ds325ee
-
Upload
babu-patel -
Category
Documents
-
view
111 -
download
3
Transcript of ds325ee
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise Edition
Advanced Architecture and Best Practices
NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Last revision: June 22, 2004
2April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Welcome!
This course is intended to provide: An overview of the development and runtime
architecture of DataStage Enterprise Edition Recommendations for parallel Job Design and Best
Practices
There is purposely a combination of baseline and advanced material Most of this information does not exist in the current
course offerings or DataStage documentation This material will eventually be rolled into future
Essentials and Advanced course offerings
3April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise EditionAdvanced Architecture and Best Practices
Agenda: Day 1
Module 1: Parallel Framework ArchitectureModule 2: Partitioning, Collecting, and SortingModule 3: The Parallel Job Score
Day 2Module 4: Best Practices and Job Design TipsModule 5: Environment VariablesModule 6: Introduction to Performance Tuning
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise Edition
Module 01: Parallel Framework Architecture
Paul ChristensenSolution Architect
NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Last revision: June 23, 2004
5April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Why You Need to Know This
DataStage Client is a developer productivity tool GUI is not intended as a replacement for understanding parallel,
flow-based ETL design DataStage Designer includes intelligence to facilitate quick
development of simple flows But, this is a development environment, not Visio (picture
drawing)
The key to mastering Enterprise Edition is in understanding the DataStage Parallel Framework Parallel ETL is a fundamentally different process Complex, high-volume flows require an understanding of the
underlying engine architecture For now (v7.x1), you’ll ALWAYS need a copy of the
“OEM” (Orchestrate) documentation Documentation for the DataStage Parallel Framework
6April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise EditionParallel Framework Architecture
7April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Data Stage Enterprise Edition Component Architecture
UNIX Operating System / NetworkingParallel Hardware (SMP, Cluster, MPP)
DataStage Parallel Application Framework and Runtime System
Ascential Data Management Components
Ascential Data Analysis
Components
Transformer, BuildOp
Components
Third Party Components
Ascential Applications(Data Stage Client)
Third Party Applications
8April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Introduction to Enterprise Edition
Parallel processing = executing your application on multiple CPUs Scalable processing = add more resources
(CPUs and disks) to increase system performance
• Example system containing 6 CPUs (or processing nodes) and disks
• Run an application in parallel by executing it on 2 or more CPUs
• Scale up system by adding more CPUs
• Can add new CPUs as individual nodes, or add CPUs to an SMP node
1 2
3 4
5 6
9April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Source
Transform
Target
Operational Data
Archived Data
Clean Load
Write to disk and read from disk before each processing operation• Sub-optimal utilization of resources
• a 10 GB stream leads to 70 GB of I/O• processing resources can sit idle during I/O
• Very complex to manage (lots and lots of small jobs)• Becomes impractical with big data volumes
• disk I/O consumes the processing• terabytes of disk required for temporary staging
Traditional Batch Processing
Disk Disk DiskData
Warehouse
10April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
• Transform, clean and load processes are executing simultaneously• Rows are moving forward through the flow
Target
Load
• Start a downstream process while an upstream process is still running.• This eliminates intermediate storing to disk, which is critical for big data.• This also keeps the processors busy.• Still have limits on scalability
Think of a conveyor belt moving rows from process to process!
Pipeline Multiprocessing
Data Warehouse
Source
Operational Data
Archived DataTransform Clean Load
11April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Node 1
Node 2
Node 3
Node 4
subset1
subset2
subset3
subset4
Partition Parallelism
Divide large data into smaller subsets (partitions) across resources Goal is to evenly distribute data Some transforms require all data
within same group to be in same partition
Requires the same transform on all partitions BUT: Each partition is independent
of others, there is no concept of “global” state
Facilitates near-linear scalability (correspondence to hardware resources) 8X faster on 8 processors 24X faster on 24 processors…
Transform
Transform
Transform
Transform
Source Data
12April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Data Warehouse
Source Data
Source
Transform
Target
Clean Load
Pipelining
Par
titio
ning
Enterprise Edition Combines Partition and Pipeline Parallelisms
Within the Parallel Framework, Pipelining and Partitioning Are Always Automatic Job developer need only identify
Sequential vs. Parallel operations (by stage)Method of data partitioningConfiguration file (there are advanced topics here)Advanced per-stage options (buffer tuning, combination,
etc)
13April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
User assembles the flow using DataStage Designer
… at runtime, this job runs in parallel for any configuration(1 node, 4 nodes, N nodes)
Job Design vs. Execution
No need to modify or recompile your job design!
14April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
explicit
pipeline
• Explicit parallelism
• Implicit pipeline "parallelism"
• Implicit data-partition parallelism
Sort
DerivationSample
Lookup
Link Constraint
data-partition
Example: Three Types of Parallelism
15April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Defining Parallelism
Execution modeExecution mode (sequential/parallel) is controlled by stage definition and properties default = parallel for most Ascential-supplied stages Can override default in most cases through Advanced
stage properties; examples where stage usage defines parallelism:Sequential File reads (unless number of readers per node is set)Sequential File targetsOracle Enterprise sources (unless partition table is set)others...
Degree of parallelismDegree of parallelism is determined by config. file Total number of logical nodes in nameless default pool,
or Nodes listed in [nodemap] or in named [nodepool]
16April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
The Parallel Configuration File
Configuration Files separate configuration (hardware/software) from job design Specified per job at runtime by $APT_CONFIG_FILE Alter hardware and resources without changing job design
Defines #nodes = logical processing units with corresponding resources (need not match physical CPUs) Dataset, Scratch, Buffer disk (filesystems) Optional resources (eg. Database, SAS, etc) Advanced topics (“pools” - named subsets of nodes)
Multiple configuration files should be used Optimize overall throughput and matches job characteristics to
overall hardware resources Provides runtime “throttle” on resource usage on a per job basis
17April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} }}
1
43
2
key aspects:
1. # Nodes defined (LOGICAL processing entities)
2. Resources assigned to each node (order of entries within each node is significant!)
3. Advanced resource optimizations and configuration (named pools, database, SAS)
The Parallel Configuration File
18April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise EditionJob Compilation
19April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage DesignerParallel Canvas Job Compilation
DataStage Designer client generates all code Validates link requirements, mandatory stage options,
transformer logic, etc. Generates OSH representation of job data flow and
stages GUI “stages” are representations of Framework “operators” Stages in parallel shared containers are statically inserted in
the job flow Each server shared container becomes a dsjobsh operator
Generates transform code for each parallel Transformer Compiled on the DataStage server into C++ and then to
corresponding native operators To improve compilation times, previously compiled
Transformers that have not been modified are not recompiled Force Compile recompiles all Transformers (use after client
upgrades)
Buildop stages must be compiled manually within the GUI or using buildop UNIX command line
Designer Client
DataStage server
Executable Job
C++ for each Transformer
Generated OSH
TransformerComponents
Compile
20April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Viewing Generated OSH
Enable viewing of generated OSH in DS Administrator:
Schema
Operator
OSH is visible in:- Job Properties- Job run log - View Data- Table Definitions (Show Schema)
Comments
21April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Example Stage / Operator Mapping
Sequential File Source: import Target: export
DataSet: copy Sort (DataStage): tsort Aggregator: group Row Generator, Column
Generator, Surrogate Key Generator: generator
Oracle Source: oraread Sparse Lookup: oralookup Target Load: orawrite Target Upsert: oraupsert
Lookup File Set Target: lookup -createOnly
Within Designer, stages represent operators, but there is not always a 1:1 correspondence.
Examples:
See “OEM” OperatorsRef.PDF
22April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Generated OSH Primer
Designer inserts comment blocks to assist in understanding the generated OSH
Note that operator order within the generated OSH is the order a stage was added to the job canvas
OSH uses the familiar syntax of the UNIX shell to create applications for Data Stage Enterprise Edition
operator name schema
for generator, import, export operator options (use “-name value” format) input (indicated by n< where n is the input #) output (indicated by n> where n is the output #)
may include modify For every operator, input and/or output datasets (links) are
numbered sequentially starting from 0. For example: op1 0> dst op1 1< src
The following operator input/output data sources are generated by DataStage Designer:
Virtual data set, (name.v) Persistent data set
(name.ds or [ds] name)
Example of generated OSH for first 2 stages of this job:
######################################################## STAGE: Row_Generator_0## Operatorgenerator## Operator options-schema record( a:int32; b:string[max=12]; c:nullable decimal[10,2] {nulls=10};)-records 50000
## General options[ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')]## Outputs0> [] 'Row_Generator_0:lnk_gen.v';
######################################################## STAGE: SortSt## Operatortsort## Operator options-key 'a'-asc
## General options[ident('SortSt'); jobmon_ident('SortSt'); par]## Inputs0< 'Row_Generator_0:lnk_gen.v'## Outputs0> [modify (keep a,b,c;)] 'SortSt:lnk_sorted.v';
Virtual data set (link) name is used to connect output of one operator to input of another
23April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Framework DataStageschema table definition
property format
type SQL type + length [and scale]
virtual dataset link
record/field row/column
operator stage
step, flow, OSH command job
Framework DS engine
• GUI uses both terminologies• Log messages (info, warnings, errors) use Framework terms
Terminology
24April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise EditionRuntime Architecture
25April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Enterprise Edition Runtime Execution
Generated OSH and Configuration file are used to “compose” a job SCORE similar to the way an RDBMS builds a query optimization plan Identifies degree of parallelism and node assignment for each operator Inserts sorts and partitioners as needed to ensure correct results Defines connection topology (datasets) between adjacent operators Inserts buffer operators to prevent deadlocks (eg. fork-joins) Defines number of actual UNIX processes
Where possible, multiple operators are combined within a single UNIX process to improve performance and optimize resource requirements
Job SCORE is used to fork UNIX processes with communication interconnects for data, message, and control Setting $APT_PM_SHOW_PIDS to show UNIX process IDs in DataStage log
It is only after these steps that processing begins This is the “startup overhead” of an Enterprise Edition job
Job processing ends when Last row (end of data) is processed by final operator in the flow (or) A fatal error is encountered by any operator (or) Job is halted (SIGINT) by DataStage Job Control or human intervention (eg.
DataStage Director STOP)
26April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Viewing the Job SCORE
• Set $APT_DUMP_SCORE to output the Score to the DataStage job log
• For each job run, 2 separate Score Dumps are written to the log
• First score is actually from the license operator
• Second score entry is the actual job scoreLicense Operator job score
Actual job score
27April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Example Job Score
Job scores are divided into two sections Datasets
partitioning and collecting
Operatorsnode/operator mapping
Both sections note sequential or parallel processing
Why 9 Unix processes?
28April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Job Execution: The OrchestraConductor Node
C
Processing Node
SL
PP P
SL
PP P
Processing Node
• Conductor - initial Framework process– Score Composer– Creates Section Leader processes (one/node)– Consolidates massages, to DataStage log– Manages orderly shutdown
• Section Leader (one per Node)– Forks Players processes (one/Stage)– Manages up/down communication
• Players– The actual processes associated with Stages– Combined players: one process only– Sends stderr, stdout to Section Leader– Establish connections to other players for data flow– Clean up upon completion
• Default Communication:• SMP: Shared Memory• MPP: Shared Memory (within hardware node) and TCP (across hardware nodes)
29April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
copy,0
generator,0
Section Leader,0
Conductor
copy,2
generator,2
Section Leader,2
copy,1
generator,1
Section Leader,1
$ osh “generator -schema record(a:int32) [par] | roundrobin | copy”
Control Channel/TCP
Stdout Channel/Pipe
Stderr Channel/Pipe
APT_Communicator
Runtime Control and Data Networks
30April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Parallel Data Flow
Think of job runtime as a series of “conveyor belts” transporting rows for each link If the stage is parallel, each link will have multiple independent “belts” (partitions)
Row order is undefined (“non-deterministic”) across partitions, or across multiple links Order within a particular link and partition is deterministic
based on partition type and, optionally, sort order
For this reason, job designs cannot include “circular” references eg. cannot update a source or reference used in the same flow
Data Flow
Undef
ined
ord
er a
cros
s
parti
tions
and
links
31April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise EditionData Types, Conversions, Nullability
32April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
The Framework processes only datasets
For external data, Enterprise Edition must perform conversion operations: Format translation
using data type mappings May also require:
Recordization Columnization
External data formats fall in two major categories: Automatic: the conversion is automatic or semi-automatic
data stored in a relational database (DB2, Informix, Oracle, Teradata)
data stored in a SAS data set Mapping rules are documented in OperatorsRef.pdf
Manual: user needs to manually specify formats everything else: flat text files, binary files Use the Sequential File Stage
Exte
rnal D
ata
Data Formats
Co
nversio
n
DataSet format
Parallel Framework
Co
nversio
n
Exte
rnal D
ata
33April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Data Sets
Consist of: Framework Schema (format=name, type, nullability) Data Records (data) Partition (subset of rows for each node)
Virtual Data Sets exist in-memory Correspond to DataStage Designer links
Persistent Data Sets are stored on-disk Descriptor file
(metadata, configuration file, data file locations, flags) Multiple Data Files
(one per node, stored in disk resource file systems)
There is no “DataSet” operator – the Designer GUI inserts a copy operator
node1:/local/disk1/…node2:/local/disk2/…
Data Sets are the structured internal representation of data within the Parallel Framework
34April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
When to Use Persistent Data Sets
When writing intermediate results between DataStage EE jobs, always write to persistent Data Sets (checkpoints) Stored in native internal format (no conversion overhead) Retain data partitioning and sort order
(end-to-end parallelism across jobs) Maximum performance through parallel I/O
Data Sets are not intended for long-term or archive storage Internal format is subject to change with new DataStage releases Requires access to named resources
(node names, file system paths, etc) Binary format is platform-specific
For fail-over scenarios, servers should be able to cross-mount filesystems Can read a dataset as long as your current $APT_CONFIG_FILE defines
the same NODE names (fastnames may differ) orchadmin –x lets you recover data from a dataset if the node names are no
longer available
35April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Caution on using Plug-In MetaData
DataStage Server plug-ins do not always match the data type definitions used by native Enterprise database stages Do not use a Plug-In to
import Oracle table definitions
Instead, use ORCHDBUTIL to import Oracle table definitions
36April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Runtime Column Propagation
Runtime Column Propagation (RCP) allows you to define only part of your table definition (schema). When RCP is enabled, if your job encounters extra columns not defined in the metadata, it will adopt these extra columns and propagate them through the rest of the job.
RCP must be enabled at the project level (it is off by default) Can then be enabled/disabled at the job level (Job
Properties/Execution) Can also be enabled/disabled at the stage level (Output Columns)
RCP allows maximum re-use of parallel shared containers Input and Output table definitions only need columns required by
container stages. Parallel Shared Container can be used by multiple jobs with different schemas, as long as the core input/output columns exist.
Must enable RCP in every stage within the parallel shared container
37April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Output Mapping With RCP Disabled
When RCP is Disabled (default) DataStage Designer will enforce Stage Input
Column to Output Column mappings At job compile time modify operators are
inserted on output links in the generated OSH
38April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Output Mapping With RCP Enabled
When RCP is Enabled DataStage Designer will not enforce mapping rules
Modify is still inserted at compile but Columns are not removed from output Columns are not renamed unless explicitly dragged to
derivation
In this example, runtime error because Name will not map to NAME, (RCP maps by case sensitive column name)
Must drag column name to derivation column
39April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Type Conversions
Enterprise Edition provides numerous conversion functions between source and target data types Default type conversions take place across the output
mappings of any parallel stage when runtime column propagation is disabled for that stageVariable to Fixed-length string conversions will pad remaining
length with ASCII NULL (0x0) charactersUse $APT_STRING_PADCHAR to change default padding
(also used by target Sequential File stages) Non-default type conversions require use of Transformer or
Modify (recommended method) Look for warnings in DataStage log to indicate unexpected
conversions!
40April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Source Type to Target Type Conversions
Source Field Target Fieldd = There is a default type conversion from source field type to destination field type. e = You can use a Modify or a Transformer conversion function to convert from the source type to the destination type A blank
cell indicates that no conversion is provided.
int8
uint8
int16
uint16
int32
uint32
int64
uint64
sfloat
dfloat
decimal
string
ustring
raw
date
time
timestam
p
int8 d d d d d d d d d e d d e d e e e e
uint8 d d d d d d d d d d d d
int16 d e d d d d d d d d d d e d e
uint16 d d d d d d d d d d d e d e
int32 d e d d d d d d d d d d e d e e e
uint32 d d d d d d d d d d d e d e e
Int64 d e d d d d d d d d d d d
uint64 d d d d d d d d d d d d
sfloat d e d d d d d d d d d d d
dfloat d e d d d d d d d d d e d e d e e e
decimal d e d d d d e d d e d e d d e d e d e
string d e d d e d d d e d d d d e d e d e e e
ustring d e d d e d d d e d d d d e d e d e e
raw e e
date e e e e e e e
time e e e e e d e
timestamp e e e e e e e
41April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Enterprise Edition Nullable Data
Out-of-band: an internal data value to mark a field as NULL In-band: a specific user-defined field value indicates a NULL
Required for Transformer processing Disadvantage:
must reserve a field value that cannot be used as valid data elsewhere in the flow Examples:
a numeric field’s most negative possible valuean empty string
To convert a NULL representation from an out-of-band to an in-band and vice-versa: Transformer stage:
Stage variables: IF ISNULL(linkname.colname) THEN … ELSE …Derivations: SetNull(linkname.colname)
Modify stage:destinationColumnName = handle_null(sourceColumnName,value)destinationColumnName = make_null(sourceColumnName,value)
42April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Null Transfer Rules
When mapping between source and destination columns of different nullability, the following rules apply:
Source Field Destination Field Result
not_nullable not_nullable Source value propagates to destination.
nullable nullable Source value or null propagates.
not_nullable nullable Source value propagates; destination value is never null.
nullable not_nullable WARNING messages in log. If source value is null, a fatal error occurs. Must handle in Transformer or Modify.
43April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
NULLS and Sequential Files
For NULLABLE columns, the following properties are used when reading from or writing to Sequential Files: null_field
A number, string, or C-style literal escape value (eg. \xAB) that defines the NULL value representation
null_lengthField length that indicates a NULL value (only appropriate for
variable-length files)
Null field representation can be any string, regardless of valid values for actual column data type
44April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Lookup and Nullable Columns
When using Lookup with “If Not Found = Continue” unmatched output rows follow nullability attributes of the reference link for non-key columns: If the non-key columns of the reference
link are defined as non-nullable, the Lookup stage assigns a "default value" on unmatched recordsDefault Value depends on the data type*.
For example: Integer columns default value is zero. Varchar is a zero-length string (this is
distinctly different from a NULL value) Char is a string of fixed length
$APT_STRING_PADCHAR characters If the non-key columns of the reference
link are defined as nullable, the Lookup stage will place NULL values in these columns for unmatched records
* Data type default values are documented in OEM UserGuide.pdf
Lookup
If Not Found = Continue
Unmatched rows follow nullability attributes of
non-key reference link columns
TIP:
When changing column attributes, be careful to propagate the change through the remaining links of your job design
(Including the output column definition of the Lookup stage in this example).
45April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Outer JOINs and Nullable Columns
Similar to Lookup, when performing an OUTER JOIN (Left Outer, Right Outer, Full Outer), unmatched output rows follow nullability attributes of the the corresponding outer link(s):
If the non-key columns of the outer link(s) are defined as non-nullable, the Join stage will assigns a "default value" on unmatched records, based on their data type
If the non-key columns of the outer link(s) are defined as nullable, the Join stage will place NULL values in these columns for unmatched records
Left Outer JOIN
Unmatched rows follow nullability attributes of
non-key outer link columns
Left
Right
Full Outer JOIN
Unmatched rows follow nullability attributes of non-key columns of
outer links
46April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Transformer and Null Expressions
Within a parallel transformer, any expression that includes a NULL value will produce a NULL result 1 + NULL = NULL “John” : NULL : “Doe” = NULL
When the result of a link constraint or output derivation is NULL, the Transformer will output that row to its reject link (dashed line) Always create a Transformer reject link in DataStage
Designer Always test for null values before
using in an expressionIF ISNULL(link.col) THEN… ELSEUse stage variables if re-used
v7 Transformer now warns when rows reject
v7 also clarifies naming of output link constraints (Otherwise)
47April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Framework “OEM” Documentation
UserGuide.PDF Covers framework architecture, parallel processing,
partitioning/collecting data, data sets, data types, conversion functions, OSH
Also includes detailed documentation on buildops
OperatorsRef.PDF Detailed reference for every built-in operator
RecordSchema.PDF Format of Framework schema definition
(including import, export, generator)
DevGuide.PDF, HeaderSorted.PDF, ClassSorted.PDF low-level Orchestrate C++ APIs for building custom operators
Available in the documentation section (“Orchestrate”) of Ascential eServices public website
48April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
For More Information
Framework “OEM” Documentation User Guide Operators Reference Record Schema
DataStage Enterprise Edition Best Practices and Performance Tuning document PLEASE send your comments and feedback to:
Don’t be afraid to try!
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise Edition
Module 01: Parallel Framework Architecture
Paul ChristensenSolution Architect
NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Last revision: June 23, 2004
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise Edition
Module 02: Partitioning, Collecting, and Sorting Data
Paul ChristensenSolution Architect
NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Last revision: June 22, 2004
51April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Partitioners, Collectors, and Sorting
Partitioners distribute rows of a single link into smaller segments that can be processed independently in parallel ONLY before parallel stages
Collectors combine parallel partitions of a single link for sequential processing ONLY before sequential stages
Sorting is used to arrange rows into specific groupings and order. May be parallel or sequential
partitioner collector
Stagerunning
Sequentially
Stage running in
Parallel
Stage running in
Parallel
52April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Partitioning and Collecting Icons
“Fan-Out” Partitioner
Sequential to Parallel
Collector
(“Fan-In”)Parallel to Sequential
NOTE: Partitioner and Collector icons are ALWAYS drawn “Left to Right” regardless of how the link is drawn!
53April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Partitioning Data
partitioner
Stage running in
Parallel
Stage running in
Parallel
54April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Partitioners
Partitioners distribute rows of a single link (data set) into smaller segments that can be processed independently in parallel
Partitioners exist before ANY parallel stage. The previous stage may be running: Sequentially
Results in a “fan-out” operation (and link icon)
In Parallel If partitioning method changes, data is repartitioned
partitioner
Stage running in
Parallel
Stage running in
Parallel
Stage running in
Parallel
Stagerunning
Sequentially
Stage running in
Parallel
Stage running in
Parallel
repartitioning icon
55April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Partition Numbers and Director Job Log
At runtime, the Parallel Framework determines the degree of parallelism for each stage using: $APT_CONFIG_FILE (and optionally) a stage’s node pool (Advanced properties)
Partitions are assigned numbers, starting at zero Partition number is appended to the stage name for
messages written to the DataStage Director job log
partition #
stage name
56April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
System Variables for Parallel Derivations
To facilitate parallel calculations regardless of actual runtime config, system variables are provided in Column / Row Generator and Transformer stages.
Within Column / Row Generator, two reserved words are provided for numeric cycles: part: actual partition # (starts at zero) partcount: total number of partitions at runtime
Starting with v7.1, the Surrogate Key Generator stage can generate a sequence of integer values in parallel: Internally similar to using Column Generator stage
with part and partcount keywords Also supports initial value for the sequence(s)
Within the Transformer, @INROWNUM system variable is generated for each node. Instead, use: @PARTITIONNUM: actual partition number
(starts at zero) @NUMPARTITIONS: total number of partitions
Example Generator Sequence:Type = CycleInitial value = partIncrement = partcount
For a 4-node configuration file:@NUMPARTITIONS = 4@PARTITIONNUM = 0 through 3
Assuming incoming data isround-robin partitioned:
Row# Part Partcount Result1 0 4 02 1 4 13 2 4 24 3 4 35 0 4 46 1 4 57 2 4 68 3 4 7
initial values
first increment
57April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Selecting a Partitioning Method
Objective 1: Choose a partitioning method that gives close to an equal number of rows in each partition Ensures that processing is evenly distributed across nodes
Greatly varied partition sizes (skew) increase processing time
Enable “Show Instances” in DataStage Director Job Monitor to show data distribution (skew) across partitions:
Setting the environment variable $APT_RECORD_COUNTS outputs row counts per partition to the DataStage log as each stage/node (operator) completes processing
58April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Selecting a Partitioning Method
Objective 1: Choose a partitioning method that gives close to an equal number of rows in each partition Ensures that processing is evenly distributed across nodes
Greatly varied partition sizes (skew) increase processing time
Objective 2: Partition method MUST match the stage logic, assigning related records to the same partition if required Any stage that operates on groups of related data (often using
key columns)Examples: Aggregator, Join, Merge, Sort, Remove Duplicates, etc…
(perhaps also Transformers, Buildops)
Partitioning method needed to ensure correct results may violate Objective #1, depending on actual data distribution
Objective 3: Partition method should not be overly complex The simplest method that meets Objectives 1 and 2 If possible, leverage partitioning performed earlier in a
flow
59April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Specifying Partitioning
Partitioning method is defined on the Input properties, Partitioning category, of any stage running in parallel
60April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Partitioning Methods
Keyless PartitioningRows are distributed independent of actual data values
Same Existing partitioning is not
altered Round Robin
Rows are evenly alternated among partitions
Random Rows randomly assigned to
partitions Entire
Each partition gets the entire dataset (rows duplicated)
Keyed PartitioningRows are distributed at runtime based on values in specified key column(s)
Hash Rows with same key column
value go to the same partition Modulus
Assigns each row of an input dataset to a partition, as determined by a specified numeric key column in the input dataset
Range Similar to hash, but partition
mapping is user-determined and partitions are ordered
DB2 Matches DB2 EEE partitioning Discussed in database chapter
Auto (the default method) DataStage EE chooses appropriate
partitioning method Round Robin, Same, or Hash are
most commonly chosen
61April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
036
147
258
036
147
258
Row ID's
SAME partitioning icon
SAME Partitioning
Keyless partitioning method Rows retain current
distribution and order from output of previous parallel stage Doesn’t move data between
nodes Retains “carefully partitioned”
data (such as the output of a previous sort)
Fastest partitioning method (no overhead)
62April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Impact of SAME Partitioning
Don’t over-use SAME partitioning in a job flow Because SAME does not alter existing partitions,
the degree of parallelism remains unchanged in the downstream stage If you read a Sequential File using SAME partitioning
(without specifying Readers Per Node option), the downstream stage will run sequentially!
If you read a persistent Data Set using SAME partitioning, the downstream stage runs with the degree of parallelism used to create the data set, regardless of the current $APT_CONFIG_FILE / specified node pool
63April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Round Robin and Random Partitioning
Keyless partitioning methods Rows are evenly distributed across
partitions Good for initial import of data if no
other partitioning is needed Useful for redistributing data
Fairly low overhead
Round Robin assigns rows to partitions as dealing cards Row/Partition assignment will be the
same for a given $APT_CONFIG_FILE
Random distributes rows with random order Higher overhead than Round Robin Not subject to regular patterns that
might exist in source data Row/Partition assignment will differ
between runs of the same input data
…8 7 6 5 4 3 2 1 0
Round Robin
630
741
852
64April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Parallel Runtime Example
Remember, row order is undefined (non-deterministic) across partitions, or across multiple links
Consider this example job: Round robin partitioning distributes
rows in a specific order to the number of nodes at runtime
But, across nodes, the order a particular node outputs its results may change with each run:
Results with a 4-node $APT_CONFIG_FILE:
Node 0: 1, 5, 9
Node 1: 2, 6, 10
Node 2: 3, 7
Node 3: 4, 8
Row Generator
10 rows {A: Integer, initial_value=1, incr=1}
Round Robin partitioning
With round robin partitioning, rows are distributed in the same order for the same input data and $APT_CONFIG_FILE
65April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
ENTIRE Partitioning
Keyless partitioning method Each partition gets a
complete copy of the data Useful for distributing lookup and
reference dataMay have performance impact in
MPP / clustered environments On SMP platforms, Lookup stage
(only) uses shared memory instead of duplicating ENTIRE reference data
ENTIRE is the default partitioning for Lookup reference links with “Auto” partitioning On SMP platforms, it is a good
practice to set this explicitly on the Normal Lookup reference link(s)
…8 7 6 5 4 3 2 1 0
ENTIRE
.
.3210
.
.3210
.
.3210
66April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
HASH Partitioning
Keyed partitioning method Rows are distributed
according to the values in one or more key columns Guarantees that rows with
identical combination of values in key column(s) are assigned to the same partition
Needed to prevent matching rows from “hiding” in other partitionseg. Join, Merge, RemDup …
Partition size will be relatively equal if the data across the source key column(s) is evenly distributed
…0 3 2 1 0 2 3 2 1 1
HASH
0303
111
222
Values of key column
67April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Hash Key Selection
HASH ensures that rows with the same combination of all key column values are assigned to the same partition
Hash on LName, 4 node config file distributes as:
So
urce D
ata
ID LName FName Address
1 Ford Henry 66 Edison Avenue
2 Ford Clara 66 Edison Avenue
3 Ford Edsel 7900 Jefferson
4 Ford Eleanor 7900 Jefferson
5 Dodge Horace 17840 Jefferson
6 Dodge John 75 Boston Boulevard
7 Ford Henry 4901 Evergreen
8 Ford Clara 4901 Evergreen
9 Ford Edsel 1100 Lakeshore
10 Ford Eleanor 1100 Lakeshore
NOTE: Partition distribution matches
source data distribution In this example, number of distinct
hash key values limits parallelism!
Partitio
n 1
ID LName FName Address
1 Ford Henry 66 Edison Avenue
2 Ford Clara 66 Edison Avenue
3 Ford Edsel 7900 Jefferson
4 Ford Eleanor 7900 Jefferson
7 Ford Henry 4901 Evergreen
8 Ford Clara 4901 Evergreen
9 Ford Edsel 1100 Lakeshore
10 Ford Eleanor 1100 Lakeshore
Part 0
ID LName FName Address
5 Dodge Horace 17840 Jefferson
6 Dodge John 75 Boston Boulevard
68April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Another Hash Key Example
Using same source data, Hash on LName, FName, 4 node config file:
NOTE: Improved distribution Only the unique combination of
key columns appear in the same partition
For partitioning purposes, order of HASH key columns is insignificantNOTE: To avoid repartitioning,
key column order should be consistent across stages with same keys
Part 3
ID LName FName Address
1 Ford Henry 66 Edison Avenue
7 Ford Henry 4901 Evergreen
Part 2
ID LName FName Address
4 Ford Eleanor 7900 Jefferson
6 Dodge John 75 Boston Boulevard
10 Ford Eleanor 1100 Lakeshore
Part 1
ID LName FName Address
3 Ford Edsel 7900 Jefferson
5 Dodge Horace 17840 Jefferson
9 Ford Edsel 1100 Lakeshore
Part 0
ID LName FName Address
2 Ford Clara 66 Edison Avenue
8 Ford Clara 4901 Evergreen
69April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Modulus Partitioning
Keyed partitioning method Rows are distributed according
to the values in one integer key column Simpler (and faster) calculation
than HASH using modulus (remainder) of division: partition = MOD (key_value / #partitions)
Guarantees that rows with identical values in key column end up in the same partition
Partition size will be relatively equal if the data within the key column is evenly distributed
…0 3 2 1 0 2 3 2 1 1
MODULUS
0303
111
222
Values of key column
70April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
RANGE Partitioning
Keyed partitioning method Rows are evenly distributed
according to the values in one or more key columns Requires “pre-processing” data to
generate a range map More expensive than HASH partitioning Must read entire data TWICE to guarantee results
Guarantees that rows with identical values in key columns end up in the same partition
The “Write Range Map” stage is used to generate the range map file If the source data distribution is
consistent over time, it may be possible to re-use the map file
Values outside of a given range map will land in the first or last partition as appropriate
565
•QUIZ! If incoming data is ordered on key, something bad happens. WHAT?
Range Map file
Values of key column
4 0 5 1 6 0 5 4 3
010
443
RANGE
ANSWER: The process runs sequentially (key value adjacency)!
71April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Example Partitioning Icons
“fan-out” Sequential to Parallel
AUTO partitioner
re-partitionwatch for this!
SAME partitioner
72April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Automatic Partitioning
By default, the Parallel Framework inserts partition components as necessary to ensure correct results (check the job score) Before any stage with “Auto” partitioning Generally chooses between ROUND-ROBIN or SAME Inserts HASH on stages that require matched key values
(eg. Join, Merge, RemDup) Inserts ENTIRE on Normal (not Sparse) Lookup reference links
NOT always appropriate for MPP/clusters
Since the Framework has limited awareness of your data and business rules, it is usually best to explicitly specify HASH partitioning when key groupings are required Framework has no visibility into Transformer logic Required before SORT and AGGREGATOR (hash method) stages Framework may insert un-needed or non-optimal partitioning
73April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Preserve Partitioning Flag
The “preserve partitioning” flag is designed for stages that use “Auto” partitioning Flag has 3 possible settings:
Set: instructs downstream stages to attempt to retain partitioning and sort order
Clear: downstream stages need not retain partitioning and sort orderPropagate: passes (if possible) the flag setting from input to output links
Set automatically by some operators (eg. Sort, Hash partitioning) Can be manually set by users through stage Advanced properties Functionally equivalent to explicitly specifying SAME partitioning
But allows the Parallel Framework to over-ride and optimize for performance (eg. if the degree of parallelism differs)
Preserve Partitioning setting is part of the data set structure In memory (virtual) and on disk (persistent)
At runtime, if Preserve Partitioning flag as set and a downstream operator cannot use previous partitioning, a warning is issued
74April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Summary: Partitioning Strategy
Use HASH partitioning when stage requires grouping of related values Specify only the key columns that are necessary for correct
grouping (as long as the number of unique values is sufficient) Use MODULUS if group key is a single Integer column RANGE may be appropriate in rare instances when data
distribution is uneven but consistent over time Know your data!
How many unique values in the hash key column(s)?
If grouping is not required, use ROUND ROBIN to redistribute data equally across all partitions Framework will often do this with AUTO partitioning
Try to optimize partitioning for the entire job flow
75April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Job SCORE: Data Sets
The Job SCORE can be used to verify partitioning and collecting methods that are used at runtime
Partitioners and Collectors are associated with datasets (top portion of the SCORE)
Datasets connect a source and a target:- operator(s) (see lower portion of SCORE)- persistent Dataset(s)
Partitioner / Collector method is shown between the source and target
76April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Interpreting the Job Score - Partitioning
The DataStage Parallel Framework implements a producer-consumer data flow model Upstream stages (operators or persistent data sets) produce rows
that are consumed by downstream stages (operators or data sets)
Partitioning method is associated with producerCollector method is associated with consumerSeparated by an indicator:
May also include [pp] notation when Preserve Partitioning flag is set
Producer
Consumer
-> Sequential to Sequential<> Sequential to Parallel=> Parallel to Parallel (SAME)#> Parallel to Parallel (not SAME) >> Parallel to Sequential> No producer or no consumer
77April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Optimizing Partitioning
Minimize the number of re-partitions within and across job flows Within a flow
Examine up-stream partitioning and sort order and attempt to preserve for down-stream stages using SAME partitioning
May require re-examining key column usage within stages and processing (stage) order
Across jobs through a persistent data setData sets retain partitioning AND sort order across flows
If sort order is significant, write to a persistent data set with the Preserve Partitioning flag SET
Useful if downstream jobs are run with the same degree of parallelism and require same partition and sort order
78April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Collecting Data
collector
Stagerunning
Sequentially
79April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Collectors combine partitions of a dataset into a single input stream to a sequential Stage
data partitions (NOT links)
collector
sequential Stage
...
Collectors
80April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Specifying Collector Type
Collector method is defined on the Input properties, Partitioning category, of any stage running sequentially when the previous stage is running in parallel
Stage running in
Parallel
Stagerunning
Sequentially
collector icon
81April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Collector Methods
(Auto)Eagerly read any row from any input partitionOutput row order is undefined (non-deterministic)This is the default collector method
Round RobinPatiently pick row from input partitions in round robin orderSlower than auto, rarely used
OrderedRead all rows from first partition, then second,… Preserves order that exists within partitions
Sort MergeProduces a single (sequential) stream of rows sorted on
specified key column(s) for input sorted on those keysRow order is undefined for non-key columns
82April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Choosing a Collector Method
In most instances, Auto collector (the default) is the fastest and most efficient method of collecting data into a sequential stream
To generate a single stream of sorted data, use the Sort Merge collector for previously-sorted input Input data must be sorted on these keys to produce a sorted result Sort Merge does not perform a sort, it simply defines the order that
rows are read from all partitions using the values in one or more key columns
Ordered collector is only appropriate when sorted input has been range-partitioned No sort required to produce sorted output
Round robin collector can be used to reconstruct original (sequential) row order for round-robin partitioned inputs As long as intermediate processing (eg. sort, aggregator) has not
altered row order or reduced number of rows Rarely used in real life scenarios
83April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Collectors vs. Funnels
Collector Operates on a single,
partitioned link (single virtual data set)
Consolidates partitions as the input to a sequential stage
Always identified by a “fan-in” link icon
Funnel stage Stage that runs in parallel Merges data from multiple
links (multiple virtual data sets) to a single output link
Table Definitions (schema) of all links must match
Don’t confuse a collector with a Funnel stage!
FunnelCollector
84April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Sorting Data
85April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Traditional (Sequential) Sort
Traditionally, the process of sorting data uses one primary key column and (optionally) multiple secondary key columns to generate a sequential, ordered result set Order of key columns determines sequence (and groupings) Each key column specifies an ascending or descending sort group
This is the method that SQL uses for an ORDER BY clause
So
urce D
ata
ID LName FName Address
1 Ford Henry 66 Edison Avenue
2 Ford Clara 66 Edison Avenue
3 Ford Edsel 7900 Jefferson
4 Ford Eleanor 7900 Jefferson
5 Dodge Horace 17840 Jefferson
6 Dodge John 75 Boston Boulevard
7 Ford Henry 4901 Evergreen
8 Ford Clara 4901 Evergreen
9 Ford Edsel 1100 Lakeshore
10 Ford Eleanor 1100 Lakeshore
Sort on:Lname
(asc),
FName (desc)
So
rted R
esult
ID LName FName Address
6 Dodge John 75 Boston Boulevard
5 Dodge Horace 17840 Jefferson
1 Ford Henry 66 Edison Avenue
7 Ford Henry 4901 Evergreen
4 Ford Eleanor 7900 Jefferson
10 Ford Eleanor 1100 Lakeshore
3 Ford Edsel 7900 Jefferson
9 Ford Edsel 1100 Lakeshore
2 Ford Clara 66 Edison Avenue
8 Ford Clara 4901 Evergreen
86April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Parallel Sort
In most cases, there is no need to globally sort data to produce a single sequence of rows
Instead, sorting is most often needed to establish order within specified groups of data Join, Merge, Aggregator, RemDup, etc… This sort can be done in parallel!
Partitioning is used to gather related rows Assigns rows with the same key column(s) values to the
same partition Sorting is used to establish grouping and order within each
partition based on one or more key column(s) Key values are adjacent
Partition and Sort keys need not be the same! Often the case before Remove Duplicates
87April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Example Parallel Sort
Using same source data, Hash partition on LName, FName (4 node config):
Part 3
ID LName FName Address
1 Ford Henry 66 Edison Avenue
7 Ford Henry 4901 Evergreen
Part 2
ID LName FName Address
4 Ford Eleanor 7900 Jefferson
6 Dodge John 75 Boston Boulevard
10 Ford Eleanor 1100 Lakeshore
Part 1
ID LName FName Address
3 Ford Edsel 7900 Jefferson
5 Dodge Horace 17840 Jefferson
9 Ford Edsel 1100 Lakeshore
Part 0
ID LName FName Address
2 Ford Clara 66 Edison Avenue
8 Ford Clara 4901 Evergreen
Within each partition, sort using LName, FName:
Part 3
ID LName FName Address
1 Ford Henry 66 Edison Avenue
7 Ford Henry 4901 Evergreen
Part 2
ID LName FName Address
6 Dodge John 75 Boston Boulevard
4 Ford Eleanor 7900 Jefferson
10 Ford Eleanor 1100 Lakeshore
Part 1
ID LName FName Address
5 Dodge Horace 17840 Jefferson
3 Ford Edsel 7900 Jefferson
9 Ford Edsel 1100 LakeshoreP
art 0
ID LName FName Address
2 Ford Clara 66 Edison Avenue
8 Ford Clara 4901 Evergreen
Parallel Sort
Parallel Sort
Parallel Sort
Parallel Sort
88April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Stages that require Sorted Data
Stages that process data on groups Aggregator Remove Duplicates Compare (perhaps)
If only comparing values, not order between two sources Transformer, Buildop (perhaps)
Depending on internal stage-variable logic
“Lightweight” stages that minimize memory usage by requiring data in key-column sort order Join Merge Sort Aggregator
89April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Parallel (Grouped) Sorting Methods
DataStage Designer provides two methods for parallel (grouped) sorting: Sort stage
Parallel execution mode
OR
Specified on a link when partitioning is not AutoLinks with SORT defined will have a Sort icon:
By default, both methods use the same internal sort package (tsort operator)
90April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Sorting on a Link
Easier job maintenance (fewer stages on job canvas)
BUT…Fewer options (tuning, features)
Right-click on key column to
specify sort options
Specify key usage for Sorting,
Partitioning, or Both
91April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Sort Stage
The Sort stage offers more options than a link sort
Always specify “DataStage” Sort Utility (much faster than UNIX sort)
92April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Stable Sorts
Stable sort preserves the order of non-key columns within each sort group
Stable sorts are slightly slower than non-stable sorts for the same data/keys Only use Stable sort when
needed
By default, stable sort is enabled on Sort stages!
Stable sort is not the default for Link sorts
93April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Resorting on Sub-Groups
Use Sort Key Mode property to re-use key column groupings from previous sorts Uses significantly less memory/disk!
Sort is now on previously-sorted key-column groups not the entire data set Outputs rows after each group
Key column order is important! Don’t forget to retain incoming sort order (eg. SAME partitioning)
94April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Partitioning and Sort Order
When partitioning data (except for SAME), sort order is not maintained
To restore row order / groupings, a sort is required after any repartitioning
2
101
3
Partitioner
1
103
102
1
2
3
103
102
101
95April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Sequential (Total) Sorting Methods
Within Enterprise Edition, DataStage provides two methods for generating a sequential (totally sorted) result: Sort stage
Sequential execution mode
OR
SortMerge Collector For sorted input
In general, parallel Sort + SortMergecollector will be MUCH faster than a sequential Sort
- Unless data is already sequential
(Similar to how databases “parallel sort”)
96April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Automatic Sorting
By default, the Parallel Framework inserts sort operators as necessary to ensure correct results Before any stage that requires
matched key values (eg. Join, Merge, RemDup)
Only inserted when the user has NOT explicitly defined an input sort
Check the Job SCORE for inserted tsort operators
For versions 7.01 and later, set $APT_SORT_INSERTION_CHECK_ONLY to change behavior of automatically inserted sorts Instead of actually performing the
sort, the inserted sort operators only VERIFY the data is sorted
If data is not sorted properly at runtime, the job will fail
Recommended only on a per-job basis during performance tuning
op1[4p] {(parallel inserted tsort operator {key={value=LastName}, key={value=FirstName}}(0))
on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}
op1[4p] {(parallel inserted tsort operator {key={value=LastName, subArgs={sorted}}, key={value=FirstName}, subArgs={sorted}}(0))
on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}
97April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Sort Resource Usage
By default, each sort uses 20MB per partition as an internal memory buffer Includes user-defined (link, stage) and framework-inserted
sorts A different size can be specified for each Sort stage using the
“Restrict Memory Usage” option Increasing this value can improve performance, especially if the
entire (or group) data partition can fit into memoryDecreasing this value may hurt performance, but will use less
memory (minimum is 1MB per partition)From Designer, this option is unavailable for link sorts
When the memory buffer is filled, sort uses temporary disk space in the following order:
Scratch disks in the $APT_CONFIG_FILE “sort” named disk poolScratch disks in the $APT_CONFIG_FILE default disk pool
(normally all scratch disks are part of the default disk pool)The default directory specified by $TMPDIRThe UNIX /tmp directory
98April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Optimizing Sort Performance
Minimize number of sorts within a job flow Each sort interrupts the parallel pipeline - must read all
rows before generating output
Specify only necessary key columns Avoid stable sorts unless needed to retain order
of non-key column data If possible, use “Sort Key Usage” key column
option to re-use previous sort keys Within Sort stage, adjusting “Restrict Memory Usage” option may improve performance
99April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Partitioning Examples
100
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Partitioning Example 1
Scenario: Assign average value to existing detail rows “Standard” Solution (3 hash/sorts):
Copy Data, Hash and Sort on all inputs to Aggregator, Join This is also the method the framework would use with Auto
partitioning to ensure correct results
Aggregate
JoinCopy
Notice that all 3 hash partitioners and sorts use the same key columns and order!
101
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Example 1 - Optimized Solution
Optimize partitioning keys (and sort order) across multiple stages in a single flow To minimize re-sorts and re-partitions
Optimized Solution (1 hash/sort): Move Hash/Sort upstream before the Copy Use SAME partitioning to preserve partitioning and sort
order
Partition and Sort on key column(s)
Aggregate
JoinCopy
SAME partitioning retains previous sort order
Inputs to JOIN do not need to be sorted!
102
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Example 1: Sort Insertion
Looking at the Job SCORE for the “optimal” solution, the Framework inserts sorts before each Join input to ensure correct results Regardless of the partitioning method
chosen In this example we don’t want these
extra sorts
To change behavior of framework-inserted sorts for this job, set $APT_SORT_INSERTION_CHECK_ONLY Inserted sorts will verify row order at
runtime, but will not actually sort data
op3[4p] {(parallel inserted tsort operator {key={value=LastName}, key={value=FirstName}}(0))
on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}op4[4p] {(parallel inserted
tsort operator {key={value=LastName}, key={value=FirstName}}(0))
on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}
103
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Partitioning Example 2: Header / Detail
Know your data – HASH guarantees correct grouping results, but it is not always the most efficient
Scenario: Header / Detail Processing Assign Data from Header Row to all Detail Rows Use Transformer to
Separate Header and Detail DataAdd Join Key column (constant value) to both outputs
Header
DetailSrc Out
NOTE: since the Join Key value is constant, inputs to the JOIN stage should NOT be sorted
104
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Partitioning Example 2: Solutions
Solution 1: Hash On Key Columns and Join This is the “standard”
approach It is also the method the
Framework would use with Auto partitioning
BUT… there is only one hash key value, so the Join runs sequentially
Solution 2:Use Entire to copy header data to all partitions Distribute detail data using
Round Robin Join will now run in parallel
Header
Detail
Src Out
But there is still a possible problem with either solution!
For either solution, to counteract framework-inserted sorts, set
$APT_SORT_INSERTION_CHECK_ONLY
105
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Introducing the Buffer Operator
At runtime, the Framework automatically inserts buffer operators to prevent deadlocks and to optimize overall performance For job flows with a fork-join, buffer
operators are inserted on all inputs to the downstream joining operatorAny link split that is later combined
in the same job flow Buffer operators may also be inserted
in an attempt to match producer and consumer rates
Data is never repartitioned across a buffer operator First-in, First-Out row processing
Some stages (eg. Sort, Hash Aggregator) internally buffer the entire data set before outputting a row Buffer operators are never inserted after these stages
Stage 3
Buffer
Buffer
Stage 1
Stage 2
106
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Identifying Buffer Operators
At runtime, buffers are identified in the operators section of the job SCORE
For more details on buffering: OEM UserGuide.PDF, Appendix A
It has 6 operators:op0[1p] {(sequential Row_Generator_0) on nodes ( ecc3671[op0,p0] )}op1[1p] {(sequential Row_Generator_1) on nodes ( ecc3672[op1,p0] )}op2[1p] {(parallel APT_LUTCreateImpl in Lookup_3) on nodes ( ecc3671[op2,p0] )}op3[4p] {(parallel buffer(0)) on nodes ( ecc3671[op3,p0] ecc3672[op3,p1] ecc3673[op3,p2] ecc3674[op3,p3] )}op4[4p] {(parallel APT_CombinedOperatorController: (APT_LUTProcessImpl in Lookup_3)
(APT_TransformOperatorImplV0S7_cpLookupTest1_Transformer_7 in Transformer_7)
(PeekNull) ) on nodes ( ecc3671[op4,p0] ecc3672[op4,p1] ecc3673[op4,p2] ecc3674[op4,p3] )}op5[1p] {(sequential APT_RealFileExportOperator in
Sequential_File_12) on nodes ( ecc3672[op5,p0] )}It runs 12 processes on 4 nodes.
107
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
How Buffer Operators Work
The primary goal of a buffer operator is to prevent deadlocks
This is accomplished by “holding rows” until the downstream operator is ready to process them Rows are held in memory up to size defined by
$APT_BUFFER_MAXIMUM_MEMORY
default is 3MB per buffer per partition When buffer memory is filled, rows are spilled to disk
By default, up to amount of available scratch disk unless QUEUE UPPER BOUND limit has been set
BufferProducer Consumer
108
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Buffer Flow Control
When buffer memory usage reaches $APT_BUFFER_FREE_RUN the buffer operator will offer resistance to the new rows, slowing down the rate of upstream producer Default 0.5 = 50% Setting $APT_BUFFER_FREE_RUN > 1 (100%) will prevent the
buffer from slowing down upstream producer until data size of $APT_MAXIMUM_MEMORY * $APT_BUFFER_FREE_RUN has been bufferedAssumes that the overhead of disk I/O for buffer scratch usage is less
than the impact of slowing down upstream operator
Producer ConsumerBuffer
$APT_BUFFER_FREE_RUNBuffer will offer resistance to new rows, slowing down
upstream producer
109
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Tuning Buffer Settings
On a per-job basis through environment variables $APT_BUFFER_MAXIMUM_MEMORY $APT_BUFFER_FREE_RUN $APT_BUFFER_DISK_WRITE_INCREMENT And many other advanced options…
On a per-link basis (Inputs/Outputs ->Advanced) Buffer options are defined per link
(virtual data set), hence the Output of one stage is the Input of the following stage
In general, Auto Buffering (default) is recommended Don’t change unless you really
understand your job flow and data! Disabling buffering may cause the
job to deadlock (hang)
In general, buffer tuning is an advanced topic. The default settings should be appropriate for most job flows. For very wide rows, it may be necessary to increase default buffer size
to handle more rows in memory Calculate total record width using internal storage for each data type / length /
scale. For variable-length (varchar) columns, use maximum length.
110
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Buffer Resource Usage
By default, each buffer operator uses 3MB per partition of virtual memory Can be changed through Advanced link properties,
or globally using $APT_BUFFER_MAXIMUM_MEMORY
When buffer memory is filled, temporary disk space is used in the following order:
Scratch disks in the $APT_CONFIG_FILE “buffer” named disk pool
Scratch disks in the $APT_CONFIG_FILE default disk pool (normally all scratch disks are part of the default disk pool)
The default directory specified by $TMPDIRThe UNIX /tmp directory
111
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
End of Data / End of Data Group
Stages that process groups of data (eg. Join, Merge, Remove Duplicates, Sort Aggregator) cannot output a row until: Data in the grouping key column(s) changes
(logical End of Data Group) All rows have been processed (End of Data)
For stages that process groups, rows are buffered in memory until an End of Data Group or End of Data
Some stages (eg. Sort, Hash Aggregator) must read an entire input data set (until End of Data) before outputting a single record Setting “Don’t Sort, Previously Sorted”
key option changes Sort behavior to output on groups instead of entire data set
112
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Revisiting Example 2: Buffering Impact
For large data volumes, buffering introduces a possible problem with this solution: At runtime, buffer operators are inserted for this fork-join scenario The Join stage, operating on key-column groups, is unable to output
rows until (end of data group) or (end of data) Generating one header row with no subsequent change in join
column, data is buffered until end of data
Solution: Use stage variables hold header data values. Output multiple header rows with different join-key values
This additional logic may impact Transformer performance Proper solution ultimately depends on data volume and available
hardware resources
Header
Detail
Src Out
Buffer
Buffer
113
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Revisiting Example 2: Buffering Solution
Define stage variables to hold header-row values. Set initial values to empty Only set header values when
header is identified
Header Link: Use output link constraints to
only output data after header values have been captured.
Assign more than one join key value using @INROWNUM
Assumes only one header row
Detail Link: Assign constant value to
detail join column
114
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Join Stage: Internal Buffering
Even for Inner Joins, there is a difference between the inputs of a Join stage! The first link (#0, “LEFT” within Link Ordering)
establishes “driver” input – rows are read one at a time For non-unique key values, all rows within the same key
value group are read into memory from the second link (#1, “RIGHT” by Link Ordering)
For Example 2, single Header row must be the second input link (#1) to the Join stage Otherwise, all input data will
be read into virtual memory
115
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Avoiding Buffer Contention
Datasets do not buffer – there is no upstream operation that would prevent rows from being output
In some cases, the best solution to avoiding fork-join buffer contention is to split the job, landing results to intermediate data sets Develop a single job first If performance / volume testing indicates a buffering-
related performance issue that cannot be resolved by adjusting buffering settings, then split the job across intermediate datasets
116
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Example 2: Why Not Use Lookup?
Lookup cannot output any rows until ALL reference link data has been read into memory (End of Data) Except for Sparse database lookups
NEVER generate Lookup reference data using a fork-join of source data Separate creation of lookup reference data from lookup
processing
Header
Detail
Src
HeaderRef
Out
117
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Summary
Partitioning Method should ensure correct results
AND (if possible) evenly distribute data Must be aware of data distribution and impact on
processing
Collecting Used to consolidate partitioned data into sequential process
Sorting Parallel sorting establishes row order within groups
Partitioning gathers related rows Sequential sorting only needed to produce single, globally
sorted sequential result set
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise Edition
Module 02: Partitioning, Collecting, and Sorting Data
Paul ChristensenSolution Architect
NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Last revision: June 22, 2004
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise Edition
Module 03: The Parallel Job Score
Paul ChristensenSolution Architect
NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Last revision: June 22, 2004
120
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
The Parallel Job SCORE
The Job SCORE details the optimization plan used by the DataStage Parallel Framework to run a given job design based on a specified $APT_CONFIG_FILE Similar to the way a parallel RDBMS builds a query plan Identifies degree of parallelism and node assignment(s) for each operator
Details mappings between functional (stage/operator) and actual UNIX processes Includes buffer operators inserted to prevent deadlocks and optimize data flow
rates between stages Can be used to identify sorts and partitioners that have been automatically inserted
to ensure correct results
Outlines connection topology (datasets) between adjacent operators and/or persistent data sets
Defines number of actual UNIX processes Where possible, multiple operators are combined within a single UNIX process to
improve performance and optimize resource requirements
121
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Viewing the Job SCORE
• Set $APT_DUMP_SCORE to output the Score to the DataStage job log
• Can enable at the project level to apply to all jobs
• For each job run, 2 separate Score Dumps are written to the log
• First score is actually from the license operator
• Second score entry is the actual job score
License Operator job score
Actual job score
122
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Example Job Score
Job scores are divided into two sections Datasets
partitioning and collecting
Operatorsnode/operator mapping
Both sections note sequential or parallel processing
123
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Job SCORE: Operators
The operators (lower) section of the Job Score details the mapping between stages and actual processes created at runtime Operator combination Operator to node mappings Degree of Parallelism per operator Framework-inserted sorts Buffer operators
op0[1p] {(sequential APT_CombinedOperatorController:
(Row_Generator_0) (inserted tsort operator
{key={value=LastName}, key={value=FirstName}})
) on nodes ( node1[op0,p0] )}op1[4p] {(parallel inserted
tsort operator {key={value=LastName}, key={value=FirstName}}(0))
on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}op2[4p] {(parallel buffer(0)) on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}
124
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Operator Combination
At runtime, the DataStage Parallel Framework can only combine stages (operators) that: Use the same partitioning method
Repartitioning prevents operator combination between the corresponding producer and consumer stages
Implicit repartitioning (eg. Sequential operators, node maps) also prevents combination
Are CombinableSet automatically within the stage/operator definitionCan also be set within DataStage Designer: Advanced stage properties
125
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Composite Operator Example: Lookup
The Lookup stage is a composite operator Internally it contains more than one
component, but to the user it appears to be one stage LUTCreateImpl
Reads the reference data into memory
LUTProcessImpl Performs actual lookup processing
once reference data has been loaded
At runtime, each internal component is assigned to operators independently
op2[1p] {(parallel APT_LUTCreateImpl in Lookup_3)
on nodes ( ecc3671[op2,p0] )}op3[4p] {(parallel buffer(0)) on nodes ( ecc3671[op3,p0] ecc3672[op3,p1] ecc3673[op3,p2] ecc3674[op3,p3] )}op4[4p] {(parallel
APT_CombinedOperatorController: (APT_LUTProcessImpl in Lookup_3)
(APT_TransformOperatorImplV0S7_cpLookupTest1_Transformer_7 in Transformer_7)
(PeekNull) ) on nodes ( ecc3671[op4,p0] ecc3672[op4,p1] ecc3673[op4,p2] ecc3674[op4,p3] )}
126
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Job SCORE: Data Sets
The Job SCORE can be used to verify partitioning and collecting methods that are used at runtime
Partitioners and Collectors are associated with datasets (top portion of the SCORE)
Datasets connect a source and a target:- operator(s) (see lower portion of SCORE)- persistent Dataset(s)
Partitioner / Collector method is shown between the source and target
127
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Interpreting the Job Score - Partitioning
The DataStage Parallel Framework implements a producer-consumer data flow model Upstream stages (operators or persistent data sets) produce rows
that are consumed by downstream stages (operators or data sets)
Partitioning method is associated with producerCollector method is associated with consumer
“eCollectAny” is specified for parallel consumers, although no collection occurs!Separated by an indicator:
May also include [pp] notation when Preserve Partitioning flag is set
Producer
Consumer
-> Sequential to Sequential<> Sequential to Parallel=> Parallel to Parallel (SAME)#> Parallel to Parallel (not SAME) >> Parallel to Sequential> No producer or no consumer
128
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Using the Job SCORE
$APT_DUMP_SCORE = 1 (“True”) is a recommended default (project level) setting for all jobs
At runtime, the Job SCORE can be examined to identify: Number of UNIX processes generated for a given job and
$APT_CONFIG_FILE Operator combination Partitioning methods between operators Framework-inserted components
Including Sorts, Partitioners, and Buffer operators
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise Edition
Module 03: The Parallel Job Score
Paul ChristensenSolution Architect
NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Last revision: June 22, 2004
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise Edition
Module 04: Best Practices and Job Design Tips
Paul ChristensenSolution Architect
NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Last revision: June 22, 2004
131
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Assumptions
This module assumes that you have an understanding of the topics covered in: Module 01: Parallel Framework Architecture Module 02: Partitioning, Collecting, and Sorting Module 03: Parallel Job Score Material covered in
DS324PX: DataStage Enterprise Edition Essentials
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStageEnterprise Edition
Job Design Tips
133
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Overall Job Design
Ideal job design must strike a balance between performance, resource usage, and restartability
In theory, best performance results from processing all data in-memory without landing to disk Requires hardware resources (eg. CPU, memory)
and UNIX resources (eg. ulimit, nfiles, etc)Resource usage grows exponentially based on degree of
parallelism and number of stages in a flow Must also consider what else is running on the server(s)
May not be possible with very large amounts of dataeg. Sort will use scratch disk if data is larger than memory buffer
Business rules may dictate job boundarieseg. Dimensional maintenance before Fact table processing/loadeg. Lookup reference data must be created before lookup
processing
134
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Modular Job Design
Parallel shared containers facilitate modular job design by creating re-usable components (stages, logic) Runtime column propagation allows maximum parallel
shared container re-use (only need to define columns used within container logic)
The total number of stages in a job includes the total of all stages in all parallel shared containers
Job parameters and multi-instance job properties facilitate job re-use
Land intermediate results to parallel data sets
135
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Establishing Job Boundaries
Business requirements
Functional / DataStage requirements
Establish restart points in the event of a failure Segment long-running steps Separate final database Load from Extract and
Transformation steps
Resource utilization (number of stages, etc)
Performance Fork-join job flows may run faster if split into two
separate jobs with intermediate datasetsDepends on processing requirements and ability to tune buffering
136
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Job Sequences
Job Sequences can be used to combine individual jobs into functional “modules” to perform a sequence of activities
Starting with DataStage release 7.1, Job Sequences can be “restartable” In the event of a failure, re-
running the sequence will not re-run activities that successfully completed
It is the developer’s responsibility to ensure that an individual job can be re-run after a failure
Enable Sequence restart in Job Properties (enabled by default)
The “do not checkpoint run” sequence stage property will execute that step every Sequence run.
137
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Job Design – Stage Usage Tips
Sequential File Optimizing performance Reading and Writing fixed-width files Adjusting write buffer size
Column Import Lookup Sort Aggregator Transformer Database Stages
138
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Reading a Sequential File in Parallel
By default, Sequential File reads are not parallel, unless multiple files are specified
The Readers Per Node option can be used to read a single input file in parallel at evenly spaced offsets Note that sequential row
order cannot be maintained when reading a file in parallel
139
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Partitioning and Sequential Files
Sequential File sources (import operator) create one partition for each input file Always follow a Sequential File with ROUND ROBIN or
other appropriate partitioning type NEVER follow a Sequential File source with SAME
partitioningIf reading from one file, this will cause the downstream
flow to run sequentially!SAME is only appropriate in unusual scenarios where the
source data is already separated into multiple files by partition
140
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Capturing Sequential File Rejects
The Sequential File stage supports an optional reject link to capture rows that do not match source or target format Reject schema is a single
raw (binary) column Be careful writing rejects to
another SequentialFile Easiest to output rejects to
a Dataset (with Peek for debug)
141
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Sequential File Tips
To write fixed-length files from variable-length fields, use the following column properties: field width: specifies the output column width pad string: specifies character used to pad data to the
specified field width (if not specified an ASCII NULL character 0x0 is used for padding)
When reading delimited files, extra characters are silently truncated for source file values longer than the maximum specified length of VARCHAR columns Starting with v7.01, set the environment variable $APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS to reject these records instead
142
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Buffering Sequential File Writes
By default, Sequential File targets (export operator) buffer writes to optimize performance Buffers are automatically flushed when the job
completes successfully
For realtime applications, the environment variable $APT_EXPORT_FLUSH_COUNT can be used to specify the number of rows to buffer For example $APT_EXPORT_FLUSH_COUNT=1 flushes to disk
for every row Setting this value low incurs a SIGNIFICANT
performance penalty!
143
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Using Column Import
The Column Import stage can be used to improve performance of non-parallel Sequential File reads and FTP sources Allows column parsing to run in parallel Separates parsing (CPU) from sequential source I/O
Define source file/FTP as a single columnType RAW or [VAR]CHARMaximum length = record sizeNote that there are metadata implications
Define Columns, Data Types, and other format options in Column Import stageSimilar to Sequential File definition
144
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Lookup Stage Usage
The Lookup stage is most appropriate when reference data is small enough to fit into physical (shared) memory For reference datasets larger than available memory, use
the JOIN or MERGE stage
Limit use of Sparse Lookup (for DB2 and Oracle reference tables) Per-row database lookups are extremely expensive (slow)
For small numbers of rows, can be used for database-generated variables / function results
ONLY appropriate when the number of input rows is significantly smaller (eg. 1:100) than the number of reference rows
145
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Lookup Reference Data Partitioning
ENTIRE is the default partitioning for Lookup reference links with “Auto” partitioning On SMP platforms, it is a good practice to set this explicitly on
the Normal Lookup reference link(s)
Lookup stage uses shared memory instead of duplicating ENTIRE reference data On SMP platforms
To minimize data movement across nodes in clustered / MPP platforms, it may be appropriate to select a keyed partitioning method Especially if data is already partitioned on those keys Input and Reference data partitioning must match
146
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Lookup Reference Data
NEVER generate Lookup reference data using a fork-join of source data Lookup cannot output rows until all reference data has
been read into memory (except for Oracle or DB2 Sparse reference links)
Use Lookup File Sets to separate the creation of lookup reference data from lookup processing
Header
Detail
Src
HeaderRef
Out
147
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Lookup File Sets
Lookup File Sets should be used to store reference data on disk Data is stored in native format, partitioned,
and pre-indexed on lookup key column(s) Key column(s) and partitioning are specified when file is
created
Lookup File Sets can only be used as reference input link to a Lookup stage Partitioning method and key columns specified when the
Lookup File Set is created will be used to process the reference data on subsequent Lookups that use this file
Particularly useful when static reference data can be re-used in multiple jobs (or runs of the same job)
148
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Aggregator
The Aggregator stage summarizes data based on groupings of key-column values Input partitioning must match desired groupings
Use Hash method for inputs with a limited number of distinct key-column values Uses 2K of memory/group Incoming data does not need to be pre-sorted Results are output after all rows
have been read Output row order is undefined
Even if input data is sorted
Use Sort method with a large (or unknown) number of distinct key-column values Requires input pre-sorted on key columns Results are output after each group
149
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Sequential (Total) Aggregations
To summarize on all input rows Generate a constant-value key column
Column GeneratorTransformer (if already in the upstream job flow)
Sequentially Aggregate on generated key columnNo need to sort or hash-partition input data!
Use 2 aggregators to prevent sequential aggregation (and collector) from slowing down upstream data flow First aggregator runs in parallel, grouping on generated key
columnRound-robin input if not evenly distributed
Second aggregator runs sequentially, grouping on generated key columnAuto collector
Parallel Sequential
150
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Transformer vs. Other Stages
For optimum performance, consider more appropriate stages instead of a Transformer in parallel job flows:- Use the Copy stage as a placeholder
- this is different from DataStage Server!- unless FORCE=TRUE, Copy is optimized out at runtime
- Leverage stage (eg. Copy) Output Mappings (RCP off) to Rename ColumnsDrop ColumnsPerform Default Type Conversions
- Modify is the most efficient “stage”. Use it for- Non-default type conversions- Null handling (converting between in-band and out-of-band)- String trimming (v7.01 and later)
- NOTE: starting with v7.01, Transformer output link constraints are FASTER than Filter stage! (Filter is always interpreted)
151
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Transformer vs. Lookup
- Consider implementing Lookup tables for expressions that depend on value mapping
- For example:- Instead of using transformer expressions such as:
- … link.A=1 OR link.A=3 OR link.A=5 …- … link.A=2 OR link.A=7 OR link.A=15 OR link.A=20 …
- Create a Lookup table for the source-value pairs, and use the Lookup stage to assign values
- This method can also be used to simply output link constraints
A Result
1 1
3 1
5 1
2 2
7 2
15 2
20 2
152
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Transformer Performance Guidelines
- Minimize the number of Transformers by combining derivations from multiple Transformers
NEVER use the Server-side BASIC Transformer in high-volume data flows Intended to provide a migration path for existing DataStage Server
applications that use DataStage BASIC routines Starting with v7, the parallel Transformer supports user-defined
functions (external object files or libraries, not through DataStage BASIC routines)
Replace Transformer stages that do not meet performance requirements with BuildOps It is generally not necessary to replace all Transformers, just those
that are bottlenecks Remember, BuildOps require more knowledgeable developers than
equivalent Transformer logic
153
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Optimizing Transformer Expressions
The parallel Transformer uses the following evaluation algorithm: Evaluate each stage variable initial value For each input row:
Evaluate each stage variable derivation value unless the derivation is empty
For each output link: Evaluate each column derivation value Write the output record
Stage variables and columns within a link are evaluated in the order displayed in the Transformer editor
154
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Optimizing Transformer Expressions
Given Transformer evaluation order, use stage variables instead of per-column derivations to minimize repeated use of the same derivation: Move repeated expressions outside of loops Examples:
Portions of output column derivations that are used in multiple derivations
Where an expression includes calculated constant values Use the stage variable Initial Value to evaluate once for all
rows
Where an expression requiring a type conversion is used as a constant, or it is used in multiple places
155
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Transformer Decimal Arithmetic
Starting with v7.0.1 and v6.0.2, the Transformer supports DECIMAL arithmetic (earlier releases converted to dfloat) Default internal decimal variables are precision 38 scale
10, but this can be changed by specifying$APT_DECIMAL_INTERM_PRECISION$APT_DECIMAL_INTERM_SCALE
Set $APT_DECIMAL_INTERM_ROUND_MODE to specify:ceil: rounds toward positive infinity
1.4 -> 2, -1.6 -> -1floor: rounds toward negative infinity
1.6 -> 1, -1.4 -> -2round_inf: rounds or truncates to nearest representable value, breaking
ties by rounding positive values toward positive infinity and negative values toward negative infinity
1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2trunc_zero: discard any fractional digits to the right of the rightmost
fractional digit supported regardless of sign. If $APT_DECIMAL_INTERM_SCALE is smaller than the results of an internal calculation, round or truncate to the scale size
1.56 -> 1.5, -1.56 -> -1.5
156
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Conditionally Aborting a Job
Use the “Abort After Rows” setting in the output link constraints of the parallel Transformer to conditionally abort a parallel job Create a new output link and assign a link constraint that
matches the abort condition Set the “Abort After Rows” for this link to the number of
rows allowed before the job aborts
When the “Abort After Rows” threshold is reached, the Transformer immediately aborts the job flow, potentially leaving uncommitted database rows or un-flushed file buffers
157
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
More Transformer Best Practices
Always include Reject Link Captures NULL errors from
Transformer expressions
Always test for null value before using a column in a function
Avoid type conversions Try to maintain data type as imported
Be aware of Column and Stage Variable data types It is easy to neglect setting the proper Stage Variable
type
158
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
“First Row” Transformer Derivations
Within a Transformer, stage variables can be used to identify the first row of an input group Define one stage variable for each grouping key column Define a stage variable to flag when input key column(s) do
not match previous value(s) On new group (flag set), set stage variable(s) to incoming key
column value(s)
159
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
“Last Row” Transformer Derivations
Since the Transformers cannot “read ahead”, other methods must be used when derivations depend on the last row of a group
For aggregate calculations within the Transformer, generate a “running total” for each group, then Remove Duplicates, retaining Last row
160
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Identifying “Last Row” in a Group
In general, it is a bad idea to perform multiple, back-to-back sorts
The sort stage, however, can be used for more than just sorting Sub-sorting on groups (instead of complete sorts) Creating key change columns
Example: For derivations that cannot output a running total, use 3 Sort stages before Transformer to generate a change key column for the last row in the group Often, data is already sorted earlier in the same flow Hash/Sort on key columns before first sort Use SAME partitioning to ensure that subsequent stages keep grouping
and sort order
Sort KeyChange SubSort
161
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
“Last Row” Sort Details
Sorts on key columns Sorts Descending on
group order column
First Sort Does no sorting – creates
key-change column Specify only key columns
Second “Sort” Does not sort on key
columns Sub-sorts Ascending on
group order column
Final “Sub-Sort”
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStageEnterprise Edition
Database Stage Usage
163
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Database Stage Usage
Overall Database Guidelines Native Parallel vs. Plug-In Stages DB2 Guidelines Oracle Guidelines Teradata Guidelines SQL or DataStage?
164
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Optimizing Select Lists for Read
For source database stages, limit the use of “SELECT * ” to read all columns Uses more memory, may impact job performance Only needed for “dynamic” source / target flows
(uncommon)
Instead, explicitly specify ONLY the columns needed in the flow For Table read method, specify Select List property
Or, use Auto-Generated or User-Defined SQL
165
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Native Parallel Database Stages
Starting with release 7, DataStage Enterprise Edition offers database connectivity through native parallel and plug-in stage types.
In general, for maximum parallel performance,scalability, and features it is best to use thenative parallel database stages. Parallel read and write OPEN and CLOSE commands
166
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Upsert (API) vs. Load Methods
For database targets, most Enterprise stages provide the choice of Upsert or Load Methods Upsert method uses database APIs
Allows concurrent processing with other jobs and applications
Does not bypass database constraints, indexes, triggers Load method uses corresponding database-specific
parallel load utilitiesCan be significantly faster than Upsert method for large
data volumesSubject to database-specific limitations of load utilities
May be issues with index maintenance, constraints, etc May not work with tables that have associated triggers
Requires exclusive access to target table
167
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
OPEN and CLOSE commands
OPEN command allows user to specify SQL to be executed before the stage begins reading or writing Example: Create temporary table used to write rows
CLOSE command allows user to specify SQL to be executed after the stage completes reading or writing Example: “SELECT FROM … INSERT INTO…” from temporary
table to actual table Example: Delete temporary table(s)
Available only in the native parallel (Enterprise) database stages
168
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Plug-In Database Stages
Plug-in stage types are intended to provide connectivity to database configurations not offered by native parallel stages. Cannot read in parallel Cannot span multiple servers in clustered or MPP
configurations
169
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Designer Palette Customization
DataStage repository window displays all stages available in the parallel canvas.
Stage Types/Parallel category
Not all of these stages are included in the default Designer palette.
Customize the palette to add stage types (eg. Teradata API)
170
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Enterprise Edition DB2 Stages
DB2 Enterprise stage Should always be used when reading from, performing
lookups against, or writing to DB2 Enterprise Server Edition with Database Partitioning Feature (DPF)DB2 v7.x this was called “DB2EEE”
Tightly coupled with DB2, communicates directly with each DB2 database node, using same partitioning as DB2 table
Supports Parallel Read, Upsert, Load, Sparse Lookup
DB2 API stage Provides connectivity to non-UNIX DB2 databases
(such as mainframe editions through DB2-CONNECT)
171
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DB2 Upsert Commit Interval
For target DB2 tables using the Upsert method, the DB2 Enterprise Stage provides options to specify the database commit interval for each stage
Rows are committed after a period of time or number of rows, whichever comes first: Default is every 2 seconds or 2000 rows
172
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Cleaning Up Failed DB2 Loads
In the event of a failure during DB2 Load operation, the DB2 Fast Loader marks the table inaccessible (quiesced exclusive or load pending state)
To reset the target table state to normal mode: Re-run the job specifying “CleanupOnFailure=True” option Any rows that were inserted before the load failure must
be deleted manually
173
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Enterprise Edition Oracle Stages
Oracle Enterprise Source
Supports sequential (default) or parallel reads Target
Upsert: uses Oracle APILoad: invokes SQL*Loader, subject to its limitations
Oracle OCI Load ONLY used for heterogeneous loads
When target databases hardware platforms differ from the Oracle client (DataStage server) platform
174
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Specifying Oracle Remote Server
The Oracle Enterprise Remote Server connection option is intended for Oracle instances on remote hosts
In general, avoid using this option for local Oracle databases (on same host as DataStage server) Specifying for local Oracle instances forces TCP
(network) instead of shared memory database connection
Instead, set the environment variable $ORACLE_SIDOracle environment is typically defined within the
DataStage dsenv file
175
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Reading from Oracle in Parallel
By default, Oracle Enterprise reads sequentially. Use the partition table option to read in parallel from Oracle sources
Limitations of Parallel Read: Source table can only be non-partitioned or range-partitioned Cannot run queries containing a "GROUP BY" clause which
are not also partitioned by same field Cannot perform a non-collocated join
176
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Oracle Schema Owner
To access Oracle tables that were created by a different user, fully-qualify the table name Syntax: ownername.tablename NOTES:
Parameterize ownernameDatabase permissions must allow accessCANNOT create an unqualified synonym
no access to Oracle system catalog information required by Oracle Enterprise stage
177
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Improving Oracle Upsert Performance
In Upsert write mode, the Oracle Enterprise stage: Executes the Insert statement (if present) first If the Insert fails with a unique-constraint violation, it then
executes the Update statement For larger data volumes, it is often faster to identify Insert
and Update data within the job and separate into different Oracle Enterprise targets Set Upsert Mode=“Update Only” for rows to be updated Set Upsert Mode=“Update and Insert” for rows to be
inserted Prevents double-processing of update records
Insert processing uses Oracle host arrays to improve processing Optional InsertArraySize parameter can enhance performance
(default is 500 rows)
178
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Oracle Upsert Commit Interval
For target Oracle tables using the Upsert method, two environment variables specify the database commit interval As environment variables, commit settings apply to all
Oracle stages in a job
Rows are committed after a period of time or number of rows, whichever comes first, for each Oracle stage/partition: $APT_ORAUPSERT_COMMIT_ROW_INTERVAL
Default is every 2000 rows (per stage/partition) $APT_ORAUPSERT_TIME_INTERVAL
Default is every 2 seconds
179
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Oracle Load into Indexed Tables
By default, Oracle Enterprise will not Load an indexed table Must drop indexes before the load, and recreate after the
load (need appropriate Oracle privileges)Can use OPEN and CLOSE commands
In Append or Truncate modes, the IndexMode option can allow load into an indexed table: Rebuild: bypasses indexes during load, rebuilds indexes
after load completes uses Oracle ALTER INDEX REBUILD commandindexes cannot be partitioned
Maintenance: maintains index on loadLoads each partition sequentiallyTable and Index must be partitionedIndex must be local range-partitioned using same range values
used to partition the table
180
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Alternate: Load into Indexed Tables
If index mode options are not possible, or if you do not have proper Oracle permissions, it is still possible to Load into an indexed table: Set Oracle Enterprise stage to run sequentially Set environment variable $APT_ORACLE_LOAD_OPTIONS
OPTIONS(DIRECT=TRUE,PARALLEL=FALSE)
181
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Teradata Stage Usage
Because of limitations imposed by the Teradata Utilities, it is sometimes appropriate to use plug-in stages for Teradata sources or targets Teradata imposes a system-wide limit to the number of
concurrent database utilitiesCan be adjusted by the DBA, but can not be greater than 15Within a parallel job, each use of Teradata Enterprise, Teradata
MultiLoad, or Teradata Load stages count against this limit when the job is run
Which Teradata stage to use? Source or Target Teradata Enterprise
uses FastExport and FastLoad utilitiesHigh-volume parallel reads and writesTargets are limited to Insert operations (empty table or Append)Supports OPEN and CLOSE commands
182
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Teradata Enterprise DBOptions
For Teradata instances with a large number of AMPs (VPROCs), it may be necessary to set the optional SessionsPerPlayer and RequestedSessions in the DBOptions string in the Teradata Enterprise stage It is a good idea to parameterize these settings Syntax is:
user=[user],password=[password],SessionsPerPlayer=nn, RequestedSessions=nn
183
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Teradata Enterprise Sessions
RequestedSessions determines the total number of distributed connections to the Teradata source or target When not specified, it equals the number of Teradata
VPROCs (AMPs) (your DBA can provide this) Can set between 1 and number of VPROCs
SessionsPerPlayer determines the number of connections each player will have to Teradata. Indirectly, it also determines the number of players (degree of parallelism). Default is 2 sessions / player The number selected should be such that
SessionsPerPlayer * number of nodes * number of players per node = RequestedSessions
Setting the value of SessionsPerPlayer too low on a large system can result in so many players that the job fails due to insufficient resources. In that case, the value for -SessionsPerPlayer should be increased.
184
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Teradata Server– MPP with 4 TPA nodes– 4 AMP’s per TPA node
DataStage Server
Configuration File Sessions Per Player Total Sessions
16 nodes 1 16
8 nodes 2 16
8 nodes 1 8
4 nodes 4 16
Example Settings
Teradata SessionsPerPlayer Example
185
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Teradata Plug-Ins
Target Teradata MultiLoad plug-in (MultiLoad utility)Targets allow Insert, Update, Delete, or Upsert of moderate data
volumes (stage cannot run in parallel)Do NOT use as a source in an EE flow! (runs FastExport
sequentially)
Target Teradata MultiLoad plug-in (TPump utility)Targets allow Insert, Update, Delete, or Upsert of small data
volumes in a large database Does NOT lock target table exclusivelystage cannot run in parallel
Source or Target Teradata API stage does not use database utilitiesIntended for small-volumes of dataDoes not count against Teradata limits, but slower than TPumpAnd…cannot read in parallel (parallel writes are allowed)
186
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Teradata Stage Usage Guidelines
Stages that use Teradata Utilities (database-wide limit): TeraData Enterprise will always have maximum
performance for high volumes of dataONLY stage that will read in parallelLimited target capabilities (insert, append)
TeraData MultiLoad for moderate data volumesInserts, Updates, Deletes, UpsertsTarget stage ONLY!Must run sequentially
TeraData MultiLoad (TPump option)Similar to MultiLoad, but does not lock target table exclusively
Stages that do not use Teradata Utilities: Teradata API
187
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
SQL or DataStage?
When reading data from multiple tables in the same database, it is possible to use either SQL or DataStage for some tasks.
In general, the optimal implementation leverages the strengths of each technology: When possible, use a SQL filter (WHERE clause) to limit the number
of rows sent to the DataStage job Use a SQL JOIN to combine data from tables of small-medium
number of rows, especially when the join columns are indexed In general, avoid SQL SORTs – DataStage SORT is much faster
and runs in parallel without the overhead of sort-merge Use DataStage SORT and JOIN to combine data from very large
tables, or when the join condition is complex Avoid the use of database stored procedures (eg. Oracle PL/SQL)
on a per-row basis. Implement these routines using native DataStage components.
When the direction is not obvious, the decision is often made by actual tests, or influenced by other factors such as metadata needs and developer skill sets
188
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
For More Information
Orchestrate “OEM” Documentation (available in the documentation section of Ascential eServices public website) User Guide Operators Reference Record Schema
DataStage Enterprise Edition Best Practices and Performance Tuning document
Don’t be afraid to try!
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise Edition
Module 04: Best Practices and Job Design Tips
Paul ChristensenSolution Architect
NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Last revision: June 22, 2004
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise Edition
Module 05: Environment Variables
Paul ChristensenSolution Architect
NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Last revision: June 22, 2004
191
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Understanding a Job’s UNIX Environment
Jobs inherit environment variables at runtime based on this order of evaluation: Environment variables defined in $DSHOME/dsenv
Shared by all projects on the DataStage server
Project-level environment variables defined by DS AdministratorDuplicate variables over-ride $DSHOME/dsenvNOTE: when migrating between environments, project level
environment variables are NOT exported
Job-level environment variables set in Job ParametersDuplicate variables over-ride $DSHOME/dsenv and project-level
settingsCannot be set / passed in Job Sequences (bug!)To avoid hard-coding job parameters, use special values:
$ENV – pulls value from operating system environment $PROJDEF – uses project default value
192
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Copying Project-Level Environment Variables
Project-level environment variables are not exported when performing a full export using DataStage manager
With care, project-level environment variables can be copied between projects by editing the DSParams file located on the top-level of the project directory User-defined settings are near the end of this file
IMPORTANT: Always make a backup-copy of the DSParams file before any manual editing. It is possible to render a project un-usuable through
improper editing of DSParams
[InternalSettings]DisableParSCCheck=0[AUTO-PURGE]PurgeEnabled=0DaysOld=0PrevRuns=0[EnvVarValues]"ORACLE_SID"\1\"cpaul""APT_SORT_INSERTION_CHECK_ONLY"\1\"1"
193
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Environment Variables For All Jobs
The following environment variables are recommended for all jobs. Although these can be set at the project level, it is better to specify within job properties Provides runtime parameter Specify in your Job template(s)
$APT_CONFIG_FILE=[filepath] $APT_DUMP_SCORE=1 $APT_RECORD_COUNTS=1
Outputs record counts to the job log as each operator completes processing $OSH_ECHO=1
Outputs generated OSH to job log $APT_PM_SHOW_PIDS=1
Places UNIX process ID entries in job log for each process started at runtime Does not show DataStage phantom or Server processes
$APT_BUFFER_MAXIMUM_TIMEOUT=1 Maximum buffer delay in seconds
$APT_COPY_TRANSFORM_OPERATOR=1 For clusters/MPP only: copies Transform operator(s) to remote nodes
194
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Job Monitoring Environment Variables
Starting with DataStage v7, the Director Job Monitor captures results on a time interval Captured row counts are shown in Director, Job Monitor, and
Designer (Show Performance Statistics) This data is also stored in the DataStage repository, and can be
extracted using Job Control or XML reports
The following environment variables alter Job Monitor characteristics: $APT_MONITOR_TIME=[seconds]
Specifies time interval for capturing job monitor information at runtime. $APT_MONITOR_SIZE=[rows]
If set, specifies that the job monitor capture information on a row (not time) basis. This is the method used in DataStage release 6.x
$APT_NO_JOBMON=1Disables job monitoring completely – no statistics will be captured In rare instances, this may improve performance
195
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Job Design Environment Variables
$APT_STRING_PADCHAR=[char] Overrides the default pad character of 0x0 (ASCII NULL) Can be a string character, or C-notation Used for all variable-length to fixed-length string conversions May have implications for some target database stages
(eg. Oracle) $APT_DECIMAL_INTERM_PRECISION=[precision]
$APT_DECIMAL_INTERM_SCALE=[scale] Specifies internal precision and scale used for internal Transformer derivations Default precision/scale is [38,10], maximum is [255,255]
$APT_DECIMAL_INTERM_ROUND_MODE=[mode] ceil: rounds toward positive infinity
1.4 -> 2, -1.6 -> -1 floor: rounds toward negative infinity
1.6 -> 1, -1.4 -> -2 round_inf: rounds or truncates to nearest representable value, breaking ties by
rounding positive values toward positive infinity and negative values toward negative infinity 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2
trunc_zero: discard any fractional digits to the right of the rightmost fractional digit supported regardless of sign. If $APT_DECIMAL_INTERM_SCALE is smaller than the results of an internal calculation, round or truncate to the scale size 1.56 -> 1.5, -1.56 -> -1.5
196
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Job Debugging Environment Variables
The following environment variables can assist with debugging a job flow: $OSH_PRINT_SCHEMAS=1
Outputs the actual schema used at runtime for each dataset in a job flow. This is useful for determining if actual schema matches what the job designer expected.
$APT_PM_PLAYER_TIMING=1When set, prints detailed information in the job log for each
operator, including CPU utilization and elapsed processing time $APT_PM_PLAYER_MEMORY=1
When set, prints detailed information in the job log for each operator when additional memory is allocated
$APT_BUFFERING_POLICY=FORCE$APT_BUFFER_FREE_RUN=1000Used in conjunction, these two environment variables effectively
isolate each operator from slowing upstream production. Using the job monitor statistics, this can identify which part of a job flow is impacting overall performance.
NOT recommended for production job runs!
197
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Buffer Environment Variables
The following environment variables may also be specified on a per-stage basis within Designer: $APT_BUFFERING_POLICY $APT_BUFFER_MAXIMUM_MEMORY $APT_BUFFER_FREE_RUN $APT_BUFFER_DISK_WRITE_INCREMENT
198
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Sequential File Stage Environment Variables
Environment Variable Setting Description
$APT_EXPORT_FLUSH_COUNT [nrows] Specifies how frequently (in rows) that the Sequential File stage (export operator) flushes its internal buffer to disk. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty from increased I/O.
$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS
(DataStage v7.01 and later)
1 Setting this environment variable directs DataStage to reject Sequential File records with strings longer than their declared maximum column length. By default, imported string fields that exceed their maximum declared length are truncated.
$APT_IMPORT_BUFFER_SIZE$APT_EXPORT_BUFFER_SIZE
[Kbytes] Defines size of I/O buffer for Sequential File reads (imports) and writes (exports) respectively. Default is 128 (128K), with a minimum of 8. Increasing these values on heavily-loaded file servers may improve performance.
$APT_CONSISTENT_BUFFERIO_SIZE [bytes] In some disk array configurations, setting this variable to a value equal to the read / write size in bytes can improve performance of Sequential File import/export operations.
$APT_DELIMITED_READ_SIZE [bytes] Specifies the number of bytes the Sequential File (import) stage reads-ahead to get the next delimiter. The default is 500 bytes, but this can be set as low as 2.
This setting should be set to a lower value when reading from streaming inputs (eg. socket, FIFO) to avoid blocking.
$APT_MAX_DELIMITED_READ_SIZE [bytes] This variable controls the upper bound which is by default 100,000 bytes. When more than 500 bytes read-ahead is desired, use this variable instead of APT_DELIMITED_READ_SIZE.
199
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DB2 Environment Variables
Environment Variable Setting Description
$INSTHOME [path] Specifies the DB2 install directory. This variable is usually set in a user’s environment from .db2profile.
$APT_DB2INSTANCE_HOME [path] Used as a backup for specifying the DB2 installation directory (if $INSTHOME is undefined).
$APT_DBNAME [database] Specifies the name of the DB2 database for DB2/UDB Enterprise stages if the “Use Database Environment Variable” option is True. If $APT_DBNAME is not defined, $DB2DBDFT is used to find the database name.
$APT_RDBMS_COMMIT_ROWSCan also be specified with the “Row Commit Interval”
stage input property.
[rows] Specifies the number of records to insert between commits. The default value is 2000.
$DS_ENABLE_RESERVED_CHAR_CONVERT 1 Allows DataStage to handle DB2 databases which use the special characters # and $ in column names.
200
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Informix Environment Variables
Environment Variable Setting Description
$INFORMIXDIR [path] Specifies the Informix install directory.
$INFORMIXSQLHOSTS [filepath] Specifies the path to the Informix sqlhosts file.
$INFORMIXSERVER [name] Specifies the name of the Informix server matching an entry in the sqlhosts file.
$APT_COMMIT_INTERVAL [rows] Specifies the commit interval in rows for Informix HPL Loads. The default is 10000.
201
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Oracle Environment Variables
Environment Variable Setting Description
$ORACLE_HOME [path] Specifies installation directory for current Oracle instance. Normally set in a user’s environment by scripts.
$ORACLE_SID [sid] Specifies the Oracle service name, corresponding to a TNSNAMES entry.
$APT_ORAUPSERT_COMMIT_ROW_INTERVAL$APT_ORAUPSERT_COMMIT_TIME_INTERVAL
[num][seconds]
These two environment variables work together to specify how often target rows are committed for target Oracle stages with Upsert method.
Commits are made whenever the time interval period has passed or the row interval is reached, whichever comes first. By default, commits are made every 2 seconds or 5000 rows.
$APT_ORACLE_LOAD_OPTIONS [SQL*Loader options]
Specifies Oracle SQL*Loader options used in a target Oracle stage with Load method. By default, this is set to OPTIONS(DIRECT=TRUE, PARALLEL=TRUE)
$APT_ORACLE_LOAD_DELIMITED
(DataStage 7.01 and later)
[char] Specifies a field delimiter for target Oracle stages using the Load method. Setting this variable makes it possible to load fields with trailing or leading blank characters.
$APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM 1 When set, a target Oracle stage with Load method will limit the number of players to the number of datafiles in the table’s tablespace.
$APT_ORA_WRITE_FILES [filepath] Useful in debugging Oracle SQL*Loader issues. When set, the output of a Target Oracle stage with Load method is written to files instead of invoking the Oracle SQL*Loader. The filepath specified by this environment variable specifies the file with the SQL*Loader commands.
$DS_ENABLE_RESERVED_CHAR_CONVERT 1 Allows DataStage to handle Oracle databases which use the special characters # and $ in column names.
202
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Teradata Environment Variables
Environment Variable Setting Description
$APT_TERA_SYNC_DATABASE [name] Starting with v7, specifies the database used for the terasync table. By default, EE uses the
$APT_TERA_SYNC_USER [user] Starting with v7, specifies the user that creates and writes to the terasync table.
$APT_TER_SYNC_PASSWORD [password] Specifies the password for the user identified by $APT_TERA_SYNC_USER.
$APT_TERA_64K_BUFFERS 1 Enables 64K buffer transfers (32K is the default). May improve performance depending on network configuration.
$APT_TERA_NO_ERR_CLEANUP 1 This environment variable is not recommended for general use. When set, this environment variable may assist in job debugging by preventing the removal of error tables and partially written target table.
$APT_TERA_NO_PERM_CHECKS 1 Disables permission checking on Teradata system tables that must be readable during the TeraData Enterprise load process. This can be used to improve the startup time of the load.
203
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
For More Information
Orchestrate “OEM” Documentation (available in the documentation section of Ascential eServices public website) Admin Install Guide, Chapter 11: Environment Variables Operators Reference
DataStage Enterprise Edition Best Practices and Performance Tuning document
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise Edition
Module 05: Environment Variables
Paul ChristensenSolution Architect
NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Last revision: June 22, 2004
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise Edition
Module 06: Introduction to Performance Tuning
Paul ChristensenSolution Architect
NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Last revision: June 22, 2004
206
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Assumptions
This module assumes that you have an understanding of the topics covered in: Module 01: Parallel Framework Architecture Module 02: Partitioning, Collecting, and Sorting Module 03: Parallel Job Score Module 04: Best Practices and Job Design Tips Material covered in
DS324PX: DataStage Enterprise Edition Essentials
207
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Optimizing Performance
The ability to process large volumes of data in a short period of time requires optimizing all aspects of the job flow and environment for maximum throughput and performance: Job Design Stage Properties DataStage Parameters Configuration File Disk Subsystem
Especially RAID arrays / SANs Source and Target database Network etc...
208
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Enterprise Edition Performance
Within DataStage, examine (in order): End-to-end process flow
Intermediate results, sources/targets, disk usage DataStage Configuration File(s) for Each Job
Degree of ParallelismImpact on Overall System ResourcesFile system mappings, scratch disk
Individual Job Design (including shared containers)Stages chosen, overall design approachPartitioning StrategyCombinationBuffering (as a last resort)
Ultimate job performance may be constrained by external sources / targets eg. disk subsystem, network, database, etc. May be appropriate to scale-back degree of parallelism to
conserve un-used resources
209
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Performance Tuning Methodology
Performance tuning is an iterative process: Test in isolation (nothing else should be running)
DataStage ServerSource and Target databases
Change one item at a time, then examine impactUse Job Score to determine
Number of processes generated Operator combination Framework-inserted sorting and partitioning
Use DataStage Job Monitor to verify Data distribution (partitioning) Throughput and bottlenecks
Use UNIX system monitoring tools to determine resource utilization (CPU, memory, disk, network)
210
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Using DataStage Director Job Monitor
Enable “Show Instances” to show data distribution (skew) across partitions Best performance with even distribution
Enable “Show %CP” to display CPU utilization
211
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Selectively Disabling Operator Combination
Operator combination is intended to improve overall performance and lower resource usage Generally separates I/O from CPU activity
There may be instances when operator combination hurts performance One process cannot use more than 100% of CPU It is also a good idea to separate I/O from CPU tasks
Use DataStage Job Monitor to identify CPU bottlenecks Selectively disable combination through Designer stage
properties
In unusual circumstances, disable all combination by setting $APT_DISABLE_COMBINATION=TRUE Generates significantly more UNIX processes May negatively impact performance
212
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Operator Combination Example
In this example, the combined operator is using 100% CPU Disabling operator combination allows each to use stage to use
more CPU, and separates I/O from CPU It has 4 operators:op0[1p] {(parallel FileSetIn.InStream) on nodes ( node1[op0,p0] )}op1[1p] {(parallel APT_TransformOperatorImplJob_Transformer in Transformer) on nodes ( node1[op1,p0] )}op2[1p] {(parallel APT_RealFileExportOperator in File_Set_6.ToOutput) on nodes ( node1[op2,p0] )}op3[1p] {(sequential APT_WriteFilesetExportOperator in File_Set_6.ToOutput) on nodes ( node1[op3,p0] )}It runs 4 processes on 1 node.
It has 2 operators:op0[1p] {(parallel APT_CombinedOperatorController: (FileSetIn.InStream) (APT_TransformOperatorImplJob_Transformer in Transformer) (APT_RealFileExportOperator in File_Set_6.ToOutput) ) on nodes ( node1[op0,p0] )}op1[1p] {(sequential APT_WriteFilesetExportOperator in File_Set_6.ToOutput) on nodes ( node1[op1,p0] )}It runs 2 processes on 1 node.
Without Operator Combination
With Operator Combination
213
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Configuration File Guidelines
Minimize I/O overlap across nodes If multiple filesystems are shared across nodes, alter order of
file systems within each node definition Pay particular attention with mapping of file systems to
physical controllers / drives within a RAID array or SAN Use local disks for scratch storage if possible
Named Pools can be used to further separate I/O “buffer” – file systems are only used for buffer overflow “sort” – file systems are only used for sorting
On clustered / MPP configurations, named pools can be used to further specify resources across physical servers Through careful job design, can minimize data shipping Specifies server(s) with database connectivity
214
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Use Parallel Data Sets
Use Parallel Data Sets to land intermediate results between parallel jobs Stored in native internal format
(no conversion overhead) Retains data partitioning and sort order
(end-to-end parallelism across jobs) Maximum performance through parallel I/O But, can only be used by other DataStage Enterprise
Edition parallel jobs
When generating Lookup reference data to be used in subsequent jobs, use Lookup File Sets Internal format, partitioned Pre-indexed
215
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Impact of Partitioning
Ensure data is as close to evenly distributed as possible When business rules dictate otherwise, re-partition to a
more balanced distribution as soon as possible to improve performance of downstream stages
Minimize repartitions by optimizing the flow to re-use upstream partitioning Especially in clustered / MPP environments
Know your data Choose hash key columns that generate sufficient
unique key combinations (while meeting business requirements)
Use SAME partitioning carefully Maintains degree of parallelism
216
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Impact of Sorting
Use parallel sorts if possible (sort by key-column groups) Where sequential sort is required, parallel sort + sort merge collector
is generally much faster than a sequential sort Complete sorts are expensive
Interrupts pipelineRows cannot be output until all rows have been read
Uses scratch disk for intermediate storageUnless the data set is small enough to fit in sort buffer
Minimize and combine sorts where possible Use the “Don’t Sort, Previously Sorted” key-column option to leverage
previous sort groupings Uses much less memory Outputs rows after each key-column group
Parallel data sets maintain sort order and partitioning across jobs
Stable sorts are slower than non-stable sorts; use only when necessary
Use the “Restrict Memory Usage (MB)” option to increase amount of memory per partition (default is 20MB)
217
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Impact of Transformers
Minimize number and use of Transformers Consider more appropriate stages / methods
Copy, Output Mappings, Modify, Lookup Combine derivations from multiple Transformers
Use stage variables to perform calculations used by multiple derivations
Replace complex Transformers that do not meet performance requirements with BuildOps
And NEVER use the BASIC Transformer for high-volume flows!
218
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Impact of Buffering
Consider maximum row width For very wide rows, it may be necessary to increase
buffer size to hold more rows in memory (default is 3MB / partition)
Set through stage properties or for entire job using $APT_BUFFER_MAXIMUM_MEMORY
Tune all other factors (job design, configuration file, disk, resources, etc) before tuning buffer settings
Be careful changing buffering mode Disabling buffering might cause deadlocks (job
hang)
In some cases, the best solution to avoiding fork-join buffer contention may be to split the job, landing to intermediate data sets
219
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Isolating Buffers from Overall Performance
Buffer operators may make it difficult to identify performance bottlenecks in a job flow
Setting the following environment variables effectively isolates each stage (by inserting buffers), and prevents the buffers from slowing down upstream stages (by spilling to disk) $APT_BUFFERING_POLICY=FORCE
Inserts buffers between each operator (isolates) $APT_BUFFER_FREE_RUN=1000
Writes excess buffer to disk instead of slowing down producerBuffer will not slow down producer until it has written 1000*$APT_MAXIMUM_MEMORY to disk
Important notes: These settings will generate a significant amount of disk I/O! Use
configuration file “buffer” disk pools to isolate buffer file systems from scratch and resource disks
Do NOT use these settings for production jobs!
220
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Other Performance Tips
Remove un-needed columns as early as possible within the flow Minimizes memory usage, optimizes buffering Use a select list when reading from database sources To remove columns on Output Mapping, disable runtime
column propagation
Always specify a maximum length for VARCHAR columns Significant performance benefits
Avoid type conversions if possible Verify with $OSH_PRINT_SCHEMAS Always import Oracle table definitions using orchdbutil
221
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
Tuning Sequential File Performance
On heavily loaded file servers or some RAID/SAN configurations, setting these environment variables may improve performance (specify a number in Kbytes, default is 128): $APT_IMPORT_BUFFER_SIZE $APT_EXPORT_BUFFER_SIZE
In some disk array configurations, set the following environment variable equal to the read/write size in bytes: $APT_CONSISTENT_BUFFERIO_SIZE
222
April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
For More Information
Orchestrate “OEM” Documentation (available in the documentation section of Ascential eServices public website) User Guide Operators Reference
DataStage Enterprise Edition Best Practices and Performance Tuning document
Don’t be afraid to try!
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.
DataStage Enterprise Edition
Module 06: Introduction to Performance Tuning
Paul ChristensenSolution Architect
NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.
Last revision: June 22, 2004