1
2
Performance Tuning Version 8.6
Bert PetersGlobal Education Services, Principal Instructor
3
Objectives
After completing this course you will be able to:
• Control how PowerCenter uses memory
• Control how PowerCenter uses CPUs
• Understand the performance counters
• Isolate source, target and engine bottlenecks
• Tune different types of bottlenecks
• Configure Workflow and Session on Grid
4
Agenda
• Memory optimization
• Performance tuning methodology
• Tuning source, target, & mapping bottlenecks
• Pipeline partitioning
• Server Grid
• Q & A
• Course evaluation
5
Anatomy of a Session
Integration Service
Data Transformation Manager(DTM)
READER WRITER
DTM Buffer
Transformationcaches
TRANSFORMER
Sourcedata
Targetdata
Targetdata
6
Memory Optimization
READER
TRANSFORMER
WRITER
DTM Buffer
Transformation Caches
7
DTM Buffer
• Temporary storage area for data
• Buffer is divided into blocks
• Buffer size and block size are tunable
• Default setting for each is Auto
8
DTM Buffer Size – Session Property
• Default is Auto meaning DTM estimates optimal size
• Check session log for actual size allocation
9
DTM Buffer Block Size
• Default is Auto
• Check session log for actual size allocation
10
Reader Bottleneck
READER
TRANSFORMER
WRITER
DTM Buffer
Transformer & writer threads wait for data
Slow reader
waiting waiting
waiting
11
Transformer Bottleneck
READER
TRANSFORMER
WRITER
DTM Buffer
Slow transformer
waiting waiting
Reader waits for free blocks; writer waits for data
12
Writer Bottleneck
READER
TRANSFORMER
WRITER
DTM Buffer
Slow writer
waiting
waiting waiting
Reader & transformer wait for free blocks
13
Source Row Logging
READER
TRANSFORMER
WRITER
DTM Buffer
Source rows must remain in the buffers until transformation/writer threads process corresponding rows downstream
waiting
14
Large Commit Interval
Target rows remain in the buffers until the DTM reaches the commit point
READER
TRANSFORMER
WRITER
DTM Bufferwaiting
15
Tuning the DTM Buffer
READER
TRANSFORMER
WRITER
DTM BufferExtra buffers can keep threads busy
16
Tuning the DTM Buffer
• Temporary slowdowns in reading, transforming or writing may cause large fluctuations in throughput
• A “slow” thread typically provides data in spurts
• Extra memory blocks can act as a “cushion”, keeping other threads busy in case of a bottleneck
17
Tuning the DTM Buffer
• Buffer block size• Recommendation: at least 100 rows / block• Compute based on largest source or target row size• Typically not a significant bottleneck unless below 10
rows/buffer
• Number of blocks• Minimum of 2 blocks required for each source, target and
XML group• (number of blocks) =
0.9 x ((DTM buffer size)/(buffer block size))
18
Tuning the DTM Buffer
• Determine the minimum DTM buffer size (DTM buffer size) = (buffer block size) x (minimum number of blocks) / 0.9
• Increase by a multiple of the block size
• If performance does not improve, return to previous setting
• There is no “formula” for optimal DTM buffer size
• Auto setting may be adequate for some sessions
19
Transformation Caches
• Temporary storage area for certain transformations
• Except for Sorter, each is divided into a Data & Index Cache
• The size of each transformation cache is tunable
• If runtime cache requirement > setting, overflow written to disk
• The default setting for each cache is Auto
20
Tuning the Transformation Caches
Default is Auto
21
Max Memory for Transformation Caches
Only applies to transformation caches set to Auto
22
Max Memory for Transformation Caches
• Two settings: fixed number & percentage• System uses the smaller of the two• If either setting is 0, DTM assigns a default size to each
transformation cache that’s set to Auto
• Recommendation: use fixed limit if this is the only session running; otherwise, use percentage
• Use percentage if running in grid or HA environment
23
Tuning the Transformation Caches
• If a cache setting is too small, DTM writes overflow to disk
• Determine if transformation caches are overflowing:• Watch the cache directory on the file system while the
session runs• Use the session performance counters
• Options to tune:• Increase the maximum memory allowed for Auto
transformation cache sizes• Set the cache sizes for individual transformations manually
24
Session Performance Counters
25
Performance Counters
26
Tuning the Transformation Caches
• Non-0 counts for readfromdisk and writetodisk indicate sub-optimal settings for transformation index or data caches
• This may indicate the need to tune transformation caches manually
• Any manual setting allocates memory outside of previously set maximum
• Cache Calculators provide guidance in manual tuning of transformation caches
27
Aggregator Caches
• Unsorted Input• Must read all input before releasing any output rows• Index cache contains group keys• Data cache contains non-group-by ports
• Sorted Input• Releases output row as each input group is processed• Does not require data or index cache
(both =0)• May run much faster than unsorted BUT
must consider the expense of sorting
28
Aggregator Caches – Manual Tuning
29
Joiner Caches: Unsorted Input
MASTER
DETAIL
All master data loaded into cache
Specify smaller data set as master
Staging algorithm:
• Index cache contains join keys
• Data cache contains non-key connected outputs
30
Joiner Caches: Sorted Input
• Index cache contains up to 100 keys
• Data cache contains non-key connected outputs associated with the 100 keys
MASTER
DETAIL
Both inputs must be sorted on join keys
Selected master data loaded into cache
Specify data set with fewest records under a single key as master
Streaming algorithm:
31
Joiner Caches – Manual Tuning
Cache calculator detects the sorted input property
32
Lookup Caches
• To cache or not to cache?• Large number of invocations – cache• Large lookup table – don’t cache• Flat file lookup is always cached
33
Lookup Caches
• Data cache• Only connected output ports included in data cache • For unconnected lookup, only “return” port included in
data cache
• Index cache size• Only lookup keys included in index cache
34
Lookup Caches
• Lookup Transformation – Fine-tuning the Cache• SQL override• Persistent cache (if the lookup data is static)• Optimize sort
• Default- lookup keys, then connected output ports in port order• Can be commented out or overridden in SQL override• Indexing strategy on table may impact performance• Use Any Value property suppresses sort
35
Lookup Caches
• Can build lookup caches concurrently• May improve session performance when there is significant
activity upstream from the lookup & the lookup cache is large• This option applies to the individual session
• Integration Service builds lookup caches at the beginning of the session run, even if no row has entered a Lookup transformation
Session properties > Config Object tab > Advanced settings
36
Lookup Caches – Manual Tuning
37
Rank Caches
• Index cache contains group keys
• Data cache contains non-group-by ports
• Cache sizes related to the number of groups & the number of ranks
38
Rank Caches – Manual Tuning
39
Sorter Cache
• Sorter Transformation• May be faster than a DB sort or 3rd party sorter• Index read from RDB = pre-sorted data• SQL SELECT DISTINCT may reduce the volume of data
across the network versus sorter with “Distinct” property set
• Single cache (no separation of index & data)
40
Sorter Cache – Manual Tuning
41
64 bit vs. 32 bit OS
• Take advantage of large memory support in 64- bit
• Cache based transformations like Sorter, Lookup, Aggregator, Joiner, and XML Target can address larger blocks of memory
42
Maximum Memory Allocation Example
• Parameters• 64 Bit OS • Total system memory: 32 GB• Maximum allowed for transformation caches: 5 GB or 10%• DTM Buffer: 24 MB• One transformation manually configured
Index Cache: 10 MB Data Cache: 20 MB
• All other transformations set to Auto
43
Maximum Memory Allocation Example
• Result• 10% = 3.2 GB < 5 GB:
max allowed for transformation caches = 3.2 GB = 3200 MB
• Manually configured transformation uses 30 MB• DTM Buffer uses 24 MB• 3200 + 30 + 24 = 3254 MB• Note that 3254 MB represents an upper limit; cached
transformations may use less than the 3200 MB max
44
Performance Tuning Methodology
• It is an iterative process• Establish benchmark• Optimize memory• Isolate bottleneck• Tune bottleneck• Take advantage of under-utilized CPU & memory
45
Disk
PowerCenter
Disk Disk
Disk Disk
Disk Disk
OSDBMSLAN /
WAN
Disk
Disk Disk
Disk Disk
Disk Disk
The Production Environment
• Multi-vendor, multi-system environment with many components:• Operating systems, databases, networks and I/O• Usually need to monitor performance in several places• Usually need to monitor outside Informatica as well
46
Disk
PowerCenter
Disk Disk
Disk Disk
Disk Disk
OSDBMSLAN /
WAN
Disk
Disk Disk
Disk Disk
Disk Disk
The Production Environment
• Tuning involves an iterative approach1. Identify the biggest performance problem2. Eliminate or reduce it3. Return to step 1
47
Preliminary Steps
• Eliminate transformation errors & data rejects“first make it work, then make it faster”• Source row logging requires reader to hold onto buffers
until data is written to target, EVEN IF THERE ARE NO ERRORS; can significantly increase DTM buffer requirement
• You may want to set stop on errors to 1
48
Preliminary Steps
• Override tracing level to terse or normal• Override at session level to avoid having to examine each
transformation in the mapping• Only use verbose tracing during development & only with
very small data sets• If you expect row errors that you will not need to correct,
avoid logging them by overriding the tracing level to terse
(not recommended as a long-term error handling solution)
49
Benchmarking
• Hardware (CPU bandwidth, RAM, disk space, etc.) should be similar to production
• Database configuration should be similar to production
• Data volume should be similar to production
• Challenge: production data is constantly changing• Optimal tuning may be data dependent• Estimate “average” behavior• Estimate “worst case” behavior
50
Benchmarking – Conditional Branching
Scenario: a high percentage of test data goes to TARGET1;but a high percentage of production data goes to TARGET2
Tuning of sorter & aggregator could be overlooked in test
51
Benchmarking – Conditional Branching
Scenario: a high percentage of production data goes to TARGET1 on Monday’s load; but a high percentage of production data goes to TARGET2 on Tuesday’s load
Performance of 2 loads may differ significantly
52
Benchmarking – Conditional Branching
• Conditional branching poses a challenge in performance tuning
• Volume & CHARACTERISTICS of data should be consistent between test & production
• May need to estimate average behavior
• May want to tune for worst-case scenario
53
Identifying Bottlenecks
• The first challenge is to identify the bottleneck • Target • Source• Transformations• Mapping/Session
• Tuning the most severe bottleneck may reveal another one
• This is an iterative process
54
Thread Statistics
• The DTM spawns multiple threads
• Each thread has busy time & idle time
• Goal – maximize the busy time & minimize the idle time
55
Thread Statistics - Terminology
• A pipeline consists of:• A source qualifier• The sources that feed that source qualifier• All transformations and targets that receive data from that
source qualifier
56
Thread Statistics - Terminology
MASTER
DETAIL
PIPELINE 1
PIPELINE 2
A pipeline on the master input of a joiner terminates at the joiner
57
Thread Statistics - Terminology
• Stage a portion of a pipeline; implemented at runtime as a thread
• Partition Point boundary between 2 stages; always associated with a transformation
58
Using Thread Statistics
• By default PowerCenter assigns a partition point ( ) at each Source Qualifier, Target, Aggregator and Rank.
Reader Thread Transformation Thread Transform Writer Thread Thread
(First Stage) (Second Stage) (Third Stage) (Fourth Stage)
Partition Points
59
Target Bottleneck
• The Aggregator transformation stage is waiting for target buffers
Reader Thread Transformation Thread Transform Writer Thread Thread
(First Stage) (Second Stage) (Third Stage) (Fourth Stage)
Busy% Busy% Busy%=15 Busy%=95
60
Transformation Bottleneck
• Both the reader & writer are waiting for buffers
Reader Thread Transformation Thread Transform Writer Thread Thread
(First Stage) (Second Stage) (Third Stage) (Fourth Stage)
Busy%=15 Busy%=60 Busy%=95 Busy%=10
61
Thread Statistics in Session Log
***** RUN INFO FOR TGT LOAD ORDER GROUP [1], CONCURRENT SET [1] *****
Thread [READER_1_1_1] created for [the read stage] of partition point [SQ_SortMergeDataSize_Detail] has completed.
Total Run Time = [318.271977] secs
Total Idle Time = [176.488675] secs
Busy Percentage = [44.547843]
Thread [TRANSF_1_1_1] created for [the transformation stage] of partition point [SQ_SortMergeDataSize_Detail] has completed.
Total Run Time = [707.803168] secs
Total Idle Time = [105.303059] secs
Busy Percentage = [85.122550]
Thread work time breakdown:
JNRTRANS: 10.869565 percent
SRTTRANS: 89.130435 percent
62
Performance Counters in WF Monitor
63
Integration Service Monitor in WFMonitor
64
Session Statistics in WFMonitor
65
Other Methods of Bottleneck Isolation
• Write to flat file If significantly faster than relational target – Target Bottleneck
• Place FALSE Filter right after Source Qualifier If significantly faster – Transformation Bottleneck
• If target & transformation bottlenecks are ruled out – Source Bottleneck
66
Target Optimization
• Target Optimization often involves non- Informatica components
• Drop Indexes and Constraints• Use pre/post SQL to drop and rebuild• Use pre/post-load stored procedures
• Use constraint-based loading only when necessary
67
Target Optimization
• Use Bulk Loading• Informatica bypasses the database log• Target cannot perform rollback• Weigh importance of performance over recovery
• Use External Loader• Similar to bulk loader, but the DB reads from a flat file
68
Target Optimization
• Target commit type• Best performance, least precise control• System avoids writing partially-filled buffers
• Source commit type• Last active source to feed a target becomes a transaction generator• Commit interval provides precise control• Slower than target commit type• Avoid setting commit interval too low
• User Defined commit type• Required when mapping contains transaction control transformation• Provides precise data-driven control• Slower than target and source commit types
Transaction Control
69
Target Optimization
• “update else insert” session property• Works well if you rarely insert• Index required for update key but slows down insert• PowerCenter must wait for database to return error before
inserting
• Alternative – lookup followed by update strategy
70
Source Bottlenecks
• Source optimization often involves non- Informatica components
• Generated SQL available in session log• Execute directly against DB• Update statistics on DB• Used tuned SELECT as SQL override
• Set the Line Sequential Buffer Length session property to correspond with the record size
71
Source Bottlenecks
• Avoid transferring more than once from remote machine
• Avoid reading same data more than once
• Filter at source if possible (reduce data set)
• Minimize connected outputs from the source qualifier• Only connect what you need• The DTM only includes connected outputs when it
generates the SQL SELECT statement
72
Reduce Data Set
• Remove Unnecessary Ports• Not all ports are needed• Fewer ports = better performance & lower memory req.
• Reduce Rows in Pipeline• Place Filter Transformation as far upstream as possible• Filter before aggregator, rank, or sorter if possible• Filter in source qualifier if possible
73
Avoid Unnecessary Sorting
jnr_ENT_EXCH_ IDNT_GRP_TO _SEDOL
srt_ENT_EXCH_ IDNT_GRP_RIC
jnr_ENT_EXCH_ IDNT_GRP_TO _RIC
srt_ENT_EXCH_ CODE_PK
srt_ENT_EXCH_ GRP_PK
jnr_ENT_MKT_ GRP_TO_MKT_ IDNT_GRP
srt_ENT_MKT_ GRP
XML_PARSER_ PME_EQT_ENT _v1_2
srt_ENTITLEME T
srt_ENT_MKT_ GRP1
srt_ENT_MKT_ AND_MKT_IDN T_GRP
jnr_ENT_MKTG RP_TO_EXCHG RP_WITH_MKT _CODES
jnr_ENT_EXCH_ GRP_TO_EXCH _IDNT
srt_ENT_EXCH_ IDNT_GRP_PK
srt_ENT_EXCH ANGE_GRP
srt_ENT_EXCH_ IDNT_GRP
srt_ENT_EXCH_ IDNT_SEDOL
srt_ENT_EXCH_ IDNT_TICKER_ SYM
srt_ENT_EXCH_ IDNT_BBT_EXC H_TICKR
jnr_ENT_TO_M KT_GRP
srt_ENT_MKT_I DNT_GRP
srt_ENT_EXCH_ GRP_PK2
jnr_ENT_EXCH_ IDNT_GRP_TO _TICK_SYM
74
Expressions Language Tips
• Functions are more expensive than operators• Use || instead of CONCAT()
• Use variable ports to factor out common logic
75
Expressions Language Tips
• Simplify nested functions when possible
instead of: IIF(condition1,result1,IIF(condition2, result2,IIF… ))))))))))))
try: DECODE (TRUE,
condition1, result1, : conditionn, resultn)
76
General Guidelines
• Data Type Conversions are expensive, avoid if possible
• All-input transformations (such as Aggregator, Rank etc) are more expensive than pass-through transformations • An all-input transformation must process multiple input
rows before it can produce any output
77
General Guidelines
• High precision (session property) is expensive but only applies to “decimal” data type
• UNICODE requires 2 bytes per character; ASCII requires 1 byte per character• Performance difference depends on number of string ports
only.
78
Transformation Specific
Reusable Sequence Generator –Number of Cached Values Property• Purpose: enables different sessions to share the
same sequence without generating the same numbers
• >0: allocates the specified number of values & updates the current value in the repository at the end of each block (each session gets a different block of numbers)
79
Transformation Specific
Reusable Sequence Generator –Number of Cached Values Property• Setting too low causes frequent repository
access, which impacts performance
• Unused values in a block are lost; this leads to gaps in the sequence
• Consider alternatives example: non-reusable sequence generators, one generates even numbers, & the other generates odd numbers
80
Other Transformations
• Normalizer• This transformation INCREASES the number of rows• Place as far downstream as possible
• XML Reader/ Mid Stream XML Parser• Remove groups that are not projected• We do not allocate memory for these groups, but still need
to maintain PK/FK relationships• Don’t leave port size lengths as infinite. Use appropriate
length.
81
Iterative Process
• After tuning your bottlenecks, revisit memory optimization
• Tuning often REDUCES memory requirements (you might even be able to change some settings back to Auto)
• Change one thing at a time & record your results
82
Partitioning
• Apply after optimizing source, target, & transformation bottlenecks
• Apply after optimizing memory usage
• Exploit under-utilized CPU & memory
• To customize partitioning settings, you need the partitioning license
83
Partitioning Terminology
• Partition subset of the data
• Stage a portion of a pipeline
• Partition Point boundary between 2 stages
• Partition Type algorithm for distributing data among partitions; always associated with a partition point
84
Reader Thread Transformation Thread Transform Writer Thread Thread
(First Stage) (Second Stage) (Third Stage) (Fourth Stage)
Threads, Partition Points and Stages
• The DTM implements each stage as a thread; hence, stages run in parallel
• You may add or remove partition points
85
Rules for Adding Partition Points
• You cannot add a partition point to a Sequence Generator
• You cannot add a partition point to an unconnected transformation
• You cannot add a partition point on a source definition
• If a pipeline is split and then concatenated, you cannot add a partition point on any transformation between the split and the concatenation
• Adding or removing partition points requires the partitioning license
86
Guidelines for Adding Partition Points
• Make sure you have ample CPU bandwidth
• Make sure you have gone through other optimization techniques
• Add on complex transformations that could benefit from additional threads
• If you have >1 partitions, add where data needs to be re-distributed• Aggregator, Rank, or Sorter, where data must be grouped• Where data is distributed unevenly• On partitioned sources and targets
87
Partition Points & Partitions
• Partitions subdivide the data• Each partition represents a thread within a stage• Each partition point distributes the data among the partitions
3 Reader Threads 3 Transformation Threads 3 more trans threads 3 Writer Threads
(First Stage) (Second Stage) (Third Stage) (Fourth Stage)
Threads - partition 1
Threads – partition 2
Threads – partition 3
88
Session Partitioning GUI
• The number next to each flag shows the number of partitions• The color of each flag indicates the partition type
89
Rules for Adding Partitions
• The master input of a joiner can only have 1 partition unless you add a partition point at the joiner
• A pipeline with an XML target can only have 1 partition
• If the pipeline has a relational source or target and you define n partitions, each database must support n parallel connections
• A pipeline containing a custom or external procedure transformation can only have 1 partition unless those transformations are configured to allow multiple partitions
90
Rules for Adding Partitions
• The number of partitions is constant on a given pipeline• If you have a partition point on a Joiner, the number of
partitions on both inputs will be the same
• At each partition point, you can specify how you want the data distributed among the partitions (this is known as the partition type)
91
Guidelines for Adding Partitions
• Make sure you have ample CPU bandwidth & memory
• Make sure you have gone through other optimization techniques
• Add 1 partition at a time & monitor the CPU• When CPU usage approaches 100%, don’t add anymore
partitions
• Take advantage of database partitioning
92
Partition Types
• Each partition point is associated with a partition type
• The partition type defines how the DTM is to distribute the data among the partitions
• If the pipeline has only 1 partition, the partition point serves only to add a stage to the pipeline
• There are restrictions, enforced by the GUI, on which partition types are valid at which partition points
93
Partition Types – Pass Through
• Data is processed without redistributing the rows among partitions
• Serves only to add a stage to the pipeline
• Use when you want an additional thread for a complex transformation but you don’t need to redistribute the data (or you only have 1 partition)
94
Partition Types – Key Range
• The DTM passes data to each partition depending on user-specified ranges
• You may use several ports to form a compound partition key
• The DTM discards rows not falling in any specified range
• If 2 or more ranges overlap, a row can go down more than 1 partition resulting in duplicate data
• Use key range partitioning when the sources or targets in the pipeline are partitioned by key range
95
Partition Types – Round Robin
• The Integration Service distributes rows of data evenly to all partitions
• Use when there is no need to group data among partitions
• Use when reading flat file sources of different sizes
• Use when data has been partitioned unevenly upstream and requires significantly more processing before arriving at the target
96
Partition Types – Hash Auto Keys
• The DTM applies a hash function to a partition key to group data among partitions
• Use hash partitioning to ensure that groups of rows are processed in the same partition
• The DTM automatically determines the partition key based on:• aggregator or rank group keys• join keys• sort keys
97
Partition Types – Hash User Keys
• This is similar to hash auto keys except the user specifies which ports make up the partition key
• Alternative to hard-coded key range partition on relational target (if DB table is partitioned)
98
Partition Types – Database
• Only valid for DB2 and Oracle databases in a multi-node database• Sources: Oracle and DB2• Targets: DB2 only
• The number of partitions does not have to equal the number of database nodes• Performance may be better if they are equal, however
99
Partitioning with Relational Sources
• PowerCenter creates a separate source database connection for each partition
• If you define n partitions, the source database must support n parallel connections
• The DTM generates a separate SQL Query for each partition
• Each query can be overridden
• PowerCenter reads the data concurrently
100
Partitioning with Flat File Sources
• Multiple flat files• Each partition reads a different file• PowerCenter reads the files in parallel• If the files are of unequal sizes, you may want to repartition
the data round-robin
• Single flat file• PowerCenter makes multiple parallel connections to the
same file based on the number of partitions specified• PowerCenter distributes the data randomly to the partitions• Over a large volume of data, this random distribution tends
to have an effect similar to round robin—partition sizes tend to be equal
101
Partitioning with Relational Targets
• The DTM creates a separate target database connection for each partition
• The DTM loads data concurrently
• If you define n partitions, database must support n concurrent connections
102
Partitioning with Flat File Targets
• The DTM writes output for each partition to a separate file
• Connection settings and properties can be configured for each partition
• The DTM can merge the target files if all have connections local to the Integration Service machine
• The DTM writes the data concurrently
103
Partitioning—Memory Requirements
• Minimum number of buffer blocks multiplied by number of partitions (2 blocks per source, target, & XML group) x (number of partitions)
• Optimal number of buffer blocks = (optimal number for 1 partition) x (number of partitions)
104
Cache Partitioning
• DTM may create separate caches for each partition for each cached transformation; this is called cache partitioning
• DTM treats cache size settings as per partition for example, if you configure an aggregator with:
2 MB for the index cache, 3 MB for the data cache, & you create 2 partitions–
DTM will allocate up to 4 MB & 6 MB total
• DTM does not partition lookup or joiner caches unless the lookup or joiner itself is a partition point
105
Cache Partitioning
Each partition has its own cache(s)
Sorter cache
Index cache
Data cache
Index cache
Data cache
Sorter cache
106
Cache Partitioning
With a partition point on the joiner,each partition has its own cache(s)
Index cache
Data cache
Index cache
Data cache
107
Cache Partitioning
With nopartition pointon the joiner,however, all partitions share 1 set of caches
Index cache
Data cache
108
Monitoring Partitions
• The Workflow Monitor provides runtime details for each partition
• Per partition, you can determine the following:• Number of rows processed• Memory usage• CPU usage
• If one partition is doing more work than the others, you may want to redistribute the data
109
Pipeline Partitioning Example
• Scenario:• Student record processing• XML source and Oracle target• XML source is split into 3 files
110
Pipeline Partitioning Example
Solution: Solution: Define a partition for each of the 3 source filesDefine a partition for each of the 3 source files
Partition 1
Partition 2
Partition 3
111
Pipeline Partitioning Example
Problem: Problem: Source files vary in size, resultingSource files vary in size, resultingin unequal workloads for each partitionin unequal workloads for each partitionSolution: Solution: Use Round Robin partitioning on the Use Round Robin partitioning on the filter to balance loadfilter to balance load
RRRR
RRRR
RRRR
112
Pipeline Partitioning Example
Problem: Problem: Potential for splitting rank groupsPotential for splitting rank groups
Solution: Solution: Use hash autoUse hash auto--keys partitioning on the rankkeys partitioning on the rank to group rows appropriatelyto group rows appropriately
RRRR
RRRR
RRRR HH
HH
HH
113
Pipeline Partitioning Example
Problem: Problem: Target tables are partitioned on Oracle by key rangeTarget tables are partitioned on Oracle by key rangeSolution: Solution: Use target Key Range partitioning to optimize writing Use target Key Range partitioning to optimize writing to target tablesto target tables
RRRR
RRRR
RRRR HH
HH
HH
KK
KK
KK
114
Dynamic Partitioning
• Integration Service can automatically set the number of partitions at runtime.
• Useful when the data volume increases or the number of CPU’s available changes.
• Basis for the number of partitions is specified as a session property
115
Concurrent Workflow Execution (8.5)
• Prior to 8.5
• Only one instance of Workflow can run
• Users duplicate workflows – maintenance issues
• Concurrent sessions required duplicate of session
116
Concurrent Workflow Execution
• Allow workflow instances to be run concurrently
• Override parameters/ variables across run instances
• Same scheduler across multiple instances
• Supports independent recovery/ failover semantics
117
Concurrent Workflow Execution
118
Workflow on Grid (WonG)
• Integration Service is deployed on a Grid – an IS service process (pmserver) runs on each node in the grid
• Allows tasks of a workflow to be distributed across a grid – no user configuration necessary if all nodes homogenous
119
Workflow on Grid (WonG)
• Different sessions in a workflow are dispatched on different nodes to balance load
• Use workflow on grid if:• There are many concurrent sessions and workflows• Leverage multiple machines in the environment
120
Load Balancer Modes
• Round Robin• Honors Max Number of Processes per Node
• Metric-based• Evaluates nodes in round-robin • Honors resource provision thresholds• Uses stats from last 3 runs - if no statistics is collected yet,
defaults used (40 MB memory, 15% CPU)
121
Load Balancer Modes
• Adaptive• Selects node w/ the most available CPU• Honors resource provision thresholds• Uses statistics from last 3 runs of a task to determine whether a
task can run on a node• Bypass in dispatch queue: skip tasks in the queue that are more
resource intensive and can’t be dispatch to any currently available nodes
• CPU Profile - Ranks node CPU performance against a baseline system
• All modes take into account the service level assigned to workflows
122
Session on Grid (SonG)
• Session partitioned and dispatched across multiple nodes
• Allows Unlimited Scalability
• Source and targets may be on different nodes
• More suited for large sessions
• Smaller machines in a grid is a lower cost option than large multi-CPU machines
123
Session on Grid (SonG)
• Session on Grid will scale if:• Sessions are CPU/memory intensive and overcomes
overhead of data movement over network• I/O is kept localized to each node running the partition • There is a fast shared storage (e.g. NAS, clustered FS)• Partitions are independent
• Source and target have different connections that are only available on different machines• E.g. source Excel files on Windows and target is only
available on UNIX
• Supported on a homogeneous grid
124
Configuring Session on Grid
• Enable Session on Grid attribute in session configuration tab
• Assign workflow to be executed by an integration service that has been assigned to a grid
125
Dynamic Partitioning
• Based on user specification (# partitions)• Can parameterize as $DynamicPartitionCount
• Based on # of nodes in grid
• Based on source partitioning (Database partitioning)
126
SonG Partitioning Guidelines
• Set # of partitions = # of nodes to get an even distribution• Tip: use dynamic partitioning feature to ease expansion of
grid
• In addition, continue to create partition-points to achieve parallelism
127
SonG Partitioning Guidelines
• To minimize data traffic across nodes:• Use pass-through partition type which will try to keep
transformations on the same node• Use resource-map to dispatch the source and target
transformations to the node where source or target are located
• Keep the target files unmerged whenever possible (e.g. if being used for staging)
• Resource requirement should be specified at the lowest granularity e.g. transformation instead of session (as far as possible)• This will ensure better distribution in SonG
128
File Placement Best Practices
• Files that should be placed on a high-bandwidth shared file system (CFS / NAS)• Source files• Lookup source files [sequential file access]• Target files [sequential file access]• Persistent cache files for lookup or incremental aggregation [random file
access]
• Files that should be placed on a shared file system but bandwidth requirement is low (NFS)• Parameter files • Other configuration files• Indirect source or target files• Log files.
129
File Placement Best Practices
• Files that should be put on local storage• Non-persistent cache files (i.e. sorter temporary files)• Intermediate target files for sequential merge• Other temporary files created during a session execution
• $PmTempFileDir should point to a local file system
• For best performance, ensure sufficient bandwidth for shared file system and local storage (possibly by using additional disk i/o controllers)
130
Data Integration Certification PathLevel Certification Title Recommended Training Required Exams
»Architecture & Administration;»Advanced Administration
Informatica Certified Administrator
Informatica Certified Developer
Informatica Certified Consultant
» PowerCenter QuickStart (eLearning) » PowerCenter 8.5+ Administrator (4 days)
» PowerCenter QuickStart (eLearning) » PowerCenter 8.5+ Administrator (4 days) » PowerCenter Developer 8.x Level I (4 days) » PowerCenter Developer 8 Level II (4 days)
»Architecture & Administration;»Mapping Design»Advanced Mapping Design
» PowerCenter QuickStart (eLearning) » PowerCenter 8.5+ Administrator (4 days) » PowerCenter Developer 8.x Level I (4 days) » PowerCenter Developer 8 Level II (4 days)
» PowerCenter 8 Data Migration (4 days) » PowerCenter 8 High Availability (1 day)
»Architecture & Administration;»Advanced Administration»Mapping Design »Advanced Mapping Design »Enablement Technologies
Additional Training:» PowerCenter 8.5 New Features» PowerCenter 8.6 New Features» PowerCenter 8 Upgrade
» PowerCenter 8 Team-Based Development» PowerCenter 8.5 Unified Security `
»Architecture & Administration;»Advanced Administration
131
Q & A
Bert PetersGlobal Education Services, Principal Instructor
132
Course Evaluation
Bert PetersGlobal Education Services, Principal Instructor
133
Appendix Informatica Services by Solution
134
B2B Data Exchange Recommended Services
Strategy Engagements• B2B Data Transformation
Architectural Review
Baseline Engagements• B2B Data Transformation
Baseline Architecture
Implement Engagements• B2B Full Project Lifecycle• Transaction/Customer/
Payment Hub
Professional Services Education ServicesRecommended Courses• Informatica B2B Data
Transformation (D)• Informatica B2B Data Exchange
(D)
B2B
Target Audience for Courses D = DeveloperA = Administrator
M = Project Manager
135
Data Governance Recommended Services
Strategy Engagements• Informatica Environment
Assessment Service • Metadata Strategy and Enablement• Data Quality Audit
Baseline Engagements• Data Governance Implementation • Metadata Manager Quick Start• Informatica Data Quality Baseline
Deployment
Implement Engagements• Metadata Manager Customization • Data Quality Management
Implementation
Professional Services Education ServicesRecommended Courses• PowerCenter Level I Developer (D)• Informatica Data Explorer (D)• Informatica Data Quality (D)
Related Courses• PowerCenter Administrator (A)• Metadata Manager (D)
Certifications:• PowerCenter• Data Quality
Target Audience for Courses D = DeveloperA = Administrator
M = Project Manager
136
Data Migration Recommended Services
Strategy Engagements• Data Migration Readiness
Assessment • Informatica Data Quality Audit
Baseline Engagements• PowerCenter Baseline Deployment• Informatica Data Quality (IDQ),
and/or Informatica Data Explorer (IDE) Baseline Deployment
Implement Engagements• Data Migration Jumpstart• Data Migration End-to-End
Implementation
Professional Services Education ServicesRecommended Courses• Data Migration (M)• Informatica Data Explorer (D)• Informatica Data Quality (D)• PowerCenter Level I Developer (D)
Related Courses• PowerExchange Basics (D)• PowerCenter Administrator (A)
Certifications• PowerCenter• Data Quality
Data Migration
Target Audience for Courses D = DeveloperA = Administrator
M = Project Manager
137
Data Quality Recommended Services
Strategy Engagements• Data Quality Management Strategy• Informatica Data Quality Audit
Baseline Engagements• Informatica Data Quality (IDQ),
and/or Informatica Data Explorer (IDE) Baseline Deployment
• Informatica Data Quality Web Services Quick Start
Implement Engagements• Data Quality Management
Implementation
Professional Services Education ServicesRecommended Courses• Informatica Data Explorer (D)• Informatica Data Quality (D)
Related Courses• Informatica Identity Resolution (D)• PowerCenter Level I Developer (D)
Certifications• Data Quality
Data Quality
Target Audience for Courses D = DeveloperA = Administrator
M = Project Manager
138
Data Synchronization Recommended Services
Strategy Engagements• Project Definition and AssessmentBaseline Engagements• PowerExchange Baseline
Architecture Deployment• PowerCenter Baseline Architecture
DeploymentImplement Engagements• Data Synchronization
Implementation
Professional Services Education ServicesRecommended Courses• PowerCenter Level I Developer (D)• PowerCenter Level II Developer (D)• PowerCenter Administrator (A)
Related Courses• PowerExchange Basics Oracle Real-
Time CDC (D)• PowerExchange SQL RT (D)• PowerExchange for MVS DB2 (D)Certifications• PowerCenter
Data Synchronization
Target Audience for Courses D = DeveloperA = Administrator
M = Project Manager
139
Enterprise Data Warehousing Recommended Services
Strategy Engagements• Enterprise Data Warehousing (EDW)
Strategy• Informatica Environment
Assessment Service• Metadata Strategy & Enablement
Baseline Engagements• PowerCenter Baseline Architecture
Deployment
Implement Engagements• EDW Implementation
Professional Services Education ServicesRecommended Courses• PowerCenter Level I Developer (D)• PowerCenter Level II Developer (D)• PowerCenter Metadata Manager (D)
Related Courses• Informatica Data Quality (D)• Data Warehouse Development (D)
Certifications• PowerCenter
Data Warehouse
Target Audience for Courses D = DeveloperA = Administrator
M = Project Manager
140
Integration Competency Centers Recommended Services
Strategy Engagements• ICC Assessment
Baseline Engagements• ICC Master Class Series• ICC Director
Implement Engagements• ICC Launch• ICC Implementation• Informatica Production Support
Professional Services Education ServicesRecommended Courses• ICC Overview (M)• PowerCenter Level I Developer (D)• PowerCenter Administrator (A)
Related Courses• Metadata Manager (D)• Informatica Data Explorer (D)• Informatica Data Quality (D)
Certifications• PowerCenter• Data Quality
ICC
Target Audience for Courses D = DeveloperA = Administrator
M = Project Manager
141
Master Data Management Recommended Services
Strategy Engagements• Master Data Management (MDM)
Strategy• Informatica Data Quality Audit
Baseline Engagements• Informatica Data Explorer (IDE)
Baseline Deployment• Informatica Data Quality (IDQ)
Baseline Deployment• PowerCenter Baseline Architecture
Deployment
Implementation• MDM Implementation
Professional Services Education Services
Master Data Management
Recommended Courses• Informatica Data Explorer (D)• Informatica Data Quality (D)• PowerCenter Level I Developer (D)
Related Courses• Metadata Manager (D)• Informatica Identity Resolution (D)
Certifications• PowerCenter• Data Quality
Target Audience for Courses D = DeveloperA = Administrator
M = Project Manager
142
Services Oriented Architecture Recommended Services
Strategy Engagements• Data Services (SOA) Strategy
Baseline Engagements• Informatica Web Services Quick
Start• Informatica Data Quality Web
Services Quick Start
Implement Engagements• Data Services (SOA) Implementation
Professional Services Education ServicesRecommended Courses• PowerCenter Level I Developer (D)• Informatica Data Quality (D)
Certifications• PowerCenter• Data Quality
Data Services
Target Audience for Courses D = DeveloperA = Administrator
M = Project Manager
143
Governance, Risk & Compliance (GRC) Recommended Services
Strategy Engagements• Informatica Environment
Assessment Service • Enterprise Data Warehouse Strategy • Data Quality Audit
Baseline Engagements• Informatica Data Quality Baseline
Deployment • Metadata Manager Quick Start
Implement Engagements• Risk Management Enablement Kit • Enterprise Data Warehouse
Implementation
Professional Services Education ServicesRecommended Courses• PowerCenter Level I Developer (D)• Informatica Data Explorer (D)• Informatica Data Quality (D)
Related Courses• Data Warehouse Development (D)• ICC Overview (M)• Metadata Manager (D)
Certifications• PowerCenter• Data Quality
Target Audience for Courses D = DeveloperA = Administrator
M = Project Manager
144
Mergers & Acquisitions (M&A) Recommended Services
Strategy Engagements• Data Migration Readiness
Assessment • Informatica Data Quality Audit
Baseline Engagements• PowerCenter Baseline Deployment• Informatica Data Quality (IDQ),
and/or Informatica Data Explorer (IDE) Baseline Deployment
Implement Engagements• Data Migration Jumpstart• Data Migration End-to-End
Implementation
Professional Services Education ServicesRecommended Courses• Data Migration (M)• PowerCenter Level I Developer (D)
Related Courses• Informatica Data Explorer (D)• Informatica Data Quality (D)• PowerExchange Basics (D)
Certifications• PowerCenter• Data Quality
Target Audience for Courses D = DeveloperA = Administrator
M = Project Manager
145
Deliver Your Project Right the First Time with Informatica Professional Services
146
Informatica Global Education Services
"We launched an aggressive data migration project that was to be completed in one year. The complexity of the data schema along with the use of Informatica PowerCenter tools proved challenging to our top colleagues.
We believe that Informatica training led us to triple productivity, helping us to complete the project on its original 1-year schedule.”
Joe Caputo, Director, Pfizer
147
Informatica Contact Information
Informatica Corporation Headquarters100 Cardinal WayRedwood City, CA 94063Tel: 650-385-5000Toll-free: 800-653-3871Toll-free Sales: 888-635-0899Fax: 650-385-5500
Informatica EMEA HeadquartersInformatica Nederland B.V.Edisonbaan 14a3439 MN NieuwegeinPostbus 1163430 AC NieuwegeinTel: +31 (0) 30-608-6700Fax: +31 (0) 30-608-6777
Informatica Asia/Pacific HeadquartersInformatica Australia Pty LtdLevel 5, 255 George StreetSydneyN.S.W. 2000AustraliaTel: +612-8907-4400Fax: +612-8907-4499
Global Customer [email protected] at my.informatica.com to open a new service request or to check on the status of an existing SR.
http://www.informatica.com
Top Related