Partitioning In Datastage

download Partitioning In Datastage

of 27

Transcript of Partitioning In Datastage

  • 7/30/2019 Partitioning In Datastage

    1/27

    Partitioning

  • 7/30/2019 Partitioning In Datastage

    2/27

    2002. Infosys Technologies Ltd. 2

    Agenda

    Introduction

    Why do we need partitioning

    Types of partitioning

  • 7/30/2019 Partitioning In Datastage

    3/27

    2002. Infosys Technologies Ltd. 3

    Introduction

    Strength of DataStage Parallel Extender is in the parallel processing capability itbrings into your data extraction and transformation applications.

    DataStage PX version has the ability to slice the data into chunks and process itsimultaneously.

    Parallelism in DataStage PX is of two types.

    Pipeline parallelism.

    Partition parallelism.

  • 7/30/2019 Partitioning In Datastage

    4/27

    2002. Infosys Technologies Ltd. 4

    Types of Parallelism

    Parallelism in PX jobs is of two types.

    Pipeline

    output of a producer operator is processed by a consumer operatorbefore the producer operator completes processing of the input.

    Partition

    Data is broken into packets and processed by each of the produceroperators at the same time.

  • 7/30/2019 Partitioning In Datastage

    5/27

    2002. Infosys Technologies Ltd. 5

    Pipeline parallelism

    Job using the parallel extender running sequentially, each stage would processa single row of data then pass it to the next process, which would run andprocess this row then pass it on. General

    Run the same job in parallel, the stage reading would start on one node andstart filling a pipeline with the data it had read. Next stage would start running onanother node as soon as there was data in the pipeline, process it and startfilling another pipeline.

  • 7/30/2019 Partitioning In Datastage

    6/27

    2002. Infosys Technologies Ltd. 6

    Pipeline

  • 7/30/2019 Partitioning In Datastage

    7/27 2002. Infosys Technologies Ltd. 7

    Partition parallelism

    Same job when processing huge volume of data pipelining the data would taketime. We can use the power of parallel processing of DataStage by partitioningthe data into separate sets of data.

    Each of these sets is then processed a node.

  • 7/30/2019 Partitioning In Datastage

    8/27 2002. Infosys Technologies Ltd. 8

    Partition and Pipeline

    When no of processors are more then both Pipeline and Partition parallelprocessing can be used to achieve better performance.

  • 7/30/2019 Partitioning In Datastage

    9/27 2002. Infosys Technologies Ltd. 9

    Why do we need

    To induce parallel processing into job data should be partitioned.

    To achieve greater performance data should be partitioned.

    Each node works on different partition.

  • 7/30/2019 Partitioning In Datastage

    10/27 2002. Infosys Technologies Ltd. 10

    Types of partitioning

    Following are various partitioning methods

    Round Robin

    Random

    Same

    Entire

    Hash

    Modulus

    Range

    DB2

    Auto

  • 7/30/2019 Partitioning In Datastage

    11/27 2002. Infosys Technologies Ltd. 11

    General

  • 7/30/2019 Partitioning In Datastage

    12/27 2002. Infosys Technologies Ltd. 12

    Round Robin

    First records goes to first processing node, second record goes to secondprocessing node. Once last processing node is reached , next records goes tofirst processing node.

    Used to re-sizing the partitions that are not equal in size.

    This method is used to create equal sized partitions.

    This method is used to create sequences.

  • 7/30/2019 Partitioning In Datastage

    13/27 2002. Infosys Technologies Ltd. 13

    Round Robin

  • 7/30/2019 Partitioning In Datastage

    14/27 2002. Infosys Technologies Ltd. 14

    Same

    Fastest method of partitioning.

    Records are processed by same processing node.

    There is no repartitioning done by the operator using the output from precedingstage.

  • 7/30/2019 Partitioning In Datastage

    15/27 2002. Infosys Technologies Ltd. 15

    Same

  • 7/30/2019 Partitioning In Datastage

    16/27 2002. Infosys Technologies Ltd. 16

    Entire

    Every processing node of the Stage get entire set of data.

    Used when data is small and can fit into memory. Access to entire data isneeded.

    Generally used in lookups to create hash table.

  • 7/30/2019 Partitioning In Datastage

    17/27 2002. Infosys Technologies Ltd. 17

    Entire

  • 7/30/2019 Partitioning In Datastage

    18/27 2002. Infosys Technologies Ltd. 18

    Hash

    Partitioning is based on a function of columns chosen as hash keys.

    This method is used when related records need to be kept in same partition.

    It does not ensure that partitioned are evenly distributed.

    This partitioning method is used in join, sort, merge and lookup Stages.

  • 7/30/2019 Partitioning In Datastage

    19/27 2002. Infosys Technologies Ltd. 19

    Hash

  • 7/30/2019 Partitioning In Datastage

    20/27 2002. Infosys Technologies Ltd. 20

    Modulus

    Partitioning is based on a key column modulo the number of partitions

    This method is similar to hash by field, but involves simpler computation.

  • 7/30/2019 Partitioning In Datastage

    21/27 2002. Infosys Technologies Ltd. 21

    Range

    Divides a data set into approximately equal-sized partitions, each of whichcontains records with key columns within a specified range.

    This method is also useful for ensuring that related records are in the samepartition.

    This method needs a Range map to be created which decides which recordsgoes to which processing node.

  • 7/30/2019 Partitioning In Datastage

    22/27 2002. Infosys Technologies Ltd. 22

    Range

  • 7/30/2019 Partitioning In Datastage

    23/27 2002. Infosys Technologies Ltd. 23

    Range map

  • 7/30/2019 Partitioning In Datastage

    24/27

    2002. Infosys Technologies Ltd. 24

    DB2

    Data is partitioned same as DB2 table.

    Used when writing to a DB2 table.

    Default partitioning method for DB2 Stages

  • 7/30/2019 Partitioning In Datastage

    25/27

    2002. Infosys Technologies Ltd. 25

    DB2

  • 7/30/2019 Partitioning In Datastage

    26/27

    2002. Infosys Technologies Ltd. 26

    Degree of parallelism

    Degree of Parallelism is determinedby the configuration file

    Total number of logical nodes in default pool, or a subset if using "constraints".

    Constraints are assigned to specific pools as defined inconfiguration file and can be referenced in the stage

    Job performance by choosing best configuration for a job.

  • 7/30/2019 Partitioning In Datastage

    27/27

    Partitioning and Collecting Icons

    Partitioner Collector