Datastage Interview Questions & Answers

download Datastage Interview Questions & Answers

of 8

Transcript of Datastage Interview Questions & Answers

  • 7/30/2019 Datastage Interview Questions & Answers

    1/8

    1) What are the types of Parallelism?

    There are 2 types of Parallel Processing. They are:

    Pipeline Parallelism :

    It is the ability for a downstream stage to begin processing a row as soon as an upstream

    stage has finished processing that row (rather than processing one row completely throughthe job before beginning the next row). In Parallel jobs, it is managed automatically.

    For example, consider a job (src Transformer Tgt) running on a system having three

    processors:

    The source stage starts running on one processor, reads the data from the source and starts

    filling a pipeline with the read data.

    Simultaneously, the Transformer stage starts running on another processor, processes the

    data in the pipeline and starts filling another pipeline.

    At the same time, the target stage starts running on another processor, writes data to thetarget as soon as the data is available.

    Partitioning Parallelism:

    Partitioning parallelism means that entire record set is partitioned into small sets and

    processed on different nodes. That is, several processors can run the same job

    simultaneously, each handling a separate subset of the total data.

    For example if there are 100 records, then if there are 4 logical nodes then each node would

    process 25 records each. This enhances the speed at which loading takes place.

    2) What are partitioning methods available in PX?

    The Partitioning methods available in PX are:

    a)Auto:

    It chooses the best partitioning method depending on:

    The mode of execution of the current stage and the preceding stage.

    The number of nodes available in the configuration file.

    b)Round robin:Here, the first record goes to the first processing node, the second to the second processing

    node, and so on. This method is useful for resizing partitions of an input dataset that are not

    equal in size to approximately equal-sized partitions.

    Data Stage uses Round robin when it partitions the data initially.

    c)Same:

    It implements the Partitioning method same as the one used by the preceding stage. The

    records stay on the same processing node; that is, data is not redistributed or repartitioned.

  • 7/30/2019 Datastage Interview Questions & Answers

    2/8

    Same is considered as the fastest Partitioning method.

    Data Stage uses Same when passing data between stages in a job.

    d) Random:

    It distributes the records randomly across all processing nodes and guarantees that each

    processing node receives approximately equal-sized partitions.

    e)Entire:

    It distributes the complete dataset as input to every instance of a stage on every processing

    node. It is mostly used with stages that create lookup tables for their input.

    f)Hash:

    It distributes all the records with identical key values to the same processing node so as to

    ensure that related records are in the same partition. This does not necessarily mean that the

    partitions will be equal in size.

    When Hash Partitioning, hashing keys that create a large number of partitions should beselected.

    Reason: For example, if you hash partition a dataset based on a zip code field, where a large

    percentage of records are from one or two zip codes, it can lead to bottlenecks because some

    nodes are required to process more records than other nodes.

    g) Modulus:

    Partitioning is based on a key column modulo the number of partitions. The modulus

    partitioned assigns each record of an input dataset to a partition of its output dataset as

    determined by a specified key field in the input dataset.

    h)Range:

    It divides a dataset into approximately equal-sized partitions, each of which contains records

    with key columns within a specific range. It guarantees that all records with same partitioning

    key values are assigned to the same partition.

    Note: In order to use a Range partitioned, a range map has to be made using the Write range

    map stage.

    http://www.allwalkin.blogspot.com/http://www.allwalkin.blogspot.com/

    i) DB2:

    Partitions an input dataset in the same way that DB2 would partition it.

    For example, if this method is used to partition an input dataset containing update information

    for an existing DB2 table, records are assigned to the processing node containing the

    corresponding DB2 record.

    3) What is the difference between OLTP and OLAP?

  • 7/30/2019 Datastage Interview Questions & Answers

    3/8

    OLTP systems contain normalized data where as OLAP systems contain de-normalized data.

    OLTP stores current data where as OLAP stores current and history data for analysis.

    The query retrieval is very fast in OLTP when compared to the OLAP systems because in OLTP

    all data is stored in one table and in OLAP data is stored in multiple tables.

    4) What is the difference between star schema and snow-flake schema?

    In star schema, dimension tables are denormalized where as in snow flake schema, dimension

    tables are normalized.

    5) Where we need partitioning (In processing or somewhere)

    Partitioning is needed in processing. It means we need Partitioning where we have huge

    volumes of data to process.

    6) If we use SAME partitioning in the first stage which partitioning method it will take?

    DataStage uses Round robin when it partitions the data initially.

    7) If we check the preserve partitioning in one stage and if we dont give any

    partitioning method (Auto) in the next stage which partition method it will use?

    In this case, the partitioning method used by the preceding stage is used.

    Preserve Partitioning indicates whether the stage wants to preserve the partitioning at the

    next stage of the job. Options in this tab are:

    Set Sets the Preserve partitioning flag.

    Clear Clears the preserve partitioning flag.

    Propagate Sets the flag to Set or Clear depending on the option selected in the previous

    stage.

    8) Why we need datasets rather than sequential files?

    A Sequential file as the source or target needs to be repartitioned as it is(as name suggests) a

  • 7/30/2019 Datastage Interview Questions & Answers

    4/8

    single sequential stream of data. A dataset can be saved across nodes using partitioning

    method selected, so it is always faster when we used as a source or target. The Data Set

    stage allows you to store data being operated on in a persistent form, which can then be used

    by other DataStage jobs. Data sets are operating system files, each referred to by a control

    file, which by convention has the suffix .ds. Using datasets wisely can be key to good

    performance in a set of linked jobs.

    9) Why we need sort stage other than sort-merge collective method and perform sort

    option in the stage in advanced properties?

    Sort Stage is used to perform more complex sort operations which are not possible using

    stages Advanced tab properties.

    Many stages have an optional sort function via the partition tab. This means if you are

    partitioning your data in a stage you can define the sort at the same time. The sort stage is

    for use when you don't have any stage doing partitioning in your job but you still want to sort

    your data, or if you want to sort your data in descending order, or if you want to use one of

    the sort stage options such as "Allow Duplicates" or "Stable Sort". If you are processing very

    large volumes and need to sort you will find the sort stage is more flexible then the partition

    tab sort.

    10)Why we need filter, copy and column export stages instead of transformer stage?

    In parallel jobs we have specific stage types for performing specialized tasks. Filter, copy,

    column export stages are operator stages. These operators are the basic functional units of an

    orchestrate application. The operators in your Orchestrate application pass data records from

    one operator to the next, in pipeline fashion. For example, the operators in an application step

    might start with an import operator, which reads data from a file and converts it to anOrchestrate data set. Subsequent operators in the sequence could perform various processing

    and analysis tasks. The processing power of Orchestrate derives largely from its ability to

    execute operators in parallel on multiple processing nodes. By default, Orchestrate operators

    execute on all processing nodes in your system. Orchestrate dynamically scales your

    application up or down in response to system configuration changes, without requiring you to

    modify your application. Thus using operator stages will increase the speed of data processing

    applications rather than using transformer stages.

    11) Describe the types of Transformers used in DataStage PX for processing and uses?

    Difference:

    A Basic transformer compiles in "Basic Language" whereas a Normal Transformer compiles in

    "C++".

    Basic transformer does not run on multiple nodes whereas a Normal Transformer can run on

    multiple nodes giving better performance.

    Basic transformer takes less time to compile than the Normal Transformer.

  • 7/30/2019 Datastage Interview Questions & Answers

    5/8

    Usage:

    A basic transformer should be used in Server Jobs.

    12) What will you do in a situation where somebody wants to send you a file and use

    that file as an input or reference and then run job?

    Use wait for file activity stage between job activity stages in job sequencer.

    13) How did you handle an 'Aborted' sequencer?

    By using check point information we can restart the sequence from failure. if u enabled the

    check point information reset the aborted job and run again.

    14) What are Performance tunings you have done in your last project to increase theperformance of slowly running jobs?

    1. Using Dataset stage instead of sequential files wherever necessary.

    2. Use Join stage instead of Lookup stage when the data is huge.

    3. Use Operator stages like remove duplicate, Filter, and Copy etc instead of transformerstage.

    4. Sort the data before sending to change capture stage or remove duplicate stage.

    5. Key column should be hash partitioned and sorted before aggregate operation.

    6. Filter unwanted records in beginning of the job flow itself.

    15) What is Change Capture stage? Which execution mode would you use when you

    used for comparison of data?

    The Change Capture stage takes two input data sets, denoted before and after, and outputs a

    single data set whose records represent the changes made to the before data set to obtain the

    after data set.

    The stage produces a change data set, whose table definition is transferred from the after data

    sets table definition with the addition of one column: a change code with values encoding the

    four actions: insert, delete, copy, and edit. The preserve-partitioning flag is set on the change

    data set.

    The compare is based on a set of key columns, rows from the two data sets are assumed to be

    copies of one another if they have the same values in these key columns. You can also

    optionally specify change values. If two rows have identical key columns, you can compare the

    value columns in the rows to see if one is an edited copy of the other.

    The stage assumes that the incoming data is key-partitioned and sorted in ascending order.

    The columns the data is hashed on should be the key columns used for the data compare. You

  • 7/30/2019 Datastage Interview Questions & Answers

    6/8

    can achieve the sorting and partitioning using the Sort stage or by using the built-in sorting

    and partitioning abilities of the Change Capture stage.

    We can use both Sequential as well as parallel modes of execution for change capture stage.

    16)What is Peek Stage? When do you use it?

    The Peek stage is a Development/Debug stage. It can have a single input link and any number

    of output links.

    The Peek stage lets you print record column values either to the job log or to a separate

    output link as the stage copies records from its input data set to one or more output data sets,

    like the Head stage and the Tail stage.

    The Peek stage can be helpful for monitoring the progress of your application or to diagnose a

    bug in your application.

    17)What is the difference between Join and Lookup?

    a. If reference table is having huge amount of data then we go for join where as if the

    reference table is having less amount of data then we go for lookup.

    b. Join performs all 4 types of joins (inner join, left-outr join, right-outer join and full-outer

    join) where as lookup performs inner join and left-outer join only.

    c. Join dont have reject link where as lookup is having a reject link.

    d. Join uses hash partition where as lookup use entire partition.

    18) What is RCP? How it is implemented?

    DataStage is flexible about Meta data. It can cope with the situation where Meta data isnt

    fully defined. You can define part of your schema and specify that, if your job encounters extra

    columns that are not defined in the Meta data when it actually runs, it will adopt these extra

    columns and propagate them through the rest of the job. This is known as runtime column

    propagation (RCP).

    This can be enabled for a project via the DataStage Administrator, and set for individual links

    via the Outputs Page Columns tab for most stages or in the Outputs page General tab for

  • 7/30/2019 Datastage Interview Questions & Answers

    7/8

    Transformer stages. You should always ensure that runtime column propagation is turned on.

    RCP is implemented through Schema File.

    The schema file is a plain text file contains a record (or row) definition.

    19)What is row generator? When do you use it?

    The Row Generator stage is a Development/Debug stage. It has no input links, and a single

    output link.

    The Row Generator stage produces a set of mock data fitting the specified metadata.

    This is useful where we want to test our job but have no real data available to process.

    Row Generator is also useful when we want processing stages to execute at least once in

    absence of data from the source.

    20)What are Stage Variables, Derivations and Constants?

    Stage Variable - An intermediate processing variable that retains value during read and

    doesnt pass the value into target column.

    Derivation - Expression that specifies value to be passed on to the target column. Constant

    Conditions that are either true or false that specifies flow of data with a link.

    The order of execution is stage variables-> constraints->derivations

    21)Explain the Types of Dimension Tables?

    Conformed Dimension: If a dimension table is connected to more than one fact table,

    the granularity that is defined in the dimension table is common across between the

    fact tables.

    Junk Dimension: The Dimension table, which contains only flags.

    Monster Dimension: If rapidly changes in Dimension are known as Monster

    Dimension.

    De-generative Dimension: It is line item-oriented fact table design.

    22)What is the significance of surrogate key in DataStage?

    Surrogate key is a substitution for the natural primary key. It is just a unique identifier or

    number for each row that can be used for the primary key to the table. The only requirement

    for a surrogate primary key is that it is unique for each row in the table. It is useful because

  • 7/30/2019 Datastage Interview Questions & Answers

    8/8

    the natural primary key can change and this makes updates more difficult. Surrogated keys

    are always integer or numeric.

    23) What is the difference between sparse lookup and normal lookup?

    Sparse Lookup:

    If the reference table is having more amounts of data than Primary table data, then better to

    go for sparse lookup

    Normal Lookup:

    If the reference table is having less amount of data than primary table data , then better to go

    for normal

    In both cases reference tables should be entire partioned and primary table should be hash

    partioned.

    24) Can we capture the duplicates in Datastage? If yes, how to do that?

    Yes. We can capture duplicates by using sort stage.

    In sort stage, we have a property called CreateKeyChangeColumn. It will give 1 for original

    record and 0 for duplicates. Then by using transformer or Filter stage we can capture

    duplicates in one file and non-duplicates in another file.