Datastage Interview Questions

download Datastage Interview Questions

of 57

description

Data-stage

Transcript of Datastage Interview Questions

  • Datastage Interview Questions - Answers

    1

    Datastage Interview Questions What is the flow of loading data into fact & dimensional tables?

    Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional

    table. Consists of fields with numeric values.

    Dimension table - Table with Unique Primary Key.

    Load - Data should be first loaded into dimensional table. Based on the primary key values in

    dimensional table, the data should be loaded into Fact table.

    What is the default cache size? How do you change the cache size if needed?

    Default cache size is 256 MB. We can increase it by going into Datastage Administrator and selecting the

    Tunable Tab and specify the cache size over there.

    What does a Config File in parallel extender consist of?

    Config file consists of the following.

    a) Number of Processes or Nodes.

    b) Actual Disk Storage Location

    What is Modulus and Splitting in Dynamic Hashed File?

    In a Hashed File, the size of the file keeps changing randomly.

    If the size of the file increases it is called as "Modulus".

    If the size of the file decreases it is called as "Splitting".

    What are Stage Variables, Derivations and Constants?

    Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the

    value into target column.

    Derivation - Expression that specifies value to be passed on to the target column.

    Constant - Conditions that are either true or false that specifies flow of data with a link.

    Types of views in Datastage Director?

    There are 3 types of views in Datastage Director

    a) Job View - Dates of Jobs Compiled.

    b) Log View - Status of Job last run

    c) Status View - Warning Messages, Event Messages, Program Generated Messages.

  • Datastage Interview Questions - Answers

    2

    Types of Parallel Processing?

    A) Parallel Processing is broadly classified into 2 types.

    a) SMP - Symmetrical Multi Processing.

    b) MPP - Massive Parallel Processing.

    Orchestrate Vs Datastage Parallel Extender?

    Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX

    platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel

    processing capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE

    and released a new version Datastage 6.0 i.e Parallel Extender.

    Importance of Surrogate Key in Data warehousing?

    Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of

    underlying database. i.e. Surrogate Key is not affected by the changes going on with a database.

    How to run a Shell Script within the scope of a Data stage job?

    By using "ExcecSH" command at Before/After job properties.

    How do you execute datastage job from command line prompt?

    Using "dsjob" command as follows.

    dsjob -run -jobstatus projectname jobname

    Functionality of Link Partitioner and Link Collector?

    Link Partitioner: It actually splits data into various partitions or data flows using various partition

    methods.

    Link Collector: It collects the data coming from partitions, merges it into a single data flow and loads to

    target.

    Types of Dimensional Modeling?

    Dimensional modeling is again sub divided into 2 types.

    a) Star Schema - Simple & Much Faster. Denormalized form.

    b) Snowflake Schema - Complex with more Granularity. More normalized form.

    c) Galaxy scheme or complex multi star schema

  • Datastage Interview Questions - Answers

    3

    Differentiate Primary Key and Partition Key? Primary Key is a combination of unique and not null. It can be a collection of key values called as

    composite primary key. Partition Key is a just a part of Primary Key. There are several methods of

    partition like Hash, DB2, and Random etc. While using Hash partition we specify the Partition Key.

    Differentiate Database data and Data warehouse data? a) Detailed or Transactional

    b) Both Readable and Writable.

    c) Current.

    Containers Usage and Types?

    Container is a collection of stages used for the purpose of Reusability.

    There are 2 types of Containers.

    a) Local Container: Job Specific

    b) Shared Container: Used in any job within a project.

    Compare and Contrast ODBC and Plug-In stages? ODBC: a) Poor Performance.

    b) Can be used for Variety of Databases.

    c) Can handle Stored Procedures.

    Plug-In: a) Good Performance.

    b) Database specific. (Only one database)

    c) Cannot handle Stored Procedures.

    Dimension Modelling types along with their significance

    Data Modelling is Broadly classified into 2 types.

    a) E-R Diagrams (Entity - Relatioships).

    b) Dimensional Modelling.

    What are Ascential Dastastage Products, Connectivity

    Ascential Products

    Ascential DataStage

    Ascential DataStage EE (3)

    Ascential DataStage EE MVS

    Ascential DataStage TX

  • Datastage Interview Questions - Answers

    4

    Ascential QualityStage

    Ascential MetaStage

    Ascential RTI (2)

    Ascential ProfileStage

    Ascential AuditStage

    Ascential Commerce Manager

    Industry Solutions

    Connectivity

    Files

    RDBMS

    Real-time

    PACKs

    EDI

    Other

    Explain Data Stage Architecture? Data Stage contains two components,

    Client Component, Server Component.

    Client Component:

    Data Stage Administrator. Data Stage Manager Data Stage Designer Data Stage Director

    Server Components:

    Data Stage Engine Meta Data Repository Package Installer

    Data Stage Administrator: (Roles and Responsibilities )

    Used to create the project Contains set of properties We can set the buffer size (by default 128 MB) We can increase the buffer size. We can set the Environment Variables. In tunable we have in process and inter-process In-processData read in sequentially Inter-process It reads the data as it comes. It just interfaces to metadata.

  • Datastage Interview Questions - Answers

    5

    Data Stage Manager:

    We can view and edit the Meta data Repository. We can import table definitions. We can export the Data stage components in .xml or .dsx format. We can create routines and transforms We can compile the multiple jobs. Data Stage Designer:

    We can create the jobs. We can compile the job. We can run the job. We can declare stage variable in transform, we can call routines, transform, macros, functions. We can write constraints. Data Stage Director:

    We can run the jobs. We can schedule the jobs. (Schedule can be done daily, weekly, monthly, quarterly) We can monitor the jobs. We can release the jobs.

    What is Meta Data Repository? Meta Data is a data about the data.

    It also contains

    Query statistics ETL statistics Business subject area Source Information Target Information

    Source to Target mapping Information

    What is Data Stage Engine? It is a JAVA engine running at the background.

    What is Dimensional Modeling? Dimensional Modeling is a logical design technique that seeks to present the data in a standard

    framework that is, intuitive and allows for high performance access.

    What is Star Schema? Star Schema is a de-normalized multi-dimensional model. It contains centralized fact tables surrounded

    by dimensions table.

    Dimension Table: It contains a primary key and description about the fact table.

    Fact Table: It contains foreign keys to the dimension tables, measures and aggregates.

    What is surrogate Key? It is a 4-byte integer which replaces the transaction / business / OLTP key in the dimension table.

    We can store up to 2 billion record.

  • Datastage Interview Questions - Answers

    6

    Why we need surrogate key? It is used for integrating the data may help better for primary key.

    Index maintenance, joins, table size, key updates, disconnected inserts and partitioning.

    What is Snowflake schema? It is partially normalized dimensional model in which at two represents least one dimension or

    more hierarchy related tables.

    Explain Types of Fact Tables?

    Factless Fact: It contains only foreign keys to the dimension tables.

    Additive Fact: Measures can be added across any dimensions.

    Semi-Additive: Measures can be added across some dimensions. Eg, % age, discount

    Non-Additive: Measures cannot be added across any dimensions. Eg, Average Conformed Fact: The equation or the measures of the two fact tables are the same under the facts are

    measured across the dimensions with a same set of measures

    Explain the Types of Dimension Tables?

    Conformed Dimension: If a dimension table is connected to more than one fact table, the

    granularity that is defined in the dimension table is common across between the fact tables.

    Junk Dimension: The Dimension table, which contains only flags.

    Monster Dimension: If rapidly changes in Dimension are known as Monster Dimension.

    De-generative Dimension: It is line item-oriented fact table design.

    What are stage variables?

    Stage variables are declaratives in Transformer Stage used to store values. Stage variables are

    active at the run time. (Because memory is allocated at the run time).

    What is sequencer? It sets the sequence of execution of server jobs.

    What are Active and Passive stages?

    Active Stage: Active stage model the flow of data and provide mechanisms for combining data

    streams, aggregating data and converting data from one data type to another. Eg, Transformer,

    aggregator, sort, Row Merger etc.

    Passive Stage: A Passive stage handles access to Database for the extraction or writing of data.

    Eg, IPC stage, File types, Universe, Unidata, DRS stage etc.

    What is ODS?

    Operational Data Store is a staging area where data can be rolled back.

    What are Macros?

    They are built from Data Stage functions and do not require arguments.

    A number of macros are provided in the JOBCONTROL.H file to facilitate getting information

    about the current job, and links and stages belonging to the current job. These can be used in

  • Datastage Interview Questions - Answers

    7

    expressions (for example for use in Transformer stages), job control routines, filenames and table

    names, and before/after subroutines.

    DSHostName

    DSJobStatus

    DSProjectName

    DSJobName

    DSJobController

    DSJobStartDate

    DSJobStartTime

    DSJobStartTimestamp

    DSJobWaveNo

    DSJobInvocations

    DSJobInvocationId

    DSStageLastErr

    DSStageType

    DSStageInRowNum

    DSStageVarList

    DSLinkLastErr

    DSLinkName

    DSStageName

    DSLinkRowCount

    What is keyMgtGetNextValue?

    It is a Built-in transform it generates Sequential numbers. Its input type is literal string & output

    type is string.

    What index is created on Data Warehouse?

    Bitmap index is created in Data Warehouse.

    What is container?

    A container is a group of stages and links. Containers enable you to simplify and modularize

    your server job designs by replacing complex areas of the diagram with a single container stage.

    You can also use shared containers as a way of incorporating server job functionality into

    parallel jobs.

    DataStage provides two types of container:

    Local containers. These are created within a job and are only accessible by that job. A

    local container is edited in a tabbed page of the jobs Diagram window. Shared containers. These are created separately and are stored in the Repository in the

    same way that jobs are. There are two types of shared container

    What is function? ( Job Control Examples of Transform Functions ) Functions take arguments and return a value.

    BASIC functions: A function performs mathematical or string manipulations on the arguments supplied to it, and return a value. Some functions have 0 arguments; most have 1 or

    more. Arguments are always in parentheses, separated by commas, as shown in this general

    syntax:

    FunctionName (argument, argument)

  • Datastage Interview Questions - Answers

    8

    DataStage BASIC functions: These functions can be used in a job control routine, which is defined as part of a jobs properties and allows other jobs to be run and controlled from the first job. Some of the functions can also be used for getting status information on the current

    job; these are useful in active stage expressions and before- and after-stage subroutines.

    To do this ... Use this function ...

    Specify the job you want to control DSAttachJob

    Set parameters for the job you want to control DSSetParam

    Set limits for the job you want to control DSSetJobLimit

    Request that a job is run DSRunJob

    Wait for a called job to finish DSWaitForJob

    Gets the meta data details for the specified link DSGetLinkMetaData

    Get information about the current project DSGetProjectInfo

    Get buffer size and timeout value for an IPC or Web Service

    stage

    DSGetIPCStageProps

    Get information about the controlled job or current job DSGetJobInfo

    Get information about the meta bag properties associated with

    the named job

    DSGetJobMetaBag

    Get information about a stage in the controlled job or current

    job

    DSGetStageInfo

    Get the names of the links attached to the specified stage DSGetStageLinks

    Get a list of stages of a particular type in a job. DSGetStagesOfType

    Get information about the types of stage in a job. DSGetStageTypes

    Get information about a link in a controlled job or current job DSGetLinkInfo

    Get information about a controlled jobs parameters DSGetParamInfo

    Get the log event from the job log DSGetLogEntry

    Get a number of log events on the specified subject from the

    job log

    DSGetLogSummary

    Get the newest log event, of a specified type, from the job log DSGetNewestLogId

    Log an event to the job log of a different job DSLogEvent

    Stop a controlled job DSStopJob

    Return a job handle previously obtained from DSAttachJob DSDetachJob

    Log a fatal error message in a job's log file and aborts the job. DSLogFatal

    Log an information message in a job's log file. DSLogInfo

  • Datastage Interview Questions - Answers

    9

    Put an info message in the job log of a job controlling current

    job.

    DSLogToController

    Log a warning message in a job's log file. DSLogWarn

    Generate a string describing the complete status of a valid

    attached job.

    DSMakeJobReport

    Insert arguments into the message template. DSMakeMsg

    Ensure a job is in the correct state to be run or validated. DSPrepareJob

    Interface to system send mail facility. DSSendMail

    Log a warning message to a job log file. DSTransformError

    Convert a job control status or error code into an explanatory

    text message.

    DSTranslateCode

    Suspend a job until a named file either exists or does not exist. DSWaitForFile

    Checks if a BASIC routine is cataloged, either in VOC as a

    callable item, or in the catalog space.

    DSCheckRoutine

    Execute a DOS or Data Stage Engine command from a

    before/after subroutine.

    DSExecute

    Set a status message for a job to return as a termination

    message when it finishes

    DSSetUserStatus

    What is Routines?

    Routines are stored in the Routines branch of the Data Stage Repository, where you can create,

    view or edit. The following programming components are classified as routines:

    Transform functions, Before/After subroutines, Custom UniVerse functions, ActiveX (OLE)

    functions, Web Service routines

    Dimension Modeling types along with their significance

    Data Modelling is broadly classified into 2 types.

    A) E-R Diagrams (Entity - Relatioships).

    B) Dimensional Modelling.

    Question: Dimensional modelling is again sub divided into 2 types.

    A) Star Schema - Simple & Much Faster. Denormalized form.

    B) Snowflake Schema - Complex with more Granularity. More normalized form. What is the flow of loading data into fact & dimensional tables? Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with numeric values. Dimension table - Table with Unique Primary Key. Load - Data should be first loaded into dimensional table. Based on the primary key values in dimensional table, then data should be loaded into Fact table. What is Hash file stage and what is it used for?

  • Datastage Interview Questions - Answers

    10

    Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance. What are types of Hashed File? Hashed File is classified broadly into 2 types. A) Static - Sub divided into 17 types based on Primary Key Pattern. B) Dynamic - sub divided into 2 types i) Generic ii) Specific Default Hased file is "Dynamic - Type Random 30 D" What are Static Hash files and Dynamic Hash files? As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2GB and the overflow file is used if the data exceeds the 2GB size. How did you handle reject data? Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected. What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?

    Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. Tuned the 'Project Tunables' in Administrator for better performance. Used sorted data for Aggregator. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs. Removed the data not used from the source as early as possible in the job. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories. Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made. Tuning should occur on a job-by-job basis. Use the power of DBMS. Try not to use a sort stage when you can use an ORDER BY clause in the database. Using a constraint to filter a record set is much slower than performing a SELECT WHERE.

  • Datastage Interview Questions - Answers

    11

    Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.

    Tell me one situation from your last project, where you had faced problem and How did u solve it?

    1. The jobs in which data is read directly from OCI stages are running extremely slow. I had to stage the data before sending to the transformer to make the jobs run faster. 2. The job aborts in the middle of loading some 500,000 rows. Have an option either cleaning/deleting the loaded data and then run the fixed job or run the job again from the row the job has aborted. To make sure the load is proper we opted the former.

    Tell me the environment in your last projects Give the OS of the Server and the OS of the Client of your recent most project How did u connect with DB2 in your last project? Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drivers to connect to DB2 (or) DB2-UDB depending the situation and availability. Certainly DB2-UDB is better in terms of performance as you know the native drivers are always better than ODBC drivers. 'iSeries Access ODBC Driver 9.00.02.02' - ODBC drivers to connect to AS400/DB2.

    What are Routines and where/how are they written and have you written any routines before? Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of Routines:

    1. Transform Functions 2. Before-After Job subroutines 3. Job Control Routines

    How did you handle an 'Aborted' sequencer? In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.

    Read the String functions in DS Functions like [] -> sub-string function and ':' -> concatenation operator Syntax: string [ [ start, ] length ] string [ delimiter, instance, repeats ] What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.

    Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.

  • Datastage Interview Questions - Answers

    12

    Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.

    What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director? Use crontab utility along with dsexecute() function along with proper parameters passed. Did you work in UNIX environment? Yes. One of the most important requirements. How would call an external Java function which are not supported by DataStage? Starting from DS 6.0 we have the ability to call external Java functions using a Java package from Ascential. In this case we can even use the command line to invoke the Java function and write the return values from the Java program (if any) and use that files as a source in DataStage job.

    How will you determine the sequence of jobs to load into data warehouse? First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the Aggregator tables (if any).

    The above might raise another Why do we have to load the dimensional tables first, then fact tables: As we load the dimensional tables the keys (primary) are generated and these keys (primary) are Foreign keys in Fact tables.

    Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic. There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically different in permissions (Truncate requires you to have alter table permissions where Delete doesn't).

    How do you rename all of the jobs to support your new File-naming conventions? Create an Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers.

    When should we use ODS? DWH's are typically read only, batch updated on a schedule ODS's are maintained in more real time, trickle fed constantly

    What other ETL's you have worked with?

  • Datastage Interview Questions - Answers

    13

    Informatica and also DataJunction if it is present in your Resume. How good are you with your PL/SQL? On the scale of 1-10 say 8.5-9 What versions of DS you worked with? DS 7.5, DS 7.0.2, DS 6.0, DS 5.2 What's the difference between Datastage Developers...? Datastage developer is one how will code the jobs. Datastage designer is how will design the job, I mean he will deal with blue prints and he will design the jobs the stages that are required in developing the code

    What are the command line functions that import and export the DS jobs?

    dsimport.exe - imports the DataStage components. dsexport.exe - exports the DataStage components.

    How to run a Shell Script within the scope of a Data stage job? By using "ExcecSH" command at Before/After job properties. How to handle Date convertions in Datastage? Convert mm/dd/yyyy format to yyyy-dd-mm? We use a) "Iconv" function - Internal Convertion. b) "Oconv" function - External Convertion.

    Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]") Types of Parallel Processing? Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing. What does a Config File in parallel extender consist of? Config file consists of the following. a) Number of Processes or Nodes. b) Actual Disk Storage Location. Types of views in Datastage Director? There are 3 types of views in Datastage Director

  • Datastage Interview Questions - Answers

    14

    a) Job View - Dates of Jobs Compiled. b) Log View - Status of Job last Run c) Status View - Warning Messages, Event Messages, Program Generated Messages. Did you Parameterize the job or hard-coded the values in the jobs? Always parameterized the job. Either the values are coming from Job Properties or from a Parameter Manager a third part tool. There is no way you will hardcode some parameters in your jobs. The often Parameterized variables in a job are: DB DSN name, username, password, dates W.R.T for the data to be looked against at.

    What are the requirements for your ETL tool? Do you have large sequential files (1 million rows, for example) that need to be compared every day versus yesterday? If so, then ask how each vendor would do that. Think about what process they are going to do. Are they requiring you to load yesterdays file into a table and do lookups? If so, RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one. It all depends on what you need the ETL to do. If you are small enough in your data sets, then either would probably be OK. What are the main differences between Ascential DataStage and Informatica PowerCenter? Chuck Kelleys You are right; they have pretty much similar functionality. However, what are the requirements for your ETL tool? Do you have large sequential files (1 million rows, for example) that need to be compared every day versus yesterday? If so, then ask how each vendor would do that. Think about what process they are going to do. Are they requiring you to load yesterdays file into a table and do lookups? If so, RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one. It all depends on what you need the ETL to do. If you are small enough in your data sets, then either would probably be OK.

    Les Barbusinskis Without getting into specifics, here are some differences you may want to explore with each vendor:

    Does the tool use a relational or a proprietary database to store its Meta data and scripts? If proprietary, why? What add-ons are available for extracting data from industry-standard ERP, Accounting, and CRM packages? Can the tools Meta data be integrated with third-party data modeling and/or business intelligence tools? If so, how and with which ones? How well does each tool handle complex transformations, and how much external scripting is required? What kinds of languages are supported for ETL script extensions?

    Almost any ETL tool will look like any other on the surface. The trick is to find out which one

    will work best in your environment. The best way Ive found to make this determination is to ascertain how successful each vendors clients have been using their product. Especially clients who closely resemble your shop in terms of size, industry, in-house skill sets, platforms, source

    systems, data volumes and transformation complexity.

  • Datastage Interview Questions - Answers

    15

    Ask both vendors for a list of their customers with characteristics similar to your own that have

    used their ETL product for at least a year. Then interview each client (preferably several people

    at each site) with an eye toward identifying unexpected problems, benefits, or quirkiness with the

    tool that have been encountered by that customer. Ultimately, ask each customer if they had it all to do over again whether or not theyd choose the same tool and why? You might be surprised at some of the answers.

    Joyce Bischoffs You should do a careful research job when selecting products. You should first document your requirements, identify all possible products and evaluate each product against the

    detailed requirements. There are numerous ETL products on the market and it seems that you are

    looking at only two of them. If you are unfamiliar with the many products available, you may

    refer to www.tdan.com, the Data Administration Newsletter, for product lists.

    If you ask the vendors, they will certainly be able to tell you which of their products features are stronger than the other product. Ask both vendors and compare the answers, which may or may

    not be totally accurate. After you are very familiar with the products, call their references and be

    sure to talk with technical people who are actually using the product. You will not want the

    vendor to have a representative present when you speak with someone at the reference site. It is

    also not a good idea to depend upon a high-level manager at the reference site for a reliable

    opinion of the product. Managers may paint a very rosy picture of any selected product so that

    they do not look like they selected an inferior product. How many places u can call Routines? Four Places u can call

    1.Transform of routine a. Date Transformation b. Upstring Transformation

    2.Transform of the Before & After Subroutines 3.XML transformation 4.Web base transformation

    What is the Batch Program and how can generate? Batch program is the program it's generate run time to maintain by the Datastage itself but u can easy to change own the basis of your requirement (Extraction, Transformation, Loading) .Batch program are generate depends your job nature either simple job or sequencer job, you can see this program on job control option. Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 ) if job 1 have 10,000 row ,after run the job only 5000 data has been loaded in target table remaining are not loaded and your job going to be aborted then.. How can short out the problem? Suppose job sequencer synchronies or control 4 job but job 1 have problem, in this condition should go director and check it what type of problem showing either data type problem, warning massage, job fail or job aborted, If job fail means data type problem or missing column action .So u should go Run window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this option here two option

    (i) On Fail -- Commit , Continue (ii) On Skip -- Commit, Continue.

  • Datastage Interview Questions - Answers

    16

    First u check how much data already load after then select on skip option then continue and what remaining position data not loaded then select On Fail , Continue ...... Again Run the job defiantly u gets successful massage

    What happens if RCP is disable? In such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased... How do you rename all of the jobs to support your new File-naming conventions? Create a Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers. What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.

    A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job.

    May be you can schedule the sequencer around the time the file is expected to arrive.

    B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the

    file

    What are Sequencers?

    Sequencers are job control programs that execute other jobs with preset Job parameters.

    How did you handle an 'Aborted' sequencer?

    In almost all cases we have to delete the data inserted by this from DB manually and fix the job

    and then run the job again.

    Question34: What is the difference between the Filter stage and the Switch stage? Ans: There are two main differences, and probably some minor ones as well. The two main differences are as follows.

    1) The Filter stage can send one input row to more than one output link. The Switch stage can not - the C switch construct has an implicit break in every case. 2) The Switch stage is limited to 128 output links; the Filter stage can have a theoretically unlimited number of output links. (Note: this is not a challenge!)

    How can i achieve constraint based loading using datastage7.5.My target tables have inter

    dependencies i.e. Primary key foreign key constraints. I want my primary key tables to be loaded

    first and then my foreign key tables and also primary key tables should be committed before the

    foreign key tables are executed. How can I go about it?

    Ans:1) Create a Job Sequencer to load you tables in Sequential mode

    In the sequencer Call all Primary Key tables loading Jobs first and followed by Foreign key

    tables, when triggering the Foreign tables load Job trigger them only when Primary Key load

    Jobs run Successfully ( i.e. OK trigger)

    2) To improve the performance of the Job, you can disable all the constraints on the tables and

    load them. Once loading done, check for the integrity of the data. Which does not meet raise

    exceptional data and cleanse them.

    This only a suggestion, normally when loading on constraints are up, will drastically

    performance will go down.

  • Datastage Interview Questions - Answers

    17

    3) If you use Star schema modeling, when you create physical DB from the model, you can

    delete all constraints and the referential integrity would be maintained in the ETL process by

    referring all your dimension keys while loading fact tables. Once all dimensional keys are

    assigned to a fact then dimension and fact can be loaded together. At the same time RI is being

    maintained at ETL process level. How do you merge two files in DS? Ans: Either use Copy command as a Before-job subroutine if the metadata of the 2 files are same or create a job to concatenate the 2 files into one, if the metadata is different.

    How do you eliminate duplicate rows?

    Ans: Data Stage provides us with a stage Remove Duplicates in Enterprise edition. Using that

    stage we can eliminate the duplicates based on a key column.

    How do you pass filename as the parameter for a job?

    Ans: While job development we can create a parameter 'FILE_NAME' and the value can be

    passed while

    How did you handle an 'Aborted' sequencer?

    Ans: In almost all cases we have to delete the data inserted by this from DB manually and fix the

    job and then run the job again. Is there a mechanism available to export/import individual DataStage ETL jobs from the UNIX command line? Ans: Try dscmdexport and dscmdimport. Won't handle the "individual job" requirement. You can only export full projects from the command line. You can find the export and import executables on the client machine usually someplace like: C:\Program Files\Ascential\DataStage.

    Diff. between JOIN stage and MERGE stage. JOIN: Performs join operations on two or more data sets input to the stage and then outputs the resulting dataset. MERGE: Combines a sorted master data set with one or more sorted updated data sets. The columns from the records in the master and update data set s are merged so that the out put record contains all the columns from the master record plus any additional columns from each update record that required. A master record and an update record are merged only if both of them have the same values for the merge key column(s) that we specify .Merge key columns are one or more columns that exist in both the master and update records. Advantages of the DataStage?

    Business advantages: Helps for better business decisions; It is able to integrate data coming from all parts of the company; It helps to understand the new and already existing clients; We can collect data of different clients with him, and compare them; It makes the research of new business possibilities possible; We can analyze trends of the data read by him.

    Technological advantages: It handles all company data and adapts to the needs;

  • Datastage Interview Questions - Answers

    18

    It offers the possibility for the organization of a complex business intelligence; Flexibly and scalable; It accelerates the running of the project;

    Easily implementable. What is the architecture of data stage? Basically architecture of DS is client/server architecture. Client components & server components Client components are 4 types they are 1. Data stage designer 2. Data stage administrator 3. Data stage director 4. Data stage manager Data stage designer is user for to design the jobs Data stage manager is used for to import & export the project to view & edit the contents of the repository. Data stage administrator is used for creating the project, deleting the project & setting the environment variables. Data stage director is use for to run the jobs, validate the jobs, scheduling the jobs. Server components DS server: runs executable server jobs, under the control of the DS director, that extract, transform, and load data into a DWH. DS Package installer: A user interface used to install packaged DS jobs and plug-in; Repository or project: a central store that contains all the information required to build DWH or data mart.

    1. What r the stages u worked on? I have some jobs every month automatically delete the log details what r the steps u have to take for that We have to set the option autopurge in DS Adminstrator. I want to run the multiple jobs in the single job. How can u handle. In job properties set the option ALLOW MULTIPLE INSTANCES.

  • Datastage Interview Questions - Answers

    19

    What is version controlling in DS? In DS, version controlling is used for back up the project or jobs. This option is available in DS 7.1 version onwards. Version controls r of 2 types.

    1. VSS- visual source safe 2. CVSS- concurrent visual source safe.

    VSS is designed by Microsoft but the disadvantage is only one user can access at a time, other user can wait until the first user complete the operation. CVSS, by using this many users can access concurrently. When compared to VSS, CVSS cost is high. What is the difference between clear log file and clear status file? Clear log--- we can clear the log details by using the DS Director. Under job menu clear log option is available. By using this option we can clear the log details of particular job. Clear status file---- lets the user remove the status of the record associated with all stages of selected jobs.(in DS Director) I developed 1 job with 50 stages, at the run time one stage is missed how can u identify which stage is missing? By using usage analysis tool, which is available in DS manager, we can find out the what r the items r used in job. My job takes 30 minutes time to run, I want to run the job less than 30 minutes? What r the steps we have to take? By using performance tuning aspects which are available in DS, we can reduce time. Tuning aspect In DS administrator : in-process and inter process In between passive stages : inter process stage OCI stage : Array size and transaction size And also use link partitioner & link collector stage in between passive stages How to do road transposition in DS? Pivot stage is used to transposition purpose. Pivot is an active stage that maps sets of columns in an input table to a single column in an output table. If a job locked by some user, how can you unlock the particular job in DS? We can unlock the job by using clean up resources option which is available in DS Director. Other wise we can find PID (process id) and kill the process in UNIX server. I am getting input value like X = Iconv(31 DEC 1967,D)? What is the X value? X value is Zero. Iconv Function Converts a string to an internal storage format.It takes 31 dec 1967 as zero and counts days from that date(31-dec-1967).

  • Datastage Interview Questions - Answers

    20

    What is the Unit testing, integration testing and system testing? Unit testing: As for Ds unit test will check the data type mismatching, Size of the particular data type, column mismatching. Integration testing: According to dependency we will put all jobs are integrated in to one sequence. That is called control sequence. System testing: System testing is nothing but the performance tuning aspects in Ds. What are the command line functions that import and export the DS jobs? Dsimport.exe ---- To import the DataStage components Dsexport.exe ---- To export the DataStage components How many hashing algorithms are available for static hash file and dynamic hash file? Sixteen hashing algorithms for static hash file. Two hashing algorithms for dynamic hash file( GENERAL or SEQ.NUM) What happens when you have a job that links two passive stages together? Obviously there is some process going on. Under covers Ds inserts a cut-down transformer stage between the passive stages, which just passes data straight from one stage to the other. What is the use use of Nested condition activity? Nested Condition. Allows you to further branch the execution of a sequence depending on a condition. I have three jobs A,B,C . Which are dependent on each other? I want to run A & C jobs daily and B job runs only on Sunday. How can u do it? First you have to schedule A & C jobs Monday to Saturday in one sequence. Next take three jobs according to dependency in one more sequence and schedule that job only Sunday. What are the ways to execute datastage jobs? A job can be run using a few different methods:

    from Datastage Director (menu Job -> Run now...) from command line using a dsjob command Datastage routine can run a job (DsRunJob command) by a job sequencer

    How to invoke a Datastage shell command? Datastage shell commands can be invoked from :

    Datastage administrator (projects tab -> Command) Telnet client connected to the datastage server

    How to stop a job when its status is running? To stop a running job go to DataStage Director and click the stop button (or Job -> Stop from menu). If it doesn't help go to Job -> Cleanup Resources, select a process with holds a lock and click Logout If it still doesn't help go to the datastage shell and invoke the following command: ds.tools

  • Datastage Interview Questions - Answers

    21

    It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10). How to run and schedule a job from command line? To run a job from command line use a dsjob command Command Syntax: dsjob [-file | [-server ][-user ][-password ]] [] The command can be placed in a batch file and run in a system scheduler. How to release a lock held by jobs? Go to the datastage shell and invoke the following command: ds.tools It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10).

    User privileges for the default DataStage roles? The role privileges are:

    DataStage Developer - user with full access to all areas of a DataStage project DataStage Operator - has privileges to run and manage deployed DataStage jobs -none- - no permission to log on to DataStage

    What is a command to analyze hashed file? There are two ways to analyze a hashed file. Both should be invoked from the datastage command shell. These are:

    FILE.STAT command ANALYZE.FILE command

    Is it possible to run two versions of datastage on the same pc? Yes, even though different versions of Datastage use different system dll libraries. To dynamically switch between Datastage versions install and run DataStage Multi-Client Manager. That application can unregister and register system libraries used by Datastage. Error in Link collector - Stage does not support in-process active-to-active inputs or outputs To get rid of the error just go to the Job Properties -> Performance and select Enable row buffer. Then select Inter process which will let the link collector run correctly. Buffer size set to 128Kb should be fine, however it's a good idea to increase the timeout.

    What is the DataStage equivalent to like option in ORACLE The following statement in Oracle: select * from ARTICLES where article_name like '%WHT080%'; Can be written in DataStage (for example as the constraint expression): incol.empname matches '...WHT080...' what is the difference between logging text and final text message in terminator stage Every stage has a 'Logging Text' area on their General tab which logs an informational message when the stage is triggered or started.

    Informational - is a green line, DSLogInfo() type message. The Final Warning Text - the red fatal, the message which is included in the sequence abort message

  • Datastage Interview Questions - Answers

    22

    Error in STPstage - SOURCE Procedures must have an output link The error appears in Stored Procedure (STP) stage when there are no stages going out of that stage. To get rid of it go to 'stage properties' -> 'Procedure type' and select Transform

    How to invoke an Oracle PLSQL stored procedure from a server job To run a pl/sql procedure from Datastage a Stored Procedure (STP) stage can be used. However it needs a flow of at least one record to run. It can be designed in the following way:

    source odbc stage which fetches one record from the database and maps it to one column - for example: select sysdate from dual

    A transformer which passes that record through. If required, add pl/sql procedure parameters as columns on the right-hand side of tranformer's mapping Put Stored Procedure (STP) stage as a destination. Fill in connection parameters, type in the procedure name and select Transform as procedure type. In the input tab select 'execute procedure for each row' (it will be run once).

    Design of a DataStage server job with Oracle plsql procedure call

    Is it possible to run a server job in parallel? Yes, even server jobs can be run in parallel. To do that go to 'Job properties' -> General and check the Allow Multiple Instance button. The job can now be run simultaneously from one or many sequence jobs. When it happens datastage will create new entries in Director and new job will be named with automatically generated suffix (for example second instance of a job named JOB_0100 will be named JOB_0100.JOB_0100_2). It can be deleted at any time and will be automatically recreated by datastage on the next run.

    Error in STPstage - STDPROC property required for stage xxx The error appears in Stored Procedure (STP) stage when the 'Procedure name' field is empty. It occurs even if the Procedure call syntax is filled in correctly. To get rid of error fill in the 'Procedure name' field.

    Datastage routine to open a text file with error catching Note! work dir and file1 are parameters passed to the routine. * open file1 OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl") END ELSE CALL DSLogInfo("Unable to open file", "JobControl")

  • Datastage Interview Questions - Answers

    23

    ABORT END

    Datastage routine which reads the first line from a text file Note! work dir and file1 are parameters passed to the routine. * open file1 OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl") END ELSE CALL DSLogInfo("Unable to open file", "JobControl") ABORT END READSEQ FILE1.RECORD FROM H.FILE1 ELSE Call DSLogWarn("******************** File is empty", "JobControl") END firstline = Trim(FILE1.RECORD[1,32]," ","A") ******* will read the first 32 chars Call DSLogInfo("******************** Record read: " : firstline, "JobControl") CLOSESEQ H.FILE1 How to test a datastage routine or transform? To test a datastage routine or transform go to the Datastage Manager. Navigate to Routines, select a routine you want to test and open it. First compile it and then click 'Test...' which will open a new window. Enter test parameters in the left-hand side column and click run all to see the results. Datastage will remember all the test arguments during future tests.

    When hashed files should be used? What are the benefits or using them? Hashed files are the best way to store data for lookups. They're very fast when looking up the key-value pairs. Hashed files are especially useful if they store information with data dictionaries (customer details, countries, exchange rates). Stored this way it can be spread across the project and accessed from different jobs.

    How to construct a container and deconstruct it or switch between local and shared? To construct a container go to Datastage designer, select the stages that would be included in the container and from the main menu select Edit -> Construct Container and choose between local and shared. Local will be only visible in the current job, and share can be re-used. Shared containers can be viewed and edited in Datastage Manager under 'Routines' menu. Local Datastage containers can be converted at any time to shared containers in datastage designer by right clicking on the container and selecting 'Convert to Shared'. In the same way it can be converted back to local.

  • Datastage Interview Questions - Answers

    24

    Corresponding datastage data types to ORACLE types? Most of the datastage variable types map very well to oracle types. The biggest problem is to map correctly oracle NUMBER(x,y) format. The best way to do that in Datastage is to convert oracle NUMBER format to Datastage Decimal type and to fill in Length and Scale column accordingly. There are no problems with string mappings: oracle Varchar2 maps to datastage Varchar, and oracle char to datastage char. How to adjust commit interval when loading data to the database? In earlier versions of datastage the commit interval could be set up in: General -> Transaction size (in version 7.x it's obsolete) Starting from Datastage 7.x it can be set up in properties of ODBC or ORACLE stage in Transaction handling -> Rows per transaction. If set to 0 the commit will be issued at the end of a successfull transaction. What is the use of INROWNUM and OUTROWNUM datastage variables? @INROWNUM and @OUTROWNUM are internal datastage variables which do the following:

    @INROWNUM counts incoming rows to a transformer in a datastage job @OUTROWNUM counts oucoming rows from a transformer in a datastage job

    These variables can be used to generate sequences, primary keys, id's, numbering rows and also for debugging and error tracing. They play similiar role as sequences in Oracle. Datastage trim function cuts out more characters than expected By deafult datastage trim function will work this way: Trim(" a b c d ") will return "a b c d" while in many other programming/scripting languages "a b c d" result would be expected. That is beacuse by default an R parameter is assumed which is R - Removes leading and trailing occurrences of character, and reduces multiple occurrences to a single occurrence. To get the "a b c d" as a result use the trim function in the following way: Trim(" a b c d "," ","B") Database update actions in ORACLE stage The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Available actions: Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT). Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT). Insert rows without clearing - only adds new rows (INSERT statement). Delete existing rows only - deletes matched rows (issues only the DELETE statement). Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT). Update existing rows only - updates existing rows (UPDATE statement). Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited. Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited.

  • Datastage Interview Questions - Answers

    25

    User-defined SQL - the data is written using a user-defined SQL statement. User-defined SQL file - the data is written using a user-defined SQL statement from a file. Use and examples of ICONV and OCONV functions?

    ICONV and OCONV functions are quite often used to handle data in Datastage.

    ICONV converts a string to an internal storage format and OCONV converts an expression to an

    output format.

    Syntax:

    Iconv (string, conversion code)

    Oconv(expression, conversion )

    Some useful iconv and oconv examples: Iconv("10/14/06", "D2/") = 14167

    Oconv(14167, "D-E") = "14-10-2006"

    Oconv(14167, "D DMY[,A,]") = "14 OCTOBER 2006"

    Oconv(12003005, "MD2$,") = "$120,030.05"

    That expression formats a number and rounds it to 2 decimal places:

    Oconv(L01.TURNOVER_VALUE*100,"MD2")

    Iconv and oconv can be combined in one expression to reformat date format easily:

    Oconv(Iconv("10/14/06", "D2/"),"D-E") = "14-10-2006" ERROR 81021 Calling subroutine DSR_RECORD ACTION=2 Datastage system help gives the following error desription: SYS.HELP. 081021 MESSAGE.. dsrpc: Error writing to Pipe. The problem appears when a job sequence is used and it contains many stages (usually more than 10) and very often when a network connection is slow. Basically the cause of a problem is a failure between DataStage client and the server communication. The solution to the issue is: Do not log in to Datastage Designer using 'Omit' option on a login screen. Type in explicitly username and password and a job should compile successfully. execute the DS.REINDEX ALL command from the Datastage shell - if the above does not help How to check Datastage internal error descriptions To check the description of a number go to the datastage shell (from administrator or telnet to the server machine) and invoke the following command: SELECT * FROM SYS.MESSAGE WHERE @ID='081021'; - where in that case the number 081021 is an error number The command will produce a brief error description which probably will not be helpful in resolving an issue but can be a good starting point for further analysis.

    Error timeout waiting for mutex

  • Datastage Interview Questions - Answers

    26

    The error message usually looks like follows: ... ds_ipcgetnext() - timeout waiting for mutex There may be several reasons for the error and thus solutions to get rid of it. The error usually appears when using Link Collector, Link Partitioner and Interprocess (IPC) stages. It may also appear when doing a lookup with the use of a hash file or if a job is very complex, with the use of many transformers. There are a few things to consider to work around the problem: - increase the buffer size (up to to 1024K) and the Timeout value in the Job properties (on the Performance tab). - ensure that the key columns in active stages or hashed files are composed of allowed characters get rid of nulls and try to avoid language specific chars which may cause the problem. - try to simplify the job as much as possible (especially if its very complex). Consider splitting it into two or three smaller jobs, review fetches and lookups and try to optimize them (especially have a look at the SQL statements). ERROR 30107 Subroutine failed to complete successfully Datastage system help gives the following error desription: SYS.HELP. 930107 MESSAGE.. DataStage/SQL: Illegal placement of parameter markers The problem appears when a project is moved from one project to another (for example when deploying a project from a development environment to production). The solution to the issue is: Rebuild the repository index by executing the DS.REINDEX ALL command from the Datastage shell Datastage Designer hangs when editing job activity properties The appears when running Datastage Designer under Windows XP after installing patches or the Service Pack 2 for Windows. After opening a job sequence and navigating to the job activity properties window the application freezes and the only way to close it is from the Windows Task Manager. The solution of the problem is very simple. Just Download and install the XP SP2 patch for the Datastage client. It can be found on the IBM client support site (need to log in): https://www.ascential.com/eservice/public/welcome.do Go to the software updates section and select an appropriate patch from the Recommended DataStage patches section. Sometimes users face problems when trying to log in (for example when the license doesnt cover the IBM Active Support), then it may be necessary to contact the IBM support which can be reached at [email protected]

    Can Datastage use Excel files as a data input?

  • Datastage Interview Questions - Answers

    27

    Microsoft Excel spreadsheets can be used as a data input in Datastage. Basically there are two possible approaches available: Access Excel file via ODBC - this approach requires creating an ODBC connection to the Excel file on a Datastage server machine and use an ODBC stage in Datastage. The main disadvantage is that it is impossible to do this on an Unix machine. On Datastage servers operating in Windows it can be set up here: Control Panel -> Administrative Tools -> Data Sources (ODBC) -> User DSN -> Add -> Driver do Microsoft Excel (.xls) -> Provide a Data source name -> Select the workbook -> OK Save Excel file as CSV - save data from an excel spreadsheet to a CSV text file and use a sequential stage in Datastage to read the data.

    Parallel processing

    Datastage jobs are highly scalable due to the implementation of parallel processing. The EE

    architecture is process-based (rather than thread processing), platform independent and uses the

    processing node concept. Datastage EE is able to execute jobs on multiple CPUs (nodes) in

    parallel and is fully scalable, which means that a properly designed job can run across resources

    within a single machine or take advantage of parallel platforms like a cluster, GRID, or MPP

    architecture (massively parallel processing).

    Partitioning and Pipelining

    Partitioning means breaking a dataset into smaller sets and distributing them evenly across the

    partitions (nodes). Each partition of data is processed by the same operation and transformed in

    the same way.

    The main outcome of using a partitioning mechanism is getting a linear scalability. This means

    for instance that once the data is evenly distributed, a 4 CPU server will process the data four

    times faster than a single CPU machine.

    Pipelining means that each part of an ETL process (Extract, Transform, Load) is executed

    simultaneously, not sequentially. The key concept of ETL Pipeline processing is to start the

    Transformation and Loading tasks while the Extraction phase is still running.

    Datastage Enterprise Edition automatically combines pipelining, partitioning and parallel

    processing. The concept is hidden from a Datastage programmer. The job developer only

    chooses a method of data partitioning and the Datastage EE engine will execute the partitioned

    and parallelized processes.

    Section 1.01 Differences between Datastage Enterprise and Server Edition

    1. The major difference between Infosphere Datastage Enterprise and Server edition is that Enterprise Edition (EE) introduces Parallel jobs. Parallel jobs support a completely new set of stages, which implement the scalable and parallel data processing mechanisms. In most cases parallel jobs and stages look similiar to the Datastage Server objects, however their capababilities are way different. In rough outline:

  • Datastage Interview Questions - Answers

    28

    o Parallel jobs are executable datastage programs, managed and controlled by Datastage Server runtime environment o Parallel jobs have a built-in mechanism for Pipelining, Partitioning and Parallelism. In most cases no manual intervention is needed to implement optimally those techniques. o Parallel jobs are a lot faster in such ETL tasks like sorting, filtering, aggregating

    2. Datastage EE jobs are compiled into OSH (Orchestrate Shell script language). OSH executes operators - instances of executable C++ classes, pre-built components representing stages used in Datastage jobs. Server Jobs are compiled into Basic which is an interpreted pseudo-code. This is why parallel jobs run faster, even if processed on one CPU. 3. Datastage Enterprise Edition adds functionality to the traditional server stages, for instance record and column level format properties. 4. Datastage EE brings also completely new stages implementing the parallel concept, for example:

    o Enterprise Database Connectors for Oracle, Teradata & DB2 o Development and Debug stages - Peek, Column Generator, Row Generator, Head, Tail, Sample ... o Data set, File set, Complex flat file, Lookup File Set ... o Join, Merge, Funnel, Copy, Modify, Remove Duplicates ...

    5. When processing large data volumes Datastage EE jobs would be the right choice, however when dealing with smaller data environment, using Server jobs might be just easier to develop, understand and manage. When a company has both Server and Enterprise licenses, both types of jobs can be used. 6. Sequence jobs are the same in Datastage EE and Server editions.

    what is the difference between ds 7.5 & 8.1?

    New version of DS is 8.0 and supprots QUALITY STAGE &

    Profile Stage and etc, and it also contain a webbrowsers

    1.To implement scd we have seperate stage(SCD stage)

    2.we dont have client manager tool in vversion 8,its

    incorporated with Designer itself.

    3.There is no need of hardcoding the parameters for every

    job and we have a option called Parameter set.if we create

    the parameter set,we can call the parameter set for whole

    project or job,sequence..

    what happens when job is compiling?

    During compilation of a DataStage Parallel job there is very high CPU and memory utilization

    on the server, and the job may take a very log time for the compilation to complete.

    What APT_CONFIG in ds? APT_CONFIG is just an environment variable used to idetify the *.apt file. Dont confuse that with *.apt

    file that has the node's information and Configuration of SMP/MMP server.

  • Datastage Interview Questions - Answers

    29

    Apt_configfile is used for to store the nodes information, and it contains the disk storage

    information, and scrach information. and datastage understands the architecture of system based

    on this Configfile. for parallel process normally two nodes are required its name like 10,.20.

    anyaways the APT_CONFIG_FILE (not just APT_CONFIG) is the

    configuration file that defines the nodes (the scratch area temp area) for the specific project

    is it possible to add extra nodes in the configuration file?

    what is RCP? n how does it works? Run time column propagation is used in case of partial schema

    usage. when we only know about the columns to be processed and we

    want all other columns to be propagated to target as they are we

    check enable RCP option in administrator or output page columns

    tab or stage page general tab and we only need to specify the

    schema of tables we are concerned with . According to

    documentation Runtime column propagation (RCP) allows DataStage

    to be flexible aboutthe columns you define in a job. If RCP is

    enabled for a project you can justdefine the columns you are

    interested in using in a job but ask DataStageto propagate the

    other columns through the various stages. So suchcolumns can be

    extracted from the data source and end up on your datatarget without explicitly being operated on in between.Sequential files

    unlike most other data sources do not have inherentcolumn

    definitions and so DataStage cannot always tell where there

    areextra columns that need propagating. You can only use RCP on

    sequentialfiles if you have used the Schema File property (see

    Schema File onpage 5-8 and on page 5-31) to specify a schema

    which describes all thecolumns in the sequential file. You need

    to specify the same schema file forSequential File Stage 5-47any

    similar stages in the job where you want to propagate columns.

    Stagesthat will require a schema file are: Sequential File File

    Set External Source External Target Column Import Column Export

    Run Time Column Propagation can be used with Column Import

    Stage. If RCP is enabled in our project we can define only the

    columns which we are interested in and other rest of the columns

    datastage will send through various other stages.

    This will ensure such columns reach to our Target eventhough

    they are not used in between of the stages.

    starschema n snowflake schema? n Difference? Star Schema Snowflake Schema

    De-Normalized Data Structure Normalized Data Structure

  • Datastage Interview Questions - Answers

    30

    Category wise Single Dimension Table Dimension table split into many pieces

    More data dependency and redundancy less data dependency and No redundancy

    No need to use complicated join Complicated Join

    Query Results Faster Some delay in Query Processing

    No Parent Table It May contain Parent Table

    Simple DB Structure Complicated DB Structure

    Difference bet OLTP n Datawarehouse?

    The OLTP database records transactions in real time and aims to automate clerical data entry processes

    of a business entity. Addition, modification and deletion of data in the OLTP database is essential and

    the semantics of the application used in the front end impact on the organization of the data in the

    database.

    The data warehouse on the other hand does not cater to real time operational requirements of the

    enterprise. It is more a storehouse of current and historical data and may also contain data extracted

    from external data sources.

    However, the data warehouse supports OLTP system by providing a place for the latter to offload data

    as it accumulates and by providing services which would otherwise degrade the performance of the

    database.

    Differences Data warehouse database and OLTP database

    Data warehouse database

    Designed for analysis of business measures by categories and attributes

    Optimized for bulk loads and large, complex, unpredictable queries that access many rows per table.

    Loaded with consistent, valid data; requires no real time validation

    Supports few concurrent users relative to OLTP

    OLTP database

    Designed for real time business operations.

    Optimized for a common set of transactions, usually adding or retrieving a single row at a time per table.

    Optimized for validation of incoming data during transactions; uses validation data tables.

    Supports thousands of concurrent users.

  • Datastage Interview Questions - Answers

    31

    What is datamodalling?

    The analysis of data objects and their relationships to other data objects. Data modeling is often

    the first step in database design and object-oriented programming as the designers first create a

    conceptual model of how data items relate to each other. Data modeling involves a progression

    from conceptual model to logical model to physical schema.

    Data modelling is the process of identifying entities, the relationship between those entities and

    their attributes. There are a range of tools used to achieve this such as data dictionaries, decision

    trees, decision tables, schematic diagrams and the process of normalisation.

    how to draw second highest salary?

    select ename,esal from

    (select ename,esal from hsal

    order by esal desc)

    where rownum 1 )

    Difference bet egrep n fgrep?

    There is a difference. fgrep can not search for regular expressions in a string. It is used for plain

    string matching.

    egrep can search regular expressions too.

    grep is a combination of both egrep and fgrep. If you don't specify -E or -F option, by default

    grep will function as egrep but will have string searching ability too.

    Hence the best one to be used is grep.

  • Datastage Interview Questions - Answers

    32

    fgrep = "Fixed GREP".

    fgrep searches for fixed strings only. The "f" does not stand for "fast" - in fact, "fgrep foobar *.c"

    is usually slower than "egrep foobar *.c" (Yes, this is kind of surprising. Try it.)

    Fgrep still has its uses though, and may be useful when searching a file for a larger number of

    strings than egrep can handle.

    egrep = "Extended GREP"

    egrep uses fancier regular expressions than grep. Many people use egrep all the time, since it has

    some more sophisticated internal algorithms than grep or fgrep, and is usually the fastest of the

    three programs

    CHMOD command?

    Permissions

    u - User who owns the file.

    g - Group that owns the file.

    o - Other.

    a - All.

    r - Read the file.

    w - Write or edit the file.

    x - Execute or run the file as a program.

    Numeric Permissions:

    CHMOD can also to attributed by using Numeric Permissions:

    400 read by owner

    040 read by group

    004 read by anybody (other)

    200 write by owner

    020 write by group

    002 write by anybody

    100 execute by owner

    010 execute by group

    001 execute by anybody

    what is the difference between ds 7.5 & 8.1?

    The main difference is dsmanager client is combined with dsdesigner in 8.1 and the following are the

    new in 8.1

    Scd2 stage,dataconnection,ps,qs,rangelookup .

    Difference between internal sort and external sort?

  • Datastage Interview Questions - Answers

    33

    Performance wise internal sort is best becoz it doesnt use any buffer where as external sort takes

    buffer memory to store rec.

    how u pass only required number of rec through partitions?

    Go to jobproperties-execution-enable trace compile-give req number of rec.

    what happens when job is compiling?

    1.all processing stages will develop osh code

    2.Tx will develop c++ code in the background.

    3.job information will be updated in metadata repository.

    4. compile

    what APT_CONFIG in ds?

    Its configuration file which defines parallelism to our jobs.

    how many types of parallelisms n partions r there?

    Two types smp n mpp.

    it possible to add extra nodes in the configuration file?

    Yes it is possible to add extra nodes go to configuration management tools where u find apt_conf edit

    that for ur req no of nodes.

    what is RCP? n how does it works?

    Run time column propagation ,is used to propagate the columns which r not define in the metadata.

    how data is moving from one stage to another?

    In virtual dataset form.

    it possible to run multiple jobs in a single job?

    Yes ,goto jobprop-have a option allow multiple instances.

    What is APT_DUMP_SCORE?

    APT_DUMP_SCORE - shows operators, datasets, nodes, partitions, combinations and

    processes used in a job.

    env var: admin

    Stage Variable - An intermediate processing variable that retains value during read and doesnt

    pass the value into target column.

    Derivation - Expression that specifies value to be passed on to the target column.

    Constant - Conditions that are either true or false that specifies flow of data with a link.

    pipeline parallelism:

    here each satge will work on separate processor

    What is the difference between Job Control and Job Sequence

    Job control specially used to control the job, means through this we can pass the parameters,

    some conditions, some log file information, dashboard information, load recover etc...,

    job seq is used to run the group of jobs based upon some conditions. For final/incremental

    processing we keep all the jobs in one diff seq and we run the jobs at a time by giving some

    triggers.

  • Datastage Interview Questions - Answers

    34

    What is the max size of Data set stage? (PX) no limit?

    performance in sort stage

    See if it is orcle db then u can write user def query for sort and remove duplicates in the source

    itself. and maintaining some key partition teqniques u can improve the performence.

    If it is not the case means better go for some key partion teqniques in sort, keeping the same

    partition which is in prev stage. don't allow the duplicates , remove duplicates and give unique

    partition key.

    How to develop the SCD using LOOKUP stage?

    we can impliment SCD by using LOOKUP stage, but it is for only scd1, not for scd2.

    we have to take source(file or db) and dataset as a ref link(for look up) and then LOOKUP stage,

    in this we have to compare the source with dataset and we have to give condition as continue,

    continue there. after that in t/r we have to give the conditon, after that we have to take two targets

    for insert and update, there we have to manually write the sql insert and update statements.

    If u see the design, then u can easily understand that.

    What is the diffrence between IBM Web Sphere DataStage 7.5 (Enterprise Edition ) & Standard

    Ascential DataStage 7.5 Version ?

    IBM Information Server also known as DS 8 has more features like Quality Stage & MetaStage .

    It maintains its repsository in DB2 unlike files in 7.5. Also it has stage specifically for SCD 1 &

    2.

    I think there is no version like standard Ascential DataStage 7.5, I know only the advanced

    edition of Datastage i.e., only web sphere Datastage and Quality stage, it is released by IBM

    itself and given the version as 8.0.1, in this there are only 3 client tools(admin..,desig..,director),

    here they have removed the manager, it is included in designer itself (for importing and

    exporting) and in this they have added some extra stages like SCD stage , by using this we can

    impliment scd1 and scd2 directly, and there are some advanced stages are there.

    They have included the QualityStage, which is used for data validation which is very very

    importent for dwh. There are somany things are available in Qualitystage, we can think it as a

    seperate tools for dwh.

    What are the errors you expereiced with data stage

    Here in datastage there are some warnings and some fatal errors will come in the log file.

  • Datastage Interview Questions - Answers

    35

    If there is any fatal error means the job got aborted but if there are any warnings are there means

    the job not aborts but we have to handle those warnings also. logfile must be cleared with no

    warnings also.

    so many errors will come in diff jobs,

    Parameter not found in job load recover.

    child job is failed bcoz of some .....

    control job is failed bcoz of some .....

    ....etc

    what are the main diff between server job and parallel job in datastage?

    in server jobs we have few stages and its mainly logical intensive and we r using transformer for

    most of the things and it does not uses MPP systems

    in paralell jobs we have lots of stages and its stage intensive and for particular thing we have in

    built stages in parallel jobs and it uses MPP systems

    ***********************************************************

    In server we dont have an option to process the data in multiple nodes as in parallel. In parallel

    we have an advatage to process the data in pipelines and by partitioning, whereas we dont have

    any such concept in server jobs.

    There are lot of differences in using same stages in server and parallel. For example, in parallel, a

    sequencial file or any other file can have either an input link or an output ink, but in server it can

    have both(that too more than 1).

    ********************************************************************

    server jobs can compile and run with in datastage server but parallel jobs can compile and run

    with in datastage unix server.

    server jobs can extact total rows from source to anthor stage then only that stage will be activate

    and passing the rows into target level or dwh.it is time taking.

    but in parallel jobs it is two types

    1.pipe line parallelisam

    2.partion parallelisam

    1.based on statistical performence we can extract some rows from source to anthor stage at the

    same time the stage will be activate and passing the rows into target level or dwh.it will maintain

    only one node with in source and target.

  • Datastage Interview Questions - Answers

    36

    2.partion parallelisam will maintain more than one node with in source and target.

    Why you need Modify Stage?

    When you are able to handle Null handling and Data type changes in ODBC stages why you

    need Modify Stage?

    Used to change the datatypes, if the source contains the varchar and the target contains integer

    then we have to use this Modify Stage and we have to change according to the requirement. And

    we can do some modification in length also.

    Modify Stage is used for the purpose of Datatype Change.

    What is the difference between Squential Stage & Dataset Stage. When do u use them.

    a)Sequential stage is use for format of squential file and Dataset is use for any type of format

    file (random)

    b)Parallel jobs use data sets to manage data within a job. You can think of each link in a job as

    carrying a data set. The Data Set stage allows you to store data being operated on in a persistent

    form, which can then be used by other WebSphere DataStage jobs. Data sets are operating

    system files, each referred to by a control file, which by convention has the suffix .ds. Using data

    sets wisely can be key to good performance in a set of linked jobs. You can also manage data

    sets independently of a job using the Data Set Management utility, available from the

    WebSphere DataStage Designer or Director In datset dat is stored in some encrypted format ie.,

    we can view the data through view data facility available in datastage but it cant be viewed in

    Linux or back end system. In sequential file data can be viewed any where. Extraction of data

    from the datset is much more faster than the sequential file.

    how can we improve the performance of the job while handling huge amount of data

    a)Minimize the transformer state,Reference table have huge amount of date then you can use join

    stage. Reference table have less amount of data then you can use lookup.

    b)this require a job level tuning or server level tuning.

    in job level we can do the follwing.

    job level tuning

    use Join for huge amount of data rather than lookup.

    use modify stage rather than transformer for simple transformation.

    Sort the data before remove duplicate stage.

    server level tuning

    this can only be done after having adequate knowledge of the serever level parameter which can

    improve the server execution performance.

  • Datastage Interview Questions - Answers

    37

    HI How can we create read only jobs in Datastage.

    By creating Protected Project. In Protected Project all jobs are read only. You cant modify the

    job. If you modify that job it will not effect the job.

    b)A job can be made read only by the follwing process:

    Export the job in the .dsx format and change the attribute which store the readonly information

    from 0 ( 0 refers to editable job) to 1 ( 1 refer to the read only job).

    then import the job again and override or rename the existing job to have both of the form.

    there are 3 kind of routines is there in Datastage.

    1.server routines which will used in server jobs.

    these routines will write in BASIC Language

    2.parlell routines which will used in parlell jobs

    These routines will write in C/C++ Language

    3.mainframe routines which will used in mainframe jobs

    DataStage Parallel routines made really easy

    http://blogs.ittoolbox.com/dw/soa/archives/datastage-parallel-routines-made-really-easy-20926

    How will you determine the sequence of jobs to load into data warehouse?

    First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the

    Aggregator tables (if any).

    The sequence of the job can also be determined by the determining the parent child relationship

    in the target tables to be loaded. parent table always need to be loaded before child tables.

    Error while connecting DS Admin?

    All you have to do is go settings-control panel-User accounts- Create new user with a password.

    Restart your comp and login with the new user name. Try using the new user name into

    datastage and I am sure that you should be able to do it.

  • Datastage Interview Questions - Answers

    38

    DataStage - delete header and footer on the source sequential

    How do you you delete header and footer on the source sequential file and how do you create

    header and footer on target sequential file using datastage?

    In Designer Pallete Development/Debug we can find Head & tail. By using this we can do......

    How can we implement Slowly Changing Dimensions in DataStage?.

    a)We can implement SCD in datastage

    1.Type 1 SCD:insert else update in ODBC stage

    2.Type 2 SCD:insert new rows if the primary key is same and update with effective from date as

    JobRundate and to date to some max date

    3.Type 3 SCD:insert value to the column the old value and update the existing column with the

    new value

    b) by using lookup stage and change capture stage we will implement the scd.

    we have 3 types of scds

    type1:it will maintain the current values only.

    type2: it will maintain the both current and historical values.

    type3: it will maintain the current and partial historical values.

    Differentiate Database data and Data warehouse data?

    Data in a Database is

    Detailed or Transactional

    Both Readable and Writable.

    Current.

    b)By Database, one means OLTP (On Line Transaction Processing). This can be the source

    systems or the ODS (Operational Data Store), which contains the transactional data.

    c)Database data is in the form of OLTP and Data warehouse data will be in the form of OLAP.

    OLTP is for transactional process and OLAP is for Analysis purpose.

    d)dwh:

    it contains current and historical data

    very summary data

    it follows denormalization

  • Datastage Interview Questions - Answers

    39

    DIMENSIONAL MODEL

    non volatile

    How to run a Shell Script within the scope of a Data stage job?

    By using "ExcecSH" command at Before/After job properties.

    what is the difference between datastage and informatica

    a)The main difference between data stge and informatica is the SCALABILTY..informatca is

    scalable than datastage

    b)In my view Datastage is also Scalable, the difference lies in the number of built-in functions

    which makes DataStage more user friendly

    c)In my view,Datastage is having less no. of transformers copared to Informatica which makes

    user to get difficulties while working

    d)The main difference is Vendors. Each one is having plus from their architecture. For Datastage

    it is a Top-Down approach. Based on the Businees