Datastage Interview Questions

Datastage Interview Questions - Answers

1

Datastage Interview Questions What is the flow of loading data into fact & dimensional tables?

Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional

table. Consists of fields with numeric values.

Dimension table - Table with Unique Primary Key.

Load - Data should be first loaded into dimensional table. Based on the primary key values in

dimensional table, the data should be loaded into Fact table.

What is the default cache size? How do you change the cache size if needed?

Default cache size is 256 MB. We can increase it by going into Datastage Administrator and selecting the

Tunable Tab and specify the cache size over there.

What does a Config File in parallel extender consist of?

Config file consists of the following.

a) Number of Processes or Nodes.

b) Actual Disk Storage Location

What is Modulus and Splitting in Dynamic Hashed File?

In a Hashed File, the size of the file keeps changing randomly.

If the size of the file increases it is called as "Modulus".

If the size of the file decreases it is called as "Splitting".

What are Stage Variables, Derivations and Constants?

Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the

value into target column.

Derivation - Expression that specifies value to be passed on to the target column.

Constant - Conditions that are either true or false that specifies flow of data with a link.

Types of views in Datastage Director?

There are 3 types of views in Datastage Director

a) Job View - Dates of Jobs Compiled.

b) Log View - Status of Job last run

c) Status View - Warning Messages, Event Messages, Program Generated Messages.


2

Types of Parallel Processing?

A) Parallel Processing is broadly classified into 2 types.

a) SMP - Symmetrical Multi Processing.

b) MPP - Massive Parallel Processing.

Orchestrate Vs Datastage Parallel Extender?

Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX

platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel

processing capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE

and released a new version Datastage 6.0 i.e Parallel Extender.

Importance of Surrogate Key in Data warehousing?

Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of

underlying database. i.e. Surrogate Key is not affected by the changes going on with a database.

How to run a Shell Script within the scope of a Data stage job?

By using "ExcecSH" command at Before/After job properties.

How do you execute datastage job from command line prompt?

Using "dsjob" command as follows.

dsjob -run -jobstatus projectname jobname

Functionality of Link Partitioner and Link Collector?

Link Partitioner: It actually splits data into various partitions or data flows using various partition

methods.

Link Collector: It collects the data coming from partitions, merges it into a single data flow and loads to

target.

Types of Dimensional Modeling?

Dimensional modeling is again sub divided into 2 types.

a) Star Schema - Simple & Much Faster. Denormalized form.

b) Snowflake Schema - Complex with more Granularity. More normalized form.

c) Galaxy scheme or complex multi star schema


3

Differentiate Primary Key and Partition Key? Primary Key is a combination of unique and not null. It can be a collection of key values called as

composite primary key. Partition Key is a just a part of Primary Key. There are several methods of

partition like Hash, DB2, and Random etc. While using Hash partition we specify the Partition Key.

Differentiate Database data and Data warehouse data? a) Detailed or Transactional

b) Both Readable and Writable.

c) Current.

Containers Usage and Types?

Container is a collection of stages used for the purpose of Reusability.

There are 2 types of Containers.

a) Local Container: Job Specific

b) Shared Container: Used in any job within a project.

Compare and Contrast ODBC and Plug-In stages? ODBC: a) Poor Performance.

b) Can be used for Variety of Databases.

c) Can handle Stored Procedures.

Plug-In: a) Good Performance.

b) Database specific. (Only one database)

c) Cannot handle Stored Procedures.

Dimension Modelling types along with their significance

Data Modelling is Broadly classified into 2 types.

a) E-R Diagrams (Entity - Relatioships).

b) Dimensional Modelling.

What are Ascential Dastastage Products, Connectivity

Ascential Products

Ascential DataStage

Ascential DataStage EE (3)

Ascential DataStage EE MVS

Ascential DataStage TX


4

Ascential QualityStage

Ascential MetaStage

Ascential RTI (2)

Ascential ProfileStage

Ascential AuditStage

Ascential Commerce Manager

Industry Solutions

Connectivity

Files

RDBMS

Real-time

PACKs

EDI

Other

Explain Data Stage Architecture? Data Stage contains two components,

Client Component, Server Component.

Client Component:

Data Stage Administrator. Data Stage Manager Data Stage Designer Data Stage Director

Server Components:

Data Stage Engine Meta Data Repository Package Installer

Data Stage Administrator: (Roles and Responsibilities )

Used to create the project Contains set of properties We can set the buffer size (by default 128 MB) We can increase the buffer size. We can set the Environment Variables. In tunable we have in process and inter-process In-processData read in sequentially Inter-process It reads the data as it comes. It just interfaces to metadata.


5

Data Stage Manager:

We can view and edit the Meta data Repository. We can import table definitions. We can export the Data stage components in .xml or .dsx format. We can create routines and transforms We can compile the multiple jobs. Data Stage Designer:

We can create the jobs. We can compile the job. We can run the job. We can declare stage variable in transform, we can call routines, transform, macros, functions. We can write constraints. Data Stage Director:

We can run the jobs. We can schedule the jobs. (Schedule can be done daily, weekly, monthly, quarterly) We can monitor the jobs. We can release the jobs.

What is Meta Data Repository? Meta Data is a data about the data.

It also contains

Query statistics ETL statistics Business subject area Source Information Target Information

Source to Target mapping Information

What is Data Stage Engine? It is a JAVA engine running at the background.

What is Dimensional Modeling? Dimensional Modeling is a logical design technique that seeks to present the data in a standard

framework that is, intuitive and allows for high performance access.

What is Star Schema? Star Schema is a de-normalized multi-dimensional model. It contains centralized fact tables surrounded

by dimensions table.

Dimension Table: It contains a primary key and description about the fact table.

Fact Table: It contains foreign keys to the dimension tables, measures and aggregates.

What is surrogate Key? It is a 4-byte integer which replaces the transaction / business / OLTP key in the dimension table.

We can store up to 2 billion record.


6

Why we need surrogate key? It is used for integrating the data may help better for primary key.

Index maintenance, joins, table size, key updates, disconnected inserts and partitioning.

What is Snowflake schema? It is partially normalized dimensional model in which at two represents least one dimension or

more hierarchy related tables.

Explain Types of Fact Tables?

Factless Fact: It contains only foreign keys to the dimension tables.

Additive Fact: Measures can be added across any dimensions.

Semi-Additive: Measures can be added across some dimensions. Eg, % age, discount

Non-Additive: Measures cannot be added across any dimensions. Eg, Average Conformed Fact: The equation or the measures of the two fact tables are the same under the facts are

measured across the dimensions with a same set of measures

Explain the Types of Dimension Tables?

Conformed Dimension: If a dimension table is connected to more than one fact table, the

granularity that is defined in the dimension table is common across between the fact tables.

Junk Dimension: The Dimension table, which contains only flags.

Monster Dimension: If rapidly changes in Dimension are known as Monster Dimension.

De-generative Dimension: It is line item-oriented fact table design.

What are stage variables?

Stage variables are declaratives in Transformer Stage used to store values. Stage variables are

active at the run time. (Because memory is allocated at the run time).

What is sequencer? It sets the sequence of execution of server jobs.

What are Active and Passive stages?

Active Stage: Active stage model the flow of data and provide mechanisms for combining data

streams, aggregating data and converting data from one data type to another. Eg, Transformer,

aggregator, sort, Row Merger etc.

Passive Stage: A Passive stage handles access to Database for the extraction or writing of data.

Eg, IPC stage, File types, Universe, Unidata, DRS stage etc.

What is ODS?

Operational Data Store is a staging area where data can be rolled back.

What are Macros?

They are built from Data Stage functions and do not require arguments.

A number of macros are provided in the JOBCONTROL.H file to facilitate getting information

about the current job, and links and stages belonging to the current job. These can be used in


7

expressions (for example for use in Transformer stages), job control routines, filenames and table

names, and before/after subroutines.

DSHostName

DSJobStatus

DSProjectName

DSJobName

DSJobController

DSJobStartDate

DSJobStartTime

DSJobStartTimestamp

DSJobWaveNo

DSJobInvocations

DSJobInvocationId

DSStageLastErr

DSStageType

DSStageInRowNum

DSStageVarList

DSLinkLastErr

DSLinkName

DSStageName

DSLinkRowCount

What is keyMgtGetNextValue?

It is a Built-in transform it generates Sequential numbers. Its input type is literal string & output

type is string.

What index is created on Data Warehouse?

Bitmap index is created in Data Warehouse.

What is container?

A container is a group of stages and links. Containers enable you to simplify and modularize

your server job designs by replacing complex areas of the diagram with a single container stage.

You can also use shared containers as a way of incorporating server job functionality into

parallel jobs.

DataStage provides two types of container:

Local containers. These are created within a job and are only accessible by that job. A

local container is edited in a tabbed page of the jobs Diagram window. Shared containers. These are created separately and are stored in the Repository in the

same way that jobs are. There are two types of shared container

What is function? ( Job Control Examples of Transform Functions ) Functions take arguments and return a value.

BASIC functions: A function performs mathematical or string manipulations on the arguments supplied to it, and return a value. Some functions have 0 arguments; most have 1 or

more. Arguments are always in parentheses, separated by commas, as shown in this general

syntax:

FunctionName (argument, argument)


8

DataStage BASIC functions: These functions can be used in a job control routine, which is defined as part of a jobs properties and allows other jobs to be run and controlled from the first job. Some of the functions can also be used for getting status information on the current

job; these are useful in active stage expressions and before- and after-stage subroutines.

To do this ... Use this function ...

Specify the job you want to control DSAttachJob

Set parameters for the job you want to control DSSetParam

Set limits for the job you want to control DSSetJobLimit

Request that a job is run DSRunJob

Wait for a called job to finish DSWaitForJob

Gets the meta data details for the specified link DSGetLinkMetaData

Get information about the current project DSGetProjectInfo

Get buffer size and timeout value for an IPC or Web Service

stage

DSGetIPCStageProps

Get information about the controlled job or current job DSGetJobInfo

Get information about the meta bag properties associated with

the named job

DSGetJobMetaBag

Get information about a stage in the controlled job or current

job

DSGetStageInfo

Get the names of the links attached to the specified stage DSGetStageLinks

Get a list of stages of a particular type in a job. DSGetStagesOfType

Get information about the types of stage in a job. DSGetStageTypes

Get information about a link in a controlled job or current job DSGetLinkInfo

Get information about a controlled jobs parameters DSGetParamInfo

Get the log event from the job log DSGetLogEntry

Get a number of log events on the specified subject from the

job log

DSGetLogSummary

Get the newest log event, of a specified type, from the job log DSGetNewestLogId

Log an event to the job log of a different job DSLogEvent

Stop a controlled job DSStopJob

Return a job handle previously obtained from DSAttachJob DSDetachJob

Log a fatal error message in a job's log file and aborts the job. DSLogFatal

Log an information message in a job's log file. DSLogInfo


9

Put an info message in the job log of a job controlling current

job.

DSLogToController

Log a warning message in a job's log file. DSLogWarn

Generate a string describing the complete status of a valid

attached job.

DSMakeJobReport

Insert arguments into the message template. DSMakeMsg

Ensure a job is in the correct state to be run or validated. DSPrepareJob

Interface to system send mail facility. DSSendMail

Log a warning message to a job log file. DSTransformError

Convert a job control status or error code into an explanatory

text message.

DSTranslateCode

Suspend a job until a named file either exists or does not exist. DSWaitForFile

Checks if a BASIC routine is cataloged, either in VOC as a

callable item, or in the catalog space.

DSCheckRoutine

Execute a DOS or Data Stage Engine command from a

before/after subroutine.

DSExecute

Set a status message for a job to return as a termination

message when it finishes

DSSetUserStatus

What is Routines?

Routines are stored in the Routines branch of the Data Stage Repository, where you can create,

view or edit. The following programming components are classified as routines:

Transform functions, Before/After subroutines, Custom UniVerse functions, ActiveX (OLE)

functions, Web Service routines

Dimension Modeling types along with their significance

Data Modelling is broadly classified into 2 types.

A) E-R Diagrams (Entity - Relatioships).

B) Dimensional Modelling.

Question: Dimensional modelling is again sub divided into 2 types.

A) Star Schema - Simple & Much Faster. Denormalized form.

B) Snowflake Schema - Complex with more Granularity. More normalized form. What is the flow of loading data into fact & dimensional tables? Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with numeric values. Dimension table - Table with Unique Primary Key. Load - Data should be first loaded into dimensional table. Based on the primary key values in dimensional table, then data should be loaded into Fact table. What is Hash file stage and what is it used for?


10

Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance. What are types of Hashed File? Hashed File is classified broadly into 2 types. A) Static - Sub divided into 17 types based on Primary Key Pattern. B) Dynamic - sub divided into 2 types i) Generic ii) Specific Default Hased file is "Dynamic - Type Random 30 D" What are Static Hash files and Dynamic Hash files? As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2GB and the overflow file is used if the data exceeds the 2GB size. How did you handle reject data? Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected. What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?

Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. Tuned the 'Project Tunables' in Administrator for better performance. Used sorted data for Aggregator. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs. Removed the data not used from the source as early as possible in the job. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories. Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made. Tuning should occur on a job-by-job basis. Use the power of DBMS. Try not to use a sort stage when you can use an ORDER BY clause in the database. Using a constraint to filter a record set is much slower than performing a SELECT WHERE.


11

Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.

Tell me one situation from your last project, where you had faced problem and How did u solve it?

1. The jobs in which data is read directly from OCI stages are running extremely slow. I had to stage the data before sending to the transformer to make the jobs run faster. 2. The job aborts in the middle of loading some 500,000 rows. Have an option either cleaning/deleting the loaded data and then run the fixed job or run the job again from the row the job has aborted. To make sure the load is proper we opted the former.

Tell me the environment in your last projects Give the OS of the Server and the OS of the Client of your recent most project How did u connect with DB2 in your last project? Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drivers to connect to DB2 (or) DB2-UDB depending the situation and availability. Certainly DB2-UDB is better in terms of performance as you know the native drivers are always better than ODBC drivers. 'iSeries Access ODBC Driver 9.00.02.02' - ODBC drivers to connect to AS400/DB2.

What are Routines and where/how are they written and have you written any routines before? Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of Routines:

1. Transform Functions 2. Before-After Job subroutines 3. Job Control Routines

How did you handle an 'Aborted' sequencer? In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.

Read the String functions in DS Functions like [] -> sub-string function and ':' -> concatenation operator Syntax: string [ [ start, ] length ] string [ delimiter, instance, repeats ] What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.

Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.


12

Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.

What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director? Use crontab utility along with dsexecute() function along with proper parameters passed. Did you work in UNIX environment? Yes. One of the most important requirements. How would call an external Java function which are not supported by DataStage? Starting from DS 6.0 we have the ability to call external Java functions using a Java package from Ascential. In this case we can even use the command line to invoke the Java function and write the return values from the Java program (if any) and use that files as a source in DataStage job.

How will you determine the sequence of jobs to load into data warehouse? First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the Aggregator tables (if any).

The above might raise another Why do we have to load the dimensional tables first, then fact tables: As we load the dimensional tables the keys (primary) are generated and these keys (primary) are Foreign keys in Fact tables.

Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic. There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically different in permissions (Truncate requires you to have alter table permissions where Delete doesn't).

How do you rename all of the jobs to support your new File-naming conventions? Create an Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers.

When should we use ODS? DWH's are typically read only, batch updated on a schedule ODS's are maintained in more real time, trickle fed constantly

What other ETL's you have worked with?


13

Informatica and also DataJunction if it is present in your Resume. How good are you with your PL/SQL? On the scale of 1-10 say 8.5-9 What versions of DS you worked with? DS 7.5, DS 7.0.2, DS 6.0, DS 5.2 What's the difference between Datastage Developers...? Datastage developer is one how will code the jobs. Datastage designer is how will design the job, I mean he will deal with blue prints and he will design the jobs the stages that are required in developing the code

What are the command line functions that import and export the DS jobs?

dsimport.exe - imports the DataStage components. dsexport.exe - exports the DataStage components.

How to run a Shell Script within the scope of a Data stage job? By using "ExcecSH" command at Before/After job properties. How to handle Date convertions in Datastage? Convert mm/dd/yyyy format to yyyy-dd-mm? We use a) "Iconv" function - Internal Convertion. b) "Oconv" function - External Convertion.

Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]") Types of Parallel Processing? Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing. What does a Config File in parallel extender consist of? Config file consists of the following. a) Number of Processes or Nodes. b) Actual Disk Storage Location. Types of views in Datastage Director? There are 3 types of views in Datastage Director


14

a) Job View - Dates of Jobs Compiled. b) Log View - Status of Job last Run c) Status View - Warning Messages, Event Messages, Program Generated Messages. Did you Parameterize the job or hard-coded the values in the jobs? Always parameterized the job. Either the values are coming from Job Properties or from a Parameter Manager a third part tool. There is no way you will hardcode some parameters in your jobs. The often Parameterized variables in a job are: DB DSN name, username, password, dates W.R.T for the data to be looked against at.

What are the requirements for your ETL tool? Do you have large sequential files (1 million rows, for example) that need to be compared every day versus yesterday? If so, then ask how each vendor would do that. Think about what process they are going to do. Are they requiring you to load yesterdays file into a table and do lookups? If so, RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one. It all depends on what you need the ETL to do. If you are small enough in your data sets, then either would probably be OK. What are the main differences between Ascential DataStage and Informatica PowerCenter? Chuck Kelleys You are right; they have pretty much similar functionality. However, what are the requirements for your ETL tool? Do you have large sequential files (1 million rows, for example) that need to be compared every day versus yesterday? If so, then ask how each vendor would do that. Think about what process they are going to do. Are they requiring you to load yesterdays file into a table and do lookups? If so, RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one. It all depends on what you need the ETL to do. If you are small enough in your data sets, then either would probably be OK.

Les Barbusinskis Without getting into specifics, here are some differences you may want to explore with each vendor:

Does the tool use a relational or a proprietary database to store its Meta data and scripts? If proprietary, why? What add-ons are available for extracting data from industry-standard ERP, Accounting, and CRM packages? Can the tools Meta data be integrated with third-party data modeling and/or business intelligence tools? If so, how and with which ones? How well does each tool handle complex transformations, and how much external scripting is required? What kinds of languages are supported for ETL script extensions?

Almost any ETL tool will look like any other on the surface. The trick is to find out which one

will work best in your environment. The best way Ive found to make this determination is to ascertain how successful each vendors clients have been using their product. Especially clients who closely resemble your shop in terms of size, industry, in-house skill sets, platforms, source

systems, data volumes and transformation complexity.


15

Ask both vendors for a list of their customers with characteristics similar to your own that have

used their ETL product for at least a year. Then interview each client (preferably several people

at each site) with an eye toward identifying unexpected problems, benefits, or quirkiness with the

tool that have been encountered by that customer. Ultimately, ask each customer if they had it all to do over again whether or not theyd choose the same tool and why? You might be surprised at some of the answers.

Joyce Bischoffs You should do a careful research job when selecting products. You should first document your requirements, identify all possible products and evaluate each product against the

detailed requirements. There are numerous ETL products on the market and it seems that you are

looking at only two of them. If you are unfamiliar with the many products available, you may

refer to www.tdan.com, the Data Administration Newsletter, for product lists.

If you ask the vendors, they will certainly be able to tell you which of their products features are stronger than the other product. Ask both vendors and compare the answers, which may or may

not be totally accurate. After you are very familiar with the products, call their references and be

sure to talk with technical people who are actually using the product. You will not want the

vendor to have a representative present when you speak with someone at the reference site. It is

also not a good idea to depend upon a high-level manager at the reference site for a reliable

opinion of the product. Managers may paint a very rosy picture of any selected product so that

they do not look like they selected an inferior product. How many places u can call Routines? Four Places u can call

1.Transform of routine a. Date Transformation b. Upstring Transformation

2.Transform of the Before & After Subroutines 3.XML transformation 4.Web base transformation

What is the Batch Program and how can generate? Batch program is the program it's generate run time to maintain by the Datastage itself but u can easy to change own the basis of your requirement (Extraction, Transformation, Loading) .Batch program are generate depends your job nature either simple job or sequencer job, you can see this program on job control option. Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 ) if job 1 have 10,000 row ,after run the job only 5000 data has been loaded in target table remaining are not loaded and your job going to be aborted then.. How can short out the problem? Suppose job sequencer synchronies or control 4 job but job 1 have problem, in this condition should go director and check it what type of problem showing either data type problem, warning massage, job fail or job aborted, If job fail means data type problem or missing column action .So u should go Run window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this option here two option

(i) On Fail -- Commit , Continue (ii) On Skip -- Commit, Continue.


16

First u check how much data already load after then select on skip option then continue and what remaining position data not loaded then select On Fail , Continue ...... Again Run the job defiantly u gets successful massage

What happens if RCP is disable? In such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased... How do you rename all of the jobs to support your new File-naming conventions? Create a Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers. What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.

A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job.

May be you can schedule the sequencer around the time the file is expected to arrive.

B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the

file

What are Sequencers?

Sequencers are job control programs that execute other jobs with preset Job parameters.

How did you handle an 'Aborted' sequencer?

In almost all cases we have to delete the data inserted by this from DB manually and fix the job

and then run the job again.

Question34: What is the difference between the Filter stage and the Switch stage? Ans: There are two main differences, and probably some minor ones as well. The two main differences are as follows.

1) The Filter stage can send one input row to more than one output link. The Switch stage can not - the C switch construct has an implicit break in every case. 2) The Switch stage is limited to 128 output links; the Filter stage can have a theoretically unlimited number of output links. (Note: this is not a challenge!)

How can i achieve constraint based loading using datastage7.5.My target tables have inter

dependencies i.e. Primary key foreign key constraints. I want my primary key tables to be loaded

first and then my foreign key tables and also primary key tables should be committed before the

foreign key tables are executed. How can I go about it?

Ans:1) Create a Job Sequencer to load you tables in Sequential mode

In the sequencer Call all Primary Key tables loading Jobs first and followed by Foreign key

tables, when triggering the Foreign tables load Job trigger them only when Primary Key load

Jobs run Successfully ( i.e. OK trigger)

2) To improve the performance of the Job, you can disable all the constraints on the tables and

load them. Once loading done, check for the integrity of the data. Which does not meet raise

exceptional data and cleanse them.

This only a suggestion, normally when loading on constraints are up, will drastically

performance will go down.


17

3) If you use Star schema modeling, when you create physical DB from the model, you can

delete all constraints and the referential integrity would be maintained in the ETL process by

referring all your dimension keys while loading fact tables. Once all dimensional keys are

assigned to a fact then dimension and fact can be loaded together. At the same time RI is being

maintained at ETL process level. How do you merge two files in DS? Ans: Either use Copy command as a Before-job subroutine if the metadata of the 2 files are same or create a job to concatenate the 2 files into one, if the metadata is different.

How do you eliminate duplicate rows?

Ans: Data Stage provides us with a stage Remove Duplicates in Enterprise edition. Using that

stage we can eliminate the duplicates based on a key column.

How do you pass filename as the parameter for a job?

Ans: While job development we can create a parameter 'FILE_NAME' and the value can be

passed while

How did you handle an 'Aborted' sequencer?

Ans: In almost all cases we have to delete the data inserted by this from DB manually and fix the

job and then run the job again. Is there a mechanism available to export/import individual DataStage ETL jobs from the UNIX command line? Ans: Try dscmdexport and dscmdimport. Won't handle the "individual job" requirement. You can only export full projects from the command line. You can find the export and import executables on the client machine usually someplace like: C:\Program Files\Ascential\DataStage.

Diff. between JOIN stage and MERGE stage. JOIN: Performs join operations on two or more data sets input to the stage and then outputs the resulting dataset. MERGE: Combines a sorted master data set with one or more sorted updated data sets. The columns from the records in the master and update data set s are merged so that the out put record contains all the columns from the master record plus any additional columns from each update record that required. A master record and an update record are merged only if both of them have the same values for the merge key column(s) that we specify .Merge key columns are one or more columns that exist in both the master and update records. Advantages of the DataStage?

Business advantages: Helps for better business decisions; It is able to integrate data coming from all parts of the company; It helps to understand the new and already existing clients; We can collect data of different clients with him, and compare them; It makes the research of new business possibilities possible; We can analyze trends of the data read by him.

Technological advantages: It handles all company data and adapts to the needs;


18

It offers the possibility for the organization of a complex business intelligence; Flexibly and scalable; It accelerates the running of the project;

Easily implementable. What is the architecture of data stage? Basically architecture of DS is client/server architecture. Client components & server components Client components are 4 types they are 1. Data stage designer 2. Data stage administrator 3. Data stage director 4. Data stage manager Data stage designer is user for to design the jobs Data stage manager is used for to import & export the project to view & edit the contents of the repository. Data stage administrator is used for creating the project, deleting the project & setting the environment variables. Data stage director is use for to run the jobs, validate the jobs, scheduling the jobs. Server components DS server: runs executable server jobs, under the control of the DS director, that extract, transform, and load data into a DWH. DS Package installer: A user interface used to install packaged DS jobs and plug-in; Repository or project: a central store that contains all the information required to build DWH or data mart.

1. What r the stages u worked on? I have some jobs every month automatically delete the log details what r the steps u have to take for that We have to set the option autopurge in DS Adminstrator. I want to run the multiple jobs in the single job. How can u handle. In job properties set the option ALLOW MULTIPLE INSTANCES.


19

What is version controlling in DS? In DS, version controlling is used for back up the project or jobs. This option is available in DS 7.1 version onwards. Version controls r of 2 types.

1. VSS- visual source safe 2. CVSS- concurrent visual source safe.

VSS is designed by Microsoft but the disadvantage is only one user can access at a time, other user can wait until the first user complete the operation. CVSS, by using this many users can access concurrently. When compared to VSS, CVSS cost is high. What is the difference between clear log file and clear status file? Clear log--- we can clear the log details by using the DS Director. Under job menu clear log option is available. By using this option we can clear the log details of particular job. Clear status file---- lets the user remove the status of the record associated with all stages of selected jobs.(in DS Director) I developed 1 job with 50 stages, at the run time one stage is missed how can u identify which stage is missing? By using usage analysis tool, which is available in DS manager, we can find out the what r the items r used in job. My job takes 30 minutes time to run, I want to run the job less than 30 minutes? What r the steps we have to take? By using performance tuning aspects which are available in DS, we can reduce time. Tuning aspect In DS administrator : in-process and inter process In between passive stages : inter process stage OCI stage : Array size and transaction size And also use link partitioner & link collector stage in between passive stages How to do road transposition in DS? Pivot stage is used to transposition purpose. Pivot is an active stage that maps sets of columns in an input table to a single column in an output table. If a job locked by some user, how can you unlock the particular job in DS? We can unlock the job by using clean up resources option which is available in DS Director. Other wise we can find PID (process id) and kill the process in UNIX server. I am getting input value like X = Iconv(31 DEC 1967,D)? What is the X value? X value is Zero. Iconv Function Converts a string to an internal storage format.It takes 31 dec 1967 as zero and counts days from that date(31-dec-1967).


20

What is the Unit testing, integration testing and system testing? Unit testing: As for Ds unit test will check the data type mismatching, Size of the particular data type, column mismatching. Integration testing: According to dependency we will put all jobs are integrated in to one sequence. That is called control sequence. System testing: System testing is nothing but the performance tuning aspects in Ds. What are the command line functions that import and export the DS jobs? Dsimport.exe ---- To import the DataStage components Dsexport.exe ---- To export the DataStage components How many hashing algorithms are available for static hash file and dynamic hash file? Sixteen hashing algorithms for static hash file. Two hashing algorithms for dynamic hash file( GENERAL or SEQ.NUM) What happens when you have a job that links two passive stages together? Obviously there is some process going on. Under covers Ds inserts a cut-down transformer stage between the passive stages, which just passes data straight from one stage to the other. What is the use use of Nested condition activity? Nested Condition. Allows you to further branch the execution of a sequence depending on a condition. I have three jobs A,B,C . Which are dependent on each other? I want to run A & C jobs daily and B job runs only on Sunday. How can u do it? First you have to schedule A & C jobs Monday to Saturday in one sequence. Next take three jobs according to dependency in one more sequence and schedule that job only Sunday. What are the ways to execute datastage jobs? A job can be run using a few different methods:

from Datastage Director (menu Job -> Run now...) from command line using a dsjob command Datastage routine can run a job (DsRunJob command) by a job sequencer

How to invoke a Datastage shell command? Datastage shell commands can be invoked from :

Datastage administrator (projects tab -> Command) Telnet client connected to the datastage server

How to stop a job when its status is running? To stop a running job go to DataStage Director and click the stop button (or Job -> Stop from menu). If it doesn't help go to Job -> Cleanup Resources, select a process with holds a lock and click Logout If it still doesn't help go to the datastage shell and invoke the following command: ds.tools


21

It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10). How to run and schedule a job from command line? To run a job from command line use a dsjob command Command Syntax: dsjob [-file | [-server ][-user ][-password ]] [] The command can be placed in a batch file and run in a system scheduler. How to release a lock held by jobs? Go to the datastage shell and invoke the following command: ds.tools It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10).

User privileges for the default DataStage roles? The role privileges are:

DataStage Developer - user with full access to all areas of a DataStage project DataStage Operator - has privileges to run and manage deployed DataStage jobs -none- - no permission to log on to DataStage

What is a command to analyze hashed file? There are two ways to analyze a hashed file. Both should be invoked from the datastage command shell. These are:

FILE.STAT command ANALYZE.FILE command

Is it possible to run two versions of datastage on the same pc? Yes, even though different versions of Datastage use different system dll libraries. To dynamically switch between Datastage versions install and run DataStage Multi-Client Manager. That application can unregister and register system libraries used by Datastage. Error in Link collector - Stage does not support in-process active-to-active inputs or outputs To get rid of the error just go to the Job Properties -> Performance and select Enable row buffer. Then select Inter process which will let the link collector run correctly. Buffer size set to 128Kb should be fine, however it's a good idea to increase the timeout.

What is the DataStage equivalent to like option in ORACLE The following statement in Oracle: select * from ARTICLES where article_name like '%WHT080%'; Can be written in DataStage (for example as the constraint expression): incol.empname matches '...WHT080...' what is the difference between logging text and final text message in terminator stage Every stage has a 'Logging Text' area on their General tab which logs an informational message when the stage is triggered or started.

Informational - is a green line, DSLogInfo() type message. The Final Warning Text - the red fatal, the message which is included in the sequence abort message


22

Error in STPstage - SOURCE Procedures must have an output link The error appears in Stored Procedure (STP) stage when there are no stages going out of that stage. To get rid of it go to 'stage properties' -> 'Procedure type' and select Transform

How to invoke an Oracle PLSQL stored procedure from a server job To run a pl/sql procedure from Datastage a Stored Procedure (STP) stage can be used. However it needs a flow of at least one record to run. It can be designed in the following way:

source odbc stage which fetches one record from the database and maps it to one column - for example: select sysdate from dual

A transformer which passes that record through. If required, add pl/sql procedure parameters as columns on the right-hand side of tranformer's mapping Put Stored Procedure (STP) stage as a destination. Fill in connection parameters, type in the procedure name and select Transform as procedure type. In the input tab select 'execute procedure for each row' (it will be run once).

Design of a DataStage server job with Oracle plsql procedure call

Is it possible to run a server job in parallel? Yes, even server jobs can be run in parallel. To do that go to 'Job properties' -> General and check the Allow Multiple Instance button. The job can now be run simultaneously from one or many sequence jobs. When it happens datastage will create new entries in Director and new job will be named with automatically generated suffix (for example second instance of a job named JOB_0100 will be named JOB_0100.JOB_0100_2). It can be deleted at any time and will be automatically recreated by datastage on the next run.

Error in STPstage - STDPROC property required for stage xxx The error appears in Stored Procedure (STP) stage when the 'Procedure name' field is empty. It occurs even if the Procedure call syntax is filled in correctly. To get rid of error fill in the 'Procedure name' field.

Datastage routine to open a text file with error catching Note! work dir and file1 are parameters passed to the routine. * open file1 OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl") END ELSE CALL DSLogInfo("Unable to open file", "JobControl")


23

ABORT END

Datastage routine which reads the first line from a text file Note! work dir and file1 are parameters passed to the routine. * open file1 OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl") END ELSE CALL DSLogInfo("Unable to open file", "JobControl") ABORT END READSEQ FILE1.RECORD FROM H.FILE1 ELSE Call DSLogWarn("******************** File is empty", "JobControl") END firstline = Trim(FILE1.RECORD[1,32]," ","A") ******* will read the first 32 chars Call DSLogInfo("******************** Record read: " : firstline, "JobControl") CLOSESEQ H.FILE1 How to test a datastage routine or transform? To test a datastage routine or transform go to the Datastage Manager. Navigate to Routines, select a routine you want to test and open it. First compile it and then click 'Test...' which will open a new window. Enter test parameters in the left-hand side column and click run all to see the results. Datastage will remember all the test arguments during future tests.

When hashed files should be used? What are the benefits or using them? Hashed files are the best way to store data for lookups. They're very fast when looking up the key-value pairs. Hashed files are especially useful if they store information with data dictionaries (customer details, countries, exchange rates). Stored this way it can be spread across the project and accessed from different jobs.

How to construct a container and deconstruct it or switch between local and shared? To construct a container go to Datastage designer, select the stages that would be included in the container and from the main menu select Edit -> Construct Container and choose between local and shared. Local will be only visible in the current job, and share can be re-used. Shared containers can be viewed and edited in Datastage Manager under 'Routines' menu. Local Datastage containers can be converted at any time to shared containers in datastage designer by right clicking on the container and selecting 'Convert to Shared'. In the same way it can be converted back to local.


24

Corresponding datastage data types to ORACLE types? Most of the datastage variable types map very well to oracle types. The biggest problem is to map correctly oracle NUMBER(x,y) format. The best way to do that in Datastage is to convert oracle NUMBER format to Datastage Decimal type and to fill in Length and Scale column accordingly. There are no problems with string mappings: oracle Varchar2 maps to datastage Varchar, and oracle char to datastage char. How to adjust commit interval when loading data to the database? In earlier versions of datastage the commit interval could be set up in: General -> Transaction size (in version 7.x it's obsolete) Starting from Datastage 7.x it can be set up in properties of ODBC or ORACLE stage in Transaction handling -> Rows per transaction. If set to 0 the commit will be issued at the end of a successfull transaction. What is the use of INROWNUM and OUTROWNUM datastage variables? @INROWNUM and @OUTROWNUM are internal datastage variables which do the following:

@INROWNUM counts incoming rows to a transformer in a datastage job @OUTROWNUM counts oucoming rows from a transformer in a datastage job

These variables can be used to generate sequences, primary keys, id's, numbering rows and also for debugging and error tracing. They play similiar role as sequences in Oracle. Datastage trim function cuts out more characters than expected By deafult datastage trim function will work this way: Trim(" a b c d ") will return "a b c d" while in many other programming/scripting languages "a b c d" result would be expected. That is beacuse by default an R parameter is assumed which is R - Removes leading and trailing occurrences of character, and reduces multiple occurrences to a single occurrence. To get the "a b c d" as a result use the trim function in the following way: Trim(" a b c d "," ","B") Database update actions in ORACLE stage The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Available actions: Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT). Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT). Insert rows without clearing - only adds new rows (INSERT statement). Delete existing rows only - deletes matched rows (issues only the DELETE statement). Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT). Update existing rows only - updates existing rows (UPDATE statement). Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited. Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited.


25

User-defined SQL - the data is written using a user-defined SQL statement. User-defined SQL file - the data is written using a user-defined SQL statement from a file. Use and examples of ICONV and OCONV functions?

ICONV and OCONV functions are quite often used to handle data in Datastage.

ICONV converts a string to an internal storage format and OCONV converts an expression to an

output format.

Syntax:

Iconv (string, conversion code)

Oconv(expression, conversion )

Some useful iconv and oconv examples: Iconv("10/14/06", "D2/") = 14167

Oconv(14167, "D-E") = "14-10-2006"

Oconv(14167, "D DMY[,A,]") = "14 OCTOBER 2006"

Oconv(12003005, "MD2$,") = "$120,030.05"

That expression formats a number and rounds it to 2 decimal places:

Oconv(L01.TURNOVER_VALUE*100,"MD2")

Iconv and oconv can be combined in one expression to reformat date format easily:

Oconv(Iconv("10/14/06", "D2/"),"D-E") = "14-10-2006" ERROR 81021 Calling subroutine DSR_RECORD ACTION=2 Datastage system help gives the following error desription: SYS.HELP. 081021 MESSAGE.. dsrpc: Error writing to Pipe. The problem appears when a job sequence is used and it contains many stages (usually more than 10) and very often when a network connection is slow. Basically the cause of a problem is a failure between DataStage client and the server communication. The solution to the issue is: Do not log in to Datastage Designer using 'Omit' option on a login screen. Type in explicitly username and password and a job should compile successfully. execute the DS.REINDEX ALL command from the Datastage shell - if the above does not help How to check Datastage internal error descriptions To check the description of a number go to the datastage shell (from administrator or telnet to the server machine) and invoke the following command: SELECT * FROM SYS.MESSAGE WHERE @ID='081021'; - where in that case the number 081021 is an error number The command will produce a brief error description which probably will not be helpful in resolving an issue but can be a good starting point for further analysis.

Error timeout waiting for mutex


26

The error message usually looks like follows: ... ds_ipcgetnext() - timeout waiting for mutex There may be several reasons for the error and thus solutions to get rid of it. The error usually appears when using Link Collector, Link Partitioner and Interprocess (IPC) stages. It may also appear when doing a lookup with the use of a hash file or if a job is very complex, with the use of many transformers. There are a few things to consider to work around the problem: - increase the buffer size (up to to 1024K) and the Timeout value in the Job properties (on the Performance tab). - ensure that the key columns in active stages or hashed files are composed of allowed characters get rid of nulls and try to avoid language specific chars which may cause the problem. - try to simplify the job as much as possible (especially if its very complex). Consider splitting it into two or three smaller jobs, review fetches and lookups and try to optimize them (especially have a look at the SQL statements). ERROR 30107 Subroutine failed to complete successfully Datastage system help gives the following error desription: SYS.HELP. 930107 MESSAGE.. DataStage/SQL: Illegal placement of parameter markers The problem appears when a project is moved from one project to another (for example when deploying a project from a development environment to production). The solution to the issue is: Rebuild the repository index by executing the DS.REINDEX ALL command from the Datastage shell Datastage Designer hangs when editing job activity properties The appears when running Datastage Designer under Windows XP after installing patches or the Service Pack 2 for Windows. After opening a job sequence and navigating to the job activity properties window the application freezes and the only way to close it is from the Windows Task Manager. The solution of the problem is very simple. Just Download and install the XP SP2 patch for the Datastage client. It can be found on the IBM client support site (need to log in): https://www.ascential.com/eservice/public/welcome.do Go to the software updates section and select an appropriate patch from the Recommended DataStage patches section. Sometimes users face problems when trying to log in (for example when the license doesnt cover the IBM Active Support), then it may be necessary to contact the IBM support which can be reached at [email protected]

Can Datastage use Excel files as a data input?


27

Microsoft Excel spreadsheets can be used as a data input in Datastage. Basically there are two possible approaches available: Access Excel file via ODBC - this approach requires creating an ODBC connection to the Excel file on a Datastage server machine and use an ODBC stage in Datastage. The main disadvantage is that it is impossible to do this on an Unix machine. On Datastage servers operating in Windows it can be set up here: Control Panel -> Administrative Tools -> Data Sources (ODBC) -> User DSN -> Add -> Driver do Microsoft Excel (.xls) -> Provide a Data source name -> Select the workbook -> OK Save Excel file as CSV - save data from an excel spreadsheet to a CSV text file and use a sequential stage in Datastage to read the data.

Parallel processing

Datastage jobs are highly scalable due to the implementation of parallel processing. The EE

architecture is process-based (rather than thread processing), platform independent and uses the

processing node concept. Datastage EE is able to execute jobs on multiple CPUs (nodes) in

parallel and is fully scalable, which means that a properly designed job can run across resources

within a single machine or take advantage of parallel platforms like a cluster, GRID, or MPP

architecture (massively parallel processing).

Partitioning and Pipelining

Partitioning means breaking a dataset into smaller sets and distributing them evenly across the

partitions (nodes). Each partition of data is processed by the same operation and transformed in

the same way.

The main outcome of using a partitioning mechanism is getting a linear scalability. This means

for instance that once the data is evenly distributed, a 4 CPU server will process the data four

times faster than a single CPU machine.

Pipelining means that each part of an ETL process (Extract, Transform, Load) is executed

simultaneously, not sequentially. The key concept of ETL Pipeline processing is to start the

Transformation and Loading tasks while the Extraction phase is still running.

Datastage Enterprise Edition automatically combines pipelining, partitioning and parallel

processing. The concept is hidden from a Datastage programmer. The job developer only

chooses a method of data partitioning and the Datastage EE engine will execute the partitioned

and parallelized processes.

Section 1.01 Differences between Datastage Enterprise and Server Edition

1. The major difference between Infosphere Datastage Enterprise and Server edition is that Enterprise Edition (EE) introduces Parallel jobs. Parallel jobs support a completely new set of stages, which implement the scalable and parallel data processing mechanisms. In most cases parallel jobs and stages look similiar to the Datastage Server objects, however their capababilities are way different. In rough outline:


28

o Parallel jobs are executable datastage programs, managed and controlled by Datastage Server runtime environment o Parallel jobs have a built-in mechanism for Pipelining, Partitioning and Parallelism. In most cases no manual intervention is needed to implement optimally those techniques. o Parallel jobs are a lot faster in such ETL tasks like sorting, filtering, aggregating

2. Datastage EE jobs are compiled into OSH (Orchestrate Shell script language). OSH executes operators - instances of executable C++ classes, pre-built components representing stages used in Datastage jobs. Server Jobs are compiled into Basic which is an interpreted pseudo-code. This is why parallel jobs run faster, even if processed on one CPU. 3. Datastage Enterprise Edition adds functionality to the traditional server stages, for instance record and column level format properties. 4. Datastage EE brings also completely new stages implementing the parallel concept, for example:

o Enterprise Database Connectors for Oracle, Teradata & DB2 o Development and Debug stages - Peek, Column Generator, Row Generator, Head, Tail, Sample ... o Data set, File set, Complex flat file, Lookup File Set ... o Join, Merge, Funnel, Copy, Modify, Remove Duplicates ...

5. When processing large data volumes Datastage EE jobs would be the right choice, however when dealing with smaller data environment, using Server jobs might be just easier to develop, understand and manage. When a company has both Server and Enterprise licenses, both types of jobs can be used. 6. Sequence jobs are the same in Datastage EE and Server editions.

what is the difference between ds 7.5 & 8.1?

New version of DS is 8.0 and supprots QUALITY STAGE &

Profile Stage and etc, and it also contain a webbrowsers

1.To implement scd we have seperate stage(SCD stage)

2.we dont have client manager tool in vversion 8,its

incorporated with Designer itself.

3.There is no need of hardcoding the parameters for every

job and we have a option called Parameter set.if we create

the parameter set,we can call the parameter set for whole

project or job,sequence..

what happens when job is compiling?

During compilation of a DataStage Parallel job there is very high CPU and memory utilization

on the server, and the job may take a very log time for the compilation to complete.

What APT_CONFIG in ds? APT_CONFIG is just an environment variable used to idetify the *.apt file. Dont confuse that with *.apt

file that has the node's information and Configuration of SMP/MMP server.


29

Apt_configfile is used for to store the nodes information, and it contains the disk storage

information, and scrach information. and datastage understands the architecture of system based

on this Configfile. for parallel process normally two nodes are required its name like 10,.20.

anyaways the APT_CONFIG_FILE (not just APT_CONFIG) is the

configuration file that defines the nodes (the scratch area temp area) for the specific project

is it possible to add extra nodes in the configuration file?

what is RCP? n how does it works? Run time column propagation is used in case of partial schema

usage. when we only know about the columns to be processed and we

want all other columns to be propagated to target as they are we

check enable RCP option in administrator or output page columns

tab or stage page general tab and we only need to specify the

schema of tables we are concerned with . According to

documentation Runtime column propagation (RCP) allows DataStage

to be flexible aboutthe columns you define in a job. If RCP is

enabled for a project you can justdefine the columns you are

interested in using in a job but ask DataStageto propagate the

other columns through the various stages. So suchcolumns can be

extracted from the data source and end up on your datatarget without explicitly being operated on in between.Sequential files

unlike most other data sources do not have inherentcolumn

definitions and so DataStage cannot always tell where there

areextra columns that need propagating. You can only use RCP on

sequentialfiles if you have used the Schema File property (see

Schema File onpage 5-8 and on page 5-31) to specify a schema

which describes all thecolumns in the sequential file. You need

to specify the same schema file forSequential File Stage 5-47any

similar stages in the job where you want to propagate columns.

Stagesthat will require a schema file are: Sequential File File

Set External Source External Target Column Import Column Export

Run Time Column Propagation can be used with Column Import

Stage. If RCP is enabled in our project we can define only the

columns which we are interested in and other rest of the columns

datastage will send through various other stages.

This will ensure such columns reach to our Target eventhough

they are not used in between of the stages.

starschema n snowflake schema? n Difference? Star Schema Snowflake Schema

De-Normalized Data Structure Normalized Data Structure


30

Category wise Single Dimension Table Dimension table split into many pieces

More data dependency and redundancy less data dependency and No redundancy

No need to use complicated join Complicated Join

Query Results Faster Some delay in Query Processing

No Parent Table It May contain Parent Table

Simple DB Structure Complicated DB Structure

Difference bet OLTP n Datawarehouse?

The OLTP database records transactions in real time and aims to automate clerical data entry processes

of a business entity. Addition, modification and deletion of data in the OLTP database is essential and

the semantics of the application used in the front end impact on the organization of the data in the

database.

The data warehouse on the other hand does not cater to real time operational requirements of the

enterprise. It is more a storehouse of current and historical data and may also contain data extracted

from external data sources.

However, the data warehouse supports OLTP system by providing a place for the latter to offload data

as it accumulates and by providing services which would otherwise degrade the performance of the

database.

Differences Data warehouse database and OLTP database

Data warehouse database

Designed for analysis of business measures by categories and attributes

Optimized for bulk loads and large, complex, unpredictable queries that access many rows per table.

Loaded with consistent, valid data; requires no real time validation

Supports few concurrent users relative to OLTP

OLTP database

Designed for real time business operations.

Optimized for a common set of transactions, usually adding or retrieving a single row at a time per table.

Optimized for validation of incoming data during transactions; uses validation data tables.

Supports thousands of concurrent users.


31

What is datamodalling?

The analysis of data objects and their relationships to other data objects. Data modeling is often

the first step in database design and object-oriented programming as the designers first create a

conceptual model of how data items relate to each other. Data modeling involves a progression

from conceptual model to logical model to physical schema.

Data modelling is the process of identifying entities, the relationship between those entities and

their attributes. There are a range of tools used to achieve this such as data dictionaries, decision

trees, decision tables, schematic diagrams and the process of normalisation.

how to draw second highest salary?

select ename,esal from

(select ename,esal from hsal

order by esal desc)

where rownum 1 )

Difference bet egrep n fgrep?

There is a difference. fgrep can not search for regular expressions in a string. It is used for plain

string matching.

egrep can search regular expressions too.

grep is a combination of both egrep and fgrep. If you don't specify -E or -F option, by default

grep will function as egrep but will have string searching ability too.

Hence the best one to be used is grep.


32

fgrep = "Fixed GREP".

fgrep searches for fixed strings only. The "f" does not stand for "fast" - in fact, "fgrep foobar *.c"

is usually slower than "egrep foobar *.c" (Yes, this is kind of surprising. Try it.)

Fgrep still has its uses though, and may be useful when searching a file for a larger number of

strings than egrep can handle.

egrep = "Extended GREP"

egrep uses fancier regular expressions than grep. Many people use egrep all the time, since it has

some more sophisticated internal algorithms than grep or fgrep, and is usually the fastest of the

three programs

CHMOD command?

Permissions

u - User who owns the file.

g - Group that owns the file.

o - Other.

a - All.

r - Read the file.

w - Write or edit the file.

x - Execute or run the file as a program.

Numeric Permissions:

CHMOD can also to attributed by using Numeric Permissions:

400 read by owner

040 read by group

004 read by anybody (other)

200 write by owner

020 write by group

002 write by anybody

100 execute by owner

010 execute by group

001 execute by anybody

what is the difference between ds 7.5 & 8.1?

The main difference is dsmanager client is combined with dsdesigner in 8.1 and the following are the

new in 8.1

Scd2 stage,dataconnection,ps,qs,rangelookup .

Difference between internal sort and external sort?


33

Performance wise internal sort is best becoz it doesnt use any buffer where as external sort takes

buffer memory to store rec.

how u pass only required number of rec through partitions?

Go to jobproperties-execution-enable trace compile-give req number of rec.

what happens when job is compiling?

1.all processing stages will develop osh code

2.Tx will develop c++ code in the background.

3.job information will be updated in metadata repository.

4. compile

what APT_CONFIG in ds?

Its configuration file which defines parallelism to our jobs.

how many types of parallelisms n partions r there?

Two types smp n mpp.

it possible to add extra nodes in the configuration file?

Yes it is possible to add extra nodes go to configuration management tools where u find apt_conf edit

that for ur req no of nodes.

what is RCP? n how does it works?

Run time column propagation ,is used to propagate the columns which r not define in the metadata.

how data is moving from one stage to another?

In virtual dataset form.

it possible to run multiple jobs in a single job?

Yes ,goto jobprop-have a option allow multiple instances.

What is APT_DUMP_SCORE?

APT_DUMP_SCORE - shows operators, datasets, nodes, partitions, combinations and

processes used in a job.

env var: admin

Stage Variable - An intermediate processing variable that retains value during read and doesnt

pass the value into target column.

Derivation - Expression that specifies value to be passed on to the target column.

Constant - Conditions that are either true or false that specifies flow of data with a link.

pipeline parallelism:

here each satge will work on separate processor

What is the difference between Job Control and Job Sequence

Job control specially used to control the job, means through this we can pass the parameters,

some conditions, some log file information, dashboard information, load recover etc...,

job seq is used to run the group of jobs based upon some conditions. For final/incremental

processing we keep all the jobs in one diff seq and we run the jobs at a time by giving some

triggers.


34

What is the max size of Data set stage? (PX) no limit?

performance in sort stage

See if it is orcle db then u can write user def query for sort and remove duplicates in the source

itself. and maintaining some key partition teqniques u can improve the performence.

If it is not the case means better go for some key partion teqniques in sort, keeping the same

partition which is in prev stage. don't allow the duplicates , remove duplicates and give unique

partition key.

How to develop the SCD using LOOKUP stage?

we can impliment SCD by using LOOKUP stage, but it is for only scd1, not for scd2.

we have to take source(file or db) and dataset as a ref link(for look up) and then LOOKUP stage,

in this we have to compare the source with dataset and we have to give condition as continue,

continue there. after that in t/r we have to give the conditon, after that we have to take two targets

for insert and update, there we have to manually write the sql insert and update statements.

If u see the design, then u can easily understand that.

What is the diffrence between IBM Web Sphere DataStage 7.5 (Enterprise Edition ) & Standard

Ascential DataStage 7.5 Version ?

IBM Information Server also known as DS 8 has more features like Quality Stage & MetaStage .

It maintains its repsository in DB2 unlike files in 7.5. Also it has stage specifically for SCD 1 &

2.

I think there is no version like standard Ascential DataStage 7.5, I know only the advanced

edition of Datastage i.e., only web sphere Datastage and Quality stage, it is released by IBM

itself and given the version as 8.0.1, in this there are only 3 client tools(admin..,desig..,director),

here they have removed the manager, it is included in designer itself (for importing and

exporting) and in this they have added some extra stages like SCD stage , by using this we can

impliment scd1 and scd2 directly, and there are some advanced stages are there.

They have included the QualityStage, which is used for data validation which is very very

importent for dwh. There are somany things are available in Qualitystage, we can think it as a

seperate tools for dwh.

What are the errors you expereiced with data stage

Here in datastage there are some warnings and some fatal errors will come in the log file.


35

If there is any fatal error means the job got aborted but if there are any warnings are there means

the job not aborts but we have to handle those warnings also. logfile must be cleared with no

warnings also.

so many errors will come in diff jobs,

Parameter not found in job load recover.

child job is failed bcoz of some .....

control job is failed bcoz of some .....

....etc

what are the main diff between server job and parallel job in datastage?

in server jobs we have few stages and its mainly logical intensive and we r using transformer for

most of the things and it does not uses MPP systems

in paralell jobs we have lots of stages and its stage intensive and for particular thing we have in

built stages in parallel jobs and it uses MPP systems

***********************************************************

In server we dont have an option to process the data in multiple nodes as in parallel. In parallel

we have an advatage to process the data in pipelines and by partitioning, whereas we dont have

any such concept in server jobs.

There are lot of differences in using same stages in server and parallel. For example, in parallel, a

sequencial file or any other file can have either an input link or an output ink, but in server it can

have both(that too more than 1).

********************************************************************

server jobs can compile and run with in datastage server but parallel jobs can compile and run

with in datastage unix server.

server jobs can extact total rows from source to anthor stage then only that stage will be activate

and passing the rows into target level or dwh.it is time taking.

but in parallel jobs it is two types

1.pipe line parallelisam

2.partion parallelisam

1.based on statistical performence we can extract some rows from source to anthor stage at the

same time the stage will be activate and passing the rows into target level or dwh.it will maintain

only one node with in source and target.


36

2.partion parallelisam will maintain more than one node with in source and target.

Why you need Modify Stage?

When you are able to handle Null handling and Data type changes in ODBC stages why you

need Modify Stage?

Used to change the datatypes, if the source contains the varchar and the target contains integer

then we have to use this Modify Stage and we have to change according to the requirement. And

we can do some modification in length also.

Modify Stage is used for the purpose of Datatype Change.

What is the difference between Squential Stage & Dataset Stage. When do u use them.

a)Sequential stage is use for format of squential file and Dataset is use for any type of format

file (random)

b)Parallel jobs use data sets to manage data within a job. You can think of each link in a job as

carrying a data set. The Data Set stage allows you to store data being operated on in a persistent

form, which can then be used by other WebSphere DataStage jobs. Data sets are operating

system files, each referred to by a control file, which by convention has the suffix .ds. Using data

sets wisely can be key to good performance in a set of linked jobs. You can also manage data

sets independently of a job using the Data Set Management utility, available from the

WebSphere DataStage Designer or Director In datset dat is stored in some encrypted format ie.,

we can view the data through view data facility available in datastage but it cant be viewed in

Linux or back end system. In sequential file data can be viewed any where. Extraction of data

from the datset is much more faster than the sequential file.

how can we improve the performance of the job while handling huge amount of data

a)Minimize the transformer state,Reference table have huge amount of date then you can use join

stage. Reference table have less amount of data then you can use lookup.

b)this require a job level tuning or server level tuning.

in job level we can do the follwing.

job level tuning

use Join for huge amount of data rather than lookup.

use modify stage rather than transformer for simple transformation.

Sort the data before remove duplicate stage.

server level tuning

this can only be done after having adequate knowledge of the serever level parameter which can

improve the server execution performance.


37

HI How can we create read only jobs in Datastage.

By creating Protected Project. In Protected Project all jobs are read only. You cant modify the

job. If you modify that job it will not effect the job.

b)A job can be made read only by the follwing process:

Export the job in the .dsx format and change the attribute which store the readonly information

from 0 ( 0 refers to editable job) to 1 ( 1 refer to the read only job).

then import the job again and override or rename the existing job to have both of the form.

there are 3 kind of routines is there in Datastage.

1.server routines which will used in server jobs.

these routines will write in BASIC Language

2.parlell routines which will used in parlell jobs

These routines will write in C/C++ Language

3.mainframe routines which will used in mainframe jobs

DataStage Parallel routines made really easy

http://blogs.ittoolbox.com/dw/soa/archives/datastage-parallel-routines-made-really-easy-20926

How will you determine the sequence of jobs to load into data warehouse?

First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the

Aggregator tables (if any).

The sequence of the job can also be determined by the determining the parent child relationship

in the target tables to be loaded. parent table always need to be loaded before child tables.

Error while connecting DS Admin?

All you have to do is go settings-control panel-User accounts- Create new user with a password.

Restart your comp and login with the new user name. Try using the new user name into

datastage and I am sure that you should be able to do it.


38

DataStage - delete header and footer on the source sequential

How do you you delete header and footer on the source sequential file and how do you create

header and footer on target sequential file using datastage?

In Designer Pallete Development/Debug we can find Head & tail. By using this we can do......

How can we implement Slowly Changing Dimensions in DataStage?.

a)We can implement SCD in datastage

1.Type 1 SCD:insert else update in ODBC stage

2.Type 2 SCD:insert new rows if the primary key is same and update with effective from date as

JobRundate and to date to some max date

3.Type 3 SCD:insert value to the column the old value and update the existing column with the

new value

b) by using lookup stage and change capture stage we will implement the scd.

we have 3 types of scds

type1:it will maintain the current values only.

type2: it will maintain the both current and historical values.

type3: it will maintain the current and partial historical values.

Differentiate Database data and Data warehouse data?

Data in a Database is

Detailed or Transactional

Both Readable and Writable.

Current.

b)By Database, one means OLTP (On Line Transaction Processing). This can be the source

systems or the ODS (Operational Data Store), which contains the transactional data.

c)Database data is in the form of OLTP and Data warehouse data will be in the form of OLAP.

OLTP is for transactional process and OLAP is for Analysis purpose.

d)dwh:

it contains current and historical data

very summary data

it follows denormalization


39

DIMENSIONAL MODEL

non volatile

How to run a Shell Script within the scope of a Data stage job?

By using "ExcecSH" command at Before/After job properties.

what is the difference between datastage and informatica

a)The main difference between data stge and informatica is the SCALABILTY..informatca is

scalable than datastage

b)In my view Datastage is also Scalable, the difference lies in the number of built-in functions

which makes DataStage more user friendly

c)In my view,Datastage is having less no. of transformers copared to Informatica which makes

user to get difficulties while working

d)The main difference is Vendors. Each one is having plus from their architecture. For Datastage

it is a Top-Down approach. Based on the Businees

Datastage Interview Questions

Documents

Transcript of Datastage Interview Questions