Datastage Interview Questions
description
Transcript of Datastage Interview Questions
-
Datastage Interview Questions - Answers
1
Datastage Interview Questions What is the flow of loading data into fact & dimensional tables?
Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional
table. Consists of fields with numeric values.
Dimension table - Table with Unique Primary Key.
Load - Data should be first loaded into dimensional table. Based on the primary key values in
dimensional table, the data should be loaded into Fact table.
What is the default cache size? How do you change the cache size if needed?
Default cache size is 256 MB. We can increase it by going into Datastage Administrator and selecting the
Tunable Tab and specify the cache size over there.
What does a Config File in parallel extender consist of?
Config file consists of the following.
a) Number of Processes or Nodes.
b) Actual Disk Storage Location
What is Modulus and Splitting in Dynamic Hashed File?
In a Hashed File, the size of the file keeps changing randomly.
If the size of the file increases it is called as "Modulus".
If the size of the file decreases it is called as "Splitting".
What are Stage Variables, Derivations and Constants?
Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the
value into target column.
Derivation - Expression that specifies value to be passed on to the target column.
Constant - Conditions that are either true or false that specifies flow of data with a link.
Types of views in Datastage Director?
There are 3 types of views in Datastage Director
a) Job View - Dates of Jobs Compiled.
b) Log View - Status of Job last run
c) Status View - Warning Messages, Event Messages, Program Generated Messages.
-
Datastage Interview Questions - Answers
2
Types of Parallel Processing?
A) Parallel Processing is broadly classified into 2 types.
a) SMP - Symmetrical Multi Processing.
b) MPP - Massive Parallel Processing.
Orchestrate Vs Datastage Parallel Extender?
Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX
platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel
processing capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE
and released a new version Datastage 6.0 i.e Parallel Extender.
Importance of Surrogate Key in Data warehousing?
Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of
underlying database. i.e. Surrogate Key is not affected by the changes going on with a database.
How to run a Shell Script within the scope of a Data stage job?
By using "ExcecSH" command at Before/After job properties.
How do you execute datastage job from command line prompt?
Using "dsjob" command as follows.
dsjob -run -jobstatus projectname jobname
Functionality of Link Partitioner and Link Collector?
Link Partitioner: It actually splits data into various partitions or data flows using various partition
methods.
Link Collector: It collects the data coming from partitions, merges it into a single data flow and loads to
target.
Types of Dimensional Modeling?
Dimensional modeling is again sub divided into 2 types.
a) Star Schema - Simple & Much Faster. Denormalized form.
b) Snowflake Schema - Complex with more Granularity. More normalized form.
c) Galaxy scheme or complex multi star schema
-
Datastage Interview Questions - Answers
3
Differentiate Primary Key and Partition Key? Primary Key is a combination of unique and not null. It can be a collection of key values called as
composite primary key. Partition Key is a just a part of Primary Key. There are several methods of
partition like Hash, DB2, and Random etc. While using Hash partition we specify the Partition Key.
Differentiate Database data and Data warehouse data? a) Detailed or Transactional
b) Both Readable and Writable.
c) Current.
Containers Usage and Types?
Container is a collection of stages used for the purpose of Reusability.
There are 2 types of Containers.
a) Local Container: Job Specific
b) Shared Container: Used in any job within a project.
Compare and Contrast ODBC and Plug-In stages? ODBC: a) Poor Performance.
b) Can be used for Variety of Databases.
c) Can handle Stored Procedures.
Plug-In: a) Good Performance.
b) Database specific. (Only one database)
c) Cannot handle Stored Procedures.
Dimension Modelling types along with their significance
Data Modelling is Broadly classified into 2 types.
a) E-R Diagrams (Entity - Relatioships).
b) Dimensional Modelling.
What are Ascential Dastastage Products, Connectivity
Ascential Products
Ascential DataStage
Ascential DataStage EE (3)
Ascential DataStage EE MVS
Ascential DataStage TX
-
Datastage Interview Questions - Answers
4
Ascential QualityStage
Ascential MetaStage
Ascential RTI (2)
Ascential ProfileStage
Ascential AuditStage
Ascential Commerce Manager
Industry Solutions
Connectivity
Files
RDBMS
Real-time
PACKs
EDI
Other
Explain Data Stage Architecture? Data Stage contains two components,
Client Component, Server Component.
Client Component:
Data Stage Administrator. Data Stage Manager Data Stage Designer Data Stage Director
Server Components:
Data Stage Engine Meta Data Repository Package Installer
Data Stage Administrator: (Roles and Responsibilities )
Used to create the project Contains set of properties We can set the buffer size (by default 128 MB) We can increase the buffer size. We can set the Environment Variables. In tunable we have in process and inter-process In-processData read in sequentially Inter-process It reads the data as it comes. It just interfaces to metadata.
-
Datastage Interview Questions - Answers
5
Data Stage Manager:
We can view and edit the Meta data Repository. We can import table definitions. We can export the Data stage components in .xml or .dsx format. We can create routines and transforms We can compile the multiple jobs. Data Stage Designer:
We can create the jobs. We can compile the job. We can run the job. We can declare stage variable in transform, we can call routines, transform, macros, functions. We can write constraints. Data Stage Director:
We can run the jobs. We can schedule the jobs. (Schedule can be done daily, weekly, monthly, quarterly) We can monitor the jobs. We can release the jobs.
What is Meta Data Repository? Meta Data is a data about the data.
It also contains
Query statistics ETL statistics Business subject area Source Information Target Information
Source to Target mapping Information
What is Data Stage Engine? It is a JAVA engine running at the background.
What is Dimensional Modeling? Dimensional Modeling is a logical design technique that seeks to present the data in a standard
framework that is, intuitive and allows for high performance access.
What is Star Schema? Star Schema is a de-normalized multi-dimensional model. It contains centralized fact tables surrounded
by dimensions table.
Dimension Table: It contains a primary key and description about the fact table.
Fact Table: It contains foreign keys to the dimension tables, measures and aggregates.
What is surrogate Key? It is a 4-byte integer which replaces the transaction / business / OLTP key in the dimension table.
We can store up to 2 billion record.
-
Datastage Interview Questions - Answers
6
Why we need surrogate key? It is used for integrating the data may help better for primary key.
Index maintenance, joins, table size, key updates, disconnected inserts and partitioning.
What is Snowflake schema? It is partially normalized dimensional model in which at two represents least one dimension or
more hierarchy related tables.
Explain Types of Fact Tables?
Factless Fact: It contains only foreign keys to the dimension tables.
Additive Fact: Measures can be added across any dimensions.
Semi-Additive: Measures can be added across some dimensions. Eg, % age, discount
Non-Additive: Measures cannot be added across any dimensions. Eg, Average Conformed Fact: The equation or the measures of the two fact tables are the same under the facts are
measured across the dimensions with a same set of measures
Explain the Types of Dimension Tables?
Conformed Dimension: If a dimension table is connected to more than one fact table, the
granularity that is defined in the dimension table is common across between the fact tables.
Junk Dimension: The Dimension table, which contains only flags.
Monster Dimension: If rapidly changes in Dimension are known as Monster Dimension.
De-generative Dimension: It is line item-oriented fact table design.
What are stage variables?
Stage variables are declaratives in Transformer Stage used to store values. Stage variables are
active at the run time. (Because memory is allocated at the run time).
What is sequencer? It sets the sequence of execution of server jobs.
What are Active and Passive stages?
Active Stage: Active stage model the flow of data and provide mechanisms for combining data
streams, aggregating data and converting data from one data type to another. Eg, Transformer,
aggregator, sort, Row Merger etc.
Passive Stage: A Passive stage handles access to Database for the extraction or writing of data.
Eg, IPC stage, File types, Universe, Unidata, DRS stage etc.
What is ODS?
Operational Data Store is a staging area where data can be rolled back.
What are Macros?
They are built from Data Stage functions and do not require arguments.
A number of macros are provided in the JOBCONTROL.H file to facilitate getting information
about the current job, and links and stages belonging to the current job. These can be used in
-
Datastage Interview Questions - Answers
7
expressions (for example for use in Transformer stages), job control routines, filenames and table
names, and before/after subroutines.
DSHostName
DSJobStatus
DSProjectName
DSJobName
DSJobController
DSJobStartDate
DSJobStartTime
DSJobStartTimestamp
DSJobWaveNo
DSJobInvocations
DSJobInvocationId
DSStageLastErr
DSStageType
DSStageInRowNum
DSStageVarList
DSLinkLastErr
DSLinkName
DSStageName
DSLinkRowCount
What is keyMgtGetNextValue?
It is a Built-in transform it generates Sequential numbers. Its input type is literal string & output
type is string.
What index is created on Data Warehouse?
Bitmap index is created in Data Warehouse.
What is container?
A container is a group of stages and links. Containers enable you to simplify and modularize
your server job designs by replacing complex areas of the diagram with a single container stage.
You can also use shared containers as a way of incorporating server job functionality into
parallel jobs.
DataStage provides two types of container:
Local containers. These are created within a job and are only accessible by that job. A
local container is edited in a tabbed page of the jobs Diagram window. Shared containers. These are created separately and are stored in the Repository in the
same way that jobs are. There are two types of shared container
What is function? ( Job Control Examples of Transform Functions ) Functions take arguments and return a value.
BASIC functions: A function performs mathematical or string manipulations on the arguments supplied to it, and return a value. Some functions have 0 arguments; most have 1 or
more. Arguments are always in parentheses, separated by commas, as shown in this general
syntax:
FunctionName (argument, argument)
-
Datastage Interview Questions - Answers
8
DataStage BASIC functions: These functions can be used in a job control routine, which is defined as part of a jobs properties and allows other jobs to be run and controlled from the first job. Some of the functions can also be used for getting status information on the current
job; these are useful in active stage expressions and before- and after-stage subroutines.
To do this ... Use this function ...
Specify the job you want to control DSAttachJob
Set parameters for the job you want to control DSSetParam
Set limits for the job you want to control DSSetJobLimit
Request that a job is run DSRunJob
Wait for a called job to finish DSWaitForJob
Gets the meta data details for the specified link DSGetLinkMetaData
Get information about the current project DSGetProjectInfo
Get buffer size and timeout value for an IPC or Web Service
stage
DSGetIPCStageProps
Get information about the controlled job or current job DSGetJobInfo
Get information about the meta bag properties associated with
the named job
DSGetJobMetaBag
Get information about a stage in the controlled job or current
job
DSGetStageInfo
Get the names of the links attached to the specified stage DSGetStageLinks
Get a list of stages of a particular type in a job. DSGetStagesOfType
Get information about the types of stage in a job. DSGetStageTypes
Get information about a link in a controlled job or current job DSGetLinkInfo
Get information about a controlled jobs parameters DSGetParamInfo
Get the log event from the job log DSGetLogEntry
Get a number of log events on the specified subject from the
job log
DSGetLogSummary
Get the newest log event, of a specified type, from the job log DSGetNewestLogId
Log an event to the job log of a different job DSLogEvent
Stop a controlled job DSStopJob
Return a job handle previously obtained from DSAttachJob DSDetachJob
Log a fatal error message in a job's log file and aborts the job. DSLogFatal
Log an information message in a job's log file. DSLogInfo
-
Datastage Interview Questions - Answers
9
Put an info message in the job log of a job controlling current
job.
DSLogToController
Log a warning message in a job's log file. DSLogWarn
Generate a string describing the complete status of a valid
attached job.
DSMakeJobReport
Insert arguments into the message template. DSMakeMsg
Ensure a job is in the correct state to be run or validated. DSPrepareJob
Interface to system send mail facility. DSSendMail
Log a warning message to a job log file. DSTransformError
Convert a job control status or error code into an explanatory
text message.
DSTranslateCode
Suspend a job until a named file either exists or does not exist. DSWaitForFile
Checks if a BASIC routine is cataloged, either in VOC as a
callable item, or in the catalog space.
DSCheckRoutine
Execute a DOS or Data Stage Engine command from a
before/after subroutine.
DSExecute
Set a status message for a job to return as a termination
message when it finishes
DSSetUserStatus
What is Routines?
Routines are stored in the Routines branch of the Data Stage Repository, where you can create,
view or edit. The following programming components are classified as routines:
Transform functions, Before/After subroutines, Custom UniVerse functions, ActiveX (OLE)
functions, Web Service routines
Dimension Modeling types along with their significance
Data Modelling is broadly classified into 2 types.
A) E-R Diagrams (Entity - Relatioships).
B) Dimensional Modelling.
Question: Dimensional modelling is again sub divided into 2 types.
A) Star Schema - Simple & Much Faster. Denormalized form.
B) Snowflake Schema - Complex with more Granularity. More normalized form. What is the flow of loading data into fact & dimensional tables? Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with numeric values. Dimension table - Table with Unique Primary Key. Load - Data should be first loaded into dimensional table. Based on the primary key values in dimensional table, then data should be loaded into Fact table. What is Hash file stage and what is it used for?
-
Datastage Interview Questions - Answers
10
Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance. What are types of Hashed File? Hashed File is classified broadly into 2 types. A) Static - Sub divided into 17 types based on Primary Key Pattern. B) Dynamic - sub divided into 2 types i) Generic ii) Specific Default Hased file is "Dynamic - Type Random 30 D" What are Static Hash files and Dynamic Hash files? As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2GB and the overflow file is used if the data exceeds the 2GB size. How did you handle reject data? Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected. What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?
Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. Tuned the 'Project Tunables' in Administrator for better performance. Used sorted data for Aggregator. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs. Removed the data not used from the source as early as possible in the job. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories. Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made. Tuning should occur on a job-by-job basis. Use the power of DBMS. Try not to use a sort stage when you can use an ORDER BY clause in the database. Using a constraint to filter a record set is much slower than performing a SELECT WHERE.
-
Datastage Interview Questions - Answers
11
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.
Tell me one situation from your last project, where you had faced problem and How did u solve it?
1. The jobs in which data is read directly from OCI stages are running extremely slow. I had to stage the data before sending to the transformer to make the jobs run faster. 2. The job aborts in the middle of loading some 500,000 rows. Have an option either cleaning/deleting the loaded data and then run the fixed job or run the job again from the row the job has aborted. To make sure the load is proper we opted the former.
Tell me the environment in your last projects Give the OS of the Server and the OS of the Client of your recent most project How did u connect with DB2 in your last project? Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drivers to connect to DB2 (or) DB2-UDB depending the situation and availability. Certainly DB2-UDB is better in terms of performance as you know the native drivers are always better than ODBC drivers. 'iSeries Access ODBC Driver 9.00.02.02' - ODBC drivers to connect to AS400/DB2.
What are Routines and where/how are they written and have you written any routines before? Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of Routines:
1. Transform Functions 2. Before-After Job subroutines 3. Job Control Routines
How did you handle an 'Aborted' sequencer? In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.
Read the String functions in DS Functions like [] -> sub-string function and ':' -> concatenation operator Syntax: string [ [ start, ] length ] string [ delimiter, instance, repeats ] What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.
Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.
-
Datastage Interview Questions - Answers
12
Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.
What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director? Use crontab utility along with dsexecute() function along with proper parameters passed. Did you work in UNIX environment? Yes. One of the most important requirements. How would call an external Java function which are not supported by DataStage? Starting from DS 6.0 we have the ability to call external Java functions using a Java package from Ascential. In this case we can even use the command line to invoke the Java function and write the return values from the Java program (if any) and use that files as a source in DataStage job.
How will you determine the sequence of jobs to load into data warehouse? First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the Aggregator tables (if any).
The above might raise another Why do we have to load the dimensional tables first, then fact tables: As we load the dimensional tables the keys (primary) are generated and these keys (primary) are Foreign keys in Fact tables.
Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic. There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically different in permissions (Truncate requires you to have alter table permissions where Delete doesn't).
How do you rename all of the jobs to support your new File-naming conventions? Create an Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers.
When should we use ODS? DWH's are typically read only, batch updated on a schedule ODS's are maintained in more real time, trickle fed constantly
What other ETL's you have worked with?
-
Datastage Interview Questions - Answers
13
Informatica and also DataJunction if it is present in your Resume. How good are you with your PL/SQL? On the scale of 1-10 say 8.5-9 What versions of DS you worked with? DS 7.5, DS 7.0.2, DS 6.0, DS 5.2 What's the difference between Datastage Developers...? Datastage developer is one how will code the jobs. Datastage designer is how will design the job, I mean he will deal with blue prints and he will design the jobs the stages that are required in developing the code
What are the command line functions that import and export the DS jobs?
dsimport.exe - imports the DataStage components. dsexport.exe - exports the DataStage components.
How to run a Shell Script within the scope of a Data stage job? By using "ExcecSH" command at Before/After job properties. How to handle Date convertions in Datastage? Convert mm/dd/yyyy format to yyyy-dd-mm? We use a) "Iconv" function - Internal Convertion. b) "Oconv" function - External Convertion.
Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]") Types of Parallel Processing? Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing. What does a Config File in parallel extender consist of? Config file consists of the following. a) Number of Processes or Nodes. b) Actual Disk Storage Location. Types of views in Datastage Director? There are 3 types of views in Datastage Director
-
Datastage Interview Questions - Answers
14
a) Job View - Dates of Jobs Compiled. b) Log View - Status of Job last Run c) Status View - Warning Messages, Event Messages, Program Generated Messages. Did you Parameterize the job or hard-coded the values in the jobs? Always parameterized the job. Either the values are coming from Job Properties or from a Parameter Manager a third part tool. There is no way you will hardcode some parameters in your jobs. The often Parameterized variables in a job are: DB DSN name, username, password, dates W.R.T for the data to be looked against at.
What are the requirements for your ETL tool? Do you have large sequential files (1 million rows, for example) that need to be compared every day versus yesterday? If so, then ask how each vendor would do that. Think about what process they are going to do. Are they requiring you to load yesterdays file into a table and do lookups? If so, RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one. It all depends on what you need the ETL to do. If you are small enough in your data sets, then either would probably be OK. What are the main differences between Ascential DataStage and Informatica PowerCenter? Chuck Kelleys You are right; they have pretty much similar functionality. However, what are the requirements for your ETL tool? Do you have large sequential files (1 million rows, for example) that need to be compared every day versus yesterday? If so, then ask how each vendor would do that. Think about what process they are going to do. Are they requiring you to load yesterdays file into a table and do lookups? If so, RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one. It all depends on what you need the ETL to do. If you are small enough in your data sets, then either would probably be OK.
Les Barbusinskis Without getting into specifics, here are some differences you may want to explore with each vendor:
Does the tool use a relational or a proprietary database to store its Meta data and scripts? If proprietary, why? What add-ons are available for extracting data from industry-standard ERP, Accounting, and CRM packages? Can the tools Meta data be integrated with third-party data modeling and/or business intelligence tools? If so, how and with which ones? How well does each tool handle complex transformations, and how much external scripting is required? What kinds of languages are supported for ETL script extensions?
Almost any ETL tool will look like any other on the surface. The trick is to find out which one
will work best in your environment. The best way Ive found to make this determination is to ascertain how successful each vendors clients have been using their product. Especially clients who closely resemble your shop in terms of size, industry, in-house skill sets, platforms, source
systems, data volumes and transformation complexity.
-
Datastage Interview Questions - Answers
15
Ask both vendors for a list of their customers with characteristics similar to your own that have
used their ETL product for at least a year. Then interview each client (preferably several people
at each site) with an eye toward identifying unexpected problems, benefits, or quirkiness with the
tool that have been encountered by that customer. Ultimately, ask each customer if they had it all to do over again whether or not theyd choose the same tool and why? You might be surprised at some of the answers.
Joyce Bischoffs You should do a careful research job when selecting products. You should first document your requirements, identify all possible products and evaluate each product against the
detailed requirements. There are numerous ETL products on the market and it seems that you are
looking at only two of them. If you are unfamiliar with the many products available, you may
refer to www.tdan.com, the Data Administration Newsletter, for product lists.
If you ask the vendors, they will certainly be able to tell you which of their products features are stronger than the other product. Ask both vendors and compare the answers, which may or may
not be totally accurate. After you are very familiar with the products, call their references and be
sure to talk with technical people who are actually using the product. You will not want the
vendor to have a representative present when you speak with someone at the reference site. It is
also not a good idea to depend upon a high-level manager at the reference site for a reliable
opinion of the product. Managers may paint a very rosy picture of any selected product so that
they do not look like they selected an inferior product. How many places u can call Routines? Four Places u can call
1.Transform of routine a. Date Transformation b. Upstring Transformation
2.Transform of the Before & After Subroutines 3.XML transformation 4.Web base transformation
What is the Batch Program and how can generate? Batch program is the program it's generate run time to maintain by the Datastage itself but u can easy to change own the basis of your requirement (Extraction, Transformation, Loading) .Batch program are generate depends your job nature either simple job or sequencer job, you can see this program on job control option. Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 ) if job 1 have 10,000 row ,after run the job only 5000 data has been loaded in target table remaining are not loaded and your job going to be aborted then.. How can short out the problem? Suppose job sequencer synchronies or control 4 job but job 1 have problem, in this condition should go director and check it what type of problem showing either data type problem, warning massage, job fail or job aborted, If job fail means data type problem or missing column action .So u should go Run window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this option here two option
(i) On Fail -- Commit , Continue (ii) On Skip -- Commit, Continue.
-
Datastage Interview Questions - Answers
16
First u check how much data already load after then select on skip option then continue and what remaining position data not loaded then select On Fail , Continue ...... Again Run the job defiantly u gets successful massage
What happens if RCP is disable? In such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased... How do you rename all of the jobs to support your new File-naming conventions? Create a Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers. What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.
A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job.
May be you can schedule the sequencer around the time the file is expected to arrive.
B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the
file
What are Sequencers?
Sequencers are job control programs that execute other jobs with preset Job parameters.
How did you handle an 'Aborted' sequencer?
In almost all cases we have to delete the data inserted by this from DB manually and fix the job
and then run the job again.
Question34: What is the difference between the Filter stage and the Switch stage? Ans: There are two main differences, and probably some minor ones as well. The two main differences are as follows.
1) The Filter stage can send one input row to more than one output link. The Switch stage can not - the C switch construct has an implicit break in every case. 2) The Switch stage is limited to 128 output links; the Filter stage can have a theoretically unlimited number of output links. (Note: this is not a challenge!)
How can i achieve constraint based loading using datastage7.5.My target tables have inter
dependencies i.e. Primary key foreign key constraints. I want my primary key tables to be loaded
first and then my foreign key tables and also primary key tables should be committed before the
foreign key tables are executed. How can I go about it?
Ans:1) Create a Job Sequencer to load you tables in Sequential mode
In the sequencer Call all Primary Key tables loading Jobs first and followed by Foreign key
tables, when triggering the Foreign tables load Job trigger them only when Primary Key load
Jobs run Successfully ( i.e. OK trigger)
2) To improve the performance of the Job, you can disable all the constraints on the tables and
load them. Once loading done, check for the integrity of the data. Which does not meet raise
exceptional data and cleanse them.
This only a suggestion, normally when loading on constraints are up, will drastically
performance will go down.
-
Datastage Interview Questions - Answers
17
3) If you use Star schema modeling, when you create physical DB from the model, you can
delete all constraints and the referential integrity would be maintained in the ETL process by
referring all your dimension keys while loading fact tables. Once all dimensional keys are
assigned to a fact then dimension and fact can be loaded together. At the same time RI is being
maintained at ETL process level. How do you merge two files in DS? Ans: Either use Copy command as a Before-job subroutine if the metadata of the 2 files are same or create a job to concatenate the 2 files into one, if the metadata is different.
How do you eliminate duplicate rows?
Ans: Data Stage provides us with a stage Remove Duplicates in Enterprise edition. Using that
stage we can eliminate the duplicates based on a key column.
How do you pass filename as the parameter for a job?
Ans: While job development we can create a parameter 'FILE_NAME' and the value can be
passed while
How did you handle an 'Aborted' sequencer?
Ans: In almost all cases we have to delete the data inserted by this from DB manually and fix the
job and then run the job again. Is there a mechanism available to export/import individual DataStage ETL jobs from the UNIX command line? Ans: Try dscmdexport and dscmdimport. Won't handle the "individual job" requirement. You can only export full projects from the command line. You can find the export and import executables on the client machine usually someplace like: C:\Program Files\Ascential\DataStage.
Diff. between JOIN stage and MERGE stage. JOIN: Performs join operations on two or more data sets input to the stage and then outputs the resulting dataset. MERGE: Combines a sorted master data set with one or more sorted updated data sets. The columns from the records in the master and update data set s are merged so that the out put record contains all the columns from the master record plus any additional columns from each update record that required. A master record and an update record are merged only if both of them have the same values for the merge key column(s) that we specify .Merge key columns are one or more columns that exist in both the master and update records. Advantages of the DataStage?
Business advantages: Helps for better business decisions; It is able to integrate data coming from all parts of the company; It helps to understand the new and already existing clients; We can collect data of different clients with him, and compare them; It makes the research of new business possibilities possible; We can analyze trends of the data read by him.
Technological advantages: It handles all company data and adapts to the needs;
-
Datastage Interview Questions - Answers
18
It offers the possibility for the organization of a complex business intelligence; Flexibly and scalable; It accelerates the running of the project;
Easily implementable. What is the architecture of data stage? Basically architecture of DS is client/server architecture. Client components & server components Client components are 4 types they are 1. Data stage designer 2. Data stage administrator 3. Data stage director 4. Data stage manager Data stage designer is user for to design the jobs Data stage manager is used for to import & export the project to view & edit the contents of the repository. Data stage administrator is used for creating the project, deleting the project & setting the environment variables. Data stage director is use for to run the jobs, validate the jobs, scheduling the jobs. Server components DS server: runs executable server jobs, under the control of the DS director, that extract, transform, and load data into a DWH. DS Package installer: A user interface used to install packaged DS jobs and plug-in; Repository or project: a central store that contains all the information required to build DWH or data mart.
1. What r the stages u worked on? I have some jobs every month automatically delete the log details what r the steps u have to take for that We have to set the option autopurge in DS Adminstrator. I want to run the multiple jobs in the single job. How can u handle. In job properties set the option ALLOW MULTIPLE INSTANCES.
-
Datastage Interview Questions - Answers
19
What is version controlling in DS? In DS, version controlling is used for back up the project or jobs. This option is available in DS 7.1 version onwards. Version controls r of 2 types.
1. VSS- visual source safe 2. CVSS- concurrent visual source safe.
VSS is designed by Microsoft but the disadvantage is only one user can access at a time, other user can wait until the first user complete the operation. CVSS, by using this many users can access concurrently. When compared to VSS, CVSS cost is high. What is the difference between clear log file and clear status file? Clear log--- we can clear the log details by using the DS Director. Under job menu clear log option is available. By using this option we can clear the log details of particular job. Clear status file---- lets the user remove the status of the record associated with all stages of selected jobs.(in DS Director) I developed 1 job with 50 stages, at the run time one stage is missed how can u identify which stage is missing? By using usage analysis tool, which is available in DS manager, we can find out the what r the items r used in job. My job takes 30 minutes time to run, I want to run the job less than 30 minutes? What r the steps we have to take? By using performance tuning aspects which are available in DS, we can reduce time. Tuning aspect In DS administrator : in-process and inter process In between passive stages : inter process stage OCI stage : Array size and transaction size And also use link partitioner & link collector stage in between passive stages How to do road transposition in DS? Pivot stage is used to transposition purpose. Pivot is an active stage that maps sets of columns in an input table to a single column in an output table. If a job locked by some user, how can you unlock the particular job in DS? We can unlock the job by using clean up resources option which is available in DS Director. Other wise we can find PID (process id) and kill the process in UNIX server. I am getting input value like X = Iconv(31 DEC 1967,D)? What is the X value? X value is Zero. Iconv Function Converts a string to an internal storage format.It takes 31 dec 1967 as zero and counts days from that date(31-dec-1967).
-
Datastage Interview Questions - Answers
20
What is the Unit testing, integration testing and system testing? Unit testing: As for Ds unit test will check the data type mismatching, Size of the particular data type, column mismatching. Integration testing: According to dependency we will put all jobs are integrated in to one sequence. That is called control sequence. System testing: System testing is nothing but the performance tuning aspects in Ds. What are the command line functions that import and export the DS jobs? Dsimport.exe ---- To import the DataStage components Dsexport.exe ---- To export the DataStage components How many hashing algorithms are available for static hash file and dynamic hash file? Sixteen hashing algorithms for static hash file. Two hashing algorithms for dynamic hash file( GENERAL or SEQ.NUM) What happens when you have a job that links two passive stages together? Obviously there is some process going on. Under covers Ds inserts a cut-down transformer stage between the passive stages, which just passes data straight from one stage to the other. What is the use use of Nested condition activity? Nested Condition. Allows you to further branch the execution of a sequence depending on a condition. I have three jobs A,B,C . Which are dependent on each other? I want to run A & C jobs daily and B job runs only on Sunday. How can u do it? First you have to schedule A & C jobs Monday to Saturday in one sequence. Next take three jobs according to dependency in one more sequence and schedule that job only Sunday. What are the ways to execute datastage jobs? A job can be run using a few different methods:
from Datastage Director (menu Job -> Run now...) from command line using a dsjob command Datastage routine can run a job (DsRunJob command) by a job sequencer
How to invoke a Datastage shell command? Datastage shell commands can be invoked from :
Datastage administrator (projects tab -> Command) Telnet client connected to the datastage server
How to stop a job when its status is running? To stop a running job go to DataStage Director and click the stop button (or Job -> Stop from menu). If it doesn't help go to Job -> Cleanup Resources, select a process with holds a lock and click Logout If it still doesn't help go to the datastage shell and invoke the following command: ds.tools
-
Datastage Interview Questions - Answers
21
It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10). How to run and schedule a job from command line? To run a job from command line use a dsjob command Command Syntax: dsjob [-file | [-server ][-user ][-password ]] [] The command can be placed in a batch file and run in a system scheduler. How to release a lock held by jobs? Go to the datastage shell and invoke the following command: ds.tools It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10).
User privileges for the default DataStage roles? The role privileges are:
DataStage Developer - user with full access to all areas of a DataStage project DataStage Operator - has privileges to run and manage deployed DataStage jobs -none- - no permission to log on to DataStage
What is a command to analyze hashed file? There are two ways to analyze a hashed file. Both should be invoked from the datastage command shell. These are:
FILE.STAT command ANALYZE.FILE command
Is it possible to run two versions of datastage on the same pc? Yes, even though different versions of Datastage use different system dll libraries. To dynamically switch between Datastage versions install and run DataStage Multi-Client Manager. That application can unregister and register system libraries used by Datastage. Error in Link collector - Stage does not support in-process active-to-active inputs or outputs To get rid of the error just go to the Job Properties -> Performance and select Enable row buffer. Then select Inter process which will let the link collector run correctly. Buffer size set to 128Kb should be fine, however it's a good idea to increase the timeout.
What is the DataStage equivalent to like option in ORACLE The following statement in Oracle: select * from ARTICLES where article_name like '%WHT080%'; Can be written in DataStage (for example as the constraint expression): incol.empname matches '...WHT080...' what is the difference between logging text and final text message in terminator stage Every stage has a 'Logging Text' area on their General tab which logs an informational message when the stage is triggered or started.
Informational - is a green line, DSLogInfo() type message. The Final Warning Text - the red fatal, the message which is included in the sequence abort message
-
Datastage Interview Questions - Answers
22
Error in STPstage - SOURCE Procedures must have an output link The error appears in Stored Procedure (STP) stage when there are no stages going out of that stage. To get rid of it go to 'stage properties' -> 'Procedure type' and select Transform
How to invoke an Oracle PLSQL stored procedure from a server job To run a pl/sql procedure from Datastage a Stored Procedure (STP) stage can be used. However it needs a flow of at least one record to run. It can be designed in the following way:
source odbc stage which fetches one record from the database and maps it to one column - for example: select sysdate from dual
A transformer which passes that record through. If required, add pl/sql procedure parameters as columns on the right-hand side of tranformer's mapping Put Stored Procedure (STP) stage as a destination. Fill in connection parameters, type in the procedure name and select Transform as procedure type. In the input tab select 'execute procedure for each row' (it will be run once).
Design of a DataStage server job with Oracle plsql procedure call
Is it possible to run a server job in parallel? Yes, even server jobs can be run in parallel. To do that go to 'Job properties' -> General and check the Allow Multiple Instance button. The job can now be run simultaneously from one or many sequence jobs. When it happens datastage will create new entries in Director and new job will be named with automatically generated suffix (for example second instance of a job named JOB_0100 will be named JOB_0100.JOB_0100_2). It can be deleted at any time and will be automatically recreated by datastage on the next run.
Error in STPstage - STDPROC property required for stage xxx The error appears in Stored Procedure (STP) stage when the 'Procedure name' field is empty. It occurs even if the Procedure call syntax is filled in correctly. To get rid of error fill in the 'Procedure name' field.
Datastage routine to open a text file with error catching Note! work dir and file1 are parameters passed to the routine. * open file1 OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl") END ELSE CALL DSLogInfo("Unable to open file", "JobControl")
-
Datastage Interview Questions - Answers
23
ABORT END
Datastage routine which reads the first line from a text file Note! work dir and file1 are parameters passed to the routine. * open file1 OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl") END ELSE CALL DSLogInfo("Unable to open file", "JobControl") ABORT END READSEQ FILE1.RECORD FROM H.FILE1 ELSE Call DSLogWarn("******************** File is empty", "JobControl") END firstline = Trim(FILE1.RECORD[1,32]," ","A") ******* will read the first 32 chars Call DSLogInfo("******************** Record read: " : firstline, "JobControl") CLOSESEQ H.FILE1 How to test a datastage routine or transform? To test a datastage routine or transform go to the Datastage Manager. Navigate to Routines, select a routine you want to test and open it. First compile it and then click 'Test...' which will open a new window. Enter test parameters in the left-hand side column and click run all to see the results. Datastage will remember all the test arguments during future tests.
When hashed files should be used? What are the benefits or using them? Hashed files are the best way to store data for lookups. They're very fast when looking up the key-value pairs. Hashed files are especially useful if they store information with data dictionaries (customer details, countries, exchange rates). Stored this way it can be spread across the project and accessed from different jobs.
How to construct a container and deconstruct it or switch between local and shared? To construct a container go to Datastage designer, select the stages that would be included in the container and from the main menu select Edit -> Construct Container and choose between local and shared. Local will be only visible in the current job, and share can be re-used. Shared containers can be viewed and edited in Datastage Manager under 'Routines' menu. Local Datastage containers can be converted at any time to shared containers in datastage designer by right clicking on the container and selecting 'Convert to Shared'. In the same way it can be converted back to local.
-
Datastage Interview Questions - Answers
24
Corresponding datastage data types to ORACLE types? Most of the datastage variable types map very well to oracle types. The biggest problem is to map correctly oracle NUMBER(x,y) format. The best way to do that in Datastage is to convert oracle NUMBER format to Datastage Decimal type and to fill in Length and Scale column accordingly. There are no problems with string mappings: oracle Varchar2 maps to datastage Varchar, and oracle char to datastage char. How to adjust commit interval when loading data to the database? In earlier versions of datastage the commit interval could be set up in: General -> Transaction size (in version 7.x it's obsolete) Starting from Datastage 7.x it can be set up in properties of ODBC or ORACLE stage in Transaction handling -> Rows per transaction. If set to 0 the commit will be issued at the end of a successfull transaction. What is the use of INROWNUM and OUTROWNUM datastage variables? @INROWNUM and @OUTROWNUM are internal datastage variables which do the following:
@INROWNUM counts incoming rows to a transformer in a datastage job @OUTROWNUM counts oucoming rows from a transformer in a datastage job
These variables can be used to generate sequences, primary keys, id's, numbering rows and also for debugging and error tracing. They play similiar role as sequences in Oracle. Datastage trim function cuts out more characters than expected By deafult datastage trim function will work this way: Trim(" a b c d ") will return "a b c d" while in many other programming/scripting languages "a b c d" result would be expected. That is beacuse by default an R parameter is assumed which is R - Removes leading and trailing occurrences of character, and reduces multiple occurrences to a single occurrence. To get the "a b c d" as a result use the trim function in the following way: Trim(" a b c d "," ","B") Database update actions in ORACLE stage The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Available actions: Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT). Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT). Insert rows without clearing - only adds new rows (INSERT statement). Delete existing rows only - deletes matched rows (issues only the DELETE statement). Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT). Update existing rows only - updates existing rows (UPDATE statement). Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited. Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited.
-
Datastage Interview Questions - Answers
25
User-defined SQL - the data is written using a user-defined SQL statement. User-defined SQL file - the data is written using a user-defined SQL statement from a file. Use and examples of ICONV and OCONV functions?
ICONV and OCONV functions are quite often used to handle data in Datastage.
ICONV converts a string to an internal storage format and OCONV converts an expression to an
output format.
Syntax:
Iconv (string, conversion code)
Oconv(expression, conversion )
Some useful iconv and oconv examples: Iconv("10/14/06", "D2/") = 14167
Oconv(14167, "D-E") = "14-10-2006"
Oconv(14167, "D DMY[,A,]") = "14 OCTOBER 2006"
Oconv(12003005, "MD2$,") = "$120,030.05"
That expression formats a number and rounds it to 2 decimal places:
Oconv(L01.TURNOVER_VALUE*100,"MD2")
Iconv and oconv can be combined in one expression to reformat date format easily:
Oconv(Iconv("10/14/06", "D2/"),"D-E") = "14-10-2006" ERROR 81021 Calling subroutine DSR_RECORD ACTION=2 Datastage system help gives the following error desription: SYS.HELP. 081021 MESSAGE.. dsrpc: Error writing to Pipe. The problem appears when a job sequence is used and it contains many stages (usually more than 10) and very often when a network connection is slow. Basically the cause of a problem is a failure between DataStage client and the server communication. The solution to the issue is: Do not log in to Datastage Designer using 'Omit' option on a login screen. Type in explicitly username and password and a job should compile successfully. execute the DS.REINDEX ALL command from the Datastage shell - if the above does not help How to check Datastage internal error descriptions To check the description of a number go to the datastage shell (from administrator or telnet to the server machine) and invoke the following command: SELECT * FROM SYS.MESSAGE WHERE @ID='081021'; - where in that case the number 081021 is an error number The command will produce a brief error description which probably will not be helpful in resolving an issue but can be a good starting point for further analysis.
Error timeout waiting for mutex
-
Datastage Interview Questions - Answers
26
The error message usually looks like follows: ... ds_ipcgetnext() - timeout waiting for mutex There may be several reasons for the error and thus solutions to get rid of it. The error usually appears when using Link Collector, Link Partitioner and Interprocess (IPC) stages. It may also appear when doing a lookup with the use of a hash file or if a job is very complex, with the use of many transformers. There are a few things to consider to work around the problem: - increase the buffer size (up to to 1024K) and the Timeout value in the Job properties (on the Performance tab). - ensure that the key columns in active stages or hashed files are composed of allowed characters get rid of nulls and try to avoid language specific chars which may cause the problem. - try to simplify the job as much as possible (especially if its very complex). Consider splitting it into two or three smaller jobs, review fetches and lookups and try to optimize them (especially have a look at the SQL statements). ERROR 30107 Subroutine failed to complete successfully Datastage system help gives the following error desription: SYS.HELP. 930107 MESSAGE.. DataStage/SQL: Illegal placement of parameter markers The problem appears when a project is moved from one project to another (for example when deploying a project from a development environment to production). The solution to the issue is: Rebuild the repository index by executing the DS.REINDEX ALL command from the Datastage shell Datastage Designer hangs when editing job activity properties The appears when running Datastage Designer under Windows XP after installing patches or the Service Pack 2 for Windows. After opening a job sequence and navigating to the job activity properties window the application freezes and the only way to close it is from the Windows Task Manager. The solution of the problem is very simple. Just Download and install the XP SP2 patch for the Datastage client. It can be found on the IBM client support site (need to log in): https://www.ascential.com/eservice/public/welcome.do Go to the software updates section and select an appropriate patch from the Recommended DataStage patches section. Sometimes users face problems when trying to log in (for example when the license doesnt cover the IBM Active Support), then it may be necessary to contact the IBM support which can be reached at [email protected]
Can Datastage use Excel files as a data input?
-
Datastage Interview Questions - Answers
27
Microsoft Excel spreadsheets can be used as a data input in Datastage. Basically there are two possible approaches available: Access Excel file via ODBC - this approach requires creating an ODBC connection to the Excel file on a Datastage server machine and use an ODBC stage in Datastage. The main disadvantage is that it is impossible to do this on an Unix machine. On Datastage servers operating in Windows it can be set up here: Control Panel -> Administrative Tools -> Data Sources (ODBC) -> User DSN -> Add -> Driver do Microsoft Excel (.xls) -> Provide a Data source name -> Select the workbook -> OK Save Excel file as CSV - save data from an excel spreadsheet to a CSV text file and use a sequential stage in Datastage to read the data.
Parallel processing
Datastage jobs are highly scalable due to the implementation of parallel processing. The EE
architecture is process-based (rather than thread processing), platform independent and uses the
processing node concept. Datastage EE is able to execute jobs on multiple CPUs (nodes) in
parallel and is fully scalable, which means that a properly designed job can run across resources
within a single machine or take advantage of parallel platforms like a cluster, GRID, or MPP
architecture (massively parallel processing).
Partitioning and Pipelining
Partitioning means breaking a dataset into smaller sets and distributing them evenly across the
partitions (nodes). Each partition of data is processed by the same operation and transformed in
the same way.
The main outcome of using a partitioning mechanism is getting a linear scalability. This means
for instance that once the data is evenly distributed, a 4 CPU server will process the data four
times faster than a single CPU machine.
Pipelining means that each part of an ETL process (Extract, Transform, Load) is executed
simultaneously, not sequentially. The key concept of ETL Pipeline processing is to start the
Transformation and Loading tasks while the Extraction phase is still running.
Datastage Enterprise Edition automatically combines pipelining, partitioning and parallel
processing. The concept is hidden from a Datastage programmer. The job developer only
chooses a method of data partitioning and the Datastage EE engine will execute the partitioned
and parallelized processes.
Section 1.01 Differences between Datastage Enterprise and Server Edition
1. The major difference between Infosphere Datastage Enterprise and Server edition is that Enterprise Edition (EE) introduces Parallel jobs. Parallel jobs support a completely new set of stages, which implement the scalable and parallel data processing mechanisms. In most cases parallel jobs and stages look similiar to the Datastage Server objects, however their capababilities are way different. In rough outline:
-
Datastage Interview Questions - Answers
28
o Parallel jobs are executable datastage programs, managed and controlled by Datastage Server runtime environment o Parallel jobs have a built-in mechanism for Pipelining, Partitioning and Parallelism. In most cases no manual intervention is needed to implement optimally those techniques. o Parallel jobs are a lot faster in such ETL tasks like sorting, filtering, aggregating
2. Datastage EE jobs are compiled into OSH (Orchestrate Shell script language). OSH executes operators - instances of executable C++ classes, pre-built components representing stages used in Datastage jobs. Server Jobs are compiled into Basic which is an interpreted pseudo-code. This is why parallel jobs run faster, even if processed on one CPU. 3. Datastage Enterprise Edition adds functionality to the traditional server stages, for instance record and column level format properties. 4. Datastage EE brings also completely new stages implementing the parallel concept, for example:
o Enterprise Database Connectors for Oracle, Teradata & DB2 o Development and Debug stages - Peek, Column Generator, Row Generator, Head, Tail, Sample ... o Data set, File set, Complex flat file, Lookup File Set ... o Join, Merge, Funnel, Copy, Modify, Remove Duplicates ...
5. When processing large data volumes Datastage EE jobs would be the right choice, however when dealing with smaller data environment, using Server jobs might be just easier to develop, understand and manage. When a company has both Server and Enterprise licenses, both types of jobs can be used. 6. Sequence jobs are the same in Datastage EE and Server editions.
what is the difference between ds 7.5 & 8.1?
New version of DS is 8.0 and supprots QUALITY STAGE &
Profile Stage and etc, and it also contain a webbrowsers
1.To implement scd we have seperate stage(SCD stage)
2.we dont have client manager tool in vversion 8,its
incorporated with Designer itself.
3.There is no need of hardcoding the parameters for every
job and we have a option called Parameter set.if we create
the parameter set,we can call the parameter set for whole
project or job,sequence..
what happens when job is compiling?
During compilation of a DataStage Parallel job there is very high CPU and memory utilization
on the server, and the job may take a very log time for the compilation to complete.
What APT_CONFIG in ds? APT_CONFIG is just an environment variable used to idetify the *.apt file. Dont confuse that with *.apt
file that has the node's information and Configuration of SMP/MMP server.
-
Datastage Interview Questions - Answers
29
Apt_configfile is used for to store the nodes information, and it contains the disk storage
information, and scrach information. and datastage understands the architecture of system based
on this Configfile. for parallel process normally two nodes are required its name like 10,.20.
anyaways the APT_CONFIG_FILE (not just APT_CONFIG) is the
configuration file that defines the nodes (the scratch area temp area) for the specific project
is it possible to add extra nodes in the configuration file?
what is RCP? n how does it works? Run time column propagation is used in case of partial schema
usage. when we only know about the columns to be processed and we
want all other columns to be propagated to target as they are we
check enable RCP option in administrator or output page columns
tab or stage page general tab and we only need to specify the
schema of tables we are concerned with . According to
documentation Runtime column propagation (RCP) allows DataStage
to be flexible aboutthe columns you define in a job. If RCP is
enabled for a project you can justdefine the columns you are
interested in using in a job but ask DataStageto propagate the
other columns through the various stages. So suchcolumns can be
extracted from the data source and end up on your datatarget without explicitly being operated on in between.Sequential files
unlike most other data sources do not have inherentcolumn
definitions and so DataStage cannot always tell where there
areextra columns that need propagating. You can only use RCP on
sequentialfiles if you have used the Schema File property (see
Schema File onpage 5-8 and on page 5-31) to specify a schema
which describes all thecolumns in the sequential file. You need
to specify the same schema file forSequential File Stage 5-47any
similar stages in the job where you want to propagate columns.
Stagesthat will require a schema file are: Sequential File File
Set External Source External Target Column Import Column Export
Run Time Column Propagation can be used with Column Import
Stage. If RCP is enabled in our project we can define only the
columns which we are interested in and other rest of the columns
datastage will send through various other stages.
This will ensure such columns reach to our Target eventhough
they are not used in between of the stages.
starschema n snowflake schema? n Difference? Star Schema Snowflake Schema
De-Normalized Data Structure Normalized Data Structure
-
Datastage Interview Questions - Answers
30
Category wise Single Dimension Table Dimension table split into many pieces
More data dependency and redundancy less data dependency and No redundancy
No need to use complicated join Complicated Join
Query Results Faster Some delay in Query Processing
No Parent Table It May contain Parent Table
Simple DB Structure Complicated DB Structure
Difference bet OLTP n Datawarehouse?
The OLTP database records transactions in real time and aims to automate clerical data entry processes
of a business entity. Addition, modification and deletion of data in the OLTP database is essential and
the semantics of the application used in the front end impact on the organization of the data in the
database.
The data warehouse on the other hand does not cater to real time operational requirements of the
enterprise. It is more a storehouse of current and historical data and may also contain data extracted
from external data sources.
However, the data warehouse supports OLTP system by providing a place for the latter to offload data
as it accumulates and by providing services which would otherwise degrade the performance of the
database.
Differences Data warehouse database and OLTP database
Data warehouse database
Designed for analysis of business measures by categories and attributes
Optimized for bulk loads and large, complex, unpredictable queries that access many rows per table.
Loaded with consistent, valid data; requires no real time validation
Supports few concurrent users relative to OLTP
OLTP database
Designed for real time business operations.
Optimized for a common set of transactions, usually adding or retrieving a single row at a time per table.
Optimized for validation of incoming data during transactions; uses validation data tables.
Supports thousands of concurrent users.
-
Datastage Interview Questions - Answers
31
What is datamodalling?
The analysis of data objects and their relationships to other data objects. Data modeling is often
the first step in database design and object-oriented programming as the designers first create a
conceptual model of how data items relate to each other. Data modeling involves a progression
from conceptual model to logical model to physical schema.
Data modelling is the process of identifying entities, the relationship between those entities and
their attributes. There are a range of tools used to achieve this such as data dictionaries, decision
trees, decision tables, schematic diagrams and the process of normalisation.
how to draw second highest salary?
select ename,esal from
(select ename,esal from hsal
order by esal desc)
where rownum 1 )
Difference bet egrep n fgrep?
There is a difference. fgrep can not search for regular expressions in a string. It is used for plain
string matching.
egrep can search regular expressions too.
grep is a combination of both egrep and fgrep. If you don't specify -E or -F option, by default
grep will function as egrep but will have string searching ability too.
Hence the best one to be used is grep.
-
Datastage Interview Questions - Answers
32
fgrep = "Fixed GREP".
fgrep searches for fixed strings only. The "f" does not stand for "fast" - in fact, "fgrep foobar *.c"
is usually slower than "egrep foobar *.c" (Yes, this is kind of surprising. Try it.)
Fgrep still has its uses though, and may be useful when searching a file for a larger number of
strings than egrep can handle.
egrep = "Extended GREP"
egrep uses fancier regular expressions than grep. Many people use egrep all the time, since it has
some more sophisticated internal algorithms than grep or fgrep, and is usually the fastest of the
three programs
CHMOD command?
Permissions
u - User who owns the file.
g - Group that owns the file.
o - Other.
a - All.
r - Read the file.
w - Write or edit the file.
x - Execute or run the file as a program.
Numeric Permissions:
CHMOD can also to attributed by using Numeric Permissions:
400 read by owner
040 read by group
004 read by anybody (other)
200 write by owner
020 write by group
002 write by anybody
100 execute by owner
010 execute by group
001 execute by anybody
what is the difference between ds 7.5 & 8.1?
The main difference is dsmanager client is combined with dsdesigner in 8.1 and the following are the
new in 8.1
Scd2 stage,dataconnection,ps,qs,rangelookup .
Difference between internal sort and external sort?
-
Datastage Interview Questions - Answers
33
Performance wise internal sort is best becoz it doesnt use any buffer where as external sort takes
buffer memory to store rec.
how u pass only required number of rec through partitions?
Go to jobproperties-execution-enable trace compile-give req number of rec.
what happens when job is compiling?
1.all processing stages will develop osh code
2.Tx will develop c++ code in the background.
3.job information will be updated in metadata repository.
4. compile
what APT_CONFIG in ds?
Its configuration file which defines parallelism to our jobs.
how many types of parallelisms n partions r there?
Two types smp n mpp.
it possible to add extra nodes in the configuration file?
Yes it is possible to add extra nodes go to configuration management tools where u find apt_conf edit
that for ur req no of nodes.
what is RCP? n how does it works?
Run time column propagation ,is used to propagate the columns which r not define in the metadata.
how data is moving from one stage to another?
In virtual dataset form.
it possible to run multiple jobs in a single job?
Yes ,goto jobprop-have a option allow multiple instances.
What is APT_DUMP_SCORE?
APT_DUMP_SCORE - shows operators, datasets, nodes, partitions, combinations and
processes used in a job.
env var: admin
Stage Variable - An intermediate processing variable that retains value during read and doesnt
pass the value into target column.
Derivation - Expression that specifies value to be passed on to the target column.
Constant - Conditions that are either true or false that specifies flow of data with a link.
pipeline parallelism:
here each satge will work on separate processor
What is the difference between Job Control and Job Sequence
Job control specially used to control the job, means through this we can pass the parameters,
some conditions, some log file information, dashboard information, load recover etc...,
job seq is used to run the group of jobs based upon some conditions. For final/incremental
processing we keep all the jobs in one diff seq and we run the jobs at a time by giving some
triggers.
-
Datastage Interview Questions - Answers
34
What is the max size of Data set stage? (PX) no limit?
performance in sort stage
See if it is orcle db then u can write user def query for sort and remove duplicates in the source
itself. and maintaining some key partition teqniques u can improve the performence.
If it is not the case means better go for some key partion teqniques in sort, keeping the same
partition which is in prev stage. don't allow the duplicates , remove duplicates and give unique
partition key.
How to develop the SCD using LOOKUP stage?
we can impliment SCD by using LOOKUP stage, but it is for only scd1, not for scd2.
we have to take source(file or db) and dataset as a ref link(for look up) and then LOOKUP stage,
in this we have to compare the source with dataset and we have to give condition as continue,
continue there. after that in t/r we have to give the conditon, after that we have to take two targets
for insert and update, there we have to manually write the sql insert and update statements.
If u see the design, then u can easily understand that.
What is the diffrence between IBM Web Sphere DataStage 7.5 (Enterprise Edition ) & Standard
Ascential DataStage 7.5 Version ?
IBM Information Server also known as DS 8 has more features like Quality Stage & MetaStage .
It maintains its repsository in DB2 unlike files in 7.5. Also it has stage specifically for SCD 1 &
2.
I think there is no version like standard Ascential DataStage 7.5, I know only the advanced
edition of Datastage i.e., only web sphere Datastage and Quality stage, it is released by IBM
itself and given the version as 8.0.1, in this there are only 3 client tools(admin..,desig..,director),
here they have removed the manager, it is included in designer itself (for importing and
exporting) and in this they have added some extra stages like SCD stage , by using this we can
impliment scd1 and scd2 directly, and there are some advanced stages are there.
They have included the QualityStage, which is used for data validation which is very very
importent for dwh. There are somany things are available in Qualitystage, we can think it as a
seperate tools for dwh.
What are the errors you expereiced with data stage
Here in datastage there are some warnings and some fatal errors will come in the log file.
-
Datastage Interview Questions - Answers
35
If there is any fatal error means the job got aborted but if there are any warnings are there means
the job not aborts but we have to handle those warnings also. logfile must be cleared with no
warnings also.
so many errors will come in diff jobs,
Parameter not found in job load recover.
child job is failed bcoz of some .....
control job is failed bcoz of some .....
....etc
what are the main diff between server job and parallel job in datastage?
in server jobs we have few stages and its mainly logical intensive and we r using transformer for
most of the things and it does not uses MPP systems
in paralell jobs we have lots of stages and its stage intensive and for particular thing we have in
built stages in parallel jobs and it uses MPP systems
***********************************************************
In server we dont have an option to process the data in multiple nodes as in parallel. In parallel
we have an advatage to process the data in pipelines and by partitioning, whereas we dont have
any such concept in server jobs.
There are lot of differences in using same stages in server and parallel. For example, in parallel, a
sequencial file or any other file can have either an input link or an output ink, but in server it can
have both(that too more than 1).
********************************************************************
server jobs can compile and run with in datastage server but parallel jobs can compile and run
with in datastage unix server.
server jobs can extact total rows from source to anthor stage then only that stage will be activate
and passing the rows into target level or dwh.it is time taking.
but in parallel jobs it is two types
1.pipe line parallelisam
2.partion parallelisam
1.based on statistical performence we can extract some rows from source to anthor stage at the
same time the stage will be activate and passing the rows into target level or dwh.it will maintain
only one node with in source and target.
-
Datastage Interview Questions - Answers
36
2.partion parallelisam will maintain more than one node with in source and target.
Why you need Modify Stage?
When you are able to handle Null handling and Data type changes in ODBC stages why you
need Modify Stage?
Used to change the datatypes, if the source contains the varchar and the target contains integer
then we have to use this Modify Stage and we have to change according to the requirement. And
we can do some modification in length also.
Modify Stage is used for the purpose of Datatype Change.
What is the difference between Squential Stage & Dataset Stage. When do u use them.
a)Sequential stage is use for format of squential file and Dataset is use for any type of format
file (random)
b)Parallel jobs use data sets to manage data within a job. You can think of each link in a job as
carrying a data set. The Data Set stage allows you to store data being operated on in a persistent
form, which can then be used by other WebSphere DataStage jobs. Data sets are operating
system files, each referred to by a control file, which by convention has the suffix .ds. Using data
sets wisely can be key to good performance in a set of linked jobs. You can also manage data
sets independently of a job using the Data Set Management utility, available from the
WebSphere DataStage Designer or Director In datset dat is stored in some encrypted format ie.,
we can view the data through view data facility available in datastage but it cant be viewed in
Linux or back end system. In sequential file data can be viewed any where. Extraction of data
from the datset is much more faster than the sequential file.
how can we improve the performance of the job while handling huge amount of data
a)Minimize the transformer state,Reference table have huge amount of date then you can use join
stage. Reference table have less amount of data then you can use lookup.
b)this require a job level tuning or server level tuning.
in job level we can do the follwing.
job level tuning
use Join for huge amount of data rather than lookup.
use modify stage rather than transformer for simple transformation.
Sort the data before remove duplicate stage.
server level tuning
this can only be done after having adequate knowledge of the serever level parameter which can
improve the server execution performance.
-
Datastage Interview Questions - Answers
37
HI How can we create read only jobs in Datastage.
By creating Protected Project. In Protected Project all jobs are read only. You cant modify the
job. If you modify that job it will not effect the job.
b)A job can be made read only by the follwing process:
Export the job in the .dsx format and change the attribute which store the readonly information
from 0 ( 0 refers to editable job) to 1 ( 1 refer to the read only job).
then import the job again and override or rename the existing job to have both of the form.
there are 3 kind of routines is there in Datastage.
1.server routines which will used in server jobs.
these routines will write in BASIC Language
2.parlell routines which will used in parlell jobs
These routines will write in C/C++ Language
3.mainframe routines which will used in mainframe jobs
DataStage Parallel routines made really easy
http://blogs.ittoolbox.com/dw/soa/archives/datastage-parallel-routines-made-really-easy-20926
How will you determine the sequence of jobs to load into data warehouse?
First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the
Aggregator tables (if any).
The sequence of the job can also be determined by the determining the parent child relationship
in the target tables to be loaded. parent table always need to be loaded before child tables.
Error while connecting DS Admin?
All you have to do is go settings-control panel-User accounts- Create new user with a password.
Restart your comp and login with the new user name. Try using the new user name into
datastage and I am sure that you should be able to do it.
-
Datastage Interview Questions - Answers
38
DataStage - delete header and footer on the source sequential
How do you you delete header and footer on the source sequential file and how do you create
header and footer on target sequential file using datastage?
In Designer Pallete Development/Debug we can find Head & tail. By using this we can do......
How can we implement Slowly Changing Dimensions in DataStage?.
a)We can implement SCD in datastage
1.Type 1 SCD:insert else update in ODBC stage
2.Type 2 SCD:insert new rows if the primary key is same and update with effective from date as
JobRundate and to date to some max date
3.Type 3 SCD:insert value to the column the old value and update the existing column with the
new value
b) by using lookup stage and change capture stage we will implement the scd.
we have 3 types of scds
type1:it will maintain the current values only.
type2: it will maintain the both current and historical values.
type3: it will maintain the current and partial historical values.
Differentiate Database data and Data warehouse data?
Data in a Database is
Detailed or Transactional
Both Readable and Writable.
Current.
b)By Database, one means OLTP (On Line Transaction Processing). This can be the source
systems or the ODS (Operational Data Store), which contains the transactional data.
c)Database data is in the form of OLTP and Data warehouse data will be in the form of OLAP.
OLTP is for transactional process and OLAP is for Analysis purpose.
d)dwh:
it contains current and historical data
very summary data
it follows denormalization
-
Datastage Interview Questions - Answers
39
DIMENSIONAL MODEL
non volatile
How to run a Shell Script within the scope of a Data stage job?
By using "ExcecSH" command at Before/After job properties.
what is the difference between datastage and informatica
a)The main difference between data stge and informatica is the SCALABILTY..informatca is
scalable than datastage
b)In my view Datastage is also Scalable, the difference lies in the number of built-in functions
which makes DataStage more user friendly
c)In my view,Datastage is having less no. of transformers copared to Informatica which makes
user to get difficulties while working
d)The main difference is Vendors. Each one is having plus from their architecture. For Datastage
it is a Top-Down approach. Based on the Businees