Data stage FAQ Questions

7/29/2019 Data stage FAQ Questions

1/137

DATASTAGE FAQs & TUTORIALs TOPIC INDEX

1. DATASTAGE QUESTIONS ............................................................................. 22. DATASTAGE FAQ from GEEK INTERVIEW QUESTIONS .................. .... 133. DATASTAGE FAQ ......................................................................................... 264. TOP 10 FEATURES IN DATASTAGE HAWK ............................................. 305. DATASTAGE NOTES .................................................................................... 326. DATASTAGE TUTORIAL ............................................................................. 43

............................................................................................................................................ 44About DataStage .................................................................................................................44

Client Components .............................................................................................................44DataStage Designer. .......................................................................................................... 44DataStage Director............................................................................................................. 44DataStage Manager............................................................................................................ 44DataStage Administrator .................................................................................................... 44DataStage Manager Roles .................................................................................................. 44Server Components ............................................................................................................ 45DataStage Features ............................................................................................................. 45Types of jobs ...................................................................................................................... 45DataStage NLS ...................................................................................................................46JOB ..................................................................................................................................... 46

Built-In Stages Server Jobs ............................................................................................. 46Aggregator. ........................................................................................................................ 46Hashed File. ....................................................................................................................... 46UniVerse. ...........................................................................................................................47UniData.............................................................................................................................. 47ODBC................................................................................................................................. 47Sequential File. ..................................................................................................................47Folder Stage. ......................................................................................................................48Transformer........................................................................................................................ 48Container............................................................................................................................ 48IPC Stage............................................................................................................................ 48Link Collector Stage........................................................................................................... 49Link Partitioner Stage......................................................................................................... 49Server Job Properties .......................................................................................................... 49Containers ...........................................................................................................................50Local containers. ............................................................................................................... 50Shared containers. ............................................................................................................. 50Job Sequences .....................................................................................................................50

7. LEARN FEATURES OF DATASTAGE .........................................................518. INFORMATICA vs DATASTAGE: ................................................................ 93

Page 1 of 137


2/137

9. BEFORE YOU DESIGN YOUR APPLICATION ........................................ 10410. DATASTAGE 7.5x1 GUI FEATURES ......................................................... 11111. DATASTAGE & DWH INTERVIEW QUESTIONS ................................... 11512. DATASTAGE ROUTINES ............................................................................13013. SET_JOB_PARAMETERS_ROUTINE ........................................................137

DATASTAGE QUESTIONS

1. What is the flow of loading data into fact & dimensional tables?

A) Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keysin Dimensional table. Consists of fields with numeric values.Dimension table - Table with Unique Primary Key.Load - Data should be first loaded into dimensional table. Based on the primary keyvalues in dimensional table, the data should be loaded into Fact table.

2. What is the default cache size? How do you change the cache size if needed?

A. Default cache size is 256 MB. We can increase it by going into Datastage

Administrator and selecting the Tunable Tab and specify the cache size over there.

3. What are types of Hashed File?

A) Hashed File is classified broadly into 2 types.a) Static - Sub divided into 17 types based on Primary Key Pattern.b) Dynamic - sub divided into 2 types

i) Generic ii) Specific.

Page 2 of 137


3/137

Dynamic files do not perform as well as a well, designed static file, but do perform betterthan a badly designed one. When creating a dynamic file you can specify the followingAlthough all of these have default values)

By Default Hashed file is "Dynamic - Type Random 30 D"

4. What does a Config File in parallel extender consist of?

A) Config file consists of the following.a) Number of Processes or Nodes.b) Actual Disk Storage Location.

5. What is Modulus and Splitting in Dynamic Hashed File?

A. In a Hashed File, the size of the file keeps changing randomly.If the size of the file increases it is called as "Modulus".If the size of the file decreases it is called as "Splitting".

6. What are Stage Variables, Derivations and Constants?

A. Stage Variable - An intermediate processing variable that retains value during readand doesnt pass the value into target column.Derivation - Expression that specifies value to be passed on to the target column.Constant - Conditions that are either true or false that specifies flow of data with a link.

7. Types of views in Datastage Director?

There are 3 types of views in Datastage Directora) Job View - Dates of Jobs Compiled.b) Log View - Status of Job last runc) Status View - Warning Messages, Event Messages, Program Generated Messages.

8. Types of Parallel Processing?

A) Parallel Processing is broadly classified into 2 types.a) SMP - Symmetrical Multi Processing.b) MPP - Massive Parallel Processing.

9. Orchestrate Vs Datastage Parallel Extender?

A) Orchestrate itself is an ETL tool with extensive parallel processing capabilities andrunning on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta versionof 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchasedOrchestrate and integrated it with Datastage XE and released a new version Datastage 6.0i.e Parallel Extender.

10. Importance of Surrogate Key in Data warehousing?

A) Surrogate Key is a Primary Key for a Dimension table. Most importance of using it isit is independent of underlying database. i.e. Surrogate Key is not affected by the changesgoing on with a database.

11. How to run a Shell Script within the scope of a Data stage job?

A) By using "ExcecSH" command at Before/After job properties.

Page 3 of 137


4/137

12. How to handle Date conversions in Datastage? Convert a mm/dd/yyyy format to

yyyy-dd-mm?

A) We use a) "Iconv" function - Internal Conversion.b) "Oconv" function - External Conversion.

Function to convert mm/dd/yyyy format to yyyy-dd-mm isOconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")

13 How do you execute datastage job from command line prompt?

A) Using "dsjob" command as follows.dsjob -run -jobstatus projectname jobname

14. Functionality of Link Partitioner and Link Collector?

Link Partitioner: It actually splits data into various partitions or data flows using variouspartition methods.Link Collector: It collects the data coming from partitions, merges it into a single dataflow and loads to target.

15. Types of Dimensional Modeling?

A) Dimensional modeling is again sub divided into 2 types.a) Star Schema - Simple & Much Faster. Denormalized form.b) Snowflake Schema - Complex with more Granularity. More normalized form.

16. Differentiate Primary Key and Partition Key?

Primary Key is a combination of unique and not null. It can be a collection of key valuescalled as composite primary key. Partition Key is a just a part of Primary Key. There areseveral methods of partition like Hash, DB2, and Random etc. While using Hash partitionwe specify the Partition Key.

17. Differentiate Database data and Data warehouse data?

A) Data in a Database isa) Detailed or Transactionalb) Both Readable and Writable.c) Current.

18. Containers Usage and Types?

Container is a collection of stages used for the purpose of Reusability.There are 2 types of Containers.

a) Local Container: Job Specificb)Shared Container: Used in any job within a project.

19. Compare and Contrast ODBC and Plug-In stages?

ODBC: a) Poor Performance.b) Can be used for Variety of Databases.c) Can handle Stored Procedures.

Page 4 of 137


5/137

Plug-In: a) Good Performance.b) Database specific. (Only one database)c) Cannot handle Stored Procedures.

20. Dimension Modelling types along with their significance

Data Modelling is Broadly classified into 2 types.a) E-R Diagrams (Entity - Relatioships).b) Dimensional Modelling.

Q 21 What are Ascential Dastastage Products, Connectivity

Ans:Ascential Products

Ascential DataStageAscential DataStage EE (3)Ascential DataStage EE MVSAscential DataStage TXAscential QualityStageAscential MetaStageAscential RTI (2)Ascential ProfileStageAscential AuditStageAscential Commerce ManagerIndustry Solutions

Connectivity

FilesRDBMSReal-timePACKsEDIOther

Q 22 Explain Data Stage Architecture?

Data Stage contains two components,Client Component.

Server Component.

Client Component:

Data Stage Administrator.

Data Stage Manager

Data Stage Designer

Data Stage Director

Server Components:

Data Stage Engine

Meta Data Repository

Page 5 of 137


6/137

Package Installer

Data Stage Administrator:

Used to create the project.Contains set of properties

We can set the buffer size (by default 128 MB)We can increase the buffer size.We can set the Environment Variables.In tunable we have in process and inter-processIn-processData read in sequentiallyInter-process It reads the data as it comes.It just interfaces to metadata.

Data Stage Manager:

We can view and edit the Meta data Repository.We can import table definitions.We can export the Data stage components in .xml or.dsx format.We can create routines and transformsWe can compile the multiple jobs.

Page 6 of 137


7/137

Data Stage Designer:

We can create the jobs. We can compile the job. We can run the job. We candeclare stage variable in transform, we can call routines, transform, macros, functions.

We can write constraints.

Data Stage Director:

We can run the jobs.We can schedule the jobs. (Schedule can be done daily, weekly, monthly, quarterly)We can monitor the jobs.We can release the jobs.

Q 23 What is Meta Data Repository?

Meta Data is a data about the data.It also contains

Query statistics ETL statistics

Business subject area

Source Information

Target Information

Source to Target mapping Information.

Q 24 What is Data Stage Engine?

Page 7 of 137


8/137

It is a JAVA engine running at the background.

Q 25 What is Dimensional Modeling?

Dimensional Modeling is a logical design technique that seeks to present the datain a standard framework that is, intuitive and allows for high performance access.

Q 26 What is Star Schema?

Star Schema is a de-normalized multi-dimensional model. It contains centralized facttables surrounded by dimensions table.Dimension Table: It contains a primary key and description about the fact table.Fact Table: It contains foreign keys to the dimension tables, measures and aggregates.

Q 27 What is surrogate Key?

It is a 4-byte integer which replaces the transaction / business / OLTP key in thedimension table. We can store up to 2 billion record.

Q 28 Why we need surrogate key?

It is used for integrating the data may help better for primary key.Index maintenance, joins, table size, key updates, disconnected inserts and

partitioning.

Q 29 What is Snowflake schema?

It is partially normalized dimensional model in which at two represents least onedimension or more hierarchy related tables.

Q 30 Explain Types of Fact Tables?

Factless Fact: It contains only foreign keys to the dimension tables.Additive Fact: Measures can be added across any dimensions.Semi-Additive: Measures can be added across some dimensions. Eg, % age, discountNon-Additive: Measures cannot be added across any dimensions. Eg, AverageConformed Fact: The equation or the measures of the two fact tables are the same underthe facts are measured across the dimensions with a same set of measures.

Q 31 Explain the Types of Dimension Tables?

Conformed Dimension: If a dimension table is connected to more than one fact table,the granularity that is defined in the dimension table is common across between the facttables.Junk Dimension: The Dimension table, which contains only flags.Monster Dimension: If rapidly changes in Dimension are known as Monster Dimension.

De-generative Dimension: It is line item-oriented fact table design.

Q 32 What are stage variables?

Stage variables are declaratives in Transformer Stage used to store values. Stage variablesare active at the run time. (Because memory is allocated at the run time).

Q 33 What is sequencer?

It sets the sequence of execution of server jobs.

Page 8 of 137


9/137

Q 34 What are Active and Passive stages?

Active Stage: Active stage model the flow of data and provide mechanisms for combiningdata streams, aggregating data and converting data from one data type to another. Eg,Transformer, aggregator, sort, Row Merger etc.Passive Stage: A Passive stage handles access to Database for the extraction or writing ofdata. Eg, IPC stage, File types, Universe, Unidata, DRS stage etc.

Q 35 What is ODS?

Operational Data Store is a staging area where data can be rolled back.

Q 36 What are Macros?

They are built from Data Stage functions and do not require arguments.A number of macros are provided in the JOBCONTROL.H file to facilitate gettinginformation about the current job, and links and stages belonging to the current job. Thesecan be used in expressions (for example for use in Transformer stages), job controlroutines, filenames and table names, and before/after subroutines.

These macros provide the functionality of using theDSGetProjectInfo, DSGetJobInfo,DSGetStageInfo, andDSGetLinkInfo functions with the DSJ.ME token as theJobHandleand can be used in all active stages and before/after subroutines. The macros provide thefunctionality for all the possibleInfoType arguments for the DSGetInfo functions. Seethe Function call help topics for more details.

The available macros are: DSHostNameDSProjectName

DSJobStatus

DSJobName

DSJobController

DSJobStartDate

DSJobStartTime

DSJobStartTimestamp

DSJobWaveNo

DSJobInvocations

DSJobInvocationId

DSStageName

DSStageLastErr

DSStageType

DSStageInRowNum

DSStageVarList

DSLinkRowCount

DSLinkLastErr

DSLinkName

1) Examples

2) To obtain the name of the current job:

Page 9 of 137


10/137

3) MyName = DSJobName

To obtain the full current stage name:

MyName = DSJobName : . : DSStageName

Q 37 What is keyMgtGetNextValue?

It is a Built-in transform it generates Sequential numbers. Its input type is literal string &output type is string.

Q 38 What are stages?

The stages are either passive or active stages.Passive stages handle access to databases for extracting or writing data. Activestages model the flow of data and provide mechanisms for combining data streams,aggregating data, and converting data from one data type to another.

Q 39 What index is created on Data Warehouse?Bitmap index is created in Data Warehouse.

Q 40 What is container?

A container is a group of stages and links. Containers enable you to simplify andmodularize your server job designs by replacing complex areas of the diagram with asingle container stage. You can also use shared containers as a way of incorporating serverjob functionality into parallel jobs.DataStage provides two types of container:

Local containers. These are created within a job and are only accessible by thatjob. A local container is edited in a tabbed page of the jobs Diagram window.

Shared containers. These are created separately and are stored in the Repositoryin the same way that jobs are. There are two types of shared containerQ 41 What is function? ( Job Control Examples of Transform Functions )

Functions take arguments and return a value.

BASIC functions: A function performs mathematical or string manipulations onthe arguments supplied to it, and return a value. Some functions have 0 arguments;most have 1 or more. Arguments are always in parentheses, separated by commas,as shown in this general syntax:

FunctionName (argument,argument)

DataStage BASIC functions: These functions can be used in a job controlroutine, which is defined as part of a jobs properties and allows other jobs to be

run and controlled from the first job. Some of the functions can also be used forgetting status information on the current job; these are useful in active stageexpressions and before- and after-stage subroutines.

To do this ... Use this function ...

Specify the job you want to control DSAttachJob

Set parameters for the job you want to control DSSetParam

Page 10 of 137


11/137

Set limits for the job you want to control DSSetJobLimit

Request that a job is run DSRunJob

Wait for a called job to finish DSWaitForJob

Gets the meta data details for the specified link DSGetLinkMetaData

Get information about the current project DSGetProjectInfo

Get buffer size and timeout value for an IPC or Web Servicestage

DSGetIPCStageProps

Get information about the controlled job or current job DSGetJobInfo

Get information about the meta bag properties associated withthe named job

DSGetJobMetaBag

Get information about a stage in the controlled job or currentjob

DSGetStageInfo

Get the names of the links attached to the specified stage DSGetStageLinks

Get a list of stages of a particular type in a job. DSGetStagesOfType

Get information about the types of stage in a job. DSGetStageTypes

Get information about a link in a controlled job or current job DSGetLinkInfo

Get information about a controlled jobs parameters DSGetParamInfo

Get the log event from the job log DSGetLogEntry

Get a number of log events on the specified subject from thejob log

DSGetLogSummary

Get the newest log event, of a specified type, from the job log DSGetNewestLogId

Log an event to the job log of a different job DSLogEvent

Stop a controlled job DSStopJob

Return a job handle previously obtained from DSAttachJob DSDetachJob

Log a fatal error message in a job's log file and aborts the job. DSLogFatal

Log an information message in a job's log file. DSLogInfo

Put an info message in the job log of a job controlling currentjob.

DSLogToController

Log a warning message in a job's log file. DSLogWarnGenerate a string describing the complete status of a validattached job.

DSMakeJobReport

Insert arguments into the message template. DSMakeMsg

Ensure a job is in the correct state to be run or validated. DSPrepareJob

Interface to system send mail facility. DSSendMail

Page 11 of 137


12/137

Log a warning message to a job log file. DSTransformError

Convert a job control status or error code into an explanatorytext message.

DSTranslateCode

Suspend a job until a named file either exists or does not exist. DSWaitForFile

Checks if a BASIC routine is cataloged, either in VOC as acallable item, or in the catalog space.

DSCheckRoutine

Execute a DOS or Data Stage Engine command from abefore/after subroutine.

DSExecute

Set a status message for a job to return as a terminationmessage when it finishes

DSSetUserStatus

Q 42 What is Routines?

Routines are stored in the Routines branch of the Data Stage Repository, where you cancreate, view or edit. The following programming components are classified as routines:Transform functions, Before/After subroutines, Custom UniVerse functions, ActiveX

(OLE) functions, Web Service routines

Q 43 What is data stage Transform?

Q 44 What is Meta Brokers?

Q 45 What is usage analysis?

Q 46 What is job sequencer?

Q 47 What are different activities in job sequencer?

Q 48 What are triggers in data Stages? (conditional, unconditional, otherwise)

Q 49 Are u generated job Reports? S

Q 50 What is plug-in?

Q 51 Have u created any custom transform? Explain? (Oconv)

Page 12 of 137


13/137

DATASTAGE FAQ from GEEK INTERVIEW QUESTIONS

Question: Dimension Modeling types along with their significance

Answer:

Data Modelling is broadly classified into 2 types.A) E-R Diagrams (Entity - Relatioships).B) Dimensional Modelling.

Question: Dimensional modelling is again sub divided into 2 types.

Answer:

A) Star Schema - Simple & Much Faster. Denormalized form.B) Snowflake Schema - Complex with more Granularity. More normalized form.

Question: Importance of Surrogate Key in Data warehousing?

Answer:

Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is, it isindependent of underlying database, i.e. Surrogate Key is not affected by the changesgoing on with a database.

Question: Differentiate Database data and Data warehouse data?Answer:

Data in a Database isA) Detailed or TransactionalB) Both Readable and Writable.C) Current.

Question: What is the flow of loading data into fact & dimensional tables?

Answer:

Page 13 of 137


14/137

Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys inDimensional table. Consists of fields with numeric values.

Dimension table - Table with Unique Primary Key.Load - Data should be first loaded into dimensional table. Based on the primary key

values in dimensional table, then data should be loaded into Fact table.

Question: Orchestrate Vs Datastage Parallel Extender?Answer:Orchestrate itself is an ETL tool with extensive parallel processing capabilities andrunning on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta versionof 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchasedOrchestrate and integrated it with Datastage XE and released a new version Datastage 6.0i.e. Parallel Extender.

Question: Differentiate Primary Key and Partition Key?

Answer:

Primary Key is a combination of unique and not null. It can be a collection of key valuescalled as composite primary key. Partition Key is a just a part of Primary Key. There areseveral methods of partition like Hash, DB2, Random etc...While using Hash partition wespecify the Partition Key.

Question: What are Stage Variables, Derivations and Constants?

Answer:

Stage Variable - An intermediate processing variable that retains value during read anddoesnt pass the value into target column.Constraint - Conditions that are either true or false that specifies flow of data with a link.Derivation - Expression that specifies value to be passed on to the target column.

Question: What is the default cache size? How do you change the cache size if

needed?Answer:

Default cache size is 256 MB. We can increase it by going into Datastage Administratorand selecting the Tunable Tab and specify the cache size over there.

Question: What is Hash file stage and what is it used for?

Answer:

Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCItables for better performance.

Question: What are types of Hashed File?

Answer:Hashed File is classified broadly into 2 types.A) Static - Sub divided into 17 types based on Primary Key Pattern.B) Dynamic - sub divided into 2 types

i) Genericii) Specific

Default Hased file is "Dynamic - Type Random 30 D"

Page 14 of 137


15/137

Question: What are Static Hash files and Dynamic Hash files?

Answer:

As the names itself suggest what they mean. In general we use Type-30 dynamic Hashfiles. The Data file has a default size of 2GB and the overflow file is used if the dataexceeds the 2GB size.

Question: What is the Usage of Containers? What are its types?

Answer:

Container is a collection of stages used for the purpose of Reusability.There are 2 types of Containers.A) Local Container: Job SpecificB) Shared Container: Used in any job within a project.

Question: Compare and Contrast ODBC and Plug-In stages?

Answer:

ODBC PLUG-IN

Poor Performance Good PerformanceCan be used for Variety of Databases Database Specific (only one database)

Can handle Stored Procedures Cannot handle Stored Procedures

Question: How do you execute datastage job from command line prompt?

Answer:

Using "dsjob" command as follows.dsjob -run -jobstatus projectname jobname

Question: What are the command line functions that import and export the DS jobs?

Answer:

dsimport.exe - imports the DataStage components.

dsexport.exe - exports the DataStage components.

Question: How to run a Shell Script within the scope of a Data stage job?

Answer:

By using "ExcecSH" command at Before/After job properties.

Question: What are OConv () and Iconv () functions and where are they used?

Answer:

IConv() - Converts a string to an internal storage formatOConv() - Converts an expression to an output format.

Question: How to handle Date convertions in Datastage? Convert mm/dd/yyyyformat to yyyy-dd-mm?

Answer:

We usea) "Iconv" function - Internal Convertion.b) "Oconv" function - External Convertion.

Function to convert mm/dd/yyyy format to yyyy-dd-mm isOconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")

Page 15 of 137


16/137

Question: Types of Parallel Processing?

Answer:Parallel Processing is broadly classified into 2 types.a) SMP - Symmetrical Multi Processing.b) MPP - Massive Parallel Processing.

Question: What does a Config File in parallel extender consist of?

Answer:

Config file consists of the following.a) Number of Processes or Nodes.b) Actual Disk Storage Location.

Question: Functionality of Link Partitioner and Link Collector?

Answer:

Link Partitioner: It actually splits data into various partitions or data flows using variousPartition methods.

Link Collector: It collects the data coming from partitions, merges it into a single dataflow and loads to target.

Question: What is Modulus and Splitting in Dynamic Hashed File?

Answer:

In a Hashed File, the size of the file keeps changing randomly.If the size of the file increases it is called as "Modulus".If the size of the file decreases it is called as "Splitting".

Question: Types of views in Datastage Director?

Answer:There are 3 types of views in Datastage Directora) Job View - Dates of Jobs Compiled.b) Log View - Status of Job last Runc) Status View - Warning Messages, Event Messages, Program Generated Messages.

Question: Did you Parameterize the job or hard-coded the values in the jobs?

Answer:

Always parameterized the job. Either the values are coming from Job Properties or from aParameter Manager a third part tool. There is no way you will hardcode someparameters in your jobs. The often Parameterized variables in a job are: DB DSN name,

username, password, dates W.R.T for the data to be looked against at.

Question: Have you ever involved in updating the DS versions like DS 5.X, if so tell

us some the steps you have taken in doing so?

Answer:

Yes.The following are some of the steps:Definitely take a back up of the whole project(s) by exporting the project as a .dsx file

Page 16 of 137


17/137

See that you are using the same parent folder for the new version also for your old jobsusing the hard-coded file path to work.

After installing the new version import the old project(s) and you have to compile them allagain. You can use 'Compile All' tool for this.

Make sure that all your DB DSN's are created with the same name as old ones. This step isfor moving DS from one machine to another.

In case if you are just upgrading your DB from Oracle 8i to Oracle 9i there is tool on DSCD that can do this for you.

Do not stop the 6.0 server before the upgrade, version 7.0 install process collects projectinformation during the upgrade. There is NO rework (recompilation of existingjobs/routines) needed after the upgrade.

Question: How did you handle reject data?

Answer:

Typically a Reject-link is defined and the rejected data is loaded back into data warehouse.So Reject link has to be defined every Output link you wish to collect rejected data.Rejected data is typically bad data like duplicates of Primary keys or null-rows where datais expected.

Question: What are other Performance tunings you have done in your last project to

increase the performance of slowly running jobs?

Answer:

Staged the data coming from ODBC/OCI/DB2UDB stages or any database on theserver using Hash/Sequential files for optimum performance also for data recovery incase job aborts.

Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical valuesfor faster inserts, updates and selects.

Tuned the 'Project Tunables' in Administrator for better performance.

Used sorted data for Aggregator.

Sorted the data as much as possible in DB and reduced the use of DS-Sort forbetter performance of jobs.

Removed the data not used from the source as early as possible in the job.

Worked with DB-admin to create appropriate Indexes on tables for betterperformance of DS queries.

Converted some of the complex joins/business in DS to Stored Procedures on DSfor faster execution of the jobs.

If an input file has an excessive number of rows and can be split-up then usestandard logic to run jobs in parallel.

Before writing a routine or a transform, make sure that there is not thefunctionality required in one of the standard routines supplied in the sdk or ds utilitiescategories.

Constraints are generally CPU intensive and take a significant amount of time toprocess. This may be the case if the constraint calls routines or external macros but if itis inline code then the overhead will be minimal.

Try to have the constraints in the 'Selection' criteria of the jobs itself. This willeliminate the unnecessary records even getting in before joins are made.

Page 17 of 137


18/137

Tuning should occur on a job-by-job basis.

Use the power of DBMS.

Try not to use a sort stage when you can use an ORDER BY clause in thedatabase.

Using a constraint to filter a record set is much slower than performing a SELECT

WHERE. Make every attempt to use the bulk loader for your particular database. Bulkloaders are generally faster than using ODBC or OLE.

Question: Tell me one situation from your last project, where you had faced problem

and How did u solve it?

Answer:

1. The jobs in which data is read directly from OCI stages are running extremelyslow. I had to stage the data before sending to the transformer to make the jobs runfaster.2. The job aborts in the middle of loading some 500,000 rows. Have an option either

cleaning/deleting the loaded data and then run the fixed job or run the job again fromthe row the job has aborted. To make sure the load is proper we opted the former.

Question: Tell me the environment in your last projects

Answer:

Give the OS of the Server and the OS of the Client of your recent most projectQuestion: How did u connect with DB2 in your last project?

Answer:

Most of the times the data was sent to us in the form of flat files. The data is dumped andsent to us. In some cases were we need to connect to DB2 for look-ups as an instance thenwe used ODBC drivers to connect to DB2 (or) DB2-UDB depending the situation and

availability. Certainly DB2-UDB is better in terms of performance as you know the nativedrivers are always better than ODBC drivers. 'iSeries Access ODBC Driver 9.00.02.02' -ODBC drivers to connect to AS400/DB2.

Question: What are Routines and where/how are they written and have you written

any routines before?

Answer:

Routines are stored in the Routines branch of the DataStage Repository, where you cancreate, view or edit.The following are different types of Routines:

1. Transform Functions

2. Before-After Job subroutines3. Job Control Routines

Question: How did you handle an 'Aborted' sequencer?

Answer:

In almost all cases we have to delete the data inserted by this from DB manually and fixthe job and then run the job again.

Question: What are Sequencers?

Page 18 of 137


19/137

Answer:

Sequencers are job control programs that execute other jobs with preset Job parameters.

Question: Read the String functions in DS

Answer:

Functions like [] -> sub-string function and ':' -> concatenation operatorSyntax:string [ [ start, ] length ]string [ delimiter, instance, repeats ]Question: What will you in a situation where somebody wants to send you a file and

use that file as an input or reference and then run job.

Answer:

Under Windows : Use the 'WaitForFileActivity' under the Sequencers and then runthe job. May be you can schedule the sequencer around the time the file is expected toarrive.

Under UNIX : Poll for the file. Once the file has start the job or sequencerdepending on the file.

Question: What is the utility you use to schedule the jobs on a UNIX server other

than using Ascential Director?

Answer:

Use crontab utility along with dsexecute() function along with proper parameters passed.

Question: Did you work in UNIX environment?

Answer:

Yes. One of the most important requirements.

Question: How would call an external Java function which are not supported byDataStage?

Answer:

Starting from DS 6.0 we have the ability to call external Java functions using a Javapackage from Ascential. In this case we can even use the command line to invoke the Javafunction and write the return values from the Java program (if any) and use that files as asource in DataStage job.

Question: How will you determine the sequence of jobs to load into data warehouse?

Answer:

First we execute the jobs that load the data into Dimension tables, then Fact tables, then

load the Aggregator tables (if any).

Question: The above might raise another question: Why do we have to load the

dimensional tables first, then fact tables:

Answer:

As we load the dimensional tables the keys (primary) are generated and these keys(primary) are Foreign keys in Fact tables.

Page 19 of 137


20/137

Question: Does the selection of 'Clear the table and Insert rows' in the ODBC stage

send a Truncate statement to the DB or does it do some kind of Delete logic.

Answer:

There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a deletefrom statement. On an OCI stage such as Oracle, you do have both Clear and Truncateoptions. They are radically different in permissions (Truncate requires you to have altertable permissions where Delete doesn't).

Question: How do you rename all of the jobs to support your new File-naming

conventions?

Answer:

Create an Excel spreadsheet with new and old names. Export the whole project as a dsx.Write a Perl program, which can do a simple rename of the strings looking up the Excelfile. Then import the new dsx file probably into a new project for testing. Recompile alljobs. Be cautious that the name of the jobs has also been changed in your job control jobsor Sequencer jobs. So you have to make the necessary changes to these Sequencers.

Question: When should we use ODS?Answer:

DWH's are typically read only, batch updated on a scheduleODS's are maintained in more real time, trickle fed constantly

Question: What other ETL's you have worked with?

Answer:

Informatica and also DataJunction if it is present in your Resume.

Question: How good are you with your PL/SQL?

Answer:On the scale of 1-10 say 8.5-9

Question: What versions of DS you worked with?

Answer:

DS 7.5, DS 7.0.2, DS 6.0, DS 5.2

Question: What's the difference between Datastage Developers...?Answer:

Datastage developer is one how will code the jobs. Datastage designer is how will designthe job, I mean he will deal with blue prints and he will design the jobs the stages that are

required in developing the code

Question: What are the requirements for your ETL tool?

Answer:

Do you have large sequential files (1 million rows, for example) that need to be comparedevery day versus yesterday?If so, then ask how each vendor would do that. Think about what process they are going todo. Are they requiring you to load yesterdays file into a table and do lookups?

Page 20 of 137


21/137

If so, RUN!! Are they doing a match/merge routine that knows how to process this insequential files? Then maybe they are the right one. It all depends on what you need theETL to do.If you are small enough in your data sets, then either would probably be OK.

Question: What are the main differences between Ascential DataStage and

Informatica PowerCenter?

Answer:

Chuck Kelleys Answer: You are right; they have pretty much similar functionality.However, what are the requirements for your ETL tool? Do you have large sequential files(1 million rows, for example) that need to be compared every day versus yesterday? If so,then ask how each vendor would do that. Think about what process they are going to do.Are they requiring you to load yesterdays file into a table and do lookups? If so, RUN!!Are they doing a match/merge routine that knows how to process this in sequential files?Then maybe they are the right one. It all depends on what you need the ETL to do. If youare small enough in your data sets, then either would probably be OK.

Les Barbusinskis Answer: Without getting into specifics, here are some differences youmay want to explore with each vendor:

Does the tool use a relational or a proprietary database to store its Meta data andscripts? If proprietary, why?

What add-ons are available for extracting data from industry-standard ERP,Accounting, and CRM packages?

Can the tools Meta data be integrated with third-party data modeling and/orbusiness intelligence tools? If so, how and with which ones?

How well does each tool handle complex transformations, and how much externalscripting is required?

What kinds of languages are supported for ETL script extensions?

Almost any ETL tool will look like any other on the surface. The trick is to find out whichone will work best in your environment. The best way Ive found to make thisdetermination is to ascertain how successful each vendors clients have been using theirproduct. Especially clients who closely resemble your shop in terms of size, industry, in-house skill sets, platforms, source systems, data volumes and transformation complexity.

Ask both vendors for a list of their customers with characteristics similar to your own thathave used their ETL product for at least a year. Then interview each client (preferablyseveral people at each site) with an eye toward identifying unexpected problems, benefits,or quirkiness with the tool that have been encountered by that customer. Ultimately, askeach customer if they had it all to do over again whether or not theyd choose the sametool and why? You might be surprised at some of the answers.

Joyce Bischoffs Answer: You should do a careful research job when selecting products.You should first document your requirements, identify all possible products and evaluateeach product against the detailed requirements. There are numerous ETL products on the

Page 21 of 137


22/137

market and it seems that you are looking at only two of them. If you are unfamiliar withthe many products available, you may refer to www.tdan.com, the Data AdministrationNewsletter, for product lists.

If you ask the vendors, they will certainly be able to tell you which of their productsfeatures are stronger than the other product. Ask both vendors and compare the answers,which may or may not be totally accurate. After you are very familiar with the products,call their references and be sure to talk with technical people who are actually using theproduct. You will not want the vendor to have a representative present when you speakwith someone at the reference site. It is also not a good idea to depend upon a high-levelmanager at the reference site for a reliable opinion of the product. Managers may paint avery rosy picture of any selected product so that they do not look like they selected aninferior product.

Question: How many places u can call Routines?Answer:Four Places u can call

1. Transform of routinea. Date Transformationb. Upstring Transformation

2. Transform of the Before & After Subroutines3. XML transformation4. Web base transformation

Question: What is the Batch Program and how can generate?Answer: Batch program is the program it's generate run time to maintain by the Datastageitself but u can easy to change own the basis of your requirement (Extraction,Transformation, Loading) .Batch program are generate depends your job nature eithersimple job or sequencer job, you can see this program on job control option.

Question:Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 )if job 1 have 10,000 row ,after run the job only 5000 data has been loaded in target

table remaining are not loaded and your job going to be aborted then.. How can

short out the problem?

Answer:Suppose job sequencer synchronies or control 4 job but job 1 have problem, in thiscondition should go director and check it what type of problem showing either data typeproblem, warning massage, job fail or job aborted, If job fail means data type problem ormissing column action .So u should go Run window ->Click-> Tracing->Performance or

In your target table ->general -> action-> select this option here two option(i) On Fail -- Commit , Continue(ii) On Skip -- Commit, Continue.First u check how much data already load after then select on skip option thencontinue and what remaining position data not loaded then select On Fail ,Continue ...... Again Run the job defiantly u gets successful massage

Question: What happens if RCP is disable?

Page 22 of 137
http://www.tdan.com/http://www.tdan.com/


23/137

Answer:

In such case OSH has to perform Import and export every time when the job runs and theprocessing time job is also increased...

Question: How do you rename all of the jobs to support your new File-naming

conventions?

Answer: Create a Excel spreadsheet with new and old names. Export the whole project asa dsx. Write a Perl program, which can do a simple rename of the strings looking up theExcel file. Then import the new dsx file probably into a new project for testing.Recompile all jobs. Be cautious that the name of the jobs has also been changed in yourjob control jobs or Sequencer jobs. So you have to make the necessary changes to theseSequencers.

Question: What will you in a situation where somebody wants to send you a file and

use that file as an input or reference and then run job.

Answer: A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers andthen run the job. May be you can schedule the sequencer around the time the file isexpected to arrive.B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending onthe file

Question: What are Sequencers?

Answer: Sequencers are job control programs that execute other jobs with preset Jobparameters.


Answer: In almost all cases we have to delete the data inserted by this from DB manuallyand fix the job and then run the job again.

Question34: What is the difference between the Filter stage and the Switch stage?

Ans: There are two main differences, and probably some minor ones as well. The two

main differences are as follows.

1) The Filter stage can send one input row to more than one output link. The

Switch stage can not - the Cswitch construct has an implicit breakin every case.

2) The Switch stage is limited to 128 output links; the Filter stage can have a

theoretically unlimited number of output links. (Note: this is not a challenge!)

Question: How can i achieve constraint based loading using datastage7.5.My targettables have inter dependencies i.e. Primary key foreign key constraints. I want my primarykey tables to be loaded first and then my foreign key tables and also primary key tablesshould be committed before the foreign key tables are executed. How can I go about it?

Page 23 of 137


24/137

Ans:1) Create a Job Sequencer to load you tables in Sequential modeIn the sequencer Call all Primary Key tables loading Jobs first and followed by Foreignkey tables, when triggering the Foreign tables load Job trigger them only when PrimaryKey load Jobs run Successfully ( i.e. OK trigger)

2) To improve the performance of the Job, you can disable all the constraints on the tablesand load them. Once loading done, check for the integrity of the data. Which does notmeet raise exceptional data and cleanse them.

This only a suggestion, normally when loading on constraints are up, will drasticallyperformance will go down.

3) If you use Star schema modeling, when you create physical DB from the model, youcan delete all constraints and the referential integrity would be maintained in the ETLprocess by referring all your dimension keys while loading fact tables. Once alldimensional keys are assigned to a fact then dimension and fact can be loaded together. Atthe same time RI is being maintained at ETL process level.

Question: How do you merge two files in DS?

Ans: Either use Copy command as a Before-job subroutine if the metadata of the 2 filesare same or create a job to concatenate the 2 files into one, if the metadata is different.

Question: How do you eliminate duplicate rows?

Ans: Data Stage provides us with a stage Remove Duplicates in Enterprise edition. Usingthat stage we can eliminate the duplicates based on a key column.

Question: How do you pass filename as the parameter for a job?

Ans: While job development we can create a parameter 'FILE_NAME' and the value canbe passed while


Ans: In almost all cases we have to delete the data inserted by this from DB manually andfix the job and then run the job again.

Question:Is there a mechanism available to export/import individual DataStage ETL

jobs from the UNIX command line?

Ans: Try dscmdexport and dscmdimport. Won't handle the "individual job" requirement. You can only export fullprojects from the command line.You can find the export and import executables on the client machine usually someplace like: C:\ProgramFiles\Ascential\DataStage.

Question: Diff. between JOIN stage and MERGE stage.Answer:

Page 24 of 137


25/137

JOIN: Performs join operations on two or more data sets input to the stage and thenoutputs the resulting dataset.MERGE: Combines a sorted master data set with one or more sorted updated data sets.The columns from the records in the master and update data set s are merged so that theout put record contains all the columns from the master record plus any additionalcolumns from each update record that required.

A master record and an update record are merged only if both of them have the samevalues for the merge key column(s) that we specify .Merge key columns are one or morecolumns that exist in both the master and update records.

Question: Advantages of the DataStage?

Answer:

Business advantages:

Helps for better business decisions; It is able to integrate data coming from all parts of the company; It helps to understand the new and already existing clients; We can collect data of different clients with him, and compare them; It makes the research of new business possibilities possible; We can analyze trends of the data read by him.

Technological advantages:

It handles all company data and adapts to the needs; It offers the possibility for the organization of a complex business intelligence; Flexibly and scalable; It accelerates the running of the project;

Easily implementable.

Page 25 of 137


26/137

DATASTAGE FAQ

1. What is the architecture of data stage?

Basically architecture of DS is client/server architecture.

Client components & server components

Client components are 4 types they are1. Data stage designer2. Data stage administrator3. Data stage director4. Data stage manager

Data stage designer is user for to design the jobs

Data stage manager is used for to import & export the project to view & edit the contentsof the repository.

Data stage administrator is used for creating the project, deleting the project & settingthe environment variables.

Data stage director is use for to run the jobs, validate the jobs, scheduling the jobs.

Server components

Page 26 of 137


27/137

DS server: runs executable server jobs, under the control of the DS director, that extract,transform, and load data into a DWH.DS Package installer: A user interface used to install packaged DS jobs and plug-in;

Repository or project: a central store that contains all the information required to buildDWH or data mart.

2. What r the stages u worked on?

3. I have some jobs every month automatically delete the log details what r the steps

u have to take for that

We have to set the option autopurge in DS Adminstrator.

4. I want to run the multiple jobs in the single job. How can u handle .

In job properties set the option ALLOW MULTIPLE INSTANCES.

5. What is version controlling in DS?

In DS, version controlling is used for back up the project or jobs.This option is available in DS 7.1 version onwards.Version controls r of 2 types.

1. VSS- visual source safe2. CVSS- concurrent visual source safe.

VSS is designed by Microsoft but the disadvantage is only one user can access at a time,other user can wait until the first user complete the operation.CVSS, by using this many users can access concurrently. When compared to VSS, CVSScost is high.

6. What is the difference between clear log file and clear status file?

Clear log--- we can clear the log details by using the DS Director. Under job menuclear log option is available. By using this option we can clear the log details ofparticular job.

Clear status file---- lets the user remove the status of the record associated with all stages

of selected jobs.(in DS Director)

7. I developed 1 job with 50 stages, at the run time one stage is missed how can u

identify which stage is missing?

By using usage analysis tool, which is available in DS manager, we can find out the what rthe items r used in job.

Page 27 of 137


28/137

8. My job takes 30 minutes time to run, I want to run the job less than 30 minutes?

What r the steps we have to take?

By using performance tuning aspects which are available in DS, we can reduce time.Tuning aspect

In DS administrator : in-process and inter processIn between passive stages : inter process stageOCI stage : Array size and transaction size

And also use link partitioner & link collector stage in between passive stages

9. How to do road transposition in DS?

Pivot stage is used to transposition purpose. Pivot is an active stage that maps sets ofcolumns in an input table to a single column in an output table.

10. If a job locked by some user, how can you unlock the particular job in DS?

We can unlock the job by using clean up resources option which is available in DSDirector. Other wise we can find PID (process id) and kill the process in UNIX server.

11. What is a container? How many types containers are available? Is it possible to

use container as look up?

A container is a group of stages and links. Containers enable you to simplify andmodularize your server job designs by replacing complex areas of the diagram with asingle container stage.DataStage provides two types of container: Local containers. These are created within a job and are only accessible by that jobonly. Shared containers. These are created separately and are stored in the Repository in thesame way that jobs are. Shared containers can use any job in the project.

Yes we can use container as look up.

12. How to deconstruct the shared container?

To deconstruct the shared container, first u have to convert the shared container to local

container. And then deconstruct the container.

13. I am getting input value like X = Iconv(31 DEC 1967,D)? What is the X

value?

X value is Zero.Iconv Function Converts a string to an internal storage format.It takes 31 dec 1967 as zeroand counts days from that date(31-dec-1967).

Page 28 of 137


29/137

14. What is the Unit testing, integration testing and system testing?

Unit testing: As for Ds unit test will check the data type mismatching,Size of the particular data type, column mismatching.

Integration testing: According to dependency we will put all jobs are integrated in to onesequence. That is called control sequence.

System testing: System testing is nothing but the performance tuning aspects in Ds.

15. What are the command line functions that import and export the DS jobs?

Dsimport.exe ---- To import the DataStage componentsDsexport.exe ---- To export the DataStage components

16. How many hashing algorithms are available for static hash file and dynamic hash

file?

Sixteen hashing algorithms for static hash file.Two hashing algorithms for dynamic hash file( GENERAL or SEQ.NUM)

17. What happens when you have a job that links two passive stages together?

Obviously there is some process going on. Under covers Ds inserts a cut-downtransformer stage between the passive stages, which just passes data straight from onestage to the other.

18. What is the use use of Nested condition activity?

Nested Condition. Allows you to further branch the execution of a sequence depending ona condition.

19. I have three jobs A,B,C . Which are dependent on each other? I want to run A

& C jobs daily and B job runs only on Sunday. How can u do it?

First you have to schedule A & C jobs Monday to Saturday in one sequence.Next take three jobs according to dependency in one more sequence and schedule that jobonly Sunday.

Page 29 of 137


30/137

TOP 10 FEATURES IN DATASTAGE HAWK

The IILive2005 conference marked the first public presentations of the functionality in theWebSphere Information Integration Hawk release. Though it's still a few months away Iam sharing my top Ten things I am looking forward to in DataStage Hawk:

1) The metadata server. To borrow a simile from that judge on American Idol "UsingMetaStage is kind of like bathing in the ocean on a cold morning. You know it's good foryou but that doesn't stop it from freezing the crown jewels." MetaStage is good for ETLprojects but none of the projects I've been on has actually used it. Too much effortrequired to install the software, setup the metabrokers, migrate the metadata, and learn

how the product works and write reports. Hawk brings the common repository andimproved metadata reporting and we can get the positive effectives of bathing in sea waterwithout the shrinkage that comes with it.

2) QualityStage overhaul. Data Quality reporting can be another forgotten aspect of dataintegration projects. Like MetaStage the QualityStage server and client had an additionalinstall, training and implementation overhead so many DataStage projects did not use it. Iam looking forward to more integration projects using standardisation, matching andsurvivorship to improve quality once these features are more accessible and easier to use.

3) Frictionless Connectivity and Connection Objects. I've called DB2 every rude name

under the sun. Not because it's a bad database but because setting up remote access takesme anywhere from five minutes to five weeks depending on how obscure the errormessage and how hard it is to find the obscure setup step that was missed duringinstallation. Anything that makes connecting to database easier gets a big tick from me.

4) Parallel job range lookup. I am looking forward to this one because it will stop peopleasking for it on forums. It looks good, it's been merged into the existing lookup form andseems easy to use. Will be interested to see the performance.

Page 30 of 137


31/137

5) Slowly Changing Dimension Stage. This is one of those things that Informatica wereable to trumpet at product comparisons, that they have more out of the box DW support.There are a few enhancements to make updates to dimension tables easier, there is theimproved surrogate key generator, there is the slowly changing dimension stage andupdates passed to in memory lookups. That's it for me with DBMS generated keys, I'monly doing the keys in the ETL job from now on! DataStage server jobs have the hash filelookup where you can read and write to it at the same time, parallel jobs will have theupdateable lookup.

6) Collaboration: better developer collaboration. Everyone hates opening a job and beingtold it is locked. "Bloody whathisname has gone to lunch, locked the job and now hispassword protected screen saver is up! Unplug his PC!" Under Hawk you can open areadonly copy of a locked job plus you get told who has locked the job so you knowwhom to curse.

7) Session Disconnection. Accompanied by the metallic cry of "exterminate!exterminate!" an administrator can disconnect sessions and unlock jobs.

8) Improved SQL Builder. I know a lot of people cross the street when they see the SQLBuilder coming. Getting the SQL builder to build complex SQL is a bit like teaching amonkey how to play chess. What I do like about the current SQL builder is that itsynchronises your SQL select list with your ETL column list to avoid column mismatches.I am hoping the next version is more flexible and can build complex SQL.

9) Improved job startup times. Small parallel jobs will run faster. I call it the death of athousand cuts, your very large parallel job takes too long to run because a thousandsmaller jobs are starting and stopping at the same time and cutting into CPU and memory.Hawk makes these cuts less painful.

10) Common logging. Log views that work across jobs, log searches, log date constraints,wildcard message filters, saved queries. It's all good. You no longer need to send out asearch party to find an error message.

That's my top ten. I am also hoping the software comes in a box shaped like a hawk andmakes a hawk scream when you open it. A bit like those annoying greeting cards. Is thereany functionality you think Hawk is missing that you really want to see?

Page 31 of 137


32/137

DATASTAGE NOTES

DataStage Tips:

1. Aggregator stage does not support more than one source, if you try to do this youwill get error, The destination stage cannot support any more stream input links.

2. You can give N number input links to transformer stage, but you can givesequential file stage as reference link. You can give only one sequential file stageas primary link and number other links as reference link. If you try to givesequential file stage as reference link you will get error as, The destination stagecannot support any more stream input links because reference link represent alookup table, but sequential file does not use as lookup table, Hashed file can beuse as lookup table.

Sequential file stage:

Sequential file stage is provided by datastage to access data from sequential file.(Text file)

The access mechanism of a sequential file is sequence order.

We cannot use a sequential file as a lookup.

The problem with sequential file we cannot directly filter rows and query is notsupported.

Update actions in sequential file:

Over write existing file (radio button).

Append to existing file (radio button). Backup existing file (check box).

Hashed file stage:

Hashed file is used to store data in hash file.

A hash file is similar to a text file but the data will be organized using hashingalgorithm.

Basically hashed file is used for lookup purpose.

Page 32 of 137


33/137

The retrieval of data in hashed file faster because it uses hashing algorithm.

Update actions in Hashed file:

Clear file before waiting

Backup existing file.

Sequential file (all are check boxes).

DATABASE Stages:

ODBC Stage:

ODBC Stage Stage Page:

You can use an ODBC stage to extract, write, or aggregate data. Each ODBC stage canhave any number of inputs or outputs. Input links specify the data you are writing. Outputlinks specify the data you are extracting and any aggregations required. You can specifythe data on an input link using an SQL statement constructed by DataStage, a generatedquery, a stored procedure, or a user-defined SQL query.

GetSQLInfo: is used to get quote character and schema delimiters of your datasource. Optionally specify the quote character used by the data source. By default, this

is set to " (double quotes). You can also click the Get SQLInfobutton to connect tothe data source and retrieve the Quote character it uses. An entry of 000 (three zeroes)specifies that no quote character should be used.Optionally specify the schema delimiter used by the data source. By default this is setto. (period) but you can specify a different schema delimiter, or multiple schemadelimiters. So, for example, identifiers have the formNode:Schema.Owner;TableName you would enter :.; into this field. You can also clickthe Get SQLInfo button to connect to the data source and retrieve the Schemadelimiter it uses.

NLS tab: You can define a character set map for an ODBC stage using the NLStab of the ODBC Stage

The ODBC stage can handle the following SQL Server data types:

GUID Timestamp SmallDateTime

ODBC Stage Input Page:

Page 33 of 137


34/137

Update action. Specifies how the data is written. Choose the option you wantfrom the drop-down list box:

Clear the table, then insert rows. Deletes the contents of the table and addsthe new rows.

Insert rows without clearing. Inserts the new rows in the table.

Insert new or update existing rows. New rows are added or, if the insert fails,the existing rows are updated.

Replace existing rows completely. Deletes the existing rows, then adds thenew rows to the table.

Update existing rows only. Updates the existing data rows. If a row with thesupplied key does not exist in the table then the table is not updated but a warningis logged.

Update existing or insert new rows. The existing data rows are updated or, ifthis fails, new rows are added.

Call stored procedure. Writes the data using a stored procedure. When youselect this option, the Procedure name field appears.

User-defined SQL. Writes the data using a user-defined SQL statement. Whenyou select this option, the View SQL tab is replaced by the Enter SQL tab.

Create table in target database. Select this check box if you want toautomatically create a table in the target database at run time. A table is created basedon the defined column set for this stage. If you select this option, an additional tab,Edit DDL, appears. This shows the SQL CREATE statement to be used for tablegeneration.

Transaction Handling. This page allows you to specify the transaction handlingfeatures of the stage as it writes to the ODBC data source. You can choose whether to

use transaction grouping or not, specify an isolation level, the number of rows writtenbefore each commit, and the number of rows written in each operation.

Isolation Levels: Read Uncommitted, Read Committed,Repeatable Read, Serializable, Versioning, and Auto-Commit.

Rows per transaction field. This is the number of rows writtenbefore the data is committed to the data table. The default value is 0, that is, all therows are written before being committed to the data table.

Parameter array size field. This is the number of rows writtenat a time. The default is 1, that is, each row is written in a separate operation.

ODBC Stage Output Page:

==

PROCESSING Stages:

TRANSFORMERStage:

Page 34 of 137


35/137

Transformer stages do not extract data or write data to a target database. They are used tohandle extracted data, perform any conversions required, and pass data to anotherTransformer stage or a stage that writes data to a target data table.

Transformer stages can have any number of inputs and outputs. The link from the maindata input source is designated the primary input link. There can only be one primaryinput link, but there can be any number ofreference inputs.

Input Links

The main data source is joined to the Transformer stage via the primary link, but the stagecan also have any number of reference input links.

A reference link represents a table lookup. These are used to provide information thatmight affect the way the data is changed, but do not supply the actual data to be changed.

Reference input columns can be designated as key fields. You can specify key expressions

that are used to evaluate the key fields. The most common use for the key expression is tospecify an equi-join, which is a link between a primary link column and a reference linkcolumn. For example, if your primary input data contains names and addresses, and areference input contains names and phone numbers, the reference link name column ismarked as a key field and the key expression refers to the primary links name column.During processing, the name in the primary input is looked up in the reference input. If thenames match, the reference data is consolidated with the primary data. If the names do notmatch, i.e., there is no record in the reference input whose key matches the expressiongiven, all the columns specified for the reference input are set to the null value.

Where a reference link originates from a UniVerse or ODBC stage, you can look up

multiple rows from the reference table. The rows are specified by a foreign key, asopposed to a primary key used for a single-row lookup.

Output Links

You can have any number of output links from your Transformer stage.

You may want to pass some data straight through the Transformer stage unaltered, but itslikely that youll want to transform data from some input columns before outputting itfrom the Transformer stage.

You can specify such an operation by entering a BASIC expression or by selecting atransform to apply to the data. DataStage has many built-in transforms, or you can defineyour own custom transforms that are stored in the Repository and can be reused asrequired.

The source of an output link column is defined in that columns Derivation cell within theTransformer Editor. You can use the Expression Editor to enter expressions or transforms

Page 35 of 137


36/137

in this cell. You can also simply drag an input column to an output columns Derivationcell, to pass the data straight through the Transformer stage.

In addition to specify derivation details for individual output columns, you can alsospecify constraints that operate on entire output links. A constraint is a BASIC expressionthat specifies criteria that data must meet before it can be passed to the output link. Youcan also specify a reject link, which is an output link that carries all the data not output onother links, that is, columns that have not met the criteria.

Each output link is processed in turn. If the constraint expression evaluates to TRUE foran input row, the data row is output on that link. Conversely, if a constraint expressionevaluates to FALSE for an input row, the data row is not output on that link.

Constraint expressions on different links are independent. If you have more than oneoutput link, an input row may result in a data row being output from some, none, or all ofthe output links.

For example, if you consider the data that comes from a paint shop, it could includeinformation about any number of different colors. If you want to separate the colors intodifferent files, you would set up different constraints. You could output the informationabout green and blue paint on LinkA, red and yellow paint on LinkB, and black paint onLinkC.

When an input row contains information about yellow paint, the LinkA constraintexpression evaluates to FALSE and the row is not output on LinkA. However, the inputdata does satisfy the constraint criterion for LinkB and the rows are output on LinkB.

If the input data contains information about white paint, this does not satisfy anyconstraint and the data row is not output on Links A, B or C, but will be output on thereject link. The reject link is used to route data to a table or file that is a catch-all forrows that are not output on any other link. The table or file containing these rejects isrepresented by another stage in the job design.

Before-Stage and After-Stage Routines

Because the Transformer stage is an active stage type, you can specify routines to beexecuted before or after the stage has processed the data. For example, you might use abefore-stage routine to prepare the data before processing starts. You might use an after-stage routine to send an electronic message when the stage has finished.

Specifying the Primary Input Link

The first link to a Transformer stage is always designated as the primary input link.However, you can choose an alternative link to be the primary link if necessary. To dothis:

1. Select the current primary input link in the Diagram window.2. Choose Convert to Reference from the Diagram window shortcut menu.

Page 36 of 137


37/137

3. Select the reference link that you want to be the new primary input link.4. Choose Convert to Stream from the Diagram window shortcut menu.

==

AGGREGATORStage:

Aggregator stages classify data rows from a single input link into groups and compute

totals or other aggregate functions for each group. The summed totals for each group areoutput from the stage via an output link.

If you want to aggregate the input data in a number of different ways, you can haveseveral output links, each specifying a different set of properties to define how the inputdata is grouped and summarized.

==

FOLDERStage:

Folder stages are used to read or write data as files in a directory located on the DataStageserver.

The folder stages can read multiple files from a single directory and can deliver the files tothe job as rows on an output link. The folder stage can also write rows of data as files to adirectory. The rows arrive at the stage on an input link.

Note: The behavior of the Folder stage when reading folders that contain other folders isundefined.

In an NLS environment, the user running the job must have write permission on the folder

so that the NLS map information can be set up correctly.

Folder Stage Input Data

The properties are as follows: Preserve CRLF. When Preserve CRLF is set to Yes field marks are not converted tonewlines on write. It is set to Yesby default.

Page 37 of 137


38/137

The Columns tab defines the data arriving on the link to be written in files to thedirectory. The first column on the Columns tab must be defined as a key, and gives thename of the file. The remaining columns are written to the named file, each columnseparated by a newline. Data to be written to a directory would normally be delivered in asingle column.

Folder Stage Output Data

The properties are as follows: Sort order. Choose from Ascending, Descending, or None. This specifies the order inwhich the files are read from the directory. Wildcard. This allows for simple wildcarding of the names of the files found in thedirectory. Any occurrence of * (asterisk) or (three periods) is treated as an instructionto match any or no characters. Preserve CRLF. When Preserve CRLF is set to Yes newlines are not converted to fieldmarks on read. It is set to Yesby default. Fully qualified. Set this to yes to have the full path name of each file written in the keycolumn instead of just the file name.

The Columns tab defines a maximum of two columns. The first column must be markedas the Key and receives the file name. The second column, if present, receives the contentsof the file.

==

IPC Stage:

An inter-process (IPC) stage is a passive stage which provides a communication channelbetween DataStage processes running simultaneously in the same job. It allows you todesign jobs that run on SMP systems with great performance benefits. To understand thebenefits of using IPC stages, you need to know a bit about how DataStage jobs actuallyrun as processes, see DataStage Jobs and Processes.

The output link connecting IPC stage to the stage reading data can be opened as soon asthe input link connected to the stage writing data has been opened.You can use Inter-process stages to join passive stages together. For example you coulduse them to speed up data transfer between two data sources:

In this example the job will run as two processes, one handling the communication fromsequential file stage to IPC stage, and one handling communication from IPC stage toODBC stage. As soon as the Sequential File stage has opened its output link, the IPC stage

Page 38 of 137


39/137

can start passing data to the ODBC stage. If the job is running on a multi processorsystem, the two processor can run simultaneously so the transfer will be much faster.

Defining IPC Stage Properties

The Properties tab allows you to specify two properties for the IPC stage: Buffer Size. Defaults to 128 Kb. The IPC stage uses two blocks of memory; one blockcan be written to while the other is read from. This property defines the size of each block,so that by default 256 Kb is allocated in total. Timeout. Defaults to 10 seconds. This gives time limit for how long the stage will waitfor a process to connect to it before timing out. This normally will not need changing, butmay be important where you are prototyping multi-processor jobs on single processorplatforms and there are likely to be delays.

==

LINK PARTITIONERStage:

The Link Partitioner stage is an active stage which takes one input and allows you todistribute partitioned rows to up to 64 output links. The stage expects the output links touse the same meta data as the input link.

Partitioning your data enables you to take advantage of a multi-processor system and havethe data processed in parallel. It can be used in conjunction with the Link Collector stageto partition data, process it in parallel, then collect it together again before writing it to asingle target. To really understand the benefits you need to know a bit about howDataStage jobs are run as processes, see DataStage Jobs and Processes.

In order for this job to compile and run as intended on a multi-processor system you musthave inter-process buffering turned on, either at project level using the DataStageAdministrator, or at job level from the Job Properties dialog box.

Before-Stage and After-Stage Subroutines

The General tab on the Stage page contains optional fields that allow you to defineroutines to use which are executed before or after the stage has processed the data.

Page 39 of 137


40/137

Before-stage subroutine and Input Value. Contain the name (and value) of asubroutine that is executed before the stage starts to process any data. For example,you can specify a routine that prepares the data before processing starts.

After-stage subroutine and Input Value. Contain the name (and value) of asubroutine that is executed after the stage has processed the data. For example, you

can specify a routine that sends an electronic message when the stage has finished.

Choose a routine from the drop-down list box. This list box contains all the routinesdefined as a Before/After Subroutine under the Routinesbranch in the Repository. Enteran appropriate value for the routines input argument in the Input Value field.

If you choose a routine that is defined in the Repository, but which was edited but notcompiled, a warning message reminds you to compile the routine when you close the LinkPartitioner Stage dialog box.

A return code of 0 from the routine indicates success, any other code indicates failure and

causes a fatal error when the job is run.

If you installed or imported a job, the Before-stage subroutine orAfterstage subroutinefield may reference a routine that does not exist on your system. In this case, a warningmessage appears when you close the Link Partitioner Stage dialog box. You must installor import the missing routine or choose an alternative one to use.

Defining Link Partitioner Stage Properties

The Properties tab allows you to specify two properties for the Link Partitioner stage:

Partitioning Algorithm. Use this property to specify the method the stage uses to

partition data. Choose from: Round-Robin. This is the default method. Using the round-robin method the stage

will write each incoming row to one of its output links in turn.

Random. Using this method the stage will use a random number generator todistribute incoming rows evenly across all output links.

Hash. Using this method the stage applies a hash function to one or more inputcolumn values to determine which output link the row is passed to.

Modulus. Using this method the stage applies a modulus function to an integerinput column value to determine which output link the row is passed to.

Partitioning Key. This property is only significant where you have chosen a

partitioning algorithm of Hash or Modulus. For the Hash algorithm, specify one ormore column names separated by commas. These keys are concatenated and a hashfunction applied to determine the destination output link. For the Modulus algorithm,specify a single column name which identifies an integer numeric column. The valueof this column value determines the destination output link.

Defining Link Partitioner Stage Input Data

Page 40 of 137


41/137

The Link Partitioner stage can have one input link. This is where the data to be partitionedarrives.The Inputs page has two tabs: General and Columns.

General. The General tab allows you to specify an optional description of thestage.

Columns. The Columns tab contains the column definitions for the data on theinput link. This is normally populated by the meta data of the stage connecting on the

input side. You can also Load a column definition from the Repository, or type one inyourself (and Save it to the Repository if required). Note that the meta data on theinput link must be identical to the meta data on the output links.

Defining Link Partitioner Stage Output Data

The Link Partitioner stage can have up to 64 output links. Partitioned data flows alongthese links. The Output Name drop-down list on the Outputspages allows you to select

which of the 64 links you are looking at.The Outputspage has two tabs: General and Columns.

General. The General tab allows you to specify an optional description of thestage.

Columns. The Columns tab contains the column definitions for the data on theinput link. You can Load a column definition from the Repository, or type one inyourself (and Save it to the Repository if required). Note that the meta data on theoutput link must be identical to the meta data on the input link. So the meta data isidentical for all the output links.

==

LINK COLLECTORStage:

The Link Collector stage is an active stage which takes up to 64 inputs and allows you tocollect data from this links and route it along a single output link. The stage expects theoutput link to use the same meta data as the input links.

The Link Collector stage can be used in conjunction with a Link Partitioner stage toenable you to take advantage of a multi-processor system and have data processed inparallel. The Link Partitioner stage partitions data, it is processed in parallel, then the LinkCollector stage collects it together again before writing it to a single target. To really

understand the benefits you need to know a bit about how DataStage jobs are run asprocesses, see DataStage Jobs and Processes.

Page 41 of 137


42/137

In order for this job to compile and run as intended on a multi-processor system you musthave inter-process buffering turned on, either at project level using the DataStageAdministrator, or at job level from the Job Properties dialog box.

The Properties tab allows you to specify two properties for the Link Collector stage:

Collection Algorithm. Use this property to specify the method the stage uses tocollect data. Choose from:

Round-Robin. This is the default method. Using the round-robin method the stagewill read a row from each input link in turn.

Sort/Merge. Using the sort/merge method the stage reads multiple sorted inputsand writes one sorted output.

Sort Key. This property is only significant where you have chosen a collectingalgorithm of Sort/Merge. It defines how each of the partitioned data sets are known tobe sorted and how the merged output will be sorted. The key has the following format:

Columnname {sortorder] [,Columnname [sortorder]]...

Columnname specifies one (or more) columns to sort on.sortorderdefines the sort order as follows:

In an NLS environment, the collate convention of the locale may affect the sort order. Thedefault collate convention is set in the DataStage Administrator, but can be set forindividual jobs in the Job Properties dialog box.

Ascending Order Descending Order

A dasc dscascending descending

A DASC DSCASCENDING DESCENDING

For example:

FIRSTNAME d, SURNAME D

Page 42 of 137


43/137

Specifies that rows are sorted according to FIRSTNAME column and SURNAME columnin descending order.

The Link Collector stage can have up to 64 input links. This is where the data to becollected arrives. The Input Name drop-down list on the Inputspage allows you to selectwhich of the 64 links you are looking at.

The Link Collector stage can have a single output link.

DATASTAGE TUTORIAL

1. About DataStage

2. Client

Components

3. DataStage

Designer.4. DataStage

Director

5. DataStage

Manager

6. DataStage

Administrator

7. DataStage

Manager Roles

8. Server

Components

9. DataStage

Features

10. Types of Jobs

11. DataStage NLS12. JOB

13. Aggregator

14. Hashed File

15. UniVerse

16. UniData.

17. ODBC

18. Sequentia

Data stage FAQ Questions

Documents

Transcript of Data stage FAQ Questions