WebSphere Quality Stage Tutorial

WebSphere® QualityStage

WebSphere QualityStage Tutorial

Version 8

SC18-9925-00

��

Note

Before using this information and the product that it supports, be sure to read the general information under “Notices and

trademarks” on page 43.

© Copyright International Business Machines Corporation 2004, 2006. All rights reserved.

US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract

with IBM Corp.

Contents

Chapter 1. The Designer Client structure 1

About DataStage Projects . . . . . . . . . . 1

About QualityStage jobs . . . . . . . . . . 1

WebSphere DataStage and QualityStage stages . . . 2

Server and client components . . . . . . . . . 2

Chapter 2. Tutorial project goals . . . . 3

Copying tutorial data . . . . . . . . . . . 4

Starting a project . . . . . . . . . . . . . 4

Creating a job . . . . . . . . . . . . . 4

Importing tutorial sample data . . . . . . . 5

Chapter 3. Module 1: Investigating

source data . . . . . . . . . . . . . 7

Lesson 1.1: Setting up and linking an investigate job 7

Lesson checkpoint . . . . . . . . . . . 8

Lesson 1.2: Renaming links and stages in an

Investigate job . . . . . . . . . . . . . . 8


Lesson 1.3: Configuring the source file . . . . . . 9


Lesson 1.4: Configuring the Copy stage . . . . . 11


Lesson 1.5: Configuring the Investigate stage . . . 11

Lesson summary . . . . . . . . . . . 12

Lesson 1.6: Configuring the InvestigateStage2 icon 13

Lesson summary . . . . . . . . . . . 14

Lesson 1.7: Configuring the target reports . . . . 14


Lesson 1.8: Compiling and running jobs . . . . . 15


Module 1 Summary . . . . . . . . . . . 15

Chapter 4. Module 2: Standardizing

data . . . . . . . . . . . . . . . . 17

Lesson 2.1: Setting up a standardize job . . . . . 17


Lesson 2.2: Configuring the Standardize job stage

properties . . . . . . . . . . . . . . . 19

Configuring Standardize properties . . . . . 19

Configuring the Transformer properties . . . . 21

Configuring the Standardize Copy stage . . . . 22

Configuring the Match Frequency stage . . . . 22


Lesson 2.3: Configuring the target data sets . . . . 23


Summary for standardized data job . . . . . . 23

Chapter 5. Module 3: Grouping records

with common attributes . . . . . . . 25

Lesson 3.1: Setting up an Unduplicate Match job . . 25

Lesson checkpoint for the Unduplicate Match job 27

Lesson 3.2: Configuring the Unduplicate Match job

stage properties . . . . . . . . . . . . . 27

Configuring the Unduplicate Match stage . . . 28

Configuring the Funnel stage . . . . . . . 29

Lesson 3.2 checkpoint . . . . . . . . . . 29

Lesson 3.3: Configuring the Unduplicate job target

files . . . . . . . . . . . . . . . . . 30


Summary for Unduplicate stage job . . . . . . 30

Chapter 6. Module 4: creating a single

record . . . . . . . . . . . . . . . 33

Lesson 4.1: Setting up a survive job . . . . . . 33


Lesson 4.2: Configuring the survive job stage

properties . . . . . . . . . . . . . . . 34

Configuring the Survive stage . . . . . . . 35

Configuring the target file . . . . . . . . 36


Module 4: survive job summary . . . . . . . 36

Chapter 7. WebSphere QualityStage

tutorial summary . . . . . . . . . . 39

Accessing information about IBM . . . 41

Contacting IBM . . . . . . . . . . . . . 41

Accessible documentation . . . . . . . . . 41

Providing comments on the documentation . . . . 42

Notices and trademarks . . . . . . . 43

Notices . . . . . . . . . . . . . . . . 43

Trademarks . . . . . . . . . . . . . . 45

Index . . . . . . . . . . . . . . . 47

© Copyright IBM Corp. 2004, 2006 iii

iv WebSphere QualityStage Tutorial

Chapter 1. The Designer Client structure

IBM® WebSphere® QualityStage is a data cleansing component that is part of the WebSphere DataStage®

and QualityStage Designer (Designer client).

The Designer client provides a common user interface in which you design your data quality jobs. In

addition, you have the power of the parallel processing engine to process large stores of source data.

The integrated stages available in the Repository provide the basis for accomplishing the following data

cleansing tasks:

v Resolving data conflicts and ambiguities

v Uncovering new or hidden attributes from free-form or loosely controlled source columns

v Conforming data by transforming data types into a standard format

v Creating one unique result

Learning objectives

The key points that you should keep in mind as you complete this tutorial include the following

concepts:

v How the processes of standardization and matching improve the quality of the data

v The ease of combining both QualityStage and DataStage stages in the same job

v How the data flows in an iterative process from one job to another

v The surviving data results in the best available record

About DataStage Projects

The Designer client provides projects as a method for organizing your re-engineered data. You define

data files and stages and you build jobs in a specific project. WebSphere QualityStage uses these projects

to create and store files on the client and server.

Each QualityStage project contains the following components:

v QualityStage jobs

v Stages that are used to build each job

v Match specification

v Standardization rules

v Table definitions

In this tutorial, you will create a project by using the data that is provided.

About QualityStage jobs

WebSphere QualityStage uses jobs to process data.

To start a QualityStage job, you open the Designer client and create a new Parallel job. You build the

QualityStage job by adding stages, source and target files, and links from the Repository, and placing

them onto the Designer canvas. The Designer client compiles the Parallel job and creates an executable

file. When the job runs, the stages process the data using the data properties that you defined. The result

is a data set that you can use as input for the next job.

© Copyright IBM Corp. 2004, 2006 1

In this tutorial, you build four QualityStage jobs. Each job is built around one of the Data Quality stages

and additional DataStage stages.

WebSphere DataStage and QualityStage stages

A stage in QualityStage performs an action on data. The type of action depends on the stage that you

use.

The Designer client stages are stored in the Designer tool palette. You can access all the QualityStage

stages in the Data Quality group in the palette. You configure each stage to perform the type of actions

on the data that obtains the required results. Those results are used as input data to the next stage. The

following stages are included in QualityStage:

v Investigate stage

v Standardize stage

v Match Frequency stage

v Unduplicate Match stage

v Reference Match stage

v Survive stage

In this tutorial, you use most of the QualityStage stages.

You can also add any of the DataStage stages to your job. In some of the lessons, you add DataStage

stages to enhance the types of tools for processing the data.

Server and client components

You load the client and server components to process the data.

The following server components are installed on the server:

Repository

A central store that contains all the information required to build a QualityStage job.

WebSphere DataStage server

Runs the QualityStage jobs.

The following DataStage client components are installed on any personal computer:

v WebSphere DataStage and QualityStage Designer

v DataStage Director

v DataStage Administrator

In this tutorial, you use all of these components when you build and run your QualityStage project.

2 WebSphere QualityStage Tutorial

Chapter 2. Tutorial project goals

The goal of this tutorial is to use Designer client stages to cleanse customer data to remove all the

duplicates of customer addresses and provide a best case for the correct address.

In this tutorial, you have the role of a database analyst for a bank that provides many financial services.

The bank has a large database of customers; however, there are problems with the customer list because

it contains multiple names and address records for a single household. Because the marketing department

wants to market additional services to existing customers, you need to find and remove duplicate

addresses.

For example, a married couple have four accounts, each in their own names. The accounts include two

checking accounts, an IRA, and a mutual fund.

In the bank’s existing system, customer information is tracked by account number rather than customer

name, number, or address. For this one customer, the bank has four address entries.

To save money on the mailing, the bank wants to consolidate the household information so that each

household receives only one mailing. In this tutorial, you are going to use WebSphere QualityStage to

standardize all customer addresses. In addition, you need to locate and consolidate all records of

customers who are living at the same address.

Learning objectives

The purpose of this tutorial is to provide a working knowledge of the QualityStage process flow through

the jobs. In addition, you learn how to do the following tasks:

v Set up each job in the project

v Configure each stage in the job

v Assess results of each job

v Apply those results to your business practices

With the completion of these tasks, you will understand how QualityStage stages restructure and cleanse

the data by using applied business rules.

This tutorial will take approximately 2.5 hours to complete.

Skill level

To use this tutorial, you need an intermediate to advanced level of understanding of data analysis.

Audience

This tutorial is intended for business analysts and systems analysts who are interested in understanding

QualityStage.

System requirements

v IBM Information Server Suite

v Microsoft® Windows® XP or Linux® operating systems


Prerequisites

To complete this tutorial, you need to know how to use

v IBM WebSphere DataStage and QualityStage Designer

v Personal computers

Expected results

Upon the completion of this tutorial, you should be able to use the Designer client to create your own

QualityStage projects to meet your company’s business requirements and data quality standards.

Copying tutorial data

The tutorial requires data that is located on the IBM Information Suite CD.

You will locate the tutorial data and copy it into your server.

1. Locate the tutorial folder Tutorialdataon the IBM Information Suite CD.

2. Copy the folder Tutorialdata to your system.

Starting a project

You use Designer client project as a container for your QualityStage jobs.

You open the DataStage Designer client to begin the tutorial. The DataStage Designer Parallel job

provides the executable file that runs your QualityStage jobs.

1. Click Start → All Programs → IBM Information Server → Start the agent.

2. Click Start → All Programs → IBM Information Server → IBM WebSphere DataStage and

QualityStage Designer. The Attach to Project window opens.

3. In the Domain field, type the name of the server that you are connected to.

4. In the User name field, type your user name.

5. In the Password field, type your password and click OK. The New Parallel job opens in the Designer

client.

6. In the Project field, attach a project or accept the default.

You are going to import the tutorial sample jobs and data.

Creating a job

The Designer client provides the interface to the parallel engine that processes the QualityStage jobs. You

are going to save a job to a folder in the DataStage repository.

If it is not already open, open the DataStage Designer client.

To create a new DataStage job:

1. From the New window, select the Jobs folder in the left pane and then select the Parallel Job icon in

the right pane.

2. Click OK. A new empty job design window opens in the job design area.

3. Click File → Save.

4. In the Save Parallel Job As window, right-click the Jobs folder and select New → Folder from the

shortcut menu.

5. Type in a name for the folder, for example, MyTutorial.

6. In the Item name field, type the title Investigate and click Save.


You have created a new parallel job named Investigate and saved it in the folder Jobs\MyTutorial in the

repository.

You are going to import the tutorial data into your project.

Importing tutorial sample data

The data for this tutorial contains sample jobs, table definitions, a Match specification, and source data.

You can import the tutorial sample data to begin the tutorial lessons.

1. Click the Import menu and select DataStage Components. The DataStage Repository Import window

opens.

2. In the Import from file field, browse to locate the QSTutorial data.

3. Click Import all.

The sample tutorial is displayed in the repository under Jobs/QSTutorial folder. You can open each job

and look at how they are designed on the canvas. Use these jobs as a reference when you create your

own jobs.

You can begin Module 1.

Chapter 2. Tutorial project goals 5

Chapter 3. Module 1: Investigating source data

This module explains how to set up and process an Investigate job.

The Investigate job provides data from which you can create reports in the Information Server Web

console. You can use the information in the reports to make basic assumptions about the data and what

further steps you need to take to attain the goal of providing a legitimate address for each customer in

the database.

Learning objectives

After completing the lessons in this module, you will know how to do the following tasks:

1. Add QualityStage or DataStage stages and links to a job

2. Configure stage properties to specify what action they take when the job is run.

3. Load and process customer data and metadata.

4. Compile and run a job

5. Produce data for reports.

This module should take approximately 30 minutes to complete.

Lesson 1.1: Setting up and linking an investigate job

You create each QualityStage job by adding Data Quality stages and DataStage sequential files and stages

to the Designer canvas. Each icon on the canvas is linked together to allow the data to flow from the

source file to each stage.

The goals for this lesson are to add stages to the Designer canvas and link them together.

To set up an investigate job:

1. Click Palette → Data Quality to select the Investigate stage.

If you do not see the palette, click View → Palette.

2. Drag the Investigate stage icon onto the Designer canvas and drop it near the middle of the canvas.

3. Drag a second Investigate stage icon and drop it beneath the first Investigate stage. You use two

investigate stages to create the data for the reports.

4. Click Palette → File and select Sequential File.

5. Drag the sequential file onto the Designer canvas and drop it to the left of the first Investigate stage.

This sequential file will become the source file.

6. Click Palette → Processing and select the Copy stage. This stage duplicates the data from the source

file and copies it to the two Investigate stages.

7. Drag the Copy stage onto the Designer canvas and drop it between the sequential file and the first

Investigate stage.

8. Click Palette → File, and drag a second sequential file onto the Designer canvas and drop it to the

right of the first Investigate stage.

The data from the Investigate stage is sent to the second sequential file which becomes the target

file.

9. Drag a third sequential file onto the Designer canvas and drop it also to the right of the Investigate

stage and beneath the second Sequential File. You now have a source file, a Copy stage, two

Investigate stages, and two target files.


10. Drag a fourth sequential file onto the Designer canvas and drop it beneath the third sequential file as

the final target file. Next, you are going to link all the stages together.

11. Click Palette → General → Link.

a. Drag a link from the source sequential file to the Copy stage.

b. Continue linking the other stages. The following figure shows the completed Investigate job.

Lesson checkpoint

When you set up the Investigate job, you are connecting the source file and its source data and metadata

to all the stages and linking the stages to the target files.

In completing this lesson, you learned the following about the Designer:

v How to add stages to the Designer canvas

v How to combine Data Quality and Processing stages on the Designer canvas

v How to link all the stages together

Lesson 1.2: Renaming links and stages in an Investigate job

When creating a large job in the Designer client, it is important to rename each stage, file, and link with

meaningful names to avoid confusion when selecting paths during stage configuration.

When you rename the links and stages, do not use spaces. The Designer client resets the name back to

the generic if you put in spaces. The goals for this lesson are to replace the generic names for the icons

on the canvas with more appropriate names.

To rename icons on the canvas:

1. To rename a stage, complete the following steps:

a. Click the name of the source SequentialFile until a highlighted box appears around the name.

b. Type SourceFile in the box.

c. Click outside the box to deselect it.2. To rename a link, complete the following steps:

a. Right-click the generic link name DSLinkXX that connects SourceFile to the Copy icon and select

Rename from the shortcut menu. A highlighted box appears around the default name.


b. Type Customerdata and click outside the box. The default name changes to Customerdata.3. Right-click the generic link name that connects the Copy icon to the Investigate icon.

4. Repeat step 2, except type Name in the box.

5. Right-click the generic link name that connects the Copy icon to the second Investigate icon.

6. Repeat step 2, except type AddrCityState in the box.

7. Rename the following stages as described in step 1:

Stage Change to

Copy CopyStage

Investigate InvestigateStage

Investigate InvestigateStage2

8. Rename the three target files from the top in the following order:

a. NameTokenReport

b. AreaTokenReport

c. AreaPatternReport9. Rename the links as described in step 2:

Link Change to

From InvestigateStage to NameTokenReport TokenData

From InvestigateStage2 to AreaTokenReport AreaTokenData

From InvestigateStage2 to AreaPatternReport AreaPatternData

Renaming the elements on the Designer canvas provides better organization to the Investigate job.

Related information

“Lesson 2.1: Setting up a standardize job” on page 17Standardizing data is the first step you take when you do data cleansing. In Lesson 2.1, you add a

variety of stages to the Designer canvas. These stages include the Transformer stage which applies

derivatives to handle nulls and the Match Frequency stage which adds frequency data.

“Lesson 3.1: Setting up an Unduplicate Match job” on page 25Sorting records into related attributes is the second step in data cleansing. In this lesson, you are

going to add the Data Quality Unduplicate Match stage and a Funnel stage to match records and

remove duplicates.

Lesson checkpoint

In this lesson, you changed the generic stages and links to names appropriate for the job.

You learned the following tasks:

v How to select the default name field in order to edit it

v The correct method to use in changing the name

Lesson 1.3: Configuring the source file

The source data and metadata are attached to the sequential file as the source data for the job.

The goal of this lesson is to attach the input data of customer names and addresses and load the

metadata.

Chapter 3. Module 1: Investigating source data 9

To add data and metadata to the Investigate job, you configure the source file to locate the input data file

input.csv stored on your computer and load the metadata columns.

To configure the source file:

1. Double-click the SourceFile icon to open the Properties window.

2. To select the tutorial data file, complete the following steps:

a. Click Source > File to activate the File field.

b. Click

in the File field and select Browse for File.

c. Locate the directory where you copied the tutorial input.csv file.

d. Click input.csv to select the file. You can click View Data to see the quality of the input data. You

see bank customer names and addresses. The addresses are shown in a disorganized way which

makes it difficult for the bank to analyze the data.3. Click Columns.

4. Click Load.

5. From the Table Definitions window, click the Table Definitions folder.

6. Click the Tutorial folder.

7. Click Input to load the source metadata. The metadata columns are shown in the Columns page.

8. Click Save to save the data at a location that you choose.

9. Click OK to close the Sequential File window.

Lesson checkpoint

In this lesson, you attached the input data (customer names and addresses) and loaded the metadata.

You learned how to do the following tasks:

v Attaching source data to the source file

v Adding column metadata to the source file


Lesson 1.4: Configuring the Copy stage

The Copy stage duplicates the source data and sends it to the two Investigate stages.

This lesson explains how you can add a DataStage Processing stage, the Copy stage, to a QualityStage

job. You use the Copy stage to duplicate the metadata and send the output metadata to the two

Investigate stages.

To configure a Copy stage:

1. Double-click the Copy stage icon to open the Properties window.

2. Click Input → Columns. The metadata you loaded in the SourceFile has propagated to the Copy stage.

3. Click Output → Mapping. You will map the columns that display in the left Columns box to the right

box.

4. In the Output name field, select Name if it is not already selected. Selecting the correct output link

assures that the data goes to the correct Investigate stage, either InvestigateStage or InvestigateStage2.

5. To copy the data from the Columns pane to the Name pane, complete the following steps:

a. Place your cursor in the Columns pane, right click and choose Select All.

b. Then choose Copy.

c. Place your cursor in the Name pane, right click and choose Paste Column. The column metadata

copies into the Name pane and lines show the columns linking from Columns to Name.6. Change the Output name field to AddrCityState.

7. Repeat step 4 to map the Columns to the AddrCityState output link.

This procedure shows you how to map columns to two different outputs.

Lesson checkpoint

In this lesson, you mapped the input metadata to the two output links to continue the propagation of the

metadata to the next two stages.


v Adding a DataStage stage to a QualityStage job

v Propagating metadata to the next stage

v Mapping metadata to two output links

Lesson 1.5: Configuring the Investigate stage

The Word Investigate option of the Investigate stage parses name and address data into recognizable

patterns using a rule set that classifies personal names and addresses.

The Investigate stage analyzes each record from the source file. In this lesson, you are going to select the

NAME rule set to apply USPS®

standards.

To configure the Investigate stage icon:

1. Double-click the InvestigateStage icon.

2. Click Word Investigate to open the Word Investigate window. The Name column that was propagated

to the Investigate stage from the Copy stage is shown in the Available Columns pane.

3. Select Name from the Available Column pane and click

to move the Name column into the

Standard Columns pane. The Investigate stage analyzes the Name column using the rule set that you

select in step 4.


4. In the Rule Set: field, click

and select a rule set for the Investigate stage.

a. In the Open window, double-click the Standardization Rules folder to open the Standardization

Rules tree.

b. Double-click the USA folder and select NAME from the list. The NAME rule set parses the Name

column according to United States Post Office standards for names.

c. Right-click NAME and select Provision All.

d. Click OK to exit the Open window.5. Click Token Report in the Output Dataset area.

6. Click Stage Properties → Output → Mapping.

7. Map the output columns:

a. Click the Columns pane.

b. Right-click and select Select All in the context menu.

c. Select Copy.

d. Click in the NameData pane.

e. Right-click and select Paste Column. The columns on the left side map to the columns on the right

side.

Your map should look like the following figure:

8. Click Columns. Notice that the Output columns are populated when you map the columns in the

Mapping tab.

Lesson summary

This lesson explained how to configure the Investigate stage using the USNAME rule set.

You learned how to configure the Investigate stage in the Investigate job by doing the following tasks:

v Selecting the columns to investigate


v Selecting a rule from the rules set

v Mapping the output columns

Lesson 1.6: Configuring the InvestigateStage2 icon

The Word Investigate option of the Investigate stage parses name and address data into recognizable

patterns using a rule set that classifies personal names and addresses.

The Investigate stage analyzes each record from the source file. In this lesson, you are going to select the

USAREA rule set to apply USPS®

standards.

To configure the InvestigateStage2 icon:

1. Double-click the InvestigateStage2 icon.

2. Click Word Investigate to open the Word Investigate window. The address columns that were

propagated to the second Investigate stage from the Copy stage are shown in the Available Columns

pane.

3. Select the following columns to move to the Standard Columns pane. The second Investigate stage

analyzes the address columns using the rule set that you select in step 5.

v City

v State

v Zip5

v Zip4

4. Click

to move each selected column to the Standard Columns pane.

5. In the Rule Set: field, click

to locate a rule set for the InvestigateStage2.

a. In the Open window, double-click the Standardization Rules folder to open the Standardization

Rules tree.

b. Double-click the USA folder and select AREA from the list. The rule set parses the City, State,

Zip5 and Zip4 columns according to the United States Post Office standards.

c. Right-click and select Provision All.

d. Click OK to exit the Open window. AREA.SET is shown in the Rule Set field. 6. Click Token Report and Pattern Report in the Output Dataset area. When you assign data to two

outputs, you need to verify that the link ordering is correct. Link ordering assures that the data is

sent to the correct reports through the assigned links that you named in Lesson 1.2. The Link

Ordering tab is not shown if there is only one link.

7. Click Stage Properties → Link Ordering and select the link to change, if you need to change the

order.

8. Move the links up or down as described next:

v Click

to move the link name up a level.

v Click

to move the link name down a level.

The following figure shows the correct order for the links.


9. Click Output → Mapping. Since there are two output links from the second Investigate stage, you

need to map the columns to each link by following these steps:

a. In the Output name field, select AreaPatternData.

b. Select the Columns pane.

c. Right-click and select Select All and then Copy.

d. Select the AreaPatternData pane, right-click and select Paste. The columns are mapped to the

output link AreaPatternData.

e. In the Output name field, select AreaTokenData.

f. Repeat substeps b to d, except select the AreaTokenData pane.10. Click OK to close the InvestigateStage2 window.

Lesson summary

This lesson explained how to configure the second Investigate stage to the AREA rule set.

You learned how to configure the second Investigate stage in the Investigate job by doing the following

tasks:

v Selecting the columns to investigate

v Selecting a rule from the rules set

v Verifying the link ordering for the output reports

v Mapping the output columns to two output links

Lesson 1.7: Configuring the target reports

The source data information and column metadata are propagated to the target data files for later use in

creating Investigation reports.

The Investigate job modifies the unformed source data into readable data that is later configured into

Investigation reports.

To configure the data files:

1. Double-click the NameTokenReport icon on Designer client canvas.

2. Click Target > File.

3. Click

and select tokrpt.csv from the tutorial data folder.

4. Click OK to close the stage.

5. Double-click the AreaPatternReport icon.

6. Repeat steps 2 to 4 except select areapatrpt.csv from the tutorial data folder.

7. Double-click the AreaTokenReport icon.

8. Repeat steps 2 to 4 except select areatokrpt.csv from the tutorial data folder.


Lesson checkpoint

This lesson explained how to configure the target files for use as reports.

You configured the three target data files by linking the data to each report file.

Lesson 1.8: Compiling and running jobs

You test the completeness of the Investigate job by running the compiler and then you run the job to

process the data for the reports.

You compile the Investigate job in the Designer client. After the job compiles, you open the Director and

run the job.

To compile and run the job:

1. Click File → Save to save the Investigate job on the Designer canvas.

2. Click

to compile the job. The Compile Job window opens and the job begins to compile. When

the compiler finishes, the following message is shown Job successfully compiled with no errors.

3. Click Tools → Run Director. The Director application opens with the job shown in the Director Job

Status View window.

4. Click

to open the Job Run Options window.

5. Click Run.

After the job runs, Finished is shown in the Status column.

Related information

“Lesson 3.3: Configuring the Unduplicate job target files” on page 30You are going to attach files to the four output records. The records in the MatchedOutput file become

the source records for the next job.

“Lesson 4.2: Configuring the survive job stage properties” on page 34You will load matched and duplicates data from the Unduplicate Match job, configure the Survive

stage with rules that test columns to a set of conditions, and configure a target file.

Lesson checkpoint

In this lesson, you learned how to compile and process an Investigate job.

You processed the data into three output files by doing the following tasks:

v Compiling the Investigate job

v Running the Investigate job in the Director

Module 1 Summary

In Module 1, you set up, configured, and processed a QualityStage Investigate job.

An investigate job looks at each record column-by-column and analyzes the data content of the columns

that you select. The Investigate job loads the name and address source data stored in the bank’s database,

parses the columns into a form that can be analyzed, and then organizes the data into three data files.

The Investigate job modifies the unformed source data into readable data that you can configure into

Investigation reports using the Information Server for Web console. You select the QualityStage Reports to

access the reports interface in the Web console.


The next module organizes the unformed data into standardized data that provides usable data for

matching and survivorship.

Lessons learned

By completing this module, you learned about the following concepts and tasks:

v How to correctly set up and link stages in a job so that the data propagates from one stage to the next

v How to configure the stage properties to apply the correct rule set to analyze the data

v How to compile and run a job

v How to create data for analysis


Chapter 4. Module 2: Standardizing data

In Module 2, you configure the Data Quality Standardize stage to standardize name and address

information. This information is derived from the bank’s customer data base.

When you looked at the data in Module 1, you may have noticed that some addresses were free form

and nonstandard. The goal of removing duplicates of customer addresses and guaranteeing that a single

address is the correct address for that customer would be impossible without standardizing.

Standardizing or conditioning ensures that the source data is internally consistent, that is, each type of

data has the same type of content and format. When you use consistent data, the system can match

address data with greater accuracy during the match stage.

Learning objectives

After completing the lessons in this module, you will know how to do the following tasks:

1. Add stages and links to a Standardize job.

2. Configure the various stage properties to correctly process the data when the job is run.

3. Work with using derivations to handle nulls.

4. Generate frequency and standardized data.


Lesson 2.1: Setting up a standardize job

Standardizing data is the first step you take when you do data cleansing. In Lesson 2.1, you add a variety

of stages to the Designer canvas. These stages include the Transformer stage which applies derivatives to

handle nulls and the Match Frequency stage which adds frequency data.

If you have not already done so, open the Designer client.

As you learned in Lesson 1.1, you are going to add stages and links to the Designer canvas to create a

standardize job. The Investigate job that you completed helped you determine how to formulate a

business strategy using Investigation reports. The Standardize job applies rule sets to the source data to

condition it for matching.

To set up a Standardize job:

1. Add the following stages to the Designer canvas from the palette.

v Data Quality Standardize stage to the middle of the canvas

v Sequential file to the left of the Standardize stage

v Data Set file to the right of the Standardize stage

v Transformer stage between the Standardize stage and the Data Set file

v Copy stage between the Transformer stage and the Data Set file

v Data Quality Match Frequency stage below the Copy stage

v Second Data Set file to the right of the Match Frequency stage

Do not worry about the positioning of the stages and files, after linking them you can adjust their

location on the canvas.

2. Right-click the Sequential file and drag to create a link from the Sequential file to the Standardize

stage.


3. Drag links to all the stages as explained in step 2.

If the link is red, click to activate the link and drag it until it meets the stage. It should turn black.

When all the icons on the canvas are linked, you can click on a stage and drag it to change its

position.

4. Rename the stages to the following names by typing the name in the highlighted box:

Stage Change to

SequentialFile Customer

Standardize stage Standardize

Transformer stage CreateAdditionalMatchColumns

Copy stage Copy

Data_Set file Stan

Match Frequency stage MatchFrequency

Data_Set file Frequencies

5. Rename the following links by highlighting the default name:

Link Change to

From Customer to Standardize Input

From Standardize to CreateAdditionalColumns Standardized

From CreateAdditionalColumns to Copy ToCopy

From Copy to Stan StandardizedData

From Copy to MatchFrequency ToMatchFrequency

From MatchFrequency to Frequencies ToFrequencies

The following figure shows the completed Standardized job.

Related information


“Lesson 1.2: Renaming links and stages in an Investigate job” on page 8When creating a large job in the Designer client, it is important to rename each stage, file, and link

with meaningful names to avoid confusion when selecting paths during stage configuration.

Lesson checkpoint

In this lesson you learned how to set up a Standardize job. The importance of the Standardize stage is to

generate the type of data that can then be used in a match job.

You set up and linked a Standardize job by doing the following tasks:

v Adding Data Quality and Processing stages to the Designer canvas

v Linking all the stages

v Renaming the links and stages

Lesson 2.2: Configuring the Standardize job stage properties

In this lesson, you will configure the stage properties for the Standardize job stages on the Designer

canvas.

When you configure the Standardize job, you will be completing the following tasks:

v Loading source data and metadata

v Adding compliant rule sets for United States names and addresses

v Applying derivatives to null sets

v Copying data to two output links

v Creating frequency data

To configure the source file stage properties:

1. Double-click the Customer file to open the Output Properties window.

2. Select the File property under the Source category, and in the File field enter the path name for the

tutorial data folder C:\Tutorial\Data\Input.csv. The file you have attached is the source file that the

stage reads when the job runs.

3. Click the Columns tab and click Load. The Table Definitions window opens.

4. Click the Table Definitions → Tutorial → Input folder. The table definitions load into the Columns tab

of the source file.

5. Click OK to close the source file.

You attached the source data to the Source file and loaded table definitions to organize the data into

standard address columns.

Configuring Standardize properties

When you configure the Standardize stage, you are applying rules to name and address data to parse it

into standard column format.

First, configure the source file.

To configure the Standardize stage:

1. Double-click Standardize to open the Standardize Stage window.

2. Click New Process to open the Standardize Rule Process window.

3. From the Standardize Rule Process window, click Standardization Rules → USA. The rule set you are

selecting for the standardize job is Domain Specific. The reason for selecting this rule set is to create

consistent, industry-standard data structures and matching structures.

4. To select a standardization rule, complete the following steps:

Chapter 4. Module 2 Standardizing data 19

a. Select NAME for the rule set. The NAME rule appears in the Rule Set field. You are selecting

this country code because the name and address data is from the United States.

b. From the Available Columns pane, select Name.

c. Click

to move the Name column into the Selected Column pane. The Optional NAMES

Handling field becomes active.

d. Click OK.

5. Click New Process and select ADDR from the USA country name.

6. Move the following column names from the Available Columns pane to the Selected Columns pane:

v AddressLine1

v AddressLine2 7. Click New Process and select AREA from the USA country name.

8. Move the following column names from the Available Columns pane to the Selected Columns pane:

v City

v State

v Zip5

v Zip4

You are going to map the Standardize stage output columns and save the table definitions to the

Table Definitions folder.

9. Click Output → Mapping and select the columns in the right pane and copy them to the left pane.

10. Click Columns.

You are going to create the Identifier for the Table Definitions.

a. Click Save. The Save Table Definitions window opens. The file name is shown in the Table/File

name field.

b. In the Data source type field, type Table Definitions.

c. In the Data source name field, type QualityStage.

d. Click OK.

e. Type Standardized in the Item name field.

f. Click Save and close the Standardize Stage window.


You have configured the Standardize stage to apply NAME, ADDR, and AREA rule sets to the customer

data and saved Table Definitions.

Configuring the Transformer properties

The DataStage Transformer stage increases the number of columns that the matching stage uses to select

matches. Also, the Transformer stage will apply derivations to handle null sets.

To configure the transformer properties:

1. Open the Transformer stage.

2. From the Transformer Stage window, right-click the Input columns and choose Select All to highlight

all the columns from the Standardize stage.

3. Drag the selected columns to the Output link.

You have now populated the Output pane and the Output metadata pane.

4. Select the top row of the ToCopy pane and complete the following steps to add three derivatives and

columns to the Transformer stage:

a. Right-click the row and select Insert row.

b. Add two more rows as explained in step a.

c. Right-click the top inserted row and select Edit row and the Edit Column Meta Data window

opens.

d. In the Column name field, type MatchFirst1.

e. In the SQL type field, select VarChar.

f. In the Length field, select 1.

g. In the Nullable field, select Yes.

h. Click Apply and Close to remove the window.

i. Right-click the next row and select Edit row.

j. In the Column name field, type HouseNumberFirstChar.

k. Repeat substeps e to h.

l. Edit the last new row.

m. In the Column name field, type ZipCode3.

n. Repeat substeps e to h, except in substep f, select 3.5. To add derivations to the columns, complete the following substeps:

a. Double click the Derivation area for the MatchFirst1 column to open the Expression Editor and

type the following derivative:If isNull(Standardize.MatchFirstName_USNAME) then Setnull()

Else Standardize.MatchFirstName_NAME[1,1] The expression that you entered detects whether the

MatchFirstName column contains a null. If it does, it handles it. If it contains a string, it extracts

the first character and writes it to the MatchFirst1 column.

b. Repeat substep a for the HouseNumberFirstChar column and type the following derivative:If

isNull(Standardize.HouseNumber_ADDR) then Setnull() Else

Standardize.HouseNumber_ADDR[1,1].

c. Repeat substep a for the ZipCode3 column and type the following derivative:If

isNull(Standardize.ZipCode_AREA) then Setnull() Else Standardize.ZipCode_AREA[1,3].6. Finally, map the three derivatives and columns to the input columns by completing the following

steps:

a. Scroll the Standardized pane until you locate MatchFirstName_NAME.

b. Click and drag the column and drop it on the same column name in the ToCopy pane.

c. Repeat substeps a and b with HouseNumber_ADDR and ZipCode_AREA.

d. Click OK to close the Transformer Stage window.


Configuring the Standardize Copy stage

As you learned in Lesson 1.4, a Copy stage duplicates data and writes it to more than one output link. In

this lesson, the Copy stage duplicates the metadata from the Transformer stage and writes it to the Match

Frequency stage and the target file.

The metadata from the Standardize and Transformer stages is duplicated and written to two output links.

To configure a Copy stage:

1. Double-click the Copy stage and click Output → Mapping.

2. To copy the data to the StandardizedData output link, following these steps:

a. From the Output name field, select StandardizedData.

b. Copy the columns from the right pane to the left pane.3. To copy the data to the ToMatchFrequency output link, repeat step 2 except select ToMatchFrequency

in the Output name field.

4. Close the Copy stage.

Configuring the Match Frequency stage

The Match Frequency stage generates frequency information using any data that provides the columns

needed by a match.

The Match Frequency stage processes frequency data independently from executing a match. The output

link of the stage carries four columns:

v qsFreqVal

v qsFreqCounts

v qsFreqColumnID

v qsFreqHeaderFlag

To configure the Match Frequency stage:

1. Double-click the Match Frequency stage icon to open the Match Frequency Stage window.

2. Click Do not use a Match Specification. At this point you do not know the columns that would be

used in the match specification.

3. From the Output → Mapping tab, copy the columns in the left pane to the right pane.

4. To create Table Definitions, continue with the following substeps:

a. Click Columns.

b. Click Save. The Save Table Definitions window opens.

c. Click OK and the Save Table Definitions As window opens.

d. Select Table Definitions → Tutorial → Save.5. Click OK to close the Output → Column tab and the Match Frequency stage.

Lesson checkpoint

This lesson explained how to configure the source file and all the stages for the Standardize job.

You have now applied settings to each stage and mapped the output files to the next stage for the

Standardize job. You learned how to do the following tasks:

v Configure the source file to load the customer data and metadata

v Apply United States postal service compliant rule set to the customer name and address data

v Add additional columns for matching and create derivatives to handle nulls

v Write data to two output links and associate the data to the correct links


v Create frequency data

Lesson 2.3: Configuring the target data sets

The two target data sets in the Standardize job store the standardized and frequency data that you can

use as source data in the Unduplicate Match job.

First, configure the stages as explained in Lesson 2.2.

You are going to apply the following tasks:

v Attach the file to the Stan target data set

v Attach the file to the Frequencies data set

To configure the target data sets:

1. Double-click the Stan target data set.

2. From the Data Set window, click Input → Properties and select Target → File.

3. Click

and select Browse for file.

4. Locate the Tutorial Data folder.

5. Select stan in the Tutorial Data folder and click OK.

6. Double-click the Frequencies target data set.

7. Repeat steps 2 to 4.

8. Select frequencies in the Tutorial Data folder and click OK. These two files will be used as the

source data sets for the Unduplicate Match job.

9. Click

to compile the job in the Designer client.

10. Click

to run the job.

The job standardizes the data according to applied rules and adds additional matching columns to the

metadata. The data is written to two target data sets as the source files for a later job.

Lesson checkpoint

This lesson explained how to attach files to the target data sets to store the processed standardized

customer name and address data and frequency data.

You have configured the Stan and Frequencies target data set files to accept the data when it is processed.

Summary for standardized data job

In Module 2, you set up and configured a standardize job.

Running a standardize job conforms the data to ensure that all the customer name and address data has

the same content and format. The Standardize job loaded the name and address source data stored in the

bank’s database and added table definitions to organize the data into a format that could be analyzed by

the rule sets. Further processing by the Transformer stage increased the number of columns. In addition,

frequency data was generated for input into the match job.


Lessons learned

By completing this module, you learned about the following concepts and tasks:

v Creating standardized data to match records effectively

v Running DataStage and Data Quality stages together in one job

v Applying country-specific rule sets to analyze the address data

v Using derivatives to handle nulls

v Creating the data that can be used as source data in a later job


Chapter 5. Module 3: Grouping records with common

attributes

The Unduplicate Match stage uses standardized data and frequency data to match records and remove

duplicate records.

The Unduplicate Match stage is one of two stages that matches records while removing duplicates and

residuals. The other matching stage is Reference Match stage.

With the Unduplicate Match stage, you are grouping records that share common attributes. The Match

specification that you apply was configured to separate all records with weights above a certain match

cutoff as duplicates. The master record is then identified by selecting the record within the set that

matches to itself with the highest weight.

Any records that are not part of a set of duplicates are residuals. These records along with the master

records are used for the next pass. You do not include duplicates because you want them to belong to

only one set.

The reason you use a matching stage is to ensure data integrity. Data integrity is assured because you are

applying probabilistic matching technology. This technology is applied to any relevant attribute for

evaluating the columns, parts of columns, or individual characters that you define. In addition, you can

apply agreement or disagreement weights to key data elements.

Learning objectives

When you build the Unduplicate Match stage, you are going to perform the following tasks:

1. Add Data Quality and DataStage links and stages to a job.

2. Add standardize and frequencies data as the source files.

3. Configure stage properties to specify what action they take when the job is run.

4. Remove duplicate addresses after the first pass.

5. Apply a Match specification to determine how matches are selected.

6. Funnel the common attribute data to a separate target file.


Lesson 3.1: Setting up an Unduplicate Match job

Sorting records into related attributes is the second step in data cleansing. In this lesson, you are going to

add the Data Quality Unduplicate Match stage and a Funnel stage to match records and remove

duplicates.

If you have not already done so, open the Designer client.

As you learned in the previous module, you are going to add stages and links to the Designer canvas to

create an Unduplicate Match job. The Standardize job you just completed created a stan data set and a

frequencies data set. The information from these data sets is going to be used as the input data when

you design the Unduplicate Match job.

To set up an Unduplicate Match job:

1. Add the following stages to the Designer canvas from the palette.

v Data Quality Unduplicate Match stage to the middle of the canvas


v Data set file to the top left of the Unduplicate Match stage

v A second Data set file to the lower left of the Unduplicate Match stage

v Processing Funnel stage to the upper right of the Unduplicate Match stage

v Three Sequential files, one to the right of the Funnel stage and the other two to the right of the

Unduplicate Match stage2. Right-click the top data set and drag to create a link from the data set to the Unduplicate Match stage.

3. Drag links to all the stages as explained in step 2

4. Rename the stages by typing the following names in the highlighted edit box:

Stage Change to

top left data set MatchFrequencies

lower left data set StandardizedData

Unduplicate Match Unduplicate

Funnel CollectMatched

top right Sequential file MatchOutput_csv

middle right Sequential file ClericalOutput_csv

lower right Sequential file NonMatchedOutput_csv

5. Rename the following links by typing the name in the highlighted box:

Links Change to

From MatchFrequencies to Unduplicate MatchFrequencies

From StandardizedData to Unduplicate StandardizedData

Unduplicate to CollectMatched MatchedData

Unduplicate to CollectMatched Duplicates

CollectMatched to MatchOutput_csv MatchedOutput

Unduplicate to ClericalOutput_csv Clerical

Unduplicate to NonMatchedOutput_csv NonMatched


Related information

“Lesson 1.2: Renaming links and stages in an Investigate job” on page 8When creating a large job in the Designer client, it is important to rename each stage, file, and link

with meaningful names to avoid confusion when selecting paths during stage configuration.

Lesson checkpoint for the Unduplicate Match job

In this lesson, you learned how to set up an Unduplicate Match job. During the processing of this job, the

records are matched using the Match specification created for this tutorial. The records are then sorted

according to their attributes and written to a variety of output links.

You set up and linked an Unduplicate Match job by doing the following tasks:

v Adding Data Quality and Processing stages to the Designer canvas

v Linking all the stages

v Renaming the links and stages with appropriate names

Lesson 3.2: Configuring the Unduplicate Match job stage properties

In this lesson, you configure the stage properties for the Unduplicate Match job stages on the Designer

canvas.

When you configure the Unduplicate Match job, you are completing the following tasks:

v Loading data and metadata for two source files

v Applying a Match specification to the Unduplicate Match job and selecting output links

v Combining unsorted records

To configure the MatchFrequencies and StandardizedData data sets:

1. Double-click the MatchFrequencies data set to open the Output Properties window.

2. Select the File property under the Source category, and in the File field enter the path name for the

tutorial data folder C:\Tutorial\Data\Frequencies.csv.

Chapter 5. Module 3: Grouping records with common attributes 27

3. Click the Columns tab and click Load. The Table Definitions window opens.

4. Click the Table Definitions → Tutorial → Frequencies folder. The table definitions load into the

Columns tab of the source file.

5. Click OK to close the MatchFrequencies window.

6. Double click the StandardizedData file.

7. Repeat steps 2 to 5 except in steps 2 and 4, select stan.

With this lesson, you have loaded the data that resulted from the Standardize job into the source files for

the Unduplicate Match job.

Configuring the Unduplicate Match stage

With the Unduplicate Match stage, you are grouping records that share common attributes.

To configure the Unduplicate Match stage:

1. Double-click Unduplicate and click the Match Specification

button.

2. From the Repository window, double-click the Match Specifications → Matches folder.

3. Select NameandAddress.MAT.

4. Right-click NameandAddress.MAT and select Provision All from the menu.

5. Click OK. You are attaching an Unduplicate Match specification that was created for the tutorial.

6. Click the following Match Output options:

v Match. Sends matched records as output data.

v Clerical. Separates those records that require clerical review.

v Duplicates. Includes duplicate records that are above the match cutoff.

v Residuals. Separates records that are not duplicates as residuals. 7. Keep the default Dependent Match Type. After the first pass, duplicates are removed with every

additional pass.

8. Click Stage Properties → Link Ordering. The links should be in the following order:

Link label Link name

Match MatchedData

Clerical Clerical

Duplicate Duplicates

Residual NonMatched

9. Click Output → Mapping and map the following columns to the correct links:

a. If not selected, select MatchedData from the Output name field.

b. Copy all the columns in the left pane to the MatchedData pane.

c. Select Duplicates from the Output name field.

d. Copy all the columns in the left pane to the Duplicates pane.

e. Select Clerical from the Output name field.

f. Copy all the columns in the left pane to the Clerical pane.

g. Select NonMatched from the Output name field.

h. Copy all the columns in the left pane to the Clerical pane.10. Click OK to close the Stage Properties window.


Configuring the Funnel stage

With the Funnel stage, you are combining records as they arrive in an unordered format.

To configure a continuous funnel:

1. Double-click the CollectMatched stage and click Stage → Advanced.

2. Select Sequential from the Execution mode menu. This setting allows you to use the sort function.

3. Click Input → Partitioning. From this page, you are going to set the sort function.

a. Select Sort Merge from the Collector type menu.

b. From the Sorting area, click Perform sort.

c. Then, click Stable to preserve any previously sorted data sets.

d. From the Available list, select the sort key qsMatchSetID.

4. click Output → Mapping.

5. Copy the columns from the Columns pane to the MatchedOutput pane.

6. Click OK to close the stage window.

Lesson 3.2 checkpoint

In Lesson 3.2, you configured the source files and stages of the Unduplicate Match job.


v Load data and metadata generated in a previous job

v Apply a Match specification to process the data according to matches and duplicates

v Combine records into a single file


Lesson 3.3: Configuring the Unduplicate job target files

You are going to attach files to the four output records. The records in the MatchedOutput file become

the source records for the next job.

To configure the target files:

1. Double-click the MatchedOutput_csv file. You will attach a file name to the match records.

2. Click Target → File.

3. From the File field, enter Tutorial/Data/MatchOutput.csv.

4. Repeat steps 2 to 3 for each additional target file.

v For the ClericalOutput_csv file, type ClericalOutput.csv.

v For the NonMatchedOutput_csv file, type NonMatchedOutput.csv.5. Click OK to close the window.

6. Click

to compile the job in the Designer client.

7. Click Tools → Run Director to open the DataStage Director. The Director opens with the Standardize

job visible in the Director window with the Compiled status.

8. Click Run.

You have configured the target files.

Related information

“Lesson 1.8: Compiling and running jobs” on page 15You test the completeness of the Investigate job by running the compiler and then you run the job to


Lesson checkpoint

In this lesson, you have combined the matched and duplicate address records into one file. The

nonmatched and clerical output records were separated into individual files. The clerical output records

can be reviewed manually for matching records. The nonmatched records will be used for the next pass.

The matched and duplicate address records will be used in the Survive job.

You have learned how to separate the output records from the Unduplicate Match stage to the various

target files.

Summary for Unduplicate stage job

In Module 3, you set up and configured an Unduplicate stage job in order to isolate matched and

duplicate name and address data into one file.

In creating an Unduplicate stage job, you added a Match specification to apply the blocking and

matching criteria to the standardized and frequency data created in the Standardize job. After applying

the Match specification, the resulting records were sent out through four output links, one for each type

of record. The match and duplicates were sent to a Funnel stage that combined the records into one

output that was written to a file. The nonmatched or residuals records were sent to a file, as were the

clerical output records.

Lessons learned

By completing Module 3, you learned about the following concepts and tasks:

v Applying a Match specification to the Unduplicate stage

v How the Unduplicate stage groups records with similar attributes


v Ensuring data integrity by applying probability matching technology


Chapter 6. Module 4: creating a single record

In Module 4, you are designing a Survive job to isolate the best record for each customer’s name and

address.

With the Unduplicate job, a group of records were identified with similar attributes. In the Survive job,

you will specify which columns and column values from each group creates the output record for the

group. The output record can include the following information:

v An entire input record

v Selected columns from the record

v Selected columns from different records in the group

You select column values based on rules for testing the columns. A rule contains a set of conditions and a

list of targets. If a column tests true against the conditions, the column value for that record becomes the

best candidate for the target. After testing each record in the group, the columns declared best candidates

combine to become the output record for the group. Whether a column survives is determined by the

target. Whether a column value survives is determined by the rules.

Learning objectives

After completing the lessons in this module, you will be able to do the following tasks:

1. Add stages and links to a Survive job.

2. Choose the selected column.

3. Add the rules.

4. Map the output columns.


Lesson 4.1: Setting up a survive job

Creating the best results record in the Survive stage is the last job in the data cleansing process. The best

results record is the name and address with the highest probability of being correct for every bank

customer.

In this lesson, you will add the Data Quality Survive stage, the source file of combined data from the

Unduplicate Match job, and the target file for the best records.

To set up a survive job

1. Add the following stages to the Designer canvas from the palette:

v Data Quality Survive stage to the middle of the canvas

v Sequential file to the left of the Survive stage

v Second Sequential file to the right of the Survive stage2. Right-click the left Sequential file and drag to create a link from the file to the Survive stage.

3. Drag a second link from the Survive stage to the output Sequential file.

4. Rename the following stages:

Stage Change to

left Sequential file MatchedOutput

Survive stage Survive


Stage Change to

right Sequential file Survived_csv

5. Rename the following links:

Links Change to

From MatchedOutput to Survive Matchesandduplicates

From Survive to Survived_csv Survived

Lesson checkpoint

In this lesson, you learned how to set up a Survive job by adding as source data the results of the

Unduplicate Match job, the Survive stage, and the target file as the output record for the group.

You have learned that the Survive stage takes one input link and one output link.

Lesson 4.2: Configuring the survive job stage properties

You will load matched and duplicates data from the Unduplicate Match job, configure the Survive stage

with rules that test columns to a set of conditions, and configure a target file.

With the survive job, you are testing column values to ascertain which columns are the best candidates

for that record. These columns are combined to become the output record for the group. In selecting a

best candidate, you can specify that these column values be tested:

v Record creation data

v Data source

v Length of data in a column

v Frequency of data in a group

To configure the source file:

1. Double-click the MatchedOutput file to access the Properties page.

2. Click Source → File and select Browse for file.

3. Find and load the MatchedOutput.csv file.

4. Click Columns.

5. Click Load and select MatchedOutput in the Table Definitions folder.

6. Click OK to close the MatchedOutput window.

You have loaded the Table Definitions into the MatchedOutput file and attached file MatchedOutput.csv.

Related information

“Lesson 1.8: Compiling and running jobs” on page 15You test the completeness of the Investigate job by running the compiler and then you run the job to



Configuring the Survive stage

You will configure the Survive stage with rules to compare the columns against a best case.

To configure the survive stage:

1. Double click the Survive stage.

2. Click New Rule to open the Survive Rules Definition window. The Survive stage requires a rule that

contains one or more targets and a TRUE condition expression.

You define the rule by specifying each of the following elements:

v Target column or columns

v Column to analyze

v Technique to apply to the column being analyzed

3. Select AllColumns from Available Columns and click

to move AllColumns to the Target

column. When you select AllColumns, you are assigning the first record in the group as the best

record.

4. From the Survive Rule (Pick one) area, click Analyze Column and select qsMatchType from the Use

Target drop-down menu. You are selecting qsMatchType as the target to which to compare other

columns.

5. From the Technique field drop-down menu, click Equals. The rules syntax for the Equals technique

is c."column" = "DATA".

6. In the Data field, type MP.

7. Click OK to close the New Rule window.

8. Follow steps 2 to 5 to add the following columns and rules:

Targets Analyze Column Technique

GenderCode_USNAME GenderCode_USNAME Most Frequent (Non-blank)

FirstName_USNAME FirstName_USNAME Most Frequent (Non-blank)

MiddleName_USNAME MiddleName_USNAME Longest

PrimaryName_USNAME PrimaryName_USNAME Most Frequent (Non-blank)

You can view the rules you added in the Survive Stage grid.

9. From the Select the group identification data column, choose the Selected Column qsMatchSetID

from the list

10. Click Stage Properties → Output → Mapping.

11. Select the columns from the Columns pane and copy them to the Survived pane.

12. Click OK to close the Mapping tab.

13. Click OK to close the Survive Stage window.

Chapter 6. Module 4: creating a single record 35

Configuring the target file

You will be configuring the target file.

1. Double click the target file and click Target → File to activate the File field.

2. Type the path name of the record.csv file.

3. Click OK to close the window.

Lesson checkpoint

You have set up the survive job, renamed the links and stages, and configured the source and target files,

and the Survive stage.

With Lesson 4.2, you learned how to select simple rule sets which is then applied to a selected column.

This combination is then compared against all columns to find the best record.

Module 4: survive job summary

With Module 4, you have completed the last job in the QualityStage work flow. In this module, you set

up and configured the survive job to select the best record from the matched and duplicates name and

address data created in the Unduplicate Match stage.

In configuring the Survive stage, you selected a rule, included columns from the source file, added a rule

to each column and applied the data. After the Survive stage processed the records to select the best

record, the information was sent to the output file.


Lessons learned

In completing Module 4, you learned about the following tasks and concepts:

v Using the Survive stage to create the best candidate in a record

v How to apply simple rules to the column values

Chapter 6. Module 4: creating a single record 37

Chapter 7. WebSphere QualityStage tutorial summary

From the lessons in this tutorial, you learned how QualityStage can be used to help an organization

manage and maintain its data quality. It is imperative for companies that their customer data be high

quality; thus it needs to be up-to-date, complete, accurate, and easy to use.

The tutorial presented a common business problem which was to verify customer names and addresses,

and showed the steps to take using QualityStage jobs to reconcile the various names that belonged to one

household. The tutorial presented four modules that covered the four jobs in the QualityStage work flow.

These jobs provide customers with the following assurances:

v Investigating data to identify errors and validate the contents of fields in a data file

v Conditioning data to ensure that the source data is internally consistent

v Matching data to identify all records in one file that correspond to similar records in another file

v Identifying which records from the match data survive to create a best candidate record

Lessons learned

By completing this tutorial, you learned about the following concepts and tasks:

v About the QualityStage work flow

v How to set up a QualityStage job

v How data created in one job is the source for the next job

v How to create quality data using QualityStage


Accessing information about IBM

IBM has several methods for you to learn about products and services.

You can find the latest information on the Web at www-306.ibm.com/software/data/integration/info_server/:

v Product documentation in PDF and online information centers

v Product downloads and fix packs

v Release notes and other support documentation

v Web resources, such as white papers and IBM Redbooks™

v Newsgroups and user groups

v Book orders

To access product documentation, go to this site:

publib.boulder.ibm.com/infocenter/iisinfsv/v8r0/index.jsp

You can order IBM publications online or through your local IBM representative.

v To order publications online, go to the IBM Publications Center at www.ibm.com/shop/publications/order.

v To order publications by telephone in the United States, call 1-800-879-2755.

To find your local IBM representative, go to the IBM Directory of Worldwide Contacts at

www.ibm.com/planetwide.

Contacting IBM

You can contact IBM by telephone for customer support, software services, and general information.

Customer support

To contact IBM customer service in the United States or Canada, call 1-800-IBM-SERV (1-800-426-7378).

Software services

To learn about available service options, call one of the following numbers:

v In the United States: 1-888-426-4343

v In Canada: 1-800-465-9600

General information

To find general information in the United States, call 1-800-IBM-CALL (1-800-426-2255).

Go to www.ibm.com for a list of numbers outside of the United States.

Accessible documentation

Documentation is provided in XHTML format, which is viewable in most Web browsers.


www-306.ibm.com/software/data/integration/info_server/

www-306.ibm.com/software/data/integration/info_server/

publib.boulder.ibm.com/infocenter/iisinfsv/v8r0/index.jsp

http://www.ibm.com/planetwide

http://www.ibm.com

XHTML allows you to view documentation according to the display preferences that you set in your

browser. It also allows you to use screen readers and other assistive technologies.

Syntax diagrams are provided in dotted decimal format. This format is available only if you are accessing

the online documentation using a screen reader.

Providing comments on the documentation

Please send any comments that you have about this information or other documentation.

Your feedback helps IBM to provide quality information. You can use any of the following methods to

provide comments:

v Send your comments using the online readers’ comment form at www.ibm.com/software/awdtools/rcf/.

v Send your comments by e-mail to [email protected]. Include the name of the product, the version

number of the product, and the name and part number of the information (if applicable). If you are

commenting on specific text, please include the location of the text (for example, a title, a table number,

or a page number).


http://www.ibm.com/software/awdtools/rcf/

http://www.ibm.com/software/awdtools/rcf/

Notices and trademarks

Notices

This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other countries.

Consult your local IBM representative for information on the products and services currently available in

your area. Any reference to an IBM product, program, or service is not intended to state or imply that

only that IBM product, program, or service may be used. Any functionally equivalent product, program,

or service that does not infringe any IBM intellectual property right may be used instead. However, it is

the user’s responsibility to evaluate and verify the operation of any non-IBM product, program, or

service.

IBM may have patents or pending patent applications covering subject matter described in this

document. The furnishing of this document does not grant you any license to these patents. You can send

license inquiries, in writing, to:

IBM Director of Licensing

IBM Corporation

North Castle Drive

Armonk, NY 10504-1785 U.S.A.

For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property

Department in your country or send inquiries, in writing, to:

IBM World Trade Asia Corporation

Licensing 2-31 Roppongi 3-chome, Minato-ku

Tokyo 106-0032, Japan

The following paragraph does not apply to the United Kingdom or any other country where such

provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION

PROVIDES THIS PUBLICATION ″AS IS″ WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR

IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF

NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some

states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this

statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically

made to the information herein; these changes will be incorporated in new editions of the publication.

IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this

publication at any time without notice.

Any references in this information to non-IBM Web sites are provided for convenience only and do not in

any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of

the materials for this IBM product and use of those Web sites is at your own risk.

IBM may use or distribute any of the information you supply in any way it believes appropriate without

incurring any obligation to you.


Licensees of this program who wish to have information about it for the purpose of enabling: (i) the

exchange of information between independently created programs and other programs (including this

one) and (ii) the mutual use of the information which has been exchanged, should contact:

IBM Corporation

J46A/G4

555 Bailey Avenue

San Jose, CA 95141-1003 U.S.A.

Such information may be available, subject to appropriate terms and conditions, including in some cases,

payment of a fee.

The licensed program described in this document and all licensed material available for it are provided

by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or

any equivalent agreement between us.

Any performance data contained herein was determined in a controlled environment. Therefore, the

results obtained in other operating environments may vary significantly. Some measurements may have

been made on development-level systems and there is no guarantee that these measurements will be the

same on generally available systems. Furthermore, some measurements may have been estimated through

extrapolation. Actual results may vary. Users of this document should verify the applicable data for their

specific environment.

Information concerning non-IBM products was obtained from the suppliers of those products, their

published announcements or other publicly available sources. IBM has not tested those products and

cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM

products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of

those products.

All statements regarding IBM’s future direction or intent are subject to change or withdrawal without

notice, and represent goals and objectives only.

This information is for planning purposes only. The information herein is subject to change before the

products described become available.

This information contains examples of data and reports used in daily business operations. To illustrate

them as completely as possible, the examples include the names of individuals, companies, brands, and

products. All of these names are fictitious and any similarity to the names and addresses used by an

actual business enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrate programming

techniques on various operating platforms. You may copy, modify, and distribute these sample programs

in any form without payment to IBM, for the purposes of developing, using, marketing or distributing

application programs conforming to the application programming interface for the operating platform for

which the sample programs are written. These examples have not been thoroughly tested under all

conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these

programs.

Each copy or any portion of these sample programs or any derivative work, must include a copyright

notice as follows:

© (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. ©

Copyright IBM Corp. _enter the year or years_. All rights reserved.


If you are viewing this information softcopy, the photographs and color illustrations may not appear.

Trademarks

IBM trademarks and certain non-IBM trademarks are marked at their first occurrence in this document.

See http://www.ibm.com/legal/copytrade.shtml for information about IBM trademarks.

The following terms are trademarks or registered trademarks of other companies:

Java™ and all Java-based trademarks and logos are trademarks or registered trademarks of Sun

Microsystems, Inc. in the United States, other countries, or both.

Microsoft, Windows, Windows NT®, and the Windows logo are trademarks of Microsoft Corporation in

the United States, other countries, or both.

Intel®, Intel Inside® (logos), MMX and Pentium® are trademarks of Intel Corporation in the United States,

other countries, or both.

UNIX® is a registered trademark of The Open Group in the United States and other countries.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product or service names might be trademarks or service marks of others.

Notices and trademarks 45

http://www.ibm.com/legal/copytrade.shtml

Index

Aaccessibility 42

address analysis 13

analyze addresses 13

Ccleanse data 1

client components 2

columns, mapping 11

comments on documentation 42

common attributes, grouping records 25

configuringMatch Frequency stage 22

configuring the Copy stage 22

contacting IBM 41

Copy stageconfiguring 11, 22

copying metadata 22

Ddata

parse free form 7

standardize 17

data cleansing 1

Designer client 5

Designer Tool PaletteData Quality group 2

documentationaccessible 42

ordering 41

Web site 41

Ffile

Sequential 9

source 9

Funnel stage, configuring 29

Iinstall tutorial data 4

Investigate stage 7

configure 11, 13

Investigate stage jobrenaming links and stages 8

setting up 7

Jjobs

overview 1

Llegal notices 43

Lesson 1.1 7

Lesson 3.1, setting up an Unduplicate

Match job 25

Lesson 3.2 checkpoint 29

Lesson 3.2, configuring Unduplicate

source files 27

Lesson 3.3, configuring the Unduplicate

job target files 30

Lesson 3.4, configuring the Funnel

stage 29

Lesson 4.2, configuring the survive

job 34

links, renaming 8

Mmapping columns 22

Mapping columns 11

Match Frequency stagecolumns 22

configuring 22

metadata 11

load 9

Module 2, about 17

Module 3summary for Unduplicate stage

job 30

Unduplicate Match stage 25

Module 4summary 36

Ooutput reports, configure 14

PParallel job

saving 4

parse free-form data 7

pattern report 7, 14

project elements 1

projects 1

opening 4

Rreaders’ comment form 42

records with common attributes 25

reportsconfigure output 14

pattern 7, 14

token 7, 14

Word pattern 7

Word token 7

Rule Setselect 11

Sscenario for tutorial project 3

screen readers 42

select Rule Set 11

Sequential file 14, 17

server components 2

setting upInvestigate stage job 7

Standardize job 17

single-domain column investigation 7

source fileconfigure 9

rename 9

stagesCopy 11, 17, 22

Investigate 7, 11

Match Frequency 17, 22

Standardize 17, 19

Transformer 17

stages, renaming 8

Standardize rule sets 19

Standardize stageconditioning data 17

configuring 19

Standardize rule sets 19

Standardize stage jobsetting up 17

Survive jobconfiguring 34

Survive job, setting up 33

Survive stagerenaming links and stages 33

setting up 33

Survive stage jobModule 4

creating a single record 33

Module 4: creating a single record 33

summary 36

Ttoken report 7, 14

trademarks 45

tutorial datainstall 4

tutorial project goals 3

tutorial sample dataimporting 5

UUnduplicate job target files

configuring 30

Unduplicate Match jobconfiguring source files 27

Lesson checkpoint 27

setting up 25

Unduplicate Match stage jobgrouping records with common

attributes 25


Unduplicate Match stageconfiguring 28

Unduplicate stageconfiguring target files 30

Unduplicate stage jobsummary 30

WWebSphere DataStage

Copy stage 11, 22

creating a job 4

Designer client 1

WebSphere DataStage Designer 4

WebSphere QualityStagejobs 1

projects 1

stages 2

Survive stage 33, 34

Survive stage job 33

summary 36

Unduplicate Match stage 25, 27

Unduplicate stage job 29

summary 30

value 1

Word 7

Word pattern report 7


��

Printed in USA

SC18-9925-00

WebSphere Quality Stage Tutorial

Documents

Transcript of WebSphere Quality Stage Tutorial