DATA QUALITY 101datasourceconsulting.com/wp-content/uploads/2018/12/DataQuality_ebook.pdfIn 2014,...

DATA QUALITY 101

TOP TIPS & TRICKS ON HOW TO IMPLEMENT & IMPROVE DATA QUALITY

BY DATASOURCE CONSULTING

http://www.datasourceconsulting.com

DATASOURCECONSULTING.COM | PAGE 2

COPYRIGHT

Title: Data Quality 101: Top Tips & Tricks on How to Implement and Improve Data Quality©

Copyright: ©2015 Datasource Consulting, LLC

Warning Against Unauthorized Use: No part of this eBook may be reproduced in any

electronic, written, recorded, or photocopied form without permission of Datasource

Consulting, LLC except for when specifically granted or use in a review or critical article.

Disclaimer: Although the author has taken all reasonable precautions to verify the

information in this eBook, neither the publisher nor the author takes any responsibility

for errors or omissions. No liability shall be assumed for any damages resulting from

information used in this eBook. This eBook is intended to deliver a high-level overview of

Data Quality and provide tips and tricks we’ve learned from our own experience working

in the industry. This book is not meant to be a comprehensive guide to Data Quality

or to developing a Data Quality program for your company. This eBook is intended for

entertainment purposes only. For more information on Data Quality and to develop a Data

Quality program for your company, contact Datasource Consulting at 888-453-2624 or

[email protected].

http://datasourceconsulting.comhttp://www.datasourceconsulting.com


TABLE OF CONTENTS

1 HOW TO IMPLEMENT DATA QUALITY 4

2 WHY IS DATA QUALITY IMPORTANT? 5

2.1 Matching the Data 7

2.2 When to Implement Data Quality 9

3 DATA QUALITY MANAGEMENT PROCESS 11

3.1 Data Quality is Iterative 11

3.2 5 Step Data Quality Management Process 11

3.3 Two Critical Players in the Data Quality Management Process 12

4 PROFILING 13

4.1 Data Profiling Techniques: Tips for Better and Cleaner Data 15

4.2 Data Profiling Techniques: Data Quality Dimensions 15

4.3 Data Profiling Tips & Best Practices 16

5 DEFINE DATA QUALITY RULES 17

5.1 Six Data Quality Rules to Follow 19

6 DESIGN 21

6.1 Tool Design 21

6.2 Design Tips in Informatica Data Quality 23

7 IMPLEMENTATION 25

8 MONITORING 27

8.2 Scorecarding Best Practices 28

9 CLOSING 29

10 ABOUT DATASOURCE CONSULTING 29

10.1 Data Quality by Datasource Consulting 29

10.2 Iterative Data Quality Management Process 30

10.3 What to Expect 30



HOW TO IMPLEMENT DATA QUALITYData quality is increasingly important in today’s enterprise business world, especially as companies look to leverage the full potential of their data.

The negative impacts of poor data range from minor interruptions (time lost doing manual

cross checks, etc.) to potentially devastating implications (exposure to unnecessary risk or

significant financial loss). Well-established Data Quality programs are long term initiatives that

address today’s Data Quality issues to prevent them from being a concern in the future. A

proactive approach builds trust and enhances better collaboration between both business and IT.

Additionally, high Data Quality will feed into the success of more comprehensive efforts such as

Data Governance or Master Data Management.

This eBook (and related articles and videos), will help identify some of the best Data Profiling

tips and Data Quality rules, and provide some real-world examples to help with your Data

Quality Initiatives.

Monitor

Profile

Define Rules

DesignImplement

In this eBook, we’ll dive into each phase of the

Data Quality Management Process, including;

- Data profiling

- Rule definition

- Design

- Implementation

- Monitoring

Without further ado, let’s get started!

Data Quality Programs are long term initiatives that address

today’s Data Quality to prevent them from being a

concern in the future.



WHY IS DATA QUALITY IMPORTANT?The cost of having bad data is one of the top reasons we need to ensure your data is good and accurate.

When the quality of a company’s data is poor, the data

becomes a liability instead of an asset.

- Low customer satisfaction

- Customer loss

- Misguided business decisions

- Missed business opportunities

- Financial inaccuracies and mistakes

- Legal and monetary penalties

- Negative company image

In 2014, General Motors hit the

headlines with a Data Quality

blunder. The company recalled

2.6M vehicles and sent recall

notices to everyone who

could have been affected. Unfortunately, this

included the families of 13 people who lost

their lives due to faulty ignition switches.

General Motors was aware of these families

and could have removed their names from

the distribution list. However, instead of

cleansing their data and sending the notice

to a clean list, GM made a public relations

mistake and is now attempting to repair their

customer relationships by sending apology

letters and spending additional resources to

improve their company image.

Vodafone also made negative

headlines in 2014 when

customers received messages

thanking them for their

payments. The message would have been

great customer service, had the recipients

made recent payments on their accounts. This

Data Quality misstep, along with several other

billing errors, caused distress among Vodafone

customers. These types of errors lead to poor

customer retention, reduced profits, and

negative company image.

The poor data quality in these two cases

means the data is a liability for these

companies; improved data quality would

enable both organizations to focus on using

data as an asset.

A 2009 TDWI report found that the cost of bad data in the U.S. is over $600 billion each year.,

and is likely to continually increase with the amount of data generated daily. Looking back at our

examples of General Motors and Vodafone, the potential consequences of low Data Quality include:



How can these situations be avoided in the future? Without knowing the specifics of either

companie’s data, we can only speculate; however, by using a detailed example of poor data, we

can see a realistic potential outcome.

Below is a list of potential customers we want to contact for a sample promotion. What’s wrong

with this list?

Right away we all see a number of things wrong with this dataset:

Incomplete Dataset: Missing data makes it impossible for our marketing messages

to reach their target customers. In the example, we don’t have phone numbers, email

addresses, or certain address elements for each record.

Invalid Addresses: A number of addresses list missing house numbers or non-standard

address elements.

Duplicate Content: There are several potential duplicate records. We have a few John

Smiths; which John/Jon Smith is the correct one? With duplicate contacts, we could be

wasting resources sending unnecessary multiple mailers.

Non-Standard Format: Much of the data is in a non-standard format. We see a state

value (CA) in a city field, a postal code in a state field, and so on. Inconsistent and

nonstandard formatting makes process automation very difficult.

CUST ID CONTACT ADDR1 ADDR2 ADDR3 ADDR4 CNTRY PHONE EMAIL

12345 JON SMITH 123 MAIN STREET SAN DIEGO CA 92121 USA 8585555555 [email protected]

23456 JOHN SMITH 123 MAIN ST. San Diego CA 92121 USA 858-555-5555 [email protected]

34567 JOHN SMITH 90 21ST PLACE SD CA 92121 USA (858)555-5555 [email protected]

45678 JANE DOE 834 2ND AVE ST DENVER CO 80210 USA [email protected]

56789 BOB JOHNSON 203 FRONT ST DENVER CO 80209 USA [email protected]

67890 SARAH SMITH 340 FIRST ST LOS ANGELES CA USA [email protected]

78901 BILL WHITE 3480 PEARL RD USA

89012 JACK BLACK 4667 GRAND AVE IL 62223 USA 618-555-4897 [email protected]

90123 CHRIS WILLIAMS 34350 PARK BLVD USA [email protected]

9876 WILLIAM WALLACE 2304 CHANCE PLACE CA 1234 USA [email protected]

1234 STORM TAYLER 7934 W. HILL LANE SPRINGFIELD USA 6368789456

A123 HOMER POWELL WEST HILL LN. SPRINGFIELD MO 62704 USA 636-555-7841



MATCHING THE DATA

The cost of having bad data is one of the biggest reasons we need to ensure our data is good and accurate.

Duplicate records can be a major stumbling block, and companies often underestimate how

quickly they get out of hand. In the image below, we’ve isolated five contacts. What are the

potential matches?

These five records include a variety of matching options; highlighted are only three of the

potential ten.

From these examples, we see some of the potential consequences of poor Data Quality. We see:

- Missed opportunities

- Misguided business decisions

- Low customer satisfaction

- Negative company image

- Overspending in marketing

One of the biggest challenges faced in Data Quality today is dealing with duplicate data. Over the

next few pages we’ll give you some practical examples & steps for how to handle duplicate data.

CUST ID CONTACT ADDR1 ADDR2 ADDR3 ADDR4 CNTRY PHONE EMAIL

12345 JON SMITH 123 MAIN STREET SAN DIEGO CA 92121 USA 8585555555 [email protected]

23456 JOHN SMITH 123 MAIN ST. San Diego CA 92121 USA 858-555-5555 [email protected]

34567 JOHN SMITH 90 21ST PLACE SD CA 92121 USA (858)555-5555 [email protected]

56789 BOB JOHNSON 203 FRONT ST DENVER CO 80209 USA [email protected]

67890 SARAH SMITH 340 FIRST ST LOS ANGELES CA USA [email protected]

CRITERIA POTENTIAL MATCHES

First name & Last Name

“John”

“Smith”

CUST ID CONTACT ADDR1

12345 JON SMITH 123 MAIN STREET

23456 JOHN SMITH 123 MAIN ST.

34567 JOHN SMITH 90 21ST PLACE




56789 BOB JOHNSON 203 FRONT ST


12345 JON SMITH 123 MAIN STREET



67890 SARAH SMITH 340 FIRST ST



To come up with the potential number of matches, use the equation (n2-n)/2 where “n” is the

number of records. For example, if you have 5 records, you end up with 10 potential pairs.

By using this equation, the number of potential matches skyrockets. The chart below illustrates

the rapid increase.

As matching these records would be very time consuming, the best strategy is to group the

records based on a common value. The grouping in our example could be based on the contact

or address value.

It is important to make sure you have the right amount of records to effectively match and group.

If the dataset is too large, you may waste time on records that shouldn’t be matched. If the

dataset is too small, there may not be any available matches.

One best practice is to create and leverage algorithms that help determine true matches. “John

Smith” and “Jon Smith” are likely duplicates, but John Smith at 123 Main St. and John Smith at 90

21st Place are probably separate records.

Algorithms are a great resource to help matching efforts. Some common algorithms are:

- Edit distance

- Jaro Distance

- Bigram

- Hamming

(n2 - n)

2

NUMBER OF RECORDS POTENTIAL MATCHES

5 10

50 1225

500 124,750

5000 12,487,500



WHEN TO IMPLEMENT DATA QUALITY

Create and leverage algorithms to determine which records truly match.

The purist in all of us will say “right now” while the realist will look for a clear reason or

business requirement to implement Data Quality. There are countless reasons to implement

Data Quality; below is a sampling of what companies may consider:

- Data as a Strategic Asset

- Risk of Non-Compliance

- Disparate Data Sources

- Master Data Management

Data as a Strategic Asset: Let’s say Facebook wants to increase its advertising revenue.

Facebook offers many ways for users to interact with friends and identify themselves as potential

marketing targets:

- Status updates

- Hashtags

- Location check-ins

John starts his day at a fitness class, where he checks in on Facebook and

updates his status. After class, John takes a walk by the smoothie vendor where

he snaps a photo the day’s smoothies and posts a comment. Later at a baseball

game, John again checks in and updates his status.

Facebook would be able to identify this person’s likes and interests and target

relevant ads, increasing advertising revenue..

- Photo uploads

- “Likes” and reactions

- Mergers & Acquisitions

- New Systems

- System Migrations

Risk of Non-Compliance: Most companies

today are required to follow industry-

specific regulations. Some organizations

use validated systems requiring them to

follow Sarbanes-Oxley (SOX), HIPAA, or

the Sunshine Act. If a company is audited,

it is crucial that their data is accurate.

Data Quality is an important piece of the

compliance puzzle.

Disparate Data Sources: Think about

migrating all your data sources into a single

Data Warehouse. It’s difficult to report on non-

standardized data from disparate locations;

Data Quality will standardize your data and

allow for efficient reporting.



Let’s look at implementation; the simplest method is to follow the Data Quality Management Process.

Master Data Management: Example:

customer master - Many companies have

multiple source systems housing data. Each

system could house an individual record

for the same (or similar) person. In our

previous example of duplicate records, the

only way to know which customer is the

right one is by implementing a solid Data

Quality initiative to standardize data and

appropriately match and merge records.

Mergers and Acquisitions: The parent

company often takes over the products,

sales, and database of the purchased

company. To consolidate and report on the

data effectively, the company must have a

standardized format. A strong Data Quality

initiative ensures clean data.

New Systems: We’ve all heard the saying,

“Garbage in, garbage out.” When implementing

a new system, consider launching a Data

Quality Initiative to verify your data is cleansed.

Clean data allows you to generate accurate

reports for improved decision-making.

System Migrations: Many organizations are

migrating their CRM or ERP systems, and the

old and new systems are often formatted

differently. This is where use Data Quality to

cleanse the data prior to migrating it.



DATA QUALITY MANAGEMENT PROCESSThis procedure allows you to correct your data using a systematic process while also developing a scalable framework for the future.

DATA QUALITY IS ITERATIVE

Monitor

Profile

Define Rules

DesignImplement

Step 1 - Profiling: Data Profiling is primarily used to measure the overall quality of

the data and becomes increasingly more important for the continued monitoring and

reporting of your data.

Step 2 - Defining Rules: Rule definition is somewhat of a “wish list”. During this phase,

we focus on how we’d like the data to look ideally, not on how it currently looks.

Step 3 - Designing: In the design phase the Developer takes the business rules (the “wish

list”) defined by the Data Steward and converts them into meaningful and useful goals.

Step 4 - Implementing: Processes are automated, and the Data Steward and Developer

work together to manage expectations, match & merge records, and remove duplicates

after standardization.

Step 5 - Monitoring: The Data Steward consistently monitors the data to assess how

well the Data Quality is performing for particular fields. During this phase, the Data

Steward can determine whether to update the current rules or create new ones.

And that brings us right back to the first step in the Data Quality Management Process –

profiling the data.

The Data Quality Management Process, is an

iterative approach to data quality. Profiling

is a common starting point; however,

Data Quality is circular and measures and

improves the data on an ongoing basis.

5-STEP DATA QUALITY MANAGEMENT PROCESS

The Data Quality Management Process is an easy-to-follow five-step program:



DATA QUALITY MANAGEMENT PROCESS

As noted in the previous section, two critical people should be involved when effectively

implementing the Data Quality Management Process: the Data Steward and the Developer.

Data Steward: The The Data Steward understands and profiles the data. Their goal is

to measure the data quality, identify possible anomalies, and define the rules used to

cleanse, standardize, and validate the data. Also, the Data Steward may define goals as

part of an ongoing improvement process.

Developer: Once the rules are defined, the Developer designs the definitions. This

process typically takes place within a data quality application like Informatica Data

Quality (IDQ).

Data stewards and developers collaborate to implement the rules and process, and Data

quality is then managed by the Data Steward.

As you can see, this is an iterative process. It becomes vitally important to the overall quality

of your data for the Data Steward to continually measure the data against the initially-

defined goals. As stewards become increasingly familiar with the data, the rules implemented

in the beginning may no longer be relevant and will need to change. The data itself will also

change as it is updated and new sources are introduced. This means that today’s rules may

not be sufficient to cleanse, standardize, and de-duplicate the data in the future therefore

there is a constant need for iterations.

The Data Quality Management Process is a continuous cycle of improvement. By asking the

following questions, you’ll improve the quality of your data over time:

The Data Quality Management Process

is a continuous cycle of improvement

- Is the data quality improving over time? If it isn’t, how do we change the rules to help improve the quality of the data?

- Are the current business rules meeting the needs of the company?

- If a new data source is introduced, can the same data quality rules be applied?



Data Quality Example 1:

Here, we’ll be looking at ADDR2. From the illustration we can see there are seven unique

values and four NULL values. Upon further inspection, we see three records: “SD”, “SAN

DIEGO”, and “San Diego”.

It is likely that SD, SAN DIEGO, and San Diego are all the same value, so the Data Steward

will need to create a rule that standardizes the data.

PROFILINGStep one in the Data Quality Management Process.

In the examples below, we measured the

quality of the data in a few different areas:

- Address

- Phone numbers

- Company

While the Data Steward can manually profile

the data using SQL or excel, (or a variety

of other tools) the examples showcase

Informatica Data Quality (IDQ).

Monitor

Profile

Define Rules

DesignImplement




We can also profile the data based on the phone number. The data pattern below is not

consistent - we have two records where the data is formatted as numbers only, three with

hyphens, and six are NULL and don’t contain a value. This is a case for better Data Quality.

In a similar manner to Example 1, the Data Steward will need to create a rule to standardize

the data.


You can also profile based on specific values. We can see in the sample below entries for both

“Go N Stop” and “Stop N Go”. This begs the question of the actual name of the company or

whether there are two separate companies. If we determine this is the same company, we’ll

need to figure out how to cleanse the data so it reflects the proper business name.

These three examples highlight a small sampling of the variety of scenarios where data

profiling can benefit your business.



The first step in a successful Data Quality Management process, data profiling focuses on

identifying and measuring the quality and use of data. In addition to identifying anomalies in

the overall data set, the Data Steward may also be looking to determine:

Data profiling allows companies to see an enterprise view of all data for use in Master Data

Management and Data Governance, etc. Data Profiling is used to understand data issues and

lay the foundation for business rules and processes. Outlined below are a variety of techniques

for data profiling we’ve learned in the field. Our hope is that these different techniques will help

push your Data Quality initiative.

DATA PROFILING TECHNIQUES: DATA QUALITY DIMENSIONS

There are six different dimensions of data quality:

DATA PROFILING TECHNIQUES: TIPS FOR BETTER AND CLEANER DATA

- How good or bad the data is right now

- Other potential uses for the data

- Ways to improve the ability to search the data

- Metadata accuracy

- Conformity to current standards

Completeness Is the data complete or are there missing elements?

Conformity Is there any data in a non-standard format?

Consistency Are all of your transaction records clean?

Accuracy What values are valid? Do you have an address that doesn’t include a number? What other values are invalid?

Duplicates Are there duplicate records? Which record is the correct record?

Integrity Are all of the fields complete? Do you have any missing ID’s?

ADDR1 ADDR2 ADDR3 ADDR4

3480 PEARL RD

4667 GRAND AVE IL 62223

34350 PARK BLVD

ADDR1 ADDR2 ADDR3

8434 2ND AVE ST DENVER CO

WEST HILL LN. SPRINGFIELD MO

PHONE

8585555555

858-555-5555

(858)555-5555

CUST ID TRANS ID PROD ID TRANS DT AMT

12345 A9384 PRD3842 1/1/2014 1000

56789 A9384 PRD3842 1/1/2014 1000

CUST ID TRANS ID PROD ID TRANS DT AMT

12345 A9384 PRD3842 1/1/2014 1000

A9201 PRD124 1/12/2014 10000

45678 A3402 PRD492 2/1/2014 500

CONTACT ADDR1 ADDR2

JON SMITH 123 MAIN STREET SAN DIEGO

JOHN SMITH 123 MAIN ST. San Diego

JOHN SMITH 90 21ST PLACE SD



In addition to the six data quality dimensions, we’ve outlined a seven different tips, tricks, and

best practices to help you with your initiative.

Timeliness: If you expect your data to be loaded weekly or monthly, it is best to

verify that your data is following this schedule. Ensuring that the data loads on a

regular schedule will allow you to profile and identify any anomalies.

Profile often: It is a good practice to profile as often as the data is loaded. As

new data is inserted/updated, new anomalies may appear. This practice will

enable you to identify these anomalies as they come up and allow you to address

and correct them before they get out of hand.

Profile production data: Looking at production data versus test data ensures

the appropriate rules are defined as you are profiling. If you are not looking at

production data, you could be defining an unnecessary rule.

Profile all data: Unless you have a large data set with billions of records, it is

a best practice to profile all of your data. This ensures you capture all potential

anomalies.

Perform column-level profiling first: Column-level profiling should be performed

first to determine what columns to include in the Data Quality Management

process. For example, you may think a particular column stores an important

data segment, but after profiling discover that data is actually NULL. This could

indicate either a larger data problem or that you’re looking at the wrong column.

Document the Results: Another best practice is to document your results.

This helps you prioritize as you move into the next phase of the process.

Using the right tools for the size of the job: Software like Informatica Data

Quality (IDQ) can provide faster analysis. If you are profiling a handful of records,

profiling manually; or by using SQL, may be sufficient. However for a large

dataset., a tool like IDQ will speed up the profiling process.

DATA PROFILING TIPS & BEST PRACTICES

When profiling a large dataset, a tool like IDQ will speed

up the profiling process.



Rule definition is typically part of the analysis

phase of the project and, like profiling, is

performed by the Data Steward.

During this step, it is best to think about how the

data should look in the target system and not

what it currently looks like in the source system.

In order to accomplish this, we’ll use at least three

different fundamentals:

DEFINE DATA QUALITY RULESThe next step in the process, after profiling, is to define the rules.

Monitor

Profile

Define Rules

DesignImplement

Manipulation Rules: Data Quality Manipulation Rules should be created to assist the

migration of data from one system to another. For example, if the source system allows

special characters but the target system doesn’t, a manipulation rule could help.

Validation Rules: Validation rules go hand-in-hand with manipulation rules. Validation rules

validate your data against the newly created manipulation rule.

Metrics: During this part of the process, scorecard metrics should also be considered.

Let’s look at an example in IDQ as to how a

Data Steward would define these roles. In

looking at the data, they can see ADDR2 has

the following four different values highlighted

in the image to the left:

- SD

- SAN DIEGO (in all caps)

- San Diego (mixed case)

- NULL values

The Data Steward can easily put a comment in the tool to say that NULL values are not valid

and add a rule to standardize the data.

Inside IDQ, the Data Steward can also tag the field and provide the Developer with direction as

to how they want to standardize. The developer will actually see the comment and the tag in

the developer tool, giving him/her a jumpstart in designing the rule.



A Data Steward can look at the list of values and determine which are valid. Once valid values

have been identified, the values are added to the reference table in IDQ. In the image below,

we have “San Diego” and “SD”. If the value “SD” shows up, the value “San Diego” will be

returned. Similarly, if the value “LA” shows up, the value “Los Angeles” will be returned.

Every Data Quality rule created should have

an impact on a business need.

Likewise, Data Stewards create rule specifications

in Informatica Analyst (a helpful tool when

performing Data Quality). Stewards define the

inputs and logic and can test the rule logic.

Once saved, this is ready for the developer in the

developer tool as a mapplet.

Data Stewards create mapping specifications

where they map the data from a source to a

target defined by business logic. Once saved,

mapping specification shows up in the developer

tool as a LDO or mapping.

Using the functionality described in

the examples (comments, tags, rule

specifications, pre-built rules, and reference

tables) enhances collaboration between

the Data Steward and Developer. This

allows for faster design/development and

information-sharing in the tool rather than

in a spreadsheet on someone’s desk.



Rules Should Serve a Purpose: When we were kids, our parents set rules for our

behavior. These rules were put in place to help protect us or to help us learn; they served

a purpose. In the same way, part of the Data Quality Management Process is to define

rules. Make sure to define rules that solve business data issues - every rule should impact

a business need.

Validation & Manipulation Rules: When defining rules for specific business cases and

business objectives, it can be easier to break them into two types of manipulation rules

and validation rules.

Manipulation Rules: Manipulation rules help to cleanse and standardize the data.

These can be as simple as trimming blank spaces or capitalizing certain values. ,

During this part of the process, ask yourself: “How should this data be cleansed or

standardized?” or, “How would I like this data to appear in the target system?”

Validation Rules: Validation rules help us determine whether the data is valid and

usable or if it is invalid and can also assist in building congruency among data sets.

As the data flows through the Data Quality Management Process we need a way to flag

it to determine if it is valid or invalid. The Data Steward will define the validation rule,

but the Developer should make sure the manipulation rule is included as part of the

validation rule code.

As we process the data, we may find that “123 MAIN street, SAN DIEGO” should be “123

MAIN ST., San Diego,” (notice the emphasis on the “ST.” versus spelling out “street” and

the casing). As we can see, validation rules provide consistency to our data. It’s beneficial

to define validation rules with the manipulation rules in mind.

The Developer can easily code a manipulation rule to properly case the data and

standardize the format so there’s no need to go back to the source system to manipulate

the data. Once the address is in a standardized format, the validation rule can be applied.

So when we validate the address against reference data (e.g. in a reference table) as part

of our validation rule, we need to make sure the manipulation rule is also in place.

SIX DATA QUALITY RULES TO FOLLOW

Outlined below are six tips to consider and implement during the rules phase of the Data

Quality Management Process.

CONTACT ADDR1 ADDR2

JON SMITH 123 MAIN STREET SAN DIEGO

JOHN SMITH 123 MAIN ST. San Diego

JOHN SMITH 90 21ST PLACE SD

Here’s an example of a question you can ask

regarding a data validation rule;

“Is this address valid and in the correct format?”



Prioritization: As Prioritize the rules as you define them. Thinking back to when we

were kids, certain rules had a higher priority than others. For example, “look both ways

before you cross the street” is more important than “wash behind your ears”. They both

have consequences, but the consequence of one is potentially much greater than the

other. The same is true for rules. The office environment can be hectic, and it is important

for Developers to understand what rules take precedence. Prioritizing rules helps the

Developer implement the most important in the shortest amount of time, allowing them to

deliver first on rules that provide the highest impact. For example, if rules are prioritized

as part of the Data Quality Management Process, a developer can work on implementing

the first 10 rules, and then work on the next 10 as time allows. This approach can provide

a quick win for IT. By simply implementing a subset of the rules, we are able to show the

business the value in a Data Quality program.

Therefore, proper prioritization helps everyone be more efficient and successful.

Rule Changes: As Since the Data Quality Management Process is very iterative, the

initially-established rules may not be the same moving forward.

As part of a cohesive team, it will be important to educate Data Stewards on rule changes.

Stewards should be aware of what changes will occur as they monitor Data Quality. It is

also important to communicate to the Data Steward that any manipulation rule change

may require an adjustment in the related validation rule.

Collaborate: Software Software enhances collaboration between the Data Steward

and the Developer. Informatica Data Quality (IDQ) has built in functionality in the form

of comments, tags, rule specifications, pre-built rules, and reference tables to enhance

collaboration between the Data Steward and Developer. This allows for faster design and

development as well as having information shared in one project in the tool rather than a

spreadsheet.

Reference Tables: Any Any eBook on Data Quality wouldn’t be complete without sharing

a few tips regarding reference tables.

Reference Table Tip 1: Consider using

reference tables to store a valid list of values.

Reference tables can be managed by the Data

Steward within IDQ, which also includes an

audit trail of any changes to the reference

tables.

Reference Table Tip 2: Consider reference

tables for LOV’s (list of values) over

hardcoding the data.

A proactive approach to

Data Quality builds trust

and enhances better

collaboration between

both business and IT.



The next step of the Data Quality Management Process is design. In this phase, the Developer

takes the business rules and converts them into meaningful and useful goals.

So, how would we go about designing the tool? From our previous example with ADDR2, we

have the following values:

- SD

- SAN DIEGO (all caps)

The Data Steward identified that they want to uppercase the value. We can do that by

applying “rule_Uppercase”. As a result, we can see there is only SD and SAN DIEGO (all caps).

DESIGNThis part of the process builds on what is defined and converts them into goals.

Monitor

Profile

Define Rules

DesignImplement

These rules could include:

- Address validation

- Standardize names

- Remove noise

- Validate data values

- Matching

TOOL DESIGN

By using IDQ, we have a jump start in designing the rules with the collaboration techniques

highlighted above. The Developer can build on top of what the Data Steward has already

defined through rule and mapping specifications.

During this process, the Developer should

look at implementing the validation and

manipulation rules together.

- San Diego (mixed case)

- NULL



The Data Steward also prescribed that we should standardize the value. After applying “rule_

Standardized_City” we have “San Diego” and no longer a “SD” value.

The Data Steward highlighted NULL values in their findings and specified that NULL values are

invalid. So we apply “rule_Completeness” and see which records are NULL and not NULL (and

complete being NULL).

Each of these rules can be completed one at a time or in a group. Pictured in the image

below, we have applied the following rules in a group: uppercase, standardized city, and

completeness. The output will tell the Data Steward whether the value is valid.

Once the rules have been applied in the profile, the Data Steward can drill down on the invalid

records and determine what to do with any invalid records.

Note: You can also apply that rule within a Logical Data Object within IDQ. A LDO is just a virtual mapping and we’ll go into more details of that later in the eBook.



Consistent Naming and Coding

Standards: Developers and Data

Stewards will see the same rules in IDQ,

so it is important to develop consistent

naming and coding standards. For

example, both the Data Steward and

the Developer will understand what

“rule_” means while not everyone

will understand what “mplt_” means.

Therefore, mapplets should be named

“rule_” if they are used in both the

Informatica Analyst and Developer tools.

Anchors & Descriptors: It is a best

practice to include anchors and

descriptions in Data Quality mappings

for faster modifications, readability and

comprehension. Any metadata changes

to a source or target object in a mapping

(e.g. a new column) can quickly be

made if anchor transformations exist

immediately following the source object

and immediately preceding the target

object. These can be simple pass-through

transformations. As a reminder, all objects

should have descriptions. The description

will be displayed in the Analyst tool

and can be useful to Data Stewards for

understanding data anomalies when

viewing profiles. Descriptors can also help

remind the Data Steward what rules were

applied and whether or not they need to

be updated.

DESIGN TIPS IN INFORMATICA DATA QUALITY

Outlined below are six different Tips & Tricks for the design phase of the Data Quality

Management Process.

Consider the Environment: Informatica

Data Quality (IDQ) or Informatica

PowerCenter? We need to consider the

environment that will be leveraged. Will

all development be done in IDQ, or will

we need to integrate with PowerCenter?

PowerCenter, for example, should be

leveraged for improved performance,

scalability, reliability, or as part of an

ETL process. If there isn’t a performance

impact, then IDQ can be leveraged

alone and integration with PowerCenter

is not needed.

Reuse: Reduce! Reuse! Recycle! These

are three common words we hear often in

reference to protecting the environment.

We can use the same approach when

it comes to designing in Informatica;

design for reuse. Many of the same rules

can be leveraged across data domains

and verticals. Two rules that can be used

anywhere are uppercasing values and

trimming blank spaces. These reusable

rules should be in a shared location.

Having reusable rules will allow you to

design multiple mapplets which can

then be placed within one mapping. This

decreases the level of complexity in the

mapping, saves time, and reduces errors.



Pre-Fab Versus New Construction:

When buying a house, it is more

convenient to purchase a pre-built

home than starting from the ground

up. Mapplets and other objects are

very similar. Prebuilt mapplets, content

sets, reference data, etc. should be

leveraged when possible. IDQ provides

core accelerators that provide easy

and pre-built solutions to common

Data Quality issues.

LDO’s: Leverage Logical Data Objects

(LDOs) during design. LDOs are virtual

mappings that allow you to apply filters

and can be used in multiple profiles

where the LDO is the source object.

LDOs allow you to join data together,

join multiple tables and include them in

one profile, exclude columns from being

shown in the profile, filter out records,

rename columns, and more. Using LDOs

allows you to be more efficient because

you won’t need to perform these steps in

your Physical Data Object (PDO)



IMPLEMENTATIONReady! Set! Go! We’ve built a solid foundation, established rules and designed our overall structure.

During the implementation phase, the exception records were managed by the Data Stewards.

When the validation rules are applied, the exception records will be flagged. At that time,

the Data Steward determines how to handle the exceptions by asking some (or all) of the

following questions:

- Will I need to update this record in the source system?

- Does the rule need to be updated?

- Maybe there is a new value in the data and it needs to be part of the reference table. Do

reference values need to be updated?

- Is the exception valid

Monitor

Profile

Define Rules

DesignImplement

As part of this step, we are going to match and merge the records. As referenced earlier in

the eBook, the data needs to have been cleansed before we move onto this process. It is for

this reason that matching and merging is performed as part of the implementation phase, as

opposed to the design phase.

The Data Steward and Developer first define three things:

1) What data elements make a record a duplicate

2) What weightings should be applied to make a duplicate record the losing record

3) What constitutes a winner record

Before the duplicate records can be matched and merged, the records need to be split into groups

to insure the correct records are matched & merged. As stated earlier in this eBook, it is important

to group the records into sizable chunks that are more likely to match. This in turn reduces the

number of records we are evaluating at one time and reduces the impact on performance.

The next part of the Data Quality Management

Process is implementation.

It is during this phase that processes are

automated, so collaboration between the

Data Steward and the Developer is critical.

These two will work together to manage

expectations, exception records, match and

merge records, and perform de-duplication

after standardization.



It is best to cleanse the data before moving to

the Implementation phase of the Data Quality

Management Process.

While there are many strategies available

for matching and merging the records, three

common strategies to generate a key for

grouping include:

- String

- Soundex

- NYSIIS

Once records are properly grouped, use

a matching algorithm to determine if the

records in the group are truly duplicates. Not

all matching algorithms or datasets are the

same; therefore, it is important to test each

algorithm.

Several common matching algorithms include:

- Bigram

- Edit distance

- Jaro distance

- Hamming

Matching and determining which strategies

will work with the data can take a long time,

so it’s best to allocate enough time when

planning for this part of the Data Quality

Management Process.



MONITORING The final step in the Data Quality Management Process is Monitoring.

The Data Steward consistently monitors the data and looks at how well the Data Quality is

performing for a particular field. By analyzing the trends, the Data Steward will be able to

determine if the current rules need to be updated or if new ones should be created.

Monitor

Profile

Define Rules

DesignImplement

As part of the monitor phase, scorecards

are created with automated notifications

sent to Data Stewards alerting them to

trends in the data. A scorecard is a graphical

representation of valid values and is used to

measure Data Quality progress. The scorecard

can also be shared with stakeholders.

It is a best practice to baseline your Data

Quality by creating an initial scorecard before

applying the Data Quality rules. This allows

you to show the business the progress you’re

making and help justify implementing new

processes.

Below is an example of a scorecard created by using Informatica Data Quality. In the image

below, the Data Quality dimensions of Accuracy and Completeness are grouped. Furthermore,

you can see how well the data is performing for the four categories of “Accurate_City”,

“Accurate_Email”, “Complete_Zip”, and “Complete_State”. The Score Trends column will show

how the quality of the data is improving, staying the same, or declining. The Data Steward is

able to drill down on the invalid rows, export the data, take action, or return to the profile for

further analysis.



SCORECARDING BEST PRACTICES

Scorecarding can be a little confusing for some and difficult for others. To help you, we’ve

listed a few helpful tips for using scorecards within IDQ.

Baseline: When developing scorecards

within IDQ, consider creating a scorecard

before applying the rules to get a baseline

for measuring Data Quality.

Weighted Scorecards: Scorecard

weightings should be defined to ensure

the overall weighting of the metric group

will have a value. If each metric has equal

weighting, a value of “1” should be given

for each metric to ensure the metric

grouping has the correct weighted

average score. score.

Audience: Consider the audience using

the scorecard and create the scorecard

appropriately. For example, if the

scorecard is going to upper management,

it may be best to include high level

trends; whereas, if the scorecard is going

to someone who will be using the data

daily, more details and an additional

report may be warranted.

Profiles: Scorecards are created based

on specific profiles and include metrics

from various profiles; however, they are

not connected to a specific profile. This

allows a profile to be deleted without

impacting the scorecard. If designed

properly, the scorecard will source a LDO,

so if there is a filter in the LDO,

the filter will be applied in the scorecard.

As a reminder, filters in profiles are only

for that specific profile and do not carry

over to other objects.

A scorecard is a graphical representation of valid values

and is used to measure the progress of your Data

Quality initiative.



About Datasource Consulting

We are an Enterprise Data Management

consulting company that focuses exclusively

on enterprise data management, including

both strategic and implementation services.

We are experts in Data Architecture, Data

Integration, Data Quality, Data Governance,

Master Data Management Reporting &

Analytics, and Program Management. We are

passionate about data.

DATA QUALITY BY DATASOURCE CONSULTING

Lean on the Data Quality experts at Datasource Consulting for experienced guidance with

building and strengthening Data Quality at your company. We will tailor our expertise to fit

your program needs.

CLOSINGCongratulations! We’ve come full circle and are back to the first step in the Data Quality Management Process – profiling the data.

While it is true that less time will be spent in

each phase of the process, the collaboration

between the Data Steward and Developer

remains constant.

By following the Data Quality Management

Process as outlined above, we can show the

business results of our efforts. This can be

especially important for the stakeholders

who want to know what business problems

we have been solving with the Data Quality

initiative.

Monitor

Profile

Define Rules

DesignImplement

ORGANIZATION

AL

BE

ST

PRACTICES

TECHN

OL

OGY

DELI

VER

Y

BUSINESS

IT

With this iterative approach, both the business and IT gain greater confidence in the data,

resulting in the data being used as a competitive advantage.



WHAT TO EXPECTExpertise: Datasource Consulting brings a wealth of knowledge and expertise.

Training: Datasource Consulting will train you on how to create and maintain a Data

Quality practice that will survive the test of time.

ITERATIVE DATA QUALITY MANAGEMENT PROCESS

Focus on Business Need: Data Quality

initiatives must be driven by solving

a business problem. During the first

step of the process, we help identify

relevant business rules and ensure they

meet the needs of the company.

Continuous Monitoring: Accuracy

requirements evolve as the business

environment changes. The quality

of the data is continually monitored

because of these changes making the

Data Quality Management Process

very iterative.

Business and IT Collaboration:

Successful Data Quality projects

require close collaboration between

the business and IT. Our methodology

ensures that we have the right people

involved to implement the right

technology and processes for an

effective Data Quality practice.

DATA QUALITY 101datasourceconsulting.com/wp-content/uploads/2018/12/DataQuality_ebook.pdfIn 2014,...

Documents

Transcript of DATA QUALITY 101datasourceconsulting.com/wp-content/uploads/2018/12/DataQuality_ebook.pdfIn 2014,...