DATA QUALITY 101datasourceconsulting.com/wp-content/uploads/2018/12/DataQuality_ebook.pdfIn 2014,...
Transcript of DATA QUALITY 101datasourceconsulting.com/wp-content/uploads/2018/12/DataQuality_ebook.pdfIn 2014,...
-
DATA QUALITY 101
TOP TIPS & TRICKS ON HOW TO IMPLEMENT & IMPROVE DATA QUALITY
BY DATASOURCE CONSULTING
http://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 2
COPYRIGHT
Title: Data Quality 101: Top Tips & Tricks on How to Implement and Improve Data Quality©
Copyright: ©2015 Datasource Consulting, LLC
Warning Against Unauthorized Use: No part of this eBook may be reproduced in any
electronic, written, recorded, or photocopied form without permission of Datasource
Consulting, LLC except for when specifically granted or use in a review or critical article.
Disclaimer: Although the author has taken all reasonable precautions to verify the
information in this eBook, neither the publisher nor the author takes any responsibility
for errors or omissions. No liability shall be assumed for any damages resulting from
information used in this eBook. This eBook is intended to deliver a high-level overview of
Data Quality and provide tips and tricks we’ve learned from our own experience working
in the industry. This book is not meant to be a comprehensive guide to Data Quality
or to developing a Data Quality program for your company. This eBook is intended for
entertainment purposes only. For more information on Data Quality and to develop a Data
Quality program for your company, contact Datasource Consulting at 888-453-2624 or
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 3
TABLE OF CONTENTS
1 HOW TO IMPLEMENT DATA QUALITY 4
2 WHY IS DATA QUALITY IMPORTANT? 5
2.1 Matching the Data 7
2.2 When to Implement Data Quality 9
3 DATA QUALITY MANAGEMENT PROCESS 11
3.1 Data Quality is Iterative 11
3.2 5 Step Data Quality Management Process 11
3.3 Two Critical Players in the Data Quality Management Process 12
4 PROFILING 13
4.1 Data Profiling Techniques: Tips for Better and Cleaner Data 15
4.2 Data Profiling Techniques: Data Quality Dimensions 15
4.3 Data Profiling Tips & Best Practices 16
5 DEFINE DATA QUALITY RULES 17
5.1 Six Data Quality Rules to Follow 19
6 DESIGN 21
6.1 Tool Design 21
6.2 Design Tips in Informatica Data Quality 23
7 IMPLEMENTATION 25
8 MONITORING 27
8.2 Scorecarding Best Practices 28
9 CLOSING 29
10 ABOUT DATASOURCE CONSULTING 29
10.1 Data Quality by Datasource Consulting 29
10.2 Iterative Data Quality Management Process 30
10.3 What to Expect 30
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 4
HOW TO IMPLEMENT DATA QUALITYData quality is increasingly important in today’s enterprise business world, especially as companies look to leverage the full potential of their data.
The negative impacts of poor data range from minor interruptions (time lost doing manual
cross checks, etc.) to potentially devastating implications (exposure to unnecessary risk or
significant financial loss). Well-established Data Quality programs are long term initiatives that
address today’s Data Quality issues to prevent them from being a concern in the future. A
proactive approach builds trust and enhances better collaboration between both business and IT.
Additionally, high Data Quality will feed into the success of more comprehensive efforts such as
Data Governance or Master Data Management.
This eBook (and related articles and videos), will help identify some of the best Data Profiling
tips and Data Quality rules, and provide some real-world examples to help with your Data
Quality Initiatives.
Monitor
Profile
Define Rules
DesignImplement
In this eBook, we’ll dive into each phase of the
Data Quality Management Process, including;
- Data profiling
- Rule definition
- Design
- Implementation
- Monitoring
Without further ado, let’s get started!
Data Quality Programs are long term initiatives that address
today’s Data Quality to prevent them from being a
concern in the future.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 5
WHY IS DATA QUALITY IMPORTANT?The cost of having bad data is one of the top reasons we need to ensure your data is good and accurate.
When the quality of a company’s data is poor, the data
becomes a liability instead of an asset.
- Low customer satisfaction
- Customer loss
- Misguided business decisions
- Missed business opportunities
- Financial inaccuracies and mistakes
- Legal and monetary penalties
- Negative company image
In 2014, General Motors hit the
headlines with a Data Quality
blunder. The company recalled
2.6M vehicles and sent recall
notices to everyone who
could have been affected. Unfortunately, this
included the families of 13 people who lost
their lives due to faulty ignition switches.
General Motors was aware of these families
and could have removed their names from
the distribution list. However, instead of
cleansing their data and sending the notice
to a clean list, GM made a public relations
mistake and is now attempting to repair their
customer relationships by sending apology
letters and spending additional resources to
improve their company image.
Vodafone also made negative
headlines in 2014 when
customers received messages
thanking them for their
payments. The message would have been
great customer service, had the recipients
made recent payments on their accounts. This
Data Quality misstep, along with several other
billing errors, caused distress among Vodafone
customers. These types of errors lead to poor
customer retention, reduced profits, and
negative company image.
The poor data quality in these two cases
means the data is a liability for these
companies; improved data quality would
enable both organizations to focus on using
data as an asset.
A 2009 TDWI report found that the cost of bad data in the U.S. is over $600 billion each year.,
and is likely to continually increase with the amount of data generated daily. Looking back at our
examples of General Motors and Vodafone, the potential consequences of low Data Quality include:
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 6
How can these situations be avoided in the future? Without knowing the specifics of either
companie’s data, we can only speculate; however, by using a detailed example of poor data, we
can see a realistic potential outcome.
Below is a list of potential customers we want to contact for a sample promotion. What’s wrong
with this list?
Right away we all see a number of things wrong with this dataset:
Incomplete Dataset: Missing data makes it impossible for our marketing messages
to reach their target customers. In the example, we don’t have phone numbers, email
addresses, or certain address elements for each record.
Invalid Addresses: A number of addresses list missing house numbers or non-standard
address elements.
Duplicate Content: There are several potential duplicate records. We have a few John
Smiths; which John/Jon Smith is the correct one? With duplicate contacts, we could be
wasting resources sending unnecessary multiple mailers.
Non-Standard Format: Much of the data is in a non-standard format. We see a state
value (CA) in a city field, a postal code in a state field, and so on. Inconsistent and
nonstandard formatting makes process automation very difficult.
CUST ID CONTACT ADDR1 ADDR2 ADDR3 ADDR4 CNTRY PHONE EMAIL
12345 JON SMITH 123 MAIN STREET SAN DIEGO CA 92121 USA 8585555555 [email protected]
23456 JOHN SMITH 123 MAIN ST. San Diego CA 92121 USA 858-555-5555 [email protected]
34567 JOHN SMITH 90 21ST PLACE SD CA 92121 USA (858)555-5555 [email protected]
45678 JANE DOE 834 2ND AVE ST DENVER CO 80210 USA [email protected]
56789 BOB JOHNSON 203 FRONT ST DENVER CO 80209 USA [email protected]
67890 SARAH SMITH 340 FIRST ST LOS ANGELES CA USA [email protected]
78901 BILL WHITE 3480 PEARL RD USA
89012 JACK BLACK 4667 GRAND AVE IL 62223 USA 618-555-4897 [email protected]
90123 CHRIS WILLIAMS 34350 PARK BLVD USA [email protected]
9876 WILLIAM WALLACE 2304 CHANCE PLACE CA 1234 USA [email protected]
1234 STORM TAYLER 7934 W. HILL LANE SPRINGFIELD USA 6368789456
A123 HOMER POWELL WEST HILL LN. SPRINGFIELD MO 62704 USA 636-555-7841
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 7
MATCHING THE DATA
The cost of having bad data is one of the biggest reasons we need to ensure our data is good and accurate.
Duplicate records can be a major stumbling block, and companies often underestimate how
quickly they get out of hand. In the image below, we’ve isolated five contacts. What are the
potential matches?
These five records include a variety of matching options; highlighted are only three of the
potential ten.
From these examples, we see some of the potential consequences of poor Data Quality. We see:
- Missed opportunities
- Misguided business decisions
- Low customer satisfaction
- Negative company image
- Overspending in marketing
One of the biggest challenges faced in Data Quality today is dealing with duplicate data. Over the
next few pages we’ll give you some practical examples & steps for how to handle duplicate data.
CUST ID CONTACT ADDR1 ADDR2 ADDR3 ADDR4 CNTRY PHONE EMAIL
12345 JON SMITH 123 MAIN STREET SAN DIEGO CA 92121 USA 8585555555 [email protected]
23456 JOHN SMITH 123 MAIN ST. San Diego CA 92121 USA 858-555-5555 [email protected]
34567 JOHN SMITH 90 21ST PLACE SD CA 92121 USA (858)555-5555 [email protected]
56789 BOB JOHNSON 203 FRONT ST DENVER CO 80209 USA [email protected]
67890 SARAH SMITH 340 FIRST ST LOS ANGELES CA USA [email protected]
CRITERIA POTENTIAL MATCHES
First name & Last Name
“John”
“Smith”
CUST ID CONTACT ADDR1
12345 JON SMITH 123 MAIN STREET
23456 JOHN SMITH 123 MAIN ST.
34567 JOHN SMITH 90 21ST PLACE
CUST ID CONTACT ADDR1
23456 JOHN SMITH 123 MAIN ST.
34567 JOHN SMITH 90 21ST PLACE
56789 BOB JOHNSON 203 FRONT ST
CUST ID CONTACT ADDR1
12345 JON SMITH 123 MAIN STREET
23456 JOHN SMITH 123 MAIN ST.
34567 JOHN SMITH 90 21ST PLACE
67890 SARAH SMITH 340 FIRST ST
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 8
To come up with the potential number of matches, use the equation (n2-n)/2 where “n” is the
number of records. For example, if you have 5 records, you end up with 10 potential pairs.
By using this equation, the number of potential matches skyrockets. The chart below illustrates
the rapid increase.
As matching these records would be very time consuming, the best strategy is to group the
records based on a common value. The grouping in our example could be based on the contact
or address value.
It is important to make sure you have the right amount of records to effectively match and group.
If the dataset is too large, you may waste time on records that shouldn’t be matched. If the
dataset is too small, there may not be any available matches.
One best practice is to create and leverage algorithms that help determine true matches. “John
Smith” and “Jon Smith” are likely duplicates, but John Smith at 123 Main St. and John Smith at 90
21st Place are probably separate records.
Algorithms are a great resource to help matching efforts. Some common algorithms are:
- Edit distance
- Jaro Distance
- Bigram
- Hamming
(n2 - n)
2
NUMBER OF RECORDS POTENTIAL MATCHES
5 10
50 1225
500 124,750
5000 12,487,500
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 9
WHEN TO IMPLEMENT DATA QUALITY
Create and leverage algorithms to determine which records truly match.
The purist in all of us will say “right now” while the realist will look for a clear reason or
business requirement to implement Data Quality. There are countless reasons to implement
Data Quality; below is a sampling of what companies may consider:
- Data as a Strategic Asset
- Risk of Non-Compliance
- Disparate Data Sources
- Master Data Management
Data as a Strategic Asset: Let’s say Facebook wants to increase its advertising revenue.
Facebook offers many ways for users to interact with friends and identify themselves as potential
marketing targets:
- Status updates
- Hashtags
- Location check-ins
John starts his day at a fitness class, where he checks in on Facebook and
updates his status. After class, John takes a walk by the smoothie vendor where
he snaps a photo the day’s smoothies and posts a comment. Later at a baseball
game, John again checks in and updates his status.
Facebook would be able to identify this person’s likes and interests and target
relevant ads, increasing advertising revenue..
- Photo uploads
- “Likes” and reactions
- Mergers & Acquisitions
- New Systems
- System Migrations
Risk of Non-Compliance: Most companies
today are required to follow industry-
specific regulations. Some organizations
use validated systems requiring them to
follow Sarbanes-Oxley (SOX), HIPAA, or
the Sunshine Act. If a company is audited,
it is crucial that their data is accurate.
Data Quality is an important piece of the
compliance puzzle.
Disparate Data Sources: Think about
migrating all your data sources into a single
Data Warehouse. It’s difficult to report on non-
standardized data from disparate locations;
Data Quality will standardize your data and
allow for efficient reporting.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 10
Let’s look at implementation; the simplest method is to follow the Data Quality Management Process.
Master Data Management: Example:
customer master - Many companies have
multiple source systems housing data. Each
system could house an individual record
for the same (or similar) person. In our
previous example of duplicate records, the
only way to know which customer is the
right one is by implementing a solid Data
Quality initiative to standardize data and
appropriately match and merge records.
Mergers and Acquisitions: The parent
company often takes over the products,
sales, and database of the purchased
company. To consolidate and report on the
data effectively, the company must have a
standardized format. A strong Data Quality
initiative ensures clean data.
New Systems: We’ve all heard the saying,
“Garbage in, garbage out.” When implementing
a new system, consider launching a Data
Quality Initiative to verify your data is cleansed.
Clean data allows you to generate accurate
reports for improved decision-making.
System Migrations: Many organizations are
migrating their CRM or ERP systems, and the
old and new systems are often formatted
differently. This is where use Data Quality to
cleanse the data prior to migrating it.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 11
DATA QUALITY MANAGEMENT PROCESSThis procedure allows you to correct your data using a systematic process while also developing a scalable framework for the future.
DATA QUALITY IS ITERATIVE
Monitor
Profile
Define Rules
DesignImplement
Step 1 - Profiling: Data Profiling is primarily used to measure the overall quality of
the data and becomes increasingly more important for the continued monitoring and
reporting of your data.
Step 2 - Defining Rules: Rule definition is somewhat of a “wish list”. During this phase,
we focus on how we’d like the data to look ideally, not on how it currently looks.
Step 3 - Designing: In the design phase the Developer takes the business rules (the “wish
list”) defined by the Data Steward and converts them into meaningful and useful goals.
Step 4 - Implementing: Processes are automated, and the Data Steward and Developer
work together to manage expectations, match & merge records, and remove duplicates
after standardization.
Step 5 - Monitoring: The Data Steward consistently monitors the data to assess how
well the Data Quality is performing for particular fields. During this phase, the Data
Steward can determine whether to update the current rules or create new ones.
And that brings us right back to the first step in the Data Quality Management Process –
profiling the data.
The Data Quality Management Process, is an
iterative approach to data quality. Profiling
is a common starting point; however,
Data Quality is circular and measures and
improves the data on an ongoing basis.
5-STEP DATA QUALITY MANAGEMENT PROCESS
The Data Quality Management Process is an easy-to-follow five-step program:
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 12
DATA QUALITY MANAGEMENT PROCESS
As noted in the previous section, two critical people should be involved when effectively
implementing the Data Quality Management Process: the Data Steward and the Developer.
Data Steward: The The Data Steward understands and profiles the data. Their goal is
to measure the data quality, identify possible anomalies, and define the rules used to
cleanse, standardize, and validate the data. Also, the Data Steward may define goals as
part of an ongoing improvement process.
Developer: Once the rules are defined, the Developer designs the definitions. This
process typically takes place within a data quality application like Informatica Data
Quality (IDQ).
Data stewards and developers collaborate to implement the rules and process, and Data
quality is then managed by the Data Steward.
As you can see, this is an iterative process. It becomes vitally important to the overall quality
of your data for the Data Steward to continually measure the data against the initially-
defined goals. As stewards become increasingly familiar with the data, the rules implemented
in the beginning may no longer be relevant and will need to change. The data itself will also
change as it is updated and new sources are introduced. This means that today’s rules may
not be sufficient to cleanse, standardize, and de-duplicate the data in the future therefore
there is a constant need for iterations.
The Data Quality Management Process is a continuous cycle of improvement. By asking the
following questions, you’ll improve the quality of your data over time:
The Data Quality Management Process
is a continuous cycle of improvement
- Is the data quality improving over time? If it isn’t, how do we change the rules to help improve the quality of the data?
- Are the current business rules meeting the needs of the company?
- If a new data source is introduced, can the same data quality rules be applied?
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 13
Data Quality Example 1:
Here, we’ll be looking at ADDR2. From the illustration we can see there are seven unique
values and four NULL values. Upon further inspection, we see three records: “SD”, “SAN
DIEGO”, and “San Diego”.
It is likely that SD, SAN DIEGO, and San Diego are all the same value, so the Data Steward
will need to create a rule that standardizes the data.
PROFILINGStep one in the Data Quality Management Process.
In the examples below, we measured the
quality of the data in a few different areas:
- Address
- Phone numbers
- Company
While the Data Steward can manually profile
the data using SQL or excel, (or a variety
of other tools) the examples showcase
Informatica Data Quality (IDQ).
Monitor
Profile
Define Rules
DesignImplement
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 14
Data Quality Example 2:
We can also profile the data based on the phone number. The data pattern below is not
consistent - we have two records where the data is formatted as numbers only, three with
hyphens, and six are NULL and don’t contain a value. This is a case for better Data Quality.
In a similar manner to Example 1, the Data Steward will need to create a rule to standardize
the data.
Data Quality Example 3:
You can also profile based on specific values. We can see in the sample below entries for both
“Go N Stop” and “Stop N Go”. This begs the question of the actual name of the company or
whether there are two separate companies. If we determine this is the same company, we’ll
need to figure out how to cleanse the data so it reflects the proper business name.
These three examples highlight a small sampling of the variety of scenarios where data
profiling can benefit your business.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 15
The first step in a successful Data Quality Management process, data profiling focuses on
identifying and measuring the quality and use of data. In addition to identifying anomalies in
the overall data set, the Data Steward may also be looking to determine:
Data profiling allows companies to see an enterprise view of all data for use in Master Data
Management and Data Governance, etc. Data Profiling is used to understand data issues and
lay the foundation for business rules and processes. Outlined below are a variety of techniques
for data profiling we’ve learned in the field. Our hope is that these different techniques will help
push your Data Quality initiative.
DATA PROFILING TECHNIQUES: DATA QUALITY DIMENSIONS
There are six different dimensions of data quality:
DATA PROFILING TECHNIQUES: TIPS FOR BETTER AND CLEANER DATA
- How good or bad the data is right now
- Other potential uses for the data
- Ways to improve the ability to search the data
- Metadata accuracy
- Conformity to current standards
Completeness Is the data complete or are there missing elements?
Conformity Is there any data in a non-standard format?
Consistency Are all of your transaction records clean?
Accuracy What values are valid? Do you have an address that doesn’t include a number? What other values are invalid?
Duplicates Are there duplicate records? Which record is the correct record?
Integrity Are all of the fields complete? Do you have any missing ID’s?
ADDR1 ADDR2 ADDR3 ADDR4
3480 PEARL RD
4667 GRAND AVE IL 62223
34350 PARK BLVD
ADDR1 ADDR2 ADDR3
8434 2ND AVE ST DENVER CO
WEST HILL LN. SPRINGFIELD MO
PHONE
8585555555
858-555-5555
(858)555-5555
CUST ID TRANS ID PROD ID TRANS DT AMT
12345 A9384 PRD3842 1/1/2014 1000
56789 A9384 PRD3842 1/1/2014 1000
CUST ID TRANS ID PROD ID TRANS DT AMT
12345 A9384 PRD3842 1/1/2014 1000
A9201 PRD124 1/12/2014 10000
45678 A3402 PRD492 2/1/2014 500
CONTACT ADDR1 ADDR2
JON SMITH 123 MAIN STREET SAN DIEGO
JOHN SMITH 123 MAIN ST. San Diego
JOHN SMITH 90 21ST PLACE SD
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 16
In addition to the six data quality dimensions, we’ve outlined a seven different tips, tricks, and
best practices to help you with your initiative.
Timeliness: If you expect your data to be loaded weekly or monthly, it is best to
verify that your data is following this schedule. Ensuring that the data loads on a
regular schedule will allow you to profile and identify any anomalies.
Profile often: It is a good practice to profile as often as the data is loaded. As
new data is inserted/updated, new anomalies may appear. This practice will
enable you to identify these anomalies as they come up and allow you to address
and correct them before they get out of hand.
Profile production data: Looking at production data versus test data ensures
the appropriate rules are defined as you are profiling. If you are not looking at
production data, you could be defining an unnecessary rule.
Profile all data: Unless you have a large data set with billions of records, it is
a best practice to profile all of your data. This ensures you capture all potential
anomalies.
Perform column-level profiling first: Column-level profiling should be performed
first to determine what columns to include in the Data Quality Management
process. For example, you may think a particular column stores an important
data segment, but after profiling discover that data is actually NULL. This could
indicate either a larger data problem or that you’re looking at the wrong column.
Document the Results: Another best practice is to document your results.
This helps you prioritize as you move into the next phase of the process.
Using the right tools for the size of the job: Software like Informatica Data
Quality (IDQ) can provide faster analysis. If you are profiling a handful of records,
profiling manually; or by using SQL, may be sufficient. However for a large
dataset., a tool like IDQ will speed up the profiling process.
DATA PROFILING TIPS & BEST PRACTICES
When profiling a large dataset, a tool like IDQ will speed
up the profiling process.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 17
Rule definition is typically part of the analysis
phase of the project and, like profiling, is
performed by the Data Steward.
During this step, it is best to think about how the
data should look in the target system and not
what it currently looks like in the source system.
In order to accomplish this, we’ll use at least three
different fundamentals:
DEFINE DATA QUALITY RULESThe next step in the process, after profiling, is to define the rules.
Monitor
Profile
Define Rules
DesignImplement
Manipulation Rules: Data Quality Manipulation Rules should be created to assist the
migration of data from one system to another. For example, if the source system allows
special characters but the target system doesn’t, a manipulation rule could help.
Validation Rules: Validation rules go hand-in-hand with manipulation rules. Validation rules
validate your data against the newly created manipulation rule.
Metrics: During this part of the process, scorecard metrics should also be considered.
Let’s look at an example in IDQ as to how a
Data Steward would define these roles. In
looking at the data, they can see ADDR2 has
the following four different values highlighted
in the image to the left:
- SD
- SAN DIEGO (in all caps)
- San Diego (mixed case)
- NULL values
The Data Steward can easily put a comment in the tool to say that NULL values are not valid
and add a rule to standardize the data.
Inside IDQ, the Data Steward can also tag the field and provide the Developer with direction as
to how they want to standardize. The developer will actually see the comment and the tag in
the developer tool, giving him/her a jumpstart in designing the rule.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 18
A Data Steward can look at the list of values and determine which are valid. Once valid values
have been identified, the values are added to the reference table in IDQ. In the image below,
we have “San Diego” and “SD”. If the value “SD” shows up, the value “San Diego” will be
returned. Similarly, if the value “LA” shows up, the value “Los Angeles” will be returned.
Every Data Quality rule created should have
an impact on a business need.
Likewise, Data Stewards create rule specifications
in Informatica Analyst (a helpful tool when
performing Data Quality). Stewards define the
inputs and logic and can test the rule logic.
Once saved, this is ready for the developer in the
developer tool as a mapplet.
Data Stewards create mapping specifications
where they map the data from a source to a
target defined by business logic. Once saved,
mapping specification shows up in the developer
tool as a LDO or mapping.
Using the functionality described in
the examples (comments, tags, rule
specifications, pre-built rules, and reference
tables) enhances collaboration between
the Data Steward and Developer. This
allows for faster design/development and
information-sharing in the tool rather than
in a spreadsheet on someone’s desk.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 19
Rules Should Serve a Purpose: When we were kids, our parents set rules for our
behavior. These rules were put in place to help protect us or to help us learn; they served
a purpose. In the same way, part of the Data Quality Management Process is to define
rules. Make sure to define rules that solve business data issues - every rule should impact
a business need.
Validation & Manipulation Rules: When defining rules for specific business cases and
business objectives, it can be easier to break them into two types of manipulation rules
and validation rules.
Manipulation Rules: Manipulation rules help to cleanse and standardize the data.
These can be as simple as trimming blank spaces or capitalizing certain values. ,
During this part of the process, ask yourself: “How should this data be cleansed or
standardized?” or, “How would I like this data to appear in the target system?”
Validation Rules: Validation rules help us determine whether the data is valid and
usable or if it is invalid and can also assist in building congruency among data sets.
As the data flows through the Data Quality Management Process we need a way to flag
it to determine if it is valid or invalid. The Data Steward will define the validation rule,
but the Developer should make sure the manipulation rule is included as part of the
validation rule code.
As we process the data, we may find that “123 MAIN street, SAN DIEGO” should be “123
MAIN ST., San Diego,” (notice the emphasis on the “ST.” versus spelling out “street” and
the casing). As we can see, validation rules provide consistency to our data. It’s beneficial
to define validation rules with the manipulation rules in mind.
The Developer can easily code a manipulation rule to properly case the data and
standardize the format so there’s no need to go back to the source system to manipulate
the data. Once the address is in a standardized format, the validation rule can be applied.
So when we validate the address against reference data (e.g. in a reference table) as part
of our validation rule, we need to make sure the manipulation rule is also in place.
SIX DATA QUALITY RULES TO FOLLOW
Outlined below are six tips to consider and implement during the rules phase of the Data
Quality Management Process.
CONTACT ADDR1 ADDR2
JON SMITH 123 MAIN STREET SAN DIEGO
JOHN SMITH 123 MAIN ST. San Diego
JOHN SMITH 90 21ST PLACE SD
Here’s an example of a question you can ask
regarding a data validation rule;
“Is this address valid and in the correct format?”
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 20
Prioritization: As Prioritize the rules as you define them. Thinking back to when we
were kids, certain rules had a higher priority than others. For example, “look both ways
before you cross the street” is more important than “wash behind your ears”. They both
have consequences, but the consequence of one is potentially much greater than the
other. The same is true for rules. The office environment can be hectic, and it is important
for Developers to understand what rules take precedence. Prioritizing rules helps the
Developer implement the most important in the shortest amount of time, allowing them to
deliver first on rules that provide the highest impact. For example, if rules are prioritized
as part of the Data Quality Management Process, a developer can work on implementing
the first 10 rules, and then work on the next 10 as time allows. This approach can provide
a quick win for IT. By simply implementing a subset of the rules, we are able to show the
business the value in a Data Quality program.
Therefore, proper prioritization helps everyone be more efficient and successful.
Rule Changes: As Since the Data Quality Management Process is very iterative, the
initially-established rules may not be the same moving forward.
As part of a cohesive team, it will be important to educate Data Stewards on rule changes.
Stewards should be aware of what changes will occur as they monitor Data Quality. It is
also important to communicate to the Data Steward that any manipulation rule change
may require an adjustment in the related validation rule.
Collaborate: Software Software enhances collaboration between the Data Steward
and the Developer. Informatica Data Quality (IDQ) has built in functionality in the form
of comments, tags, rule specifications, pre-built rules, and reference tables to enhance
collaboration between the Data Steward and Developer. This allows for faster design and
development as well as having information shared in one project in the tool rather than a
spreadsheet.
Reference Tables: Any Any eBook on Data Quality wouldn’t be complete without sharing
a few tips regarding reference tables.
Reference Table Tip 1: Consider using
reference tables to store a valid list of values.
Reference tables can be managed by the Data
Steward within IDQ, which also includes an
audit trail of any changes to the reference
tables.
Reference Table Tip 2: Consider reference
tables for LOV’s (list of values) over
hardcoding the data.
A proactive approach to
Data Quality builds trust
and enhances better
collaboration between
both business and IT.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 21
The next step of the Data Quality Management Process is design. In this phase, the Developer
takes the business rules and converts them into meaningful and useful goals.
So, how would we go about designing the tool? From our previous example with ADDR2, we
have the following values:
- SD
- SAN DIEGO (all caps)
The Data Steward identified that they want to uppercase the value. We can do that by
applying “rule_Uppercase”. As a result, we can see there is only SD and SAN DIEGO (all caps).
DESIGNThis part of the process builds on what is defined and converts them into goals.
Monitor
Profile
Define Rules
DesignImplement
These rules could include:
- Address validation
- Standardize names
- Remove noise
- Validate data values
- Matching
TOOL DESIGN
By using IDQ, we have a jump start in designing the rules with the collaboration techniques
highlighted above. The Developer can build on top of what the Data Steward has already
defined through rule and mapping specifications.
During this process, the Developer should
look at implementing the validation and
manipulation rules together.
- San Diego (mixed case)
- NULL
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 22
The Data Steward also prescribed that we should standardize the value. After applying “rule_
Standardized_City” we have “San Diego” and no longer a “SD” value.
The Data Steward highlighted NULL values in their findings and specified that NULL values are
invalid. So we apply “rule_Completeness” and see which records are NULL and not NULL (and
complete being NULL).
Each of these rules can be completed one at a time or in a group. Pictured in the image
below, we have applied the following rules in a group: uppercase, standardized city, and
completeness. The output will tell the Data Steward whether the value is valid.
Once the rules have been applied in the profile, the Data Steward can drill down on the invalid
records and determine what to do with any invalid records.
Note: You can also apply that rule within a Logical Data Object within IDQ. A LDO is just a virtual mapping and we’ll go into more details of that later in the eBook.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 23
Consistent Naming and Coding
Standards: Developers and Data
Stewards will see the same rules in IDQ,
so it is important to develop consistent
naming and coding standards. For
example, both the Data Steward and
the Developer will understand what
“rule_” means while not everyone
will understand what “mplt_” means.
Therefore, mapplets should be named
“rule_” if they are used in both the
Informatica Analyst and Developer tools.
Anchors & Descriptors: It is a best
practice to include anchors and
descriptions in Data Quality mappings
for faster modifications, readability and
comprehension. Any metadata changes
to a source or target object in a mapping
(e.g. a new column) can quickly be
made if anchor transformations exist
immediately following the source object
and immediately preceding the target
object. These can be simple pass-through
transformations. As a reminder, all objects
should have descriptions. The description
will be displayed in the Analyst tool
and can be useful to Data Stewards for
understanding data anomalies when
viewing profiles. Descriptors can also help
remind the Data Steward what rules were
applied and whether or not they need to
be updated.
DESIGN TIPS IN INFORMATICA DATA QUALITY
Outlined below are six different Tips & Tricks for the design phase of the Data Quality
Management Process.
Consider the Environment: Informatica
Data Quality (IDQ) or Informatica
PowerCenter? We need to consider the
environment that will be leveraged. Will
all development be done in IDQ, or will
we need to integrate with PowerCenter?
PowerCenter, for example, should be
leveraged for improved performance,
scalability, reliability, or as part of an
ETL process. If there isn’t a performance
impact, then IDQ can be leveraged
alone and integration with PowerCenter
is not needed.
Reuse: Reduce! Reuse! Recycle! These
are three common words we hear often in
reference to protecting the environment.
We can use the same approach when
it comes to designing in Informatica;
design for reuse. Many of the same rules
can be leveraged across data domains
and verticals. Two rules that can be used
anywhere are uppercasing values and
trimming blank spaces. These reusable
rules should be in a shared location.
Having reusable rules will allow you to
design multiple mapplets which can
then be placed within one mapping. This
decreases the level of complexity in the
mapping, saves time, and reduces errors.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 24
Pre-Fab Versus New Construction:
When buying a house, it is more
convenient to purchase a pre-built
home than starting from the ground
up. Mapplets and other objects are
very similar. Prebuilt mapplets, content
sets, reference data, etc. should be
leveraged when possible. IDQ provides
core accelerators that provide easy
and pre-built solutions to common
Data Quality issues.
LDO’s: Leverage Logical Data Objects
(LDOs) during design. LDOs are virtual
mappings that allow you to apply filters
and can be used in multiple profiles
where the LDO is the source object.
LDOs allow you to join data together,
join multiple tables and include them in
one profile, exclude columns from being
shown in the profile, filter out records,
rename columns, and more. Using LDOs
allows you to be more efficient because
you won’t need to perform these steps in
your Physical Data Object (PDO)
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 25
IMPLEMENTATIONReady! Set! Go! We’ve built a solid foundation, established rules and designed our overall structure.
During the implementation phase, the exception records were managed by the Data Stewards.
When the validation rules are applied, the exception records will be flagged. At that time,
the Data Steward determines how to handle the exceptions by asking some (or all) of the
following questions:
- Will I need to update this record in the source system?
- Does the rule need to be updated?
- Maybe there is a new value in the data and it needs to be part of the reference table. Do
reference values need to be updated?
- Is the exception valid
Monitor
Profile
Define Rules
DesignImplement
As part of this step, we are going to match and merge the records. As referenced earlier in
the eBook, the data needs to have been cleansed before we move onto this process. It is for
this reason that matching and merging is performed as part of the implementation phase, as
opposed to the design phase.
The Data Steward and Developer first define three things:
1) What data elements make a record a duplicate
2) What weightings should be applied to make a duplicate record the losing record
3) What constitutes a winner record
Before the duplicate records can be matched and merged, the records need to be split into groups
to insure the correct records are matched & merged. As stated earlier in this eBook, it is important
to group the records into sizable chunks that are more likely to match. This in turn reduces the
number of records we are evaluating at one time and reduces the impact on performance.
The next part of the Data Quality Management
Process is implementation.
It is during this phase that processes are
automated, so collaboration between the
Data Steward and the Developer is critical.
These two will work together to manage
expectations, exception records, match and
merge records, and perform de-duplication
after standardization.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 26
It is best to cleanse the data before moving to
the Implementation phase of the Data Quality
Management Process.
While there are many strategies available
for matching and merging the records, three
common strategies to generate a key for
grouping include:
- String
- Soundex
- NYSIIS
Once records are properly grouped, use
a matching algorithm to determine if the
records in the group are truly duplicates. Not
all matching algorithms or datasets are the
same; therefore, it is important to test each
algorithm.
Several common matching algorithms include:
- Bigram
- Edit distance
- Jaro distance
- Hamming
Matching and determining which strategies
will work with the data can take a long time,
so it’s best to allocate enough time when
planning for this part of the Data Quality
Management Process.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 27
MONITORING The final step in the Data Quality Management Process is Monitoring.
The Data Steward consistently monitors the data and looks at how well the Data Quality is
performing for a particular field. By analyzing the trends, the Data Steward will be able to
determine if the current rules need to be updated or if new ones should be created.
Monitor
Profile
Define Rules
DesignImplement
As part of the monitor phase, scorecards
are created with automated notifications
sent to Data Stewards alerting them to
trends in the data. A scorecard is a graphical
representation of valid values and is used to
measure Data Quality progress. The scorecard
can also be shared with stakeholders.
It is a best practice to baseline your Data
Quality by creating an initial scorecard before
applying the Data Quality rules. This allows
you to show the business the progress you’re
making and help justify implementing new
processes.
Below is an example of a scorecard created by using Informatica Data Quality. In the image
below, the Data Quality dimensions of Accuracy and Completeness are grouped. Furthermore,
you can see how well the data is performing for the four categories of “Accurate_City”,
“Accurate_Email”, “Complete_Zip”, and “Complete_State”. The Score Trends column will show
how the quality of the data is improving, staying the same, or declining. The Data Steward is
able to drill down on the invalid rows, export the data, take action, or return to the profile for
further analysis.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 28
SCORECARDING BEST PRACTICES
Scorecarding can be a little confusing for some and difficult for others. To help you, we’ve
listed a few helpful tips for using scorecards within IDQ.
Baseline: When developing scorecards
within IDQ, consider creating a scorecard
before applying the rules to get a baseline
for measuring Data Quality.
Weighted Scorecards: Scorecard
weightings should be defined to ensure
the overall weighting of the metric group
will have a value. If each metric has equal
weighting, a value of “1” should be given
for each metric to ensure the metric
grouping has the correct weighted
average score. score.
Audience: Consider the audience using
the scorecard and create the scorecard
appropriately. For example, if the
scorecard is going to upper management,
it may be best to include high level
trends; whereas, if the scorecard is going
to someone who will be using the data
daily, more details and an additional
report may be warranted.
Profiles: Scorecards are created based
on specific profiles and include metrics
from various profiles; however, they are
not connected to a specific profile. This
allows a profile to be deleted without
impacting the scorecard. If designed
properly, the scorecard will source a LDO,
so if there is a filter in the LDO,
the filter will be applied in the scorecard.
As a reminder, filters in profiles are only
for that specific profile and do not carry
over to other objects.
A scorecard is a graphical representation of valid values
and is used to measure the progress of your Data
Quality initiative.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 29
About Datasource Consulting
We are an Enterprise Data Management
consulting company that focuses exclusively
on enterprise data management, including
both strategic and implementation services.
We are experts in Data Architecture, Data
Integration, Data Quality, Data Governance,
Master Data Management Reporting &
Analytics, and Program Management. We are
passionate about data.
DATA QUALITY BY DATASOURCE CONSULTING
Lean on the Data Quality experts at Datasource Consulting for experienced guidance with
building and strengthening Data Quality at your company. We will tailor our expertise to fit
your program needs.
CLOSINGCongratulations! We’ve come full circle and are back to the first step in the Data Quality Management Process – profiling the data.
While it is true that less time will be spent in
each phase of the process, the collaboration
between the Data Steward and Developer
remains constant.
By following the Data Quality Management
Process as outlined above, we can show the
business results of our efforts. This can be
especially important for the stakeholders
who want to know what business problems
we have been solving with the Data Quality
initiative.
Monitor
Profile
Define Rules
DesignImplement
ORGANIZATION
AL
BE
ST
PRACTICES
TECHN
OL
OGY
DELI
VER
Y
BUSINESS
IT
With this iterative approach, both the business and IT gain greater confidence in the data,
resulting in the data being used as a competitive advantage.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com
-
DATASOURCECONSULTING.COM | PAGE 30
WHAT TO EXPECTExpertise: Datasource Consulting brings a wealth of knowledge and expertise.
Training: Datasource Consulting will train you on how to create and maintain a Data
Quality practice that will survive the test of time.
ITERATIVE DATA QUALITY MANAGEMENT PROCESS
Focus on Business Need: Data Quality
initiatives must be driven by solving
a business problem. During the first
step of the process, we help identify
relevant business rules and ensure they
meet the needs of the company.
Continuous Monitoring: Accuracy
requirements evolve as the business
environment changes. The quality
of the data is continually monitored
because of these changes making the
Data Quality Management Process
very iterative.
Business and IT Collaboration:
Successful Data Quality projects
require close collaboration between
the business and IT. Our methodology
ensures that we have the right people
involved to implement the right
technology and processes for an
effective Data Quality practice.
http://datasourceconsulting.comhttp://www.datasourceconsulting.com