DATA QUALITY 101datasourceconsulting.com/wp-content/uploads/2018/12/DataQuality_ebook.pdfIn 2014,...

30
DATA QUALITY 101 TOP TIPS & TRICKS ON HOW TO IMPLEMENT & IMPROVE DATA QUALITY BY DATASOURCE CONSULTING

Transcript of DATA QUALITY 101datasourceconsulting.com/wp-content/uploads/2018/12/DataQuality_ebook.pdfIn 2014,...

  • DATA QUALITY 101

    TOP TIPS & TRICKS ON HOW TO IMPLEMENT & IMPROVE DATA QUALITY

    BY DATASOURCE CONSULTING

    http://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 2

    COPYRIGHT

    Title: Data Quality 101: Top Tips & Tricks on How to Implement and Improve Data Quality©

    Copyright: ©2015 Datasource Consulting, LLC

    Warning Against Unauthorized Use: No part of this eBook may be reproduced in any

    electronic, written, recorded, or photocopied form without permission of Datasource

    Consulting, LLC except for when specifically granted or use in a review or critical article.

    Disclaimer: Although the author has taken all reasonable precautions to verify the

    information in this eBook, neither the publisher nor the author takes any responsibility

    for errors or omissions. No liability shall be assumed for any damages resulting from

    information used in this eBook. This eBook is intended to deliver a high-level overview of

    Data Quality and provide tips and tricks we’ve learned from our own experience working

    in the industry. This book is not meant to be a comprehensive guide to Data Quality

    or to developing a Data Quality program for your company. This eBook is intended for

    entertainment purposes only. For more information on Data Quality and to develop a Data

    Quality program for your company, contact Datasource Consulting at 888-453-2624 or

    [email protected].

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 3

    TABLE OF CONTENTS

    1 HOW TO IMPLEMENT DATA QUALITY 4

    2 WHY IS DATA QUALITY IMPORTANT? 5

    2.1 Matching the Data 7

    2.2 When to Implement Data Quality 9

    3 DATA QUALITY MANAGEMENT PROCESS 11

    3.1 Data Quality is Iterative 11

    3.2 5 Step Data Quality Management Process 11

    3.3 Two Critical Players in the Data Quality Management Process 12

    4 PROFILING 13

    4.1 Data Profiling Techniques: Tips for Better and Cleaner Data 15

    4.2 Data Profiling Techniques: Data Quality Dimensions 15

    4.3 Data Profiling Tips & Best Practices 16

    5 DEFINE DATA QUALITY RULES 17

    5.1 Six Data Quality Rules to Follow 19

    6 DESIGN 21

    6.1 Tool Design 21

    6.2 Design Tips in Informatica Data Quality 23

    7 IMPLEMENTATION 25

    8 MONITORING 27

    8.2 Scorecarding Best Practices 28

    9 CLOSING 29

    10 ABOUT DATASOURCE CONSULTING 29

    10.1 Data Quality by Datasource Consulting 29

    10.2 Iterative Data Quality Management Process 30

    10.3 What to Expect 30

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 4

    HOW TO IMPLEMENT DATA QUALITYData quality is increasingly important in today’s enterprise business world, especially as companies look to leverage the full potential of their data.

    The negative impacts of poor data range from minor interruptions (time lost doing manual

    cross checks, etc.) to potentially devastating implications (exposure to unnecessary risk or

    significant financial loss). Well-established Data Quality programs are long term initiatives that

    address today’s Data Quality issues to prevent them from being a concern in the future. A

    proactive approach builds trust and enhances better collaboration between both business and IT.

    Additionally, high Data Quality will feed into the success of more comprehensive efforts such as

    Data Governance or Master Data Management.

    This eBook (and related articles and videos), will help identify some of the best Data Profiling

    tips and Data Quality rules, and provide some real-world examples to help with your Data

    Quality Initiatives.

    Monitor

    Profile

    Define Rules

    DesignImplement

    In this eBook, we’ll dive into each phase of the

    Data Quality Management Process, including;

    - Data profiling

    - Rule definition

    - Design

    - Implementation

    - Monitoring

    Without further ado, let’s get started!

    Data Quality Programs are long term initiatives that address

    today’s Data Quality to prevent them from being a

    concern in the future.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 5

    WHY IS DATA QUALITY IMPORTANT?The cost of having bad data is one of the top reasons we need to ensure your data is good and accurate.

    When the quality of a company’s data is poor, the data

    becomes a liability instead of an asset.

    - Low customer satisfaction

    - Customer loss

    - Misguided business decisions

    - Missed business opportunities

    - Financial inaccuracies and mistakes

    - Legal and monetary penalties

    - Negative company image

    In 2014, General Motors hit the

    headlines with a Data Quality

    blunder. The company recalled

    2.6M vehicles and sent recall

    notices to everyone who

    could have been affected. Unfortunately, this

    included the families of 13 people who lost

    their lives due to faulty ignition switches.

    General Motors was aware of these families

    and could have removed their names from

    the distribution list. However, instead of

    cleansing their data and sending the notice

    to a clean list, GM made a public relations

    mistake and is now attempting to repair their

    customer relationships by sending apology

    letters and spending additional resources to

    improve their company image.

    Vodafone also made negative

    headlines in 2014 when

    customers received messages

    thanking them for their

    payments. The message would have been

    great customer service, had the recipients

    made recent payments on their accounts. This

    Data Quality misstep, along with several other

    billing errors, caused distress among Vodafone

    customers. These types of errors lead to poor

    customer retention, reduced profits, and

    negative company image.

    The poor data quality in these two cases

    means the data is a liability for these

    companies; improved data quality would

    enable both organizations to focus on using

    data as an asset.

    A 2009 TDWI report found that the cost of bad data in the U.S. is over $600 billion each year.,

    and is likely to continually increase with the amount of data generated daily. Looking back at our

    examples of General Motors and Vodafone, the potential consequences of low Data Quality include:

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 6

    How can these situations be avoided in the future? Without knowing the specifics of either

    companie’s data, we can only speculate; however, by using a detailed example of poor data, we

    can see a realistic potential outcome.

    Below is a list of potential customers we want to contact for a sample promotion. What’s wrong

    with this list?

    Right away we all see a number of things wrong with this dataset:

    Incomplete Dataset: Missing data makes it impossible for our marketing messages

    to reach their target customers. In the example, we don’t have phone numbers, email

    addresses, or certain address elements for each record.

    Invalid Addresses: A number of addresses list missing house numbers or non-standard

    address elements.

    Duplicate Content: There are several potential duplicate records. We have a few John

    Smiths; which John/Jon Smith is the correct one? With duplicate contacts, we could be

    wasting resources sending unnecessary multiple mailers.

    Non-Standard Format: Much of the data is in a non-standard format. We see a state

    value (CA) in a city field, a postal code in a state field, and so on. Inconsistent and

    nonstandard formatting makes process automation very difficult.

    CUST ID CONTACT ADDR1 ADDR2 ADDR3 ADDR4 CNTRY PHONE EMAIL

    12345 JON SMITH 123 MAIN STREET SAN DIEGO CA 92121 USA 8585555555 [email protected]

    23456 JOHN SMITH 123 MAIN ST. San Diego CA 92121 USA 858-555-5555 [email protected]

    34567 JOHN SMITH 90 21ST PLACE SD CA 92121 USA (858)555-5555 [email protected]

    45678 JANE DOE 834 2ND AVE ST DENVER CO 80210 USA [email protected]

    56789 BOB JOHNSON 203 FRONT ST DENVER CO 80209 USA [email protected]

    67890 SARAH SMITH 340 FIRST ST LOS ANGELES CA USA [email protected]

    78901 BILL WHITE 3480 PEARL RD USA

    89012 JACK BLACK 4667 GRAND AVE IL 62223 USA 618-555-4897 [email protected]

    90123 CHRIS WILLIAMS 34350 PARK BLVD USA [email protected]

    9876 WILLIAM WALLACE 2304 CHANCE PLACE CA 1234 USA [email protected]

    1234 STORM TAYLER 7934 W. HILL LANE SPRINGFIELD USA 6368789456

    A123 HOMER POWELL WEST HILL LN. SPRINGFIELD MO 62704 USA 636-555-7841

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 7

    MATCHING THE DATA

    The cost of having bad data is one of the biggest reasons we need to ensure our data is good and accurate.

    Duplicate records can be a major stumbling block, and companies often underestimate how

    quickly they get out of hand. In the image below, we’ve isolated five contacts. What are the

    potential matches?

    These five records include a variety of matching options; highlighted are only three of the

    potential ten.

    From these examples, we see some of the potential consequences of poor Data Quality. We see:

    - Missed opportunities

    - Misguided business decisions

    - Low customer satisfaction

    - Negative company image

    - Overspending in marketing

    One of the biggest challenges faced in Data Quality today is dealing with duplicate data. Over the

    next few pages we’ll give you some practical examples & steps for how to handle duplicate data.

    CUST ID CONTACT ADDR1 ADDR2 ADDR3 ADDR4 CNTRY PHONE EMAIL

    12345 JON SMITH 123 MAIN STREET SAN DIEGO CA 92121 USA 8585555555 [email protected]

    23456 JOHN SMITH 123 MAIN ST. San Diego CA 92121 USA 858-555-5555 [email protected]

    34567 JOHN SMITH 90 21ST PLACE SD CA 92121 USA (858)555-5555 [email protected]

    56789 BOB JOHNSON 203 FRONT ST DENVER CO 80209 USA [email protected]

    67890 SARAH SMITH 340 FIRST ST LOS ANGELES CA USA [email protected]

    CRITERIA POTENTIAL MATCHES

    First name & Last Name

    “John”

    “Smith”

    CUST ID CONTACT ADDR1

    12345 JON SMITH 123 MAIN STREET

    23456 JOHN SMITH 123 MAIN ST.

    34567 JOHN SMITH 90 21ST PLACE

    CUST ID CONTACT ADDR1

    23456 JOHN SMITH 123 MAIN ST.

    34567 JOHN SMITH 90 21ST PLACE

    56789 BOB JOHNSON 203 FRONT ST

    CUST ID CONTACT ADDR1

    12345 JON SMITH 123 MAIN STREET

    23456 JOHN SMITH 123 MAIN ST.

    34567 JOHN SMITH 90 21ST PLACE

    67890 SARAH SMITH 340 FIRST ST

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 8

    To come up with the potential number of matches, use the equation (n2-n)/2 where “n” is the

    number of records. For example, if you have 5 records, you end up with 10 potential pairs.

    By using this equation, the number of potential matches skyrockets. The chart below illustrates

    the rapid increase.

    As matching these records would be very time consuming, the best strategy is to group the

    records based on a common value. The grouping in our example could be based on the contact

    or address value.

    It is important to make sure you have the right amount of records to effectively match and group.

    If the dataset is too large, you may waste time on records that shouldn’t be matched. If the

    dataset is too small, there may not be any available matches.

    One best practice is to create and leverage algorithms that help determine true matches. “John

    Smith” and “Jon Smith” are likely duplicates, but John Smith at 123 Main St. and John Smith at 90

    21st Place are probably separate records.

    Algorithms are a great resource to help matching efforts. Some common algorithms are:

    - Edit distance

    - Jaro Distance

    - Bigram

    - Hamming

    (n2 - n)

    2

    NUMBER OF RECORDS POTENTIAL MATCHES

    5 10

    50 1225

    500 124,750

    5000 12,487,500

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 9

    WHEN TO IMPLEMENT DATA QUALITY

    Create and leverage algorithms to determine which records truly match.

    The purist in all of us will say “right now” while the realist will look for a clear reason or

    business requirement to implement Data Quality. There are countless reasons to implement

    Data Quality; below is a sampling of what companies may consider:

    - Data as a Strategic Asset

    - Risk of Non-Compliance

    - Disparate Data Sources

    - Master Data Management

    Data as a Strategic Asset: Let’s say Facebook wants to increase its advertising revenue.

    Facebook offers many ways for users to interact with friends and identify themselves as potential

    marketing targets:

    - Status updates

    - Hashtags

    - Location check-ins

    John starts his day at a fitness class, where he checks in on Facebook and

    updates his status. After class, John takes a walk by the smoothie vendor where

    he snaps a photo the day’s smoothies and posts a comment. Later at a baseball

    game, John again checks in and updates his status.

    Facebook would be able to identify this person’s likes and interests and target

    relevant ads, increasing advertising revenue..

    - Photo uploads

    - “Likes” and reactions

    - Mergers & Acquisitions

    - New Systems

    - System Migrations

    Risk of Non-Compliance: Most companies

    today are required to follow industry-

    specific regulations. Some organizations

    use validated systems requiring them to

    follow Sarbanes-Oxley (SOX), HIPAA, or

    the Sunshine Act. If a company is audited,

    it is crucial that their data is accurate.

    Data Quality is an important piece of the

    compliance puzzle.

    Disparate Data Sources: Think about

    migrating all your data sources into a single

    Data Warehouse. It’s difficult to report on non-

    standardized data from disparate locations;

    Data Quality will standardize your data and

    allow for efficient reporting.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 10

    Let’s look at implementation; the simplest method is to follow the Data Quality Management Process.

    Master Data Management: Example:

    customer master - Many companies have

    multiple source systems housing data. Each

    system could house an individual record

    for the same (or similar) person. In our

    previous example of duplicate records, the

    only way to know which customer is the

    right one is by implementing a solid Data

    Quality initiative to standardize data and

    appropriately match and merge records.

    Mergers and Acquisitions: The parent

    company often takes over the products,

    sales, and database of the purchased

    company. To consolidate and report on the

    data effectively, the company must have a

    standardized format. A strong Data Quality

    initiative ensures clean data.

    New Systems: We’ve all heard the saying,

    “Garbage in, garbage out.” When implementing

    a new system, consider launching a Data

    Quality Initiative to verify your data is cleansed.

    Clean data allows you to generate accurate

    reports for improved decision-making.

    System Migrations: Many organizations are

    migrating their CRM or ERP systems, and the

    old and new systems are often formatted

    differently. This is where use Data Quality to

    cleanse the data prior to migrating it.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 11

    DATA QUALITY MANAGEMENT PROCESSThis procedure allows you to correct your data using a systematic process while also developing a scalable framework for the future.

    DATA QUALITY IS ITERATIVE

    Monitor

    Profile

    Define Rules

    DesignImplement

    Step 1 - Profiling: Data Profiling is primarily used to measure the overall quality of

    the data and becomes increasingly more important for the continued monitoring and

    reporting of your data.

    Step 2 - Defining Rules: Rule definition is somewhat of a “wish list”. During this phase,

    we focus on how we’d like the data to look ideally, not on how it currently looks.

    Step 3 - Designing: In the design phase the Developer takes the business rules (the “wish

    list”) defined by the Data Steward and converts them into meaningful and useful goals.

    Step 4 - Implementing: Processes are automated, and the Data Steward and Developer

    work together to manage expectations, match & merge records, and remove duplicates

    after standardization.

    Step 5 - Monitoring: The Data Steward consistently monitors the data to assess how

    well the Data Quality is performing for particular fields. During this phase, the Data

    Steward can determine whether to update the current rules or create new ones.

    And that brings us right back to the first step in the Data Quality Management Process –

    profiling the data.

    The Data Quality Management Process, is an

    iterative approach to data quality. Profiling

    is a common starting point; however,

    Data Quality is circular and measures and

    improves the data on an ongoing basis.

    5-STEP DATA QUALITY MANAGEMENT PROCESS

    The Data Quality Management Process is an easy-to-follow five-step program:

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 12

    DATA QUALITY MANAGEMENT PROCESS

    As noted in the previous section, two critical people should be involved when effectively

    implementing the Data Quality Management Process: the Data Steward and the Developer.

    Data Steward: The The Data Steward understands and profiles the data. Their goal is

    to measure the data quality, identify possible anomalies, and define the rules used to

    cleanse, standardize, and validate the data. Also, the Data Steward may define goals as

    part of an ongoing improvement process.

    Developer: Once the rules are defined, the Developer designs the definitions. This

    process typically takes place within a data quality application like Informatica Data

    Quality (IDQ).

    Data stewards and developers collaborate to implement the rules and process, and Data

    quality is then managed by the Data Steward.

    As you can see, this is an iterative process. It becomes vitally important to the overall quality

    of your data for the Data Steward to continually measure the data against the initially-

    defined goals. As stewards become increasingly familiar with the data, the rules implemented

    in the beginning may no longer be relevant and will need to change. The data itself will also

    change as it is updated and new sources are introduced. This means that today’s rules may

    not be sufficient to cleanse, standardize, and de-duplicate the data in the future therefore

    there is a constant need for iterations.

    The Data Quality Management Process is a continuous cycle of improvement. By asking the

    following questions, you’ll improve the quality of your data over time:

    The Data Quality Management Process

    is a continuous cycle of improvement

    - Is the data quality improving over time? If it isn’t, how do we change the rules to help improve the quality of the data?

    - Are the current business rules meeting the needs of the company?

    - If a new data source is introduced, can the same data quality rules be applied?

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 13

    Data Quality Example 1:

    Here, we’ll be looking at ADDR2. From the illustration we can see there are seven unique

    values and four NULL values. Upon further inspection, we see three records: “SD”, “SAN

    DIEGO”, and “San Diego”.

    It is likely that SD, SAN DIEGO, and San Diego are all the same value, so the Data Steward

    will need to create a rule that standardizes the data.

    PROFILINGStep one in the Data Quality Management Process.

    In the examples below, we measured the

    quality of the data in a few different areas:

    - Address

    - Phone numbers

    - Company

    While the Data Steward can manually profile

    the data using SQL or excel, (or a variety

    of other tools) the examples showcase

    Informatica Data Quality (IDQ).

    Monitor

    Profile

    Define Rules

    DesignImplement

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 14

    Data Quality Example 2:

    We can also profile the data based on the phone number. The data pattern below is not

    consistent - we have two records where the data is formatted as numbers only, three with

    hyphens, and six are NULL and don’t contain a value. This is a case for better Data Quality.

    In a similar manner to Example 1, the Data Steward will need to create a rule to standardize

    the data.

    Data Quality Example 3:

    You can also profile based on specific values. We can see in the sample below entries for both

    “Go N Stop” and “Stop N Go”. This begs the question of the actual name of the company or

    whether there are two separate companies. If we determine this is the same company, we’ll

    need to figure out how to cleanse the data so it reflects the proper business name.

    These three examples highlight a small sampling of the variety of scenarios where data

    profiling can benefit your business.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 15

    The first step in a successful Data Quality Management process, data profiling focuses on

    identifying and measuring the quality and use of data. In addition to identifying anomalies in

    the overall data set, the Data Steward may also be looking to determine:

    Data profiling allows companies to see an enterprise view of all data for use in Master Data

    Management and Data Governance, etc. Data Profiling is used to understand data issues and

    lay the foundation for business rules and processes. Outlined below are a variety of techniques

    for data profiling we’ve learned in the field. Our hope is that these different techniques will help

    push your Data Quality initiative.

    DATA PROFILING TECHNIQUES: DATA QUALITY DIMENSIONS

    There are six different dimensions of data quality:

    DATA PROFILING TECHNIQUES: TIPS FOR BETTER AND CLEANER DATA

    - How good or bad the data is right now

    - Other potential uses for the data

    - Ways to improve the ability to search the data

    - Metadata accuracy

    - Conformity to current standards

    Completeness Is the data complete or are there missing elements?

    Conformity Is there any data in a non-standard format?

    Consistency Are all of your transaction records clean?

    Accuracy What values are valid? Do you have an address that doesn’t include a number? What other values are invalid?

    Duplicates Are there duplicate records? Which record is the correct record?

    Integrity Are all of the fields complete? Do you have any missing ID’s?

    ADDR1 ADDR2 ADDR3 ADDR4

    3480 PEARL RD

    4667 GRAND AVE IL 62223

    34350 PARK BLVD

    ADDR1 ADDR2 ADDR3

    8434 2ND AVE ST DENVER CO

    WEST HILL LN. SPRINGFIELD MO

    PHONE

    8585555555

    858-555-5555

    (858)555-5555

    CUST ID TRANS ID PROD ID TRANS DT AMT

    12345 A9384 PRD3842 1/1/2014 1000

    56789 A9384 PRD3842 1/1/2014 1000

    CUST ID TRANS ID PROD ID TRANS DT AMT

    12345 A9384 PRD3842 1/1/2014 1000

    A9201 PRD124 1/12/2014 10000

    45678 A3402 PRD492 2/1/2014 500

    CONTACT ADDR1 ADDR2

    JON SMITH 123 MAIN STREET SAN DIEGO

    JOHN SMITH 123 MAIN ST. San Diego

    JOHN SMITH 90 21ST PLACE SD

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 16

    In addition to the six data quality dimensions, we’ve outlined a seven different tips, tricks, and

    best practices to help you with your initiative.

    Timeliness: If you expect your data to be loaded weekly or monthly, it is best to

    verify that your data is following this schedule. Ensuring that the data loads on a

    regular schedule will allow you to profile and identify any anomalies.

    Profile often: It is a good practice to profile as often as the data is loaded. As

    new data is inserted/updated, new anomalies may appear. This practice will

    enable you to identify these anomalies as they come up and allow you to address

    and correct them before they get out of hand.

    Profile production data: Looking at production data versus test data ensures

    the appropriate rules are defined as you are profiling. If you are not looking at

    production data, you could be defining an unnecessary rule.

    Profile all data: Unless you have a large data set with billions of records, it is

    a best practice to profile all of your data. This ensures you capture all potential

    anomalies.

    Perform column-level profiling first: Column-level profiling should be performed

    first to determine what columns to include in the Data Quality Management

    process. For example, you may think a particular column stores an important

    data segment, but after profiling discover that data is actually NULL. This could

    indicate either a larger data problem or that you’re looking at the wrong column.

    Document the Results: Another best practice is to document your results.

    This helps you prioritize as you move into the next phase of the process.

    Using the right tools for the size of the job: Software like Informatica Data

    Quality (IDQ) can provide faster analysis. If you are profiling a handful of records,

    profiling manually; or by using SQL, may be sufficient. However for a large

    dataset., a tool like IDQ will speed up the profiling process.

    DATA PROFILING TIPS & BEST PRACTICES

    When profiling a large dataset, a tool like IDQ will speed

    up the profiling process.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 17

    Rule definition is typically part of the analysis

    phase of the project and, like profiling, is

    performed by the Data Steward.

    During this step, it is best to think about how the

    data should look in the target system and not

    what it currently looks like in the source system.

    In order to accomplish this, we’ll use at least three

    different fundamentals:

    DEFINE DATA QUALITY RULESThe next step in the process, after profiling, is to define the rules.

    Monitor

    Profile

    Define Rules

    DesignImplement

    Manipulation Rules: Data Quality Manipulation Rules should be created to assist the

    migration of data from one system to another. For example, if the source system allows

    special characters but the target system doesn’t, a manipulation rule could help.

    Validation Rules: Validation rules go hand-in-hand with manipulation rules. Validation rules

    validate your data against the newly created manipulation rule.

    Metrics: During this part of the process, scorecard metrics should also be considered.

    Let’s look at an example in IDQ as to how a

    Data Steward would define these roles. In

    looking at the data, they can see ADDR2 has

    the following four different values highlighted

    in the image to the left:

    - SD

    - SAN DIEGO (in all caps)

    - San Diego (mixed case)

    - NULL values

    The Data Steward can easily put a comment in the tool to say that NULL values are not valid

    and add a rule to standardize the data.

    Inside IDQ, the Data Steward can also tag the field and provide the Developer with direction as

    to how they want to standardize. The developer will actually see the comment and the tag in

    the developer tool, giving him/her a jumpstart in designing the rule.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 18

    A Data Steward can look at the list of values and determine which are valid. Once valid values

    have been identified, the values are added to the reference table in IDQ. In the image below,

    we have “San Diego” and “SD”. If the value “SD” shows up, the value “San Diego” will be

    returned. Similarly, if the value “LA” shows up, the value “Los Angeles” will be returned.

    Every Data Quality rule created should have

    an impact on a business need.

    Likewise, Data Stewards create rule specifications

    in Informatica Analyst (a helpful tool when

    performing Data Quality). Stewards define the

    inputs and logic and can test the rule logic.

    Once saved, this is ready for the developer in the

    developer tool as a mapplet.

    Data Stewards create mapping specifications

    where they map the data from a source to a

    target defined by business logic. Once saved,

    mapping specification shows up in the developer

    tool as a LDO or mapping.

    Using the functionality described in

    the examples (comments, tags, rule

    specifications, pre-built rules, and reference

    tables) enhances collaboration between

    the Data Steward and Developer. This

    allows for faster design/development and

    information-sharing in the tool rather than

    in a spreadsheet on someone’s desk.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 19

    Rules Should Serve a Purpose: When we were kids, our parents set rules for our

    behavior. These rules were put in place to help protect us or to help us learn; they served

    a purpose. In the same way, part of the Data Quality Management Process is to define

    rules. Make sure to define rules that solve business data issues - every rule should impact

    a business need.

    Validation & Manipulation Rules: When defining rules for specific business cases and

    business objectives, it can be easier to break them into two types of manipulation rules

    and validation rules.

    Manipulation Rules: Manipulation rules help to cleanse and standardize the data.

    These can be as simple as trimming blank spaces or capitalizing certain values. ,

    During this part of the process, ask yourself: “How should this data be cleansed or

    standardized?” or, “How would I like this data to appear in the target system?”

    Validation Rules: Validation rules help us determine whether the data is valid and

    usable or if it is invalid and can also assist in building congruency among data sets.

    As the data flows through the Data Quality Management Process we need a way to flag

    it to determine if it is valid or invalid. The Data Steward will define the validation rule,

    but the Developer should make sure the manipulation rule is included as part of the

    validation rule code.

    As we process the data, we may find that “123 MAIN street, SAN DIEGO” should be “123

    MAIN ST., San Diego,” (notice the emphasis on the “ST.” versus spelling out “street” and

    the casing). As we can see, validation rules provide consistency to our data. It’s beneficial

    to define validation rules with the manipulation rules in mind.

    The Developer can easily code a manipulation rule to properly case the data and

    standardize the format so there’s no need to go back to the source system to manipulate

    the data. Once the address is in a standardized format, the validation rule can be applied.

    So when we validate the address against reference data (e.g. in a reference table) as part

    of our validation rule, we need to make sure the manipulation rule is also in place.

    SIX DATA QUALITY RULES TO FOLLOW

    Outlined below are six tips to consider and implement during the rules phase of the Data

    Quality Management Process.

    CONTACT ADDR1 ADDR2

    JON SMITH 123 MAIN STREET SAN DIEGO

    JOHN SMITH 123 MAIN ST. San Diego

    JOHN SMITH 90 21ST PLACE SD

    Here’s an example of a question you can ask

    regarding a data validation rule;

    “Is this address valid and in the correct format?”

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 20

    Prioritization: As Prioritize the rules as you define them. Thinking back to when we

    were kids, certain rules had a higher priority than others. For example, “look both ways

    before you cross the street” is more important than “wash behind your ears”. They both

    have consequences, but the consequence of one is potentially much greater than the

    other. The same is true for rules. The office environment can be hectic, and it is important

    for Developers to understand what rules take precedence. Prioritizing rules helps the

    Developer implement the most important in the shortest amount of time, allowing them to

    deliver first on rules that provide the highest impact. For example, if rules are prioritized

    as part of the Data Quality Management Process, a developer can work on implementing

    the first 10 rules, and then work on the next 10 as time allows. This approach can provide

    a quick win for IT. By simply implementing a subset of the rules, we are able to show the

    business the value in a Data Quality program.

    Therefore, proper prioritization helps everyone be more efficient and successful.

    Rule Changes: As Since the Data Quality Management Process is very iterative, the

    initially-established rules may not be the same moving forward.

    As part of a cohesive team, it will be important to educate Data Stewards on rule changes.

    Stewards should be aware of what changes will occur as they monitor Data Quality. It is

    also important to communicate to the Data Steward that any manipulation rule change

    may require an adjustment in the related validation rule.

    Collaborate: Software Software enhances collaboration between the Data Steward

    and the Developer. Informatica Data Quality (IDQ) has built in functionality in the form

    of comments, tags, rule specifications, pre-built rules, and reference tables to enhance

    collaboration between the Data Steward and Developer. This allows for faster design and

    development as well as having information shared in one project in the tool rather than a

    spreadsheet.

    Reference Tables: Any Any eBook on Data Quality wouldn’t be complete without sharing

    a few tips regarding reference tables.

    Reference Table Tip 1: Consider using

    reference tables to store a valid list of values.

    Reference tables can be managed by the Data

    Steward within IDQ, which also includes an

    audit trail of any changes to the reference

    tables.

    Reference Table Tip 2: Consider reference

    tables for LOV’s (list of values) over

    hardcoding the data.

    A proactive approach to

    Data Quality builds trust

    and enhances better

    collaboration between

    both business and IT.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 21

    The next step of the Data Quality Management Process is design. In this phase, the Developer

    takes the business rules and converts them into meaningful and useful goals.

    So, how would we go about designing the tool? From our previous example with ADDR2, we

    have the following values:

    - SD

    - SAN DIEGO (all caps)

    The Data Steward identified that they want to uppercase the value. We can do that by

    applying “rule_Uppercase”. As a result, we can see there is only SD and SAN DIEGO (all caps).

    DESIGNThis part of the process builds on what is defined and converts them into goals.

    Monitor

    Profile

    Define Rules

    DesignImplement

    These rules could include:

    - Address validation

    - Standardize names

    - Remove noise

    - Validate data values

    - Matching

    TOOL DESIGN

    By using IDQ, we have a jump start in designing the rules with the collaboration techniques

    highlighted above. The Developer can build on top of what the Data Steward has already

    defined through rule and mapping specifications.

    During this process, the Developer should

    look at implementing the validation and

    manipulation rules together.

    - San Diego (mixed case)

    - NULL

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 22

    The Data Steward also prescribed that we should standardize the value. After applying “rule_

    Standardized_City” we have “San Diego” and no longer a “SD” value.

    The Data Steward highlighted NULL values in their findings and specified that NULL values are

    invalid. So we apply “rule_Completeness” and see which records are NULL and not NULL (and

    complete being NULL).

    Each of these rules can be completed one at a time or in a group. Pictured in the image

    below, we have applied the following rules in a group: uppercase, standardized city, and

    completeness. The output will tell the Data Steward whether the value is valid.

    Once the rules have been applied in the profile, the Data Steward can drill down on the invalid

    records and determine what to do with any invalid records.

    Note: You can also apply that rule within a Logical Data Object within IDQ. A LDO is just a virtual mapping and we’ll go into more details of that later in the eBook.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 23

    Consistent Naming and Coding

    Standards: Developers and Data

    Stewards will see the same rules in IDQ,

    so it is important to develop consistent

    naming and coding standards. For

    example, both the Data Steward and

    the Developer will understand what

    “rule_” means while not everyone

    will understand what “mplt_” means.

    Therefore, mapplets should be named

    “rule_” if they are used in both the

    Informatica Analyst and Developer tools.

    Anchors & Descriptors: It is a best

    practice to include anchors and

    descriptions in Data Quality mappings

    for faster modifications, readability and

    comprehension. Any metadata changes

    to a source or target object in a mapping

    (e.g. a new column) can quickly be

    made if anchor transformations exist

    immediately following the source object

    and immediately preceding the target

    object. These can be simple pass-through

    transformations. As a reminder, all objects

    should have descriptions. The description

    will be displayed in the Analyst tool

    and can be useful to Data Stewards for

    understanding data anomalies when

    viewing profiles. Descriptors can also help

    remind the Data Steward what rules were

    applied and whether or not they need to

    be updated.

    DESIGN TIPS IN INFORMATICA DATA QUALITY

    Outlined below are six different Tips & Tricks for the design phase of the Data Quality

    Management Process.

    Consider the Environment: Informatica

    Data Quality (IDQ) or Informatica

    PowerCenter? We need to consider the

    environment that will be leveraged. Will

    all development be done in IDQ, or will

    we need to integrate with PowerCenter?

    PowerCenter, for example, should be

    leveraged for improved performance,

    scalability, reliability, or as part of an

    ETL process. If there isn’t a performance

    impact, then IDQ can be leveraged

    alone and integration with PowerCenter

    is not needed.

    Reuse: Reduce! Reuse! Recycle! These

    are three common words we hear often in

    reference to protecting the environment.

    We can use the same approach when

    it comes to designing in Informatica;

    design for reuse. Many of the same rules

    can be leveraged across data domains

    and verticals. Two rules that can be used

    anywhere are uppercasing values and

    trimming blank spaces. These reusable

    rules should be in a shared location.

    Having reusable rules will allow you to

    design multiple mapplets which can

    then be placed within one mapping. This

    decreases the level of complexity in the

    mapping, saves time, and reduces errors.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 24

    Pre-Fab Versus New Construction:

    When buying a house, it is more

    convenient to purchase a pre-built

    home than starting from the ground

    up. Mapplets and other objects are

    very similar. Prebuilt mapplets, content

    sets, reference data, etc. should be

    leveraged when possible. IDQ provides

    core accelerators that provide easy

    and pre-built solutions to common

    Data Quality issues.

    LDO’s: Leverage Logical Data Objects

    (LDOs) during design. LDOs are virtual

    mappings that allow you to apply filters

    and can be used in multiple profiles

    where the LDO is the source object.

    LDOs allow you to join data together,

    join multiple tables and include them in

    one profile, exclude columns from being

    shown in the profile, filter out records,

    rename columns, and more. Using LDOs

    allows you to be more efficient because

    you won’t need to perform these steps in

    your Physical Data Object (PDO)

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 25

    IMPLEMENTATIONReady! Set! Go! We’ve built a solid foundation, established rules and designed our overall structure.

    During the implementation phase, the exception records were managed by the Data Stewards.

    When the validation rules are applied, the exception records will be flagged. At that time,

    the Data Steward determines how to handle the exceptions by asking some (or all) of the

    following questions:

    - Will I need to update this record in the source system?

    - Does the rule need to be updated?

    - Maybe there is a new value in the data and it needs to be part of the reference table. Do

    reference values need to be updated?

    - Is the exception valid

    Monitor

    Profile

    Define Rules

    DesignImplement

    As part of this step, we are going to match and merge the records. As referenced earlier in

    the eBook, the data needs to have been cleansed before we move onto this process. It is for

    this reason that matching and merging is performed as part of the implementation phase, as

    opposed to the design phase.

    The Data Steward and Developer first define three things:

    1) What data elements make a record a duplicate

    2) What weightings should be applied to make a duplicate record the losing record

    3) What constitutes a winner record

    Before the duplicate records can be matched and merged, the records need to be split into groups

    to insure the correct records are matched & merged. As stated earlier in this eBook, it is important

    to group the records into sizable chunks that are more likely to match. This in turn reduces the

    number of records we are evaluating at one time and reduces the impact on performance.

    The next part of the Data Quality Management

    Process is implementation.

    It is during this phase that processes are

    automated, so collaboration between the

    Data Steward and the Developer is critical.

    These two will work together to manage

    expectations, exception records, match and

    merge records, and perform de-duplication

    after standardization.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 26

    It is best to cleanse the data before moving to

    the Implementation phase of the Data Quality

    Management Process.

    While there are many strategies available

    for matching and merging the records, three

    common strategies to generate a key for

    grouping include:

    - String

    - Soundex

    - NYSIIS

    Once records are properly grouped, use

    a matching algorithm to determine if the

    records in the group are truly duplicates. Not

    all matching algorithms or datasets are the

    same; therefore, it is important to test each

    algorithm.

    Several common matching algorithms include:

    - Bigram

    - Edit distance

    - Jaro distance

    - Hamming

    Matching and determining which strategies

    will work with the data can take a long time,

    so it’s best to allocate enough time when

    planning for this part of the Data Quality

    Management Process.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 27

    MONITORING The final step in the Data Quality Management Process is Monitoring.

    The Data Steward consistently monitors the data and looks at how well the Data Quality is

    performing for a particular field. By analyzing the trends, the Data Steward will be able to

    determine if the current rules need to be updated or if new ones should be created.

    Monitor

    Profile

    Define Rules

    DesignImplement

    As part of the monitor phase, scorecards

    are created with automated notifications

    sent to Data Stewards alerting them to

    trends in the data. A scorecard is a graphical

    representation of valid values and is used to

    measure Data Quality progress. The scorecard

    can also be shared with stakeholders.

    It is a best practice to baseline your Data

    Quality by creating an initial scorecard before

    applying the Data Quality rules. This allows

    you to show the business the progress you’re

    making and help justify implementing new

    processes.

    Below is an example of a scorecard created by using Informatica Data Quality. In the image

    below, the Data Quality dimensions of Accuracy and Completeness are grouped. Furthermore,

    you can see how well the data is performing for the four categories of “Accurate_City”,

    “Accurate_Email”, “Complete_Zip”, and “Complete_State”. The Score Trends column will show

    how the quality of the data is improving, staying the same, or declining. The Data Steward is

    able to drill down on the invalid rows, export the data, take action, or return to the profile for

    further analysis.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 28

    SCORECARDING BEST PRACTICES

    Scorecarding can be a little confusing for some and difficult for others. To help you, we’ve

    listed a few helpful tips for using scorecards within IDQ.

    Baseline: When developing scorecards

    within IDQ, consider creating a scorecard

    before applying the rules to get a baseline

    for measuring Data Quality.

    Weighted Scorecards: Scorecard

    weightings should be defined to ensure

    the overall weighting of the metric group

    will have a value. If each metric has equal

    weighting, a value of “1” should be given

    for each metric to ensure the metric

    grouping has the correct weighted

    average score. score.

    Audience: Consider the audience using

    the scorecard and create the scorecard

    appropriately. For example, if the

    scorecard is going to upper management,

    it may be best to include high level

    trends; whereas, if the scorecard is going

    to someone who will be using the data

    daily, more details and an additional

    report may be warranted.

    Profiles: Scorecards are created based

    on specific profiles and include metrics

    from various profiles; however, they are

    not connected to a specific profile. This

    allows a profile to be deleted without

    impacting the scorecard. If designed

    properly, the scorecard will source a LDO,

    so if there is a filter in the LDO,

    the filter will be applied in the scorecard.

    As a reminder, filters in profiles are only

    for that specific profile and do not carry

    over to other objects.

    A scorecard is a graphical representation of valid values

    and is used to measure the progress of your Data

    Quality initiative.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 29

    About Datasource Consulting

    We are an Enterprise Data Management

    consulting company that focuses exclusively

    on enterprise data management, including

    both strategic and implementation services.

    We are experts in Data Architecture, Data

    Integration, Data Quality, Data Governance,

    Master Data Management Reporting &

    Analytics, and Program Management. We are

    passionate about data.

    DATA QUALITY BY DATASOURCE CONSULTING

    Lean on the Data Quality experts at Datasource Consulting for experienced guidance with

    building and strengthening Data Quality at your company. We will tailor our expertise to fit

    your program needs.

    CLOSINGCongratulations! We’ve come full circle and are back to the first step in the Data Quality Management Process – profiling the data.

    While it is true that less time will be spent in

    each phase of the process, the collaboration

    between the Data Steward and Developer

    remains constant.

    By following the Data Quality Management

    Process as outlined above, we can show the

    business results of our efforts. This can be

    especially important for the stakeholders

    who want to know what business problems

    we have been solving with the Data Quality

    initiative.

    Monitor

    Profile

    Define Rules

    DesignImplement

    ORGANIZATION

    AL

    BE

    ST

    PRACTICES

    TECHN

    OL

    OGY

    DELI

    VER

    Y

    BUSINESS

    IT

    With this iterative approach, both the business and IT gain greater confidence in the data,

    resulting in the data being used as a competitive advantage.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com

  • DATASOURCECONSULTING.COM | PAGE 30

    WHAT TO EXPECTExpertise: Datasource Consulting brings a wealth of knowledge and expertise.

    Training: Datasource Consulting will train you on how to create and maintain a Data

    Quality practice that will survive the test of time.

    ITERATIVE DATA QUALITY MANAGEMENT PROCESS

    Focus on Business Need: Data Quality

    initiatives must be driven by solving

    a business problem. During the first

    step of the process, we help identify

    relevant business rules and ensure they

    meet the needs of the company.

    Continuous Monitoring: Accuracy

    requirements evolve as the business

    environment changes. The quality

    of the data is continually monitored

    because of these changes making the

    Data Quality Management Process

    very iterative.

    Business and IT Collaboration:

    Successful Data Quality projects

    require close collaboration between

    the business and IT. Our methodology

    ensures that we have the right people

    involved to implement the right

    technology and processes for an

    effective Data Quality practice.

    http://datasourceconsulting.comhttp://www.datasourceconsulting.com