Post on 20-Dec-2015
Data Quality
Class 3
Goals
Dimensions of Data Quality Enterprise Reference Data Data Parsing
Dimensions of Data Quality
Poor data quality is similar to obscenity-– It seems as if there are no real ways to measure it,
but you know it when you see it!
In reality, data quality can be measured– The frame of reference for measurement is different
Dimensions of Data Quality 2
Data Models Data Values Data Presentation Data Policy
Example: Sales Database
Sales and marketing database– Current customers– Sales leads
Name, address, contact data For current customers, sales data
Data Quality of Data Models
Clarity of definition Comprehensiveness Flexibility Robustness
Data Quality of Data Models 2
Essentialness Attribute granularity Precision of domains Homogeneity
Data Quality of Data Models 3
Naturalness Identifiability Obtainability Relevance
Data Quality of Data Models 4
Simplicity Semantic Consistency Structural Consistency
Data Quality of Data Values
Accuracy Null values Completeness Consistency Currency
Accuracy
Agreement with established sources Database of record Other corroborative sources
Null Values
Null vs. Missing– Unavailable– Not applicable– No value– Not classified– Truly null
Completeness
Mandatory attributes require values Optional attributes may hold values (when and
how?) Inapplicable attributes may not have a value
(also when and how?) Completeness constraints
Consistency
Are values in one set consistent with values in another set?
Consistency relations between attributes in the same table
Consistency assertions across columns Consistency relationships between tables
Currency/Timeliness
What data is current? How is it kept up-to-date? Time expectations for accessibility to data
Data Quality of Data Presentation
Appropriateness Correct Interpretation Flexibility Format Precision
Data Quality of Data Presentation 2
Portability Representation Consistency Representation of Null Values
Data Quality of Data Policy
Access Metadata Privacy Fault-tolerance Security
Reference Data
Relatively static Referred to from within many tables Shared data Examples:
– Product catalog– Security type classification– Currency codes
Reference Data 2
Metadata Data Domains Mappings between those domains
Domains
A data domain is a subclassed data type Domains can be described using enumerations
– Example: states, cities, product codes
Domains can be described using functions– Example: formatted trouble ticket ids such as CC-
NNNN
Domain Membership
Data attributes can be affiliated with data domains
Test a value to make sure it is valid within a domain– For enumerated domains, a lookup works– For generated/described domains, check to see if
the value could be generated by function
Domains as Metadata
The existence of a data domain is metadata Keep track of which tables have attributes
making use of which domains Manage domains as enterprise reference data
Domain Tables
Domains are sets of values Maintain in two tables:
– Domain name table (domain_name, domain_id)– Domain value table (domain_id, value)
Example:– (2, “STATES”)– {(2, “NY”), (2, “MA”), (2, “CT”), (2, “VA”)…}
Mappings
Map values in one domain to values in other domains
1-1, 1-Many, Many-1, Many-Many Represents relations from one set to another
set 1-1, Many-1 are functions
Mappings as Metadata
Semi-static mappings are also metadata Keep track of which tables refer to mappings Enterprise reference data
Mapping Tables
Maintain mapping information in two tables:– Mapping name table (from_domain, to_domain,
mapping_id, mapping name)– Mapping table (mapping_id, source_value,
target_value)
Example:
(“Currency”, “Country”, 65, “CurrencyToCountry”)
(65, “USD”, “US)
Tables
Inference of data fields from source data In our case, this is straightforward based on
SGML tags in the data Determine table structure in context of use of
metadata and reference data In other words, try to maintain normal form
Keys
Primary key: must be unique across all values in table Primary key may be assigned based on internal
increasing value, or maybe extant in the data Foreign key: relates values in one table to values in
another table Foreign keys must exist in target table (=referential
integrity)
Data Profiling and Parsing
Goals:– Identify data domains– Identify mappings– Identify candidate keys
Data Column Profiling
For each table– For each column
Type inference Subclassed Type inference Count the number of distinct values Enumerate the distinct values Sort the distinct values Look for patterns of usage Document discoveries
Data Table Profiling
For each Table– Identify candidate keys– Identify data domains– Identify data mappings– Look for patterns
Multiple Table Profiling
Identify foreign key relationships Validate referential integrity Look for patterns
Data Parsing
2 kinds of data domains– Enumerated, consisting of a predefined set of
values– Inferred/Functional, consisting of values that are
validated based on a set of rules
Type Inference
Data value sets all conform to a data type All values within a column must belong to the same
type Propose data type by a series of inferences
– Initial assumption is that all values belong to strictest type– Test for violations to type restriction– As a violation is discovered, loosen the restriction and test
again for violations– Measure conformance at each level
Subclassed Type Inference
Once type has been inferred, look for additional restrictions within type
Example:– Integer vs. Integer within a range (0..100)
For integers and real values, look for ranges For character strings, look for length, context,
character patterns
String Parsing
Classify characters into groups– Alphabetic– Numeric/Digits– Punctuation (this can be further refined)
Transform all values within a column set into their corresponding pattern
Example: 789-23-1100 would change to– DDDPDDPDDDD
String Parsing
Given a set of representative strings, perform column analysis again
Look for high frequency counts for specific patterns
This will provide proposed functional domains We can characterize certain domains by format
(e.g., telephone numbers, SSNs, Tax Ids, UPC codes, etc.)
Word Parsing
More advanced technique to explore domain types
Classify word tokens in terms of predefined characterization– Example: Name words, business words, transaction
words, connector words, titles, etc.
Primary Key Discovery
Iterative process to find candidate keys A candidate key is one or more attributes
whose values, when composed, can uniquely locate a record
Primary Key Discovery 2
For I = 1, number of attributes– For each set S composed of I attributes do
Are the composed values unique across the tables? If so, add to candidate key set
Next Assignment
Type Inferencer– Propose data types for analyzed columns
Format Parser– Transform strings into pattern strings– Propose format for a column– Catalog discovered named formats (telno, SSN,
etc.)– More details on posting on web site