Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality...
-
date post
20-Dec-2015 -
Category
Documents
-
view
219 -
download
2
Transcript of Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality...
Data Quality
Class 5
Goals
• Project
• Data Quality Rules (Continued)
• Example
• Use of Data Quality Rules
Data Quality Rules Classes
• 1) Null value rules• 2) Value rules• 3) Domain membership rules• 4) Domain Mappings• 5) Relation rules• 6) Table, Cross-table, and Cross-message assertions• 7) In-Process directives• 8) Operational Directives• 9) Other rules
Representing Data Quality Rules
• Data is divided into 2 sets:– conformers– violators
• Sets can be represented using SQL
• Create SQL statements representing violating set
Using SQL
• Direct queries• Embedded queries
– Using ODBC/JDBC, can create validation scripts in
• C• C++• Java• Visual Basic• Etc.
Null Value Representations
• Maintain a table of null representation types and names:
create table nullreps (name varchar(30),
nulltype char(1),
description varchar(1024),
source varchar(512),
nullval varchar(100),
nullrepid integer
);
Null Value Rules
• Allows nulls– If the rule is “allows nulls” without any
additional characterization• Nothing necessary
– If the rule is “allows nulls,” but only of a specific type
• Must check for real nulls (and possibly blanks and spaces):
• SELECT * from <table> WHERE <table>.<attribute> is NULL;
Null Value Rules
• Does not allow nulls– Must check for nulls(and possibly blanks and
spaces):• SELECT * from <table> WHERE
<table>.<attribute> is NULL;
Value Rules
• Value rule is specified as some set of constraints
• Makes use of operators and functions:– +, -, *, /, <, <=, >, >=, !=, ==, AND, OR– User defined functions
• Example:– value >= 0 AND value <= 100
Value Rules 2
• Validation test is opposite of constraint
• Use DeMorgan’s laws– If constraint was “value >= 0 AND value <=
100), use:
SELECT * from <table> where <table>.<attribute> < 0 OR
<table>.<attribute> > 100;
Domain Membership
• Domains are stored in a database table
• Test for domain membership of an attribute is a test to make sure that all values are represented in domain table
Domain Reference Tables
create table domainref (
name varchar(30),
dtype char(1),
description varchar(1024),
source varchar(512),
domainid integer
);
Domain Reference Tables
create table domainvals (
domainid integer,
value varchar(128)
);
Domain Membership
• Test for membership of attribute foo in the domain named bar:
SELECT * from <table> where foo not in
(SELECT value from domainvals where domainid =
(SELECT domainid from domainref
where domainref.name = “bar”));
Domain Assignment
• The values in the attribute define the domain:– Find all the values not in the domain already– Update domain tables with those values
Domain Assignment 2
• SELECT * from <table> where foo not in
(SELECT value from domainvals where domainid =
(SELECT domainid from domainref
where domainref.name = “bar”));
For all values in this set, create a record with (the value, the domain id for “bar”), and insert into domainvals.
Mapping Membership
• Similar to domain membership, except:– Must include domain membership tests for both
values– Also must be looked up in the mapping tables
Completeness
• Defines when a record is complete– Ex: IF (Orders.Total > 0.0), Complete With
{Orders.Billing_Street, Orders.Billing_City, Orders.Billing_State, Orders.Billing_ZIP}
• Format:– Condition– List of fields that must be complete
Completeness 2
• Equivalent to a set of null tests using condition
• Select * from <table> where <condition is true> and <list of not null tests>;
Exemption
Defines which fields may be missingIF (Orders.Item_Class != “CLOTHING”) Exempt
{Orders.Color,
Orders.Size
}
• Format:– Condition
– List of fields that must be complete
Exemption 2
• If condition is true, the fields may be null
• Therefore, if condition is false, fields may not be null
• Equivalent for test of opposite of condition and test for nulls
Consistency
• Define a relationship between attributes based on field content– IF (Employees.title == “Staff Member”)
Then (Employees.Salary >= 20000 AND Employees.Salary < 30000)
– Format:• Condition
• Assertion
Consistency 2
• If condition is true, the assertion must be true
• Equivalent to test for cases where the condition is true and the assertion is false:
Select * from <table> where <condition> and not <assertion>;
Derivation
• Prescriptive form of consistency rule• Details how one attribute’s value is determined
based on other attributesIF (Orders.NumberOrdered > 0) Then {Orders.Total = (Orders.NumberOrdered *
Orders.Price) * 1.05}
• Format:– Condition– assignment
Derivation 2
• The assigned fields must be updated if condition is true
• Find all records where the condition is true
• Generate update SQL calls with updated values
• Execute updates
Functional Dependence
• Functional Dependence between columns X and Y:– For any two records R1 and R2 in a table,
• if field X of record R1 contains value x and field X of record R2 contains the same value x, then if field Y of record R1 contains the value y, then field Y of record R2 must contain the value y.
• In other words, attribute Y is said to be determined by attribute X.
Functional Dependence 2
• Rule Format:– Attribute X determines Attribute Y
• Validation test makes sure that the functional dependence criterion is met
• This means that if we extract the X value from the set of all distinct value pairs, that set should have no duplicates
Functional Dependence 3
• Create view FD as select distinct X, Y from <table>;
• Select count (*) from FD;
• Select count (distinct X) from <table>;
• These should be the same numbers.
Primary Key/Uniqueness
• A set of attributes defined as a primary key must uniquely identify a record
• Can also be viewed as a uniqueness constraint
• Format:– {attribute list} is PRIMARY– {attribute list} is UNIQUE
Primary
• Test to make sure that the number of distinct records with the expected key is the same as the number of records
• Select count(*) from <table>;• Select count (distinct <attribute list>) from
<table>;
• These numbers should be the same
Uniqueness
• Test for multiple record occurrences with the same set of values that should have been unique, if there is a separate known primary key
SELECT <table>.<attribute>, <table>.<attribute>
FROM <table> AS t1, <table> AS t2
WHERE t1.<attribute> = t2.<attribute> and t1.<primary> <> t2.<primary>;
Foreign Key
• When the values in field f in table T is chosen from the key values in field g in table S, field S.g is said to be a foreign key for field T.f
• If f is a foreign key, the key must exist in table S, column g (=referential integrity)
Foreign Key 2
• Similar to primary key
• Test is to make sure that all values in foreign key field exist in target table
Select * from <source table> where <attribute> not in (Select distinct <attribute> from <target table>);
Use of Data Quality Rules
• Data Validation
• Root Cause Analysis
• Message Transformation
• Data-driven GUIs
• Metadata Collection
Data Validation
• Translate rule set into select statements
• Create a program that:– Loads select statements into an array, indexed
by a unique integer– Connects to database via ODBC– Iterates through the array of select statements
those results
Data Validation 2
– Each type of rule has an expected result; check against the expected result
– Outputs the result of each statement to output file, tagged by rule identifier
– Results can be tallied to yield an overall percentage of valid records to total records
Root Cause Analysis
• Root cause analysis can be started by looking at the counts of violated rules
• Use the most frequently violated rule as a starting place
Message Transformation
• Electronic Data Interchange
• Use DQ rules to validate incoming messages
• Use DQ rules (derivations, mappings) to transform incoming messages into an internal format
Data-driven GUIs
• Data dependence is specified in a collection of rules
• Generate equivalence classes of data values based on dependence specification
Data-driven GUIS
• First, look for all independent attributes – this is class 0
• For class i, collect all attributes that depend on class (i – 1)
• The GUI will be constructed to iteratively request data from class 0..n
• Based on the results from collecting data at step j, the rules associated with the actual values are applied, determining which values are requested at step j + 1
Metadata Collection
• Use domain and mapping derivation rules to collect metadata
• Use other rules as a documentation of business operations