Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.
-
date post
19-Dec-2015 -
Category
Documents
-
view
220 -
download
1
Transcript of Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.
![Page 1: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/1.jpg)
Data Quality
Class 4
![Page 2: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/2.jpg)
Goals
• Discuss Project
• Midterm
• Statistical Process Control
• Data Quality Rules
![Page 3: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/3.jpg)
Project
• Informtion is now on web site
• Final version is due on July 26
• Data will be available by end of the week
• We will spend some time discussing goals today
![Page 4: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/4.jpg)
Midterm
• Written exam on July 5th
• Will cover:– Cost of low data quality– Dimensions of data quality– domains and mappings– SPC– Data Quality Rules
![Page 5: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/5.jpg)
Statistical Process Control
• Developed by Shewhart at Bell Labs in the 1920’s through 1950’s
• Notions of Variation vs. Control
• Important in original context of both equpiment manufacture and service quality
![Page 6: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/6.jpg)
Variation
• Natural variations
• Defects
• Errors
• Mistakes
• Some variations are meaningful, some are not
![Page 7: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/7.jpg)
Causes of Variation
• Common, or Chance causes– minor fluctuations or differences– not necessarily important to correct– observed to form a normal distribution
• Assignable, or Special causes– (self explanatory)
• We expect to see the normal variations, but assignable cause variations are interesting
![Page 8: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/8.jpg)
Example
• Measure railroad on-time performance– Trains are typically on time or a few minutes
late– One night, the trains are all 1 hour late due to
electrical problems – a special cause
![Page 9: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/9.jpg)
Statistical Control
• State in which variations observed can be attributed to common causes that do not change with time
![Page 10: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/10.jpg)
Pareto Principle
• In a population that contributes to a common effect, relaively few of the contributors account for the bulk of the effect
• Example: code performance analysis
• Can be used to direct analysis
![Page 11: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/11.jpg)
Control Chart
UCL
LCL
Center line
![Page 12: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/12.jpg)
Control Chart 2
• Used to look for distinct variations from the mean
• Goal: predictable behavior
• Plot series of data over time
• Variations are represented as distance from the mean
![Page 13: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/13.jpg)
Control Chart 3
• Center Line: can be computed as mean of variable points
• Upper Contril Limit: three standard deviations above center line
• Lower Control Limit: three standard deviations below center line
![Page 14: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/14.jpg)
Control Chart 4
• As long as all points are between UCL and LCL, the variations are due to common causes, and the process is said to be in control, or stable
• Points above UCL or below LCL are indicative of abnormal variation, and are due to special causes – the process is not in control
![Page 15: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/15.jpg)
Control Chart 5
• Select variables chart or attributes chart
• Use data quality dimensions as guideline
• Select meaningful variables to measure (i.e., stuff that will point at a diagnosible problem)
![Page 16: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/16.jpg)
Interpreting the Control Chart
• Lack of stability indicates potential problem• Look for:
– points utside of control limits– zone testing (clusters of points within certain
standard deviation limits)– potential to split out data points into different
logical data sets
• Look for cycles
![Page 17: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/17.jpg)
SPC and Data Quality
• “The Information Factory”
• Use data quality dimensions as guideline for investigation
• Analyze the state of data as it passes through the information chain
• Probing can be automated with data quality rules
![Page 18: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/18.jpg)
Inserting the Probes
• FInd a location in information chain that is:– nondisruptive– easy to access– easy to retool
![Page 19: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/19.jpg)
Data Quality Rules
• Definitions
• Proscriptive Assertions
• Prescriptive Assertions
• Conditional Assertions
• Operational Assertions
![Page 20: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/20.jpg)
Definitions
• Nulls
• Domains
• Mappings
![Page 21: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/21.jpg)
Proscriptive Assertions
• Describe what is not allowed
• Used to figure out what is wrong with data
• Used for validation
![Page 22: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/22.jpg)
Prescriptive Assertions
• Describe what is supposed to happen with data
• Can be used for data population, extraction, transformation
• Can also be used for validation
![Page 23: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/23.jpg)
Conditional Assertions
• Define an assertion that must be true if a condition is true
![Page 24: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/24.jpg)
Operational Assertions
• Define an action that must be taken if a condition is true
![Page 25: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/25.jpg)
9 Classes of Rules
• 1) Null value rules• 2) Value rules• 3) Domain membership rules• 4) Domain Mappings• 5) Relation rules• 6) Table, Cross-table, and Cross-message assertions• 7) In-Process directives• 8) Operational Directives• 9) Other rules
![Page 26: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/26.jpg)
Null Value Rules
• Null value specification– Define GETDATE for unavailable as “fill in
date”
• Null values allowed– Attribute A allowed nulls {GETDATE, U, X}
• Null values not allowed– Attribute B nulls not allowed
![Page 27: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/27.jpg)
Value Rules
• Value restriction ruleRestrict GRADE: value >= ‘A’ AND value <=
‘F’ AND value != ‘E’
![Page 28: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/28.jpg)
Domain Rules
• Domain Definition
• Domain Membership
• Domain Nonmembership
• Domain Assignment
![Page 29: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/29.jpg)
Mapping Rules
• Mapping definition
• Mapping membership
• Mapping nonmembership
![Page 30: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/30.jpg)
Relation Rules
• Completeness
• Exemption
• Consistency
• Derivation
![Page 31: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/31.jpg)
Completeness
• Defines when a record is complete (I.e., what fields must be present)IF (Orders.Total > 0.0), Complete With
{Orders.Billing_Street,
Orders.Billing_City,
Orders.Billing_State,
Orders.Billing_ZIP}
![Page 32: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/32.jpg)
Exemption
Defines which fields may be missingIF (Orders.Item_Class != “CLOTHING”)
Exempt
{Orders.Color,
Orders.Size
}
![Page 33: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/33.jpg)
Consistency
• Define a relationship between attributes based on field content– IF (Employees.title == “Staff Member”)
Then (Employees.Salary >= 20000 AND Employees.Salary < 30000)
![Page 34: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/34.jpg)
Derivation
• Prescriptive form of consistency rule
• Details how one attribute’s value is determined based on other attributesIF (Orders.NumberOrdered > 0) Then {
Orders.Total = (Orders.NumberOrdered * Orders.Price) * 1.05
}
![Page 35: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/35.jpg)
Table and Cross-Table Rules
• Functional Dependence
• Primary Key Assertion
• Foreign Key Assertion (=referential integrity)
![Page 36: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/36.jpg)
Functional Dependence
• Functional Dependence between columns X and Y:– For any two records R1 and R2 in a table,
• if field X of record R1 contains value x and field X of record R2 contains the same value x, then if field Y of record R1 contains the value y, then field Y of record R2 must contain the value y.
• In other words, attribute Y is said to be determined by attribute X.
![Page 37: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/37.jpg)
Primary Key Assertion
• A set of attributes defined as a primary key must uniquely identify a record
• Enforcement = testing for duplicates across defined key set
![Page 38: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/38.jpg)
Foreign Key Assertion
• When the values in field f in table T is chosen from the key values in field g in table S, field S.g is said to be a foreign key for field T.f
• If f is a foreign key, the key must exist in table S, column g (=referential integrity)
![Page 39: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/39.jpg)
In-process Directives
• Definition directives (labeling information chain members)
• Measurement directives
• Trigger directives
![Page 40: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/40.jpg)
Operational Directives
• Transformation
• Update
![Page 41: Data Quality Class 4. Goals Discuss Project Midterm Statistical Process Control Data Quality Rules.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3a5503460f94a15184/html5/thumbnails/41.jpg)
Other Rules
• Approximate Searching rules
• Approximate Matching rules