Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI
-
date post
19-Oct-2014 -
Category
Technology
-
view
3.474 -
download
3
description
Transcript of Data quality - Using sas dataflux in the real world - Shane Gibson - OptimalBI
Delivering Data Quality in the real world
A case study using SAS Dataflux
What I Will Cover 1. What is Data Quality?
2. What is SAS Dataflux?
3. The approach we took and why
4. The things we did and how
5. Monitoring the results
1. What is Data Quality
• Data are of high quality “if they are fit for their intended uses in opera6ons, decision making and planning" (J. M. Juran). Alterna6vely, the data are deemed of high quality if they correctly represent the real-‐world construct to which they refer.”
• Source Wikipedia hGp://en.wikipedia.org/wiki/Data_quality
• Joseph Moses Juran (December 24, 1904 – February 28, 2008) was a 20th century management consultant, principally remembered as an evangelist for quality and quality management,
2. What is SAS Dataflux
DataFlux provides organisaXons with the ability to plan and complete data integraXon, data quality and master data management (MDM) projects – all from a single interface It makes it easier to do:
• Profiling • StandardizaXon • Matching • AugmentaXon • Business Rules Monitoring
Its delivered as:
• Standalone Desktop Client • Component of SAS Enterprise Data IntegraXon Server
• Full Data flux soluXon
2. What is SAS Dataflux
3. The approach we took and why InformaXon Governance Hierarchy
Board
ExecuXve Team
Data Governance CommiGee
Data Council
Business Data Stewards Technical Data Stewards
Data Governance: From theory to pracXce Zeeman van der Merwe Manager: InformaXon Integrity and Analysis, ACC 2010 SUNZ Conference 16 February 2010
Data Quality Maturity Model
Data management experXse exists internally and there is some ability to duplicate good pracXces. Key data management individuals are assigned to criXcal projects to reduce risks and improve results.
Data management is characterised as ad-‐hoc or chaoXc. The organisaXon depends solely on individuals with no awareness of data management pracXces, resulXng in variable results and no repeatability.
Unaware
Repeatable
Defined
Managed
EffecXve
The organisaXon uses a set of defined data management processes, which are published for recommended use.
The use of the data management processes are required and monitored. All projects and iniXaXves include data management as a core part of their objecXves and deliverables’.
Data Quality is automaXcally monitored and reported. Reliability and predictability of result’s is monitored via Six Sigma or equivalent measurement methodology.
The organisaXon regularly analyses exisXng data management processes to determine where changes can deliver improved efficiencies and implements them.
OpXmised
Trust in Inform
aXon
Maturity of Data Governance processes
Data Quality Issue
Monitoring Scorecards
Update Source System
Profile Issues
IniXate InformaXon
Governance Group
PrioriXse Data Quality Issues
Manually or ProgrammaXcally
update data
Data Cleansing Business Process
4. The things we did and how We used dataflux to
• Profile the data
• Profiled • Phone numbers • Customer AIributes
• Gender • Date of Birth • Missing Values
• Addresses • Suppliers • Customers • Loca6ons
4. The things we did and how Example
4. The things we did and how Profile Data
Alpha String Count (NIGHTS 1 -ROOM 1 -X 3 ACT 1 COURSE 1 EX 7 EXT 4 FAX 1 N/A 2 SCHOOL 1 WK 1 X 48 XT 7 XTN 3
Category Count Percentage AREA CODE MISSING 13476 51% INVALID MOBILE NUMBER 158 1% INVALID NUMBER 212 1% INVALID LANDLINE NUMBER, TOO FEW DIGITS 723 3% INVALID LANDLINE NUMBER, TOO MANY DIGITS 366 1% MOBILE NUMBER 1744 7% MOBILE NUMBER OBSOLETE 942 4% NUMBER OK 8324 30% ZERO 511 2%
Pattern Count Percentage 9999999 12760 51% 9999999999 2979 12% 99 9999999 2634 10% 999999999 2210 9% 999 9999 1453 6% 99 999 9999 998 4% 999 999 9999 605 2% 99*9999999 493 2% 999 9999999 297 1% 99999999999 292 1% 99999999 101 0% 9999999 9999 84 0% 999*9999 71 0% 999 999999 53 0% 9999999 999 49 0% 999 999 999 47 0% 999999 41 0% 999 9999 999 36 0%
4. The things we did and how We used dataflux to
• Standardise Data
• Use Dataflux Quality Knowledge Base to: • Standardise Person Names
• Robert, Rob, Bob • Standardise Loca6on Names
• Wellington, WLG, Wgtn
4. The things we did and how We used dataflux to
• Consolidate Data
• Merge mul6ple people records • Mul6ple matching rules • Needed to be reusable • Needed to have logic layers
4. The things we did and how Logic Layers
4. The things we did and how We used dataflux to
• Programma6cally Validate and Augment the Data • Validate against external datasets
• NZ Post PAF • LINZ Data • Poten6ally
• Birth’s, Deaths and Marriages data • External Customer Lists • Can’t find valid Phone number dataset
5. Monitoring the Results
• Typical aGributes to measure data quality • Accuracy
Are targets defined to measure against? • Correctness
Requires something to look up • Data Age
Data Quality degrades over Xme, is that acceptable? • Completeness
What are the business rules that define what is acceptable? • Relevance
Have you documented how it is used?
5. Monitoring the Results
• Give the business owners feedback that tells them: • If their Data Quality is ge]ng beIer or worse
• Who is the business owner who can impact the data quality
• What do they need to change
• Encourage the business owners to improve the quality of the data • Ideally programma6cally update the data for them
• Or use center's of excellence’s to update data (i.e Call Centers for Phone numbers)
• Or provide the business a recommended process to update it
• Make people accountable for bad data quality! •
5. Monitoring the Results
Record Type Count Percentage
Duplicates 1,037,964 56.85%
Master 787,673 43.15%
Customer Records
Data Quality is not a project, it is a never ending process
The shameless plug!
• www.opXmalBI.com Delivering AcXonable Insight
• www.saasInct.com PreBuilt SAS Portlets
• blog.saasInct.com Ramblings about SAS