Data Cleanup: Unlock the potential at a corporate scale.
-
Upload
kurt-tomblin -
Category
Documents
-
view
224 -
download
7
Transcript of Data Cleanup: Unlock the potential at a corporate scale.
![Page 1: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/1.jpg)
Data Cleanup: Unlock the potential at a corporate
scale
![Page 2: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/2.jpg)
Thibault Dambrine
• IT professional for 25 years– Network Designer– ETL Data Warehouse Analyst– Interface Specialist– ERP Developer
• Data Quality Experience: – Tasked to work on the pre-conversion data cleanup
project during the Shell JDE to SAP transition
![Page 3: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/3.jpg)
Introduction
• Premise:“What would a Company-wide Data Quality
Initiative look like?”
• Base:– Experience setting up a data cleanup team, prior
to a JDE to SAP data conversion – Realizing the potential for increasing the data
value within the corporation
![Page 4: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/4.jpg)
Defining Data Quality
• Intrinsic Data Quality: – Accuracy, Completeness, Uniqueness– Reliability, Security and Accessibility
• Contextual Data Quality: – Timeliness, Relevance – Inter-operability, consistency of identifiers
• Accessibility and Representational Data Quality– Ease of understanding – Consistency of identifiers– Consistency in structures
![Page 5: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/5.jpg)
Quantifying the cost of Quality: The 1-10-100 Rule - Additional cost = Less Competitive Business
1
10
100
Prevention Cost
Correction Cost
Failure Cost
The 1-10-100 Quality Cost Rule
$$$
![Page 6: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/6.jpg)
The 1-10-100 Rule A Data Quality Example:
The 1-10-100 Quality Rule Applied to mailing Data: It costs: • $1.00 to verify data at data entry time • $10.00 to clean the data after the fact • $100.00 to mop up errors caused by bad data – Packages mailed in the wrong address – Lost revenue – Lost customers – Bad (sloppy) reputation– Additional carrying cost for bad data
![Page 7: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/7.jpg)
Data Quality: The Up Side!Trust• Consistent data inquiry results build confidence in information
systems• Tractability across Business and IT domains• Consistent data identifiers
– promotes internal cross-department reporting– Consistency– Confidence in results
Productivity • Removing redundant or near-redundant data • Maximizes re-use of data• Reduces the amount of data being processed • Reduces errors
ReliabilityConsistently good quality data is data you can count on!
![Page 8: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/8.jpg)
A Data Cleanup Initiative
• Where to start? • Who will enforce such quality initiatives? • How will the data quality be maintained on
on-going basis?
![Page 9: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/9.jpg)
DQB Task 1: Identify Sponsor and Data Quality Boss
To Identify sponsor: • Communicate clear understanding of the cost
of bad data• Use the 1-10-100 rule • Initiative has to be backed with – Money– Authority– Responsibility
![Page 10: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/10.jpg)
Identify Data Quality Boss – DQB
• Must be knowledgeable on data quality • Must be knowledgeable on the Business• Will be responsible for data quality• Will have authority to make changes
Note: Responsibility without authority will not work
![Page 11: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/11.jpg)
DQB Task 2: Identify Data Sets• Identify/inventory high-level data sets e.g.– CMDB– ERP • Master Data e.g.
– Customer Master – Item Master
• Transaction Data e.g. – PO’s & Invoices– Inventory movements
• Assign data sets to departments, potential lists of Data owners
• Note: The final data owners may not be the one initially penciled in at this stage
![Page 12: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/12.jpg)
DQB Task 3: Identify Business Side Data Owners• Data Owners will effectively be the local, more granular, Data
Quality Bosses. Again, they will need to – Be responsible for the data at their level– Have authority to request changes at their level– Have bottom up knowledge of the data, understand what “should be
there”
• Setup meetings with every department, in line with the Data Sets identified, with aim of coming up with a set of Data Owners– Have a presentation ready – Look for individuals who have been in the Business for a long time,
who are well respected, who understand the data, the dependencies, and know who to talk to, to get answers, from the bottom up
![Page 13: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/13.jpg)
DQB Task 4: Request from Data Owners the “Data Quality Specification” or DQS
• DQS is a document that spells out the data quality rules e.g. – No duplicates or near-duplicates– Data older than x years should be purged or archived– Data Dependencies such as no detail without a header or no invoice
without a PO– No duplicates– Consistency e.g. data format – Quality audit e.g. postal code matches address– More…
• Note: Some rules will apply in all DQS Documents• There is value in sharing, reviewing and updating the DQS over
time. • Data quality issues are not always apparent until a first cut of data
is cleaned up
![Page 14: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/14.jpg)
DQS (part of Task 4 ) Also look for:
• Data Islands– Lack of consistent identifiers inhibit a single view
of the big picture
• Data Opportunity– Could correlated data sets be more useful to the
Business?
• Data Surprises– Misplaced Data– Information buried in free-form fields
![Page 15: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/15.jpg)
• Data Quality – Cannot be a “side job” or a part-time task– Must be staffed with individuals who understand
data. Best candidates • Proficient in SQL, data extract techniques• Understand ETL tools and techniques • Are detailed-oriented • Experience: Data Warehouse staff is good fishing grounds
for such individuals
DQB Task 5: Build IT Data Quality Team
![Page 16: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/16.jpg)
Mid-Presentation Recap: All the Ingredients are now in place
The real work can start! 1. Name Data Quality Sponsor & Data Quality Boss (DQB)
2. Identify Data Sets
3. Data Quality Owners
4. Data Quality Specifications (the DQ Roadmaps)
5. IT Data Cleansing Team
![Page 17: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/17.jpg)
Introducing: the Data Quality Cycle
• We now have – a sponsor– Identified data sets and data owners• They have produced Data Quality Specifications
– An IT Team ready to work on the first Data Quality measurements, based on the DQS
• Next step: Initiate the cleanup– Not a single iteration but one that will be repeated
in a cyclic fashion
![Page 18: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/18.jpg)
The Data Quality Cycle: Simple View
Improve
Monitor
Analyze
DATA
![Page 19: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/19.jpg)
Data Quality Cycle - Corporate Version
Step 1: Identify Bad Data
Step 2: Data Cleansing
Step 3: Measure Progress
Step 4: Data Hygiene
– Schedule Cleanup/Reviews
– Ensure progress visible
Step 5: Sharpen the Saw
Analyze Data
Improve Data
Monitor Progress
Formalize ScheduleMake Progress VISIBLE
Continuous Improvement
![Page 20: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/20.jpg)
Step 1: Identify Bad DataBad Data Definition: Does not adhere to DQS
• Coordinate meetings to translate DQS documents into a suite of repeatable data cleansing procedures
• Very important that these procedures should be repeatable, schedulable on regular basis
• Initial Focus: Identify Bad Data– Bad data(does not adhere to DQS), – Inconsistent data– Old Data – Note: DQS will spell out rules for “old” and “inconsistent”
• Ensure results are reported in a format readable by Management at executive level. This initiative has to be VISIBLE
![Page 21: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/21.jpg)
Step 2: Data Cleansing• Data Cleansing can be done in two ways:
1) Automated, IT based cleanup 2) Business-based, manual cleanup
• Once the bad data is identified, determine who must do what– Business-based,
• manual cleanup appropriate for more subtle tasks, e.g. to determine which of two duplicates identified should be kept. These tasks may require additional research, phone calls etc.
– Automated, • IT based cleanup good for simpler tasks e.g. making telephone
number formats consistent• Can be also sub-contracted to specialized data quality
companies
![Page 22: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/22.jpg)
IT-based Data Cleansing and Outsource Considerations
• Data cleansing may take valuable time from the Business, which is not available – Data Cleanup effort may suffer as a result
• Not all data cleansing is a simple SQL• Not all data is most confidential
When considering data cleansing tasks, look at all possible options
• Outsourcing some data cleansing tasks may be more economical than doing it all in-house
![Page 23: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/23.jpg)
Step 3: Measure Progress• All programs, procedures written with the aim of
identifying data quality issues should– Be stored, like any other programming assets– Be repeatable and be schedulable – Provide aggregate measures to describe the data cleanup
status e.g. • X duplicates • Y old records • Z invoices without PO
– Progress • Has to be measured in a published dashboard• Has to be visible by the entire organization to provide a
sense of value
![Page 24: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/24.jpg)
Step 4: Data Hygiene: * Schedule the Cleanup/Review Tasks * Ensure results are visible
• Bad data is created EVERY DAY• Data quality is an on-going effort • Establish, publish a schedule, part of the dashboard• Ensure there is visibility and accountability to ensure
the levels of bad data – are going down with time – Or are kept at a minimal level
![Page 25: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/25.jpg)
Step 5: Sharpen the Saw
Once the data cleanup cycle is established• Review Results • Review DQS documents periodically (setup schedule) • Get Business input– Improve process– Give input on improvements to be made
• Ask the Business to come up with performance improvement measures born from the Data Quality initiative
![Page 26: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/26.jpg)
ConclusionTwo sets of Five activities best define the Data Quality The Foundation Setup 1. Identify Data Quality Sponsor & Data Quality Boss (DQB)
2. Identify Data Sets
3. Identify Business Side Data Owners
4. Define Data Quality Specifications (DQS) - the DQ Roadmaps
5. Appoint IT Data Cleansing Team
The Data Quality Cycle – Ongoing 6. Identify Bad Data
7. Initiate Data Cleansing
8. Measure Progress
9. Initiate Data Hygiene, Data Cleanup Cycle– Schedule Cleanup/Reviews
– Ensure progress visible
10.Sharpen the Saw
![Page 27: Data Cleanup: Unlock the potential at a corporate scale.](https://reader035.fdocuments.in/reader035/viewer/2022062712/56649c925503460f9494df3e/html5/thumbnails/27.jpg)
Links• How to improve Data Quality
http://www.informit.com/articles/article.aspx?p=399325&seqNum=3
• Predefined data quality rule definitionshttp://pic.dhe.ibm.com/infocenter/iisinfsv/v9r1/index.jsp?topic=%2Fcom.ibm.swg.im.iis.ia.quality.doc%2Ftopics%2Fpdr_predef.html
• Creating Effective Business Rules: Interview with Graham Witthttp://dataqualitypro.com/data-quality-pro-blog/how-to-create-effective-business-rules-graham-witt
• Gartner Magic Quadrant on Data Quality Tools – “Demand for data quality tools remains strong”http://www.citia.co.uk/content/files/50_161-377.pdf