Cleanliness is next to Godliness
-
Upload
jonathan-levin -
Category
Technology
-
view
3.427 -
download
1
Transcript of Cleanliness is next to Godliness
Cleanliness is next to Godliness
Deduplicating Your Customer Data
Parts of the talkTalking about Data Quality
Techniques for Deduplication
Processing, Timing and Mindset
Part 1 Part 2 Part 3
Timeline
Once upon a time...
Age of information
Large amounts of data inputted by humans
Humans make mistakes...
Information is a significant raw material for
businesses around the world.
Making data-based decisionsWrong information leads to wrong decisions
Information as productsBad and unimpressive products
Information for logisticsCompany may shut down
Gathering Data from HumansPaper forms
• Spelling mistakes
• Unclear questions
• Bare minimum information
• OCR
Web forms
• Bypassing filters
Tainting Existing DataChanges in procedures• Didn’t update older data
• Different data structures
• Different ways of handling data
Importing sources of (bad) data
Some Industry JargonSingle View of Customer• Marketing Campaigns
Single Version of the Truth• Strategy
Getting Correct Reports
Consider thisYou start a direct mail marketing campaign
And this happens...
Dear Mr ----- O’Brien,
We are delighted to inform you that we have an amazing offer specifically for you..
Avoiding Embarrassing Mistakes• Marketing/PR
• Accounting
• Shipping
• Strategy
How much is it worth?• 30% ROI (big consultancy)
• 10-25% Loss of revenue for bad data quality
• Competitive advantage
• Avoid going out of business
MFI GroupFounded 1964
Upgraded ERP systems early 2000’s
Due to issues with data quality in 2004
• £46m in lost sales, £16m extra deliveries + technical costs and £20m for the actual system.
Administration 2008
(Comeback 2010)
RecapData Quality is a big subject
Avoid embarrassing mistakes
Keep company running efficiently
Good for reports
What Deduplication is used forIncreasing data quality
Compressing data
Pre-stage data cleansing needed
MatchingTechniques• Address
• Name
• Fuzzy
• DOB
Business Rules
Quality Matching
Ask the Data
Address MatchingDatabases• Royal Mail (PAF)
• Council Address Data
• Do Your Own
Fill in missing parts
House Number, Building Number, House Name,
Flat Number, Company Name, Street, Locality,
Town, City, County, Country and Postcode
Name MatchingName, Full name
Forename, Firstname,
Lastname, Surname
Initial
Middle name(s)
Title, Suffix
Qualification
Lord James Jonah William Smith 3rd
SQL exampleSELECT c1.*, c2.*
FROM customers c1 INNER JOIN customers c2
ON c1.address_id = c2.address_id
WHERE c1.surname = c2.surname
AND c1.forename = c2.forename
AND (c1.middlename = c2.middlename
XOR (c1.middlename = ‘’ XOR c2.middle=name‘’));
Title Forename Middle Surname DOBMR MARK MADANES 05/10/1963MR MARK MADANES 04/10/1963
Title Forename Middle Surname DOBMR CIARAN GERARD O’NEILL 26/07/1971MR CIARAN M O’NEILL 26/07/1971
Title Forename Middle Surname DOBMS JAN PHILMORE 15/10/1954MR JAN PHILMORE 00/00/0000
Title Forename Middle Surname DOBMR ALBERTO CARLOS 00/00/0000MR ALBERT O CARLOS 00/00/0000
Fuzzy MatchingLevenshtein
select levenshtein(‘jonathan’,’jonathon’) -> 1
Download from: http://www.artfulsoftware.com/infotree/queries.php?&bw=1280#552
Fuzzy MatchingSoundex
select soundex('jonathan') -> J535
Metaphone
echo metaphone('jonathan') -> JNON
Title Forename Middle Surname DOB
SAMUEL JOHNSTONE 00/00/0000
MR SAMUEL JOHNSTON 00/00/0000
Business RulesCertain Level of Correctness
Generic Rules and Source Specific Rules
Business RulesExample• Middle name: Adam Smith vs. Adam E Smith
• Title: Miss vs. Ms vs. Lady
• Initial: A Smith vs. Adam Smith (same address)
• Surnames: O`Brien vs. O’Brien vs. O\’Brien
• More Surname: McDonald vs. Mc Donald vs. Mac Donald
Things to Watch Out forSame father/son or mother/daughter names
Twins with same DOB
Initial for a forename
Mixing of forename with middle name
Changing surname after marriage
Quality MatchingAnalyze data sources
How recent the data is
Ask the DataName popularity
Number of sources• Example: 4 sources vs. 1 source say this spelling is
right
Consider Using a Democratic SystemOpposite of hieratical (if-then-else) system
If rules order is problematic
Business Rules + Asking the Data
RecapFind address
Find duplicates
Try to make a decision for deduplication• Business Rules
• Ask the Data
ProcessingCPU/Disk/Memory bound
Sequential or parallel
Processing DataExtra data
Result table
Temp data
TimingOn insert
A few minutes after insert (events)
Scheduled tasks
Pre-fetch
When user asks for it
New Data User Request
Points in Time
Using Your TeamDBAs
Database Developers/ETL experts
Data Analysts
Developers
Testers
MindsetNever 100%
Best Effort
Pareto Principle
Continuous Improvement
Cost
Be
ne
fits
Final RecapContinuous Improvements
Which duplicate is the correct one?Combine business rules + ask the data
Questions & Answers
Contact Information:MySQL-related questions about presentation?
Non-profit or Medical?