Cleanliness is next to Godliness

44
Cleanliness is next to Godliness Deduplicating Your Customer Data

Transcript of Cleanliness is next to Godliness

Page 1: Cleanliness is next to Godliness

Cleanliness is next to Godliness

Deduplicating Your Customer Data

Page 2: Cleanliness is next to Godliness

Parts of the talkTalking about Data Quality

Techniques for Deduplication

Processing, Timing and Mindset

Part 1 Part 2 Part 3

Timeline

Page 3: Cleanliness is next to Godliness

Once upon a time...

Age of information

Page 4: Cleanliness is next to Godliness

Large amounts of data inputted by humans

Page 5: Cleanliness is next to Godliness

Humans make mistakes...

Page 6: Cleanliness is next to Godliness

Information is a significant raw material for

businesses around the world.

Page 7: Cleanliness is next to Godliness

Making data-based decisionsWrong information leads to wrong decisions

Information as productsBad and unimpressive products

Information for logisticsCompany may shut down

Page 8: Cleanliness is next to Godliness

Gathering Data from HumansPaper forms

• Spelling mistakes

• Unclear questions

• Bare minimum information

• OCR

Web forms

• Bypassing filters

Page 9: Cleanliness is next to Godliness

Tainting Existing DataChanges in procedures• Didn’t update older data

• Different data structures

• Different ways of handling data

Importing sources of (bad) data

Page 10: Cleanliness is next to Godliness

Some Industry JargonSingle View of Customer• Marketing Campaigns

Single Version of the Truth• Strategy

Getting Correct Reports

Page 11: Cleanliness is next to Godliness

Consider thisYou start a direct mail marketing campaign

And this happens...

Page 12: Cleanliness is next to Godliness

Dear Mr ----- O’Brien,

We are delighted to inform you that we have an amazing offer specifically for you..

Page 13: Cleanliness is next to Godliness

Avoiding Embarrassing Mistakes• Marketing/PR

• Accounting

• Shipping

• Strategy

Page 14: Cleanliness is next to Godliness

How much is it worth?• 30% ROI (big consultancy)

• 10-25% Loss of revenue for bad data quality

• Competitive advantage

• Avoid going out of business

Page 15: Cleanliness is next to Godliness

MFI GroupFounded 1964

Upgraded ERP systems early 2000’s

Due to issues with data quality in 2004

• £46m in lost sales, £16m extra deliveries + technical costs and £20m for the actual system.

Administration 2008

(Comeback 2010)

Page 16: Cleanliness is next to Godliness

RecapData Quality is a big subject

Avoid embarrassing mistakes

Keep company running efficiently

Good for reports

Page 17: Cleanliness is next to Godliness

What Deduplication is used forIncreasing data quality

Compressing data

Pre-stage data cleansing needed

Page 18: Cleanliness is next to Godliness

MatchingTechniques• Address

• Name

• Fuzzy

• DOB

Business Rules

Quality Matching

Ask the Data

Page 19: Cleanliness is next to Godliness

Address MatchingDatabases• Royal Mail (PAF)

• Council Address Data

• Do Your Own

Fill in missing parts

House Number, Building Number, House Name,

Flat Number, Company Name, Street, Locality,

Town, City, County, Country and Postcode

Page 20: Cleanliness is next to Godliness

Name MatchingName, Full name

Forename, Firstname,

Lastname, Surname

Initial

Middle name(s)

Title, Suffix

Qualification

Lord James Jonah William Smith 3rd

Page 21: Cleanliness is next to Godliness

SQL exampleSELECT c1.*, c2.*

FROM customers c1 INNER JOIN customers c2

ON c1.address_id = c2.address_id

WHERE c1.surname = c2.surname

AND c1.forename = c2.forename

AND (c1.middlename = c2.middlename

XOR (c1.middlename = ‘’ XOR c2.middle=name‘’));

Page 22: Cleanliness is next to Godliness

Title Forename Middle Surname DOBMR MARK MADANES 05/10/1963MR MARK MADANES 04/10/1963

Page 23: Cleanliness is next to Godliness

Title Forename Middle Surname DOBMR CIARAN GERARD O’NEILL 26/07/1971MR CIARAN M O’NEILL 26/07/1971

Page 24: Cleanliness is next to Godliness

Title Forename Middle Surname DOBMS JAN PHILMORE 15/10/1954MR JAN PHILMORE 00/00/0000

Page 25: Cleanliness is next to Godliness

Title Forename Middle Surname DOBMR ALBERTO CARLOS 00/00/0000MR ALBERT O CARLOS 00/00/0000

Page 26: Cleanliness is next to Godliness

Fuzzy MatchingLevenshtein

select levenshtein(‘jonathan’,’jonathon’) -> 1

Download from: http://www.artfulsoftware.com/infotree/queries.php?&bw=1280#552

Page 27: Cleanliness is next to Godliness

Fuzzy MatchingSoundex

select soundex('jonathan') -> J535

Metaphone

echo metaphone('jonathan') -> JNON

Page 28: Cleanliness is next to Godliness

Title Forename Middle Surname DOB

SAMUEL JOHNSTONE 00/00/0000

MR SAMUEL JOHNSTON 00/00/0000

Page 29: Cleanliness is next to Godliness

Business RulesCertain Level of Correctness

Generic Rules and Source Specific Rules

Page 30: Cleanliness is next to Godliness

Business RulesExample• Middle name: Adam Smith vs. Adam E Smith

• Title: Miss vs. Ms vs. Lady

• Initial: A Smith vs. Adam Smith (same address)

• Surnames: O`Brien vs. O’Brien vs. O\’Brien

• More Surname: McDonald vs. Mc Donald vs. Mac Donald

Page 31: Cleanliness is next to Godliness

Things to Watch Out forSame father/son or mother/daughter names

Twins with same DOB

Initial for a forename

Mixing of forename with middle name

Changing surname after marriage

Page 32: Cleanliness is next to Godliness

Quality MatchingAnalyze data sources

How recent the data is

Page 33: Cleanliness is next to Godliness

Ask the DataName popularity

Number of sources• Example: 4 sources vs. 1 source say this spelling is

right

Page 34: Cleanliness is next to Godliness

Consider Using a Democratic SystemOpposite of hieratical (if-then-else) system

If rules order is problematic

Business Rules + Asking the Data

Page 35: Cleanliness is next to Godliness

RecapFind address

Find duplicates

Try to make a decision for deduplication• Business Rules

• Ask the Data

Page 36: Cleanliness is next to Godliness

ProcessingCPU/Disk/Memory bound

Sequential or parallel

Page 37: Cleanliness is next to Godliness
Page 38: Cleanliness is next to Godliness

Processing DataExtra data

Result table

Temp data

Page 39: Cleanliness is next to Godliness

TimingOn insert

A few minutes after insert (events)

Scheduled tasks

Pre-fetch

When user asks for it

New Data User Request

Points in Time

Page 40: Cleanliness is next to Godliness

Using Your TeamDBAs

Database Developers/ETL experts

Data Analysts

Developers

Testers

Page 41: Cleanliness is next to Godliness

MindsetNever 100%

Best Effort

Pareto Principle

Continuous Improvement

Cost

Be

ne

fits

Page 42: Cleanliness is next to Godliness

Final RecapContinuous Improvements

Which duplicate is the correct one?Combine business rules + ask the data

Page 43: Cleanliness is next to Godliness

Questions & Answers

Page 44: Cleanliness is next to Godliness

Contact Information:MySQL-related questions about presentation?

Non-profit or Medical?

[email protected]