De-Duplication A not so simple problem Covers Appendix Part 5.
-
Upload
lara-errett -
Category
Documents
-
view
222 -
download
3
Transcript of De-Duplication A not so simple problem Covers Appendix Part 5.
De-Duplication
A not so simple problemCovers Appendix Part 5
False?
• False positives occur when a group of duplicates are identified that do NOT represent the same customer
• False negatives occur when actual redundant representations of the same customer are NOT identified
• Customer Name – only personal names • Postal Address – only United States address
formats• Tax ID – Could be personal National
Insurance Number or another unique identifier
Identical
Would you argue that these are NOT duplicate customers?
AbbreviationThe abbreviation of first and middle names is a common challenge:
Does a matching Tax ID guarantee that a variation is a duplicate? What about when Tax ID is missing?
MarriageMarriages can be good for people but possibly bad for their data:
Did the hyphenated last name on Key 252 help overcome the change of address and missing Tax ID? How do you know if Keys 261 and/or 262 are truly the same customer as Key 263?
False Positives
For Keys 312 & 313, do you think the matching Tax ID and similar name indicate possible duplication of Key 311 despite the different postal address?
For Keys 322 & 323, do you think the exact same postal address and similar name indicate possible duplication of Key 321 despite the missing Tax IDs
Same Address
A common challenge is the same family name and the exact same postal address
What goes in Report Appendix?
• Discuss deduplication– What is your business strategy
• Show via a flow chart how you would attempt deduplication
Mailing List Management Functional Requirements
• Set out what the new system will do.
• You have some experience with this from CS22120 Group Project.
• An attempt to describe, logically, the functionality of the system.
• You need to describe it NOT build it.
Requirements
• Functional– What is it supposed to do
• Non-Functional requirements– Computer Environment– Personnel– Web based
Some functions
• Set up required fields• Add, Modify and Delete Fields• Import initial list
– Field matching– Excel, CSV programs
• Add, Modify and Delete Records• Merge records from externally purchased files
Mailing List Functionality cont’d
• Cleanse using Post Office Address File (PAF)– Contains all address in UK– Use to correct address from post code– Can add correct:
• Street name• Posttown• County
Sorting
• Sort by– Post code– Geographic Areas– Job Title– SIC codes– Turnover (Ascending/Descending/Random)– And combinations of above
Mailing List Functionality cont’d
• Select Number of records to deliver and maybe by– Post code– Job Title– SIC codes– Turnover (Ascending/Descending/Random)– Add false “ghosts”– File formats
Product?
• Must be able distribute software– How?– Web or local OS– Hardware Platform
Competition
• Mailing Houses– Data discs– Web– Mailing list management services
• Software Companies– Dedupe software– Mailing List Management Software
• CHECK THESE OUT FOR THE REPORT