De-Duplication A not so simple problem Covers Appendix Part 5.

18
De-Duplication A not so simple problem Covers Appendix Part 5

Transcript of De-Duplication A not so simple problem Covers Appendix Part 5.

Page 1: De-Duplication A not so simple problem Covers Appendix Part 5.

De-Duplication

A not so simple problemCovers Appendix Part 5

Page 2: De-Duplication A not so simple problem Covers Appendix Part 5.

False?

• False positives occur when a group of duplicates are identified that do NOT represent the same customer

• False negatives occur when actual redundant representations of the same customer are NOT identified

Page 3: De-Duplication A not so simple problem Covers Appendix Part 5.

• Customer Name – only personal names • Postal Address – only United States address

formats• Tax ID – Could be personal National

Insurance Number or another unique identifier

Page 6: De-Duplication A not so simple problem Covers Appendix Part 5.

AbbreviationThe abbreviation of first and middle names is a common challenge:

Does a matching Tax ID guarantee that a variation is a duplicate? What about when Tax ID is missing?

Page 7: De-Duplication A not so simple problem Covers Appendix Part 5.

MarriageMarriages can be good for people but possibly bad for their data:

Did the hyphenated last name on Key 252 help overcome the change of address and missing Tax ID? How do you know if Keys 261 and/or 262 are truly the same customer as Key 263?

Page 8: De-Duplication A not so simple problem Covers Appendix Part 5.

False Positives

For Keys 312 & 313, do you think the matching Tax ID and similar name indicate possible duplication of Key 311 despite the different postal address?

For Keys 322 & 323, do you think the exact same postal address and similar name indicate possible duplication of Key 321 despite the missing Tax IDs

Page 9: De-Duplication A not so simple problem Covers Appendix Part 5.

Same Address

A common challenge is the same family name and the exact same postal address

Page 10: De-Duplication A not so simple problem Covers Appendix Part 5.

What goes in Report Appendix?

• Discuss deduplication– What is your business strategy

• Show via a flow chart how you would attempt deduplication

Page 11: De-Duplication A not so simple problem Covers Appendix Part 5.

Mailing List Management Functional Requirements

• Set out what the new system will do.

• You have some experience with this from CS22120 Group Project.

• An attempt to describe, logically, the functionality of the system.

• You need to describe it NOT build it.

Page 12: De-Duplication A not so simple problem Covers Appendix Part 5.

Requirements

• Functional– What is it supposed to do

• Non-Functional requirements– Computer Environment– Personnel– Web based

Page 13: De-Duplication A not so simple problem Covers Appendix Part 5.

Some functions

• Set up required fields• Add, Modify and Delete Fields• Import initial list

– Field matching– Excel, CSV programs

• Add, Modify and Delete Records• Merge records from externally purchased files

Page 14: De-Duplication A not so simple problem Covers Appendix Part 5.

Mailing List Functionality cont’d

• Cleanse using Post Office Address File (PAF)– Contains all address in UK– Use to correct address from post code– Can add correct:

• Street name• Posttown• County

Page 15: De-Duplication A not so simple problem Covers Appendix Part 5.

Sorting

• Sort by– Post code– Geographic Areas– Job Title– SIC codes– Turnover (Ascending/Descending/Random)– And combinations of above

Page 16: De-Duplication A not so simple problem Covers Appendix Part 5.

Mailing List Functionality cont’d

• Select Number of records to deliver and maybe by– Post code– Job Title– SIC codes– Turnover (Ascending/Descending/Random)– Add false “ghosts”– File formats

Page 17: De-Duplication A not so simple problem Covers Appendix Part 5.

Product?

• Must be able distribute software– How?– Web or local OS– Hardware Platform

Page 18: De-Duplication A not so simple problem Covers Appendix Part 5.

Competition

• Mailing Houses– Data discs– Web– Mailing list management services

• Software Companies– Dedupe software– Mailing List Management Software

• CHECK THESE OUT FOR THE REPORT