Post on 17-Dec-2015
Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik
Presented by Bryan Wilhelm
Problem DescriptionA single entity may be referenced in separate
records in textually dissimilar ways.For example “Robert” and “Bob”.
Traditional text similarity functions such as edit distance and jaccard coefficient cannot handle these cases.
Current research is looking at string transformation databases.
These databases can be extremely large.
Problem Description
Solution: DefinitionsRule Application
Example: {Olathe→Olathe, 7, 4}
AlignmentRule applications cannot
overlapOrder does not matter
Coverage
Solution: Algorithm
Solution: Algorithm
Solution: Algorithm
Solution: Algorithm
Solution: Algorithm
Record Matching ApplicationGenerating Example Pairs
Traditional text matching methods are used (such as jaccard coefficient).
Input from domain experts could also be considered but this is expensive.
A few incorrect pairs will not effect the end result.
Validation of TransformationsAll approaches involve confirmation by a
domain expert.
Analysis
Analysis