Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

12
Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm

Transcript of Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

Page 1: Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik

Presented by Bryan Wilhelm

Page 2: Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

Problem DescriptionA single entity may be referenced in separate

records in textually dissimilar ways.For example “Robert” and “Bob”.

Traditional text similarity functions such as edit distance and jaccard coefficient cannot handle these cases.

Current research is looking at string transformation databases.

These databases can be extremely large.

Page 3: Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

Problem Description

Page 4: Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

Solution: DefinitionsRule Application

Example: {Olathe→Olathe, 7, 4}

AlignmentRule applications cannot

overlapOrder does not matter

Coverage

Page 5: Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

Solution: Algorithm

Page 6: Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

Solution: Algorithm

Page 7: Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

Solution: Algorithm

Page 8: Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

Solution: Algorithm

Page 9: Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

Solution: Algorithm

Page 10: Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

Record Matching ApplicationGenerating Example Pairs

Traditional text matching methods are used (such as jaccard coefficient).

Input from domain experts could also be considered but this is expensive.

A few incorrect pairs will not effect the end result.

Validation of TransformationsAll approaches involve confirmation by a

domain expert.

Page 11: Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

Analysis

Page 12: Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.

Analysis