Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf ·...

37
Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor, Damien Hostettler, Pasi Fränti 24.4.2019 N. Gali, R. Mariescu-Istodor, D. Hostettler and P. Fränti, "Framework for syntactic string similarity measures", Expert Systems with Applications, 2019.

Transcript of Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf ·...

Page 1: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Framework for Syntactic String Similarity Measures

Najlah Gali, Radu Mariescu-Istodor, Damien Hostettler, Pasi Fränti

24.4.2019

N. Gali, R. Mariescu-Istodor, D. Hostettler and P. Fränti, "Framework for syntactic string similarity measures",

Expert Systems with Applications, 2019.

Page 2: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Introduction

Page 3: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Application examples

Titles of web pages:

V-caféViet-Café

Keywords and keyphrases:

Theatertheatre

Named entities:

U.S State DepartmentUS Department of State

Personal names:

Gail VestGayle Vesty

Place names:

Ting Tsi RiverTingtze River

Ontology alignments:

associate professorsenior lecturer

Short segments of text:

Apple computerApple pie

Sentences:

I haven't watched television for agesIt's been a long time since I watched television

Page 4: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Similarity framework

Page 5: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Existing packages

Page 6: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

StringSim package

Existing measureNew combination

Page 7: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Character-level measures

• Exact match

• Transformation

• Longest common substring (LCS)

Page 8: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Edit distanceLevenshtein 1966

dist=5

Solved by dynamic programming algorithm

Page 9: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Character-level measures

Page 10: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

String segmentation

• Tokenization

• Q-grams

Page 11: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Segmentation examplesThe club at the Ivy

Page 12: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Matching techniques

• Sequence matching

• Set matching

• Bag-of-tokens

Page 13: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

String matching at token level

Page 14: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Problem of crisp sets

gray color

color gray

gray color

colour grey

?

Similarity = 1.0 Similarity = 0.0

Page 15: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Soft set-matchingSmith-Waterman-Gotoh

Similarity = ( ) 63.08.09.02.03

1=++

Page 16: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Soft cardinalities of sets{gray, grey} …. {gray, color}

Page 17: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Soft cardinalities of sets{gray, grey} …. {gray, color}

Page 18: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Another example

Page 19: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Summary of measuresSequence and set-matching

Page 20: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Summary of measuresBag-of-tokens

Page 21: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Results

Page 22: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Datasets

Page 23: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Ch/ Q Edit MongeElkan Brau-Ban Simpson JaccardDice & Rouge

Cosine Manhattan Euclidean

Exact match2

4 10

Hamming

Levenshtein & Damerau-Levenshtein 6 12 6 9 6Needleman Wunch 14Smith Waterman & SWG 2 6 10 15 9 12 15Jaro 7

11 13 12Jaro Winkler 8LCS 1 9 12 6 9 62Grams 5

2 33Grams 3Word2Vec

Text manipulationCharacter changes

Page 24: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Exact matchWord2Vec

Expected

7. Edit-Damerau

16. BraunBan-Needle15. BraunBan-JaroWinkler

14. Euclidean

13. BraunBan-SWG

12. Edit-Needle11. BraunBan-LCS10. BraunBan-Damerau

6. Edit-SWG

4. Edit-LCS

5. BraunBan-3Grams

3. Edit-3Gram

s

8. BraunBan-Ham

ming

2. LCS

1. Manhattan-Hamming

Hu

man

intu

ition

9. Edit-JaroWinkler

Effect of char changes

Page 25: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Ch/ Q Edit MongeElkan Brau-Ban Simpson Jaccard Dice & Rouge Cosine Manhattan Euclidean

Exact match2

4 10

Hamming

Levenshtein & Damerau-Levenshtein 6 12 6 9 6Needleman Wunch 14Smith Waterman & SWG 2 6 10 15 9 12 15Jaro 7

11 13 12Jaro Winkler 8LCS 1 9 12 6 9 62Grams 5

2 33Grams 3

Text manipulationToken changes

Page 26: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Effect of token changes

Page 27: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Correlation to human intuition

Page 28: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Qualitative examples

Excellent match

Poor match

Page 29: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Correlation to distance

Page 30: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Correlation to distance

Page 31: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Clustering experiment

• 180 photos• 15 clusters

Page 32: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Clustering results

Page 33: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Name matching

Page 34: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Name matching

Page 35: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Summary of the results

Page 36: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

Conclusions

Token level measures

• Well-maintained databases: ok as such• Free text: soft variants improves!

Semantic similarity

• Suffers from single char changes

Recommendation:

• Dice or Rouge (token level) + Q-grams• No single measure work for all applications

Page 37: Framework for Syntactic String Similarity Measurescs.uef.fi/sipu/pub/TitleSimilarity.pdf · 2019-11-14 · Framework for Syntactic String Similarity Measures Najlah Gali, Radu Mariescu-Istodor,

The end

http://cs.uef.fi/sipu/soft/stringsim