Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering...

35
Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring for Cleaning Dirty Texts (ISSAC v2)

Transcript of Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering...

Page 1: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

Wilson Wong, Wei Liu and Mohammed BennamounSchool of Computer Science and Software Engineering

University of Western Australia

Enhanced Integrated Scoring for Cleaning Dirty Texts (ISSAC v2)

Page 2: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

• Authors: Wilson Wong, Wei Liu and Mohammed Bennamoun (University of Western Australia)

• Presented By: Benjamin Johnston (University of Technology, Sydney)

INTRODUCTIONSINTRODUCTIONS

Page 3: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

1.Background

2.Problems & Challenges

3.Solution

4.Evaluations

5.Future Works

INDEXINDEX

Page 4: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

itmetime, with i and t

swappeditem, with m and e

swapped

ITME, Institute of Electronics Materials

Technology in Warsaw, Poland

it’s me, with missing ’s

BACKGROUNDBACKGROUND

• These three errors are interrelated: Splling erors Abbre IMPROPER cAsing

• Research mostly (traditionally) carried out separately.

3 types of errors

Page 5: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

BACKGROUNDBACKGROUND

• Spelling error detection and correction: Minimum edit distance (Damerau-Levenshtein,

Wagner-Fisher, etc.) Similarity key (SOUNDEX, Metaphone, Double

Metaphone, Daitch-Mokotoff, etc.)

• Abbreviation expansion: most research carried out in the area of named-entity recognition. Rely on: Letter casing. E.g. “NASA” Use of periods. E.g. “U.S.A.” Use of parentheses. E.g. “North Atlantic Treaty

Organisation (NATO)” Number of letters in words.

Spelling error and abbreviation

Page 6: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

BACKGROUNDBACKGROUND

Letter casing

• Case restoration: improper casing in words are detected and restored. Common approaches include: Use N-grams to predict the most likely case (LC,

MC, UC) of a token based on its local context.

Rely on unambiguous introduction of ambiguous tokens. The ambiguity of “Riders” will reduce when we encounter “John Riders” in the same text.

newinformation

york

subsequent token

likely to be LCcategorize

into LC less likely to be LC

Page 7: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

INDEX

Page 8: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

PROBLEMS & CHALLENGESPROBLEMS & CHALLENGES

• Test data are either artificial or not-so-dirty dirty text.• Techniques are isolated.

Existing techniques, their accuracies and test data

Page 9: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

np, ty

Example of dirty texts

PROBLEMS & CHALLENGESPROBLEMS & CHALLENGES

• Ad-hoc abbreviations, common in the Internet era, pose extra challenges (e.g. “ty”, “u”).

Page 10: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

Mi Teaser constantly REMINDS mer that eduction is an inerrant asper of LIFO. She sad, "Few yrs in school will ensue a beater LIFO for u". 2/16 [Aspell 0.50.3][Aspell 0.50.3]

MI Teacher kinsman REMINDS meek that education is an important speak of life. She sad, "Few yes in Scholl will ensure a better LIFO for U". 5/16 [htp://www.spellcheck.net][htp://www.spellcheck.net]

Mi Teacher constantly REMINDS me that education is an important aspect of life. She sad, "Few yrs in Scholl will ensure a better LIFO for u". 8/16 [[MS Office Word 2003]MS Office Word 2003]

Mi Teacer konstanly REMINDS mee that edicotion is an inporrant aspek of lifu. She sad, "Few yrs in scholl will ensur a beter liFO for u". 16 errors [Original][Original]

Examples of existing applications

PROBLEMS & CHALLENGESPROBLEMS & CHALLENGES

Page 11: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

• Techniques for abbrev. expansion, etc based on patterns and static dictionary face problems with expansion.

• Integrated approaches for automatically correcting all three types of errors are rare.

• The accuracy of corrections by the existing isolated techniques can be further improved.

• The accuracy of existing techniques (individual or integrated) on extremely challenging dirty texts (e.g. chat records) has yet to be demonstrated.

PROBLEMS & CHALLENGESPROBLEMS & CHALLENGES

Challenges to be addressed

Page 12: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

INDEX

Page 13: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

SOLUTIONSOLUTION

ISSAC v2

Suggestions and rank by Aspell

Expansions for abbreviations by

Stands4.com

Google’s page count and spell check

Domain corpora (i.e. dirty texts collection)

• Our solution must put into consideration the followings: Integrated approach (for all 3 types of errors) High accuracy Automatic (i.e. no user involvement) Evaluations using real-world dirty texts

Overview

Page 14: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

SOLUTION

Aspell

• A term is fed into Aspell and a list of suggestions for each error term will be generated.

Page 15: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

SOLUTIONSOLUTION

Stands4.com

• Stands4.com is consulted for possible expansions for each erroneous term.

• Local copy is maintained for future use.

Page 16: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

SOLUTIONSOLUTION

Google

• Google’s ability to search for phrases• The page count that Google returns • Google’s suggestions for spelling errors in queries.

Page 17: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

SOLUTIONSOLUTION

m expansions, all with rank 1

n suggestions by Aspell, according to their original rank

the error term itself

= jth suggestion with rank i in the set S

Notations

Google’s suggestion

Page 18: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

SOLUTIONSOLUTION

Notations

itme

time item Institute of Electronics Materials

Technology

• We use the neighbouring words to disambiguate and identify the most ideal suggestion from S for automatic correction.

• The left and right words are considered as context.

“shipping itmeitme frame”

Left word, l = “shipping”Right word, r = “frame”

Page 19: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

SOLUTIONSOLUTION

ISSAC v2

Reuse factor, RF(e,si,j) {0, 1}

Abbreviation factor, AF(e,si,j) {0, 1}

Domain significance, DS(l,si,j,r) (0,1)

General significance, GS(l,si,j,r) (0,1)

Normalized edit distance,

NED(e,si,j) (0,1]

Original rank by Aspell, i-1 (0,1]

Different weights in ISSAC

Page 20: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

SOLUTIONSOLUTION

• The list of suggestions S is re-ranked using

• Individual weights contribute to the overall ranking of each suggestion.

• Suggestion with highest NS is taken as the most ideal replacement given the surrounding context.

Correction using ISSAC

Page 21: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

SOLUTIONSOLUTION

• Heuristic: correct replacement should not deviate too far from the error.

itme

item

timeit me

timer

Tim

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Edit distance

NE

D

Edit distance

Page 22: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

SOLUTIONSOLUTION

Reuse and abbreviation factors

• If a suggestion is a potential expansion for an abbrev. (i.e. error term), AF will yield 1 and 0 otherwise.

• The abbreviation dictionary is consulted.

• Return 1 if suggestions appear in spelling dictionary.• Two types of entries in the spelling dictionary.

Suggestions by Google for spelling errors. Automatically updated every time Google suggest a replacement for an error.

Suggestions for errors provided by users (optional)

Page 23: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

SOLUTIONSOLUTION

sj,i is not common both individually and in context

sj,i occurs very frequent, both individually and in context but nearly all documents contain

the term (i.e. too common)sj,i occurs very frequent, and appears exclusively only in

few documents

A

B

C

D

where B, D > 0

0lim0

DSA

3.0lim),(),(

DSDBCA

1lim)1,(),(

DSBCA

Domain significance

Page 24: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

SOLUTIONSOLUTION

A

B

C

D

where B, D > 0 and B < D

sj,i appears very rarely in context

sj,i, appears often in context, appears often individually (i.e.

term is very common)sj,i appears often in context,

individual appearance approaches appearance in

context (i.e. term is exclusive to the context)

0lim0

GSA

3.0lim),(),(

GSDBCA

1lim),(),(

GSABCA

General significance

Page 25: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

INDEX

Page 26: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

EVALUATIONSEVALUATIONS

Accuracy of ISSAC

• Evaluation data (700 chat sessions, 3313 errors) are actual chat records between agents and customers provided by 247Customer.com.

aspellbyidentifiederrortotal

treplacemencorrect

N

NAccuracy

____

_

Page 27: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

EVALUATIONSEVALUATIONS

Accuracy of ISSAC

Page 28: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

EVALUATIONSEVALUATIONS

• Cause 1 (≈0.8%):The accuracy of correction by ISSAC is bounded by

the coverage of S produced by Aspell.Due to the absence of the correct replacement from

the list of suggestions produced by Aspell.For example, the correct replacement for “dotn” is

not present in the list of suggestion by Aspell.

When ISSAC doesn’t work

Page 29: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

EVALUATIONSEVALUATIONS

• Cause 2 (≈0.7%): Due to two flaws related to l and r:

Neighbouring words are not correctly spelt. Example, “morel iberal return”.

The left and right words are inadequate. Example, “both ocats <”.

• Cause 3 (≈0.5%): Two anomalies where ISSAC does not apply:

Suggestions who are equally likely to be the correct replacement. Example, “Cheng” or “Cheung” in the context of “Janice Cheng <”.

Contrasting disagreement among weights.

When ISSAC doesn’t work

Page 30: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

INDEX

Page 31: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

FUTURE WORKSFUTURE WORKS

My teacher constantly reminds me that education is an important aspect of life. She said, “Few years in school will ensure a better Life for you". 15/16 [ISSAC v2][ISSAC v2]

Mi Teacer konstanly REMINDS mee that edicotion is an inporrant aspek of lifu. She sad, "Few yrs in scholl will ensur a beter liFO for u". 16 errors [Original][Original]

• Look for solutions to overcome the 3 causes to improve the accuracy.

• Carry out evaluations on larger data sets.

• Evaluate ISSAC in terms of time complexity.

Page 32: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

THANK YOUTHANK YOU

Page 33: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.
Page 34: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

• Widely adopted classes of techniques for detecting and correcting spelling errors: Minimum edit distance Similarity key (phonetic algorithms)

• Minimum edit distance: minimal number of insertions, deletions, substitutions and transpositions needed to transform one string into the other. Example: “wear” → “beard” require a minimum of

2 operations.

Damerau-Levenshtein, Wagner-Fisher, etc.

BACKGROUNDBACKGROUND

Spelling error

substitute ‘w’ with ‘b’ insert ‘d’ beardwear bear

Page 35: Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

BACKGROUNDBACKGROUND

• Similarity key: map every string into a key such that similarly spelled strings will have identical keys. The key, computed for each spelling error, will act

as a pointer to all similarly pronounced words (i.e. soundslike) in the dictionary.

SOUNDEX, Metaphone, Double Metaphone, etc.

wear → w006 → w6

ware → w060 → w6

Spelling error