Noun Homograph Disambiguation Using Local Context in Large Text Corpora

Marti A. Hearst

Presented by: Heng JiMar. 29, 2004

Outline

Introduction Motivations of Algorithm Feature Selection Crucial Problem and Detail Algorithm Experiment Results Conclusions & Discussions

Introduction

What is Homograph? One or two or more words spelled alike but different in

meaning

What is Noun Homograph Disambiguation? Determine which of a set of pre-determined senses

should be assigned to that noun

Why Noun Homograph Disambiguation is useful?

Noun Compound Interpretation

Noun Compound

Interpretation

Improve Information Retrieval Results

Extend key words?

How to do? -- Motivations Intuition1 Human can identify word sense by local context

Intuition2 Human’s identification ability comes from familiarity with frequent co

ntexts

Intuition3 Different senses can be distinguished by:

-- different high-frequency context

-- different syntactic, orthographic, or lexical features

Combine Intuition 1, 2, 3 Similar-sense terms will tend to have similar contexts!

Feature Selection

Principles: Selective & General Example: “bank” Numerous residences, banks, and libraries parallel buildings They use holes in trees, banks, or rocks for nests parallel nature objects are found on the west bank of the Nile [“direction”] bank of the “proper name” Headed the Chase Manhattan Bank in New York Name + Capitalization

Neighbor word not enough Need syntactic information!

Feature Set

Crucial Problem: need large annotated data?

Problem: Cost of manual tagging is high The size of corpus is usually large Statistics vary a great deal across different domains Automating the tagging of the training corpus will result in “Circularity

problem” ( Dagan and Itai, 1994) Solution: Construct the training corpus incrementally An initial model M1, is trained using small corpus C1 M1 is used to disambiguate the rest of ambiguous words All words that can be disambiguated with strong confidence will be combined

with C1 to form C2 M2 is trained using C2; and repeat.

Algorithm

Manually label a smallset of samples

Record context features

Training

Check context feature of target noun

Choose sense with most evidence

Output

Compare Evidence

Samples with high Comparative Evidence

Segmented into phrases& POS tagging

Comparative Evidence Definition Max (CE) where:

CE: Comparative Evidence; n: number of senses m: number of evidence features found in test sentences fij: frequency (feature j is recorded in a sentence containing sense i)

Procedure Choose sense with maximum comparative evidence If the largest CE is not larger than the second largest CE by threshold

the sentence cannot be classified! (Margin)

j iji fE1

Experiment Result – “tank”

Results for word "tank"

20 30 40 50 60 70Training Size

) SupervisedLearning

Supervised +Unsupervised

Experiment Result – “bank”

Results for word "bank"

10 20 30 40 50

Training Size

) SupervisedLearning

Supervised+Unsupervised

Experiment Result – “bass”

Result for word "bass"

10 15 20 25

Training Size

y(5) Supervised

Learning

Experiment Result – “country”

Result for "country"

10 20 30 40

Training Size

SupervisedLearning

Experiment Result – “Record”

Results for "Record" with Supervised Learning

20 30 40

Training Size

Record1

Record2

Record1: “archived event” “pinnacle achievement”Record2: “archived event” “musical disk”

Conclusions and future work Most advantage: using bootstrapping to alleviate tagging bottleneck; No sizable sense-tagged corpus is needed Results show the method is successful Unsupervised Learning helps to improve general words has limitations on difficult words like “country”. also helps to reduce work amount Use of partial syntactic information: richer than common statistics tech

niques

Proposed Improvements Bootstrapping from Bilingual Corpora Improve Evidence Metric (adjust weight automatically; weight on the entire cor

pus and each sense; add more types) Integrate WordNet

Discussion 1: Initial Training A good training base need to be already obtained,

Namely initial hand tagging is required. But once the training is complete, Noun Homograph Disambiguation is fast;

This initial set is still large(20-30 occurrences for each sense) the cost of tagging is still high!

Discussion 2: Resources

Advantage of unrestricted corpus compared to dictionaries, includes sufficient contextual variety Can automatically integrate unfamiliar words

Assumption The context around an instance of a sense of the homograph is

meaningfully related to that sense

Need Semantic Lexicon? Numerous residences, banks, and libraries parallel buildings They use holes in trees, banks, or rocks for nests parallel nature objects

References

Marti A. Hearst(1991). Noun Homograph Disambiguation Using Local Context in Large Text Corpora

Yarowsky(1992). Word-Sense Disambiguation Using Statistical Models of Roget’s..

Chin(1999). Word Sense Disambiguation Using Statistical Techniques

Peh, Ng(1997). Domain-Specific Semantic Class Disambiguation Using WordNet

Dagan, I. and Itai(1994). Word Sense Disambiguation using a second language monolingual corpus

Noun Homograph Disambiguation Using Local Context in Large Text Corpora

Documents

Transcript of Noun Homograph Disambiguation Using Local Context in Large Text Corpora

Entity Disambiguation

Sculpture - sweethaven02.com · Sculpture “Sculptor”redirectshere. Forotheruses,seeSculptor (disambiguation)andSculpture(disambiguation). Sculptureisthebranchofthevisualartsthatoperates

Lecture: Word Sense Disambiguation

HOMOGRAPH ACTIVITY

DomainNet: Homograph Detection for Data Lake …

1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД.

Word Sense Disambiguation

Disambiguation of Biomedical Text

WORD SENSE DISAMBIGUATION - ETDA

Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.

Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

Synonyms, Antonyms, & Homographs. Synonym, Antonym & Homograph What is a synonym? What is an antonym? What is a homograph?

SQL Injection Homograph Attack Cross-Site Scripting · SQL Injection Homograph Attack Cross-Site Scripting Applicazioni di Rete – M. Ribaudo - DISI SQL Injection “SQL Injection

Word Sense Disambiguation - cuni.cz

Choose the homograph | Homograph Worksheets · To place or leave a vehicle in a ... Free for educational use at home or in classrooms. ... Choose the homograph | Homograph Worksheets

ParaSense: Parallel Corpora for Word Sense Disambiguation · cover of my PhD, Gitte for doing the cover layout and Kristien for proofreading the text. Next, I’d like to give a heartfelt,

Cross Document Entity Disambiguation

Corpora & Bubbles

Information Retrieval Oriented Mongolian Homograph ...

KingArthurKingArthur Forotheruses,seeKingArthur(disambiguation). “ArthurPendragon”redirectshere. Forotheruses,see ArthurPendragon(disambiguation ...