Natural Language Processing word sense disambiguation Updated 1/12/2005.

22
Natural Language Processing word sense disambiguation Updated 1/12/2005

Transcript of Natural Language Processing word sense disambiguation Updated 1/12/2005.

Natural Language Processing

word sense disambiguation

Updated 1/12/2005

Overview of the Problem Problem: many words have different

meanings or senses ==> there is ambiguity about how they are to be interpreted.

Task: to determine which of the senses of an ambiguous word is invoked in a particular use of the word. This is done by looking at the context of the word’s use.

Note: more often than not the different senses of a word are closely related.

Word Sense Many words have several meanings or senses. Consider the word bank: (Webster’s new Collegiate)

the rising ground, bordering a llake, river or sea … an establishment for the custody, loan exchange or issue of money,

for the extension of credit, and for facilitating the transmission of funds.

However, the senses are not always so well defined. E.g. Title

An identifying name given to a book, play, film, musical composition, or other work.

A general or descriptive heading, as of a book chapter. Law. A heading that names a document, statute, or proceeding. A formal appellation attached to the name of a person or family by

virtue of office, rank, hereditary privilege, noble birth etc.

Overview of our Discussion Methodology Supervised Disambiguation: based on a

labeled training set. Dictionary-Based Disambiguation: based

on lexical resources such as dictionaries and thesauri.

Unsupervised Disambiguation: based on unlabeled corpora.

Methodological Preliminaries

Supervised versus Unsupervised Learning: in supervised learning the sense label of a word occurrence is known. In unsupervised learning, it is not known.

Pseudowords: used to generate artificial evaluation data for comparison and improvements of text-processing algorithms.

Upper and Lower Bounds on Performance: used to find out how well an algorithm performs relative to the difficulty of the task.

Supervised Disambiguation

Training set: exemplars where each occurrence of the ambiguous word w is annotated with a semantic label ==> Classification problem.

Approaches: Bayesian Classification: the context of occurrence

is treated as a bag of words without structure, but it integrates information from many words.

Information Theory: only looks at informative features in the context. These features may be sensitive to text structure.

There are many more approaches (see Chapter 16 or the Machine Learning course).

Supervised Disambiguation: Bayesian Classification I

(Gale et al, 1992)’s Idea: to look at the words around an ambiguous word in a large context window. Each content word contributes potentially useful information about which sense of the ambiguous word is likely to be used with it. The classifier does no feature selection. Instead, it combines the evidence from all features.

Bayes decision rule: Decide s’ if P(s’|C) > P(sk|C) for all sk s’.

P(sk|C) is computed by Bayes’ Rule.

Supervised Disambiguation: Bayesian Classification II Naive Bayes assumption: P(C|sk) = P({vj| vj in C}| sk) = vj in CP(vj |

sk) The Naive Bayes assumption is incorrect

in the context of text processing, but it is useful.

Two consequences: The structure and linear ordering of words is

ignored: bag of words model. The presence of one word is independent of

another, which is clearly untrue in text.

Supervised Disambiguation: Bayesian Classification III Decision rule for Naive Bayes: Decide s’=argmax sk [log P(sk)+ vj in C log P(vj |sk)] P(vj |sk) and P(sk) are computed via Maximum-

Likelihood Estimation, perhaps with appropriate smoothing, from the labeled training corpus.

Performance Gale, Church, and Yarowsky obtain 90% correct

disambiguation on 6 ambiguous nouns in Hansard corpus using this approach (drug, duty, land, language, position, sentence)

Supervised Disambiguation: Bayesian Classification IV Clues for the two senses of drug

Sense clues for sense

medication prices, prescription, patent, increase, consumer

illegal substance

abuse, paraphernalia, illicit, alcohol, cocaine,

Supervised Disambiguation:An Information-Theoretic Approach (Brown et al., 1991)’s Idea: to find a single

contextual feature that reliably indicates which sense of the ambiguous word is being used.

The Flip-Flop algorithm is used to disambiguate between the different senses of a word using the mutual information as a measure.

I(X;Y)=xXyYp(x,y) log p(x,y)/(p(x)p(y)) The algorithm works by searching for a partition

of senses that maximizes the mutual information. The algorithm stops when the increase becomes insignificant.

Flip Flop algorithmt1, …, tm are translations of an ambiguous word, and x1, …, xn are possible values of the indicator.

find random partition P={P1, P2} of {t1, …, tm} while (there is a significant improvement) do

find partition Q={Q1, Q2} of indicators {x1, …, xn} that maximizes I(P;Q)

find partition P={P1, P2} of translations { t1, …, tm} that maximizes I(P;Q)

end

Flip Flop - example Suppose we want to translate prendre based on its object and

have {t1, …, tm}={take, make, rise, speak} and {x1, …,xn}={mesure, note, exemple, décision, parole}, and that prendre is used as take when occurring with the objects mesure, note, and exemple; otherwise used as make, rise or speak.

Suppose the initial partition is P1={take, rise } and P2={make, speak}. – Then choose partition of Q of indicator values that maximizes I(P;Q), say Q1={mesure, note, exemple } and Q2={décision, parole } (selected if the division gives us the most information for distinguishing translations in P1 from translations in P2).

prendre la parole is not translated as rise to speak when it should be; repartition as P1={take} and P2={rise, make, speak}, and Q as previously. This is always correct for take sense.

To distinguish among the others, we would have to consider more than two senses.

Dictionary-Based Disambiguation: Overview We will be looking at three different methods:

Disambiguation based on sense definitions Thesaurus-Based Disambiguation Disambiguation based on translations in a

second-language corpus Also, we will show how a careful examination

of the distributional properties of senses can lead to significant improvements in disambiguation.

Disambiguation based on sense definitions

(Lesk, 1986: Idea): a word’s dictionary definitions are likely to be good indicators for the sense they define.

Express the dictionary sub-definitions of the ambiguous word as sets of bag-of-words and the words occurring in the context of the ambiguous word as single bags-of-words emanating from its dictionary definitions (all pooled together).

Disambiguate the ambiguous word by choosing the sub-definition of the ambiguous word that has the greatest overlap with the words occurring in its context.

Thesaurus-Based Disambiguation

Idea: the semantic categories of the words in a context determine the semantic category of the context as a whole. This category, in turn, determines which word senses are used.

(Walker, 87): each word is assigned one or more subject codes which corresponds to its different meanings. For each subject code, we count the number of words (from the context) having the same subject code. We select the subject code corresponding to the highest count.

(Yarowski, 92): adapted the algorithm for words that do not occur in the thesaurus but that are very . Informative. E.g., Navratilova --> Sports

Disambiguation based on translations in a second-language corpus

(Dagan & Itai, 91)’s Idea: words can be disambiguated by looking at how they are translated in other languages.

Example: the word “interest” has two translations in German: 1) “Beteiligung” (legal share--50% a interest in the company) 2) “Interesse” (attention, concern--her interest in Mathematics).

To disambiguate the word “interest”, we identify the sentence it occurs in, search a German corpus for instances of the phrase, and assign the meaning associated with the German use of the word in that phrase.

One sense per discourse, one sense per collocation

(Yarowsky, 1995)’s Idea: there are constraints between different occurrences of an ambiguous word within a corpus that can be exploited for disambiguation: One sense per discourse: The sense of a target

word is highly consistent within any given document.

One sense per collocation: nearby words provide strong and consistent clues to the sense of a target word, conditional on relative distance, order and syntactic relationship.

Unsupervised Disambiguation

Idea: disambiguate word senses without having recourse to supporting tools such as dictionaries and thesauri and in the absence of labeled text. Simply cluster the contexts of an ambiguous word into a number of groups and discriminate between these groups without labeling them.

(Schutze, 1998): The probabilistic model is the same Bayesian model as the one used for supervised classification, but the P(vj | sk) are estimated using the EM algorithm.

EM algorithm Initialize the parameters of model. These are P(vj |sk)

and P(sk), j = 1,2,…J, k = 1,2,…K. compute the log likelihood of corpus C given the model :

l(C|) = log ik P(cj |sk) P(sk) while l(C|) increses repeat:

E-step: hik= P(cj |sk) P(sk) / k P(cj |sk) P(sk) (use Naive bayes to compute P(cj |sk) )

M-step: reestimate the parameters P(vj |sk) and P(sk) by MLE:

P(vj |sk) = ci hjk/Zj where the sum is over all contexts ci in which vj occurs, Zj a normalizing constant.

P(sk) = i hjk/ k i hjk = i hjk/I

Disambiguation Once the model parameters have

been estimated, a word w can be disambiguated by computing the probability of each sense given the words vj in the context.

Again we use the Naïve Bayes assumption: Decide s’=argmax sk [log P(sk)+ vj in C log P(vj |sk)]

Performance of unsupervised disambiguation Is capable of identifying minute difference in

senses, e.g. a bank in physical sense and in abstract sense.

Usually the clusters obtained are not identical with dictionary senses.

Results of unsupervised disambiguation (schutze 1998)word sense Mean accuracy

suit lawsuitgarment

9596

motion physical movementproposal for action

8588

train Line of railroad carsteach

7955