Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a...
-
Upload
vuonghuong -
Category
Documents
-
view
228 -
download
2
Transcript of Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a...
![Page 1: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/1.jpg)
Chapter 5: Collocations ∗
presented by Dustin Boswell
May 3, 2004
∗from Foundations of Statistical Natural Language Processing
![Page 2: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/2.jpg)
What is a Collocation?
-“An expression consisting of two or more words that correspondto some conventional way of saying things.” - Ch. 5 of FSNLP
-“Collocations of a given word are statements of the habitualor customary places of that word.” -Firth (1957)
-“A phrase that means more than the sum of its parts.” -Dustin
There is no exact definition.
Examples:‘‘strong tea’’, ‘‘New York’’,
‘‘weapons of mass destruction’’, etc..
1
![Page 3: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/3.jpg)
What is a Collocation?
Characteristics:
• non-compositionality
Eg: white wine (wine isn’t white...)
• non-substitutability
Eg: white yellow wine (doesn’t work)
• non-modifiability
Eg: I have a (slimy?)frog in my throat
Non-Examples:
• of the
• doctor ... nurse
(related words are simply co-occurences)
2
![Page 4: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/4.jpg)
Why care about Collocations?
• Sentence Parsing:
Helps identify noun/verb phrases.
• Natural Language Generation & Translation:
Avoid awkward output like
‘‘powerful tea’’ or ‘‘to take a decision’’
• Dictionary Building:
Identify phrases that essentially act like
individual words
3
![Page 5: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/5.jpg)
How do we find Collocations?
• Counting frequencies of adjacent words
• Mutual Information between words
• Hypothesis Testing
– t test
– t test of differences
– χ2 test
– likelihood ratios
4
![Page 6: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/6.jpg)
Counting Frequencies of Adjacent Words
- Method: Simply choose the most frequent adjacent pairs.
- Difficulty: prepositions are frequent.
- Results:
5
![Page 7: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/7.jpg)
Counting Frequencies of Adjacent Words
- Fix: only look for phrases with special “part of speech patterns”- Results:
6
![Page 8: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/8.jpg)
Counting Frequencies of Adjacent Words
Summary
+ Easy to implement
+ Gets the simple cases right
- Too sensitive to frequent bigrams. (strong man)
- Ignores rare bigrams
7
![Page 9: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/9.jpg)
Pointwise Mutual Information
PMI(w1, w2) = log2P (w1, w2)
P (w1)P (w2)
w1 and w2 are values of random variables, like word tokens.
Don’t confuse this with the usual Mutual Information:
MI(W1;W2) = E[ PMI(w1, w2) ]
=∑
w1,w2
P (w1, w2) log2P (w1, w2)
P (w1)P (w2)
W1 and W2 are random variables, like word locations.
8
![Page 10: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/10.jpg)
Pointwise Mutual Information
- Method: Choose bigrams with highest PMI = I(w1, w2)
- Results:
9
![Page 11: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/11.jpg)
Pointwise Mutual Information
- Difficulty: PMI is too sensitive to rare bigrams
- The problem is that P (w1,w2)P (w1)P (w2)
easily becomes large for infre-
quent individual words.
10
![Page 12: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/12.jpg)
Pointwise Mutual Information
- Possible Fixes:
• Ignore bigrams that occur less than (say) 20 times
• Redefine PMI(w1, w2)′ = C(w1, w2) ∗ PMI(w1, w2)
11
![Page 13: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/13.jpg)
Hypothesis Testing
- We really just want to know if words collocate
more often than chance.
- Define a null hypothesis H0 that says two words
are independent:
P (w1w2) = P (w1)P (w2)
- If (w1, w2) is a collocation, the hypothesis should
be rejected to some significance level.
12
![Page 14: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/14.jpg)
Hypothesis Testing: The t test
• H0: we have data coming from a normal
distribution with mean µ.
• Data: we observe N points with sample mean x,
and sample variance s2
• Compute the t statistic: t = x−µ√s2N
• If t is greater than some threshold, reject H0.
• The threshold (lookup in table) is 2.576 for large
N and 99.5% confidence.
13
![Page 15: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/15.jpg)
The t test Applied to Collocations
-Our statistic is the frequency of the bigram.
• µ is the frequency assuming the words are
independent
• x is the observed frequency
• s2 is the observed variance (of this ’binomial’)
• The ’frequency’ is really just the probability as
calculated by simply counting:
P (w1w2) =Count(w1w2)
N
14
![Page 16: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/16.jpg)
Hypothesis Testing: The t test: Example
• We have a corpus with
– N = 14 million words.
– C(new) = 15,828
– C(companies) = 4675
• H0: ‘‘new companies’’ occurs with probability
µ = P (new)P (companies)
= 15,82814million ∗ 4675
14million≈ 3.6× 10−7
15
![Page 17: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/17.jpg)
Hypothesis Testing: The t test: Example
• Data: we observe 8 occurrences of new companies, so
x = 814million ≈ 5.6× 10−7.
• For a Bernoulli trial, s2 = p(1− p) ≈ p (for small p).
• Compute the t statistic:
t =x − µ√
s2
N
=5.6× 10−7 − 3.6× 10−7√
5.6×10−7
14million
≈ 1.00
• t is not greater than 2.576, so new companies is not a
collocation.
16
![Page 18: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/18.jpg)
Hypothesis Testing: The t test
- Method: Choose bigrams with highest t-statistic
- Results:
17
![Page 19: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/19.jpg)
Hypothesis Testing: The t test of differences
• Consider two words with similar meaning:
strong, and powerful
• We want to find collocates that best distinguish
the usage of the two.
Ex: powerful computer vs. strong computer
• H0: we expect both pairs to occur just as
frequently.
• Compute a similar t statistic:
t =x1 − x2√
s21+s22N
• Find words (like computer) with highest t score.18
![Page 20: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/20.jpg)
Hypothesis Testing: The t test of differences
- Results:
19
![Page 21: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/21.jpg)
Hypothesis Testing: Pearson’s chi-square test
• t-test has been criticized because it assumes the data is
normally distributed.
• Pearson’s χ2 test also starts by assuming words are
independent.
• First, compute a table of observed values:
20
![Page 22: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/22.jpg)
Hypothesis Testing: Pearson’s chi-square test
• The X2 statistic is computed as
X2 =∑i,j
(Oij − Eij)2
Eij
i and j are over all rows and columns of the table.
• Oij is the observed value (in the table)
• Eij is the expected value (if words were truly
independent). For example, to compute E11:
E(new companies) = C(new)N × C(companies)
N × N
= (3.6× 10−7)× 14million
≈ 5.2
21
![Page 23: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/23.jpg)
Hypothesis Testing: Pearson’s chi-square test
• Again, if the X2 statistic is above some threshold we
accept the collocation.
• The top 20 collocations are the same for χ2 and t tests.
• But the χ2 test is considered more robust, and is more
frequently used.
22
![Page 24: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/24.jpg)
Hypothesis Testing: Pearson’s chi-square test
• Consider another application: Learning word-to-word
translations from an aligned corpus.
• Here are observations for how often the French vache
was aligned with the English cow:
• X2 = 456400 (very high), so ( vache, cow ) is a likely
translation pair.
23
![Page 25: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/25.jpg)
Hypothesis Testing: Likelihood ratios
• Another approach to hypothesis testing
• We consider two hypotheses:
– Hypothesis 1. (Two words are independent.)
p = P (w2|w1) = P (w2|¬w1) = P (w2)
– Hypothesis 2. (w2 depends on w1.)
p1 = P (w2|w1)
p2 = P (w2|¬w1)
p1 6= p2
24
![Page 26: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/26.jpg)
Hypothesis Testing: Likelihood ratios
Quick Notation
• c1 = C(w1)
• c2 = C(w2)
• c12 = C(w1w2)
• We will use the binomial model
b(k;n, p) =(n
k
)pk(1− p)(n−k)
- A coin is biased to heads with probability p.
- Flip the coin n times.
- b(k;n, p) is the probability of k heads.
25
![Page 27: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/27.jpg)
The World According to H1
• What we expect:
– P (w2|w1) = P (w2) = p = c2N
– P (w2|¬w1) = P (w2) = p = c2N
• In actuality, w1w2 occurred c12 times - how likely is this?
• We are assuming a binomial distribution:
– Each time w1 appears, w2 should follow with prob p.
– Each time ¬w1 appears, w2 should follow with prob p.
26
![Page 28: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/28.jpg)
The Likelihood of our data, according to H1
• Of the c1 times w1 occurred, w2 followed c12 times.
• This should happen with probability b(c12; c1, p).
• Of the C(¬w1) = N − c1 times ¬w1 occurred,
w2 followed c2 − c12 times.
• This should happen with probability b(c2− c12;N − c1, p).
• The total probability (likelihood) of all the data is
simply the product:
L(H1) = P ( all the times we saw w2 )
= b(c12; c1, p) ∗ b(c2 − c12;N − c1, p)
27
![Page 29: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/29.jpg)
The World According to H2
• What we expect:
– P (w2|w1) = P (w2) = p1 = c12c1
– P (w2|¬w1) = P (w2) = p2 = c2−c12N−c1
• In actuality, w1w2 occurred c12 times - how likely is this?
• Well, we are assuming a binomial distribution:
– Each time w1 appears, w2 should follow with prob p1.
– Each time ¬w1 appears, w2 should follow with prob p2.
28
![Page 30: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/30.jpg)
The Likelihood of our data, according to H2
• Of the c1 times w1 occurred, w2 followed c12 times.
• This should happen with probability b(c12; c1, p1).
• Of the C(¬w1) = N − c1 times ¬w1 occurred,
w2 followed c2 − c12 times.
• This should happen with probability
b(c2 − c12;N − c1, p2).
• The total probability (likelihood) of all the data is
simply the product:
L(H2) = P ( all the times we saw w2 )
= b(c12; c1, p1) ∗ b(c2 − c12;N − c1, p2)
29
![Page 31: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/31.jpg)
The Likelihood Ratio - general hypothesis testing
• We are given two hypotheses and a some data.
• We have no reason to believe one over the other
(P (H1) = P (H2))
• We pick H2 if
P (H2|data)
P (H1|data)=
P (data|H2)P (H2)
P (data|H1)P (H1)
=P (data|H2)
P (data|H1)
is large.
30
![Page 32: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/32.jpg)
The Likelihood Ratio
• We want bigrams with large ratio
L(H2)
L(H1)
• To make the numbers nice, we equivalently find bigrams
with large
−2 ∗ log(L(H1)
L(H2))
• It turns out that the value −2 ∗ log(L(H1)L(H2)
) is
“asymptotically χ2 distributed” (more so than the X2
statistic).
31
![Page 33: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/33.jpg)
Using The Likelihood Ratio
32
![Page 34: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/34.jpg)
Conclusion
• Counting Frequencies of adjacent words
- too sensitive to frequent pairs
• Mutual Information between words
- too sensitive to rare words
• Hypothesis Testing
– t test - okay
– χ2 test - better
– likelihood ratios - even better
33
![Page 35: Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional](https://reader033.fdocuments.in/reader033/viewer/2022051201/5a700fcd7f8b9a9d538b993c/html5/thumbnails/35.jpg)
The end
Questions?
34