Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.
-
Upload
gerald-garrett -
Category
Documents
-
view
214 -
download
2
Transcript of Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.
Measures of Coincidence
Vasileios Hatzivassiloglou
University of Texas at Dallas
A study of different measures
• Smadja, McKeown, and Hatzivassiloglou (1996): Translating Collocations for Bilingual Lexicons: A Statistical Approach
• Use aligned parallel corpora (Hansards)
• Task: Find translation for a word group across languages
Sketch of algorithm
• Start with set of collocations in French
• Find candidate single word translations according to association between original collocation and translation
• Measure association between source collocation and pairs of candidate words
• Expand iteratively to triplets, etc. by recalculating association
Dice vs. SI
• Dice depends on conditional probabilities only
• SI depends on the marginals: logP(X|Y)-logP(X)
• SI depends on how rare X is
• Limit behavior
Asymmetry
• Many kinds of asymmetry– Between X and Y– Between X=1 and X=0– 1-1 matches versus 0-0 matches
• Adding 0-0 matches does not change Dice
• Adding 0-0 matches always increases SI
Effect of asymmetry
• Hypothetical scenario on 100 sentences
• A,B appear together twice, by themselves three times each
• Dice: 2×2 / (5+5) = 0.4
• SI: log (0.02 / (0.05×0.05)) = 3 bits
• MI: 0.0457 bits
Reversing one and zeroes
• Now replace every 1 with 0 and vice versa
• New variables A′, B′ occur together 92 times, each occurs by itself three times
• Dice: 2×92 / (95 + 95) = 0.9684
• MI: Unchanged (0.0457 bits)
• SI: log(0.92 / (0.95×0.95)) = 0.0277 bits
Explaining the behavior
• Limit effect as P(X) decreases with P(X|Y) constant
• P(X) eventually dominates SI
• Makes SI (and MI) more sensitive to estimation errors
Bounds and testing purpose
• No upper bound for SI and MI
• Dice is always between 0 and 1
• Easy to test SI/MI for independence
• Easy to test Dice for correlation
Empirical comparison
• How to compare without redoing the entire experiment?
• Solution: Use competing measure in the last round
• Test cases where the correct solution is available
• Provide lower bound on competitor error
Empirical results
• 45 French collocations
• 2 did not produce any candidate translation
• Dice resulted in 36 correct, 7 incorrect translations
• SI resulted in 26 correct, 17 incorrect translations
Re-examining contingency tables
• Ted Dunning, “Accurate Methods for the Statistics of Surprise and Coincidence”, Computational Linguistics, 1993.
• Problem: Asymptotic normality assumptions
• How much data is enough?
• Are researchers aware of the need for statistical validity analysis?
Rarity of words
• Empirical counts on words show that 20–30% of words appear less than 1 in 50,000 words
• Estimating binomial as normal: Good as long as np(1-p) > 5
• Significance overestimated by 20% for np=1, 40 for np=0.1, 1020 for np=0.01
Likelihood in parameter spaces
• Parametric model (known except for parameter values)
• Likelihood function H(ω;k)
• Hypothesis represented by a point ω0
Likelihood ratio
• Test statistic: -2logλ
• Rapidly approaches χ2 distribution for binomial H
);(max
);(max0
kH
kH
Comparing to chi-square
• Leads to same formula as Pearson’s chi-square statistic when approximating with normal distribution
• Diverges significantly from chi-square for low np
• Closely follows chi-square distribution
Experimental results
• 32,000 words of financial text from Switzerland
• Find highly correlated word pairs• Observe top-ranked entries for log-likelihood
and chi-square• Chi-square leads to huge scores for rare pairs• 2,682 of 2,693 bigrams violate assumptions