Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

17
Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas

Transcript of Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Page 1: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Measures of Coincidence

Vasileios Hatzivassiloglou

University of Texas at Dallas

Page 2: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

A study of different measures

• Smadja, McKeown, and Hatzivassiloglou (1996): Translating Collocations for Bilingual Lexicons: A Statistical Approach

• Use aligned parallel corpora (Hansards)

• Task: Find translation for a word group across languages

Page 3: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Sketch of algorithm

• Start with set of collocations in French

• Find candidate single word translations according to association between original collocation and translation

• Measure association between source collocation and pairs of candidate words

• Expand iteratively to triplets, etc. by recalculating association

Page 4: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Dice vs. SI

• Dice depends on conditional probabilities only

• SI depends on the marginals: logP(X|Y)-logP(X)

• SI depends on how rare X is

• Limit behavior

Page 5: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Asymmetry

• Many kinds of asymmetry– Between X and Y– Between X=1 and X=0– 1-1 matches versus 0-0 matches

• Adding 0-0 matches does not change Dice

• Adding 0-0 matches always increases SI

Page 6: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Effect of asymmetry

• Hypothetical scenario on 100 sentences

• A,B appear together twice, by themselves three times each

• Dice: 2×2 / (5+5) = 0.4

• SI: log (0.02 / (0.05×0.05)) = 3 bits

• MI: 0.0457 bits

Page 7: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Reversing one and zeroes

• Now replace every 1 with 0 and vice versa

• New variables A′, B′ occur together 92 times, each occurs by itself three times

• Dice: 2×92 / (95 + 95) = 0.9684

• MI: Unchanged (0.0457 bits)

• SI: log(0.92 / (0.95×0.95)) = 0.0277 bits

Page 8: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Explaining the behavior

• Limit effect as P(X) decreases with P(X|Y) constant

• P(X) eventually dominates SI

• Makes SI (and MI) more sensitive to estimation errors

Page 9: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Bounds and testing purpose

• No upper bound for SI and MI

• Dice is always between 0 and 1

• Easy to test SI/MI for independence

• Easy to test Dice for correlation

Page 10: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Empirical comparison

• How to compare without redoing the entire experiment?

• Solution: Use competing measure in the last round

• Test cases where the correct solution is available

• Provide lower bound on competitor error

Page 11: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Empirical results

• 45 French collocations

• 2 did not produce any candidate translation

• Dice resulted in 36 correct, 7 incorrect translations

• SI resulted in 26 correct, 17 incorrect translations

Page 12: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Re-examining contingency tables

• Ted Dunning, “Accurate Methods for the Statistics of Surprise and Coincidence”, Computational Linguistics, 1993.

• Problem: Asymptotic normality assumptions

• How much data is enough?

• Are researchers aware of the need for statistical validity analysis?

Page 13: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Rarity of words

• Empirical counts on words show that 20–30% of words appear less than 1 in 50,000 words

• Estimating binomial as normal: Good as long as np(1-p) > 5

• Significance overestimated by 20% for np=1, 40 for np=0.1, 1020 for np=0.01

Page 14: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Likelihood in parameter spaces

• Parametric model (known except for parameter values)

• Likelihood function H(ω;k)

• Hypothesis represented by a point ω0

Page 15: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Likelihood ratio

• Test statistic: -2logλ

• Rapidly approaches χ2 distribution for binomial H

);(max

);(max0

kH

kH

Page 16: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Comparing to chi-square

• Leads to same formula as Pearson’s chi-square statistic when approximating with normal distribution

• Diverges significantly from chi-square for low np

• Closely follows chi-square distribution

Page 17: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Experimental results

• 32,000 words of financial text from Switzerland

• Find highly correlated word pairs• Observe top-ranked entries for log-likelihood

and chi-square• Chi-square leads to huge scores for rare pairs• 2,682 of 2,693 bigrams violate assumptions