Download - Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Transcript
Page 1: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Measures of Coincidence

Vasileios Hatzivassiloglou

University of Texas at Dallas

Page 2: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

A study of different measures

• Smadja, McKeown, and Hatzivassiloglou (1996): Translating Collocations for Bilingual Lexicons: A Statistical Approach

• Use aligned parallel corpora (Hansards)

• Task: Find translation for a word group across languages

Page 3: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Sketch of algorithm

• Start with set of collocations in French

• Find candidate single word translations according to association between original collocation and translation

• Measure association between source collocation and pairs of candidate words

• Expand iteratively to triplets, etc. by recalculating association

Page 4: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Dice vs. SI

• Dice depends on conditional probabilities only

• SI depends on the marginals: logP(X|Y)-logP(X)

• SI depends on how rare X is

• Limit behavior

Page 5: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Asymmetry

• Many kinds of asymmetry– Between X and Y– Between X=1 and X=0– 1-1 matches versus 0-0 matches

• Adding 0-0 matches does not change Dice

• Adding 0-0 matches always increases SI

Page 6: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Effect of asymmetry

• Hypothetical scenario on 100 sentences

• A,B appear together twice, by themselves three times each

• Dice: 2×2 / (5+5) = 0.4

• SI: log (0.02 / (0.05×0.05)) = 3 bits

• MI: 0.0457 bits

Page 7: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Reversing one and zeroes

• Now replace every 1 with 0 and vice versa

• New variables A′, B′ occur together 92 times, each occurs by itself three times

• Dice: 2×92 / (95 + 95) = 0.9684

• MI: Unchanged (0.0457 bits)

• SI: log(0.92 / (0.95×0.95)) = 0.0277 bits

Page 8: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Explaining the behavior

• Limit effect as P(X) decreases with P(X|Y) constant

• P(X) eventually dominates SI

• Makes SI (and MI) more sensitive to estimation errors

Page 9: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Bounds and testing purpose

• No upper bound for SI and MI

• Dice is always between 0 and 1

• Easy to test SI/MI for independence

• Easy to test Dice for correlation

Page 10: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Empirical comparison

• How to compare without redoing the entire experiment?

• Solution: Use competing measure in the last round

• Test cases where the correct solution is available

• Provide lower bound on competitor error

Page 11: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Empirical results

• 45 French collocations

• 2 did not produce any candidate translation

• Dice resulted in 36 correct, 7 incorrect translations

• SI resulted in 26 correct, 17 incorrect translations

Page 12: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Re-examining contingency tables

• Ted Dunning, “Accurate Methods for the Statistics of Surprise and Coincidence”, Computational Linguistics, 1993.

• Problem: Asymptotic normality assumptions

• How much data is enough?

• Are researchers aware of the need for statistical validity analysis?

Page 13: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Rarity of words

• Empirical counts on words show that 20–30% of words appear less than 1 in 50,000 words

• Estimating binomial as normal: Good as long as np(1-p) > 5

• Significance overestimated by 20% for np=1, 40 for np=0.1, 1020 for np=0.01

Page 14: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Likelihood in parameter spaces

• Parametric model (known except for parameter values)

• Likelihood function H(ω;k)

• Hypothesis represented by a point ω0

Page 15: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Likelihood ratio

• Test statistic: -2logλ

• Rapidly approaches χ2 distribution for binomial H

);(max

);(max0

kH

kH

Page 16: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Comparing to chi-square

• Leads to same formula as Pearson’s chi-square statistic when approximating with normal distribution

• Diverges significantly from chi-square for low np

• Closely follows chi-square distribution

Page 17: Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Experimental results

• 32,000 words of financial text from Switzerland

• Find highly correlated word pairs• Observe top-ranked entries for log-likelihood

and chi-square• Chi-square leads to huge scores for rare pairs• 2,682 of 2,693 bigrams violate assumptions