Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta.

Post on 22-Dec-2015

222 views 3 download

Tags:

Transcript of Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta.

Cognates and Cognates and Word Alignment Word Alignment

in Bitextsin Bitexts

Greg KondrakGreg Kondrak

University of AlbertaUniversity of Alberta

22

OutlineOutline

BackgroundBackground Improving LCSRImproving LCSR Cognates vs. word alignment linksCognates vs. word alignment links Experiments & resultsExperiments & results

33

MotivationMotivation

Claim: words that are orthographically Claim: words that are orthographically similar are more likely to be mutual similar are more likely to be mutual translations than words that are not translations than words that are not similar.similar.

Reason: existence of cognates, which Reason: existence of cognates, which are usually orthographically are usually orthographically andand semantically similar.semantically similar.

Use: Considering cognates can improve Use: Considering cognates can improve word alignment and translation models.word alignment and translation models.

44

ObjectiveObjective

Evaluation of orthographic Evaluation of orthographic similarity measures in the context similarity measures in the context of word alignment in bitexts. of word alignment in bitexts.

55

MT applicationsMT applications

sentence alignment sentence alignment word alignmentword alignment improving translation modelsimproving translation models inducing translation lexiconsinducing translation lexicons aid in manual alignmentaid in manual alignment

66

CognatesCognates

Similar in orthography or Similar in orthography or pronunciation.pronunciation.

Often mutual translations.Often mutual translations. May include:May include:

– genetic cognatesgenetic cognates– lexical loanslexical loans– namesnames– numbersnumbers– punctuationpunctuation

77

The task of cognate The task of cognate identificationidentification Input: two wordsInput: two words Output: the likelihood that they Output: the likelihood that they

are cognateare cognate One method: compute their One method: compute their

orthographic/phonetic/semantic orthographic/phonetic/semantic similaritysimilarity

88

ScopeScope

The measures that we consider areThe measures that we consider are language-independentlanguage-independent orthography-basedorthography-based operate on the level of individual operate on the level of individual

lettersletters binary identity functionbinary identity function

99

Similarity measuresSimilarity measures

Prefix method Prefix method Dice coefficientDice coefficient Longest Common Subsequence Longest Common Subsequence

Ratio (LCSR)Ratio (LCSR) Edit distanceEdit distance Phonetic alignmentPhonetic alignment Many other methodsMany other methods

1010

IDENTIDENT

1 if two words are identical, 0 1 if two words are identical, 0 otherwiseotherwise

The simplest similarity measureThe simplest similarity measure e.g. IDENT(e.g. IDENT(cocolour, lour, cocouleur) = 0uleur) = 0

1111

PREFIXPREFIX

The ratio of the longest common The ratio of the longest common prefix of two words to the length prefix of two words to the length of the longer wordof the longer word

e.g. PREFIX(e.g. PREFIX(cocolour, lour, cocouleur) = 2/7 = uleur) = 2/7 = 0.280.28

1212

DICE coefficientDICE coefficient

The ratio of the number of The ratio of the number of common letter bigrams to the common letter bigrams to the total number of letter bigramstotal number of letter bigrams

e.g. DICE(colour, couleur) = 6/11 = e.g. DICE(colour, couleur) = 6/11 = 0.550.55

coco ol lo ol lo ouou urur

coco ouou ul le eu ul le eu urur

1313

Longest Common Sub-Longest Common Sub-sequence Ratio (LCSR)sequence Ratio (LCSR) The ratio of the longest common The ratio of the longest common

subsequence of two words to the subsequence of two words to the length of the longer word.length of the longer word.

e.g. LCSR(colour, couleur) = 5/7 = e.g. LCSR(colour, couleur) = 5/7 = 0.710.71

cc oo -- ll oo -- uu rr

cc oo uu ll -- ee uu rr

1414

LCSRLCSR

Method of choice in several papersMethod of choice in several papers Weak point: insensitive to word Weak point: insensitive to word

lengthlength ExampleExample

– LCSR(LCSR(walls, allwalls, allééss) = 0.8) = 0.8– LCSR(LCSR(sanctuary, sanctuairesanctuary, sanctuaire) = 0.8) = 0.8

Sometimes a minimal word length Sometimes a minimal word length imposedimposed

A principled solution?A principled solution?

1515

The random modelThe random model

Assumption: strings are Assumption: strings are generated randomly from a given generated randomly from a given distribution of letters.distribution of letters.

Problem: what is the probability Problem: what is the probability of seeing of seeing kk matches between two matches between two strings of length strings of length mm and and nn??

1616

A special caseA special case

Assumption: Assumption: kk=0 (no matches)=0 (no matches) t – alphabet sizet – alphabet size S(n,i) - Stirling number of the S(n,i) - Stirling number of the

second kindsecond kind

mtn

i

itinSi

t

tLCS mn ))(,(

1)0Pr(

),max(

1

1717

The problemThe problem

What is the probability of seeing What is the probability of seeing kk matches between two strings of length matches between two strings of length mm and and nn??

An exact analytical formula is unlikely to An exact analytical formula is unlikely to exist.exist.

A very similar problem has been studied in A very similar problem has been studied in bioinformatics as bioinformatics as statistical significance of statistical significance of alignment scoresalignment scores..

Approximations developed in Approximations developed in bioinformatics are not applicable to words bioinformatics are not applicable to words because of length differences.because of length differences.

1818

Solutions for the Solutions for the general casegeneral case SamplingSampling

– Not reliable for small probability valuesNot reliable for small probability values– Works well for low Works well for low k/nk/n ratios (uninteresting) ratios (uninteresting)– Depends on a given alphabet size and Depends on a given alphabet size and

letter frequenciesletter frequencies– No insightNo insight

Inexact approximation Inexact approximation – Works well for high Works well for high k/nk/n ratios (interesting) ratios (interesting)– Easy to useEasy to use

1919

Formula 1Formula 1

- probability of a match- probability of a match

))1log(exp(

)1()Pr(

k

k

m

k

n

k

pk

m

k

n

pkLCS

t

j jpp1

2

)Pr()1Pr()Pr( kLCSkLCSkLCS

2020

Formula 1Formula 1

Exact for Exact for k=m=nk=m=n Inexact in generalInexact in general Reason: implicit independence Reason: implicit independence

assumptionassumption Lower bound for the actual probabilityLower bound for the actual probability Good approximation for high Good approximation for high k/nk/n ratios. ratios. Runs into numerical problems for largerRuns into numerical problems for larger

nn

nnn pnL )Pr( ,

2121

Formula 2Formula 2

Expected number of pairs of Expected number of pairs of kk--letter substrings.letter substrings.

Approximates the required Approximates the required probability for high probability for high k/nk/n ratios. ratios.

)Pr()( kLCSpk

m

k

nxE kk

2222

Formula 2Formula 2

Does not work for low Does not work for low k/nk/n ratios. ratios. Not monotonic.Not monotonic. Simpler than Formula 1.Simpler than Formula 1. More robust against numerical More robust against numerical

underflow for very long words.underflow for very long words.

2323

Comparison of both Comparison of both formulasformulas Both are exact for Both are exact for k=m=nk=m=n For k close to max(m,n)For k close to max(m,n)

– both formulas are good both formulas are good approximationsapproximations

– their values are their values are veryvery close close Both can be quickly computed Both can be quickly computed

using dynamic programming.using dynamic programming.

2424

LCSFLCSF

A new similarity measure based on A new similarity measure based on Formula 2.Formula 2.

LCSR(X,Y) = k/nLCSR(X,Y) = k/n LCSF(X,Y) =LCSF(X,Y) = LCSF is as fast as LCSR because its LCSF is as fast as LCSR because its

values that depend only on values that depend only on kk and and n n can be pre-computed and storedcan be pre-computed and stored

)0),logmax( kpk

n

k

n

2525

Evaluation - motivationEvaluation - motivation

Intrinsic evaluation of orthographic Intrinsic evaluation of orthographic similarity is difficult and subjective.similarity is difficult and subjective.

My idea: extrinsic evaluation on My idea: extrinsic evaluation on cognates and word aligned bitexts.cognates and word aligned bitexts.– Most cross-language cognates are Most cross-language cognates are

orthographically similar and vice-versa.orthographically similar and vice-versa.– Cognation is binary and Cognation is binary and notnot subjective subjective

2626

Cognates vs alignment Cognates vs alignment linkslinks Manual identification of cognates Manual identification of cognates

is tedious.is tedious. Manually word-aligned bitexts are Manually word-aligned bitexts are

available, but only some of the available, but only some of the links are between cognates.links are between cognates.

Question #1: can we use Question #1: can we use manually-constructed word manually-constructed word alignment links instead?alignment links instead?

2727

Manual vs automatic Manual vs automatic alignment linksalignment links Automatically word-aligned bitext Automatically word-aligned bitext

are easily obtainable, but a good are easily obtainable, but a good fraction of the links are wrong.fraction of the links are wrong.

Question #2: can we use Question #2: can we use machine-generated word machine-generated word alignment links instead?alignment links instead?

2828

Evaluation Evaluation methodologymethodology Assumption: a word aligned bitextAssumption: a word aligned bitext Treat aligned sentences as bags of Treat aligned sentences as bags of

wordswords Compute similarity for all word pairs Compute similarity for all word pairs Order word pairs by their similarity Order word pairs by their similarity

valuevalue Compute precision against a gold Compute precision against a gold

standardstandard– either a cognate list or alignment linkseither a cognate list or alignment links

2929

Test dataTest data

Blinker bitext (French-English)Blinker bitext (French-English)– 250 Bible verse pairs250 Bible verse pairs– manual word alignmentmanual word alignment– all cognates manually identifiedall cognates manually identified

Hansards (French-English)Hansards (French-English)– 500 sentences500 sentences– manual and automatic word-alignmentmanual and automatic word-alignment

Romanian-EnglishRomanian-English– 248 sentences248 sentences– manually alignedmanually aligned

3030

Blinker resultsBlinker results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Cognates Links

IDENTPREFIXDICELCSRLCSF

3131

Hansards resultsHansards results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Manual Automatic

IDENTPREFIXDICELCSRLCSF

3232

Romanian-English Romanian-English resultsresults

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Manual

IDENTPREFIXDICELCSRLCSF

3333

ContributionsContributions

We showed that word alignment We showed that word alignment links can be used instead of links can be used instead of cognates for evaluating word cognates for evaluating word similarity measures.similarity measures.

We proposed a new similarity We proposed a new similarity measure which outperforms measure which outperforms LCSR.LCSR.

3434

Future workFuture work

Extend our approach to length Extend our approach to length normalization to edit distance and normalization to edit distance and other similarity measures.other similarity measures.

Incorporate cognate information Incorporate cognate information into statistical MT models as an into statistical MT models as an additional feature function.additional feature function.

3535

Thank you