Cross Lingual Information Retrieval (CLIR)

52
Cross Lingual Information Retrieval (CLIR) Rong Jin

description

Cross Lingual Information Retrieval (CLIR). Rong Jin. The Problem. Increasing pressure for accessing information in foreign language: find information written in foreign languages read and interpret that information merge it with information in other languages - PowerPoint PPT Presentation

Transcript of Cross Lingual Information Retrieval (CLIR)

Page 1: Cross Lingual Information Retrieval (CLIR)

Cross Lingual Information Retrieval (CLIR)

Rong Jin

Page 2: Cross Lingual Information Retrieval (CLIR)

The Problem Increasing pressure for accessing information

in foreign language: find information written in foreign languages read and interpret that information merge it with information in other languages

Need for multilingual information access

Page 3: Cross Lingual Information Retrieval (CLIR)

Why Cross Lingual IR is Important? Internet is no longer monolingual and non-English

content is growing rapidly

Non-English speakers represent the fastest growing group of new internet users

In 1997, 8.1 million Spanish speaking users In 2000, 37 million ……..

Page 4: Cross Lingual Information Retrieval (CLIR)

Confidential, unpublished information©Manning & Napier Information Services 2000

English English

2000 2005

Page 5: Cross Lingual Information Retrieval (CLIR)

2. Multilingual Text Processing

Character encoding Language recognition Tokenization Stop word removal Feature normalization (stemming) Part-of-speech tagging Phrase identification

Page 6: Cross Lingual Information Retrieval (CLIR)

Character Encoding Language (alphabet) specific native encoding:

Chinese GB, Big5, Western European ISO-8859-1 (Latin1) Russian KOI-8, ISO-8859-5, CP-1251

UNICODE (ISO/IEC 10646) UTF-8 variable-byte length UTF-16, UCS-2 fixed double-byte

Page 7: Cross Lingual Information Retrieval (CLIR)

Tokenization

Punctuation separated from words – incl. word separation characters.

“The train stopped.” “The”, “train”, “stopped”, “.”

String split into lexical units - incl. Segmentation (Chinese) and compound-splitting (German)

Page 8: Cross Lingual Information Retrieval (CLIR)

Chinese Segmentation

Page 9: Cross Lingual Information Retrieval (CLIR)

Chinese Segmentation Frank Petroleum Detection

Page 10: Cross Lingual Information Retrieval (CLIR)

German Segmentation Unrestricted compounding in German

Abendnachrichtensendungsblock

Use compound analysis together with CELEX German dictionary (360,000 words) Treuhandanstalt { treuhand, anstalt }

Use n-gram representation Treuhandanstalt {Treuha, reuhan, treuhand, euhand, … }

Page 11: Cross Lingual Information Retrieval (CLIR)

CLIR - Approaches

Machine Translation Bilingual Dictionaries Parallel/Comparable Corpora

UserQuery

Document

LanguageBarrier

QueryRepresentation

DocumentRepresentation

誰在 1998 年贏得環法自行車大賽

Marco Pantani of Italy became the first Italian to win the Tour de France of 1998 …

Page 12: Cross Lingual Information Retrieval (CLIR)

Machine Translation Translate all documents into the query language

Chinese Queries English

DocumentsChinese

Documents

Machine Translation

Lucene

Page 13: Cross Lingual Information Retrieval (CLIR)

Machine Translation (MT) Translate all documents into the query language

Not viable on large collections (MT is computationally expensive)

Not viable if there are many possible query languages

Chinese Queries English

DocumentsChinese

Documents

Machine Translation

Lucene

Page 14: Cross Lingual Information Retrieval (CLIR)

Machine Translation

Translate the query into languages of the content being searched

Chinese Queries English

Documents

Machine Translation Lucene

English Queries

Page 15: Cross Lingual Information Retrieval (CLIR)

Machine Translation

Translate the query into languages of the content being searched

Query translation is inadequate for CLIR no context for accurate translation system selects preferred target term

Chinese Queries English

Documents

Machine Translation Lucene

English Queries

Page 16: Cross Lingual Information Retrieval (CLIR)

Example of Translating Queries

Who won the Tour de France in 1998?

Page 17: Cross Lingual Information Retrieval (CLIR)

Using Dictionaries

Bilingual machine-readable dictionaries (in-house or commercial)

Look-up query terms in dictionary and replace with translations in document languages

Chinese Queries English

DocumentsLuceneEnglish

QueriesBilingual

Dictionary

Page 18: Cross Lingual Information Retrieval (CLIR)

Using Dictionaries

Problems ambiguity many terms are out-of-vocabulary lack of multiword terms phrase identification bilingual dictionary needed for every

query-document language pair of interest

Page 19: Cross Lingual Information Retrieval (CLIR)

Word Sense Disambiguation

Page 20: Cross Lingual Information Retrieval (CLIR)

Word Sense Disambiguation

The sign for independent press to disappear

Page 21: Cross Lingual Information Retrieval (CLIR)

Using Corpora Parallel Corpora

translation equivalent e.g. UN corpus in French, Spanish & English

Comparable Corpora Similar for topic, style, time etc. Hong Kong TV broadcast news in both Chinese

and English

Page 22: Cross Lingual Information Retrieval (CLIR)

Using CorporaParallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

Query:

A E

d1

a a c e

d2

b c d a

d3

e d a

How to bridge the language barrier using the parallel corpora ?

Page 23: Cross Lingual Information Retrieval (CLIR)

Translate Query using Parallel Corpus (I)

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

Query:

A E

d1

a a c e

d2

b c d a

d3

e d a

Page 24: Cross Lingual Information Retrieval (CLIR)

Translate Query using Parallel Corpus (I)

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

Query:

A E

d2

b c d a

d3

e d a

Query:

ce

d1

a a c e

Page 25: Cross Lingual Information Retrieval (CLIR)

Translate Query using Parallel Corpus (I)

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

Query:

A E

d1

a a c e

d2

b c d a

d3

e d a

Query:

ce

Page 26: Cross Lingual Information Retrieval (CLIR)

Translate Query using Parallel Corpus (II) Learn word-to-word translation probabilities

from parallel corpa Compute the relevance of a document d to a

given query q by estimating the probability of translating document d into query q

Page 27: Cross Lingual Information Retrieval (CLIR)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Translate Query using Parallel Corpus (II)

Word-to-Word Translation Probabilities

Q = (A E), d1 = (a a c e)

Page 28: Cross Lingual Information Retrieval (CLIR)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Translate Query using Parallel Corpus (II)

Word-to-Word Translation Probabilities

Q = (A E), d1 = (a a c e)

Page 29: Cross Lingual Information Retrieval (CLIR)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Translate Query using Parallel Corpus (II)

Word-to-Word Translation Probabilities

Q = (A E), d1 = (a a c e)

Page 30: Cross Lingual Information Retrieval (CLIR)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

d1

a a c e

d2

b c d a

d3

e d a

A, E 0.58

Translate Query using Parallel Corpus (II)

Q = (A E), d1 = (a a c e)

Page 31: Cross Lingual Information Retrieval (CLIR)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

d1

a a c e

d2

b c d a

d3

e d a

A, E 0.58 0.36 0.48

Translate Query using Parallel Corpus (II)

Q = (A E), d1 = (a a c e)

Page 32: Cross Lingual Information Retrieval (CLIR)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

d1

a a c e

d2

b c d a

d3

e d a

A, E 0.58 0.36 0.48

Translate Query using Parallel Corpus (II)

How to obtain the translation probabilities ?

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

Page 33: Cross Lingual Information Retrieval (CLIR)

Approach I: Co-occurrence Counting

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Page 34: Cross Lingual Information Retrieval (CLIR)

Approach I: Co-occurrence Counting

Co-occurrence based translation model

e.g. p(A|a) = co(a, A) / occur(a) = 4/4 = 1

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Page 35: Cross Lingual Information Retrieval (CLIR)

Approach I: Co-occurrence Counting

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

P(B|c) = co(B, c)/occ(c) = 2/4 = 0.5

Page 36: Cross Lingual Information Retrieval (CLIR)

Approach I: Co-occurrence Counting

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Any problem ?

Page 37: Cross Lingual Information Retrieval (CLIR)

Approach I: Co-occurrence Counting

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Many large translation probabilities Usually one word of one language corresponds motly to a

single word in another language

Page 38: Cross Lingual Information Retrieval (CLIR)

Approach I: Co-occurrence Counting

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Many large translation probabilities Usually one word of one language corresponds motly to a

single word in another language We may over-count the co-occurrence statistics

Page 39: Cross Lingual Information Retrieval (CLIR)

Approach I: Overcounting

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

co(A, a) = 4 implies that all occurrence of ‘A’ is due to the occurrence of ‘a’

Page 40: Cross Lingual Information Retrieval (CLIR)

Approach I: Overcounting

a b c d e total

A 4 3 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

Page 41: Cross Lingual Information Retrieval (CLIR)

Approach I: Overcounting

a b c d e total

A 4 3 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

If we believe that the first two occurrences of ‘A’ is due to ‘a’, then, co(A, b) = 1, not 3

But, we have no idea if the first two occurrences of ‘A’ is due to ’a’

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

X

x

Page 42: Cross Lingual Information Retrieval (CLIR)

How to Compute Co-occurrence ? IBM statistical translation model

There are translation models published by IBM research We will only discuss IBM Translation Model I

It uses an iterative procedure to eliminate the over counting problem

Page 43: Cross Lingual Information Retrieval (CLIR)

Step 1: Compute co-occurrence

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Page 44: Cross Lingual Information Retrieval (CLIR)

Step 1: Compute co-occurrence

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Assume that translation probabilities are proportional to co-occurrence

( , ) 4( | ) 0.33

( , ) ( , ) ( , ) ( , ) 12

co A ap A a

co A a co B a co C a co D a

Page 45: Cross Lingual Information Retrieval (CLIR)

Step 2: Compute Conditional Prob.

Assume that translation probabilities are proportional to co-occurrence

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

a b c d e

A 0.33 0.25 0.36 0.17 0.22

B 0.17 … … … …

C 0.25 … … … …

D 0.08 … … … …

E 0.17 … … … …

( , ) 4( | ) 0.33

( , ) ( , ) ( , ) ( , ) 12

co A ap A a

co A a co B a co C a co D a

Page 46: Cross Lingual Information Retrieval (CLIR)

Step 3: Re-estimate co-occurrenceParallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

A B C

b c a d

‘A’ can be caused by one of the words ‘b’, ‘c’, ‘a’, ‘d’

co(A,a) for sentence 1 should be computed by taking account of the competition

a b c d e

A 0.33 0.25 0.36 0.17 0.22

B 0.17 … … … …

C 0.25 … … … …

D 0.08 … … … …

E 0.17 … … … …

Page 47: Cross Lingual Information Retrieval (CLIR)

Step 3: Re-estimate co-occurrenceParallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

A B C

b c a d

a b c d e

A 0.33 0.25 0.36 0.17 0.22

B 0.17 … … … …

C 0.25 … … … …

D 0.08 … … … …

E 0.17 … … … …

( | )( , ;1)

( | ) ( | ) ( | ) ( | )

0.330.41

0.33 0.25 0.36 0.17

p A aoc A a

p A a p A c p A b p A d

Page 48: Cross Lingual Information Retrieval (CLIR)

Step 3: Re-estimate co-occurrence

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

co(A,a) for each sentence

0.41

0.37

0.48

0

0.36

a b c d e

A 0.33 0.25 0.36 0.17 0.22

Page 49: Cross Lingual Information Retrieval (CLIR)

Step 3: Re-estimate co-occurrence

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a

A 1.62

B

C

D

E

co(A,a) for each sentence

0.41

0.37

0.48

0

0.36

a b c d e

A 0.33 0.25 0.36 0.17 0.22

co(A,a) =

0.41 + 0.37 + 0.48 + 0 + 0.36

= 1.62

Page 50: Cross Lingual Information Retrieval (CLIR)

Step 3: Re-estimate co-occurrence

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

a b c d e

A 1.62 … … … …

B 0.75 … … … …

C 1.25 … … … …

D 0.31 … … … …

E 0.56 … … … …

Page 51: Cross Lingual Information Retrieval (CLIR)

Step 4: Re-compute Conditional Prob.

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

a b c d e

A 0.46 … … … …

B 0.15 … … … …

C 0.22 … … … …

D 0.06 … … … …

E 0.11 … … … …

( , )( | ) 0.36

( , ) ( , ) ( , ) ( , )

co A ap A a

co A a co B a co C a co D a

Page 52: Cross Lingual Information Retrieval (CLIR)

IBM Statistical Translation Model Apply the steps of counting and estimation

iteratively The convergence can be proved This is related so-called Expectation Maximization

Algorithm E step: counting M step: estimate the translation probabilities

It has the best performance in the past TREC evaluations for CLIR