Cross Lingual Information Retrieval (CLIR) Rong Jin.

Cross Lingual Information Retrieval (CLIR)

Rong Jin

The Problem Increasing pressure for accessing information

in foreign language: find information written in foreign languages read and interpret that information merge it with information in other languages

Need for multilingual information access

Why Cross Lingual IR is Important? Internet is no longer monolingual and non-English

content is growing rapidly

Non-English speakers represent the fastest growing group of new internet users

In 1997, 8.1 million Spanish speaking users In 2000, 37 million ……..

Confidential, unpublished information©Manning & Napier Information Services 2000

English English

2000 2005

2. Multilingual Text Processing

Character encoding Language recognition Tokenization Stop word removal Feature normalization (stemming) Part-of-speech tagging Phrase identification

Character Encoding Language (alphabet) specific native encoding:

Chinese GB, Big5, Western European ISO-8859-1 (Latin1) Russian KOI-8, ISO-8859-5, CP-1251

UNICODE (ISO/IEC 10646) UTF-8 variable-byte length UTF-16, UCS-2 fixed double-byte

Tokenization

Punctuation separated from words – incl. word separation characters.

“The train stopped.” “The”, “train”, “stopped”, “.”

String split into lexical units - incl. Segmentation (Chinese) and compound-splitting (German)

Chinese Segmentation

Chinese Segmentation Frank Petroleum Detection

German Segmentation Unrestricted compounding in German

Abendnachrichtensendungsblock

Use compound analysis together with CELEX German dictionary (360,000 words) Treuhandanstalt { treuhand, anstalt }

Use n-gram representation Treuhandanstalt {Treuha, reuhan, treuhand, euhand, … }

CLIR - Approaches

Machine Translation Bilingual Dictionaries Parallel/Comparable Corpora

UserQuery

Document

LanguageBarrier

QueryRepresentation

DocumentRepresentation

誰在 1998 年贏得環法自行車大賽

Marco Pantani of Italy became the first Italian to win the Tour de France of 1998 …

Machine Translation Translate all documents into the query language

Chinese Queries English

DocumentsChinese

Documents

Machine Translation

Lucene

Machine Translation (MT) Translate all documents into the query language

Not viable on large collections (MT is computationally expensive)

Not viable if there are many possible query languages

DocumentsChinese

Documents

Machine Translation

Lucene

Machine Translation

Translate the query into languages of the content being searched

Documents

Machine Translation Lucene

English Queries

Machine Translation

Translate the query into languages of the content being searched

Query translation is inadequate for CLIR no context for accurate translation system selects preferred target term

Documents

Machine Translation Lucene

English Queries

Example of Translating Queries

Who won the Tour de France in 1998?

Using Dictionaries

Bilingual machine-readable dictionaries (in-house or commercial)

Look-up query terms in dictionary and replace with translations in document languages

DocumentsLuceneEnglish

QueriesBilingual

Dictionary

Using Dictionaries

Problems ambiguity many terms are out-of-vocabulary lack of multiword terms phrase identification bilingual dictionary needed for every

query-document language pair of interest

Word Sense Disambiguation

The sign for independent press to disappear

Using Corpora Parallel Corpora

translation equivalent e.g. UN corpus in French, Spanish & English

Comparable Corpora Similar for topic, style, time etc. Hong Kong TV broadcast news in both Chinese

and English

Using CorporaParallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

Query:

a a c e

b c d a

How to bridge the language barrier using the parallel corpora ?

Translate Query using Parallel Corpus (I)

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

Query:

a a c e

b c d a

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

Query:

b c d a

Query:

a a c e

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

Query:

a a c e

b c d a

Query:

Translate Query using Parallel Corpus (II) Learn word-to-word translation probabilities

from parallel corpa Compute the relevance of a document d to a

given query q by estimating the probability of translating document d into query q

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Translate Query using Parallel Corpus (II)

Word-to-Word Translation Probabilities

Q = (A E), d1 = (a a c e)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Q = (A E), d1 = (a a c e)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Q = (A E), d1 = (a a c e)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

a a c e

b c d a

A, E 0.58

Q = (A E), d1 = (a a c e)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

a a c e

b c d a

A, E 0.58 0.36 0.48

Q = (A E), d1 = (a a c e)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

a a c e

b c d a

A, E 0.58 0.36 0.48

How to obtain the translation probabilities ?

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

Approach I: Co-occurrence Counting

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Co-occurrence based translation model

e.g. p(A|a) = co(a, A) / occur(a) = 4/4 = 1

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

P(B|c) = co(B, c)/occ(c) = 2/4 = 0.5

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Any problem ?

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Many large translation probabilities Usually one word of one language corresponds motly to a

single word in another language

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Many large translation probabilities Usually one word of one language corresponds motly to a

single word in another language We may over-count the co-occurrence statistics

Approach I: Overcounting

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

co(A, a) = 4 implies that all occurrence of ‘A’ is due to the occurrence of ‘a’

a b c d e total

A 4 3 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a b c d e total

A 4 3 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

If we believe that the first two occurrences of ‘A’ is due to ‘a’, then, co(A, b) = 1, not 3

But, we have no idea if the first two occurrences of ‘A’ is due to ’a’

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

How to Compute Co-occurrence ? IBM statistical translation model

There are translation models published by IBM research We will only discuss IBM Translation Model I

It uses an iterative procedure to eliminate the over counting problem

Step 1: Compute co-occurrence

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Step 1: Compute co-occurrence

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Assume that translation probabilities are proportional to co-occurrence

( , ) 4( | ) 0.33

( , ) ( , ) ( , ) ( , ) 12

co A ap A a

co A a co B a co C a co D a

Step 2: Compute Conditional Prob.

Assume that translation probabilities are proportional to co-occurrence

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

a b c d e

A 0.33 0.25 0.36 0.17 0.22

B 0.17 … … … …

C 0.25 … … … …

D 0.08 … … … …

E 0.17 … … … …

( , ) 4( | ) 0.33

( , ) ( , ) ( , ) ( , ) 12

co A ap A a

Step 3: Re-estimate co-occurrenceParallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

b c a d

‘A’ can be caused by one of the words ‘b’, ‘c’, ‘a’, ‘d’

co(A,a) for sentence 1 should be computed by taking account of the competition

a b c d e

A 0.33 0.25 0.36 0.17 0.22

B 0.17 … … … …

C 0.25 … … … …

D 0.08 … … … …

E 0.17 … … … …

Step 3: Re-estimate co-occurrenceParallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

b c a d

a b c d e

A 0.33 0.25 0.36 0.17 0.22

B 0.17 … … … …

C 0.25 … … … …

D 0.08 … … … …

E 0.17 … … … …

( | )( , ;1)

( | ) ( | ) ( | ) ( | )

0.330.41

0.33 0.25 0.36 0.17

p A aoc A a

p A a p A c p A b p A d

Step 3: Re-estimate co-occurrence

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

co(A,a) for each sentence

a b c d e

A 0.33 0.25 0.36 0.17 0.22

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

A 1.62

co(A,a) for each sentence

a b c d e

A 0.33 0.25 0.36 0.17 0.22

co(A,a) =

0.41 + 0.37 + 0.48 + 0 + 0.36

= 1.62

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

a b c d e

A 1.62 … … … …

B 0.75 … … … …

C 1.25 … … … …

D 0.31 … … … …

E 0.56 … … … …

Step 4: Re-compute Conditional Prob.

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

a b c d e

A 0.46 … … … …

B 0.15 … … … …

C 0.22 … … … …

D 0.06 … … … …

E 0.11 … … … …

( , )( | ) 0.36

( , ) ( , ) ( , ) ( , )

co A ap A a

IBM Statistical Translation Model Apply the steps of counting and estimation

iteratively The convergence can be proved This is related so-called Expectation Maximization

Algorithm E step: counting M step: estimate the translation probabilities

It has the best performance in the past TREC evaluations for CLIR

Cross Lingual Information Retrieval (CLIR) Rong Jin.

Documents

Transcript of Cross Lingual Information Retrieval (CLIR) Rong Jin.

It's CLIR!

Lingual Thyroid

Frenillectomia Lingual

Cross-lingual and Multi-lingual IR

CLIR “Milestones in 20th Century American …...CLIR “Milestones in 20 th Century American Children’s Literature at the Free Library of Philadelphia” ARCHIVISTS’ TOOLKIT

Nerf lingual

Discovering Modern China: Report on CLIR Project of the ...depts.washington.edu › ealclir › large_media › CLIR... · •Sheryl Stiefel, Advancement support 6. Phase I : June

Can We Do More? - CLIR

Huang Rong

Risk Management of Digital Information - Welcome to CLIR

Cross lingual Information Retrieval - IIT Bombay · 2014-06-30 · Cross lingual Information Retrieval Chapter 1. CLIR and its challenges A large amount of information in the form

Anatomy of Commercial CLIR Applicationsclef.isti.cnr.it/workshop2002/presentations/com-clir.pdfPartial Survey of CLIR Systems Fully Functional Partially Functional Non-Commercial Commercial

Performance Evaluation of Dictionary Based CLIR Strategies

Collaboration in the Evolving Academy: Experiences from the CLIR ... · Experiences from the CLIR Postdoctoral Fellowship Program Tamsyn Rose-Steel, Inna Kouper, Jennifer M. Parrott,

CLIR Synchronous Session: DataUp

IIIT Hyderabad’s CLIR experiments for FIRE-2008

Web-Based Query Translation for English-Chinese CLIR

The PATENTSCOPE search system: CLIR

Doin' It Rong

COMMON LANGUAGE INFRASTRUCTURE FOR RESEARCH (CLIR): …. Shawn_2012... · 2013. 3. 8. · ABSTRACT COMMON LANGUAGE INFRASTRUCTURE FOR RESEARCH (CLIR): EDITING AND OPTIMIZING .NET