Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

15
1 Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and T echnology Using the Web for Automated Translation Extraction in Cross- Language Information Retrieval Advisor : Dr. Hsu Presenter : Zih-Hui Lin Author :Ying Zhang and Phil Vines

description

Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval. Advisor : Dr. Hsu Presenter : Zih-Hui Lin Author :Ying Zhang and Phil Vines. Outline. Motivation Objective Previous work Methodology Experiments and results Conclusions. Motivation. - PowerPoint PPT Presentation

Transcript of Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

Page 1: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

1Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

Advisor : Dr. Hsu

Presenter : Zih-Hui Lin

Author :Ying Zhang and Phil Vines

Page 2: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

2

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Motivation Objective Previous work Methodology Experiments and results Conclusions

Outline

Page 3: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

3

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Motivation One of the major remaining reasons that CLIR

does not perform as well as monolingual retrieval is the presence of out of vocabulary (OOV) terms.─ it will not be recognized, and segmented into either sm

aller sequences of characters or individual characters ─ 北野武→ (north limit military)

Previous work has either relied on manual intervention or has only been partially successful in solving this problem.

Page 4: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

4

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Objective

We propose a segmentation free method which can be applied to both Chinese-English and English-Chinese CLIR, correctly extracting translations of OOV terms from the Web automatically, and thus is a significant improvement on earlier work

Page 5: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

5

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.English translation extraction in Chinese-English CLIR

Chinese OOV term detection─ 北野武 (north limit military) → Pvalue given by the HMM will be very low if Pval

ue < Pmin → contains OOV terms web text extraction

─ we extract strings that contain the Chinese query terms and some English text from the Web.

collection of co-occurrence statistics,

translation selection.search for longest Chinese substring Ct:

search for the English term etwith the highest frequency:

1. |Ctargets| = max(|Cij|).

2. f(et, Ct) = max(f(ei,Ctargets)).

3. Add (Ct, et) into the translation dictionary.

1.f(etargets) = max(f(ei)).

2.f(et’ ,Ct’) = max(f(etargets,Cij )).

3. if Ct’ ≠ Ct and et’ ≠ et , add (Ct’ , et’ ) into the translation dictionary.

北野武( Kitano Takeshi )c4 c5 c6 e 1

導演北野武 ( Kitano Takeshi )c2 c3 c4 c5 c6 e1

Page 6: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

6

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Chinese translation extraction in English-Chinese CLIR

Extraction of web text─ use Google to fetch the top100 Chinese documents with the English OOV term eoov as t

he query. Collection of co-occurrence statistics

─ accumulate the frequency foov.─ considering all substrings in Sleft and Sright, and collecting

the frequency fn and the length |sn| of each Chinese substring.─

Translation selection─ exclude any substring that

already in the translation dictionary doesn’t occur in the document collection

與區域貿易………兩岸經貿關係( Canada and Cross Straits Economic Relations )三組發表九………英茂、加

Sleft ( 包含 20 個字 ) Sright ( 包含 20 個字 )eoov

Longest length

Highest frequency

Page 7: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

7

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Experiments and results

Chinese-English CLIR English-Chinese CLIR

30

10 10

Page 8: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

8

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Introduction When translating from Chinese to English, a standard

first step is to segment the text into words based on an existing segmentation dictionary.

However where an OOV term occurs, it will not be recognized, and segmented into either smaller sequences of characters or individual characters.

We propose a segmentation free method based on frequency and length analysis and corpus-based disambiguation

Page 9: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

9

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Previous work

Dictionary-based translation schemes need to address three major issues─ phrase identification and translation

ex. non proliferation treaty and cross straits.

─ translation ambiguity using techniques such as term co-occurrence , mutual

information or language modeling.

─ out of vocabulary (OOV) terms. ex. Dioxin

Page 10: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

10

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Previous work-Existing approaches to the OOV problem Depending on the language, it may be possible to ded

uce appropriate transliterated translations automatically.─ that they successfully applied in English-Arabic CLIR.

However the issue is more difficult in Chinese as many characters have the same sound, and many English syllables do not have equivalent sounds in Chinese, meaning that selecting the correct characters to represent a transliterated word can be problematic.─ cross straits( 兩岸 ) 、北野武 (north limit military)

Page 11: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

11

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Previous work-Segmentation free translation extraction It is common to find a small amount of English text

in Chinese web documents, but extremely rare to find Chinese text in English web documents.

We therefore rely on Chinese web documents to extract translations in both directions.

The problem is that the Chinese OOV term we are looking for is currently unknown, and thus we have no information about how it should be segmented.─ In previous work, this problem was overcome by manual

intervention to provide appropriate segmentation.

Page 12: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

12

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Experiments and results Chinese-English CLIR

─ retrieving English documents using Chinese queries.30

10

10

Page 13: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

13

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Experiments and results (cont.) English-Chinese CLIR

─ retrieving Chinese documents using English queries. The aim of our work is to find appropriate Chinese translations of English OOV terms

Page 14: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

14

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusions

We have also described improved ways to extract the translation of OOV terms from the Web in a way that does not rely on prior segmentation.

Although the Web is constantly changing, we were able to find most OOV terms, many of which related to news events up to 10 years ago.

Page 15: Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

15

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.My opinion

Advantage: Segmentation free translation extraction

Disadvantage: Apply: 線上翻譯 ……… ..