Translation of Web Queries Using Anchor Text Mining

18
1 Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and T echnology Translation of Web Queries Using Anchor Text Mining Advisor : Dr. Hsu Graduate : Wen-Hsiang Hu Authors : Wen-Hsiang Lu ACM, June 2002

description

Translation of Web Queries Using Anchor Text Mining. Advisor : Dr. Hsu Graduate : Wen-Hsiang Hu Authors : Wen-Hsiang Lu. ACM, June 2002. Outline. Motivation Objective Introduction Anchor Text Mining Probabilistic Inference Model Query Translation System Experiments Discussion - PowerPoint PPT Presentation

Transcript of Translation of Web Queries Using Anchor Text Mining

Page 1: Translation of Web Queries Using Anchor Text Mining

1Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Translation of Web Queries Using Anchor Text Mining

Advisor : Dr. Hsu

Graduate : Wen-Hsiang Hu

Authors : Wen-Hsiang Lu

ACM, June 2002

Page 2: Translation of Web Queries Using Anchor Text Mining

2

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Outline Motivation Objective Introduction Anchor Text Mining Probabilistic Inference Model Query Translation System Experiments Discussion Conclusion Personal Opinion

Page 3: Translation of Web Queries Using Anchor Text Mining

3

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Motivation One of the existing difficulties in cross-language

information retrieval (CLIR) and Web search is the lack of appropriate translations of new terminology and proper names.

Page 4: Translation of Web Queries Using Anchor Text Mining

4

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Objective automatically extracting translations of Web query

terms

Page 5: Translation of Web Queries Using Anchor Text Mining

5

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Introduction In this paper, we are interested in

discovering translations of new

terminology and proper names

through mining Web anchor texts. the problems of precious research methods

parallel corpora for various

subject and multiple languages lack of parallel correlation

between word pairs short query terms

Yahoo 雅虎

Yahoo 雅虎

美國雅虎

搜尋、雅虎 ..

雅虎

Page 6: Translation of Web Queries Using Anchor Text Mining

6

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

We use a triple form <Uj,Ui,Dk>

to indicate that page Uj points to

page Ui with description text Dk.

For a Web page (or URL) Ui, its anchor-text set AT(Ui) is defined as all of the anchor texts of the links pointing to Ui,

i.e., Ui ’s inlinks.

For a query term appearing in AT(Ui), it is likely that its corresponding translations also appear together.

Anchor Text MiningUi

Uj

Uj

Uj

Uj

Uj

Page 7: Translation of Web Queries Using Anchor Text Mining

7

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

asymmetric similarity estimation model cause some common terms may become the best translations.

symmetric similarity estimation function based on the probabilistic inference model defined first below:

Probabilistic Inference Model

where Tt is target translation ; Ts is source term,

the inductive rule “if Ts then Tt”, i.e. P( Ts→Tt).

(2)

the inductive rules “if Ts then Tt” and “if Tt then Ts”, i.e. P( Ts Tt).

Total: 100 anchor-text Ts:Yahoo (only one anchor text) ; Tt: 雅虎 (10 anchor text )

雅虎 Yahoo P( Tt | Ts) = 0.01/ 0.01 = 1

雅虎 動物 P( Ts Tt ) = 0.01/ [(0.01+0.1)-0.01] = 0.1

雅虎 企業

………….

100

Page 8: Translation of Web Queries Using Anchor Text Mining

8

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Let U=(U1,U2,…,Un) be a concept space (Web page space), consisting of a set of pair-wised disjoint basic concepts (Web pages), i.e., Ui∩Uj = for i≠j. We can rewrite Eq.(2) as follows:∅

Probabilistic Inference Model (cont.)

where L(Uj) = the number of in-links of pages Uj

Uj

15

L(Ui)

Page 9: Translation of Web Queries Using Anchor Text Mining

9

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

We assume that Ts and Tt are independent given Ui; then the joint probability P(Ts∩Tt|Ui) is equal to the product of P(Ts|Ui) and P(Tt|Ui)

the above estimation approach considers the link information and degree of authority among Web pages.

Probabilistic Inference Model (cont.)

Page 10: Translation of Web Queries Using Anchor Text Mining

10

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

three different methods to extract Chinese terms: PAT-tree-based

1. check if the strings of candidate terms are complete in a lexical boundary

2. decide the importance of a term, based on its relative frequency

Query-set-based take queries from search engines query sets of different sizes

Tagger-based use the CKIP’s tagger extract unknown words

Query Translation System

Yahoo 雅虎

搜尋、雅虎

雅虎

美國雅虎

Page 11: Translation of Web Queries Using Anchor Text Mining

11

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Experimental Environment Collected popular query terms with the logs from Dreamer

and GAIS. These query terms were taken as the major test set in our

term translation extraction analysis. We filtered out the terms that had no corresponding

Chinese translations in the anchor-text database and picked up 622 English terms as the source query set.

Experiments

Page 12: Translation of Web Queries Using Anchor Text Mining

12

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Evaluation Metric For a set of test query terms, its top-n inclusion rate is

defined as the percentage of the query terms whose effective translation (s) can be found in the top n extracted translations.

Experiments (cont.)

Page 13: Translation of Web Queries Using Anchor Text Mining

13

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Performance with Various Similarity Estimation Models MA, Asymmetric model as

MAL, Asymmetric model with link information:

MS, Symmetric model as

MSL, Symmetric model with link

information as (the proposed model).

622 English query terms and

query-set-based method

Experiments (cont.)

Page 14: Translation of Web Queries Using Anchor Text Mining

14

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Performance with Various Term Extraction Methods use MSL as similarity estimation model

Experiments (cont.)

PAT-tree-based

Query-set-based

Tagger-based

longer-translations ○ ○ X

short-translations ○ ○

low-frequency X ○ ○

Page 15: Translation of Web Queries Using Anchor Text Mining

15

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Performance with Various Query-Set Sizes medium-sized query set achieved the best performance.

Example: "sakura" 9709 terms: 台灣櫻花 (Taiwan Sakura Corporation); 櫻花 (sakur

a); 蜘蛛網 (spiderweb); 純愛 (pure love); and 螢幕保護 (screen saving)

228,566 terms: 庫洛魔法使 (Card Captor Sakura); 櫻花建設 (Sakura Development Corporation); 模仿 (imitation); 櫻花大戰 (Sakura Wars); 美夕 (Miyu, name of an actresss); 台灣櫻花 (Taiwan Sakura Corporation); 櫻花 (sakura); 蜘蛛網 (spiderweb); 純愛 (pure love); and 螢幕保護 (screen saving)

Experiments (cont.)

might also produce more noise

Page 16: Translation of Web Queries Using Anchor Text Mining

16

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Discussion

Comparisons with a translation lexicon Queries suitable for finding translations Extracting domain-specific translations Experiments on Simplified Chinese pages

Page 17: Translation of Web Queries Using Anchor Text Mining

17

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

proposing a new and effective approach for mining Web link structures and anchor texts for translations of Web query terms.

Future research combining more in-depth linguistic knowledge to remove

noisy terms.

Conclusion

Page 18: Translation of Web Queries Using Anchor Text Mining

18

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

……..

Personal Opinion