1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.

1

Cross-Lingual Query Suggestion Using

Query Logs of Different Languages

SIGIR 07

2

Abstract

• Query suggestion– To suggest relevant queries for a given query– To help users better specify their information

needs

• Cross-Lingual Query Suggestion (CLQS): – For a query in one language, we suggest similar or

relevant queries in other languages.• cross-lingual keyword bidding (Search Engine)

• cross-language information retrieval (CLIR)

3

Introduction

• CLQS vs. Cross-Lingual Query Expansion – Full queries formulated by users in another

language.

• The users of search engines – similar interests in the same period of time– queries on similar topics in different languages

• Key point– How to learn a similarity measure between two

queries– MLQS: Term Co-Occurrence based MI and 2

4

Estimating Cross-Lingual Query similarity

• Discriminative Model for Estimating Cross-Lingual Query Similarity

• Monolingual Query Similarity Measure Based on Click-through Information

• Features Used for Learning Cross-Lingual Query Similarity Measure– Bilingual Dictionary– Parallel Corpora– Online Mining for Related Queries– Monolingual Query Suggestion

• Estimating Cross-lingual Query Similarity

5

Discriminative Model for Estimating Cross-Lingual Query Similarity – 1/2

– qf : a source language query

– qe : a target language query

– simML : Monolingual query similarity

– simCL : Cross-lingual query similarity

– Tqf : translation of qf in the target language

6

Discriminative Model for Estimating Cross-Lingual Query Similarity – 2/2

• Learning: LIBSVM regression algorithm– f : feature functions– : mapping feature space onto kernel space– w : weight vector in the kernel space

– relevant vs. irrelevant– strongly relevant, weakly relevant or irrelevant

7






8

Monolingual Query Similarity Measure Based on Click-through Information

• click-through information in query logs [26]

• KN(x) : number of keyword in a query x

• RD(x) : number of clicked URLs for a query x

• = 0.4 , =0.6

9






10

1. Bilingual Dictionary – 1/2

– 120,000 unique entries (built-in-house)– Given an input query qf={wf1,wf2,…,wfn} (in source languag

e)– By bilingual dictionary D: D(wfi)={ti1,ti2,…,tim}

– C(x,y) is the number of queries in the log containing both x and y.

– C(x) is the number of queries in the log containing x. – N is the total number of queries in the log

11

1. Bilingual Dictionary – 2/2

–

– The set of top-4 query translations is denoted as S(Tqf)

– T S(Tqf)• Retrieve all queries containing T in target language and

assign Sdict(T) as their value

12

2. Parallel Corpora– Given a pair of queries

• qf : in the source language • qe : in the target language

– Bi-Directional Translation Score : • IBM model 1 & GIZA++ tool

• P(yj|xi) is the word to word translation probability

– Top 10 queries {qe} with qf from the query log

13

3. Online Mining for Related Queries – 1/3

• OOV is a major knowledge bottleneck for query translation and CLIR

• Assumption :– A query in the target co-occurs with the source

query in many web pages– They are probably semantically related – but, amount of noise

14


– Frequency in the Snippets• For example:

– Given a query q=abc in source language

– By dictionary : a={a1,a2,a3}, b={b1,b2} and c={c1}

– Web query : q ^ (a1 v a2 v a3) ^ (b1 v b2) ^ (c1) in target language

– 700 snippets , most frequent 10 target queries

15


– Any query qe mined from the web will be associated with a feature CODC Measure with SCODC(qf,qe)

16

4. Monolingual Query Suggestion

• Q0 : candidate queries (in target language)

– For each target query qe,

• SQML(qe) : monolingual source query

17






18

Estimating Cross-lingual Query Similarity

• Four categories of features are used to learn the cross-lingual query similarity.

• cross-lingual query similarity score– Learning: LIBSVM regression algorithm

• f : feature functions

• : mapping feature space onto kernel space

• w : weight vector in the kernel space

19

Performance Evaluation – Log Data

• Data Resources : – MSN Search Engine

• French (source language) vs. English ( target language)– A one-month English query log

– 7 million unique English queries

– Occurrence frequency more than 5

• 5,000 French queries – 4,171 queries have their translations in the English queries

– 70% training weight of LIBSVM

– 10% development data

– 20% testing

20

Performance Evaluation - CLIR

• Data Resources : – TREC6 CLIR data (AP88-90 newswire, 750MB)– 25 short French-English queries Pairs (CL1-CL25)

• average long 3.3

• match in the web query logs for training CLQS

Source Language

Target Language

BM25

CLIR

CLQS {q

e}qf

21

• CLQS

23

• CLIR

24

Conclusion

• Cross-lingual query suggestion

• Query Logs

• French to English

• TREC6 French to English CLIR task– CLQO demonstrates the high quality

1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.

Documents

Transcript of 1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.