Search Query Disambiguation from Short Sessions

Post on 08-Feb-2016

25 views 0 download

description

Search Query Disambiguation from Short Sessions. Lilyana Mihalkova & Raymond Mooney The University of Texas at Austin. Query Disambiguation. scrubs. ?. Existing Work. Well-studied problem: [e.g., Sugiyama et al. 04, Sun et al. 05, Dou et al. 07] - PowerPoint PPT Presentation

Transcript of Search Query Disambiguation from Short Sessions

1

Search Query Disambiguation

from Short SessionsLilyana Mihalkova & Raymond

Mooney

The University of Texas at Austin

2

Query Disambiguation

scrubs

?

3

Existing Work

• Well-studied problem: [e.g., Sugiyama et al. 04, Sun et al. 05, Dou et al. 07]

• Common Assumption: Information about each user is available over a relatively long period of time.

4

Privacy Concerns

• NY Times: “A Face is Exposed for AOL Searcher no. 4417749”

• [Conti, 06]: “Googling Considered Harmful”

5

Pragmatic Concerns

• Identifying users across search sessions– Log-in?– IP Address?

• Managing and protecting user-specific information

6

Proposed Setting

• Base personalization only on short-term search histories– complete search histories cannot be

reconstructed

• Relate current session to previous short sessions of other users, based on the search activity in these sessions

7

How Short is Short-Term?N

umbe

r of

ses

sion

s w

ith th

at m

any

quer

ies

Number of queries before ambiguous query

8

Is This Enough Info?

98.7 fm

kroq

scrubs

www.star987.com

www.kroq.com

???

huntsville hospital

ebay.com

scrubs

www.huntsvillehospital.com

www.ebay.com

???scrubs-tv.com scrubs.com

9

More Closely Related Work

• [Almeida & Almeida 04]: Similar assumption of short sessions, but better suited for a specialized search engine (e.g. on computer science literature)

• [Krause & Horvitz 08]: Explicitly models the tradeoff between better performance and more user information.

10

Main Challenge

• How to harness this small amount of potentially noisy information available for each user?– Exploit the relations among users,

sessions, URLs– Use statistical relational learning (SRL)

[Getoor & Taskar 07]

11

Using Relational Information

huntsville hospital

ebay

scrubs

huntsvillehospital.org

ebay.com

???

huntsville school

. . .

. . .

hospitallink.com

scrubs

scrubs-tv.com

ebay.com

scrubs

scrubs.com

12

Details

• Used Markov logic networks (MLNs) [Richardson & Domingos 06]– MLN structure is provided as domain

knowledge– Weights are learned from the data

• Weight learning: Adapted contrastive divergence [Lowd & Domingos 07] for incremental learning

13

Predicates

• Evidence predicates– provide information about clicked URLs and

keywords shared between sessions, i.e.• shares-keyword-between-clicks(ActiveS, backgroundS,

keyword)• shares-keyword-between-click-and-search(ActiveS,

backgroundS, keyword)• shares-clicks(ActiveS, BackgroundS, hostname)

– provide information about clicked URLs and keywords in current session

• Query predicate– states that user will chose particular URL

• clicks-on(ActiveS, hostname)

14

Re-Ranking of Search Results

• Search engine produces a list of search results

• For each possible search result R, compute the probability that

clicks-on(ActiveS, R)

• Rank the search results by their likelihood of being clicked

15

MLN 1

• User will click on at least one result• User will select result chosen by

previous user with whom a click is shared

ambiguous query www.someplace1.com

some query ambiguous query

www.clickedResult.com

. . .

www.someplace1.com

16

MLN 2

• MLN1 +• User will select result chosen by

previous user with whom a keyword is shared– click-to-click, click-to-search, search-to-

click, search-to-search

ambiguous query www.aClick.com

some query ambiguous query

www.clickedResult.comwww.someplace1.com

some other

17

MLN 3

• MLN 2 + • User will choose result that shares a

keyword with a previous search or click in the current session

ambiguous query

some query

www.someplace1.com

www.someResult.com

www.anotherPossibility.com

www.yetAnother.com

18

Data

• Collected from the MSN engine in May 2006

• Contains time-stamped records of searches and clicked URLs, grouped by sessions – Average session length is 3.28– No across-session identifiers

• Used first 25 days for training/validation and last 6 days for test

19

Data Limitation #1:

• Data does not specify what queries are ambiguous– Consider query as ambiguous, if over all

pages clicked after searching for this query, at least 2 fall in different high-level categories in the DMOZ (dmoz.org) hierarchy.

– Limit to query strings of up to two words (43.7%)• 6,360 ambiguous queries (2.4% of all two-word

query strings)

20

Data Limitation #2

• Data does not provide the full list of search results presented to the user; only the ones actually clicked– Assume that the URLs seen by the user

are those clicked by at least one person after searching for the exact query string

– Consequence: result sets have differing lengths

21

Result Set Sizes

Size of result set for ambiguous queryNum

ber

of q

ueri

es w

ith th

at r

esul

t set

siz

e

22

Evaluation Metrics: MAP

• Mean average precision – identical to the area under the

interpolated precision/recall curve

23

Evaluation Metrics: AUC-ROC

• Area under the ROC curve– identical to the mean average true

negative rate

24

Baselines

• Random: Rank randomly• Click-Sim: Rank by similarity based on

shared clicks• Click-KW-Sim: Rank by similarity based

on shared clicks and keywords

25

Click-Sim

huntsville hospital

scrubs

huntsvillehospital.org

???

scrubs

scrubs.tv

scrubs

scrubs.tv

scrubsscrubs

scrubs

scrubs

scrubs.med

scrubs.med

scrubs.med

scrubs.med

. . . . . . . . .

. . .

. . .

. . .

scrubs.tv

Average similaritybased on shared clicks

26

Click-KW-Sim

huntsville hospital

scrubs

huntsvillehospital.org

???

scrubs

scrubs.tv

scrubs

scrubs.tv

scrubsscrubs

scrubs

scrubs

scrubs.med

scrubs.med

scrubs.med

scrubs.med

. . . . . . . . .

. . .

. . .

. . .

scrubs.tv

Average similaritybased on shared clicksand keywords

27

Results (MAP)MAP

0.28

0.29

0.3

0.31

0.32

0.33

0.34

0.35

0.36

0.37

0.38

0.39

Random Click-Sim Click-KW-Sim

MLN1 MLN2 MLN3

* **

*

28

Results (AUC-ROC)

AUC-ROC

0.46

0.48

0.5

0.52

0.54

0.56

0.58

Random Click-Sim

Click-KW-Sim

MLN1 MLN2 MLN3

*

**

29

Current/Future Work

• Incorporating more information in the models– Actual content of clicked pages– Popularity of pages– Weighing evidence based on how close it

is in time to ambiguous query• Learning separate weights for each

connecting keyword or domain/group of keywords or domains

• Revising the provided clauses

30

Questions?

31

1

• The popularity of a possible result provides a strong signal, but providing relational information on top of popularity gives further performance improvements– Rank by popularity + click-KW-Sim baseline:

• MAP (0.383), AUC-ROC (0.536)

– Rank by popularity only:• MAP(0.380), AUC-ROC (0.525)

32

2N

umbe

r of

ses

sion

s w

ith th

at m

any

clic

ks

Number of distinct clicks before ambiguous query