Search Query Disambiguation from Short Sessions

32
1 Search Query Disambiguation from Short Sessions Lilyana Mihalkova & Raymond Mooney The University of Texas at Austin

description

Search Query Disambiguation from Short Sessions. Lilyana Mihalkova & Raymond Mooney The University of Texas at Austin. Query Disambiguation. scrubs. ?. Existing Work. Well-studied problem: [e.g., Sugiyama et al. 04, Sun et al. 05, Dou et al. 07] - PowerPoint PPT Presentation

Transcript of Search Query Disambiguation from Short Sessions

Page 1: Search Query Disambiguation from Short Sessions

1

Search Query Disambiguation

from Short SessionsLilyana Mihalkova & Raymond

Mooney

The University of Texas at Austin

Page 2: Search Query Disambiguation from Short Sessions

2

Query Disambiguation

scrubs

?

Page 3: Search Query Disambiguation from Short Sessions

3

Existing Work

• Well-studied problem: [e.g., Sugiyama et al. 04, Sun et al. 05, Dou et al. 07]

• Common Assumption: Information about each user is available over a relatively long period of time.

Page 4: Search Query Disambiguation from Short Sessions

4

Privacy Concerns

• NY Times: “A Face is Exposed for AOL Searcher no. 4417749”

• [Conti, 06]: “Googling Considered Harmful”

Page 5: Search Query Disambiguation from Short Sessions

5

Pragmatic Concerns

• Identifying users across search sessions– Log-in?– IP Address?

• Managing and protecting user-specific information

Page 6: Search Query Disambiguation from Short Sessions

6

Proposed Setting

• Base personalization only on short-term search histories– complete search histories cannot be

reconstructed

• Relate current session to previous short sessions of other users, based on the search activity in these sessions

Page 7: Search Query Disambiguation from Short Sessions

7

How Short is Short-Term?N

umbe

r of

ses

sion

s w

ith th

at m

any

quer

ies

Number of queries before ambiguous query

Page 8: Search Query Disambiguation from Short Sessions

8

Is This Enough Info?

98.7 fm

kroq

scrubs

www.star987.com

www.kroq.com

???

huntsville hospital

ebay.com

scrubs

www.huntsvillehospital.com

www.ebay.com

???scrubs-tv.com scrubs.com

Page 9: Search Query Disambiguation from Short Sessions

9

More Closely Related Work

• [Almeida & Almeida 04]: Similar assumption of short sessions, but better suited for a specialized search engine (e.g. on computer science literature)

• [Krause & Horvitz 08]: Explicitly models the tradeoff between better performance and more user information.

Page 10: Search Query Disambiguation from Short Sessions

10

Main Challenge

• How to harness this small amount of potentially noisy information available for each user?– Exploit the relations among users,

sessions, URLs– Use statistical relational learning (SRL)

[Getoor & Taskar 07]

Page 11: Search Query Disambiguation from Short Sessions

11

Using Relational Information

huntsville hospital

ebay

scrubs

huntsvillehospital.org

ebay.com

???

huntsville school

. . .

. . .

hospitallink.com

scrubs

scrubs-tv.com

ebay.com

scrubs

scrubs.com

Page 12: Search Query Disambiguation from Short Sessions

12

Details

• Used Markov logic networks (MLNs) [Richardson & Domingos 06]– MLN structure is provided as domain

knowledge– Weights are learned from the data

• Weight learning: Adapted contrastive divergence [Lowd & Domingos 07] for incremental learning

Page 13: Search Query Disambiguation from Short Sessions

13

Predicates

• Evidence predicates– provide information about clicked URLs and

keywords shared between sessions, i.e.• shares-keyword-between-clicks(ActiveS, backgroundS,

keyword)• shares-keyword-between-click-and-search(ActiveS,

backgroundS, keyword)• shares-clicks(ActiveS, BackgroundS, hostname)

– provide information about clicked URLs and keywords in current session

• Query predicate– states that user will chose particular URL

• clicks-on(ActiveS, hostname)

Page 14: Search Query Disambiguation from Short Sessions

14

Re-Ranking of Search Results

• Search engine produces a list of search results

• For each possible search result R, compute the probability that

clicks-on(ActiveS, R)

• Rank the search results by their likelihood of being clicked

Page 15: Search Query Disambiguation from Short Sessions

15

MLN 1

• User will click on at least one result• User will select result chosen by

previous user with whom a click is shared

ambiguous query www.someplace1.com

some query ambiguous query

www.clickedResult.com

. . .

www.someplace1.com

Page 16: Search Query Disambiguation from Short Sessions

16

MLN 2

• MLN1 +• User will select result chosen by

previous user with whom a keyword is shared– click-to-click, click-to-search, search-to-

click, search-to-search

ambiguous query www.aClick.com

some query ambiguous query

www.clickedResult.comwww.someplace1.com

some other

Page 17: Search Query Disambiguation from Short Sessions

17

MLN 3

• MLN 2 + • User will choose result that shares a

keyword with a previous search or click in the current session

ambiguous query

some query

www.someplace1.com

www.someResult.com

www.anotherPossibility.com

www.yetAnother.com

Page 18: Search Query Disambiguation from Short Sessions

18

Data

• Collected from the MSN engine in May 2006

• Contains time-stamped records of searches and clicked URLs, grouped by sessions – Average session length is 3.28– No across-session identifiers

• Used first 25 days for training/validation and last 6 days for test

Page 19: Search Query Disambiguation from Short Sessions

19

Data Limitation #1:

• Data does not specify what queries are ambiguous– Consider query as ambiguous, if over all

pages clicked after searching for this query, at least 2 fall in different high-level categories in the DMOZ (dmoz.org) hierarchy.

– Limit to query strings of up to two words (43.7%)• 6,360 ambiguous queries (2.4% of all two-word

query strings)

Page 20: Search Query Disambiguation from Short Sessions

20

Data Limitation #2

• Data does not provide the full list of search results presented to the user; only the ones actually clicked– Assume that the URLs seen by the user

are those clicked by at least one person after searching for the exact query string

– Consequence: result sets have differing lengths

Page 21: Search Query Disambiguation from Short Sessions

21

Result Set Sizes

Size of result set for ambiguous queryNum

ber

of q

ueri

es w

ith th

at r

esul

t set

siz

e

Page 22: Search Query Disambiguation from Short Sessions

22

Evaluation Metrics: MAP

• Mean average precision – identical to the area under the

interpolated precision/recall curve

Page 23: Search Query Disambiguation from Short Sessions

23

Evaluation Metrics: AUC-ROC

• Area under the ROC curve– identical to the mean average true

negative rate

Page 24: Search Query Disambiguation from Short Sessions

24

Baselines

• Random: Rank randomly• Click-Sim: Rank by similarity based on

shared clicks• Click-KW-Sim: Rank by similarity based

on shared clicks and keywords

Page 25: Search Query Disambiguation from Short Sessions

25

Click-Sim

huntsville hospital

scrubs

huntsvillehospital.org

???

scrubs

scrubs.tv

scrubs

scrubs.tv

scrubsscrubs

scrubs

scrubs

scrubs.med

scrubs.med

scrubs.med

scrubs.med

. . . . . . . . .

. . .

. . .

. . .

scrubs.tv

Average similaritybased on shared clicks

Page 26: Search Query Disambiguation from Short Sessions

26

Click-KW-Sim

huntsville hospital

scrubs

huntsvillehospital.org

???

scrubs

scrubs.tv

scrubs

scrubs.tv

scrubsscrubs

scrubs

scrubs

scrubs.med

scrubs.med

scrubs.med

scrubs.med

. . . . . . . . .

. . .

. . .

. . .

scrubs.tv

Average similaritybased on shared clicksand keywords

Page 27: Search Query Disambiguation from Short Sessions

27

Results (MAP)MAP

0.28

0.29

0.3

0.31

0.32

0.33

0.34

0.35

0.36

0.37

0.38

0.39

Random Click-Sim Click-KW-Sim

MLN1 MLN2 MLN3

* **

*

Page 28: Search Query Disambiguation from Short Sessions

28

Results (AUC-ROC)

AUC-ROC

0.46

0.48

0.5

0.52

0.54

0.56

0.58

Random Click-Sim

Click-KW-Sim

MLN1 MLN2 MLN3

*

**

Page 29: Search Query Disambiguation from Short Sessions

29

Current/Future Work

• Incorporating more information in the models– Actual content of clicked pages– Popularity of pages– Weighing evidence based on how close it

is in time to ambiguous query• Learning separate weights for each

connecting keyword or domain/group of keywords or domains

• Revising the provided clauses

Page 30: Search Query Disambiguation from Short Sessions

30

Questions?

Page 31: Search Query Disambiguation from Short Sessions

31

1

• The popularity of a possible result provides a strong signal, but providing relational information on top of popularity gives further performance improvements– Rank by popularity + click-KW-Sim baseline:

• MAP (0.383), AUC-ROC (0.536)

– Rank by popularity only:• MAP(0.380), AUC-ROC (0.525)

Page 32: Search Query Disambiguation from Short Sessions

32

2N

umbe

r of

ses

sion

s w

ith th

at m

any

clic

ks

Number of distinct clicks before ambiguous query