1 Quicklink Selection for Navigational Query Results Deepayan Chakrabarti (deepay@yahoo-inc.com)...

Post on 15-Jan-2016

219 views 0 download

Tags:

Transcript of 1 Quicklink Selection for Navigational Query Results Deepayan Chakrabarti (deepay@yahoo-inc.com)...

1

Quicklink Selection for Navigational Query Results

Deepayan Chakrabarti (deepay@yahoo-inc.com)

Ravi Kumar (ravikuma@yahoo-inc.com)

Kunal Punera (kpunera@yahoo-inc.com)

2

What are quicklinks

Quicklinks

Result Website

3

Quicklinks = URLs within the search result website Enable fast navigation to important parts of the

website Which URLs should be QLs?

Quicklinks

Quicklinks

Result Website

4

Quicklink Selection

Some obvious strategies don’t work very well Top clicked URLs in search engine

URL may have low relevance in the QL context lib.utexas.edu/maps is popular for searches on “maps” and

not for searches on “Univ. of Texas” URL may be too specific:

automobiles.honda.com/civic-hybrid/exterior-photos.aspx for honda.com

URL popularity be time sensitive: nytimes.com/election-guide/2008/ for nytimes.com

5

Quicklink Selection

Some obvious strategies don’t work very wellTop clicked URLs in search engine

Top visited URLs intoolbar data May not relate to search activity:

e.g., for nytimes.com #3 is nytimes.com/mem/emailthis.html #6 is nytimes.com/auth/login #8 is nytimes.com/gst/regi.html

6

Quicklink Selection

Some obvious strategies don’t work very wellTop clicked URLs in search engine

Top visited URLs in toolbar data

Top URLs from analysis of hyperlink graph Ignores preferences of search users Toolbar data is more representative

Heavily tagged URLs (e.g., del.icio.us/digg) Low coverage: Too few websites

7

Quicklink Selection

Need a combined approach Search logs Toolbar data Web-server logs Website hyperlink graph User tags

This paper

8

Related Work

Sitemap generation [Perkowitz+/00] Detection of hard-to-find URLs [Srikant+/01] Improving website navigability [Doerr+/07] Mining Web usage patterns [Buchner/99,

Cadez+/03] BrowseRank [Liu+/08] Post-search browsing behavior [Bilenko+/08]

We focus on QLs in the context of Search

9

Outline

Motivation and Related Work Problem Formulation Proposed Solution Experiments Conclusions

10

Problem Formulation

Which k URLs should be QLs?

“The greatest good for the greatest number”

QLs save clicks Maximize the total number of clicks saved

using at most k QLs But when exactly is a click “saved”?

11

Problem Formulation

When does a QL get clicked by the user?

Graph of click trails (Toolbar data)

Say we pick this node as a QL

nasa.gov

Hubble telescope

Photos

12

Problem Formulation

Say we pick this node as a QL

Assumption:The user recognizes if SearchResult QL Destination

Graph of click trails (Toolbar data)

nasa.gov

Hubble telescope

Photos

13

Problem Formulation

Say we pick this node as a QL

(saves 1 click each)

Assumption:The user recognizes if SearchResult QL Destination

Graph of click trails (Toolbar data)

nasa.gov

14

Problem Formulation

Say we pick this node as a QL

(saves 1 click each)

(saves 2 clicks each)

(saves 0)

(saves 0)

Total savings = 1*3 + 2*2 = 7 clicks

Graph of click trails (Toolbar data)

Assumption:The user recognizes if SearchResult QL Destination

nasa.gov

15

Problem Formulation

However…

Unknown pages might become QLs

lyrics.com

A B C Z… These could become the “best” QLs

16

Problem Formulation

However… Unknown pages might become QLs Automatic-redirect pages might become QLs:

nytimes.com forces logging in aaa.com forces zipcode entry

We need QLs that are “noticeable” in a search context

17

Problem Formulation

How can we estimate noticeability? Via Search click-logs Noticeability of a URL u:

User notices a useful QL with probability α(u)

Tuning param(≈ 2)

Fraction of search clicks for u on website

18

Problem Formulation

QL1

(saves 0)

(saves 0)

QL2

# trail prob #clicks

saves 2 x α1 x 2

saves 1 x α1 x 1

saves 2 x (1-α2)α1 x 1

saves 2 x α2 x 2

Total = 5α1 + 4α2 + 2(1-α1)α2

Assumption:The user picks the best QL that he/she notices

nasa.gov

?

19

Problem Formulation

QL1

(saves 0)

(saves 0)

QL2

# trail prob #clicks

saves 2 x α1 x 2

saves 1 x α1 x 1

saves 2 x (1-α2)α1 x 1

saves 2 x α2 x 2

Total = 5α1 + 4α2 + 2(1-α1)α2

If only QL1 is perfectly noticeable (α1=1, α2=0): Total = 7 clicks (as if 1 QL only)

If both QLs are perfectly noticeable (α1=1, α2=1): Total = 9 clicks

nasa.gov

20

Problem Formulation

Which k URLs should be QLs?

Maximize the expected number of clicks saved using at most k QLs while incorporating “noticeability”

21

Outline

Motivation and Related Work

Problem Formulation Proposed Solution Experiments Conclusions

22

Algorithms

Maximize expected number of saved clicks using k QLs NP-Hard

Theorem: This objective is non-decreasing submodular

1. Non-negative

2. Adding QLs never hurts

3. “Diminishing Returns”

u

SS '

S

Marginal improvement to set S

Marginal improvement to superset S’

23

Algorithms

Greedy algorithm: Iteratively pick QLs that increase the number of saved clicks the most Within a factor (1-1/e) of OPT

[Nemhauser+/’78]

24

Algorithms

However… Inhomogeneous results: QLs for ea.com are

fifa08.ea.com battlefield.ea.com 6 webpages deep inside thesim2.ea.com

Redundant results: QLs for senate.gov include obama.senate.gov obama.senate.gov/about obama.senate.gov/contact obama.senate.gov/votes

Parent URL makes the child URLs

redundant

Two games made by EA

25

Algorithms

Both can be specified as pairwise constraints on URLs allowed to belong to a QL set

Pairwise-constrained QL selection isNP-hard.

Two-step process: Heuristically find a large subset of trails that form

a tree Enforce constraints on tree

Dynamic program optimal on tree

26

Outline

Motivation and Related Work

Problem Formulation

Proposed Solution Experiments Conclusions

27

Experiments

Baseline Methods TopClicked:

URL score = # search clicks on URL TopVisited:

URL score = # occurrences on toolbar trails PageRank:

Build a weighted graph on URLs, where weight(i,j) = # trails using the ij edge

URL score = PageRank on this graph

28

Experiments

Live Traffic dataset Computed CTRs on QLs currently displayed by

Yahoo! (1043 website subset) Measure:

Pick two equal-sizes subsets of QLs Use sum-of-scores and sum-of-CTRs to predict the

better subset Measure how often the predictions match

29

Experiments Live Traffic Data

Subset sizesFra

ctio

n o

f su

bse

t-p

airs

whe

re

pre

dic

tion

s ag

ree

with

live

tra

ffic

QL-ALG > TopVisited > PageRank > TopClicked

30

Experiments

Tree-structured trails Most dropped trails are

very short Tree-structured trails

improve accuracy

1 10 100 1000 100000

20

40

60

80

100

Length of trail

Num

ber

of t

rails

dro

pped

Live Traffic prediction quality comparison

Distribution of dropped trails

31

Outline

Motivation and Related Work

Problem Formulation

Proposed Solution

Experiments Conclusions

32

Conclusions

Proposed a formulation for the QL selection problem Both toolbar and search logs are used intuitively

Proposed two algorithms: Greedy: (1-1/e)-optimal Tree-structured: empirically better

Improvement of 22% over competing baselines