Post on 15-Jan-2016
1
Quicklink Selection for Navigational Query Results
Deepayan Chakrabarti (deepay@yahoo-inc.com)
Ravi Kumar (ravikuma@yahoo-inc.com)
Kunal Punera (kpunera@yahoo-inc.com)
2
What are quicklinks
Quicklinks
Result Website
3
Quicklinks = URLs within the search result website Enable fast navigation to important parts of the
website Which URLs should be QLs?
Quicklinks
Quicklinks
Result Website
4
Quicklink Selection
Some obvious strategies don’t work very well Top clicked URLs in search engine
URL may have low relevance in the QL context lib.utexas.edu/maps is popular for searches on “maps” and
not for searches on “Univ. of Texas” URL may be too specific:
automobiles.honda.com/civic-hybrid/exterior-photos.aspx for honda.com
URL popularity be time sensitive: nytimes.com/election-guide/2008/ for nytimes.com
5
Quicklink Selection
Some obvious strategies don’t work very wellTop clicked URLs in search engine
Top visited URLs intoolbar data May not relate to search activity:
e.g., for nytimes.com #3 is nytimes.com/mem/emailthis.html #6 is nytimes.com/auth/login #8 is nytimes.com/gst/regi.html
6
Quicklink Selection
Some obvious strategies don’t work very wellTop clicked URLs in search engine
Top visited URLs in toolbar data
Top URLs from analysis of hyperlink graph Ignores preferences of search users Toolbar data is more representative
Heavily tagged URLs (e.g., del.icio.us/digg) Low coverage: Too few websites
7
Quicklink Selection
Need a combined approach Search logs Toolbar data Web-server logs Website hyperlink graph User tags
This paper
8
Related Work
Sitemap generation [Perkowitz+/00] Detection of hard-to-find URLs [Srikant+/01] Improving website navigability [Doerr+/07] Mining Web usage patterns [Buchner/99,
Cadez+/03] BrowseRank [Liu+/08] Post-search browsing behavior [Bilenko+/08]
We focus on QLs in the context of Search
9
Outline
Motivation and Related Work Problem Formulation Proposed Solution Experiments Conclusions
10
Problem Formulation
Which k URLs should be QLs?
“The greatest good for the greatest number”
QLs save clicks Maximize the total number of clicks saved
using at most k QLs But when exactly is a click “saved”?
11
Problem Formulation
When does a QL get clicked by the user?
Graph of click trails (Toolbar data)
Say we pick this node as a QL
nasa.gov
Hubble telescope
Photos
12
Problem Formulation
Say we pick this node as a QL
Assumption:The user recognizes if SearchResult QL Destination
Graph of click trails (Toolbar data)
nasa.gov
Hubble telescope
Photos
13
Problem Formulation
Say we pick this node as a QL
(saves 1 click each)
Assumption:The user recognizes if SearchResult QL Destination
Graph of click trails (Toolbar data)
nasa.gov
14
Problem Formulation
Say we pick this node as a QL
(saves 1 click each)
(saves 2 clicks each)
(saves 0)
(saves 0)
Total savings = 1*3 + 2*2 = 7 clicks
Graph of click trails (Toolbar data)
Assumption:The user recognizes if SearchResult QL Destination
nasa.gov
15
Problem Formulation
However…
Unknown pages might become QLs
lyrics.com
A B C Z… These could become the “best” QLs
16
Problem Formulation
However… Unknown pages might become QLs Automatic-redirect pages might become QLs:
nytimes.com forces logging in aaa.com forces zipcode entry
We need QLs that are “noticeable” in a search context
17
Problem Formulation
How can we estimate noticeability? Via Search click-logs Noticeability of a URL u:
User notices a useful QL with probability α(u)
Tuning param(≈ 2)
Fraction of search clicks for u on website
18
Problem Formulation
QL1
(saves 0)
(saves 0)
QL2
# trail prob #clicks
saves 2 x α1 x 2
saves 1 x α1 x 1
saves 2 x (1-α2)α1 x 1
saves 2 x α2 x 2
Total = 5α1 + 4α2 + 2(1-α1)α2
Assumption:The user picks the best QL that he/she notices
nasa.gov
?
19
Problem Formulation
QL1
(saves 0)
(saves 0)
QL2
# trail prob #clicks
saves 2 x α1 x 2
saves 1 x α1 x 1
saves 2 x (1-α2)α1 x 1
saves 2 x α2 x 2
Total = 5α1 + 4α2 + 2(1-α1)α2
If only QL1 is perfectly noticeable (α1=1, α2=0): Total = 7 clicks (as if 1 QL only)
If both QLs are perfectly noticeable (α1=1, α2=1): Total = 9 clicks
nasa.gov
20
Problem Formulation
Which k URLs should be QLs?
Maximize the expected number of clicks saved using at most k QLs while incorporating “noticeability”
21
Outline
Motivation and Related Work
Problem Formulation Proposed Solution Experiments Conclusions
22
Algorithms
Maximize expected number of saved clicks using k QLs NP-Hard
Theorem: This objective is non-decreasing submodular
1. Non-negative
2. Adding QLs never hurts
3. “Diminishing Returns”
u
SS '
S
Marginal improvement to set S
Marginal improvement to superset S’
23
Algorithms
Greedy algorithm: Iteratively pick QLs that increase the number of saved clicks the most Within a factor (1-1/e) of OPT
[Nemhauser+/’78]
24
Algorithms
However… Inhomogeneous results: QLs for ea.com are
fifa08.ea.com battlefield.ea.com 6 webpages deep inside thesim2.ea.com
Redundant results: QLs for senate.gov include obama.senate.gov obama.senate.gov/about obama.senate.gov/contact obama.senate.gov/votes
Parent URL makes the child URLs
redundant
Two games made by EA
25
Algorithms
Both can be specified as pairwise constraints on URLs allowed to belong to a QL set
Pairwise-constrained QL selection isNP-hard.
Two-step process: Heuristically find a large subset of trails that form
a tree Enforce constraints on tree
Dynamic program optimal on tree
26
Outline
Motivation and Related Work
Problem Formulation
Proposed Solution Experiments Conclusions
27
Experiments
Baseline Methods TopClicked:
URL score = # search clicks on URL TopVisited:
URL score = # occurrences on toolbar trails PageRank:
Build a weighted graph on URLs, where weight(i,j) = # trails using the ij edge
URL score = PageRank on this graph
28
Experiments
Live Traffic dataset Computed CTRs on QLs currently displayed by
Yahoo! (1043 website subset) Measure:
Pick two equal-sizes subsets of QLs Use sum-of-scores and sum-of-CTRs to predict the
better subset Measure how often the predictions match
29
Experiments Live Traffic Data
Subset sizesFra
ctio
n o
f su
bse
t-p
airs
whe
re
pre
dic
tion
s ag
ree
with
live
tra
ffic
QL-ALG > TopVisited > PageRank > TopClicked
30
Experiments
Tree-structured trails Most dropped trails are
very short Tree-structured trails
improve accuracy
1 10 100 1000 100000
20
40
60
80
100
Length of trail
Num
ber
of t
rails
dro
pped
Live Traffic prediction quality comparison
Distribution of dropped trails
31
Outline
Motivation and Related Work
Problem Formulation
Proposed Solution
Experiments Conclusions
32
Conclusions
Proposed a formulation for the QL selection problem Both toolbar and search logs are used intuitively
Proposed two algorithms: Greedy: (1-1/e)-optimal Tree-structured: empirically better
Improvement of 22% over competing baselines