The Cluster Hypothesis: Ranking Document Clusters
Transcript of The Cluster Hypothesis: Ranking Document Clusters
The Cluster Hypothesis: Ranking Document Clusters
Oren KurlandFaculty of Industrial Engineering and Management
Technion
* Based on joint work with Fiana Raiber* This work has been supported by, and carried out, at the Technion-Microsoft Electronic Commerce Research Center
Query = “oren kurland dblp”
Search #1 Search #2
Is search a solved problem?
The ad hoc retrieval task
Rank documents in a corpus by their relevance to the information need expressed by a query
Example: Vector space model(Salton ’68)
o query= “Technion”o �⃗�𝑞 =< 0, … , 0,1,0, … , 0 >
o document= “Faculty Technion Student Technion”
o 𝑑𝑑 =< 0, … , 0,1,0, … , 1,0, … , 2,0, … , 0 >
o 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑑𝑑; 𝑞𝑞) ≜ cos(�⃗�𝑞, 𝑑𝑑)
o Term weighting scheme: TF.IDFo TF: the number of occurrences of a term in the documento IDF: The inverse of the document frequency of the term
Web search engineso Use a variety of relevance signals
o The textual similarity between the page and the query (query-dependent)
o The textual similarity between the query and the anchor text of pages that point to the page (query-dependent)
o The PageRank score of the page (query-independent)o The clickthrough rate for the page (query-independent)o …
o Learning-to-rank (Liu ’09)
The document-query similarityo Relevance is determined based on whether the
document content satisfies the information need expressed by the query
o The document-query similarity is among the most important features for ranking pages in Web search engines (Liu ’09)
Back to classical information retrieval?
The cluster hypothesis
“Closely associated documents tend to be relevant to the same requests”
(Jardine&van Rijsbergen ’71,van Rijsbergen ’79)
Operational consequence:
Relevant documents should be more similar to each other than to non-relevant documents
Cluster-based results interface
Clustering the pages Automatically labeling the clustersFinding the highest quality clusters
Cluster-based document retrieval
Query
Initial list of documents
Set of clusters
Ranking of clusters
Ranking of documents
Document ranking method
Clustering method
Each cluster is replaced with its
documents
Cluster ranking method
The optimal cluster problem(Kurland&Domshlak ’08)
p@5
Doc-query similarity
Query expansion
Oracle experiment
The cluster ranking task
( ) ( ) ( ),( )
,p C Q
p C Qp Q
p C Q= =
o Estimate the probability that cluster C is relevant to query Q:
o Estimate p(C,Q) using Markov Random Fields
rank
Markov Random Fields
( )( )( ),
ll L Gl
p C QZ
ψ∈=
∏
G
( )L G −
( )l lψ −Z −
l −
o Define a graph : o Nodes ‒ random variables representing Q and C’s documentso Edges ‒ dependencies between the variableso set of cliques in Go clique o potential defined over lo normalization factor
o A common instantiation of potential functions:o feature function defined over lo weight associated with ( ) ( )( )expl l ll f lψ λ=( )lf l −
lλ − ( )lf ldef
ClustMRF
o A linear (in feature functions) cluster ranking function that depends on the graph G
o Next:o Determine the clique set L(G)o Associate feature functions with cliques
( ) ( )( )
, l ll L G
p C Q f lλ∈
= ∑rank
The lQD clique
Q
d3d2d1
lQD lQD lQD
o Contains the query Qand a single document din cluster C
o Consider query-similarity values of C’s documents independently
( )1
| |log ( , ) Cgeo qsim QDf l sim Q d− =
( , )sim
def
(cf., Liu&Croft ’08)
. . is an inter-text similarity measurenumber of documents in C| |C −
The lQC cliqueo Contains the query Q
and all C’s documents
o Induce information from relations between query-similarity values of C’s documents
Q
d3d2d1
lQC
( ) { }( )log ( , )QC d CA qsimf l A sim Q d− ∈=def
{ }A∈ min, max, stdv
The lC cliqueo Contains only C’s
documents
o Induce information based on query-independent properties of C’s documents (e.g., PageRank score, ratio of stopwords to non stopwords)
Q
d3d2d1
lC
( ) ( ){ }( )logdP C CAf l A P d− ∈
=
def
{ }A∈ min, max, geometric mean
P is a query-independent document (quality) measure
Empirical evaluation
Query
Initial list of 50
documents
Set of clusters
Ranking of clusters
Ranking of documents
Markov Random Field (MRF)
(Metzler&Croft ’05)
Nearest Neighbor clustering
(Griffiths et al. ’86)
Each cluster is replaced with its
documents
ClustMRF
11
12.5
14
15.5
17
ClueWebB
MAP
11
12
13
14
15
GOV2
MAP
Comparison with the initial ranking■ Init: Markov Random Field (Metzler&Croft ’05)■ ClustMRF (Our algorithm)♦ Statistically significant differences with ClustMRF
♦♦
Clu
stM
RF
Clu
stM
RF
Comparison with other cluster ranking methods
■ AMean: Arithmetic mean of query similarity values (Liu&Croft ’08)
■ GMean: Geometric mean of query-similarity values (Liu&Croft ’08 ,Seo&Croft ’10)
■ ClustRanker: Uses measures of document and cluster biases (Kurland ’08)
■ ClustMRF (Our algorithm)♦ Statistically significant differences with ClustMRF
11
14
17
ClueWebB
MAP
12
13
14
15
GOV2
MAP
♦♦ ♦ ♦ ♦
Clu
stM
RF
Clu
stM
RF
Comparison with automatic query expansion
■ RM3 (Abdul-Jaleel et al. ’04)■ ClustMRF (Our algorithm)♦ Statistically significant differences with ClustMRF
12
13
14
GOV2
MAP
13
15
17
ClueWebB
MAP
Clu
stM
RF
Clu
stM
RF
♦♦
Diversifying search results
Diversifying search resultso MMR (Carbonell&Goldstein ’98) and xQuAD (Santos et al. ’10)
iteratively re-rank the initial listo In each iteration a document is scored by:
Relevance estimates:■ Init (MRF)■ ClustMRF♦ Statistically significant
differences with ClustMRF
ClueWebB
(1 )β β+ −Relevance to the query
Diversity with respect todocuments already selected
30
36
42
48
MMR xQuAD
α-NDCG@20
♦♦
TREC (Text REtrieval Conference)o Annual “competition” of search approacheso In 2013, the Technion won the first place in the Web
retrieval track according to most evaluation criteriao ClustMRF was used to re-rank the results of a highly
effective learning-to-rank approach
Summary
o We presented a novel cluster ranking approach thato outperforms state-of-the-art document-based retrieval
methodso outperforms state-of-the-art cluster-based retrieval
methodso can be used to improve the performance of results-
diversification methods