Algorithms for query result diversification

Swap algorithm for query result diversification

Emre Can Kucukoglu, [email protected] articles:

• C. Yu, L. Lakshmanan, and S. Amer-Yahia, “It takes variety to make a world: diversification in recommender systems,” in EDBT, 2009.

• Marcos R. Vieira, Humberto Luiz Razente, Maria Camila Nardini Barioni, Marios Hadjieleftheriou, Divesh Srivastava, Caetano Traina Jr., Vassilis J. Tsotras: On query result diversification. ICDE 2011: 1163-1174

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

Inside the while loop in [2-5], first 4 documents are added to result set R.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R STEP 1




Inside the while loop in [2-5], first 4 documents are added to result set R.These are d4, d5, d1, d7.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 1




Inside the while loop in [2-5], first 4 documents are added to result set R.These are d4, d5, d1, d7.And remove them from candidate set S.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 1S

d6

d2

d8

d3




While candidate set S has documents, pick highest scoring document.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 2S

d6

d2

d8

d3




While candidate set S has documents, pick highest scoring document.First pick d6.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 2S

d6

d2

d8

d3

C

d6




While candidate set S has documents, pick highest scoring document.First pick d6. And remove it from S.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 2S

d2

d8

d3

C

d6




While candidate set S has documents, pick highest scoring document.First pick d6. And remove it from S.Assign R’ to R.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 2S

d2

d8

d3

C

d6

R’

d4

d5

d1

d7





Pick d7 from R [line 10],Compute function F value for A={d4,d5,d1,d6} and R’={d4,d5,d1,d7}.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 3S

d2

d8

d3

C

d6

R’

d4

d5

d1

d7





Pick d7 from R [line 10],Compute function F value for A={d4,d5,d1,d6} and R’={d4,d5,d1,d7}.Let F(q,A) > F(q, R’)

Then update R’ with A={d4,d5,d1,d6}.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 3S

d2

d8

d3

C

d6

R’

d4

d5

d1

d6





Pick d1 from R [line 10],and compute F value for A={d4,d5,d6,d7} and R’={d4,d5,d1,d6}.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 4S

d2

d8

d3

C

d6

R’

d4

d5

d1

d6





Pick d1 from R [line 10],and compute F value for A={d4,d5,d6,d7} and R’={d4,d5,d1,d6}.Let F(q,A) < F(q, R’)

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 4S

d2

d8

d3

C

d6

R’

d4

d5

d1

d6






d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 5S

d2

d8

d3

C

d6

R’

d4

d5

d1

d6





Pick d5 from R [line 10],and compute F value for A={d4,d6,d1,d7} and R’={d4,d5,d1,d6}.Let F(q,A) > F(q, R’)

Then update R’ with A={d4,d5,d6,d7}.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 5S

d2

d8

d3

C

d6

R’

d4

d5

d6

d7






d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 6S

d2

d8

d3

C

d6

R’

d4

d5

d6

d7





Pick d4 from R [line 10],and compute F value for A={d6,d5,d1,d7} and R’={d4,d5,d6,d7}.Let F(q,A) < F(q, R’)

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 6S

d2

d8

d3

C

d6

R’

d4

d5

d6

d7





After inner for-loop [10-12],If R’ has higher return value for funtion F than initial R set,Assign R to R’.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 7S

d2

d8

d3

C

d6

R’

d4

d5

d6

d7





After inner for-loop [10-12],If R’ has higher return value for funtion F than initial R set,Assign R to R’. [13-14]

Let F(q,R’) > F(q, R).

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d6

d7

STEP 7S

d2

d8

d3

C

d6

R’

d4

d5

d6

d7




While candidate set S has documents, pick highest scoring document.Second pick d2. And remove it from S.Assign R’ to R.

And repeat step [3-6].

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d6

d7

STEP 8S

d8

d3

C

d2

R’

d4

d5

d6

d7



If we assume complexity of function F as O(C),overall complexity of swap algorithm is O(N.k.C)

N is the size of the candidate set S.

The F value of the final result R is not guaranteed to be optimal, since documents in the candidateset S are analyzed with respect to their similarity distances.

That is, this method does not consider the order of diversity distances in S.

MMR* algorithm for query result diversification

Emre Can Kucukoglu, [email protected] articles:

• J. Carbonell and J. Goldstein, “The use of MMR, diversity-based reranking for reordering documents and producing summaries,” in Proc. ACM SIGIR, 1998.

• Marcos R. Vieira, Humberto Luiz Razente, Maria Camila Nardini Barioni, Marios Hadjieleftheriou, Divesh Srivastava, Caetano Traina Jr., Vassilis J. Tsotras: On query result diversification. ICDE 2011: 1163-1174

*: Maximal Marginal Relevance

MMR algorithm iteratively constructs the result set Rby selecting a new document in S that maximixes the function mmr:

λ is the tradeoff value between δsim function and δdiv function.k is the result set size.δsim function computes the similarity distancesbetween query and document.δdiv function computes the diversity distances between a pair of documents.Similarity distances can be computed with tf-idf or other algorithms.Since R is empty in the initial iteration, |R| is 0, so that the element with the highest δsim in S is always included in R, regardless of the λ value.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order function δsim scores:

Since R is initially empty, first element is picked according to similarity distances.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

STEP 1R S

d1

d2

d3

d4

d5

d6

d7

d8




Since R is initially empty, first element is picked according to similarity distances.

d4 has highest δsim score. Add it to R, remove from S.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

S

d1

d2

d3

d5

d6

d7

d8

STEP 1R

d4




Then for every step, pick highest scoring document from S, according to their function mmrresults until |R| = k.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

S

d1

d2

d3

d5

d6

d7

d8

STEP 2-4R

d4





Let d3,d6 and d8 have highest scores.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

S

d1

d2

d5

d7

STEP 2-4R

d4

d3

d6

d8





Let d3,d6 and d8 have highest scores.

For each calculation of function mmr, since |R| value is increasing, weight ofdiversity distances is decreasing.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

S

d1

d2

d5

d7

STEP 2-4R

d4

d3

d6

d8



Since the result is incrementally constructed by inserting a new element to previous results, the first chosen element has a large influence in the quality of the final result set R.

Moreover, experimental results show that the quality of the results for the MMR method decreases very fast when increasing the λ parameter.

If we assume complexity of picking highest scoring document according to function mmr as O(C),overall complexity of MMR algorithm is O(k.C).

Algorithms for query result diversification

Data & Analytics

Transcript of Algorithms for query result diversification