Algorithms for query result diversification

27
Swap algorithm for query result diversification Emre Can Kucukoglu, [email protected] Reference articles: C. Yu, L. Lakshmanan, and S. Amer-Yahia, “It takes variety to make a world: diversification in recommender systems,” in EDBT, 2009. Marcos R. Vieira, Humberto Luiz Razente, Maria Camila Nardini Barioni, Marios Hadjieleftheriou, Divesh Srivastava, Caetano Traina Jr., Vassilis J. Tsotras: On query result diversification. ICDE 2011: 1163-1174

Transcript of Algorithms for query result diversification

Page 1: Algorithms for query result diversification

Swap algorithm for query result diversification

Emre Can Kucukoglu, [email protected] articles:

• C. Yu, L. Lakshmanan, and S. Amer-Yahia, “It takes variety to make a world: diversification in recommender systems,” in EDBT, 2009.

• Marcos R. Vieira, Humberto Luiz Razente, Maria Camila Nardini Barioni, Marios Hadjieleftheriou, Divesh Srivastava, Caetano Traina Jr., Vassilis J. Tsotras: On query result diversification. ICDE 2011: 1163-1174

Page 2: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

Inside the while loop in [2-5], first 4 documents are added to result set R.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R STEP 1

Page 3: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

Inside the while loop in [2-5], first 4 documents are added to result set R.These are d4, d5, d1, d7.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 1

Page 4: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

Inside the while loop in [2-5], first 4 documents are added to result set R.These are d4, d5, d1, d7.And remove them from candidate set S.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 1S

d6

d2

d8

d3

Page 5: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 2S

d6

d2

d8

d3

Page 6: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.First pick d6.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 2S

d6

d2

d8

d3

C

d6

Page 7: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.First pick d6. And remove it from S.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 2S

d2

d8

d3

C

d6

Page 8: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.First pick d6. And remove it from S.Assign R’ to R.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 2S

d2

d8

d3

C

d6

R’

d4

d5

d1

d7

Page 9: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.First pick d6. And remove it from S.Assign R’ to R.

Pick d7 from R [line 10],Compute function F value for A={d4,d5,d1,d6} and R’={d4,d5,d1,d7}.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 3S

d2

d8

d3

C

d6

R’

d4

d5

d1

d7

Page 10: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.First pick d6. And remove it from S.Assign R’ to R.

Pick d7 from R [line 10],Compute function F value for A={d4,d5,d1,d6} and R’={d4,d5,d1,d7}.Let F(q,A) > F(q, R’)

Then update R’ with A={d4,d5,d1,d6}.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 3S

d2

d8

d3

C

d6

R’

d4

d5

d1

d6

Page 11: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.First pick d6. And remove it from S.Assign R’ to R.

Pick d1 from R [line 10],and compute F value for A={d4,d5,d6,d7} and R’={d4,d5,d1,d6}.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 4S

d2

d8

d3

C

d6

R’

d4

d5

d1

d6

Page 12: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.First pick d6. And remove it from S.Assign R’ to R.

Pick d1 from R [line 10],and compute F value for A={d4,d5,d6,d7} and R’={d4,d5,d1,d6}.Let F(q,A) < F(q, R’)

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 4S

d2

d8

d3

C

d6

R’

d4

d5

d1

d6

Page 13: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.First pick d6. And remove it from S.Assign R’ to R.

Pick d5 from R [line 10],and compute F value for A={d4,d6,d1,d7} and R’={d4,d5,d1,d6}.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 5S

d2

d8

d3

C

d6

R’

d4

d5

d1

d6

Page 14: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.First pick d6. And remove it from S.Assign R’ to R.

Pick d5 from R [line 10],and compute F value for A={d4,d6,d1,d7} and R’={d4,d5,d1,d6}.Let F(q,A) > F(q, R’)

Then update R’ with A={d4,d5,d6,d7}.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 5S

d2

d8

d3

C

d6

R’

d4

d5

d6

d7

Page 15: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.First pick d6. And remove it from S.Assign R’ to R.

Pick d4 from R [line 10],and compute F value for A={d6,d5,d1,d7} and R’={d4,d5,d6,d7}.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 6S

d2

d8

d3

C

d6

R’

d4

d5

d6

d7

Page 16: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.First pick d6. And remove it from S.Assign R’ to R.

Pick d4 from R [line 10],and compute F value for A={d6,d5,d1,d7} and R’={d4,d5,d6,d7}.Let F(q,A) < F(q, R’)

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 6S

d2

d8

d3

C

d6

R’

d4

d5

d6

d7

Page 17: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.First pick d6. And remove it from S.Assign R’ to R.

After inner for-loop [10-12],If R’ has higher return value for funtion F than initial R set,Assign R to R’.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d1

d7

STEP 7S

d2

d8

d3

C

d6

R’

d4

d5

d6

d7

Page 18: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.First pick d6. And remove it from S.Assign R’ to R.

After inner for-loop [10-12],If R’ has higher return value for funtion F than initial R set,Assign R to R’. [13-14]

Let F(q,R’) > F(q, R).

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d6

d7

STEP 7S

d2

d8

d3

C

d6

R’

d4

d5

d6

d7

Page 19: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order document relevance scores:

While candidate set S has documents, pick highest scoring document.Second pick d2. And remove it from S.Assign R’ to R.

And repeat step [3-6].

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

R

d4

d5

d6

d7

STEP 8S

d8

d3

C

d2

R’

d4

d5

d6

d7

Page 20: Algorithms for query result diversification

Start with the k highest scoring documents, and swap the document which contributes the least to the function F with the next highest scoring document among the remaining documents. At each iteration, acandidate document with a lower relevance is swapped into the top-k set if and only if it increases the overall function F value of the resulting set.Function F:

λ is the tradeoff value between sim function and divfunction.k is the result set size.sim function computes the sum of similarity distancesamong all documents.div function computes the sum of diversity distances among all documents. For a given documents, every pair of documents’ diversity distances are added to find div function result.Similarity distances can be computed with tf-idf or other algorithms.

If we assume complexity of function F as O(C),overall complexity of swap algorithm is O(N.k.C)

N is the size of the candidate set S.

The F value of the final result R is not guaranteed to be optimal, since documents in the candidateset S are analyzed with respect to their similarity distances.

That is, this method does not consider the order of diversity distances in S.

Page 21: Algorithms for query result diversification

MMR* algorithm for query result diversification

Emre Can Kucukoglu, [email protected] articles:

• J. Carbonell and J. Goldstein, “The use of MMR, diversity-based reranking for reordering documents and producing summaries,” in Proc. ACM SIGIR, 1998.

• Marcos R. Vieira, Humberto Luiz Razente, Maria Camila Nardini Barioni, Marios Hadjieleftheriou, Divesh Srivastava, Caetano Traina Jr., Vassilis J. Tsotras: On query result diversification. ICDE 2011: 1163-1174

*: Maximal Marginal Relevance

Page 22: Algorithms for query result diversification

MMR algorithm iteratively constructs the result set Rby selecting a new document in S that maximixes the function mmr:

λ is the tradeoff value between δsim function and δdiv function.k is the result set size.δsim function computes the similarity distancesbetween query and document.δdiv function computes the diversity distances between a pair of documents.Similarity distances can be computed with tf-idf or other algorithms.Since R is empty in the initial iteration, |R| is 0, so that the element with the highest δsim in S is always included in R, regardless of the λ value.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order function δsim scores:

Since R is initially empty, first element is picked according to similarity distances.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

STEP 1R S

d1

d2

d3

d4

d5

d6

d7

d8

Page 23: Algorithms for query result diversification

MMR algorithm iteratively constructs the result set Rby selecting a new document in S that maximixes the function mmr:

λ is the tradeoff value between δsim function and δdiv function.k is the result set size.δsim function computes the similarity distancesbetween query and document.δdiv function computes the diversity distances between a pair of documents.Similarity distances can be computed with tf-idf or other algorithms.Since R is empty in the initial iteration, |R| is 0, so that the element with the highest δsim in S is always included in R, regardless of the λ value.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order function δsim scores:

Since R is initially empty, first element is picked according to similarity distances.

d4 has highest δsim score. Add it to R, remove from S.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

S

d1

d2

d3

d5

d6

d7

d8

STEP 1R

d4

Page 24: Algorithms for query result diversification

MMR algorithm iteratively constructs the result set Rby selecting a new document in S that maximixes the function mmr:

λ is the tradeoff value between δsim function and δdiv function.k is the result set size.δsim function computes the similarity distancesbetween query and document.δdiv function computes the diversity distances between a pair of documents.Similarity distances can be computed with tf-idf or other algorithms.Since R is empty in the initial iteration, |R| is 0, so that the element with the highest δsim in S is always included in R, regardless of the λ value.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order function δsim scores:

Then for every step, pick highest scoring document from S, according to their function mmrresults until |R| = k.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

S

d1

d2

d3

d5

d6

d7

d8

STEP 2-4R

d4

Page 25: Algorithms for query result diversification

MMR algorithm iteratively constructs the result set Rby selecting a new document in S that maximixes the function mmr:

λ is the tradeoff value between δsim function and δdiv function.k is the result set size.δsim function computes the similarity distancesbetween query and document.δdiv function computes the diversity distances between a pair of documents.Similarity distances can be computed with tf-idf or other algorithms.Since R is empty in the initial iteration, |R| is 0, so that the element with the highest δsim in S is always included in R, regardless of the λ value.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order function δsim scores:

Then for every step, pick highest scoring document from S, according to their function mmrresults until |R| = k.

Let d3,d6 and d8 have highest scores.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

S

d1

d2

d5

d7

STEP 2-4R

d4

d3

d6

d8

Page 26: Algorithms for query result diversification

MMR algorithm iteratively constructs the result set Rby selecting a new document in S that maximixes the function mmr:

λ is the tradeoff value between δsim function and δdiv function.k is the result set size.δsim function computes the similarity distancesbetween query and document.δdiv function computes the diversity distances between a pair of documents.Similarity distances can be computed with tf-idf or other algorithms.Since R is empty in the initial iteration, |R| is 0, so that the element with the highest δsim in S is always included in R, regardless of the λ value.

Let candidate set S: {d1, d2, d3, d4, d5, d6, d7, d8},k = 4,decreasing order function δsim scores:

Then for every step, pick highest scoring document from S, according to their function mmrresults until |R| = k.

Let d3,d6 and d8 have highest scores.

For each calculation of function mmr, since |R| value is increasing, weight ofdiversity distances is decreasing.

d4 d5 d1 d7 d6 d2 d8 d3

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

S

d1

d2

d5

d7

STEP 2-4R

d4

d3

d6

d8

Page 27: Algorithms for query result diversification

MMR algorithm iteratively constructs the result set Rby selecting a new document in S that maximixes the function mmr:

λ is the tradeoff value between δsim function and δdiv function.k is the result set size.δsim function computes the similarity distancesbetween query and document.δdiv function computes the diversity distances between a pair of documents.Similarity distances can be computed with tf-idf or other algorithms.Since R is empty in the initial iteration, |R| is 0, so that the element with the highest δsim in S is always included in R, regardless of the λ value.

Since the result is incrementally constructed by inserting a new element to previous results, the first chosen element has a large influence in the quality of the final result set R.

Moreover, experimental results show that the quality of the results for the MMR method decreases very fast when increasing the λ parameter.

If we assume complexity of picking highest scoring document according to function mmr as O(C),overall complexity of MMR algorithm is O(k.C).