SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU...

57
SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search

Transcript of SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU...

Page 1: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

SIGIR 2008

Singapore

Jonathan Elsas, Jaime Arguello,

Jamie Callan & Jaime Carbonell

LTI/SCS/CMU

Retrieval and Feedback Models for Blog Feed Search

Page 2: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Outline

• The task– Overview of Blogs & Blog Search– Challenges in Blog Search

• Our approach– Retrieval Models– Query Expansion Models

• Conclusion

Page 3: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Background

Page 4: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

What is a Blog?

Page 5: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

What is a Feed?

<xml>

<feed>

<entry>

<author>Peter …</>

<title>Good, Evil…</>

<content>I’ve said…</>

</entry>

<entry>

<author>Peter …</>

<title>Agreeing…</>

<content>Some peo…</>

</entry>

Page 6: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Blog-Feed Correspondence

Blog Feed

Post Entry

HTMLHTML XMLXML

Page 7: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Why are Blogs important?

Technorati currently tracking > 112.8 Million Blogs> 175,000 new Blogs per day> 1.6 Million posts per day

[http://www.technorati.com/about/]

Page 8: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

The Task

Page 9: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Feed Search at TREC

Ranking Blogs/Feeds (collections of posts) in response to a user’s query, [X]

“A relevant feed should have a principle and recurring interest in X”

— TREC 2007 Blog Track

(a.k.a. Blog Distillation)

Page 10: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Feed Search at TREC

[Gardening]

[Apple iPod]

[Violence in Sudan]

[Gun Control]

[Food]

[Wine]

RepresentOngoing

Information Needs

FrequentlyVery

General

Page 11: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Challenges in Feed Search

Page 12: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Challenges in Feed Search

entries

time

feed

1. A feed is a collection of documents

Page 13: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

1. A feed is a collection of documents – How does relevance at the entry level

correspond to relevance at the feed level?

Challenges in Feed Search

entries

time

feed

Page 14: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Challenges in Feed Search2. Even a topical feed is topically diverse

time

NASA

China’s plans for the moon shuttle

launch

My dog

Mars rover

Boeing

Space Exploration

topic

Page 15: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Challenges in Feed Search2. Even a topical feed is topically diverse

– Can we favor entries close to the central topic of the feed?

Space Exploration

time

topic

Page 16: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Challenges in Feed Search3. Feeds are noisy

– Spam blogs, Spam & off topic comments

time

Page 17: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Challenges in Feed Search

4. General & Ongoing Information Needs

[Mac]

[Music]

[Food]

[Wine]

… post regularly about new products, features, or application software of Apple Mac computers.

… describing songs, biographies of musicians, musical styles andtheir influences of music on people are discussed.…such as tastings,

reviews, food matching or pairing, and oenophile news and events.

… describing experiences eating cuisines, culinary delights,recipes, nutrition plans.

Page 18: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Our Approach

Page 19: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Retrieval Models

Feedback Models

Feeds:Topically Diverse

Noisy

Collections

Information Needs:

General & Ongoing

Challenges Our Approach

Page 20: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Retrieval Models

• Challenge: ranking topically diverse collections

• Representation: feed vs. entry• Model topical relationship between entries

Page 21: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Large Document (Feed) Model

<?xml……

</…>

`<?xml……

</…>

<?xml……

</…>

<?xml…<feed><entry><entry><entry><entry><entry>

…</…>

<?xml……

</…>

<?xml……

</…>

<?xml……

</…>

<?xml…<feed><entry><entry><entry><entry><entry>

…</…>

Feed Document Collection

[Q]

Ranked Feeds

Rank by

Indri’s standard retrieval model[Metzler and Croft, 2004; 2005]

Page 22: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Large Document (Feed) ModelAdvantages:

• A straightforward application of existing retrieval techniques

Potential Pitfalls:

• Large entries dominate a feed’s language model

• Ignores relationship among entries

Feed

Entry E E Entry Entry E

Page 23: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Small Document (Entry) Model

<entry><entry><entry><entry><?xml…<entry>

Entry Document Collection

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

Ranked FeedsRanked Entriesdocument = entry

[Q]

Apply some rankaggregation function

Rank By

Page 24: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Small Document (Entry) Model

• Query Likelihood• Entry Centrality• Feed Prior: favors longer feeds

ReDDE Federated Search Algortihm[Si & Callan, 2003]

Page 25: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Entry Centrality

Uniform :

Geometric Mean :

time

topic

Page 26: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Small Document (Entry) Model

Advantages:

• Controls for differing entry length

•Models topical relationship among entries

Disadvantages:

• Centrality computation is slow(er)

Q

Not only improves speed, Also performance

Page 27: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Retrieval Model Results

Page 28: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Retrieval Model Results

• 45 Queries from the TREC 2007 Blog Distillation Task

• BLOG06 test collection, XML feeds only

• 5-Fold Cross Validation for all retrieval model smoothing parameters

Page 29: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Retrieval Model Results

0.29

0.277

0.290.298

0.315

0.245

0.265

0.285

0.305

0.325

Mea

n A

vera

ge P

reci

sion

LargeDocument

(Feed)Model

Small Document (Entry) Models

Page 30: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Retrieval Model Results

0.29

0.277

0.290.298

0.315

0.245

0.265

0.285

0.305

0.325

Mea

n A

vera

ge P

reci

sion

Uniform Log(Feed Length)UniformLog PriorMap 0.188

Page 31: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Retrieval Model Results

0.29

0.277

0.290.298

0.315

0.245

0.265

0.285

0.305

0.325

Mea

n A

vera

ge P

reci

sion

Uniform Log(Feed Length)Uniform

n/a

Page 32: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Feedback Models

• Challenge: Noisy collection with general & ongoing

information needs

• Use a cleaner external collection for query expansion (Wikipedia)

• With an expansion technique designed to identify multiple query facets

Page 33: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Query Expansion (PRF)

[Q]

BLOG06Collection

Related Terms from top K documents

[Q + Terms]

[Lavrenko & Croft, 2001]

Page 34: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Query Expansion Example

Idealdigital photography

depth of field

photographic film

photojournalism

cinematography

[Photography]

PRFphotography

nudeerotic

artgirlfreeteen

fashionwomen

Page 35: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Feedback Model Results

0.2

0.24

0.28

0.32

0.36

BLOG LD BLOG SD

Mea

n A

vera

ge P

reci

sion

None PRF

Page 36: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Query Expansion (Wikipedia PRF)

[Q]

BLOG06Collection

[Q + Terms]

[Lavrenko & Croft, 2001]

Wikipedia

[Diaz & Metzler, 2006]

Related Terms from top K documents

Page 37: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Query Expansion Example

Idealdigital photography

depth of field

photographic film

photojournalism

cinematography

[Photography]

PRFphotography

nudeerotic

artgirlfreeteen

fashionwomen

Wikipedia PRFphotography

directorspecial

filmart

cameramusic

cinematographerphotographic

Page 38: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Feedback Model Results

0.2

0.24

0.28

0.32

0.36

BLOG LD BLOG SD

Mea

n A

vera

ge P

reci

sion

None PRF Wiki. PRF

Page 39: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Query Expansion (Wikipedia Link)

[Q]

BLOG06Collection

[Q + Terms]

Wikipedia

Related Terms from link structure

Page 40: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Wikipedia Link-BasedQuery Expansion

Page 41: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Wikipedia Link-Based ExpansionWikipedia

Q

Page 42: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Wikipedia Link-Based Expansion

Wikipedia

Relevance Set, Top R = 100

Working Set, Top W = 1000

Q

Page 43: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Wikipedia Link-Based Expansion

Wikipedia

Q

Relevance Set, Top R = 100

Working Set, Top W = 1000

Page 44: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Wikipedia Link-Based Expansion

Relevance Set, Top R = 100

Working Set, Top W = 1000

Wikipedia

Extract anchor text fromWorking Set that link tothe Relevance Set.

Q

Page 45: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Wikipedia Link-Based Expansion

Relevance Set, Top R = 500

Working Set, Top W = 1000

Wikipedia

Extract anchor text fromWorking Set that link tothe Relevance Set.

Q

Combines relevance and popularity

Relevance: An anchor phrase that links to a high ranked article gets a high score

Popularity: An anchor phrase that links many times to a mid-ranked articles also gets high score

Page 46: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Query Expansion Example

Wikipedia Link-Basedphotographyphotographer

digital photographyphotographicdepth of field

feature photographyfilm

photographic filmphotojournalism

[Photography]

PRFphotography

nudeerotic

artgirlfreeteen

fashionwomen

Idealdigital photography

depth of field

photographic film

photojournalism

cinematography

Page 47: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Feedback Model Results

0.2

0.24

0.28

0.32

0.36

0.4

BLOG LD BLOG SD

Mea

n A

vera

ge P

reci

sion

None PRF Wiki. PRF Wiki. Link

Page 48: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Conclusion

• Feed Search Challenges:

– Feeds are topically diverse, noisy collections

– Ranked against ongoing & general information needs

• Novel Retrieval Models:

– Ranking collections, sensitive to topical relationship among entries

• Novel Feedback Models:

– Discover multiple query facets & robust to collection noise

Page 49: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Thank You!

Student Travel Grant funding from: ACM SIGIR, Amit Singhal, Microsoft Research

Page 50: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Entry Centrality GM Derivation

where

Entry Generation Likelihood:

|E|

Page 51: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Query Expansion Examples

Wikipedia ExpansionMusic

Folk musicElectronic music

FolkMusic videoWorld music

AmbientElectronic

Country music

[Music]

PRFMusicCountryDownloadFreeMP3Mp3andmoreLyricListenSong

Page 52: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Query Expansion Examples

Wikipedia Expansionscotland

scottish parliamentscottish

scottish national party wars of scottish independence

scottish independencewilliam wallace

glasgowscottish socialist party

[Scottish Independence]

PRFscotlandindependencepartyconventionpoliticssnpnationalpeoplescot

Page 53: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Query Expansion Examples

Wikipedia Expansionmachine learning

learningartificial intelligence

turing machine machine gun

neural networksupport vector machine

supervised learningartificial neural network

[Machine Learning]

PRFlearnmachinecreditcardkaraokejournalsexmodelsew

Page 54: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Query Generality Characteristics

• Query Length:

– BLOG: 1.9 words

– TB04: 3.2 words

– TB05: 3.0 words

• ODP Depth

– BLOG: 4.7 levels

– TB04: 5.2 levels

– TB05: 5.3 levels

Page 55: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Relevance Set Cohesiveness

Wikipedia

Relevance Set, Top R = 100 Cohesiveness

=| Lin |

| Lin U Lout |

Page 56: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Relevant Set Cohesiveness

Page 57: SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU Retrieval and Feedback Models for Blog Feed Search.

Is it the Queries?

Feed Search Queries

TB Adhoc QueriesBut, none of these measurespredict whether wikipedia

expansions helps…