Retrieval and Feedback Models for Blog Feed Search

57
SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell QuickTime™ and a TIFF (Uncompressed) decompresso are needed to see this pictur Retrieval and Feedback Models for Blog Feed Search

description

SIGIR 2008 Presentation

Transcript of Retrieval and Feedback Models for Blog Feed Search

Page 1: Retrieval and Feedback Models for Blog Feed Search

SIGIR 2008Singapore

Jonathan Elsas, Jaime Arguello,

Jamie Callan & Jaime Carbonell

LTI/SCS/CMU

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Retrieval and Feedback Models for Blog Feed

Search

Page 2: Retrieval and Feedback Models for Blog Feed Search

Outline

• The task– Overview of Blogs & Blog Search– Challenges in Blog Search

• Our approach– Retrieval Models– Query Expansion Models

• Conclusion

Page 3: Retrieval and Feedback Models for Blog Feed Search

Background

Page 4: Retrieval and Feedback Models for Blog Feed Search

What is a Blog?

Page 5: Retrieval and Feedback Models for Blog Feed Search

What is a Feed?<xml>

<feed>

<entry>

<author>Peter …</>

<title>Good, Evil…</>

<content>I’ve said…</>

</entry>

<entry>

<author>Peter …</>

<title>Agreeing…</>

<content>Some peo…</>

</entry>

Page 6: Retrieval and Feedback Models for Blog Feed Search

Blog-Feed Correspondence

Blog Feed

Post Entry

HTMLHTML XMLXML

Page 7: Retrieval and Feedback Models for Blog Feed Search

Why are Blogs important?

Technorati currently tracking > 112.8 Million Blogs> 175,000 new Blogs per day> 1.6 Million posts per day

[http://www.technorati.com/about/]

Page 8: Retrieval and Feedback Models for Blog Feed Search

The Task

Page 9: Retrieval and Feedback Models for Blog Feed Search

Feed Search at TREC

Ranking Blogs/Feeds (collections of posts) in response to a user’s query, [X]

“A relevant feed should have a principle and recurring interest in X”

— TREC 2007 Blog Track

(a.k.a. Blog Distillation)

Page 10: Retrieval and Feedback Models for Blog Feed Search

Feed Search at TREC

[Gardening][Apple iPod]

[Violence in Sudan][Gun Control]

[Food][Wine]

RepresentOngoing

Information Needs

FrequentlyVery

General

Page 11: Retrieval and Feedback Models for Blog Feed Search

Challenges in Feed Search

Page 12: Retrieval and Feedback Models for Blog Feed Search

Challenges in Feed Search

entries

time

feed

1.A feed is a collection of documents

Page 13: Retrieval and Feedback Models for Blog Feed Search

1.A feed is a collection of documents – How does relevance at the entry level

correspond to relevance at the feed level?

Challenges in Feed Search

entries

time

feed

Page 14: Retrieval and Feedback Models for Blog Feed Search

Challenges in Feed Search

2. Even a topical feed is topically diverse

time

NASA

China’s plans for the moon

shuttle launch

My dog

Mars rover

Boeing

Space Exploration

topic

Page 15: Retrieval and Feedback Models for Blog Feed Search

Challenges in Feed Search

2. Even a topical feed is topically diverse– Can we favor entries close to the

central topic of the feed?

Space Exploration

time

topic

Page 16: Retrieval and Feedback Models for Blog Feed Search

Challenges in Feed Search

3. Feeds are noisy– Spam blogs, Spam & off topic comments

time

Page 17: Retrieval and Feedback Models for Blog Feed Search

Challenges in Feed Search

4. General & Ongoing Information Needs

[Mac]

[Music]

[Food]

[Wine]

… post regularly about new products, features, or application software of Apple Mac computers.

… describing songs, biographies of musicians, musical styles andtheir influences of music on people are discussed.

…such as tastings, reviews, food matching or pairing, and oenophile news and events.

… describing experiences eating cuisines, culinary delights,recipes, nutrition plans.

Page 18: Retrieval and Feedback Models for Blog Feed Search

Our Approach

Page 19: Retrieval and Feedback Models for Blog Feed Search

Retrieval Models

Feedback Models

Feeds:Topically Diverse

Noisy

Collections

Information Needs:

General & Ongoing

ChallengesOur

Approach

Page 20: Retrieval and Feedback Models for Blog Feed Search

Retrieval Models

• Challenge: ranking topically diverse

collections

• Representation: feed vs. entry• Model topical relationship between entries

Page 21: Retrieval and Feedback Models for Blog Feed Search

Large Document (Feed) Model

<?xml……

</…>

`<?xml……

</…>

<?xml……

</…>

<?xml…<feed><entry><entry><entry><entry><entry>

…</…>

<?xml……

</…>

<?xml……

</…>

<?xml……

</…>

<?xml…<feed><entry><entry><entry><entry><entry>

…</…>

Feed Document Collection

[Q]

Ranked Feeds

Rank by

Indri’s standard retrieval model[Metzler and Croft, 2004; 2005]

Page 22: Retrieval and Feedback Models for Blog Feed Search

Large Document (Feed) Model

Advantages:

• A straightforward application of existing retrieval techniques

Potential Pitfalls:

• Large entries dominate a feed’s language model

• Ignores relationship among entries

Feed

Entry E E Entry Entry E

Page 23: Retrieval and Feedback Models for Blog Feed Search

Small Document (Entry) Model

<entry><entry><entry><entry><?xml…<entry>

Entry Document Collection

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

<entry><entry><entry><entry><?xml…<entry>

Ranked FeedsRanked Entriesdocument = entry

[Q]

Apply some rankaggregation function

Rank By

Page 24: Retrieval and Feedback Models for Blog Feed Search

Small Document (Entry) Model

• Query Likelihood• Entry Centrality• Feed Prior: favors longer feeds

ReDDE Federated Search Algortihm[Si & Callan, 2003]

Page 25: Retrieval and Feedback Models for Blog Feed Search

Entry Centrality

Uniform :

Geometric Mean :

time

topic

Page 26: Retrieval and Feedback Models for Blog Feed Search

Small Document (Entry) Model

Advantages:• Controls for differing entry length

• Models topical relationship among entries

Disadvantages:• Centrality computation is slow(er)

Q

Not only improves speed, Also performance

Page 27: Retrieval and Feedback Models for Blog Feed Search

Retrieval Model Results

Page 28: Retrieval and Feedback Models for Blog Feed Search

Retrieval Model Results

• 45 Queries from the TREC 2007 Blog Distillation Task

• BLOG06 test collection, XML feeds only

• 5-Fold Cross Validation for all retrieval model smoothing parameters

Page 29: Retrieval and Feedback Models for Blog Feed Search

Retrieval Model Results

0.29

0.277

0.290.298

0.315

0.245

0.265

0.285

0.305

0.325

Mean Average Precision

LargeDocument(Feed)Model

Small Document (Entry) Models

Page 30: Retrieval and Feedback Models for Blog Feed Search

Retrieval Model Results

0.29

0.277

0.290.298

0.315

0.245

0.265

0.285

0.305

0.325

Mean Average Precision

Uniform Log(Feed Length)UniformLog PriorMap 0.188

Page 31: Retrieval and Feedback Models for Blog Feed Search

Retrieval Model Results

0.29

0.277

0.290.298

0.315

0.245

0.265

0.285

0.305

0.325

Mean Average Precision

Uniform Log(Feed Length)Uniform

n/a

Page 32: Retrieval and Feedback Models for Blog Feed Search

Feedback Models

• Challenge: Noisy collection with general

& ongoing information needs

• Use a cleaner external collection for query expansion (Wikipedia)

• With an expansion technique designed to identify multiple query facets

Page 33: Retrieval and Feedback Models for Blog Feed Search

Query Expansion (PRF)

[Q]

BLOG06Collection

Related Terms from top K documents[Q + Terms]

[Lavrenko & Croft, 2001]

Page 34: Retrieval and Feedback Models for Blog Feed Search

Query Expansion Example

Idealdigital

photography

depth of field

photographic film

photojournalism

cinematography

[Photography]PRF

photographynudeeroticartgirlfreeteen

fashionwomen

Page 35: Retrieval and Feedback Models for Blog Feed Search

Feedback Model Results

0.2

0.24

0.28

0.32

0.36

BLOG LD BLOG SD

Mean Average Precision None PRF

Page 36: Retrieval and Feedback Models for Blog Feed Search

Query Expansion (Wikipedia PRF)

[Q]

BLOG06Collection

[Q + Terms]

[Lavrenko & Croft, 2001]

Wikipedia

[Diaz & Metzler, 2006]

Related Terms from top K documents

Page 37: Retrieval and Feedback Models for Blog Feed Search

Query Expansion Example

Idealdigital

photography

depth of field

photographic film

photojournalism

cinematography

[Photography]PRF

photographynudeeroticartgirlfreeteen

fashionwomen

Wikipedia PRFphotographydirectorspecialfilmart

cameramusic

cinematographerphotographic

Page 38: Retrieval and Feedback Models for Blog Feed Search

Feedback Model Results

0.2

0.24

0.28

0.32

0.36

BLOG LD BLOG SD

Mean Average Precision None PRF Wiki. PRF

Page 39: Retrieval and Feedback Models for Blog Feed Search

Query Expansion (Wikipedia Link)

[Q]

BLOG06Collection

[Q + Terms]

Wikipedia

Related Terms from link structure

Page 40: Retrieval and Feedback Models for Blog Feed Search

Wikipedia Link-BasedQuery Expansion

Page 41: Retrieval and Feedback Models for Blog Feed Search

Wikipedia Link-Based ExpansionWikipedia

Q

Page 42: Retrieval and Feedback Models for Blog Feed Search

Wikipedia Link-Based Expansion

Wikipedia

Relevance Set, Top R = 100

Working Set, Top W = 1000

Q

Page 43: Retrieval and Feedback Models for Blog Feed Search

Wikipedia Link-Based Expansion

Wikipedia

Q

Relevance Set, Top R = 100

Working Set, Top W = 1000

Page 44: Retrieval and Feedback Models for Blog Feed Search

Wikipedia Link-Based Expansion

Relevance Set, Top R = 100

Working Set, Top W = 1000

Wikipedia

Extract anchor text fromWorking Set that link tothe Relevance Set.

Q

Page 45: Retrieval and Feedback Models for Blog Feed Search

Wikipedia Link-Based Expansion

Relevance Set, Top R = 500

Working Set, Top W = 1000

Wikipedia

Extract anchor text fromWorking Set that link tothe Relevance Set.

Q

Combines relevance and popularity

Relevance: An anchor phrase that links to a high ranked article gets a high score

Popularity: An anchor phrase that links many times to a mid-ranked articles also gets high score

Page 46: Retrieval and Feedback Models for Blog Feed Search

Query Expansion Example

Wikipedia Link-Based

photographyphotographer

digital photographyphotographicdepth of field

feature photographyfilm

photographic filmphotojournalism

[Photography]PRF

photographynudeeroticartgirlfreeteen

fashionwomen

Idealdigital photography

depth of field

photographic film

photojournalism

cinematography

Page 47: Retrieval and Feedback Models for Blog Feed Search

Feedback Model Results

0.2

0.24

0.28

0.32

0.36

0.4

BLOG LD BLOG SD

Mean Average Precision None PRF Wiki. PRFWiki. Link

Page 48: Retrieval and Feedback Models for Blog Feed Search

Conclusion

• Feed Search Challenges:– Feeds are topically diverse, noisy collections

– Ranked against ongoing & general information needs

• Novel Retrieval Models:– Ranking collections, sensitive to topical relationship among entries

• Novel Feedback Models:– Discover multiple query facets & robust to collection noise

Page 49: Retrieval and Feedback Models for Blog Feed Search

Thank You!

Student Travel Grant funding from: ACM SIGIR, Amit Singhal, Microsoft Research

Page 50: Retrieval and Feedback Models for Blog Feed Search

Entry Centrality GM Derivation

where

Entry Generation Likelihood:

|E|

Page 51: Retrieval and Feedback Models for Blog Feed Search

Query Expansion Examples

Wikipedia ExpansionMusic

Folk musicElectronic music

FolkMusic videoWorld music

AmbientElectronic

Country music

[Music]

PRFMusicCountryDownloadFreeMP3Mp3andmoreLyricListenSong

Page 52: Retrieval and Feedback Models for Blog Feed Search

Query Expansion Examples

Wikipedia Expansionscotland

scottish parliamentscottish

scottish national party wars of scottish

independencescottish independence

william wallaceglasgow

scottish socialist party

[Scottish Independence]

PRFscotlandindependencepartyconventionpoliticssnpnationalpeoplescot

Page 53: Retrieval and Feedback Models for Blog Feed Search

Query Expansion Examples

Wikipedia Expansionmachine learning

learningartificial intelligence

turing machine machine gun

neural networksupport vector machine

supervised learningartificial neural network

[Machine Learning]

PRFlearnmachinecreditcardkaraokejournalsexmodelsew

Page 54: Retrieval and Feedback Models for Blog Feed Search

Query Generality Characteristics• Query Length:

– BLOG: 1.9 words – TB04: 3.2 words– TB05: 3.0 words

• ODP Depth– BLOG: 4.7 levels– TB04: 5.2 levels– TB05: 5.3 levels

Page 55: Retrieval and Feedback Models for Blog Feed Search

Relevance Set Cohesiveness

Wikipedia

Relevance Set, Top R = 100 Cohesivenes

s

=| Lin |

| Lin U Lout |

Page 56: Retrieval and Feedback Models for Blog Feed Search

Relevant Set Cohesiveness

Page 57: Retrieval and Feedback Models for Blog Feed Search

Is it the Queries?

Feed Search Queries ≠

TB Adhoc Queries

But, none of these measurespredict whether wikipedia

expansions helps…