SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU...
-
Upload
asher-foster -
Category
Documents
-
view
218 -
download
4
Transcript of SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU...
SIGIR 2008
Singapore
Jonathan Elsas, Jaime Arguello,
Jamie Callan & Jaime Carbonell
LTI/SCS/CMU
Retrieval and Feedback Models for Blog Feed Search
Outline
• The task– Overview of Blogs & Blog Search– Challenges in Blog Search
• Our approach– Retrieval Models– Query Expansion Models
• Conclusion
Background
What is a Blog?
What is a Feed?
<xml>
<feed>
<entry>
<author>Peter …</>
<title>Good, Evil…</>
<content>I’ve said…</>
</entry>
<entry>
<author>Peter …</>
<title>Agreeing…</>
<content>Some peo…</>
</entry>
…
Blog-Feed Correspondence
Blog Feed
Post Entry
HTMLHTML XMLXML
Why are Blogs important?
Technorati currently tracking > 112.8 Million Blogs> 175,000 new Blogs per day> 1.6 Million posts per day
[http://www.technorati.com/about/]
The Task
Feed Search at TREC
Ranking Blogs/Feeds (collections of posts) in response to a user’s query, [X]
“A relevant feed should have a principle and recurring interest in X”
— TREC 2007 Blog Track
(a.k.a. Blog Distillation)
Feed Search at TREC
[Gardening]
[Apple iPod]
[Violence in Sudan]
[Gun Control]
[Food]
[Wine]
RepresentOngoing
Information Needs
FrequentlyVery
General
Challenges in Feed Search
Challenges in Feed Search
entries
time
feed
1. A feed is a collection of documents
1. A feed is a collection of documents – How does relevance at the entry level
correspond to relevance at the feed level?
Challenges in Feed Search
entries
time
feed
Challenges in Feed Search2. Even a topical feed is topically diverse
time
NASA
China’s plans for the moon shuttle
launch
My dog
Mars rover
Boeing
Space Exploration
topic
Challenges in Feed Search2. Even a topical feed is topically diverse
– Can we favor entries close to the central topic of the feed?
Space Exploration
time
topic
Challenges in Feed Search3. Feeds are noisy
– Spam blogs, Spam & off topic comments
time
Challenges in Feed Search
4. General & Ongoing Information Needs
[Mac]
[Music]
[Food]
[Wine]
… post regularly about new products, features, or application software of Apple Mac computers.
… describing songs, biographies of musicians, musical styles andtheir influences of music on people are discussed.…such as tastings,
reviews, food matching or pairing, and oenophile news and events.
… describing experiences eating cuisines, culinary delights,recipes, nutrition plans.
Our Approach
Retrieval Models
Feedback Models
Feeds:Topically Diverse
Noisy
Collections
Information Needs:
General & Ongoing
Challenges Our Approach
Retrieval Models
• Challenge: ranking topically diverse collections
• Representation: feed vs. entry• Model topical relationship between entries
Large Document (Feed) Model
<?xml……
</…>
`<?xml……
</…>
<?xml……
</…>
<?xml…<feed><entry><entry><entry><entry><entry>
…</…>
<?xml……
</…>
<?xml……
</…>
<?xml……
</…>
<?xml…<feed><entry><entry><entry><entry><entry>
…</…>
Feed Document Collection
[Q]
Ranked Feeds
Rank by
Indri’s standard retrieval model[Metzler and Croft, 2004; 2005]
Large Document (Feed) ModelAdvantages:
• A straightforward application of existing retrieval techniques
Potential Pitfalls:
• Large entries dominate a feed’s language model
• Ignores relationship among entries
Feed
Entry E E Entry Entry E
Small Document (Entry) Model
<entry><entry><entry><entry><?xml…<entry>
Entry Document Collection
<entry><entry><entry><entry><?xml…<entry>
<entry><entry><entry><entry><?xml…<entry>
<entry><entry><entry><entry><?xml…<entry>
<entry><entry><entry><entry><?xml…<entry>
<entry><entry><entry><entry><?xml…<entry>
<entry><entry><entry><entry><?xml…<entry>
Ranked FeedsRanked Entriesdocument = entry
[Q]
Apply some rankaggregation function
Rank By
Small Document (Entry) Model
• Query Likelihood• Entry Centrality• Feed Prior: favors longer feeds
ReDDE Federated Search Algortihm[Si & Callan, 2003]
Entry Centrality
Uniform :
Geometric Mean :
time
topic
Small Document (Entry) Model
Advantages:
• Controls for differing entry length
•Models topical relationship among entries
Disadvantages:
• Centrality computation is slow(er)
Q
Not only improves speed, Also performance
Retrieval Model Results
Retrieval Model Results
• 45 Queries from the TREC 2007 Blog Distillation Task
• BLOG06 test collection, XML feeds only
• 5-Fold Cross Validation for all retrieval model smoothing parameters
Retrieval Model Results
0.29
0.277
0.290.298
0.315
0.245
0.265
0.285
0.305
0.325
Mea
n A
vera
ge P
reci
sion
LargeDocument
(Feed)Model
Small Document (Entry) Models
Retrieval Model Results
0.29
0.277
0.290.298
0.315
0.245
0.265
0.285
0.305
0.325
Mea
n A
vera
ge P
reci
sion
Uniform Log(Feed Length)UniformLog PriorMap 0.188
Retrieval Model Results
0.29
0.277
0.290.298
0.315
0.245
0.265
0.285
0.305
0.325
Mea
n A
vera
ge P
reci
sion
Uniform Log(Feed Length)Uniform
n/a
Feedback Models
• Challenge: Noisy collection with general & ongoing
information needs
• Use a cleaner external collection for query expansion (Wikipedia)
• With an expansion technique designed to identify multiple query facets
Query Expansion (PRF)
[Q]
BLOG06Collection
Related Terms from top K documents
[Q + Terms]
[Lavrenko & Croft, 2001]
Query Expansion Example
Idealdigital photography
depth of field
photographic film
photojournalism
cinematography
[Photography]
PRFphotography
nudeerotic
artgirlfreeteen
fashionwomen
Feedback Model Results
0.2
0.24
0.28
0.32
0.36
BLOG LD BLOG SD
Mea
n A
vera
ge P
reci
sion
None PRF
Query Expansion (Wikipedia PRF)
[Q]
BLOG06Collection
[Q + Terms]
[Lavrenko & Croft, 2001]
Wikipedia
[Diaz & Metzler, 2006]
Related Terms from top K documents
Query Expansion Example
Idealdigital photography
depth of field
photographic film
photojournalism
cinematography
[Photography]
PRFphotography
nudeerotic
artgirlfreeteen
fashionwomen
Wikipedia PRFphotography
directorspecial
filmart
cameramusic
cinematographerphotographic
Feedback Model Results
0.2
0.24
0.28
0.32
0.36
BLOG LD BLOG SD
Mea
n A
vera
ge P
reci
sion
None PRF Wiki. PRF
Query Expansion (Wikipedia Link)
[Q]
BLOG06Collection
[Q + Terms]
Wikipedia
Related Terms from link structure
Wikipedia Link-BasedQuery Expansion
Wikipedia Link-Based ExpansionWikipedia
…
Q
Wikipedia Link-Based Expansion
…
Wikipedia
Relevance Set, Top R = 100
Working Set, Top W = 1000
Q
Wikipedia Link-Based Expansion
…
Wikipedia
Q
Relevance Set, Top R = 100
Working Set, Top W = 1000
Wikipedia Link-Based Expansion
Relevance Set, Top R = 100
Working Set, Top W = 1000
…
Wikipedia
Extract anchor text fromWorking Set that link tothe Relevance Set.
Q
Wikipedia Link-Based Expansion
Relevance Set, Top R = 500
Working Set, Top W = 1000
…
Wikipedia
Extract anchor text fromWorking Set that link tothe Relevance Set.
Q
Combines relevance and popularity
Relevance: An anchor phrase that links to a high ranked article gets a high score
Popularity: An anchor phrase that links many times to a mid-ranked articles also gets high score
Query Expansion Example
Wikipedia Link-Basedphotographyphotographer
digital photographyphotographicdepth of field
feature photographyfilm
photographic filmphotojournalism
[Photography]
PRFphotography
nudeerotic
artgirlfreeteen
fashionwomen
Idealdigital photography
depth of field
photographic film
photojournalism
cinematography
Feedback Model Results
0.2
0.24
0.28
0.32
0.36
0.4
BLOG LD BLOG SD
Mea
n A
vera
ge P
reci
sion
None PRF Wiki. PRF Wiki. Link
Conclusion
• Feed Search Challenges:
– Feeds are topically diverse, noisy collections
– Ranked against ongoing & general information needs
• Novel Retrieval Models:
– Ranking collections, sensitive to topical relationship among entries
• Novel Feedback Models:
– Discover multiple query facets & robust to collection noise
Thank You!
Student Travel Grant funding from: ACM SIGIR, Amit Singhal, Microsoft Research
Entry Centrality GM Derivation
where
Entry Generation Likelihood:
|E|
Query Expansion Examples
Wikipedia ExpansionMusic
Folk musicElectronic music
FolkMusic videoWorld music
AmbientElectronic
Country music
[Music]
PRFMusicCountryDownloadFreeMP3Mp3andmoreLyricListenSong
Query Expansion Examples
Wikipedia Expansionscotland
scottish parliamentscottish
scottish national party wars of scottish independence
scottish independencewilliam wallace
glasgowscottish socialist party
[Scottish Independence]
PRFscotlandindependencepartyconventionpoliticssnpnationalpeoplescot
Query Expansion Examples
Wikipedia Expansionmachine learning
learningartificial intelligence
turing machine machine gun
neural networksupport vector machine
supervised learningartificial neural network
[Machine Learning]
PRFlearnmachinecreditcardkaraokejournalsexmodelsew
Query Generality Characteristics
• Query Length:
– BLOG: 1.9 words
– TB04: 3.2 words
– TB05: 3.0 words
• ODP Depth
– BLOG: 4.7 levels
– TB04: 5.2 levels
– TB05: 5.3 levels
Relevance Set Cohesiveness
…
Wikipedia
Relevance Set, Top R = 100 Cohesiveness
=| Lin |
| Lin U Lout |
Relevant Set Cohesiveness
Is it the Queries?
Feed Search Queries
≠
TB Adhoc QueriesBut, none of these measurespredict whether wikipedia
expansions helps…