CIS 430 November 6, 2008 Emily Pitler. 3 Named Entities 1 or 2 words Ambiguous meaning Ambiguous...
-
Upload
howard-henry -
Category
Documents
-
view
213 -
download
0
Transcript of CIS 430 November 6, 2008 Emily Pitler. 3 Named Entities 1 or 2 words Ambiguous meaning Ambiguous...
Beitzel et. al. SIGIR 2004
America Online, week in December 2003
Popular queries: ◦ 1.7 words
Overall: ◦ 2.2 words
7
Lempel and Moran WWW2003 AltaVista, summer 2001 7,175,151 queries 2,657,410 distinct queries
1,792,104 queries occurred only once 63.7%
Most popular query: 31,546 times
8
Clarity score ~ low ambiguity Cronen-Townsend et. al. SIGIR 2002 Compare a language model
◦ over the relevant documents for a query◦ over all possible documents
The more difference these are, the more clear the query is
“programming perl” vs. “the”
13
Query Language Model
Collection Language Model (unigram)
RD
QDPDwPQwP )|()|()|(
collectionD
collectionD
allwordsC
wCcollectionwP
)(
)()|(
14
Relative entropy between the two distributions
Cost in bits of coding using Q when true distribution is P
)))(log()((
))(log()()(
iPiP
iQiPQPDi
KL
i
iPiPxPH ))(log()())((
15
Navigational◦ greyhound bus◦ compaq
Informational◦ San Francisco◦ normocytic anemia
Transactional◦ britney spears lyrics◦ download adobe reader
Broder SIGIR 2002
19
The more webpages that point to you, the more important you are
The more important webpages point to you, the more important you are
These intuitions led to PageRank
PageRank led to…
Page et. al. 1998
22
Assume our surfer is on a page
In the next time step she can:◦ Choose a link on the current page uniformly at
random◦ Or◦ Go somewhere else in the web uniformly at
random
After a long time, what is the probability she is on a given page?
24
Could also “get bored” with probability d and jump somewhere else completely
27
vBu u
uPd
N
dvP
)deg(
)()1()(
Google, obviously Given objects and links between them,
measures importance
Summarization (Erkan and Radev, 2004)◦ Nodes = sentences, edges = thresholded cosine
similarity Research (Mimno and McCallum, 2007)
◦ Nodes = people, edges = citations Facebook?
29
What OTHER webpages say about your webpage
Very good descriptions of what’s on a page
Link to: www.cis.upenn.edu/~nenkova
“Ani Nenkova” is anchor text for that page
34
10,000 documents 10 of them are relevant
What happens if you decide to return absolutely nothing?
99.9% accuracy
36
Standard metrics in Information Retrieval Precision: Of what you return, how many are
relevant?
Recall: Of what is relevant, how many do you return?
|Retrieved|
|Retrived andRelevant | Precision
|Relevant|
|Retrived andRelevant | Recall
37
Not always clear-cut binary classification: relevant vs. not relevant
How do you measure recall over the whole web?
How many of the 2.7 billion results will get looked at? Which ones actually need to be good?
38
Very relevant > Somewhat relevant > Not relevant
Want most relevant documents to be ranked first
NDCG = DCG / ideal ordering DCG
Ranges from 0 to 1
p
ii
p i
relrelDCG
22
1 log
39
Proposed ordering:
DCG = 4 + 2/log(2) + 0/log(3) + 1/log(4)◦ = 6.5
IDCG = 4 + 2/log(2) + 1/log(3) + 0/log(4)◦ = 6.63
NDCG = 6.5/6.63 = .98
1024
40
Documents—hundreds of words Queries—1 or 2, often ambiguous, words
It would be much easier to compare documents and documents
How can we turn a query into a document?
Just find ONE relevant document, then use that to find more
42
New Query = Original Query +Terms from Relevant Docs - Terms from Irrelevant Docs
Original query = “train” Relevant
◦ www.dog-obedience-training-review.com Irrelevant
◦ http://en.wikipedia.org/wiki/Caboose New query = train + .3*dog -.2*railroad
43
Explicit feedback◦ Ask the user to mark relevant versus irrelevant◦ Or, grade on a scale (like we saw for NDCG)
Implicit feedback◦ Users see list of top 10 results, click on a few◦ Assume clicked on pages were relevant, rest weren’t
Pseudo-relevance feedback◦ Do search, assume top results are relevant, repeat
44
Have query logs for millions of users “hybrid car””toyota prius” is more likely
than “hybrid car”-> “flights to LA” Find statistically significant pairs of queries
(Jones et. al. WWW 2006) using:
45
)(
)(log2
)|()|(:
)|()|(:
2
1
12122
12121
HL
HLLLR
qqPqqPH
qqPqqPH
A lot of ambiguity is removed by knowing who the searcher is
Lots of Fernando Pereira’s ◦ I (Emily Pitler) only know one of them
Location matters◦ “Thai restaurants” from me means “Thai
restaurants Philadelphia, PA”
49
Mei and Church, WSDM 2008
H(URL|Q) = H(URL,Q)-H(Q) = 23.88-21.14=2.74 H(URL|Q,IP)= H(URL,Q,IP)-H(Q,IP)=27.17-26=1.17
50
Descriptive searches: “pictures of mountains”◦ I don’t want a document with the words: ◦ {“picture”, “of”, “mountains”}
Link farms: trying to game PageRank
Spelling correction: a huge portion of queries are misspelled
Ambiguity
53