Discussion 7
-
Upload
adrian-percival-kim -
Category
Documents
-
view
216 -
download
0
description
Transcript of Discussion 7
Information Retrieval
Admin● Project 3 due tomorrow at 9pm
○ Extra OH today and tomorrow (see Piazza)
● Midterm Wed. 2/24 7pm● Review Session: Monday 22nd, 7-8:30 PM, COOL G906 (Cooley Lab - White
auditorium in basement)
Q1: Consider the following short documents.1) “vim is the only real editor”
2) “why do people use emacs?”
3) “vim emacs evil mode?”
4) “nano is the best editor”
We want to make these documents searchable
Q1: Consider the following short documents.1) “vim is the only real editor”
2) “why do people use emacs?”
3) “vim emacs evil mode?”
4) “nano is the best editor”
First, remove stopwords: {is, the}
Q1: Consider the following short documents.1) “vim is the only real editor”
2) “why do people use emacs?”
3) “vim emacs evil mode?”
4) “nano is the best editor”
First, remove stopwords: {is, the}
Q1: Consider the following short documents.1) “vim only real editor”
2) “why do people use emacs?”
3) “vim emacs evil mode?”
4) “nano best editor”
Now, build term frequency vectors for each document
Q1: Consider the following short documents.1) “vim only real editor”
2) “why do people use emacs?”
3) “vim emacs evil mode?”
4) “nano best editor”
First, build the global dictionary: dictionary = {vim, only, real, editor, why, do, people, use, emacs, evil, mode, nano, best}
Then, build the tf vectors for each document
Q1: Consider the following short documents.1) “vim only real editor” = <1,1,1,1,0,0,0,0,0,0,0,0,0>
2) “why do people use emacs?” = <0,0,0,0,1,1,1,1,1,0,0,0,0>
3) “vim emacs evil mode?”= <1,0,0,0,0,0,0,0,1,1,1,0,0>
4) “nano best editor” = <0,0,0,1,0,0,0,0,0,0,0,1,1>
First, build the global dictionary: dictionary = {vim, only, real, editor, why, do, people, use, emacs, evil, mode, nano, best}
Then, build the tf vectors for each document
Q1: Consider the following short documents.1) “vim only real editor”
2) “why do people use emacs?”
3) “vim emacs evil mode?”
4) “nano best editor”
Draw the inverted index for this document collection
Q1: Consider the following short documents.1) “vim only real editor”
2) “why do people use emacs?”
3) “vim emacs evil mode?”
4) “nano best editor”
Draw the inverted index for this document collection
vim: (1, 3)only: (1)real: (1)editor: (1, 4)why: (2)do: (2)people: (2)use: (2)emacs: (2, 3)evil: (3)mode: (3)nano: (4)best: (4)
Q1: Consider the following short documents.1) “vim only real editor”
2) “why do people use emacs?”
3) “vim emacs evil mode?”
4) “nano best editor”
Draw the inverted index for this document collection -> Keep this sorted!
best: (4)do: (2)editor: (1, 4)emacs: (2, 3)evil: (3)mode: (3)nano: (4)only: (1)people: (2)real: (1)use: (2)vim: (1, 3)why: (2)
Q1: Consider the following short documents.1) “vim only real editor”
2) “why do people use emacs?”
3) “vim emacs evil mode?”
4) “nano best editor”
Consider the query: “vim editor”:
vim: (1, 3)
editor: (1, 4)
Using the inverted index lets us answer this query quickly
best: (4)do: (2)editor: (1, 4)emacs: (2, 3)evil: (3)mode: (3)nano: (4)only: (1)people: (2)real: (1)use: (2)vim: (1, 3)why: (2)
Q1: Consider the following short documents.1) “vim only real editor”
2) “why do people use emacs?”
3) “vim emacs evil mode?”
4) “nano best editor”
What is the IDF of vim? emacs? nano?
best: (4)do: (2)editor: (1, 4)emacs: (2, 3)evil: (3)mode: (3)nano: (4)only: (1)people: (2)real: (1)use: (2)vim: (1, 3)why: (2)
Q1: Consider the following short documents.1) “vim only real editor”
2) “why do people use emacs?”
3) “vim emacs evil mode?”
4) “nano best editor”
What is the IDF of vim? emacs? nano?
vim: log(4/2) = log(2)
emacs: log(4/2) = log(2)
nano: log(4/1) = log(4)
best: (4)do: (2)editor: (1, 4)emacs: (2, 3)evil: (3)mode: (3)nano: (4)only: (1)people: (2)real: (1)use: (2)vim: (1, 3)why: (2)
Q1: Consider the following short documents.1) “vim only real editor”
2) “why do people use emacs?”
3) “vim emacs evil mode?”
4) “nano best editor”
What is the tf-idf vector of document 1?
Q1: Consider the following short documents.1) “vim only real editor”
2) “why do people use emacs?”
3) “vim emacs evil mode?”
4) “nano best editor”
What is the tf-idf vector of document 1?
tf: <1,1,1,1,0,0,0,0,0,0,0,0,0>
idf: vim=log(4/2), only=log(4/1), real=log(4/1), editor=log(4/2)
tf-idf: <log(2), log(4), log(4), log(2), 0,0,0,0,0,0,0>
Q2: Kendall’s TauA startup search engine is trying to compare its query results to Google’s.
Google’s results: ABCXYZ
Startup’s results: ABZYXC
They claim their results are comparable. What do you think?
Q2: Kendall’s TauA startup search engine is trying to compare its query results to Google’s.
Google’s results: ABCXYZ
Startup’s results: ABZYXC
Number of matching pairs: 9
Number of non-matching pairs: 6
(9-6)/((1/2)(6)(5)) = 3/15
OR: C-D/C+D = 9-6/9+6 = 3/15
Q2: Kendall’s TauA startup search engine is trying to compare its query results to Google’s.
Google’s results: ABCXYZ
What if the startup was able to change their results to: ABCZYX
Q2: Kendall’s TauA startup search engine is trying to compare its query results to Google’s.
Google’s results: ABCXYZ
What if the startup was able to change their results to: ABCZYX
Number of matching pairs: 12
Number of non-matching pairs: 3
(12-3)/((1/2)(6)(5)) = 9/15
OR: C-D/C+D = 12-3/12+3 = 9/15
Q3: Precision and Recall QuestionConsider a query for which there are 100 relevant URLs in the universe. A search engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6, and 8. It has irrelevant URLs in 2, 7, 9, and 10.
What is the precision and recall of this search engine if it emits only its top-10 results?
Q3: Precision and Recall Question
What are precision and recall again?
https://en.wikipedia.org/wiki/Precision_and_recall
Q3: Precision and Recall QuestionConsider a query for which there are 100 relevant URLs in the universe. A search engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6, and 8. It has irrelevant URLs in 2, 7, 9, and 10.
What is the precision and recall of this search engine if it emits only its top-10 results?
Q3: Precision and Recall QuestionConsider a query for which there are 100 relevant URLs in the universe. A search engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6, and 8. It has irrelevant URLs in 2, 7, 9, and 10.
What is the precision and recall of this search engine if it emits only its top-10 results?Precision = .6, Recall = .06
Q3: Precision and Recall QuestionConsider a query for which there are 100 relevant URLs in the universe. A search engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6, and 8. It has irrelevant URLs in 2, 7, 9, and 10.
What is the precision and recall of this search engine if it emits only its top-5 results?
Q3: Precision and Recall QuestionConsider a query for which there are 100 relevant URLs in the universe. A search engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6, and 8. It has irrelevant URLs in 2, 7, 9, and 10.
What is the precision and recall of this search engine if it emits only its top-5 results?Precision = .8, Recall = .04
Q4: HITS
What does HITS stand for?
How does the algorithm work?
Q4: HITS
What does HITS stand for?
Hyperlinked-Induced Topic Search
How does the algorithm work?
It starts with the user's query to create the root set. It then builds the base set from those pages. Once you have this focused subgraph, run the algorithm to compute hub and auth scores
Q4: HITS
What are hubs and authorities?
Q4: HITS
What are hubs and authorities?
Hubs are central repositories - they have links to good authorities
Authorities are the sources of information - they are linked to by good hubs
Q4: HITS
How does HITS differ from PageRank?
Q4: HITS
How does HITS differ from PageRank?
HITS is based on the user’s query.
Each node maintains two scores - hub and auth
Each round requires an explicit normalization step
Q5: Pagerank
What does the .85 value for d represent?
If we assume most internet users are mobile, should we raise or lower the value of d?
Q5: Pagerank
What does the .85 value for d represent?
This value represents the amount of time that a user clicks on a link. So, 85% of the time they follow by clicking links, and 15% of the time they navigate to a new page.
If we assume most internet users are mobile, should we raise or lower the value of d?
Probably raise. Mobile users are more likely to follow links, and less likely to navigate to new pages (because navigating to new pages requires typing in a URL)
Q5: Pagerank
What are some issues with PageRank as a metric of page quality?
Q5: Pagerank
What are some issues with PageRank as a metric of page quality?
- Link farms, spam bots, etc. can skew rankings
- Links may not be meant as an “endorsement”, ie. social media shares
- Ajax and javascript can make traditional surfing difficult
- Content can be behind login - facebook feed is not searchable
Q5: Pagerank: An example
A = .2
B = .2
C = .2
D = .2
E = .2
E
B
C
A
D
Assume d=.85
Q5: Pagerank: An example
A = .15/5 + .85*(.2/2) = .115
B = .15/5 + .85*(.2/2 + .2) = .285
C = .15/5 + .85*(.2/2) = .115
D = .15/5 + .85*(.2/2+.2/2) = .2
E = .15/5 + .85*(.2 + .2/2) = .285
E
B
C
A
D
Assume d=.85
Q5: Pagerank: An example
A = .115, = (.15/5) + (.85)(.285/2) = 0.151
B = .285, = (.15/5) + (.85)(.115/2 + .2) = 0.249
C = .115, = (.15/5) + (.85)(.115/2) = .079
D = .2, = (.15/5) + (.85)(.115/2 + .285/2) = .2
E = .285, = (.15/5) + (.85)(.285 + .115/2) = .321
E
B
C
A
D
Assume d=.85