Discussion 7

Information Retrieval

Admin● Project 3 due tomorrow at 9pm

○ Extra OH today and tomorrow (see Piazza)

● Midterm Wed. 2/24 7pm● Review Session: Monday 22nd, 7-8:30 PM, COOL G906 (Cooley Lab - White

auditorium in basement)

Q1: Consider the following short documents.1) “vim is the only real editor”

2) “why do people use emacs?”

3) “vim emacs evil mode?”

4) “nano is the best editor”

We want to make these documents searchable

Q1: Consider the following short documents.1) “vim is the only real editor”



4) “nano is the best editor”

First, remove stopwords: {is, the}

Q1: Consider the following short documents.1) “vim only real editor”



4) “nano best editor”

Now, build term frequency vectors for each document





First, build the global dictionary: dictionary = {vim, only, real, editor, why, do, people, use, emacs, evil, mode, nano, best}

Then, build the tf vectors for each document

Q1: Consider the following short documents.1) “vim only real editor” = <1,1,1,1,0,0,0,0,0,0,0,0,0>

2) “why do people use emacs?” = <0,0,0,0,1,1,1,1,1,0,0,0,0>

3) “vim emacs evil mode?”= <1,0,0,0,0,0,0,0,1,1,1,0,0>

4) “nano best editor” = <0,0,0,1,0,0,0,0,0,0,0,1,1>

First, build the global dictionary: dictionary = {vim, only, real, editor, why, do, people, use, emacs, evil, mode, nano, best}

Then, build the tf vectors for each document





Draw the inverted index for this document collection





Draw the inverted index for this document collection

vim: (1, 3)only: (1)real: (1)editor: (1, 4)why: (2)do: (2)people: (2)use: (2)emacs: (2, 3)evil: (3)mode: (3)nano: (4)best: (4)





Draw the inverted index for this document collection -> Keep this sorted!

best: (4)do: (2)editor: (1, 4)emacs: (2, 3)evil: (3)mode: (3)nano: (4)only: (1)people: (2)real: (1)use: (2)vim: (1, 3)why: (2)





Consider the query: “vim editor”:

vim: (1, 3)

editor: (1, 4)

Using the inverted index lets us answer this query quickly






What is the IDF of vim? emacs? nano?






What is the IDF of vim? emacs? nano?

vim: log(4/2) = log(2)

emacs: log(4/2) = log(2)

nano: log(4/1) = log(4)






What is the tf-idf vector of document 1?





What is the tf-idf vector of document 1?

tf: <1,1,1,1,0,0,0,0,0,0,0,0,0>

idf: vim=log(4/2), only=log(4/1), real=log(4/1), editor=log(4/2)

tf-idf: <log(2), log(4), log(4), log(2), 0,0,0,0,0,0,0>

Q2: Kendall’s TauA startup search engine is trying to compare its query results to Google’s.

Google’s results: ABCXYZ

Startup’s results: ABZYXC

They claim their results are comparable. What do you think?



Startup’s results: ABZYXC

Number of matching pairs: 9

Number of non-matching pairs: 6

(9-6)/((1/2)(6)(5)) = 3/15

OR: C-D/C+D = 9-6/9+6 = 3/15



What if the startup was able to change their results to: ABCZYX



What if the startup was able to change their results to: ABCZYX

Number of matching pairs: 12

Number of non-matching pairs: 3

(12-3)/((1/2)(6)(5)) = 9/15

OR: C-D/C+D = 12-3/12+3 = 9/15

Q3: Precision and Recall QuestionConsider a query for which there are 100 relevant URLs in the universe. A search engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6, and 8. It has irrelevant URLs in 2, 7, 9, and 10.

What is the precision and recall of this search engine if it emits only its top-10 results?

Q3: Precision and Recall Question

What are precision and recall again?

https://en.wikipedia.org/wiki/Precision_and_recall




What is the precision and recall of this search engine if it emits only its top-10 results?Precision = .6, Recall = .06


What is the precision and recall of this search engine if it emits only its top-5 results?Precision = .8, Recall = .04

Q4: HITS

What does HITS stand for?

How does the algorithm work?

Q4: HITS

What does HITS stand for?

Hyperlinked-Induced Topic Search

How does the algorithm work?

It starts with the user's query to create the root set. It then builds the base set from those pages. Once you have this focused subgraph, run the algorithm to compute hub and auth scores

Q4: HITS

What are hubs and authorities?

Q4: HITS

What are hubs and authorities?

Hubs are central repositories - they have links to good authorities

Authorities are the sources of information - they are linked to by good hubs

Q4: HITS

How does HITS differ from PageRank?

Q4: HITS

How does HITS differ from PageRank?

HITS is based on the user’s query.

Each node maintains two scores - hub and auth

Each round requires an explicit normalization step

Q5: Pagerank

What does the .85 value for d represent?

If we assume most internet users are mobile, should we raise or lower the value of d?

Q5: Pagerank

What does the .85 value for d represent?

This value represents the amount of time that a user clicks on a link. So, 85% of the time they follow by clicking links, and 15% of the time they navigate to a new page.

If we assume most internet users are mobile, should we raise or lower the value of d?

Probably raise. Mobile users are more likely to follow links, and less likely to navigate to new pages (because navigating to new pages requires typing in a URL)

Q5: Pagerank

What are some issues with PageRank as a metric of page quality?

Q5: Pagerank

What are some issues with PageRank as a metric of page quality?

- Link farms, spam bots, etc. can skew rankings

- Links may not be meant as an “endorsement”, ie. social media shares

- Ajax and javascript can make traditional surfing difficult

- Content can be behind login - facebook feed is not searchable

Q5: Pagerank: An example

A = .2

B = .2

C = .2

D = .2

E = .2

E

B

C

A

D

Assume d=.85


A = .15/5 + .85*(.2/2) = .115

B = .15/5 + .85*(.2/2 + .2) = .285

C = .15/5 + .85*(.2/2) = .115

D = .15/5 + .85*(.2/2+.2/2) = .2

E = .15/5 + .85*(.2 + .2/2) = .285

E

B

C

A

D

Assume d=.85


A = .115, = (.15/5) + (.85)(.285/2) = 0.151

B = .285, = (.15/5) + (.85)(.115/2 + .2) = 0.249

C = .115, = (.15/5) + (.85)(.115/2) = .079

D = .2, = (.15/5) + (.85)(.115/2 + .285/2) = .2

E = .285, = (.15/5) + (.85)(.285 + .115/2) = .321

E

B

C

A

D

Assume d=.85

Discussion 7

Documents

Transcript of Discussion 7