Discussion 7

39
Information Retrieval

description

Discussion7EECS

Transcript of Discussion 7

Page 1: Discussion 7

Information Retrieval

Page 2: Discussion 7

Admin● Project 3 due tomorrow at 9pm

○ Extra OH today and tomorrow (see Piazza)

● Midterm Wed. 2/24 7pm● Review Session: Monday 22nd, 7-8:30 PM, COOL G906 (Cooley Lab - White

auditorium in basement)

Page 3: Discussion 7

Q1: Consider the following short documents.1) “vim is the only real editor”

2) “why do people use emacs?”

3) “vim emacs evil mode?”

4) “nano is the best editor”

We want to make these documents searchable

Page 4: Discussion 7

Q1: Consider the following short documents.1) “vim is the only real editor”

2) “why do people use emacs?”

3) “vim emacs evil mode?”

4) “nano is the best editor”

First, remove stopwords: {is, the}

Page 5: Discussion 7

Q1: Consider the following short documents.1) “vim is the only real editor”

2) “why do people use emacs?”

3) “vim emacs evil mode?”

4) “nano is the best editor”

First, remove stopwords: {is, the}

Page 6: Discussion 7

Q1: Consider the following short documents.1) “vim only real editor”

2) “why do people use emacs?”

3) “vim emacs evil mode?”

4) “nano best editor”

Now, build term frequency vectors for each document

Page 7: Discussion 7

Q1: Consider the following short documents.1) “vim only real editor”

2) “why do people use emacs?”

3) “vim emacs evil mode?”

4) “nano best editor”

First, build the global dictionary: dictionary = {vim, only, real, editor, why, do, people, use, emacs, evil, mode, nano, best}

Then, build the tf vectors for each document

Page 8: Discussion 7

Q1: Consider the following short documents.1) “vim only real editor” = <1,1,1,1,0,0,0,0,0,0,0,0,0>

2) “why do people use emacs?” = <0,0,0,0,1,1,1,1,1,0,0,0,0>

3) “vim emacs evil mode?”= <1,0,0,0,0,0,0,0,1,1,1,0,0>

4) “nano best editor” = <0,0,0,1,0,0,0,0,0,0,0,1,1>

First, build the global dictionary: dictionary = {vim, only, real, editor, why, do, people, use, emacs, evil, mode, nano, best}

Then, build the tf vectors for each document

Page 9: Discussion 7

Q1: Consider the following short documents.1) “vim only real editor”

2) “why do people use emacs?”

3) “vim emacs evil mode?”

4) “nano best editor”

Draw the inverted index for this document collection

Page 10: Discussion 7

Q1: Consider the following short documents.1) “vim only real editor”

2) “why do people use emacs?”

3) “vim emacs evil mode?”

4) “nano best editor”

Draw the inverted index for this document collection

vim: (1, 3)only: (1)real: (1)editor: (1, 4)why: (2)do: (2)people: (2)use: (2)emacs: (2, 3)evil: (3)mode: (3)nano: (4)best: (4)

Page 11: Discussion 7

Q1: Consider the following short documents.1) “vim only real editor”

2) “why do people use emacs?”

3) “vim emacs evil mode?”

4) “nano best editor”

Draw the inverted index for this document collection -> Keep this sorted!

best: (4)do: (2)editor: (1, 4)emacs: (2, 3)evil: (3)mode: (3)nano: (4)only: (1)people: (2)real: (1)use: (2)vim: (1, 3)why: (2)

Page 12: Discussion 7

Q1: Consider the following short documents.1) “vim only real editor”

2) “why do people use emacs?”

3) “vim emacs evil mode?”

4) “nano best editor”

Consider the query: “vim editor”:

vim: (1, 3)

editor: (1, 4)

Using the inverted index lets us answer this query quickly

best: (4)do: (2)editor: (1, 4)emacs: (2, 3)evil: (3)mode: (3)nano: (4)only: (1)people: (2)real: (1)use: (2)vim: (1, 3)why: (2)

Page 13: Discussion 7

Q1: Consider the following short documents.1) “vim only real editor”

2) “why do people use emacs?”

3) “vim emacs evil mode?”

4) “nano best editor”

What is the IDF of vim? emacs? nano?

best: (4)do: (2)editor: (1, 4)emacs: (2, 3)evil: (3)mode: (3)nano: (4)only: (1)people: (2)real: (1)use: (2)vim: (1, 3)why: (2)

Page 14: Discussion 7

Q1: Consider the following short documents.1) “vim only real editor”

2) “why do people use emacs?”

3) “vim emacs evil mode?”

4) “nano best editor”

What is the IDF of vim? emacs? nano?

vim: log(4/2) = log(2)

emacs: log(4/2) = log(2)

nano: log(4/1) = log(4)

best: (4)do: (2)editor: (1, 4)emacs: (2, 3)evil: (3)mode: (3)nano: (4)only: (1)people: (2)real: (1)use: (2)vim: (1, 3)why: (2)

Page 15: Discussion 7

Q1: Consider the following short documents.1) “vim only real editor”

2) “why do people use emacs?”

3) “vim emacs evil mode?”

4) “nano best editor”

What is the tf-idf vector of document 1?

Page 16: Discussion 7

Q1: Consider the following short documents.1) “vim only real editor”

2) “why do people use emacs?”

3) “vim emacs evil mode?”

4) “nano best editor”

What is the tf-idf vector of document 1?

tf: <1,1,1,1,0,0,0,0,0,0,0,0,0>

idf: vim=log(4/2), only=log(4/1), real=log(4/1), editor=log(4/2)

tf-idf: <log(2), log(4), log(4), log(2), 0,0,0,0,0,0,0>

Page 17: Discussion 7

Q2: Kendall’s TauA startup search engine is trying to compare its query results to Google’s.

Google’s results: ABCXYZ

Startup’s results: ABZYXC

They claim their results are comparable. What do you think?

Page 18: Discussion 7

Q2: Kendall’s TauA startup search engine is trying to compare its query results to Google’s.

Google’s results: ABCXYZ

Startup’s results: ABZYXC

Number of matching pairs: 9

Number of non-matching pairs: 6

(9-6)/((1/2)(6)(5)) = 3/15

OR: C-D/C+D = 9-6/9+6 = 3/15

Page 19: Discussion 7

Q2: Kendall’s TauA startup search engine is trying to compare its query results to Google’s.

Google’s results: ABCXYZ

What if the startup was able to change their results to: ABCZYX

Page 20: Discussion 7

Q2: Kendall’s TauA startup search engine is trying to compare its query results to Google’s.

Google’s results: ABCXYZ

What if the startup was able to change their results to: ABCZYX

Number of matching pairs: 12

Number of non-matching pairs: 3

(12-3)/((1/2)(6)(5)) = 9/15

OR: C-D/C+D = 12-3/12+3 = 9/15

Page 21: Discussion 7

Q3: Precision and Recall QuestionConsider a query for which there are 100 relevant URLs in the universe. A search engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6, and 8. It has irrelevant URLs in 2, 7, 9, and 10.

What is the precision and recall of this search engine if it emits only its top-10 results?

Page 22: Discussion 7

Q3: Precision and Recall Question

What are precision and recall again?

https://en.wikipedia.org/wiki/Precision_and_recall

Page 23: Discussion 7

Q3: Precision and Recall QuestionConsider a query for which there are 100 relevant URLs in the universe. A search engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6, and 8. It has irrelevant URLs in 2, 7, 9, and 10.

What is the precision and recall of this search engine if it emits only its top-10 results?

Page 24: Discussion 7

Q3: Precision and Recall QuestionConsider a query for which there are 100 relevant URLs in the universe. A search engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6, and 8. It has irrelevant URLs in 2, 7, 9, and 10.

What is the precision and recall of this search engine if it emits only its top-10 results?Precision = .6, Recall = .06

Page 25: Discussion 7

Q3: Precision and Recall QuestionConsider a query for which there are 100 relevant URLs in the universe. A search engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6, and 8. It has irrelevant URLs in 2, 7, 9, and 10.

What is the precision and recall of this search engine if it emits only its top-5 results?

Page 26: Discussion 7

Q3: Precision and Recall QuestionConsider a query for which there are 100 relevant URLs in the universe. A search engine computes a ranking that contains a relevant URL in positions 1, 3, 4, 5, 6, and 8. It has irrelevant URLs in 2, 7, 9, and 10.

What is the precision and recall of this search engine if it emits only its top-5 results?Precision = .8, Recall = .04

Page 27: Discussion 7

Q4: HITS

What does HITS stand for?

How does the algorithm work?

Page 28: Discussion 7

Q4: HITS

What does HITS stand for?

Hyperlinked-Induced Topic Search

How does the algorithm work?

It starts with the user's query to create the root set. It then builds the base set from those pages. Once you have this focused subgraph, run the algorithm to compute hub and auth scores

Page 29: Discussion 7

Q4: HITS

What are hubs and authorities?

Page 30: Discussion 7

Q4: HITS

What are hubs and authorities?

Hubs are central repositories - they have links to good authorities

Authorities are the sources of information - they are linked to by good hubs

Page 31: Discussion 7

Q4: HITS

How does HITS differ from PageRank?

Page 32: Discussion 7

Q4: HITS

How does HITS differ from PageRank?

HITS is based on the user’s query.

Each node maintains two scores - hub and auth

Each round requires an explicit normalization step

Page 33: Discussion 7

Q5: Pagerank

What does the .85 value for d represent?

If we assume most internet users are mobile, should we raise or lower the value of d?

Page 34: Discussion 7

Q5: Pagerank

What does the .85 value for d represent?

This value represents the amount of time that a user clicks on a link. So, 85% of the time they follow by clicking links, and 15% of the time they navigate to a new page.

If we assume most internet users are mobile, should we raise or lower the value of d?

Probably raise. Mobile users are more likely to follow links, and less likely to navigate to new pages (because navigating to new pages requires typing in a URL)

Page 35: Discussion 7

Q5: Pagerank

What are some issues with PageRank as a metric of page quality?

Page 36: Discussion 7

Q5: Pagerank

What are some issues with PageRank as a metric of page quality?

- Link farms, spam bots, etc. can skew rankings

- Links may not be meant as an “endorsement”, ie. social media shares

- Ajax and javascript can make traditional surfing difficult

- Content can be behind login - facebook feed is not searchable

Page 37: Discussion 7

Q5: Pagerank: An example

A = .2

B = .2

C = .2

D = .2

E = .2

E

B

C

A

D

Assume d=.85

Page 38: Discussion 7

Q5: Pagerank: An example

A = .15/5 + .85*(.2/2) = .115

B = .15/5 + .85*(.2/2 + .2) = .285

C = .15/5 + .85*(.2/2) = .115

D = .15/5 + .85*(.2/2+.2/2) = .2

E = .15/5 + .85*(.2 + .2/2) = .285

E

B

C

A

D

Assume d=.85

Page 39: Discussion 7

Q5: Pagerank: An example

A = .115, = (.15/5) + (.85)(.285/2) = 0.151

B = .285, = (.15/5) + (.85)(.115/2 + .2) = 0.249

C = .115, = (.15/5) + (.85)(.115/2) = .079

D = .2, = (.15/5) + (.85)(.115/2 + .285/2) = .2

E = .285, = (.15/5) + (.85)(.285 + .115/2) = .321

E

B

C

A

D

Assume d=.85