Computing Relevance, Similarity: The Vector Space Model

37
CPSC 404 Laks V.S. Lakshmanan Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley http://www.sims.berkeley.edu/courses/ is202/f00/

description

Computing Relevance, Similarity: The Vector Space Model. Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley http://www.sims.berkeley.edu/courses/is202/f00/. Document Vectors. Documents are represented as “bags of words” Represented as vectors when used computationally - PowerPoint PPT Presentation

Transcript of Computing Relevance, Similarity: The Vector Space Model

Page 1: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 1

Computing Relevance, Similarity: The Vector Space

Model

Chapter 27, Part BBased on Larson and Hearst’s

slides at UC-Berkeley

http://www.sims.berkeley.edu/courses/is202/f00/

Page 2: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 2

Document Vectors

Documents are represented as “bags of words”

Represented as vectors when used computationally• A vector is like an array of floating point• Has direction and magnitude• Each vector holds a place for every term in

the collection• Therefore, most vectors are sparse

Page 3: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 3

Document Vectors:One location for each word.

nova galaxy heat h’wood film rolediet fur

10 5 3 5 10

10 8 7 9 10 5

10 10 9 10

5 7 9 6 10 2 8

7 5 1 3

ABCDEFGHI

“Nova” occurs 10 times in docA“Galaxy” occurs 5 times in doc A

“Heat” occurs 3 times in doc A(Blank means 0 occurrences.)

Page 4: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 4

Document Vectors

nova galaxy heat h’wood film rolediet fur

10 5 3 5 10

10 8 7 9 10 5

10 10 9 10

5 7 9 6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids

Page 5: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 5

We Can Plot the Vectors

Star

Diet

Doc about astronomyDoc about movie stars

Doc about mammal behavior

Assumption: Documents that are “close” in space are similar.

Page 6: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 6

Vector Space Model 1/2 Documents are represented as vectors in

term space• Terms are usually stems• Documents represented by binary vectors of

terms Queries represented the same as

documents

Page 7: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 7

Vector Space Model 2/2

A vector distance measure between the query and documents is used to rank retrieved documents• Query and Document similarity is based on

length and direction of their vectors• Vector operations to capture boolean query

conditions• Terms in a vector can be “weighted” in

many ways

Page 8: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 8

Vector Space Documentsand Queries

docs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3D10 0 1 1 5D11 1 0 1 4Q 1 2 3

q1 q2 q3

D1D2

D3

D4D5

D6

D7D8

D9

D10

D11

t2

t3

t1

Boolean term combinationsRemember, we need to rank the resulting doc list.

Q is a query – also represented as a vector

Page 9: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 9

Assigning Weights to Terms

Binary Weights Raw term frequency tf x idf

• Recall the Zipf distribution• Want to weight terms highly if they are

• frequent in relevant documents … BUT• infrequent in the collection as a whole

Page 10: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 10

Binary Weights

Only the presence (1) or absence (0) of a term is indicated in the vector

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1D10 0 1 1D11 1 0 1

Page 11: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 11

Raw Term Weights

The frequency of occurrence for the term in each document is included in the vector

docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1D10 0 3 5D11 4 0 1

Page 12: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 12

The Zipfian Problem

RareTerms

Termfrequency

Stop words

Page 13: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 13

TF x IDF Weights

tf x idf measure:• Term Frequency (tf)• Inverse Document Frequency (idf) -- a way

to deal with the problems of the Zipf distribution

Goal: Assign a tf * idf weight to each term in each document

Page 14: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 14

TF x IDF Calculation

)/log(* kikik nNtfw

log

Tcontain hat in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

tCn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

Let C be a doc collection.

Page 15: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 15

Inverse Document Frequency

IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

For a collectionof 10000 documents

Words too common in the collection offer little discriminating power.

Page 16: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 16

t

k kik

kikik

nNtf

nNtfw

1

22 )]/[log()(

)/log(

TF x IDF Normalization Normalize the term weights (so longer

documents are not unfairly given more weight)• The longer the document, the more likely it

is for a given term to appear in it, and the more often a given term is likely to appear in it. So, we want to reduce the importance attached to a term appearing in a document based on the length of the document.

Page 17: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 17

Pair-wise Document Similarity

nova galaxy heat h’wood film role diet fur 1 3 1

5 2 2 1 5 4 1

ABCD

How to compute document similarity?

Page 18: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 18

Pair-wise Document Similarity

nova galaxy heat h’wood film rolediet fur

1 3 1 5 2

2 1 5 4 1

ABCD

t

iii

t

t

wwDDsim

wwwD

wwwD

12121

2,22212

1,12111

),(

...,,

...,,

9)11()42(),(

0),(

0),(

0),(

0),(

11)32()51(),(

DCsim

DBsim

CBsim

DAsim

CAsim

BAsim

Page 19: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 19

Pair-wise Document Similarity(cosine normalization)

normalized cosine

)()(

),(

edunnormaliz ),(

...,,

...,,

1

22

1

21

121

21

12121

2,22212

1,12111

t

ii

t

ii

t

iii

t

iii

t

t

ww

wwDDsim

wwDDsim

wwwD

wwwD

Page 20: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 20

Vector Space “Relevance” Measure

)()(

),(

:comparison similarity in the normalize otherwise

),( :normalized weights termif

absent is terma if 0 ...,,

,...,,

1

2

1

2

1

1

,21

21

t

jd

t

jqj

t

jdqj

i

t

jdqji

qtqq

dddi

ij

ij

ij

itii

ww

ww

DQsim

wwDQsim

wwwwQ

wwwD

Page 21: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 21

Computing Relevance Scores

98.0 42.0

64.0

])7.0()2.0[(*])8.0()4.0[(

)7.0*8.0()2.0*4.0(),(

yield? comparison similarity their doesWhat

)7.0,2.0(document Also,

)8.0,4.0(or query vect have Say we

22222

2

DQsim

D

Q

Page 22: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 22

Vector Space with Term Weights and Cosine Matching

1.0

0.8

0.6

0.4

0.2

0.80.60.40.20 1.0

D2

D1

Q

1

2

Term B

Term A

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)

t

j

t

j dq

t

j dq

i

ijj

ijj

ww

wwDQsim

1 1

22

1

)()(),(

Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)

98.042.0

64.0

])7.0()2.0[(])8.0()4.0[(

)7.08.0()2.04.0()2,(

2222

DQsim

74.058.0

56.),( 1 DQsim

Page 23: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 23

Text Clustering

Finds overall similarities among groups of documents

Finds overall similarities among groups of tokens

Picks out some themes, ignores others

Page 24: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 24

Text Clustering

Term 1

Term 2

Clustering is“The art of finding groups in data.” -- Kaufmann and Rousseeu

Page 25: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 25

Problems with Vector Space

There is no real theoretical basis for the assumption of a term space• It is more for visualization than having any real

basis• Most similarity measures work about the same

Terms are not really orthogonal dimensions• Terms are not independent of all other terms;

terms appearing in text may be correlated.

Page 26: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 26

Probabilistic Models

Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query

Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)

Relies on accurate estimates of probabilities

Page 27: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 27

Probability Ranking Principle

If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.Stephen E. Robertson, J. Documentation 1977

Page 28: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 28

Iterative Query Refinement

Page 29: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 29

Query Modification

Problem: How can we reformulate the query to help a user who is trying several searches to get at the same information?• Thesaurus expansion:

• Suggest terms similar to query terms• Relevance feedback:

• Suggest terms (and documents) similar to retrieved documents that have been judged to be relevant

Page 30: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 30

Relevance Feedback Main Idea:

• Modify existing query based on relevance judgements• Extract terms from relevant documents and

add them to the query• AND/OR re-weight the terms already in the

query There are many variations:

• Usually positive weights for terms from relevant docs

• Sometimes negative weights for terms from non-relevant docs

Users, or the system, guide this process by selecting terms from an automatically-generated list.

Page 31: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 31

Rocchio Method

Rocchio automatically• Re-weights terms• Adds in new terms (from relevant docs)

• have to be careful when using negative terms

• Rocchio is not a machine learning algorithm

Page 32: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 32

Rocchio Method

0.25) to and 0.75 toset best to studies some(in

t termsnonrelevan andrelevant of importance thetune and ,

chosen documentsrelevant -non ofnumber the

chosen documentsrelevant ofnumber the

document relevant -non for the vector the

document relevant for the vector the

query initial for the vector the

2

1

0

121101

21

n

n

iS

iR

Q

where

Sn

Rn

QQ

i

i

i

n

i

n

ii

Page 33: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 33

Rocchio/Vector Illustration

Retrieval

Information

0.5

1.0

0 0.5 1.0

D1

D2

Q0

Q’

Q”

Q0 = retrieval of information = (0.7,0.3)D1 = information science = (0.2,0.8)D2 = retrieval systems = (0.9,0.1)

Q’ = ½*Q0+ ½ * D1 = (0.45,0.55)Q” = ½*Q0+ ½ * D2 = (0.80,0.20)

Page 34: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 34

Alternative Notions of Relevance Feedback

Find people whose taste is “similar” to yours.• Will you like what they like?

Follow a user’s actions in the background. • Can this be used to predict what the user will

want to see next? Track what lots of people are doing.

• Does this implicitly indicate what they think is good and not good?

Page 35: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 35

Collaborative Filtering (Social Filtering)

If Pam liked the paper, I’ll like the paper If you liked Star Wars, you’ll like Independence Day Rating based on ratings of similar people

• Ignores text, so also works on sound, pictures etc.

• But: Initial users can bias ratings of future users

Sally Bob Chris Lynn KarenStar Wars 7 7 3 4 7Jurassic Park 6 4 7 4 4Terminator II 3 4 7 6 3Independence Day 7 7 2 2 ?

Page 36: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 36

Users rate items from like to dislike• 7 = like; 4 = ambivalent; 1 = dislike• A normal distribution; the extremes are what

matter Nearest Neighbors Strategy: Find similar users and

predicted (weighted) average of user ratings

Ringo Collaborative Filtering 1/2

Page 37: Computing Relevance, Similarity: The Vector Space Model

CPSC 404 Laks V.S. Lakshmanan 37

Ringo Collaborative Filtering 2/2 Pearson Algorithm: Weight by degree of correlation

between user U and user J• 1 means similar, 0 means no correlation, -1

dissimilar• Works better to compare against the ambivalent

rating (4), rather than the individual’s average score

22 )()(

))((

JJUU

JJUUrUJ