IR Theory: IR Basics & Vector Space Model

26
IR Theory: IR Basics & Vector Space Model

Transcript of IR Theory: IR Basics & Vector Space Model

Page 1: IR Theory: IR Basics & Vector Space Model

IR Theory: IR Basics & Vector Space Model

Page 2: IR Theory: IR Basics & Vector Space Model

IR Approach

Search Engine 2

Information Seeker Authors

Information Need Concepts

Query String Document Text

Is the document relevant to the query?

Page 3: IR Theory: IR Basics & Vector Space Model

IR System Architecture

Search Engine 3

Documents Query

Results

Representation Module

Representation Module

Matching/Ranking Module

Document Representation

Query Representation

Page 4: IR Theory: IR Basics & Vector Space Model

Step 1: Representation

Search Engine 4

Documents Query

Results

Representation Module

Representation Module

Matching/Ranking Module

Document Representation

Query Representation

Page 5: IR Theory: IR Basics & Vector Space Model

How to represent text? How do we represent the complexities of language?

► Computers don’t “understand” documents or queries Simple, yet effective approach: “bag of words”

► Treat all the words in a document as index terms for that document ► Disregard order, structure, meaning, etc. of the words

Search Engine 5

McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. …

16 × said 14 × McDonalds 12 × fat 11 × fries 8 × new 6 × company french nutrition 5 × food oil percent reduce

taste Tuesday …

Bag of Words

Page 6: IR Theory: IR Basics & Vector Space Model

Bag-of-Word Representation

Search Engine 6

The quick brown fox jumped over the lazy dog’s back.

Document 1

Document 2

Now is the time for all good men to come to the aid of their party.

the

is for

to

of

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0

1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1

Term D

ocum

ent 1

Doc

umen

t 2

Stopword List

Page 7: IR Theory: IR Basics & Vector Space Model

Step 2: Term Weighting

Search Engine 7

Documents Query

Results

Representation Module

Representation Module

Matching/Ranking Module

Document Representation

Query Representation

Page 8: IR Theory: IR Basics & Vector Space Model

Term Weight: What & How? What is term weight?

► Numerical estimate of term importance How should we estimate the term importance?

► Terms that appear often in a document should get high weights • The more often a document contains the term “dog”, the more likely that the

document is “about” dogs.

► Terms that appear in many documents should get low weights • Words like “the”, “a”, “of” appear in (nearly) all documents.

► Term frequency in long documents should count less than those in short ones How do we compute it?

► Term frequency ► Inverse document frequency ► Document length

Search Engine 8

Page 9: IR Theory: IR Basics & Vector Space Model

Step 3: Matching/Ranking

Search Engine 9

Documents Query

Results

Representation Module

Representation Module

Matching/Ranking Module

Document Representation

Query Representation

Page 10: IR Theory: IR Basics & Vector Space Model

Boolean vs. Vector Space Model Boolean model

► Based on the notion of sets • Does not impose a ranking on retrieved documents

► Documents are retrieved only if they satisfy Boolean conditions specified in the query

• Exact match

Vector space model ► Based on geometry, the notion of vectors in high dimensional space ► Documents are ranked based on their similarity to the query ► Best/partial match

Search Engine 10

Page 11: IR Theory: IR Basics & Vector Space Model

Boolean Model: Overview Weights assigned to terms are either “0” or “1”

► “0” represents “absence”: term isn’t in the document ► “1” represents “presence”: term is in the document

Build queries by combining terms with Boolean operators ► AND, OR, NOT

The system returns all documents that satisfy the query

Search Engine 11

A B

A OR B

A AND B

A AND NOT(B)

Page 12: IR Theory: IR Basics & Vector Space Model

Boolean Model: Strength Boolean operators define the relationship between query terms.

► AND → terms/concepts that are not equivalent/similar • party AND good: good party • Retrieves records that include all AND terms → Narrows the search

► OR → related terms, synonyms • party AND (good OR excellent OR wild): good party, excellent party, wild party • Retrieves records that include any OR terms → Broadens the search

► NOT → antonyms, alternate terms for polysemes • party NOT democratic: Democratic party • Eliminates records that include NOT term → Narrows the search

Precise, if you know the right strategies ► knows what concepts to combine/exclude, narrow/broaden

Efficient for the computer

Search Engine 12

Page 13: IR Theory: IR Basics & Vector Space Model

Boolean Model: Weakness Natural language is way more complex

► Boolean logic insufficient to capture the richness of language ► AND “discovers” nonexistent relationships

• Terms in different sentences, paragraphs, … Money is good, but I won’t be party to stealing.

► Guessing terminology for OR is hard • good, nice, excellent, outstanding, awesome, …

► Guessing terms to exclude is even harder! • Democratic party, party to a lawsuit, …

No control over size of result set Too many documents or none All documents in the result set are considered “equally good”

No Partial Matching Documents that “don’t quite match” the query may also be useful.

Search Engine 13

Page 14: IR Theory: IR Basics & Vector Space Model

Vector Space Model: Pros & Cons

Pros ► Non-binary term weights ► Partial matching ► Ranked results ► Easy query formulation ► Query expansion

Cons ► Term relationships ignored ► Term order ignored ► No wildcard ► Problematic w/ long

documents

► Similarity ≠ Relevance

Search Engine 14

Page 15: IR Theory: IR Basics & Vector Space Model

Vector Space Model: Representation “Bags of words” can be represented as vectors

► Computational efficiency ► Ease of manipulation ► Geometric metaphor: “arrows”

A vector is a set of values recorded in any consistent order

Search Engine 15

“The quick brown fox jumped over the lazy dog’s back”

→ (1, 1, 1, 1, 1, 1, 1, 1, 2)

1st position corresponds to “back” 2nd position corresponds to “brown” 3rd position corresponds to “dog” 4th position corresponds to “fox” 5th position corresponds to “jump” 6th position corresponds to “lazy” 7th position corresponds to “over” 8th position corresponds to “quick” 9th position corresponds to “the”

back 1

brown 1

dog 1

fox 1

jump 1

lazy 1

over 1

quick 1

the 2

Bag of words Vector

Page 16: IR Theory: IR Basics & Vector Space Model

Vector Space Model: Ranked Retrieval Order documents by “relevance”

► Relevance = how likely they are to be relevant to the information need ► Some documents are “better” than others ► Users can decide when to stop reading

Best (partial) match ► Documents need not have all query terms ► Documents with more query terms should be “better”

Estimate relevance with query-document similarity 1. Treat the query as if it were a document

• Create a query bag-of-words • Compute term weights

2. Find its similarity to each document 3. Rank order the documents by similarity

• Works surprisingly well

Search Engine 16

Page 17: IR Theory: IR Basics & Vector Space Model

Vector Space Model: 3-D Example

Search Engine 17

y

x

z

A vector A in a 3-dimensional space • Represented with initial point at the origin of a rectangular coordinate system. • Projections of A on the x, y, and z axes: Ax, Ay, and Az

− the (rectangular) components of A in the x, y, and z directions − each axis represents a term (e.g., x = all, y = brown, z = cat)

A

Ax

Ay

Az

Page 18: IR Theory: IR Basics & Vector Space Model

Vector Space Model: Postulate

Search Engine 18

Documents that are “close together” in vector space “talk about” the same things

t1

d2

d1 d3

d4

d5

t3

t2

θ

Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

Page 19: IR Theory: IR Basics & Vector Space Model

Similarity Measures: Set-based Simple matching function

Dice’s coefficient

Jaccard’s coefficient

A = (wd1, wd2, wd3, wd4, wd5) B = (wd2, wd4, wd6) A ∩ B: intersection of A and B

• the set of elements that belongs to both A and B • A ∩ B = (wd2, wd4)

A ∪ B: union of A and B • the set of elements that belongs to either A or B. • A ∪ B = (wd1, wd2, wd3, wd4, wd5, wd6)

|A| : cardinality of A • the number of elements in A • |A| = 5 • |B| = 3 • |A ∩ B| = 2 • |A ∪ B| = 6

Similarity Scores • Simple: |A ∩ B| = 2 • Dice: 2* |A ∩ B| / (|A|+ |B|) = 2*2 /8 = 1/2 • Jaccard: |A ∩ B| / |A ∪ B| = 2/6 = 1/3

Search Engine 19

BABASIM ∩=),(

BABA

BASIM+∩

=2

),(

BABA

BASIM∪∩

=),(

Page 20: IR Theory: IR Basics & Vector Space Model

Similarity Measures: Set-based Example

Search Engine 20

Simple matching function

By Dice’s coefficient

By Jaccard’s coefficient

Object – Attribute (feature) array O1 = (1, 0, 1, 1, 0, 0, 0, 1) |O1| = 4 O2 = (1, 0, 0, 0, 1, 1, 0, 0) |O2| = 3 O3 = (1, 0, 0, 1, 1, 1, 0, 0) |O3| = 4 O4 = (1, 1, 0, 1, 0, 1, 1, 0) |O4| = 5 O5 = (1, 1, 1, 1, 0, 0, 1, 1) |O5| = 6 |O1∩ O2| = |(A1)| = 1 |O1∩ O3| = |(A1, A4)| = 2 |O1∩ O4| = |(A1, A4)| = 2 |O1∩ O5| = |(A1, A3, A4 , A8)| = 4 |O1∪ O2| = |(A1, A3, A4, A5, A6, A8)| = 6 |O1∪ O3| = |(A1, A3, A4, A5, A6, A8)| = 6 |O1∪ O4| = |(A1, A2, A3, A4, A6, A7, A8)| = 7 |O1∪ O5| = |(A1, A2, A3, A4, A7, A8)| = 6

BABASIM ∩=),(

BABA

BASIM+∩

=2

),(

BABA

BASIM∪∩

=),(

O2 O3 O4 O5

SIM 1 2 2 4

Rank 4 2 2 1

O2 O3 O4 O5

SIM 2*1/(4+3)=2/7 2*2/(4+4)=4/8 2*2/(4+5)=4/9 2*4/(4+6)=8/10

Rank 4 2 3 1

O2 O3 O4 O5

SIM 1/6 2/6 2/7 4/6

Rank 4 2 3 1

Page 21: IR Theory: IR Basics & Vector Space Model

Similarity Measures: Vector-based Cosine Similarity (n-dimensional space)

Dot/Scalar product of vectors / product of vector lengths

• Dot product = sum (product of each axis component) A = (A1, A2, A3, A4) B = (B1, B2, B3, B4)

A•B = (A1B1+A2B2+A3B3+A4B4)

• Vector length = square root of sum (square of each axis component) |A| = sqrt [(A1)2+ (A2)2+ (A3)2+ (A4)2]

|B| = sqrt [(B1)2+ (B2)2+ (B3)2+ (B4)2]

Cosine Similarity (3-dimensional space) A = (Ax, Ay, Az) B = (Bx, By, Bz)

Search Engine 21

222222 )()()()()()(cos

zyxzyx

zzyyxx

BBBAAA

BABABA

++++

++=θ

∑∑

==

==•

=n

ii

n

ii

n

iii

BA

BA

BABA

1

2

1

2

1

)()(cosθ

Page 22: IR Theory: IR Basics & Vector Space Model

Similarity Measures: Vector-based Example

Search Engine 22

Compute cosine similarities

Rank objects O2 through O5 by descending order of similarity to O1

Object – Attribute (feature) array O1 = (1, 0, 1, 1, 0, 0, 0, 1) O2 = (1, 0, 0, 0, 1, 1, 0, 0) O3 = (1, 0, 0, 1, 1, 1, 0, 0) O4 = (1, 1, 0, 1, 0, 1, 1, 0) O5 = (1, 1, 1, 1, 0, 0, 1, 1) |O1| = sqrt(12+02+12+12+02+02+02+12) = sqrt(4) |O2| = sqrt(12+02+02+02+12+12+02+02) = sqrt(3) |O3| = sqrt(12+02+02+12+12+12+02+02) = sqrt(4) |O4| = sqrt(12+12+02+12+02+12+12+02) = sqrt(5) |O5| = sqrt(12+12+12+12+02+02+12+12) = sqrt(6) O1•O2 = (1*1+0*0+1*0+1*0+0*1+0*1+0*0+1*0) = 1 O1•O3 = (1*1+0*0+1*0+1*1+0*1+0*1+0*0+1*0) = 2 O1•O4 = (1*1+0*1+1*0+1*1+0*0+0*1+0*1+1*0) = 2 O1•O5 = (1*1+0*1+1*1+1*1+0*0+0*0+0*1+1*1) = 4

121

341),( 21 ==OOSIM

162

442),( 31 ==OOSIM

202

542),( 41 ==OOSIM

244

644),( 51 ==OOSIM ∑∑

==

==•

=n

ii

n

ii

n

iii

BA

BA

BABA

1

2

1

2

1

)()(cosθ

Page 23: IR Theory: IR Basics & Vector Space Model

Text Analysis: Word Frequency TREC Volume 3 Corpus

► Number of documents: 336,310 ► Total word occurrences: 125,720,891 ► Unique words: 508,209

Zipf Distribution ► Rank*Frequency = constant

• Population, Wealth, Popularity

► A few words are very common ► Most words are very rare

Term Weights ► Represents the ability of terms to identify relevant items

& to distinguish them from non-relevant material ► Very common & very rare words are not

very useful for indexing (Luhn, 1958) ► Good

• Smaller index → Faster retrieval

► Bad • Lost gems & broken phrases

Search Engine 23

B. Croft (Umass)

Page 24: IR Theory: IR Basics & Vector Space Model

Text Analysis: Term Weighting Term Weighting Factors

► Term frequency (tf) • Number of times that a term occurs in a given document

tf(dogd1) = 2, tf(dogd2) = 1 tf(foxd1) = 3, tf(foxd2) = 0 tf(partyd1) = 0, tf(partyd2) = 1

► Inverse document frequency (idf)

• (Simple) 1/number of document in which a term occurs idf(dog) = 1/2, idf(fox) = 1/1, idf(party) = 1/1

• (Default) log(Nd/ number of document in which a term occurs) Nd = number of document in a collection idf(dog)=log(2/2)=0, idf(fox)=log(2/1)=0.3, idf(party) = log(2/1)=0.3

► Document length (dlen) • Number of tokens in a document

Token = an instance/occurrence of a word (not unique word) dlen(d1) = 11, dlen(d2) = 10

tf⋅idf formula

Search Engine 24

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

0 0 1 1 0 2 3 0 1 1 0 0 1 0 1 0 0

1 1 0 0 1 1 0 1 0 0 1 1 0 1 0 1 1

Term d1 d2

k

dkiki d

Nfidftfw log=∗=wki = weight of term k in document i fki = frequency of term k in document i (tf) Nd = number of documents in collection dk = number of documents in which term k appears (postings)

Page 25: IR Theory: IR Basics & Vector Space Model

Similarity Measures: using Term Weights

Search Engine 25

1. Compute term weights (e.g., tf*idf)

• Nd = 5 • d1=5, d2=4, d3=1, d4=3, d5=2, d6=3, d7=4, d8=2

Document – Term array

k

dkiki d

Nfidftfw log=∗=

Page 26: IR Theory: IR Basics & Vector Space Model

Similarity Measures: using Term Weights

Search Engine 26

2. Compute query-document cosine similarity with tf*idf weights

∑∑

==

==•

=n

ii

n

ii

n

iii

BA

BA

BABA

1

2

1

2

1

)()(cosθ