STKI_-_Pertemuan_6

Vector Space Model

- Imam Cholissodin -

Review : Boolean Retrieval

• Kelebihan dan Kekurangan : Documents either match or don’t. Good for expert users with precise

understanding of their needs and the collection (e.g., library search).

Also good for applications: Applications can easily consume 1000s of results.

Not good for the majority of users. Most users incapable of writing Boolean queries

(or they are, but they think it’s too much work). Most users don’t want to wade through 1000s of

results (e.g., web search).

Problem with Boolean Model

• Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1: “standard user dlink 650” → 200,000

hits Query 2: “standard user dlink 650 no card

found”: 0 hits

Setiap space di default dengan ekspresi boolean “AND”

Scoring as the basis of ranked retrieval• We wish to return in order the documents most

likely to be useful to the searcher

• How can we rank-order the documents in the collection with respect to a query?

• Assign a score – say in [0, 1] – to each document

• This score measures how well document and query “match”.

Perbandingan Model : Boolean & Vector• Boolean : Each document is represented by a

binary vector {0,1}∈ |V|

• Vector : Each document is a count vector ℕ|V|

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0


Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

Structure : Vector Space Model

Represent “query“ as a weighted

Represent each “document” as a weighted

Compute the “cosine similarity score” for the “query” vector and each “document”

vector

“Rank documents” with respect to the query by score

Return the top K (e.g., K = 10) to the user

Bag of Words : Vector Model

• Vector representation doesn’t consider the ordering of words in a document John is quicker than Mary and Mary is quicker

than John have the same vectors

• This is called the bag of words model. In a sense, this is a step back: The positional

index was able to distinguish these two documents.

Term Frequency (tf)

• The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

• We want to use tf when computing query-document match scores. But how?

• Raw term frequency is not what we want:• A document with 10 occurrences of the term may

be more relevant than a document with one occurrence of the term.

• But not 10 times more relevant.

• Relevance does not increase proportionally with term frequency.

Log-Frequency Weighting

• The log frequency weight of term t in d is

• 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

• Score for a document-query pair: sum over terms t in both q and d:

• score

• The score is 0 if none of the query terms is present in the document.

otherwise 0,

0 tfif, tflog 1 10 t,dt,d

t,dw

dqt dt ) tflog (1 ,

Inverse Document Frequency (idf)

• dft is the document frequency of t: the number of documents that contain t df is a measure of the informativeness of t

• We define the idf (inverse document frequency) of t by

We use log N/dft instead of N/dft to “dampen” the effect of idf.

tt N/df log idf 10

tf-idf Weighting

• The tf-idf weight of a term is the product of its tf weight and its idf weight.

• Best known weighting scheme in information retrieval

• Note: the “-” in tf-idf is a hyphen, not a minus sign!• Alternative names: tf.idf, tf x idf

• Increases with the number of occurrences within a document

• Increases with the rarity of the term in the collection

tdt Ndt

df/log)tflog1(w ,,

Example Weight Matrix

• Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|


Antony 5.25 3.18 0 0 0 0.35

Brutus 1.21 6.1 0 1 0 0

Caesar 8.59 2.54 0 1.51 0.25 0

Calpurnia 0 1.54 0 0 0 0

Cleopatra 2.85 0 0 0 0 0

mercy 1.51 0 1.9 0.12 5.25 0.88

worser 1.37 0 0.11 4.15 0.25 1.95

Example Weight Matrix (After Normalization)

• Each document is now normalization :


Antony 0.489364639 0.424394021 0 0 0 0.16145327

Brutus 0.112786898 0.814089159 0 0.2207715 0 0

Caesar 0.800693761 0.338981387 0 0.333365 0.04751 0

Calpurnia 0 0.205524148 0 0 0 0

Cleopatra 0.26565509 0 0 0 0 0

mercy 0.140750591 0 0.998328301 0.0264926 0.99774 0.40593964

worser 0.127700868 0 0.057797954 0.9162019 0.04751 0.89952535

n

tdt

dt

dt

1

2

,

,

,

w

ww

Latihan : tf-idf

• Given a document containing terms with given frequencies:

A(3), B(2), C(1)

Assume collection contains 10,000 documents and document frequencies of these terms are:

A(50), B(1300), C(250)

Then:

A: tf = 3; idf =; log freq Weight =; tf-idf Weight = ;

B: tf = 2; idf =; log freq Weight =; tf-idf Weight = ;

C: tf = 1; idf =; log freq Weight =; tf-idf Weight = ;

Documents & Query as Vectors• Document :

So we have a |V|-dimensional vector space Terms are axes of the space Documents are points or vectors in this space

• Query : Key idea 1: Do the same for queries: represent

them as vectors in the space Key idea 2: Rank documents according to their

proximity to the query in this space proximity = similarity of vectors proximity ≈ inverse of distance

Graphic Representation : Di & QExample:

D1 = 2T1 + 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

T3

T1

T2

D1 = 2T1+ 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

7

32

5

• Is D1 or D2 more similar to Q?

• How to measure the degree of similarity?

Similarity Measure

• A similarity measure is a function that computes the degree of similarity between two vectors.

• Using a similarity measure between the query and each document: It is possible to rank documents It is possible to enforce a certain threshold so

that the size of the retrieved set can be controlled.

Cosine Similarity Measure

• Cosine similarity measures the cosine of the angle between two vectors.

• Get value Cosine similarity : Without normalization wt,d.

With normalization wt,d.

t

i

t

i

t

i

ww

ww

qd

qd

iqij

iqij

j

j

1 1

22

1

)(

CosSim(dj, q) =

t

iwwqd iqijj

1

)(

CosSim(dj, q) =

n

tdt

dt

dt

1

2

,

,

,

w

ww

Example : Cosine similarity amongst 3 documents• How similar are the novels

SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights?

term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

Term frequencies (counts)Note: To simplify this example, we don’t do idf weighting.

Jawab : Cosine similarity amongst 3 documents

Note: To simplify this example, we don’t do idf weighting.

term SaS PaP WH

affection 3.0606978 2.763427994 2.301029996

jealous 2 1.84509804 2.041392685

gossip 1.30103 0 1.77815125

wuthering 0 0 2.579783597

• Log frequency weighting

• CosSim

Sas PaP WH

Sas 1 0.942083434 0.788681945

PaP 1 0.694003346

Wh 1

tf.idf weighting has many variants

• Columns are acronyms for weight schemes. Chart borrowed from Salton, Buckley 88.

Terima Kasih

Latihan: Hitunglah CosSim amongst 4 documents

D1: Dokumen 1 D2: Dokumen 2 D3: Dokumen 3 D4: Dokumen 4

term D1 D2 D3 D4

t1 50 32 4 15

t2 0 16 0 10

t3 42 2 0 23

t4 18 0 17 19

tf (counts)

Latihan: Hitunglah CosSim amongst 4 documents (Lanjutan)

• Diasumsikan total keseluruhan dokumen adalah 1000

term df

t1 24

t2 50

t3 11

t4 9

idf (counts)

STKI_-_Pertemuan_6

Documents

Transcript of STKI_-_Pertemuan_6