STKI_-_Pertemuan_6

24
Vector Space Model - Imam Cholissodin -

Transcript of STKI_-_Pertemuan_6

Page 1: STKI_-_Pertemuan_6

Vector Space Model

- Imam Cholissodin -

Page 2: STKI_-_Pertemuan_6

Review : Boolean Retrieval

• Kelebihan dan Kekurangan : Documents either match or don’t. Good for expert users with precise

understanding of their needs and the collection (e.g., library search).

Also good for applications: Applications can easily consume 1000s of results.

Not good for the majority of users. Most users incapable of writing Boolean queries

(or they are, but they think it’s too much work). Most users don’t want to wade through 1000s of

results (e.g., web search).

Page 3: STKI_-_Pertemuan_6

Problem with Boolean Model

• Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1: “standard user dlink 650” → 200,000

hits Query 2: “standard user dlink 650 no card

found”: 0 hits

Setiap space di default dengan ekspresi boolean “AND”

Page 4: STKI_-_Pertemuan_6

Scoring as the basis of ranked retrieval• We wish to return in order the documents most

likely to be useful to the searcher

• How can we rank-order the documents in the collection with respect to a query?

• Assign a score – say in [0, 1] – to each document

• This score measures how well document and query “match”.

Page 5: STKI_-_Pertemuan_6

Perbandingan Model : Boolean & Vector• Boolean : Each document is represented by a

binary vector {0,1}∈ |V|

• Vector : Each document is a count vector ℕ|V|

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

Page 6: STKI_-_Pertemuan_6

Structure : Vector Space Model

Represent “query“ as a weighted

Represent each “document” as a weighted

Compute the “cosine similarity score” for the “query” vector and each “document”

vector

“Rank documents” with respect to the query by score

Return the top K (e.g., K = 10) to the user

Page 7: STKI_-_Pertemuan_6

Bag of Words : Vector Model

• Vector representation doesn’t consider the ordering of words in a document John is quicker than Mary and Mary is quicker

than John have the same vectors

• This is called the bag of words model. In a sense, this is a step back: The positional

index was able to distinguish these two documents.

Page 8: STKI_-_Pertemuan_6

Term Frequency (tf)

• The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

• We want to use tf when computing query-document match scores. But how?

• Raw term frequency is not what we want:• A document with 10 occurrences of the term may

be more relevant than a document with one occurrence of the term.

• But not 10 times more relevant.

• Relevance does not increase proportionally with term frequency.

Page 9: STKI_-_Pertemuan_6

Log-Frequency Weighting

• The log frequency weight of term t in d is

• 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

• Score for a document-query pair: sum over terms t in both q and d:

• score

• The score is 0 if none of the query terms is present in the document.

otherwise 0,

0 tfif, tflog 1 10 t,dt,d

t,dw

dqt dt ) tflog (1 ,

Page 10: STKI_-_Pertemuan_6

Inverse Document Frequency (idf)

• dft is the document frequency of t: the number of documents that contain t df is a measure of the informativeness of t

• We define the idf (inverse document frequency) of t by

We use log N/dft instead of N/dft to “dampen” the effect of idf.

tt N/df log idf 10

Page 11: STKI_-_Pertemuan_6

tf-idf Weighting

• The tf-idf weight of a term is the product of its tf weight and its idf weight.

• Best known weighting scheme in information retrieval

• Note: the “-” in tf-idf is a hyphen, not a minus sign!• Alternative names: tf.idf, tf x idf

• Increases with the number of occurrences within a document

• Increases with the rarity of the term in the collection

tdt Ndt

df/log)tflog1(w ,,

Page 12: STKI_-_Pertemuan_6

Example Weight Matrix

• Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35

Brutus 1.21 6.1 0 1 0 0

Caesar 8.59 2.54 0 1.51 0.25 0

Calpurnia 0 1.54 0 0 0 0

Cleopatra 2.85 0 0 0 0 0

mercy 1.51 0 1.9 0.12 5.25 0.88

worser 1.37 0 0.11 4.15 0.25 1.95

Page 13: STKI_-_Pertemuan_6

Example Weight Matrix (After Normalization)

• Each document is now normalization :

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 0.489364639 0.424394021 0 0 0 0.16145327

Brutus 0.112786898 0.814089159 0 0.2207715 0 0

Caesar 0.800693761 0.338981387 0 0.333365 0.04751 0

Calpurnia 0 0.205524148 0 0 0 0

Cleopatra 0.26565509 0 0 0 0 0

mercy 0.140750591 0 0.998328301 0.0264926 0.99774 0.40593964

worser 0.127700868 0 0.057797954 0.9162019 0.04751 0.89952535

n

tdt

dt

dt

1

2

,

,

,

w

ww

Page 14: STKI_-_Pertemuan_6

Latihan : tf-idf

• Given a document containing terms with given frequencies:

A(3), B(2), C(1)

Assume collection contains 10,000 documents and document frequencies of these terms are:

A(50), B(1300), C(250)

Then:

A: tf = 3; idf =; log freq Weight =; tf-idf Weight = ;

B: tf = 2; idf =; log freq Weight =; tf-idf Weight = ;

C: tf = 1; idf =; log freq Weight =; tf-idf Weight = ;

Page 15: STKI_-_Pertemuan_6

Documents & Query as Vectors• Document :

So we have a |V|-dimensional vector space Terms are axes of the space Documents are points or vectors in this space

• Query : Key idea 1: Do the same for queries: represent

them as vectors in the space Key idea 2: Rank documents according to their

proximity to the query in this space proximity = similarity of vectors proximity ≈ inverse of distance

Page 16: STKI_-_Pertemuan_6

Graphic Representation : Di & QExample:

D1 = 2T1 + 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

T3

T1

T2

D1 = 2T1+ 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

7

32

5

• Is D1 or D2 more similar to Q?

• How to measure the degree of similarity?

Page 17: STKI_-_Pertemuan_6

Similarity Measure

• A similarity measure is a function that computes the degree of similarity between two vectors.

• Using a similarity measure between the query and each document: It is possible to rank documents It is possible to enforce a certain threshold so

that the size of the retrieved set can be controlled.

Page 18: STKI_-_Pertemuan_6

Cosine Similarity Measure

• Cosine similarity measures the cosine of the angle between two vectors.

• Get value Cosine similarity : Without normalization wt,d.

With normalization wt,d.

t

i

t

i

t

i

ww

ww

qd

qd

iqij

iqij

j

j

1 1

22

1

)(

CosSim(dj, q) =

t

iwwqd iqijj

1

)(

CosSim(dj, q) =

n

tdt

dt

dt

1

2

,

,

,

w

ww

Page 19: STKI_-_Pertemuan_6

Example : Cosine similarity amongst 3 documents• How similar are the novels

SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights?

term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

Term frequencies (counts)Note: To simplify this example, we don’t do idf weighting.

Page 20: STKI_-_Pertemuan_6

Jawab : Cosine similarity amongst 3 documents

Note: To simplify this example, we don’t do idf weighting.

term SaS PaP WH

affection 3.0606978 2.763427994 2.301029996

jealous 2 1.84509804 2.041392685

gossip 1.30103 0 1.77815125

wuthering 0 0 2.579783597

• Log frequency weighting

• CosSim

Sas PaP WH

Sas 1 0.942083434 0.788681945

PaP   1 0.694003346

Wh     1

Page 21: STKI_-_Pertemuan_6

tf.idf weighting has many variants

• Columns are acronyms for weight schemes. Chart borrowed from Salton, Buckley 88.

Page 22: STKI_-_Pertemuan_6

Terima Kasih

Page 23: STKI_-_Pertemuan_6

Latihan: Hitunglah CosSim amongst 4 documents

D1: Dokumen 1 D2: Dokumen 2 D3: Dokumen 3 D4: Dokumen 4

term D1 D2 D3 D4

t1 50 32 4 15

t2 0 16 0 10

t3 42 2 0 23

t4 18 0 17 19

tf (counts)

Page 24: STKI_-_Pertemuan_6

Latihan: Hitunglah CosSim amongst 4 documents (Lanjutan)

• Diasumsikan total keseluruhan dokumen adalah 1000

term df

t1 24

t2 50

t3 11

t4 9

idf (counts)