Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9....

Post on 09-Oct-2020

3 views 0 download

Transcript of Latent Semantic Indexing (LSI)ce.sharif.edu/courses/95-96/1/ce324-1/resources/root/... · 2020. 9....

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

M. Soleymani

Fall 2016

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

Vector space model: pros

Partial matching of queries and docs

dealing with the case where no doc contains all search terms

Ranking according to similarity score

Term weighting schemes

improves retrieval performance

Various extensions

Relevance feedback (modifying query vector)

Doc clustering and classification

2

Problems with lexical semantics

Ambiguity and association in natural language

Polysemy: Words often have a multitude of meanings and

different types of usage

More severe in very heterogeneous collections.

The vector space model is unable to discriminate between

different meanings of the same word.

3

Problems with lexical semantics

Synonymy: Different terms may have identical or similar

meanings (weaker: words indicating the same topic).

No associations between words are made in the vector

space representation.

4

Polysemy and context

Doc similarity on single word level: polysemy and context

carcompany

•••dodgeford

meaning 2

ringjupiter

•••space

voyagermeaning 1…

saturn...

planet...

contribution to similarity, if

used in 1st meaning, but not

if in 2nd

5

SVD

6

Latent Semantic Indexing (LSI)

Perform a low-rank approximation of doc-term

matrix (typical rank 100-300)

latent semantic space

Term-doc matrices are very large but the number of topicsthat people talk about is small (in some sense)

General idea: Map docs (and terms) to a low-dimensional

space

Design a mapping such that the low-dimensional space reflects

semantic associations

Compute doc similarity based on the inner product in this latent

semantic space

7

Goals of LSI

Similar terms map to similar location in low

dimensional space

Noise reduction by dimension reduction

8

9

This matrix is the basis for computing similarity between docs and queries.

Can we transform this matrix, so that we get a better measure of similarity

between docs and queries? . . .

Term-document matrix

Singular Value Decomposition (SVD)

𝑀𝑀 𝑀𝑁 𝑁𝑁

For an 𝑀𝑁 matrix 𝐴 of rank 𝑟 there exists a factorization:

The columns of 𝑈 are orthogonal eigenvectors of 𝐴𝐴𝑇.

The columns of 𝑉 are orthogonal eigenvectors of 𝐴𝑇𝐴.

Singular values

Eigenvalues 1… 𝑟 of 𝐴𝐴𝑇 are also the eigenvalues of 𝐴𝑇𝐴.

𝐴 = 𝑈Σ𝑉𝑇

Typically, the singular values arranged in decreasing order.

Σ = diag 𝜎1, … , 𝜎𝑟𝜎𝑖 = 𝜆𝑖

Singular Value Decomposition (SVD)

Truncated SVD

11

min(𝑀,𝑁)

min(𝑀,𝑁)

Mmin(M,N) Min(M,N)min(M,N) Min(M,N)N

𝐴 = 𝑈Σ𝑉𝑇

SVD example

M=3, N=2

Or equivalently:

0 2/ 6

1/ 2 −1/ 6

1/ 2 −1/ 6

1 0

0 3

1/ 2 1/√2

1/ 2 −1/ 2

𝐴 =

0 2/ 6

1/ 2 −1/ 6

1/ 2 −1/ 6

1/ 3

1/ 3

1/ 3

1 0

0 30 0

1/ 2 1/√2

1/ 2 −1/ 2

𝐴 =1 −10 11 0

Example

13

We use a non-weighted matrix here to simplify the example.

Example of 𝐶 = 𝑈Σ𝑉𝑇: All four matrices

14

𝐶 = 𝑈Σ𝑉𝑇

Example of 𝐶 = 𝑈Σ𝑉𝑇: matrix 𝑈

15

Columns: “semantic” dims (distinct topics like politics, sports,...)

𝑢𝑖𝑗: how strongly related term 𝑖 is to the topic in column 𝑗 .

One row per term

One column per min(M,N)

Example of 𝐶 = 𝑈Σ𝑉𝑇: The matrix Σ

16

Singular value:

“measures the importance of the corresponding semantic dimension”.

We’ll make use of this by omitting unimportant dimensions.

square, diagonal matrix

min(M,N) × min(M,N).

Example of 𝐶 = 𝑈Σ𝑉𝑇: The matrix 𝑉𝑇

17

Columns of 𝑉: “semantic” dims

𝑣𝑖𝑗: how strongly related doc 𝑖 is to the topic in column 𝑗 .

One column per doc

One row per min(M,N)

Matrix decomposition: Summary

We’ve decomposed the term-doc matrix 𝐶 into a

product of three matrices.

𝑈: consists of one (row) vector for each term

𝑉𝑇: consists of one (column) vector for each doc

Σ: diagonal matrix with singular values, reflecting importance of

each dimension

Next:Why are we doing this?

18

LSI: Overview

19

Decompose term-doc matrix 𝐶 into a product of

matrices using SVD

𝐶 = 𝑈Σ𝑉𝑇

We use columns of matrices 𝑈 and 𝑉 that correspond to the

largest values in the diagonal matrix Σ as term and document

dimensions in the new space

SVD for this purpose is called LSI.

Solution via SVD

Low-rank approximation

set smallest r-k

singular values to zero

column notation:

sum of rank 1 matrices

𝑀 ×𝑁 𝑀 × 𝑘

𝑘 × 𝑘 𝑘 × 𝑁

We retain only 𝑘 singular values

𝐴𝑘 = 𝑈 diag 𝜎1, … , 𝜎𝑘 , 0, … 0 𝑉𝑇

𝐴𝑘 =

𝑖=1

𝑘

𝜎𝑘𝑢𝑖𝑣𝑖𝑇

SVD can be used to compute optimal low-rank approximations.

Keeping the 𝑘 largest singular values and setting all others to zero results in

the optimal approximation [Eckart-Young].

No matrix of the rank 𝑘 can approximates 𝐴 better than 𝐴𝑘 .

Approximation problem: Given matrix 𝐴, find matrix 𝐴𝑘 of rank 𝑘 (e.g.

a matrix with 𝑘 linearly independent rows or columns) such that

𝐴𝑘 and 𝑋 are both 𝑀 × 𝑁 matrices.

Typically, we want 𝑘 ≪ 𝑟.

Low-rank approximation

Frobenius norm

21

𝐴𝑘 = min𝑋:𝑟𝑎𝑛𝑘 𝑋 =𝑘

𝐴 − 𝑋 𝐹

Approximation error

How good (bad) is this approximation?

It’s the best possible, measured by the Frobenius norm of

the error:

where the 𝑖 are ordered such that 𝑖 𝑖+1.

Suggests why Frobenius error drops as 𝑘 increases.

22

min𝑋:𝑟𝑎𝑛𝑘 𝑋 =𝑘

𝐴 − 𝑋 𝐹 = 𝐴 − 𝐴𝑘 𝐹 = 𝜎𝑘+1

𝐴𝑘 = 𝑈 diag 𝜎1, … , 𝜎𝑘 , 0, … 0 𝑉𝑇

SVD Low-rank approximation

Term-doc matrix 𝐶 may have 𝑀 = 50000,𝑁 = 106

rank close to 50000

Construct an approximation 𝐶100with rank 100.

Of all rank 100 matrices, it would have the lowest Frobenius

error.

Great … but why would we??

Answer: Latent Semantic Indexing

C. Eckart, G. Young, The approximation of a matrix by another of lower rank.

Psychometrika, 1, 211-218, 1936.

Recall unreduced decomposition 𝐶 = 𝑈Σ𝑉𝑇

24

Reducing the dimensionality to 2

25

Reducing the dimensionality to 2

26

Original matrix 𝐶 vs. reduced 𝐶2 = 𝑈Σ2𝑉𝑇

27

𝐶2 as a two-dimensional

representation of 𝐶.

Dimensionality reduction to

two dimensions.

Why is the reduced matrix “better”?

28

28

Similarity of d2 and d3 in the original space: 0.

Similarity of d2 und d3 in the reduced space:

0.52 * 0.28 + 0.36 * 0.16 + 0.72 * 0.36 + 0.12 * 0.20 + - 0.39 * - 0.08 ≈ 0.52

Why the reduced matrix is “better”?

29

“boat” and “ship” are semantically similar.

The “reduced” similarity measure reflects this.

What property of the SVD reduction is responsible for improved similarity?

Example

30 [Example from Dumais et. al]

Example

31 [Example from Dumais et. al]

Example (k=2)

32 [Example from Dumais et. al]

𝑈𝑘

Σ𝑘 𝑉𝑘𝑇

33

graph

tree

minor

survey

time

responseuser

computer

interface

humanEPS

system

Squares: terms

Circles: docs

34 [Example from Dumais et. al]

How we use the SVD in LSI

Key property of SVD: Each singular value tells us how

important its dimension is.

By setting less important dimensions to zero, we keep the

important information, but get rid of the “details”.

These details may

be noise ⇒ reduced LSI is a better representation

Details make things dissimilar that should be similar ⇒ reduced LSI is a better

representation because it represents similarity better.

35

How LSI addresses synonymy and semantic

relatedness?

Docs may be semantically similar but are not similar in the

vector space (when we talk about the same topics but use

different words).

Desired effect of LSI: Synonyms contribute strongly to doc similarity.

Standard vector space: Synonyms contribute nothing to doc similarity.

LSI (via SVD) selects the “least costly” mapping:

different words (= different dimensions of the full space) are

mapped to the same dimension in the reduced space.

Thus, it maps synonyms or semantically related words to the same dimension.

“cost” of mapping synonyms to the same dimension is much less

than cost of collapsing unrelated words.

Thus, LSI will avoid doing that for unrelated words.

36

Performing the maps

Each row and column of 𝐶 gets mapped into the 𝑘-dimensional LSI space, by the SVD.

A query 𝑞 is also mapped into this space, by

Query NOT a sparse vector.

Claim: this is not only the mapping with the best

(Frobenius error) approximation to 𝐶, but also improves

retrieval.

37

Since 𝑉𝑘 = 𝐶𝑘𝑇𝑈𝑘Σ𝑘

−1, we

should transform query 𝑞 to 𝑞𝑘

𝑞𝑘 = 𝑞𝑇𝑈𝑘Σ𝑘

−1

Implementation

Compute SVD of term-doc matrix

Map docs to the reduced space

Map the query into the reduced space 𝑞𝑘 = 𝑞𝑇𝑈𝑘Σ𝑘

−1

Compute similarity of 𝑞𝑘 with all reduced docs in 𝑉𝑘 .

Output ranked list of docs as usual

What is the fundamental problem with this approach?

38

Empirical evidence

Experiments on TREC 1/2/3 – Dumais

Lanczos SVD code (available on netlib) due to Berry used

in these experiments

Running times of ~ one day on tens of thousands of docs [still an

obstacle to use]

Dimensions – various values 250-350 reported.

Reducing k improves recall.

Under 200 reported unsatisfactory

Generally expect recall to improve – what about precision?

39

Empirical evidence

Precision at or above median TREC precision

Top scorer on almost 20% of TREC topics

Slightly better on average than straight vector spaces

Effect of dimensionality:

Dimensions Precision

250 0.367

300 0.371

346 0.374

40

But why is this clustering?

We’ve talked about docs, queries, retrieval and

precision here.

What does this have to do with clustering?

Intuition: Dimension reduction through LSI brings

together “related” axes in the vector space.

41

Simplistic picture

Topic 1

Topic 2

Topic 342

Reference

43

Chapter 18 of IIR Book