Latent Semantic Analysis. An Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger! d3 : Romeo...

12
Latent Semantic Analysis

Transcript of Latent Semantic Analysis. An Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger! d3 : Romeo...

Latent Semantic Analysis

An Exampled1 : Romeo and Juliet.

d2 : Juliet: O happy dagger!

d3 : Romeo died by dagger.

d4 : “Live free or die”, that’s the New-Hampshire’s motto.

d5 : Did you know, New-Hampshire is in New-England.

q: dies, dagger

Which document should be returned and how the ranking should be?

Eigenvectors and Eigenvalues• Let A be an n × n matrix. • If x is an n-dimensional vector, then the matrix-vector product

Ax is well-defined, and the result is again an n-dimensional vector.

• In general, multiplication by a matrix changes the direction of a non-zero vector x, unless the vector is special and we have that

Ax = xfor some scalar .

Matrix Decomposition• Let S be the matrix with eigenvectors of A as columns.

• Let be the diagonal matrix with the eigenvalues of A on the diagonal.

• Then

A = SS-1

• If A is symmetric then we have S-1=ST

A = SST

Singular Value Decomposition• Let A be an m × n matrix with entries being real numbers and m > n. • Consider the n × n square matrix B = ATA.

– B is symmetric

– it has been shown that the eigenvalues of such (ATA) matrices are non-negative. – Since they are non-negative we can write them in decreasing order as squares of

non-negative real numbers: 12 > . . . > n

2

• For some index r (possibly n) the first r numbers are positive whereas the rest are zero.

• S1 = [x1, . . . , xr]

• y1=(1/1)Ax1 ... yr=(1/r)Axr

• S2 = [y1, ..., yr]• We can show that

A = S2 S1T

is diagonal and the values along the diagonal are 1, . . . , n which are called singular values.

• If we denote S2 by S and S1 by U we have A = S UT

Exampled1 : Romeo and Juliet.

d2 : Juliet: O happy dagger!

d3 : Romeo died by dagger.

d4 : “Live free or die”, that’s the New-Hampshire’s motto.

d5 : Did you know, New-Hampshire is in New-England.

q: dies, dagger

Document-term matrix

Latent Concepts• Latent Semantic Indexing (LSI) is a method for discovering

hidden concepts in document data.

• Each document and term (word) is then expressed as a vector with elements corresponding to these concepts. – Each element in a vector gives the degree of participation of the

document or term in the corresponding concept.

• Goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing – document-document, – document-term, and– term-term similarities

which are otherwise hidden…

Matrix

Matrix A can be written: A = SUT

Let's "neglect" the last three singular values of as being too "small"...

Also, just keep two columns from S obtaining S2 and two rows from UT obtaining U2

T

Matrix A is approximated as: A2 = S2U2T

In general: Ak = SkUkT where a good value for k is determined empirically.

Matrices 2, S2, U2

Representing Documents, Terms, and Queries

• Represent documents by the column vectors of U2T

• Represent terms by the row vectors S2

• Represent queries by the centroid vector of their terms

Geometry