1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret...

25
1 Latent Concepts and Latent Concepts and the Number the Number Orthogonal Factors Orthogonal Factors in Latent Semantic in Latent Semantic Analysis Analysis Georges Dupret Georges Dupret [email protected] [email protected] t t

Transcript of 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret...

Page 1: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

1

Latent Concepts and the Latent Concepts and the Number Orthogonal Factors Number Orthogonal Factors in Latent Semantic Analysisin Latent Semantic Analysis

Georges DupretGeorges Dupret

[email protected]@laposte.net

Page 2: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

2

AbstractAbstract

We seek insight into Latent Semantic We seek insight into Latent Semantic Indexing by establishing a method to Indexing by establishing a method to identify the optimal number of factors in identify the optimal number of factors in the reduced matrix for representing a the reduced matrix for representing a keyword.keyword.

By examining the precision, we find that By examining the precision, we find that lower ranked dimensions identify related lower ranked dimensions identify related terms and higher-ranked dimensions terms and higher-ranked dimensions discriminate between the synonyms.discriminate between the synonyms.

Page 3: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

3

IntroductionIntroduction

The task of retrieving the documents The task of retrieving the documents relevant to a user query in a large text relevant to a user query in a large text database is complicated by the fact that database is complicated by the fact that different authors use different words to different authors use different words to express the same ideas or concepts.express the same ideas or concepts.

Methods related to Latent Semantic Methods related to Latent Semantic Analysis interpret the variability associated Analysis interpret the variability associated with the expression of a concept as a noise, with the expression of a concept as a noise, and use linear algebra techniques to isolate and use linear algebra techniques to isolate the perennial concept from the variable the perennial concept from the variable noise.noise.

Page 4: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

4

IntroductionIntroduction

In LSA, the SVD (singular value In LSA, the SVD (singular value decomposition) technique is used to decomposition) technique is used to decompose a term by document matrix to a decompose a term by document matrix to a set of orthogonal factors.set of orthogonal factors.

A large number of factor weights provide an A large number of factor weights provide an approximation close to the original term by approximation close to the original term by document matrix but retains too much noise.document matrix but retains too much noise.

On the other hand, if too many factors are On the other hand, if too many factors are discarded, the information loss is too large.discarded, the information loss is too large.

The objective is to identify the optimal The objective is to identify the optimal number of orthogonal factors.number of orthogonal factors.

Page 5: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

5

Bag-of-words representationBag-of-words representation

Each documents is replaced by a vector of Each documents is replaced by a vector of its attributes which are usually the its attributes which are usually the keywords present in the document.keywords present in the document.

This representation can be used to This representation can be used to retrieve document relevant to a user retrieve document relevant to a user query: A vector representation is derived query: A vector representation is derived from the query in the same way as regular from the query in the same way as regular documents and then compared with the documents and then compared with the database using a suitable measure of database using a suitable measure of distance or similarity.distance or similarity.

Page 6: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

6

Latent Semantic AnalysisLatent Semantic Analysis

LSI is one of the few methods which LSI is one of the few methods which successfully overcome the vocabulary noise successfully overcome the vocabulary noise problem because it takes into account problem because it takes into account synonymysynonymy and and polysemypolysemy..

Not accounting for synonymy leads to under-Not accounting for synonymy leads to under-estimate the similarity between related estimate the similarity between related documents, and not accounting for polysemy documents, and not accounting for polysemy leads to erroneously finding similarities.leads to erroneously finding similarities.

The idea behind LSI is to reduce the The idea behind LSI is to reduce the dimension of the IR problem by projecting the dimension of the IR problem by projecting the DD documents by documents by NN attributes matrix attributes matrix AA to an to an adequate subspace of lower dimension. adequate subspace of lower dimension.

Page 7: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

7

Latent Semantic AnalysisLatent Semantic Analysis

SVD of the SVD of the D D × × NN matrix matrix AA::

A A = = UΔVUΔVTT (1)(1)

UU,,VV: orthogonal matrices: orthogonal matrices

ΔΔ: a diagonal matrix with elements σ: a diagonal matrix with elements σ11,…,σ,…,σpp, , where where pp == min(min(DD, , NN) and σ) and σ1 1 ≧σ≧σ2 2 ≧… ≧σ≧… ≧σp-1p-1 ≧σ ≧σpp

The closest matrix The closest matrix AA((kk) of dimension ) of dimension kk < < rank(rank(AA) is obtained by setting ) is obtained by setting σσi i == 0 for 0 for ii > > k.k.

AA((kk)) = = UU((kk) × ) × ΔΔ((kk) × ) × VV((kk))TT (2)(2)

Page 8: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

8

Latent Semantic AnalysisLatent Semantic Analysis

Then we compare documents in a Then we compare documents in a kk dimensional subspace based on dimensional subspace based on AA((kk). The ). The projection of the original document projection of the original document representation givesrepresentation gives

AAT T ×× UU((kk) × ) × ΔΔ((kk))-1-1 = = VV((kk)) (3)(3)

the same operation on the query vector Qthe same operation on the query vector Q

QQ((kk) ) = = QQT T ×× UU((kk) × ) × ΔΔ((kk))-1-1 (4)(4) The closest document to query Q is The closest document to query Q is

identified by a dissimilarity function identified by a dissimilarity function ddkk(., .):(., .):

(5)(5)

Page 9: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

9

Covariance MethodCovariance Method

Advantage: be able to handle databases of Advantage: be able to handle databases of several hundred of thousands of documents.several hundred of thousands of documents.

If If DD is the number of documents, is the number of documents, AAdd the vector the vector representing the representing the ddthth document and document and ĀĀ the mean of the mean of these vectors, the covariance matrix is written:these vectors, the covariance matrix is written:

(6)(6)

this matrix being symmetric, the singular value this matrix being symmetric, the singular value decomposition can be written:decomposition can be written:

CC = = VΔVVΔVTT (7)(7)

Page 10: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

10

Covariance MethodCovariance Method

Reducing Reducing ΔΔ to the to the kk more significant more significant singular values, we can project the singular values, we can project the keyword space into a keyword space into a kk dimensional dimensional subspace:subspace:

((AA||QQ) → () → (AA||QQ))VV((kk))ΔΔ = = ((AA((kk)|)|QQ((kk)))) (8)(8)

Page 11: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

11

Embedded ConceptsEmbedded Concepts

Sending the covariance matrix onto a subspace of Sending the covariance matrix onto a subspace of fewer dimension implies a loss of information.fewer dimension implies a loss of information.

It can be interpreted as the merging of keywords It can be interpreted as the merging of keywords meaning into a more general concept.meaning into a more general concept.

ex: “cat” and “mouse” → “mammal” → ”animal”ex: “cat” and “mouse” → “mammal” → ”animal” How many singular values are necessary for a How many singular values are necessary for a

keyword to be correctly distinguished from all keyword to be correctly distinguished from all others in the dictionary?others in the dictionary?

What’s the definition of “correctly distinguished”?What’s the definition of “correctly distinguished”?

Page 12: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

12

Correlation MethodCorrelation Method

The correlation matrix The correlation matrix SS of of AA is defined is defined based on the covariance matrix based on the covariance matrix CC::

(9)(9)

Using the correlation rather than the Using the correlation rather than the covariance matrix results in a different covariance matrix results in a different weighting of correlated keywords, the weighting of correlated keywords, the justification of the model remaining justification of the model remaining otherwise identical.otherwise identical.

Page 13: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

13

Keyword ValidityKeyword Validity

The property of SVD:The property of SVD:

(10)(10)

The rank The rank kk approximation approximation SS((kk) of ) of SS can be can be writtenwritten

(11)(11)

with with kk ≦ ≦ NN. . SS((NN) ) = = SS..

Page 14: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

14

Keyword ValidityKeyword Validity

We can argue the following argument:We can argue the following argument:the k-order approximation of the correlation matrix the k-order approximation of the correlation matrix correctly represents a given keyword only if this correctly represents a given keyword only if this keyword is more correlated to itself than to any keyword is more correlated to itself than to any other attribute.other attribute.

For a given keyword α this condition is For a given keyword α this condition is writtenwritten

(12)(12)

A keyword is said to be “valid” of rank k if A keyword is said to be “valid” of rank k if k-1 is the largest value for which Eq.(12) is k-1 is the largest value for which Eq.(12) is not verified, then k is the validity rank of not verified, then k is the validity rank of the keywordthe keyword..

Page 15: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

15

ExperimentsExperiments

Data: REUTERS (21,578 articles) and Data: REUTERS (21,578 articles) and TREC5 TREC5 (131,896 articles generated (131,896 articles generated 1,822,531 1,822,531 “documents”) databases.“documents”) databases.

Pre-processing:Pre-processing: using the Porter algorithm to stem wordsusing the Porter algorithm to stem words removing the keywords appearing in either removing the keywords appearing in either

more or less than two user specified more or less than two user specified thresholds.thresholds.

mapping documents to vectors by TFIDF.mapping documents to vectors by TFIDF.

Page 16: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

16

First ExperimentFirst Experiment

The claim: The claim: a given keyword is correctly a given keyword is correctly represented by a rank k approximation of represented by a rank k approximation of the correlation matrix if k is at least the correlation matrix if k is at least equal to the validity rank of the keywordequal to the validity rank of the keyword..

Experiment method:Experiment method:1.1. Select a keyword, for example Select a keyword, for example africaafrica, and , and

extract all the documents containing it.extract all the documents containing it. Produce a new copy of these documents, in Produce a new copy of these documents, in

which we replace the selected keyword by an which we replace the selected keyword by an new one.new one.

ex: replace ex: replace africaafrica by by afrique afrique (French)(French)

Page 17: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

17

First ExperimentFirst Experiment

Add these new documents to the original Add these new documents to the original database, and the keyword database, and the keyword afriqueafrique to the to the vocabulary.vocabulary.

Compute the correlation matrix and SVD of Compute the correlation matrix and SVD of this extended, new database. Note that this extended, new database. Note that afriqueafrique and and africaafrica are perfect synonyms. are perfect synonyms.

Send the original database to the new Send the original database to the new subspace and issue a query for subspace and issue a query for afriqueafrique. We . We hope to find documents containing hope to find documents containing africaafrica first.first.

Page 18: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

18

First ExperimentFirst Experiment

Figure 1: Keyword africa is replaced by afrique: Curves Figure 1: Keyword africa is replaced by afrique: Curves corresponding to ranks 450 and 500 start with a null corresponding to ranks 450 and 500 start with a null precision and remain under the curves of lower validity precision and remain under the curves of lower validity ranks.ranks.

Page 19: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

19

First ExperimentFirst Experiment

Table 1: Keywords CharacteristicsTable 1: Keywords Characteristics

Page 20: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

20

First ExperimentFirst Experiment

Note: the drop in precision is low after we reach the Note: the drop in precision is low after we reach the validity rank, but still present. As usual, the validity validity rank, but still present. As usual, the validity rank is precisely defined: the drop in precision is rank is precisely defined: the drop in precision is observed as soon as the rank reaches 400.observed as soon as the rank reaches 400.

Page 21: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

21

First ExperimentFirst Experiment

Figure 3: keyword Figure 3: keyword networknetwork is replaced by is replaced by reseaureseau: Precision is 100% until the validity : Precision is 100% until the validity rank and deteriorates drastically beyond it.rank and deteriorates drastically beyond it.

Page 22: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

22

First ExperimentFirst Experiment

This experiment shows the relation between This experiment shows the relation between the “concept” associated with the “concept” associated with afriqueafrique (or (or any other concept) and the actual keyword.any other concept) and the actual keyword.

For low rank approximations For low rank approximations SS((kk), ), augmenting the number of orthogonal augmenting the number of orthogonal factors helps identifying the “concept” factors helps identifying the “concept” common to both common to both afriqueafrique and and africaafrica, while , while orthogonal factors beyond the validity rank orthogonal factors beyond the validity rank help distinguish between the keyword and help distinguish between the keyword and its synonym.its synonym.

Page 23: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

23

Second ExperimentSecond Experiment

Figure 4: Ratios Figure 4: Ratios RR and hit = and hit = NN//GG for keyword for keyword afrique. afrique. Validity rank is 428.Validity rank is 428.

Page 24: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

24

Third ExperimentThird Experiment

Figure 5: vocabulary of 1201 Figure 5: vocabulary of 1201 keywords in REUTERS db.keywords in REUTERS db.

Figure 6: vocabulary of 2486 Figure 6: vocabulary of 2486 keywords in TREC db.keywords in TREC db.

Page 25: 1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret georges.dupret@laposte.net.

25

ConclusionConclusion

We examined the dependence of the Latent We examined the dependence of the Latent Semantic structure on the number of Semantic structure on the number of orthogonal factors in the context of the orthogonal factors in the context of the Correlation Method.Correlation Method.

We analyzed the claim that LSA provides a We analyzed the claim that LSA provides a method to take account of synonymy.method to take account of synonymy.

We propose a method to determine the We propose a method to determine the number of orthogonal factors for which a given number of orthogonal factors for which a given keyword best represents an associated keyword best represents an associated concept.concept.

Further directions might include the extension Further directions might include the extension to multiple keywords queries.to multiple keywords queries.