A comparison between Latent Semantic Analysis and...

A comparison between Latent Semantic Analysis and

Correspondence Analysis

Julie Séguéla, Gilbert Saporta

CNAM, Cedric LabMultiposting.fr

February 9th 2011 - CARME

Outline

1 Introduction

2 Latent Semantic AnalysisPresentationMethod

3 Application in a real contextPresentationMethodologyResults and comparisons

4 Conclusion

A comparison between LSA & CA February 9th 2011 - CARME 2 / 29

Introduction

Outline

1 Introduction

2 Latent Semantic Analysis

3 Application in a real context

4 Conclusion

Introduction

Context

Text representation for categorization task

Introduction

Objectives

Comparison of several text representation techniques through theoryand application

In particular, comparison between a statistical technique :Correspondence Analysis, and an information retrieval (IR) orientedmethod : Latent Semantic Analysis

Is there an optimal technique for performing document clustering ?

Latent Semantic Analysis

Outline

1 Introduction

2 Latent Semantic AnalysisPresentationMethod

4 Conclusion

Latent Semantic Analysis Presentation

Uses of LSA

LSA was patented in 1988 (US Patent 4,839,853) by Deerwester,Dumais, Furnas, Harshman, Landauer, Lochbaum and Streeter.

Find semantic relations between terms

Helps to overcome synonymy and polysemy problems

Dimensionality reduction (from several thousands of features to40-400 dimensions)

Applications

Document clustering and document classi�cation

Matching queries to documents of similar topic meaning (informationretrieval)

Text summarization

Latent Semantic Analysis Method

LSA theory

How to obtain document coordinates ?

1) Document-Term matrix 2) Weighting

. . . fij . . ....

. . . lij(fij) · gj(fij) . . ....

3) SVD 4) Document coordinates in the latent

semantic space :TW = UΣV ′ C = UkΣk

We need to �nd the optimal dimension for �nal representation

Common weighting functions

Local weighting

Term frequency

lij(fij) = fij

Binary

lij(fij) = 1 if term j occurs in

document i , else 0

Logarithm

lij(fij) = log(fij + 1)

Global weighting

Normalisation

gj(fij) = 1√∑i f

IDF (Inverse Document Frequency)

gj(fij) = 1 + log( nnj

n : number of documents

nj : number of documents in which term

j occurs

Entropy

gj(fij) = 1−∑

fijf.j

log(fijf.j

log(n)

LSA vs CA

Latent Semantic Analysis

1) T = [fij ]i ,j

2) TW = [lij(fij) · gj(fij)]i ,j

3) TW = UΣV ′

4) C = UkΣk

Correspondence Analysis

1) T = [fij ]i ,j

2) TW =

[fij√fi .f.j

3) TW = UΣV ′

3′) U = diag(

√f..

fi .)U

4) C = UkΣk

CA : lij(fij) =fij√fi.

and gj(fij) = 1√f.j

Application in a real context

Outline

1 Introduction

3 Application in a real contextPresentationMethodologyResults and comparisons

4 Conclusion

Application in a real context Presentation

Objectives

Corpus of job o�ers

Find the best representation method to assess "job similarity" betweeno�ers in a non-supervised framework

Comparison of several representation techniques

Discussion about the optimal number of dimensions to keep

Comparison between two similarity measures

Application in a real context Presentation

O�ers have been manually labeled by recruiters into 8 categoriesduring the posting procedure

Distribution among job categories :

Category Freq. % Category Freq. %

Sales/Business Development 360 24 Marketing/Product 141 10R&D/Science 69 5 Production/Operations 127 9Accounting/Finance 338 23 Human Resources 138 9Logistics/Transportation 118 8 Information Systems 192 13

Total 1483 100

We keep only the "title"+"mission description" parts ("�rmdescription" and "pro�le searched" are excluded)

Application in a real context Methodology

Preprocessing of texts

Lemmatisation and tagging

Filtering according to grammatical category (we keep nouns, verbs andadjectives)

Filtering terms occuring in less than 5 o�ers

Vector space model ("bag of words")

Several representations are compared

Representation method

LSA, weighting : Term Frequency

LSA, weighting : TF-IDF

LSA, weighting : Log Entropy

Dissimilarity measure

Euclidian distance between documents i and i ′

1 - cosine similarity between documents i and i ′

Method of clustering

Clustering steps

Computing of dissimilarity matrix from document coordinates in thelatent semantic space

Hierarchical Agglomerative Clustering until a 8 class partition

Computation of class centroids

K-means clustering initialized from previous centroids

Measures of agreement between two partitions

P1, P2 : two partitions of n objects with the same number of classes kN = [nij ]i=1,..,k

j=1,..,k

: corresponding contingency table

Rand index

R =2∑

∑j n

2ij −

∑i n

2i . −

∑j n

2.j + n2

n2, 0 ≤ R ≤ 1

Rand index is based on the number of pairs of units which belong tothe same clusters. It doesn't depend on cluster labeling.

Measures of agreement between two partitions

Cohen's Kappa and F-measure values are depending on clusters'labels. To overcome label switching, we are looking for their maximumvalues over all label allocations.

Cohen's Kappa

κopt = max

∑i nii −

∑i ni .n.i

1− 1n2

∑i ni .n.i

}, −1 ≤ κ ≤ 1

F -measure

F opt = max

{2 · 1k

∑inii

ni.· 1k

∑inii

ni.+ 1

∑inii

}, 0 ≤ F ≤ 1

Application in a real context Results and comparisons

Correlation between coordinates issued from the di�erent

methods

Clustering quality according to the method and the number

of dimensions : Rand index

of dimensions : Cohen's Kappa

of dimensions : F-measure

Clustering quality according to the dissimilarity function :

LSA + Log Entropy

Clustering quality according to the dissimilarity function : CA

Conclusion

Outline

1 Introduction

4 Conclusion

Conclusion

Conclusions

CA seems to be less stable than other methods but with cosinesimilarity, it provides better results under 100 dimensions

As it is said in literature, cosine similarity between vectors seems to bemore adapted to textual data than usual dot similarity : slight increaseof e�ciency and more stability for agreement measures

Optimal number of dimensions to keep ? It is varying according to thetype of text studied and the method used (around 60 dimensions withCA)

We should prefer a dissimilarity measure which provides stable resultswith the number of dimensions kept (in the context of automatedtasks, it's problematic if optimal dimension is depending too much onthe collection of documents)

Conclusion

Limitations & future work

Limitations of the study

Clusters obtained are compared with categories choosen by recruiters,which are sometimes subjective and could explain some errors

We are working on a very particular type of corpus : short texts,variable length, sometimes very similar but not really duplicates

Future work

Test other clustering methods (the representation to adopt maydepend on it)

Repeat the study with a supervised algorithm for classi�cation (indexvalues are disappointing in unsupervised framework)

Study the e�ect of using the di�erent parts of job o�ers forclassi�cation

Some references

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., &Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journalof the American Society for Information Science, 41, 391-407.

Greenacre, M. (2007). Correspondence Analysis in Practice, Second

Edition. London : Chapman & Hall/CRC.

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction toLatent Semantic Analysis. Discourse Processes, 25, 259-284.

Landauer, T. K., McNamara, D., Dennis, S., & Kintsch, W. (Eds.)(2007). Handbook of Latent Semantic Analysis. Mahwah, NJ :Erlbaum.

Picca, D., Curdy, B., & Bavaud, F. (2006). Non-linear correspondenceanalysis in text retrieval : a kernel view. In JADT'06, pp. 741-747.

Wild, F. (2007). An LSA package for R. In LSA-TEL'07, pp. 11-12.

Thanks !

A comparison between Latent Semantic Analysis and...

Documents

Transcript of A comparison between Latent Semantic Analysis and...

Ethical issues raised by genetically modified …bioethics.agrocampus-ouest.fr/infoglueDeliverLive/...1 Ethical issues raised by genetically modified microorganisms Michel Gautier,

Welcome [halieutique.agrocampus-ouest.fr]halieutique.agrocampus-ouest.fr/pdf/3775.pdf · Welcome to the first edition of the GIFS newsletter. GIFS is an exciting ... Dr Tim Acott,

Rainwater harvesting and greywater recovery (1) … · Rainwater harvesting and greywater recovery - Part 1 - Prof. Patrice CANNAVO AGROCAMPUS OUEST / Agreenium, France Department

Contents - CARME 2011carme2011.agrocampus-ouest.fr/book_of_abstracts/book_of_abstracts.… · Contents Bécue-Bertaut ... lexicometry; constraint clustering methods ... these methods

Thèse de doctorat - Agrocampus Ouest

AGROCAMPUS OUEST Higher education and research

Towards spatial life cycle modelling of eastern Channel …halieutique.agrocampus-ouest.fr/amedee/presentations/mars_2014/... · Towards spatial life cycle modelling of eastern Channel

E. Chassot, D., Gascuel and T. Rouyer. Agrocampus Rennes, …halieutique.agrocampus-ouest.fr/pdf/3658.pdf · 3 Biological data Information dealing with population dynamics is very

Photo Ville d’Angers, DPJP - r2d2-2015.agrocampus-ouest.frr2d2-2015.agrocampus-ouest.fr/infoglueDeliverLive/digitalAssets/80421_4_Francois... · Photo : Ville d’Angers, DPJP.

Trimble Survey Controller™ - tice.agrocampus-ouest.fr · Version 10.8 Révision A Avril 2004 Trimble Survey Controller™ Guide de mise en route

Analyse des correspondances (AFC) - Agrocampus Ouestmath.agrocampus-ouest.fr/infoglueDeliverLive/digital...de V1 Profil ligne moyen = distribution marginale de V2 Profil de l ensemble

Conference on missing values and matrix completionmissdata2015.agrocampus-ouest.fr/infoglueDeliverLive/...Conference on missing values and matrix completion June, 17-19, 2015 AGROCAMPUS

Geometry of La Distinction - CARME 2011carme2011.agrocampus-ouest.fr/slides/Bonnet_Lebaron.pdfGeometry of La Distinction Latest methodological innovations in geometric data analysis

DoroteyaRaykova GrzegorzDziurawiec Mich aëldaVeigaMendes ...bioethics.agrocampus-ouest.fr/infoglueDeliverLive/digitalAssets/... · Definition A genetically modified organism is altered

The R User Conference 2009 July 8-10, Agrocampus-Ouest, Rennes

Rennes, 10 • 13 July 2012 AGROCAMPUS OUEST

A bootstrap framework for low-rank matrix estimationmath.agrocampus-ouest.fr/infoglueDeliverLive/digitalAssets/82335_j… · A bootstrap framework for low-rank matrix estimation JulieJosse–StefanWager

About the History of Multiple Correspondence …carme2011.agrocampus-ouest.fr/slides/lebart.pdflaws. We believe that Data Analysis should adequately express the laws of those phenomenons,

Short Book Reviews - Agrocampus Ouestsirs.agrocampus-ouest.fr/bayes_V2/review/Book_Review_IntStatRevie… · Students and instructors can conveniently access the text data sets on

View of Aegean Sea and island of Lesbos. Turkey, August ...carme2011.agrocampus-ouest.fr/slides/Nenadic_tutorial.pdfMichael Greenacre Universitat Pompeu Fabra Barcelona Simple correspondence