Lecture 5: Probabilistic Latent Semantic Analysis

Ata KabanThe University of Birmingham

Overview

• We learn how can we– represent text in a simple numerical form in

the computer– find out topics from a collection of text

documents

Salton’s Vector Space Model

Gerald Salton

’60 – ‘70

• Represent each document by a high-dimensional vector in the space of words

• Represent the doc as a vector where each entry corresponds to a different word and the number at that entry corresponds to how many times that word was present in the document (or some function of it)– Number of words is huge– Select and use a smaller set of words that are of interest– E.g. uninteresting words: ‘and’, ‘the’ ‘at’, ‘is’, etc. These are

called stop-words– Stemming: remove endings. E.g. ‘learn’, ‘learning’,

‘learnable’, ‘learned’ could be substituted by the single stem ‘learn’

– Other simplifications can also be invented and used– The set of different remaining words is called dictionary or

vocabulary. Fix an ordering of the terms in the dictionary so that you can operate them by their index.

ExampleThis is a small document collection that consists of 9 text documents. Terms that are in our dictionary are in bold.

Collect all doc vectors into a term by document matrix

Queries• Have a collection of documents• Want to find the most relevant documents to

a query• A query is just like a very short document• Compute the similarity between the query and

all documents in the collection• Return the best matching documents

• When are two document similar?• When are two document vectors similar?

Document similarity

||||||||),cos(

yxyxyx

Simple, intuitive

Fast to compute, because x and y are typically sparse (i.e. have many 0-s)

How to measure success?

• Assume there is a set of ‘correct answers’ to the query. The docs in this set are called relevant to the query

• The set of documents returned by the system are called retrieved documents

• Precision: what percentage of the retrieved documents are relevant

• Recall: what percentage of all relevant documents are retrieved

Problems

• Synonyms: separate words that have the same meaning.– E.g. ‘car’ & ‘automobile’– They tend to reduce recall

• Polysems: words with multiple meanings– E.g. ‘saturn’– They tend to reduce precision

The problem is more general: there is a disconnect between topics and words

• ‘… a more appropriate model should consider some conceptual dimensions instead of words.’ (Gardenfors)

Latent Semantic Analysis (LSA)• LSA aims to discover something about the meaning

behind the words; about the topics in the documents.• What is the difference between topics and words?

– Words are observable– Topics are not. They are latent.

• How to find out topics from the words in an automatic way?– We can imagine them as a compression of words– A combination of words– Try to formalise this

Probabilistic Latent Semantic Analysis

• Let us start from what we know• Remember the random sequence model

)|()|(

)|()...|()|()(doctermXT

doctermPdoctermP

doctermPdoctermPdoctermPdocP

We know how to compute the parameter of this model, ie P(term_t|doc)

- We ‘guessed’ it intuitively in Lecture1

- We also derived it by Maximum Likelihood in Lecture1 because we said the guessing strategy may not work for more complicated models.

• Now let us have K topics as well:

})|()|({)(

,collection in the docany for this,replacingby So

)|()|()|(

:shorthands using written same, The

)|()|()|(

doctXT

dockPktPdocP

dockPktPdoctP

doctopicPtopictermPdoctermP

Which are the parameters of this model?

• The parameters of this model are:P(t|k)P(k|doc)

• It is possible to derive the equations for computing these parameters by Maximum Likelihood.

• If we do so, what do we get?P(t|k) for all t and k, is a term by topic matrix

(gives which terms make up a topic)P(k|doc) for all k and doc, is a topic by document matrix

(gives which topics are in a document)

Deriving the parameter estimation algorithm

• The log likelihood of this model is the log probability of the entire collection:

dkPktP

dkPktPdtXdP

.1)|( and 1)|( that sconstraint thesubject to

d),|P(k also then and k)|P(t parameters w.r.t.maximised be toiswhich

)|()|(log),()(log

For those who would enjoy to work it out:- Lagrangian terms are added to ensure the constraints- Derivatives are taken wrt the parameters (one of them

at a time) and equate these to zero- Solve the resulting equations. You will get fixed point

equations which can be solved iteratively. This is the PLSA algorithm.

Note these steps are the same as those we did in Lecture1 when deriving the Maximum Likelihood estimate for random sequence models, just the working is a little more tedious.

We skip doing this in the class, we just give the resulting algorithm (see next slide)

You can get 5% bonus if you work this algorithm out.

The PLSA algorithm• Inputs: term by document matrix X(t,d), t=1:T, d=1:N and the

number K of topics sought• Initialise arrays P1 and P2 randomly with numbers between

[0,1] and normalise them to sum to 1 along rows• Iterate until convergence

For d=1 to N, For t =1 to T, For k=1:K

• Output: arrays P1 and P2, which hold the estimated parameters P(t|k) and P(k|d) respectively

dkPdkPktPdkPktP

dtxdkPdkP

ktPktPdkPdkPktP

dtXktPktP

),(2),(2);,(1),(2),(1

),(),(2),(2

),(1),(1;),(2),(2),(1

),(),(1),(1

Example of topics found from a Science Magazine papers collection

The performance of a retrieval system based on this model (PLSI) was found superior to that of both the vector space based similarity (cos) and a non-probabilistic latent semantic indexing (LSI) method. (We skip details here.)

From Th. Hofmann, 2000

Summing up

• Documents can be represented as numeric vectors in the space of words.

• The order of words is lost but the co-occurrences of words may still provide useful insights about the topical content of a collection of documents.

• PLSA is an unsupervised method based on this idea.• We can use it to find out what topics are there in a

collection of documents• It is also a good basis for information retrieval

systems

Related resourcesThomas Hofmann, Probabilistic Latent Semantic Analysis. Proceedings of the

Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI'99) http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf

Scott Deerwester et al: Indexing by latent semantic analysis, Journal of te American Society for Information Science, vol 41, no 6, pp. 391—407, 1990. http://citeseer.ist.psu.edu/cache/papers/cs/339/http:zSzzSzsuperbook.bellcore.comzSz~stdzSzpaperszSzJASIS90.pdf/deerwester90indexing.pdf

The BOW toolkit for creating term by doc matrices and other text processing and analysis utilities: http://www.cs.cmu.edu/~mccallum/bow

Lecture 5: Probabilistic Latent Semantic Analysis

Documents

Transcript of Lecture 5: Probabilistic Latent Semantic Analysis

Expert Systems with Applications - Derek Greenederekgreene.com/papers/ocallaghan15eswa.pdf · Probabilistic Latent Semantic Analysis (PLSA) method of Hofmann (2001), also known as

EM algorithm and its application in probabilistic latent semantic analysis

Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. · Keywords Probabilistic latent semantic analysis · Weighted incremental learning ·

Probabilistic Topic Models Hongning Wang CS@UVa. Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)

On the equivalence between Non-negative Matrix ...ranger.uta.edu/~chqding/papers/NMFpLSIequiv.pdf · Probabilistic Latent Semantic Indexing (PLSI). PLSI is an unsupervised learning

Meta-Search Engine based on Query-Expansion … grade/Anand Arun Atre...Meta-Search Engine based on Query-Expansion Using Latent Semantic Analysis and Probabilistic Latent Semantic

Classification and clustering methods by probabilistic latent semantic indexing model

Phishing Detection Using Probabilistic Latent Semantic … 0.012317 disnei 0.000226 password 0.013333 dvd 0.000223 failur 0.000405 ashle 0.000182 suspend 0.000523 simpson 0.000209

Latent Semantic Indexing: A probabilistic Analysis

Polarity Inducing Latent Semantic Analysis

Probabilistic Latent Semantic Analysis...Outline • Latent Semantic Analysis o Need o Overview o Drawbacks • Probabilistic Latent Semantic Analysis o Solution to drawbacks of LSA

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, …€¦ · clude PLSA (Probabilistic Latent Semantic Analysis) [2] and LDA (Latent Dirichlet Allocation) [3]. By using topic

LATENT SEMANTIC INDEXING FOR HINDI- ENGLISH CLIR ...shodhganga.inflibnet.ac.in/bitstream/10603/12226/9/09_chapter3.pdf · inverted file system, probabilistic latent semantic indexing,

Information Retrieval - Latent Semantic Indexingce.sharif.edu/courses/97-98/1/ce324-2/resources/root/Slides/Lect-26.pdfInformation Retrieval j Latent semantic indexing Latent semantic

Non-negative Matrix Factorization: Algorithms, Extensions ... · Outline 1 Introduction 2 Non-negative Matrix Factorization 3 Probabilistic Latent Semantic Analysis 4 Convolutive

Latent Semantic Indexing

Probabilistic Latent Component Analysis for Gearbox ...cns.bu.edu/~mvss/stuff/PLCA_gearbox_PHM_2010_Final.pdf · Probabilistic Latent Component Analysis for ... synchronous average

Comparing Latent Dirichlet Allocation and Latent Semantic .../67531/metadc...Anaya, Leticia H. Comparing Latent Dirichlet Allocation and Latent Semantic Analysis as Classifiers. Doctor

Multi-Relational Latent Semantic Analysis

Latent Semantic Analysis (LSA)berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and...Latent Semantic Analyy( )sis (LSA) • Also called Latent Semantic Indexingg( ), (LSI), Latent