Release 0 - Read the Docs

28
PLSA Release 0.1 Sep 18, 2019

Transcript of Release 0 - Read the Docs

Page 1: Release 0 - Read the Docs

PLSARelease 0.1

Sep 18, 2019

Page 2: Release 0 - Read the Docs
Page 3: Release 0 - Read the Docs

Contents:

1 plsa.preprocessors module 1

2 plsa.pipeline module 5

3 plsa.corpus module 7

4 plsa.algorithms package 114.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 plsa.visualize module 17

6 Indices and tables 19

Python Module Index 21

Index 23

i

Page 4: Release 0 - Read the Docs

ii

Page 5: Release 0 - Read the Docs

CHAPTER 1

plsa.preprocessors module

Preprocessors for documents and words.

These preprocessors come in three flavours (functions, closures that return functions, and classes defining callableobjects). The choice for the respective flavour is motivated by the complexity of the preprocessor. If it doesn’t needany parameters, a simple function will do. If it is simple, does not need to be manipulated interactively, but needs someparameter(s), then a closure is fine. If it would be convenient to alter parameters of the preprocessor interactively, thena class is a good choice.

Preprocessors act either on an entire document string or, after splitting documents into individual words, on an iterableover the words contained in a single document. Therefore, they cannot be combined in arbitrary order but care mustbe taken to ensure that the return value of one matches the call signature of the next.

plsa.preprocessors.remove_non_ascii(doc: str)→ strRemoves non-ASCII characters (i.e., with unicode > 127) from a string.

Parameters doc (str) – A document given as a single string.

Returns The document as a single string with all characters of unicode > 127 removed.

Return type str

plsa.preprocessors.to_lower(doc: str)→ strConverts a string to all-lowercase.

Parameters doc (str) – A document given as a single string.

Returns The document as a single string with all characters converted to lowercase.

Return type str

plsa.preprocessors.remove_numbers(doc: str)→ strRemoves digit/number characters from a string.

Parameters doc (str) – A document given as a single string.

Returns The document as a single string with all number/digit characters removed.

Return type str

1

Page 6: Release 0 - Read the Docs

PLSA, Release 0.1

plsa.preprocessors.remove_tags(exclude_regex: str)→ Callable[[str], str]Returns callable that removes matches to the given regular expression.

Parameters exclude_regex (str) – A regular expression specifying specific patterns to removefrom a document.

Returns A callable that removes patterns matching the given regular expression from a string.

Return type function

plsa.preprocessors.remove_punctuation(punctuation: Iterable[str])→ Callable[[str], str]Returns callable that removes punctuation characters from a string.”

Parameters punctuation (iterable of str) – An iterable over single-character stringsspecifying punctuation characters to remove from a document.

Returns A callable that removes the given punctuation characters from a string.

Return type function

plsa.preprocessors.tokenize(doc: str)→ Tuple[str, ...]Splits a string into individual words.

Parameters doc (str) – A document given as a single string.

Returns The document as tuple of individual words.

Return type tuple of str

class plsa.preprocessors.RemoveStopwords(stopwords: Union[str, Iterable[str]])Bases: object

Instantiate callable objects that remove stopwords from a document.

Parameters stopwords (str or iterable of str) – Stopword(s) to remove from a doc-ument given as an iterable over words.

Examples

>>> from plsa.preprocessors import RemoveStopwords>>> remover = RemoveStopwords('is')>>> remover.words('is',)

>>> remover.words = 'the', 'are'>>> remover.words('the', 'are')

>>> remover += 'is', 'we'>>> remover.words('is', 'we', 'the', 'are')

>>> new_instance = remover + 'do'>>> new_instance.words('are', 'we', 'is', 'do', 'the')

wordsThe current stopwords.

2 Chapter 1. plsa.preprocessors module

Page 7: Release 0 - Read the Docs

PLSA, Release 0.1

class plsa.preprocessors.LemmatizeWords(*incl_pos)Bases: object

Instantiate callable objects that find the root form of words.

Parameters *inc_pos (str) – One or more positional tag(s) indicating the type(s) of words toretain and to find the root form of. Must be one of ‘JJ’ (adjectives), ‘NN’ (nouns), ‘VB’ (verbs),or ‘RB’ (adverbs).

Raises KeyError – If the given positional tags are not among the list of allowed ones.

Examples

>>> from plsa.preprocessors import LemmatizeWords>>> lemmatizer = LemmatizeWords('VB')>>> lemmatizer.types('VB',)

>>> lemmatizer.types = 'jj', 'nn'>>> lemmatizer.types('JJ', 'NN')

>>> lemmatizer += 'VB', 'NN'>>> lemmatizer.types('JJ', 'NN', 'VB')

>>> new_instance = lemmatizer + 'RB'>>> new_instance.types('JJ', 'RB', 'NN', 'VB')

typesThe current type(s) of words to retain.

plsa.preprocessors.remove_short_words(min_word_len: int) → Callable[[Iterable[str]], Tu-ple[str, ...]]

Returns a callable that removes short words from an iterable of strings.

Parameters min_word_len (int) – Minimum number of characters in a word for it to be re-tained.

Returns A callable that removes words shorter than the given threshold from an iterable over strings.

Return type function

3

Page 8: Release 0 - Read the Docs

PLSA, Release 0.1

4 Chapter 1. plsa.preprocessors module

Page 9: Release 0 - Read the Docs

CHAPTER 2

plsa.pipeline module

class plsa.pipeline.Pipeline(*preprocessors)Bases: object

Encapsulates and applies multiple document preprocessors.

Each preprocessor is assumed to be a callable that takes a single document as input and produces a singledocument as output. Importantly, each document fed to the first preprocessor in the chain is delivered as asingle string, while the last preprocessor is required to return it as an iterable over strings with each elementrepresenting one word of that document.

Other than that, preprocessors can be combined in any which way, provided that the return value of one matchesthe call signature of the next. The order in which they are applied is the order in which they are specified, i.e.,from left to right.

Parameters *preprocessors (callable) – Function(s) or other callable object(s) that eachtake a single document as input and produce a (processed) document as output.

See also:

plsa.preprocessors

process(doc: str)→ Tuple[str, ...]Applies a chain of one or more preprocessors to a document.

Parameters doc (str) – A text document given as a single string.

Returns Each element represents one word of the document.

Return type tuple of str

5

Page 10: Release 0 - Read the Docs

PLSA, Release 0.1

6 Chapter 2. plsa.pipeline module

Page 11: Release 0 - Read the Docs

CHAPTER 3

plsa.corpus module

class plsa.corpus.Corpus(corpus: Iterable[str], pipeline: plsa.pipeline.Pipeline)Bases: object

Processes raw document collections and provides numeric representations.

Parameters

• corpus (iterable of str) – An iterable over documents given as a single stringeach.

• pipeline (Pipeline) – The preprocessing pipeline.

See also:

plsa.pipeline

classmethod from_csv(path: str, pipeline: plsa.pipeline.Pipeline, col: int = -1, encoding: str =’latin_1’, max_docs: int = 1000, **kwargs)→ plsa.corpus.Corpus

Instantiate a corpus from documents in a column of a CSV file.

Parameters

• path (str) – Full path (incl. file name) to a CSV file with one column containingdocuments.

• pipeline – The preprocessing pipeline.

• col (int) – Which column contains the documents. Numbering starts with 0 for the firstcolumn. Negative numbers count back from the last column (e.g., -1 for last, -2 just beforethe last, etc.).

• encoding (str) – A valid python encoding used to read the documents.

• max_docs (int) – The maximum number of documents to read from file.

• **kwargs – Keyword arguments are passed on to Python’s own csv.reader function.

Raises StopIteration – If you do not have at least two lines in your CSV file.

7

Page 12: Release 0 - Read the Docs

PLSA, Release 0.1

Notes

If you set a col to a value outside the range present in the CSV file, it will be silently reset to the first orlast column, depending on which side you exceed the permitted range.

A list of available encodings can be found at https://docs.python.org/3/library/codecs.html

Formatting parameters for the Python’s csv.reader can be found at https://docs.python.org/3/library/csv.html#csv-fmt-params

classmethod from_xml(directory: str, pipeline: plsa.pipeline.Pipeline, tag: str = ’post’, encoding:str = ’latin_1’, max_files: int = 100)→ plsa.corpus.Corpus

Instantiate a corpus from elements of XML files in a directory.

Parameters

• directory (str) – Path to the directory with the XML files.

• pipeline (Pipeline) – The preprocessing pipeline.

• tag – The XML tag that opens (<. . . >) and closes (</. . . >) the elements containing doc-uments.

• encoding – A valid python encoding used to read the documents.

• max_files – The maximum number of XML files to read.

Notes

A list of available encodings can be found at https://docs.python.org/3/library/codecs.html

get_doc(tf_idf: bool)→ numpy.ndarrayThe marginal probability that any word comes from a given document.

This probability p(d) is obtained by summing the joint document- word probability p(d, w) over all words.

Parameters tf_idf (bool) – Whether to marginalize the term-frequency inverse-document-frequency or just the term-frequency matrix.

Returns The document probability p(d).

Return type ndarray

get_doc_given_word(tf_idf: bool)→ numpy.ndarrayThe conditional probability of a particular word in a given document.

This probability p(d|w) is obtained by dividing the joint document- word probability p(d, w) by themarginal word probability p(w).

Parameters tf_idf (bool) – Whether to base the conditional probability on the term-frequency inverse-document-frequency or just the term-frequency matrix.

Returns The conditional word probability p(d|w).

Return type ndarray

get_doc_word(tf_idf: bool)→ numpy.ndarrayThe normalized document-word counts matrix.

Also referred to as the term-frequency matrix. Because words (or terms) that occur in the majority ofdocuments are the least helpful in discriminating types of documents, each column of this matrix canbe multiplied by the logarithm of the total number of documents divided by the number of documentscontaining the given word. The result is then referred to as the term-frequency inverse-document-frequencyor TF-IDF matrix.

8 Chapter 3. plsa.corpus module

Page 13: Release 0 - Read the Docs

PLSA, Release 0.1

Either way, the returned matrix is always normalized such that it can be interpreted as the joint document-word probability p(d, w).

Parameters tf_idf (bool) – Whether to return the term-frequency inverse-document-frequency or just the term-frequency matrix.

Returns The normalized document (rows) - word (columns) matrix, either as pure counts (iftf_idf = False) or weighted by the inverse document frequency (if tf_idf is False).

Return type ndarray

get_word(tf_idf: bool)→ numpy.ndarrayThe marginal probability of a particular word.

This probability p(w) is obtained by summing the joint document- word probability p(d, w) over all docu-ments.

Parameters tf_idf (bool) – Whether to marginalize the term-frequency inverse-document-frequency or just the term-frequency matrix.

Returns The word probability p(w).

Return type ndarray

idfLogarithm of inverse fraction of documents each word occurs in.

indexMapping from actual word to numeric word index.

n_docsThe number of non-empty documents.

n_occurrencesTotal number of times any word occurred in any document.

n_wordsThe number of unique words retained after preprocessing.

pipelineThe pipeline of preprocessors for each document.

rawThe raw documents as they were read from the source.

vocabularyMapping from numeric word index to actual word.

9

Page 14: Release 0 - Read the Docs

PLSA, Release 0.1

10 Chapter 3. plsa.corpus module

Page 15: Release 0 - Read the Docs

CHAPTER 4

plsa.algorithms package

4.1 Submodules

4.1.1 plsa.algorithms.plsa module

class plsa.algorithms.plsa.PLSA(corpus: plsa.corpus.Corpus, n_topics: int, tf_idf: bool = True)Implements probabilistic latent semantic analysis (PLSA).

At its core lies the assumption that the normalized document-word (or term-frequency) matrix p(d, w), weightedwith the inverse document frequency or not, can be factorized as:

𝑝(𝑑,𝑤) ≈∑︁𝑡

𝑝(𝑑|𝑡)𝑝(𝑤|𝑡)𝑝(𝑡)

Parameters

• corpus (Corpus) – The corpus of preprocessed and numerically represented documents.

• n_topics (int) – The number of latent topics to identify.

• tf_idf (bool) – Whether to use the term-frequency inverse-document-frequency or justthe term-frequency matrix as joint probability p(d, w) of documents and words.

Raises ValueError – If the number of topics is < 2 or the number of both, words and documents,in the corpus isn’t greater than the number of topics.

Notes

The implementation follows algorithm 15.2 in Barber’s book1 to the letter. What is not said there is that, in orderto update the conditional probability p(t|d, w) of a certain topic given a certain word in a certain document, onefirst needs to find the joint probability of all random variables as

𝑝(𝑡, 𝑑, 𝑤) = 𝑝(𝑑|𝑡)𝑝(𝑤|𝑡)𝑝(𝑡)

and then divide by the marginal 𝑝(𝑑,𝑤).1 “Bayesian Reasoning and Machine Learning”, David Barber (Cambridge Press, 2012).

11

Page 16: Release 0 - Read the Docs

PLSA, Release 0.1

References

best_of(n_runs: int = 3, **kwargs)→ plsa.algorithms.result.PlsaResultFinds the best PLSA model among the specified number of runs.

As with any iterative algorithm, also the probabilities in PSLA need to be (randomly) initialized prior tothe first iteration step. Therefore, calling the fit method of two different instances operating on the samecorpus with the same number of topics potentially leads to (slightly) different results, corresponding todifferent local minima of the Kullback-Leibler divergence between the true document-word probabilityand its approximate factorization. To mitigate this effect, perform multiple runs and pick the best model.

Parameters

• n_runs (int, optional) – Number of runs to pick the best model of. Defaults to 3.

• **kwargs – Keyword-only arguments are passed on to the fit method.

Returns Container class for the best result.

Return type PlsaResult

fit(eps: float = 1e-05, max_iter: int = 200, warmup: int = 5)→ plsa.algorithms.result.PlsaResultRun EM-style training to find latent topics in documents.

Expectation-maximization (EM) iterates until either the maximum number of iterations is reached or ifrelative changes of the Kullback- Leibler divergence between the actual document-word probability andits approximate fall below a certain threshold, whichever occurs first.

Since all quantities are update in-place, calling the fit method again after a successful run (possibly withchanged convergence criteria) will continue to add more iterations on top of the status quo rather thanstarting all over again from scratch.

Because a few EM iterations are needed to get things going, you can specify an initial warm-up period,during which progress in the Kullback-Leibler divergence is not tracked, and which does not count towardsthe maximum number of iterations.

Parameters

• eps (float, optional) – The convergence cutoff for relative changes in theKullback- Leibler divergence between the actual document-word probability and its ap-proximate. Defaults to 1e-5.

• max_iter (int, optional) – The maximum number of iterations to perform. De-faults to 200.

• warmup (int, optional) – The number of iterations to perform before changes inthe Kullback-Leibler divergence are tracked for convergence.

Returns Container class for the results of the latent semantic analysis.

Return type PlsaResult

n_topicsThe number of topics to find.

tf_idfUse inverse document frequency to weigh the document-word counts?

12 Chapter 4. plsa.algorithms package

Page 17: Release 0 - Read the Docs

PLSA, Release 0.1

4.1.2 plsa.algorithms.conditional_plsa module

class plsa.algorithms.conditional_plsa.ConditionalPLSA(corpus: plsa.corpus.Corpus,n_topics: int, tf_idf: bool =True)

Implements conditional probabilistic latent semantic analysis (PLSA).

Given that the normalized document-word (or term-frequency) matrix p(d, w), weighted with the inverse docu-ment frequency or not, can always be written as,

𝑝(𝑑,𝑤) = 𝑝(𝑑|𝑤)𝑝(𝑤)

the core of conditional PLSA is the assumption that the conditional p(d|w) can be factorized as:

𝑝(𝑑|𝑤) ≈∑︁𝑡

𝑝(𝑑|𝑡)𝑝(𝑡|𝑤)

Parameters

• corpus (Corpus) – The corpus of preprocessed and numerically represented documents.

• n_topics (int) – The number of latent topics to identify.

• tf_idf (bool) – Whether to use the term-frequency inverse-document-frequency or justthe term-frequency matrix as joint probability p(d, w) of documents and words.

Raises ValueError – If the number of topics is < 2 or the number of both, words and documents,in the corpus isn’t greater than the number of topics.

Notes

Importantly, the present implementation does not follow algorithm 15.3 in Barber’s book1. The update equationsthere appear non-sensical. Following through the derivation that gives (non-conditional) PLSA, one arrives atthe following updates:

𝑝(𝑑|𝑡) =∑︁𝑤

𝑝(𝑑,𝑤)𝑞(𝑡|𝑑,𝑤)

𝑝(𝑡|𝑤) =∑︁𝑑

𝑝(𝑑,𝑤)𝑞(𝑡|𝑑,𝑤)

𝑝(𝑡, 𝑑, 𝑤) = 𝑝(𝑤)∑︁𝑡

𝑝(𝑑|𝑡)𝑝(𝑡|𝑤)

𝑞(𝑡|𝑑,𝑤) = 𝑝(𝑡, 𝑑, 𝑤)/𝑝(𝑑,𝑤)

References

best_of(n_runs: int = 3, **kwargs)→ plsa.algorithms.result.PlsaResultFinds the best PLSA model among the specified number of runs.

As with any iterative algorithm, also the probabilities in PSLA need to be (randomly) initialized prior tothe first iteration step. Therefore, calling the fit method of two different instances operating on the samecorpus with the same number of topics potentially leads to (slightly) different results, corresponding todifferent local minima of the Kullback-Leibler divergence between the true document-word probabilityand its approximate factorization. To mitigate this effect, perform multiple runs and pick the best model.

Parameters1 “Bayesian Reasoning and Machine Learning”, David Barber (Cambridge Press, 2012).

4.1. Submodules 13

Page 18: Release 0 - Read the Docs

PLSA, Release 0.1

• n_runs (int, optional) – Number of runs to pick the best model of. Defaults to 3.

• **kwargs – Keyword-only arguments are passed on to the fit method.

Returns Container class for the best result.

Return type PlsaResult

fit(eps: float = 1e-05, max_iter: int = 200, warmup: int = 5)→ plsa.algorithms.result.PlsaResultRun EM-style training to find latent topics in documents.

Expectation-maximization (EM) iterates until either the maximum number of iterations is reached or ifrelative changes of the Kullback- Leibler divergence between the actual document-word probability andits approximate fall below a certain threshold, whichever occurs first.

Since all quantities are update in-place, calling the fit method again after a successful run (possibly withchanged convergence criteria) will continue to add more iterations on top of the status quo rather thanstarting all over again from scratch.

Because a few EM iterations are needed to get things going, you can specify an initial warm-up period,during which progress in the Kullback-Leibler divergence is not tracked, and which does not count towardsthe maximum number of iterations.

Parameters

• eps (float, optional) – The convergence cutoff for relative changes in theKullback- Leibler divergence between the actual document-word probability and its ap-proximate. Defaults to 1e-5.

• max_iter (int, optional) – The maximum number of iterations to perform. De-faults to 200.

• warmup (int, optional) – The number of iterations to perform before changes inthe Kullback-Leibler divergence are tracked for convergence.

Returns Container class for the results of the latent semantic analysis.

Return type PlsaResult

n_topicsThe number of topics to find.

tf_idfUse inverse document frequency to weigh the document-word counts?

4.1.3 plsa.algorithms.result module

class plsa.algorithms.result.PlsaResult(topic_given_doc: numpy.ndarray,word_given_topic: numpy.ndarray,topic_given_word: numpy.ndarray, topic:numpy.ndarray, kl_divergences: List[float],corpus: plsa.corpus.Corpus, tf_idf: bool)

Bases: object

Container for the results generated by a (conditional) PLSA run.

Parameters

• topic_given_doc (ndarray) – The conditional probability p(t|d) as 𝑛𝑡𝑜𝑝𝑖𝑐𝑠 × 𝑛𝑑𝑜𝑐𝑠

array.

• word_given_topic (ndarray) – The conditional probability p(w|t) as 𝑛𝑤𝑜𝑟𝑑𝑠 ×𝑛𝑡𝑜𝑝𝑖𝑐𝑠 array.

14 Chapter 4. plsa.algorithms package

Page 19: Release 0 - Read the Docs

PLSA, Release 0.1

• topic_given_word (ndarray) – The conditional probability p(t|w) as 𝑛𝑡𝑜𝑝𝑖𝑐𝑠 ×𝑛𝑤𝑜𝑟𝑑𝑠 array.

• topic (ndarray) – The marginal probability p(w).

• kl_divergences (list of float) – The Kullback-Leibler divergences between theoriginal document-word probability p(d, w) and its approximate for each iteration.

• corpus (Corpus) – The original corpus the PLSA model was trained on.

• tf_idf (bool) – Whether to weigh the document.word matrix with the inverse documentfrequencies or not.

convergenceThe convergence of the Kullback-Leibler divergence.

kl_divergenceKL-divergence of approximate and true document-word probability.

n_topicsThe number of latent topics identified.

predict(doc: str)→ Tuple[numpy.ndarray, int, Tuple[str, ...]]Predict the relative importance of latent topics in a new document.

Parameters doc (str) – A new document given as a single string.

Returns

• ndarray – A 1-D array with the relative importance of latent topics.

• int – The number of words in the new document that were not present in the corpus thePLSA model was trained on.

• tuple of str – Those words in the new document that were not present in the corpus thePLSA model was trained on.

Raises ValueError – If the document to predict on is an empty string, if there are no wordsleft after preprocessing the document, or if there are no known words in the document.

tf_idfUsed inverse document frequency to weigh the document-word counts?

topicThe relative importance of latent topics.

topic_given_docThe relative importance of latent topics in each document.

Dimensions are 𝑛𝑑𝑜𝑐𝑠 × 𝑛𝑡𝑜𝑝𝑖𝑐𝑠.

word_given_topicThe words in each latent topic and their relative importance.

Results are presented as a tuple of 2-tuples (word, word importance).

4.1. Submodules 15

Page 20: Release 0 - Read the Docs

PLSA, Release 0.1

16 Chapter 4. plsa.algorithms package

Page 21: Release 0 - Read the Docs

CHAPTER 5

plsa.visualize module

class plsa.visualize.Visualize(result: plsa.algorithms.result.PlsaResult)Bases: object

Visualize the results of probabilistic latent semantic analysis.

Parameters result (PlsaResult) – The results object returned by the fit method of a PLSAmodel object.

convergence(axis: matplotlib.axes._subplots.AxesSubplot)→ List[matplotlib.lines.Line2D]Plot the convergence of the PLSA run.

The quantity to be minimized is the Kullback-Leibler divergence between the original document-wordmatrix and its approximation given by the (conditional) PLSA factorization.

Parameters axis (Subplot) – The matplotlib axis to plot into.

Returns The line object plotted into the given axis.

Return type list of Line2D

prediction(doc: str, axis: matplotlib.axes._subplots.AxesSubplot) → mat-plotlib.container.BarContainer

Plot the predicted relative weights of topics in a new document.

Parameters

• doc (str) – A new document given as a single string.

• axis (Subplot) – The matplotlib axis to plot into.

Returns The container for the bars plotted into the given axis.

Return type BarContainer

topics(axis: matplotlib.axes._subplots.AxesSubplot)→ matplotlib.container.BarContainerPlot the relative importance of the individual topics.

Parameters axis (Subplot) – The matplotlib axis to plot into.

Returns The container for the bars plotted into the given axis.

17

Page 22: Release 0 - Read the Docs

PLSA, Release 0.1

Return type BarContainer

topics_in_doc(i_doc: int, axis: matplotlib.axes._subplots.AxesSubplot) → mat-plotlib.container.BarContainer

Plot the relative weights of topics in a given document.

Parameters

• i_doc (int) – Index of the document to plot. Numbering starts at 0.

• axis (Subplot) – The matplotlib axis to plot into.

Returns The container for the bars plotted into the given axis.

Return type BarContainer

wordclouds(figure: matplotlib.figure.Figure)→ List[matplotlib.image.AxesImage]Plot the relative importance of words in all topics.

Parameters figure (Figure) – An empty matplotlib figure to plot into.

Returns List of images with the created word clouds.

Return type list of AxisImage

words_in_topic(i_topic: int, axis: matplotlib.axes._subplots.AxesSubplot) → mat-plotlib.image.AxesImage

Plot the relative importance of words in a given topic.

Parameters

• i_topic (int) – Index of the topic to plot. Numbering starts at 0.

• axis (Subplot) – The matplotlib axis to plot into.

Returns The image with the produced word cloud.

Return type AxesImage

18 Chapter 5. plsa.visualize module

Page 23: Release 0 - Read the Docs

CHAPTER 6

Indices and tables

• genindex

• modindex

• search

19

Page 24: Release 0 - Read the Docs

PLSA, Release 0.1

20 Chapter 6. Indices and tables

Page 25: Release 0 - Read the Docs

Python Module Index

pplsa.algorithms.conditional_plsa, 13plsa.algorithms.plsa, 11plsa.algorithms.result, 14plsa.corpus, 7plsa.pipeline, 5plsa.preprocessors, 1plsa.visualize, 17

21

Page 26: Release 0 - Read the Docs

PLSA, Release 0.1

22 Python Module Index

Page 27: Release 0 - Read the Docs

Index

Bbest_of() (plsa.algorithms.conditional_plsa.ConditionalPLSA

method), 13best_of() (plsa.algorithms.plsa.PLSA method), 12

CConditionalPLSA (class in

plsa.algorithms.conditional_plsa), 13convergence (plsa.algorithms.result.PlsaResult at-

tribute), 15convergence() (plsa.visualize.Visualize method), 17Corpus (class in plsa.corpus), 7

Ffit() (plsa.algorithms.conditional_plsa.ConditionalPLSA

method), 14fit() (plsa.algorithms.plsa.PLSA method), 12from_csv() (plsa.corpus.Corpus class method), 7from_xml() (plsa.corpus.Corpus class method), 8

Gget_doc() (plsa.corpus.Corpus method), 8get_doc_given_word() (plsa.corpus.Corpus

method), 8get_doc_word() (plsa.corpus.Corpus method), 8get_word() (plsa.corpus.Corpus method), 9

Iidf (plsa.corpus.Corpus attribute), 9index (plsa.corpus.Corpus attribute), 9

Kkl_divergence (plsa.algorithms.result.PlsaResult at-

tribute), 15

LLemmatizeWords (class in plsa.preprocessors), 2

Nn_docs (plsa.corpus.Corpus attribute), 9n_occurrences (plsa.corpus.Corpus attribute), 9n_topics (plsa.algorithms.conditional_plsa.ConditionalPLSA

attribute), 14n_topics (plsa.algorithms.plsa.PLSA attribute), 12n_topics (plsa.algorithms.result.PlsaResult attribute),

15n_words (plsa.corpus.Corpus attribute), 9

PPipeline (class in plsa.pipeline), 5pipeline (plsa.corpus.Corpus attribute), 9PLSA (class in plsa.algorithms.plsa), 11plsa.algorithms.conditional_plsa (mod-

ule), 13plsa.algorithms.plsa (module), 11plsa.algorithms.result (module), 14plsa.corpus (module), 7plsa.pipeline (module), 5plsa.preprocessors (module), 1plsa.visualize (module), 17PlsaResult (class in plsa.algorithms.result), 14predict() (plsa.algorithms.result.PlsaResult method),

15prediction() (plsa.visualize.Visualize method), 17process() (plsa.pipeline.Pipeline method), 5

Rraw (plsa.corpus.Corpus attribute), 9remove_non_ascii() (in module

plsa.preprocessors), 1remove_numbers() (in module plsa.preprocessors),

1remove_punctuation() (in module

plsa.preprocessors), 2remove_short_words() (in module

plsa.preprocessors), 3remove_tags() (in module plsa.preprocessors), 1

23

Page 28: Release 0 - Read the Docs

PLSA, Release 0.1

RemoveStopwords (class in plsa.preprocessors), 2

Ttf_idf (plsa.algorithms.conditional_plsa.ConditionalPLSA

attribute), 14tf_idf (plsa.algorithms.plsa.PLSA attribute), 12tf_idf (plsa.algorithms.result.PlsaResult attribute), 15to_lower() (in module plsa.preprocessors), 1tokenize() (in module plsa.preprocessors), 2topic (plsa.algorithms.result.PlsaResult attribute), 15topic_given_doc (plsa.algorithms.result.PlsaResult

attribute), 15topics() (plsa.visualize.Visualize method), 17topics_in_doc() (plsa.visualize.Visualize method),

18types (plsa.preprocessors.LemmatizeWords attribute), 3

VVisualize (class in plsa.visualize), 17vocabulary (plsa.corpus.Corpus attribute), 9

Wword_given_topic (plsa.algorithms.result.PlsaResult

attribute), 15wordclouds() (plsa.visualize.Visualize method), 18words (plsa.preprocessors.RemoveStopwords attribute),

2words_in_topic() (plsa.visualize.Visualize

method), 18

24 Index