M odeling and Predicting Personal Information Dissemination Behavior
S TATISTICAL T OPIC M ODELING part 1 Andrea Tagarelli Univ. of Calabria, Italy.
-
Upload
alfred-gilmore -
Category
Documents
-
view
215 -
download
2
Transcript of S TATISTICAL T OPIC M ODELING part 1 Andrea Tagarelli Univ. of Calabria, Italy.
STATISTICAL TOPIC MODELINGpart 1
Andrea Tagarelli
Univ. of Calabria, Italy
Statistical topic modeling (1/3)
• Key assumption: • text data represented as a mixture of topics, i.e., probability
distributions over terms
• Generative model for documents:• document features as being generated by latent variables
• Topic modeling vs. vector-space text modeling• (Latent) Semantic aspects underlying correlations between words
• Document topical structure
Statistical topic modeling (2/3)
• Training on (large) corpus to learn:• Per-topic word distributions• Per-document topic distributions
[Blei, CACM, 2012]
Statistical topic modeling (3/3)
• Graphical “Plate” notation• Standard representation for generative models• Rectangles (plates) represent repeated areas of the model
• number of times the variable(s) is repeated
[Hofmann, SIGIR, 1999]
Observed and latent variables• Observed variable: we know the current value• Latent variable: a variable whose state cannot be observed• Estimation problem:
• Estimate values for a set of distribution parameters that can best explain a set of observations
• Most likely values of parameters: maximum likelihood of a model
• Likelihood impossible to calculate in full• Approximation through
• Expectation-maximization algorithm: an iterative method to estimate the probability of unobserved, latent variables. Until a local optimum is obtained
• Gibbs sampling: update parameters sample-wise• Variational inference: approximate the model by a simpler one
Probabilistic LSA• PLSA [Hofmann, 2001]
• Probabilistic version of LSA conceived to better handling problems of term polysemy M
N
d z w
PLSA training (1/2)
• Joint probability model:
• Likelihood
PLSA training (2/2)
• Training with EM:• Initialization of the per-topic word distributions and per-document
topic distributions• E-step:
• M-step:
Latent Dirichlet Allocation (1/2)
• LDA [Blei et al., 2003]• Adds a Dirichlet prior on the per-document topic distribution
• 3-level scheme: corpus, documents, and terms• Terms are the only observed variables
For each doc in a collection of N docs
For each word position in a doc of length M
Topic assignment to a word at position i in doc dj
Word token at position i in doc dj
Per-document topic distribution
Per-topic word distribution
[Moens and Vulic, Tutorial @WSDM 2014]
Latent Dirichlet Allocation (2/2)
• Meaning of Dirichlet priors• θ ~ Dir(α1, …, αK)
• Each αk is a prior observation count for the no. of times a topic zk is sampled in a document prior to word observations
• Analogously for ηi, with β ~ Dir(η1, …, ηV)
• Inference for a new document: Given α, β, η, infer θ• Exact inference problem is intractable: training through
• Gibbs sampling• Variational inference