Latent Tree Models for Hierarchical Topic Detection
-
Upload
paulina-montgomery -
Category
Documents
-
view
223 -
download
0
description
Transcript of Latent Tree Models for Hierarchical Topic Detection
Latent Tree Models for Hierarchical Topic Detection
Nevin L. Zhang With: Tengfei Liu, Peixian Chen, Kin Man Poon,
Zhourong Chen Topic Detection One of most active machine learning
research topic in past decade. Objective: Reveal topics in document
collection, What topics? Sizes? Topic-document relationships?
Topic-topic relationships? Topic evolution? Why Interesting?
Provide overview of document collection
Topic-guided browsing Features for further processing
(classification, retrieval) Topic Hierarchy for NIPS Data ( )
Current Methods Latent Dirichlet Allocation (LDA) (Blei et al,
2003)
Correlated topic models (Lafferty & Blei 2005) Hierarchical LDA
(hLDA) (Griffiths et al 2004, Blei et al 2010) Nested hierarchical
Dirichlet processes (nHDP) (Paisley et al 2012) Dynamic topic model
(Blei & Lafferty 2006) .. A Novel Approach to Hierarchical
Topic Detection
Hierarchical latent tree analysis (HLTA) T. Liu, N.L. Zhang, P.
Chen. Hierarchical Latent Tree Analysis for Topic Detection.
ECML/PKDD (2) 2014: P. Chen, N.L. Zhang, et al. Progressive EM for
Latent Tree Models and Hierarchical Topic Detection. AAAI 2016.
T.Liu, N.L. Zhang, et al. Greedy learning of latent tree models for
multidimensional clustering. Machine Learning 98(1-2): (2015)
Outline Clustering and latent tree models
LTMs for hierarchical topic detection The HLTA algorithm Comparison
with the LDA approach Future directions How to Cluster? Here I have
some pictures.Intuitively, how should we cluster them?... Pause
Obviously, we should divide them into two group like this.. This is
quite clear. How to Cluster These? Now, how do we cluster
those?Pause.... How to get one partition?
Finite mixture models: one latent variable z Gaussian mixture
models: continuous data Latent class model: categorical data Key
point: se model with one latent variable for one partition How to
get multiple partitions?
Use models with multiple latent variables for multiple partitions
Latent tree models: Probabilistic graphical models with multiple
latent variables Latent Tree Models (LTMs)
Tree-structured probabilistic graphical models(Bayesian networks)
with multiple latent variables Leaves observed (manifest variables)
Discrete Internal nodes latent (latent variables) Each edge is
associated with a conditional distribution One latent node with
marginal distribution Defines a joint distributions over all the
variables (Zhang, JMLR 2004) What are latent tree models or LTMs?A
latent tree model is a tree structured probabilistic graphical
model or Bayesian network, where the leaf node represent observed
variables, while the internal nodes represents latent variables
that are not observed. This is an example latent tree model for the
academic performance of high school students. The grades that the
students get in the 4 subjects, Math, Science, English, and History
are observed variables.The other two skill variables, analytical
skill and literacy skill, are latent. The model says that a
students performances in Math and Science are influenced by his
analytical skill, his performances in English and History are
influenced by literal skill, and the two latent skills are
correlated. Each edge in the model is associated with a probability
distributions. Collectively, those distributions defined a joint
distribution over all the variables. Latent Tree Analysis
(LTA)
Learning latent tree models:Determine Number of latent variables
(i.e. number of partitions) Numbers of possible states for each
latent variable (i.e. number of clusters in each partition)
Connections among nodes Probability distributions Latent tree
analysis or LTA refers to the process of obtaining a latent tree
model from data on observed variables.For our example, we might
start with the scores of the students, and from this data, we want
to obtain a model like this. To do so, we first need to determine
the number of latent variables; the number of states for each
latent variable; how the nodes should be connected up to form a
tree, and finally to the probability distributions. This is a
rather difficult problem.In this tutorial, I will discuss various
algorithms for solving the problem. Before doing that, I would like
to use two examples to illustrate what LTA can be used for. Latent
Tree Analysis Learning latent tree models: Determine
Number of latent variables (i.e. number of partitions) Numbers of
possible states for each latent variable (i.e. number clusters in
each partition) Connections among nodes Probability distributions
Difficult, but doable Outline Clustering and latent tree
models
LTMs for hierarchical topic detection The HLTA algorithm Comparison
with the LDA approach Future directions Hierarchical Latent Tree
Analysis (HLTA)
Each word is a variable 0 absence from doc,1 presence in doc Each
doc is a binary vector over vocabulary Topics Each latent variable
partitions docs into 2 clusters
Document clusters interpreted as topics Z23=0: background topic
Z23=1: NASA Each latent variable gives one topic Topic Hierarchy
Topics at high levels Topics at low levels
long-range word co-occurrences, moregeneral Topics at low levels
short-range word co-occurrences, morespecific. NIPS Data ~2,000
papers from 1987 1999 Version 1 Version 2
1,000 words selected by TF/IDF for analysis Next slide: Model
structure by D3.js Version 2 10,000 words selected by TF/IDF for
analysis Slide after next: Model structure by Graphviz Model for 1k
words (D3.js) Model for 10k words: Some pruned. (Graphviz)
Tokenization problem due to hyphenation
gaus sian dis tribution struc ture generafive Outline Clustering
and latent tree models
LTMs for hierarchical topic detection The HLTA algorithm Comparison
with the LDA approach Future directions HLTA: Top Level
Control
Learn flat LTM: Each latent connected to observed variables Turn
the latent variables into observed variables. Repeat step 1. Stack
model from step 2 on top model from step 1 Repeat 1-3 until
termination Optimize parameters of global model Phase 1: Model
Construction Phase 2: Parameter Estimation Learning Flat LTMs
Objective function:
BIC(m|D) = log P(D|m, *) d/2logN d : number of free parameters;N is
the sample size. *: MLE of, estimated using the EM algorithm.
Likelihood term of BIC:Measure how well the model fits data. Second
term: Penalty for model complexity. The use of the BIC score
indicates that we are looking for a model that fits the data well,
and at the sametime, not overly complex. The Bridged Islands (BI)
Algorithm
Partition observed variables into clusters Partition criterion: The
clusters must be unidimensional (UD) Learn an latent class model
(LCM) for each cluster Islands Link up the islands using Chow-Lius
algorithm Mutual information (MI)-based maximum spanning tree over
latents Unidimensionality A set of variables is unidimensional if
the correlations among them can be properly modeled using a single
latent variable Unidimensionality (UD) Test Example:S={A, B, C, D,
E} Learn two models m1:Best LCM, i.e., LTM with one latent variable
m2:Best LTM with two latent variables UD-test passes if and only if
The uni-dimensionality test, or the UD test, determines whether the
correlations among the variables in a set S can be properly be
modeled using a single latent variable.It does so by learning two
models: The firstmodel is the best model among all models that that
contain only one latent variable, and the second one is the best
model with two latent variables. Then it compares the BIC score of
the two models. If the BIC score of the two-latent-variable model
does not exceed that of the single-latent-variable model by some
threshold delta, then it concludes the data can be properly modeled
using a single latent variable, and then the test passes. In other
words, if the use of two latent variable does not give us a
significantly better model, then the use of one latent variable is
appropriate. If the use of two latent variable does not give
significantly better model, then the use of one latent variable is
appropriate. Bayes Factor Unlike a likelihood-ratio test,
Wikipedia Unlike a likelihood-ratio test, Models do not need to be
nested Strength of evidence in favor of M2 depends on the value of
K The UD test is related to the Bayes factor from Statistic.The
Bayes factor compares two models M1 and M2 based on data. Unlike
the likelihood ratio test, the two models are not required to be
nested. The strength of evidence in favor of M1 depends on the
value of K. If 2 lnK is from 0 to 2, the evidence is negligible. If
2lnK is between 2 and 6, there is positive evidence in favor of M1.
If it is between 6 to 10, there is strong evidence in favor of M1.
If it is larger than 10, then there is very strong evidence in
factor of M1. Bayes Factor and UD-Test
Wikipedia The statistic is a large sample approximation of Strength
of evidence in favor of two latent variables depends on U : In the
UD-test, we usually set : Conclude single latent variable if no
strong evidence for >1 latent variables The statistic we use in
the UD test, BIC(m2) BIC(m1) is a large sample approximation of log
k.So, U is between 1 and 3, there is positive evidence in favor of
two latent variables; If U is between 3 to 5, there is strong
evidence in favor of two latent variables. In the UD-test, we
usually set delta = 3.This means that we conclude that there should
be a single latent variable if there is no strong evidence
suggesting multiple latent variables. Building Islands Build first
Island
Start with three variables with strongest correlations(MI) Grow the
cluster by adding other strongly correlatedvariables one by one
Perform UD-test after each step Obtain first island when UD-test
fails Repeat on the remaining variables for moreislands Empirical
Studies with BI
In comparison with alternative algorithms, BI finds best models on
data with hundreds of attributes Does not scale up Parameter
Estimation for Intermediate Models
Large number of intermediate models Need fast estimation method
Method of Moments Theorem: Relates model parameters with
marginals
PAC: matrix for P(A, C) PA|Y: matrix for P(A|Y) PbIY: vector for
P(B=b|Y) PAbC: matrix for P(A, B=b, C) Estimate P(B|Y) Obtain
empirical distributions P^ABCand P^AC from data (moments) Solve
equation (finding eigenvalues of matrix on RHS) to get PbIY Only
involves 3 observed variables Progressive Parameter
Estimation
Multi-step process, with a small number of parameters estimatedat
each step. P(B|Y) in red part P(A|Y) in red part (swap roles)
P(C|Y) in red part (swap roles) P(Y): diag(PY)= PA|Y-1 PAC (PC|Y-1
)T P(D|Y), P(E|Y) in blue part P(Z|Y) P(E|Y) in purple part P(E|Z)
in green part P(Z|Y): PZ|Y= PE|Z-1 PE|Y Method of Moments Works if
Breaks down if In such a case,
Model structure matches true model behind data Sample size large
enough So that P^ABCand P^AC close to PABCand PAC Breaks down if
Data not from an LTM or not from the LTM being estimated, or Sample
size is not large enough In such a case, Produces poor estimate,
Can even give negative values for probability So, we propose
progressive EM Progressive EM Run EM multiple times, each in a
small part of model
Example 1 P(Y), P(A|Y), P(B|Y), P(C|Y) by running EM on red part
Then, P(D|Y) and P(E|Y) by run EM on blue part with P(Y), and
P(C|Y) fixed. Example 2 P(Y), P(A|Y), P(B|Y), P(C|Y) by running EM
on purple part Then, P(Z|Y), P(C|Z), P(E|C) in green part with
other parameters fixed Quality: Never give negative probabilities
Efficiency: Data set consists of 8 or 16 distinct cases when
projected to 3 or 4 binary variables! Progressive EM in Island
Building
Model with parameters previously estimated Add lunar P(lunar|Y)by
EM in read part P(lunar|Z), P(moon|Z), P(Z|Y) by EM in blue part
Only 3 or 4 observed variables involved HLTA: Top Level
Control
Learn flat LTM: Each latent connected to observed variables Turn
the latent variables into observed variables. Repeat step 1. Stack
model from step 2 on top model from step 1 Repeat 1-3 until
termination Optimize parameters of global model Phase 1: Model
Construction Phase 2: Parameter Estimation Parameter Estimation for
Final Model
Stochastic EM Same as EM, except in each pass Sample a small subset
of data Run one EM step on the subset Similar to stochastic
gradient descent More Empirical Results
New York Times Data From UCI 300,000 articles from 10,000 words
selected using TF/IDF HLTA took 11 hours on a desktop machine Model
from 300,000 New York Times Articles Topic about Countries 0.09
china chinese beijing taiwan hong_kong
0.07 japanese japan tokyo yen 0.06 soccer world_cup seoul
south_korea 0.08 missiles missile ballistic north_korea 0.08 moscow
russian vladimir_putin 0.14 germany german european 0.05
hosni_mubarak cairo egyptian 0.08 greek greece athen god 0.21 (USA)
affair scandal intern monica_lewinsky Outline Clustering and latent
tree models
LTMs for hierarchical topic detection The HLTA algorithm Comparison
with the LDA approach Future directions Difference 1: Observed
Variables
Latent Dirichlet Allocation (LDA) Token variable Stands for
location in document Possible values: All words in vocabulary
Models word counts, but not conditional independence Latent Tree
Analysis (LTA) Word variable Possible values: 0, 1 Stand for
absence/presence of word in doc Models conditional independence,
but not word count Independence structure Conducive to clear
thematic meaning Not modeling word counts Losing of information
However, TF/IDF prunes low/high frequency words Difference 2: Topic
LDA Topic LTA Topic A distribution over vocabulary
Characterized with words with highest probabilities LTA Topic A
soft cluster of documents Characterized with words that occur with
High probabilities in the topic Low probabilities outside the topic
Difference 3: Document-Topic Relationship
LDA A document is a mixture of topics, or a distribution over
topics Mixed-membership model LTA A document can belong to multiple
topics Multi-membership model Difference 4: Topic Hierarchy
Hierarchy of document collections Each node corresponds to a
collection of documents Collection at a node is the union of the
collections at its children LDA and LTA approaches do not yield
hierarchy of document collections Difference 4: Topic
Hierarchy
hLDA/nHDP (LDA-based hierarchical topic models) Hierarchy of topics
Topics at high level appear in more documents Topics at low level
appear in fewer documents Difference 4: Topic Hierarchy
HLTA (Hierarchical LTA) Hierarchy of latent
variables/co-occurrences Level-1 latent variables model
co-occurrence of words Higher level nodes represent co-occurrence
of patterns at the level below High-level variables give
thematically more general topics. Low-level variables give
thematically more specific topics A Side Note Conceptually, a word
co-occurrence pattern can be used to define two different topics
Narrowly defined topic: Collection of documents containing the
pattern. Broadly defined topic: Collection of Documents containing
the pattern, and Documents not containing the pattern, but similar
to those in a) otherwise By default, HLTA yield broadly defined
topics.However, it is possible to obtain narrowly definedtopics.
For topic-guided browsing, narrowly defined topics should be used.
For feature extraction, the issue is not clear and can be
determined by experiments. To get narrowly defined topics, we need
to re-run EM in the subtree rooted at each latent variables.
Quantitative Comparisons nHDP Topic Hierarchy vs HLTA Topic
Hierarchy nHDP Topic Hierarchy vs HLTA Topic Hierarchy Future
Directions Visualization Parallelization
Model structure Topic hierarchy Topic-index for documents
Parallelization Topics as features in text analysis tasks Explore
middle ground between HLTM and DBN Learning structure for sparse
DBN