Latent Tree Models for Hierarchical Topic Detection

Latent Tree Models for Hierarchical Topic Detection
Nevin L. Zhang With: Tengfei Liu, Peixian Chen, Kin Man Poon, Zhourong Chen Topic Detection One of most active machine learning research topic in past decade. Objective: Reveal topics in document collection, What topics? Sizes? Topic-document relationships? Topic-topic relationships? Topic evolution? Why Interesting? Provide overview of document collection
Topic-guided browsing Features for further processing (classification, retrieval) Topic Hierarchy for NIPS Data ( ) Current Methods Latent Dirichlet Allocation (LDA) (Blei et al, 2003)
Correlated topic models (Lafferty & Blei 2005) Hierarchical LDA (hLDA) (Griffiths et al 2004, Blei et al 2010) Nested hierarchical Dirichlet processes (nHDP) (Paisley et al 2012) Dynamic topic model (Blei & Lafferty 2006) .. A Novel Approach to Hierarchical Topic Detection
Hierarchical latent tree analysis (HLTA) T. Liu, N.L. Zhang, P. Chen. Hierarchical Latent Tree Analysis for Topic Detection. ECML/PKDD (2) 2014: P. Chen, N.L. Zhang, et al. Progressive EM for Latent Tree Models and Hierarchical Topic Detection. AAAI 2016. T.Liu, N.L. Zhang, et al. Greedy learning of latent tree models for multidimensional clustering. Machine Learning 98(1-2): (2015) Outline Clustering and latent tree models
LTMs for hierarchical topic detection The HLTA algorithm Comparison with the LDA approach Future directions How to Cluster? Here I have some pictures.Intuitively, how should we cluster them?... Pause Obviously, we should divide them into two group like this.. This is quite clear. How to Cluster These? Now, how do we cluster those?Pause.... How to get one partition?
Finite mixture models: one latent variable z Gaussian mixture models: continuous data Latent class model: categorical data Key point: se model with one latent variable for one partition How to get multiple partitions?
Use models with multiple latent variables for multiple partitions Latent tree models: Probabilistic graphical models with multiple latent variables Latent Tree Models (LTMs)
Tree-structured probabilistic graphical models(Bayesian networks) with multiple latent variables Leaves observed (manifest variables) Discrete Internal nodes latent (latent variables) Each edge is associated with a conditional distribution One latent node with marginal distribution Defines a joint distributions over all the variables (Zhang, JMLR 2004) What are latent tree models or LTMs?A latent tree model is a tree structured probabilistic graphical model or Bayesian network, where the leaf node represent observed variables, while the internal nodes represents latent variables that are not observed. This is an example latent tree model for the academic performance of high school students. The grades that the students get in the 4 subjects, Math, Science, English, and History are observed variables.The other two skill variables, analytical skill and literacy skill, are latent. The model says that a students performances in Math and Science are influenced by his analytical skill, his performances in English and History are influenced by literal skill, and the two latent skills are correlated. Each edge in the model is associated with a probability distributions. Collectively, those distributions defined a joint distribution over all the variables. Latent Tree Analysis (LTA)
Learning latent tree models:Determine Number of latent variables (i.e. number of partitions) Numbers of possible states for each latent variable (i.e. number of clusters in each partition) Connections among nodes Probability distributions Latent tree analysis or LTA refers to the process of obtaining a latent tree model from data on observed variables.For our example, we might start with the scores of the students, and from this data, we want to obtain a model like this. To do so, we first need to determine the number of latent variables; the number of states for each latent variable; how the nodes should be connected up to form a tree, and finally to the probability distributions. This is a rather difficult problem.In this tutorial, I will discuss various algorithms for solving the problem. Before doing that, I would like to use two examples to illustrate what LTA can be used for. Latent Tree Analysis Learning latent tree models: Determine
Number of latent variables (i.e. number of partitions) Numbers of possible states for each latent variable (i.e. number clusters in each partition) Connections among nodes Probability distributions Difficult, but doable Outline Clustering and latent tree models
LTMs for hierarchical topic detection The HLTA algorithm Comparison with the LDA approach Future directions Hierarchical Latent Tree Analysis (HLTA)
Each word is a variable 0 absence from doc,1 presence in doc Each doc is a binary vector over vocabulary Topics Each latent variable partitions docs into 2 clusters
Document clusters interpreted as topics Z23=0: background topic Z23=1: NASA Each latent variable gives one topic Topic Hierarchy Topics at high levels Topics at low levels
long-range word co-occurrences, moregeneral Topics at low levels short-range word co-occurrences, morespecific. NIPS Data ~2,000 papers from 1987 1999 Version 1 Version 2
1,000 words selected by TF/IDF for analysis Next slide: Model structure by D3.js Version 2 10,000 words selected by TF/IDF for analysis Slide after next: Model structure by Graphviz Model for 1k words (D3.js) Model for 10k words: Some pruned. (Graphviz) Tokenization problem due to hyphenation
gaus sian dis tribution struc ture generafive Outline Clustering and latent tree models
LTMs for hierarchical topic detection The HLTA algorithm Comparison with the LDA approach Future directions HLTA: Top Level Control
Learn flat LTM: Each latent connected to observed variables Turn the latent variables into observed variables. Repeat step 1. Stack model from step 2 on top model from step 1 Repeat 1-3 until termination Optimize parameters of global model Phase 1: Model Construction Phase 2: Parameter Estimation Learning Flat LTMs Objective function:
BIC(m|D) = log P(D|m, *) d/2logN d : number of free parameters;N is the sample size. *: MLE of, estimated using the EM algorithm. Likelihood term of BIC:Measure how well the model fits data. Second term: Penalty for model complexity. The use of the BIC score indicates that we are looking for a model that fits the data well, and at the sametime, not overly complex. The Bridged Islands (BI) Algorithm
Partition observed variables into clusters Partition criterion: The clusters must be unidimensional (UD) Learn an latent class model (LCM) for each cluster Islands Link up the islands using Chow-Lius algorithm Mutual information (MI)-based maximum spanning tree over latents Unidimensionality A set of variables is unidimensional if the correlations among them can be properly modeled using a single latent variable Unidimensionality (UD) Test Example:S={A, B, C, D, E} Learn two models m1:Best LCM, i.e., LTM with one latent variable m2:Best LTM with two latent variables UD-test passes if and only if The uni-dimensionality test, or the UD test, determines whether the correlations among the variables in a set S can be properly be modeled using a single latent variable.It does so by learning two models: The firstmodel is the best model among all models that that contain only one latent variable, and the second one is the best model with two latent variables. Then it compares the BIC score of the two models. If the BIC score of the two-latent-variable model does not exceed that of the single-latent-variable model by some threshold delta, then it concludes the data can be properly modeled using a single latent variable, and then the test passes. In other words, if the use of two latent variable does not give us a significantly better model, then the use of one latent variable is appropriate. If the use of two latent variable does not give significantly better model, then the use of one latent variable is appropriate. Bayes Factor Unlike a likelihood-ratio test,
Wikipedia Unlike a likelihood-ratio test, Models do not need to be nested Strength of evidence in favor of M2 depends on the value of K The UD test is related to the Bayes factor from Statistic.The Bayes factor compares two models M1 and M2 based on data. Unlike the likelihood ratio test, the two models are not required to be nested. The strength of evidence in favor of M1 depends on the value of K. If 2 lnK is from 0 to 2, the evidence is negligible. If 2lnK is between 2 and 6, there is positive evidence in favor of M1. If it is between 6 to 10, there is strong evidence in favor of M1. If it is larger than 10, then there is very strong evidence in factor of M1. Bayes Factor and UD-Test
Wikipedia The statistic is a large sample approximation of Strength of evidence in favor of two latent variables depends on U : In the UD-test, we usually set : Conclude single latent variable if no strong evidence for >1 latent variables The statistic we use in the UD test, BIC(m2) BIC(m1) is a large sample approximation of log k.So, U is between 1 and 3, there is positive evidence in favor of two latent variables; If U is between 3 to 5, there is strong evidence in favor of two latent variables. In the UD-test, we usually set delta = 3.This means that we conclude that there should be a single latent variable if there is no strong evidence suggesting multiple latent variables. Building Islands Build first Island
Start with three variables with strongest correlations(MI) Grow the cluster by adding other strongly correlatedvariables one by one Perform UD-test after each step Obtain first island when UD-test fails Repeat on the remaining variables for moreislands Empirical Studies with BI
In comparison with alternative algorithms, BI finds best models on data with hundreds of attributes Does not scale up Parameter Estimation for Intermediate Models
Large number of intermediate models Need fast estimation method Method of Moments Theorem: Relates model parameters with marginals
PAC: matrix for P(A, C) PA|Y: matrix for P(A|Y) PbIY: vector for P(B=b|Y) PAbC: matrix for P(A, B=b, C) Estimate P(B|Y) Obtain empirical distributions PÂBCand PÂC from data (moments) Solve equation (finding eigenvalues of matrix on RHS) to get PbIY Only involves 3 observed variables Progressive Parameter Estimation
Multi-step process, with a small number of parameters estimatedat each step. P(B|Y) in red part P(A|Y) in red part (swap roles) P(C|Y) in red part (swap roles) P(Y): diag(PY)= PA|Y-1 PAC (PC|Y-1 )T P(D|Y), P(E|Y) in blue part P(Z|Y) P(E|Y) in purple part P(E|Z) in green part P(Z|Y): PZ|Y= PE|Z-1 PE|Y Method of Moments Works if Breaks down if In such a case,
Model structure matches true model behind data Sample size large enough So that PÂBCand PÂC close to PABCand PAC Breaks down if Data not from an LTM or not from the LTM being estimated, or Sample size is not large enough In such a case, Produces poor estimate, Can even give negative values for probability So, we propose progressive EM Progressive EM Run EM multiple times, each in a small part of model
Example 1 P(Y), P(A|Y), P(B|Y), P(C|Y) by running EM on red part Then, P(D|Y) and P(E|Y) by run EM on blue part with P(Y), and P(C|Y) fixed. Example 2 P(Y), P(A|Y), P(B|Y), P(C|Y) by running EM on purple part Then, P(Z|Y), P(C|Z), P(E|C) in green part with other parameters fixed Quality: Never give negative probabilities Efficiency: Data set consists of 8 or 16 distinct cases when projected to 3 or 4 binary variables! Progressive EM in Island Building
Model with parameters previously estimated Add lunar P(lunar|Y)by EM in read part P(lunar|Z), P(moon|Z), P(Z|Y) by EM in blue part Only 3 or 4 observed variables involved HLTA: Top Level Control
Learn flat LTM: Each latent connected to observed variables Turn the latent variables into observed variables. Repeat step 1. Stack model from step 2 on top model from step 1 Repeat 1-3 until termination Optimize parameters of global model Phase 1: Model Construction Phase 2: Parameter Estimation Parameter Estimation for Final Model
Stochastic EM Same as EM, except in each pass Sample a small subset of data Run one EM step on the subset Similar to stochastic gradient descent More Empirical Results
New York Times Data From UCI 300,000 articles from 10,000 words selected using TF/IDF HLTA took 11 hours on a desktop machine Model from 300,000 New York Times Articles Topic about Countries 0.09 china chinese beijing taiwan hong_kong
0.07 japanese japan tokyo yen 0.06 soccer world_cup seoul south_korea 0.08 missiles missile ballistic north_korea 0.08 moscow russian vladimir_putin 0.14 germany german european 0.05 hosni_mubarak cairo egyptian 0.08 greek greece athen god 0.21 (USA) affair scandal intern monica_lewinsky Outline Clustering and latent tree models
LTMs for hierarchical topic detection The HLTA algorithm Comparison with the LDA approach Future directions Difference 1: Observed Variables
Latent Dirichlet Allocation (LDA) Token variable Stands for location in document Possible values: All words in vocabulary Models word counts, but not conditional independence Latent Tree Analysis (LTA) Word variable Possible values: 0, 1 Stand for absence/presence of word in doc Models conditional independence, but not word count Independence structure Conducive to clear thematic meaning Not modeling word counts Losing of information However, TF/IDF prunes low/high frequency words Difference 2: Topic LDA Topic LTA Topic A distribution over vocabulary
Characterized with words with highest probabilities LTA Topic A soft cluster of documents Characterized with words that occur with High probabilities in the topic Low probabilities outside the topic Difference 3: Document-Topic Relationship
LDA A document is a mixture of topics, or a distribution over topics Mixed-membership model LTA A document can belong to multiple topics Multi-membership model Difference 4: Topic Hierarchy
Hierarchy of document collections Each node corresponds to a collection of documents Collection at a node is the union of the collections at its children LDA and LTA approaches do not yield hierarchy of document collections Difference 4: Topic Hierarchy
hLDA/nHDP (LDA-based hierarchical topic models) Hierarchy of topics Topics at high level appear in more documents Topics at low level appear in fewer documents Difference 4: Topic Hierarchy
HLTA (Hierarchical LTA) Hierarchy of latent variables/co-occurrences Level-1 latent variables model co-occurrence of words Higher level nodes represent co-occurrence of patterns at the level below High-level variables give thematically more general topics. Low-level variables give thematically more specific topics A Side Note Conceptually, a word co-occurrence pattern can be used to define two different topics Narrowly defined topic: Collection of documents containing the pattern. Broadly defined topic: Collection of Documents containing the pattern, and Documents not containing the pattern, but similar to those in a) otherwise By default, HLTA yield broadly defined topics.However, it is possible to obtain narrowly definedtopics. For topic-guided browsing, narrowly defined topics should be used. For feature extraction, the issue is not clear and can be determined by experiments. To get narrowly defined topics, we need to re-run EM in the subtree rooted at each latent variables. Quantitative Comparisons nHDP Topic Hierarchy vs HLTA Topic Hierarchy nHDP Topic Hierarchy vs HLTA Topic Hierarchy Future Directions Visualization Parallelization
Model structure Topic hierarchy Topic-index for documents Parallelization Topics as features in text analysis tasks Explore middle ground between HLTM and DBN Learning structure for sparse DBN

Latent Tree Models for Hierarchical Topic Detection

Documents

Transcript of Latent Tree Models for Hierarchical Topic Detection