SSHLDA: A Semi-Supervised Hierarchical Topic Model · Furthermore, hierarchical topic modeling is...

Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational NaturalLanguage Learning, pages 800–809, Jeju Island, Korea, 12–14 July 2012. c©2012 Association for Computational Linguistics

SSHLDA: A Semi-Supervised Hierarchical Topic Model

Xian-Ling Mao♠∗, Zhao-Yan Ming♥, Tat-Seng Chua♥, Si Li♣, Hongfei Yan♠†, Xiaoming Li♠♠Department of Computer Science and Technology, Peking University, China

♥School of Computing, National University of Singapore, Singapore♣School of ICE, Beijing University of Posts and Telecommunications, China{xianlingmao,lxm}@pku.edu.cn, [email protected]{chuats,mingzhaoyan}@nus.edu.sg, [email protected]

Abstract

Supervised hierarchical topic modeling andunsupervised hierarchical topic modeling areusually used to obtain hierarchical topics, suchas hLLDA and hLDA. Supervised hierarchi-cal topic modeling makes heavy use of the in-formation from observed hierarchical labels,but cannot explore new topics; while unsu-pervised hierarchical topic modeling is ableto detect automatically new topics in the dataspace, but does not make use of any informa-tion from hierarchical labels. In this paper, wepropose a semi-supervised hierarchical topicmodel which aims to explore new topics auto-matically in the data space while incorporatingthe information from observed hierarchical la-bels into the modeling process, called Semi-Supervised Hierarchical Latent Dirichlet Al-location (SSHLDA). We also prove that hLDAand hLLDA are special cases of SSHLDA. Weconduct experiments on Yahoo! Answers andODP datasets, and assess the performance interms of perplexity and clustering. The ex-perimental results show that predictive abilityof SSHLDA is better than that of baselines,and SSHLDA can also achieve significant im-provement over baselines for clustering on theFScore measure.

1 Introduction

Topic models, such as latent Dirichlet allocation(LDA), are useful NLP tools for the statistical anal-ysis of document collections and other discrete data.

∗This work was done in National University of Singapore.†Corresponding author.

Furthermore, hierarchical topic modeling is able toobtain the relations between topics — parent-childand sibling relations. Unsupervised hierarchicaltopic modeling is able to detect automatically newtopics in the data space, such as hierarchical La-tent Dirichlet Allocation (hLDA) (Blei et al., 2004).hLDA makes use of nested Dirichlet Process to auto-matically obtain a L-level hierarchy of topics. Mod-ern Web documents, however, are not merely col-lections of words. They are usually documents withhierarchical labels – such as Web pages and theirplacement in hierarchical directories (Ming et al.,2010). Unsupervised hierarchical topic modelingcannot make use of any information from hierarchi-cal labels, thus supervised hierarchical topic models,such as hierarchical Labeled Latent Dirichlet Allo-cation (hLLDA) (Petinot et al., 2011), are proposedto tackle this problem. hLLDA uses hierarchical la-bels to automatically build corresponding topic foreach label, but it cannot find new latent topics in thedata space, only depending on hierarchy of labels.

As we know that only about 10% of an iceberg’smass is seen outside while about 90% of it is unseen,deep down in water. We think that a corpus with hi-erarchical labels should include not only observedtopics of labels, but also there are more latent top-ics, just like icebergs. hLLDA can make use of theinformation from labels; while hLDA can explorelatent topics. How can we combine the merits of thetwo types of models into one model?

An intuitive and simple combinational method islike this: first, we use hierarchy of labels as basic hi-erarchy, called Base Tree (BT); then we use hLDAto build automatically topic hierarchy for each leaf

800

node in BT, called Leaf Topic Hierarchy (LTH); fi-nally, we add each LTH to corresponding leaf in theBT and obtain a hierarchy for the entire dataset. Werefer the method as Simp-hLDA. The performanceof the Simp-hLDA is not so good, as can be seenfrom the example in Figure 3 (b). The drawbacksare: (i) the leaves in BT do not obtain reasonableand right words distribution, such as “Computers &Internet” node in Figure 3 (b), its topical words, “theto you and a”, is not about “Computers & Internet”;(ii) the non-leaf nodes in BT cannot obtain wordsdistribution, such as “Health” node in Figure 3 (b);(iii) it is a heuristic method, and thus Simp-hLDAhas no solid theoretical basis.

To tackle the above drawbacks, we explore theuse of probabilistic models for such a task wherethe hierarchical labels are merely viewed as a partof a hierarchy of topics, and the topics of a path inthe whole hierarchy generate a corresponding doc-ument. Our proposed generative model learns boththe latent topics of the underlying data and the la-beling strategies in a joint model, by leveraging onthe hierarchical structure of labels and HierarchicalDirichlet Process.

We demonstrate the effectiveness of the proposedmodel on large, real-world datasets in the questionanswering and website category domains on twotasks: the topic modeling of documents, and the useof the generated topics for document clustering. Ourresults show that our joint, semi-hierarchical modeloutperforms the state-of-the-art supervised and un-supervised hierarchical algorithms. The contribu-tions of this paper are threefold: (1) We propose ajoint, generative semi-supervised hierarchical topicmodel, i.e. Semi-Supervised Hierarchical LatentDirichlet Allocation (SSHLDA), to overcome thedefects of hLDA and hLLDA while combining thetheir merits. SSHLDA is able to not only explorenew latent topics in the data space, but also makesuse of the information from the hierarchy of ob-served labels; (2) We prove that hLDA and hLLDAare special cases of SSHLDA; (3) We develop agibbs sampling inference algorithm for the proposedmodel.

The remainder of this paper is organized as fol-lows. We review related work in Section 2. In Sec-tion 3, we introduce some preliminaries; while weintroduce SSHLDA in Section 4. Section 5 details

a gibbs sampling inference algorithm for SSHLDA;while Section 6 presents the experimental results.Finally, we conclude the paper and suggest direc-tions for future research in Section 7.

2 Related Work

There have been many variations of topic mod-els. The existing topic models can be dividedinto four categories: Unsupervised non-hierarchicaltopic models, Unsupervised hierarchical topic mod-els, and their corresponding supervised counter-parts.

Unsupervised non-hierarchical topic models arewidely studied, such as LSA (Deerwester et al.,1990), pLSA (Hofmann, 1999), LDA (Blei et al.,2003), Hierarchical-concept TM (Chemudugunta etal., 2008c; Chemudugunta et al., 2008b), Corre-lated TM (Blei and Lafferty, 2006) and Concept TM(Chemudugunta et al., 2008a; Chemudugunta et al.,2008b) etc. The most famous one is Latent DirichletAllocation (LDA). LDA is similar to pLSA, exceptthat in LDA the topic distribution is assumed to havea Dirichlet prior. LDA is a completely unsupervisedalgorithm that models each document as a mixtureof topics. Another famous model that not only rep-resents topic correlations, but also learns them, isthe Correlated Topic Model (CTM). Topics in CTMare not independent; however it is noted that onlypairwise correlations are modeled, and the numberof parameters in the covariance matrix grows as thesquare of the number of topics.

However, the above models cannot capture therelation between super and sub topics. To addressthis problem, many models have been proposedto model the relations, such as Hierarchical LDA(HLDA) (Blei et al., 2004), Hierarchical Dirichletprocesses (HDP) (Teh et al., 2006), Pachinko Allo-cation Model (PAM) (Li and McCallum, 2006) andHierarchical PAM (HPAM) (Mimno et al., 2007)etc. The relations are usually in the form of a hi-erarchy, such as the tree or Directed Acyclic Graph(DAG). Blei et al. (2004) proposed the hLDA modelthat simultaneously learns the structure of a topichierarchy and the topics that are contained withinthat hierarchy. This algorithm can be used to extracttopic hierarchies from large document collections.

Although unsupervised topic models are suffi-

801

ciently expressive to model multiple topics per doc-ument, they are inappropriate for labeled corpora be-cause they are unable to incorporate the observed la-bels into their learning procedure. Several modifica-tions of LDA to incorporate supervision have beenproposed in the literature. Two such models, Su-pervised LDA (Blei and McAuliffe, 2007; Blei andMcAuliffe, 2010) and DiscLDA (Lacoste-Julien etal., 2008) are first proposed to model documents as-sociated only with a single label. Another categoryof models, such as the MM-LDA (Ramage et al.,2009b), Author TM (Rosen-Zvi et al., 2004), Flat-LDA (Rubin et al., 2011), Prior-LDA (Rubin et al.,2011), Dependency-LDA (Rubin et al., 2011) andPartially LDA (PLDA) (Ramage et al., 2011) etc.,are not constrained to one label per document be-cause they model each document as a bag of wordswith a bag of labels. However, these models obtaintopics that do not correspond directly with the la-bels. Labeled LDA (LLDA) (Ramage et al., 2009a)can be used to solve this problem.

None of these non-hierarchical supervised mod-els, however, leverage on dependency structure,such as parent-child relation, in the label space. Forhierarchical labeled data, there are also few modelsthat are able to handle the label relations in data.To the best of our knowledge, only hLLDA (Petinotet al., 2011) and HSLDA (Perotte et al., 2011) areproposed for this kind of data. HSLDA cannot ob-tain a probability distribution for a label. AlthoughhLLDA can obtain a distribution over words for eachlabel, hLLDA is unable to capture the relations be-tween parent and child node using parameters, and italso cannot detect automatically latent topics in thedata space. In this paper, we will propose a genera-tive topic model to tackle these problems of hLLDA.

3 Preliminaries

The nested Chinese restaurant process (nCRP) is adistribution over hierarchical partitions (Blei et al.,2004). It generalizes the Chinese restaurant process(CRP), which is a distribution over partitions. TheCRP can be described by the following metaphor.Imagine a restaurant with an infinite number of ta-bles, and imagine customers entering the restaurantin sequence. The dth customer sits at a table accord-

Table 1: Notations used in the paper.Sym DescriptionV Vocabulary (word set), w is a word in VD Document collection

TjThe set of paths in the sub-tree whose root is thejth leaf node in the hierarchy of observed topics

m A document m that consists of words and labelswm The text of document m, wi is ith words in wcm The topic set of document mcom The set of topics with observed labels for document mcem The set of topics without labels for document mce−m The set of latent topics for all documents other than m

zem

The assignment of the words in the mth documentto one of the latent topics

wem

The set of the words belonging to one of the latenttopics in the the mth document

zm,nThe assignment of the nth word in the mth documentto one of the L available topics

z The set of zm,n for all words in all documentsci A topic in the ith level in the hierarchyθ The word distribution set for Z, i.e., {θ}z∈cα Dirichlet prior of θδci The multinomial distribution over the sub-topics of ci−1

µci Dirichlet prior of δci

η Dirichlet prior of ββ The multinomial distribution of wordsθm The distributions over topics for document mθ The set for θm, m ∈ {1, ..., D}

ing to the following distribution,

p(cd = k|c1:(d−1)) ∝{

mk if k is previous occupiedγ if k is a new tabel, (1)

where mk is the number of previous customers sit-ting at table k and γ is a positive scalar. After D cus-tomers have sat down, their seating plan describes apartition of D items.

In the nested CRP, imagine now that tables are or-ganized in a hierarchy: there is one table at the firstlevel; it is associated with an infinite number of ta-bles at the second level; each second-level table isassociated with an infinite number of tables at thethird level; and so on until the Lth level. Each cus-tomer enters at the first level and comes out at theLth level, generating a path with L tables as she sitsin each restaurant. Moving from a table at level l toone of its subtables at level l+1, the customer drawsfollowing the CRP using Formula (1). In this paper,we will make use of nested CRP to explore latenttopics in data space.

To elaborate our model, we first define two con-cepts. If a model can learn a distribution over wordsfor a label, we refer the topic with a correspondinglabel as a labeled topic. If a model can learn an un-seen and latent topic without a label, we refer the

802

Figure 1: The graphical model of SSHLDA.

topic as a latent topic.

4 The Semi-Supervised Hierarchical TopicModel

In this section, we will introduce a semi-supervised hierarchical topic model, i.e., the Semi-Supervised Hierarchical Latent Dirichlet Allocation(SSHLDA). SSHLDA is a probabilistic graphicalmodel that describes a process for generating a hi-erarchical labeled document collection. Like hi-erarchical Labeled LDA (hLLDA) (Petinot et al.,2011), SSHLDA can incorporate labeled topics intothe generative process of documents. On the otherhand, like hierarchical Latent Dirichlet Allocation(hLDA) (Blei et al., 2004), SSHLDA can automat-ically explore latent topic in data space, and extendthe existing hierarchy of observed topics. SSHLDAmakes use of not only observed topics, but also la-tent topics.

The graphical model of SSHLDA is illustrated inFigure 1. In the model, N is the number of words ina document, D is the total number of documents ina collection, M is the number of leaf nodes in hier-archical observed nodes, ci is a node in the ith levelin the hierarchical tree, η, α and µci are dirichletprior parameters, βk is a distribution over words, θis a document-specific distribution over topics, δci isa multinomial distribution over observed sub-topicsof topic ci, w is an observed word, z is the topicassigned to w, Dirk(.) is a k-dimensional Dirichletdistribution, Tj is a set of paths in the hierarchy oflatent topics for jth leaf node in the hierarchy of ob-

Figure 2: One illustration of SSHLDA. The tree has 5levels. The shaded nodes are observed topics, and circlednodes are latent topics. The latent topics are generatedautomatically by SSHLDA model. After learning, eachnode in this tree will obtain a corresponding probabilitydistribution over words, i.e. a topic.

served topics, γ is a Multi-nomial distribution overpaths in the tree. All notations used in this paper arelisted in Table 1.

SSHLDA, as shown in Figure 1, assumes the fol-lowing generative process:

(1) For each table k ∈ T in the infinite tree,

(a) Draw a topic βk ∼ Dir(η).

(2) For each document, m ∈ {1, 2, ..., D}

(a) Let c1 be the root node.(b) For each level l ∈ {2, ..., L}:

(i) If nodes in this level have been observed,draw a node cl from Mult(δcl−1

|µcl−1).

(ii) Otherwise, draw a table cl from restaurantcl−1 using Formula (1).

(c) Draw an L-dimensional topic proportion vec-tor θm from Dir(α).

(d) For each word n ∈ {1, ..., N}:(i) Draw z ∈ {1, ..., L} from Mult(θ).

(ii) Draw wn from the topic associated withrestaurant cz .

As the example showed in Figure 2, we assumethat we have known a hierarchy of observed top-ics: {A1,A2,A17,A3,A4}, and assume the heightof the desired topical tree is L = 5. All circlednodes are latent topics, and shaded nodes are ob-served topics. A possible generative process for adocument m can be: It starts from A1, and choosesnode A17 at level 2, and then chooses A18, A20 andA25 in the following levels. Thus we obtain a path:cm = {A1, A17, A18, A20, A25}. After getting thepath for m, SSHLDA generates each word from oneof topics in this set of topics cm.

803

5 Probabilistic Inference

In this section, we describe a Gibbs sampling al-gorithm for sampling from the posterior and corre-sponding topics in the SSHLDA model. The Gibbssampler provides a method for simultaneously ex-ploring the model parameter space (the latent topicsof the whole corpus) and the model structure space(L-level trees).

In SSHLDA, we sample the paths cm for docu-ment m and the per-word level allocations to topicsin those paths zm,n. Thus, we approximate the pos-terior p(cm, zm|γ, η, w,µ). The hyper-parameter γreflects the tendency of the customers in each restau-rant to share tables, η denotes the expected varianceof the underlying topics (e.g., η � 1 will tend tochoose topics with fewer high-probability words),µci is the dirichlet prior of δci , and µ is the set ofµci . wm,n denotes the nth word in the mth docu-ment; and cm,l represents the restaurant correspond-ing to the lth-level topic in document m; and zm,n,the assignment of the nth word in the mth documentto one of the L available topics. All other variablesin the model, θ and β, are integrated out. The Gibbssampler thus assesses the values of zm,n and cm,l.

The Gibbs sampler can be divided into two mainsteps: the sampling of level allocations and the sam-pling of path assignments.

First, given the values of the SSHLDA hiddenvariables, we sample the cm,l variables which are as-sociated with the CRP prior. Noting that cm is com-posed of com and cem , com is the set of observedtopics for document m, and cem is the set of latenttopics for document m. The conditional distributionfor cm, the L topics associated with document m, is:

p(cm|z, w, c−m, µ)

=p(com |µ)p(cem |zem , wem , ce−m)

∝p(com |µ)p(wem |cem , we−m , zem)

p(cem |ce−m) (2)

where

p(com |µ) =

|com |−1∏i=0

p(ci,m|µci) (3)

and

p(wem |cem , we−m , zem)

=

|cem |∏l=1

(Γ(n.

cem,l,−m + |V |η)∏w Γ(nw

cem,l,−m + η)×∏

w Γ(nwcem,l,−m + nw

cem,l,m+ η)

Γ(n.cem,l,−m + n·cem,l,m

+ |V |η)

)(4)

ce−m is the set of latent topics for all documentsother than m, zem is the assignment of the wordsin the mth document to one of the latent topics, andwem is the set of the words belonging to one of thelatent topics in the the mth document. nw

cem,l,−m isthe number of instances of word w that have beenassigned to the topic indexed by cem,l, not includingthose in the document m.

Second, given the current state of the SSHLDA,we sample the zm,n variables of the underlyingSSHLDA model as follows:

p(zm,n = j|z−(m,n), w, cm, µ)

∝nm−n,j + α

nm−n,. + |cm|

·n

wm,n

−n,j + ηwm,n

n.−(m,n) + |V |

(5)

Having obtained the full conditional distribution,the Gibbs sampling algorithm is then straightfor-ward. The zm,n variables are initialized to determinethe initial state of the Markov chain. The chain isthen run for a number of iterations, each time find-ing a new state by sampling each zm,n from the dis-tribution specified by Equation (5). After obtain-ing individual word assignments z, we can estimatethe topic multinomials and the per-document mixingproportions. Specifically, the topic multinomials areestimated as:

βcm,j,i = p(wi|zcm,j) =

η + nzwicm,j

|V |η +∑

n.zcm,j

(6)

while the per-document mixing proportions fixedcan be estimated as:

θm,j =α + nm

.,j

|cm|α + nm.,.

, j ∈ 1, ..., |cm| (7)

5.1 Relation to Existing ModelsIn this section, we draw comparisons with the cur-rent state-of-the-art models for hierarchical topic

804

modeling (Blei et al., 2004; Petinot et al., 2011) andshow that at certain choices of the parameters of ourmodel, these methods fall out as special cases.

Our method generalises not only hierarchi-cal Latent Dirichlet Allocation (hLDA), but alsoHierarchical Labeled Latent Dirichlet Allocation(hLLDA). Our proposed model provides a unifiedframework allowing us to model hierarchical labelswhile to explore new latent topics.Equivalence to hLDA As introduced in Section 2,hLDA is a unsupervised hierarchical topic model. Inthis case, there are no observed nodes, that is, thecorpus has no hierarchical labels. This means cm isequal to cem,m; meanwhile the factor p(com,m|µ) isalways equal to one because each document has rootnode, and this allows us to rewrite Formula (2) as:

p(cm|z, w, c−m, µ)

∝p(wcm |c, w−m, z)p(cm|c−m) (8)

which is exactly the same as the conditional distribu-tion for cm, the L topics associated with documentm in hLDA model. In this case, our model becomesequivalent to the hLDA model.Equivalence to hLLDA hLLDA is a supervised hi-erarchical topic model, which means all nodes in hi-erarchy are observed. In this case, cm is equal tocom,m, and this allows us to rewrite Formula (2) as:

p(cm|z, w, c−m, µ) = p(cm|µ) ∝ p(com |µ) (9)

which is exactly the same as the step “ Draw arandom path assignment cm” in the generative pro-cess for hLLDA. Consequentially, in this sense ourmodel is equivalent to hLLDA.

6 Experiments

We demonstrate the effectiveness of the proposedmodel on large, real-world datasets in the questionanswering and website category domains on twotasks: the topic modeling of documents, and the useof the generated topics for document clustering.

6.1 Datasets

To construct comprehensive datasets for our ex-periments, we crawled data from two websites.First, we crawled nearly all the questions and as-sociated answer pairs (QA pairs) of two top cat-

Table 2: The statistics of the datasets.Datasets #labels #paths Max level #docsY Ans 46 35 4 6,345,786O Hlth 6695 6505 10 54939O Home 2432 2364 9 24254

egories of Yahoo! Answers: Computers & Inter-net and Health. This produced forty-three sub-categories from 2005.11 to 2008.11, and an archiveof 6,345,786 QA documents. We refer the Yahoo!Answer data as Y Ans.

In addition, we first crawled two categories ofOpen Directory Project (ODP)∗: Home and Health.Then, we removed all categories whose number ofWeb sites is less than 3. Finally, for each of Websites in categories, we submited the url of each Website to Google and used the words in the snippet andtitle of the first returned result to extend the sum-mary of the Web site. We denote the data from thecategory Home as O Home, and the data from thecategory Health as O Hlth.

The statistics of all datasets are summarized in Ta-ble 2. From this table, we can see that these datasetsare very diverse: Y Ans has much fewer labels thanO Hlth and O Home, but have much more docu-ments for each label; meanwhile the depth of hierar-chical tree for O Hlth and O Home can reach level9 or above.

All experiments are based on the results of modelswith a burn-in of 10000 Gibbs sampling iterations,symmetric priors α = 0.1 and free parameter η = 1.0;and for µ, we can obtain the estimation of µci byfixed-point iteration (Minka, 2003).

6.2 Case StudyWith topic modeling, the top associated words oftopics can be used as good descriptors for topics ina hierarchy (Blei et al., 2003; Blei and McAuliffe,2010). We show in Figure 3 a pair of compara-tive example of the proposed model and a baselinemodel over Y Ans dataset. The tree-based topic vi-sualizations of Figure 3 (a) and (b) are the results ofSSHLDA and Simp-hLDA.

We have three major observations from the exam-ple: (i) SSHLDA is a unified and generative model,after learning, it can obtain a hierarchy of topics;

∗http://dmoz.org/

805

Figure 3: (a) A sub network discovered on Y Ans dataset using SSHLDA, and the whole tree has 74 nodes; (b) A subnetwork discovered on Y Ans dataset using Simp-hLDA algorithm, and the whole tree has 89 nodes. In both figures,the shaded and squared nodes are observed labels, not topics; the shaded and round nodes are topics with observedlabels; blue nodes are topics but without labels and the yellow node is one of leaves in hierarchy of labels. Each topicrepresented by top 5 terms.

while Simp-hLDA is a heuristic method, and its re-sult is a mixture of label nodes and topical nodes.For example, Figure 3 (b) shows that the hierarchyincludes label nodes and topic nodes, and each of la-beled nodes just has a label, but label nodes in Fig-ure 3 (a) have their corresponding topics. (ii) Dur-ing obtaining a hierarchy, SSHLDA makes use of theinformation from observed labels, thus it can gener-ate a logical, structual hierarchy with parent-childrelations; while Simp-hLDA does not incorporateprior information of labels into its generation pro-cess, thus although it can obtain a hierarchy, manyparent-child pairs have not parent-child relation. Forexample, in Figure 3 (b), although label “root” isa parent of label “Computers & Internet”, the topi-cal words of label “Computers & Internet” show thetopical node is not a child of label “root”. How-ever, in Figure 3 (a), label “root” and “Computers& Internet” has corresponding parent-child relationbetween their topical words. (iii) In a hierarchy oftopics, if a topical node has correspending label, thelabel can help people understand descendant topi-cal nodes. For example, when we know node “er-ror files click screen virus” in Figure 3 (a) has itslabel “Computers & Internet”, we can understandthe child topic “hard screen usb power dell” is about

“computer hardware”. However, in Figure 3 (b), thelabels in parent nodes cannot provide much informa-tion to understand descendant topical nodes becausemany label nodes have not corresponding right topi-cal words, such as label “Computers & Internet”, itstopical words, “the to you and a”, do not reflect theconnotation of the label.

These observations further confirm that SSHLDAis better than the baseline model.

6.3 Perplexity Comparison

A good topic model should be able to generalize tounseen data. To measure the prediction ability ofour model and baselines, we compute the perplex-ity for each document d in the test sets. Perplex-ity, which is widely used in the language modelingand topic modeling community, is equivalent alge-braically to the inverse of the geometric mean per-word likelihood (Blei et al., 2003). Lower perplexityscores mean better. Our model, SSHLDA, will com-pare with three state-of-the-art models, i.e. Simp-hLDA, hLDA and hLLDA. Simp-hLDA has beenintroduced in Section 1, and hLDA and hLLDA hasbeen reviewed in Section 2. We keep 80% of the datacollection as the training set and use the remainingcollection as the held-out test set. We build the mod-

806

els based on the train set and compute the preplexityof the test set to evaluate the models. Thus, our goalis to achieve lower perplexity score on a held-out testset. The perplexity of M test documents is calculatedas:

perplexity(Dtest) = exp

{−∑M

d=1

∑Ndm=1 log p(wdm)∑Md=1 Nd

}(10)

where Dtest is the test collection of M documents,Nd is document length of document d and wdm ismth word in document d.

We present the results over the O Hlth dataset inFigure 4. We choose top 3-level labels as observed,and assume other labels are not observed, i.e. l = 3.From the figure, we can see that the perplexities ofSSHLDA, are lower than that of Simp-hLDA, hLDAand hLLDA at different value of the tree height pa-rameter, i.e. L ∈ {5, 6, 7, 8}. It shows that theperformance of SSHLDA is always better than thestate-of-the-art baselines, and means that our pro-posed model can model the hierarchical labeled databetter than the state-of-the-art models. We can alsoobtain similar experimental results over Y Ans andO Home datasets, and their detailed description isnot included in this paper due to the limitation ofspace.

6.4 Clustering performance

To evaluate indirectly the performance of the pro-posed model, we compare the clustering perfor-mance of following systems: 1) the proposed model;2) Simp-hLDA; 3) hLDA; 4) agglomerative cluster-ing algorithm. There are many agglomerative clus-tering algorithms, and in this paper, we make useof the single-linkage method in a software packagecalled CLUTO (Karypis, 2005) to obtain hierarchiesof clusters over our datasets, with words as features.We refer the method as h-clustering.

Given a document collection DS with a H-level hi-erarchy of labels, each label in the hierarchy and cor-responding documents will be taken as the groundtruth of clustering algorithms. The hierarchy of la-bels denoted as GT-tree. The process of evaluationis as follows. First, we choose top l-level labelsin GT-tree as an observed hierarchy, i.e. Base Tree(BT), and we need to construct a L-level hierarchy(l < L <= H) over the documents DS using a

Figure 4: Perplexities of hLLDA, hLDA, Simp-hLDAand SSHLDA. The results are run over the O Hlthdataset, with the height of the hierarchy of observed la-bels l = 3. The X-axis is the height of the whole topicaltree (L), and Y-axis is the perplexity.

model. The remaining labels in GT-tree and cor-responding documents are the ground truth classes,each class denoted as Ci. Then, (i) for h-clustering,we run single-linkage method over the documentsDS. (ii) for Simp-hLDA, hLDA runs on the doc-uments in each leaf-node in BT, and the height pa-rameter is (L − l) for each hLDA. After training,each document is assigned to top-1 topic accord-ing to the distribution over topics for the document.Each topic and corresponding documents forms anew cluster. (iii) for hLDA, hLDA runs on all docu-ments in DS, and the height parameter is L. Similarto Simp-hLDA, each document is assigned to top-1 topic. Each topic and corresponding documentsforms a new cluster. (iv) for SSHLDA, we set heightparameter as L. After training, each document isalso assigned to top-1 topic. Topics and their cor-responding documents form a hierarchy of clusters.

6.4.1 Evaluation MetricsFor each dataset we obtain corresponding clusters

using the various models described in previous sec-tions. Thus we can use clustering metrics to measurethe quality of various algorithms by using a measurethat takes into account the overall set of clusters thatare represented in the new generated part of a hier-archical tree.

One such measure is the FScore measure, intro-

807

duced by (Manning et al., 2008). Given a particularclass Cr of size nr and a particular cluster Si of sizeni, suppose nri documents in the cluster Si belongto Cr, then the FScore of this class and cluster isdefined to be

F (Cr, Si) =2×R(Cr, Si)× P (Cr, Si)

R(Cr, Si) + P (Cr, Si)(11)

where R(Cr, Si) is the recall value defined asnri/nr, and P (Cr, Si) is the precision value definedas nri/ni for the class Cr and the cluster Si. The FS-core of the class Cr, is the maximum FScore valueattained at any node in the hierarchical clusteringtree T . That is,

F (Cr) = maxSi∈T

F (Cr, Si). (12)

The FScore of the entire clustering solution is thendefined to be the sum of the individual class FScoreweighted according to the class size.

FScore =c∑

r=1

nr

nF (Cr), (13)

where c is the total number of classes. In general, thehigher the FScore values, the better the clusteringsolution is.

6.4.2 Experimental ResultsEach of hLDA, Simp-hLDA and SSHLDA needs

a parameter—the height of the topical tree, i.e. L;and for Simp-hLDA and SSHLDA, they need an-other parameter—the height of the hierarchical ob-served labels, i.e l. The h-clustering does not haveany height parameters, thus its FScore will keep thesame values at different height of the topical tree.With choosing the height of hierarchical labels forO Home as 4, i.e. l = 4, the results of our modeland baselines with respect to the height of a hierar-chy are shown in Figure 5.

From the figure, we can see that our proposedmodel can achieve consistent improvement overthe baseline models at different height, i.e. L ∈{5, 6, 7, 8}. For example, the performance ofSSHLDA can reach 0.396 at height 5 while the h-clustering, hLDA and hLLDA only achieve 0.295,0.328 and 0.349 at the same height. The result showsthat our model can achieve about 34.2%, 20.7% and13.5% improvements over h-clustering, hLDA and

Figure 5: FScore measures of h-clustering, hLDA,Simp-hLDA and SSHLDA. The results are run over theO Home dataset, with the height of the hierarchy of ob-served labels l = 3. The X-axis is the height of the wholetopical tree (L), and Y-axis is the FScore measure.

hLLDA at height 5. The improvements are signifi-cant by t-test at the 95% significance level. We canalso obtain similar experimental results over Y Ansand O Hlth. However, for the same reason of limita-tion of space, their detailed descriptions are skippedin this paper.

7 Conclusion and Future work

In this paper, we have proposed a semi-supervisedhierarchical topic models, i.e. SSHLDA, which aimsto solve the drawbacks of hLDA and hLLDA whilecombine their merits. Specially, SSHLDA incorpo-rates the information of labels into generative pro-cess of topic modeling while exploring latent topicsin data space. In addition, we have also proved thathLDA and hLLDA are special cases of SSHLDA.We have conducted experiments on the Yahoo! An-swers and ODP datasets, and assessed the perfor-mance in terms of Perplexity and FScore measure.The experimental results show that the predictionability of SSHLDA is the best, and SSHLDA canalso achieve significant improvement over the base-lines on Fscore measure.

In the future, we will continue to explore noveltopic models for hierarchical labeled data to furtherimprove the effectiveness; meanwhile we will alsoapply SSHLDA to other media forms, such as im-age, to solve related problems in these areas.

808

Acknowledgments

This work was partially supported by NSFC with GrantNo.61073082, 60933004, 70903008 and NExT SearchCentre, which is supported by the Singapore National Re-search Foundation & Interactive Digital Media R&D Pro-gram Office, MDA under research grant (WBS:R-252-300-001-490).

ReferencesD. Blei and J. Lafferty. 2006. Correlated topic mod-

els. Advances in neural information processing sys-tems, 18:147.

D.M. Blei and J.D. McAuliffe. 2007. Supervised topicmodels. In Proceeding of the Neural Information Pro-cessing Systems(nips).

D.M. Blei and J.D. McAuliffe. 2010. Supervised topicmodels. Arxiv preprint arXiv:1003.0783.

D.M. Blei, A.Y. Ng, and M.I. Jordan. 2003. Latentdirichlet allocation. The Journal of Machine LearningResearch, 3:993–1022.

D. Blei, T.L. Griffiths, M.I. Jordan, and J.B. Tenenbaum.2004. Hierarchical topic models and the nested chi-nese restaurant process. Advances in neural informa-tion processing systems, 16:106.

C. Chemudugunta, A. Holloway, P. Smyth, andM. Steyvers. 2008a. Modeling documents by com-bining semantic concepts with unsupervised statisticallearning. The Semantic Web-ISWC 2008, pages 229–244.

C. Chemudugunta, P. Smyth, and M. Steyvers. 2008b.Combining concept hierarchies and statistical topicmodels. In Proceeding of the 17th ACM conference onInformation and knowledge management, pages 1469–1470. ACM.

C. Chemudugunta, P. Smyth, and M. Steyvers. 2008c.Text modeling using unsupervised topic models andconcept hierarchies. Arxiv preprint arXiv:0808.0973.

S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer,and R. Harshman. 1990. Indexing by latent semanticanalysis. Journal of the American society for informa-tion science, 41(6):391–407.

T. Hofmann. 1999. Probabilistic latent semantic analy-sis. In Proc. of Uncertainty in Artificial Intelligence,UAI’99, page 21. Citeseer.

G. Karypis. 2005. Cluto: Software forclustering high dimensional datasets. In-ternet Website (last accessed, June 2008),http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview.

S. Lacoste-Julien, F. Sha, and M.I. Jordan. 2008. ndis-clda: Discriminative learning for dimensionality re-

duction and classification. Advances in Neural Infor-mation Processing Systems, 21.

W. Li and A. McCallum. 2006. Pachinko allocation:Dag-structured mixture models of topic correlations.In Proceedings of the 23rd international conference onMachine learning, pages 577–584. ACM.

C.D. Manning, P. Raghavan, and H. Schutze. 2008. In-troduction to information retrieval, volume 1. Cam-bridge University Press Cambridge.

D. Mimno, W. Li, and A. McCallum. 2007. Mixtures ofhierarchical topics with pachinko allocation. In Pro-ceedings of the 24th international conference on Ma-chine learning, pages 633–640. ACM.

Z.Y. Ming, K. Wang, and T.S. Chua. 2010. Prototypehierarchy based clustering for the categorization andnavigation of web collections. In Proceeding of the33rd international ACM SIGIR, pages 2–9. ACM.

T.P. Minka. 2003. Estimating a dirichlet distribution.Annals of Physics, 2000(8):1–13.

A. Perotte, N. Bartlett, N. Elhadad, and F. Wood. 2011.Hierarchically supervised latent dirichlet allocation.Neural Information Processing Systems (to appear).

Y. Petinot, K. McKeown, and K. Thadani. 2011. Ahierarchical model of web summaries. In Proceed-ings of the 49th Annual Meeting of the ACL: HumanLanguage Technologies: short papers-Volume 2, pages670–675. ACL.

D. Ramage, D. Hall, R. Nallapati, and C.D. Manning.2009a. Labeled lda: A supervised topic model forcredit attribution in multi-labeled corpora. In Proceed-ings of the 2009 Conference on Empirical Methods inNatural Language Processing: Volume 1-Volume 1,pages 248–256. Association for Computational Lin-guistics.

D. Ramage, P. Heymann, C.D. Manning, and H. Garcia-Molina. 2009b. Clustering the tagged web. In Pro-ceedings of the Second ACM International Conferenceon Web Search and Data Mining, pages 54–63. ACM.

D. Ramage, C.D. Manning, and S. Dumais. 2011. Par-tially labeled topic models for interpretable text min-ing. In Proceedings of the 17th ACM SIGKDD inter-national conference on Knowledge discovery and datamining, pages 457–465. ACM.

M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth.2004. The author-topic model for authors and doc-uments. In Proceedings of the 20th conference onUncertainty in artificial intelligence, pages 487–494.AUAI Press.

T.N. Rubin, A. Chambers, P. Smyth, and M. Steyvers.2011. Statistical topic models for multi-label docu-ment classification. Arxiv preprint arXiv:1107.2462.

Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. 2006.Hierarchical dirichlet processes. Journal of the Amer-ican Statistical Association, 101(476):1566–1581.

809

SSHLDA: A Semi-Supervised Hierarchical Topic Model · Furthermore, hierarchical topic modeling is...

Documents

Transcript of SSHLDA: A Semi-Supervised Hierarchical Topic Model · Furthermore, hierarchical topic modeling is...