Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.

Style & Topic Language Model Adaptation Using HMM-LDA

Bo-June (Paul) Hsu, James Glass

2

Outline

• Introduction

• LDA

• HMM-LDA

• Experiments

• Conclusions

3

Introduction

• An effective LM needs to not only account for the casual speaking style of lectures but also accommodate the topic-specific vocabulary of the subject matter

• Available training corpora rarely match the target lecture in both style and topic

• In this paper, the syntactic state and semantic topic assignment are investigated using HMM with LDA model

4

LDA

• A generative probabilistic model of a corpus• The topic mixture is drawn from a conjugate Dirichlet prio

r– PLSA

– LDA

– Model parameters

Dirichlet

Discretez

Dirichlet

Discretezw

ii

ii

ddi

zzii

~

~|

~

~,|

dPPP |,|,| ww

d w

t dtPtwPP |,|w

5

Markov chain Monte Carlo

• A class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its stationary distribution

• The most common application of these algorithms is numerically calculating multi-dimensional integrals– an ensemble of "walkers" moves around randomly

• A Markov chain is constructed in such a way as to have the integrand as its equilibrium distribution

6

LDA

• Estimate posteriori

• Integrating out:

• Gibbs sampling

w

W

wW

P 1|

7

Markov chain Monte Carlo (cont.)

• Gibbs Sampling http://en.wikipedia.org/wiki/Gibbs_sampling

http://en.wikipedia.org/wiki/Gibbs_sampling

http://en.wikipedia.org/wiki/Gibbs_sampling

8

HMM+LDA

• HMMs generate documents purely based on syntactic relations among unobserved word classes– Short-range dependencies

• Topic model generate documents based on semantic correlations between words, independent of word order– long-range dependencies

• A major advantage of generative models is modularity– Different models are easily combined

– Words are exhibited by Mixture of model & product of model

– Only a subset of words, content words, exhibit long-range dependencies

• Replace one probability distribution over words used in syntactic model with the semantic model

9

HMM+LDA (cont.)

• Notation:– A sequence of words – A sequence of topic assignments– A sequence of classes– means semantic class– zth topic associated with distribution over words– Each class is associated with distribution over words– Each document has a distribution over topic– Transition between class and follows a distribution

Wwww in ,,...,1w Tzxx in ,,...,1z

Cccc in ,,...,1c

z

1ic c

1ic

d d

1ic ic 1is

10

HMM+LDA (cont.)

• A document is generated:– Sample from a prior– For each word in document

• Draw from • Draw from • If , then draw from ,else draw

from

d Dirichlet

iw d

iz d

ic 1ic1ic iw

iz iw ic

11

HMM+LDA (cont.)

• Inference– are drawn from– are drawn from– The row of the transition matrix are drawn from– are drawn from– Assume all Dirichlet distribution are symmetric

d Dirichlet z Dirichlet

Dirichlet c Dirichlet

12

HMM+LDA (cont.)

• Gibbs Sampling

13

HMM-LDA Analysis

• Lectures Corpus – 3 undergraduate subject in math, physics, computer science – 10 CS lectures for development set, 10 CS lectures for test set

• Textbook Corpus– CS course textbook– divided in to 271 topic-cohesive documents at every section

heading

• Run Gibbs sampler against the two dataset– L: 2,800 iterations, T: 2,000 iterations– Use lowest perplexity model as the final model

14

HMM-LDA Analysis (cont.)

• Semantic topics (Lectures)

Machine learning

Linear Algebra Magnetism

• <laugh>: cursory examination of the data suggests that speakers talking about children tend to laugh more during the lecture

• Although it may not be desirable to capture speaker idiosyncrasies in the topic mixtures, HMM-LDA has clearly demonstrated its ability to capture distinctive semantic topics in a corpus

Childhood Memories

15


• Semantic topics (Textbook)• A topically coherent paragraph

• 6 of the 7 instances of the words “and” and “or” (underline) are correctly classified

• Multi-word topic key phrases can be identified for n-gram topic models

the context-dependent labeling abilities of the HMM-LDA models is demonstrated

16


• Syntactic States (Lectures)– State 20 is topic state

Verbs

Prepositions

Hesitation disfluencies

Conjunctions

• As demonstrated with spontaneous speech, HMM-LDA yields syntactic states that have a good correspondence to part-of speech labels, without requiring any labeled training data

17

Discussions

• Although MCMC techniques converge to the global stationary distribution, we cannot guarantee convergence from observation of the perplexity alone

• Unlike EM algorithms, random sampling may actually temporarily decrease the model likelihood

• The number of iteration was chosen to be at least double the point at which the PP first appeared to converge

18

Language Modeling Experiments

• Baseline model: Lecture + Textbook Interpolated trigram model (using modified Kneser-Ney discounting)

• Topic-deemphasized style (trigram) model (Lectures):– To deemphasize the observed occurrences of topic words and id

eally redistribute these counts to all potential topic words

– The counts of topic to style word transitions are not altered

19

Language Modeling Experiments (cont.)

• Textbook model should ideally have higher weight in the contexts containing topic words

• Domain trigram model (Textbook):– Emphasize the sequences containing a topic word in the context

by doubling their counts

20


• unsmoothed topical tirgram model:– Apply HMM-LDA with 100 topics to identify representative words

and their associated contexts for each topics

• Topic mixtures for all models– Mixture weights were tuned on individual target lectures (cheat)– 15 of 100 topics account for over 90% of the total weight

21


• Since the topic distribution shifts over a long lecture, modeling a lecture with fixed weights may not be the most optimal

• Update the mixture distribution by linearly interpolating it with the posterior topic distribution given the current word

' )'('|

)(|)|(

t tPtwP

tPtwPwtP )|()(1)(1

iii wtPtPtP

22


• The variation of topic mixtures

Review previous lecture -> Show an example of computation using accumulators -> Focus the lecture on stream as a data structure, with an intervening example that finds pairs of i and j that sum up to a prime

23


• Experimental results

24

Conclusions

• HMM-LDA shows great promise for finding structure in unlabeled data, from which we can build more sophisticated models

• Speaker-specific adaptation will be investigated in the future

Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.

Documents

Transcript of Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.