Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.
-
Upload
samantha-rose -
Category
Documents
-
view
222 -
download
1
Transcript of Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.
Style & Topic Language Model Adaptation Using HMM-LDA
Bo-June (Paul) Hsu, James Glass
2
Outline
• Introduction
• LDA
• HMM-LDA
• Experiments
• Conclusions
3
Introduction
• An effective LM needs to not only account for the casual speaking style of lectures but also accommodate the topic-specific vocabulary of the subject matter
• Available training corpora rarely match the target lecture in both style and topic
• In this paper, the syntactic state and semantic topic assignment are investigated using HMM with LDA model
4
LDA
• A generative probabilistic model of a corpus• The topic mixture is drawn from a conjugate Dirichlet prio
r– PLSA
– LDA
– Model parameters
Dirichlet
Discretez
Dirichlet
Discretezw
ii
ii
ddi
zzii
~
~|
~
~,|
dPPP |,|,| ww
d w
t dtPtwPP |,|w
5
Markov chain Monte Carlo
• A class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its stationary distribution
• The most common application of these algorithms is numerically calculating multi-dimensional integrals– an ensemble of "walkers" moves around randomly
• A Markov chain is constructed in such a way as to have the integrand as its equilibrium distribution
6
LDA
• Estimate posteriori
• Integrating out:
• Gibbs sampling
w
W
wW
P 1|
7
Markov chain Monte Carlo (cont.)
• Gibbs Sampling http://en.wikipedia.org/wiki/Gibbs_sampling
8
HMM+LDA
• HMMs generate documents purely based on syntactic relations among unobserved word classes– Short-range dependencies
• Topic model generate documents based on semantic correlations between words, independent of word order– long-range dependencies
• A major advantage of generative models is modularity– Different models are easily combined
– Words are exhibited by Mixture of model & product of model
– Only a subset of words, content words, exhibit long-range dependencies
• Replace one probability distribution over words used in syntactic model with the semantic model
9
HMM+LDA (cont.)
• Notation:– A sequence of words – A sequence of topic assignments– A sequence of classes– means semantic class– zth topic associated with distribution over words– Each class is associated with distribution over words– Each document has a distribution over topic– Transition between class and follows a distribution
Wwww in ,,...,1w Tzxx in ,,...,1z
Cccc in ,,...,1c
z
1ic c
1ic
d d
1ic ic 1is
10
HMM+LDA (cont.)
• A document is generated:– Sample from a prior– For each word in document
• Draw from • Draw from • If , then draw from ,else draw
from
d Dirichlet
iw d
iz d
ic 1ic1ic iw
iz iw ic
11
HMM+LDA (cont.)
• Inference– are drawn from– are drawn from– The row of the transition matrix are drawn from– are drawn from– Assume all Dirichlet distribution are symmetric
d Dirichlet z Dirichlet
Dirichlet c Dirichlet
12
HMM+LDA (cont.)
• Gibbs Sampling
13
HMM-LDA Analysis
• Lectures Corpus – 3 undergraduate subject in math, physics, computer science – 10 CS lectures for development set, 10 CS lectures for test set
• Textbook Corpus– CS course textbook– divided in to 271 topic-cohesive documents at every section
heading
• Run Gibbs sampler against the two dataset– L: 2,800 iterations, T: 2,000 iterations– Use lowest perplexity model as the final model
14
HMM-LDA Analysis (cont.)
• Semantic topics (Lectures)
Machine learning
Linear Algebra Magnetism
• <laugh>: cursory examination of the data suggests that speakers talking about children tend to laugh more during the lecture
• Although it may not be desirable to capture speaker idiosyncrasies in the topic mixtures, HMM-LDA has clearly demonstrated its ability to capture distinctive semantic topics in a corpus
Childhood Memories
15
HMM-LDA Analysis (cont.)
• Semantic topics (Textbook)• A topically coherent paragraph
• 6 of the 7 instances of the words “and” and “or” (underline) are correctly classified
• Multi-word topic key phrases can be identified for n-gram topic models
the context-dependent labeling abilities of the HMM-LDA models is demonstrated
16
HMM-LDA Analysis (cont.)
• Syntactic States (Lectures)– State 20 is topic state
Verbs
Prepositions
Hesitation disfluencies
Conjunctions
• As demonstrated with spontaneous speech, HMM-LDA yields syntactic states that have a good correspondence to part-of speech labels, without requiring any labeled training data
17
Discussions
• Although MCMC techniques converge to the global stationary distribution, we cannot guarantee convergence from observation of the perplexity alone
• Unlike EM algorithms, random sampling may actually temporarily decrease the model likelihood
• The number of iteration was chosen to be at least double the point at which the PP first appeared to converge
18
Language Modeling Experiments
• Baseline model: Lecture + Textbook Interpolated trigram model (using modified Kneser-Ney discounting)
• Topic-deemphasized style (trigram) model (Lectures):– To deemphasize the observed occurrences of topic words and id
eally redistribute these counts to all potential topic words
– The counts of topic to style word transitions are not altered
19
Language Modeling Experiments (cont.)
• Textbook model should ideally have higher weight in the contexts containing topic words
• Domain trigram model (Textbook):– Emphasize the sequences containing a topic word in the context
by doubling their counts
20
Language Modeling Experiments (cont.)
• unsmoothed topical tirgram model:– Apply HMM-LDA with 100 topics to identify representative words
and their associated contexts for each topics
• Topic mixtures for all models– Mixture weights were tuned on individual target lectures (cheat)– 15 of 100 topics account for over 90% of the total weight
21
Language Modeling Experiments (cont.)
• Since the topic distribution shifts over a long lecture, modeling a lecture with fixed weights may not be the most optimal
• Update the mixture distribution by linearly interpolating it with the posterior topic distribution given the current word
' )'('|
)(|)|(
t tPtwP
tPtwPwtP )|()(1)(1
iii wtPtPtP
22
Language Modeling Experiments (cont.)
• The variation of topic mixtures
Review previous lecture -> Show an example of computation using accumulators -> Focus the lecture on stream as a data structure, with an intervening example that finds pairs of i and j that sum up to a prime
23
Language Modeling Experiments (cont.)
• Experimental results
24
Conclusions
• HMM-LDA shows great promise for finding structure in unlabeled data, from which we can build more sophisticated models
• Speaker-specific adaptation will be investigated in the future