Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA
-
Upload
derek-decker -
Category
Documents
-
view
25 -
download
0
description
Transcript of Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA
Landmark-Based Speech Recognition:
Spectrogram Reading,Support Vector Machines,
Dynamic Bayesian Networks,and Phonology
Mark [email protected]
University of Illinois at Urbana-Champaign, USA
Lecture 8. Inference in Non-Tree Graphs
• The multiple-parent problem• Solution #1: Parent merger• Solution #2: Moralize, Triangulate, and
create a Junction Tree• Inference in Any DBN: Sum-Product
Algorithm in a Junction Tree• Example: Factorial HMM• Example: Zweig-triangle LVCSR• Example: Articulatory phonology
Example Problem: Find p(d|a)
b c
a
d
• The correct answer is– p(d|a) = b c p(b|a) p(c|a) p(d|b,c)
• Try the sum-product algorithm. Propagate up, starting with node b: p(Db|b) = p(d|b) = ???
• p(d|b) is no longer one of the parameters of the model; now it must be calculated from p(d|b,c).
• In fact, the calculation requires us to sum over every variable in the model:– p(d|b) = c a p(a) p(c|a) p(d|b,c)
• High Computational Cost!
Conditional Independence of Descendants and Non-descendants
• The Sum-Product algorithm can use computations that are local at each node, v, because of the following theorem:
• Theorem: if and only if a Bayesian network is a tree, then for every variable v, the descendants and non-descendants of v are conditionally independent given v: p(Dv,Nv | v) = p(Dv | v) p(Nv | v)
Example: Descendants and Non-descendants in a Tree
• p(Dc|c) = p(d|c)p(e|c)p(f|c)
• p(Nc,c) = p(a)p(b|a)p(c|a)
a
b c
d e f
Example: Descendants and Non-descendants in a Non-Tree
• p(Dc,c) = a,b p(a)p(b|a)p(c|a)p(d|b,c)p(e|c)p(f|c)
• p(Nc,c) = d p(a)p(b|a)p(c|a)p(d|b,c)
• So is it necessary for EVERY computation to be global?
a
b c
d e f
Local Computations in a Non-Tree
• Here are some computations that can be local:– d depends only on the combination (b,c)– (b,c) depend only on a– e depends only on c, or equivalently, e depends on (b,c)
a
b c
d e f
The “Parent Merger” Algorithm
• Combine b,c into “super-node” bc – Number of possible values = (# of b values) X (# of c values)– p(bc | a) = p(b | a) p(c | a)– p(d | bc) = p(d | b,c)– p(e | bc) = p(e | c)
• Result is a tree
a
b c
d e f
a
bc
d e f
Sum-Product Algorithm with Super-Nodes
• Propagate Up:– p(Dbc | bc) = p(d|bc) p(e|bc) fp(f|bc)
– p(Da | a) = bc p(Dbc | bc) p(bc | a)
• Propagate Down:– p(bc, Nbc) = a p(a) p(bc | a)
– p(f, Nf) = bc p(bc, Nbc) p(f | bc) p(e | bc) p(d| bc)
• Multiply:– p(bc, d, e) = p(bc, Nbc)p(Dbc | bc)
– p(c, d, e) = b p(bc, d, e)
a
bc
d e fd e
The “Parent Merger” Algorithm• Algorithm #1 for turning a Non-Tree into a Tree:
– If any node has multiple parents, merge them– If any resulting supernode has multiple parents, merge them– Repeat until no node has multiple parents
• Why this algorithm is sometimes undesirable:– In an upward-branching graph, this results in a supernode with
NgNhNiNj = many possible values
– Many values Lots of computation a
b c
e f
i j
m
d
g h
k l
n o
p
{bc}
{def}
{ghij}
{klm}
{no}
Algorithm #2: Junction Trees• Moralize• Triangulate• Read off the cliques into a Junction Tree• Add variables to cliques, as necessary to ensure Locality
of Influence
Moralization• “Moralization” is the process of connecting the parents of
every node.• Goal: to show that values of the parents can not really be
independently computed.• Once the graph has been moralized, we usually show it as
an undirected graph --- dependency structure will still be necessary for inference, but not necessary for finding the best junction tree.
a
b c
e fd
n o
p
a
b c
e fd
n o
pMoralize
Triangulation• A “triangular” or “chordal” graph is a graph with no cycles
of length longer than three.• “Triangulation” is the process of adding edges to a graph
in order to make it triangular.
a b
dc
e f
hg
ji
a b
dc
e f
hg
ji
a b
dc
e f
hg
ji
Moralize Triangulate
Digression: Why are Moralization and Triangulation Allowed?
• An edge connecting a→b means that, during inference, we must use a probability table of size NaNb: p(b|a)
• One special case of the p(b|a) probability table is the case in which every row is the same, i.e., p(b|a)=p(a)
• Therefore this graph: Is a special case of this one:
• Put another way, information about the special features of a problem is coded by absent edges, not present edges.
• Adding edges is equivalent to forcing yourself to solve a harder, more general problem, rather than a simple specific one.
a b
c
a b
c
Cliques• A “clique” is a group of nodes, all of which are connected
together• The “separator” of two cliques is the set of nodes that are
members of both cliques. For example:• Clique efg and• Clique def• have {e,f} as their
separator
a b
dc
e f
hg
ji
Forming a Junction Tree• It is always possible to create a Junction Tree from a
triangular graph using the following algorithm:– Start with any clique as the root node– Next comes the clique whose separator with the root node is
largest– Locality of influence: if cliques A and B both contain node c, then
node c must also be added to every clique in the junction tree between A and B
a b
dc
e f
hg
ji
d,e,f
e,f,g
f,g,h
g,h,j
g,j
c,d,e
a,b,d
Junction Tree
Triangulation: A Hard Example• In this example, the graph on the left is not yet fully
triangulated (for example, the cycle dbcfon has no chord). Here is one possible triangulation algorithm: create a junction tree, then add variables to the cliques as necessary to maintain locality of influence.
Junction Tree
a
b c
e fd
n o
p
e,f,o?
d,e,n?
e,n,o?
n,o,p
c,e,f?
b,d,e?
b,c,e?
a,b,c
Example: Triangulation to Maintain Locality of Influence
• Every node that’s in both the second cliquesecond clique and the fourth cliquefourth clique must also exist in the third cliquethird clique
Junction Tree
a
b c
e fd
n o
p
e,f,o?
d,e,n?
c,e,f?
b,d,e?
b,c,e?
a,b,c
e,n,o?
n,o,p
Example: Triangulation to Maintain Locality of Influence
• Add node c to the 3rd clique, because c is in both 2nd and 4th cliques. Putting c into this clique is equivalent to drawing an edge between c and d, so that b,c,d,e are all interconnected.
Junction Tree
a
b c
e fd
n o
p
e,f,o?
d,e,n?
c,e,f?
b,c,d,e
b,c,e?
a,b,c
e,n,o?
n,o,p
Example: Triangulation to Maintain Locality of Influence
• … and then delete clique (b,c,e), because it’s now redundant with clique (b,c,d,e).
Junction Tree
a
b c
e fd
n o
p
e,f,o?
d,e,n?
c,e,f?
b,c,d,e
a,b,c
e,n,o?
n,o,p
Example: Triangulation to Maintain Locality of Influence
• Similar reasoning: d is in the 4th clique, so it had better be added to the 3rd. This is equivalent to drawing an edge between d and f.
Junction Tree
a
b c
e fd
n o
p
e,f,o?
d,e,n?
c,d,e,f
b,c,d,e
a,b,c
e,n,o?
n,o,p
Example: Triangulation to Maintain Locality of Influence
• … and f is in the 5th clique, so it had better be added to the 4th. This is equivalent to drawing an edge between f and n.
Junction Tree
a
b c
e fd
n o
p
e,f,o?
d,e,f,n
c,d,e,f
b,c,d,e
a,b,c
e,n,o?
n,o,p
Finishing Up
Junction Tree
a
b c
e fd
n o
p
e,f,n,o
d,e,f,n
c,d,e,f
b,c,d,e
a,b,c
n,o,p
Inference in a Junction Tree
• The “clique” representation means that all inference is contained within one clique. For each clique:– Input: a probability table for values of the lower separator. Size of this
table = NI, where N is the number of possible values of each variable, and I is the number of nodes in the input separator
– Product: multiply by information about other nodes in the clique. Resulting table is of size NC, where C is the number of nodes in the clique.
– Sum: For each possible setting of variables in the output separator (NO possible settings), marginalize out the values of all other variables (a sum operation, containing NC-O terms in the sum). Total complexity: O{NC}
– Pass the resulting table to next clique.
Inference Example• Suppose b and o are observed, and we want to find
p(♦,b,o) for all variables ♦.• Clique noq:
– Output separator variables are n,o– Product: p(observation, q | n,o) = p(q | n,o)
– Sum: p(observation | n,o) = q p(q | n,o) = 1
• We could have skipped this step by observing that p(o | o) is always 1
e,f,n,o
d,e,f,n
c,d,e,f
b,c,d,e
a,b,c
n,o,q
a
b c
e fd
n o
q
o
b
Inference Example• Clique efno:
– Input: p(observation | n,o)– Product: Nodes not in the output separator are moved to the left:
• p(observation,o | e,f,n) = p(observation | n,o) p(o | e,f,n)
– Sum: over unobserved elements not in the output separator• p(observation | e,f,n) = p(observation,o | e,f,n)
– Output: probabilities for every setting of the output separator• p(observation | e,f,n)
e,f,n,o
d,e,f,n
c,d,e,f
b,c,d,e
a,b,c
n,o,q
a
b c
e fd
n o
q
o
b
Inference Example• Clique defn:
– Input: p(observation | e,f,n)– Product: Nodes not in the output separator are moved to the left:
• p(observation,n | d,e,f) = p(observation | e,f,n) p(n | d,e,f)
– Sum: over unobserved elements not in the output separator• p(observation | d,e,f) = n p(observation,n | d,e,f)
– Output• p(observation | d,e,f)
e,f,n,o
d,e,f,n
c,d,e,f
b,c,d,e
a,b,c
n,o,q
a
b c
e fd
n o
q
o
b
Inference Example: “Propagate Down”• Clique abc:
– Product: p(a,b,c) = p(b,c|a)p(a)– Sum: over every unobserved variable that’s not in the output
separator:• p(b,c) = a p(a,b,c)
e,f,n,o
d,e,f,n
c,d,e,f
b,c,d,e
a,b,c
n,o,q
a
b c
e fd
n o
q
b
Inference Example: “Propagate Down”• Clique bcde:
– Product: p(b,c,d,e) = p(d,e|b,e)p(b,c)– Sum: over every unobserved variable that’s not in the output
separator:• p(observations above, c,d,e) = p(b,c,d,e)
– Output: a probability table of size N3:• p(observations above, c,d,e)
e,f,n,o
d,e,f,n
c,d,e,f
b,c,d,e
a,b,c
n,o,q
a
b c
e fd
n o
q
b
Inference Example: “Propagate Down”• Clique cdef:
– Input: p(observations above, c,d,e)– Product: p(observation,c,d,e,f) = p(f|c,d,e)p(observation,c,d,e)– Sum: over every unobserved variable that’s not in the output
separator:• p(observations above, d,e,f) = c p(observations above, c,d,e,f)
– Output: a probability table of size N3:• p(observations above, d,e,f)
e,f,n,o
d,e,f,n
c,d,e,f
b,c,d,e
a,b,c
n,o,q
a
b c
e fd
n o
q
b
… and so on…
Computational Complexity• Complexity of Inference is O{NC}, where N is the number of
values each node takes, C the number of nodes in the largest clique.– Actually, complexity is O{max i Ni}, where the ith variable in the
clique is defined in the range 1≤vi≤Ni, and max finds the maximum of this number over all cliques
• Therefore, a triangulation algorithm should minimize the maximum clique.
• Unfortunately, automatic minimum-maximum-clique triangulation is NP-hard. Good approximate algorithms exist, but…
• Humans are better at this than machines: design your graph with small cliques.
Factorial HMM (FHMM)
• Factorial HMM:– qt and vt represent two different types of background information, each
with its own history
– Observations xt depend on both hidden processes
• Model parameters:– p(vt+1|vt), p(qt+1|qt), p(xt|qt,vt)
• Computational Complexity of Sum-Product Algorithm:– O{N4T} using “parent-merger” triangulation– O{N3T} using a better triangulation (five slides from now)
q1
x1 x2
qt qt+1
x3
qT-1
x4
qT
x5x1 xt xt+1 xT-1 xT
v1 vt vt+1 vT-1 vT
Example: Speech in Music(Deoras and Hasegawa-Johnson, ICSLP 2004)
• qt = one person speaking
– Speech log spectrum given by ps(yt(ej)|qt) = mixture Gaussian
• vt = music playing in the background
– Music log spectrum given by pm(zt(ej)|vt) = mixture Gaussian
• Observed log spectrum = max(speech,music)
– xt(ej) ≈ max(yt(ej),zt(ej)) (xt(ej)≥max(yt(ej),zt(ej))≥xt(ej)-6dB)
– p(xt|qt,vt) = ps(xt|qt)ʃxpm(z|vt)dz + pm(xt|vt)ʃxps(y|qt)dy
q1
x1 x2
qt qt+1
x3
qT-1
x4
qT
x5x1 xt xt+1 xT-1 xT
v1 vt vt+1 vT-1 vT
… …
AVSR: The Boltzmann Zipper(Hennecke, Stork, and Prasad, 1996)
• Same as AVSR model from last time, except that now vt has memory, independent of qt. Model parameters:
– p(qt+1|qt), p(xt|qt)
– p(vt+1|vt,qt+1), p(yt|qt)
• Sum-Product algorithm: O{N3T}, just like FHMM• The extra observations add complexity only of O{T}
q1
x1 x2
qt+1 qT
x5
…x1 xt+1 xT
v1 vt+1 vT
x1 x2 x5y1 yt+1 yT Video observations
Viseme states
Audio phoneme states
Audio spectral observationsx2
qt
xt
vt
x2yt
…
AVSR: The Coupled HMM(Chu and Huang, 2000)
• Advantage over Boltzmann Zipper: More flexible, because neither vision nor sound is “privileged” over the other.– p(qt+1|vt,qt), p(xt|qt)
– p(vt+1|vt,qt), p(yt|qt)
• Disadvantage: can’t be triangulated like FHMM, so complexity is O{N4T} rather than O{N3T}
q1
x1 x2
qt+1 qT
x5x1 xt+1 xT
v1 vt+1 vT
x1 x2 x5y1 yt+1 yT Video observations
Viseme states
Audio phoneme states
Audio spectral observationsx2
qt
xt
vt
x2yt
Inference using Parent Merger
• Nt=observed non-descendants of (qt,vt) = {x1,…,xt-1}
• Dt=observed descendants of (qt,vt) = {xt,…,xT}
• Forward algorithm:– p(Nt+1,qt+1,vt+1) = qtvt p(xt | qt,vt)p(qt+1 | qt)p(vt+1 | vt)p(Nt,qt,vt)
• Backward algorithm:– p(Dt | qt,vt) = p(xt | qt,vt) qt+1vt+1 p(qt+1 | qt)p(vt+1 | vt)p(Dt+1 | qt+1,vt+1)
• Complexity: – (T frames)X(N2 sums/frame)X(N2 terms/sum) = O{N4T}
q1
x1 x2
qt qt+1
x3
qT-1
x4
qT
x5x1 xt xt+1 xT-1 xT
v1 vt vt+1 vT-1 vT
… …
A Smarter Triangulation
• Forward Algorithm, step 1:– p(qt+1,vt,Nt) = qt p(xt | qt,vt) p(qt+1 | qt) p(qt,vt,Nt)
q1
x1 x2
qt qt+1
x3
qT-1
x4
qT
x5x1 xt xt+1 xT-1 xT
v1 vt vt+1 vT-1 vT
… …
A Smarter Triangulation
• Forward Algorithm, step 1:– p(qt+1,vt,Nt+1) = qt p(xt | qt,vt) p(qt+1 | qt) p(qt,vt,Nt)
• Forward Algorithm, step 2:– p(qt+1,vt+1,Nt+1) = vt p(vt+1|vt) p(qt+1,vt,Nt+1)
• Computational Complexity:– (T frames)X(2N2 sums/frame)X(N terms/sum) = O{N3T}– Complexity is N times higher than that of a one-stream HMM
q1
x1 x2
qt qt+1
x3
qT-1
x4
qT
x5x1 xt xt+1 xT-1 xT
v1 vt vt+1 vT-1 vT
… …
“Compiling” an FHMM into an HMM
• Purpose: GMTK (Bilmes and Zweig, ICASSP 2002) can implement FHMM directly, but by compiling FHMM to HMM, we can also use HTK (Young, Evermann, Hain et al.,
2002) and other software tools• Method:
– Each state specifies the variables in the output separator of one clique, e.g., (qt+1,vt) is the separator between cliques (qt,qt+1,vt) and (qt+1,vt,vt+1).
– Transition probability matrix p(qt+1,vt|qt,vt) is N2XN2, but only N3 entries can be non-zero, thus complexity is O{N3}
x2xt
qt,vt qt+1,vt… …x3xt+1
qt+1,vt+1
A Note on Parameter Tying
• Transition probability table specifies p(separator|separator), e.g., p(qt+1,vt | qt,vt)
• The only non-zero entries are those specifying variables that differ between separators, e.g., p(qt+1 | qt,vt)
• With “parameter tying,” we can constrain different elements of transition matrix to equal one another, thus forcing the model to match condition p(qt+1|qt,vt)=p(qt+1|qt).– If the two chains are known to be truly independent, e.g., speech
and background music, parameter tying may help to avoid over-training the model.
– If the two chains are possibly dependent, allow the full transition matrix p(qt+1|qt,vt): result is a Boltzmann zipper.
x2xt
qt,vt qt+1,vt… …x3xt+1
qt+1,vt+1
“Compiling” an FHMM into an HMM• In order to handle non-emitting states, we need a total of
2N2 junction states: N2 emitting, N2 non-emitting• Finite State Diagram looks like this (NOT A DBN – this is
here to help you design the HTK configuration, if desired):
1,1
1,2
Emitting Statesqt,vt
Non-Emitting Statesqt+1,vt
2,1
2,2
1,1
1,2
2,1
2,2
Blue arrows: left-to-right transitionRed arrows: right-to-left transition
Black arrows: both(Note: no self-loops. Emitting & non-emitting states alternate.)
p(xt | qt=1,vt=1)
p(xt | qt=1,vt=2)
p(xt | qt=2,vt=1)
p(xt | qt=2,vt=2)
Observation PDFs
Graphical Models for Large-Vocabulary Speech
Recognition
“Zweig Triangles”(Zweig, 1998)
• wt: word. 1≤ wt ≤Nw
– Nw = # words in vocabulary
– p(wt+1=wt | wt, wdTrt=0)=1
– p(wt+1 | wt, wdTrt=1) = bigram word grammar
• it: segment index. 1≤ it ≤Ni
– p(it+1 | it,wdTrt=0)>0 iff it≤it+1≤ it+1
– p(it+1=1 | it,wdTrt=1)=1
• wdTrt: is there a word transition?
– p(wdTrt=0 | it<Ni)=1
– p(wdTrt=1 | it=Ni)= probability word ends
• qt: segment label, for example, qt could equal “/aa/ state 3.”
– p(qt|it,wt)=probability that itth phonetic segment in wt is qt
– Often deterministic: p(qt|it,wt)=1 iff qt is itth phone of wt
• xt: observation
– p(xt|qt) usually mixture Gaussian
wt
it
qt
x2xt
wdTrt
wt+1
it+1
qt+1
x2xt+1
wdTrt+1
LIP-OP TT-OPEN
TT-LOC
TB-LOC
TB-OPENVELUM
VOICING
Example: Pronunciation Variability• Pronunciation variability (e.g., apparent deletions or substitutions
of phonemes) can be parsimoniously described as resulting from asynchrony and reduction of quasi-independent tract variables such as the LIP-OPENING and TONGUE-TIP-OPENING:
(Browman & Goldstein 1990):
A DBN Model of Articulatory Phonology for Speech Recognition
(Livescu and Glass, 2004)
• wordt: word ID at frame #t• wdTrt: word transition?• indt
i: which gesture, from the canonical word model, should articulator i be trying to implement?• asynct
i;j: how asynchronous are articulators i and j? • Ut
i: canonical setting of articulator #i• St
i: surface setting of articulator #i
Summary• Multiple parents violate conditional independence of descendants and non-descendants sum-
product fails• A fast solution: parent merger• A more computationally efficient solution:
– Moralize– Triangulate– Create a junction tree
• Sum-Product Algorithm in a Junction Tree has complexity of O{NC} where C is number of nodes in the largest clique
• Example: Factorial HMM– Applications: speech with background noise, audiovisual speech– Complexity:O{N4} with parent merger, O{N3} with triangulation
• Example: Large Vocabulary Speech Recognition– Zweig triangles: word grammar and phone model in one graph– Livescu model: a DBN for pronunciation variability