Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Landmark-Based Speech Recognition:

Spectrogram Reading,Support Vector Machines,

Dynamic Bayesian Networks,and Phonology

Mark [email protected]

University of Illinois at Urbana-Champaign, USA

mailto:[email protected]

Lecture 8. Inference in Non-Tree Graphs

• The multiple-parent problem• Solution #1: Parent merger• Solution #2: Moralize, Triangulate, and

create a Junction Tree• Inference in Any DBN: Sum-Product

Algorithm in a Junction Tree• Example: Factorial HMM• Example: Zweig-triangle LVCSR• Example: Articulatory phonology

Conditional Independence of Descendants and Non-descendants

• The Sum-Product algorithm can use computations that are local at each node, v, because of the following theorem:

• Theorem: if and only if a Bayesian network is a tree, then for every variable v, the descendants and non-descendants of v are conditionally independent given v: p(Dv,Nv | v) = p(Dv | v) p(Nv | v)

Example: Descendants and Non-descendants in a Tree

• p(Dc|c) = p(d|c)p(e|c)p(f|c)

• p(Nc,c) = p(a)p(b|a)p(c|a)

a

b c

d e f

Example: Descendants and Non-descendants in a Non-Tree

• p(Dc,c) = a,b p(a)p(b|a)p(c|a)p(d|b,c)p(e|c)p(f|c)

• p(Nc,c) = d p(a)p(b|a)p(c|a)p(d|b,c)

• So is it necessary for EVERY computation to be global?

a

b c

d e f

Local Computations in a Non-Tree

• Here are some computations that can be local:– d depends only on the combination (b,c)– (b,c) depend only on a– e depends only on c, or equivalently, e depends on (b,c)

a

b c

d e f

The “Parent Merger” Algorithm• Algorithm #1 for turning a Non-Tree into a Tree:

– If any node has multiple parents, merge them– If any resulting supernode has multiple parents, merge them– Repeat until no node has multiple parents

• Why this algorithm is sometimes undesirable:– In an upward-branching graph, this results in a supernode with

NgNhNiNj = many possible values

– Many values Lots of computation a

b c

e f

i j

m

d

g h

k l

n o

p

{bc}

{def}

{ghij}

{klm}

{no}

Algorithm #2: Junction Trees• Moralize• Triangulate• Read off the cliques into a Junction Tree• Add variables to cliques, as necessary to ensure Locality

of Influence

Moralization• “Moralization” is the process of connecting the parents of

every node.• Goal: to show that values of the parents can not really be

independently computed.• Once the graph has been moralized, we usually show it as

an undirected graph --- dependency structure will still be necessary for inference, but not necessary for finding the best junction tree.

a

b c

e fd

n o

p

a

b c

e fd

n o

pMoralize

Triangulation• A “triangular” or “chordal” graph is a graph with no cycles

of length longer than three.• “Triangulation” is the process of adding edges to a graph

in order to make it triangular.

a b

dc

e f

hg

ji

a b

dc

e f

hg

ji

a b

dc

e f

hg

ji

Moralize Triangulate

Digression: Why are Moralization and Triangulation Allowed?

• An edge connecting a→b means that, during inference, we must use a probability table of size NaNb: p(b|a)

• One special case of the p(b|a) probability table is the case in which every row is the same, i.e., p(b|a)=p(a)

• Therefore this graph: Is a special case of this one:

• Put another way, information about the special features of a problem is coded by absent edges, not present edges.

• Adding edges is equivalent to forcing yourself to solve a harder, more general problem, rather than a simple specific one.

a b

c

a b

c

Cliques• A “clique” is a group of nodes, all of which are connected

together• The “separator” of two cliques is the set of nodes that are

members of both cliques. For example:• Clique efg and• Clique def• have {e,f} as their

separator

a b

dc

e f

hg

ji

Forming a Junction Tree• It is always possible to create a Junction Tree from a

triangular graph using the following algorithm:– Start with any clique as the root node– Next comes the clique whose separator with the root node is

largest– Locality of influence: if cliques A and B both contain node c, then

node c must also be added to every clique in the junction tree between A and B

a b

dc

e f

hg

ji

d,e,f

e,f,g

f,g,h

g,h,j

g,j

c,d,e

a,b,d

Junction Tree

Triangulation: A Hard Example• In this example, the graph on the left is not yet fully

triangulated (for example, the cycle dbcfon has no chord). Here is one possible triangulation algorithm: create a junction tree, then add variables to the cliques as necessary to maintain locality of influence.

Junction Tree

a

b c

e fd

n o

p

e,f,o?

d,e,n?

e,n,o?

n,o,p

c,e,f?

b,d,e?

b,c,e?

a,b,c

Example: Triangulation to Maintain Locality of Influence

• Every node that’s in both the second cliquesecond clique and the fourth cliquefourth clique must also exist in the third cliquethird clique

Junction Tree

a

b c

e fd

n o

p

e,f,o?

d,e,n?

c,e,f?

b,d,e?

b,c,e?

a,b,c

e,n,o?

n,o,p


• Add node c to the 3rd clique, because c is in both 2nd and 4th cliques. Putting c into this clique is equivalent to drawing an edge between c and d, so that b,c,d,e are all interconnected.

Junction Tree

a

b c

e fd

n o

p

e,f,o?

d,e,n?

c,e,f?

b,c,d,e

b,c,e?

a,b,c

e,n,o?

n,o,p


• … and then delete clique (b,c,e), because it’s now redundant with clique (b,c,d,e).

Junction Tree

a

b c

e fd

n o

p

e,f,o?

d,e,n?

c,e,f?

b,c,d,e

a,b,c

e,n,o?

n,o,p


• Similar reasoning: d is in the 4th clique, so it had better be added to the 3rd. This is equivalent to drawing an edge between d and f.

Junction Tree

a

b c

e fd

n o

p

e,f,o?

d,e,n?

c,d,e,f

b,c,d,e

a,b,c

e,n,o?

n,o,p


• … and f is in the 5th clique, so it had better be added to the 4th. This is equivalent to drawing an edge between f and n.

Junction Tree

a

b c

e fd

n o

p

e,f,o?

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

e,n,o?

n,o,p

Finishing Up

Junction Tree

a

b c

e fd

n o

p

e,f,n,o

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

n,o,p

Inference in a Junction Tree

• The “clique” representation means that all inference is contained within one clique. For each clique:– Input: a probability table for values of the lower separator. Size of this

table = NI, where N is the number of possible values of each variable, and I is the number of nodes in the input separator

– Product: multiply by information about other nodes in the clique. Resulting table is of size NC, where C is the number of nodes in the clique.

– Sum: For each possible setting of variables in the output separator (NO possible settings), marginalize out the values of all other variables (a sum operation, containing NC-O terms in the sum). Total complexity: O{NC}

– Pass the resulting table to next clique.

Inference Example• Suppose b and o are observed, and we want to find

p(♦,b,o) for all variables ♦.• Clique noq:

– Output separator variables are n,o– Product: p(observation, q | n,o) = p(q | n,o)

– Sum: p(observation | n,o) = q p(q | n,o) = 1

• We could have skipped this step by observing that p(o | o) is always 1

e,f,n,o

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

n,o,q

a

b c

e fd

n o

q

o

b

Inference Example• Clique efno:

– Input: p(observation | n,o)– Product: Nodes not in the output separator are moved to the left:

• p(observation,o | e,f,n) = p(observation | n,o) p(o | e,f,n)

– Sum: over unobserved elements not in the output separator• p(observation | e,f,n) = p(observation,o | e,f,n)

– Output: probabilities for every setting of the output separator• p(observation | e,f,n)

e,f,n,o

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

n,o,q

a

b c

e fd

n o

q

o

b

Inference Example• Clique defn:

– Input: p(observation | e,f,n)– Product: Nodes not in the output separator are moved to the left:

• p(observation,n | d,e,f) = p(observation | e,f,n) p(n | d,e,f)

– Sum: over unobserved elements not in the output separator• p(observation | d,e,f) = n p(observation,n | d,e,f)

– Output• p(observation | d,e,f)

e,f,n,o

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

n,o,q

a

b c

e fd

n o

q

o

b

Inference Example: “Propagate Down”• Clique abc:

– Product: p(a,b,c) = p(b,c|a)p(a)– Sum: over every unobserved variable that’s not in the output

separator:• p(b,c) = a p(a,b,c)

e,f,n,o

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

n,o,q

a

b c

e fd

n o

q

b

Inference Example: “Propagate Down”• Clique bcde:

– Product: p(b,c,d,e) = p(d,e|b,e)p(b,c)– Sum: over every unobserved variable that’s not in the output

separator:• p(observations above, c,d,e) = p(b,c,d,e)

– Output: a probability table of size N3:• p(observations above, c,d,e)

e,f,n,o

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

n,o,q

a

b c

e fd

n o

q

b

Inference Example: “Propagate Down”• Clique cdef:

– Input: p(observations above, c,d,e)– Product: p(observation,c,d,e,f) = p(f|c,d,e)p(observation,c,d,e)– Sum: over every unobserved variable that’s not in the output

separator:• p(observations above, d,e,f) = c p(observations above, c,d,e,f)

– Output: a probability table of size N3:• p(observations above, d,e,f)

e,f,n,o

d,e,f,n

c,d,e,f

b,c,d,e

a,b,c

n,o,q

a

b c

e fd

n o

q

b

… and so on…

Computational Complexity• Complexity of Inference is O{NC}, where N is the number of

values each node takes, C the number of nodes in the largest clique.– Actually, complexity is O{max i Ni}, where the ith variable in the

clique is defined in the range 1≤vi≤Ni, and max finds the maximum of this number over all cliques

• Therefore, a triangulation algorithm should minimize the maximum clique.

• Unfortunately, automatic minimum-maximum-clique triangulation is NP-hard. Good approximate algorithms exist, but…

• Humans are better at this than machines: design your graph with small cliques.

Factorial HMM (FHMM)

• Factorial HMM:– qt and vt represent two different types of background information, each

with its own history

– Observations xt depend on both hidden processes

• Model parameters:– p(vt+1|vt), p(qt+1|qt), p(xt|qt,vt)

• Computational Complexity of Sum-Product Algorithm:– O{N4T} using “parent-merger” triangulation– O{N3T} using a better triangulation (five slides from now)

q1

x1 x2

qt qt+1

x3

qT-1

x4

qT

x5x1 xt xt+1 xT-1 xT

v1 vt vt+1 vT-1 vT

Example: Speech in Music(Deoras and Hasegawa-Johnson, ICSLP 2004)

• qt = one person speaking

– Speech log spectrum given by ps(yt(ej)|qt) = mixture Gaussian

• vt = music playing in the background

– Music log spectrum given by pm(zt(ej)|vt) = mixture Gaussian

• Observed log spectrum = max(speech,music)

– xt(ej) ≈ max(yt(ej),zt(ej)) (xt(ej)≥max(yt(ej),zt(ej))≥xt(ej)-6dB)

– p(xt|qt,vt) = ps(xt|qt)ʃxpm(z|vt)dz + pm(xt|vt)ʃxps(y|qt)dy

q1

x1 x2

qt qt+1

x3

qT-1

x4

qT


v1 vt vt+1 vT-1 vT

… …

AVSR: The Boltzmann Zipper(Hennecke, Stork, and Prasad, 1996)

• Same as AVSR model from last time, except that now vt has memory, independent of qt. Model parameters:

– p(qt+1|qt), p(xt|qt)

– p(vt+1|vt,qt+1), p(yt|qt)

• Sum-Product algorithm: O{N3T}, just like FHMM• The extra observations add complexity only of O{T}

q1

x1 x2

qt+1 qT

x5

…x1 xt+1 xT

v1 vt+1 vT

x1 x2 x5y1 yt+1 yT Video observations

Viseme states

Audio phoneme states

Audio spectral observationsx2

qt

xt

vt

x2yt

…

AVSR: The Coupled HMM(Chu and Huang, 2000)

• Advantage over Boltzmann Zipper: More flexible, because neither vision nor sound is “privileged” over the other.– p(qt+1|vt,qt), p(xt|qt)

– p(vt+1|vt,qt), p(yt|qt)

• Disadvantage: can’t be triangulated like FHMM, so complexity is O{N4T} rather than O{N3T}

q1

x1 x2

qt+1 qT

x5x1 xt+1 xT

v1 vt+1 vT

x1 x2 x5y1 yt+1 yT Video observations

Viseme states

Audio phoneme states

Audio spectral observationsx2

qt

xt

vt

x2yt

Inference using Parent Merger

• Nt=observed non-descendants of (qt,vt) = {x1,…,xt-1}

• Dt=observed descendants of (qt,vt) = {xt,…,xT}

• Forward algorithm:– p(Nt+1,qt+1,vt+1) = qtvt p(xt | qt,vt)p(qt+1 | qt)p(vt+1 | vt)p(Nt,qt,vt)

• Backward algorithm:– p(Dt | qt,vt) = p(xt | qt,vt) qt+1vt+1 p(qt+1 | qt)p(vt+1 | vt)p(Dt+1 | qt+1,vt+1)

• Complexity: – (T frames)X(N2 sums/frame)X(N2 terms/sum) = O{N4T}

q1

x1 x2

qt qt+1

x3

qT-1

x4

qT


v1 vt vt+1 vT-1 vT

… …

A Smarter Triangulation

• Forward Algorithm, step 1:– p(qt+1,vt,Nt) = qt p(xt | qt,vt) p(qt+1 | qt) p(qt,vt,Nt)

q1

x1 x2

qt qt+1

x3

qT-1

x4

qT


v1 vt vt+1 vT-1 vT

… …

A Smarter Triangulation

• Forward Algorithm, step 1:– p(qt+1,vt,Nt+1) = qt p(xt | qt,vt) p(qt+1 | qt) p(qt,vt,Nt)

• Forward Algorithm, step 2:– p(qt+1,vt+1,Nt+1) = vt p(vt+1|vt) p(qt+1,vt,Nt+1)

• Computational Complexity:– (T frames)X(2N2 sums/frame)X(N terms/sum) = O{N3T}– Complexity is N times higher than that of a one-stream HMM

q1

x1 x2

qt qt+1

x3

qT-1

x4

qT


v1 vt vt+1 vT-1 vT

… …

“Compiling” an FHMM into an HMM

• Purpose: GMTK (Bilmes and Zweig, ICASSP 2002) can implement FHMM directly, but by compiling FHMM to HMM, we can also use HTK (Young, Evermann, Hain et al.,

2002) and other software tools• Method:

– Each state specifies the variables in the output separator of one clique, e.g., (qt+1,vt) is the separator between cliques (qt,qt+1,vt) and (qt+1,vt,vt+1).

– Transition probability matrix p(qt+1,vt|qt,vt) is N2XN2, but only N3 entries can be non-zero, thus complexity is O{N3}

x2xt

qt,vt qt+1,vt… …x3xt+1

qt+1,vt+1

A Note on Parameter Tying

• Transition probability table specifies p(separator|separator), e.g., p(qt+1,vt | qt,vt)

• The only non-zero entries are those specifying variables that differ between separators, e.g., p(qt+1 | qt,vt)

• With “parameter tying,” we can constrain different elements of transition matrix to equal one another, thus forcing the model to match condition p(qt+1|qt,vt)=p(qt+1|qt).– If the two chains are known to be truly independent, e.g., speech

and background music, parameter tying may help to avoid over-training the model.

– If the two chains are possibly dependent, allow the full transition matrix p(qt+1|qt,vt): result is a Boltzmann zipper.

x2xt

qt,vt qt+1,vt… …x3xt+1

qt+1,vt+1

“Compiling” an FHMM into an HMM• In order to handle non-emitting states, we need a total of

2N2 junction states: N2 emitting, N2 non-emitting• Finite State Diagram looks like this (NOT A DBN – this is

here to help you design the HTK configuration, if desired):

1,1

1,2

Emitting Statesqt,vt

Non-Emitting Statesqt+1,vt

2,1

2,2

1,1

1,2

2,1

2,2

Blue arrows: left-to-right transitionRed arrows: right-to-left transition

Black arrows: both(Note: no self-loops. Emitting & non-emitting states alternate.)

p(xt | qt=1,vt=1)

p(xt | qt=1,vt=2)

p(xt | qt=2,vt=1)

p(xt | qt=2,vt=2)

Observation PDFs

Graphical Models for Large-Vocabulary Speech

Recognition

“Zweig Triangles”(Zweig, 1998)

• wt: word. 1≤ wt ≤Nw

– Nw = # words in vocabulary

– p(wt+1=wt | wt, wdTrt=0)=1

– p(wt+1 | wt, wdTrt=1) = bigram word grammar

• it: segment index. 1≤ it ≤Ni

– p(it+1 | it,wdTrt=0)>0 iff it≤it+1≤ it+1

– p(it+1=1 | it,wdTrt=1)=1

• wdTrt: is there a word transition?

– p(wdTrt=0 | it<Ni)=1

– p(wdTrt=1 | it=Ni)= probability word ends

• qt: segment label, for example, qt could equal “/aa/ state 3.”

– p(qt|it,wt)=probability that itth phonetic segment in wt is qt

– Often deterministic: p(qt|it,wt)=1 iff qt is itth phone of wt

• xt: observation

– p(xt|qt) usually mixture Gaussian

wt

it

qt

x2xt

wdTrt

wt+1

it+1

qt+1

x2xt+1

wdTrt+1

LIP-OP TT-OPEN

TT-LOC

TB-LOC

TB-OPENVELUM

VOICING

Example: Pronunciation Variability• Pronunciation variability (e.g., apparent deletions or substitutions

of phonemes) can be parsimoniously described as resulting from asynchrony and reduction of quasi-independent tract variables such as the LIP-OPENING and TONGUE-TIP-OPENING:

(Browman & Goldstein 1990):

A DBN Model of Articulatory Phonology for Speech Recognition

(Livescu and Glass, 2004)

• wordt: word ID at frame #t• wdTrt: word transition?• indt

i: which gesture, from the canonical word model, should articulator i be trying to implement?• asynct

i;j: how asynchronous are articulators i and j? • Ut

i: canonical setting of articulator #i• St

i: surface setting of articulator #i

Summary• Multiple parents violate conditional independence of descendants and non-descendants sum-

product fails• A fast solution: parent merger• A more computationally efficient solution:

– Moralize– Triangulate– Create a junction tree

• Sum-Product Algorithm in a Junction Tree has complexity of O{NC} where C is number of nodes in the largest clique

• Example: Factorial HMM– Applications: speech with background noise, audiovisual speech– Complexity:O{N4} with parent merger, O{N3} with triangulation

• Example: Large Vocabulary Speech Recognition– Zweig triangles: word grammar and phone model in one graph– Livescu model: a DBN for pronunciation variability

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Documents

Transcript of Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA