Lecture 9: Speech Recognitiondpwe/e6820/lectures/E6820-L09-asr.pdf · 2006. 3. 29. · E6820 SAPR -...

E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 1

EE E6820: Speech & Audio Processing & Recognition

Lecture 9:Speech Recognition

Recognizing Speech

Feature Calculation

Sequence Recognition

Hidden Markov Models

Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/e6820/

Columbia University Dept. of Electrical EngineeringSpring 2006

1

2

3

4


Recognizing Speech

• What kind of information might we want from the speech signal?

- words- phrasing, ‘speech acts’ (prosody)- mood / emotion- speaker identity

• What kind of processing do we need to get at that information?

- time scale of feature extraction- signal aspects to capture in features- signal aspects to

exclude

from features

1

Time

Fre

quen

cy

0 0.5 1 1.5 2 2.5 30

2000

4000“So, I thought about that and I think it’s still possible”


Speech recognition as Transcription

• Transcription = “speech to text”

- find a word string to match the utterance

• Best suited to small vocabulary tasks

- voice dialing, command & control etc.

• Gives neat objective measure:word error rate (WER) %

- can be a sensitive measure of performance

• Three kinds of errors:

- WER = (S + D + I) / N

Reference:

Recognized:

THE

–

CAT SAT ON THE MAT

CAT SAT AN THE MATA

Deletion Substitution Insertion


Problems: Within-speaker variability

• Timing variation:

- word duration varies enormously

- fast speech ‘reduces’ vowels

• Speaking style variation:

- careful/casual articulation- soft/loud speech

• Contextual effects:

- speech sounds vary with context, role:“How

do

you

do

?”

0.5

Fre

quen

cy

00 1.0 1.5 2.0 2.5 3.0

2000

4000

s

SO ITHOUGHT

ABOUTTHAT

ITHINK

IT'S STILL POSSIBLE

AND

ow ayth

aadx

ax b aw plihtstih

kn

ihth

ayn

axth

aa s b ax l


Between-speaker variability

• Accent variation

- regional / mother tongue

• Voice quality variation

- gender, age, huskiness, nasality

• Individual characteristics

- mannerisms, speed, prosody

0

2000

4000

6000

8000

time / s

freq

/ H

z

0 0.5 1 1.5 2 2.50

2000

4000

6000

8000

mbma0

fjdm2


Environment variability

• Background noise

- fans, cars, doors, papers

• Reverberation

- ‘boxiness’ in recordings

• Microphone/channel

- huge effect on relative spectral gain

0

2000

4000

time / s

freq

/ H

z

0 0.2 0.4 0.6 0.8 1 1.2 1.40

2000

4000

Closemic

Tabletopmic


How to recognize speech?

• Cross correlate templates?

- waveform?- spectrogram?- time-warp problems

• Match short-segments & handle time-warp later

- model with slices of ~ 10 ms- pseudo-stationary model of words:

- other sources of variation...time / s

freq

/ H

z

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450

1000

2000

3000

4000sil silg w eh n


Probabilistic formulation

• Probability that segment label is correct

- gives standard form of speech recognizers:

• Feature calculation

transforms signal into easily-classified domain

• Acoustic classifier

calculates probabilities of each mutually-exclusive state

q

i

• ‘Finite state acceptor’ (i.e. HMM)

MAP match of allowable sequence to probabilities:

s n[ ] Xm→ mnH----=

p qi X( )

Q̂ p q0 q1 �qL, , X0 X1�XL,( )q0 q1 �qL, ,{ }

argmax =

X

q0 = “ay”q1

0 1 2 ...

...

time


Standard speech recognizer structure

• Questions:

- what are the best features?- how do we do the acoustic classification?- how do we find/match the state sequence?

Featurecalculation

sound

Acousticclassifier

feature vectors

Networkweights

HMMdecoder

phone probabilities

phone & word labeling

Word modelsLanguage model

s[n]

Xm

p(qi|Xm)


Outline

Recognizing Speech

Feature Calculation

- Spectrogram, MFCCs & PLP- Improving robustness



1

2

3

4


Feature Calculation

• Goal: Find a representational space most suitable for classification

- waveform: voluminous, redundant, variable- spectrogram: better, still quite variable- ...?

• Pattern Recognition: Representation is upper bound on performance

- maybe we

should

use the waveform...- or, maybe the representation can do

all

the work

• Feature calculation is intimately bound to classifier

- pragmatic strengths and weaknesses

• Features develop by slow evolution- current choices more historical than principled

2


Features (1): Spectrogram

• Plain STFT as features e.g.

• Consider examples:

• Similarities between corresponding segments- but still large differences

Xm k[ ] S mH k,[ ] s n mH+[ ] w n[ ] ej2πkn( ) NÚ–

⋅ ⋅n∑= =

0

2000

4000

6000

8000

time / s

freq

/ H

z

0 0.5 1 1.5 2 2.50

2000

4000

6000

8000

Feature vector slice


Features (2): Cepstrum

• Idea: Decorrelate, summarize spectral slices:

- good for Gaussian models- greatly reduce feature dimension

Xm l[ ] IDFT S mH k,[ ]log{ }=

0 0.5 1 1.5 2 2.5 time / s

2

4

6

8

0

4000

8000

0

4000

8000

2

4

6

8

Male

Female

spectrum

cepstrum

spectrum

cepstrum


Features (3): Frequency axis warp

• Linear frequency axis gives equal ‘space’ to 0-1 kHz and 3-4 kHz- but perceptual importance very different

• Warp frequency axis closer to perceptual axis:- mel, Bark, constant-Q ...

X c[ ] S k[ ]2

k lc=

uc

∑=

Male

Female

spectrum

audspec

audspec

0 0.5 1 1.5 2 2.5 time / s

5

10

15

0

4000

8000

5

10

15


Features (4): Spectral smoothing

• Generalizing across different speakers is helped by smoothing (i.e. blurring) spectrum

• Truncated cepstrum is one way:

- MSE approx to

• LPC modeling is a little different:

- MSE approx to → prefers detail at peaks

S k[ ]log

S k[ ]

Male

audspec

plpsmoothed 5

10

15

0 2 4 6 8 10 12 14 16 18 freq / chan

leve

l / d

B

30

40

50

5

10

15

0 0.5 1 1.5 2 2.5 time / s

audspec

plp


Features (5): Normalization along time

• Idea: feature variations, not absolute level

• Hence: calculate average level & subtract it:

• Factors out fixed channel frequency response:

Y n k,[ ] X n k,[ ] meann X n k,[ ]{ }–=

s n[ ] hc e n[ ]*=

S n k,[ ]log Hc k[ ]log E n k,[ ]log+=

Male

Female

plp

meannorm

meannorm

0 0.5 1 1.5 2 2.5 time / s

5

10

15

5

10

15

5

10

15


Delta features

• Want each segment to have ‘static’ feature vals- but some segments intrinsically dynamic!→calculate their derivatives - maybe steadier?

• Append dX/dt (+ d2X/dt2) to feature vectors

• Relates to onset sensitivity in humans?

Male

plp(µ,σ norm)

deltas

ddeltas

51015

51015

51015

0 0.5 1 1.5 2 2.5 time / s


Overall feature calculation

• MFCCs and/or RASTA-PLP

• Key attributes:- spectral, auditory

scale- decorrelation- smoothed

(spectral) detail- normalization of

levels

FFT X[k]

Mel scalefreq. warp

log|X[k]|

IFFT

Truncate

Subtractmean

CMN MFCCfeatures

Sound

spectra

audspec

cepstra

FFT X[k]

Bark scalefreq. warp

log|X[k]|

Rastaband-pass

LPCsmooth

Cepstralrecursion

Rasta-PLPcepstral features

smoothedonsets

LPCspectra


Features summary

• Normalize same phones

• Contrast different phones

0

4000

8000

5

10

15

5

10

15

5

10

15

0 0.5 1 time / s0 0.5 1 1.5

Male Female

spectrum

audspec

rasta

deltas


Outline

Recognizing Speech

Feature Calculation

Sequence Recogntion- Dynamic Time Warp- Probabilistic Formulation


1

2

3

4


Sequence recognition:Dynamic Time Warp (DTW)

• Framewise comparison with stored templates:

- distance metric?- comparison across templates?

3

10 20 30 40 50 time /frames

10

20

30

40

50

60

70

Ref

eren

ce

Test

ON

ET

WO

TH

RE

EF

OU

RF

IVE


Dynamic Time Warp (2)

• Find lowest-cost constrained path:- matrix d(i,j) of distances

between input frame fi and reference frame rj

- allowable predecessors & transition costs Txy

• Best path via traceback from final state- store predecessors for each (i,j)

Input frames fi

Ref

eren

ce f

ram

es r

j

D(i,j) = d(i,j) + min{ }D(i-1,j) + T10D(i,j-1) + T01

D(i-1,j-1) + T11D(i-1,j)

D(i-1,j) D(i-1,j)

T10

T01

T 11 Local match cost

Lowest cost to (i,j)

Best predecessor(including transition cost)


DTW-based recognition

• Reference templates for each possible word

• For isolated words:- mark endpoints of input word- calculate scores through each template (+prune)

- continuous speech: link together word ends

• Successfully handles timing variation- recognize speech at reasonable cost

Ref

eren

ce

Input frames

ON

ET

WO

TH

RE

EF

OU

R


Statistical sequence recognition

• DTW limited because it’s hard to optimize- interpretation of distance, transition costs?

• Need a theoretical foundation: Probability

• Formulate recognition as MAP choice among models:

- X = observed features- Mj = word-sequence models

- Θ = all current parameters

M*

p M j X Θ,( )M j

argmax =


Statistical formulation (2)

• Can rearrange via Bayes’ rule (& drop p(X) ):

- p(X | Mj) = likelihood of observations under model

- p(Mj) = prior probability of model

- ΘA = acoustics-related model parameters

- ΘL = language-related model parameters

• Questions:

- what form of model to use for ?

- how to find ΘA (training)?

- how to solve for Mj (decoding)?

M*

p M j X Θ,( )M j

argmax =

p X M j ΘA,( ) p M j ΘL( )M j

argmax =

p X M j ΘA,( )


State-based modeling

• Assume discrete-state model for the speech:- observations are divided up into time frames- model → states → observations:

• Probability of observations given model is:

- sum over all possible state sequences Qk

• How do observations depend on states?How do state sequences depend on model?

q1Qk : q2 q3 q4 q5 q6 ...

x1X1 : x2 x3 x4 x5 x6 ...

time

states

Nobserved feature

vectors

Model Mj

p X M j( ) p X1N

Qk M j,( ) p Qk M j( )⋅all Qk

∑=


Outline

Recognizing Speech

Feature Calculation


Hidden Markov Models (HMM)- generative Markov models- hidden Markov models- model fit likelihood- HMM examples

1

2

3

4


Markov models

• A (first order) Markov model is a finite-state system whose behavior depends only on the current state

• E.g. generative Markov model:

3

A

S

E

C

Bp(qn+1|qn)

S�A�B�C�E��

0 1 0 0 0

0 0 0 0 1

0 .8 .1 .1 00 .1 .8 .1 00 .1 .1 .7 .1

S A B C E

qn

qn+1.8 .8

.7

.1

.1

.1.1.1

.1

.1

S A A A A A A A A B B B B B B B B B C C C C B B B B B B C E


Hidden Markov models

• = Markov model where state sequence Q = {qn} is not directly observable (= ‘hidden’)

• But, observations X do depend on Q:

- xn is rv that depends on current state:

- can still tell something about state seq...

p x q( )

1

2

3

0

0.2

0.4

0.6

0.8

0 10 20 300

0

1

2

3

0 1 2 3 40

0.2

0.4

0.6

0.8

observation x time step n

State sequenceEmission distributions

Observationsequence

x nx n

p(x|

q)p(

x|q)

q = A q = B q = C

q = A q = B q = CAAAAAAAABBBBBBBBBBBCCCCBBBBBBBC


(Generative) Markov models (2)

• HMM is specified by:

+ (initial state probabilities )

k a t

k a t •

k a t •

••

•

•

•kat

k a t •

0.9 0.1 0.0 0.01.0 0.0 0.0 0.0

0.0 0.9 0.1 0.00.0 0.0 0.9 0.1

p(x|q)

x

- states qi

- transition probabilities aij

- emission distributions bi(x)

p qnj qn 1–

i( ) aij≡

p x qi( ) bi x( )≡

p q1i( ) πi≡


Markov models for sequence recognition

• Independence of observations:

- observation xn depends only current state qn

• Markov transitions:

- transition to next state qi+1 depends only on qi

p X Q( ) p x1 x2 �xN, , q1 q2 �qN, ,( )=

p x1 q1( ) p x2 q2( ) � p xN qN( )⋅ ⋅=

p xn qn( )n 1=

N∏= bqn

xn( )n 1=

N∏=

p Q M( ) p q1 q2 �qN, , M( )=

p qN q1�qN 1–( ) p qN 1– q1�qN 2–( )�p q2 q1( ) p q1( )=

p qN qN 1–( ) p qN 1– qN 2–( )�p q2 q1( ) p q1( )=

p q1( ) p qn qn 1–( )n 2=

N∏= πq1

aqn 1– qnn 2=

N∏=


Model-fit calculation

• From ‘state-based modeling’:

• For HMMs:

• Hence, solve for M* :

- calculate for each available model,

scale by prior →

• Sum over all Qk ???

p X M j( ) p X1N

Qk M j,( ) p Qk M j( )⋅all Qk

∑=

p X Q( ) bqnxn( )

n 1=

N∏=

p Q M( ) πq1aqn 1– qnn 2=

N∏⋅=

p X M j( )

p M j( ) p M j X( )


Summing over all paths

q0 q1 q2 q3 q4

S A A A ES A A B ES A B B ES B B B E

.9 x .7 x .7 x .1 = 0.0441

.9 x .7 x .2 x .2 = 0.0252

.9 x .2 x .8 x .2 = 0.0288

.1 x .8 x .8 x .2 = 0.0128Σ = 0.1109 Σ = p(X | M) = 0.4020

2.5 x 0.2 x 0.1 = 0.052.5 x 0.2 x 2.3 = 1.152.5 x 2.2 x 2.3 = 12.650.1 x 2.2 x 2.3 = 0.506

0.00220.02900.36430.0065

SABE

AS B E0.9• 0.1 •0.7• 0.2 0.1•• 0.8 0.2•• • 1

All possible 3-emission paths Qk from S to Ep(Q | M) = Πn p(qn|qn-1) p(X | Q,M) = Πn p(xn|qn) p(X,Q | M)

Observationlikelihoods

x1 x2 x3

AB

p(x|q)

q{ 2.5 0.2 0.10.1 2.2 2.3

A B ES0.9

0.1

0.7

0.2

0.1

0.8

0.2

Model M1 xn

n1

p(x|B)p(x|A)

2 3

Observationsx1, x2, x3

S0 1 2 3 4

A

B

E

Sta

tes

time n

0.1

0.9

0.80.2

0.7

0.8 0.2

0.2

0.7

0.1

Paths

(length 3 paths only)


The ‘forward recursion’

• Dynamic-programming-like technique to calculate sum over all Qk

• Define as the probability of getting to

state qi at time step n (by any path):

• Then can be calculated recursively:

αn i( )

αn i( ) p x1 x2 �xn qn=qi

, , ,( ) p X1n

qni

,( )≡=

αn 1+ j( )

Time steps n

Mo

del

sta

tes

qi

αn(i+1)

αn(i)

ai+1j

aij

bj(xn+1)

αn+1(j) = [Σ αn(i)·aij]·bj(xn+1)i=1

S


Forward recursion (2)

• Initialize

• Then total probability

→ Practical way to solve for p(X | Mj) and hence perform recognition

α1 i( ) πi bi x1( )⋅=

p X1N

M( ) αN i( )i 1=

S

∑=

Observations�X

p(X | M1)·p(M1)

p(X | M2)·p(M2)

Choose best


Optimal path

• May be interested in actual qn assignments

- which state was ‘active’ at each time frame- e.g. phone labelling (for training?)

• Total probability is over all paths...

• ... but can also solve for single best path= “Viterbi” state sequence

• Probability along best path to state :

- backtrack from final state to get best path- final probability is product only (no sum)→ log-domain calculation is just summation

• Total probability often dominated by best path:

qn 1+j

αn 1+*

j( ) αn*

i( )aij

imax b j xn 1+( )⋅=

p X Q*

, M( ) p X M( )≈


Interpreting the Viterbi path

• Viterbi path assigns each xn to a state qi

- performing classification based on bi(x)

- ... at the same time as applying transition constraints aij

• Can be used for segmentation- train an HMM with ‘garbage’ and ‘target’ states- decode on new data to find ‘targets’, boundaries

• Can use for (heuristic) training- e.g. train classifiers based on labels...

0 10 20 300

1

2

3

Viterbi labels:

Inferred classification

x n

AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC


Recognition with HMMs

• Isolated word

- choose best

• Continuous speech- Viterbi decoding of one large HMM gives words

p M X( ) p X M( ) p M( )∝

Model M1p(X | M1)·p(M1) = ...

Model M2p(X | M2)·p(M2) = ...

Model M3p(X | M3)·p(M3) = ...

Input

w ah n

th r iy

t uw

Inputp(M1)

p(M2)

p(M3)sil

w ah n

th r iy

t uw


HMM examples: Different state sequences

K AS

0.8

0.2

0.9

0.1T E

0.8

0.2Model M1

K OS

0.8

0.2

0.9

0.1T E

0.8

0.2Model M2

Emissiondistributions

00 1 2 3 4

0.2

0.4

0.6

0.8

p(x | q)

xp(

x | K

)p(

x | T

)p(

x | A

)

p(x |

O)


Model matching:Emission probabilities

0 2 4 6 8 10 12 14 16 180

2

4

Observationsequence

xn

time n / steps

0

2

4

-3-2-10

-10

-5

0

0

2

4

-3-2-10

-10

-5

0

log p(X,Q* | M) = -47.5

log p(X | M) = -47.0

log p(Q* | M) = -8.3

log p(X | Q*,M) = -39.2

Model M2

state alignment

log trans.prob

log obs.l'hood

state alignment

log trans.prob

log obs.l'hood

log p(X,Q* | M) = -33.5

log p(X | M) = -32.1

log p(Q* | M) = -7.5

log p(X | Q*,M) = -26.0

Model M1

K

AT

O


Model matching:Transition probabilities

K

A

S

0.80.18

0.9

0.1

0.02

0.9

0.1T E

0.8

0.2

Model M'1

O

K

A

S

0.80.05

0.9

0.1

0.15

0.9

0.1T E

0.8

0.2

O

Model M'2

-3-2-10

0

0 2 4 6 8 10 12 14 16 18

time n / steps

0

2

4

-3-2-1

0

-10

-5

log trans.problog p(X,Q* | M) = -34.9

log p(X | M) = -33.5

log p(Q* | M) = -8.9

Model M'2

state alignment

log trans.prob

log obs.l'hood

log p(X,Q* | M) = -33.6log p(X | M) = -32.2

log p(Q* | M) = -7.6

log p(X | Q*,M) = -26.0

Model M'1


Summary

• Speech signal is highly variable- need models that absorb variability- hide what we can with robust features

• Speech is modeled as a sequence of features- need temporal aspect to recognition- best time-alignment of templates = DTW

• Hidden Markov models are rigorous solution- self-loops allow temporal dilation- exact, efficient likelihood calculations

Parting thought:How to set the HMM parameters? (training)

Lecture 9: Speech Recognitiondpwe/e6820/lectures/E6820-L09-asr.pdf · 2006. 3. 29. · E6820 SAPR -...

Documents

Transcript of Lecture 9: Speech Recognitiondpwe/e6820/lectures/E6820-L09-asr.pdf · 2006. 3. 29. · E6820 SAPR -...