Lecture 9: Speech Recognitiondpwe/e6820/lectures/E6820-L... · 2006-03-29 · E6820 SAPR - Dan...
Transcript of Lecture 9: Speech Recognitiondpwe/e6820/lectures/E6820-L... · 2006-03-29 · E6820 SAPR - Dan...
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 1
EE E6820: Speech & Audio Processing & Recognition
Lecture 9:Speech Recognition
Recognizing Speech
Feature Calculation
Sequence Recognition
Hidden Markov Models
Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/e6820/
Columbia University Dept. of Electrical EngineeringSpring 2006
1
2
3
4
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 2
Recognizing Speech
• What kind of information might we want from the speech signal?
- words- phrasing, ‘speech acts’ (prosody)- mood / emotion- speaker identity
• What kind of processing do we need to get at that information?
- time scale of feature extraction- signal aspects to capture in features- signal aspects to
exclude
from features
1
Time
Fre
quen
cy
0 0.5 1 1.5 2 2.5 30
2000
4000“So, I thought about that and I think it’s still possible”
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 3
Speech recognition as Transcription
• Transcription = “speech to text”
- find a word string to match the utterance
• Best suited to small vocabulary tasks
- voice dialing, command & control etc.
• Gives neat objective measure:word error rate (WER) %
- can be a sensitive measure of performance
• Three kinds of errors:
- WER = (S + D + I) / N
Reference:
Recognized:
THE
–
CAT SAT ON THE MAT
CAT SAT AN THE MATA
Deletion Substitution Insertion
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 4
Problems: Within-speaker variability
• Timing variation:
- word duration varies enormously
- fast speech ‘reduces’ vowels
• Speaking style variation:
- careful/casual articulation- soft/loud speech
• Contextual effects:
- speech sounds vary with context, role:“How
do
you
do
?”
0.5
Fre
quen
cy
00 1.0 1.5 2.0 2.5 3.0
2000
4000
s
SO ITHOUGHT
ABOUTTHAT
ITHINK
IT'S STILL POSSIBLE
AND
ow ayth
aadx
ax b aw plihtstih
kn
ihth
ayn
axth
aa s b ax l
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 5
Between-speaker variability
• Accent variation
- regional / mother tongue
• Voice quality variation
- gender, age, huskiness, nasality
• Individual characteristics
- mannerisms, speed, prosody
0
2000
4000
6000
8000
time / s
freq
/ H
z
0 0.5 1 1.5 2 2.50
2000
4000
6000
8000
mbma0
fjdm2
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 6
Environment variability
• Background noise
- fans, cars, doors, papers
• Reverberation
- ‘boxiness’ in recordings
• Microphone/channel
- huge effect on relative spectral gain
0
2000
4000
time / s
freq
/ H
z
0 0.2 0.4 0.6 0.8 1 1.2 1.40
2000
4000
Closemic
Tabletopmic
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 7
How to recognize speech?
• Cross correlate templates?
- waveform?- spectrogram?- time-warp problems
• Match short-segments & handle time-warp later
- model with slices of ~ 10 ms- pseudo-stationary model of words:
- other sources of variation...time / s
freq
/ H
z
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450
1000
2000
3000
4000sil silg w eh n
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 8
Probabilistic formulation
• Probability that segment label is correct
- gives standard form of speech recognizers:
• Feature calculation
transforms signal into easily-classified domain
• Acoustic classifier
calculates probabilities of each mutually-exclusive state
q
i
• ‘Finite state acceptor’ (i.e. HMM)
MAP match of allowable sequence to probabilities:
s n[ ] Xm→ mnH----=
p qi X( )
Q̂ p q0 q1 �qL, , X0 X1�XL,( )q0 q1 �qL, ,{ }
argmax =
X
q0 = “ay”q1
0 1 2 ...
...
time
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 9
Standard speech recognizer structure
• Questions:
- what are the best features?- how do we do the acoustic classification?- how do we find/match the state sequence?
Featurecalculation
sound
Acousticclassifier
feature vectors
Networkweights
HMMdecoder
phone probabilities
phone & word labeling
Word modelsLanguage model
s[n]
Xm
p(qi|Xm)
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 10
Outline
Recognizing Speech
Feature Calculation
- Spectrogram, MFCCs & PLP- Improving robustness
Sequence Recognition
Hidden Markov Models
1
2
3
4
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 11
Feature Calculation
• Goal: Find a representational space most suitable for classification
- waveform: voluminous, redundant, variable- spectrogram: better, still quite variable- ...?
• Pattern Recognition: Representation is upper bound on performance
- maybe we
should
use the waveform...- or, maybe the representation can do
all
the work
• Feature calculation is intimately bound to classifier
- pragmatic strengths and weaknesses
• Features develop by slow evolution- current choices more historical than principled
2
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 12
Features (1): Spectrogram
• Plain STFT as features e.g.
• Consider examples:
• Similarities between corresponding segments- but still large differences
Xm k[ ] S mH k,[ ] s n mH+[ ] w n[ ] ej2πkn( ) NÚ–
⋅ ⋅n∑= =
0
2000
4000
6000
8000
time / s
freq
/ H
z
0 0.5 1 1.5 2 2.50
2000
4000
6000
8000
Feature vector slice
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 13
Features (2): Cepstrum
• Idea: Decorrelate, summarize spectral slices:
- good for Gaussian models- greatly reduce feature dimension
Xm l[ ] IDFT S mH k,[ ]log{ }=
0 0.5 1 1.5 2 2.5 time / s
2
4
6
8
0
4000
8000
0
4000
8000
2
4
6
8
Male
Female
spectrum
cepstrum
spectrum
cepstrum
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 14
Features (3): Frequency axis warp
• Linear frequency axis gives equal ‘space’ to 0-1 kHz and 3-4 kHz- but perceptual importance very different
• Warp frequency axis closer to perceptual axis:- mel, Bark, constant-Q ...
X c[ ] S k[ ]2
k lc=
uc
∑=
Male
Female
spectrum
audspec
audspec
0 0.5 1 1.5 2 2.5 time / s
5
10
15
0
4000
8000
5
10
15
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 15
Features (4): Spectral smoothing
• Generalizing across different speakers is helped by smoothing (i.e. blurring) spectrum
• Truncated cepstrum is one way:
- MSE approx to
• LPC modeling is a little different:
- MSE approx to → prefers detail at peaks
S k[ ]log
S k[ ]
Male
audspec
plpsmoothed 5
10
15
0 2 4 6 8 10 12 14 16 18 freq / chan
leve
l / d
B
30
40
50
5
10
15
0 0.5 1 1.5 2 2.5 time / s
audspec
plp
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 16
Features (5): Normalization along time
• Idea: feature variations, not absolute level
• Hence: calculate average level & subtract it:
• Factors out fixed channel frequency response:
Y n k,[ ] X n k,[ ] meann X n k,[ ]{ }–=
s n[ ] hc e n[ ]*=
S n k,[ ]log Hc k[ ]log E n k,[ ]log+=
Male
Female
plp
meannorm
meannorm
0 0.5 1 1.5 2 2.5 time / s
5
10
15
5
10
15
5
10
15
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 17
Delta features
• Want each segment to have ‘static’ feature vals- but some segments intrinsically dynamic!→calculate their derivatives - maybe steadier?
• Append dX/dt (+ d2X/dt2) to feature vectors
• Relates to onset sensitivity in humans?
Male
plp(µ,σ norm)
deltas
ddeltas
51015
51015
51015
0 0.5 1 1.5 2 2.5 time / s
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 18
Overall feature calculation
• MFCCs and/or RASTA-PLP
• Key attributes:- spectral, auditory
scale- decorrelation- smoothed
(spectral) detail- normalization of
levels
FFT X[k]
Mel scalefreq. warp
log|X[k]|
IFFT
Truncate
Subtractmean
CMN MFCCfeatures
Sound
spectra
audspec
cepstra
FFT X[k]
Bark scalefreq. warp
log|X[k]|
Rastaband-pass
LPCsmooth
Cepstralrecursion
Rasta-PLPcepstral features
smoothedonsets
LPCspectra
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 19
Features summary
• Normalize same phones
• Contrast different phones
0
4000
8000
5
10
15
5
10
15
5
10
15
0 0.5 1 time / s0 0.5 1 1.5
Male Female
spectrum
audspec
rasta
deltas
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 20
Outline
Recognizing Speech
Feature Calculation
Sequence Recogntion- Dynamic Time Warp- Probabilistic Formulation
Hidden Markov Models
1
2
3
4
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 21
Sequence recognition:Dynamic Time Warp (DTW)
• Framewise comparison with stored templates:
- distance metric?- comparison across templates?
3
10 20 30 40 50 time /frames
10
20
30
40
50
60
70
Ref
eren
ce
Test
ON
ET
WO
TH
RE
EF
OU
RF
IVE
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 22
Dynamic Time Warp (2)
• Find lowest-cost constrained path:- matrix d(i,j) of distances
between input frame fi and reference frame rj
- allowable predecessors & transition costs Txy
• Best path via traceback from final state- store predecessors for each (i,j)
Input frames fi
Ref
eren
ce f
ram
es r
j
D(i,j) = d(i,j) + min{ }D(i-1,j) + T10D(i,j-1) + T01
D(i-1,j-1) + T11D(i-1,j)
D(i-1,j) D(i-1,j)
T10
T01
T 11 Local match cost
Lowest cost to (i,j)
Best predecessor(including transition cost)
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 23
DTW-based recognition
• Reference templates for each possible word
• For isolated words:- mark endpoints of input word- calculate scores through each template (+prune)
- continuous speech: link together word ends
• Successfully handles timing variation- recognize speech at reasonable cost
Ref
eren
ce
Input frames
ON
ET
WO
TH
RE
EF
OU
R
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 24
Statistical sequence recognition
• DTW limited because it’s hard to optimize- interpretation of distance, transition costs?
• Need a theoretical foundation: Probability
• Formulate recognition as MAP choice among models:
- X = observed features- Mj = word-sequence models
- Θ = all current parameters
M*
p M j X Θ,( )M j
argmax =
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 25
Statistical formulation (2)
• Can rearrange via Bayes’ rule (& drop p(X) ):
- p(X | Mj) = likelihood of observations under model
- p(Mj) = prior probability of model
- ΘA = acoustics-related model parameters
- ΘL = language-related model parameters
• Questions:
- what form of model to use for ?
- how to find ΘA (training)?
- how to solve for Mj (decoding)?
M*
p M j X Θ,( )M j
argmax =
p X M j ΘA,( ) p M j ΘL( )M j
argmax =
p X M j ΘA,( )
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 26
State-based modeling
• Assume discrete-state model for the speech:- observations are divided up into time frames- model → states → observations:
• Probability of observations given model is:
- sum over all possible state sequences Qk
• How do observations depend on states?How do state sequences depend on model?
q1Qk : q2 q3 q4 q5 q6 ...
x1X1 : x2 x3 x4 x5 x6 ...
time
states
Nobserved feature
vectors
Model Mj
p X M j( ) p X1N
Qk M j,( ) p Qk M j( )⋅all Qk
∑=
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 27
Outline
Recognizing Speech
Feature Calculation
Sequence Recognition
Hidden Markov Models (HMM)- generative Markov models- hidden Markov models- model fit likelihood- HMM examples
1
2
3
4
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 28
Markov models
• A (first order) Markov model is a finite-state system whose behavior depends only on the current state
• E.g. generative Markov model:
3
A
S
E
C
Bp(qn+1|qn)
S�A�B�C�E��
0 1 0 0 0
0 0 0 0 1
0 .8 .1 .1 00 .1 .8 .1 00 .1 .1 .7 .1
S A B C E
qn
qn+1.8 .8
.7
.1
.1
.1.1.1
.1
.1
S A A A A A A A A B B B B B B B B B C C C C B B B B B B C E
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 29
Hidden Markov models
• = Markov model where state sequence Q = {qn} is not directly observable (= ‘hidden’)
• But, observations X do depend on Q:
- xn is rv that depends on current state:
- can still tell something about state seq...
p x q( )
1
2
3
0
0.2
0.4
0.6
0.8
0 10 20 300
0
1
2
3
0 1 2 3 40
0.2
0.4
0.6
0.8
observation x time step n
State sequenceEmission distributions
Observationsequence
x nx n
p(x|
q)p(
x|q)
q = A q = B q = C
q = A q = B q = CAAAAAAAABBBBBBBBBBBCCCCBBBBBBBC
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 30
(Generative) Markov models (2)
• HMM is specified by:
+ (initial state probabilities )
k a t
k a t •
k a t •
••
•
•
•kat
k a t •
0.9 0.1 0.0 0.01.0 0.0 0.0 0.0
0.0 0.9 0.1 0.00.0 0.0 0.9 0.1
p(x|q)
x
- states qi
- transition probabilities aij
- emission distributions bi(x)
p qnj qn 1–
i( ) aij≡
p x qi( ) bi x( )≡
p q1i( ) πi≡
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 31
Markov models for sequence recognition
• Independence of observations:
- observation xn depends only current state qn
• Markov transitions:
- transition to next state qi+1 depends only on qi
p X Q( ) p x1 x2 �xN, , q1 q2 �qN, ,( )=
p x1 q1( ) p x2 q2( ) � p xN qN( )⋅ ⋅=
p xn qn( )n 1=
N∏= bqn
xn( )n 1=
N∏=
p Q M( ) p q1 q2 �qN, , M( )=
p qN q1�qN 1–( ) p qN 1– q1�qN 2–( )�p q2 q1( ) p q1( )=
p qN qN 1–( ) p qN 1– qN 2–( )�p q2 q1( ) p q1( )=
p q1( ) p qn qn 1–( )n 2=
N∏= πq1
aqn 1– qnn 2=
N∏=
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 32
Model-fit calculation
• From ‘state-based modeling’:
• For HMMs:
• Hence, solve for M* :
- calculate for each available model,
scale by prior →
• Sum over all Qk ???
p X M j( ) p X1N
Qk M j,( ) p Qk M j( )⋅all Qk
∑=
p X Q( ) bqnxn( )
n 1=
N∏=
p Q M( ) πq1aqn 1– qnn 2=
N∏⋅=
p X M j( )
p M j( ) p M j X( )
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 33
Summing over all paths
q0 q1 q2 q3 q4
S A A A ES A A B ES A B B ES B B B E
.9 x .7 x .7 x .1 = 0.0441
.9 x .7 x .2 x .2 = 0.0252
.9 x .2 x .8 x .2 = 0.0288
.1 x .8 x .8 x .2 = 0.0128Σ = 0.1109 Σ = p(X | M) = 0.4020
2.5 x 0.2 x 0.1 = 0.052.5 x 0.2 x 2.3 = 1.152.5 x 2.2 x 2.3 = 12.650.1 x 2.2 x 2.3 = 0.506
0.00220.02900.36430.0065
SABE
AS B E0.9• 0.1 •0.7• 0.2 0.1•• 0.8 0.2•• • 1
All possible 3-emission paths Qk from S to Ep(Q | M) = Πn p(qn|qn-1) p(X | Q,M) = Πn p(xn|qn) p(X,Q | M)
Observationlikelihoods
x1 x2 x3
AB
p(x|q)
q{ 2.5 0.2 0.10.1 2.2 2.3
A B ES0.9
0.1
0.7
0.2
0.1
0.8
0.2
Model M1 xn
n1
p(x|B)p(x|A)
2 3
Observationsx1, x2, x3
S0 1 2 3 4
A
B
E
Sta
tes
time n
0.1
0.9
0.80.2
0.7
0.8 0.2
0.2
0.7
0.1
Paths
(length 3 paths only)
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 34
The ‘forward recursion’
• Dynamic-programming-like technique to calculate sum over all Qk
• Define as the probability of getting to
state qi at time step n (by any path):
• Then can be calculated recursively:
αn i( )
αn i( ) p x1 x2 �xn qn=qi
, , ,( ) p X1n
qni
,( )≡=
αn 1+ j( )
Time steps n
Mo
del
sta
tes
qi
αn(i+1)
αn(i)
ai+1j
aij
bj(xn+1)
αn+1(j) = [Σ αn(i)·aij]·bj(xn+1)i=1
S
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 35
Forward recursion (2)
• Initialize
• Then total probability
→ Practical way to solve for p(X | Mj) and hence perform recognition
α1 i( ) πi bi x1( )⋅=
p X1N
M( ) αN i( )i 1=
S
∑=
Observations�X
p(X | M1)·p(M1)
p(X | M2)·p(M2)
Choose best
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 36
Optimal path
• May be interested in actual qn assignments
- which state was ‘active’ at each time frame- e.g. phone labelling (for training?)
• Total probability is over all paths...
• ... but can also solve for single best path= “Viterbi” state sequence
• Probability along best path to state :
- backtrack from final state to get best path- final probability is product only (no sum)→ log-domain calculation is just summation
• Total probability often dominated by best path:
qn 1+j
αn 1+*
j( ) αn*
i( )aij
imax b j xn 1+( )⋅=
p X Q*
, M( ) p X M( )≈
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 37
Interpreting the Viterbi path
• Viterbi path assigns each xn to a state qi
- performing classification based on bi(x)
- ... at the same time as applying transition constraints aij
• Can be used for segmentation- train an HMM with ‘garbage’ and ‘target’ states- decode on new data to find ‘targets’, boundaries
• Can use for (heuristic) training- e.g. train classifiers based on labels...
0 10 20 300
1
2
3
Viterbi labels:
Inferred classification
x n
AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 38
Recognition with HMMs
• Isolated word
- choose best
• Continuous speech- Viterbi decoding of one large HMM gives words
p M X( ) p X M( ) p M( )∝
Model M1p(X | M1)·p(M1) = ...
Model M2p(X | M2)·p(M2) = ...
Model M3p(X | M3)·p(M3) = ...
Input
w ah n
th r iy
t uw
Inputp(M1)
p(M2)
p(M3)sil
w ah n
th r iy
t uw
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 39
HMM examples: Different state sequences
K AS
0.8
0.2
0.9
0.1T E
0.8
0.2Model M1
K OS
0.8
0.2
0.9
0.1T E
0.8
0.2Model M2
Emissiondistributions
00 1 2 3 4
0.2
0.4
0.6
0.8
p(x | q)
x
p(x |
K)
p(x |
T)
p(x |
A)
p(x |
O)
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 40
Model matching:Emission probabilities
0 2 4 6 8 10 12 14 16 180
2
4
Observationsequence
xn
time n / steps
0
2
4
-3-2-10
-10
-5
0
0
2
4
-3-2-10
-10
-5
0
log p(X,Q* | M) = -47.5
log p(X | M) = -47.0
log p(Q* | M) = -8.3
log p(X | Q*,M) = -39.2
Model M2
state alignment
log trans.prob
log obs.l'hood
state alignment
log trans.prob
log obs.l'hood
log p(X,Q* | M) = -33.5
log p(X | M) = -32.1
log p(Q* | M) = -7.5
log p(X | Q*,M) = -26.0
Model M1
K
AT
O
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 41
Model matching:Transition probabilities
K
A
S
0.80.18
0.9
0.1
0.02
0.9
0.1T E
0.8
0.2
Model M'1
O
K
A
S
0.80.05
0.9
0.1
0.15
0.9
0.1T E
0.8
0.2
O
Model M'2
-3-2-10
0
0 2 4 6 8 10 12 14 16 18
time n / steps
0
2
4
-3-2-1
0
-10
-5
log trans.problog p(X,Q* | M) = -34.9
log p(X | M) = -33.5
log p(Q* | M) = -8.9
Model M'2
state alignment
log trans.prob
log obs.l'hood
log p(X,Q* | M) = -33.6log p(X | M) = -32.2
log p(Q* | M) = -7.6
log p(X | Q*,M) = -26.0
Model M'1
E6820 SAPR - Dan Ellis L09 - Speech Recognition 2006-03-30 - 42
Summary
• Speech signal is highly variable- need models that absorb variability- hide what we can with robust features
• Speech is modeled as a sequence of features- need temporal aspect to recognition- best time-alignment of templates = DTW
• Hidden Markov models are rigorous solution- self-loops allow temporal dilation- exact, efficient likelihood calculations
Parting thought:How to set the HMM parameters? (training)