Automatic speech recognition using an echo state network
description
Transcript of Automatic speech recognition using an echo state network
![Page 1: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/1.jpg)
Automatic speech recognition using an echo state network
Mark D. Skowronski
Computational Neuro-Engineering Lab
Electrical and Computer Engineering
University of Florida, Gainesville, FL, USA
May 10, 2006
![Page 2: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/2.jpg)
CNEL Seminar History• Ratio spectrum, Oct. 2000• HFCC, Sept. 2002• Bats, Dec. 2004• Electrohysterography, Aug. 2005• Echo state network, May 2006
2006
2000
![Page 3: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/3.jpg)
Overview
• ASR motivations
• Intro to echo state network
• Multiple readout filters
• ASR experiments
• Conclusions
![Page 4: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/4.jpg)
ASR Motivations
• Speech is most natural form of communication among humans.
• Human-machine interaction lags behind with tactile interface.
• Bottleneck in machine understanding is signal-to-symbol translation.
• Human speech a “tough” signal:– Nonstationary– Non-Gaussian– Nonlinear systems for production/perception
How to handle the “non”-ness of speech?
![Page 5: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/5.jpg)
ASR State of the Art• Feature extraction: HFCC
– bio-inspired frequency analysis– tailored for statistical models
• Acoustic pattern rec: HMM– Piecewise-stationary stochastic model– Efficient training/testing algorithms– …but several simplistic assumptions
• Language models– Uses knowledge of language, grammar– HMM implementations– Machine language understanding still elusive (spam
blockers)
![Page 6: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/6.jpg)
Hidden Markov ModelPremier stochastic model of non-stationary time series used for decision making.
Assumptions:
1) Speech is piecewise-stationary process.
2) Features are independent.
3) State duration is exponential.
4) State transition prob. function of previous-next state only.
Can we devise a better pattern recognition model?
![Page 7: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/7.jpg)
Echo State Network
• Partially trained recurrent neural network, Herbert Jaeger, 2001
• Unique characteristics:– Recurrent “reservoir” of processing elements,
interconnected with random untrained weights.– Linear readout weights trained with simple
regression provide closed-form, stable, unique solution.
![Page 8: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/8.jpg)
ESN Diagram & Equations
)()(
))()1(()(
nn
nnfn
out
in
xWy
uWxWx
![Page 9: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/9.jpg)
ESN Matrices
• Win: untrained, M x Min matrix– Zero mean, unity variance normally distributed– Scaled by rin
• W: untrained, M x M matrix– Zero mean, unity variance normally distributed– Scaled such that spectral radius r < 1
• Wout: trained, linear regression, Mout x M matrix– Regression closed-form, stable, unique solution– O(M2) per data point complexity
![Page 10: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/10.jpg)
Echo States Conditions
• The network has echo states if x(n) is uniquely determined by left-infinite input sequence …,u(n-1),u(n).
• x(n) is an “echo” of all previous inputs.
• If f is tanh activation function: – If σmax(W)=||W||<1, guarantees echo states
– If r=|λmax(W)|>1, guarantees no echo states
![Page 11: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/11.jpg)
ESN Training• Minimize mean-squared error between y(n)
and desired signal d(n).
))()(())()(( 1
1
TTout
out
nnnn dxxxW
pRW
Wiener solution:
![Page 12: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/12.jpg)
ESN Example: Mackey-GlassM=60 PEsr=0.9
rin=0.3u(n): MG,10000 samplesd(n)=u(n+1)
Prediction Gain(var(u)/var(e)):Input: 16.3 dBWiener: 45.1 dBESN: 62.6 dB
![Page 13: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/13.jpg)
Multiple Readout Filters• Wout projects reservoir space to output space.
• Question: how to divide reservoir space and use multiple readout filters?
• Answer: competitive network of filters
• Question: how to train/test competitive network of K filters?
• Answer: mimic HMM.
],1[),()( Kknn kout
k xWy
![Page 14: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/14.jpg)
HMM vs. ESN ClassifierHMM ESN Classifier
Output Likelihood MSE
Architecture States, left-to-right States, left-to-right
Minimum element
Gaussian kernel Readout filter
Elements combined
GMM Winner-take-all
Transitions State transition matrix Binary switching matrix
Training Segmental K-means (Baum-Welch)
Segmental K-means
Discriminatory No Maybe, depends on desired signal
![Page 15: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/15.jpg)
Segmental K-means: InitFor each input, xi(n) and desired di(n) for sequence i:
Divide x,d into equal-sized chunks Xη,Dη (one per state).For each n, select k(n)[1,K] uniform random.
After init. with all sequences:
Tii
nk
Tii
nk
nn
nn
))(()(
))(()(),(
),(
DXB
XXA
,1,, )( kkkout BAW
![Page 16: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/16.jpg)
Segmental K-means: Training
• For each utterance:– Produce MSE for each readout filter.– Find Viterbi path through MSE matrix.– Use features from each state to update
auto- and cross-correlation matrices.
• After all utterances: Wiener solution• Guaranteed to converge to local
minimum in MSE over training set.
![Page 17: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/17.jpg)
ASR Example 1• Isolated English digits “zero” - “nine” from TI46 corpus: 8
male, 8 female, 26 utterances each, 12.5 kHz sampling rate.
• ESN: M=60 PEs, r=2.0, rin=0.1, 10 word models, various #states and #filters per state.
• Features: 13 HFCC, 100 fps, Hamming window, pre-emphasis (α=0.95), CMS, Δ+ΔΔ (±4 frames)
• Pre-processing: zero-mean and whitening transform• M1/F1: testing; M2/F2: validation; M3-M8/F3-F8 training• Two to six training epochs for all models• Desired: next frame of 39-dimension features• Test: corrupted by additive noise from “real” sources (subway,
babble, car, exhibition hall, restaurant, street, airport terminal, train station)
• Baseline: HMM with identical input features
![Page 18: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/18.jpg)
ASR Results, noise free
K
ESN (HMM) 1 2 3 4 5 10
Nst=1 7(171) 6(136) 3(65) 2(33) 3(4) 2(2)
2 1(83) 1(46) 0(4) 1(3) 2(2) 1(0)
3 0(126) 1(4) 0(2) 0(2) 0(1) 2(0)
5 1(11) 1(2) 0(0) 0(0) 1(0) 0(0)
10 1(2) 1(0) 1(0) 1(0) 0(0) 0(0)
15 0(1) 0(0) 0(0) 0(0) 0(0) 1(0)
20 0 0 0 0 0 1
Number of classification errors out of 518 (smaller is better).
![Page 19: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/19.jpg)
ASR Results, noisyK
ESN (HMM) 1 2 3 4 5 10
Nst=1 70.9(22.4) 70.0(29.7) 74.6(45.6) 74.3(46.0) 74.3(36.2) 75.8(50.9)
2 76.3(41.5) 77.6(47.6) 78.3(50.1) 77.7(53.8) 77.1(50.2) 75.8(64.5)
3 78.8(29.2) 79.2(44.6) 79.3(51.7) 79.2(58.6) 79.1(58.6) 78.8(55.6)
5 81.4(51.6) 81.1(56.4) 81.6(59.7) 81.9(59.2) 81.3(59.2) 81.3(53.5)
10 84.6(57.2) 84.4(61.1) 84.4(58.7) 83.6(55.7) 83.5(56.2) 81.0(52.2)
15 85.4(64.0) 85.1(62.0) 85.0(59.2) 83.8(56.4) 82.8(52.9) 78.4(52.2)
20 85.8 85.6 84.0 83.5 82.5 72.3
Average accuracy (%),all noise sources, 0-20 dB SNR (larger is better):
![Page 20: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/20.jpg)
ASR Results, noisySingle mixture per state (K=1): ESN classifier
![Page 21: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/21.jpg)
ASR Results, noisySingle mixture per state (K=1): HMM baseline
![Page 22: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/22.jpg)
ASR Example 2• Same experiment setup as Example 1.• ESN: M=600 PEs, 10 states, 1 filter per state,
rin=0.1, various r.
• Desired: one-of-many encoding of class, ±1, tanh output activation function AFTER linear readout filter.
• Test: corrupted by additive speech-shaped noise• Baseline: HMM with identical input features
![Page 23: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/23.jpg)
ASR Results, noisy
![Page 24: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/24.jpg)
Discussion• What gives the ESN classifier its noise-
robust characteristics?• Theory: ESN reservoir provides context
of noisy input, allowing reservoir to reduce effects of noise by averaging.
• Theory: Non-linearity and high-dimensionality of network increases linear separability of classes in reservoir space.
![Page 25: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/25.jpg)
Future Work
• Replace winner-take-all with mixture-of-experts.
• Replace segmental K-means with Baum-Welch-type training algorithm.
• “Grow” network during training.
• Consider nonlinear activation functions (e.g., tanh, softmax) AFTER linear readout filter.
![Page 26: Automatic speech recognition using an echo state network](https://reader036.fdocuments.in/reader036/viewer/2022062408/56813dfd550346895da7d650/html5/thumbnails/26.jpg)
Conclusions• ESN classifier using inspiration from HMM:
– Multiple readout filters per state, multiple states.– Trained as competitive network of filters.– Segmental K-means guaranteed to converge to
local minimum of total MSE from training set.
• ESN classifier noise robust compared to HMM:– Ave. over all sources, 0-20 dB SNR: +21
percentage points– Ave. over all sources: +9 dB SNR