Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
-
Upload
bill-liu -
Category
Technology
-
view
77 -
download
6
Transcript of Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
![Page 1: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/1.jpg)
Jinyu LiMicrosoft
![Page 2: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/2.jpg)
![Page 3: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/3.jpg)
} Review the deep learning trends for automatic speech recognition (ASR) in industry◦ Deep Neural Network (DNN)◦ Long Short-Term Memory (LSTM)◦ Connectionist Temporal Classification (CTC)
} Describe selected key technologies to make deep learning models more effective under production environment
![Page 4: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/4.jpg)
![Page 5: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/5.jpg)
Feature Analysis (Spectral Analysis)
Language Model
Word Lexicon
Confidence Scoring
Pattern Classification
(Decoding, Search)
Acoustic Model (HMM)
Input Speech “Hey Cortana”
(0.9) (0.8)
s(n), W
Xn
W
![Page 6: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/6.jpg)
Feature Analysis (Spectral Analysis)
Language Model
Word Lexicon
Confidence Scoring
Pattern Classification
(Decoding, Search)
Acoustic Model (HMM)
Input Speech “Hey Cortana”
(0.9) (0.8)
s(n), W
Xn
W
![Page 7: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/7.jpg)
} Word sequence: Hey Cortana} Phone sequence: hh ey k ao r t ae n ax} Triphonesequence: sil-hh+ey hh-ey+k ey-k+ao k-ao+r
ao-r+ae ae-n+ax n-ax+sil} Every triphone is then modeled by a three-state HMM: sil-
hh+ey[1], sil-hh+ey[2], sil-hh+ey[3], hh-ey+k[1], ......, n-ax+sil[3]. The key problem is how to evaluate the state likelihood given the speech signal.
![Page 8: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/8.jpg)
![Page 9: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/9.jpg)
sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]
![Page 10: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/10.jpg)
sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]
![Page 11: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/11.jpg)
sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3]hh-ey+k[1] n-ax+sil[3]
![Page 12: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/12.jpg)
sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]
![Page 13: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/13.jpg)
} ZH-CN is improved by 32% within one year!
0
5
10
15
20
25
30
35
GMM MFCC CE DNN LFB CE DNN LFB SE DNN
ZH-CN Relative Improvement
CERR
CE: Cross Entropy trainingSE: SEquence training
![Page 14: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/14.jpg)
![Page 15: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/15.jpg)
DNNs process speech frames independently
tx1−tx ( )bxWh += thxt σ
![Page 16: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/16.jpg)
RNN considers temporal relation over speech frames.
tx1−tx
Vulnerable to gradients vanishing and exploding( )bhWxWh ++= −1thhthxt σ
![Page 17: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/17.jpg)
Memory cells store the history information
Various gates control the information flow inside LSTM
Advantageous in learning long short-term temporal dependency
tx1−tx
![Page 18: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/18.jpg)
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
SMD2015 VS2015 MobileC Mobile Win10C
WER
RRelative WER reduction of LSTM from DNN
![Page 19: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/19.jpg)
![Page 20: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/20.jpg)
![Page 21: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/21.jpg)
The HMM/GMM or HMM/DNN pipeline is highly complexMultiple training stages: CI phone, CD senones, …
Various resources: lexicon, decision trees questions, …
Many hyper-parameters: number of senones, number of Gaussians, …
CI Phone
CD Senone
DNN/ LSTM
GMM Hybrid
![Page 22: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/22.jpg)
Feature Analysis (Spectral Analysis)
Language Model
Word Lexicon
Confidence Scoring
Pattern Classification
(Decoding, Search)
Acoustic Model (HMM)
Input Speech “Hey Cortana”
(0.9) (0.8)
s(n), W
Xn
W
![Page 23: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/23.jpg)
The HMM/GMM or HMM/DNN pipeline is highly complexMultiple training stages: CI phone, CD senones, …
Various resources: lexicon, decision trees questions, …
Many hyper-parameters: number of senones, number of Gaussians, …
LM building requests tons of data and complicated process also
Efficient decoder writing needs experts with years’ experience
![Page 24: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/24.jpg)
The HMM/GMM or HMM/DNN pipeline is highly complexMultiple training stages: CI phone, CD senones, …
Various resources: lexicon, decision trees questions, …
Many hyper-parameters: number of senones, number of Gaussians, …
LM building requests tons of data and complicated process also
Efficient decoder writing needs experts with years’ experience
![Page 25: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/25.jpg)
End-to-EndModel
“Hey Cortana”
} ASR is a sequence-to-sequence learning problem.} A simpler paradigm with a single model (and training
stage) is desired.
![Page 26: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/26.jpg)
Allow repetitions of non-blank labels
Add the blank as an additional label, meaning no (actual) labels are emitted
A B C!A!!!A!!!∅!!!∅!!!B!!!C!!!∅!!∅!!!A!!!A!!!B!!!∅!!!C!!!C!!∅!!!∅!!!∅!!!A!!!B!!!C!!!∅!
collapse
expand
} CTC is a sequence-to-sequence learning method used to map speech waveforms directly to characters, phonemes, or even words
} CTC paths differ from labels sequences in that:
A B C
-- labels sequence z -- observation frames X
![Page 27: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/27.jpg)
t-1 t t+1
LSTM LSTM LSTM……
softmax
∅blank
words
![Page 28: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/28.jpg)
} Directly from speech to text, no language model, no decoder, no lexicon……
![Page 29: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/29.jpg)
![Page 30: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/30.jpg)
} Reduce runtime cost without accuracy loss
} Adapt to speakers with low footprints
} Reduce accuracy gap between large and small deep networks
} Enable languages with limited training data
![Page 31: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/31.jpg)
[Xue13]
![Page 32: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/32.jpg)
} The runtime cost of DNN is much larger than that of GMM, which has been fully optimized in product deployment. We need to reduce the runtime cost of DNN in order to ship it.
![Page 33: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/33.jpg)
} The runtime cost of DNN is much larger than that of GMM, which has been fully optimized in product deployment. We need to reduce the runtime cost of DNN in order to ship it.
} We propose a new DNN structure by taking advantage of the low-rank property of DNN model to compress it
![Page 34: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/34.jpg)
} How to reduce the runtime cost of DNN ?SVD !!!
} speaker personalization & AM modularization.
![Page 35: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/35.jpg)
𝐴"×$ = 𝑈"×$∑$×$𝑉$×$) =𝑢++ ⋯ 𝑢+$⋮ ⋱ ⋮
𝑢"+ ⋯ 𝑢"$/
𝜖++ ⋯⋮ ⋱
0 ⋯ 0⋮ ⋱ ⋮
0 ⋯⋮ ⋱0 ⋯
𝜖22 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝜖$$
/𝑣++ ⋯ 𝑣+$⋮ ⋱ ⋮𝑣$+ ⋯ 𝑣$$
![Page 36: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/36.jpg)
} Number of parameters: mn->mk+nk. } Runtime cost: O(mn) -> O(mk+nk). } E.g., m=2048, n=2048, k=192. 80% runtime cost reduction.
![Page 37: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/37.jpg)
![Page 38: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/38.jpg)
![Page 39: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/39.jpg)
} Singular Value Decomposition
![Page 40: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/40.jpg)
LSTM LSTM
tx1−tx
LSTM
1+tx
Copy
DNN Model LSTM Model
DNN DNN
tx1−tx
DNN
1+tx
Copy
![Page 41: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/41.jpg)
Split training utterances through frame skipping
2x1x 3x 5x4x 6x
1x 3x 5x 2x 4x 6x
When skipping 1 frame, odd and even frames are picked as separate utterances
Frame labels are selected accordingly
![Page 42: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/42.jpg)
[Xue 14]
![Page 43: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/43.jpg)
} Speaker personalization with a deep model creates a storage size issue: It is not practical to store an entire deep models for each individual speaker during deployment.
![Page 44: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/44.jpg)
} Speaker personalization with a DNN model creates a storage size issue: It is not practical to store an entire DNN model for each individual speaker during deployment.
} We propose low-footprint DNN personalization method based on SVD structure.
![Page 45: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/45.jpg)
![Page 46: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/46.jpg)
0 0.36
18.6420.86
30
7.4 7.4
0.26
FULL-SIZE DNN SVD DNN STANDARD ADAPTATION SVD ADAPTATION
Adapting with 100 utterances
Relative WER reduction Number of parameters (M)
![Page 47: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/47.jpg)
![Page 48: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/48.jpg)
} SVD matrices are used to reduce the number of DNN parameters and CPU cost.
} Quantization for SSE evaluation is used for single instruction multiple data processing.
} Frame skipping is used to remove the evaluation of some frames.
![Page 49: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/49.jpg)
} The industry has strong interests to have DNN systems on devices due to the increasingly popular mobile scenarios.
} Even with the technologies mentioned above, the large computational cost is still very challenging due to the limited processing power of devices.
} A common way to fit CD-DNN-HMM on devices is to reduce the DNN model size by ◦ reducing the number of nodes in hidden layers◦ reducing the number of targets in the output layer
![Page 50: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/50.jpg)
} Better accuracy is obtained if we use the output of large-size DNN for acoustic likelihood evaluation
} The output of small-size DNN is away from that of large-size DNN, resulting in worse recognition accuracy
} The problem is solved if the small-size DNN can generate similar output as the large-size DNN
...
...
...
...
...
...Text
...
...
...
...
...
...
...
...
![Page 51: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/51.jpg)
◦ Use the standard DNN training method to train a large-size teacher DNN using transcribed data
◦ Minimize the KL divergence between the output distribution of the student DNN and teacher DNN with large amount of un-transcribed data
![Page 52: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/52.jpg)
} 2 Million parameter for small-size DNN, compared to 30 Million parameters for teacher DNN.
} The footprint is further reduced to 0.5 million parameter when combining with SVD.
Teacher DNN trained with standard sequence training
Small-size DNN trained with standard sequence training
Student DNN trained with output distribution learning
Accuracy
![Page 53: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/53.jpg)
[Huang 13]
![Page 54: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/54.jpg)
} Develop a new language in new scenario with small amount of training data.
![Page 55: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/55.jpg)
} Develop a new language in new scenario with small amount of training data.
} Leverage the resource-rich languages to develop high-quality ASR for resource-limited languages.
![Page 56: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/56.jpg)
![Page 57: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/57.jpg)
...
...
...
...
...
...
...
InputLayer:Awindowofacousticfeatureframes
SharedFeatureTransformation
OutputLayer
Newlanguagesenones
NewLanguage TrainingorTestingSamples
Text
ManyHiddenLayers
![Page 58: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/58.jpg)
0
5
10
15
20
25
3 hrs 9hrs 36hrs 139hrs
releative error reduction
![Page 59: Deep Learning for Speech Recognition in Cortana at AI NEXT Conference](https://reader034.fdocuments.in/reader034/viewer/2022042619/58ecf01e1a28ab2b378b465b/html5/thumbnails/59.jpg)