Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and...
Transcript of Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and...
![Page 1: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/1.jpg)
Deep Neural Networks for Acoustic Modelling
Bajibabu Bollepalli
Hieu Nguyen
Rakshith Shetty
Pieter Smit (Mentor)
![Page 2: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/2.jpg)
Introduction
• Automatic speech recognition
Feature Extraction
Acoustic Modelling
Decoder
Language Modelling
Speech signalRecognized text
![Page 3: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/3.jpg)
Introduction
• Acoustic modelling using deep neural networks
Feature Extraction
Acoustic Modelling
Decoder
Language Modelling
Speech signalRecognized text
![Page 4: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/4.jpg)
Background
• HMM-GMMs have prevailed in ASR for last four decades• Difficult for any new methods to outperform them for acoustic modelling
• Can GMMs capture all information in acoustic features?• No. Inefficient in modelling the data that lie on or near nonlinear manifold in
the data space
• Need for better models• Artificial neural networks (ANNs) are known to capture the nonlinearities in
the data
• Natural to think of ANNs as an alternative to GMMs
![Page 5: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/5.jpg)
Background
• ANNs are not new for speech recognition• Two decades back, researchers employed the ANNs for ASR
• Unable to outperform the GMMs
• Hardware and learning algorithms were restricted the capacity of ANNs
• Advancements in hardware as well in machine learning algorithms allows us to train large multilayer (deep) ANNs called Deep Neural Networks (DNNs)• DNNs outperform the GMMs (finally ;) )
![Page 6: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/6.jpg)
Deep Neural Networks (DNNs)
• Feed-forward ANNs with more than one hidden layers
![Page 7: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/7.jpg)
Our task
• Frame based phoneme recognition using simple DNNs
• Experiments with various input features
• Compare the results with GMMs
• Try complex DNNs (if time permits)• Deep belief networks (DBN)
• Recurrent neural networks (RNNs)
![Page 8: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/8.jpg)
Our task
• Frame based phoneme recognition using simple DNNs
• Experiments with various input features
• Compare the results with GMMs
• Try complex DNNs (if time permits)• Deep belief networks (DBN)
• Recurrent neural networks (RNNs)
![Page 9: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/9.jpg)
Database
• Training data : 151 Finnish speech sentences (~ 15 mins)
• Development data 135 sentences (~ 11 mins)
• Evaluation data 100 sentences (~ 8 mins)
![Page 10: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/10.jpg)
Simple DNN
• Similar to multi-layer perceptron (MLP)
• Hidden Layers: [300, 300]
• Activations: Sigmoid
• Optimization: Stochastic Gradient Descent (SGD)
• Error criteria: Categorical crossentropy
• Software tool: Keras
• Input: MFCC features with 39 dimension
• Output: 24 Finnish phonemes
• Normalization: Mean-variance
![Page 11: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/11.jpg)
Performance of simple DNN (MLP)
Input feature Frame-wise accuracy (%)
Single frame [t] 63.81
Three frames [t-1, t, t+1] 67.59
Five frames [t-2, t-1, t, t+1, t+2] 67.22
![Page 12: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/12.jpg)
DBN
![Page 13: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/13.jpg)
Deep Belief Network (DBN)
• This neural network is similar to MLP but the weights are pre-trained using multiple Restricted Boltzmann Machines (RBM) instead of only random initialization.
• After the model is pre-trained, the weight are fine-tuned again. The process is similar to model training of only MLP.
• Pre-training step is unsupervised (without using the true target label of data point), we try to regenerate input x from the hidden representation induced by input x. The knowledge learned is encoded by the values of the weights.
• Fine-tuning is supervised training step, where we try to maximize the prediction accuracy of the data points with true label.
13
![Page 14: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/14.jpg)
Restricted Boltzmann Machine (RBM)
• This is type of generative neural network.
• The idea is to generate an ’energy surface’ or ’heat map’ in form of probability density.
• Energy:
• Probability density:
• Optimize:
Use Gibbs sampling for <.>_model
14
![Page 15: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/15.jpg)
DBN-pretraining
• Stack of RBMs:
• Two consecutive layers are trained using a RBM with the lower one is hidden layer and the upper one is visible layer.
• The process is done bottom-up
• Iterate multiple for multiple epochs
15
![Page 16: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/16.jpg)
Setups
• Using Theano-based tutorial code from deeplearning.net
• Hidden layers uses activation function sigmoid function.
• Prediction layer (top layer) is a softmax layer.
• Loss function is categorical cross entropy.
• Output is either predicted label (one of 24 phonemes) or probabilities of 24 phonemes (predicted label is argmax of probabilities)
• Each input is MFCC in context of 3 (triphone)
16
![Page 17: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/17.jpg)
Experiments
• Pre-training is tricky, after some rough estimates, pre-training learning rate 1e-5 is chosen
• Train ‘with’ and ‘without’ pre-training to compare
• The number of hidden layers varies from 1 to 3
• The size of each hidden layer varies from 100 to 600 (some results with size 500 and 600 were not trained)
• Experimenting with some 3-hidden-layer ‘hourglass’ model, the results don’t show real improvement.
17
![Page 18: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/18.jpg)
DBN Results
• The best model is non-pre-trained 500_500 network. Accuracy on validation set is 66.82%
• The table show predicting accuracy on trained models on validation set.
18
Model size Pre-trained Iterations
Non-pre-
trained Iterations
100 60.188 48934 60.344 39830
200 61.235 44382 62.792 48934
300 61.387 39830 62.721 39830
400 61.284 42106 63.561 37554
100_100 61.641 48934 62.638 44382
200_200 63.106 47796 64.266 39830
300_300 63.808 46658 64.716 37554
400_400 63.741 51210 64.634 33002
500_500 66.820 33002
600_600 65.327 30726
100_100_100 62.237 55762 62.926 46658
200_200_200 63.589 53486 64.19 40968
300_300_300 63.572 44382 63.73 33002
400_400_400 63.106 44382 64.941 35278
![Page 19: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/19.jpg)
Recurrent Networks
![Page 20: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/20.jpg)
Recurrent Neural Networks• Output of a recurrent n/w at time t depends on inputs at
time t as well as state of the n/w at time t-1.
• Thus are ideal to model sequences, as time dependencies can be learnt in the recurrent weights
• In case of phoneme classification it is now easy to include arbitrary amount of context i.e previous frames within a window.
• Infinitely deep in a sense
![Page 21: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/21.jpg)
Our Model• We use a fixed context size with frames upto t-context fed
into the RNN.
• Then, the hidden state of RNN at time t is used to predict the class of the frame at time t.
![Page 22: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/22.jpg)
Learning in recurrent nets• We can compute the error at time t (cross entropy error)
and backpropagate the gradients through time, similar to backpropagation in MLP.
• Problem is these gradients can die out or blow up if sequence is very long
• One solution for explosive gradients is to truncate the depth in time till which you propagate
• Other solution is to use more complex recurrent units like LSTMs
![Page 23: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/23.jpg)
LSTM Cell• Consists of a memory unit
and 3 gates
• Each gate is affected by current input and previous output state of the cell.
• The 3 cells control data flow to the memory, retention of memory & activation of output from the cell.
![Page 24: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/24.jpg)
Learning Details and Regularization• We use RMSprop learning algorithm – a form of gradient
descent where learning rate is automatically scaled by rmsvalue of most recent gradients
• Regularize using dropout : For each training sample some units are randomly switched off. This forces each unit to learn something useful and not co-depend too much
• Dropout only in the embedding and output layer, bad idea to do it recurrent connections.
![Page 25: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/25.jpg)
Results with RNNs - Accuracies• Context 10, 200 units, Dropout 0.3
• LSTM , Context 10, Dropout 0.3
• LSTM , 200 Units, Dropout 0.3
• LSTM , Context 10, 200 Units
Size of n/w 50 100 200
Accuracy on Eval 67.79 68.11 67.76
Context Window 5 10 20
Accuracy on Eval 68.11 67.76 68.76
Dropout Prob 0.0 0.3 0.5 0.7
Accuracy on Eval 66.47 67.76 68.21 68.19
Type of Unit simple lstm
Accuracy on Eval 66.43 67.76
![Page 26: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/26.jpg)
Summary Results: All ModelsContext Window MLP DBN RNN
Accuracy on Eval 67.59 66.82 68.76
![Page 27: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/27.jpg)
Source code is available on GitHub :
https://github.com/rakshithShetty/dnn-speech
27
![Page 28: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/28.jpg)
References
• George E. Dahl Abdel-rahman Mohamed and Geoffrey Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, Volume 20 Issue 1, 2012.
• Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv, abs/1303.5778, 2013.
• Some figures are taken from prof. Juha Karhunen’s slides of the course Machine Learning and Neural Networks.
• Implementation DBN code are take and modified from tutorial on deeplearning.net
28
![Page 29: Deep Neural Networks for Acoustic Modelling · 1, 2012. • Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv,](https://reader033.fdocuments.in/reader033/viewer/2022050203/5f5671c06f546b25e22e2f50/html5/thumbnails/29.jpg)
Questions ?
29