Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav...
Transcript of Audio-visual processing of speech with DNN...Audio-visual processing of speech with DNN Ido Ariav...
Audio-visual processing of speech with DNN
Ido Ariav
Electrical Engineering DepartmentTechnion - Israel Institute of Technology
Supervised by Prof. Israel Cohen
Outline
▪ Background - Voice Activity Detection
▪ Deep Multimodal Architectures for Voice Activity Detection
▪ results
Voice Activity Detection (VAD)Some background..
Voice Activity Detection (VAD)
▪ Many applications - speech and speaker recognition, speech enhancement, dominant speaker identification, hearing-improvement devices, etc.
Voice Activity Detection (VAD)
▪ a preliminary block to other speech related applications
Traditional Methods
▪ simple acoustic features (e.g. zero-crossings), model-based methods (e.g. GMM)
Traditional Methods
Performance deteriorates in presence of noise
Cannot model highly non-stationary noise (transients)
-3 dB thresh-4 dB thresh
Deep NN
▪ Deep learning to the rescue!
Deep NN
▪ But wait… speech is a time-series so why should we treat it as a discrete classification problem?
speech
Multimodal
▪ Any other sensors we could use??
▪ Video is especially useful in challenging acoustic environments
Deep Multimodal Architectures for Voice Activity Detection
Problem Setting
▪ a multimodal setting, audio and video signals are both available.
Problem Setting
▪ Stationary background noise and transients (metronome, keyboard typing, hammering) are added to the clean signal
▪ 11 speakers, each recording 120 seconds long
Speech
transients
Deep architecture for VAD
Feature Extraction
▪ Audio Features - MFCC (Mel-frequency cepstral coefficients)
▪ Video Features - motion vectors (MV)
▪ MV capture both spatial and temporal information
Transient Reducing AE
▪ a special AE is designed for both fusing the audio and video signals, and reducing the effect of noises and transients
Clean
mushroom
Recurrent Neural Network
▪ The transient reducing AE is followed by a multilayered RNN
▪ The length of the temporal window is learned instead of being arbitrarily predetermined.
▪ a sigmoid on the RNN output produces a probability measure for the presence of speech in each frame 𝑛
Experimental Results
Comparison of our method to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al.
Our method produces less false alarms
Experimental Results
Comparison to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al. and “robust audio-visual speech recognition using audio-visual voice activity detection“ by Tamura et al.
Colored noise with 5 dB SNR and hammering transient
Experimental Results
Comparison to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al. and “robust audio-visual speech recognition using audio-visual voice activity detection“ by Tamura et al.
Babble noise with 10 dB SNR and keyboard transient
That’s nice, but still not end-to-end…
End-to-End VAD
Video Feature Extraction
▪ Residual networks
Audio Feature Extraction
▪ a WaveNet encoder
▪ stacked residual blocks of dilated convolutions
▪ captures long-range temporal dependencies
Audio Feature Extraction
▪ Dilated convolutions
Audio Feature Extraction
Feature Fusion - MCB
▪ Feature vectors fusion -
2048
2048
2048
Feature Fusion - MCB
▪ The best of all worlds – MCB
▪ approximated by projecting the jointouter product to a lower dimensionalspace, using a count sketch function
Whatever we choose..
Feature Fusion - MCB
Feature Fusion - MCB
▪ can easily be extended for more than two modalities
▪ able to choose the desired size for the joint vector
▪ MCB output size is set to be 1024
Dataset
▪ more challenging dataset compared to our previous work - each sample of the evaluation set contains a different mixture of background noise, transient, and SNR
▪ Training set – noised every iteration
▪ Evaluation set – noised once at init
Experimental Results
Comparison of our method to “Audio-Visual Voice Activity Detection Using Diffusion Maps” by Dov et al. and our previous work
Experimental Results
A comparison of our 4 different architectures –with MCB\concatenation, and with shared\joint LSTM
Shared LSTM + MCB is best
Discussion
▪ Features are learned from raw data
▪ fusion of the modalities via an MCB module, higher order relationsbetween the two modalities are explored
▪ Can be utilized to other domains (ECG)
Questions?Thank you..