Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS...
Transcript of Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS...
![Page 1: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/1.jpg)
Audio-Visual Dictionary Learning and
Probabilistic Time-Frequency Masking in Convolutive and Noisy Source Separation
Wenwu Wang [email protected]
Senior Lecturer in Signal Processing
Centre for Vision, Speech and Signal Processing
Department of Electronic Engineering
University of Surrey, Guildford
![Page 2: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/2.jpg)
Joint work with Dr Qingju Liu (former PhD
student & current postdoc)
Collaborators: Dr Philip Jackson, Dr Mark
Barnard, Prof Josef Kittler, Prof Jonathon
Chambers (Loughborough University), and Dr
Wei Dai (Imperial College London)
Financial support: EPSRC & DSTL, UDRC in
Signal Processing
Acknowledgement
1
![Page 3: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/3.jpg)
Introduction
o Cocktail party problem, source
separation, time-frequency masking
o Why audio-visual BSS (AV-BSS)
Dictionary learning (AVDL) based AV-BSS
o Audio-visual dictionary learning
o Time-frequency mask fusion
Results and demonstrations
Conclusions and future work
Outline
2
![Page 4: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/4.jpg)
Introduction----Cocktail party problem
Independent component
analysis (ICA)
Time-frequency (TF)
masking
3
![Page 5: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/5.jpg)
BSS using TF masking
m
X(m,w)
Sparsity assumption ------ each TF point is dominated by one source signal.
4
![Page 6: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/6.jpg)
5
Benchmark: ideal binary mask (IBM)
Demonstrations by DeLiang Wang, The Ohio State Univ.
![Page 7: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/7.jpg)
m
m
X1(m,w)
X2(m,w)
2
2
1
2
( , )( , ), ( , )
( , )
X mm m w
X m
IPD ILD
2( , )m
2( , )m w
6
![Page 8: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/8.jpg)
7
Adverse effects
Acoustic noise
Reverberations
![Page 9: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/9.jpg)
8
![Page 10: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/10.jpg)
Audio stream
Visual stream
CT & MRI
Audio stream
Visual stream
Perception
Why AV-BSS?----AV coherence
9
![Page 11: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/11.jpg)
• The audio-domain BSS algorithms degrade in adverse conditions.
• The visual stream contains complementary information to the coherent audio stream.
How can the visual
modality be used to assist
audio-domain BSS
algorithms in noisy and
reverberant conditions?
Ob
jecti
ve
Why AV-BSS?
10
• Reliable AV coherence modelling
• Bimodal differences in size, dimensionality and sampling rates
• Fusion of AV coherence with audio-domain BSS methods
Key Challenges
Potential applications
AV-BSS
Surveillance AV speech recognition
Hello world
HCI
Robot audition
![Page 12: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/12.jpg)
11
AVDL based BSS
TF masking, Mandel et al. 2010.
![Page 13: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/13.jpg)
Dictionary learning
Figures taken from ICASSP 2013 Tutorial 11, by Dai, Maihe and Wang. Likewise
for next four pages. Acknowledgement to Wei Dai for making these figures.
![Page 14: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/14.jpg)
A two-stage procedure
![Page 15: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/15.jpg)
Sparse coding (approximation)
![Page 16: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/16.jpg)
Dictionary update: the formulation
![Page 17: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/17.jpg)
Dictionary update: K-SVD algorithm
![Page 18: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/18.jpg)
17
Audio-visual dictionary learning: a
generative model
![Page 19: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/19.jpg)
18
Sparse assumption of AVDL
![Page 20: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/20.jpg)
Flow of the AVDL
MP coding
KSVD
visual audio
Kmeans
AV dictionary
Converge No
Yes
End
AV dictionary
AV sequence The coding process relies on the
matching criterion, how well an atom
fits the signal in the MP algorithm
The learning process uses two
different update methods, to
accommodate different bimodality
sparsity constraints.
19
A scanning index is proposed to
reduce the computational complexity.
![Page 21: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/21.jpg)
The overall algorithm
20
![Page 22: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/22.jpg)
The coding process
21
![Page 23: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/23.jpg)
The coding process (algorithm)
22
![Page 24: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/24.jpg)
The learning stage
23
![Page 25: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/25.jpg)
AVDL evaluations
Synthetic data
24
![Page 26: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/26.jpg)
AVDL evaluations
Additive noise added
![Page 27: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/27.jpg)
AVDL evaluations
Convolutive noise added
26
![Page 28: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/28.jpg)
AVDL evaluations
The approximation error metrics comparison of AVDL and Monaci's
method over 50 independent tests over the synthetic data
The proposed AVDL
outperforms the
baseline approach,
giving an average of
33% improvement for
the audio modality,
together with a 26%
improvement for the
visual modality.
27
![Page 29: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/29.jpg)
AVDL evaluations
28
![Page 30: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/30.jpg)
AV mask fusion for AVDL-BSS
Audio mask
Statistically generated by evaluating the IPD
and ILD of each TF point.
Visual mask
Mapping the observation to the learned AV
dictionary via the coding stage in AVDL.
The power-law transformation
The power coefficients
are determined by a non-
linear interpolation with
pre-defined values
29
![Page 31: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/31.jpg)
Visual mask generation
![Page 32: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/32.jpg)
Long Speech
AVDL evaluations
The first AV atom
represents the
utterance
“marine" /mᵊri:n/
while the second
one denotes the
utterance
“port" /pᵓ:t/.
31
Sheerman-Chase et al.
LILiR Twotalk database
2011
Lip tracking,
Ong et al. 2008
![Page 33: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/33.jpg)
Demonstration of
TF mask fusion in
AVDL-BSS
Why do we choose
the power law
combination, instead
of, e.g., a linear
combination?
32
![Page 34: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/34.jpg)
AVDL-BSS evaluations----SDR
33
Noise-free 10 dB Gaussian noise
![Page 35: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/35.jpg)
AVDL-BSS evaluations----OPS-PEASS
34
Noise-free 10 dB Gaussian noise
![Page 36: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/36.jpg)
Some examples
Mixture Ideal Mandel AV-LIU AVDL-BSS Rivet AVMP-BSS
A
B
C
D
35
![Page 37: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/37.jpg)
Conclusions AVDL offers an alternative and effective method for
modelling the AV coherence within the audio-visual data.
The mask derived from AVDL can be used to improve the
BSS performance for separating reverberant and noisy
speech mixtures
Future work To achieve dictionary adaptation and source separation
simultaneously
36
![Page 39: Audio-Visual Dictionary Learning and Probabilistic Time ... · Robot audition . 11 AVDL based BSS TF masking, Mandel et al. 2010. ... generative model . 18 Sparse assumption of AVDL](https://reader036.fdocuments.in/reader036/viewer/2022063016/5fd64494959b913b2376ea9c/html5/thumbnails/39.jpg)
References
38
Q. Liu, W. Wang, P. Jackson, M. Barnard, J. Kittler, and J.A. Chambers, “Source separation of
convolutive and noisy mixtures using audio-visual dictionary learning and probabilistic time-
frequency masking", IEEE Transactions on Signal Processing, vol. 61, no. 22, pp. 5520-5535, 2013.
Q. Liu, W. Wang, and P. Jackson, “Use of bimodal coherence to resolve spectral indeterminacy in
convolutive BSS", Signal Processing, 92(8):1916-1927, 2012.
Q. Liu, W. Wang, P. Jackson, and M. Barnard, “Reverberant speech separation based on audio-visual
dictionary learning and binaural cues", in Proc. SSP, 2012.