ESTIMATING THE GOOD, THE BAD AND THE UGLY IN SPEECH RECORDINGS
Alankar A. Atreya , Oscar Forth, Samuel Kent, Finnian Kelly and Anil Alexander Research and Development
Oxford Wave Research Ltd
IAFPA 2018, Huddersfield, UK
THE GOOD THE BAD THE UGLY
and
MOTIVATION
Audio recordings commonly encountered in forensic and investigative casework can vary widely in both subjective quality and relative suitability for automatic speech and speaker recognition.
In performing a forensic speaker comparison, it is important to be aware of these variations and their potential influence on the results.
Standardising the often subjective judgments about audio quality is important and an improvement to the tools available to forensic phoneticians
We consider how objective measurements can be used to triage a dataset to inform automatic speaker recognition system performance.
QUALITATIVE EXTRINSIC VARIATIONS
Variations unrelated to the speech content or speaker identity include:
File format, i.e. codecs, sampling rates, bit rates
Acoustic content within the files, the presence of interfering tones, background noise or music
Recording conditions such as the channel and microphone
Net amount of speech present
JUICER is a tool for large-scale audio quality
measurements allowing quick identification of
regions suited for speech and audio processing.
Identifies and measures within and across files
• Net Speech (after Voice activity detection)
• Voiced speech
• Clipping
• VAD-based SNR
• WADA SNR
• DC-Offsets
• Intrusive Tone Frequencies
Provides detailed information about file types,
codecs, bit depth, sample rate etc.
OBJECTIVE AUDIO QUALITY MEASUREMENTS
OBJECTIVE AUDIO QUALITY MEASUREMENTS
JUICER: Joint Union of Intelligent Classifier Extracted Regions
USING GLOBAL OR TEMPORAL MEASURES Global measures provide one value for the whole file submitted
Temporal measures continuously values and assignments across files
In this study we use global measures for file quality for triage
SIGNAL-TO-NOISE RATIO
Signal to Noise Ratio or SNR is often used as the best means of characterising
HOWEVER SNR is tricky to measure if you don’t have a reference clean signal.
It is generally estimated by assuming non-speech output of the voice activity detection is noise and then calculating the relative energy of the signal to the noise.
This method can suffer significantly if the initial non-speech assignments include speech.
This is why you are sometimes asked not pre-clean the files before submission
VAD
SNR =
Signal
Noise
WADA SIGNAL-TO-NOISE RATIO
WADA is an non-destructive algorithm for estimating the signal-to-noise ratio (SNR)
WADA stands for Waveform Amplitude Distribution Analysis.
In the NIST 2016 evaluation, WADA-SNR was one of the recommended SNR measurement techniques
The algorithm “assumes that the amplitude distribution of clean speech can be approximated by the Gamma distribution with a shaping parameter of 0.4, and that an additive noise signal is Gaussian”
† Kim, C., & Stern, R. M. (2008). Robust signal-to-noise ratio estimation
based on waveform amplitude distribution analysis. Annual Conference
of the International Speech Communication Association 2008,
Brisbane, Australia.
Speech/Noise Amplitude Distributions
VOXCELEB DATABASE DESCRIPTION
VoxCeleb (1) (Nagrani et al. 2017) is a large-scale ‘in the wild’ database of over a thousand celebrity speakers collected from Youtube by an Oxford University group.
100,000 utterances from 1,251 celebrities from videos uploaded to Youtube – gender balanced (55% male) – “wide range of ethnicities, accents, professions, ages”
Relatively unconstrained database with a variety of recording conditions and various intrusive noises, making it an ideal test-bed for analysing the effect of objective quality measures on speaker comparisons.
VOXCELEB - AUDIO DURATION DISTRIBUTION
VOXCELEB – SOME QUALITY METRICS
VAD-based SNR distributions WADA-based SNR
THE GOOD, BAD AND THE UGLY
Three groups selected based on their audio quality (WADA SNRs) from 10800 files from 1174 Voxceleb speakers*
SNR<11.5 dB SNR>23.5 dB
Medium
Poor Good
THE GOOD, BAD AND THE UGLY
Three groups selected based on their audio quality (WADA SNRs) from 10800 files from 1174 Voxceleb speakers*.
*As greater than 70% of files fell into the medium group, a subset of the medium group was randomly selected to provide a more balanced comparison of the 3 groups.
SNR<11.5 dB SNR>23.5 dB
Medium
Poor Good
720 speakers
1322 files
772 speakers
1330 files
669 speakers
1300 files
THE GOOD, BAD AND THE UGLY
SNR<11.5 dB SNR>23.5 dB
Medium
Poor Good
720 speakers
1322 files
772 speakers
1330 files
669 speakers
1300 files
feature
extraction
speech
UBM
i-vector
High-dimensional,
universal speaker space
Low-dimensional,
speaker-specific
space
15 of 15
IVECTOR SCHEMATIC
I-VECTOR BASED APPROACH (VOCALISE) Group EER%
Poor 8.63
Medium 5.10
Good 3.78
Group EER%
Poor vs Medium 7.64
Good vs Poor 7.70
Good vs Medium 4.87
Cross Condition
Same Condition
ZOOPLOT I-VECTORS (POOR VS POOR)
EER: 8.63%
SPEAKER EMBEDDINGS – THE FUTURE
X-vectors are a revolutionary new technology that will probably replace i-vectors.
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” ICASSP, 2018.
It replaces i-vectors with a embeddings extracted from a feed-forward Deep Neural Network (DNN)
Research into x-vectors is in its early stage today – but is showing significant improvements, particularly with mismatched recording conditions.
feature
extraction
speech
DNN
19 of 19
X-VECTOR SCHEMATIC
High-dimensional,
universal speaker space
x-vector
Low-dimensional,
speaker-specific
space
FROM I- TO X-VECTOR Group EER%
Poor 2.59
Medium 1.43
Good 0.96
Group EER%
Poor vs Medium 2.23
Good vs Poor 2.10
Good vs Medium 1.40
Same Condition
Cross Condition
Group EER%
Poor 8.63
Medium 5.10
Good 3.78
Group EER%
Poor vs Medium 7.64
Good vs Poor 7.70
Good vs Medium 4.87
Cross Condition
Same Condition
ZOOPLOT X-VECTORS (GOOD VS GOOD)
EER: 0.9%
CONCLUSIONS
There is a clear relationship between objective quality in terms
of WADA-SNR and speaker recognition performance.
It is important to use quality metrics as a gatekeeper to an
automatic speaker recognition system.
Use of quality-metrics and triage tools can help reduce the risk
of mislabelling in databases
There is a remarkable improvement going from i-vectors to x-
vectors but the trends in performance between groups is
consistent
REFERENCES
† Alexander, A., Forth, O., Atreya, A. A., and Kelly, F. (2016). VOCALISE: A forensic automatic speaker recognition system supporting spectral, phonetic, and user-provided features. Speaker Odyssey 2016, Bilbao, Spain.
† Kim, C., & Stern, R. M. (2008). Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. Annual Conference of the International Speech Communication Association 2008, Brisbane, Australia.
† Nagrani, A., Chung, J. S., & Zisserman, A. (2017). VoxCeleb: a large-scale speaker identification dataset. Annual Conference of the International Speech Communication Association 2017, Stockholm, Sweden.
† Rousseau, A., Deléglise, P., & Estève, Y. (2014). Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks. International Conference on Language Resources and Evaluation 2014, Reykjavik, Iceland.