Download - THE GOOD THE BAD d THE UGLY - oxfordwaveresearch.com · THE GOOD, BAD AND THE UGLY Three groups selected based on their audio quality (WADA SNRs) from 10800 files from 1174 Voxceleb

ESTIMATING THE GOOD, THE BAD AND THE UGLY IN SPEECH RECORDINGS

Alankar A. Atreya , Oscar Forth, Samuel Kent, Finnian Kelly and Anil Alexander Research and Development

Oxford Wave Research Ltd

IAFPA 2018, Huddersfield, UK

THE GOOD THE BAD THE UGLY

and

MOTIVATION

Audio recordings commonly encountered in forensic and investigative casework can vary widely in both subjective quality and relative suitability for automatic speech and speaker recognition.

In performing a forensic speaker comparison, it is important to be aware of these variations and their potential influence on the results.

Standardising the often subjective judgments about audio quality is important and an improvement to the tools available to forensic phoneticians

We consider how objective measurements can be used to triage a dataset to inform automatic speaker recognition system performance.

QUALITATIVE EXTRINSIC VARIATIONS

Variations unrelated to the speech content or speaker identity include:

File format, i.e. codecs, sampling rates, bit rates

Acoustic content within the files, the presence of interfering tones, background noise or music

Recording conditions such as the channel and microphone

Net amount of speech present

JUICER is a tool for large-scale audio quality

measurements allowing quick identification of

regions suited for speech and audio processing.

Identifies and measures within and across files

• Net Speech (after Voice activity detection)

• Voiced speech

• Clipping

• VAD-based SNR

• WADA SNR

• DC-Offsets

• Intrusive Tone Frequencies

Provides detailed information about file types,

codecs, bit depth, sample rate etc.

OBJECTIVE AUDIO QUALITY MEASUREMENTS

OBJECTIVE AUDIO QUALITY MEASUREMENTS

JUICER: Joint Union of Intelligent Classifier Extracted Regions

USING GLOBAL OR TEMPORAL MEASURES Global measures provide one value for the whole file submitted

Temporal measures continuously values and assignments across files

In this study we use global measures for file quality for triage

SIGNAL-TO-NOISE RATIO

Signal to Noise Ratio or SNR is often used as the best means of characterising

HOWEVER SNR is tricky to measure if you don’t have a reference clean signal.

It is generally estimated by assuming non-speech output of the voice activity detection is noise and then calculating the relative energy of the signal to the noise.

This method can suffer significantly if the initial non-speech assignments include speech.

This is why you are sometimes asked not pre-clean the files before submission

VAD

SNR =

Signal

Noise

WADA SIGNAL-TO-NOISE RATIO

WADA is an non-destructive algorithm for estimating the signal-to-noise ratio (SNR)

WADA stands for Waveform Amplitude Distribution Analysis.

In the NIST 2016 evaluation, WADA-SNR was one of the recommended SNR measurement techniques

The algorithm “assumes that the amplitude distribution of clean speech can be approximated by the Gamma distribution with a shaping parameter of 0.4, and that an additive noise signal is Gaussian”

† Kim, C., & Stern, R. M. (2008). Robust signal-to-noise ratio estimation

based on waveform amplitude distribution analysis. Annual Conference

of the International Speech Communication Association 2008,

Brisbane, Australia.

Speech/Noise Amplitude Distributions

VOXCELEB DATABASE DESCRIPTION

VoxCeleb (1) (Nagrani et al. 2017) is a large-scale ‘in the wild’ database of over a thousand celebrity speakers collected from Youtube by an Oxford University group.

100,000 utterances from 1,251 celebrities from videos uploaded to Youtube – gender balanced (55% male) – “wide range of ethnicities, accents, professions, ages”

Relatively unconstrained database with a variety of recording conditions and various intrusive noises, making it an ideal test-bed for analysing the effect of objective quality measures on speaker comparisons.

VOXCELEB - AUDIO DURATION DISTRIBUTION

VOXCELEB – SOME QUALITY METRICS

VAD-based SNR distributions WADA-based SNR

THE GOOD, BAD AND THE UGLY

Three groups selected based on their audio quality (WADA SNRs) from 10800 files from 1174 Voxceleb speakers*

SNR<11.5 dB SNR>23.5 dB

Medium

Poor Good


Three groups selected based on their audio quality (WADA SNRs) from 10800 files from 1174 Voxceleb speakers*.

*As greater than 70% of files fell into the medium group, a subset of the medium group was randomly selected to provide a more balanced comparison of the 3 groups.


Medium

Poor Good

720 speakers

1322 files

772 speakers

1330 files

669 speakers

1300 files



Medium

Poor Good

720 speakers

1322 files

772 speakers

1330 files

669 speakers

1300 files

feature

extraction

speech

UBM

i-vector

High-dimensional,

universal speaker space

Low-dimensional,

speaker-specific

space

15 of 15

IVECTOR SCHEMATIC

I-VECTOR BASED APPROACH (VOCALISE) Group EER%

Poor 8.63

Medium 5.10

Good 3.78

Group EER%

Poor vs Medium 7.64

Good vs Poor 7.70

Good vs Medium 4.87

Cross Condition

Same Condition

ZOOPLOT I-VECTORS (POOR VS POOR)

EER: 8.63%

SPEAKER EMBEDDINGS – THE FUTURE

X-vectors are a revolutionary new technology that will probably replace i-vectors.

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” ICASSP, 2018.

It replaces i-vectors with a embeddings extracted from a feed-forward Deep Neural Network (DNN)

Research into x-vectors is in its early stage today – but is showing significant improvements, particularly with mismatched recording conditions.

feature

extraction

speech

DNN

19 of 19

X-VECTOR SCHEMATIC

High-dimensional,

universal speaker space

x-vector

Low-dimensional,

speaker-specific

space

FROM I- TO X-VECTOR Group EER%

Poor 2.59

Medium 1.43

Good 0.96

Group EER%

Poor vs Medium 2.23

Good vs Poor 2.10

Good vs Medium 1.40

Same Condition

Cross Condition

Group EER%

Poor 8.63

Medium 5.10

Good 3.78

Group EER%

Poor vs Medium 7.64

Good vs Poor 7.70

Good vs Medium 4.87

Cross Condition

Same Condition

ZOOPLOT X-VECTORS (GOOD VS GOOD)

EER: 0.9%

CONCLUSIONS

There is a clear relationship between objective quality in terms

of WADA-SNR and speaker recognition performance.

It is important to use quality metrics as a gatekeeper to an

automatic speaker recognition system.

Use of quality-metrics and triage tools can help reduce the risk

of mislabelling in databases

There is a remarkable improvement going from i-vectors to x-

vectors but the trends in performance between groups is

consistent

REFERENCES

† Alexander, A., Forth, O., Atreya, A. A., and Kelly, F. (2016). VOCALISE: A forensic automatic speaker recognition system supporting spectral, phonetic, and user-provided features. Speaker Odyssey 2016, Bilbao, Spain.

† Kim, C., & Stern, R. M. (2008). Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. Annual Conference of the International Speech Communication Association 2008, Brisbane, Australia.

† Nagrani, A., Chung, J. S., & Zisserman, A. (2017). VoxCeleb: a large-scale speaker identification dataset. Annual Conference of the International Speech Communication Association 2017, Stockholm, Sweden.

† Rousseau, A., Deléglise, P., & Estève, Y. (2014). Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks. International Conference on Language Resources and Evaluation 2014, Reykjavik, Iceland.