intelligibility of speech in noise after lossy audio data ...

24
intelligibility of speech in noise after lossy audio data compression Gaston Hilkhuysen & Mark Huckvale Gaston Hilkhuysen & Mark Huckvale department of Speech, Hearing and Phonetic Sciences University College London (UK) Centre for Law Enforcement Audio Research

Transcript of intelligibility of speech in noise after lossy audio data ...

intelligibility of speech in intelligibility of speech in noise after lossy audio

data compression

Gaston Hilkhuysen & Mark HuckvaleGaston Hilkhuysen & Mark Huckvale

department of Speech, Hearing and Phonetic SciencesUniversity College London (UK)

Centre for Law Enforcement Audio Research

overview

- compression strategies- compression strategies- observed intelligibilities- intelligibility metrics- conclusions- conclusions

CODEC distortion in law enforcement

G711 a-law

8 kHz sampling rate

12 bits compressed 12 bits compressed into 8 bits

=> 64 kbs CODEC-40

-30

-20

-10

0

outp

ut le

vel [

dBfs

]

-60 -40 -20 0-60

-50

input level [dBfs]

outp

ut le

vel [

dBfs

]

GSM 06.10

Regular Pulse Excitation - Long Term Prediction

8 kHz sampling; 20 m frames 8 kHz sampling; 20 m frames 160 samples

8 order LPC

32 bytes per frame=>13.2 kbs=>13.2 kbs

MP3 / WMA

propriety patented algorithms

psycho-acoustically basedpsycho-acoustically basedonly transmit perceptually relevant

MP3 CBR, 16 kbs, 16 kHz, mono

MP3 CBR, 48 kbs, 44.1 kHz, monoMP3 CBR, 48 kbs, 44.1 kHz, mono

WMA 9.2 CBR, 16 kbs, 16 kHz, 16 bit mono

WMA 9.2 CBR, 48 kbs, 44.1 kHz, 16 bit mono

experimental design

CODEC • PCM (256 kbs)

IEEE sentences• 10 sentences per experimental condition• PCM (256 kbs)

• G711 (64 kbs)

• GSM 06.10 (13 kbs)

• MP3 (48/16 kbs)

• WMA (48/16 kbs)

noise types

• 10 sentences per experimental condition

• 5 keywords per sentence

•10 listeners per noise type

500 observations per experimental condition

noise types• car cabin noise

• babble

(5 SNRs per noise type)

70 experimental conditions in total

observations

6

performancelevel[Bk]

word correct[%]

car cabin babble

-4

-2

0

2

4

6

20

50

80

94

-21 -18 -15 -12 -9 -6 -3 0-10

-8

-6

SNR [dB]

2

6

G711 a-law 64kbs

GSM 06.10 13kbs

non-processed POTS

observations

6

performancelevel[Bk] car cabin babble

word correct[%]

-4

-2

0

2

4

6

20

50

80

94

-21 -18 -15 -12 -9 -6 -3 0-10

-8

-6

SNR [dB]

2

6

G711 a-law 64kbs

GSM 06.10 13kbs

non-processed POTS

MP3 16kbsMP3 48kbs

observations

6

performancelevel[Bk] car cabin babble

word correct[%]

-4

-2

0

2

4

6

20

50

80

94

WMA 16kbsWMA 48kbs

-21 -18 -15 -12 -9 -6 -3 0-10

-8

-6

SNR [dB]

2

6

G711 a-law 64kbs

GSM 06.10 13kbs

non-processed POTS

MP3 16kbsMP3 48kbs

speech

∑2

slice m

coherence SII

∑ ∑

∑=

m mmm

mmm

kYkX

kYkX

k 22

2

*

2

)()(

)()(

)(γ

∑=

yyk

j kSkkWjSDR

)()()()(

slice m

0 2 4 6 8 10 12 14 16

x =

y =

ms

speech + noise

∑ −=

kyyj

k

kSkkWjSDR

)(])(1)[()( 2γ

coherence SII

20instantaneousspeech level

-20

-10

0

10speech level[dB re: RMS] high

mid

-40

-30

time

low

CSII performance function

4

6

rrank

= 0.93

performancelevel[Bk]

-4

-2

0

2

MP3 16kbsMP3 48kbsWMA 16kbs

car cabinbabble

0 0.1 0.2 0.3 0.4 0.5-10

-8

-6

CSIImid

WMA 16kbsWMA 48kbsG711 a-law 64kbsGSM 06.10 13kbsnon-processed POTS

STOI

spectrogram pixels

• 1/3 octave x 13 ms

speech

• 1/3 octave x 13 ms

speech + noise

STOI

spectrogram pixels

• 1/3 octave x 13 ms

speech

• 1/3 octave x 13 ms

equalize energies in corresponding 400 ms ranges

limit pixel difference to range <-15..15> dB

speech + noise

range <-15..15> dB

STOI

spectrogram pixels

• 1/3 octave x 13 ms

speech

• 1/3 octave x 13 ms

equalize energies in corresponding 400 ms ranges

limit pixel difference to range <-15..15> dB

speech + noise

range <-15..15> dB

correlate regions

average correlations

STOI performance functions

4

6MP3 16kbsMP3 48kbsWMA 16kbsWMA 48kbs

-4

-2

0

2WMA 48kbsG711 a-law 64kbsGSM 06.10 13kbsnon-processed POTS

car cabinbabble

0 0.2 0.4 0.6 0.8 1-10

-8

-6 rrank= 0.96

speech onlyspeech

envelope

SNR in the envelope domain

17 channel 1/3-octave filterbank

Hilbert transform

squaringdevice

bandpass 1-32 Hz

-3 dB/octslope

speech & noise speech &noiseenvelope

correlate

envelope

)1

(log102

2

10 r

rSNRenv −

=

SNR in envelope domain

(processed) noisySNRenv

40

speech envelope

(processed) noisyspeech envelope

correlate

env[dB]

-10

0

10

20

30

)1

(log102

2

10

r

rSNRenv −

=-15 -10 -5 0 5 10 15

-20

SNRaud [dB]

modelling NR effects

46

performancelevel [Bk]

SS MMSE SSA

words correct[%]

94

0.01 0.1 1-8-6-4-2024

0.1 1 0.1 1

SIImod

SS MMSE SSA

5020

80

26

94

246

SS MMSE SSA

car non processed car processed babble processedbabble non processed

-8-6-4-202CSIImid

noise reduction parameter optimization

80

0

SIImod CSIImidcobserved STOI

-4-3.5

-3-2 .5-2

-1 .5-1-0

.500

80

nois

ines

s [%

]

0

20

40

60

00

0 0

car cabin

-3

-2-1.5-1

-0 .5-3

-2 .5

-2.5-2-1.5

-1.5

-1

-1

-0.5

-0.5

0

0

0

0

0

00

0 10 20 30

car cabin

nois

ines

s [%

]

reduce noise by [dB]0 10 20 30

0

20

40

60

0

reduce noise by [dB]

0 10 20 30

00

reduce noise by [dB]

0 10 20 30

babble -3

-2.5

-2.5

-2

-2

-1.5

-1.5

-1

-1

-0 .5

-0.5

-0.5

0

0

00 0

reduce noise by [dB]0 10 20 30

babble

SIImod performance functions

4

6

rrank

= 0.96

performancelevel[Bk]

-4

-2

0

2

MP3 16kbsMP3 48kbs

car cabinbabble

0 0.2 0.4 0.6 0.8 1-10

-8

-6

SIImod

MP3 48kbsWMA 16kbsWMA 48kbsG711 a-law 64kbsGSM 06.10 13kbsnon-processed POTS

conclusions

CODECs alter intelligibility of noisy speech

• negatively/positively(!)• negatively/positively(!)

effects can be predicted

• STOI/SIImod

SIImod may be improved by improving SII predictions for speech in babblepredictions for speech in babble

• modulations deteriorate intelligibility

• dip listening imporves intelligibility

Special thanks to CLEAR

Mike Brooks

Dushyant Sharma

Nikolai Gaubitch Mark

Wibrow

PatrickNaylor

Mark Huckvale