intelligibility of speech in noise after lossy audio data ...
Transcript of intelligibility of speech in noise after lossy audio data ...
intelligibility of speech in intelligibility of speech in noise after lossy audio
data compression
Gaston Hilkhuysen & Mark HuckvaleGaston Hilkhuysen & Mark Huckvale
department of Speech, Hearing and Phonetic SciencesUniversity College London (UK)
Centre for Law Enforcement Audio Research
overview
- compression strategies- compression strategies- observed intelligibilities- intelligibility metrics- conclusions- conclusions
G711 a-law
8 kHz sampling rate
12 bits compressed 12 bits compressed into 8 bits
=> 64 kbs CODEC-40
-30
-20
-10
0
outp
ut le
vel [
dBfs
]
-60 -40 -20 0-60
-50
input level [dBfs]
outp
ut le
vel [
dBfs
]
GSM 06.10
Regular Pulse Excitation - Long Term Prediction
8 kHz sampling; 20 m frames 8 kHz sampling; 20 m frames 160 samples
8 order LPC
32 bytes per frame=>13.2 kbs=>13.2 kbs
MP3 / WMA
propriety patented algorithms
psycho-acoustically basedpsycho-acoustically basedonly transmit perceptually relevant
MP3 CBR, 16 kbs, 16 kHz, mono
MP3 CBR, 48 kbs, 44.1 kHz, monoMP3 CBR, 48 kbs, 44.1 kHz, mono
WMA 9.2 CBR, 16 kbs, 16 kHz, 16 bit mono
WMA 9.2 CBR, 48 kbs, 44.1 kHz, 16 bit mono
experimental design
CODEC • PCM (256 kbs)
IEEE sentences• 10 sentences per experimental condition• PCM (256 kbs)
• G711 (64 kbs)
• GSM 06.10 (13 kbs)
• MP3 (48/16 kbs)
• WMA (48/16 kbs)
noise types
• 10 sentences per experimental condition
• 5 keywords per sentence
•10 listeners per noise type
500 observations per experimental condition
noise types• car cabin noise
• babble
(5 SNRs per noise type)
70 experimental conditions in total
observations
6
performancelevel[Bk]
word correct[%]
car cabin babble
-4
-2
0
2
4
6
20
50
80
94
-21 -18 -15 -12 -9 -6 -3 0-10
-8
-6
SNR [dB]
2
6
G711 a-law 64kbs
GSM 06.10 13kbs
non-processed POTS
observations
6
performancelevel[Bk] car cabin babble
word correct[%]
-4
-2
0
2
4
6
20
50
80
94
-21 -18 -15 -12 -9 -6 -3 0-10
-8
-6
SNR [dB]
2
6
G711 a-law 64kbs
GSM 06.10 13kbs
non-processed POTS
MP3 16kbsMP3 48kbs
observations
6
performancelevel[Bk] car cabin babble
word correct[%]
-4
-2
0
2
4
6
20
50
80
94
WMA 16kbsWMA 48kbs
-21 -18 -15 -12 -9 -6 -3 0-10
-8
-6
SNR [dB]
2
6
G711 a-law 64kbs
GSM 06.10 13kbs
non-processed POTS
MP3 16kbsMP3 48kbs
speech
∑2
slice m
coherence SII
∑ ∑
∑=
m mmm
mmm
kYkX
kYkX
k 22
2
*
2
)()(
)()(
)(γ
∑=
yyk
j kSkkWjSDR
)()()()(
2γ
slice m
0 2 4 6 8 10 12 14 16
x =
y =
ms
speech + noise
∑ −=
kyyj
k
kSkkWjSDR
)(])(1)[()( 2γ
coherence SII
20instantaneousspeech level
-20
-10
0
10speech level[dB re: RMS] high
mid
-40
-30
time
low
CSII performance function
4
6
rrank
= 0.93
performancelevel[Bk]
-4
-2
0
2
MP3 16kbsMP3 48kbsWMA 16kbs
car cabinbabble
0 0.1 0.2 0.3 0.4 0.5-10
-8
-6
CSIImid
WMA 16kbsWMA 48kbsG711 a-law 64kbsGSM 06.10 13kbsnon-processed POTS
STOI
spectrogram pixels
• 1/3 octave x 13 ms
speech
• 1/3 octave x 13 ms
equalize energies in corresponding 400 ms ranges
limit pixel difference to range <-15..15> dB
speech + noise
range <-15..15> dB
STOI
spectrogram pixels
• 1/3 octave x 13 ms
speech
• 1/3 octave x 13 ms
equalize energies in corresponding 400 ms ranges
limit pixel difference to range <-15..15> dB
speech + noise
range <-15..15> dB
correlate regions
average correlations
STOI performance functions
4
6MP3 16kbsMP3 48kbsWMA 16kbsWMA 48kbs
-4
-2
0
2WMA 48kbsG711 a-law 64kbsGSM 06.10 13kbsnon-processed POTS
car cabinbabble
0 0.2 0.4 0.6 0.8 1-10
-8
-6 rrank= 0.96
speech onlyspeech
envelope
SNR in the envelope domain
17 channel 1/3-octave filterbank
Hilbert transform
squaringdevice
bandpass 1-32 Hz
-3 dB/octslope
speech & noise speech &noiseenvelope
correlate
envelope
)1
(log102
2
10 r
rSNRenv −
=
SNR in envelope domain
(processed) noisySNRenv
40
speech envelope
(processed) noisyspeech envelope
correlate
env[dB]
-10
0
10
20
30
)1
(log102
2
10
r
rSNRenv −
=-15 -10 -5 0 5 10 15
-20
SNRaud [dB]
modelling NR effects
46
performancelevel [Bk]
SS MMSE SSA
words correct[%]
94
0.01 0.1 1-8-6-4-2024
0.1 1 0.1 1
SIImod
SS MMSE SSA
5020
80
26
94
246
SS MMSE SSA
car non processed car processed babble processedbabble non processed
-8-6-4-202CSIImid
noise reduction parameter optimization
80
0
SIImod CSIImidcobserved STOI
-4-3.5
-3-2 .5-2
-1 .5-1-0
.500
80
nois
ines
s [%
]
0
20
40
60
00
0 0
car cabin
-3
-2-1.5-1
-0 .5-3
-2 .5
-2.5-2-1.5
-1.5
-1
-1
-0.5
-0.5
0
0
0
0
0
00
0 10 20 30
car cabin
nois
ines
s [%
]
reduce noise by [dB]0 10 20 30
0
20
40
60
0
reduce noise by [dB]
0 10 20 30
00
reduce noise by [dB]
0 10 20 30
babble -3
-2.5
-2.5
-2
-2
-1.5
-1.5
-1
-1
-0 .5
-0.5
-0.5
0
0
00 0
reduce noise by [dB]0 10 20 30
babble
SIImod performance functions
4
6
rrank
= 0.96
performancelevel[Bk]
-4
-2
0
2
MP3 16kbsMP3 48kbs
car cabinbabble
0 0.2 0.4 0.6 0.8 1-10
-8
-6
SIImod
MP3 48kbsWMA 16kbsWMA 48kbsG711 a-law 64kbsGSM 06.10 13kbsnon-processed POTS
conclusions
CODECs alter intelligibility of noisy speech
• negatively/positively(!)• negatively/positively(!)
effects can be predicted
• STOI/SIImod
SIImod may be improved by improving SII predictions for speech in babblepredictions for speech in babble
• modulations deteriorate intelligibility
• dip listening imporves intelligibility