Context and Motivation · • A uniform linear quantizer is called Pulse Code Modulation(PCM). •...
Transcript of Context and Motivation · • A uniform linear quantizer is called Pulse Code Modulation(PCM). •...
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 1
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 2
Context and Motivation
• What : Find an efficient representation of speech so that it can be transmitted with a minimum bandwidth, depending on the desired quality.
• How : Exploit the redundancy of the speech waveform.
• Applications :
– Telephony, PBX
– Wireless/Cellular Telephony
– Internet Telephony
– Speech Storage (Automated call-centers)
– Text-to-speech (machine generated speech)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 3
Types of CodersSpeech Coders
Waveform Coders Vocoders
Time Domain : PCM. ADPCM
Frequency Domain : Sub-band coders,
Adaptive transform coder
Linear Predictive Coder Formant Coders
• Waveform based coders : Preserve the signal waveform, not the speech.– Pulse Coded Modulation (PCM)– Differential PCM (DPCM)– Adaptive DPCM (ADPCM)
• Model based coders: Preserve speech , not waveform.– LPC10(e) Federal Standard 101 – Mixed Excitation Linear Prediction (MELP)
• Hybrid coders– Coded Excitation Linear Prediction (CELP)– Vector Sum Excitation Linear Prediction (VSELP)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 4
Types of CodersSpeech Coders
Waveform Coders Vocoders
Time Domain : PCM. ADPCM
Frequency Domain : Sub-band coders,
Adaptive transform coder
Linear Predictive Coder Formant Coders
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 5
Quantization • Amplitude quantizing: Mapping samples of a continuous amplitude
waveform to a finite set of amplitudes.
Qua
ntiz
edva
lues
Continuous signal
Discrete signal
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 6
Uniform Quantizer
• A uniform linear quantizer is called Pulse Code Modulation (PCM).
• Pulse code modulation (PCM): Encoding the quantized signals into a digital word (PCM word or codeword).
– Each quantized sample is digitally encoded into an l bits codeword where Lin the number of quantization levels and
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 7
Quantization example
tTs: sampling time
x(nTs): sampled valuesxq(nTs): quantized values
boundaries
Quant. levels
111 3.1867
110 2.2762
101 1.3657
100 0.4552
011 -0.4552
010 -1.3657
001 -2.2762
000 -3.1867
PCMcodeword 110 110 111 110 100 010 011 100 100 011 PCM sequence
amplitudex(t)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 8
Quantization Error
• Quantizing error: The difference between the output and input of a quantizer
)()(ˆ)( txtxte −=
+
)(tx )(ˆ tx
)()(ˆ)(
txtxte−=
AGC
x
)(xqy =Qauntizer
Process of quantizing noise
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 9
Quantization error …
• Quantizing error:– Granular or linear errors happen for inputs within the dynamic range of
quantizer– Saturation errors happen for inputs outside the dynamic range of quantizer
» Saturation errors are larger than linear errors» Saturation errors can be avoided by proper tuning of AGC
• Quantization noise variance: 2Sat
2Lin
22 }]{[ σσσ +=−= xxq)E
Value of Input Signal
Value of Output Signal
-1-2-3-4-5 1 2 3 4
1
2
3
4
-1
-2
-3
-4
5
Quantizing Error
(output-input)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 10
Quantization error
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 11
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−3
−2
−1
0
1
2
3
Time (ms)
Am
plitu
de (
Qua
ntiz
atio
n Le
vels
)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−15
−10
−5
0
5
10
15
Time (ms)
Am
plitu
de (
Qua
ntiz
atio
n Le
vels
)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−250
−200
−150
−100
−50
0
50
100
150
200
250
Time (ms)
Am
plitu
de (
Qua
ntiz
atio
n Le
vels
)
Quantization error
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 12
Quantization error
• “mid-tread” vs. a “mid-riser” quantizer design is significant when large quantizing steps are used.
– Mid-tread has zero output unless analog input exceeds voltage step size, so background noise is suppressed, but produces worse quantizing error at low voice levels.
– Mid-riser produces worse idle channel noise by increasing the miniscule background room noise or circuit noise, but has less average quantizing noise at low signal levels.
• Quantizing error can be characterized as an equivalent additive quantizing “noise”
Quantizeroutputcode value
Analog voltage
mid-tread
mid-risercode value
Analog voltage
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 13
– The quantization noise is characterize as a realization of a stationary random process q in which each of the random variables q(n) has uniform pdf.
» Where the step size of the quantizer is
2)(
2Δ
≤≤Δ
− xq
2Δ
Δ/1
2Δ
−
dqqpdfnqnqq ⋅⋅== ∫∞
∞−)()(})]({[ 222 Eσ
Quantization error
B
A2max=Δ
B: Number of bits
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 14
– :maximum swing of signal.
– The mean square value of the quantization error is :
– For the case of , the mean square value of the quantization noise is in dB :
Quantization Error
B
A2max=ΔmaxA
[ ]
12212|)(
31
1)()(
2
2max
22/
2/3
2/
2/
22
×=
Δ=
Δ=
⋅Δ⋅=
ΔΔ−
Δ
Δ−∫
B
Anq
dqnqnqE
.dB 8.10612
2log1012
log102
10
2
10 −−==Δ −
BB
1max =A
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 15
)23()(
)(ofpower averagethedenoteLet
231
;log
sampleper bitsofnumber theiswhere2
form,binary in expressedissamplequatizedWhen the
22max
2o
22max
2
2
B
Q
BQ
B
mPPSNR
tmP
m
LB
BL
==⇒
=
=
=
−
σ
σ
6dB per bit
Quantization SNR
2max
22max
10dB3 10log6B )23( log*10 )(
mP
mPSNR B +=⎥
⎦
⎤⎢⎣
⎡=
BB
mA2
22
maxmax ==Δ
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 16
How many bits?
• 16 bits resolution is more than is needed for telephone purposes.
– the voice waveform has already been band-limited to ~3.5kHz bandwidth
– Filter imperfections add about -30 dB noise – Carbon microphone is not high-fidelity– Extra bits cost more in hardware and precision of design and
manufacture, and in transmission cost.
• Empirical listener testing indicates about 12-13 bits of uniform resolution is adequate
– No perception of degradation in telephone voice quality
• Logarithmically compressed (“companded”) steps at low level permit equivalent quality with even less bits
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 17
Types of Quantization
– Uniform (linear) quantizing:• No assumption about amplitude statistics and correlation properties
of the input.• Robust to small changes in input statistic by not finely tuned to a
specific set of input parameters• Simply implemented
– Non-uniform quantizing:• Using the input statistics to tune quantizer parameters• Larger SNR than uniform quantizing with same number of levels• Non-uniform intervals in the dynamic range with same quantization
noise variance• Commonly used for speech
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 18
Statistics of Speech Signals
• In speech, weak signals are more frequent than strong ones.
• Using equal step sizes (uniform quantizer) gives low for weak signals and high for strong signals.
– Thus, adjusting the step size of the quantizer by taking into account the speech statistics improves the SNR for the input range.
0.0
1.0
0.5
1.0 2.0 3.0Normalized magnitude of speech signalPr
obab
ility
den
sity
func
tion
qNS⎟⎠⎞
⎜⎝⎛
qNS⎟⎠⎞
⎜⎝⎛
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 19
Non-Uniform Quantizer
Input SignalO
utpu
t Sig
nal
Input Signal
Out
put S
igna
l
Uniform Transfer
Characteristic
Non-Uniform Transfer
Characteristic
Input Signal
Uniform Error
Characteristic
Non-Uniform Error
Characteristic
Input Signal
2-44
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 20
Uniform vs Non-Linear Quantizing
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 21
Non-uniform quantization
• It is done by uniformly quantizing the “compressed” signal. • At the receiver, an inverse compression characteristic, called “expansion”
is employed to avoid signal distortion.
compression+expansion companding
)(ty)(tx )(ˆ ty )(ˆ tx
x
)(xCy = x
yCompress Quantize
ChannelExpand
Transmitter Receiver
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 22
μ Law/A Law
• The μ-law algorithm (μ-law) is a companding algorithm, primarily used in the digital telecommunication systems of North America and Japan. Its purpose is to reduce the dynamic range of an audio signal. In the analog domain, this can increase the signal to noise ratio achieved during transmission, and in the digital domain, it can reduce the quantization error (hence increasing signal to quantization noise ratio).
• A-law algorithm used in the rest of world.
• A-law algorithm provides a slightly larger dynamic range than the mu-lawat the cost of worse proportional distortion for small signals. By convention, A-law is used for an international connection if at least one country uses it.
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 23
μ Law
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 24
A Law
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 25
μ Law/A Law
|x| |x|
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 26
A Companding Law (Europe - ITU)
163248648096
128
16 32 48 64 80 96 112
128
144
160
176
192
208
224
240
256
272
11 bitinput
8 bitoutput
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 27
Compounding in music recording
• Recall the human ear’s “masking” phenomena – a noise signal is not perceived as objectionable unless it is sufficiently
large in relation to a desired sound present simultaneously – Small noises are objectionable in a quiet library– The same small noise is imperceptible at a rock concert!
• This principle is the basis of noise reduction systems like the Dolby™ system for sound recording
– The recording audio level is automatically increased for soft passages– The playback level is automatically reduced, to match, via an auxiliary
control signal, so desired signal has the original loudness. In Dolby system, this is typically a low frequency control signal.
– Therefore, noise added by the recording medium (e.g., magnetic tape “hiss”) is not noticeable during “soft” music intervals
– Dolby systems treat different audio frequency bands separately (high frequency is noisiest in magnetic tape), and use different types of auxiliary signals (Dolby B, C, etc.)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 28
G.711
• The most commonplace codec– Used in circuit-switched telephone network– PCM, Pulse-Code Modulation
• If uniform quantization– 12 bits * 8 k/sec = 96 kbps
• Non-uniform quantization– 64 kbps DS0 rate– mu-law
» North America– A-law
» Other countries, a little friendlier to lower signal levels– An MOS of about 4.3
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 29
Differential PCM
• Basic idea– Since speech signals are slowly varying, it is possible to eliminate the
temporal redundancy by prediction– For many natural signals, the difference between successive samples
quantizes better than samples themselves– Even better, predict the current sample from the past one(s) and transmit
the error of the prediction to the decoder on the other side.
• Linear prediction– Fixed: the same predictor is used again and again
– Adaptive: predictor is adjusted on-the-fly
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 30
First-order Prediction
- Encodinge1=x1
en=xn-xn-1 n = 2,…,N
x1 x2 … … xN
_
+D
xn
xnxn-1
en
xn-1
+xn-1
en xn
D
EncoderDecoder
DPCM Loop
- Decoding e1 e2 … … eN
x1=e1
xn=en+xn-1 n = 2,…,N
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 31
Open-loop DPCM
_
+D
+
D
EncoderDecoder
Q
Note: • Prediction is based on the past unquantized sample
• But quantization is located outside the DPCM loop
nene) nx)
nx1−nx
nx
1−nx1−nx)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 32
DPCM
Σ Quantizer
Σ
ΣCommunicationChannel
PredictorPredictor
)(nx )(ne )(ne )(~ ne
Coder Decoder
)1( −nx)
)1( −nx)
)1( −nx)
)(nx)
)(nx)
Bring the quantizer into the ‘prediction loop’
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 33
Numerical Example
90 92 91 93 93 95 …
90 2 -2 3 0 2
90 93 90 93 93 96 …
90 3 -3 3 0 3
Q 33
)( ⋅⎥⎦⎤
⎢⎣⎡=
xxQ
a
b
a-b
a
b a+b
)(nx
)(ne
)(ne
)(nx)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 34
DPCM
1−−= nnn xxe )
1−+= nnn xex )) nnnn eexx −=− )A:
B:
The distortion due to quantization of the prediction residue en is identical to the distortion introduced to the original sample xn
Σ Quantizer
ΣPredictor
)(nx )(ne )(ne
)1( −nx)
)1( −nx)
)(nx)
AB
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 35
Higher Order Prediction
- Encoding
initialize
prediction Nknxaxek
iininn ,...,1
1
+=−= ∑=
−
Nknxaexk
iininn ,...,1
1
+=+= ∑=
−
knxe nn :1==
- Decodinginitialize knex nn :1==
prediction
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 36
DPCM
Σ Quantizer
Σ
ΣCommunicationChannel
PredictorPredictor
)(nx )(ne )(ne
Coder Decoder
∑=
−=k
iinin xax
1
~~
nx~)(~ nx
)(~ nx
nx~)(ne
Prediction of the current sample from past estimated ones
Est of current sample = predicted + error prediction
error
nnn xxe ~−=
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 37
Linear predictor coefficients
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−
−
)(
)2()1(
)0()1()1()1(
)0()1()1()1()0(
2
1
KR
RR
a
aa
RRKRR
RRKRRR
n
n
n
Knnn
n
nn
nnn
MM
L
OM
MO
L
∑ ∑∑= ==
−−==N
n
K
kk
N
nknxanxneMSE
1
2
11
2 ])()([)(minimize
Note that in fixed prediction, auto-correlation is calculatedover the whole segment of speech (NOT short-time features)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 38
Adaptive DPCM
• Forward adaptation
– The prediction parameters are estimated from the current speech data
which is available only at the transmitter. The quantized prediction
coefficients are transmitted to the decoder as side information .
• Backward adaptation
– The parameters are estimated from past data, which is available at both
transmitter and receiver, thus there is no need for side information (no
overhead), but the operation is suseptible to transmission errors.
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 39
More suitable for high-bit rate coding
More suitable for low-bit rate coding
sensitive to errorsrobust to errors
No overheadOverhead non-negligible
Symmetric complexity allocation (encoder=decoder)
Asymmetric complexity allocation (encoder>decoder)
Backward adaptive predictionForward adaptive prediction
Forward / Backward Adaptation
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 40
Adaptive DPCM with Forward Adaptation
+ AdaptiveSpeech Input
+Adaptive
Quantizer
Predictor
DecoderEncoder
-
+
order p
+
AdaptivePredictor
Speech Output
Q-1
PredictorAdaptation
PredictorAdaptation
Step sizeAdaptation
Step sizeAdaptation
side info
side info
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 41
Adaptive DPCM with Backward Adaptation
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 42
Short and long-term ADPCM
( ) ∑=
−=P
k
kk zazP
1
If we wish to model the short and long term prediction nature of speech, we can use a predictor of very large order P
But modeling the pitch periodicity of speech as well as the short term redundancy would require a very large order, P = 50 to 100.
Instead, the predictor is split into 2 portions, one modelling the short term redundancy of speech, and one modeling the long term redundancy , due to pitch periodicity. The long term predictor can be a single coeficient filter of the form :
( ) ML zzA −⋅= β
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 43
Short and long-term ADPCM
β is a scaling factor that relates to the degree of periodicity of the waveform and M is the estimated period (in samples). The predictor time response is a single impulse delayed by M samples. M is the estimated pitch period. βThe synthesis filter is of the form :
0.5
1
1.5
2
2.5
3
3.5
0 500 1000 1500 2000 2500 3000 3500 4000 Freq. (Hz)
|H(f)|
Frequency response of the synthesis filter:
( ) ( ) ML
L zzAzS −⋅−
=−
=β1
11
1
Peaks are spaced by 1/M
Width of the peaks is function of β , which can be estimated as
[ ][ ])(
)()(2 MnxE
MnxnxE−−
=β
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 44
Short and long-term ADPCM
+
AL(z) +
A(z)
+ Qeq(n)
s(n)
-+ - + +
+Speech Output
+Q-1eq(n)
+
+
+
A(z)
AL(z)
+
Long-term prediction Encoder
Decoder
( ) ML zzA −⋅= β( ) ∑
=
−=P
k
kk zazA
1
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 45
Higher-order LT predictor
The true pitch period is unlikely to be an exact multiple of 1/Fs. Thus, a predictor of multiple orders can be used to better synthesize the pitch periods .
( ) 132
11
−−−+− ⋅+⋅+⋅= MMML zzzzA βββ
Another way to deal with the varying degree of voicing across the spectrum (lower spectrum is more harmonically pronounced than the higher), separate bands can be considered separately. This allows the pitch predictor in different bands to have different b : high values lead to narrow bandwidth for the lower frequencies and lower values for the less periodic higher frequencies.
Hi band
Low band
LT prediction
LT prediction
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 46
ITU-T G.726 - Adaptive Differential Pulse Code Modulation (ADPCM)
Encoder
Decoder
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 47
ITU-T G.722 7 kHz Audio Coding within 64 kbit/s
simultaneous speech- and data-transmission with data-rate BD=8 or16 kbit/s possible, B+BD= 64 kbit/s
overall signal delay 1.5ms
ADPCM (G.726 like) coding in both subbands with w = 4,5 or 6 (32,40,48 kbit/s) in the lower subband ans w = 2 (16 kbit/s) in the higher subband
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 48
ITU waveform coders
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 49
ITU waveform coders
G.722(48 kbps)
G.726(32 kbps)
http://www-lns.tf.uni-kiel.de/demo/demo_speech.htm
G.711(64 kbps)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 50
Delta Modulation : (DM)
• Predictor : one-step delay function
• Quantizer : 1-bit quantizer
[ ])()(~)1(~)()(
1 neQnenunune
bit−=−−=
)1(~)(~ −= nunu
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 51
Delta Modulation : (DM)
• Primary Limitation of DM– Slope overload : large jump region
» Max. slope = (step size)X(sampling freq.)
– Granularity Noise : almost constant region
– Instability to channel noise
)(nu
)(~ nu
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 52
DM:
Unit Delay (Ts)
Unit Delay (Ts)
Integrator
)(nu )(ne )(~ ne
)(~ nu)1(~)(~ −= nunu
)(~ ne )(~ nu
)(~ nu
Coder
Decoder
1-bit quantizer
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 53
DM
Step size effect : Step Size (i) slope overload
(sampling frequency ) (ii) granular Noise
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 54
DM – step size conditions
Afdt
tdxT 02)(max π=≥Δ
00 22 ff
TfA s
ππ⋅Δ
=Δ
≤
• The choice of step size is crucial to successful performance in DM. Since the output magnitude can change only by Δ each sample interval T, then Δ must be large enough to accommodate rapid changes.
Tnxnx
dttdx
T)1()(
max)(max−−
≈≥Δ
For Sinusoidal Signals ( )tfAtx 02cos)( π⋅=
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 55
DM – step size example
– Q. Consider a Speech Signal with maximum frequency of 3.4KHz and maximum amplitude of 1volt. This speech signal is applied to a delta modulator whose bit rate is set at 60kbit/sec. What is an appropriate step size for the modulator ?
– Bandwidth of the signal = 3.4 KHz.– Maximum amplitude = 1 volt– Bit Rate = 60Kbits/sec– Sampling rate = 60K Samples/sec.– STEP SIZE = 0.356 Volts
sATf02π≥Δ
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 56
Adaptive DM:
1+kX
1+kE1+ks
Adaptive Function
Unit DelaykX 1+Δ k
Storedk mink ,E, ΔΔ
Input signal is varying fast - Step Size is increased
Input signal is varying slow - Step Size is reduced
Variable Step Size