Speech Coding Techniques. Introduction Efficient speech-coding techniques Advantages for VoIP...
-
Upload
megan-gertrude-roberts -
Category
Documents
-
view
219 -
download
0
Transcript of Speech Coding Techniques. Introduction Efficient speech-coding techniques Advantages for VoIP...
Speech Coding Techniques
Introduction Efficient speech-coding techniques
Advantages for VoIP Digital streams of ones and zeros The lower the bandwidth, the lower
the quality RTP payload types Processing power
The better quality (for a given bandwidth) uses a more complex algorithm
A balance between quality and cost
Voice Quality Bandwidth is easily quantified
Voice quality is subjective MOS, Mean Opinion Score
ITU-T Recommendation P.800 Excellent – 5 Good – 4 Fair – 3 Poor – 2 Bad – 1
A minimum of 30 people Listen to voice samples or in conversations
P.800 recommendations The selection of participants The test environment Explanations to listeners Analysis of results
Toll quality A MOS of 4.0 or higher
About Speech Speech
Air pushed from the lungs past the vocal cords and along the vocal tract
The basic vibrations – vocal cords The sound is altered by the
disposition of the vocal tract ( tongue and mouth)
Model the vocal tract as a filter The shape changes relatively slowly
The vibrations at the vocal cords The excitation signal
Voiced sound The vocal cords vibrate open and close Quasi-periodic pulses of air The rate of the opening and closing – the pitch
Unvoiced sounds Forcing air at high velocities through a constriction Noise-like turbulence Show little long-term periodicity Short-term correlations still present
Plosive sounds A complete closure in the vocal tract Air pressure is built up and released suddenly
Speech sounds
Voice Sampling Discrete Time LTI Systems: The
Convolution Sum
k
knkxnx ][][][
0 1 2
0 1 0 1 2 3
h[n]
x[n] y[n]
n
n n
1
0.5
2
0.5
2.5 2
k
knhkxny ][][][
n
nTtts )()(
nc
cs
nTttx
tstxtx
)()(
)()()(
k
skT
jS )(2
)(
)( jX c
NN
SS 0
)( jX c
NN SS
)( NS
Nyquist sampling theorem
Quantization (Scalar Quantization)
v1 v2 vk+1 vL
m0= -A m1 m2 …… mk mk+1 mL1 mL=A
· Assume | x[n] | Adivide the range [ A , A ] into L quantization levels{ J1 , J2 , …… Jk ,….. JL }
Jk : [mk-1,mk ]
L = 2R
each quantization level Jk is represented by a value vk
S = U Jk , V = { v1 , v2 , …… vk ,….. vL }
Jk+1
Non-Uniform Quantization
m0 = -A m1 m2 …… 0 mL=A
Concept : small quantization levels for small x
large quantization levels for large x Goal: constant SNRQ for all x
Companding
F(x)x[n]
Uniform Quantization
F1(x)x[n]
Uniform Decoder
^
Compressor …1101…1101… Expandor
Compressor + Expandor Compandor
F(x) is to specify the non-uniform quantization characteristics
Non-Uniform Quantization -law
A-law
11
,1
][1
10,
1)(
xAAln
xAlnA
xAln
xA
xF
Typical values in practice
= 255 , A = 87.6
10,
1
1)(
xμ)(log
xμlogxF
Types of Speech Codecs Waveform codecs,source codecs
(also known as vocoders),and hybrid codecs.
Speech Source Model and Source Coding
unvoiced
G v/u
voiced
N
randomsequencegenerator
periodic pulse traingenerator
G(z) = 1
1 akz-k P
k = 1
x[n]
G(z), G(), g[n]
u[n]
Excitation
Vocal Tract Model
Excitation parameters
v/u : voiced/ unvoiced
N : pitch for voiced
G : signal gain
excitation signal u[n]
Vocal Tract parameters
{ak} : LPC coefficients
formant structure of speech signals
A good approximation, though not precise enough
LPC Vocoder(Voice Coder)x[n]
LPC Analysis
{ ak }N , Gv/u
Encoder…11011…
N by pitch detection
v/u by voicing detection
Decoder { ak }N , Gv/u
receiver
…11011…
g[n]G(z)
Exx[n]
{ak} can be non-uniform or vector quantized to reduce bit rate further
The most commonplace codec Used in circuit-switched telephone
network PCM, Pulse-Code Modulation
If uniform quantization 12 bits * 8 k/sec = 96 kbps
Non-uniform quantization 65 kbps DS0 rate
North America A-law
Other countries, a little friendlier to lower signal levels
An MOS of about 4.3
law
G.711
ADPCM(adaptive differential PCM) DPCM and ADPCM.
ADPCM : Adaptive Prediction in DPCM Adaptive Quantization
Adaptive Quantization Quantization level varies with local signal level [n] = ax[n] x[n] : locally estimated standard deviation of x[n]
G.721:ADPCM-coded speech at 32Kbps. G.726(A-law or )
16,24,32,40Kbps MOS 4.0 , at 32Kbps
law
Analysis-by-Synthesis (AbS) Codecs
Hybrid codec Fill the gap between waveform and source
codecs The most successful and commonly used
Time-domain AbS codecs Not a simple two-state, voiced/unvoiced Different excitation signals are attempted Closest to the original waveform is
selected MPE, Multi-Pulse Excited RPE, Regular-Pulse Excited CELP, Code-Excited Linear Predictive
G.728 LD-CELP CELP codecs
A filter; its characteristics change over time A codebook of acoustic vectors
A vector = a set of elements representing various char. of the excitation
Transmit Filter coefficients, gain, a pointer to the vector
chosen Low Delay CELP
Backward-adaptive coder Use previous samples to determine filter
coefficients Operates on five samples at a time
Delay < 1 ms Only the pointer is transmitted
1024 vectors in the code book 10-bit pointer (index) 16 kbps
LD-CELP encoder Minimize a frequency-weighted mean-square error
LD-CELP decoder
An MOS score of about 3.9 One-quarter of G.711 bandwidth
G.723.1 ACELP 6.3 or 5.3 kbps
Both mandatory Can change from one to another during a conversation
The coder A band-limited input speech signal Sampled at 8 KHz, 16-bit uniform PCM quantization Operate on blocks of 240 samples at a time A look-ahead of 7.5 ms A total algorithmic delay of 37.5 ms + other delays A high-pass filter to remove any DC component
G.723.1 Annex A Silence Insertion Description (SID)
frames of size four octets The two lsbs of the first octet
00 6.3kbps 24 octets/frame 01 5.3kbps 20 10 SID frame 4
An MOS of about 3.8 At least 37.5 ms delay
G.729 8 kbps Input frames of 10 ms, 80 samples for 8
KHz sampling rate 5 ms look-ahead
Algorithmic delay of 15 ms An 80-bit frame for 10 ms of speech A complex codec
G.729.A (Annex A), a number of simplifications
Same frame structure Encoder/decoder, G.729/G.729.A Slightly lower quality
G.729.B VAD, Voice Activity Detection
Based on analysis of several parameters of the input
The current frames plus two preceding frames DTX, Discontinuous Transmission
Send nothing or send an SID frame SID frame contains information to generate
comfort noise CNG, Comfort Noise Generation
G.729, an MOS of about 4.0 G.729A an MOS of about 3.7
Other Codecs
CDMA QCELP defined in IS-733 Variable-rate coder Two most common rates
The high rate, 13.3 kbps A lower rate, 6.2 kbps
Silence suppression For use with RTP, RFC 2658
GSM Enhanced Full-Rate (EFR) GSM 06.60 An enhanced version of GSM Full-Rate ACELP-based codec The same bit rate and the same
overall packing structure 12.2 kbps
Support discontinuous transmission For use with RTP, RFC 1890
GSM Adaptive Multi-Rate (AMR) codec GSM 06.90 Eight different modes 4.75 kbps to 12.2 kbps 12.2 kbps, GSM EFR 7.4 kbps, IS-641 (TDMA cellular
systems) Change the mode at any time Offer discontinuous transmission The coding choice of many 3G
wireless networks
The MOS values are for laboratory conditions G.711 does not deal with lost packets G.729 can accommodate a lost frame
by interpolating from previous frames But cause errors in subsequent speech
frames
Processing Power G.728 or G.729, 40 MIPS G.726 10 MIPS
Cascaded Codecs E.g., G.711 stream -> G.729
encoder/decoder Might not even come close to G.729
Each coder only generate an approximate of the incoming signal
Tones, Signal, and DTMF Digits The hybrid codecs are optimized for
human speech Other data may need to be transmitted Tones: fax tones, dialing tone, busy tone DTMF digits for two-stage dialing or voice-
mail G.711 is OK G.723.1 and G.729 can be unintelligible The ingress gateway needs to intercept
The tones and DTMT digits Use an external signaling system
Easy at the start of a call Difficult in the middle of a call
Encode the tones differently form the speech
Send them along the same media path An RTP packet provides the name of the tone and
the duration Or, a dynamic RTP profile; an RTP packet
containing the frequency, volume and the duration
RFC 2198 An RTP payload format for redundant audio
data Sending both types of RTP payload
RTP Payload Format for DTMF Digits An Internet Draft Both methods described before A large number of tones and events
DTMF digits, a busy tone, a congestion tone, a ringing tone, etc.
The named events E: the end of the tone, R: reserved
Payload format
Finis
Discrete Time LTI Systems: The Convolution Sum
k
knkxnx ][][][
0 1 2
0 1 0 1 2 3
h[n]
x[n] y[n]
n
n n
1
0.5
2
0.5
2.5 2
k
knhkxny ][][][
Frequency-Domain Representation of Sampling
n
nTtts )()(
nc
cs
nTttx
tstxtx
)()(
)()()(
k
skT
jS )(2
)(
)( jX c
NN
SS 0
)( jX c
NN SS
)( NS
Speech Source Model and Source Coding Vocal Tract Model
p
kk nxknxanu
1
][][)(
)(
)(
1
1)(
1
zU
zX
zazG p
k
kk