Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals....
-
Upload
hillary-mcgee -
Category
Documents
-
view
255 -
download
7
Transcript of Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals....
![Page 1: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/1.jpg)
Speech Processing
![Page 2: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/2.jpg)
References
L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978.
Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice-Hall, 1993.
James H. McClellan, et al. Computer-Based Exercises for Signal Processing Using MATLAB 5. Prentice-Hall, 1998.
![Page 3: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/3.jpg)
The sound of spoken words is divided-up into phonemes. European languages have about forty phonemes. Phonemes are divided into two groups: voiced sounds and unvoiced sounds. Voiced sounds are “vowel-like” sounds where the sound comes from the throat. Unvoiced phonemes are “consonant-like” phonemes where the sound comes from compressed air blown through the mouth. While unvoiced phonemes are “consonant-like,” not all consonants are unvoiced. Phonemes like “s” are unvoiced, but phonemes like “z” are voiced.
![Page 4: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/4.jpg)
Speech production may be modeled by the following diagram:
PulseTrain
Glottis
RandomNoise
VocalTract
LipRadiation
Voiced
Unvoiced
(See Figure 10.5 in Computer-Based Exercises for Signal Processing.)
![Page 5: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/5.jpg)
The glottis (in the throat) produces “quasi-periodic” signals (like singing a long note). These signals are modeled as the output of the glottis block. These signals are then passed into a vocal tract block. The vocal tract models the mouth, nose and teeth. Finally the lip radiation block models the lips.
Unvoiced sounds have no glottal pulse component and can be modeled with the vocal tract and lip radiation blocks. To obtain any kind of sound, the input to the vocal tract and lip radiation blocks cannot be simply a unit step but rather a random process.
![Page 6: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/6.jpg)
PulseTrain
Glottis
RandomNoise
VocalTract
LipRadiation
Voiced
Unvoiced
Let us give function values to these signals and processes:
e[n]
G(z)
V(z) R(z)uG[n] uL[n] pL[n]
![Page 7: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/7.jpg)
e[n] is a periodic pulse train.
G(z) is the transfer function of the glottis
uG[n] is the glottis output.
V(z) is the transfer function of the vocal tract.
R(z) is the transfer function of the lips.
uL[n] is the output of the vocal tract.
pL[n] is the output of the lips.
![Page 8: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/8.jpg)
The glottal transfer function G(z) will be represented by an exponential model:
.)1(
)]ln()[(
)(
)()(
21
1
az
zaae
zE
zUzG G
The symbol e represents the base of natural logarithms. The parameter a is some value less than one that corresponds to the natural frequency of the glottis (which varies from speaker to speaker, man to woman, child to adult, etc.).
![Page 9: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/9.jpg)
The frequency response of G(z) for various values of a is shown on the following slide. (Graph printed using glottal.m.)
![Page 10: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/10.jpg)
0 0.25 0.5 0.75 10
5
10
15
20
25
a = 0.90
a = 0.80
a = 0.70
, x
|G(e
j)|
Frequency Response
![Page 11: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/11.jpg)
The vocal tract V(z) can be modeled after a sequence of “lossless tubes”:
uG[n] uL[n]
AkAk+1
Ak-1
Each “tube” has a cross-sectional area Ak.
![Page 12: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/12.jpg)
The vocal tract transfer function V(z) will be represented by following model:
.)(
)1(
)(
)()( 1
2/
zD
zr
zU
zUzV
N
k
Nk
G
L
The parameters rk (which correspond to reflection coefficients along the vocal tract) are found from
kk
kkk AA
AAr
1
1
![Page 13: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/13.jpg)
The denominator D(z) is found from the recursive relationship:
Where Ak (k=1, … N) are parameters corresponding to cross-sectional areas of the vocal tract. (These values are given for a particular phoneme.)
)()()( 111
zDzrzDzD k
kkkk
starting with D0(z) = 1 and ending with D(z) = DN(z).
![Page 14: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/14.jpg)
The numerator G [of V(z)] is found by
N
kkrG
1
).1(
Finally, the lip radiation transfer function is given by
.1)(
)()( 1 z
zU
zPzR
L
L
![Page 15: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/15.jpg)
The previous voice model was implemented in MATLAB in a script file called voice.m.
The vocal tract transfer function V(z) parameters are computed by a MATLAB function called AtoV().
The glottal transfer function G(z) coefficients are assigned to arrays numg and deng.
The vocal tract/lip radiation transfer function V(z)R(z) coefficients are assigned to arrays numv and denv.
![Page 16: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/16.jpg)
PulseTrain
Glottis
RandomNoise
VocalTract
LipRadiation
Voiced
Unvoiced
e[n]
G(z)
V(z) R(z)uG[n] uL[n] pL[n]
numg, deng
AtoV numv, denv
uG[n] = rand();
![Page 17: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/17.jpg)
kk
kkk AA
AAr
1
1
for k=1:N-1 r = [r (A(k+1)-A(k))/(A(k+1)+A(k))];end;
AtoV()
![Page 18: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/18.jpg)
).()()( 111
zDzrzDzD k
kkkk
for k=1:N D = [D 0] + r(k).*[0 fliplr(D)]; G = G*(1+r(k));end;
N
kkrG
1
).1(
![Page 19: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/19.jpg)
Voiced Speech
ug = 0.1*filter(numg,deng,p);pl = filter(numv,denv,ug);
ug = 0.01*randn(1,10000);pl = filter(numv,denv,ug);
Unvoiced Speech
The array p is a pulse train
![Page 20: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/20.jpg)
Given the vocal tract areas Ak for a given vowel, we can synthesize the vowels.
In the following demonstration, we will synthesize the phonemes AA and IY.
The phoneme AA is like a short a (ă)
The phoneme IY is like a long e (ē).
![Page 21: Speech Processing. References L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. Lawrence Rabiner and Biing-Hwang.](https://reader036.fdocuments.in/reader036/viewer/2022081501/56649cf85503460f949c8819/html5/thumbnails/21.jpg)
AA voiced (aav.wav)
AA unvoiced(aau.wav)
IY voiced(iyv.wav)
IY unvoiced(iyu.wav)