Bandwidth Expansion of Narrowband Speech using Linear...

AALBORG UNIVERSITYInstitute of Electronic Systems

Fredrik Bajers Vej 7 - 9220 Aalborg Øst - Phone: (+45) 9635 8080

Bandwidth Expansion of NarrowbandSpeech using Linear Prediction

- Worksheets -

AALBORG UNIVERSITYInstitute of Electronic Systems

Fredrik Bajers Vej 7 - 9220 Aalborg Øst - Phone: (+45) 9635 8080

Bandwidth Extension of Narrowband Speech using Linear Prediction- Worksheets -

PERIOD:

September - December 2004

THEME:

Stochastic Signal Analysis

PROJECT GROUP:

742

GROUP MEMBERS:

Bjarke Bliksted AndersenJakob DyrebyFrederik Holmelund KjærskovOle Lodahl MikkelsenPeter Drustrup NielsenNiels Henrik ZimmermannBrian Jensen

SUPERVISORS:

Søren Holdt JensenAalborg University, DKJesper JensenDelft University of Technology, NL

COPIES: 9

NUMBER OF PAGES: 90

ABSTRACT:

We present a signal processing algorithmthat converts telephone quality speech, i.e.speech with a frequency range of about 0.3 -3.4 kHz, into 8 kHz wide band speech. Thewell-known source-filter model is used andlinear prediction analysis is applied to de-compose the signal into an envelope partand an excitation part. By means of VQ andmodulation techniques wide band envelopeand excitation signals are generated. Subjecttests show the performance of the proposedalgorithm.

Bjarke Bliksted Andersen

Jakob Dyreby

Frederik Holmelund Kjærskov

Ole Lodahl Mikkelsen

Peter Drustrup Nielsen

Brian Jensen

Niels Henrik Zimmermann

Aalborg University, December 17th. 2004.

II

Contents

1 Overview of the human speech system 1

2 Pitch detection 6

3 Framing and deframing 24

4 Cepstral signal analysis for pitch detection 28

5 LPC modeling of vocal tract 31

6 Line Spectrum Pairs 42

7 Reflection Coefficients 47

8 Vector Quantization 51

9 The K-means Clustering Algorithm 53

10 Generating the codebook for speech enhancement 58

11 Codebook training 62

12 Excitation extension 65

13 Telephone filter 76

14 Envelope and excitation evaluation 78

15 Listening test 85

Bibliography 90

III

Overview of the human speechsystem 1

[Wooil n.d.] The main components of the human speech system are: The lungs, tracher(windpipe),larynx, pharyngeal cavity(throat), oral or buccal cavity(mouth), nasal cavity(nose). Nor-mally the pharyngeal and the oral cavity are grouped into one unit called the oral tract. Thenasal cavity is normally called the nasal tract. The exact placement of the main organs isshown in figure 1.1. A simplified schematic for the speech system is shown in figure 1.2 onthe next page

Figure 1.1: [1]Schematic view of human speech production.

1

CHAPTER 1. OVERVIEW OF THE HUMAN SPEECH SYSTEM

Figure 1.2: [1]Block model of human speech production.

1.1 Speech production

As seen in figure 1.2 muscle force are used to press air from the lungs through the larynx(more specific; the epiglottis). The vocal cords then vibrates, and interrupt the air and pro-duce a quasi-periodic pressure wave. The pressure impulse are called pitch impulse. Thefrequency of the pressure signal is the pitch frequency or fundamental frequency. A typicalsound pressure function is shown in figure 1.3. The frequency of the pressured signal is thepart that define the speech melody. If we speak with at constant pitch frequency the speechwould sound monotonous, but normally there is a permanent change if the frequency. Thefrequency of the vocal cord is determined by several factors: The tension exerted by themuscles, it’s mass and it’s length. These factors vary between sexes and according to age.

Figure 1.3: [3]Typical impulse sequence(Sound pressure function).

The pressure impulse are stimulating the air in the oral tract and for certain sounds also thenasal tract. When the cavities resonate, they radiate a sound wave which is the speech sig-nal. Both tracts (Vocal and nasal) act as resonators with characteristic resonance frequencies

2

1.2. VOICED/UNVOICED SPEECH

called formant frequency. It is possible to change the cavities of the mouth by mowing thejaw, tongue, velum, lips and mouth. Because of this we can pronounce very many differentsounds.

1.2 voiced/unvoiced speech

Speech can be divided into two classes, voiced and unvoiced. The difference between thetwo signals is the use of the vocal cords and vocal tract(mouth and lips). When voicedsounds are pronounced you use the vocal cords and the vocal tract. Because of the vocalcords, it is possible to find the fundamental frequency of the speech. In contrast to this, thevocal cords are not used when pronouncing unvoiced sounds. Because the vocal cords arenot used, is it not possible to find a fundamental frequency in unvoiced speech. I general allvowels are voiced sounds. Examples of unvoiced sounds are /sh/ /s/ and /p/

There are different ways to detect if voice are voiced or unvoiced. As mentioned earlierthe fundamental frequency can be used to detect the voiced and unvoiced parts of speech.Another way is to calculate the energy in the signal (signal frame). There are more energy ina voiced sound than in a unvoiced sound. Figure 1.4 shows a speech signal that is dividedinto three parts, a original part, a voiced and a unvoiced part. The figure shows that thereare more energy in the voiced part than in the unvoiced part.

Time (sec)

Fre

q (k

Hz)

Speech Sample Spectrogram orginal

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

10

20

30

40

Time (sec)

Fre

q (k

Hz)

Speech Sample Spectrogram voiced

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

10

20

30

40

Time (sec)

Fre

q (k

Hz)

Speech Sample Spectrogram orginal unvoiced

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

10

20

30

40

Figure 1.4: Spectrogram of speech signal

3

CHAPTER 1. OVERVIEW OF THE HUMAN SPEECH SYSTEM

1.3 model of human speech

According to "overview of the human speech" a model for the human speech productionare shown in figure 1.5a. The mouth and nose can be described as a time-varying filter.The background of the LPC is to minimize the sum of the squared difference between theoriginal speech and the estimated speech. The LPC are normally estimated in frames of 20ms. The reason 20ms i chosen, is that a speech signal is consider quasi stationary in this timeframe. The transfer function of the time varying filter is given in equation 1.1.

H(z) =G

1−p

∑k=1

akZ−k(1.1)

In the transfer function the linear prediction coefficients are represented by ak and G is thegain, p is the number of coefficients. The theory of the LPC is described in the worksheet"Line Spectrum Pairs". To estimate a prober speech, the model must be able to conclude if thespeech is voiced or unvoiced, this is done with a pitch detector (described in the worksheet"pitch detection"). The pitch detector also finds the fundamental frequency, who controlsthe impulse train. A model of the speech model is shown in figure 1.5b.

Sound Pressure

Voiced

Excitation

Unvoiced

Excitation

1-a

a

VocalCords

Articulation

Quasi-periodic

excitation signal

Noise-like

excitation signal

Energy

Tone Generator

(pulse train)

Noise

Generator

Fundamental

Frequency

Variable

LPC filter

Voiced/unvoiced

Decision

Filter Cofficients

Lungs Mouth/Nose

Speech

Speech

Excitation Articulation

a)

b)

Figure 1.5: [4]a)Model for the human speech production b)Speech model forelectrical system

4

1.4. LITERATURE LIST

1.4 Literature list

Images origin from the following sites:

[1] http://ispl.korea.ac.kr/~wikim/research/speech.html

[2] http://www.kt.tu-cottbus.de/speech-analysis/

[3] http://www.student.chula.ac.th/~47704705/2-2.html

[4] http://www.kt.tu-cottbus.de/speech-analysis/tech.html

[Wooil n.d.]

5

Pitch detection 2

In this chapter will we describe a pitch detector or pitch extractor. The goal of a pitch de-tector is to find the fundamental frequency f0 of a speech signal. There are several differenttechniques for finding this fundamental frequency. In this chapter a pitch detector will bechosen, based on test of some different pitch detection techniques.

2.1 Introduction

A pitch detector is used to detect the fundamental frequency ( f0), called the pitch, in a sig-nal. In the project "Bandwidth Expansion of Narrowband Speech using Linear Prediction"a pitch detector would be useful to find f0 of speech signals, due to the fact that telephonesignals are limited in the frequency interval 300-3400 Hz. When extending the bandwidth ofa signal from a sampling frequency of 8000Hz to 16000Hz, it would also be useful to createthe lower frequencies of a phone-signal, say 50-300 Hz. Because the human pitch lies inthe interval 80-350 Hz(table 2.1.1 on the facing page), where the pitch for men is normallyaround 150 Hz, for women around 250 Hz and children even higher frequencies, the pitchis needed to construct this part of the speech signal.

Some of the most used detectors are; Energy based, Cepstral based, zero crossing, pitchin the difference function and autocorrelation based. In this chapter there will be testedthree different pitch detectors, The "On-The-Fly", a Cepstral based and an auto correlationbased.

"On-The-Fly" [1] is time based an algorithm which is developed to estimate the pitch ofa signal, divided into frames. This paper is a description of this algorithm, how it’s imple-mented and an overall test of its ability to estimate a pitch of a speech signal.

The Cepstral pitch detector is based on the log of the fourier transformation. This algo-rithm is only shortly described and tested in this paper.

The Auto correlation pitch detector [5] is described and tested in this chapter.

6

2.2. ON-THE-FLY PITCH ESTIMATION

2.1.1 The test signal

The pitch detector must be able to estimate a pitch of signals in a limited frequency range.The speech signal used throughout this paper, is spoken by a man with a pitch around 150Hz± 20Hz, tested with a pitch analyzer program1. If not mentioned otherwise figures illustratea single frame (20ms) and always the same frame, so that the figures can be compared. Thesignal is sampled at 16000Hz and passed though a telefilter, which is a 10’the order band passfilter working at 300-3400Hz. All figures in this paper are illustrated with the original signaland the band passed signal. In this way it is possible to compare the difference betweenthese.

f0min(Hz) f0max(Hz)

men 80 200

woman 150 350

Table 2.1: Range of human speech

Also notice that the implemented algorithms not only have been tested with the above de-scribed signal. This signal is only used to gain understanding of the algorithms.

2.2 On-The-Fly pitch estimation

2.2.1 Common time domain pitch detectors

Many time-domain pitch detectors use the autocorrelation (equation 2.1) method to estimatethe pitch, which is also among the oldest methods. Another method to find the pitch is thedifference function (equation 2.2), which has been shown by Yin [2] and Chowdhury [3] tobe more effective than the autocorrelation method. The algorithm "On-The-Fly", is a pitch-detector is based on the difference function (dt ).

rt(τ) =W−τ

∑j=1

x jx j+τ (2.1)

dt(τ) =W−τ

∑j=1

(x j −x j+r)2 (2.2)

W is the length of the window. t indicates this is a time-domain calculation.

2.2.2 Unifying the difference function

The difference function is an effective algorithm to detect the f0 of a signal. Here f0 shouldbe at the global minimum of the difference function. This minimum is although difficult to

1Pitch analyzer v1.1 - written by Franz Josef Elmer

7

CHAPTER 2. PITCH DETECTION

locate, because the difference function of a random signal decrease with a slope of a straightline, as shown on figure 2.1.

0 50 100 150 200 250 300 3500

5

10

15

20

25

30

35

40

45Differencen function of the signal

Samples difference index

Am

plitu

de

All pass signalBand limited signal

Figure 2.1: Shows the difference function taken of both the all-passed signaland the band limited signal. The figure shows that both signals approximatelyconverge with a straight line. Also notice the difference between the localminima. The local minimas of the band limited signal are generally muchsmaller than the all pass signal.

To locate the actual global minima, the difference function output needs to be unified. So ifthe time delay τ is equal to the period of the pitch (P), the converging slope (see figure 2.1)can be seen as a straight line of a unity slope. Correlating xt+P with xt will result in a globalmaximum at the pitch P. Therefore by considering the covariance matrix of the randomvector X = [x j xt + τ]T we get that:

cx(τ) = E[(x−m)(x−m)T ] (2.3)

Considering this as WSS, two orthogonal eigenvectors can be found from the covariancematrix, by decomposing it using diagonalization:

Cx(τ) =

[

1 −11 1

][

λ1(τ) 00 λ2(τ)

][

1 1−1 1

]

=

[

λ1(τ)+ λ2(τ) λ1(τ)−λ2(τ)λ1(τ)−λ2(τ) λ1(τ)+ λ2(τ)

]

(2.4)

By assuming uniform distributions and equal means, the eigenvalues can be written likethis:

λ1(τ) ∝W−τ

∑j=1

(x j +x j+τ)2 (2.5)

λ2(τ) ∝W−τ

∑j=1

(x j −x j+τ)2 (2.6)

8

2.2. ON-THE-FLY PITCH ESTIMATION

Because the eigen-values and vectors describes new basis for the covariance matrix, a uni-fied difference function can be derived. Thinking of eigenvector 1 and eigenvalue 1 being a1D basis of the 2D space, the corresponding eigenvector 2 and eigenvalue 2 can be thoughtof as the error off this basis and vice versa. Therefore the unified difference function can bewritten as the following:

d′t (τ) =

λ2(τ)λ1(τ)+ λ2(τ)

=

W−τ∑j=1

(x j −x j+r)2

2·W−τ∑j=1

x2j +x2

j+r

(2.7)

The unified difference function is plotted figure 2.2.

0 50 100 150 200 250 300 3500

0.2

0.4

0.6

0.8

1Unified differencen function of the signal

Am

plitu

de

0 50 100 150 200 250 300 3500

0.2

0.4

0.6

0.8

1


Am

plitu

de

All pass signal

Band limited signal

Figure 2.2: Here the two signals are unified. Now the difference min the localminima are really shown.

By Yin the the pitch is found at the global minima of the unified difference function (d′

t ). Thisis found as a sample index and is equal to P. f0 is when calculated by the function:

f0 =f sP

(2.8)

f s is the sampling frequency (fs).

9


2.2.3 Contradiction in results

By taking further notice to the two figures in figure 2.2, we find a contradiction. The twosignals doesn’t seem to have the same global minimum, even though f0 of the two signalsshould be equal. In the article [1] it does not say anything about "On-The-Fly" was devel-oped for these types of band limited signals. But it is developed to find the actual pitch ofa signal, due to the fact, that the authors’ claim that the global minimum of the differencefunction is not always equal to the pitch period. This is actually the case for our band limitedsignal, where the situation is as follows:

• The all-pass signal has a global minimum around index 95, which is equal to a pitchof about 170 Hz.

• The band limited signal has a global minimum around index 47, which is equal to apitch of about 340 Hz.

Surely one of these results must be wrong. Considering the test signal originates from aman, with a pitch around 150 Hz, the result of the band limited signal must be wrong.

2.2.4 Idea of the "On-The-Fly"

The fundamental idea of the "On-The-Fly" algorithm is to find other candidates for f0. This isdone by locating a number of the smallest minimas in d

′

t , where the overall global minimum,is still by offset selected to as f0. Notice that at all time, the global minimum is still thoughtof as being some harmonic of the actual pitch. If the global minimum is not equal to f0, theactual pitch should be found within some limits of this minimum, so it would be a candidateof a harmonic frequency to the global minima.

The surrounding local minima are therefore to be tested in 2 ways:

1. A minimum differs only by a significant amount (threshold stage 1).A small threshold (ATh1) is to be used in testing the difference in the minimas size. Ifthe difference from the found pitch-sample is within the limit, the one with the smallertime period is considered the actual pitch. This has to be accompanied with anotherlarger threshold (TTh1) which is used to test the cohesion between the candidate har-monic of the pitch minima and the new candidates. If the minimum does not lie withthe threshold of the harmonic, this sample should be discarded.

2. A minima differs only a relatively significant amount (threshold stage 2).A larger threshold (ATh2) is to be used in testing the difference in the minimas size. Ifthe difference from the found pitch-sample is within the limit, the one with the smallertime period is considered the actual pitch. This has to be accompanied with anothersmaller threshold (TTh2) which is used to test the cohesion between the candidateharmonic of the pitch minima and the new candidates. If the minimum does not liewith the threshold of the harmonic, this sample should be discarded.

10

2.3. THE ON-THE-FLY ALGORITHM AND IMPLEMENTATION

Figure 2.3 is illustrating the 2 test conditions where the threshold stages are marked.

0 50 100 150 200 250 300 3500

0.2

0.4

0.6

0.8

1Unified differencen function of the signal with smallest minimas

Am

plit

ude

All pass signal

Minimas

0 50 100 150 200 250 300 3500

0.2

0.4

0.6

0.8

1


Am

plit

ude

Band limited signal

Minimas

TTh1 ATh1

TTh2 ATh2

Figure 2.3: Same as figure 2.2, though here with the 5 smallest minimasmarked with green lines. The threshold stage 1, ATh1 and TTh1, is shownin the upper figure, while threshold stage 2, ATh2 and TTh2, is shown in thelower figure. Of course both thresholds need to be applied to each signal whentesting

2.3 The On-The-Fly algorithm and implementation

The algorithm to On-The-Fly is relatively easy and is in figure 2.4 shown in a block diagram.

Calculate the unified difference function (d’

t )

of a frame.

Locate a number (typically 5) of the

smallest minima of d’ t .

Do parabolic interpolation of the

possible candidates.

Do threshold stage 1, with the thresholds ATh1

and TTh1.

Do threshold stage 2, with the thresholds ATh2

and TTh2.

Result of threshold stage 2 is the found pitch

(fundamental frequency)

A B C

D E F

Figure 2.4: The diagram of the On-The-Fly algorithm

A Here d′

t is calculated.

11


B Locating a number of the smallest local minima of d′

t can be a challenging task. First aminimum must be defined. But can 2 local minima be separated with 1 or only a fewsamples? If not, where should the limit be?

An example shows that surely minimas with only a few samples between them wouldnot be harmonics of each other. Typical speech signals are samples at fs=16000 Hz.With the maximal pitch of 500 Hz, there would still have to be minimum (16000Hz/500Hz)32 samples between minimas of the d

′

t . But still a local minimum might occur outsidethe harmonic and thereby eliminate the possibility of locating the actual harmonic tobe found.

In the implementation, an algorithm to find a number of minimas of the d′

t was devel-oped. It works in the following way; First it locates all minimas of d

′

t . Then any minimalarger that 0.4 is deleted from data, to assure the minimas could actually be the pitch.Next the global minimum among all the minima is found and stored in ti , where ti isthe index to the minima. If any minima exists within 10 samples of the found globalminimum, these minimas are deleted from data along with the found minima, and theprocedure is run again until there is no more data or the number of minimas wantedhas been found.The matlab code for this is found in section 2.8.2.

C The parabolic interpolation is needed to reduce quantization error in the candidatepitch values. An example, say fs=16000 Hz, if a minima is the pitch and located atsample 60, it equals a pitch of 267Hz. If the minima were found at sample 59, it wouldequal a pitch of 271Hz. No pitch is possible to get between 267-271Hz. Further thiserror gets higher, as the pitch rises. The parabolic interpolation solves this problem,and is implemented as a simple curve fitting problem.The matlab code for this is found in section 2.8.3.

D Among the candidate minimas (ti), we need to know if there are a harmonics of theglobal minimum is said to be the pitch. This is decided from the formula:

∣

∣

∣

∣

N−tgti

∣

∣

∣

∣

< TTh1 , where N={2,3, ..,6} (2.9)

, where tg is the index of the overall global minima. The threshold TTh1 here specifieshow much a harmonic can maximal differs from the global minimum. Those ti forwhich at least one of the possible N-values satisfy the threshold are then tested againstthreshold ATh1. ATh1 is a threshold of the difference in the minima value between tgand ti . If any ti are within the 2 threshold values, tg is changed with the one, whichhave the smaller time period (if not tg itself). In this threshold set, TTh1 is set to 0.2and ATh1 is set to 0.07.The matlab code for this is found in section 2.8.4.

E Bullet D and E are exactly the same, just with 2 different threshold sets. In this thresh-old set, TTh2 is set to 0.05 and ATh2 is set to 0.2.

The fully implemented "On-The-Fly" algorithm can be seen in section 2.8.2.

12

2.3. THE ON-THE-FLY ALGORITHM AND IMPLEMENTATION

2.3.1 Test results of "On-The-Fly"

Through this paper a single frame has been used to show how the algorithm works. Infigure 2.5 we see the final result of the algorithm on this frame.

0 50 100 150 200 250 300 3500

0.2

0.4

0.6

0.8

1Pitch found in the difference function by On−The−Fly

Am

plitu

de

All pass signalf0 (pitch=165.84Hz)

0 50 100 150 200 250 300 3500

0.2

0.4

0.6

0.8

1


Am

plitu

de

Band limited signalf0 (pitch=338.51Hz)

Figure 2.5: The pitch found in a single frame by the "On The Fly"-algorithm.

The ’o’ in each figure indicates where the global minima of d′

t (offset pitch)were found.

The figure clearly shows that the pitch is not the same in this frame. The pitch should bearound 150Hz, which is the result from the original signal. From the band passed signal apitch was not found anywhere near the actual pitch of the frame.This could although be a insecurity in the algorithm. So in figure 2.6 150 frame of the signalare shown.

13


0 0.5 1 1.5 2 2.5

x 104

−1

0

1On−The−Fly pitch estimation of original signal

Am

plitu

de o

f sig

nal

0 0.5 1 1.5 2 2.5

x 104

−1

0

1On−The−Fly pitch estimation of bandpassed signal

Am

plitu

de o

f sig

nal

0 0.5 1 1.5 2 2.5

x 104

−500

−250

0

250

500

Fre

quen

cy o

f pitc

h

Org. speech signalPitch frequency

0 0.5 1 1.5 2 2.5

x 104

−500

−250

0

250

500

Fre

quen

cy o

f pitc

h

Sample index

Bandpassed speech signalPitch frequency

Figure 2.6: The pitch found by the "On The Fly"-algorithm.

In this figure we see that the found pitch from the original speech signal is far more conse-quent, than the one found in the band passed signal. From this and the fact that the signal isman with a pitch around 150Hz, we can conclude that "On The Fly" may be a good at detect-ing pitch in full frequencyranged signals, but not in band passed signals2. This algorithmis therefore not useful in the project "Bandwidth Expansion of Narrowband Speech usingLinear Prediction".

2.4 Cepstral based pitch detector

The theory behind the cepstral detector is that the fourier transform of a pitched signal usu-ally have a number of regularly peaks, who is representing the harmonic spectrum. Whenlog magnitude of a spectrum is taken, these peaks are reduced (their amplitude brought intoa usable scale). The result is a periodic waveform in the frequency domain, where the pe-riod is related to the fundamental frequency of the original signal. This means that a fouriertransformation of this waveform has a peak representing the fundamental frequency.This method has only shortly been tested in matlab, with a non satisfying result. Due to thetest, and time limitation, this pitch detector will not be explained further.

2A short description the algorithms flaws in detecting pitch in band limited signal can be found in section2.8.1

14

2.5. PITCH DETECTION USING THE AUTOCORRELATION

2.5 Pitch detection using the autocorrelation

When calculating the autocorrelation you get a lot of information of the speech signal. Oneinformation is the pitch period (the fundamental frequency). To make the speech signalclosely approximate a periodic impulse train we must use some kind of spectrum flattening.To do this we have chosen to use "Center clipping spectrum flattener". After the Centerclipping the autocorrelation is calculated and the fundamental frequency is extracted. Anoverview of the system is shown in figure 2.7.

Speech Frame x(n)

LPF

Fc = 900 Hz

Cl = 30% of max{x(n)}

Center clip x(n)

Compare AC Rn(k) for

Fs/350 <= K <= Fs/80

Compute

R = max{Rn(k)}

Compute

R = Rn(0)

Test

R >= 30% of Rn(0)

Pitch period

f0 = K + Fs/350

Frame unvoiced

output = 0

yes No

Figure 2.7: Overview of the autocorrelation pitch detector

2.5.1 Center clipping spectrum flattener

The meaning of center clipping spectrum flattener is to make a threshold (cl ) so only thehigh and low pitches are saved. In our case (cl ) is chosen to 30% of the max amplitude. thecenter clipping of a speech signal is shown in figure 2.8 on the following page

15


+CL

-CL

Amax

Input speech

Time

Center clipped speech

Time

Figure 2.8: Center clipping

The center clipping is made due to the following definition:

• CL =% of Amax (e.g. 30%)

• if x(n) > CL then y(n)=x(n)- CL

• if x(n) <= CLthen y(n) = 0

2.5.2 Autocorrelation

The autocorrelation is a correlation of a variable with itself over time. The autocorrelationcan be calculated using the equation shown in equation 2.10

R(k) =N.k−1

∑m=0

x(m)x(m+k) (2.10)

In this project we have used the Matlab function "xcorr" to calculate the autocorrelation.

16

2.5. PITCH DETECTION USING THE AUTOCORRELATION

2.5.3 Median smoothing

When we have calculated the pitch detection, we can see that there are some outliers. Tominimize these outliers, we have chosen to use a median smooth (Median Filter). The effectof this median filter is shown in figure 2.9. The window length (n) have been tested withn=3 and n=5. The window length have been chosen to n=5, because the pitch detector theneliminate the outliers in the noisy part. One of the problem using the Median filter with n=5is that it smooths over five windows. In our project that means that we will have to make adelay in our system for 50 ms. The delay is calculated as:

number o f windows∗ the length o f the window∗ window overlap= 5∗20ms∗0,5 = 50ms(2.11)

In our project we have chosen not to deal with this problem, cause we are not going toimplement it as an real time system. If we should implement it as a real time system, wewould have to predict the pitch five windows ahead.

0 0.5 1 1.5 2 2.5

x 104

−1

−0.5

0

0.5

1Pitch detection with median smooth n=3

Am

plitu

de o

f sig

nal

0 0.5 1 1.5 2 2.5

x 104

−500

−250

0

250

500

Fre

quen

cy o

f pitc

h

0 0.5 1 1.5 2 2.5

x 104

−500

−250

0

250

500

Speech signalmedian smoothpitch detection

0 0.5 1 1.5 2 2.5

x 104

−1

−0.5

0

0.5

1with median smooth n=5

Am

plitu

de o

f sig

nal

0 0.5 1 1.5 2 2.5

x 104

−500

−250

0

250

500

Fre

quen

cy o

f pitc

h

0 0.5 1 1.5 2 2.5

x 104

−500

−250

0

250

500


0 0.5 1 1.5 2 2.5

x 104

−1

−0.5

0

0.5

1with median smooth n=7

Am

plitu

de o

f sig

nal

0 0.5 1 1.5 2 2.5

x 104

−500

−250

0

250

500

Fre

quen

cy o

f pitc

h

0 0.5 1 1.5 2 2.5

x 104

−500

−250

0

250

500

Sample index


Figure 2.9: Median Smoothing

If the window length is set to n=7, shown in figure 2.9, the result seem perfect. The prob-lem with such a long window is that the algorithm sometime finds a wrong fundamentalfrequency when there are "speech pause".

2.5.4 Test

The algorithm has been tested against result from the program "pitch analyzer 1.1", and hasshown that the auto correlation pitch detector is very accurate.

17


2.6 Overall test and results

Through this paper a speech signal has been used as a test signal. But since pitch of a randomtest signal isn’t known, the algorithms has also been tested up against signals with a knownpitch. Figure 2.10 shows the 2 implemented pitch detectors, estimating the pitch of a linearswept frequency signal, which spectrogram is shown in the top figure.

Time

Fre

quen

cy

The spectogram of a Linear swept−frequency signal

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

1000

2000

3000

4000

5000

6000

7000

8000

0 2000 4000 6000 8000 10000 12000 14000 16000

−1

−0.5

0

0.5

1

Samples

Am

plitu

de

0 2000 4000 6000 8000 10000 12000 14000 16000−500

0

500

Pitc

h fr

eque

ncy

(Hz)

Pitch estimation of a Linear swept−frequency signal (80−1000Hz)

Linear swept−frequency signal (Amplitude)OnTheFly (pitch freq)Fast−Auto−corr (pitch freq)Fast−Auto−corr shifted 5 windows <−− (pitch freq)Actual pitch of signal (pitch freq)

Figure 2.10: The top figure shows a spectrogram of the linear swept frequencysignal, starting at 80 Hz and ending in 1000 Hz. The bottom figure showsthe 2 algorithms estimation of the pitch of the linear swept frequency signal.NOTICE the pitch detectors are constructed to detect the pitch in the interval80-350 Hz.

As the figure shows, both methods are equally good to estimate the pitch within their oper-ation limits 80-350 Hz. Outside these limits the figure shows that both methods quite poorto floor the pitch frequency, which would be the preferred outcome.Further the ’Fast auto-correlation’ method is delayed 5 frames (50ms), because it uses amedian-smooth over 5 estimated pitches. This can’t be compensated for and the only so-lution to this, is therefore to remove the smoothing, which will cause more spikes (see fig-ure 2.9 on the page before).

Another important feature are the influence of noise in signals. This is tested with a 200 Hzsinus signal, with a signal/noise ratio shifting from 100% to 0% in a time-interval, shown infigure 2.11.

18

2.7. CONCLUSION

0 2000 4000 6000 8000 10000 12000 14000 16000

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Samples

Am

plitu

de

0 2000 4000 6000 8000 10000 12000 14000 16000−500

−400

−300

−200

−100

0

100

200

300

400

500

Pitc

h fr

eque

ncy

(Hz)

Pitch estimation of a 200 Hz sinus signal with decreasing signal/noise ratio

Signal with noise (Amplitude)Signal/noise (%)OnTheFly (pitch freq)Fast−Auto−corr (pitch freq)Fast−Auto−corr shifted 5 windows <−− (pitch freq)

Figure 2.11: Test of the influence of noise in a signal. Here the signal/noiseratio raises with time.

The figure shows that the Fast-auto-correlation method is very robust against noisy signals,while the On-The-Fly breaks down very early in the test. Because speech signals aren’t justa single sinus signal, such signals may more or less be similar to a noisy sinus. Therefore theFast-auto-correlation would be preferred.

2.7 Conclusion

Between the 2 implemented methods of pitch detection, only the autocorrelation method areusable in the project "Bandwidth expansion of narrowband signals". It was shown (figure2.6) that the "On-The-Fly" algorithm could not estimate a reliable pitch from band limitedsignals, which is a must in the project. The Fast-auto-correlation method, on the contrary,had no problem with either type of signal and is robust against noisy signals. The preferredpitch detector is therefore the Fast-auto-correlation method.

19


2.8 Appendix

2.8.1 On-The-Fly’s flaws when used on band limited signals

Tests of On-The-Fly algorithm showed that it was unable to estimate an accurate pitch of aband limited signal. So why is that?The problem is with the lower cut-off frequency of the band pass filter. The On-The-Flyalgorithm uses these lower frequencies harmonics to estimate the pitch. Because it is timebased and only used the minimas of the difference equation, the higher order harmonicsget more influence, than the lower order harmonics (see figure 2.2). Next the key idea is tobe the threshold stages is to select the candidate harmonic with the smallest time period,resulting in a higher pitch. Comparing this key idea with the figure, the algorithm will mostlikely select a higher pitch frequency, than it should.

2.8.2 Matlab implementation of the "On-The-Fly" algorithm

1 % T h i s f u n c t i o n c a l c u l a t e s t h e f u n d a m en t a l f r eq u en cy o f a f rame o f da ta2 % wi t h t h e OnTheFly a l g o r i t h m .34 % −− I n p u t −−5 % data : Frame o f da ta f o r w i t h p i t c h i s t o be c a l c u l a t e d6 % f s : Sampl ing f req u en cy o f s i g n a l78 % −− Outpu t−−9 % f 0 : The c a l c u l a t e d f u n d a m en t a l f r eq u en cy o f t h e i n p u t da ta

10 % 0 i s r e t u r n e d i f no p i t c h i s found1112 f u n c t i o n [ f 0 ] = OnTheFly ( data , f s )1314 f rameLength = l e n g t h ( d a t a ) ; % Framelength15 cu r v eF i t S am p l es = 3 ; % Samples from minima t o l e f t and r i g h t1617 % C a l c u l a t e d i f f e r e n c e f u n c t i o n18 d = zeros ( 1 , f rameLength−1) ;19 f o r i =1 : f rameLength−120 d ( i ) = sum ( ( d a t a ( 1 : f rameLength− i ) − d a t a ( i +1 : f rameLength ) ) . ^ 2 ) ;21 d ( i ) = d ( i ) / ( 2∗ sum ( d a t a ( 1 : f rameLength− i ) . ^ 2 + d a t a ( i +1 : f rameLength ) . ^ 2 ) ) ;22 end2324 % Find g l o b a l minima and 4 l o c a l minima ( s m a l l e s t minima ) w i th minimum 1 025 % samples between minimas . Minimas can have a maximum va l u e of 0 . 426 [ minimas minCount ] = f indmin ( d , 5 , 1 0 , 0 . 4 ) ;27 %mins ( end :−1 : 2 , : ) = mins ( 2 : end , : ) ;28 nMins = s i z e ( minimas ) ;2930 % I f t o many l o c a l minimas are l o c a t e d by f indmin , t h e s i g n a l is l i k e l y t o31 % be no ise , and t h e r e f o r e n o t t o be p ro cessed . Also i f no minimas has been32 % found , t h e data i s p r o p e r l y unvo iced , and i s n o t t o be p ro cessed .33 i f minCount <= 20 && nMins ( 1 ) > 034 % Mi n i m i za t i o n o f q u a t i z a t i o n e r ro r w i t h p a ra b l e c u r v e f i t ti n g f o r each minima35 f o r i =1 : nMins ( 1 )36 % x−c o o d i n a t e s t o make c u r v e f i t t i n g from37 x = ( minimas ( i , 1 )−cu r v eF i t S am p l es ) : ( minimas ( i , 1 ) + cu r v eF i t S am p l es ) ;3839 % Removes c o o r d i n a t e s o f x , which i s n o t w i t h i n t h e data f rame40 x ( f i n d ( x < 1 | x > ( f rameLength−1) ) ) = [ ] ;41 y = d ( x ) ; % Correspond ing va l u es o f x−c o o r d i n a t e s42 [ minimas ( i , 1 ) minimas ( i , 2 ) ] = p a r r eg ( x , y ) ; % C u r v e f i t t i n g and t o p p o i n t e x t r a c t i o n43 end4445 % T h resh o l d i n g s t a g e 146 TTh1 = 0 . 2 ; ATh1 = 0 . 0 7 ; % T h resh o l d s e t t i n g s47 [ t g mins ] = f i n d f 0 ( TTh1 , ATh1 , minimas ) ;4849 % T h resh o l d i n g s t a g e 250 TTh2 = 0 . 0 5 ; ATh2 = 0 . 2 ; % T h resh o l d s e t t i n g s51 [ t g mins ] = f i n d f 0 ( TTh2 , ATh2 , [ t g ; minimas ] ) ;5253 f 0 = f s / t g ( 1 ) ; % C a l c u l a t i o n o f t h e f u n d a m en t a l f r eq u en cy o f t h e f rame54 e l s e55 f 0 = 0 ; % Zero i f f rame i s u n vo i ced or n o i se !

20

2.8. APPENDIX

56 end

Locating minimas

Main function in locating minimas of data:

1 % Fu n c t i o n l o c a t e s nMins o f t h e s m a l l e s t minimas o f data , where minimas2 % have t o be sep a ra t ed w i t h a t l e a s t spred samples . Also a minima can ’ t3 % exc i d e maxMinValue !45 % −− I n p u t −−6 % data : Data t o l o c a t e minimas i n7 % nMins : # o f minimas t o f i n d i n da ta ( i f p o s s i b l e )8 % spred : Minimum # o f samples between minimas l o c a t e d9 % maxMinValue : Maximum va l u e o f a minima

1011 % −− Outpu t−−12 % t i : A n−by−2 m a t r i x where column 1 i s i n d exes o f minima and13 % column 2 i s va l u es o f minima .14 % minCount : T o t a l number o f found minima i n da ta1516 f u n c t i o n [ t i minCount ] = f indmin ( data , nMins , spred , maxMinValue )1718 [ n m] = s i z e ( d a t a ) ; % F l i p s v e c t o r c o r r e c t !19 i f m == 120 d a t a = data ’ ;21 end2223 minimas = l o ca l m i n i m a ( d a t a ) ’ ; % L o ca t es ALL minimas i n da ta as i n d exes24 minimas ( : , 2 ) = d a t a ( minimas ( : , 1 ) ) ’ ; % Extends mins t o ho ld d a t a v a l u e s t o i n d exes o f minimas25 minCount = l e n g t h ( minimas ) ; % T o t a l number o f l o c a l mins found2627 i n d ex es = f i n d ( minimas ( : , 2 ) >= maxMinValue ) ;% Find i n d exes o f minima t h a t e x c i d e s maxMinValue28 minimas ( indexes , : ) = [ ] ; % D e l e t es minimas t h a t exc i d e maxMinValue29 [m n ] = s i z e ( minimas ) ; % New s i z e o f mins m a t r i x3031 i f m > = 1 % Only runs i f da ta are a v a i l a b l e32 t i = zeros ( nMins , 2 ) ; % A l l o c a t e s number o f s m a l l e s t minimas wanted33 f o r i =1 : nMins34 [ minima mIndex ] = min ( minimas ( : , 2 ) ) ;% Finds g l o b a l min among minimas35 t i ( i , : ) = minimas ( mIndex , : ) ; % Copies found g l o b a l minima i n t r o new m a t r i ce36 i n d ex es = f i n d ( abs ( minimas ( mIndex , 1 )−minimas ( : , 1 ) ) < sp r ed ) ; % Finds i n d exes o f minimas w i t h i n ’ spred ’ samples

o f found minima37 minimas ( [ mIndex indexes ’ ] , : ) = [ ] ; % D e l e t es minima w i t h i n ’ spred ’ samples o f found minima38 i f l e n g t h ( minimas ) < = 0 % Break loop i f no more data39 t i ( i +1 : end , : ) = [ ] ; % D e l e t es used a l l o c a t e d space40 break ;41 end42 end43 e l s e44 t i = zeros ( 0 , 2 ) ; % A l l o c a t e s a 0−by−2 m a t r i x ( t o avo id ru n t i m e e r ro r )45 end

Sub function that locates all minimas of data:1 % mins = LocalMinima ( x )2 %3 % f i n d s p o s i t i o n s o f a l l s t r i c t l o c a l minima i n i n p u t a r ray45 f u n c t i o n mins = LocalMinima ( x )67 n P o i n t s = l e n g t h ( x ) ;8 Middle = x ( 2 : ( n P o i n t s−1) ) ;9 Le f t = x ( 1 : ( n P o i n t s−2) ) ;

10 Righ t = x ( 3 : n P o i n t s ) ;11 mins = 1 +f i n d ( Middle < Le f t & Middle < Righ t ) ;1213 % Wr i t t en by Kenneth D . H a r r i s14 % T h i s s o f t w a r e i s r e l e a s e d under t h e GNU GPL15 % www . gnu . org / c o p y l e f t / g p l . h tml16 % any comments , or i f you make any e x t e n s i o n s17 % l e t me know a t harr is@axon . r u t g e r s . edu

21


2.8.3 Curvefitting function

1 % T h i s f u n c t i o n c a l c u l a t e s t h e t o p p o i n t o f a c u r v e f i t t e d p a ra b l e .2 f u n c t i o n [ xx yy ] = p a r r eg ( x , y )3 %x : x−k o o r d i n a t e s t o do c u r v e f i t t i n g around4 %y : y−k o o r d i n a t e s t o do c u r v e f i t t i n g around56 [m n ] = s i z e ( y ) ; % Vec t o r x has t o s t a n d ( a m−by−1 v e c t o r )7 i f n ~ = 18 y = y ’ ;9 end

10 [m n ] = s i z e ( x ) ; % Vec t o r y has t o s t a n d ( a m−by−1 v e c t o r )11 i f n ~ = 112 x = x ’ ;13 end1415 % Curve f i t t i n g i n t h e form Ax=b16 X = [ ones (s i z e ( x ) ) x x . ^ 2 ] ; % X va l u es17 a = X\ y ; % L ea s t sq a u res s o l u t i o n o f i n c l i n a t i o n18 xx = − a ( 2 ) / ( 2 ∗ a ( 3 ) ) ; % C a l c u l a t e s x o f t o p p o i n t o f p a ra b l e19 yy = [ 1 xx xx . ^ 2 ] ∗ a ; % C a l c u l a t e s t o f t o p p o i n t o f p a ra b l e

2.8.4 Threshold stage function

1 % Fu n c t i o n t h a t l o c a t e s t h e f u n d a m en t a l f r eq u en cy o f t i a ccord i n g t o i n p u t2 % t h r e s h o l d s TTh and ATh .34 % −− I n p u t −−5 % TTh : Harmonic t h r e s h o l d6 % ATh : Ampl i tude t h r e s h o l d7 % t i : A n−by−2 m a t r i x [ index_of_min ima , va lue_of_min ima ]8 % Row 1 MUST be t h e c u r r e n t p i cked f u n d a m en t a l f r eq u en cy minima9

10 % −− Outpu t−−11 % t g : New minima o f t h e f u n d a m en t a l f r eq u en cy12 % t i : Remaining minimas1314 f u n c t i o n [ tg , t i ] = f i n d f 0 ( TTh , ATh , t i )15 N = [ 2 3 4 5 6 ] ; % T h resh o l d c o n s t a n t s t o t e s t f o r harmon ics1617 t g = t i ( 1 , : ) ; % E x t r a c t c u r r e n t f u n d a m en t a l f r eq u en cy o f minimas18 t i ( 1 , : ) = [ ] ; % D e l e t es t h e c u r r e n t f u n d a m en t a l f r eq u en cy o f minimas fromt19 t i = s o r t r o w s ( t i , 1 ) ; % S o r t s rows a cco rd i n g t o i n d exes ( a scen d i n g )20 t i = t i ( end : −1 : 1 , : ) ; % Reve rses t i . . . s o r t must be d esen d i n g21 i ndex = [ ] ; % Must t o a l l o c a t e d t o avo id ru n t i m e warn ings2223 % T es t f o r c a n d i d a t e s o f harmonic minimas t o t h e f u n d a m en t a lf r eq u en cy minima24 f o r i =1 : l e n g t h (N) % Run e l em en t s i n n− t i m es25 x = t g ( 1 ) . / t i ( : , 1 ) ; % C a l c u l a t e s t g / t i26 newIndexes = f i n d ( abs (N( i )−x ) < TTh ) ; % Finds i n d exes o f t i , t h a t s a t i f i e s t h e t h r e s h o l d TTh27 i f l e n g t h ( newIndexes ) > 028 i ndex = [ index ; newIndexes ] ; % Adds t h e found i n d ex t o a g en e ra l i n d ex v a r i a b l e29 end30 end3132 % T es t among t h e c a n d i d a t e s o f a minimas t o t h e f u n d a m en t a l f req u en cy minima33 % T es t t h e c a n d i d a t e s i f t h e t h r e d s h o l d i s w i t h i n t h e t h r e s h ol d ATh34 newTgIndex = 0 ; % Index among i n d ex o f new f u n d a m en t a l f r eq u en cy minima35 f o r i =1 : l e n g t h ( index ) % Runs o n l y among t h e harmonic c a n d i d a t e s36 i f t i ( i ndex ( i ) , 2 ) − t g ( 2 ) < ATh % T es t i f ATh t h r e s h o l d i f s a t i f i e d37 newTgIndex = i ; % Save i n d ex o f ’ index ’ o f new f u n d a m en t a l f r eq u en cy minima38 end39 end40 i f newTgIndex > 0 % I f a new f u n d a m en t a l f r eq u en cy was found , s e t new t g41 tgTemp = t i ( index ( newTgIndex ) , : ) ; % Makes copy o f new f u n d a m en t a l f r eq u en cy minima42 t i ( i ndex ( i ) , : ) = t g ; % Copies o l d f u n d a m en t a l f r eq u en cy minima i n t o t i43 t g = tgTemp ; % Copies new f u n d a m en t a l f r eq u en cy minima i n t o t g44 t i = s o r t r o w s ( t i , 1 ) ; % S o r t s rows a cco rd i n g t o i n d exes ( a scen d i n g )45 t i = t i ( end : −1 : 1 , : ) ; % Reve rses t i . . . s o r t must be d esen d i n g46 end

22

2.9. LITERATURE LIST

2.9 Literature list

[1] Saurabh Sood and Ashok Krishnamurthy: A Robust On-The Fly Pitch(OTFP) Estimation Algorithm. The Ohio State University, Columbus OH.

[2] A. D. Cheveigne and H. Kawahara. Yin: A fundamental frequencyestimator for speech and music. Journal of the Acoustical Society ofAmerica, 111(4), 2002.

[3] S. Chowdhury, A. K. Datta, and B. B. Chaudhuri. Pitch detectionalgorithm using state phase analysis. Journal of the AcousticalSociety of India, 28(1), Jan. 2000.

[4] http://www.cnmat.berkeley.edu/~tristan/Report/node4.html

[5] http://www.ee.ucla.edu/~ingrid/ee213a/speech/vlad_present.pdf

23

Framing and deframing 3

In speech processing it is often advantageous to divide the signal into frames to achievestationarity. This worksheet describes how to split speech into frames and how to combinethe frames into a speech signal.

3.1 Partitioning of a speech signal into frames

Normally a speech signal is not stationary, but seen from a short-time point of view it is.This result from the fact that the glottal system can not change immediately. XXX XXXXXX states that a speech signal typically is stationary in windows of 20 ms. Therefore thesignal is divided into frames of 20 ms which corresponds to n samples:

n = tst fs (3.1)

When the signal is framed it is necessary to consider how to treat the edges of the frame.This result from the harmonics the edges add. Therefore it is expedient to use a window totone down the edges. As a consequence the samples will not be assigned the same weightin the following computations and for this reason it is prudent to use an overlap.

Frame 1

OverlapFrame step

Frame 2Frame 3

Frame 4

Figure 3.1: Illustration of framing. The speech is divided into four frames.

Figure 3.1 shows how a speech signal is divided into frames. Each frame shares the firstpart with the previous frame and the last part with the next frame. The time frame stept f s indicates how long time there is between the start time of each frame. The overlap to isdefined as the time from a new frame starts until the current stops. From this follows thatthe frame length t f l is:

t f l = t f s+ to (3.2)

24

3.2. DEFRAMING

Hence the window has got to be of length t f l which corresponds to t f l fs samples.

3.2 Deframing

After performing computations on each individual frame it is desirable to combine theframes into a coherent speech signal. This task contains two problems: To take accountof the formerly used window and to combine the samples shared by different frames.

A possibillity of the removal of the window is to multiply the frame by the reciprocal win-dow. Before doing this it is essential to take precaution that the window is different fromzero. This claim restrain the number of possible windows for example the triangular, thehanning and the blackman window.

The common samples of two frames can be combined by averaging in such a way that thecloser the samples get to the edge of the frame the less weight they are given. Figure 3.2shows two functions that average the beginning and the end of a frame. The functions arebased on a frame that has got an overlap from t1 to t2 and from t3 to t4. The solid line is a halftriangular window in each end and the dashed line is a half Hanning window.

t0 t1 t2 t30

0.2

0.4

0.6

0.8

1

Figure 3.2: Two different functions to average the overlap of frames.

The triangular window provides a constant amplitude but gives a point of discontinuitywhile the Hanning window have the opposite effect.

Figure 3.3 shows the functions shown in 3.2 divided by a hamming window to compensatefor the windowing performed by the framing of the signal. Multiplication by one of thesefunctions will dewindow the signal and average the overlapping sections. This productcan be converged to the speech signal by joining the frames by addition in the overlapping

25

CHAPTER 3. FRAMING AND DEFRAMING

t0 t1 t2 t30

0.5

1

1.5

2

2.5

Figure 3.3: Two functions that removes a hamming window and averagingthe overlaps of two frames.

zones.

26

3.3. MATLAB CODE

3.3 Matlab Code

1 c l o s e a l l2 c l e a r a l l3 c l c45 [ speech , f s , n b i t s ]=wavread ( ’ANN10045 . wav ’ ) ;6 % t = 0 : 1 / 2 0 0 0 0 : 1 ;7 speech =speech ( 4 0 0 0 0 : 7 5 0 0 0 ) ;%s i n (2∗ p i∗542∗ t ) ’;%8 speech8 = r esam p l e ( speech ,1 6 0 0 0 , f s ) ;9 f s8 =16000;

1011 SegmentStep8 = f s8∗ . 0 2 5 ;12 Over lap8= f s8∗ . 0 1 5 ;13 SegmentLength8 =SegmentStep8 +Over lap8 ;14 SpeechLength8=l e n g t h ( speech8 ) ;15 nSegments8=f l o o r ( SpeechLength8 / ( SegmentStep8 ) )−1;16 Window8=hamming( SegmentLength8 ) ;1718 de=hann ing (2∗Over lap8−1) ’ ;19 dewindow=[ de ( 1 : Over lap8 ) , ones ( 1 , SegmentLength8−2∗Over lap8 ) de ( Over lap8 :end ) ] ’ . / Window8;%2021 r eco n s =zeros ( SpeechLength8 , 1 ) ;2223 f o r i =1 : nSegments824 speech8Segment ( : , i ) =speech8 ( ( i−1)∗SegmentStep8 +1: i∗SegmentStep8 +Over lap8 ) ;25 speechW8 ( : , i ) =Window8.∗ speech8Segment ( : , i ) ;26 speechde ( : , i ) =speechW8 ( : , i ) .∗ dewindow ;27 r eco n s ( ( i−1)∗SegmentStep8 +1: i∗SegmentStep8 +Over lap8 ) = . . .28 speechde ( : , i ) + r eco n s ( ( i−1)∗SegmentStep8 +1: i∗SegmentStep8 +Over lap8 ) ;29 end

27

Cepstral signal analysis forpitch detection 4

Cepstral signal analysis is one out of several methods that enables us to find out whether asignal contains periodic elements. The method can also be used to determine the pitch of asignal.

4.1 Definition of the cepstral coefficients

The cepstral coefficients are found by using the following equation

c(τ) = F{log(|F{x[n]}|2)} (4.1)

where F denotes fourier transform and x[n] is the signal in the time domain. |F{x[n]}|2 is thepower spectrum estimat of the signal.

When using cepstral analysis we are using new expressions to denote the characteristics. τis the quefrency and the magnitude of c(τ) is called the gamnitude.

4.2 Properties of the cepstrum

From equation 4.1 follows that the cepstral coefficients describe the periodicity of the spec-trum. A peak in the cepstrum denotes that the signal is a linear combination of multiples ofthe pitch frequency. The pitch period can be found as the number of the coefficient wherethe peak occurs.

To describe this property an example is used. Figure 4.1 shows a segment of speech whichwe want to analyze using the cepstrum.

The spectrum of the signal is shown at figure 4.2. It is clear that the signal is periodic dueto the spikes with equal distances in the spectrum. However it is difficult to determine thepitch frequency since the different harmonics not necessarily is present in the spectrum.

Figure 4.3 shows the cepstrum of the signal. The cepstrum consists of two elements. Anelement from the excitation sequence (a pulse train for voiced speech) in the higher quefren-cies. The other element originates from the vocal tract impulse response and is present inthe lower quefrencies.

28

4.2. PROPERTIES OF THE CEPSTRUM

0 50 100 150 200 250 300 350 400 450 500−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

Figure 4.1: A segment of a speech signal.

0 50 100 150 200 2500

2

4

6

8

10

12

14

Figure 4.2: The spectrum of the speech signal.

The peak in the higher quefrencies indicates that there are some periodicity in the signal.The peak is located at the period of the fundamental frequency.

29

CHAPTER 4. CEPSTRAL SIGNAL ANALYSIS FOR PITCH DETECTION

0 50 100 150 200 2500

50

100

150

200

250

300

350

Figure 4.3: The cepstrum of the speech signal.

30

LPC modeling of vocal tract 5

LPC (Linear Predictor Coding) is a method to represent and analyze human speech. Theidea of coding human speech is to change the representation of the speech. Representationwhen using LPC is defined with LPC coefficients and an error signal, instead of the originalspeech signal. The LPC coefficients are found by LPC analysis which describes the inversetransfer function of the human vocal tract and the error signal is found by using LPC esti-mation based on the LPC coefficients from LPC analysis.

LPC - predictions error filter Vocal tract model Speech production

g

A(z)

W

-

+

FIR

Impuls generator

/ white noise

v(n) u(n) e(n)

u(n-1) y(n)

Z -1

A (z)

Figure 5.1: Relationship between vocal tract model and LPC model.

The above figure 5.1 shows the relationship between vocal tract transfer function and theLPC transfer function. Left part of the figure shows speech production model, while right-hand side of figure shows LPC prediction error filter (LPC analysis filter) applied to outputof the vocal tract model. Vocal transfer function and LPC transfer function are defined asfollow:

H(z) =g

A(z)=

g

1+n∑

k=1akz−k

(5.1)

A(z) = 1+n∑

k=1akz−k (5.2)

The method to obtain the LPC coefficients included in the equation 5.2 is calculated usingLPC analysis (see eventually 5.5 on page 36 for LPC theory). The LPC analysis method isdescribed in next section. LPC estimation and LPC synthesis is describe in later sections.

31

CHAPTER 5. LPC MODELING OF VOCAL TRACT

5.1 LPC-analysis

LPC analysis is used to constructing the LPC coefficients for the inverse transfer function ofthe vocal tract. The standard methods for LPC coefficients estimation have the assumptionthat the input signal is stationery. Quasi stationery signal is obtain by framing the inputsignal which is often done in frames in length of 20 ms. A more stationery signal resultin a better LPC analysis because the signal is better described by the LPC coefficients andtherefore minimize the residual signal. The residual signal also called the error signal whichis described in next section.

LPC - estimation S g, a

Figure 5.2: LPC-analysis blockdiagram

Figure 5.2 show a block diagram of a LPC analysis, where S is the input signal, g is the gain ofthe residual signal (prediction error signal) and a is a vector containing the LPC coefficientsto a specific order. The size of vector depends on the order of the LPC analysis. Bigger ordermeans more LPC coefficients and therefore better estimation of the vocal tract.

Matlab LPC analysis:1 [ au tos , l a g s ] = x co r r ( s ) ;2 au t o s = au t o s (f i n d ( l a g s ==0) :end ) ;3 [ a , g ] = l e v i n s o n ( au tos ,N) ;45 %a u t o s : A u t o c o r r e l a t i o n o f i n p u t s i g n a l [ v e c t o r ]6 %s : I n p u t s i g n a l [ v e c t o r ]7 %a : LPC c o e f f i c i e n t s8 %g : P r e d i c t i o n e r ro r va r i a n ce

The above Matlab code calculate a and g from a given input signal S.

Figure 5.3(a) shows the input signal and figure 5.3(b) show the LPC coefficients and thefrequency response of the LPC coefficients, which is found using above Matlab code.

32

5.2. LPC-ESTIMATION

0.925 0.93 0.935 0.94 0.945 0.95

−0.3

−0.2

−0.1

0

0.1

0.2

Input signal − (frame: 38 of man_nb.wav)

Time [s]

Am

plitu

de

(a) Input signal

0 2 4 6 8 10 12−1.5

−1

−0.5

0

0.5

1LPC coefficients − (LPC order: 12, frame: 38 of man_nb.wav)

Coefficients [n]

Am

plitu

de

0 500 1000 1500 2000 2500 3000 3500 4000−70

−60

−50

−40

−30

−20LPC frequency response − (frame: 38 of man_nb.wav)

Frequency [Hz]

Am

plitu

de [d

B]

(b) LPC coefficients and its frequency re-sponse

Figure 5.3: LPC analysis

5.2 LPC-estimation

LPC estimation calculates an error signal from the LPC coefficients from LPC analysis. Thiserror signal is called the residual signal which could not be modeled by the LPC analysis.This signal is calculated by filtering the original signal with the inverse transfer functionfrom LPC analysis. If the inverse transfer function from LPC analysis is equal to the vocaltract transfer function then is the residual signal from the LPC estimation equal to the resid-ual signal which is put in to the vocal tract. In that case is the residual signal equal to theimpulses or noise from the human speech production (Se illustration 5.1).

LPC-analysis e s

g, a

Figure 5.4: LPC-estimation.

Figure 5.4 show a block diagram of LPC estimation where S is the input signal, g and a iscalculated from LPC analysis and e is the residual signal for LPC estimation.

Matlab LPC estimation:1 e = f i l t e r ( a , s q r t ( g ) , s ) ;23 %e : Erro r s i g n a l f rom LPC a n a l y s i s [ v e c t o r ]4 %a : LPC c o e f f i c i e n t s from LPC e s t i m a t i o n [ v e c t o r ]5 %g : P r e d i c t i o n e r ro r va r i a n ce6 %s : I n p u t s i g n a l [ v e c t o r ]

The above Matlab code calculate e by filtering the input signal S with the inverse transferfunction which is found from LPC analysis.

33


0.925 0.93 0.935 0.94 0.945 0.95

−0.3

−0.2

−0.1

0

0.1

0.2

Input signal − (frame: 38 of man_nb.wav)

Time [s]

Am

plitu

de

(a) Input signal

0.925 0.93 0.935 0.94 0.945 0.95

−0.3

−0.2

−0.1

0

0.1

0.2

Error signal − (frame: 38 of man_nb.wav)

Time [s]

Am

plitu

de

(b) Error signal

Figure 5.5: LPC estimation

Figure 5.5(a) show the input signal and figure 5.5(b) show the error signal fra LPC estimationusing the above Matlab code.

5.3 LPC-synthesis

LPC synthesis is used to reconstruct a signal from the residual signal and the transfer func-tion of the vocal tract. Because the vocal tract transfer function is estimated from LPC anal-ysis can this be used combined with the residual / error signal from LPC estimation toconstruct the original signal.

LPC-systhesis e s

g, a

Figure 5.6: LPC-synthesis.

Figure 5.6 show a block diagram of LPC synthesis where e is the error signal found fromLPC estimation and g and a from LPC analysis. Reconstruction of the original signal s isdone by filtering the error signal with the vocal tract transfer function.

Matlab LPC synthesis:1 s = f i l t e r ( s q r t ( g ) , a , e ) ;23 %s : I n p u t s i g n a l [ v e c t o r ]4 %g : P r e d i c t i o n e r ro r va r i a n ce5 %a : LPC c o e f f i c i e n t s from LPC e s t i m a t i o n [ v e c t o r ]6 %e : Erro r s i g n a l f rom LPC a n a l y s i s [ v e c t o r ]

The above Matlab code calculate the original signal S from a error signal e and vocal tract

34

5.4. APPLICATION OF LPC

transfer function represented with a and g.

0.925 0.93 0.935 0.94 0.945 0.95

−0.3

−0.2

−0.1

0

0.1

0.2

Error signal − (frame: 38 of man_nb.wav)

Time [s]

Am

plitu

de

(a) Error signal

0.925 0.93 0.935 0.94 0.945 0.95

−0.3

−0.2

−0.1

0

0.1

0.2

Reconstructed signal − (frame: 38 of man_nb.wav)

Time [s]

Am

plitu

de

(b) Reconstructed signal

Figure 5.7: LPC synthesis

Figure 5.7(a) show the error signal and figure 5.7(b) show the original signal which is foundfrom LPC synthesis using the above Matlab code.

5.4 Application of LPC

Bandwidth expansion is a method to increase the frequency range of a signal. The increasein range is done by appending information about the higher frequency components. Byappending the higher frequency components using code book for envelope extension andexcitation extension is it possible to increase the bandwidth of the signal.

LP - analysis Envelope

extension

LP-estimation Excitation

extension

LP - synthesis

s nb

e nb

s wb

e wb

g nb , a nb g wb

, a wb

Figure 5.8: Bandwidth expansion.

Figure 5.8 show the block diagram of bandwidth expansion using LPC and codebook (en-velope and excitation extension) with additional frequency information.

The matlab code in appendices implements all on the above block diagram besides excita-tion and envelope extension.

35


5.5 LPC theory

5.5.1 Wiener filter theory

W

-

+

FIR

e(n)

u(n) y(n)

d(n)

Figure 5.9: Linear discrete-time filter

Orthogonality

y(n) = u(n|Un) =∞∑

k=0w∗

ku(n−k) n = 0,1,2, ... (5.3)

e(n) = d(n)−y(n) (5.4)

J = E[e(n)e∗(n)] = E[|e(n)|2] (5.5)

∇kJ = −2E [u(n−k)e∗(n)] (5.6)

∇kJ = 0 ⇒ E [u(n−k)e∗o(n)] = 0 k = 0,1,2, ... (5.7)

Minimum mean-square error

eo(n) = d(n)−yo(n) (5.8)

eo(n) = d(n)− d(n|Un) (5.9)

Jmin = E[

|eo(n)|2]

(5.10)

36

5.5. LPC THEORY

Wiener Hopf

E

[

u(n−k)

(

d∗(n)−∞∑

i=0woiu∗ (m− i)

)]

= 0 k = 0,1,2, ... (5.11)

∞∑

i=0woiE [u(n−k)u∗(n− i)] = E [u(n−k)d∗(n)] k = 0,1,2, ... (5.12)

r(i −k) = E [u(n−k)u∗(n− i)] (5.13)

p(−k) = E [u(n−k)d∗(n)] (5.14)

∞∑

i=0woir(i −k) = p(−k) k = 0,1,2, ... (5.15)

Rwo = p (5.16)

Wiener Hopf (Matrix Formulation)

R=[

u(n)uH(n)]

R=

r(0) r(1) ... r(M−1)r∗(1) r(0) ... r(M−2)

. . .r∗(M−1) r∗(M−2) ... r(0)

(5.17)

p = E [u(n)d∗(n)] p = [p(0), p(−1), ..., p(1−M)]T (5.18)

wo = [wo1,wo2, ...woM−1]T (5.19)

wo = R−1p (5.20)

37


5.5.2 Prediction error filter

W

-

+

FIR

e(n)

u(n-1) y(n)

Z -1

u(n)

Figure 5.10: Prediction error filter

y(n) = u(n|Un−1) =M

∑k=1

w∗ku(n−k) (5.21)

e(n) = u(n)− u(n|Un−1) (5.22)

e(n) = u(n)−M

∑k=1

w∗ku(n−k) (5.23)

e(n) =M

∑k=0

a∗ku(n−k) (5.24)

e(n) =M∑

k=0a∗ku(n−k) ak =

{

1 k = 0−wk k = 1,2, ..,M

(5.25)

38

5.5. LPC THEORY

5.5.3 Application matlab code

1 c l e a r ; c l o s e a l l23 %%%%%%%%%%%%%%%%%%%4 %I n i t i a l i z e − s t a r t5 numberofLPCcoef f = 1 2%order o f fo rward l i n e a r p r e d i c t o r ( g i v e s N+1 c o e f f i c i e n t s)6 o f f s e t = 2 0 ; %used f rame i n i n p u t s i g n a l7 f f t p o i n t s = 5 1 2 ; %number o f f f t p o i n t s ( used f o r f f t and f r e q z a n a l y s i s )8 hammingwindowed = 1 ;% i f t ru e , t h e i n p u t s i g n a l i s hamming windowed9 wave_as_ inpu t = 1 ;%i f t ru e , wave f i l e i s used as i n p u t s i g n a l

1011 p lo tnumber = 1 ;%used f o r numbering t h e f i g u r e s ( i n c rem en t f o r each f i g u r e )12 p l o t _ g l o b a l = 1 ;13 p l o t _ e s t i m a t i o n _ a n a l y s i s _ i n p u t _ s i g n a l = 1 ;14 p l o t _ LP C_ es t i m a t i o n _ f r e q u en cy _ r es p o n se = 1 ;15 p l o t _ L P C _ a n a l y s i s _ e r r o r _ s i g n a l = 1 ;16 p l o t _ L P C _ s y n t h e s i s _ s i g n a l _ r e c o n s t r u c t i o n = 1 ;17 e p s f i l e s = 0 ;1819 f r am e l en g t h = 20∗10^−3; %l e n g t h o f f rame from i n p u t s i g n a l ( even number ) [ u n i t : second ]20 f r a m e l e n g t h o v e r l a p = 5∗10^−3; %l e n g t h o f o ve r l a p between t o f rames [ u n i t : second ]21 f ramelengthwindow = f r am e l en g t h + 2∗ f r a m e l e n g t h o v e r l a p ;%t o t a l l e n g t h o f f rames [ u n i t : second ]2223 %I n i t i a l i z e − end24 %%%%%%%%%%%%%%%%%2526 %Loading w a v f i l e − s t a r t27 u sed _ w av _ f i l e = ’ man_nb . wav ’ ;28 [ y , f s ] = wavread ( u sed _ w av _ f i l e ) ;29 y = y ( : , 1 ) ;3031 i f wave_as_ inpu t == 032 y = s i n (2∗ p i ∗1000∗ (0 :40000)∗1/ f s ) ’ ;33 end34 %Loading w a v f i l e − end3536 %Downsample i n p u t s i g n a l− s t a r t37 y = d ec i m a t e ( y , 2 ) ;38 f s = f s / 239 %Downsample i n p u t s i g n a l− end4041 %%%%%%%%%%%%%%%%%%%%%%42 %Pre i n i t a l i z e − s t a r t43 f ramesamples = f ramelengthwindow / ( 1 / f s ) ;%l e n g t h o f f rame from i n p u t s i g n a l [ u n i t : samples ]44 f r am esam p l eso v e r l ap = f r a m e l e n g t h o v e r l a p / ( 1 / f s ) ;%l e n g t h o f o ve r l a p between t o f rames [ u n i t : samples ]4546 y = y ( 1 : l e n g t h ( y ) − mod ( l e n g t h ( y ) , f ramesamples ) ) ;%f i x t h e l e n g t h o f i n p u t s i g n a l f o r f raming47 minmaxy = [min ( y ) max ( y ) ] ; %min and max va l u es o f i n p u t s i g n a l ( used f o r p l o t t i n g )48 %Pre i n i t a l i z e − end49 %%%%%%%%%%%%%%%%%%%%5051 %Frameing − s t a r t52 d imensiony = s i z e ( y ) ; %used f o r r e c o n s t r u c t i o n ( co n t a i n t h e t r u e d i m en s i o n s o f t he i n p u t s i g n a l )53 d imensiony f rame = [ f ramesamplesl e n g t h ( y ) / f ramesamples ] ;%used f o r f ra m e i n g [ samples i n frame , number o f f rames ]5455 %framing t h e data56 f o r i = 1 : d imensiony f rame ( 2 )57 yf rame ( : , i ) = y ( 1 + ( f ramesamples−f r am esam p l eso v e r l ap )∗( i −1) : f ramesamples + ( f ramesamples−f r am esam p l eso v e r l ap )∗( i −1) ) ;58 end59 minmaxyframe = [min ( yf rame ( : , o f f s e t ) ) max ( yf rame ( : , o f f s e t ) ) ] ;60 %Frameing − end6162 %Window − s t a r t63 i f hammingwindowed64 yframewindow = yframe .∗ [ hamming( d imensiony f rame ( 1 ) )∗ones ( 1 , d imensiony f rame ( 2 ) ) ] ;65 e l s e66 yframewindow = yframe ;67 end68 %Window − end6970 %M o d e l f i t t i n g − s t a r t71 s igna lNB = yframewindow ;72 %M o d e l f i t t i n g − end7374 %LPC e s t i m a t i o n − s t a r t75 f o r i = 1 : d imensiony f rame ( 2 )76 [ au t o s i g n a l N B ( : , i ) , l a g s ( i , : ) ] = x co r r ( s igna lNB ( : , i ) ) ;77 end78 au t o s i g n a l N B = au t o s i g n a l N B (f i n d ( l a g s ( i , : ) ==0) :end , : ) ;79 [ a , g ] = l e v i n s o n ( au tos igna lNB , numberofLPCcoef f ) ;80 %LPC e s t i m a t i o n − end8182 %Frequency reponse o f LPC t r a n s f e r f u n c t i o n− s t a r t83 [H, F ] = f r e q z ( g ( o f f s e t ) ^0 .5 , a ( o f f s e t , : ) , f f t p o i n t s , f s );84 %Frequency reponse o f LPC t r a n s f e r f u n c t i o n− end85

39


86 %LPC p o l y t o LSF− s t a r t87 l s f = p o l y 2 l s f ( a ( o f f s e t , : ) )88 %LPC p o l y t o LSF− end8990 %Frequency reponse o f LSF− s t a r t91 [H1 , F1 ] = f r e q z ( 1 , l s f , f f t p o i n t s , f s ) ;92 %Frequency reponse o f LSF− end9394 %LPC LSF t o p o l y− s t a r t95 %a = l s f 2 p o l y ( l s f )96 %LPC LSF t o p o l y− end9798 %LPC a n a l y s i s − s t a r t99 errorNB = f i l t e r ( a ( o f f s e t , : ) , g ( o f f s e t ) . ^ 0 . 5 , s igna lNB ) ; % Error s i g n a l

100 %LPC a n a l y s i s − end101102 %LPC s y n t h e s i s− s t a r t103 s i g n a l N B r e c o n s t r u c t e d =f i l t e r ( g ( o f f s e t ) ^0 .5 , a ( o f f s e t , : ) , er rorNB ) ;104 %LPC s y n t h e s i s− end105106 i f p l o t _ g l o b a l107 f i g u r e ( p lo tnumber )108 p lo tnumber = p lo tnumber + 1 ;109110 su b p lo t ( 2 , 1 , 1 )111 hold on112 p l o t y f r am e = p l o t ( [ 1 : f ramesamples ] , yf rame ( : , o f f s e t ) , ’ r ’ )113 i f hammingwindowed114 p l o t ( [ 1 : f ramesamples ] , yframewindow ( : , o f f s e t ) , ’ g ’ ) , p lo tyf ramewindow = p l o t ( [ 1 : 5 : f ramesamples ] , yframewindow ( 1 : 5 :end , o f f s e t ) ,

’ go ’ )115 end116 p l o t ( [ 1 : f ramesamples ] ,r e a l ( er rorNB ( : , o f f s e t ) ) , ’ b ’ ) , p l o t e r r o r = p l o t ( [ 1 : 5 : f ramesamples ] ,r e a l ( er rorNB ( 1 : 5 :end , o f f s e t ) ) , ’ bx ’ )117 t i t l e ( s p r i n t f ( ’ I n p u t s i g n a l and e r r o r s i g n a l ( f rame : % d ) ’ , o f f s e t ) )118 x l a b e l ( ’ Samples [ n ] ’ ) ,y l a b e l ( ’ Ampl i tude ’ ) , grid , x l im ( [ 1 f ramesamples ] )119 i f hammingwindowed120 l egen d ( [ p l o t y f r am e p lo ty f ramewindow p l o t e r r o r ] , ’ I n p u t s i g n a l’ , ’ I n p u t s i g n a l∗hamming ’ , ’ E r r o r s i g n a l ’ )121 e l s e122 l egen d ( [ p l o t y f r am e p l o t e r r o r ] , ’ I n p u t s i g n a l ’ , ’ E r r o r s i g n a l ’ )123 end124125 su b p lo t ( 2 , 1 , 2 )126 minmaxdBscale = [min (20∗ l og10 (2∗ abs (H) / f f t p o i n t s ) )−6 max(20∗ l og10 (2∗ abs (H) / f f t p o i n t s ) ) +6 ] ;127 hold on128 p l o t f f t = p l o t ( [ 0 : f f t p o i n t s −1]∗ f s / ( f f t p o i n t s −1) ,20∗ l og10 ( 2∗ abs ( f f t ( yframewindow ( : , o f f s e t ) , f f t p o i n t s ) ) / f f t p o i n t s ) )129 p l o t ( F ,20∗ l og10 ( 2 ∗ abs (H) / f f t p o i n t s ) , ’ r ’ ) , p l o t l p c = p l o t ( F ( 1 : 1 0 :end ) ,20∗ l og10 ( 2 ∗ abs (H( 1 : 1 0 :end ) ) / f f t p o i n t s ) , ’ r x ’

)130 stem ( ( l s f / p i )∗ f s /2 ,−200+20∗ l og10 ( ones ( 1 ,l e n g t h ( l s f ) ) ) , ’m’ )131 t i t l e ( s p r i n t f ( ’FFT of i n p u t s i g n a l and f r eq u en cy r ep o n se of LPC ( f rame : % d) ’ , o f f s e t ) )132 x l a b e l ( ’ F requency [ Hz ] ’ ) ,y l a b e l ( ’ Ampl i tude [ dB ] ’ )133 l egen d ( [ p l o t f f t p l o t l p c ] , s p r i n t f ( ’FFT ( f f t p o i n t s : % d ) ’ , f f t p o i n t s ) ,s p r i n t f ( ’LPC ( o r d e r : % d ) ’ , numberofLPCcoef f ’ ) ) ,grid , y l im

( [ minmaxdBscale ] ) , x l im ( [ 0 f s / 2 ] )134135 i f e p s f i l e s136 p r i n t −depsc− t i f f − r300 eps / l p c _ e s t i m a t i o n _ g l o b a l _ p l o t _ f f t _ l p c _ t i m e _ B J137 end138139 end140141 i f p l o t _ e s t i m a t i o n _ a n a l y s i s _ i n p u t _ s i g n a l142 f i g u r e ( p lo tnumber )143 p lo tnumber = p lo tnumber + 1 ;144145 p l o t ( [ 0 : f ramesamples−1]∗1/ f s + ( f ramelengthwindow− f r a m e l e n g t h o v e r l a p )∗( o f f s e t −1) , yframewindow ( : , o f f s e t ) )146 t i t l e ( t e x l a b e l (s p r i n t f ( ’ I n p u t s i g n a l − ( f rame : % d of % s ) ’ , o f f s e t , u sed _ w av _ f i l e ) , ’ l i t e r a l ’ ) )147 x l im ( [ 0 f ramesamples−1]∗1/ f s + ( f ramelengthwindow− f r a m e l e n g t h o v e r l a p )∗( o f f s e t −1) )148 x l a b e l ( ’Time [ s ] ’ ) , y l a b e l ( ’ Ampl i tude ’ ) , y l im ( [ minmaxyframe ] ) ,gr id149150 i f e p s f i l e s151 p r i n t −depsc− t i f f − r300 eps / l p c _ e s t i m a t i o n _ i n p u t _ s i g n a l _ B J152 end153154 end155156 i f p l o t _ LP C_ es t i m a t i o n _ f r e q u e n cy _ r es p o n se157 f i g u r e ( p lo tnumber )158 p lo tnumber = p lo tnumber + 1 ;159160 su b p lo t ( 2 , 1 , 1 )161 stem ( [ 0 : numberofLPCcoef f ] , a ( o f f s e t , : ) )162 t i t l e ( t e x l a b e l (s p r i n t f ( ’LPC c o e f f i c i e n t s − (LPC o r d e r : % d , f rame : % d of % s ) ’ , numberofLPCcoef f , o f f s e t, u sed _ w av _ f i l e ) , ’

l i t e r a l ’ ) )163 x l a b e l ( ’ C o e f f i c i e n t s [ n ] ’ ) , y l a b e l ( ’ Ampl i tude ’ )164165 su b p lo t ( 2 , 1 , 2 )166 p l o t ( F ,20∗ l og10 (2∗ abs (H) / f f t p o i n t s ) )167 t i t l e ( t e x l a b e l (s p r i n t f ( ’LPC f r eq u en cy r esp o n se− ( f rame : % d of % s ) ’ , o f f s e t , u sed _ w av _ f i l e ) , ’ l i t e r a l ’ ) )168 x l im ( [ 0 f s / 2 ] ) , x l a b e l ( ’ F requency [ Hz ] ’ ) ,y l a b e l ( ’ Ampl i tude [ dB ] ’ ) , gr id169

40

5.5. LPC THEORY

170 i f e p s f i l e s171 p r i n t −depsc− t i f f − r300 eps / l p c _ e s t i m a t i o n _ f r e q u e n c y _ r e s p o n s e _ o f _ l p c _ B J172 end173174 end175176 i f p l o t _ L P C _ a n a l y s i s _ e r r o r _ s i g n a l177 f i g u r e ( p lo tnumber )178 p lo tnumber = p lo tnumber + 1 ;179180 p l o t ( [ 0 : f ramesamples−1]∗1/ f s + ( f ramelengthwindow− f r a m e l e n g t h o v e r l a p )∗( o f f s e t −1) , errorNB ( : , o f f s e t ) )181 t i t l e ( t e x l a b e l (s p r i n t f ( ’ E r r o r s i g n a l − ( f rame : % d of % s ) ’ , o f f s e t , u sed _ w av _ f i l e ) , ’ l i t e r a l ’ ) )182 x l im ( [ 0 f ramesamples−1]∗1/ f s + ( f ramelengthwindow− f r a m e l e n g t h o v e r l a p )∗( o f f s e t −1) )183 x l a b e l ( ’ Time [ s ] ’ ) , y l a b e l ( ’ Ampl i tude ’ ) , y l im ( [ minmaxyframe ] ) ,gr id184185 i f e p s f i l e s186 p r i n t −depsc− t i f f − r300 eps / l p c _ a n a l y s i s _ e r r o r _ s i g n a l _ B J187 end188189 end190191 i f p l o t _ L P C _ s y n t h e s i s _ s i g n a l _ r e c o n s t r u c t i o n192 f i g u r e ( p lo tnumber )193 p lo tnumber = p lo tnumber + 1 ;194195 p l o t ( [ 0 : f ramesamples−1]∗1/ f s + ( f ramelengthwindow− f r a m e l e n g t h o v e r l a p )∗( o f f s e t −1) , s i g n a l N B r e c o n s t r u c t e d ( : , o f f s e t ) )196 t i t l e ( t e x l a b e l (s p r i n t f ( ’ Reco n s t r u c t ed s i g n a l− ( f rame : % d of %s ) ’ , o f f s e t , u sed _ w av _ f i l e ) , ’ l i t e r a l ’ ) )197 x l im ( [ 0 f ramesamples−1]∗1/ f s + ( f ramelengthwindow− f r a m e l e n g t h o v e r l a p )∗( o f f s e t −1) )198 x l a b e l ( ’ Time [ s ] ’ ) , y l a b e l ( ’ Ampl i tude ’ ) , y l im ( [ minmaxyframe ] ) ,gr id199200 i f e p s f i l e s201 p r i n t −depsc− t i f f − r300 eps / l p c _ s y n t h e s i s _ s i g n a l _ r e c o n s t r u c t i o n _ B J202 end203204 end

41

Line Spectrum Pairs 6

Line spectrum pairs, LSP, is a way of representing the LPC-coefficients. The motivationbehind LSP transformation is greater interpolation properties and robustness to quantiza-tion. These benefits are achieved by the cost of higher complexity of the overall system[Bäckström 2004, p. 47-48]. This worksheet will outline the procedure of LSP transforma-tion.

6.1 Decomposition Strategy

The key idea of LSP decomposition is to decompose the pth order linear predictor A(z) intoa symmetrical and antisymmetrical part denoted by the polynomials P(z) and Q(z) respec-tively. The procedure is depicted in Figure 6.1

Figure 6.1: Decomposition of A(z).

For reasons of emphasis the pth order linear predictor A(z) is reproduced below:

A(z) = 1+p

∑i=1

aiz−i (6.1)

The two polynomials P(z) and Q(z) are given by Equations 6.2 and 6.3 [Zheng et al. 1998, p.2]:

P(z) = A(z)+z−(p+1)A(z−1) (6.2)

Q(z) = A(z)−z−(p+1)A(z−1) (6.3)

Note the (p+1)th order of the two decomposition polynomials and the substitution of z−1 asthe variable in A(z) on the right side of Equations 6.2 and 6.3. The linear predictor A(z) can

42

6.1. DECOMPOSITION STRATEGY

be expressed in terms of P(z) and Q(z) as follows:

A(z) =12

(P(z)+Q(z)) (6.4)

The LSP parameters are expressed as the zeros (or roots) of P(z) and Q(z). The zeros uniquelydetermine P(z) and Q(z) and since A(z) can be made up of P(z) and Q(z) the representationof LPC-coefficients by means of LSP-parameters is valid under the assumption that the syn-thesis filter

(

A(z)−1)

is stable, i.e. its poles are within the unit circle [Bäckström 2004, p. 48].The zeros of the LSP polynomials are subject to the following properties [Zheng et al. 1998,p. 2] [Yuan 2003, p. 23-24]

1. All zeros of P(z) and Q(z) are located on the unit circle.

2. The zeros are separated and interlaced if A(z) is minimum phase, i.e. A(z) has all itszeros within the unit circle.

3. All zeros have a complex conjugate in the z-plane.

Properties 1 and 2 are illustrated in Figure 6.2

Figure 6.2: Zeros of P(z) and Q(z) located on the unit circle. Note the interlac-ing property of the zeros.

Due to the fact that all zeros are located on the unit circle (see Property 1) it is only neces-sary to specify the angle w in order to represent the LSP [Bäckström 2004, p. 49]. If LSP isexpressed in terms of the angular frequency the solutions are named line spectrum frequen-cies, LSF. The LSFs coefficients are commonly the preferred feature vectors used in vectorquantization and are represented with a number between [0;π] or normalized as [0;1].

The all pole model A−1(z) causes the zeros to be converted into poles. As the zeros arelocated on the unit circle the representation of A−1(z) by LSF will be depicted as verticallines in the power spectrum.

43

CHAPTER 6. LINE SPECTRUM PAIRS

6.2 Example of LSP

The following illustrates the use of LSP representation in connection with LPC. Figure 6.3shows the speech signal after the original speech signal has been preprocessed by framingand windowing by a Hamming window.

0 20 40 60 80 100 120 140 160 180−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

Sample [.]

Am

plitu

de

Figure 6.3: The speech signal being processed.

The LPC-coefficients is found as explained in a previous worksheet.

[LPCcoef,sigma] = lpc(inputdata,12);LPCcoef =[1.00 -1.55 1.60 -1.73 1.68 -1.87 1.95 -1.55 1.13 -.75 -.41 -.21 .08]

The LPC coefficients are then converted into Line Spectrum Frequencies as described in theprevious section by the following matlab commando:

LSFcoef = poly2lsf(LPCcoef);LSFcoef = [.17 .36 .56 .82 1.12 1.30 1.46 1.68 1.98 2.28 2.40 2.52]

Finally the LSF coefficients and the LPC spectrum is plotted in Figure 6.4.

Note that the LSF vector is normalized between [0;π] and that the stem plot of the LSFillustrates the location of the zeros on the unit circle. As a peak occur in the spectrum theLSF are moved closer together and vice versa.

Figure 6.5 shows the zeros of the decomposition polynomials P(z) and Q(z) and the zeros ofthe predictor polynomial A(z). The zeros of P(z) and Q(z) are marked with circles and obeythe interlacing property as described earlier. The zeros of A(z) are denoted by triangles.

44

6.2. EXAMPLE OF LSP

0 500 1000 1500 2000 2500 3000 3500 4000−200

−190

−180

−170

−160

−150

−140

−130

−120

−110

−100

Frequency [Hz]

Am

plitu

de [d

B]

Figure 6.4: LPC spectrum and vertical lines indicating the LSF

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Real axis

Imag

inar

y ax

is

Figure 6.5: Zeros of P(z), Q(z) and A(z).

45

CHAPTER 6. LINE SPECTRUM PAIRS

6.3 Matlab Code

1 %%%%%%%%%I l l u s t r a t i o n o f LSP2 c l o s e a l l3 c l e a r a l l4 l oad ’ l s p d a t a . mat ’5 o f f s e t = 2 0 ;6 %Windowed s i g n a l7 i n p u t d a t a = yframewindow ( : , o f f s e t ) ;89 p l o t ( i n p u t d a t a )

10 x l a b e l ( ’ Sample [ . ] ’ )11 y l a b e l ( ’ Ampl i tude ’ )12 p r i n t −depsc ’ l s p _ s p e e c h s i g n a l _ f h k . eps ’1314 %LPC e s t i m a t i o n and spect rum15 [ LPCcoef , sigma ] = l p c ( i n p u t d a t a , 1 2 ) ;16 f i g u r e ; hold on ;17 [H, f r e q ] = f r e q z ( sigma , LPCcoef ,5 1 2 ,8 0 0 0 ) ;18 p l o t ( f r eq , (20∗ l o g ( abs (H) ) ) )1920 %LSF r e p r e s e n t a t i o n and stem p l o t21 LSFcoef = p o l y 2 l s f ( LPCcoef ) ;2223 f o r i =1 : l e n g t h ( LSFcoef ) ,24 Y( i ) =H( round ( ( LSFcoef ( i )∗4000/p i ) / 2 5 6 ) ) ;25 end2627 stem ( LSFcoef /p i ∗8000/2 ,(−200+abs (Y) ) , ’ r ’ )28 x l a b e l ( ’ F requency [ Hz ] ’ )29 y l a b e l ( ’ Ampl i tude [ dB ] ’ )30 a x i s ( [0 4000 −200 −100] )31 p r i n t −depsc ’ l sp _ l p c−l sp _ f h k . eps ’3233 %Pl o t o f A ( z ) and LSF c o e f f i c i e n t s i n t h e z−p lane34 LSFpt = [ exp ( LSFcoef∗p i∗ j ) ; con j ( exp ( LSFcoef∗p i∗ j ) ) ] ;35 f i g u r e ;36 p l o t ( LSFpt , ’ ro ’ )37 hold on ;38 [K,V ] = t f 2 l a t c ( 1 , LPCcoef ) ;39 [ num , den ] = l a t c 2 t f (K,V) ;40 [ z , p , k ] = zp k d a t a ( t f (num , den ) ) ;41 p= c e l l 2 m a t ( p ) ;42 p l o t ( p , ’ b^ ’ ) ;43 hold on ;44 %Uni t c i r c l e45 t h e t a =0 : .0 1 : 2∗ p i ;46 x=cos ( t h e t a ) ;47 y= s i n ( t h e t a ) ;48 p l o t ( x , y , ’ k ’ ) ;49 a x i s eq u a l50 x l a b e l ( ’ Rea l a x i s ’ )51 y l a b e l ( ’ Imag inary a x i s ’ )52 a x i s ( [ −1 . 1 1 . 1 −1 . 1 1 . 1 ] )5354 p r i n t −depsc ’ l sp_pzmap_fhk . eps ’

46

Reflection Coefficients 7

Reflection coefficients (RCs) are related to the Levinson-Durbin algorithm used to solve theaugmented Wiener-Hopf equations for a prediction error filter. For reasons of simplicityonly the forward prediction error filter will be used in this worksheet.

7.1 Basics

The matrix formulation of the Levinson-Durbin algorithm can be expressed as follows [Haykin2002, p. 148]:

am =

[

am−1

0

]

+ κm

[

0aB∗

m−1

]

(7.1)

where am are the weights of the forward prediction error filter with the subscript denotingthe order m. am is made up of the tap weights from the Wiener filter, i.e.

am = [1 −ωm,1 −ωm,2 ... −ωm,m]T

The superscript B* in Equation 7.1 denotes the complex conjugated weights of the corre-sponding backward prediction errors filter. κm is referred to as the reflection coefficient andis recursively defined too [Haykin 2002, p. 151].

κm =−∆m−1

Pm−1(7.2)

As can be seen from Equation 7.2 κm is a scalar made up of the the scaler quantities ∆m−1 andPm−1. The scalar ∆m−1 can be interpreted in two ways;

• as a cross-correlation between the forward prediction error fm−1(n) and the unit de-layed backward prediction error bm−1(n−1)

• or as a series of multiplications between a and the autocorrelation sequence of theinput, r ii (τ).

Both ways of interpretation is given in Equation 7.3

47

CHAPTER 7. REFLECTION COEFFICIENTS

∆m−1 = E{

bm−1(n−1) f ∗m−1(n)}

=m−1

∑k=0

r ii (k−m)am−1,k (7.3)

Where the forward and backward prediction error filter denoted fm−1(n) and bm−1(n− 1)respectively, is defined as the output of the transversal filter with tap weights am−1 to aninput sequence u(n) [Haykin 2002, p. 141-146], i.e.

fm−1(n) =m−1

∑k=0

a∗m−1,k ·u(n−k)

bm−1(n−1) =m−1

∑k=0

am−1,(m−1)−k ·u(n− (k+1)) (7.4)

The scalar Pm−1 in Equation 7.2 corresponds to the power of a forward prediction error filterof order m-1 and is given as:

Pm−1 = E{

| fm−1(n)|2}

(7.5)

As ∆m−1 in Equation 7.3 can never exceed the forward prediction error given by Equation7.5, it follows that the reflection coefficient κ is bounded between -1 and 1.

By the recursive use of Equations 7.1, 7.2, 7.3 and 7.5 the Levinson-Durbin algorithm offersboth the reflection coefficients, the prediction error power and the filter weights sequence a.The initial conditions are P0 = r(0) and ∆0 = r∗(1). Furthermore it is worth noting that am,0 =1∀m and am,k = 0∀k > m. If κm is found, Equation 7.1 readily determines the correspondingweights. The relation between the reflection coefficients and the tap weights makes thereflection coefficients a possible way a describing the LPC coefficients.

7.2 Alternative Representations of κ

There are various ways to present the reflection coefficients. An often used presentation ispartial correlation coefficients, abbreviated PARCOR, and defined by Equation 7.6 [Haykin2002, p. 153].

PARCOR =E

{

bm−1(n−1) f ∗m−1(n)}

(E{|bm−1(n−1)|2}E{| fm−1(n)|2})1/2

=∆m−1

(E{|bm−1(n−1)|2}Pm−1)1/2

(7.6)

48

7.2. ALTERNATIVE REPRESENTATIONS OF κ

Under the assumption of wide-sense stationarity, the forward prediction error power Pm−1

is equal to the backward prediction error power E{

|bm−1(n−1)|2}

, hence Equation 7.6 sim-plifies to

PARCOR=∆m−1

√

(Pm−1)2

= −κm (7.7)

Thus PARCOR is simply the negative of the reflection coefficients given in the previous sec-tion under WSS conditions. Both representations are subject to poor quantization propertieswhen the magnitude of κ approach unity. Therefore more appropriate representations havebeen investigated.

The Log Area Ratio (LAR) is one way of obtaining a more robust representation of κ. TheLAR transforms a given reflection coefficient by the transformation stated below [Delleret al. 1993, p. 301-302].

LAR=12· log

(

1+ κ1−κ

)

= arctanh(κ) (7.8)

Another useful representation is by the Inverse Sine parameters, which transforms κ accord-ing to Equation 7.9 [Deller et al. 1993, p. 301-302].

IS=2π·arcsin(κ) (7.9)

Both transformations warps the amplitude scale for values of κ near unity to avoid the highsensivity towards quantization. Figure 7.1 on the next page depicts the LAR and IS param-eters as a function of reflection coefficients in the interval ]-1;1[. Notice that the IS represen-tation is still bounded by -1 and 1.

Matlab can easily convert LPC coefficients into both reflection coefficients, LAR or IS andvice versa. Table 7.1 displays the commands. The table is read row-wise, e.g. to get from RCto LAR type rc2lar(RC).

- LPC RC LAR IS

LPC - poly2rc(a) poly2lar(a) poly2is(a)

RC rc2poly(κ) - rc2lar(κ) rc2is(κ)

LAR lar2poly(LAR) lar2rc(LAR) - lar2is(LAR)

IS is2poly(IS) is2rc(IS) is2lar(IS) -

Table 7.1: Matlab commands.

49

CHAPTER 7. REFLECTION COEFFICIENTS

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−3

−2

−1

0

1

2

3

Reflection Coefficient

Am

plitu

de

Log Area RatioInv. sine parameters

Figure 7.1: Log Area Ratios and Inverse Sine parameters as a function of re-flection coefficients.

7.3 Remark on κ and LSP

The two polynomials P(z) and Q(z) used in the representation of the Line Spectrum Fre-quencies are related to the reflection coefficients. Equation 7.1 on page 47 can be expressedas a polynomium A(z) in terms of the variable z as follows [Bäckström 2004, p. 51]:

A(z)+ κmz−(p+1)A(z−1) =

{

P(z) ; κm = 1Q(z) ; κm = −1

(7.10)

If the reflection coefficient is either 1 or -1, the Levinson-Durbin algorithm yields the sym-metric and antisymmetric polynomials used in connection with the LSF. The worksheet onLine Spectrum Pairs/Frequencies is in Chapter 6 on page 42.

50

Vector Quantization 8

Vector quantization is the process of taking a large set of vectors and produce a smaller setof vectors that represents the centroids of the large data space. VQ can be divided into twoseparate operations: vector encoding and vector decoding. Figure 8.1 illustrates the twooperations.

Figure 8.1: Schematic of VQ structure. x is a vector, I a index returned fromthe encoder and x is the vector returned from the quantization process.

The task of the encoder is to identify in which of N geometrically spaced regions the giveninput vector lies. The N regions defines the codebook and the index I identifies the positionin the codebook. The codebook is made up of codewords, which represent the centroids ofthe clustered sample space [Gersho Gray 1991].

The decoder maps the index I to a corresponding vector and outputs the quantizised vectorx.

The goal of VQ is to find a codebook, specifying the decoder, and a partition, specifying theencoder, that will maximize an overall measure of performance.

The K-means algorithm described in a previous worksheet can be used to create the por-tion needed by the encoder. Figure 8.2 illustrates the steps involved in the creation of thecodebook.

A few comments is worth mentioning with respect to Figure 8.2 The feature extraction in-volves ways of specifying the spectral characteristics of the signal, i.e. by LSP, MFCC etc.The data clustering of the feature vectors can be performed by the K-means algorithm. Fi-nally the calculated centroids from the K-means algorithm is used as codewords in the code-book, from which future test vectors is tested by means of a distance measure (also calleddistortion measure). The distance measure determines which centroids (or codeword) bestrepresents a given test vector. A common size of the codebook is 1024 codewords.

After completion of the codebook, the quantizer is ready for use as depicted in Figure 8.3 on

51

CHAPTER 8. VECTOR QUANTIZATION

Figure 8.2: Training of the codebook.

the following page. The input vector is first tested using the nearest neighbor rule andthe index from the codebook belongs to the codeword that produces the minimum distancemeasure between the input vector and the codeword. With the index as input to the decoder,the output vector is returned by a table lookup in the codebook. Vector quantizers whichrely on this strategy are called Voronoi or nearest neighbor quantizers [Gersho Gray 1991].

Figure 8.3: IO-relationsship.

52

The K-means ClusteringAlgorithm 9

K-means is a method of clustering observations into a specific number of disjoint clusters.The ”K” refers to the number of clusters specified. Various distance measures exist to deter-mine which observation is to be appended to which cluster. The algorithm aims at minimiz-ing the measure between the centroid of the cluster and the given observation by iterativelyappending an observation to any cluster and terminate when the lowest distance measureis achieved.

9.1 Overview Of Algorithm

1. The sample space is initially partitioned into K clusters and the observations are ran-domly assigned to the clusters.

2. For each sample:

• Calculate the distance from the observation to the centroid of the cluster.

• IF the sample is closest to its own cluster THEN leave it ELSE select anothercluster.

3. Repeat steps 1 and 2 until no observations are moved from one cluster to another

When step 3 terminates the clusters are stable and each sample is assigned a cluster whichresults in the lowest possible distance to the centroid of the cluster.

9.2 Distance measures

Common distance measures include the Euclidean distance, the Euclidean squared distanceand the Manhattan or City distance.

The Euclidean measure corresponds to the shortest geometric distance between to points.

d =

√

N

∑i=1

(xi −yi)2 (9.1)

53

CHAPTER 9. THE K-MEANS CLUSTERING ALGORITHM

A faster way of determining the distance is by use of the squared Euclidean distance whichcalculates the above distance squared, i.e.

dsq =N

∑i=1

(xi −yi)2 (9.2)

The Manhattan measure calculates a distance between points based on a grid and is illus-trated in Figure 9.1.

Euclidean measure Manhattan measure

Figure 9.1: Comparison between the Euclidean and the Manhattan measure.

For applications in speech processing the squared Euclidean distance is widely used.

9.3 Application of K-means

K-means can be used to cluster the extracted features from speech signals. The extractedfeatures from the signal include for instance mel frequency cepstral coefficients or line spec-trum pairs. This allows speech signals with similar spectral characteristics to be positionedinto the same position in the codebook. In this way similar narrow band signals will bepredicted likewise thereby limiting the size of the codebook.

9.4 Example of K-means Clustering

The following figures illustrate the K-means algorithm on a 2-dimensional data set.

54

9.4. EXAMPLE OF K-MEANS CLUSTERING

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

20

Figure 9.2: Example of signal data made from Gaussian White Noise.

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

20

Figure 9.3: The signal data are seperated into seven clusters. The centroids aremarked with a cross.

55

CHAPTER 9. THE K-MEANS CLUSTERING ALGORITHM

0 0.2 0.4 0.6 0.8 1

1

2

3

4

5

6

7

Silhouette Value

Clu

ster

Figure 9.4: The Silhouette diagram shows how well the data are seperatedinto the seven clusters. If the distance from one point to two centroids is thesame, it means the point could belong to both centroids. The result is a conflictwhich gives a negative value in the Silhouette diagram. The positive part ofthe Silhouette diagram, shows that there is a clear separation of the pointsbetween the clusters.

56

9.5. MATLAB SOURCE CODE

9.5 Matlab Source Code

1 c l o s e a l l2 c l e a r a l l3 c l c45 Lim i t = 2 0 ;67 X = [ 1 0∗ randn ( 4 0 0 ,2 ) ; 1 0∗ randn ( 4 0 0 ,2 ) ] ;8 p l o t (X ( : , 1 ) ,X( : , 2 ) , ’ k . ’ )9 l e n g t h (X( : , 1 ) )

10 f i g u r e11 %i =1;12 k =1;13 f o r i =1 : l e n g t h (X( : , 1 ) )14 i f ( s q r t (X( i , 1 ) ^2+X( i , 2 ) ^2) ) > L im i t ;15 X( i , 1 ) =0 ;16 X( i , 2 ) =0 ;17 e l s e18 Y( k , 1 ) =X( i , 1 ) ;19 Y( k , 2 ) =X( i , 2 ) ;20 k=k +1;21 end22 end23 p l o t (Y ( : , 1 ) ,Y( : , 2 ) , ’ k . ’ )24 f i g u r e2526 [ c idx , c t r s ] = kmeans (Y , 7 , ’ d i s t ’ , ’ sq Eu c l i d ean ’ , ’ r ep ’ , 5, ’ d i sp ’ , ’ f i n a l ’ , ’ EmptyAct ion ’ , ’ s i n g l e t o n ’ ) ;2728 p l o t (Y( c i d x ==1 ,1) ,Y( c i d x ==1 ,2) , ’ r . ’ , . . .29 Y( c i d x ==2 ,1) ,Y( c i d x ==2 ,2) , ’ b . ’ , c t r s ( : , 1 ) , c t r s ( : , 2 ) ,’ kx ’ ) ;3031 hold on32 p l o t (Y( c i d x ==3 ,1) ,Y( c i d x ==3 ,2) , ’ y . ’ ,Y( c i d x ==4 ,1) ,Y( c i d x==4 ,2) , ’ g . ’ ) ;3334 hold on35 p l o t (Y( c i d x ==5 ,1) ,Y( c i d x ==5 ,2) , ’ c . ’ ,Y( c i d x ==6 ,1) ,Y( c i d x==6 ,2) , ’m. ’ ) ;3637 hold on38 p l o t (Y( c i d x ==7 ,1) ,Y( c i d x ==7 ,2) , ’ k . ’ ) ;3940 f i g u r e41 [ s i l k , h ]= s i l h o u e t t e (Y, c idx , ’ sq Eu c l i d ean ’ ) ;42 mean ( s i l k )

57

Generating the codebook forspeech enhancement 10

When a codebook is used for speech enhancement, the complexity of the search entry de-pends on the actual data, which the codebook is used to reproduce, and the data in thecodebook generated through the training process. If there are many similarities betweenthese two sets of data, there is no need for a complex codebook, but is it difficult to make aclear connection between the input data and the data in the codebook, a different approachmust be used to generate the codebook.

The simplest way to generate the codebook is to use K-means algorithm, and separate theLSF-vectors into a number of clusters, depending on the first ten coefficients in each vector.The first ten coefficients in the LSF-vectors represent the narrowband signal from 300 Hz to 4kHz, while the last ten coefficients represent the upper frequencies up to 8 kHz. Afterwardswhen the codebook is used to enhance the speech, the ten coefficients from the input signalis compared to the first ten coefficients of the vectors in the codebook, this is shown onfigure 10.1. The criterion for the comparison is to find the vector in the codebook withthe shortest distance to the input vector. The ten last coefficient of the vector found in thecodebook is then used to represent the upper frequencies of the signal.

Input data (Phone line)

Narrowband signal

0

2048 1 20

1 10

Lookup in the

codebook

Reconstructed

wide band signal

1 20

coefficients for

wideband added

Wideband

CodeBook

Coefficients for

narrowband

Coefficients for

wideband

Figure 10.1: The simple way to implement and use a codebook for speechenhancement.

This approach will be sufficient if neither the input signal nor the signal used for traininghas been modified trough filtering and down sampling. The input signal is a telephone linewhere both a filtering and down sampling has occurred. Through the process of filteringand down sampling the formant tops are changed both in frequency and amplitude in suchway, that the clear connection between the vector in the codebook and the input signal islost. If the above mentioned method to generate and use the codebook is applied, the result

58

10.1. USE OF THE ENHANCED CODEBOOK

will be a wrong look-up in the codebook, leading to a wideband signal which has not beenenhanced, more likely the opposite.

10.1 Use of the enhanced codebook

To solve this problem, the codebook must be generated from a signal which is identicalto the input signal from a telephone line, when it comes to filtering and down sampling.The training signal for the codebook is filtered and down sampled in order to model thetelephone line. This codebook will have a clear connection to the input signal though it doesnot contain any coefficients for the wideband extension. However when the codebook isgenerated by using the k-means algorithm, an index of the vectors related to each clusteris produced. This index is used to calculate a second codebook where the training data isthe same as the data used to calculate the first codebook, but this time the training data isnot filtered and down sampled (see figure 10.2). The clusters for the second codebook arecontaining both narrowband and wideband coefficients.

Final codebook

Narrowband

training data K-means

Narrowband codebook

(2048 x 10)

Wideband training data

Index Wideband codebook

(2048 x 20)

Figure 10.2: The figure shows the different steps to generate the advancedcodebook. The difference between the narrowband and wideband trainingdata, is that the narrowband signal has been filtered with a model of a tele-phone line. The index contains which vectors that belongs to the clusters.

When the codebook is used to enhance speech, the coefficient of the vector from the inputsignal is compared to the coefficients of the vectors from the first codebook. The vector inthe codebook with the shortest distance to the input vector is chosen. The princip of usingthe codebook is shown on figure 10.3. Since the connection between the first and secondcodebook is well known, the wideband coefficients are looked up in the second codebook.

This solution solves the problem though it makes the codebook more complex both to gen-erate and to use. The produced wideband signal is significantly better, when the advancedcodebook is used instead of the more simple solution.

59

CHAPTER 10. GENERATING THE CODEBOOK FOR SPEECH ENHANCEMENT

1 10

2048

Coefficients for

filtered and

downsampled

narrowband

Input data (Phone line)

Narrowband signal

0

1 10

Narrowband CodeBook

0

2048 1 20

Wideband CodeBook

Coefficients for

narrowband

Coefficients for

wideband

Reconstructed

wide band signal

1 20

Figure 10.3: The advanced codebook is used the same way as the simple,though the input data is looked up in the first codebook, and afterwardslinked to the wideband codebook.

10.2 Error correction of narrowband signal

Because of filtering and down sampling the formant tops are changed in the signals whichhave been transmitted through a telephone system. This leads to a speech sound of thelower frequencies that seems unnatural when the signal is extended to a wide band signal.The ideal solution to this problem is to use the codebook to move the formant tops so theyare more identical to those in the original signal.

coefficients from

narrowband codebook

Narrowband coefficients

from wideband codebook

Calculated error

Figure 10.4: To illustrate the princip of the error correction in the codebook,the figure has only two dimensions. The practical implementation uses 10dimensions, the princip however is the same.

The first ten coefficient of each vector in the second codebook represents the formant tops inthe original signal. Instead of using them directly, the difference is calculated between theten coefficients in each vector of the first codebook, and the corresponding first ten coeffi-cients of the vectors in the second codebook. The calculated difference replaces the first tencoefficients in the second codebook.

60

10.2. ERROR CORRECTION OF NARROWBAND SIGNAL

Now it is even simpler to use the codebook since all twenty coefficients of the vectors in thesecond codebook are added to the narrow band signal from the telephone line.

61

Codebook training 11

The codebook must be trained with speech data in order to be used for speech enhancement.The more data used for the training, the better the codebook will represent speech, this alsomeans that the codebook must be trained with many different people and many varieties ofspeech sounds. The training of the codebook gets even more difficult if it is going to be usedfor different languages since many languages has unique sounds that differs from other lan-guages.To make a codebook that includes all ages and several languages will be a demanding taskboth when it comes to generating and finding the speech samples that give a good repre-sentation of the language but also computing power to calculate the codebook. Computingpower is a big issue, since the time it takes to do the calculations seems to rise like an expo-nential function of the amount of input data for the training.

Computer power and time were the main limitations for generating a codebook for thisproject. The result was that the codebook only represents four persons; two women andtwo men. One codebook with all four persons was made and two codebooks with men andwomen respectively. For each person approximately five sound files are chosen.

Below are the names of the files for the training of the codebook listed. The three lettersof the prefix describes the contents of the speech: NDS are numbers, PDS are phrases andfinally SDS are sentences. The next two letters are the initials of the person.

Men Women

NDS_EJN1001016k_original_short.wav NDS_HGN1000716k_original_short.wavNDS_EJN2001116k_original_short.wav NDS_HGN2000816k_original_short.wavNDS_EJN3001216k_original_short.wav NDS_HGN3000916k_original_short.wavNDS_KTN1001116k_original_short.wav NDS_JDN1026516k_original_short.wavNDS_KTN2001216k_original_short.wav NDS_JDN2026616k_original_short.wavNDS_KTN3001316k_original_short.wav PDS_HGF1000016k_original_short.wavPDS_EJK2000416k_original_short.wav PDS_HGF2000116k_original_short.wavPDS_EJK3000516k_original_short.wav PDS_JDG2000316k_original_short.wavPDS_KTD2000516k_original_short.wav PDS_JDG3000416k_original_short.wavPDS_KTD3000616k_original_short.wav PDS_JDG4000516k_original_short.wavSDS_EJK5000716k_original_short.wavSDS_KTD5000816k_original_short.wav

Sound duration men: 296 sek. Sound duration women: 237 sek

The total duration of the speech files for training of the codebook is 533 seconds. It would

62

11.1. MATLAB CODE FOR CODEBOOK TRAINING.

have been desirable to have longer duration, but the limited computer power restricted theamount of data. Another approach for indirect getting more data could be to narrow downthe number of persons, so the codebook gives an even better representation of only oneperson used for the training.

11.1 Matlab code for codebook training.

Below on figure 11.1 is shown a simple block diagram of the functions for codebook train-ing. Descriptions of the function K-mean is covered in chapter 9 on page 53, as well as amore detailed description of error correction and the idea behind the codebook is found inchapter 10 on page 58.

Final codebook

Bandpass filter (Telefilter)

K-means Narrowband codebook

(2048 x 10)

Wideband training data

Index Wideband codebook (2048 x 20)

Error calculation

Framing

Figure 11.1: Overview of the different processes for training the codebook.

The three final codebooks are calculated from the training date of men and women respec-tively and a combination of both. The number of clusters for these codebooks are 2048.

11.2 Matlab Source Code

1 f u n c t i o n f u n c _ c o d e b o o k t r a i n i n g ( names , s i ze_ o f _ co d eb o o k )2 % Funt ion f o r t r a i n i n g t h e codebook . I n p u t v a r i a b l e s are an array f o r t h e names o f t h e3 % t ra n i n g d a t a , end " s i z e o f codebook " d e s c r i b e s t h e number af c l u s t e r f o r t h e codebook .45 l s f = [ ] ;6 e t o t a l = [ ] ;78 f o r j = 1 : l e n g t h ( names )9 u sed _ w av _ f i l e = ch a r ( names ( j ) ) ;

10 [ y , f s ] = wavread ( u sed _ w av _ f i l e ) ; %Read w a v e f i l e s11 y = y ( : , 1 ) ;1213 g l o b a l HP LP14 l oad f i l t r e15 y = f u n c _ t e l e f i l t e r 2 ( y ) ; %" Channel " f i l t e r i n g16 f s = f s / 2 ;1718 f r am e l en g t h = 20∗10^−3; %l e n g t h o f f rame from i n p u t s i g n a l ( even number ) [ u n i t : second ]19 f r a m e l e n g t h o v e r l a p = 10∗10^−3; %l e n g t h o f o ve r l a p between t o f rames [ u n i t : second ]20 f ramelengthwindow = f r am e l en g t h ; %+ 2∗ f r a m e l e n g t h o v e r l a p ; % t o t a l l e n g t h o f f rames [ u n i t : second]2122 f ramesamples = f ramelengthwindow / ( 1 / f s ) ; %l e n g t h o f f rame from i n p u t s i g n a l [ u n i t : samples ]23 f r am esam p l eso v e r l ap = f r a m e l e n g t h o v e r l a p / ( 1 / f s ) ;%l e n g t h o f o ve r l a p between t o f rames [ u n i t : samples ]24 maxframes = l e n g t h ( y ) / f r am esam p l eso v e r l ap ; %used f o r f ra m e i n g [ samples i n frame , number o f f rames ]2526 t i c27 f o r f rame = 1 : maxframes−1 %framing f u n c t i o n s28 s i g n a l = f u n c_ f r am e_ i n _ d a t a ( y , f ramesamples , f r am esam p leso v e r l ap , f rame ) ;

63

CHAPTER 11. CODEBOOK TRAINING

29 windowedsigna l = func_windowing ( s i g n a l ) ;3031 [ aLPC , e ] = f u n c _ l p c _ c o e f f ( w indowedsigna l , 1 0 ) ;32 e t o t a l = [ e t o t a l e ] ;33 l s f = [ l s f p o l y 2 l s f ( aLPC ) ] ;34 end35 t o c36 end3738 % K−mean f u n c t i o n . The parameter " sq Eu c l i d ea n " c o n t r o l s t h e d is t a n c e measurement f rom each39 % c l u s t e r t o t h e v e c t o r . " EmtyAct ion " s e t s t h e way t h e f u n c t io n h a n d l es an empty c l u s t e r .40 % When " S i n g l e t o n " i s chosen , an empty c l u s t e r w i l l be d e l e t ed , and a new one c rea t ed . The41 % new c l u s t e r i s l o c a t e d where t h e l o n g e s t d i s t a n c e between ac l u s t e r and a v e c t o r i s42 % found .43 [ c idx , c t r s f i l t e r e d ] = kmeans ( l s f ’ , s i ze_o f_codebook , ’ di s t ’ , ’ sq Eu c l i d ean ’ , ’ r ep ’ , 1 , ’ d i sp ’ , ’ i t e r ’ , ’ EmptyAct ion ’ , ’ s i n g l e t o n ’ )

;4445 e v a l ( s p r i n t f ( ’ save f i l t e r e d _ t e l e f i l t c t r s f i l t e r e d c i d x ’ ) )4647 c l e a r l s f ; c l e a r e t o t a l ;c l e a r y ;48 l s f = [ ] ;49 e t o t a l = [ ] ;50 energy = [ ] ;5152 f o r j = 1 : l e n g t h ( names )53 u sed _ w av _ f i l e = ch a r ( names ( j ) ) ;54 [ y , f s ] = wavread ( u sed _ w av _ f i l e ) ;55 y = y ( : , 1 ) ;5657 t i c58 f o r f rame = 1 : maxframes−159 s i g n a l = f u n c_ f r am e_ i n _ d a t a ( y , f ramesamples , f r am esam p leso v e r l ap , f rame ) ;60 windowedsigna l = func_windowing ( s i g n a l ) ;6162 [ aLPC , e ] = f u n c _ l p c _ c o e f f ( w indowedsigna l , 2 0 ) ;63 energy = [ energysum ( abs ( w indowedsigna l ) . ^ 2 ) ] ;64 e t o t a l = [ e t o t a l e ] ;65 l s f = [ l s f p o l y 2 l s f ( aLPC ) ] ;66 end67 t o c68 end6970 c t r s f u l l = zeros ( s i ze_of_codebook , 2 0 ) ;71 f o r i = 1 : s i ze_ o f _ co d eb o o k72 I = f i n d ( c i d x = = i ) ;73 f o r j = 1 : l e n g t h ( I )74 c t r s f u l l ( i , : ) = [ l s f ( : , I ( j ) ) ’ + c t r s f u l l ( i , : ) ] ;75 end76 c t r s f u l l ( i , : ) = c t r s f u l l ( i , : ) / l e n g t h ( I ) ;77 end7879 e v a l ( s p r i n t f ( ’ save f u l l _ t e l e f i l t c t r s f u l l c i d x ’ ) )80 %C a l c u l a t i o n o f t h e e r ro r d e i s t a n c e .81 c t rsEr ro rWB = [ c t r s f u l l ( : , 1 : 1 0 )− c t r s f i l t e r e d ( : , : ) / 2 c t r s f u l l ( : , 1 1 : 2 0 ) ] ;82 e v a l ( s p r i n t f ( ’ save e r r o r w b _ t e l e f i l t c t r sEr ro rWB f rameenergy e r r o r p ow er c i d x ’ ) )

64

Excitation extension 12

The LPC-analysis of the narrowband signal gives two different outputs: The LPC-coefficientsand the residual signal. To achieve a wideband signal it is necessary to expand the LPC-coefficients as well as the residual signal. This paper concerns the high frequency expansionof the residual signal.

The goal of the excitation extension is to extend the narrowband excitation signal to estimatethe wide band excitation. According to [Deller et al. 1993, p. 165] the phase spectrum is notof essential importance for speech signals; however, phase-continuity seen across time maybe perceptually very relevant. Hence the first priority is to estimate the magnitude spectrum.To do this first step is to analyze wide band residual signals.

12.1 Analysis of the residual signal

To illustrate how the wideband residual and the narrowband residual are related a fre-quency analysis is performed. Figure 12.1 shows a speech segment residual in the timedomain (a) and in the frequency domain for wideband (b) and narrowband (c) respectively.The narrowband spectrum is found by filtering the wideband speech signal with the tele-phone filter specified in chapter 13 and performing a new LPC-analysis.

Figure 12.1 shows that the narrowband spectrum is equal to the corresponding part of thewideband spectrum. Because of the filter the narrowband signal does not contain any fre-quencies below 300 Hz and above 3.4 kHz. Since the LPC synthesis is a linear operation itis necessary to add high frequencies to the residual signal. These frequencies should corre-spond to the higher part of the wideband spectrum.

The residual signal can be divided into two cases:

• Voiced speech signals

• Unvoiced speech signals

The residual signal for a voiced signal is ideally a pulse train. Hence is the magnitudespectrum also a pulse train. Practical speech signals do however have small fluctuationsin the pitch which result in a blurred magnitude spectrum at higher frequencies due to theharmonic nature of a speech signal.

65

CHAPTER 12. EXCITATION EXTENSION

0 100 200 300 400 500 600

−0.2

−0.1

0

0.1

0.2

Samples

Am

plitu

de

(a)

0 2000 4000 6000 80000

0.5

1

1.5

2

Frequency

Am

plitu

de

(b)

0 2000 4000 6000 8000Frequency

(c)

Figure 12.1: Frequency analysis of a residual signal. (a) is the wideband resid-ual is time domain. (b) is the magnitude spectrum of the wideband residual.(c) is the magnitude spectrum of the narrowband residual.

The residual signal for an unvoiced signal is ideally white noise. Hence is the magnitudespectrum flat. In reality a residual signal rarely has a flat spectrum.

The LPC-estimation result in that each frequency band has approximately the same ampli-tude. However it is observed that the frequency attenuation tendency generally is continuedin the high spectrum.

Now it is time to look at methods to extend the frequency range of the residual signal. Thenext section describes different methods to extend the upper frequencies.

12.2 Generation of high frequent components

The increment of the sample rate should result in a higher sampling frequency as well ashigh frequency content. Figure 12.2 shows two different methods to extend the sample rateof a speech signal. The plots on the left side show the signal in the time domain while theplots on the right show the same signal in frequency domain.

The signal shown in (a) is the original narrowband signal and the similar frequency plotshows that the signal consists of three sinusoidal harmonics.

Signal (b) is achieved by insertion of zeros between each sample. The spectrum shows that

66

12.2. GENERATION OF HIGH FREQUENT COMPONENTS

−1

0

1

a

−1

0

1

b

0 0.5 1

−1

0

1

Time [ms]

c

0 2 4 6 8Frequency [kHz]

Figure 12.2: Different types of sample rate increment and the effect in thefrequency domain. Left side is the signal in time domain and right side is themagnitude spectrum. (a) is the original signal. (b) is extended by insertingzeros between every sample. (c) is extended by repeating every sample.

the spectrum is reflected around the half sample frequency of the narrowband signal. There-fore this operation will add numerous high frequency components.

Signal (c) occur by repetition of each sample. This technique adds harmonics reflectedaround the half sample frequency of the narrowband signal but with a lower amplitudethan (b).

The methods of increasing the sample rate described above both reflects the harmonicsaround the half sample frequency of the narrowband signal. Hence a voiced residual con-taining mainly low frequencies gives an addition of high frequencies. This means, that adecreasing narrowband spectrum results in an increasing highband spectrum. Clearly thisis inappropriate since the residual continue the same tendency in the high frequencies as inthe low. Therefore it is desirable to find an operation that moves the frequencies instead ofmirroring.

A method to avoid this reflection is by modulating the signal. This operation will move thespectrum and therefore the reflections are avoided.

67


12.2.1 Moving the spectrum

By moving the spectrum the distance between the harmonics and the structure of the resid-ual are preserved in the highband. This is an advantageous ability as the resulting residualsignal follows the same tendency at higher frequencies as the tendency in the low.

A method of moving the spectrum of a signal is by multiplying the signal by an exponentialfunction according to the following theorem:

ejωonx(n) ⇔ X(ej(ω−ωo)) (12.1)

The block diagram shown in figure 12.3 describes how to extend the frequency range of asignal. The letters a, b, c and d denotes the different states in the process. These states canbe seen in the time domain as well as in the frequency domain on figure 12.4.

Interpolation

exp(j n)

HP-filterInput Outputa b cdG

Figure 12.3: Block diagram of moving the signal in the frequency domain.

To use the property stated in equation 12.1 the signal has got to have the wideband sampler-ate. The samplerate is extended by zero insertion and following low pass filtering. Now thesignal is band limited up to 4 kHz.

Next step is to multiply by the modulation function exp( jω0n). After the multiplication thesignal looks like shown in b. The lower frequencies occur because of the symmetric propertyof the frequencies. These are undesirable and therefore a high-pass filter is applied. Theresult is a signal like shown in c.

To add the modulated signal to the original signal a simple addition is used. The resultingsignal is shown in d. The factor G adjusts the attenuation of the artificial high frequencycomponents. This factor has to be adjusted by listening tests in such a way that the highfrequency components are not annoying and metallic.

The artificial high frequency can be metallic because it is too periodic with respect to a realspeech signal. The natural signal is often more blurred in the high frequencies because asmall derivation of the pitch frequency result in a large derivation in the higher frequencies.Because of this condition it may be an idea to blur the higher frequencies by addition ofwhite noise.

68

12.3. DESIGN

−1

0

1

a

−1

0

1b

−1

0

1

c

−1

0

1

d

Time [ms] Frequency [kHz]

Figure 12.4: Steps in the procedure of moving the signal in the frequency do-main. (a) is the original signal. In (b) the signal is amplitude modulated. (c) is(b) high pass filtered and (d) is the sum of (a) an (c).

12.3 Design

It is chosen to use the method of moving the lower frequencies because of the advantagesstated in the previous section.

Because of the telephone filter, the frequency content below 300 Hz and above 3400 Hzorigins from noise. Therefore the residual is first filtered so that the frequency band is limitedto 350-3350 Hz. Hereby noise at the boundaries are avoided. The filter specifications aredescribed below. The filter is designed using fdatool in matlab.

High pass filterPass frequency: 350 Hz Pass attenuation: -1 dBStop frequency: 300 Hz Stop attenuation: -60 dBLow pass filterPass frequency: 3350 Hz Pass attenuation: -1 dBStop frequency: 3400 Hz Stop attenuation: -60 dB

The residual signal is band limited to the frequencies between 350 Hz and 3350 Hz andis sampled at 8 kHz. Therefore it is desirable to extend the signal with frequencies from3350 Hz up to 8 kHz since it is intended to use a new sample rate of 16 kHz. The knownpart of the signal has got a bandwidth of 3000 Hz. It is wanted to move this band so that it

69


covers the band from 3400 Hz to 6400 Hz corresponding to 3050 Hz. Now the modulationfrequency in equation 12.1 can be calculated as:

ω0 = 2πf0fs

= 2π3050Hz

16000Hz= 0.3812π (12.2)

To keep the harmonic properties of the residual ω0 is quantized to a multiple of the pitch.

To check whether aliasing occurs a graphical interpretation is the easiest to handle. Fig-ure 12.5 shows a signal with the specific bandwidth before and after modulation. The figureshows that the introduced aliasing can be removed by a highpass filter. To design this filterit is necessary to decide stop and pass frequency and passband and stopband attenuation.

83.350.35 160

f [kHz]

f [kHz]8 16

03.350.35

(b)

(a)

Figure 12.5: Modulation of a signal with frequency range from 350 Hz to3.35kHz. The modulation frequency is 3.05 kHz and the figure shows thataliasing can be removed by a high pass filter. (a) is the signal before modula-tion and (b) is the signal after modulation.

Figure 12.6 shows the modulation and the ideal filter design to remove the modulated neg-ative frequencies. It is observed, that the stop frequency is 2.7 kHz and the pass frequencyis 3.4 kHz. The attenuations is determined by comparison with the signal to noise ratio ofthe telephone signal, typically 45 dB. To avoid noticeable low frequency components thestopband is chosen to 60 dB.

0.35 3.35-0.35-3.35 0

3.4-0.3 0 6.42.7

f [kHz]

f [kHz](b)

(a)

Figure 12.6: Filter dimensioning to remove the modulated negative frequen-cies. (a) is the signal before modulation and (b) is the signal after modulation.The dotted line describes the filter, that gives the desirable frequency band.

Since the passband does not represent the correct residual signal some ripple is acceptedand as a consequence 1 dB is chosen. This is implemented using an elliptic filter of sixth

70

12.4. ADDITION OF HIGH FREQUENT WHITE NOISE

order using Matlab.

The modulated signal is combined with the original by addition in the time domain.

12.4 Addition of high frequent white noise

Since the added signal has a bandwidth of 3000 Hz the expanded residual signal has a fre-quency range from 350 Hz to 6400 Hz. Hence there is a band from 6450 Hz to 8000 Hzwithout information. To fill out this band filtered white noise is added. In a natural resid-ual signal the signal gets more blurred the higher frequency due to small fluctuations in thepitch which becomes larger the higher frequency. To imitate this the white noise is filteredby a sluggish filter such that there is added white noise from 3500 Hz to 8000 Hz. Ascendingamplitude from 3500 Hz to 6450 Hz. Hereby the upper frequencies is blurred. The filter isdesigned as a elliptic filter in Matlab. The filter specifications are as follows.

Pass frequency: 6450 Hz Pass attenuation: -1 dBStop frequency: 3500 Hz Stop attenuation: -60 dB

The amplitude of the white noise is determined by comparing reconstructed residuals tooriginal residuals. To avoid detrimental high frequency components the amplitude is fixedsuch that the mean amplitude is a little lower than in the original residual.

12.5 Low frequency extension of the excitation signal

The narrowband residual signal does not contain any frequencies below 350 Hz. Neverthe-less the narrowband residual can content frequencies in the stopband due to amplificationof noise. Therefore it is necessary to remove the residual content below 300 Hz. To enhancethe quality of the low frequent residual signal it is desirable to add some low frequencycomponents. It is however important to note, that here it is necessary to take account of thepitch of the signal. This result from that a frequency change of f Hz is much more notice-able in the low frequency range than in the high due to the human logarithmic frequencyperception.

The lower bands of a speech signal are typically periodic. Therefore the residual signalis enhanced by adding a sinusoidal signal with a frequency corresponding to the pitch ofthe signal. To avoid phase problems only the fundamental frequency is added. The phaseis saved for each segment such that the new signal starts at the amplitude where the oldstopped.

In the case where the pitch detector does not find a pitch, no signal is added to the residual.

Figure 12.7 shows the total block diagram for the excitation extension. The gains G1, G2,and G3 adjusts the attenuation of each frequency range. These constants are determined by

71


Upsample &Interpolation

exp(j n)

HP-filterExcitationsignal Output

HP-filterWhitenoise

Pitchdetector

Sinegenerator

Speechsignal

G1

G2

G3

Figure 12.7: Total block diagram for excitation extension. The gains G1, G2,and G3 adjusts the attenuation of each frequency range.

comparison of the extended and the original residual. The gains are constant and do notdepend on the actual signal.

12.6 Test of excitation extension

To test the functionality of the excitation extension the reconstructed residual signals arecompared to the original. The test is performed for a voiced and an unvoiced speech signal.

12.6.1 Voiced speech signal

Figure 12.8 show magnitude spectrums for the extended and the original excitation. Theupper plot is the extended and the lower is the original excitation. The speech signal isvoiced since the residual magnitude spectrum resembles a pulse train. It is observed, thatthe two spectrums has got similarities. The residual signal is discussed in four differentbands:

1. 0-350 Hz: The pitch detector has detected the pitch and a sinusoidal signal has beenadded. The low frequency band is hereby correctly reconstructed.

2. 350-3350 Hz: The band is approximately the same. This is expected since this fre-quency band is leaved unchanged by the telephone filter.

3. 3350-6400 Hz: The bands have got the same harmonic structure, but the extended ismore harmonic in the higher frequencies than the original. In addition, the harmonicshas not got the total right attenuations. Note that the band is equal to band 2.

4. 6400-8000 Hz: In the highest frequencies the harmonic structure is limited. Becausewhite noise is added the spectrum is different. However the level of the extendedresidual is lower.

72

12.6. TEST OF EXCITATION EXTENSION

0 1000 2000 3000 4000 5000 6000 7000 80000

0.5

1

1.5

2

2.5

3

Mag

nitu

de

Reconstructed residual

0 1000 2000 3000 4000 5000 6000 7000 80000

0.5

1

1.5

2

2.5

3

Frequency [Hz]

Mag

nitu

de

Original residual

Figure 12.8: Magnitude spectrums for the extended and the original excita-tion. The upper plot is the extended and the lower is the original excitation.The signal is a voiced speech signal.

12.6.2 Unvoiced speech signal

Figure 12.9 show magnitude spectrums for the extended and the original excitation. Theupper plot is the extended and the lower is the original excitation. The speech signal isunvoiced since the residual magnitude spectrum is not distinct periodic. It is observed, thatthe two spectrums has got similarities. The residual signal is discussed in four differentbands:

1. 0-350 Hz: The pitch detector has not the pitch and nothing has been added. Nothinglow frequent is reconstructed.

2. 350-3350 Hz: The band is approximately the same. This is expected since this fre-quency band is leaved unchanged by the telephone filter. There is no clear periodiccomponents.

3. 3350-6400 Hz: The bands are not equal. There is no apparent structure and hence themethod can not predict this band.

73


0 1000 2000 3000 4000 5000 6000 7000 80000

0.5

1

1.5

2M

agni

tude

Reconstructed residual

0 1000 2000 3000 4000 5000 6000 7000 80000

0.5

1

1.5

2

Frequency [Hz]

Mag

nitu

de

Original residual

Figure 12.9: Magnitude spectrums for the extended and the original excita-tion. The upper plot is the extended and the lower is the original excitation.The signal is an unvoiced speech signal.

4. 6400-8000 Hz: Because white noise is added the spectrum is different. However thelevel of the extended residual is lower.

Those examples show, that the method do add frequency content outside the narrowbandfrequency range. The added signal is not exactly correct, but similarities exists. To testwhether the method is expedient a listening test is essential. To test the overall performanceof the algorithm the extended residual signal is filtered with the wideband LPC-coefficients.In this case the quality reduction of the resulting speech signal only originate from the exci-tation extension. This test is performed in the test worksheet (kapitel 15).

12.7 Matlab Code

1 f u n c t i o n [ d a t aex t , ango ] = f u n c _ e x c i t a t i o n _ e x t e n s i o n ( data , e x t e nt i o n _ g a i n , p i t ch , an g i )2 %3 % [ d a t a e x t ] = e x c i t a t i o n _ e x t e n s i o n ( da ta ) r e t u r n . . . .4 %5 % d a t a e x t :6 % data :

74

12.7. MATLAB CODE

78 g l o b a l Bwhite Awhi te9 g l o b a l MF f s

1011 % F i l t e r and resample e x c i t a t i o n12 x ex t= f u n c _ f i l t u p s a m p l e ( d a t a ) ; % Fs = 1 6 kHz and f i l t e r e x c i t a t i o n s i g n a l13 % Fpass =[350 ,3350]1415 % Add f r e q u e n c i e s i n band : 3400−645016 i f p i t c h ~=017 wm= quant (3050 , p i t c h ) / f s ; % C a l c u l a t e m o d u l a t i o n f req u en cy18 e l s e19 wm=3050/ f s ;20 end21 n =0: l e n g t h ( x ex t ) −1; % I n i t i a l i z e t i m e f o r m o d u l a t i o n22 x e x t S h i f t =r e a l ( x ex t .∗ exp ( j ∗wm∗p i∗n ) ) +imag ( x ex t .∗ exp ( j ∗wm∗p i∗n ) ) ; % Modulate e x c i t a t i o n by wm23 x e x t S h i f t f i l t =( f i l t e r (MF, x e x t S h i f t ) ) ; % Remove t h e fo rmer n e g a t i v e f r e q u e n c i e s2425 n o i F i l = f i l t e r ( Bwhite , Awhite , ( . 1∗ ( rand (2∗160 ,1)− .5) ) ) ; % F i l t e r t h e n o i se2627 x lowr= zeros ( 1 ,3 2 0 ) ; % I n i t i a l i z e x lowr ( Low f req u en cy e x t e n s i o n )28 i f p i t c h ~=0 & p i t ch <350 % Add s i n u s o i d a l s i g n a l s w i t h t h e p i t c h f req u en cy and harmonics29 x lowr=x lowr+ func_windowing (s i n (2∗ p i∗p i t c h ∗ [ 0 : 1 / ( f s∗2) : 3 1 9 / ( f s∗2) ]+ an g i ) ) ;30 ango= an g i + 2∗ p i∗p i t c h∗320/ ( f s∗2) ; % save t h e phase31 e l s e32 ango = 0 ;% Rese t t h e phase i f t h e r e i s no p i t c h or p i t c h i s h i g h e r than 3 50 Hz33 end343536 d a t a e x t = x ex t + n o i F i l ’+1/5000∗ x lowr + e x t e n t i o n _ g a i n∗ x e x t S h i f t f i l t ; % Add t h e c o n t r i b u t i o n s3738 % [ dataauto , l a g s ] = xco r r ( d a t a ex t ’ ) ;39 % d a t a a u t o = d a t a a u t o ( f i n d ( l a g s ==0) : end , : ) ;40 % [ a , e ] = l e v i n s o n ( da taau to , 2 0 ) ;41 % d a t a e x t = f i l t e r ( a , s q r t ( e ) , d a t a e x t ) ;4243 f u n c t i o n [ o u t ]= f u n c _ f i l t u p s a m p l e ( i n ) % D ef i n e f u n c t i o n4445 g l o b a l EFH EFL4647 o u t=f i l t e r (EFH , i n ) ; % F i l t e r e x c i t a t i o n by HP− f i l t e r48 o u t=upsample ( out , 2 ) ; % Upsample49 o u t=f i l t e r ( EFL , o u t ) ; % I n t e r p o l a t i o n f i l t e r i n g f p a s s = 3 4 0 0 Hz

75

Telephone filter 13

The codebook has the purpose of linking LSF-coefficients for band limited signals with LSF-coefficients for the same wideband signal. To generate this connection it is necessary tohave the wideband signal as well as the band limited signal. The available sound samplesis sampled by 20 kHz. Since the codebook has to represent signal at 16 kHz all samples areresampled to 20 kHz. To achieve the band limited samples a model of the telephone filterhas to be estimated. This paper concerns the modelling of this filter.

13.1 Modelling of the telephone filter

The telephone system is a band pass filter with cut off frequencies at 300 Hz and 3.4 kHz. Itis chosen to have 50 Hz between pass band and stop band. Now it is possible to set up thefilter demands.

High pass filterPass frequency: 300 Hz Pass attenuation: -1 dBStop frequency: 250 Hz Stop attenuation: -60 dBLow pass filterPass frequency: 3400 Hz Pass attenuation: -1 dBStop frequency: 3450 Hz Stop attenuation: -60 dB

To avoid filters of high orders an elliptic implementation is used. The filters are designedusing fdatool in Matlab. Quantization introduces more problems the higher sample fre-quency. Therefore the high pass filter is implemented at a sample frequency of 8 kHz andthe low pass at 16 kHz. Figure 13.1 shows the magnitude plot of the total telephone filter.

76

13.2. MATLAB CODE

0 500 1000 1500 2000 2500 3000 3500 4000−100

−80

−60

−40

−20

0

20

Figure 13.1: Magnitude plot of the telephone filter.

13.2 Matlab Code

1 function [ out ]= f u n c _ t e l e f i l t e r 2 ( in )23 global LP HP45 out= f i l t e r ( LP , in ) ; % Lowpass f i l t e r t h e s i g n a l a t 3 4 0 0 Hz6 out=downsample ( out , 2 ) ; % Downsample s i g n a l t o 8 kHz7 out= f i l t e r (HP, out ) ; % Highpass f i l t e r t h e s i g n a l a t 3 0 0 Hz

77

Envelope and excitationevaluation 14

The bandwidth expansion algorithm which has been implemented in Matlab combinesenvelope- and excitation extension to construct a bandwidth expanded signal. Evaluation ofthis algorithm is done by testing different parts of the algorithm separately. This worksheetdocuments the test of the envelope- and excitation parts of the algorithm.

14.1 Speech signals

In this worksheet different speech signals are used to evaluate the envelope and excitationparts in the algorithm. Evaluation is done by using speech signals which originate from thecodebook as a reference. The codebook is trained with different speech signals to constructa specific codebook for women and men respectively and a combined codebook for bothwomen and men. In later sections the codebooks are evaluated with speech signals. Theused speech signals in plots and tables refer to speech signals which are described in thecodebook training worksheet.

14.2 Envelope measurement

This section describes how the envelope extension is evaluated using spectral distortionmeasurement [Paliwal Atal 1993] [Kallio Dec. 2002]. The distortion method calculates thespectral distortion between a reference envelope and a test envelope in accordance withequations 14.1 and 14.2

D =

√

1N

N

∑i=1

[

20log10| H(ωi) |−20log10| H(ωi) |]2

(14.1)

Davg = 1M

M∑j=1

D j ,D j = (D1,D2, . . . ,DM) (14.2)

Evaluation of a codebook can be done by applying this spectral distortion measurementbetween the original envelope and the envelope from a codebook. H(ωi) is the frequencyresponse of a reference LPC polynomial and H(ωi) is the frequency response of test LPCpolynomial. M in equation 14.2 is the number of spectral distortion values. In [Paliwal

78

14.2. ENVELOPE MEASUREMENT

Atal 1993] it is described that a spectral distortion less then 1 dB gives spectral transparency.Spectral transparency means that there is no audible difference between a reference enve-lope and a test envelope. From this 1.0 dB limit value the codebook can be evaluated fortransparency.

0 1000 2000 3000 4000 5000 6000 7000 8000−20

−10

0

10

20

30

40Signal envelope − ’PDS_JDR00300.wav’ frame 948 − D = 18.61 dB

Frequency [Hz]

Am

plitu

de [d

B]

NarrowbandWidebandOriginal

(a) Maximum spectral distortion - woman

0 1000 2000 3000 4000 5000 6000 7000 8000−20

−10

0

10

20

30

40

50

60Signal envelope − ’PDS_JDR00300.wav’ frame 507 − D = 1.87 dB

Frequency [Hz]

Am

plitu

de [d

B]


(b) Minimum spectral distortion - woman

0 1000 2000 3000 4000 5000 6000 7000 8000−20

−10

0

10

20

30

40Signal envelope − ’PDS_EJK40006.wav’ frame 1114 − D = 19.85 dB

Frequency [Hz]

Am

plitu

de [d

B]


(c) Maximum spectral distortion - man

0 1000 2000 3000 4000 5000 6000 7000 8000−30

−20

−10

0

10

20

30

40

50Signal envelope − ’PDS_EJK40006.wav’ frame 683 − D = 2.61 dB

Frequency [Hz]

Am

plitu

de [d

B]


(d) Minimum spectral distortion - man

Figure 14.1: Spectral distortion for a man and woman

Figure 14.1 shows four figures, each consisting of the envelopes from the original signal, thebandlimited signal and from the codebook which correspond to the same frame in a speechsignal. The figures in the left column show the envelopes of a frame where the spectraldistortion reaches a maximum and the figures to the right show a frame where the spectraldistortion has its minimum. As expected it is the right most figures that have the smallestspectral distortion value because the original envelope and codebook envelope are closestto each other.

To show how the spectral distortion performs in time, the spectral distortion is plotted as

79

CHAPTER 14. ENVELOPE AND EXCITATION EVALUATION

function of frames for both the women and men speech signals. The list below contains theused speech signals for a woman and a man.

• Woman speech signal: PDS_JDR0030016k_original_short.wav

• Man speech signal: PDS_EJK4000616k_original_short.wav

200 400 600 800 1000 1200 1400 16000

2

4

6

8

10

12

14

16

18

20

Spectral distortion in ’PDS_JDR00300.wav’ − Davg

= 6.62 dB

Frame [n]

D [d

B]

(a) Woman

200 400 600 800 1000 12002

4

6

8

10

12

14

16

18

20

Spectral distortion in ’PDS_EJK40006.wav’ − Davg

= 6.27 dB

Frame [n]

D [d

B]

(b) Man

Figure 14.2: Spectral distortion as function of frames - D(frames)

Figure 14.2 show the spectral distortion as function of frames respectively for the womenand men speech signals. In agreement to the definition of spectral distortion measurementeach frame is evaluated against the original corresponding envelope frame. Because thespectral distortion calculation is performed on each frame it is possible to calculate an aver-age for all frames. This average measurement is used in the next section for the evaluationof different codebooks against different speech signals.

14.2.1 Codebook evaluation

The envelope from the codebook/codebooks are evaluated using the average spectral dis-tortion measurement. In this section the following codebooks will be evaluated againstdifferent speech signals. The list below describes the three different codebooks which hasbeen constructed from the codebook training; see chapter 11 on page 62.

• Codebook containing only women

• Codebook containing only men

• Codebook containing both men and women

80

14.2. ENVELOPE MEASUREMENT

Speech signals Codebook

WomanWoman Woman and man

µ σ2 µ σ2

Scenario 1 6.08 6.69 0.00 0.00

Scenario 2 6.62 7.36 0.00 0.00

Scenario 3 6.97 6.24 0.00 0.00

ManMan Woman and man

µ σ2 µ σ2

Scenario 1 5.34 2.39 0.00 0.00

Scenario 2 6.27 4.07 0.00 0.00

Scenario 3 6.36 3.57 0.00 0.00

Table 14.1: Average and variance of the spectral distortion between speechsignal and codebook

Woman speech signals Wav files

Scenario 1 PDS_JDG4000516k_original_short_incodebook.wav

Scenario 2 PDS_JDR0030016k_original_short.wav

Scenario 3 PDS_JAF1000216k_original_short_notincodebook.wav

Man speech signals Wav files

Scenario 1 PDS_KTD3000616k_original_short_incodebook.wav

Scenario 2 PDS_EJK4000616k_original_short.wav

Scenario 3 SDS_HCF5000616k_original_short_notincodebook.wav

Table 14.2: Used wavfiles

For evaluation of each codebook in the above list there are three different scenarios for com-bination of speech signals and codebooks.

• Scenario 1 - Test speech has been used as training data

• Scenario 2 - Person X is in codebook but test speech is not used as training data

• Scenario 3 - Test speech has not been used as training data

Combining the listed codebooks and the listed scenarios led to three tests for every code-book.

Table 14.1 show the average spectral distortion for speech files evaluated against the threedifferent codebooks. Table 14.2 show the used wavfiles in accordance to the scenario. Inaccordance to the values in table 14.1 it has not been possible to obtain spectral transparency[Paliwal Atal 1993] between a original and bandwidth expanded envelope. Although thespectral distortion value exceeds the 1 dB limit for spectral transparency it is possible toextend the envelope to add frequencies in a higher band then the original signal. From the

81


table it is also specified that the spectral distortion is less for a man then a woman. This canbe explained from the fact that men have lower pitch then woman which indicate that menare more properly mapped in the codebook.

14.3 Excitation evaluation

This section describes how the excitation is evaluated using FFT (Fast Fourier Transform).On the basis of the envelope section the same frames are used to evaluate the excitation.Using the same frames in this section as in the envelope section provides comparison ofthe overall performance of the algorithm. This overall performance is described in a latersection.

0 1000 2000 3000 4000 5000 6000 7000 8000

−80

−60

−40

FFT(Signal excitation) − ’PDS_JDR00300.wav’ frame 948 − D = 18.61 dB

Am

plitu

de [d

B]

Original

0 1000 2000 3000 4000 5000 6000 7000 8000

−80

−60

−40

Am

plitu

de [d

B]

Narrowband

0 1000 2000 3000 4000 5000 6000 7000 8000

−80

−60

−40

Frequency [Hz]

Am

plitu

de [d

B]

Wideband


0 1000 2000 3000 4000 5000 6000 7000 8000

−80

−60

−40

FFT(Signal excitation) − ’PDS_JDR00300.wav’ frame 507 − D = 1.87 dB

Am

plitu

de [d

B]

Original

0 1000 2000 3000 4000 5000 6000 7000 8000

−80

−60

−40

Am

plitu

de [d

B]

Narrowband

0 1000 2000 3000 4000 5000 6000 7000 8000

−80

−60

−40

Frequency [Hz]

Am

plitu

de [d

B]

Wideband


0 1000 2000 3000 4000 5000 6000 7000 8000

−80

−60

−40

FFT(Signal excitation) − ’PDS_EJK40006.wav’ frame 1114 − D = 19.85 dB

Am

plitu

de [d

B]

Original

0 1000 2000 3000 4000 5000 6000 7000 8000

−80

−60

−40

Am

plitu

de [d

B]

Narrowband

0 1000 2000 3000 4000 5000 6000 7000 8000

−80

−60

−40

Frequency [Hz]

Am

plitu

de [d

B]

Wideband


0 1000 2000 3000 4000 5000 6000 7000 8000

−80

−60

−40

FFT(Signal excitation) − ’PDS_EJK40006.wav’ frame 683 − D = 2.61 dB

Am

plitu

de [d

B]

Original

0 1000 2000 3000 4000 5000 6000 7000 8000

−80

−60

−40

Am

plitu

de [d

B]

Narrowband

0 1000 2000 3000 4000 5000 6000 7000 8000

−80

−60

−40

Frequency [Hz]

Am

plitu

de [d

B]

Wideband


Figure 14.3: FFT analysis of excitation from a man and woman

82

14.4. OVERALL EVALUATION

Figure 14.3 shows four figures containing the FFT analysis for a man and woman respec-tively at minimum and maximum spectral distortion. Each figure contains the FFT analysisfrom the original excitation, channel excitation and from the bandwidth expanded excita-tion.

In contradiction to the envelope section it is difficult to distinguish if the bandwidth ex-panded excitation adds the missing frequency components correct in correspondence to theoriginal excitation signal. For a more detailed description of excitation extension read thesection on the excitation algorithm chapter 12 on page 65.

The next section combines the envelope- and excitation extension for evaluation of the over-all performance of the implemented algorithm.

14.4 Overall evaluation

In this section the overall performance of the implemented algorithm is evaluated fromFFT (Fast Fourier Transform) analysis. From these analysis’ it is possible to compare thefrequency of the original signal and the bandwidth expanded signal respectively.

Figure 14.4 shows four figures, each containing the FFT analysis of a frame from the originalspeech signal, channel signal and from the bandwidth expanded signal respectively. Thefigures in the left column show FFT analysis of a frame where the spectral distortion reachesa maximum and the figures to the right show a frame where the spectral distortion has itsminimum.

In accordance to the envelope from the codebook and the excitation a wideband signal isconstructed. The evaluation of the excitation is not fully covered in this worksheet. Formore detailed description on the excitation; see chapter 12. The evaluation of envelope isspecified in table 14.1 and shows that the spectral distortion is less if a person in a speechfile is represented in a codebook. From this observation a codebook should contain a morerepresentative sample of human speech. Such codebook will cover a wider range of humanspeech at the expense of codebook complexity.

The codebook which is constructed in this project will need further work to be sufficient forwider range of use and minimizing the spectral distortion to obtain spectral transparency.

83


0 1000 2000 3000 4000 5000 6000 7000 8000−110

−100

−90

−80

−70

−60

−50

−40

−30

−20FFT of a speech signals − ’PDS_JDR00300.wav’ frame 948 − D = 18.61 dB

Frequency [Hz]

Am

plitu

de [d

B]

OriginalNarrowbandWideband


0 1000 2000 3000 4000 5000 6000 7000 8000−110

−100

−90

−80

−70

−60

−50

−40

−30

−20FFT of a speech signals − ’PDS_JDR00300.wav’ frame 507 − D = 1.87 dB

Frequency [Hz]

Am

plitu

de [d

B]



0 1000 2000 3000 4000 5000 6000 7000 8000−110

−100

−90

−80

−70

−60

−50

−40

−30

−20FFT of a speech signals − ’PDS_EJK40006.wav’ frame 1114 − D = 19.85 dB

Frequency [Hz]

Am

plitu

de [d

B]



0 1000 2000 3000 4000 5000 6000 7000 8000−110

−100

−90

−80

−70

−60

−50

−40

−30

−20FFT of a speech signals − ’PDS_EJK40006.wav’ frame 683 − D = 2.61 dB

Frequency [Hz]

Am

plitu

de [d

B]



Figure 14.4: FFT analysis of excitation from a man and woman

84

Listening test 15

A method to evaluate the implemented algorithm is by performing a listening test. Thegoal of the test is to state whether independent listeners prefer the original telephone signalor the synthesized wide band signal. Furthermore it is desirable to determine where thequality reduction occurs.

15.1 Design of the listening test

The test should examine the following questions:

• Do the population prefer the reconstructed signal over the narrow band signal?

• What part of the algorithm is introducing the greatest quality reduction?

• Does the codebook contain enough information?

The implemented algorithm contains two major parts: Envelope extension and excitationextension. These two parts give cause for quality reduction in proportion to the originalwide band signal. To determine which part that presents the greatest quality reduction it isdesirable to exclude one of the parts such that only one part degrades the signal.

This is possible to do by making a LP analysis and a LP estimation of the wide band signal.This gives respectively the correct envelope and the correct excitation signal. Hereby it ispossible to construct six different signals from one speech signal:

1. Original signal

2. Telephone filtered signal

3. Original envelope and original excitation

4. Synthesized envelope and original excitation

5. Original envelope and synthesized excitation

6. Synthesized envelope and synthesized excitation

85

CHAPTER 15. LISTENING TEST

By comparing these signals it is possible to point out where signal distortion is added andextra research is needed. The original signal and the signal obtained with original envelopeand original excitation are not used in the test. Hence there are four different signals foreach speech segment.

A part of the envelope extension is the codebook. It is desirable to find out if this codebookis large enough. For the test two codebooks where used: One for males and one for females.The generation of each codebook is based on training data from two persons.

It is wanted to investigate whether the training data describes all shades of the used personand whether the training persons represent a wider selection of the population. Hereby itis possible to determine if it is necessary to use more training data from each person or usemore persons in the training data.

15.2 The practical test

The practical test is performed as an A/B test where the test persons have to evaluate whichspeech signal they prefer in a telephone. A total of eight speech segments of approximately 4seconds as used for the test. Four signals from the same persons as are used for training thecodebook. Two signals from persons outside the codebook and at last two signals includedin the training data.

Since each speech segment is converted into four signals, a total of 32 signals occur. Allfour versions are tested with respect to each other and therefore a total of 48 A/B tests isto be performed. The 48 tests are randomly ordered and also the occurrence of A and B arerandom.

15.3 Results

27 test persons where used. The test procedure and the results are shown in table 15.1.

86

15.3. RESULTS

Test Sound type Filename Sound A Sound B Votes A Votes B1 Training data PDS_JDG4000516 Synth_LPC Orig_EX Channel 0 272 Independent SDS_HCF5000616 Orig_LPC Synth_EX Synth_LPC Synth_EX 23 43 Independent PDS_JAF1000216 Synth_LPC Orig_EX Synth_LPC Synth_EX 14 134 Normal PDS_EJK4000616 Channel Synth_LPC Orig_EX 21 65 Training data PDS_JDG4000516 Channel Synth_LPC Synth_EX 23 46 Normal PDS_KTD1000416 Synth_LPC Synth_EX Orig_LPC Synth_EX 0 277 Normal PDS_JDR0030016 Channel Synth_LPC Synth_EX 26 18 Normal PDS_KTD1000416 Synth_LPC Synth_EX Channel 2 259 Normal PDS_KTD1000416 Synth_LPC Orig_EX Channel 7 2010 Normal PDS_EJK4000616 Synth_LPC Synth_EX Channel 3 2411 Normal PDS_KTD1000416 Synth_LPC Orig_EX Synth_LPC Synth_EX 11 1612 Normal PDS_HGF4000316 Synth_LPC Synth_EX Synth_LPC Orig_EX 16 1113 Normal PDS_KTD1000416 Orig_LPC Synth_EX Channel 12 1514 Training data PDS_JDG4000516 Synth_LPC Orig_EX Orig_LPC Synth_EX 3 2415 Training data PDS_KTD3000616 Synth_LPC Synth_EX Synth_LPC Orig_EX 8 1916 Normal PDS_JDR0030016 Orig_LPC Synth_EX Synth_LPC Orig_EX 21 617 Training data PDS_JDG4000516 Orig_LPC Synth_EX Synth_LPC Synth_EX 23 418 Independent SDS_HCF5000616 Channel Synth_LPC Orig_EX 20 719 Training data PDS_KTD3000616 Channel Synth_LPC Orig_EX 12 1520 Independent PDS_JAF1000216 Orig_LPC Synth_EX Synth_LPC Orig_EX 14 1321 Independent SDS_HCF5000616 Synth_LPC Synth_EX Channel 4 2322 Independent PDS_JAF1000216 Synth_LPC Synth_EX Orig_LPC Synth_EX 4 2323 Independent SDS_HCF5000616 Orig_LPC Synth_EX Channel 13 1424 Normal PDS_HGF4000316 Synth_LPC Orig_EX Orig_LPC Synth_EX 7 2025 Training data PDS_KTD3000616 Synth_LPC Orig_EX Orig_LPC Synth_EX 11 1626 Normal PDS_EJK4000616 Synth_LPC Orig_EX Synth_LPC Synth_EX 12 1527 Normal PDS_KTD1000416 Orig_LPC Synth_EX Synth_LPC Orig_EX 18 928 Normal PDS_HGF4000316 Synth_LPC Synth_EX Channel 4 2329 Independent PDS_JAF1000216 Orig_LPC Synth_EX Channel 7 2030 Normal PDS_EJK4000616 Synth_LPC Orig_EX Orig_LPC Synth_EX 10 1731 Training data PDS_KTD3000616 Orig_LPC Synth_EX Synth_LPC Synth_EX 20 732 Training data PDS_KTD3000616 Channel Orig_LPC Synth_EX 19 833 Normal PDS_JDR0030016 Synth_LPC Synth_EX Synth_LPC Orig_EX 10 1734 Independent SDS_HCF5000616 Synth_LPC Orig_EX Synth_LPC Synth_EX 9 1835 Normal PDS_HGF4000316 Synth_LPC Orig_EX Channel 5 2236 Independent PDS_JAF1000216 Synth_LPC Orig_EX Channel 2 2537 Training data PDS_JDG4000516 Orig_LPC Synth_EX Channel 8 1938 Independent PDS_JAF1000216 Synth_LPC Synth_EX Channel 2 2539 Training data PDS_KTD3000616 Channel Synth_LPC Synth_EX 23 440 Normal PDS_EJK4000616 Orig_LPC Synth_EX Channel 7 2041 Normal PDS_EJK4000616 Orig_LPC Synth_EX Synth_LPC Synth_EX 19 842 Normal PDS_HGF4000316 Channel Orig_LPC Synth_EX 16 1143 Training data PDS_JDG4000516 Synth_LPC Orig_EX Synth_LPC Synth_EX 4 2344 Normal PDS_JDR0030016 Synth_LPC Orig_EX Channel 2 2545 Normal PDS_JDR0030016 Synth_LPC Synth_EX Orig_LPC Synth_EX 10 1746 Independent SDS_HCF5000616 Synth_LPC Orig_EX Orig_LPC Synth_EX 3 2447 Normal PDS_JDR0030016 Orig_LPC Synth_EX Channel 7 2048 Normal PDS_HGF4000316 Synth_LPC Synth_EX Orig_LPC Synth_EX 4 23

Table 15.1: Test ordering and results. The sound type parameter indicateswhether the signal is a part of the training data (Training data), the person isin the training data but not the signal (Normal) or if the person is not in thetraining data (Independent).

To compare the different methods the number of votes for each method is counted andplotted. Table 15.2 shows the distribution of votes in percentage of the possible amount ofvotes for each method. The distribution is shown for each codebook scenario. The data isplotted in figures 15.1, 15.2(a) and 15.2(b).

87

CHAPTER 15. LISTENING TEST

Channel O.L. S.E. S.L. O.E. S.L S.E.Normal 79 61 32 27Training data 76 61 32 31Independent 78 64 30 28

Table 15.2: Votes for each method for each scenario in percentage of the pos-sible votes. Abbreviations: O.: Original, S.: Synthesized, L.: LP envelope, E.:Excitation.

Channel O.Env. S.Exc S.Env. O.Exc S.Env. S.Exc0

10

20

30

40

50

60

70

80

90

100Same persons used for training of the codebook

Vot

es o

f pos

sibl

e vo

tes

[%]

Figure 15.1: Votes for each method for signals outside the training data butthe same persons.


10

20

30

40

50

60

70

80

90

100Input data also used for training of the codebook

Vot

es o

f pos

sibl

e vo

tes

[%]

(a) Votes for each method for speechsignals from the training data.


10

20

30

40

50

60

70

80

90

100Different persons than used for training of the codebook

Vot

es o

f pos

sibl

e vo

tes

[%]

(b) Votes for each method for speechsignal with no relation to the trainingdata.

Figure 15.2: Votes.

To evaluate if each result is significant every result from one specific method versus anotheris counted.

88

15.4. DISCUSSION

Case Normal Training data IndependentDifference for 5 % significance 54.7 56.7 56.7

1 Channel vs. Orig_LPC Synth_EX 66 70 632 Channel vs. Synth_LPC Orig_EX 81 72 833 Channel vs. Synth_LPC Synth_EX 91 85 894 Orig_LPC Synth_EX vs. Synth_LPC Orig_EX 70 74 705 Orig_LPC Synth_EX vs. Synth_LPC Synth_EX 80 80 856 Synth_LPC Synth_EX vs. Synth_LPC Orig_EX 53 57 57

Table 15.3: The table describes how big a percentage of the votes the first-mentioned method got. The significance level is 5 % and indicates how big apercentage one of the methods must have to obtain significance.

15.4 Discussion

The listening test shows, that the test persons prefer the channel method. This indicatesthat the algorithm either gives the wrong frequency content or adds annoying artefacts. Apossibility is, that the gain of the extended frequencies is too large. Therefore an interestingtest would be to compare different gains.

Furthermore the test shows that the test persons prefer synthesized excitation rather thansynthesized envelope. Therefore the conclusion is that the primary quality loss happens atenvelope extension. This is supported by the fact, that there are no noticeable difference be-tween original excitation and pure synthesized signals. Therefore envelope excitation needsfurther research. However, excitation extension do also introduce some quality reductionthat has to be investigated to get a satisfying result.

Regarding the codebook table 15.3 shows that Synth_LPC Orig_EX gets more votes in case2 when the signal is a part of the training data. Nevertheless Synth_LPC Orig_EX gets lessvotes in case 4 when the signal is part of the training data. Hence there are two results thatdrags in two directions. Therefore there is no significance for that the codebook is too smalland no indication that the training data is not descriptive. To test this relationship a morecomprehensive test is needed.

89

Bibliography

Bäckström, T. [2004], Linear Predictive Modelling of Speech -Constraints and Line Spectrum PairDecomposition, Helsinki University of Technology. ISBN 951-22-6946-5.

Deller, J. R., Proakis, J. G. Hansen, J. H. L. [1993], Discrete-Time Processing of Speech Signals,Macmillan. ISBN 0-02-328301-7.

Gersho, A. Gray, R. M. [1991], Vector Quantization and Signal Compression, Kluwer Interna-tional Series in Engineering and Computer Science. ISBN 079-23-9181-0.

Haykin, S. [2002], Adaptive Filter Theory, 4th edn, Prentice Hall. ISBN 0-13-048434-2.

Kallio, L. [Dec. 2002], Artificial Bandwidth Expansion of Narrowband Speech in Mobile Commu-nication Systems, Helsinki University of Technologi.

Paliwal, K. K. Atal, B. S. [1993], Efficient Vector Quantization of LPC Parameters at 24 bits/frame,IEEE transactions on speech and audio processing. no. 1, vol. 1.

Wooil, K. [n.d.], ‘Speech Production’, http://ispl.korea.ac.kr/ wikim/research/speech.html.

Yuan, Z. [2003], The Weighted Sum of the Line Spectrum Pair for noisy speech, Helsinki Univer-sity of Technology.

Zheng, F., Song, Z., Li, L., Yu, W., Zheng, F. Wu, W. [1998], The Distance Measure for LineSpectrum Pairs applied to Speech Recognition, Tsinghua University.

90

Bandwidth Expansion of Narrowband Speech using Linear...

Documents

Transcript of Bandwidth Expansion of Narrowband Speech using Linear...