J K Very low bit rate speech coding using speech ... · Very low bit rate speech coding using...

TAMPERE UNIVERSITY OF TECHNOLOGYDepartment of Information Technology

JUKKA KIVIMÄKI

Very low bit rate speech coding using speechrecognition, analysis and synthesis

MASTER OF SCIENCE THESIS

SUBJECT APPROVED BY DEPARTMENTAL

COUNCIL ON June 7, 2000.

Examiners: Professor Ioan Tabus

M.Sc. Konsta Koppinen

Preface

This work has been carried out in the Signal Processing Laboratory of Tampere Univer-sity of Technology, Finland.

I would like thank my examiner Professor Ioan Tabus. I would also like to express mygratitude to my thesis advisor Konsta Koppinen for his guidance, advice and wealth ofmaterial he provided me during the making of this thesis.

I have been fortunate to work in the Audio Research Group of TUT. In particular, Iwould like to thank Mr. Teemu Saarelainen, Mr. Antti Rosti and Mr. Tommi Lahti fortheir advice. I would also like to thank Mr. Timo Haanpää for proof-reading.

Finally, my warmest thanks to my loved wife Hanna for the continuous support andunderstanding during this work.

Tampere, December 5, 2000

Jukka Kivimäki

i

Contents

Abstract iv

Tiivistelmä v

List of Abbreviations vi

List of Symbols vii

1 Introduction 1

2 Properties of Speech Signals 3

2.1 Introduction . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Human speech production system . . . .. . . . . . . . . . . . . . . . 3

2.3 Finnish articulatory phonetics. . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Speech prosody . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Time-domain and Frequency-domain characteristics . .. . . . . . . . . 8

3 Overview of Speech Analysis 11

3.1 Introduction . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Overview of speech recognition . . . . .. . . . . . . . . . . . . . . . 11

3.2.1 HMM speech recognition system. . . . . . . . . . . . . . . . 12

3.2.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.3 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Fundamental frequency estimation . . . .. . . . . . . . . . . . . . . . 20

3.4 Estimation of vocal tract parameters . . .. . . . . . . . . . . . . . . . 22

3.4.1 Source-filter model .. . . . . . . . . . . . . . . . . . . . . . . 23

3.4.2 Linear prediction . . . . . . . . . . . . . . . . . . . . . . . . . 24

ii

4 Speech Synthesis 27

4.1 Introduction . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Brief history of speaking machines . . . . . . . . . . . . . . . . . . . . 27

4.3 Synthesis methods .. . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3.1 Articulatory synthesis . . . . . .. . . . . . . . . . . . . . . . 29

4.3.2 Formant synthesis .. . . . . . . . . . . . . . . . . . . . . . . 30

4.3.3 PSOLA synthesis . .. . . . . . . . . . . . . . . . . . . . . . . 32

4.3.4 Linear prediction synthesis . . . .. . . . . . . . . . . . . . . . 33

4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Phonetic Vocoder for Finnish 38

5.1 Introduction . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2 Description of coder . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3 Speech analysis . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3.1 Phoneme recognition. . . . . . . . . . . . . . . . . . . . . . . 40

5.3.2 Pitch estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3.3 Energy estimation . . . . . . . . . . . . . . . . . . . . . . . . 46

5.4 Speech synthesis . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Conclusions 51

References 52

Appendices 55

A Linear Prediction Theory 55

iii

TAMPERE UNIVERSITY OF TECHNOLOGY

Department of Information Technology

Signal Processing Laboratory

KIVIMÄKI, JUKKA : Very low bit rate speech coding using speech recognition, anal-ysis and synthesis.

Master of Science Thesis, 54 pages and Appendices, 2 pages.

Examiners: Prof. Ioan Tabus, M.Sc. Konsta Koppinen

Funding: Signal Processing Laboratory, Tampere University of Technology

December 2000

keywords: speech synthesis, speech analysis, speech recognition, speech coding

During the last few decades the area of speech coding has witnessed drastic develop-ment. The intelligibility of synthetic (artificially produced) speech has improved mani-fold and today text-to-speech synthesis can be used to address many needs and applica-tions. On the other hand, the recognition of speech is still in its infancy, but it is alreadypossible to find restricted speech recognition in practical applications.

One of the more interesting fields in speech coding is very low bit rate coding, whichis often based on speech recognition and synthesis. In phonetic vocoding speech recog-nition is used to segment and recognize the sounds in speech. In addition to phonemerecognition, certain prosodic qualities of the speaker are measured with speech anal-ysis methods. The decoding method is based on speech synthesis, where the speechwaveform is construed from the parametric representation of speech.

This thesis concentrates on the phonetic vocoding of speech. The thesis begins with thepresentation of common problems, as well as solution models, associated with phoneticvocoder applications and includes an introduction to the theory and algorithms used inthe speech coder. The main emphasis, however, has been on implementation of thewhole speech coding system, but more specific attention is paid to the decoding appli-cation of the speech coder at hand. The speech recognition system’s performance wasmeasured with the SpeechDat II speech database, and the voice quality of the vocoderwas measured with informal subjective listening test.

iv

TAMPEREEN TEKNILLINEN KORKEAKOULU

Tietotekniikan osasto

Signaalinkäsittelyn laitos

KIVIMÄKI, JUKKA : Very low bit rate speech coding using speech recognition, anal-ysis and synthesis.

Diplomityö, 54 s. ja 2 liites.

Tarkastajat: Prof. Ioan Tabus, FM Konsta Koppinen

Rahoittaja: Signaalinkäsittelyn laitos, Tampereen teknillinen korkeakoulu

Joulukuu 2000

Avainsanat: puhesynteesi, puheentunnistus, puheanalyysi, puheenkoodaus

Puheenkoodauksen menetelmät ovat kehittyneet merkittävästi viimeisen vuosikymme-nen aikana. Synteettisen eli keinotekoisesti tuotetun puheen ymmärrettävyys on saavut-tanut riittävän tason moniin tarpeisiin ja sovelluksiin. Puheentunnistus on puolestaanvielä melko puutteellista. Kuitenkin selvästi rajoitetuilla ehdoilla puheentunnistin voiolla käyttökelpoinen jo monissa käytännön sovelluksissa.

Eräs mielenkiintoinen puheenkoodauksen menetelmä on erittäin matalan bittinopeu-den puheenkoodaus, joka perustuu usein juuri puheen tunnistamiseen ja -syntetisointiin.Foneettisessa vokoodausmenetelmässä (phonetic vocoding) käytetään puheentunnistus-ta segmentoimaan ja tunnistamaan puheessa esiintyviä äänteitä. Äännetunnistuksen lisäk-si eräitä puhujan prosodisia ominaisuuksia estimoidaan puheanalyysin menetelmin. Me-netelmän dekoodaus perustuu puhesynteesiin, joka konstruoi puheen aaltomuotoesityk-sen parametrisestä esitysmuodosta.

Tässä diplomityössä käsitellään puheen foneettista vokoodausmenetelmää ja esitelläänfoneettisen puhekoodekin toteutuksessa esiin tulevia ongelmia sekä niiden ratkaisu-menetelmiä. Lisäksi työssä esitellään toteutetun koodekin taustalla olevaa teoriaa ja al-goritmeja. Työn pääpaino on dekoodauksessa tarvittavan puhesyntetisaattorin toteutuk-sessa ja kokonaisjärjestelmän rakentamisessa. Puheentunnistuksen suorituskykyä tutki-taan SpeechDat II -puhetietokannan avulla. Puhekoodekin tuottamaa äänen laatua arvi-oidaan epämuodollisella subjektiivisella kuuntelukokeella.

v

List of Abbreviations

ACF Autocorrelation functionAR AutoregressiveARMA Autoregressive moving averageASR Automatic speech recognitionCBR Constant bit rateDCT Discrete cosine transformDFT Discrete fourier transformationDP Dynamic programmingDTFT Discrete-time fourier transformFT Fourier transformFIR Finite impulse responseHMM Hidden Markov modelIDFT Inverse discrete fourier transformationIFFT Inverse fast fourier fransformIIR Infinite impulse responseIPA International phonetic alphabetLM Language modelLP Linear predictionLPC Linear predictive codingLSF Line spectral frequencyLTI Linear time invariantMFCC Mel-frequency cepstral coefficientML Maximum likelihoodNCCF Normalized cross-correlation functionPSOLA Pitch synchronous overlap and addRAPT Robust algorithm for pitch trackingRMS Root mean squareSPL Signal processing labST Short-termTTS Text-to-speechTUT Tampere university of technologyVBR Variable bit rate

vi

List of Symbols

λ Hidden Markov model parameter setρ Gain of an error signal=a= Phoneme ’a’ak kth predictor coefficientaT Transpose of a vectoraA(z) Analysis orwhiteningfilterci ith cepstral coefficientf0 Fundamental frequencyfs Sampling frequencyH(z) Synthesis filterp LPC model orderR Autocorrelation matrixy(n) Output value of a linear system at time instantny(n) Estimate ofy at time instantn

vii

Chapter 1

Introduction

Speech communication plays important role in our lives. Recently, cellular phones andthe Internet have made it possible to convey audio in digital form. This has been madepossible by the rapid developments in telecommunications and digital signal processing.However, the storage of enormous quantities of speech and audio has surely become aproblem, and the need for low bit rate speech coding methods is on the increase.

To achieve good quality synthetic speech at very low bit rates, perceptually relevant bitallocations are essential. According to linguistics and information theory, the phoneticdomain is the most plausible domain in which the information could be transferred.

In this thesis the problem of very low bit rate speech coding is studied. The aim of thework presented in this thesis was to implement a phonetic vocoder. This speech codingmethod can be considered a very high-level approach, which utilizes speech recognition,analysis and synthesis in encoding and decoding.

The encoder is comprised of three speech analysis processes. The objective of thespeech recognition is to phonetically segment and recognize speech in a speaker inde-pendent manner. The hidden Markov models (HMM) representing the phoneme modelsat the encoder are trained and tested using SpeechDat(II) speech database. In order toincorporate naturalness to the synthesized speech some prosodic aspects are analyzed.The fundamental frequency of the speaker is estimated using Robust Algorithm for PitchTracking (RAPT). The speech energy is also included in the estimates.

The data stream from the speech coder consists of two components: The phoneme tran-scription and the prosody information. The decoding is based on speech synthesis. Theconcatenative LPC synthesis is used to construct the speech waveform from the para-metric representation obtained in the encoder.

The purpose of this thesis is to explore the type of speech quality that is achievable usingphonetic vocoding approach. The topic combines many interesting aspects of speechprocessing, such as speech recognition, analysis of prosody, and synthesis. This thesisincludes an introduction to the theory and algorithms used in the phonetic vocoder.

This thesis is divided into six chapters. In Chapter 2 an introduction to the propertiesof speech signals is presented. The presentation aims to give relevant background in-

1. Introduction 2

formation on human speech production system and speech characteristics needed in thefollowing chapters.

Chapter 3 is devoted to present theory related to the speech analysis. The first sectiongives a general view on the use of the hidden Markov models in speech recognitionsystems. In addition, the problem of fundamental frequency estimation is addressed,and autocorrelation-based methods for solving the problem are reviewed briefly. In thelast section, the vocal tract parameters are estimated using the source-filter model ofspeech production and linear prediction.

Chapter 4 consists of a brief history of speech synthesis and a review of commonlyused techniques. The exclusive topic of the discussion is low-level synthesis methods,since the high-level pre-processing used in text-to-speech synthesizers is replaced bythe parameter estimation from a natural speech in the implemented speech coder. Theimplemented concatenative LPC synthesis method is presented in detail.

The proposed phonetic vocoder and the obtained results are given in Chapter 6. Firstthe overall structure of the coder is described. Then the implemented speech analysisand synthesis subsystems are presented in detail. The chapter ends with a presentationof the obtained results. Chapter 6 concludes this thesis.

Chapter 2

Properties of Speech Signals

2.1 Introduction

In order to apply digital signal processing techniques in the field of speech communica-tion, it is essential to understand the fundamentals of human speech production processand digital signal processing. This chapter consists of a brief discussion of the humanspeech production system, acoustic phonetics, and some selected properties of speechsignals related to the following chapters. In addition, the central terminology of speechprocessing is briefly explained.

A comprehensive review of speech properties and technical fundamentals are presentedin [7] and [8]. More detailed discussion of the human speech production system andacoustic phonetics can be found from [6].

2.2 Human speech production system

Acoustically speech can be described as a fluctuation of air pressure. In this context themain function of lungs is to produce air flow for the speech production system. This airflow is forced through the glottis, between vocal cords and the larynx to the cavities ofthe vocal tract. Leaving the oral and nasal cavities, the air flow exits through the mouthand nose. The main vocal organs, lungs, larynx, velum, tongue and lips, are depicted inFigure 2.1 below.

In more detailed discussion, the lung pressure in larynx forces the air flow throughthe tensioned vocal cords, and, as a result, the vocal cords start to vibrate producing aperiodic pressure wave. Thefundamental frequencyof vibration is an inherent propertyof periodic signals, and depends on the mass and tension of the vocal cords. Variation inthe tension of the vocal cords controls the vibration frequency. The average fundamentalfrequency is about 110, 200, and 300 Hz for men, women, and children, respectively.

2. Properties of Speech Signals 4

1 Nasal cavity 6 Dorsum 11 False vocal cords2 Hard palate 7 Uvula 12 Vocal cords3 Alveoral ridge 8 Radix 13 Larynx4 Velum (soft palate) 9 Pharynx 14 Esophagus5 Apex 10 Epiglottis 15 Trachea

Figure 2.1: The Human vocal organs [3].

Sounds produced by vibrating vocal cords are calledvoicedsounds. In Finnish, vow-els and voiced consonants such as /j/ (e.g., “patja”) and /l/ (e.g., “peili”), are voiced.Unvoicedsounds, such as /s/ and /f/, are produced with the vocal cords totally open.

Thevocal tractis commonly held to consist of the vocal organs after larynx. The vocaltract can be divided into three parts: pharynx, oral and nasal cavities. The overall aver-age length of the vocal tract from glottis to lip opening is estimated to be 14.1 cm forwomen and 16.9 cm for men [6].Velumcontrols the air flow from pharynx to oral andnasal cavities. From a technical point of view the vocal tract can be seen as an acoustictube between the glottis and mouth.

Theoral cavity is a prominent part of the vocal tract. The size and shape can be variedby the movements of the tongue, lips, teeth and velum. The tongue has a high degreeof freedom in movement, since it is also capable of changing shape by moving thetip and the edges. Essential information about the tongue position can be described asconstriction. Constriction can be defined as the place where the gap between tongue andhard palate or velum is smallest. Vowels, for example, can be distinguished by the placeof constriction. Lips are used to control the size and the shape of the mouth opening.

Unlike the oral cavity, thenasal cavityhas fixed shape and dimension. By opening theair flow route to the nasal cavity, the nasalized sounds, such as /m/ (e.g., “kammio”) and


/n/ (e.g., “paino”), can be produced.

2.3 Finnish articulatory phonetics

Speech in general can be analyzed from different points of view.Phoneticsis a viewof speech sounds considered in isolation from any language [21].Articulatory pho-neticsconsiders how any given speech sound is produced, with particular emphasis onanatomical detail. In this section the Finnish articulatory phonetics use these two termsto describe how sounds are produced in Finnish.

While man can produce a great number of sounds, each language has a small set oflinguistic units calledphonemes. A phoneme is the smallest meaningful contrastive unitin the phonology of a language [19]. Each word of a language is a series of phonemesneeded to produce the word. Most languages have 20–40 phonemes providing an al-phabet of sounds which is a unique description of the words in a given language [19].Originally, Finnish language consists of 21 phonemes, but during the last few centuriesthree phonemes /b,f,g/ have been adopted from foreign languages [4].

In phonetics, an individual sound is aphone. A phone can be understood as a realizationof an abstract phoneme. Acoustically the realizations of phonemes, phones, depends ontheir context. In most cases, the production of a phone will include some articulatoryfeature left over from the previous phone and some anticipation of features in a sub-sequent phone [21]. Therefore, each phone can be considered as a target at which thevocal organs aim. The phenomenon is termedcoarticulation.

A pair of words which differ in only one phone (e.g. “hai” – “hei”) is known asminimalpair. In linguistic field work, the analysis of language into phonemes is based in part onfinding minimal pairs which separate all phones from one another in the way that “hai”and “hei” separate /a/ and /e/1 in Finnish.

In all languages there is a common property that speech sounds, phonemes, can be di-vided into two groups, namelyvowelsandconsonants. The set of vowels and consonantsused in a certain language islanguage specific, that isphonemic, and therefore in thesections below we restrict ourselves to describe only the Finnish vowels and consonants.

Finnish vowels

In Finnish, vowels are defined as sounds that can form a syllable alone. This definitionfor a vowel would not be valid for, say, English. More of a general property of vowelarticulation is that the vocal tract is open, and therefore the air can flow out of the mouthfreely.

The eight Finnish vowels /a,e,i,o,u,y,æ,ø/ can be characterized with the following prop-erties of the vocal tract:

1Phonemes are denoted between solidi (e.g., /a/), and they are typed with International Phonetic As-sociation (IPA) alphabets.


Table 2.1: Finnish vowels classified by their characteristic properties.

Vowels Front BackWide Round Wide Round

High i y uMid e ø oLow æ a

� Frontness/backness, the front-back position of the tongue

� High/mid/low, the up-down position of the tongue

� Roundness/wideness, the shape of the lips

The Finnish vowels are classified using the properties above in the Table 2.1.

Finnish consonants

It is typical of consonant sounds that some part of the vocal tract is closed so that the aircan not flow at all, or the air flow is restricted by theconstrictionof the vocal tract caus-ing audible turbulence. The twelve original Finnish consonants are /d,h,j,k,l,m,n,N,p,s,t,v/,and the consonants adopted from other languages are /b,f,g/.

The consonants can be divided intoresonantsandobstruents. It is a property of theresonants that the air can flow relatively freely out of the mouth and nose. In the case ofobstruents, however, tight constriction in the vocal tract causes turbulent or noisy sound.The classification of the consonants is illustrated in the Figure 2.2.

Resonants can be classified into the following four classes according to the behavior ofvocal organs during phonation:

1. Semivowels/j,v/ — Resembles vowels, but the constriction is more tight.

2. Laterals /l/ — The raised tip of the tongue closes the vocal tract. However, theair can still flow via both sides of the tongue.

3. Tremulants /r/ — The tip of the tongue is vibrating quickly against the teethridge. This results in a discontinuous air flow.

4. Nasals/n,m,N/ — The velum redirects the air flow from oral cavity to nasal cavity.The generated voiced sound is affected by both the vocal and nasal tract.

Obstruents can be also divided into the following two classes:

1. Fricatives /f,h,s/ — Very tight constriction produces a turbulent air flow. Finnishfricatives are unvoiced.


ConsonantsVowels

Resonants Obstruents

Sem

ivowels

Trem

ulants

Plosives

Fricatives

Nasals

Phonemes

Liquids

LateralsFigure 2.2: Classification of Finnish phonemes.

2. Plosives/k,p,t,g,b,d/ — Vocal tract is fully constricted. Reopening the vocal tractresults in impulsive-like, noisy-like or burst sound.

2.4 Speech prosody

This chapter has concentrated so far on the human speech production organs, and theirarticulatory behavior in each acoustic segment (phone). Another relevant aspect withan interpretation beyond phone boundaries is calledprosody. In the field of prosodythe relationships ofduration, stress, andintonationof speech utterance are studied. Ingeneral, small temporal segments are a cue to phoneme and word identification, whileprosody primarily cues other linguistic phenomena [19]. Since prosody usually con-cerns temporally longer properties of speech than phonemes, the smallest temporal unitof prosody is usually asyllable.

Duration

Duration describes the temporal length of speech sounds. In some languages duration isa phonetic feature (e.g., in Spain); this means that the meaning of a word does not changewhen durations of the phonemes are changed. In Finnish, however, the duration is aphonemic feature (e.g., Finnish words “muta”,“muuta” and “mutta” all have a differentmeaning).

In addition to the forementioned linguistical aspect of duration, phone durations varyconsiderably due to factors such as the chosen style of speech (reading vs. conversation),


stress, and rhythm [19]; e.g., typically half of the conversation time consists of pauses,compared to only 20% in read speech [19].

Effects of stress

Stress is the phenomena which occurs when a certain syllable or word is perceived to beuttered louder than the adjacent units of speech. The stress can be adjusted by tensingthe vocal cords and by variations in the pressure of the lungs. The stress usually relateseither toword stressor sentence stress.

The sentence stress is concerned with the stress of each word in the same sentence. Itcould be used for example to emphasize and introduce a new information. In addition,infrequently used words have longer durations than common words.

The word stress can be found from a syllable that is pronounced louder than the othersyllables in the word. Word stress can be further divided into three types; In manylanguagesprimary stressis automatically located in some language specific syllablesand it acts as a cue for word boundaries. In Finnish the primary stress is always locatedin the first syllable of a word, but in English the general location of the stress is the firstsyllable of the root word (e.g.,re’route2). However, in some cases the primary stress isalso used to change the meaning of words (e.g.,’import vs. im’port). In Finnish,secondstressis used to change the meaning of words.

Intonation

The termintonationrefers to pitch in a speech utterance. Intonation is a phonemic fea-ture in some languages, it can also be used to express punctuation marks. For example inEnglish the pitch is increasing in the end of an interrogative clause, whereas in Finnishit is not increasing but the average pitch of the whole clause is greater.

Intonation in speech is produced mainly by adjusting the tension of the vocal cords.This alters thefundamental frequencyof the cords so the perceived pitch is also varied.

2.5 Time-domain and Frequency-domain characteristics

The audio spectrumrepresents the range of frequencies audible to humans and it de-scribes the bandwidth of the human hearing system. This range is usually held to extendfrom 20 Hz to 20 kHz. The actual range of frequencies most human beings can hear isusually much narrower. This can be attributed to the individual differences and charac-teristics of each human being’s hearing system, as well as to the natural deteriorationof the human hearing system due to ageing. Frequencies above and below the humanaudio spectrum are called ultrasonics and infrasonics, respectively.

2In IPA alphabets, primary word stress is denoted with ’ before the stressed syllable and secondaryword stress with ,


A speech waveform sampled at 16 kHz and the corresponding spectogram are illustratedin the upper and lower plots of Figure 2.3. It can be noticed that there is less variability inthe time behavior of the spectral envelope than in the corresponding speech waveform.Moreover, there are two interesting features present in the spectral envelope in the caseof voiced sounds. The narrow maxima are due to the fundamental frequency of thespeech signal and the wide maxima are due to the resonant frequencies of the vocaltract calledformants.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−1

−0.5

0

0.5

1

Time (s)

Ampl

itude

Time

Freq

uenc

y

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

2000

4000

6000

8000

Figure 2.3: Finnish word “seitsemän” represented with waveform (upper) and spec-togram (lower).

Periodicity of the voiced sounds can also be noticed from the amplitude spectrum ofa speech segment. In Figure 2.4 the amplitude spectrum of the Finnish vowel /e/ ispresented. The amplitude spectrum is calculated from the short time segment (250–270 ms) of the spectogram presented in Figure 2.3. The time-domain periodicity can beobserved here as the harmonicity of spectral peaks (solid line).

The speech signal is often represented as a non-stationary random process. This statisti-cal description of the speech signal can be justified due to the inertia of the articulatoryorgans. The speech signal is usually assumed to be at least wide sense stationary andergodic in autocorrelation for a short period of time usually about 10–30 ms.

Short-time power spectrum (from now referred to aspower spectrum) is calculated overa smooth analysis window. Therefore the analysis of the speech signal is localized to acertain time instant.

The autocorrelation theorem [5] defines the relationship between the autocorrelationfunction and the power spectrum of the signal. The power spectrum is a discrete Fourier


0 1000 2000 3000 4000 5000 6000 7000 8000−60

−40

−20

0

20

40

Frequency (Hz)

Mag

nitu

de R

espo

nse

(dB

)

Figure 2.4: Amplitude spectrum of a Finnish vowel /e/ showing rapidly varying har-monic peaks (solid line) and slowly varying spectrum envelope (dashed line).

transform (DFT) of the signals autocorrelation function, that is

SXX(ω),∞

∑n=�∞

RXX(n)e� jnω; (2.1)

whereSXX is the power spectrum of the signal, andRXX is the autocorrelation functionof the signal.

This shows that the short-term autocorrelation observed in a speech segment is a func-tion of the power-spectral envelope, and therefore also a function of the shape of thevocal tract. It can be also shown that the long-term autocorrelation of a speech segmentcorresponds to the fine structure of the power spectrum [1].

Chapter 3

Overview of Speech Analysis

3.1 Introduction

In speech analysis interesting properties of speech are examined by various speech anal-ysis methods. Modern computers have the needed processing power to conduct a verylarge collection of speech analysis algorithms. The analyzed speech input is usually asampled speech utterance, and the automated process will return the desired information(e.g., fundamental frequency of speech, spectogram, etc.) to the user.

The speech analysis techniques reviewed in this chapter will form the basis of the speechanalysis subsystem of the phonetic vocoder described in Chapter 5.

In the next Section 3.2, the fundamentals of a modern speech recognition system are pre-sented. The problem of fundamental frequency estimation is addressed in Section 3.3.Section 3.4 describes the source-filter model of speech production, and a method forobtaining the vocal tract parameters needed in the model.

3.2 Overview of speech recognition

Speech recognition by machine has been an important research goal for almost fivedecades. The development towards modern speech recognition systems has requiredenormous efforts over a wide range of disciplines such as signal processing, physics,pattern recognition, linguistics, acoustics, and mathematics [14]. Due to the interdisci-plinary nature of automatic speech recognition (ASR) the task has proven to be chal-lenging.

The earliest attempts to devise systems for automatic speech recognition by machinewere made in the 1950s. In the 1980s the speech recognition research was charac-terized by a shift in technology fromtemplate-basedapproach tostatistical modelingmethods—especially the hidden Markov modeling [14].

In template-based recognition approach the objective is to derive a typical model, tem-plate, for the object via some averaging procedure. The recognition is performed by

3. Overview of Speech Analysis 12

choosing a model that has the minimum distance between the measured object and theset of template objects. The methodology of the template-based recognition is welldeveloped and provides a good recognition performance for a variety of practical ap-plications. However, the template-based recognition was found insufficient to meet therequirements of modern speech recognition systems.

In the mid-1980shidden Markov models(HMM) became widely applied in virtuallyevery speech recognition laboratory in the world. The reason for this is that HMMsoffer a more flexible model of the statistical nature of speech. Moreover, the HMMframework includes both an automatic training scheme for estimating model parameters,as well as efficient decoding algorithms for performing recognition.

3.2.1 HMM speech recognition system

Functionally, a speech recognition system can be divided into three separate blocks. Inthe preprocessingstage, the speech signal is segmented into low dimensionalfeaturevectors. In therecognitionstage, statistical methods are used to classify feature vectorsinto selected phonetic or linguistic categories. In the (optional)post-processingstage alanguage model can be incorporated to enhance the recognition accuracy by analyzingthe syntax and semantics of the recognition result. A block diagram of a general HMM-based speech recognition system is depicted in Figure 3.1.

Lexicon

RecognitionAlgorithm

FeatureExtraction

AcousticModels

RecognitionHypothesis

Grammar

SpeechDatabase

SpeechSignal

Language Model

Figure 3.1: Block diagram of a HMM speech recognition system.

In Figure 3.1, the feature extraction front-end converts the speech waveform into a se-quence of acoustic observation vectors characterizing a spectral content of a speechsegment that is temporally relatively short (typically 20–25 ms). This speech signalrepresentation also aims to reduce redundancy included in the speech waveform (seeSection 2.5).

At present time the most popular acoustic model for speech recognition is the hiddenMarkov model. In statistical speech recognition approach, the problem is how to model


the distribution of the feature vectors, and HMMs are the standard construction used tomodel the second-order statistics, mean and variance, of the feature vectors. This statis-tical modeling approach is very fitting for the speech recognition task since speech canbe considered a random process (see Section 2.5). The HMM parameters are estimatedfrom a speech database which matches the target user group and overall conditions aswell as possible.

The language model is comprised of a lexicon and a grammar. The lexicon defines themapping between acoustic models and the system vocabulary. In case of word-levelHMM modeling the lexicon is a simple one-to-one mapping of models and words, andin case of subword HMM modeling the lexicon determines how to concatenate modelsto form words. In triphone HMM modeling, for example, the lexicon could establisha relation between dictionary word “cat” and the HMM model concatenated from sub-word (triphone) modelsc+a , c-a+t , anda-t 1. Moreover, the multiple pronunciationsof certain words (e.g., “the” in English) are defined in the lexicon.

The grammar contains a set of probabilities for each word occurrence, and optionalprobabilities for word occurrences after a given word. The language model is knownas aunigramif it describes only the former probabilities. If the model defines also theprobabilities of possible word pairs, it is called abigram. Language modeling is out ofthe scope of this thesis, but further discussion of this subject can be found in [9, 10, 18].

The requirements set to a speech recognition system differ greatly depending on the in-tended application. The usually considered specifications are environment, recognitionaccuracy and speed. It can be stated that speech recognizers without any major con-straints perform very poorly and therefore they can be categorized using the followingthree aspects [11]:

1. Speaker dependence and independence

This aspect refers to the training process of the speech recognition sys-tem. In speaker dependent speech recognition the system is trained todeal with an input from a single “target” speaker. Due to small vari-ability of speech, the recognition performance is better compared tospeaker independent recognition. In addition, there is no need for ex-tensive training data.

2. Isolated words and continuous speech recognition

In isolated word recognition systems, it is assumed that the speech in-put consists of a single word or phrase which can be considered a ’com-mand’ word. In continuous speech recognition, however, the recogni-tion process allows natural conversational speech. Continuous speechrecognizers allows the most rapid input, but it is the most difficult classto recognize [19].

1In triphone modelinga-b+c denotes a model for phone /b/ where the left context is phone /a/ and theright context is /c/.


3. Vocabulary size

Vocabulary size has a great effect on recognition accuracy and speed.Naturally, a large vocabulary is more likely to contain confusing wordsthan a smaller one. On the other hand, a small vocabulary containingsimilar words (or sub-word units) can also have the same problem.The strict speed requirements in real-time applications usually implythat the vocabulary size is very limited.

3.2.2 Feature extraction

To effectively use a speech recognition system, the speech signal has to be parametrizedin some appropriate manner. The parameterization aims to represent the speech sig-nal as compactly as possible, thus reducing the data rate. This means that redundantinformation should be avoided, and yet, the representation should retain all essentialinformation needed in recognition.

Many different forms of speech parameterization have been considered over the years.Presently there are two approaches: one is some type of coding—usually linear predic-tion of the time domain signal, and the other is direct sampling of domains other thantime domain, usually the frequency or cepstral domains. These speech signal analy-sis approaches are usually referred to as Linear Predictive Coding (LPC) and filterbankanalysis in the literature, respectively. In both approaches the results describe the ap-proximation of the spectral envelope of the speech signal. This thesis concentrates onthe filterbank approach instead of reviewing the LPC based approaches, but the readeris advised to obtain more information from [1, 2, 14, 34].

Formants represent the most immediate source of articulatory information. Hence theyhave been used extensively as primary features in speech recognition. Information aboutformants is contained in the spectrum envelope (see Section 2.5). Therefore virtually allfeature extraction front-ends calculate a group of feature values representing the spectralenvelope of temporally short speech segments.

At present, the most usual features in speech recognition systems are thecepstral coef-ficients. Cepstral coefficients can be obtained by linear prediction or filterbank analysis.Feature vectors are used to represent the speech waveform. An illustrative example offeature extraction process is depicted in Figure 3.2. In the upper part of the figure, a100 ms speech segment is divided into ten frames. Each feature vectoroi in the lowerpart of the figure is calculated from two frames of speech. The sequence of featurevectors is the output of feature extraction subsystem.

The speech signal windowing is generally carried out with tapered analysis windows,such as Hanning (defined in e.g. [20]), of length 20–30 ms. These speech segments areusually termed as frames. The speech signal statistics are generally considered station-ary within this analysis window (see Section 2.5). The length of the analysis windownormally consists of more than two periods of voiced speech. By using a shorter anal-ysis window the change in the signal period is shown in the spectral representation as


......

o o o o o o o o on+1 n+2 n+3 n+4 n+5 n+6 n+7 n+8n

SequenceVectorFeature

WaveformSpeech

10ms

Figure 3.2: Conversion of speech waveform to feature vector sequence.

a local variation. On the other hand using a longer analysis window the fast changesin the vocal tract will not be captured by the speech analysis. The human ear resolvesfrequencies non-linearly across the audio spectrum (defined in Section 2.5) [6]. This ar-gument and the empirical evidence [13] suggest that designing a front-end that operatesin non-linear frequency-scale improves the recognition accuracy. One of the most popu-lar non-linear frequency spacing is calledmel-scale. The non-linear frequency warpingfunction from linear frequency scale to mel-scale is defined by

fMel = 2595log10(1+f

700); (3.1)

where f is the frequency in linear scale and thefMel is the corresponding frequency inmel-scale.

Filterbank analysis provides a straightforward route to obtain the above mentioned non-linear frequency resolution of speech signal. The mel-scaled filterbank is usually im-plemented with triangular filter channels spaced along the mel-scale as illustrated inFigure 3.3.

However, filterbank amplitudes of speech are usually highly correlated between adjacentfrequency bands [12] and hence, the cepstral transformation of filterbank amplitudes isuseful with the upcoming HMM framework. The transform is used to considerablyreduce the HMM complexity [16].

TheMel-Frequency Cepstral Coefficients(MFCC) are calculated from the log filterbankbinsmj using the Discrete Cosine Transform (DCT) as follows

ci =

r2N

N

∑j=1

mj cos�πi

N( j �0:5)

�; (3.2)


0 500 1000 1500 2000 2500 3000 3500 40000

0.2

0.4

0.6

0.8

1Mel−Scaled Filterbank

Frequency / Hz

Mag

nitu

de R

espo

nse

Figure 3.3: Mel-scaled filterbank magnitude responses in linear scale.

whereN is the number of the filterbank channels.

This parameterization of a speech frame is used in virtually all state-of-the-art speechrecognition system front-ends. MFCCs have been found to give good discrimination.In addition, MFCCs have been found to provide a reasonably good recognition perfor-mance both in clean and noisy conditions. The number of cepstral coefficients used inthe feature vectors is usually around 12. Additional features such as energy measuresand dynamic coefficient parameters are often augmented to feature vector [14, 16].

3.2.3 Hidden Markov models

A hidden Markov model (HMM) is a statistical model that can be used to model anytime series data. Today, HMM is the standard approach for modeling acoustic units[14, 15, 16]. The objective of the statistical models is to characterize only the statisticalproperties of the signal. And thus, the underlying assumption is that the signal canbe well characterized as a parametric random process, and that the parameters of thestochastic process can be estimated in a precise, well-defined manner [15].

General specification of a hidden Markov model

A HMM is a finite state machine which undergoes a state change every time unit, andeach timet that a statei is entered, an observation vectorot is generated from the prob-ability density functionbi(ot) = P(ot jqt = i), whereqt = i denotes that the model is instatei at time instantt. Acoustic units to be recognized may be words, phonemes, or anyother acoustic unit. HMM model is doubly stochastic in that the transition from statei tostatej is also probabilistic with a discrete transition probabilityai j = P(qt+1 = jjqt = i).Figure 3.4 illustrates an example of typical HMM structure and a possible observableoutput sequence generated by the HMM. As in the example figure, the entry and the exit


states are usually so called non-emitting states facilitating the construction of concate-nated composite models.

o 1b 2( )

o 1 o 2 o 3 o 4

b ( )o3 3b ( )o3 2 b ( )o4 4

��

��

��

��

a

a 12

22

a a a

a a

23

33

34

44

451 2 3 4 5

Observation Sequence

Figure 3.4: A three state left-to-right HMM generating an observation sequenceO =(o1o2o3o4).

The use of HMM framework is dependent on certain assumptions. First, in speech anal-ysis the speech is usually split into temporal segments, or states, in which the speechwaveform may be assumed to be stationary. The transition between these states is as-sumed to be instantaneous. Second, the probability of a certain symbol being generatedis only dependent on the current state, not on any previously generated states. This firstorder Markov assumption is usually also referred to as the independence assumption.

A HMM is fully defined by a parameter setλ = (A;B;π), whereA2 RNxN is a matrix

of above described transition probabilitiesai j , andB= fbj(ot)g is a set of observationprobability density functions for each statej. N is the number of the states in a model.

Recognition problem

In general, the recognizer should be able to decide in favor of a string of wordsW whichmaximizes the probabilityP(WjO); i.e. the probability of a stringW was spoken givenby the acoustic evidence in a sequenceO. Here, the string of words can be considereda collection of word-level HMMs concatenated one after another, but the idea can begeneralized to any other recognition units. Given the observed acoustic information, themost likely string can be represented as follows:

W = argmaxW

P(WjO); (3.3)

where the maximation is carried out over every possible string of words. Rewritingthe right-hand side probability of Equation 3.3 using the Bayes’ formula of probabilitytheory we get

P(WjO) =P(W)P(OjW)

P(O); (3.4)


whereP(W) is the probability of a word stringW given by the language model,P(OjW)is the probability that the sequence of the feature vectorsO was observed given the stringW andP(O) is the probability of the sequence of feature vectorsO was observed.

Since the acoustic informationO is a fixed variable in Equation 3.4,P(O) does not haveinfluence on the maximation in Equation 3.3 which therefore can be represented as

W = argmaxW

P(W)P(OjW): (3.5)

Equation 3.5 can be evaluated straightforwardly since the probabilityP(OjW) is avail-able through equation 3.7 by replacing the single modelλ with the concatenated modelfor stringW.

Probability calculation

The recognizer needs to be able to determine the conditional probabilityP(Ojλ) for amodelλ when a speaker utters a sentence producing a sequenceO of feature vectors.The joint probabilityP(O;q�jλ) thatO is generated by the modelλ moving through theknownstate sequenceq� = (123345) in Figure 3.4 can be estimated by the product ofthe output probabilitiesP(Ojq�;λ) and the transitional probabilitiesP(q�jλ); that is,

P(O;q�jλ) = P(Ojq�;λ)P(q�jλ)= a12b2(o1)a23b3(o2)a33b3(o3)a34b4(o4)a45: (3.6)

The equation 3.6 above holds, since Bayes’ formula [5] holds, and because of the inde-pendence assumption of observation vectors.

However, in the field of speech recognition, it is desirable to compute the probabilitythat the observation sequence was generated by a given HMM. The problem is that theactual state sequenceq� is unknown, or hiddenas the model is called.

Since the underlying state sequence is hidden, the probability that the observation se-quenceO = (o1o2 : : :oT) was generated by the modelλ with N states is computed by

summing over all possible state sequencesq(i) = (q(i)1 q(i)2 : : :q(i)T ), 1� i � NT ; that is,

P(Ojλ) =NT

∑i=1

P(Ojq(i);λ)P(q(i)jλ)

=NT

∑i=1

aq(i)

1 q(i)2

bq(i)

2(o1)aq(i)

2 q(i)3

bq(i)

3(o2) : : :aq(i)

T�1q(i)T

bq(i)

T(oT): (3.7)

In practice, computingP(Ojλ) with the Formula 3.7 would be computationally infea-sible requiring the order of 2T �NT calculations [14]. Fortunately, there exists an ef-ficient algorithm called forward backward procedure described in [15] requiring onlyN2T computations. Also, the probability of the likeliest state sequence can be efficientlycomputed using the Viterbi algorithm [15].


Modeling the output probabilities

The form of output distributionsbj(ot) can be defined in many ways. Most of thecurrent HMM based systems use continuous Gaussian Mixture Densities as the outputdistributions defined as [14]

bj(ot) =M

∑m=1

cjmN (o;µjm;Σ jm); 1� j � N; (3.8)

whereo is the observation vector to be modeled,cjm is the mixture weight for themthmixture in statej, andN (�;µ;Σ) is a multivariate Gaussian with mean vectorµ andcovariance matrixΣ. The mixture weightscjm have the property

∑m

cjm = 1 8 j: (3.9)

It can be seen that an arbitrary observation vectoroi can be emitted from any of theHMM states (q1;q2;q3; : : : ;qN).

Estimation of the parameters

The most difficult problem of HMMs is to estimate or adjust the model parametersλin order to maximize the probability of the observation sequence given the model. Infact, there is no optimal way of estimating the model parameters [15]. However, it isstill possible to choose model parametersλ = (A;B;π) for each model independentlysuch thatP(Ojλ) is locally maximized using an iterativere-estimationprocedure suchas Baum-Welch method. This Maximum Likelihood (ML) estimation is the most com-mon HMM training approach. The Baum-Welch re-estimation theory and formulas arecarefully and comprehensively presented in [15].

An important feature of the re-estimation is that the new re-estimated modelλ producesan observation sequence more likely than the current modelλ; that is,P(Ojλ)�P(Ojλ).Now, we can iteratively useλ in place ofλ and repeat the re-estimation calculation sev-eral times until some local maxima ofP(Ojλ) is reached. Finally, it can be noticedthat since the entire problem of re-estimation can be set up as an optimization problem,where standard gradient techniques can be used to solve for “optimal” values of themodel parameters [14]. However, the advantage of Baum-Welch estimation is that it isguaranteed to produce monotonic improvement in the model likelihood, whereas mono-tonic improvement can not be guaranteed while standard gradient methods are used.Additional information about re-estimation can be found in [16, 18].

Conclusions

The most obvious benefits of using HMMs as acoustic models for speech recognition arethe efficient and easy implementation of likelihood calculations and the re-estimation


formulas for the model parameters. Despite the HMM assumptions mentioned, thespeech recognition performance in general has been proven to be better with HMMsthan other approaches, e.g., neural networks.

Today, there are commercial products that can recognize individual commands spokenby several speakers at an accuracy of about 95%. Nevertheless, these recognizers needto be trained and different dialects on different speakers cause often problems. Phonemebased recognizers, on the other hand, can be used with larger vocabularies and they canrecognize individual phonemes in a restricted case, especially vowels, at an accuracy ofalmost 100% but, in case of continuous speech, single errors tend to accumulate and thecorrect classification of words falls below 90%. However, phoneme based recognizersare also more dependent on the speaker and the similarity of the environment in whichthey were trained. At present, users of these recognizers must tolerate long trainingperiods and yet, several transcription errors.

3.3 Fundamental frequency estimation

In human conversation, thepitchhas been found to be the primary acoustic cue to into-nation and stress in speech [19], and therefore the estimates of pitch have been exten-sively used in speech encoding, synthesis, and recognition. Albeit the speech processingresearchers have been very keen on pitch estimation methods for decades and many al-gorithms have been proposed, the lack of generally accepted method is still a fact. In thissection, a very abbreviated survey of general pitch estimation techniques is provided.

Pitch is defined as a perceived frequency of a sound. Thus, pitch is not a quantity thatcan be measured directly. However, pitch estimation refers virtually always to the esti-mation of the fundamental frequency (usually denoted asf0) of the speech signal, andthe practice is well established since the pitch tends to correlate well with the funda-mental frequency [1].

The goal of a pitch estimator is to automatically extract the fundamental frequencyinformation for voiced sounds from speech signal. Fundamental frequency can be de-termined either from periodicity in the time domain or from regularly spaced harmonicsin the frequency domain. An illustrative example of estimated pitch track is presentedin Figure 3.5 below. The sample waveform is plotted with blue color and the estimatedpitch track is plotted with red color. It can be noticed that pitch for unvoiced fricatives/s/ is zero.

Problematic speech

The problems that make estimating the fundamental frequency a difficult task are gen-erally due to the complex nature of the human speech production system. The followinglist gives a glimpse at the problems associated with the estimation [1].

� f0 changes with time, often with each glottal period


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

0

1

time (s)

Am

plitu

de020406080100120140160180

f 0 (H

z)

Figure 3.5: Example of estimated pitch track for Finnish word ’seitsemän’ (spectogramof the sample is presented in Figure 2.3).

� the fundamental frequency can occasionally jump up or down by an octave

� sub-harmonics off0 often appear

� voicing is very irregular at voice onsets and offsets

� some voiced segments are only a few glottal cycles in extent

� it is difficult to distinguish periodic background noise from breathy speech

General approach to f0 estimation

The basic structure of the fundamental frequency estimators, operating either in time-domain or in frequency-domain, is usually comprised of three major components: pre-processing, basicf0 extraction, and post-processing.

The preprocessing stage usually filters and simplifies the signal via data reduction. Theobject of the preprocessing is to remove interfering signal components, such as noise,DC-offsets etc.

The basic extraction generates the possible pitch marker candidates in the waveform.These candidates can be generated by using waveform similarity measures, frequency-domain harmonicity, or both of the approaches in some combined way.

In the final stage, postprocessor chooses the best candidate from the candidates gen-erated in the previous stage. The selection of the best candidate usually copes withproblems summarized in the list above. The earliest methods used non-linear smooth-ing, such as median filtering, to deal with the isolated outliers in the pitch track. Themore advanced methods use a dynamic programming technique to take into account theacoustic properties of pitch.


Autocorrelation-based methods

The autocorrelation function (ACF) of the speech signal, or of a preprocessed versionof it, is a traditional source of period candidates [1]. In speech processing, the auto-correlation function is usually estimated using short-time autocorrelationR(k) definedas

R(k) =L�1�k

∑i=0

w(i)s(i) w(i +k)s(i +k); (3.10)

wherek is thelag index,s(i) is the sampled speech signal, andw(i) is a smooth windowfunction (e.g., a Hamming window, see [20]) of lengthL. The fundamental frequencycan be solved from the lag corresponding to the maximum value of ACF; that is,f0 =

fsk ,

where fs is the sampling frequency andk is the lag with maximum ACF value. Themaximum value of ACF should not be calculated from either very small or very largelag values, because ACF value increases drastically when the lag approaches to themultiples of the pitch period.

While autocorrelation based methods have performed well in many contexts and they areshown to be relatively immune to noise, there are two flaws that reduce these methods’usability as a period candidate generator. The main disadvantage is the relatively largetime window over which the autocorrelation must be computed to cover adequately thef0 ranges encountered in human speech. This precludes resolution of cycle-to-cyclevariation in shorter periods. Additionally, the needed pitch estimates vary as a functionof the lag index,k, since the summation interval shrinks ask increases.

3.4 Estimation of vocal tract parameters

Formants and their bandwidths are the most important spectral features characterizingsounds of a speaker. By linear prediction analysis of a short segment of speech, it ispossible to efficiently encode the spectral envelope information.

The source-filter model of speech production is described in Section 3.4.1. The modelgives a mathematical foundation to the separation of the speech signal to the glottalsource signal and the vocal tract response. The spectral characteristics of the vocaltract are modeled with some spectral modeling approach, e.g. Linear Prediction (LP)described in Section 3.4.2. LP analysis has been very successful in giving a very goodmodel of the vocal tract for analysis and synthesis purposes.

Example of a spectral envelope for voiced and unvoiced sound is depicted in Figure 3.6below. The formants are can be seen as peaks in frequency domain, and formant band-widths describe the width of the corresponding formant peaks. In the upper part ofthe figure, a voiced speech segment (Finnish vowel /e/) and its spectral envelope arepresented. The first four formants of voiced sound are located at 370, 2 340, 3 220,5 500 Hz, and their bandwidths are 80, 325, 285, 700 Hz, respectively. In the lower partof the figure, an unvoiced speech segment is presented. It can be seen that the formantstructure is located at the higher spectral range. This is a consequence of the fact that the


unvoiced source signals, produced by the constriction in the vocal tract, are modulatedwith a shorter version of the vocal tract (from place of constriction to lips).

0 0.01 0.02 0.03−0.9

0

0.9

Time (s)

Am

plitu

de

0 1000 2000 3000 4000 5000 6000 7000 8000

−20

−10

0

10

20

30

40

Frequency (Hz)

Mag

nitu

de R

espo

nse

(dB

)

0 0.01 0.02 0.03−0.4

0

0.4

Time (s)

Am

plitu

de

0 1000 2000 3000 4000 5000 6000 7000 8000

−20

−10

0

10

20

30

40

Frequency (Hz)

Mag

nitu

de R

espo

nse

(dB

)

Figure 3.6: Voiced speech waveform and its spectral envelope (upper, vowel /e/) andunvoiced speech waveform and its spectral envelope (lower, fricative /s/).

3.4.1 Source-filter model

Thesource-filtermodel of speech can be used to replicate naturally spoken speech whenthe model parameters are obtained by analysis of natural speech. The model is based onthe separation of the speech signal intoexcitationandvocal tractparameters. Actually,the basis of the model can be traced to the human speech production mechanisms (seeChapter 2).

The source-filter model after [7] is diagrammed in Figure 3.7 below. In the figure, twoexcitation signals are generated: Impulse train generator generates a periodic impulsetrain for voiced sounds, and random noise generator generates a white noise for unvoicedsounds. The excitations are scaled, and the voicing switch selects between the twoexcitations. There are some sounds (e.g., /f/, and unvoiced-voiced sound transition) thatare combinations of both types of excitation. These sounds are generated using mixedexcitation, where the two excitations are properly scaled and summed. In Chapter 4 asimplification of the excitation signal generation is used. The voicing is assumed to bebinary (voiced/unvoiced). After the excitation simulating the glottis signal is generated,the vocal tract filter modulates the excitation mimicking modulation in the vocal tract,and the output speech is produced.

The model assumes that the speech signal can belinearlyseparated into two independentsignals, excitation signal and vocal tract response. In practice, the vocal tract response


Voicingswitch

+ modelVocal tract

Random noisegenerator

Vocal tractparameters

signalGlottis Speech

signal

Impulse traingenerator

Pitch period Gain for voice source

Gain for voice source

Figure 3.7: Block diagram of source-filter model for speech production.

is usually assumed to be linear, and therefore thez-transform of the speech signal,S(z),can be expressed as

S(z) =U(z)H(z); (3.11)

whereU(z) is an approximation to the excitation signal, andH(z) is the transfer functionof a digital filter representing the vocal tract response and the radiation characteristic ofthe lips.

From the speech analysis point of view this means that the given speech signal should bedeconvolved into the excitation and the vocal tract response. In practice, the separationis a hard deconvolution problem without a globally optimal solution. However, manymethods have been proposed to solve the problem, and one of the solutions is presentedin the Section 3.4.2.

The source-filter modeling in speech processing is beneficial in many respects: It en-ables the use of efficient coding methods for excitation signal and vocal tract parameters,and on the other hand many interesting applications becomes relatively easy to imple-ment, such as speaker modification [22, 23], and speaker identification [24].

3.4.2 Linear prediction

The concept of predicting the future of a signal dates back at least to the late 1940s.Applied to speech processing, it has been found that the linear prediction (LP) givesa very good model of the vocal tract. This dual role has made LP the most intensivelyused technique in low bit rate speech coding, and it has turned out to be a very importanttool in general speech analysis as well.

The Linear prediction is namely based on ’prediction’. The idea of discrete-time LP is toestimate the output valuey(n) of a system from a given input valuex(n), a combination


of M previous input valuesx(n�1);x(n�2);x(n�3) : : : ;x(n�M), and a combinationof N previous output valuesy(n�1);y(n�2);y(n�3); : : :;y(n�M); that is,

y(n) =M

∑k=1

aky(n�k)+N

∑k=0

bkx(n�k): (3.12)

The problem is to findpredictor coefficients ak andbk given input and output signalsx(n) andy(n) so that ˆy is an optimal linear estimate ofy.

In speech processing, however, the input signal (corresponding to the glottis signal)x(n) is unknown, and therefore the prediction is limited to operate only onp previousoutput valuesy(n�1);y(n�2);y(n�3); : : :;y(n� p) that is the actual speech signalmeasured. Now, the LP model is

y(n) =�p

∑k=1

aky(n�k): (3.13)

(The minus sign is a standard notation used to simplify equations later.) Then the error,or residual, is

e(n) = y(n)� y(n)

= y(n)+p

∑k=1

aky(n�k): (3.14)

This type of predictive modeling is termed asautoregressive(AR) in statistical mathe-matics. In signal processing, the AR model corresponds to aall-poleor Infinite ImpulseResponse(IIR) filter. The problem of the autoregressive LP model is to find optimal pre-diction coefficientsak which minimize the total-squared error between the actual speechsampley(n) and the predicted speech sample ˆy(n).

The original objective was to find an estimate for the vocal tract transfer functionH(z)in Equation 3.11. The practical solution here is a all-pole approximation

H(z)�ρ

A(z); (3.15)

whereρ is the error gain (ρ =p

∑ je(n)j2), andA(z) is thez-transform of the optimalprediction coefficientsak.

The optimal predictor coefficients, usually termed as Linear Predictive Coding (LPC)coefficients, can be obtained via theautocorrelation methodor thecovariance method.The essential theory of the linear prediction and the derivation of optimal linear predictorcoefficients (using autocorrelation method) is presented in Appendix A. The autocorre-lation method is preferred for several reasons. Autocorrelation method enables the useof computationally fast algorithms such as theLevinson-Durbin algorithm(described in[8]). The Toeplitz structure of the autocorrelation matrixR (in Appendix A) guaranteesthe poles of the synthesis filterH(z) to be inside the unit circle. Thus, the synthesis filterH(z) resulting from the analysis will always be stable [1]. This is a major motivating


factor for the use of the autocorrelation method in many practical applications, such asspeech synthesis.

The order of predictor depends on the sampling frequencyfs. The rule of thumb [21]for the predictor orderp is

p=fs

1000+ γ: (3.16)

This rule can be justified as follows. For a normal (17 cm) vocal tract, there is an av-erage of about one formant per kiloherz of bandwidth [21], and modeling each formantrequires two conjugate complex poles. Hence, two predictor coefficients per kiloherzof bandwidth are needed. The total bandwidth is equal toNyquist’s rate fs=2. Theγin Equation 3.16 is a fudge constant, empirically determined, and typically 2 or 3 [21].These extra poles can be interpreted to take care of the rolloff in the glottal excitationfunction, and also give some extra flexibility to the predictor.

An all-pole model is physically justifiable as the vocal tract for majority of speechsounds [7]. For vowel and some fricative sounds, the transfer function of the vocal tractis an all-pole function [21]. However, in the case of nasal sounds the all-pole modelperforms poorly due to its incapability to model anti-resonances. In practice, all-polemodeling approach is nonetheless preferred.

Some examples of the spectral envelope all-pole modeling of a speech frame can be seenin Figures 2.4 and 3.6. In these figures LPC analysis (18th order, p= 18) was carriedout over 20 ms Hanning windowed speech frame. Excellent performance of the LPCanalysis speech spectrum envelope matching abilities can be pointed out in Figure 2.4.The vocal tract resonance frequencies and their bandwidths are precisely approximatedonly by 18 LPC coefficients.

Chapter 4

Speech Synthesis

4.1 Introduction

Speech synthesis is an automated process which artificially produces acoustic speechwaveforms usually from the written form of a sentence. Current speech synthesizersrepresent tradeoffs among the conflicting demands of maximizing speech quality, whileminimizing memory space, algorithmic complexity, and computational time [19].

It is important to make a distinction between high-level and low-level synthesis. A low-level synthesizer is used to create the output waveform. However, it can not produceoutput unless it is driven by a matching high-level synthesizer. A high-level synthesizeris responsible for generating the input in such format that low-level synthesizer is ca-pable to generate the acoustic waveform. The type of data used depends on the chosensystem architecture. The high-level synthesis usually includes text pre-precessing (intext-to-speech systems), pronunciation and prosodic analysis.

The speech synthesis techniques reviewed in this chapter will make the basis for thespeech synthesis subsystem of the speech coder described in Chapter 5. The reviewdeals exclusively with low-level synthesis methods, since the high-level pre-processingis replaced by the parameter estimation from a natural speech. In the implementedspeech coder, the synthesis is used to generate the speech waveform from the estimatedparameters.

4.2 Brief history of speaking machines

The earliest efforts to produce synthetic speech date back over two hundred years ago.In order to understand how the present systems work and how they have evolved totheir present form, a very short historical review may be useful. For a more detaileddiscussion of speech synthesis development and history reader is advised to consult[25].

4. Speech Synthesis 28

Era of mechanical synthesis

In 1791 Wolfgang von Kempelen introduced his “acoustic-mechanical speech machine”.After over 20 years of research he published a book in which he described his studieson human speech production and the experiments with his speech machine. His studiesled to the theory that the vocal tract is the main source of the acoustic articulation.

The connection between a specific vowel and the geometry of the vocal tract was foundby Willis in 1838. Willis noticed the important fact that the vowel quality depends onlyon the length of the tube, not on its diameter.

The research and experiments improving the mechanical speech machine introduced byKempelen and semi-electrical machines were made until 1960’s, but with no remarkablesuccess.

Era of electrical synthesis

The first full-electrical synthesis device was introduced by Steward in 1922. His syn-thesizer had a buzzer as an excitation and two resonant circuits to model the acousticresonances of the vocal tract. The machine was able to synthesize single static vowelsounds with two lowest formants.

In 1932 Japanese researchers Obata and Teshima discovered the third formant of vowels.This was a remarkable discovery, because three formants are generally considered to beenough for intelligible speech.

The first machine that is considered a speech synthesizer was VODER introduced byHomer Dudley in 1939. The VODER was basically a spectrum-synthesis device op-erated from a finger keyboard. It did, however, duplicate one important physiologicalcharacteristic of the vocal system, namely, that the excitation can be voiced or unvoiced.In addition, the basic structure of the machine is very similar to the source-filter modelof speech production described in Section 3.4.1.

Parametric Artificial Talker (PAT), introduced in 1953 by Walter Lawrence was the firstformant synthesizer(see Section 4.3.2). PAT consisted of three electronic formant res-onators connected in parallel to model the vocal tract. At about the same time GunnarFant introduced the first cascade formant synthesizer OVE I. These synthesizers sparkeda conversation whether the transfer function of an acoustic tube should be a parallel orcascade model.

The firstarticulatory synthesizer(see Section 4.3.1), DAVO, was introduced in 1958 byGeorge Rosen at the Massachusetts Institute of Technology (MIT). In mid 1960s, firstexperiments with Linear Predictive Coding (LPC, see Section 4.3.4) were made.

The first full text-to-speech(TTS) system for English was developed in 1968 by NorikoUmeda. The system was based on an articulatory model and included a syntactic anal-ysis module with some heuristics. In 1979 Allen, Hunnicut, and Klatt demonstrated theMITalk TTS system developed at MIT. The technology used in MITalk forms the basisfor many synthesis systems today.


4.3 Synthesis methods

The modern speech synthesizers can be divided into two broad categories according tothe chosen synthetic speech production strategy.

In thesystem-modelingapproach the aim is to model the human speech production sys-tem. The system-modeling approach is also known asarticulatory synthesisintroducedin Section 4.3.1. Thesignal-model, in turn, attempts to model the resulting speech sig-nal. This approach has been more thoroughly studied in the past, because it is simplerof the two and has provided more natural sounding synthesized speech. The signal-modeling approach can be further divided into methods known asrule-based formantsynthesis(introduced in Section 4.3.2) andconcatenation synthesis(Sections 4.3.3 and4.3.4).

The choice of method is influenced by size of the synthesis vocabulary and synthesisquality. The formant and concatenative methods are the most commonly used in presentsystems. Their advantage is simplicity, whereas articulatory synthesis is still too com-plicated being a potential method in the future.

4.3.1 Articulatory synthesis

In the system-modeling approach of human speech production the aim is to model thehuman vocal organs as accurately as possible. Since the model originates from theactual speech production mechanisms, this method should be the most satisfying whenthe goal is to produce high-quality synthetic speech.

Articulatory synthesis typically involves models of the humanarticulators andvocalcords (see Section 2.2). The first articulatory model for speech synthesis was basedon a table of vocal tract area functions for each phonetic segment and an linear inter-polation scheme [25]. Modern models are usually based on two-dimensional or eventhree-dimensional modeling of the articulators.

The example of vocal tract model used in Haskins Laboratories articulatory synthesisprogram [26] is depicted in Figure 4.1. There are six key parameters in this model ofthe vocal tract: the tongue body center (2degrees of freedom, df), the tongue tip (2 df),the jaw (1 df), the lips (2 df), the velum (1 df), and the hyoid (2 df, controlling larynxheight and pharynx width). The tongue tip is a structure that rests on the tongue body,which is implemented as a ball. In turn, the tongue ball rests on the jaw.

The vocal cord model may be similarly used to generate an appropriate excitation signal.Examples of vocal cord model control parameters are glottal aperture cord tension, andlung pressure.

In rule-basedarticulatory-synthesis programs model parameters are updated towardstarget positions for each phoneme usingrules. Rules are a set of functions for modelingthe masses and degrees of freedom of articulators’ movements.


Figure 4.1: An example of vocal tract model in a articulatory synthesis system [26].

However, a general solution to the problem of seeking target articulatory shapes via setsof dependent articulators seems to require control strategies incorporating considerableknowledge of the dynamic constraints on the system and selection of an optimal controlstrategy from a multiplicity of alternative ways to achieve a desired goal [25]. Thearticulators’ target positions can be estimated from x-ray images acquired from naturalspeech, but difficulties in defining rules arise from the unavailability of appropriateddata of the motion of the articulators during speech.

According to [25] many attempts have been carried out to implement the articulatorysynthesis system, but computational costs and, mainly, the lack of data upon which tobase rules prevents the immediate application of this approach. Thus, it has receivedless attention than other synthesis methods.

4.3.2 Formant synthesis

Formant synthesis is probably the synthesis method most widely used during last fewdecades. Formant synthesis is based on modeling each formant used in speech produc-tion. The formant information is usually estimated from natural speech or generatedfrom written text.

The idea of formant synthesis is very similar to the Dudley’s VODER (described inSection 4.2) with the distinction that the large number of fixed frequency resonators isreplaced by a small number of variable frequency resonators. This arrangement is doneto simplify the structure of the synthesizer, but it also gives some flexibility in adjustingthe formant frequencies. This synthesis method obeys the source-filter model of speechproduction (see Section 3.4.1).

There are two methods of combining each of the formants to make the model of the


vocal tract. Incascade formant synthesisthe output of one formant resonator is appliedto the input of the next resonator. The basic structure of cascade formant synthesizer isdepicted in the Figure 4.2. In theparallel formant synthesizerthe excitation signal isapplied to all the modeled formants in parallel. The output of each formant resonator isindividually gained to modify the timbre of the resulting speech. The basic structure ofparallel formant synthesizer is depicted in the Figure 4.3. The parallel arrangement ofresonators have been found better for nasals, fricatives and stops, and cascade type fornon-nasal voiced sounds. Efforts to improve the previous types of formant arrangementshave lead to systems which incorporate both types.

SpeechSignal

ExciationSignal

F1 F2 F3Gain

Formant 1 Formant 2 Formant 3

Figure 4.2: The basic structure of a cascade formant synthesizer.

Formant 1

Formant 3

Formant 2

SpeechSignal

ExciationSignal

F1 BW1

BW2F2

F3 BW3

Gain

Gain

Gain

Figure 4.3: The basic structure of a parallel formant synthesizer.

Three formants are generally required to synthesize intelligible speech, with four ormore being sufficient to produce high quality speech. Each formant of speech is usuallymodeled with a two-pole resonator, which enables both the formant and its bandwidthto be specified [27].

Formant synthesizers are usually controlled by rules. These rules determine which allo-phones are used in a certain phonetic context, and specify exactly how these allophones,and the transitions between them, should be produced [27].

Extensive research into rule-based formant synthesis eventually led to some high qual-ity text-to-speech (TTS) systems, such as Klattalk, MITalk, DECtalk, the Infovox SA-


101, and the Prose-2000. It should be noted that MITalk, the Prose-200, Klattalk andDECtalk all used a version of the Klatt formant synthesizer[27].

4.3.3 PSOLA synthesis

Concatenation synthesis operates by concatenating appropriate synthesis units to con-struct the required speech. The synthesis units are usually either words, syllables, di-phones, or monophones.

The Pitch Synchronous Overlap and Add (PSOLA) algorithm was developed by FranceTelecom at CNET [27]. Actually, the algorithm does not synthesize the speech signalitself, but merely enables pre-recorded segments of speech to be smoothly concatenated.However, the algorithm enables the alteration of the pitch and duration of the segments.The chief advantage of the PSOLA synthesis is that the synthesized speech can be gen-erally considered a very high quality with a relatively low complexity algorithm.

There are several versions of the PSOLA algorithm, and all of them work essentiallythe same way. The time-domain version,TD-PSOLA, is the most commonly used dueto its computational efficiency [1]. The TD-PSOLA consists of three steps. The anal-ysis step divides the original speech signal into many separate but often overlappingshort-term (ST) analysis signals. Analysis is carried out using pitch-synchronous Han-ning windowing through regions of voiced speech and at fixed interval through regionsof unvoiced speech. In the second step each analysis signal is modified to match thedesired ST synthesis signals. The pitch can be raised or lowered by altering the spac-ing of the ST-signals during synthesis, and the duration can be simultaneously alteredby copying or deleting ST-signals from the synthetic speech. Finally, in the synthesisstep the modified segments are recombined by means of overlap-add. TD-PSOLA hasthe disadvantage that spectral smoothing at the concatenation unit boundaries cannot beperformed.

The spectral discontinuity problem in TD-PSOLA can be overcome by using the fre-quency domain version of PSOLA, namelyFD-PSOLA. In this approach the ST-signalis decomposed into source spectrum (representing the contribution of glottal source sig-nal, see Section 2.5) and spectral envelope (the contribution of vocal tract characteris-tics). The spectral envelope can be estimated, for example, using LP techniques, and anestimate of source spectrum can be obtained by dividing the Discrete Fourier Transform(DFT) of the ST-signal by the spectral envelope of the ST-signal. Now, the pitch canbe modified to match the synthesis pitch by adjusting the spacing of the harmonics inthe source spectrum, and the spectral smoothing in concatenation unit boundaries canbe achieved by modifying the spectral envelope of the boundary units. After modifica-tions, the spectra are recombined and the Inverse Fourier Transform (IDFT) is applied togenerate a synthesis ST-signal. The actual synthesis procedure is actually the same thanin TD-PSOLA. FD-PSOLA requires considerably more computation than TD-PSOLAdue to the needed transformations and additional smoothening.

In the following section the linear prediction synthesis method is described. This syn-


thesis method is used in the current work, therefore the review is presented more com-prehensively.

4.3.4 Linear prediction synthesis

Linear Prediction (LP) based speech synthesis methods were originally developed forspeech coding systems. This synthesis method has rapid and simple analysis and syn-thesis algorithms at the expense of output quality.

LP synthesis is another synthesis method based on the source-filter model of speechproduction (see Section 3.4.1), where the excitation signal imitates the glottal sourceand the vocal tract filter tries to mimic the vocal tract configuration needed to producethe required sound. The synthesis process is carried out frame by frame. In the followingsections various aspects of LPC speech synthesis are discussed.

Excitation

In LP synthesis, the glottal source is modeled by the excitation signal and the vocaltract response by the LPC filtering operation. The excitation signal usually comprisesa very simple model for glottis source signal. Voiced excitation is often approximatedby a train of impulses with periods corresponding to the required pitch. This significantsimplification of the glottis signal is a major cause for the unnatural sound of the method.More advanced techniques for glottal source modeling exist, but they usually gain onlymarginal improvement with additional parameters. The excitation for unvoiced soundsis approximated by white noise.

Vocal tract response

In the LP synthesis approach, the spectral characteristics (except periodicity) of thespeech are extracted in the form of LPC filter coefficients. The basics of the LP theoryis presented in Section 3.4. In practice, LPC parameters provide an accurate and eco-nomical representation of relevant speech parameters, that is formants and their band-widths, and can be used to generate efficient speech synthesis systems. For example,the spectral envelope of the speech frame presented in Figure 3.6 is coded into twelveLPC coefficients determined using the autocorrelation method of LPC analysis.

Synthesis units

Concatenating pre-recorded natural speech utterances is probably the easiest way to pro-duce intelligible and high-quality synthetic speech. The problem with the flexibility ofthis approach arises when the required synthetic speech does not match the pre-recordedutterances.


Therefore, one of the most important aspects of this approach is to choose a suitable syn-thesis unit length. The selection is a trade-off between longer and shorter units. Longerunits (e.g., words) can provide naturalness due to the minimal quantity of needed con-catenation points, but the amount of required units and needed memory is drasticallyincreased. Shorter units (e.g., diphones) require less memory, but the quality of thesynthetic speech degrades due to the distortion introduced by the discontinuities in con-catenation points.

Phonemes are probably the most natural units of synthesis because they are the normallinguistic presentation of speech. The number of phonemes is usually between 40 and50 depending on the language. This is clearly the smallest quantity compared to otherunits. The use of phonemes gives a great flexibility to concatenate units, but somephonemes that do not have a steady state target position (e.g., plosives) can be difficultto synthesize.

Diphones (also calleddyads) are the most commonly used synthesis units in speechsynthesis. They describe a phone pair from the central point of the steady state part ofthe phone to the central point of the following one. In other words, they describe thetransition between adjacent phones. In this manner the coarticulation effects betweenadjacent phones can be captured. This means also that most concatenation points willbe located in the steady state region of the phone, which reduces the distortion. Inprinciple, the number of diphone elements to be stored in the inventory is the square ofthe number of phonemes plus allophones, but not all combinations of the phonemes areneeded. For example, in Finnish the combinations such as /h-s/, /s-j/, /m-t/, /n-k/, and/N-p/ within a word are not possible, and thus the number of units is usually from 1 500to 2 000 [28]. However, the number of elements is still tolerable.

Longer segmental units, such as triphones1 or tetraphones2, are rarely used. The prob-lem with these longer synthesis units is the collection and storage of data. For example,English requires more than 10 000 triphone units.

Acoustic inventory

The set of chosen synthesis units, termedacoustic inventory, is usually gathered more orless manually in three steps. First, the natural speech must be recorded so that all usedsynthesis units (e.g., diphones) within all possible contexts are included. The speechdata must be labeled and segmented, and finally the most appropriate instance of eachinventory element is selected for further analysis.

In the analysis process each of the selected time segments is decomposed and encodedframe by frame into the excitation signal and vocal tract filter parameters. The resultscan be stored to acoustic inventory as such or the amount of data can be reduced bymeans of interpolation. For example, if diphone synthesis units are used, the steadystates and a small number of transition speech frames can be encoded and the rest of theparameters can be interpolated.

1Contains one phone between steady state points (half phone - phone - half phone)2Contains two phones between steady state points (half phone - phone - phone - half phone)


Synthesis

The concatenative LPC synthesis procedure consists of two parallel processes. Thesynthesis control parameters are either generated by rule or determined from analysisof natural speech, and the actual waveform calculation is performed using these givenparameters. For simplicity, the parameters are assumed to be determined from naturalspeech.

The excitation signal is generated using the pitch and voicing information that is usu-ally estimated by thef0 extraction algorithm (e.g., with the algorithm described in Sec-tion 3.3) from natural speech. The gain of the excitation is usually calculated from thepower of the speech at the analysis time frame of the natural speech. The LPC filtercoefficients for the desired synthesis unit are fetched from the acoustic inventory.

After all the parameters described above are computed, the waveform calculation isperformed as in ordinary LPC synthesis scheme. Figure 4.4 illustrates a typical structureof the concatenative LPC synthesis. As in the figure, the voiced/unvoiced excitationsignal is gained to match the desired speech intensity. The gained excitation is filteredwith LPC synthesis filter to adjust the spectral characteristics to correspond the requiredsound. The output of this filtering operation is a frame of synthesized speech. All theframes are calculated using the same procedure with updated or interpolated parametervalues.

Voicingswitch

Random noisegenerator

Impulse traingenerator

Speechsignal

Pitch period

Gain

LPC filtercoefficients

Filteroperation

Figure 4.4: Block diagram of typical concatenative LPC synthesis scheme.

Summary

Synthetic speech produced using linear prediction synthesis is far from perfect [25].This not so uncommon statement can explained with two problems associated with LPsynthesis: First, the source-filter model assumes clear separation of excitation and vocaltract response, and still the excitation is determined separately [19] and actually a over-simplified version excitation is used in synthesis process. This causes mismatch insynthesis and implies imperfect reconstruction. Secondly, in LP synthesis the vocal tract


response is modeled by an all-pole filter. Therefore phonemes containing antiformants,such as nasals and nasalized vowels, are poorly modeled. The solution for this problemis the more general, but less popular, autoregressive moving average (ARMA) modelingof the vocal tract response. However, ARMA modeling has a disadvantage of beingcomputationally too complex for practical applications.

The major advantages of the LPC synthesis approach are the simplicity, fast and stablealgorithms, and the easily modified synthesis parameters. An interesting applicationthat emerges from parameter modification is speaker transformation. Prosodic aspectsof speech can be modified by adjusting the excitation signal, and the spectral transfor-mation can be used to map the acoustic space of the original speaker to that of a targetspeaker.

Current research is examining the possibility of using versions of actual LPC residualwaveforms for vocal tract excitation (e.g.,multi-pulseexcitation). The improvementof quality follows the fact that exciting the LPC synthesis filter with its correspondingresidual signal reconstructs exactly the original speech [19].

4.4 Conclusions

The four basic methods of low-level speech synthesis have been introduced in this chap-ter. Concatenative synthesis methods have became more and more popular since themethods to reduce distortion at concatenation points improve. The collecting and label-ing of speech samples has usually been difficult and very time consuming. However,most of this work can be automated today by using for example speech recognition.

Formant synthesis has the advantage of being relatively flexible allowing a good controlof formant frequencies and their bandwidths, and also fundamental frequency. Com-pared to concatenative methods the formant synthesis produces slightly more unnaturalspeech [28], and individual sounding voice is more difficult to achieve.

In theory, articulatory synthesis is perhaps the most feasible method because it modelsthe human articulatory system directly. Due to its very complex nature, however, thepotential of this method has certainly not been realized yet. This method seems to bethe method of choice in the future.

The problem area in speech synthesis is very wide. Although speech synthesis has de-veloped steadily over the last few decades, there is still much work in both the low-leveland high-level speech synthesis systems. It can be observed that the present speechsynthesis systems are soon so complicated that one researcher can not handle the entiresystem, and therefore a modular structure is very popular in currently available synthe-sizers.

The evaluation and assessment of synthesized speech is not a simple task, either. Speechquality can be considered a multidimensional term, and it is usually evaluated by subjec-tive listening tests measuring intelligibility and naturalness. It can be stated that in most


modern applications the intelligibility and naturalness of synthetic speech has reachedthe acceptable level.

Chapter 5

Phonetic Vocoder for Finnish

5.1 Introduction

Speech coding approaches have been traditionally divided intowaveform codersandparametric coders[1]. In the waveform coder class the objective is to minimize a crite-rion which measures the dissimilarity between the original and the reconstructed speechsignal. In parametric coding, usually termed asvocoding, the speech signal is character-ized in terms of a set of model parameters, and these parameters are quantized withoutspeech signal consideration. In the waveform coder approach the signal-to-noise ratio(SNR) can be used as a useful performance measure. In the class of vocoders, how-ever, the SNR measure is meaningless, because it has no correlation with reconstructedspeech quality. Subjective listening tests are usually used to measure vocoding perfor-mance.

For uncompressed PCM coded speech the bit rate is 64 kbit/s, and for commonly knownspeech codecs such as ADPCM and LPC10 the bit rates are 32 and 2.4 kbit/s, respec-tively, where the approximate compression is 4 to 50 times. For the phonetic vocoderdescribed in this thesis, the target bit rate is below 1 000 bit/s, meaning the compressionis about 130, i.e. by using this method it is possible to store over three hours of speechon a single 1.44 Mb floppy disk.

5.2 Description of coder

Phonetic vocoding is one of the most frequently proposed methods for speech codingat bit rates below 1 000 bit/s. At present, speech coding at such low bit rates relieson an extensive model of speech production and a detailed linguistic knowledge of theinformation embedded in the speech signal.

It seems, however, that it is natural to base the model on the physiological structureof the human speech production system, and it is usually possible to identify synthesisstructures which emulate the human vocal organs.

5. Phonetic Vocoder for Finnish 39

In encoding process, the speech signal analysis comprises of the estimation of prosodicmeasures such as duration, stress and intonation. The information content of the speechis encoded by utilizing speech recognition. In the class ofsegmental speech coders,in which the phonetic vocoder belongs, the speech utterance is divided into tempo-ral segments. These temporal segments typically represent some linguistic units (e.g.,phonemes). This phonetic segmentation and labeling is carried out using speech recog-nition.

In decoding process, the speech signal is reconstructed from the parametric represen-tation obtained in the encoding process. The source-filter model of speech production(described in the Section 3.4.1) is applied by the speech synthesis algorithm.

The implemented speech coder utilizes a HMM-based speech recognition system tophonetically segment and label the input speech (see Section 3.2). The fundamentalfrequency is estimated using a robust pitch tracking algorithm (see Section 3.3). Speechreconstruction is carried out with a concatenative LPC speech synthesis algorithm (seeSection 4.3.4).

Coding parameters

The parametric representation of speech is comprised of three components: phoneticsegmentation and labeling, pitch track, and energy information. An illustrative exampleof the obtained parameters and the coder structure is presented in Figure 5.1.

The spectral information of speech is efficiently encoded in terms of a finite set ofLPC filters describing the spectral characteristics of phonemes. This implies that thephoneme labels can be used to encode the spectral content of speech. In Figure 5.1 anexample of phonetic labeling can be seen as an output of the segmentation and labeling.The prosodic aspects, duration, stress, and intonation, are encoded in the form of seg-mentation, gain, and pitch, respectively. These parameters have a considerable effect onthe perceived naturality of synthesized speech.

The resulting parametric representation of phonetic vocoder is a variable bit rate (VBR)data stream. This stems from the fact that the phoneme durations are a variable rateprocess, whereas the rest of the parameters are computed periodically, resulting in aconstant bit rate (CBR) data stream.

Although the set of parameters described above allow a relatively good control of manysubjective aspects in synthesized speech, some important cues of speech remain. Forexample, the speaker’s emotional state (angry, skeptical, etc.) is lost.

Discussion

Phonetic vocoders are very suitable to speaker dependent speech coding in very low bitrate requirements. The low bit rate is achieved mainly by coding spectral parametersefficiently.


Parametricrepresentation

Segmentationand labeling

Energyestimation

Pitchestimation

Excitationgeneration

Amplitudecontrol

Vocal tractfiltering

1 450 2451

449 2450 4240

’s’ ’e’ ’i’ ...

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−1

−0.5

0

0.5

1

Time

Am

plitu

de

Time

Fre

quen

cy

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

2000

4000

6000

8000

0 20 40 60 80 100 120 140 160 180 2000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Frame

RMS g

ain

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

20

40

60

80

100

120

140

160

180

t (s)

f0 (H

z)

Analysis Synthesized sampleOriginal sample Synthesis

Figure 5.1: Simplified block diagram of the implemented phonetic vocoder.

The obtained high-level description of the speech makes it relatively easy to implementvoice altering (e.g. speaker transformation, pitch modification, etc.) applications. Inaddition, the phonetic transcription makes it possible to search for speech according totranscription.

The independent analysis processes in phonetic vocoder encoders give some freedomin choosing the analysis algorithms for the specified application and accuracy. Theindependency also enables the parallel implementation of the encoder. This is a majoradvantage, as it prompts for the use of distributed computing where computationallycomplex analysis processes are needed.

5.3 Speech analysis

This section describes the methods used to encode the speech signal into the paramet-ric representation of phonetic vocoder: Speech segmentation and labeling is describedin Section 5.3.1, the pitch track estimation is described in Section 5.3.2, and energyestimation of the speech is described in Section 5.3.3.

5.3.1 Phoneme recognition

The phoneme recognizer is probably the most important component of the encoder.Even a small number of recognition errors may result in an unintelligible synthesizedspeech. Therefore a state-of-the-art HMM recognizer is used for maximum phonemerecognition performance.

The phoneme recognizer aims to segment and label continuous, speaker independentspeech. The objective is to have the phoneme sequences segmented and to obtain thecorresponding time indices. The recognition vocabulary consists of the phonemes of theFinnish language.


The problems in phoneme segmentation are caused by the very nature of speech. In timedomain, the continuous speech event has no discrete segments, and the boundaries be-tween temporal units can be defined only ambiguously. Furthermore, the coarticulationeffects in natural speech complicate phoneme recognition by machine.

The two major stages of the recognition process are feature extraction and the recogni-tion which generates the phonetic transcription of speech. The feature extraction front-end produces the traditionally used Mel-Frequency Cepstral Coefficient (MFCC, seeSection 3.2.2) feature vectors used by the hidden Markov Model (HMM, described inSection 3.2.3) recognizer.

The implementation of the phoneme recognizer utilizes the freely downloadable1 HTKsoftware toolkitin the training and recognition processes. HTK is a portable toolkitfor building and manipulating hidden Markov models and is primarily used for speechrecognition research, although it has been used for numerous other applications, includ-ing research into speech synthesis and character recognition. HTK consists of a set oflibrary modules and tools available in C source form. The tools provide sophisticatedfacilities for speech analysis, HMM training, testing and results analysis. [30]

Acoustic model description

Speaker independent phoneme recognition is based on HMM acoustic models trainedfor the Finnish language.Context independentphoneme models have the problem thatthey are not adequate in representing the spectral and temporal properties of a speechunit in all contexts [14]. This is mainly due to the fact that the acoustic realization ofphonemes depends greatly on the context (see Section 2.3).Context dependenttriphonemodels were used to improve the vital recognition accuracy.

In triphone modeling the number of units needed for all sounds is very large, so there is aconsiderable increase in the need for training data. Moreover, the required robustness inspeaker independent recognition demands occurrences of triphones in different speakerage groups, dialects, etc. It is fortunate that natural languages contain a relatively smallpercentage of triphone units. In addition, several techniques can be used to reduce thenumber of needed models, which, in turn, implies a reduction in the need for trainingmaterial. For example inmodel tying, the number of models is reduced by the use ofshared models for acoustically similar contexts [16].

Each of the HMM triphone models consists of three left-to-right connected states (seeFigure 5.2). The state output distribution is modeled using continuous Gaussian mixturedensities (defined in Section 3.2.3). Short pause is modeled with a single state HMM.Silence, modeling a longer pause in speech, is modeled with three state HMM with adistinction to triphone models of having a backward transition from third to first state.

1Currently available from HTK web sitehttp://htk.eng.cam.ac.uk/


��

��

��

��

��

��

��

��

��

��

��

��

1 2 3 4 5 1 2 3

Triphone Short pause Silence

2 3 4 51

Figure 5.2: Used HMM topologies in phoneme recognition: Triphone, short pause andsilence.

SpeechDat database

SpeechDat(II) FI databasewas used in HMM training and testing processes. Thedatabase was collected by Tampere University of Technology’s Digital Media Institutein framework of EC project LE2-4001 SpeechDat [31, 32].

The database was collected over the fixed telephone network for the purposes of auto-mated teleservices. It consists of typical application words and sentences, spontaneousyes/no questions, dates, times, digit sequences, numbers, money amounts, cities, com-panies, names and phonetically rich words and sentences. The full database providesover 160 000 utterances from 4 000 speakers that represent all age groups, accents andregions of Finland.

The database is phonetically transcribed, and unintelligible speech, speaker noise andother non-speech events are marked with special markers. The speech file format is8 bit, 8 kHz uncompressed A-law, and the speech files are accompanied with an ASCIISAM label file.

Training of acoustic models

Building a phoneme recognizer from scratch involves a great number of tasks, and de-tailed description is irrelevant in respect to the scope of this thesis. In this thesis, how-ever, the training process steps are briefly described in chronological order. For moredetailed information the reader is advised to consult [16].

The SpeechDat database was prepared for HMM training and testing purposes. Utter-ances with non-speech content (speaker noise, unintelligible speech, etc.) were not usedin the training and testing phase. Training and testing material of over 111 000 and2 000 samples, respectively, was chosen randomly from the utterances database.

The creation of a well-trained set of triphone HMMs consists of many phases. The train-ing phase begins with the training of the monophone models. In theflat start creationof monophone models all the models were assigned to have identical mean and iden-tical variance for each state. Intuitively, the mean and variance estimates can be seento describe the global statistics of feature vectors. Next, the monophone models werere-estimated, thrice, with the help of HTK’s embedded re-estimation tool.

After the monophone models are initialized, the next step is totie the short pause modeland the pause model from the centre model states. The tied silence model is depicted in


Figure 5.3. The new silence model has the advantage that the parameters of the sharedstate can be estimated more robustly [16]. After the creation of new silence model, theset of models were re-estimated four times.

��

��

��

��

2 3 4 51

��

��

��

��

Silence

1 2 3

Short pause

Sharedstate

Figure 5.3: The topology of tied silence model.

The final stage in model building is to create context-dependent triphone HMMs. Thetriphone models were created using the monophone models. For example, each of thetriphone model of the forma-b+c can be obtained by making a copy of monophonemodelb. The transcription of the training database is also converted to a form of tri-phone transcriptions. The re-estimation was done twice to initialize the triphone models.

Next, in order to ensure that all state distributions were robustly estimated, acousticallysimilar states of triphones were tied. Tying may affect performance if it is performedindiscriminately [16]. Keeping this in mind, it is important to only tie the parameterswhich have only a diminutive effect on discrimination.

The mechanism used for the tying was based on decision trees. The decision tree at-tempts, by asking questions, to find those contexts which make the largest difference tothe acoustics. This procedure could therefore be used for clustering the states and thenthe tying of clusters. A question can include any linguistic or phonetic classificationwhich may be relevant. For example, each of the questions such as “is the left contextof the triphone a nasal sound” can be used to split states in a pooled cluster into two sets.The question process is repeated until the increase in log likelihood (by any question atany node) is less than the specified threshold. Finally, and for the last time, the tied-statetriphone models were re-estimated three times.

A bigram language model (see Section 3.2) was also computed using the transcriptionsof the training material.

Phoneme recognizer evaluation

The performance of the trained phoneme recognizer was evaluated. The test materialof 2 000 samples was run through the recognizer and the accuracy was evaluated by


Table 5.1: Phoneme recognition accuracies using monophone models.

number of re-estimations 4 5 6 7

without LM 36.41 38.29 39.07 39.59with LM 36.98 38.71 39.51 40.07

comparing the transcription of the test material and the recognition output.

As mentioned earlier, the phoneme recognition was carried out continuously in a speakerindependent manner. The recognition vocabulary consists of 1 572 Finnish triphonemodels, and the recognition network was a fully connected lattice, meaning that anytriphone can be followed by any other triphone.

The phoneme recognition was tested using two configurations, with and without a bi-gram language model (LM). The two configurations were identical in all other aspects.Table 5.1 summarizes the results of the recognition accuracy.

5.3.2 Pitch estimation

The pitch estimation is based on therobust algorithm for pitch tracking(RAPT) [1]. Thealgorithm was implemented as the Master of Science thesis by Anssi Rämö (source codeavailable in [33]). Therefore, the algorithm outline and some examples are presented inthis thesis to provide sufficient detail to understand the principles of the algorithm, thedetails are described in [33] and [1].

The primary aim of RAPT is to obtain a robust and accurate estimate of the pitch track,with some consideration for computational complexity, memory requirements or inher-ent processing delay [1]. RAPT is designed to work at any sampling frequency andframe rate over a wide range of possiblef0, speaker and noise conditions.

Outline of the RAPT is comprised of several steps. First, thef0 candidates are searchedusing two-pass normalized cross-correlation function (NCCF) calculation. For the firstpass the speech sample is decimated and NCCF is calculated over the speech frame forall lags in thef0 range of interest. The locations of the NCCF local maxima are stored.In the second pass, the NCCF of the sample at the original sample rate is calculatedonly in the vicinity of promising NCCF peaks found in the previous pass. The aim ofthe second pass is to refine the estimates of the peak locations. Each of the peaks areretained to generate a candidatef0 for that frame. Also, the hypothesis that the frameis unvoiced is advanced. Finally, the dynamic programming (DP) technique is used toselect the set of NCCF peaks or unvoiced hypotheses across all frames that best matchthe characteristics of the speech.

The general approach tof0 estimation was given in the Section 3.3. In RAPT the NCCFcalculation can be considered as the basicf0 extraction which was used to find thef0


0 20 40 60 80 100 120 140 160 180 2000

50

100

150

200

250

300

Frame

f 0(Hz)

Figure 5.4: Generatedf0 candidates (plus sign) and the resulting pitch track after dy-namic programming (line).

candidates, and as a postprocessor, the dynamic programming aims to select the bestcandidate using the knowledge of the speech production system.

The implementation of RAPT was mainly written in Matlab m-script files. In the al-gorithm the NCCF computation is the dominant cost, and therefore it was written in Cprogramming language. The computation can also be reduced with limiting the rangeof f0 values searched. However, a general-purposef0 estimator according to [1] shouldsearch at least the range, 50� f0 � 500. A better choice to speed up the algorithm is toresample the input speech signal at lower rate.

Unlike the original version of RAPT in [1], the implemented version calculated first-pass NCCF from the whitened and low-pass filtered speech. It was found that there isa small improvement in the accuracy of the initial estimates [33]. The speech framewas whitened using 12th order LPC analysis filter operation. The low-pass filtering wasapplied using linear phase (real and symmetric coefficients) FIR filter of order 60. Afterthese filtering operations the first-pass NCCF was calculated. The implementation hadno other significant differences to the original RAPT.


5.3.3 Energy estimation

Stress is one of the important prosodic features in speech (see Section 2.4). The stresscan be measured in terms of the speech signal energy.

The energy is computed from a windowed speech frame. The speech is windowed witha tapered analysis window. The implementation used 10 ms Hanning window, and theadjacent analysis windows overlapped by 5 ms. The energy information is estimatedsimply by using root mean square (RMS) energy measure for each frame, that is

Erms=

s1N

N

∑n=1

�s(n) w(n)

�2; (5.1)

wheres(n) is the speech signal,w(n) is the windowing function of lengthN. An exampleof RMS energy estimated from Finnish word ’seitsemän’ is presented in Figure 5.5.

0 20 40 60 80 100 120 140 160 180 2000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Frame

RM

S g

ain

Figure 5.5: RMS energy estimate of word ’seitsemän’.

5.4 Speech synthesis

The speech synthesis method used in the phonetic vocoder implementation is acon-catenative LPC synthesisalgorithm described in Section 4.3.4. The synthesis uses theparameters estimated by the encoder, and acoustic inventory of Finnish speaking malevoice was analyzed and stored beforehand. The synthesis was implemented as a Matlabm-script.

The LPC synthesis is based on the source-filter model of speech production (see 3.4.1).In this approach the excitation signal approximates the glottal source signal, and the


vocal tract filtering is used to mimic the vocal tract configuration needed to produce arequired sound.

The excitation for voiced sounds is modeled by a train of impulses with period corre-sponding to the pitch parameter. The excitation for unvoiced sounds is approximated bywhite noise. The binary decision between the type of excitation signal is made accordingto the pitch parameter. The pitch tracking algorithm makes the voiced/unvoiced soundclassification, and the unvoiced sound decision is transmitted as a pitch parameter valuezero. An illustrative example of estimated pitch track and generated excitation signal ispresented in Figure 5.6. In the speech utterance from where the pitch information wasextracted was said the Finnish word “seitsemän”. From the generated excitation signalcan be noticed that the fricative /s/ is excited with noise signal. The following voicedsounds /e/ and /i/ are excited with train of impulses, and so on.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

0

1

time (s)

Am

plitu

de

020406080100120140160180

f 0 (H

z)

Figure 5.6: Pitch track (red) and corresponding excitation signal used (blue) in synthe-sis.

The contribution of the vocal tract is modeled with a LPC filtering operation. Thespectral characteristics, formant frequencies and their bandwidths, of each sound arereproduced by filtering an excitation signal with aLPC synthesis filter. A LPC synthesisfilter of order 18 is used to synthesize the speech.

The acoustic inventory consists of the set of LPC synthesis filter coefficients neededto produce all Finnish phonemes. Selecting the phonemes as synthesis units has sev-eral advantages over other synthesis units: The number of phonemes is relatively small(24 in Finnish) compared to e.g. diphone (approximately 500) or triphone (10 000– )synthesis units, and this fact will make the creation of acoustic inventory easier. Also,the synthesis procedure can be implemented straightforwardly by concatenating synthe-sized phonemes. The acoustic inventory of phonemes, however, has the disadvantagethat the coarticulation effects in the phoneme boundaries are not reproduced, and as suchthe synthesis procedure creates spectral discontinuities in the phoneme boundaries.

The spectral discontinuity problem in the phoneme boundaries can be lessened by theuse of interpolated LPC coefficients. The objective of interpolation is to adjust thesynthesis filter coefficients so that the shift of formant frequencies and bandwidths fromsource to target phoneme sounds natural. The easiest way of interpolation would be thelinear interpolation of LPC coefficients. Nevertheless, this interpolation method has two


problems. First, by directly interpolating the filter coefficients the formant frequenciesand bandwidths are not shifted in a ’natural’ way. Second, the resulting interpolatedsynthesis filter is not guaranteed to be stable. The latter problem does cause infeasiblesynthesis result.

In the implementation, the LPC filter coefficients are linearly interpolated in the linespectral frequency (LSF) transform domain [34] at the phoneme boundaries. The LSFrepresentation has a number of properties, including a bounded range, a sequential or-dering of parameters and a simple check for the filter stability [1]. Now the LPC filtercoefficients can be linearly interpolated in the LSF domain between phoneme bound-aries. The interpolated LSF coefficients and the corresponding inverse transformed LPCcoefficients are guaranteed to form a stable filter [34].

The acoustic inventory of Finnish phonemes was collected manually from the speechuttered by the author. For the analysis, speech material containing several occurrencesof all Finnish phonemes was recorded. All the occurrences of each phoneme wereanalyzed using LPC analysis, and the most representative candidates of each phonemewere chosen by means of subjective listening test of synthesized sound. The inventorywas tested by synthesizing sentences of Finnish. Some of the phonemes sounded clearlyunnatural, and new candidates for them were chosen. The iterative process of enhancingthe acoustic inventory was made three times, and the final inventory seemed to representphonemes as accurately as it would be possible using this method of analysis.

5.5 Results

The algorithmic delay of the proposed phonetic vocoding system is mainly composedof the complex encoding subsystem. The pitch estimation and phoneme recognitionincrease the total delay of the coder to the order of several hundreds of milliseconds.Unfortunately, such delay prevents the real-time implementation of the coder.

The estimated total average bit rate is 400–500 bits/s (see Table 5.2). The phonemelabel and duration are variable bit rate parameters, and their bit rate estimation is basedon the average phone rate in the SpeechDat database. The pitch and energy are constantbit rate parameters.

In [35] the entropy of Finnish phoneme distribution in SpeechDat (II) FI database wasestimated at 4.0 bits, and the phone rate of the database was 10.3 phones/s. Thus, theaverage theoretical coding limit of phoneme label is 42 bits/s. After [29], the phonemeduration requires approximately 50 bits/s. In [35] the estimate of the pitch track andenergy parameter bit rate are 200 bits/s and 100 bits/s, respectively.

The high-level description of speech signal implies high complexity and delay in en-coding. Thus, the practical applications for the proposed method include the storage oflarge amount of speech, and experimental systems in speech research.

Informal subjective listening tests show that synthesized speech is very intelligible, butthe overall speech quality is poor. The lack of appropriate parameters describing the


Table 5.2: Estimation of bit allocation.

Parameter bits/sPhoneme label 42Phoneme duration 50Pitch track 200Energy 100

Total �400

speaker’s emotional state (e.g., angry, skeptical, etc.) degrades the quality of coding.Figures 5.7 and 5.8 represent the waveform and spectogram of original and synthesizedword ’seitsemän’.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−1

−0.5

0

0.5

1

Time (s)

Am

plitu

de

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9−1

−0.5

0

0.5

1

Time (s)

Am

plitu

de

Figure 5.7: Waveform of original sample (upper) and the synthesized (lower) of Finnishword ’seitsemän’.


Time (s)

Fre

quen

cy (

Hz)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

2000

4000

6000

8000

Time (s)

Fre

quen

cy (

Hz)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

2000

4000

6000

8000

Figure 5.8: Spectogram of original sample (upper) and the synthesized (lower) ofFinnish word ’seitsemän’.

Chapter 6

Conclusions

In this thesis, the phonetic vocoding method for very low bit rate speech coding wasstudied. First the relevant theory and algorithms of speech analysis and synthesis inthe scope of the thesis were presented. Then the idea of phonetic vocoding, and how itdiffers from conventional waveform coders was explained.

In the implementation part of the thesis the training process of the HMM-based phonemerecognizer for Finnish language was described. The recognition performance showedthat that the speaker independent Finnish phoneme recognition using fully connectedrecognition lattice and bigram language model performed relatively poorly. However, aremarkable observation was that the quality of the acoustic match is usually good even ifthe phone recognition fails. A recognition error usually results in a finding that belongsto the same broad phonetic class than the correct phoneme, and thus the synthesis willnot necessarily produce a bad synthetic speech. As a result, the synthetic speech remainsintelligible. The speech intelligibility of the speech coder was considered to be high,but the overall speech quality was deemed relatively poor. The main attribute of the lowquality was found out to be the primitive concatenative LPC synthesis. On the otherhand, much of the speech naturalness is also lost due to the fact that the coder operatesat the lowest bit rates.

Future work could be to improve the speech synthesis. The concatenative LPC syn-thesis should be replaced by more sophisticated method. For example, PSOLA basedapproach should improve significantly the naturalness of output speech. Another areaof work would be the speaker adaptation. A speaker normalization in the coder andvoice transformation in the decoder should make it possible to obtain a synthesis voicematching the original speaker. Also, the automatic collection of the acoustic inventoryof the speaker would ease the building of the acoustic database. The manual collectionhas found out to be time-consuming even the acoustic inventory was small (consistedonly phonemes).

References

[1] W.B. Kleijn, K.K. Paliwal, editors,Speech Coding and Synthesis, Elsevier ScienceB.V., 1995.

[2] E.C. Ifeachor, B.W. Jervis,Digital Signal Processing, A Practical Approach,Addison-Wesley Publishing Company Inc., 1993.

[3] M. Karjalainen,Kommunikaatioakustiikka, Akustiikan ja äänenkäsittely-tekniikanlaboratorion raporttisarja #51, Teknillinen Korkeakoulu, 1999.

[4] K. Wiik, Fonetiikan perusteet, Werner Söderström Oy, Juva, 1981.

[5] H. Stark, J.W. Woods,Probability, Random Processes, and Estimation Theory forEngineers, Second Edition, Prentice-Hall, Inc., 1994.

[6] K.N. Stevens,Acoustic Phonetics, The MIT Press, Cambridge, Massachusetts,1998.

[7] L.R. Rabiner, R.W. Schafer,Digital Processing of Speech Signals, EnglewoodCliffs, N.J., Prentice Hall, 1978.

[8] J.R. Deller, John G. Proakis, John H.L. Hansen,Discrete-Time Processing ofSpeech Signals, Macmillan Publishing Company, New York, 1993.

[9] F. Jelinek, R.L. Mercer, S. Roukos, “Principles of Lexical Language Modeling forSpeech Recognition”,Advances in Speech Signal Processing, S. Furui and M.M.Sondhi, ed., Marcel Decker, New York, 1991.

[10] S. Katz, “Estimation of Probabilities from Sparse Data for the Language ModelComponent of a Speech Recognizer”,IEEE Transactions on Acoustics, Speechand Signal Processing, Vol. 35, no. 3, March 1987.

[11] O. Viikki, Adaptive Methods for Robust Speech Recognition, Tampere Universityof Technology, PhD Thesis, 1999.

[12] W.B. Kleijn, “Representing Speech”,Proceedings of the X European Signal Pro-cessing Conference,EUSIPCO 2000, Tampere, Finland, Vol. 3, September 2000.

REFERENCES 53

[13] Ö.B. Tüzün, M. Demirekler, K.B. Nakiboglu, “Comparison of Parametric andNon-Parametric Representation of Speech for Recognition”,Proceedings of 7thMediterranean Electrotechnical Conference, 1994.

[14] L.R. Rabiner, B.H. Juang,Fundamentals of Speech Recognition, Prentice-Hall,Inc., 1993.

[15] L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications inSpeech Recognition”,Proc. of the IEEE, Vol. 77(2), Feb. 1989.

[16] S. Young,The HTK Book, Cambridge University Engineering Department, Eng-land, 1999.

[17] S.M. Kay,Fundamentals of Statistical Signal Processing: Estimation Theory, PTRPrentice-Hall, Inc., New Jersey, 1993.

[18] F. Jelinek,Statistical Methods for Speech Recognition, The MIT Press, Cambridge,Massachusetts, 1997.

[19] D. O’Shaughnessy,Speech Communication — Human and Machine, Addison-Wesley Publishing Company Inc., 1987.

[20] A.V. Oppenheim, R.W. Schafer,Discrete-Time Signal Processing, Prentice-HallInternational, 1999.

[21] T.W. Parsons,Voice and Speech Processing, McGraw-Hill, Inc., 1987.

[22] J. Slifka, T.R. Anderson, “Speaker Modification with LPC Pole Analysis”,Inter-national Conference on Acoustics, Speech, and Signal Processing, ICASSP-95,Vol. 1, 1995.

[23] L.M. Arslan, “Speaker Transformation Algorithm using Segmental Codebooks(STASC)”,Speech Communication, Vol. 28, 1999.

[24] K. Gopalan, S.S. Mahil, “Speaker Identification Using Singular Value Decompo-sition of LPC Spectral Magnitudes”,Proceedings of the 35th Midwest Symposiumon Circuits and Systems, Vol. 2, 1992.

[25] D.H. Klatt, “Review of text-to-speech conversion for English”,Journal of theAcoustical Society of America, JASA, Vol. 82 (3), Sept. 1987.

[26] Haskings Laboratories Homepage, Oct. 2000,http://www.haskins.yale.edu/haskins/inside.html

[27] R.E. Donovan,Trainable Speech Synthesis, PhD Thesis, Cambridge UniversityEngineering Department, England, 1996.ftp://svr-ftp.eng.cam.ac.uk/reports/donovan_thesis.ps.Z

REFERENCES 54

[28] S. Lemmetty,Review of Speech Synthesis Technology, MSc Thesis, Department ofElectrical and Communications Engineering, Helsinki University of Technology,Finland, 1999.http://www.acoustics.hut.fi/~slemmett/dippa/thesis.pdf

[29] J. Picone, G.R. Doddington, “A Phonetic Vocoder”,Proc. Int. Conf. Acoust.,Speech, Signal Proc., pp. 580-583, 1989.

[30] HTK web site, Cambridge University Engineering Department, England, Oct.2000,http://htk.eng.cam.ac.uk/

[31] H. Hoge, H.S. Tropf, R. Winski, H. van den Heuvel, R. Haeb-Umbach, K. Choukri,“European Speech Databases for Telephone Applications”,Trans. on Acoust.,Speech, and Signal Processing, Vol. 3, 1997.

[32] SpeechDat(II) Finnish database homepage, Tampere University of Technology,Finland, Oct. 2000,http://www.dmi.tut.fi/puhe/

[33] A. Rämö,Pitch Modification and Quantization for Offline Speech Coding, MScThesis, Signal Processing Laboratory, Tampere University of Technology, Finland,1999.

[34] F. Itakura, “Line Spectrum Representation of Linear Predictive Coefficients ofSpeech Signals”,J. Acoust. Soc. Am., Vol. 57, p. S35, April 1975.

[35] J. Kivimäki, T. Lahti, K. Koppinen, “A Phonetic Vocoder for Finnish“,Proceed-ings of the X European Signal Processing Conference (EUSIPCO), Tampere, Fin-land, September 2000.

Appendix A

Linear Prediction TheoryPresented after Kleijn and Paliwal ([1], pp. 436)

Consider a frame of speech signal havingN samples, {s1;s2; : : : ;sN}. In LPC analysisthe current sample is approximately predicted by a linear combination ofp past samples,

sn =�p

∑k=1

aksn�k; (A.1)

wherep is the order of LPC analysis and {a1;a2; : : : ;ap} are the LPC coefficients. Leten denote the error, orresidual, between the actual value and the predicted value, i.e.,

en = sn� s

= sn+p

∑k=1

aksn�k (A.2)

Since {en} is obtained by subtracting { ˆsn} from { sn}, it is called the residual signal.Takingz-transform of Equation (A.2), we get

E(z) = A(z)S(z); (A.3)

whereS(z) andE(z) are thez-transforms of the speech signal and the residual signal,respectively, and

A(z) = 1+p

∑k=1

akz�k: (A.4)

The filterA(z) is known as the “whitening” filter as it removes the short-term correlationpresent in the speech signal and, therefore, flattens the spectrum. SinceE(z) has anapproximately flat spectrum, the short-time power-spectral envelope of the speech ismodeled in LPC analysis by an all-pole (or, autoregressive) model

H(z) =1

A(z): (A.5)

A. Linear Prediction Theory 56

The filter A(z) is also known as the “inverse” filter as it is the inverse of the all-polemodelH(z) of the speech signal.

The LPC coefficients are determined by minimizing the total-squared LPC error,

E =∞

∑n=�∞

e2n (A.6)

Minimization of the error criterion defined in Equation (A.6) leads to the followingequations,

p

∑k=1

rji�kjak =�ri ; 1� i � p; (A.7)

where rk is the kth autocorrelation coefficient of the windowed speech signal and isgiven by

rk =1N

N

∑n=k

wnsn wn�ksn�k: (A.8)

Here {wi} is the window function, which is of durationN samples.

The p equations defined by Equation (A.7) are called the Yule-Walker equations andhave to be solved to obtainp LPC coefficients. These equations can be written in thematrix form as follows:

Ra=�r ; (A.9)

where

R =

2666664

r0 r1 r2 � � � r p�1

r1 r0 r1 � � � r p�2

r2 r1 r0 � � � r p�3...

......

. . ....

r p�1 r p�2 r p�3 � � � r0

3777775 ; (A.10)

a= [a1;a2; : : : ;ap]T; (A.11)

andr = [r1; r2; : : : ; r p]

T : (A.12)

Here, the superscript T indicates the transpose of a vector (or matrix).

The matrixR in Equation (A.10) is often called the autocorrelation matrix. It has aToeplitz structure. This facilitates the solution of the Yule-Walker equations (equa-tions A.7 and A.9) for the LPC coefficientsfaig through computationally fast algorithmsuch as Levinson-Durbin algorithm and the Schur algorithm. The Toeplitz structureguarantees the poles of the LPC synthesis filterH(z) to be inside the unit circle. Thus,the synthesis filterH(z) resulting from the autocorrelation method will always be sta-ble.

J K Very low bit rate speech coding using speech ... · Very low bit rate speech coding using...

Documents

Transcript of J K Very low bit rate speech coding using speech ... · Very low bit rate speech coding using...