Project(2)

BESSEL FEATURES FOR SPEECH SIGNALPROCESSING

1

D.MEENAKSHIG.SILPA

V.RAJITHA

Department of Electronics and Communication Engineering

MAHATMA GANDHI INSTITUTE OF TECHNOLOGY (Affiliated to Jawaharlal Nehru Technological University, Hyderabad, A.P.)

Chaitanya Bharathi P.O., Gandipet, Hyderabad – 500 075

2010

BESSEL FEATURES FOR SPEECH SIGNAL PROCESSING

PROJECT REPORT

SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

BACHELOR OF TECHNOLOGY

IN

ELECTRONICS AND COMMUNICATION ENGINEERING

BY

D.MEENAKSHI (06261A0412)G.SILPA (06261A0420)V.RAJITHA (06261A0456


MAHATMA GANDHI INSTITUTE OF TECHNOLOGY (Affiliated to Jawaharlal Nehru Technological University, Hyderabad, A.P.)

Chaitanya Bharathi P.O., Gandipet, Hyderabad – 500 075

2010

2

MAHATMA GANDHI INSTITUTE OF TECHNOLOGY

(Affiliated to Jawaharlal Nehru Technological University, Hyderabad, A.P.)

Chaitanya Bharathi P.O., Gandipet, Hyderabad-500 075 (Font: 14, TNR)


CERTIFICATE

Date: 26 April 2010

This is to certify that the project work entitled “Bessel Features For Speech

Signal Processing” is a bonafide work carried out by

D.Meenakshi (06261A0412)G.Silpa (06261A0420)V.Rajitha (06261A0456)

in partial fulfillment of the requirements for the degree of BACHELOR OF

TECHNOLOGY in ELECTRONICS & COMMUNICATION ENGINEERING by

the Jawaharlal Nehru Technological University, Hyderabad during the academic year

2009-10.

The results embodied in this report have not been submitted to any other University or

Institution for the award of any degree or diploma.

(Signature) (Signature) -------------------------- ------------------Mr. T. D. Bhatt, Associate Professor

Dr.Nagbhooshanam Faculty Advisor/Liaison Professor & Head

3

ACKNOWLEDGEMENT

We express our deep sense of gratitude to our Guide Mr. Suryakanth V

Gangashetty, IIIT, Hyderabad, for his invaluable guidance and encouragement in

carrying out our Project.

We are highly indebted to our Faculty Liaison Mr. T. D. Bhatt, Associate

Professor, Electronics and Communication Engineering Department, who has given us

all the necessary technical guidance in carrying out this Project.

We wish to express our sincere thanks to Dr. E. Nagabhooshanam, Head of the

Department of Electronics and Communication Engineering, M.G.I.T., for permitting us

to pursue our Project in Cranes Varsity and encouraging us throughout the Project.

Finally, we thank all the people who have directly or indirectly helped us through

the course of our Project.

D.Meenakshi

G. Silpa

V. Rajitha

4

Human speech signal is a multi component signal where the components are

called formants. Multicomponent signals produce delineated concentrations in the time-

frequency plane. In the time frequency plane there is a clear delineation into different

regions. Different time frequency distributions may give somewhat different

representations-however they all give roughly the same picture in regard to the existence

of the components. Most efforts in devising recognition schemes have been directed

toward the recognition of human speech. While it has been appreciated over fifty years

that speech is multicomponent no particular exploitation has been made of that fact.

Recently, however, an ingenious idea has been proposed and developed by

Fineberg and Mammone which takes advantage of the multicomponent nature of a signal.

Suppose, for the sake of illustration, we consider signals consisting of two components.

The phase of the first component of the unknown and of the template candidate is

determined. Subtraction of the two phases for each instant of time defines the

transformation function for going from the template to the unknown. One can think of

this as the possible distortion function for the first component. It would equal if there was

no distortion. Similarly one determines the distortion function for the second component.

If the two distortion functions are equal then we have a match. Fineberg and Mammone

have successfully used this method for the classification of speech.

The excellence of the results can be interpreted as indicating that indeed formants

are correlated and distorted in the same way. This is an important finding about the nature

of speech. Note that in the above discussion, the distortion function is taken to be the

difference of the phases. However different circumstances may make other distortion

functions more appropriate For example one can define the distortion function to be the

ratio of the two phases. In general one can think of the distortion as a function of the

signal and the environment. It would be of considerable interest to investigate distortion

functions for common situations.

5

ABSTRACT

The discrete energy separation algorithm (DESA) together with the Gabor's

filtering provides a standard approach to estimate the amplitude envelope (AE) and the

instantaneous frequency (IF) functions of a multicomponent amplitude and frequency

modulated (AM-FM) signal. The filtering operation introduces amplitude and phase

modulation in the separated mono component signals, which may lead to an error in the

final estimation of the modulation functions. We have used a method called the Fourier-

Bessel expansion-based discrete energy separation algorithm (FB-DESA) for component

separation and estimation of the AE and IF functions of a multicomponent AM-FM

signal. The FB-DESA method does not introduce any amplitude or phase modulation in

the separated mono component signal leading to accurate estimations of the AE and IF

functions. Simulation results with synthetic and natural signals are included to illustrate

the effectiveness of the proposed method.

The Voice Onset Time (VOT) is a important characteristic of stop consonants

which plays a significant role in perceptual discrimination of phonemes of the same place

of articulation [6]. It also plays an important role in word segmentation, stress related

phenomena, and dialectal and accented variations in speech patterns [7-8]. The VOT can

also be used for classification of accents.Voice Onset Time (VOT) can be used to classify

mandarin, Turkish, German and American accented English. It is an important temporal

feature which is often overlooked in speech perception, speech recognition, as well as

accent detection.

Many speech analysis situations depend on accurate estimation of the location of

epoch within the glottal pulse. For example, the knowledge of epoch location is useful for

accurate estimation of the fundamental frequency (fo)Analyses of speech signals in the

closed glottis regions provide an accurate estimate of the frequency response of the

supraalaryngeal vocal-tract system [12] [13]. With the knowledge of epochs, it may be

possible to determine the characteristics of the voice source by a careful analysis of the

signal within a glottal pulse. The epochs can be used as pitch markers for prosody

manipulation, which is useful in applications like text-speech synthesis, voice conversion

and speech rate conversion

6

Table of contents

CERTIFICATE FROM ECE DEPARTMENT (i)

ACKNOWLEDGEMENTS (ii)

ABSTRACT (iii)

LIST OF FIGURES (v)

LIST OF TABLES (vi)

CHAPTER 1. OVERVIEW

1. Introduction 2

2. Aim of the project 3

3. Methodology 3

4. Significance and applications 4

5. Organization of work 4

CHAPTER 2. INTRODUCTION TO BESSEL FEATURES FOR SPEECH

SIGNAL PROCESSING

2.1 Introduction 7

2.2 Multicomponent signal

2.3 Series representation

2.4 Fourier Bessel

CHAPTER 3. REVIEW OF APPROACHES FOR BESSEL FEATURES

3.1 Introduction 29

3.2 Parse Val’s Formula 30

3.3 Hankel transform 30

CHAPTER 4. THEORY OF BESSEL FEATURES

4.1 Introduction 74

4.2 Solution of Differential Equation 75

4.3 Mean Frequency computation 76

4. Reconstruction of the signal 82

7

CHAPTER 5. BESSEL FEATURES FOR DETECTION OF VOICE ONSET TIME

(VOT)

5.1 Introduction

5.2 Significance of VOT

5.3 Detection of VOT

5.3.1 Fourier-Bessel expansion

5.3.2 AM-FM model and DESA method

5.3.3 Approach to detect VOTs from speech using

Amplitude Envelope (AE)

5.4 Results

5.5 Conclusions

CHAPTER 6. BESSEL FEATURES FOR DETECTION OF

GLOTTAL CLOSURE INSTANTS (GCI)

6.1 Introduction

6.2 Significance of Epoch in speech analysis

6.3 Review of existing approaches

6.3.1 Epoch extraction from short time

energy of speech signal

6.3.2 Epoch extraction from linear prediction analysis

6.3.4 Limitation of existing approaches

6.4 Detection of GCI using FB expansion and AM-FM model

6.4.1 Fourier-Bessel expansion

6.4.2 AM-FM model and DESA method

6.5 Studies on detection of GCIs for various speech utterances

6.6 Glottal activity detection

6.7 Conclusion

CHAPTER 7. SUMMARY AND CONCLUSIONS

7.1 Summary of the work

CHAPTER 8. REFERENCES

8

LIST OF FIGURES

2.1 Zero order Bessel function………………………………………

3.1 Sinusoidal function………………………………………………

4.1 Bessel function for different order………………………………

4.2 Reconstructed Band limited signal………………………………

5.1 Regions of significant events in the production............................

5.2 Plot of waveforms for speech utterance /ke/.................................

5.3 Plot of waveforms for speech utterance /te/..................................

5.4 Plot of waveforms for speech utterance /pe/.................................

5.5 Plot of the Bar graphs for utterances of /ke/, /te/, /pe/..................

6.1 Epoch (or GCI) extraction of a male speaker..............................

6.2 Epoch (or GCI) extraction of a female speaker..........................

9

LIST OF TABLES

5.1 FB coefficient orders for emphasizing the vowel and consonant parts

5.2 VOT values for female (F01) and male (M05) speakers

1

CHAPTER 1

OVERVIEW

1.1 INTRODUCTION

Multicomponent signals produce delineated concentrations in the time-frequency

plane. Human speech signal is a multi component signal where the components are called

formants. Most efforts in devising recognition schemes have been directed toward the

recognition of human speech. While it has been appreciated over fifty years that speech is

multicomponent no particular exploitation has been made of that fact. Recently, however,

an ingenious idea has been proposed and developed by Fineberg and Mammone which

takes advantage of the multicomponent nature of a signal.

1.2 AIM OF THE PROJECT

The aim is to explore Bessel features and apply them for Speech signal processing

such as detecting accent, language and also for speaker identification.

1.3 METHODOLOGY

Fourier-Bessel (FB) expansion and AM-FM model has been employed for

efficient results in speech signal processing.

1.4 SIGNIFICANCE AND APPLICATIONS

The applications include speech segmentation, speaker verification, speaker

identification, speech recognition and language identification. Pattern recognition may

also be accomplished.

1

1.5 ORGANISATION OF WORK

Firstly, the significance of Bessel features have been studied and their

applications in speech signal processing for the detection of Voice Onset Time (VOT)

and Glottal Closure Instants (GCI) have been observed.

CHAPTER 2

INTRODUCTION TO BESSEL FEATURES FOR SPEECH SIGNAL PROCESSING

2.1 INTRODUCTION

Representation of a signal directly by it’s sample values or by an analytic function

may not be desired and practical. Many practical signals are highly redundant, both

image and Speech signals fall under this category, and it may be desirable and possibly

necessary to represent the signal with a fewer number of samples for economy of storage

and /or transmission bandwidth limitation. The processing of signals can be efficiently

carried out in another domain that of the original signal. Many natural and man-made

signals have a unique structure in time and frequency. In the time frequency plane there is

a clear delineation into different regions. Different time frequency distributions may give

somewhat different representations-however they all give roughly the same picture in

regard to the existence of the components Pattern recognition techniques rely on the

ability to generate a set of coefficients from the raw data (time domain samples) that are

more compact and are more closely related to the signal characteristics of interest.

2.2 MULTICOMPONENT SIGNAL

Human speech signal is a multi component signal where the components are called

formants. Multicomponent signals produce delineated concentrations in the time-

frequency plane. The general belief is that in a multicomponent signal each component in

the time-frequency plane corresponds to a part of a signal. That is if

S (t) = S1 (t) + S2 (t)

1

That each part, S1 and S2 corresponds to a component. A mono component signal looks

like a single mountain ridge. The center of the ridge forms a trajectory in the time

'generally varies from time to time. The trajectory is the instantaneous frequency of the

signal. If the signal is written in the form

S (t) =A (t) e jw(t)

The instantaneous frequency is an average, namely the average of the frequencies at a

given time. The broadness at that time is the spread (root mean square) of frequencies at

time t, we call it the instantaneous bandwidth.

2.2.1 SERIES REPRESENTATION

If Frequency content of signal is desired, a series representation packs the

frequency information into fewer samples than a time domain representation. Hence,

signal decomposition by means of series representation is important to such applications

as Seismic, Speech and Image processing. Fourier series and Fourier transform

representations of the speech signal have been applied extensively in speech storage and

speech compression systems. The advantage of these representations lies in the ability to

store or transmit the parameters associated with the transformation instead of the binary

codes representing the values of samples of the waveform taken in the time domain.

Usually the transform domain parameters can be handled with greater efficiency than the

time domain parameters.

2.3 FOURIER BESSEL

We are interested in expanding a speech signal into a Fourier Bessel series.

Generally speaking the spectrum cannot be used to ascertain or define whether a signal is

mono component or not, although in some cases it may give an indication that

components are present. The reason is that the spectrum tells us what frequencies existed

for the total duration of the signal, while the existence of components happens locally in

time. If each component is limited to its own mutually exclusive band for all time then

and only then the spectrum may give an indication of the existence of components. If

1

there is an overlap at any point of time, the spectrum will not indicate that there were

components since the spectral components will coalesce.

The spectrum gives no indications of components; it just tells us what components

existed irrespective of when they existed. In general the spectrum is not a good indicator

1

of components. What is needed is a measure of the spread of frequencies at each time-

the instantaneous bandwidth. It appears that the application of F-B series speech

1

Basis function for zero-order Bessel function

Fig 2.1 zero order Bessel function

processing, particularly speaker identification, bears further research. The shift variant

property of the Hankel transform may prove valuable for non-stationary analysis and

some indications that fewer coefficients may be required. Since the coefficients are real,

the speech can be directly reconstructed from its coefficients time index plot without

need to retain phase components; this may prove to be of some use if conversion back to

the time domain is desired.

The Fourier series representation employs an orthogonal set of sinusoidal

functions as a basis set, while the Fourier transform uses a complex exponential function,

related to the sinusoidal through Euler’s relation as its kernel. The sinusoidal functions

are periodic and are ideal for representing general periodic functions, but may not fully

match the properties of other waveforms. In particular the random, non-stationary natures

of speech waveforms do not lead to the most efficient representation by sinusoidal based

transformations. In general, for series expansion or integral transforms of signals, the

representation converges more rapidly if there is a better match between the basis or

kernel function, and the signal being represented. Thus the Fourier transform of a sine

wave is impulsive, indicating a perfect match between signal and the kernel function and

corresponding in infinite convergence rate for the transform. This principle may be

exploited to improve the signal to noise ratio of speech signals in digital signal

processing.

A basis set or kernel function with regular zero-crossings and decaying

amplitudes would be expected to provide efficiencies in representing speech signals for

storage and compression. Bessel functions provide the desired properties and have been

accordingly exploited for speech processing. Fourier-Bessel transform is more efficient in

representing speech-like waveforms by comparing the Fourier and Fourier-Bessel

transforms of a rectangular pulse, a triangular pulse, and a linearly-damped sinusoidal

pulse.

1

2.4 CONCLUSIONS

Generally speaking the spectrum cannot be used to ascertain or define whether a

signal is mono component or not, although in some cases it may give an indication that

components are present. he reason is that the spectrum tells us what frequencies existed

for the total duration of the signal, while the existence of components happens locally in

time. It appears that the application of F-B series speech processing, particularly speaker

identification, bears further research. The shift variant property of the Hankel transform

may prove valuable for non-stationary analysis and some indications that fewer

coefficients may be required. Since the coefficients are real, the speech can be directly

reconstructed from its coefficients time index plot without need to retain phase

components

1

CHAPTER 3

REVIEW OF APPROACHES FOR BESSEL FEATURES

3.1 INTRODUCTION

Any orthonormal set of basis functions can be used to represent some arbitrary

function. Fourier series theory includes that the series coefficients are given by a Discrete

Fourier transform, thus coefficient generation is an easy process with the numerous FFT

algorithms that abound.

Calculation of the Fourier-Bessel series coefficients requires computation of a

Hankel Transform, which until recently greatly diminished consideration of this series for

potential applications. Fast Hankel Transform have now been developed which allow

computation of F-B coefficients at a speed only slightly slower than Fourier coefficients;

this should result in increased use of the F-B expansion..

3.2 PARSE VAL’S FORMULA

The theory of series representation of an arbitrary signal is more general than

expressing a signal as a sum of sinusoids. In fact, any orthonormal set of basis functions

can be used to represent some arbitrary function. For example, if we define an orthogonal

set of function as follows,

∑ ɸm(t) ɸn(t) dt=1, m=n; 0, m ≠ n,

The function can be written as:

f (t ) =∑ C n ɸn(t),

Where

Cn= ∫ f (t) ɸn(t)dt.

If we restrict f (t) to signals possessing finite energy and band limited frequency spectra a

useful property can be stated:

The energy, E, of f(t) is given by

E= ∫ f (t) 2dt = ∑ Cn 2 <∞

1

This is simply a restatement of Parse Val’s well known formula concerning

Fourier series coefficients.

Fig 3.1 Sinusoidal function-Basis functions for Fourier transfor

Although the generalized form of series representation is useful for the

construction of mathematical proofs, we are more interested in specific choices of the

basis function, ɸn(t). Obviously, choosing

ɸn(t) = ejnwt

results in a Fourier series. If f (t) is available over a finite segment of time (-t, t), a

realistic assumption, it may be desirable to concentrate the energy within this interval.

Denoting the concentrated energy by the fractional energy ratio

E= ∫-t t│ɸn(t)│dt / ∫ │ɸn(t)│dt

As the waveform becomes more speech-like the Fourier-Bessel transform

converges faster then the Fourier transform. This means that the frequency spectrum is

narrower, so that when dealing with signal plus wide band noise, a filter with lower

cutoff frequency can be employed to attenuate more of the noise power without also

attenuating the signal.

1

3.3 HANKEL TRANSFORM

The Fourier series representation employs an orthogonal set of sinusoidal

functions as a basis set, while the FIt can be shown that E is maximized for ɸn(t)

corresponding to prolate spheroid functions. This choice of basis function has been

investigated thoroughly and has found use within many areas of digital signal processing.

Another possible choice for ɸn(t) is a family of Bessel functions, which results in an

expansion termed the Fourier-Bessel series. In this case, choosing a zero order Bessel

function for illustration:

ɸn(t) =J 0(λt),

And f (t) can then be expressed as

f (t) = ∑C n J 0(λt).

Fourier series possesses some analytical properties such as shift invariance that

make the various mathematical manipulations much simpler. Fourier series theory

includes that the series coefficients are given by a Discrete Fourier transform, thus

coefficient generation is an easy process with the numerous FFT algorithms that abound.

Calculation of the Fourier-Bessel series coefficients requires computation of a Hankel

Transform, which until recently greatly diminished consideration of this series for

potential applications. Fast Hankel Transform have now been developed which allow

computation of F-B coefficients at a speed only slightly slower than Fourier coefficients;

this should result in increased use of the F-B expansion..

3.4 CONCLUSIONS

Any orthonormal set of basis functions can be used to represent some arbitrary

function. Calculation of the Fourier-Bessel series coefficients requires computation of a

Hankel Transform, which until recently greatly diminished consideration of this series for

potential applications.A fast Hankel transform algorithm was presented that allows the

Fourier-Bessel series coefficients to be computed efficiently.

2

CHAPTER 4

THEORY OF BESSEL FEARURES

4.1 INTRODUCTION

Bessel functions arise as a solution of the differential equation. The general

solution is given by J n(x) is called a Bessel function of the first kind of order n and Yn(x)

is called a Bessel function of the second kind of order n. Bessel functions are expressible

in series form. It should be noted that the FB series coefficients Cm are unique for a given

signal, similarly as the Fourier series coefficients are unique for a given signal. However,

unlike the sinusoidal basis functions in the Fourier series, the Bessel functions decay over

time. This feature of the Bessel functions makes the FB series expansion suitable for

nonstationary signals.

Also, it is possible that the Fourier-Bessel coefficients in some sense better

capture the fundamental nature of the speech waveform; the shift variant property may be

desirable and possibly result in improved speaker identification/authentication

probabilities. Since the Fourier –Bessel coefficients are real; the noisy phase problem

upon reconstruction is avoided, which may be advantageous. The entire range of image

processing algorithms developed over the past several decades would be available for

exploitation to improve upon the speech characteristics.

4.2 SOLUTION FOR DIFFERENTIAL EQUATION

Bessel functions arise as a solution of the differential equation:

x2 y’’ + x y’ +(x 2–n 2) y = 0, n>0,…….(1)

this is called Bessel’s differential equation. The general solution of (1) is given by

y = C1Jn(x) + C2Yn(x),

where J n(x) is called a Bessel function of the first kind of order n and Y n(x) is called a

Bessel function of the second kind of order n. Bessel functions are expressible in series

form; for example, J n(x) can be written

J n(x) = ∑ (-1) r (x/2)n+2r / r!┌ (n+r+1)

2

And in particular

J0 (x) =1- x 2/2 2 + x4 / 224 2 -……

It can be readily shown that Bessel functions are orthogonal with respect to the weighting

function x. This can be seen by computing

∫ x Jn (αx) Jn (βx) dx= β Jn ( α ) J'n (β)- α Jn (β) J'n ( α ) / α 2 - β 2

and

∫ x Jn (αx) dx= ½ [ J2n ( α ) + (1- n /⌐ ) J2 ( α ) ]

Now if a and b are different roots of Jn(x) =0 we can write

∫ x Jn (αx) Jn (β x) dx =0, α ≠ β

And thus J n (ax) and J n (b x) are orthogonal with respect to the weighting function x.

Having established orthogonally, a series expansion of an arbitrary function can be

written in terms of Bessel functions with the form

f(x) = ∑ Cm Jn (λmx),

Where λ1, λ2, λ3........are the positive roots of Jn(x) = 0.

The coefficients, Cm, are given by

Cm =2 ∫ x Jn (λmx) f(x) dx / J2n+1(λm ).

If we wish to expand f(t) over some arbitrary interval (0, a) the zero order Bessel series

expansion becomes

f (t) = ∑ Cm J 0(λt), 0<t<a,

With the coefficients, Cm, calculated from

Cm=2 ∫ t f(t) J 0(λt) dt / a2[J 1(λ m a)]2

And are the ascending order positive roots of J0(a) =0. The integral in the numerator is

the Hankel transform

The coefficients of the FB expansion have been used to compute the Mean

Frequency. The FB coefficients are unique for a given signal in the same way that Fourier

series coefficients are unique for a given signal. However, unlike the Sinusoidal basis

function in the Fourier transform, the Bessel functions are aperiodic, and decay over time.

These features of the Bessel functions make the FB series expansion suitable for analysis

of non stationary signals when compared to simple Fourier transform.

2

4.3 MEAN FREQUENCY COMPUTATION

The zero-order Fourier-Bessel series expansion of a signal x(t) considered over

some arbitrary interval (0, a) is expressed as in

X (t) = ∑ J0 ( λm t / a )Cm

Where ……, m=1, 2, 3….are the ascending-order positive roots of J0( )= 0, and

J0(m/a) are the zero-order Bessel functions. The sequence of Bessel functions {J0(m/

a)t)} forms an orthogonal set on the interval 0<= t<=a with respect to the weight t.

Using the orthogonality of the set {J0 (m/a)t)}, the FB coefficients Cm are

computed by using the following equation

Cm=2 ∫ t f(t) J 0(λt) dt / a2[J

1(λ

m a)]2

With 1<= m<= q, where Q is the order of the FB expansion and, J1( ) are the first-

order Bessel functions. The FB expansion order Q must be known a priori. The interval

between successive zero-crossings of the Bessel function J0( ) increases slowly with time

and approaches…….in the limit. If order Q is unknown, then in order to cover full signal

band width, the half of the sampling frequency, Q, must be equal to the length of the

signal.

It should be noted that the FB series coefficients Cm are unique for a given signal,

similarly as the Fourier series coefficients are unique for a given signal. However, unlike

the sinusoidal basis functions in the Fourier series, the Bessel functions decay over time.

This feature of the Bessel functions makes the FB series expansion suitable for

nonstationary signals.

The mean frequency is calculated as in 11

F mean= ∑ Q f mEm / ∑ Em

Where

E m = Cm2 = (energy at order m),

Fm = m /2a = (frequency at order m).

characteristics (such as speech) may be more compactly represented by Bessel function

basis vectors rather then by pure sinusoids. Also, it is possible that the Fourier-Bessel

coefficients in some sense better capture the fundamental nature of the speech waveform;

2

the shift variant property may be desirable and possibly result in improved speaker

identification/authentication probabilities.

For the test function f (t) =J0 (t), the Fourier series coefficients produced an

extremely accurate reconstruction of the function under transformation. A Fourier-Bessel

expansion resulted in a higher error, but the numbers of coefficients were dramatically

different. Regenerating f(t)= J0 (t) from Fourier coefficients required all 256 values to

achieve the result; by contrast just one Fourier-Bessel coefficient is required to

reconstruct the function.

Any function decomposed into basis vectors of the same analytic form will

produce a single coefficient. Indeed, expanding the test signal f (t) =sin (t) via Fourier

series requires a single coefficient. Nevertheless, the point being made is that an

unknown signal will be more efficiently (more information in fewer coefficients)

represented if expanded in the set of basis functions that “resemble” itself. Since the

Fourier –Bessel coefficients are real; the noisy phase problem upon reconstruction is

avoided, which may be advantageous. The entire range of image processing algorithms

developed over the past several decades would be available for exploitation to improve

upon the speech characteristics.

Fig 4.1 Bessel functions of different order

2

Fig shows different order Bessel functions. Red colored waveform represents

zeroth order Bessel function, green colored waveform represents first order Bessel

function, blue colored waveform represents second order waveform.

The Zero-order Fourier-Bessel series expansion of a signal x(t) over the interval (0,a) is

4.4 RECONSTRUCTION OF THE SIGNAL

Fig a represents speech signal. Fig b represents frequencies present in the speech

signal as a cluster. Fig c represents band limited signal reconstructed from the original

one using Bessel coefficients.

Fig 4.2 Reconstruction of the speech signal using Bessel coefficients.

2

4.5 CONCLUSIONS

In this chapter, we represented the general signal decomposition problem in terms

of an orthogonal series expansion. Focus was primarily held on the Fourier-Bessel series

expansion utilized here and there for comparison purposes. A fast Hankel transform

algorithm was presented that allows the Fourier-Bessel series coefficients to be computed

efficiently. Mean frequency computation is made possible using Fourier-Bessel

coefficiens.

2

CHAPTER 5

BESSEL FEATURES FOR DETECTION OF VOICE ONSET

TIME (VOT)

5.1 INTRODUCTION

The instant of onset of vocal fold vibration relative to the release of closure

(burst) is the commonly used feature to analyze the manner of articulation in production

of stop consonants. The interval between the time of burst release to the time of onset of

vocal fold vibration is defined as voice onset time (VOT) [4].

Accurate determination of VOT from acoustic signals is important both theoretically

and clinically. From a clinical perspective, the VOT constitutes an important clue for

assessment of speech production of hearing impaired speakers [5]. From a theoretical

perspective, the VOT of stop consonants often serves as a significant acoustic correlate to

discriminate voiced from unvoiced, and aspirated from unaspirated stop consonants. The

unvoiced unaspirated stop consonants typically have low and positive VOTs, meaning

that the voicing of the following vowel begins near the instant of closure release. The

unvoiced aspirated stop consonants followed by a vowel have slightly higher VOTs than

their unaspirated counterparts, as the burst is followed by the aspiration noise. The

duration of the VOT in such cases is a practical measure of aspiration. The longer the

VOT, the stronger is the aspiration. On the other hand, voiced stop consonants have a

negative VOT, meaning that the vocal folds start vibrating before the stop is released.

5.2 SIGNIFICANCE OF VOICE ONSET TIME (VOT)

The Voice Onset Time (VOT) is a important characteristic of stop consonants

which plays a significant role in perceptual discrimination of phonemes of the same place

of articulation [6]. It also plays an important role in word segmentation, stress related

phenomena, and dialectal and accented variations in speech patterns [7-8]. The VOT can

also be used for classification of accents.

2

Voice Onset Time (VOT) can be used to classify mandarin, Turkish, German and

American accented English. It is an important temporal feature which is often overlooked

in speech perception, speech recognition, as well as accent detection. Therefore, the

amplitude envelope (AE) function is useful for detection of VOT. The sub-band

frequency analysis is performed to detect VOT of unvoiced stops in [9]. The amplitude

modulation component (AMC) is used to detect vowel plus voiced onset regions (VORs)

in different frequency bands assuming the stop to vowel transition has different amplitude

envelopes for partitioned frequency ranges. In the following section we shall discuss the

effective VOT detection approach using FB expansion followed by AM-FM model for

stop consonant vowel units (/ke/, /ki/, /ku/, /te/, /ti/, /tu/, /pe/, /pi/, /pu/). The dominant

frequency bands of the voiced onset region for various stops and vowels are as

follows: /k/ between 1500 and 2500 Hz; /t/ between 2000 and 3000 Hz; /p/ between 2500

and 3500 Hz; vowel between 300 and 1200 Hz [10,6]. The VOT detection discussed here

is conceptually simpler and can be implemented as a one step process, which makes real

time implementation feasible.

5.3 DETECTION OF VOICE ONSET TIME (VOT)

The detection of VOT has been done using FB expansion and AM-FM model.

Section 5.2.1 discusses the FB expansion. AM-FM signal model and its analysis using

DESA method is discussed in Section 5.2.2. Section 5.2.3 describes the proposed

algorithm for VOT detection using FB expansion and AE function of AM-FM model.

The VOT detection results for speech data collected form various male and female

speakers are presented in Section 5.2.4.

5.3.1 FOURIER-BESSEL (FB) EXPANSION

The zero order Fourier-Bessel series expansion of a signal x(t) considered over

some arbitrary interval (0,T) is expressed as

where,

2

and

Where,

are the ascending order positive roots of

are the zero order Bessel function.

It has been shown that there is one-to-one correspondence between the frequency

content of the signal and the order (m) where the coefficient attains peak magnitude [10].

If the AM-FM components of formant of the speech signal are well separated in the

frequency domain, the speech signal components will be associated with various distinct

clusters of non-overlapping FB coefficients [11]. Each component of the speech signal

can be reconstructed by identifying and separating the corresponding FB coefficients.

5.3.2 AM-FM MODEL AND DESA METHOD

For both continuous and discrete time signals, Kaiser has defined a nonlinear energy

tracking operator [12]. For the discrete time case, the energy operator for x[n] is defined

as,

Where,

The energy operator can estimate the modulating signal, or more precisely its scaled

version, when either AM or FM is present [12]. When AM and FM are present

simultaneously, three algorithms are described in [12] to estimate the instantaneous

frequency and A(n) separately. The best among the three algorithms according to

performance is the discrete energy separation algorithm 1 (DESA-1).

2

5.3.3 APPROACH TO DETECT VOTS FROM SPEECH USING AMPLITUDE

ENVELOPE (AE)

In order to detect voice onset time, the emphasized consonant and vowel regions of

the speech utterance are separated using the FB expansion of appropriate range of orders

using the Bessel function. Since, the separated regions are narrow-band signals, they can

be modeled by using AM-FM signal model. The DESA technique is applied on the

emphasized regions of the speech utterance in order to estimate the AE function for

detection of VOT. From the vowel emphasized part the beginning of the vowel will be

detected. From the beginning of the vowel, by tracing back towards the beginning of

consonant region in the consonant emphasized part, the beginning of the consonant

region has been detected. The VOT is obtained by taking the difference between

beginning of the vowel and beginning of consonant regions.

5.4 RESULTS

5. 4.1RESULTS OF VARIOUS UTERENCES FOR MALE AND FEMALE

SPEAKERS

In this section we discuss the suitability of the proposed method for VOT detection.

Speech data used for these consists of 24 (12 male and 12 female speaker) isolated

utterances of the units /ke/, /te/, /pe/, /ki/, /ti/, /pi/, /ku/, /tu/, /pu/ respectively. The speech

signals were sampled at 16 kHz with 16 bits resolution, and stored as separate wave files.

Here, we shall consider the important subset of basic units namely SCV (stop

consonant vowel). Stop consonants are the sounds produced by complete closure at some

point along the vocal tract, build up pressure behind the closure, and release the pressure

by sudden opening. These units have two distinct regions in the production

characteristics: the region just before the onset of the vowel (corresponds to consonant

region) and steady vowel region. Figure 5.1 shows the regions of significant events in the

production of the SCV unit /kha/ with Vowel Onset Point (VOP) at sample number 3549.

Table 5.1 shows the requirement of the Fourier-Bessel coefficient orders for emphasizing

the vowel and consonant regions of the speech utterances.

3

Region of speech signal Required FB coefficient orders/a/ P1=12 to P2=48/k/ P1=60 to P2=100/t/ P1=80 to P2=120/p/ P1=100 to P2=140

Table 5.1 FB coefficient orders for emphasizing the vowel and consonant parts of the

different speech utterances.

For illustration first we shall consider the speech utterances /ke/ whose waveform is

shown in Figure 5.2. The spectrogram, amplitude envelope estimations for vowel and

consonant emphasized regions of the speech utterance /ke/ are shown in 5.2 (a), (c) and

(d) respectively. Similarly, the plots of the waveform, spectrogram, amplitude envelope

estimation for vowel and consonant region of the speech utterances /te/ and /pe/ are

shown in Figures 5.3 and 5.4 respectively. It is seen that the amplitude envelopes

corresponding to vowel and consonant regions using proposed method are emphasized.

This enables us to identify the beginning of the consonant (tc) and beginning of the vowel

region (tv). We have subtracted tc from tv in order to detect the voice onset time (tvot),

tvot = tv-tc. For testing we have considered 24 utterances from various speakers. Their

respective tv and tc and VOT values are shown in Table 5.2.

5.5 CONCLUSION

In this chapter, Fourier-Bessel (FB) expansion and the amplitude and frequency

modulated (AM-FM) signal model has been proposed to detect the voice onset time

(VOT). The FB expansion is used to emphasize the vowel and consonant regions which

results narrow-band signals from the speech utterance. The DESA method has been

applied for estimating amplitude envelope of the AM-FM signals due to its low

complexity and good time resolution.

3

VOICE ONSET TIME (VOT)

Fig 5.1 Regions of significant events in the production of the SCV unit /kha/ with Vowel

Onset Point (VOP) at sample number 3549.

3

Fig 5.2 Plot of the (a) Spectrogram, (b) Waveform, (c) AE estimation of the vowel (/e/)

emphasized part, (d) AE estimation of the consonant part (/k/) emphasized part for the

speech utterance /ke/.

3


emphasized part, (d) AE estimation of the consonant part (/t/) emphasized part for the

speech utterance /te/.

3


emphasized part, (d) AE estimation of the consonant part (/p/) emphasized part for the

speech utterance /pe/.

3

6.

Fig 5.5 plot of the Bar graphs for utterances of /ke/, /te/, /pe/ for various male and female

speakers.

3

Waveform VOP (sec) BURST (sec) VOT(sec)Ke_F01_s1.wav 0.8443 0.8227 0.0216Ke_F01_s2.wav 0.8168 0.8039 0.0132Ke_F01_s3.wav 0.9633 0.9473 0.0160Ke_F01_s4.wav 0.4611 0.4394 0.0217Te_F01_s1.wav 0.6178 0.6029 0.0149Te_F01_s2.wav 0.6979 0.6839 0.0140Te_F01_s3.wav 0.7401 0.7236 0.0165Te_F01_s4.wav 0.7212 0.7088 0.0124Pe_F01_s1.wav 0.5377 0.5239 0.0138Pe_F01_s2.wav 0.5308 0.5153 0.0155Pe_F01_s3.wav 0.8250 0.8087 0.0163Pe_F01_s4.wav 0.8239 0.8154 0.0085Ke_M05_s1.wav 1.4540 1.4170 0.0370Ke_M05_s2.wav 0.5560 0.5230 0.0330Ke_M05_s3.wav 0.7480 0.7136 0.0344Ke_M05_s4.wav 0.6574 0.6137 0.0437Te_M05_s1.wav 0.6687 0.6502 0.0185Te_M05_s2.wav 0.4704 0.4604 0.0100Te_M05_s3.wav 0.6814 0.6548 0.0266Te_M05_s4.wav 0.6013 0.5843 0.0170Pe_M05_s1.wav 0.9851 0.9718 0.0133Pe_M05_s2.wav 0.7262 0.7164 0.0098Pe_M05_s3.wav 0.4899 0.4809 0.0090Pe_M05_s4.wav 0.4341 0.4291 0.0050

Table 5.2 VOT values for female (F01) and male (M05) speakers for utterances /ke/, /te/

and /pe/ respectively.

3

CHAPTER 6

BESSEL FEATURES FOR DETECTION OF GLOTTAL CLOSURE INSTANTS (GCI)

6.1 INTRODUCTION

The primary mode of excitation of the vocal tract system during speech production is due

to the vibration of the vocal folds. For voiced speech, the most significant excitation takes

place around the glottal closure instant (GCI), called the epoch. The performance of

many speech analysis and synthesis approaches depends on accurate estimation of GCIs.

In this chapter we propose to use a new method based on Fourier-Bessel (FB) expansion

and amplitude and frequency modulated (AM-FM) signal model for the detection of GCI

locations in speech utterances.

The organization of this chapter is as follows: In section 6.1 the significance of epochs

is discussed. The review of the existing approaches for detection of epochs is provided in

Section 6.2. The detection of GCI using FB expansion and the AM-FM signal model is

discussed in 6.3. A study on detection of GCIs for various categories of sound units has

been provided in Section 6.4. The detection of glottal activity has been analyzed in the

Section 6.5. The final section summarizes the study of GCI.

6.2 SIGNIFICANCE OF EPOCHS IN SPEECH ANALYSIS

Glottal closure instants are defined as the instants of significant excitation of the

vocal-tract system. Speech analysis consists of determining the frequency response of the

vocal-tract system and the glottal pulses representing the excitation source. Although the

source of excitation for voiced speech is the sequence of glottal pulses, the significant

excitation of the vocal-tract system within the glottal pulse, can be considered to occur at

the GCI, called an epoch. Many speech analysis situations depend on accurate estimation

of the location of epoch within the glottal pulse. For example, the knowledge of epoch

location is useful for accurate estimation of the fundamental frequency (fo).

3

Analyses of speech signals in the closed glottis regions provide an accurate estimate of

the frequency response of the supraalaryngeal vocal-tract system [12] [13]. With the

knowledge of epochs, it may be possible to determine the characteristics of the voice

source by a careful analysis of the signal within a glottal pulse. The epochs can be used as

pitch markers for prosody manipulation, which is useful in applications like text-speech

synthesis, voice conversion and speech rate conversion [14] [15]. Knowledge of the

epoch locations may be used for estimating the time delay between speech signals

collected over a pair of spatially separated microphones [16]. The segmental signal-to-

noise ration (SNR) of the speech signal is high in the regions around the epochs. Hence it

is possible to enhance the speech by exploiting the characteristics of speech signals

around the epochs [17]. It has been shown that the excitation features derived from the

regions around the epoch locations provide complimentary speaker-specific information

to the existing spectral features.

The instants of significant excitation play an important role in human perception also.

It is because of the epochs in speech that a human being seems to be able to perceive

speech even at a distance from the source, though the spectral components of the direct

signal suffer an attenuation of over 40 dB. The neural mechanism of human beings has

the ability of processing selectively the robust regions around the epoch for extracting

acoustics cues even under degraded conditions. It is the ability of human beings to focus

on these micro level events that may be responsible for perceiving the speech information

even under severe degradation such as noise, reverberation, presence of other speakers

and channel variations.

6.3 REVIEW OF EXISTING APPROACHES

Several methods have been proposed for estimating the GCI from a speech signal. We

categorize these methods as follows: (a) methods based on short-time energy of speech

signal, (b) methods based on predictability of all-pole linear predictor and (c) methods

based on the properties of Group-Delay (GD) i.e. the negative going zero crossing of GD

3

measure derived from the speech signal. The above classification is not a rigid one and

one category can overlap with another based on the interpretation of the method.

6.3.1 EPOCH EXTRACTION FROM SHORT-TIME ENERGY OF SPEECH SIGNAL

GCIs can be detected from the energy peaks in the waveform derived directly

from the speech signal [17, 18] or from the features in its time-frequency representation

[19, 20]. The epoch filter proposed in this work, computes the Hilbert envelope (HE) of

the high-pass filtered composite signal to locate the epoch instants. It was shown that the

instant of excitation of the vocal-tract could be identified precisely even for continuous

speech. However, this method is suitable for analyzing only clean speech signals.

6.3.2 EPOCH EXTRACTION FROM LINEAR PREDICTION ANALYSIS

Many methods of epoch extraction rely on the discontinuities in a linear model of

speech production. An early approach used the predictability measure to detect epochs by

finding the maximum of the determinant of the auto covariance matrix [21, 22] of the

speech signal. The determinant of the matrix as a function of time increases sharply when

the speech segment covered by the data matrix contains an excitation, and it decreases

when the speech segment is excitation free. This method does not work well for some

vowels, particularly when many pulses occur in the determinant computed around the

GCI.

A method for unambiguous identification of epochs from the LP residual was

proposed in [23] [24]. In this work the amplitude envelope of the analytic signal of the

LP residual, referred to as the Hilbert Envelope (HE) of the Linear prediction (LP)

residual, is used for epoch extraction. Computation of the HE overcomes the effect due to

inaccurate phase compensation during inverse filtering. This method works well for clean

signals, but the performance degrades under noisy conditions.

6.3.3 EPOCH EXTRACTION BASED ON THE PROPERTIES OF GROUP-DELAY

4

A method for detecting the epochs in a speech signal using the properties of

minimum phase signals and GD function was proposed in [25]. The method is based on

the fact that the average value of the GD function of a signal within an analysis frame

corresponds to the location of the significant excitation. An improved method based on

computation of the GD function directly from the speech signal was proposed in [26].

The Dynamic Programming Projected Phase-Slope Algorithm (DYPSA) for

automatic estimation of GCI in speech is presented in [27, 28]. The candidates for GCI

were obtained from the zero crossing of the phase-slope function derived from the energy

weighted GD, and were refined by employing a dynamic programming algorithm. The

performance of this method was better than the previous methods.

6.3.4 LIMITATIONS OF EXISTING APPROACHES

Epoch is an instant property. However, in most of the methods discussed (except

the GD function based method) the epochs are detected by employing block processing

approaches, which result in ambiguity about the precise location of the epochs. One of

the difficulties in using the prediction error approach is that it often contains effects due

to the resonances of the vocal-tract system. As a result, the excitation peaks become less

prominent in the residual signal, and hence unambiguous detection of the GCIs becomes

difficult. Most of the existing methods rely on LP residual signal derived by inverse

filtering the speech signal. Even using the GD based methods, it is still difficult to detect

the epochs in case of low voiced consonants, nasals and semi-vowels, breathy voices and

female speakers.

6.4 DETECTION OF GCI USING FOURIER BESSEL (FB) EXPANSION

AND THE AM-FM SIGNAL MODEL

The method is based on the FB expansion and the AM-FM signal model. The inherent

filtering property of the FB expansion is used to weaken the effect of formants in the

speech utterances. The FB coefficients are unique for a given signal in the same way that

4

Fourier series coefficients are unique for a given signal. However, unlike the sinusoidal

basis functions in the Fourier transform, the Bessel functions are aperiodic, and decay

over time. These features of the Bessel functions make the FB series expansion suitable

for analysis of non-stationary signals such as speech when compared to simple Fourier

transform [9-11]. The discrete energy separation (DESA) method has been used to

estimate amplitude envelope (AE) function of the AM-FM model due to its good time

resolution. This feature is advantageous as they are well localized in time-domain.

6.4.1 FOURIER-BESSEL EXPANSION

The zero order Fourier-Bessel series expansion of a signal x(t) considered over

some arbitrary interval (0,T) is expressed as

where,

and

Where,

are the ascending order positive roots of

are the zero order Bessel function.

It has been shown that there is one-to-one correspondence between the frequency

content of the signal and the order (m) where the coefficient attains peak magnitude [10].

If the AM-FM components of formant of the speech signal are well separated in the

frequency domain, the speech signal components will be associated with various distinct

clusters of non-overlapping FB coefficients [11]. Each component of the speech signal

can be reconstructed by identifying and separating the corresponding FB coefficients.

4

6.4.2 AM-FM Model and DESA Method

For both continuous and discrete time signals, Kaiser has defined a nonlinear

energy tracking operator [11]. For the discrete time case, the energy operator for x[n] is

defined as,

Where,

The energy operator can estimate the modulating signal, or more precisely its scaled

version, when either AM or FM is present [11]. When AM and FM are present

simultaneously, three algorithms are described in [11] to estimate the instantaneous

frequency and A(n) separately. The best among the three algorithms according to

performance is the discrete energy separation algorithm 1 (DESA-1).

6.4.3 APPROACH TO DETECT GCIS FROM SPEECH USING AMPLITUDE

ENVELOPE (AE)

In order to detect GCIs we emphasize the low frequency contents of the speech

signal in the range of 0 to 300 Hz. This is achieved by using the FB expansion of

appropriate order m of the expansion. Since the resultant band-limited signal is a narrow-

band signal, it can be modeled by using AM-FM signal model.

The advantage of choosing 0 to 300 Hz band is that the characteristics of the time-

varying vocal-tract system will not affect the location of the GCIs. This is because the

vocal-tract system has resonances at higher frequencies than 300 Hz. Therefore, it has

been studied that the characteristics of peaks due to GCIs can be extracted by

reconstructing the speech signal using the FB expansion of order m=75. The DESA

technique is applied on this band-limited AM-FM signal of the speech utterance in order

4

to determine amplitude envelope (AE) functions for detection of GCIs. The peaks in the

envelope of AM-FM signals provide the locations of GCIs.

6.5 STUDIES ON DETECTION OF GCIs FOR VARIOUS SPEECH

UTTERANCES

In this section we provide an analysis for the studies on epoch or GCIs for both male

and female speakers. Figure 6.1 gives the speech signal of a male speaker and its

corresponding spectrogram. The amplitude envelope (AE) estimation of the band-limited

AM-FM signal, reconstruction by using FB expansion is shown in second waveform of

the figure. The third waveform shows the estimated amplitude envelope for the second

waveform. The differenced EGG signal is shown in the fourth waveform. It is seen that

the peaks in the amplitude envelope and peaks in the differenced EGG signals are

agreeing in most of the cases. Similar observations are also noticed for a female speaker

shown in Figure 6.2.

This enables us to identify the locations of GCIs from the peaks of the amplitude

envelope of the band-limited AM-FM signal of the given speech utterance. From Figures

6.1 and 6.2, it can be noticed that the number of GCIs for female speakers are more than

the male speaker for the same duration of speech segment. This is due to the fact that

generally the fundamental frequency (reciprocal of the difference between successive

GCIs) of female speakers is higher than the male speaker.

6.6 GLOTTAL ACTIVITY DETECTION

The strength of excitation helps in the detection of the regions of the speech signal

based on the glottal activity. The region where the strength of excitation is considered to

be significant is referred to as the regions of the vocal fold vibration (glottal activity).

That is, they are considered as the voiced regions where glottal activity is detected. In the

absence of vocal fold vibration, the vocal-tract system can be considered to be excited by

random noise, as in the case of frication.

4

The energy of the random noise excitation is distributed both in time and frequency

domains. While the energy of an impulse is distributed uniformly in the frequency

domain, it is highly concentrated in the time-domain. As a result, the filtered signal

exhibits, significantly lower amplitude for random noise excitation compared to the

impulse-like excitation. Hence the filtered signal can be used to detect the regions of

glottal activity (vocal fold vibration).

6.7 CONCLUSIONS

The primary mode of excitation of the vocal tract system during speech

production is due to the vibration of the vocal folds. For voiced speech, the most

significant excitation takes place around the glottal closure instant (GCI). The instants of

significant excitation play an important role in human perception. The studies on GCIs

help in identifying the various regions in continuous speech. The extracted GCIs help in

identifying the fundamental frequency (pitch) of the speaker. The pitch is a feature

unique to a speaker.

4

GLOTTAL CLOSURE INSTANTS OF A MALE SPEAKER

Fig. 6.1 Epoch (or GCI) extraction of a male speaker using Fourier-Bessel (FB)

expansion and AM-FM model

4

GLOTTAL CLOSURE INSTANTS OF A FEMALE SPEAKER

Fig 6.2 Epoch (or GCI) extraction of a female speaker using Fourier-Bessel (FB)

expansion and AM-FM model

4

CHAPTER 7

SUMMARY AND CONCLUSIONS

7.1 SUMMARY OF THE WORK

Glottal Closure Instants (GCI) (also known as epoch) is one of the important event

that can be attributed to the speech production mechanism. The primary mode of

excitation of vocal-tract system during speech production is due to the vibration of the

vocal folds. For voiced speech, the most significant excitation takes place around the

GCIs. The rate of glottal closure is referred to as strength of the epoch. The GCIs and the

strength of the epochs form important features of the excitation source. In this work, we

propose to use a method based on Fourier-Bessel expansion and AM-FM model to detect

the regions of glottal activity and to estimate the strength of excitation in each glottal

cycle.

The influence of the vocal-tract system is relatively less at zero-frequency, as the

vocal-tract system has resonances at much higher frequencies. Hence, we use Fourier-

Bessel expansion followed by DESA method to extract the epoch locations and their

strengths. The method involves subjecting the speech signal to Bessel transformation and

then using the required coefficients to band-limit the signal by means of discrete energy

separation algorithm (DESA) so as to produce an EGG signal, from which the Amplitude

Envelope (AE) is obtained, which gives the required peaks that enables us to distinguish

between male and female speakers.

The excitation source information is used to analyze the manner of articulation

(MOA) of stop consonants. Two of the excitation source features utilized in this study are

the filtered speech signal and the normalized error. The filtered speech signal is used to

characterize the excitation information during vocal-fold vibration. The normalized error

derived from linear prediction (LP) analysis is used to highlight the regions of noisy

excitation caused by a rush of air through the open glottis during closure release and

aspiration. It is observed from the studies that these two features jointly highlight the

important events. Like onset of voicing and instant of closure release, in the stop

4

consonants. Using these two excitation source features, the voice onset time and the burst

duration of stop consonants were measured.

Using the epoch locations as anchor points within each glottal cycle, a method to

estimate the instantaneous fundamental frequency of voiced speech segments is

presented. The fundamental frequency is estimated as the reciprocal of the interval

between two successive epoch locations derived from the speech signal. Since the

method is based on the point property of epoch and does not involve any block

processing, it provides cycle-to-cycle variations in pitch during voicing. This results in

instantaneous fundamental frequency. Errors due to spurious zero crossings in the weak

voiced region are corrected using the filtered signal of Hilbert Enevlope (HE) of the

speech signal. In this work, analysis of the pitch frequency for various subjects in

different environmental conditions is carried out.

The ability to detect VOT in speech is a challenging problem because it combines

temporal and frequency structure over a very short duration. So far, no successful

automatic VOT detection scheme has yet been developed or published in the literature. In

this study, the amplitude modulation component (AMC) of the Teager-Energy operator

(TEO), sub-band frequency based non-linear energy tracking operator was employed to

detect the VOR and to estimate VOT. The proposed algorithm was applied to the

problem accent classification using American English, Chinese and Indian accented

speakers. Using 546 tokens, consisting of three words from 24 speakers, the average

MSEC mismatch between automatic and hand labeled VOT was 0.735 (msec) (among

the 409 tokens, which were detected with less than 10 percent error). This represents a

1.15% mismatch. It was also shown that average VOTs are different among three

different language groups, hence making VOT a good feature for accent classification.

4

CHAPTER 8

REFERENCES

2. Papoulis, A. Signal Analysis. McGraw-Hill, New York, 1977.

3. Bracewell, R. The Fourier Transform and its Applications. McGraw-Hill, New

York, 1964.

4. Luke, Y. L. Integrals of Bessel Functions. McGraw-Hill, New York, 1962.

5. A S Ambramson and L Lisker, “Voice onset time in stop consonants: acoustic

Analysis and synthesis”, in Proc. 5th International congress on phonetic sciences,

1965.

6. R B Monsen, “Normal and reduced phonological space: The study of vowels by a

Deaf adolescent”, Journal of Phonetics, vol.4, 1976.

7. J. Jiang, M. Chen, and A. Alwan “On the perception of voicing in syllable-initial

Plosives in noise”, Journal of the Acoustical Society of America, vol. 119, 2006.

8. L. Lisker and A.S Abramson, “A cross-language study of voicing in initial stops:

Acoustical measurements”, Word, vol.10, 1967.

9. L. Lisker and A.S. Abramson, “Some effects of context on voice onset time in

English stops”, Language and Speech, vol. 10, 1967.

10. S. Das and J.H.L. Hansen, “Detection of voice onset time (VOT) for unvoiced

Stops (/p/,/t/,/k/) using the Teager energy operator (TEO) for automatic detection

of accented English”, Proc. 6th Nordic Signal Processing Symposium, 2004.

11. P. Ladefoged, “A Course in Phonetics”, Third edition, Harcourt Brace College

Publishers, Fort Worth, 1993.

12. R.B Pachori and P. Sircar, “Analysis of multicomponent AM-FM signals using

FB-DESA method”, Digital Signal Processing, vol.20, 2010

13. D. Veeneman and S. BeMent, “Automatic glottal inverse filtering from speech

And electroglottographic signals”, IEEE Trans. Signal Processing, vol.33, 1985.

14. B. Yegnanarayana and R N. J. Veldhuis, “Extraction of vocal-tract system

5

Characteristics from speech signals”, IEEE Trans. Speech and Audio Processing,

vol, 6, 1998.

15. C. Hamon, E. Moulines, and F. Charpentier, “A diphone synthesis system based

On time domain prosodic modifications of speech”, in Proc. IEEE, May, 1989.

16. K. S. Rao and B. Yegnanarayana, “Prosody modification using instants of

Significant excitation”, IEEE, May, 2006.

17. B. Yegnanarayana, S.R.M Prasanna, R. Duraiswamy, and D. Zotkin, “Processing

Of reverberant speech for time-delay estimation”, IEEE Trans. Speech and Audio

Signal Processing, 2005.

18. B. Yegnanarayana, P.S. Murty, “Enhancement of reverberant speech using LP

Residual signal”, IEEE Trans.Speech and Audio Processing, vol.8, 2000

19. Y KC Ma and LF Willems, “A Frobenius norm approach to glottal closure

Detection from the speech signal”, IEEE Tans. Speech Audio Processing, vol.2,

1994.

20. C R Jankowski Jr, T F Quatieri, and D A Reynolds, “Measuring fine structure in

Speech: Application to speaker identification”, Proc. IEEE Int. Conf., 1995.

25. K. Sri Rama Murthy and B. Yegnanarayana, “Epoch extraction from speech

signal,” IEEE Trans. Audio, Speech Lang. Process. , vol. 16 (8), pp.1602-1613,

Nov.2008.

5

SOURCE CODE

5

Gci_female.mlc;

clear all;

close all;

inputfile='30401.wav'eggfile='30401.egg'samplesrange=[89601:92800];fborderleft=1;fborderright=75;

MM=length(samplesrange);%computation of roots of bessel function Jo(x)if exist('alfa') == 0 x=2; alfa=zeros(1,MM); for i=1:MM ex=1; while abs(ex)>.00001 ex=-besselj(0,x)/besselj(1,x); x=x-ex; end alfa(i)=x; fprintf('Root # %g = %8.5f ex = %9.6f \n',i,x,ex) x=x+pi; endend

[s,fs]=wavread(inputfile);s=s;

S=s';

% S=diff(S);

5

S=-S(samplesrange);

ax(1)=subplot(4,1,1);

% plot(samplesrange, S);plot(samplesrange/fs, S);

axis tightgrid on

% spgramsvg('ka_F01_S2.wav', 320, 160, 8000)

s=S;N=length(s);nb=1:N;

a=N;

for m1=1:MM a3(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s.*besselj(0,alfa(m1)/a*nb));end

%reconstruction of the signal

for mm=1:N g1_r=[(alfa(fborderleft:fborderright))/a ]; F1_r(mm)=sum(a3(fborderleft:fborderright).*besselj(0,g1_r*mm));end

y1=F1_r;

ax(2)=subplot(4,1,2);% plot(samplesrange, y1)plot((samplesrange)/fs, y1)axis tightgrid on% y1=y1-mean(y1);


[egg, fs]=wavread(eggfile);

5

vg=-diff(egg);

% plot(samplesrange,vg(samplesrange))plot((samplesrange)/fs,vg(samplesrange))axis tightgrid on

for l=2:N-1 xx=y1; si(l)=xx(l)^2-xx(l-1)*xx(l+1);end

for m=2:N-1 yy(m)=y1(m)-y1(m-1); endfor m=2:N-2siy(m)=yy(m)^2-yy(m-1).*yy(m+1);end

for mm=2:N-2 if siy(mm)<0 yy1(mm)=siy(mm-1); yy1(mm)=yy1(mm-1); else yy1(mm)=siy(mm); endendsiy=yy1;for m1=2:N-3 omega(m1)=acos(1-((siy(m1)+siy(m1+1))/(4*si(m1))));endfor mm=2:N-3 if imag(omega(mm))==0 yy1(mm)=omega(mm); else yy1(mm)=omega(mm-1); yy1(mm)=yy1(mm-1); endendomega=yy1;

for m1=2:N-3 amp(m1)=sqrt(((si(m1))/(1-(1-((siy(m1)+siy(m1+1))/(4*si(m1))))^2)));endfor mm=2:N-3 if imag(amp(mm))==0

5

yy1(mm)=amp(mm); else yy1(mm)=amp(mm-1); yy1(mm)=yy1(mm-1); endend[ca,cd]=dwt(yy1,'db2');

yy1=idwt(ca,[],'db2');

amp1=yy1;

amp1(end+1:end+2)=amp1(end);

% X2=overlapadd(omega1,W,INC); ax(4)=subplot(4,1,4);

% plot(samplesrange,amp1)plot((samplesrange)/fs,amp1) axis tightgrid on% ax(4)=subplot(4,1,4);% % plot((1:length(X2))/32000,X2)

linkaxes(ax,'x');

5

Gci_male.m

clc;

clear all;

close all;

inputfile='10501.wav'eggfile='10501.egg'samplesrange=[76001:79200];fborderleft=1;fborderright=75;

MM=length(samplesrange);%computation of roots of bessel function Jo(x)if exist('alfa') == 0 x=2; alfa=zeros(1,MM); for i=1:MM ex=1; while abs(ex)>.00001 ex=-besselj(0,x)/besselj(1,x); x=x-ex; end alfa(i)=x; %fprintf('Root # %g = %8.5f ex = %9.6f \n',i,x,ex) x=x+pi; endend

[s,fs]=wavread(inputfile);s=s;

S=s';

% S=diff(S);

5

S=-S(samplesrange);


% plot(samplesrange, S);plot(samplesrange/fs, S);

axis tightgrid on

% spgramsvg('ka_F01_S2.wav', 320, 160, 8000)

s=S;N=length(s);nb=1:N;

a=N;

for m1=1:MM a3(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s.*besselj(0,alfa(m1)/a*nb));end


for mm=1:N g1_r=[(alfa(fborderleft:fborderright))/a ]; F1_r(mm)=sum(a3(fborderleft:fborderright).*besselj(0,g1_r*mm));

end

y1=F1_r;

ax(2)=subplot(4,1,2);% plot(samplesrange, y1)plot((samplesrange)/fs, y1)axis tightgrid on% y1=y1-mean(y1);


5

[egg, fs]=wavread(eggfile);

vg=-diff(egg);

% plot(samplesrange,vg(samplesrange))plot((samplesrange)/fs,vg(samplesrange))axis tightgrid on

for l=2:N-1 xx=y1; si(l)=xx(l)^2-xx(l-1)*xx(l+1);end


for mm=2:N-2 if siy(mm)<0 yy1(mm)=siy(mm-1); yy1(mm)=yy1(mm-1); else yy1(mm)=siy(mm); endendsiy=yy1;for m1=2:N-3 omega(m1)=acos(1-((siy(m1)+siy(m1+1))/(4*si(m1))));endfor mm=2:N-3 if imag(omega(mm))==0 yy1(mm)=omega(mm); else yy1(mm)=omega(mm-1); yy1(mm)=yy1(mm-1); endendomega=yy1;

for m1=2:N-3

5

amp(m1)=sqrt(((si(m1))/(1-(1-((siy(m1)+siy(m1+1))/(4*si(m1))))^2)));endfor mm=2:N-3 if imag(amp(mm))==0 yy1(mm)=amp(mm); else yy1(mm)=amp(mm-1); yy1(mm)=yy1(mm-1); endend [ca,cd]=dwt(yy1,'db2'); yy1=idwt(ca,[],'db2');

amp1=yy1; amp1(end+1:end+2)=amp1(end);

% X2=overlapadd(omega1,W,INC); ax(4)=subplot(4,1,4);

% plot(samplesrange,amp1)plot((samplesrange)/fs,amp1) axis tightgrid on% ax(4)=subplot(4,1,4);% % plot((1:length(X2))/32000,X2)

linkaxes(ax,'x');

---------------------------------------------------------------------------------------------------------Vot.m clc;

clear all;

close all;inputfile='ku_F01_S4.wav';MM=320;%MM=1100;%computation of roots of bessel function Jo(x)if exist('alfa') == 0 x=2; alfa=zeros(1,MM); for i=1:MM

6

ex=1; while abs(ex)>.00001 ex=-besselj(0,x)/besselj(1,x); x=x-ex; end alfa(i)=x; % fprintf('Root # %g = %8.5f ex = %9.6f \n',i,x,ex) x=x+pi; endend[s,fs]=wavread(inputfile);% yy=resample(s,1,2);% % s=yy(9500:10600);S=s';INC=160;%INC=550;NW=INC*2; W=sqrt(hamming(NW+1)); W(end)=[]; F=enframe(S,W,INC);

[r,c]=size(F);

for i=1:r s1=F(i,:);N=length(s1);nb=1:N;

a=N;

for m1=1:MM a3(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s1.*besselj(0,alfa(m1)/a*nb));end


for mm=1:N g1_r=[(alfa(12:48))/a ]; F1_r(mm)=sum(a3(12:48).*besselj(0,g1_r*mm));% g1_r=[(alfa(20:85))/a ];% F1_r(mm)=sum(a3(20:85).*besselj(0,g1_r*mm));% g1_r=[(alfa)/a ];% F1_r(mm)=sum(a3.*besselj(0,g1_r*mm));

6

end

y1=F1_r;for l=2:N-1 xx=y1; si(l)=xx(l)^2-xx(l-1)*xx(l+1);end


for mm=2:N-2 if siy(mm)<0 yy1(mm)=siy(mm-1); yy1(mm)=yy1(mm-1); else yy1(mm)=siy(mm); endendsiy=yy1;for m1=2:N-3 omega(m1)=acos(1-((siy(m1)+siy(m1+1))/(4*si(m1))));endfor mm=2:N-3 if imag(omega(mm))==0 yy1(mm)=omega(mm); else yy1(mm)=omega(mm-1); yy1(mm)=yy1(mm-1); endendomega=yy1;for m1=2:N-3 amp(m1)=sqrt(((si(m1))/(1-(1-((siy(m1)+siy(m1+1))/(4*si(m1))))^2)));endfor mm=2:N-3 if imag(amp(mm))==0 yy1(mm)=amp(mm); else yy1(mm)=amp(mm-1); yy1(mm)=yy1(mm-1); end

6

end[ca,cd]=dwt(yy1,'db2');


amp1(i,:)=yy1;end

amp1(:,c-1)=amp1(:,c-2);

amp1(:,c)=amp1(:,c-1);

Xvowel=overlapadd(amp1,W,INC);

%%%%%

% [s,fs]=wavread(inputfile);% % yy=resample(s,1,2);% % fs=fs/2;% % s=yy(9500:10600);% S=s';

INC=160;NW=INC*2 W=sqrt(hamming(NW+1)); W(end)=[]; F=enframe(S,W,INC);

[r,c]=size(F);

for i=1:r s2=F(i,:);N=length(s2);nb=1:N;

a=N;

for m1=1:MM a3(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s2.*besselj(0,alfa(m1)/a*nb));

6

end


for mm=1:N% g1_r=[(alfa(60:100))/a ];% F1_r(mm)=sum(a3(60:100).*besselj(0,g1_r*mm));g1_r=[(alfa(60:100))/a ]; F1_r(mm)=sum(a3(60:100).*besselj(0,g1_r*mm));end

y1=F1_r;for l=2:N-1 xx=y1; si(l)=xx(l)^2-xx(l-1)*xx(l+1);end


for mm=2:N-2 if siy(mm)<0 yy1(mm)=siy(mm-1); yy1(mm)=yy1(mm-1); else yy1(mm)=siy(mm); endendsiy=yy1;for m1=2:N-3 omega(m1)=acos(1-((siy(m1)+siy(m1+1))/(4*si(m1))));endfor mm=2:N-3 if imag(omega(mm))==0 yy1(mm)=omega(mm); else yy1(mm)=omega(mm-1); yy1(mm)=yy1(mm-1); endendomega=yy1;for m1=2:N-3 amp(m1)=sqrt(((si(m1))/(1-(1-((siy(m1)+siy(m1+1))/(4*si(m1))))^2)));

6

endfor mm=2:N-3 if imag(amp(mm))==0 yy1(mm)=amp(mm); else yy1(mm)=amp(mm-1); yy1(mm)=yy1(mm-1); endend[ca,cd]=dwt(yy1,'db2');


amp2(i,:)=yy1;end

amp2(:,c-1)=amp2(:,c-2);

amp2(:,c)=amp2(:,c-1);

Xc=overlapadd(amp2,W,INC); fig = figure;

ax(1)=subplot(4,1,1);%specgram(s,320,16000,320,160) ; colormap(1-gray);svlSpgram(s,2^8,fs,10*fs/1000,9*fs/1000,30,4000);%specgram(s,1100,8000,1100,550)

ax(2)=subplot(4,1,2);plot((1:length(s))/16000, s);grid;axis tight;%text(1.81,0.04,'(b)','fontweight','bold');%plot((1:length(s))/8000, s);%plot(s);grid;axis tight% spgramsvg('ka_F01_S2.wav', 320, 160, 8000)ax(3)=subplot(4,1,3);plot((1:length(Xvowel))/16000,Xvowel);grid;axis tight;%text(1.81,0.15,'(c)','fontweight','bold');

ax(4)=subplot(4,1,4);plot((1:length(Xc))/16000,Xc);grid;axis tight;xlabel('Time (sec)'); %text(1.81,0.004,'(d)','fontweight','bold');

% ax(5)=subplot(5,1,5);% plot((1:length(Xc)-1)/16000,diff(Xc));grid;axis tight;xlabel('Time (sec)'); %text(1.81,0.004,'(d)','fontweight','bold');

6

linkaxes(ax,'x');

alltext=findall(fig,'type','text');allaxes=findall(fig,'type','axes');allfont=[alltext(:);allaxes(:)];set(allfont,'fontsize',16);Overlapadd.mfunction [x,zo]=overlapadd(f,win,inc)%OVERLAPADD join overlapping frames together X=(F,WIN,INC)%% Inputs: F(NR,NW) contains the frames to be added together, one% frame per row.% WIN(NW) contains a window function to multiply each frame.% WIN may be omitted to use a default rectangular window% If processing the input in chunks, WIN should be replaced by% ZI on the second and subsequent calls where ZI is the saved% output state from the previous call.% INC gives the time increment (in samples) between% succesive frames [default = NW].%% Outputs: X(N,1) is the output signal. The number of output samples is N=NW+(NR-1)*INC. % ZO Contains the saved state to allow a long signal% to be processed in chunks. In this case X will contain only N=NR*INC% output samples. %% Example of frame-based processing:% INC=20

% set frame increment% NW=INC*2

% oversample by a factor of 2 (4 is also often used)% S=cos((0:NW*7)*6*pi/NW);

% example input signal% W=sqrt(hamming(NW+1)); W(end)=[]; % sqrt hamming window of period NW% F=enframe(S,W,INC); % split into frames% ... process frames ...% X=overlapadd(F,W,INC); % reconstitute the time waveform (omit "X=" to plot waveform)

% Copyright (C) Mike Brookes 2009% Version: $Id: overlapadd.m,v 1.2 2009/06/08 16:21:49 dmb Exp $%% VOICEBOX is a MATLAB toolbox for speech processing.% Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html%

6

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% This program is free software; you can redistribute it and/or modify% it under the terms of the GNU General Public License as published by% the Free Software Foundation; either version 2 of the License, or% (at your option) any later version.%% This program is distributed in the hope that it will be useful,% but WITHOUT ANY WARRANTY; without even the implied warranty of% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the% GNU General Public License for more details.%% You can obtain a copy of the GNU General Public License from% http://www.gnu.org/copyleft/gpl.html or by writing to% Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA.%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%[nr,nf]=size(f); % number of frames and frame lengthif nargin<2 win=nf; % default incrementendif isstruct(win) w=win.w; if ~numel(w) && length(w)~=nf error('window length does not match frames size'); end inc=win.inc; xx=win.xx;else if nargin<3 inc=nf; end if numel(win)==1 && win==fix(win) && nargin<3 % win has been omitted inc=win; w=[]; else w=win(:).'; if length(w)~=nf error('window length does not match frames size'); end if all(w==1) w=[]; end end xx=[]; % partial output from previous call is nullendnb=ceil(nf/inc); % number of overlap buffersno=nf+(nr-1)*inc; % buffer length

6

z=zeros(no,nb); % space for overlapped output speechif numel(w) z(repmat(1:nf,nr,1)+repmat((0:nr-1)'*inc+rem((0:nr-1)',nb)*no,1,nf))=f.*repmat(w,nr,1);else z(repmat(1:nf,nr,1)+repmat((0:nr-1)'*inc+rem((0:nr-1)',nb)*no,1,nf))=f;endx=sum(z,2);if ~isempty(xx) x(1:length(xx))=x(1:length(xx))+xx; % add on leftovers from previous callendif nargout>1 % check if we want to preserve the state mo=inc*nr; % completed output samples if no<mo x(mo,1)=0; zo.xx=[]; else zo.xx=x(mo+1:end); zo.w=w; zo.inc=inc; x=x(1:mo); endelseif ~nargout if isempty(xx) k1=nf-inc; % dubious samples at start else k1=0; end k2=nf-inc; % dubious samples at end plot(1+(0:nr-1)*inc,x(1+(0:nr-1)*inc),'>r',nf+(0:nr-1)*inc,x(nf+(0:nr-1)*inc),'<r', ... 1:k1+1,x(1:k1+1),':b',k1+1:no-k2,x(k1+1:end-k2),'-b',no-k2:no,x(no-k2:no),':b'); xlabel('Sample Number'); title(sprintf('%d frames of %d samples with %.0f%% overlap = %d samples',nr,nf,100*(1-inc/nf),no));end

6

Enframe.mfunction f=enframe(x,win,inc)%ENFRAME split signal up into (overlapping) frames: one per row. F=(X,WIN,INC)%% F = ENFRAME(X,LEN) splits the vector X(:) up into% frames. Each frame is of length LEN and occupies% one row of the output matrix. The last few frames of X% will be ignored if its length is not divisible by LEN.% It is an error if X is shorter than LEN.%% F = ENFRAME(X,LEN,INC) has frames beginning at increments of INC% The centre of frame I is X((I-1)*INC+(LEN+1)/2) for I=1,2,...% The number of frames is fix((length(X)-LEN+INC)/INC)%% F = ENFRAME(X,WINDOW) or ENFRAME(X,WINDOW,INC) multiplies% each frame by WINDOW(:)%% Example of frame-based processing:% INC=20

% set frame increment% NW=INC*2

% oversample by a factor of 2 (4 is also often used)% S=cos((0:NW*7)*6*pi/NW);

% example input signal% W=sqrt(hamming(NW+1)); W(end)=[]; % sqrt hamming window of period NW% F=enframe(S,W,INC); % split into frames% ... process frames ...% X=overlapadd(F,W,INC); % reconstitute the time waveform (omit "X=" to plot waveform)

% Copyright (C) Mike Brookes 1997% Version: $Id: enframe.m,v 1.6 2009/06/08 16:21:42 dmb Exp $%% VOICEBOX is a MATLAB toolbox for speech processing.% Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html%

6

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% This program is free software; you can redistribute it and/or modify% it under the terms of the GNU General Public License as published by% the Free Software Foundation; either version 2 of the License, or% (at your option) any later version.%% This program is distributed in the hope that it will be useful,% but WITHOUT ANY WARRANTY; without even the implied warranty of% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the% GNU General Public License for more details.%% You can obtain a copy of the GNU General Public License from% http://www.gnu.org/copyleft/gpl.html or by writing to% Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA.%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

nx=length(x(:));nwin=length(win);if (nwin == 1) len = win;else len = nwin;endif (nargin < 3) inc = len;endnf = fix((nx-len+inc)/inc);f=zeros(nf,len);indf= inc*(0:(nf-1)).';inds = (1:len);f(:) = x(indf(:,ones(1,len))+inds(ones(nf,1),:));if (nwin > 1) w = win(:)'; f = f .* w(ones(nf,1),:);end

7

SvlSpgram.mfunction [X, f_r, t_r] = svlSpgram(x, n, Fs, window, overlap, clipdB, maxfreq)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Usage: % [X [, f [, t]]] = svlSpgram(x [, n [, Fs [, window [, overlap[, clipdB [, maxfreq ]]]]]])%% Generate a spectrogram for the signal. This chops the signal into% overlapping slices, windows each slice and applies a Fourier% transform to determine the frequency components at that slice.%% INPUT:% x: signal or vector of samples% n: size of fourier transform window, or [] for default=256% Fs: sample rate, or [] for default=2 Hz% window: shape of the fourier transform window, % or [] for default=hanning(n)% Note: window length can be specified instead, in which case% window=hanning(length)% overlap: overlap with previous window, % or [] for default=length(window)/2% clipdB:Clip or cut-off any spectral component more than 'clipdB' % below the peak spectral strength.(default = 35 dB)% maxfreq: Maximum freq to be plotted in the spectrogram (default = Fs/2)%% OUTPUT:% X: STFT of the signal x% f: The frequency values corresponding to the STFT values% t: Time instants at which the STFT values are computed% % Example%--------% x = chirp([0:0.001:2],0,2,500); # freq. sweep from 0-500 over 2 sec.% Fs=1000; # sampled every 0.001 sec so rate is 1 kHz% step=ceil(20*Fs/1000); # one spectral slice every 20 ms% window=ceil(100*Fs/1000); # 100 ms data window

7

% svlSpgram(x, 2^nextpow2(window), Fs, window, window-step);%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Original version by Paul Kienzle; modified by Sean Fulop March 2002%% Customized by Anand and then by Dhananjaya% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

if nargin < 1 | nargin > 7 error('usage: ([Y [, f [, t]]] = svlSpgram(x [, n [, Fs [, window [, overlap[, clipdB]]]]]) ') end

% assign defaults if nargin < 2 | isempty(n), n = min(256, length(x)); end if nargin < 3 | isempty(Fs), Fs = 2; end if nargin < 4 | isempty(window), window = hanning(n); end if nargin < 5 | isempty(overlap), overlap = length(window)/2; end if nargin < 6 | isempty(clipdB), clipdB = 35; end % clip anything below 35 dB if nargin < 7 | isempty(maxfreq), maxfreq = Fs/2; end

% if only the window length is given, generate hanning window if length(window) == 1, window = hanning(window); end

% should be extended to accept a vector of frequencies at which to % evaluate the fourier transform (via filterbank or chirp % z-transform) if length(n)>1, error('spgram doesn''t handle frequency vectors yet') end

% compute window offsets win_size = length(window); if (win_size > n) n = win_size; warning('spgram fft size adjusted---must be at least as long as frame') end step = win_size - overlap;

% build matrix of windowed data slices S = buffer(x,win_size,overlap,'nodelay'); W = window(:,ones(1,size(S,2))); S = S .* W; offset = [0:size(S,2)-1]*step;

7

% compute fourier transform STFT = fft(S,n); % extract the positive frequency components if rem(n,2)==1 ret_n = (n+1)/2; else ret_n = n/2; end

STFT = STFT(1:ret_n, :);

f = [0:ret_n-1]*Fs/n; t = offset/Fs; if nargout>1, f_r = f; end if nargout>2, t_r = t; end

%maxfreq = Fs/2; STFTmag = abs(STFT(2:n*maxfreq/Fs,:)); % magnitude of STFT STFTmag = STFTmag/max(max(STFTmag)); % normalize so max magnitude will be 0 db STFTmag = max(STFTmag, 10^(-clipdB/10)); % clip everything below -35 dB

if nargout>0, X = STFTmag; end % display as an indexed grayscale image showing the log magnitude of the STFT, % i.e. a spectrogram; the colormap is flipped to invert the default setting,% in which white is most intense and black least---in speech science we want% the opposite of that.% imagesc(t, f(2:n*maxfreq/Fs), 20*log10(STFTmag)); axis xy; colormap(flipud(gray));

if nargout==0 if Fs<2000

imagesc(t, f(2:n*maxfreq/Fs), 20*log10(STFTmag));%pcolor(t, f(2:n*maxfreq/Fs), 20*log10(STFTmag));ylabel('Hz');

else %imagesc(t, f(2:n*maxfreq/Fs)/1000, 20*log10(STFTmag)); pcolor(t, f(2:n*maxfreq/Fs)/1000, 20*log10(STFTmag)); ylabel('kHz'); end axis xy; colormap(flipud(gray)); shading interp; %xlabel('Time (seconds)'); end

7

Fbc.mfunction [c,F1_r]= fbc(s)

N = length(s);

% calculation of the rootsif exist('alfa') == 0 x=2; alfa=zeros(1,N); for i=1:N ex=1; while abs(ex)>.00001 ex=-besselj(0,x)/besselj(1,x); x=x-ex; end alfa(i)=x; %fprintf('Root # %g = %8.5f ex = %9.6f \n',i,x,ex); x=x+pi; endend a = N;nb = 1:N;for m1 = 1:N c(m1) = (2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s'.*besselj(0,alfa(m1)/a*nb));end% cindex=[6:10];% cindex=18:24;% cindex=[6:10 18:24];% cindex=[2:5 5:8 26:30];% cindex=[2:6 6:10 10:15 52:58];

%cindex=[6:30 130:145 160:175 225:235];for mm=1:N g1_r=[alfa(mm)/a]; F1_r(mm)=sum(c(mm).*besselj(0,g1_r*mm));% g1_r=[alfa(1:N/8)/a];

7

% F1_r(mm)=sum(a3(1:N/8).*besselj(0,g1_r*mm));end

Residual.mfunction [residual,LPCoeffs,eta,Ne] = Residual(speech,Fs,segmentsize,segmentshift,lporder,preempflag,plotflag)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% USAGE : [residual,LPCoeffs,Ne] = Residual(speech,framesize,frameshift,lporder,preempflag,plotflag)% INPUTS :% speech - speech in ASCII% Fs - Sampling frequency (Hz)% segmentsize - framesize for lpanalysis (ms)% segmentshift - frameshift for lpanalysis (ms)% lporder - order of lpc% preempflag - If 1 do preemphasis% plotflag - If 1 plot results% OUTPUTS :% residual - residual signal% LPCoeffs - 2D array containing LP coeffs of all frames% Ne - Normalized error

% LOG: Nan errors have been fixed.% Some elements of 'residual' were Nan. Now, the error% has been corrected.%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Preemphasizing speech signalif(preempflag==1)

dspeech=diff(speech);dspeech(length(dspeech)+1)=dspeech(length(dspeech));

elsedspeech=speech;

end

[r,c] = size(dspeech);

7

if r==1dspeech = dspeech';

end

framesize = floor(segmentsize * Fs/1000);frameshift = floor(segmentshift * Fs/1000);nframes=floor((length(dspeech)-framesize)/frameshift)+1;LPCoeffs=zeros(nframes,(lporder+1));Lspeech = length(dspeech);numSamples = Lspeech - framesize;

Lspeechnframes(nframes-1)*frameshift + framesize;

j=1;% Processing the frames.for i=1:frameshift:Lspeech-framesize

curFrame = dspeech(i:i+framesize-1);frame = speech(i:i+framesize-1);

% Check if energy of the frame is zero.if(sum(abs(curFrame)) == 0)

LPCoeffs(j,:) = 0; Ne(j) = 0;

resFrame(1:framesize) = 0;else

%[a,E] = lpc(hamming(framesize).*curFrame,lporder);[a,E] = lpc(curFrame,lporder);nanFlag = isnan(real(a));

% Check for ill-conditioning that can lead to NaNs.if(sum(nanFlag) == 0)

LPCoeffs(j,:) = real(a); Ne(j) = E;

% Inverse filtering.if i <= lporder

frameToFilter(1:lporder) = 0;else

frameToFilter(1:lporder) = dspeech(i-lporder:i-1);%frameToFilter(1:lporder) = speech(i-lporder:i-1);

end

frameToFilter(lporder+1:lporder+framesize)=curFrame;%frameToFilter(lporder+1:lporder+framesize)=frame;resFrame = InverseFilter(frameToFilter,lporder,LPCoeffs(j,:));

7

numer=resFrame(lporder+1:framesize);denom=curFrame(lporder+1:framesize);

eta(i) = sum(numer.*numer)/sum(denom.*denom);%eta(i)

elseLPCoeffs(j,:) = 0;

Ne(j) = 0;resFrame(1:framesize) = 0;

endend

% Write current residual into the main residual array.residual(i:i+frameshift-1) = resFrame(1:frameshift);j=j+1;

end

clear frameToFilter;

i=i+frameshift;% Updating the remaining residual samples of the penultimate frame.%residual(i+frameshift:i+framesize-1) = resFrame(frameshift+1:framesize);

% The above processing covers only L samples, where % L = {(nframes-1)*frameshift + framesize}.% Still, the last {Lspeech-L} samples remain to be processed.% Note that 0 <= {Lspeech-L} < framesize.

% Processing the last frame. However, this last frame will have% a length of {Lspeech-i+1} samples.

if(i < Lspeech)

curFrame = dspeech(i:Lspeech);frame = speech(i:Lspeech);l=Lspeech-i+1;

% Check if energy of the frame is zero.if(sum(abs(curFrame)) == 0)

LPCoeffs(j,:) = 0; Ne(j) = 0;

resFrame(1:l) = 0;else

%[a,E] = lpc(hamming(l).*curFrame,lporder);[a,E] = lpc(curFrame,lporder);nanFlag = isnan(real(a));

if(sum(nanFlag) == 0)

7

LPCoeffs(j,:) = real(a); Ne(j) = E;

% Inverse filtering.frameToFilter(1:lporder) = dspeech(i-lporder:i-1);%frameToFilter(1:lporder) = speech(i-lporder:i-1);

frameToFilter(lporder+1:lporder+l)=curFrame;%frameToFilter(lporder+1:lporder+l)=frame;

%resFrame(1:l) = InverseFilter(frameToFilter,lporder,LPCoeffs(j,:));

resFrame(1:l) = InverseFilter(frameToFilter,lporder,LPCoeffs(j,:));

elseLPCoeffs(j,:) = 0;

Ne(j) = 0;resFrame(1:l) = 0;

endend

residual(i:i+l-1) = resFrame(1:l);% Residual computation is now complete.% The lengths of speech and residual are equal now.

end

% Plotting the resultsif plotflag==1

% Setting scale for x-axis.i=1:Lspeech;

x = (i/Fs);

figure; ax(1) = subplot(2,1,1);plot(x,speech);

xlim([x(1) x(end)]);xlabel('Time (s)');ylabel('Signal');grid;

ax(2) = subplot(2,1,2);plot(x,residual);xlim([x(1) x(end)]);xlabel('Time (s)');ylabel('LP Residual');grid;

linkaxes(ax,'x');end

7

Inversefilter.mfunction residual = InverseFilter(frameToFilter,lporder,a)

% frameToFilter: It has 'lporder + framesize' samples.% lporder: Order of LP analysis.% a: Array of predictor coefficients, of size 'lporder+1'.

l = length(frameToFilter);

if l <= lporderreturn;

else% Note that l-lporder = frameSize.for i=1:l-lporder

predictedSample=0;% Note that a(1) = 1; Hence start from j=2.for j=2:lporder+1

predictedSample = predictedSample - a(j)*frameToFilter(lporder+1+i-j);

endresidual(i) = frameToFilter(i+lporder) - predictedSample;

endend

7

Hilberenvelope.mfunction [HilbertEnv]=HilbertEnvelope(signal,Fs,plotflag)

% returns a complex helical sequence, sometimes called the analytic% signal, from a real data sequence. The analytic signal has a real% part, which is the original data, and an imaginary part, which% contains the Hilbert Transform. The imaginary part is a version % of the original real sequence with a 90 degree pahse shift. Sines% are therefore transformed to cosines and vice versa.

tempSeq=hilbert(signal);HilbertSeq=imag(tempSeq);

% HilbertSeq contains the Hilbert transformed version of input signal% Hilbert Envelope is given by t=sqrt((sig*sig)+(h(sig)*h(sig)));

sigSqr=signal.*signal;HilbertSqr=HilbertSeq.*HilbertSeq;HilbertEnv=sqrt(sigSqr+HilbertSqr);

%wavwrite(HilbertSeq/1.01/max(abs(HilbertSeq)),Fs,16,'ht.wav');wavwrite(HilbertSeq,Fs,16,'ht.wav');

if(plotflag==1) % Setting scale for x-axis.

len = length(signal);x = [1:len]*1/Fs;

figure;subplot(3,1,1);plot(x,signal);%xlabel('Time (ms)');ylabel('Signal');grid;hold on;%plot(x,HilbertSeq.*HilbertSeq,'r');plot(x,HilbertEnv,'r');hold off;

8

subplot(3,1,2);plot(x,HilbertSeq);ylabel('HT of Signal');grid;subplot(3,1,3);plot(x,HilbertEnv);ylabel('HE of Signal');grid;%xlabel('Time (ms)');ylabel('Hilbert Envelope of LP Residual');grid;xlabel('Time (s)');

%figure;%plot(signal,HilbertSeq,'k.');grid;%plot(x, signal.*HilbertSeq);grid;for i=1:16:len-16xi = signal(i:i+16);yi = HilbertSeq(i:i+16);%plot(xi-mean(xi),yi-mean(yi),'k.');grid;end

end

8

Bandlimitfbc.mclc;clear all;close all;

[os,fs] = wavread('sa1_8000.wav');plot(os),title('Original Speech Signal'),axis tight,grid on;samplesrange = [14000:20000];S = os';s = S(samplesrange);N = length(s);figure(),plot(s),title('Extracted voiced speech signal of 20msec'),axis tight, grid on;if exist('alfa') == 0 x=2; alfa=zeros(1,N); for i=1:N ex=1; while abs(ex)>.00001 ex=-besselj(0,x)/besselj(1,x); x=x-ex; end alfa(i)=x; %fprintf('Root # %g = %8.5f ex = %9.6f \n',i,x,ex); x=x+pi; endend

fts = fft(s);ftsby2 = fts(1:length(fts)/2);n = length(ftsby2);tf = [1:n].*((fs/2)/n);figure(),plot(tf,20*log10(abs(ftsby2))),title('Spectrum of the Extracted speech signal'),axis tight, grid on;

a = N;nb = 1:N;for m1 = 1:N fbc(m1) = (2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s.*besselj(0,alfa(m1)/a*nb));

8

end

% Calculating Mean Frequencys1=s;N=length(s1);nb=1:N;MM = N;a=N;for m1=1:MM c(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s1.*besselj(0,alfa(m1)/a*nb)); p(m1)= (c(m1)^2)*(a^2*(besselj(1,alfa(m1))).^2)/2;endf(1:MM)=alfa(1:MM)/(2*pi*a);fmean1 = sum(f(1:MM).*p(1:MM))/sum(p(1:MM));mf = fmean1*fs

% Reconstructing the signal from the coefficients

fd = 300;range = N*fd/(fs/2);

fbc = fbc(1:range);

fbc = [fbc zeros(1,N-length(fbc))]; for mm=1:N g1_r=[alfa/a]; rs(mm)=sum(fbc.*besselj(0,g1_r*mm));end

figure(),plot(rs),title('Band Limited signal with frequency < 300Hz'),axis tight, grid on;

ftrs = fft(rs);ftrsby2 = ftrs(1:length(ftrs)/2);nr = length(ftrsby2);tfr = [1:nr].*((fs/2)/nr);figure(),plot(tfr,20*log10(abs(ftrsby2))),title('Spectrum of the Band Limited Signal'),axis tight, grid on;

8

Computeresidual.mfunction [residual,LPCoeffs] = computeResidual(speech,Fs,segmentsize,segmentshift,lporder,preempflag,plotflag)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% USAGE : [residual,LPCoeffs,Ne] = computeResidual(speech,framesize,frameshift,lporder,preempflag,plotflag)% INPUTS :% speech - speech in ASCII% Fs - Sampling frequency (Hz)% segmentsize - framesize for lpanalysis (ms)% segmentshift - frameshift for lpanalysis (ms)% lporder - order of lpc% preempflag - If 1 do preemphasis% plotflag - If 1 plot results% OUTPUTS :% residual - residual signal% LPCoeffs - 2D array containing LP coeffs of all frames% Ne - Normalized error

% LOG: Nan errors have been fixed.% Some elements of 'residual' were Nan. Now, the error% has been corrected.%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Preemphasizing speech signalif(preempflag==1)

dspeech=diff(speech);dspeech(length(dspeech)+1)=dspeech(length(dspeech));

elsedspeech=speech;

end

dspeech = dspeech(:);

8

framesize = floor(segmentsize * Fs/1000);frameshift = floor(segmentshift * Fs/1000);nframes=floor((length(dspeech)-framesize)/frameshift)+1;Lspeech = length(dspeech);numSamples = Lspeech - framesize;

Lspeechnframes(nframes-1)*frameshift + framesize

sbuf = buffer(dspeech, framesize, framesize-frameshift,'nodelay');[rs,cs] = size(sbuf);tmp = dspeech(Lspeech-rs+1:Lspeech);sbuf(:,cs) = tmp(:); % Last column of the buffer.

% Computation of LPCs.for i=1:cs

nanflag = 0;erg(i) = sum(sbuf(:,i).*sbuf(:,i));if erg(i) ~= 0

a = lpc(sbuf(:,i),lporder);nanflag = sum(isnan(real(a)));

elsea1 = [1 zeros(1,p)];

end

if nanflag == 0A(:,i) = real(a(:));

elseA(:,i) = a1(:);

endend

% Computation of LP residual.x = [zeros(1,lporder) (dspeech(:))'];xbuf = buffer(x, lporder+framesize, lporder+framesize-frameshift,'nodelay');[rx,cx] = size(xbuf);tmp = x(Lspeech+lporder-rx+1:Lspeech+lporder);xbuf(:,cx) = tmp(:); % Last column of the buffer.

% Inverse filtering.j=1;for i=1:cx-1

res = filter(A(:,i), 1, xbuf(:,i));

% Write current residual into the main residual array.residual(j:j+frameshift-1) = res(lporder+1:frameshift+lporder);j=j+frameshift;

8

endres = filter(A(:,cx), 1, xbuf(:,cx));residual((cx-1)*frameshift + 1: Lspeech) = res((cx-1)*frameshift - Lspeech + rx +

1:rx);

LPCoeffs = A';size(LPCoeffs)

% Plotting the resultsif plotflag==1

% Setting scale for x-axis.i=1:Lspeech;

x = i/Fs;

figure; subplot(2,1,1);plot(x,speech);xlim([x(1) x(end)]);ylabel('Signal');grid;

subplot(2,1,2);plot(x,residual);xlim([x(1) x(end)]);xlabel('Time (s)');ylabel('LP Residual');grid;

end

8

Synthesizespeech.mfunction [speech] = synthesizeSpeech(Residual, LPCs, Fs, lporder,fsize,fshift,plotflag)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [speech]=SynthesizeSpeechUsingResidual(idNoEx, lporder,framesize,frameshift,plotflag)% fsize,fshift: In ms% Use .res and .lpc files.% In .lpc file, each row is a set of LPCs for one frame.% Get sampling frequency from .res.%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

framesize = floor(fsize*Fs/1000);frameshift = floor(fshift*Fs/1000);

speech=zeros(1,length(Residual));

j=1;for(i=1:frameshift:length(Residual)-framesize)

ResFrm=Residual(i:i+framesize-1);

a=LPCs(j,:);

j=j+1;

if(i <= framesize)PrevFrm=zeros(1,framesize);PrevFrm(framesize-(i-2):framesize)=speech(1:i-1);

elsePrevFrm=speech((i-framesize):(i-1));

end

SpFrm=SynthFilter(real(PrevFrm),real(ResFrm),real(a),lporder,framesize,0);

8

speech(i:i+frameshift-1)=SpFrm(1:frameshift);%pause

endspeech(i+frameshift:i+framesize-1)=SpFrm(frameshift+1:framesize);

%PROCESSING LASTFRAME SAMPLESif(i<length(Residual))

ResFrm = Residual(i:length(Residual));

a=LPCs(j,:);

j=j+1;

PrevFrm=speech((i-framesize):(i-1));

SpFrm=SynthFilter(real(PrevFrm),real(ResFrm),real(a),lporder,framesize,0); speech(i:i+length(SpFrm)-1)=SpFrm(1:length(SpFrm));

end

if(plotflag==1)

figure;l = length(speech);x = [1:l]/Fs;

subplot(2,1,1);plot(x,real(Residual),'k');grid;xlim([x(1) x(end)]);

subplot(2,1,2);plot(x,real(speech),'k');grid;xlim([x(1) x(end)]);xlabel('Time (s)');

endSynthfilter.mfunction [SpchFrm]=SynthFilter(PrevSpFrm,ResFrm,FrmLPC,LPorder,FrmSize,plotflag);

%USAGE: [SpchFrm]=SynthFilter(PrevSpFrm,ResFrm,FrmLPC,LPorder,FrmSize,plotflag);

tempfrm=zeros(1,2*FrmSize);

tempfrm((FrmSize-LPorder):FrmSize)=PrevSpFrm((FrmSize-LPorder):FrmSize);

for(i=1:FrmSize)

8

t=0;for(j=1:LPorder)

t=t+FrmLPC(j+1)*tempfrm(-j+i+FrmSize);%pause

end

% ResFrm(i);

%s=-t+ResFrm(i)%pause

%tempfrm(FrmSize+i)=s;

tempfrm(FrmSize+i)=-t+ResFrm(i);

%pauseend

SpchFrm=tempfrm(FrmSize+1:2*FrmSize);

if(plotflag==1)

figure;subplot(2,1,1);plot(ResFrm);grid;

subplot(2,1,2);plot(SpchFrm);grid;

end

8

Project(2)

Documents

Transcript of Project(2)