Download - [IEEE 2013 IEEE Workshop on Signal Processing Systems (SiPS) - Taipei City, Taiwan (2013.10.16-2013.10.18)] SiPS 2013 Proceedings - Enhancement of speech over wireless network using

Transcript

Enhancement of Speech over Wireless Network using Sinusoidal Modeling and Synthesis

Dhany Arifianto

Department of Engineering Physics Sepuluh Nopember Institute of Technology

Surabaya, 60111 Indonesia email: [email protected]

Abstract - This paper presents a non-invasive technique using wavelet speech signal analysis to estimate the degree of severity on a patient with vocal cord disorders. The early symptom is usually characterized by a perceivable hoarse voice that changes with respect to the degree of severity. The first step was to estimate the fundamental frequency and the formant of the speech. Using these features, we were able to measure the distortion using statistical analysis of jitter, shimmer, and harmonics-to-noise ratio (HNR), respectively. From this experiment, prescribed thresholds were set on the distortion features to classify automatically whether a subject was normal, or suffer from a mild or severe. This classification was then validated by the ear-nose-throat doctor using the fiber-optic laryngoscope (gold standard technique). The second experiment, we propose sinusoidal modeling synthesis to enhance the speech signal transmitted over wireless network and compared the results to the Daubechies wavelet speech analysis technique. The results of both experiments suggest that the proposed technique may be used as an alternative non-invasive diagnostic technique. Keywords: hoarse voice, vocal cord, signal analysis, severity, wavelet transform.

1. INTRODUCTION

Larynx or vocal cord is an organ that produces sound through the back and forth movement of the vocal muscles and interaction with other organs. Disturbances in the function of the vocal cords will cause temporary or permanent loss of ability to speak and often causes hoarseness. Early detection in the early stages of disorders of the larynx increases the chances of full recovery of these organs to function normally again. However, when detected in later stage, one can lose the ability to speak, even in some cases can cause death, for example in cases of nasopharyngeal cancer [1].

Signal processing offers an alternative and non-invasive way for fast and portable tools for determining the vocal cord disorders. However, finding characteristics for determining the type and least severe level of disorder of the larynx is a challenge given the nonlinearity nature of the human voice signal with respect to time. The technique

commonly used is to ask the patient to pronounce the vowel (/ a /) continuously in one breath [2]. In normal case, the frequency of the sound should be about in one frequency with little variation throughout the time. A regular structure in frequency with respect to time is called harmonicity of a signal, which the first harmonic is called fundamental frequency and the later harmonics is first, second and n-th formant. With hoarseness, the frequency will be distorted and the distortion will be measured. Based on these characteristics, hybrid characteristic harmonics-to-noise ratio (HNR) to determine the least severe level of disorder of the larynx was proposed. In this article the number of harmonics that normally exist, distorted due to vocal cord abnormalities [3, 4].

Research using non-invasive method has been carried out by the Acoustic Group FTI-ITS Engineering Physics since 1998 in collaboration with the Medical Faculty of Airlangga University and Ear Nose Throat – Technical Implementation Unit (ENT-UPF) of dr. Soetomo Hospital. The idea of this study originated from diagnostic enforcement difficulties for ENT specialists to perform by entering the elastic optical cable (laryngoscopy) to the throat because they are invasive, causing discomfort to the patient [1].

We extend the research to investigate the severity of vocal cord disorders suffered by patients using Daubechies wavelet for analysis because of its linear phase property [5]. To measure degree of hoarseness quantitatively, we used jitter, shimmer, and HNR.

2. SPEECH SIGNAL ANALYSIS METHOD Speech is generated by mechanical vibration of

vocal cords that creates air pressure difference during transmission. The vocal cords movements are driven by air pressure from the lung and controlled by intrinsic muscles at the larynx to produce the desired sound. In the event of abnormalities on the vocal cords, it is often marked by symptoms of audible hoarse voice. Hoarse voice is a generic term for any disorder that causes changes in the sound. When hoarse, the voice may sound harsh, rough with a lower tone than usual, weak voice, lost voice, a voice came out strained and hard, consisting of several tone of voice, pain when make a sound, or the inability to achieve a particular tone or intensity.

The change of sound, also cause changes in the voice signal. The signals change can be processed by analyzing the voice signal, so that the details and

301

2013 IEEE Workshop on Signal Processing Systems

978-1-4673-6238-2/13 $31.00 © 2013 IEEE

causes of the changes that occurred can be known based on time domain characteristics, frequency and amplitude.

2.1. Jitter, Shimmer, and HNR Jitter, shimmer, and HNR is a variation of the

fundamental frequency and amplitude that being used to describe the quality of pathology of the human voice. These parameters are easy to obtain and showed the degree of the distortion straightforwardly. Values obtained may be one aspect of the characterization of specific sound.

1) Jitter Jitter is the frequency modulation noise which produces a difference frequency variation value in a row at the fundamental frequency [6] [7].

is a frequency value of that is extracted and

is the number of frequencies that is extracted 2) Shimmer Amplitude modulation that is expressed as a change of peak to peak amplitude in decibels (dB) [6] [7]

as is peak to peak amplitude data that is extracted. 3) HNR Usually used to determine the clarity level of the measured sound signal. By determining amplitude harmonization value of the signal in decibel (dB), the larger value obtained by HNR, the more harmonic signal is used [4]. Harmonics of a speech signal are obtained by using fundamental frequency and formant estimation algorithm. Wavelet Transform

Wavelet possesses an ability to analyze data in time and frequency domain simultaneously with different resolution scales in each state owned, making one of the advantages of wavelet transforms. Mathematical function of wavelet cuts the data into clusters of different frequencies, so that each component can be studied using different resolution scales. With dilation and translation function by Daubechies Wavelet Transform is as follows, [5]

where, a is the dilatation parameter, 2j is the

frequency of dilation or scale parameter, k parameter

space of time or location, and Z are j and k conditional on the value of an integer value. Based on equation (3) shows that the wavelet has the characteristics such as translation, dilation and short oscillation.

In signal processing, wavelet transform uses two important components, which is scaling function and wavelet function. Both components can be called as the mother wavelet that must satisfy the following condition,

to guarantee the fulfillment of the nature of vector orthogonality [5].

Daubechies wavelet filter has general characteristics that completely supported by the wavelets with extreme phase and those that has the highest number of vanishing moment for the specified width. Vanishing moment shows the ability of wavelet in representing polynomial characteristic. The general characteristic of the Daubechies wavelet filter is fully supported by extremal phase wavelet with vanishing moment and has the highest amount for a specified width. Vanishing moment, showed the ability of wavelet in representing properties of polynomials. Scale filter is associated to the filter minimum phase.

3. EXPERIMENTS AND RESULTS DISCUSSION

Data acquisition was conducted with the research collaboration between the Department of Ear, Nose, Throat and Head-Neck Surgery (ENT-TOS) dr. Soetomo Hospital, with hoarse voice recording of a patient in the Room of Audiology ENT-TOS (Double-Walled-sound-attenuated booth). Patients were asked to pronounce the phoneme /a/ continuously in a single pull of the breath in accordance with the lung capacity owned. Phonation length of healthy people is about 12 to 14 seconds. On the other hand, patients with vocal cord disorders in general have shorter lung capacity. The higher the stage of the illness, the shorter duration of phonation upon pronunciation will be.

Voice signals were recorded using a flat-frequency-response Shure microphones-58 and digitized with analog-to-digital converter (ADC) 24-bit Creative Emu 0404. The data acquisition performed using sampling rate 16 kHz and by PCM code saved in .wav format. At the time of recording and storing voice data, the identity of the patient is replaced with a combination of letters and numbers as well be labeled a disease that has been diagnosed by a specialist ENT-TOS using gold-standard techniques prior to recording.

The schematic of wireless data acquisition is depicted in Fig. 1. The patient were asked to sit in a quiet room and uttered continuous /a/ to a hand-phone (HP#1) connected to another identical hand-phone (HP#2) as a data receiver. The recordings took place

302

Figure 1. Schematic of Data Acquisition

at a crowded period of service time. The voice data was coded using wireless industrial standard Adaptive Multi Rate (amr) codec. In this experiment, the data was sent one-way only from the patient to the receiver.

On the overall data that has been recorded, statistical analysis on the frequency and amplitude data in the form of jitter and shimmer values is performed to determine the percentage of deviation value of the data occurs. Then comparative analysis of the harmonization value of the patient's voice signal and noise that occurs during voice data recording of patients also performed by calculating the value of HNR (Harmonic to Noise Ratio). Value of jitter, shimmer, and HNR are used to determine the severity of vocal cord disorder suffered by the patient. Based on the severity, patients are divided into two conditions, mild and severe. With parameter comparison, also conducted measurements of jitter, shimmer, and HNR in normal people, shown in Table 1.

Table 1 Calculation of Jitter, Shimmer and HNR

Jitter (%) Shimmer (dB) HNR (dB)

Normal 0,249 0,149 25,223 Mild 1,350 0,647 15,411

severe 4,235 1,501 3,296 At Table 1 shows that, the normal people, the

value of frequency deviation occurs only 0.249%. It means that the frequency deviation in normal people is only about 0.249% of average frequency, and about 0.149 dB of shimmer. This implies that the deflection amplitude of a normal person is only 0.149 dB on average signal amplitude. While the harmonization of the signal was from the value of HNR obtained were 25.223 dB. Based on the value of jitter, shimmer, and HNR in Table 1., it can be seen the difference of values between patients with mild and the severe vocal cord disorders. For patients with mild disease conditions, obtained jitter value is 1.350%, whereas in severe conditions jitter obtained higher that is 4.235%. Significant numerical results were four times to those of mild disease condition. Similarly it is also occurred in the value of the deflection amplitude (shimmer). Compared to normal person, the deflection amplitude that occurs in patients with mild abnormalities is higher 0.498 dB. Moreover, when compared to severe patients, it is increased almost 2.5 times to 1.501 dB.

(a)

(b)

Figure 2 Fundamental frequency of normal (a) female (b) male speakers

However, on the value of HNR, vocal cord

abnormalities in patients with mild-stage measurements is lower than normal person that is to be 15.411 dB and the more drastic decrease HNR formed only 3.296 dB. This suggests that the lower the value of HNR, the more severe disorder that affects the vocal cords. We argue that there was no identical condition of the patient even within one class of vocal cord abnormality and consequently the data deviation across patients does not provide additional information. For validation, these values have been confirmed by a trained ear-noise-throat doctors using fiber-optic laryngoscope as the medical gold-standard technique, instead. It is known that the fundamental frequency (F0) of female is twice higher than that of male [8], at which both children and adult women are from 250-300 Hz, while adult men are about 150 Hz. Fig. 2 shows the F0 of a normal male and female speaker, respectively. The F0 of female speaker is about 380 Hz with duration almost 12 second and about 150 Hz of the male speaker with phonation duration about 14.5 seconds. As can be seen as well on the figure, the F0 contour does not fluctuate during sustained phonation. In contrast with Figure 2 showing the F0 of two different subjects from Figure 3, the F0 distortion are shown on both cases. In Figure 3(a), the distortion is called pitch doubling, which is shown by the sudden jump of the F0 by doubling the value or even larger. In the Figure 3(b), on the other hand, the F0s are vanished in the middle and at the end of sustained phonation. By knowing the differences in fundamental frequency of normal male and female voices, the vocal cord disorder may be quantitatively measured to

AMR Format AMR Format

ReceiverPatient

HP#1 HP#2

TransmissionChannel

303

(a)

Figure 3 Fundamental frequency of (a) male (b)

female speaker with severe polyp determine the degree of severity of hoarseness. Abnormalities of the vocal cords that collected are the vocal cords polyps. The statistical results of jitter, shimmer, and HNR obtained from both male and female patients are relatively equal. There were no significant changes in the frequency and amplitude deviation occurs. Similarly, the harmonization signal was generated as well. This condition is seen clearly in Table 2. For male patients with polyps, the measured jitter is 1.2325% that only differs from about 0.108% with female patients that is 1.3405%. While the amplitude deviation occurs in male is 0.4423 dB and for female is 0.4983 dB. Similarly, both the harmonization signals that occurs in the range of 12.9 dB.

To determine the characteristics of sound signals that generated in patients with vocal cord polyp disorders of both male and female, signals analysis using wavelet transform is conducted. Wavelet is one of the alternative signal processing techniques in addition to Fourier transformation. Mother wavelet used is the Daubechies. The main reason of using this type of wavelet is the linear phase of the analysis and synthesis filters. This feature is important to preserve the distortion information on the speech signal. The measurement on the spatial location of distortion at the time-frequency space is the key to determine the type of the vocal cord disorder. Besides, the position and the intensity of the Daubechies wavelet scale can be used as benchmarks in determining the characteristics of patients with vocal cord disorders for both male and female.

Figure 4 shows the amplitude of each scale of Daubechies wavelet when decomposing the speech signal of normal male and female speakers. From these

Table 2 Calculation of Value Analysis Jitter, Shimmer and HNR patients with polyps

Jitter (%) Shimmer (dB) HNR (dB)

Male 1,2325 0,4423 12,992 Female 1,3405 0,4983 12,983

Tabel 3 Frequency Allocation (Ananda J.,2009)

Operator GSM

Frequency bandGSM900 (MHz)

GSM1800 (MHz)

Total (MHz)

Operator 3 7.5 22.5 30 Operator 2 10 20 30 Operator 5 7.5 7.5 15 Operator 1 0 15 15 Operator 4 0 10 10

(a)

(b) Figure 4. Wavelet analysis of (a) male (b) female normal

speakers

figures can be easily seen that the harmocity strength are high to decompose the signal into its details. This feature in the time-space (scale) is then be used to determine what type of vocal cord disorder the speaker suffer from. Without losing of generality, the scale can be easily converted back to frequency without changing the information due to Parseval’s theorem.

In contrast of Figure 4, the harmonicity strength shown was shifted only at the beginning of phonation. This may suggest that the phonation contains less harmonic that we perceive it as hoarseness. In the figure 4. sound signals of indicated vocal cord polyp disorder patients with different gender using Daubechies wavelet. The horizontal axis is a function of time which is the duration of phonation and the

304

(a)

(b)

Figure 5. Spectrogram of normal male speaker (a), and male with vocal nodule (b). Upper panel is input signal, middle is synthesized signal, and lower panel is reconstructed signal.

vertical axis are functions of Daubechies wavelet coefficients of scale, while the intensity of color is the amplitude of the sound signal produced. Phonation duration by female with polyp is about 18 seconds while male is only about 7 seconds. This may show that lung capacity of the female speaker is much larger than that of the male speaker. Patients with polyps have increased the intensity of which was uneven at the phonation time. This increased intensity can be called a distortion of the voice signal frequency. The difference is the distortion occurs in male patients with 0.4 seconds from phonation start, whereas in female patients, distortion occurs two seconds after phonation. Distortions that occur in female patients are higher and more evenly distributed compared with male patients with polyps. Although the fundamental frequency which is owned by male and female are differ 100 Hz, but apparently with the same disease, high intensity which occur have the same range on scale of approximately 101 to 161.

The speech reconstruction of normal speakers received at the receiver were obtained by using the Sinusoidal Modeling and Synthesis (SMS) technique [9]. The results of the experiment are shown at Fig. 5(a) for a normal male speaker and Fig. 5(b) for a vocal nodule speaker respectively. The lower regions of frequency were successfully reconstructed as well as some region in the end of utterance as the power of the signal decreased during phonation.

4. CONCLUSIONS

Analysis has been done to get the value of jitter,

shimmer, and HNR and analysis of Daubechies wavelet time-scale in patients with vocal cord disorders. From the experimental results, using these two techniques in signal analysis, obtained parameters that can indicate the severity of the vocal cords disorders experienced by patients and provided analysis of the differences that occurred in male and female patients with polyps. The on-going research will focus on the effect of external distortion, for example due to channel distortion over wireless communication. The long-term goal of this project is to develop a reliable technique of e-diagnose.

ACKNOWLEDGMENT This work is supported in part by PHKI Research Grant 2009, DP2M Hibah Kompetensi Grant 2010 and the Hitachi Scholarship Foundation 2012. The authors would like to express their gratitude to Profs. Widodo Ario Kentjono, Sri Herawati of the School of Medicine, Airlangga University for data collections and Alif Januar Rahman for helping SMS processing.

REFERENCES

[1]. Soedjak, S., Analisa Suara Penyakit pada Pita Suara, Disertasi Doktor, Fakultas Kedokteran, Universitas Airlangga, 1997.

[2]. Koike, Yasuo, “Vowel Amplitude Modulation in Patients with Laryngeal Diseases,” J. Acoust. Soc. Amer., vol 45, no.4, pp. 839-844, 1969.

[3]. Murphy, P.J., “Perturbation-free Measurement of the Harmonics-to-Noise Ratio in Voice Signals Using Pitch Synchronous Harmonic Analysis,” J. Acoust. Soc. Amer., vol.105, no. 5, pp. 2866-2881, May 1999.

[4]. Arifianto D., Sekartedjo, “Speech Disorder Analysis using Time-Varying Autoregressive,” Proc. IEEE-MWSCAS 2004, pp.III-191-III-194, 2004, Hiroshima, Japan.

[5]. Daubechies, Inggrid, “Orthonormal Bases of Compactly Supported Wavelet,” Commun. On Pure and Applied Math., vol. 41, pp.909-996., November 1988.

[6]. Michaelis, D., Gramss, T., Strube, H. W., “Glottal-to-Noise Exitation Ratio-a New Measure for Describing Pathological Noise,” J. Acustica. Acta Acustica, vol. 83, pp.700-706.,1997.

[7]. Moran, R. J., Reilly, R. B., Chazal, P., Lacy, P. D., “Telephony - Based Voice Pathology Assessment Using Automated Speech Analysis,” IEEE Trans. on Biomedical Engineering, vol. 53, no. 3, March 2006.

[8]. Furui, S., Digital Speech Processing, Synthesis, and Recognition, CRC Press, New York, 2001.

[9]. McAulay R.J and Quatieri T., “ Speech Analysis/Synthesis Based on a Sinusoidal Representation”, IEEE trans. on Acoustics, Speech, and Signal Processing, vol. Assp-34, no. 4. August 1986. p. 744-754.

305