Statistical Parametric Speech Synthesis a Review

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015

Statistical Parametric Speech Synthesis: A Review

Athira Aroon

Department of Electronics Engineering A.LS.S.M.S Institute of Information Technology

Pune , India [email protected]

Abstract- In this paper we have briefly reviewed the

Statistical Parametric Speech Synthesis (SPSS ) , based on hidden

Markov model. The non-mathematical introduction of SPSS have

been introduced .Have emphasized the recent emerging

techniques used in SPSS like Autoregressive HMM model,

Gaussian Process Regression(GPR), Neural Autoregressive

Distribution Estimators (NADE) overcoming Restricted

Boltzmann Machines (RBM), Deep Neural Networks (DNNs).

One of the major drawback of SPSS is vocoder quality in

accordance to this problem we have analyzed spectral envelope

estimation algorithms proposed for speech synthesis like

STRAIGHT, TANDEM-STRAIGHT ,CHEAPTRICK providing high quality.)

Index Terms--. Text-to-Speech, Statistical Parametric

Speech Synthesis(SPSS), Vocoder Quality .

L INTRODUCTION

Speech is the vocalized form for communication with other people. [t comprises of names that are drawn from large vocabularies .Spoken words is originated out of the phonetic combination and a limited set vowels and consonant speech termed as sound unites in speech synthesis. Speech synthesis is artificially producing speech and system performing this function is called as speech synthesizer[ 1].

Abstract

Input Analysis Underlying r-------1

Synthes i s � Output

Text routines I Lingu istic routines Speech

Descr iption

Fig 1. Text converted to abstract linguistic representation [ I]

A Text- To Speech (ITS) systems converts normal text of any language into speech . High intelligible synthesizer are being developed producing sound as natural as expected. Naturalness and intelligibility are the important qualities required for speech synthesizer. [ntelligibility is the ease with which output is understood and naturalness describes how

S.B Dhonde

Department of Electronics Engineering A.LS.S.M.S Institute of InformationTechnology

Pune,India [email protected]

closely the output resembles human speech. It is tried to maximize the characteristics of above stated qualities using speech synthesis. Descriptions of speech synthesisers often take a procedural view: they describe the sequence of processes required to convert text into speech, often arranged in a simple 'pipeline' architecture[[][2] . We have undertaken this review inorder to study the progresses towards the speech synthesizer ,one of its techniques that is statistical speech synthesis .

This paper clearly gives a brief description of Statistical Parametric Synthesis. . The paper comprises of three sections. Section 2 describes the emerging trends of statistical parametric speech synthesis techniques. Section 3 explains about the methods in order to improve the vocoder quality.

ILSTATISTICAL PARAMETRIC SPEECH SYNTHESIS This system is based on hidden Markov models . The

model is parametric because the speech used are parameters, rather than stored exemplars. It is statistical because it describes those parameters using statistics (e.g., means and variances of probability density functions) which capture the distribution of parameter values found in the training data. HMM based speech synthesis system comprises of training stage and synthesis stage . [n the training stage mel cepstral coefficients are obtained from speech database by mel cepstral analysis .Mel cepstral coefficients are used to train HMM phenome [3].

The parameters are then extracted for phenomes followed by the generation of HMM for each phenome.In later stage text to be synthesized is transformed into phenome sequence, representing the whole text to be synthesized constructed by concatenating phenome HMMs. From the sentence HMM , a speech parameter sequence is generated using the algorithm for speech parameter generation from HMM. By using a suitable algorithm for spectral synthesis, speech is synthesized from

the generated mel-cepstral coefficients [3] [4].

A. Autoregressive HMM Model

978-1-4799-6480-2/15/$31.00 ©2015 IEEE


The autoregressive HMM as a probabilistic model. We then discuss how to use a specifc form of autoregressive HMM, the linear Gaussian linear autoregressive HMM (LGLAR HMM), to model speech parameters .It uses the linear Gaussian linear regression model described in a larger sequential model.

It supports existing high quality speech parameter generation methods such as parameter generation considering global variance; and supports a simple and exact timerecursive form of speech parameter generation that is not available for the standard HMM synthesis framework or the trajectory HMM and which may be used for low latency parameter generation. We. The LGLAR HMM, like the standard HMM synthesis framework, has certain parameters such as MDL tuning factor LGLAR HMM is capable of producing speech that is as natural as that of the standard HMM synthesis framework with its conventional settings, but

not as natural as the trajectory HMM[6].

Fig 2. HMM based TTS system[4]

B. Gaussian Process Regression

Gaussian process regression (GPR) is a statistical technique with a long history in spatial statistics, and more recently in function estimation and prediction. To make computational cost feasible partially independent condtional (PlC) approximation was adopted and showed that GP based approach achieved comparatively better performance than HMM-based system.

Contributing advantages of GPs, such as the flexibility to model complexity and the robustness against over-fitting. Hierarchical GPR may be used to distinguish individual, group, and condition differences. Although the GP-based

speech synthesis is promising, there exist a number of issues for the realization of practical systems. One of them is generation of acoustic feature trajectories from predictive distribution[8].

Generally causing over-smoothing problem. Global variance (GY) was considered as an alternate way widely used to alleviate over -smoothing problem in the HMM based speech synthesis. Another issue is selection of hyperparameter of kernel function used in GP. So hyperparameter optimization for PlC approximation using EM algorithm are introduced. GY and hyperparameter optimization outperformed the conventional HMM-based approach by subjective evaluation[9].

C. Neural Autoregressive Distribution Estimators (NADE) The NADE proposed is inspired by Restricted Boltzmann

Machines(RBM) which is a kind of bipartite undirected graphical model which has been applied to speech synthesis . and voice conversion . However, RBM does not provide a tractable partition function for computing the probability of an observation. Not knowing the exact value of partition function makes it hard to evaluate how well the distribution estimated by the RBM fits the observations. So NADE evolved solving the difficulty of partition function calculation by decomposing the joint distribution of observations into tractable conditional distributions. Therefore, NADE was adopted as the form of the state PDFs instead of RBM[lO].

NADE has been proved to be an efficient multivariate binary distribution estimator and performs similarly to a large (but intractable) RBMs on several datasets. comparing the ability of model generalization between RBMs and NADEs, the experimental results show that NADEs demonstrates better performance than RBMs due to the accurate calculation of gradients at training time . It can also be understood as a special kind of auto encoder whose output assigns valid probabilities to observations and hence is a proper generative model. Results have also shown the superiority of NADEs over Gaussian mixture models in describing the distribution of spectral envelopes as a density model and in alleviating the over-smoothing effect at the synthesis time .. Incorporating the dynamic features of mel-cepstra and spectral envelopes into NADE modeling and extending the spectral features from the spectral envelopes to the FFT spectrum[lO] [ 1 1] .

D . Deep Neural Networks (DNNs)

They are feed forward artificial neural networks (ANNs) with many hidden layers, and have achieved significant improvement in many machine learning areas. They were also introduced as acoustic models for Statistical parameter speech synthesis (SPSS). In SPSS, a number of linguistic features that affect speech, including phonetic, syllabic, and grammatical ones, have to be taken into account in acoustic modeling to achieve naturally sounding synthesized speech.

Effective modeling of these complex context dependencies is one of the most critical problems for SPSS. In DNN-based SPSS, a DNN is trained to represent the mapping function from linguistic features (inputs) to acoustic features (outputs).


DNN-based acoustic models offer an efficient and distributed representation of complex dependencies between linguistic and acoustic features and have shown the potential to produce naturally-sounding synthesized speech. Mixture Density networks (MDNs) have been introduced for overcoming the limitations m DNN-based acoustic modelling for speech synthesis like the lack of variances and the unimodal nature of the objective function[7].

Objective and subjective evaluations have shown that the components having variances and multiple mixtures by using a mixture density output layer was helpful in predicting acoustic features DNN-based SPSS by introducing mixture density networks (MDNs) speech synthesis: more accurately and improved the naturalness of the synthesized speech significantly[7] [ 10].

Sf. Authors

No

2

3

Masanori Morise, (2014)

[ 14]

Matt Shannon et.al (20 13)[6]

Xiang Yin et. AI (2014)[ 1 1]

TABLE I.

Proposed Work

A spectral envelope estimation algorithm IS

presented to achieve high-quality speech synthesis. The algorithm obtains an accurate and temporally stable spectral envelope, using fundamental frequency (FO)

Proposed using the autoregressive hidden Markov model (HMM) for speech synthesis. The autoregressive HMM uses the same model for

and

Authors

Contribution The subjective evaluations demonstrated that Cheap-Trick was superior to the conventional algorithms. particular, CheapTrick synthesized FO-manipulated speech more robustly than the other algorithms.

In

the

Compared to the standard HMM synthesis framework, the trajectory HMM has slightly better mean trajectories, much better trajectory covariances, and a

parameter estimation synthesis consistent

in a higher naturalness

contrast standard to parametric synthesis.

A new which

way, in score. Compared to the to the approach autoregressive statistical HMM, the

speech

approach utilized

trajectory HMM has better mean trajectory modeling

Experimental results show the

neural autoregressive

superiority of NADEs over

4

5

Zhen-Hua Ling, (20 13)[10]

Tomoki Koriyama et.al(20 14) [9]

distribution estimators (NADE) for the spectral modeling in statistical parametric speech synthesis. In order to alleviate the over-smoothing effect on the generated spectral structures.

Adopted the graphical models with multiple hidden variables, including restricted Boltzmann machines (RBM) and deep belief networks (DBN), to represent the distribution of the low-level spectral envelopes at each HMM state.

Paper examines two issues of a statistical speech synthesis approach based Gaussian process (GP) regression. Although GP-based speech synthesis can give higher performance in generating spectral parameters than the HMM-based one.

Gaussian mixture models in describing the distribution of spectral envelopes as a density model and m alleviating the oversmoothing effect at the synthesis time.

Results show the supenonty of RBM and DBN over Gaussian mixture model m

describing the distribution of spectral envelopes as density models and in mitigating the over-smoothing effect of the synthetic speech. RBM and DBN are more appropriate than the GMM for generating acoustic features by sampling, which may help make the synthetic speech less monotonic and boring.

Proposed method which uses GV and hyperparameter optimization outperformed the conventional HMM-based approach by subjective evaluation.


III.VOCODER Vocoder-a term derived from the words Voice and

CODER. One of the major challenges of statistical parametric speech synthesis are the vocoder quality, which is not on par with the pure waveforms of unit selection synthesis, the accuracy of the HMM-based acoustic modeling, which does not exactly model the real speech waveform, and the problem of over-smoothing of the HMM-generated parameter trajectories. So we may consider the vocoder methods for spectral estimations. For HMM- based speech synthesis we may consider many different vocoders.

A. STRAIGHT

STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHT spectrum) is the most established of the more sophisticated vocoding method.[t is a tool for manipulating voice quality, timbre, pitch, speed and other attributes flexibly. [t is an always evolving system for attaining better sound quality, that is close to the original natural speech, by introducing advanced signal processing algorithms and findings in computational aspects of auditory processing. The main feature of the STRAIGHT refilled analysis is with speech spectrum, a series of advanced methods are adopted to modify and complement the original spectrum directly extracted by use of short time Fourier transformation (STFT)[ 12] .

The input speech is decomposed by STRA[GHT into three types of positive-valued parameters: an interference-free spectrogram, an aperiodicity map, and a fundamental frequency (FO) trajectory . . The periodic interference in time domain is eliminated via the effective solution of the size mismatch problem between the fixed time window and the variable pitch by a pitch-adaptive smoothing filter. the phase interference in frequency domain is successively taken into account.

A compensatory time window is designed to remove the holes of the spectrogram caused by out of phase. Moreover, a compensation procedure of over-smoothing in the frequency domain is used to recover some underlying spectrum structure to further improve the speech analysis-synthesis performance of the STRAIGHT model[12].

Spectrum extraction 'With ti me freq uency

Fig 3 . The procedure of smoothed spectrum[12]

One of the advantage of The STRAIGHT model is the a period component parameter which can effectively describe the voiced attribute inorder to enhance the synthesized

speech quality. In the traditional speech models, such as Multi-Band Excitation (MBE) model , the voiced/unvoiced labels are simply added to the speech in the time-frequency domain with a hard clustering, which can only identify whether the frame or frequency band of the speech is voiced or not. [n contrast to such clustering, the a period component is more flexible to present the ratio between the period and noise energies, which are respectively defined by the higher and lower smoothed spectrum envelopes of the speech.

B. TANDEM-STRAIGHT

TANDEM-STRAIGHT superseded original STRAIGHT , bringing about complete reformulation and re engineering based on the same underlying concept. In a time invariant linear system the output is excited by a periodic pulse train that yields a spectrogram that has periodic interference both in the time and frequency domains, even if the system and the input are temporally stable and spectrally smooth This is the major problem STRA[GHT and TANDEM-STRA[GHT were designed to address.

TANDEM ,a short-term power spectral representation of periodic signals that does not have a temporally varying component. TANDEM is a procedure , shortening the window length and keeping the power spectra temporally constant and the logarithmic power spectra tolerant to background noise. Therefore introduced measures for the window length, the frequency variations, the temporal of the power spectra, and the temporal variation of the logarithmic power spectra.

STRA[GHT uses FO adaptive triangular smoothing function h [(w) as an additional anti-aliasing filter impulse response to eliminate this leakage. The base length is set to 2wO in this case. TANDEM-STRA[GHT uses FO adaptive rectangular function h2(w) instead. Its base length is set to wo. Smoothing function hl(w) is obtained by the convolution of h2(w) with itself.. Smoothing TANDEM spectra using this anti-aliasing smoother selectively removes spectral variations due to periodicity. It provides simple decomposition, which separates the periodicity and response information almost perfectly [ 13].

C. CHEAPTRICK

For high-quality speech synthesis a simple algorithm for high-quality speech synthesis is introduced that is superior to conventional ones both objectively and subjectively.

CheapTrick consists of power spectrum estimation with the FO-adaptive Hanning window, the smoothing of the power spectrum, and spectral recovery in the quefrency domain. The algorithm can obtain an accurate and temporally stable spectral envelope by objective evaluations. Conventional algorithms other than STRAIGHT and TANDEMSTRAIGHT cannot fulfill two requirements in the estimation performance and remove the time-varying component. an algorithm named Cheap Trick that fulfills these requirements

. The name Cheap Trick comes from its cheap and tricky design based on the conventional algorithms such as FOadaptive windowing and the cepstrum method .. CheapTrick was superior to the other algorithms in terms of sound quality


regardless of gender the results include the sound quality of not only the re-synthesized speech but also the FOmanipulated speech, they suggested that Cheap Trick was robust against FO manipulation. The difference in sound quality in female speech was smaller than that in male speech, and this difference is associated with the objective evaluation results in which the error in higher FO was smaller than that in lower FO[14].

IV.CONCLUSION

The paper reviews majorly researched methods of Statistical Parametric Speech Sythesis viz. like Autoregressive HMM model supporting the existing high quality speech parameter generation methods considering global variance; and supports a simple and exact timerecursive form, Gaussian Process Regression(GPR) includes hyperparameter optimization outperforming the conventional HMM-based approach, Neural Autoregressive Distribution Estimators (NADE) overcoming Restricted Boltzmann Machines (RBM) NADE is a very easy to implement and train model for joint distributions, yielding a tractable distribution function .In future work, we can use NADE on problems other than distribution estimation, in particular on problems for which RBMs and auto encoders are often considered., Deep Neural Networks (DNNs), describing the distribution of spectral envelopes, making the synthetic speech less monotonic and improved the naturalness of the synthesized speech.

Vocoder quality is the major drawback of SPPS ,so the recent evolving vocoder algorithms like STRAIGHT ,TANDEM-STRAIGHT, Cheaptrick was comparatively reviwed .Among these models Cheptrick outperforms other methods and obtain a temporally stable spectral envelope and synthesize speech with higher sound quality than speech synthesized with other algorithms.

V.REFERENCES

[ 1] Dr. Shaila Apte, "Speech Synthesis," chapter in the book Speech and Audio Processing,20 13.

[2] David Suendermann, Harald Hoge, and Alan Black,"Challenges in Speech Synthesis",chapter 2 ofF. Chen, K. Jokinen ,Speech Technology, Springer Science+Business Media, LLC 20 1 O.

[3] Simon King ," An introduction to statistical parametric speech synthesis" , Sadhana Vo!. 36, Part 5, October 20 1 1, pp. 837-85.

[4] Heiga Zen" Keiichi Tokuda, Alan W. BlackcK. , "Statistical Parametric Speech Synthesis", submitted to Speech Communication, April 6 2009.

[5] Gregory E. Cox ,George Kachergis, Richard M. Shiffrin " Gaussian Process Regression for Trajectory Analysis",

[6] Matt Shannon, Student Member, Heiga Zen, Member , and William Byrne, Senior Member," Autoregressive Models for Statistical Parametric Speech Synthesis", IEEE Transactions on Audio ,Speech and Language Processing VOL. 2 1, NO. 3, MARCH 20 13.

[7] Heiga Zen, Andrew Senior, "Deep Mixture Density Network for acoustic modelling in statistical parametric speech synthesis ",

[8] Tomoki Koriyama, Takashi Nose, Takao Kobyashi ,"Statistical Parametric Speech Synthesis based on Gaussian Process

Regression" ,IEEE journal of selected topics in Signal Processing ,VoI8. No. 2 ,pp 173-183. 20 14.

[9] Tomoki Koriyama ,Takashi Nose ,Takao Kobyashi ,"Parametric Speech Synthesis using local and global variance" , 24th IEEE

International Workshop on Machine learning and Signal processing, 20 14 .

[ 10] Zhen-Hua Ling" LiDeng, , and Dong Yu," Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis ", IEEE Transactions on Audio ,Speech and Language Processing VOL. 2 1, NO. 10. October 20 13.

[ 1 1] Xiang Yin, Zhen-Hua Ling, Li-Rong Dai," Spectral modelling using Neural Autoregressive distribution estimators for statistical parametric speech synthesis ",2014 IEEE International conference on Acoustic and speech Processing.

[ 12] Yibin TANG, , Ning XU 1, Yuan GAOl, Changping ZHU2,Qingbang HAN2," A Simplifed STRAIGHT Model with Aperiod Component Reconstruction", Journal of Computational

Information Systems 9: 5 (20 13) .

[ 13] Hideki Kawaharai and Masanori Morise ,"Technical foundations of TANDEM -STRAI GHT, a speech analysis, modification and synthesis framework", Sadhana Vo!. 36, Part 5, October 20 1 1, pp. 7 13-727.

[ 14] Masanori Morise ," CheapTrick, a spectral envelope estimator for high-quality speech synthesis", ScienceDirect , Available online September 20 14.

Statistical Parametric Speech Synthesis a Review

Documents

Transcript of Statistical Parametric Speech Synthesis a Review