Post on 28-Jan-2021
– Interspeech 2011 –
Hanna Silén, Elina Helander, Moncef Gabbouj
Tampere University of Technology, Finland
hanna.silen@tut.fi
1
28.8.2011
Introduction
Asynchronous voice aperiodicity prediction
Objective evaluations
Conclusions
2
28.8.2011
3
28.8.2011
• In parametric speech synthesis, such as hidden Markov model (HMM) based text-to-speech (TTS) synthesis, speech must be parameterized into form that enables control on the perceptually important features of speech.
• Typical parameterization schemes use the familiar source-filter decomposition to decompose speech into spectral and excitation parts.
• In the conventional approach, all speech features are modeled simultaneously using hidden Markov models (HMM) or hidden semi-Markov models (HSMM) with Gaussian observations.
• To cope with the data sparseness and to enable prediction for unseen contexts in synthesis, speech features are clustered separately using minimum description length (MDL) based decision tree clustering.
4
28.8.2011
• Even for the voiced signals, the vocal cord vibration is not purely harmonic and hence the modeling of speech aperiodicities is essential for high-quality speech synthesis:
• For high-quality waveform re-synthesis, modeling of voice aperiodicities is essential.
• Voicing decisions describe weather the signal is voiced or not – whether there is a related F0 or not:
• Including also the amount of devoicing described by the voice aperiodicity in the speech parameterization improves the vocoding quality.
5
28.8.2011
• Even though the aperiodicity measure is needed in synthesis, its role in HMM training is rather limited:
• The spectrum part is needed to create reliable labeling for the training data and to provide segmental intelligibility for synthesis.
• F0 modeling is needed for synthesis of natural intonations.
• However, increasing the number of model parameters also increases the computational load of the training.
6
28.8.2011
• Asynchronous voice aperiodicity prediction based on synthetic spectra:
• Here we investigate an alternative prediction approach employing the possible correlation of the spectral and aperiodicity features.
• The approach is able to provide comparable prediction accuracy with HMM-based approach, but with a lower number of model parameters.
7
28.8.2011
• STRAIGHT (http://www.wakayama-u.ac.jp/~kawahara/STRAIGHTadv) is a high-quality vocoder that is widely used for speech analysis and waveform re-synthesis in HMM-based speech synthesis.
• Parameterizes speech waveform into a spectral envelope without periodic interferences in time or frequency domain and a mixed mode excitation signal.
• The mixed mode excitation signal of STRAIGHT consists of F0 with binary voicing decisions and the relative level of voice aperiodicity at each frequency:
• Voice aperiodicity is defined as a relative energy of aperiodic components at each frequency – for HMM modeling typically encoded as average band aperiodicity (BAP) of frequency sub-bands.
• Binary voicing decisions determine whether there is F0related to the signal segment or not – in HMM modeling typically embedded in F0 training.
8
28.8.2011
SPECTRALENVELOPE
LEVEL OFAPERIODICITY
• STRAIGHT decomposes speech into a spectral envelope without periodic interferences from intonation and a mixed mode excitation with F0 and level of voice aperiodicity.
• However, even after removal of the intonation interferences, there is clear correlation between the spectral and aperiodicity features.
• In the conventional HMM-based training, this correlation is, however, not exploited.
• In this approach we use the dependency and predict the aperiodicity features based on the spectral ones.
9
28.8.2011
10
28.8.2011
• In this paper, we investigate local prediction for aperiodicity features (BAP and voicing decisions) based on synthetic spectral features in HMM-based speech synthesis.
• The training is done asynchronously, separately from the HMM training of spectral features.
• Local mapping is done by the Gaussian mixture models (GMM), an approach rather similar to the one used for voice conversion for mapping between spectral features of two speakers.
• The training phase aims at finding a prediction function from predictors (spectral features) into responses (BAP or voicing decisions)
• In the synthesis phase, the formed prediction function is further used for mapping data unseen in the training phase.
11
28.8.2011
• We model the distribution of the spectral feature vectors xt using GMMs i.e. as a weighted sum of N Gaussian components:
• The posterior probability of the vector xt belonging to the nth cluster is:
12
• Direct mapping from spectral feature vectors xt into aperiodicity feature vectors yt (either BAP values or voicing decisions) would have a form:
and the mapping matrix β could be found using multivariate regression.
• To find local mapping functions from spectral features into aperiodicity features, we use GMM posterior probabilities to expand the input variable vectors:
13
28.8.2011
• The least-squares solution for the coefficients of the mapping matrix β can be found using standard multivariate regression with pseudo-inverse:
where the training input data is expanded using GMMs and the output data is not:
14
28.8.2011
• In speech signals, there is a rather strong correlation between adjacent frames.
• To exploit this correlation to improve the modeling accuracy, we further augment the regression source data with the corresponding representations of the neighboring frames:
• Dynamic modeling is likely to increase the correlation of the adjacent predictors thus e.g. partial least squares regression could be used instead of standard multivariate regression.
15
28.8.2011
16
28.8.2011
• Objective evaluations:
• Accuracy of the proposed aperiodicity prediction approach was evaluated objectively.
• Evaluations consisted of two parts:
• Prediction of the band aperiodicity (BAP) of five bands based on spectral parameters (Mel-cepstral coefficients, MCCs).
• Prediction of the voicing decisions (whether there is F0 or not) based on spectral parameters (MCCs).
17
28.8.2011
• Speech data: CMU ARCTIC speakers slt (female) and rms (male)
• Training: Set A with 593 utterances• Testing: Set B with 539 utterances
• Parameterization: STRAIGHT parameterization with
• Mel-cepstrum coefficients (MCCs) of order 24• F0 with binary voicing decisions• Mean band aperiodicity (BAP) of five frequency bands
18
28.8.2011
• HMM training: MCC, F0, and baseline BAP modeling using HTS
• 5 state left-to-right HSMMs
• GMM training: BAP and voicing prediction models from GMMs
• 8 Gaussians with diagonal covariances
19
28.8.2011
• Speaker dependent training based on the training data:
Baseline: traditional HMM-modeling with HSMMs and delta-augmented features and decision tree-based context clustering.
Proposed I: prediction based on spectral parameters (MCCs) using GMMs (8 Gaussians with diagonal covariance matrices) and multivariate regression with dynamic modeling.
• And as a reference:
Proposed II: as Proposed I but without dynamic modeling.
Proposed III: prediction based on spectral parameters (MCCs) using standard multivariate regression with dynamic modeling (no GMMs).
Proposed IV: as Proposed III but without dynamic modeling.
20
28.8.2011
21
28.8.2011
MALE SPEAKER: rms FEMALE SPEAKER: slt
SYNTHETIC(FROM HMMS)
RECORDED
HMM
PROPOSED I
• Evaluation of the prediction accuracy for the test data:• BAP prediction RMSE values for each frequency band and• Percentage of incorrect voicing decisions.
22
28.8.2011
HSMM + MDL CLUSTERING
GMM + DYNAMIC MODELING
• Differences between the system Proposed I (with GMM and dynamic modeling) and the HMM-based baseline are small:
• In voicing decision error rates of the proposed system were slightly smaller than for the baseline
• In BAP prediction RMSE values of the traditional approach were somewhat smaller.
• The small difference suggests that comparable accuracy can be achieved by the proposed approach that predicts the aperiodicity features based on the spectrum instead of the context-dependent labels.
• The comparison of the systems Proposed I-IV shows that both the use of GMM-based modeling and dynamics can increase the prediction accuracy compared to the direct mapping from spectral parameters into bandwise aperiodicities.
23
28.8.2011
• The small differences are extremely difficult to detect in the synthesized waveforms.
• To give an idea of the quality of the similarity of the synthesis quality:
Baseline:
Proposed I:
• Randomly chosen synthesis samples are available at:
http://www.cs.tut.fi/sgn/arg/silen/is2011/AperiodicityPrediction.html
24
28.8.2011
25
28.8.2011
• We have investigated the use of an alternative method for the prediction of voice aperiodicities in the framework of HMM-based speech synthesis.
• The prediction approach employs GMM modeling and multivariate regression to form local mappings from synthetic spectral features into aperiodicity features:
• The role of the band aperiodicity in HMM parameter estimation is limited and can therefore be left out from the training.
• The voicing decision modeling is typically embedded in the F0 modeling.
26
28.8.2011
• In objective evaluation, the proposed approach was found to produce comparable quality with the conventional approach using HMM modeling and context clustering.
• Randomly selected synthesis samples available at: http://www.cs.tut.fi/sgn/arg/silen/is2011/AperiodicityPrediction.html
• This is a starting point for the future research:
• Our recent studies suggest that we can further improve the accuracy of the spectrum-based prediction.
27
28.8.2011
Thank you for your attention.
Contact information:
Hanna Silén
Tampere University of Technology, Finland
hanna.silen@tut.fi
28
28.8.2011