– Interspeech 2011...domain and a mixed mode excitation signal. • The mixed mode excitation...

28
– Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland [email protected] 1 28.8.2011

Transcript of – Interspeech 2011...domain and a mixed mode excitation signal. • The mixed mode excitation...

  • – Interspeech 2011 –

    Hanna Silén, Elina Helander, Moncef Gabbouj

    Tampere University of Technology, Finland

    [email protected]

    1

    28.8.2011

  • Introduction

    Asynchronous voice aperiodicity prediction

    Objective evaluations

    Conclusions

    2

    28.8.2011

  • 3

    28.8.2011

  • • In parametric speech synthesis, such as hidden Markov model (HMM) based text-to-speech (TTS) synthesis, speech must be parameterized into form that enables control on the perceptually important features of speech.

    • Typical parameterization schemes use the familiar source-filter decomposition to decompose speech into spectral and excitation parts.

    • In the conventional approach, all speech features are modeled simultaneously using hidden Markov models (HMM) or hidden semi-Markov models (HSMM) with Gaussian observations.

    • To cope with the data sparseness and to enable prediction for unseen contexts in synthesis, speech features are clustered separately using minimum description length (MDL) based decision tree clustering.

    4

    28.8.2011

  • • Even for the voiced signals, the vocal cord vibration is not purely harmonic and hence the modeling of speech aperiodicities is essential for high-quality speech synthesis:

    • For high-quality waveform re-synthesis, modeling of voice aperiodicities is essential.

    • Voicing decisions describe weather the signal is voiced or not – whether there is a related F0 or not:

    • Including also the amount of devoicing described by the voice aperiodicity in the speech parameterization improves the vocoding quality.

    5

    28.8.2011

  • • Even though the aperiodicity measure is needed in synthesis, its role in HMM training is rather limited:

    • The spectrum part is needed to create reliable labeling for the training data and to provide segmental intelligibility for synthesis.

    • F0 modeling is needed for synthesis of natural intonations.

    • However, increasing the number of model parameters also increases the computational load of the training.

    6

    28.8.2011

  • • Asynchronous voice aperiodicity prediction based on synthetic spectra:

    • Here we investigate an alternative prediction approach employing the possible correlation of the spectral and aperiodicity features.

    • The approach is able to provide comparable prediction accuracy with HMM-based approach, but with a lower number of model parameters.

    7

    28.8.2011

  • • STRAIGHT (http://www.wakayama-u.ac.jp/~kawahara/STRAIGHTadv) is a high-quality vocoder that is widely used for speech analysis and waveform re-synthesis in HMM-based speech synthesis.

    • Parameterizes speech waveform into a spectral envelope without periodic interferences in time or frequency domain and a mixed mode excitation signal.

    • The mixed mode excitation signal of STRAIGHT consists of F0 with binary voicing decisions and the relative level of voice aperiodicity at each frequency:

    • Voice aperiodicity is defined as a relative energy of aperiodic components at each frequency – for HMM modeling typically encoded as average band aperiodicity (BAP) of frequency sub-bands.

    • Binary voicing decisions determine whether there is F0related to the signal segment or not – in HMM modeling typically embedded in F0 training.

    8

    28.8.2011

    SPECTRALENVELOPE

    LEVEL OFAPERIODICITY

  • • STRAIGHT decomposes speech into a spectral envelope without periodic interferences from intonation and a mixed mode excitation with F0 and level of voice aperiodicity.

    • However, even after removal of the intonation interferences, there is clear correlation between the spectral and aperiodicity features.

    • In the conventional HMM-based training, this correlation is, however, not exploited.

    • In this approach we use the dependency and predict the aperiodicity features based on the spectral ones.

    9

    28.8.2011

  • 10

    28.8.2011

  • • In this paper, we investigate local prediction for aperiodicity features (BAP and voicing decisions) based on synthetic spectral features in HMM-based speech synthesis.

    • The training is done asynchronously, separately from the HMM training of spectral features.

    • Local mapping is done by the Gaussian mixture models (GMM), an approach rather similar to the one used for voice conversion for mapping between spectral features of two speakers.

    • The training phase aims at finding a prediction function from predictors (spectral features) into responses (BAP or voicing decisions)

    • In the synthesis phase, the formed prediction function is further used for mapping data unseen in the training phase.

    11

    28.8.2011

  • • We model the distribution of the spectral feature vectors xt using GMMs i.e. as a weighted sum of N Gaussian components:

    • The posterior probability of the vector xt belonging to the nth cluster is:

    12

  • • Direct mapping from spectral feature vectors xt into aperiodicity feature vectors yt (either BAP values or voicing decisions) would have a form:

    and the mapping matrix β could be found using multivariate regression.

    • To find local mapping functions from spectral features into aperiodicity features, we use GMM posterior probabilities to expand the input variable vectors:

    13

    28.8.2011

  • • The least-squares solution for the coefficients of the mapping matrix β can be found using standard multivariate regression with pseudo-inverse:

    where the training input data is expanded using GMMs and the output data is not:

    14

    28.8.2011

  • • In speech signals, there is a rather strong correlation between adjacent frames.

    • To exploit this correlation to improve the modeling accuracy, we further augment the regression source data with the corresponding representations of the neighboring frames:

    • Dynamic modeling is likely to increase the correlation of the adjacent predictors thus e.g. partial least squares regression could be used instead of standard multivariate regression.

    15

    28.8.2011

  • 16

    28.8.2011

  • • Objective evaluations:

    • Accuracy of the proposed aperiodicity prediction approach was evaluated objectively.

    • Evaluations consisted of two parts:

    • Prediction of the band aperiodicity (BAP) of five bands based on spectral parameters (Mel-cepstral coefficients, MCCs).

    • Prediction of the voicing decisions (whether there is F0 or not) based on spectral parameters (MCCs).

    17

    28.8.2011

  • • Speech data: CMU ARCTIC speakers slt (female) and rms (male)

    • Training: Set A with 593 utterances• Testing: Set B with 539 utterances

    • Parameterization: STRAIGHT parameterization with

    • Mel-cepstrum coefficients (MCCs) of order 24• F0 with binary voicing decisions• Mean band aperiodicity (BAP) of five frequency bands

    18

    28.8.2011

  • • HMM training: MCC, F0, and baseline BAP modeling using HTS

    • 5 state left-to-right HSMMs

    • GMM training: BAP and voicing prediction models from GMMs

    • 8 Gaussians with diagonal covariances

    19

    28.8.2011

  • • Speaker dependent training based on the training data:

    Baseline: traditional HMM-modeling with HSMMs and delta-augmented features and decision tree-based context clustering.

    Proposed I: prediction based on spectral parameters (MCCs) using GMMs (8 Gaussians with diagonal covariance matrices) and multivariate regression with dynamic modeling.

    • And as a reference:

    Proposed II: as Proposed I but without dynamic modeling.

    Proposed III: prediction based on spectral parameters (MCCs) using standard multivariate regression with dynamic modeling (no GMMs).

    Proposed IV: as Proposed III but without dynamic modeling.

    20

    28.8.2011

  • 21

    28.8.2011

    MALE SPEAKER: rms FEMALE SPEAKER: slt

    SYNTHETIC(FROM HMMS)

    RECORDED

    HMM

    PROPOSED I

  • • Evaluation of the prediction accuracy for the test data:• BAP prediction RMSE values for each frequency band and• Percentage of incorrect voicing decisions.

    22

    28.8.2011

    HSMM + MDL CLUSTERING

    GMM + DYNAMIC MODELING

  • • Differences between the system Proposed I (with GMM and dynamic modeling) and the HMM-based baseline are small:

    • In voicing decision error rates of the proposed system were slightly smaller than for the baseline

    • In BAP prediction RMSE values of the traditional approach were somewhat smaller.

    • The small difference suggests that comparable accuracy can be achieved by the proposed approach that predicts the aperiodicity features based on the spectrum instead of the context-dependent labels.

    • The comparison of the systems Proposed I-IV shows that both the use of GMM-based modeling and dynamics can increase the prediction accuracy compared to the direct mapping from spectral parameters into bandwise aperiodicities.

    23

    28.8.2011

  • • The small differences are extremely difficult to detect in the synthesized waveforms.

    • To give an idea of the quality of the similarity of the synthesis quality:

    Baseline:

    Proposed I:

    • Randomly chosen synthesis samples are available at:

    http://www.cs.tut.fi/sgn/arg/silen/is2011/AperiodicityPrediction.html

    24

    28.8.2011

  • 25

    28.8.2011

  • • We have investigated the use of an alternative method for the prediction of voice aperiodicities in the framework of HMM-based speech synthesis.

    • The prediction approach employs GMM modeling and multivariate regression to form local mappings from synthetic spectral features into aperiodicity features:

    • The role of the band aperiodicity in HMM parameter estimation is limited and can therefore be left out from the training.

    • The voicing decision modeling is typically embedded in the F0 modeling.

    26

    28.8.2011

  • • In objective evaluation, the proposed approach was found to produce comparable quality with the conventional approach using HMM modeling and context clustering.

    • Randomly selected synthesis samples available at: http://www.cs.tut.fi/sgn/arg/silen/is2011/AperiodicityPrediction.html

    • This is a starting point for the future research:

    • Our recent studies suggest that we can further improve the accuracy of the spectrum-based prediction.

    27

    28.8.2011

  • Thank you for your attention.

    Contact information:

    Hanna Silén

    Tampere University of Technology, Finland

    [email protected]

    28

    28.8.2011