Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef...

28
– Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland [email protected] 1 28.8.2011

Transcript of Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef...

Page 1: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

– Interspeech 2011 –

Hanna Silén, Elina Helander, Moncef Gabbouj

Tampere University of Technology, Finland

[email protected]

1

28.8.2011

Page 2: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

Introduction

Asynchronous voice aperiodicity prediction

Objective evaluations

Conclusions

2

28.8.2011

Page 3: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

3

28.8.2011

Page 4: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• In parametric speech synthesis, such as hidden Markov model (HMM) based text-to-speech (TTS) synthesis, speech must be parameterized into form that enables control on the perceptually important features of speech.

• Typical parameterization schemes use the familiar source-filter decomposition to decompose speech into spectral and excitation parts.

• In the conventional approach, all speech features are modeled simultaneously using hidden Markov models (HMM) or hidden semi-Markov models (HSMM) with Gaussian observations.

• To cope with the data sparseness and to enable prediction for unseen contexts in synthesis, speech features are clustered separately using minimum description length (MDL) based decision tree clustering.

4

28.8.2011

Page 5: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• Even for the voiced signals, the vocal cord vibration is not purely harmonic and hence the modeling of speech aperiodicities is essential for high-quality speech synthesis:

• For high-quality waveform re-synthesis, modeling of voice aperiodicities is essential.

• Voicing decisions describe weather the signal is voiced or not – whether there is a related F0 or not:

• Including also the amount of devoicing described by the voice aperiodicity in the speech parameterization improves the vocoding quality.

5

28.8.2011

Page 6: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• Even though the aperiodicity measure is needed in synthesis, its role in HMM training is rather limited:

• The spectrum part is needed to create reliable labeling for the training data and to provide segmental intelligibility for synthesis.

• F0 modeling is needed for synthesis of natural intonations.

• However, increasing the number of model parameters also increases the computational load of the training.

6

28.8.2011

Page 7: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• Asynchronous voice aperiodicity prediction based on synthetic spectra:

• Here we investigate an alternative prediction approach employing the possible correlation of the spectral and aperiodicity features.

• The approach is able to provide comparable prediction accuracy with HMM-based approach, but with a lower number of model parameters.

7

28.8.2011

Page 8: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• STRAIGHT (http://www.wakayama-u.ac.jp/~kawahara/STRAIGHTadv) is a high-quality vocoder that is widely used for speech analysis and waveform re-synthesis in HMM-based speech synthesis.

• Parameterizes speech waveform into a spectral envelope without periodic interferences in time or frequency domain and a mixed mode excitation signal.

• The mixed mode excitation signal of STRAIGHT consists of F0 with binary voicing decisions and the relative level of voice aperiodicity at each frequency:

• Voice aperiodicity is defined as a relative energy of aperiodic components at each frequency – for HMM modeling typically encoded as average band aperiodicity (BAP) of frequency sub-bands.

• Binary voicing decisions determine whether there is F0related to the signal segment or not – in HMM modeling typically embedded in F0 training.

8

28.8.2011

SPECTRALENVELOPE

LEVEL OFAPERIODICITY

Page 9: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• STRAIGHT decomposes speech into a spectral envelope without periodic interferences from intonation and a mixed mode excitation with F0 and level of voice aperiodicity.

• However, even after removal of the intonation interferences, there is clear correlation between the spectral and aperiodicity features.

• In the conventional HMM-based training, this correlation is, however, not exploited.

• In this approach we use the dependency and predict the aperiodicity features based on the spectral ones.

9

28.8.2011

Page 10: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

10

28.8.2011

Page 11: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• In this paper, we investigate local prediction for aperiodicity features (BAP and voicing decisions) based on synthetic spectral features in HMM-based speech synthesis.

• The training is done asynchronously, separately from the HMM training of spectral features.

• Local mapping is done by the Gaussian mixture models (GMM), an approach rather similar to the one used for voice conversion for mapping between spectral features of two speakers.

• The training phase aims at finding a prediction function from predictors (spectral features) into responses (BAP or voicing decisions)

• In the synthesis phase, the formed prediction function is further used for mapping data unseen in the training phase.

11

28.8.2011

Page 12: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• We model the distribution of the spectral feature vectors xt using GMMs i.e. as a weighted sum of N Gaussian components:

• The posterior probability of the vector xt belonging to the nth cluster is:

12

Page 13: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• Direct mapping from spectral feature vectors xt into aperiodicity feature vectors yt (either BAP values or voicing decisions) would have a form:

and the mapping matrix β could be found using multivariate regression.

• To find local mapping functions from spectral features into aperiodicity features, we use GMM posterior probabilities to expand the input variable vectors:

13

28.8.2011

Page 14: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• The least-squares solution for the coefficients of the mapping matrix β can be found using standard multivariate regression with pseudo-inverse:

where the training input data is expanded using GMMs and the output data is not:

14

28.8.2011

Page 15: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• In speech signals, there is a rather strong correlation between adjacent frames.

• To exploit this correlation to improve the modeling accuracy, we further augment the regression source data with the corresponding representations of the neighboring frames:

• Dynamic modeling is likely to increase the correlation of the adjacent predictors thus e.g. partial least squares regression could be used instead of standard multivariate regression.

15

28.8.2011

Page 16: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

16

28.8.2011

Page 17: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• Objective evaluations:

• Accuracy of the proposed aperiodicity prediction approach was evaluated objectively.

• Evaluations consisted of two parts:

• Prediction of the band aperiodicity (BAP) of five bands based on spectral parameters (Mel-cepstral coefficients, MCCs).

• Prediction of the voicing decisions (whether there is F0 or not) based on spectral parameters (MCCs).

17

28.8.2011

Page 18: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• Speech data: CMU ARCTIC speakers slt (female) and rms (male)

• Training: Set A with 593 utterances

• Testing: Set B with 539 utterances

• Parameterization: STRAIGHT parameterization with

• Mel-cepstrum coefficients (MCCs) of order 24

• F0 with binary voicing decisions

• Mean band aperiodicity (BAP) of five frequency bands

18

28.8.2011

Page 19: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• HMM training: MCC, F0, and baseline BAP modeling using HTS

• 5 state left-to-right HSMMs

• GMM training: BAP and voicing prediction models from GMMs

• 8 Gaussians with diagonal covariances

19

28.8.2011

Page 20: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• Speaker dependent training based on the training data:

Baseline: traditional HMM-modeling with HSMMs and delta-augmented features and decision tree-based context clustering.

Proposed I: prediction based on spectral parameters (MCCs) using GMMs (8 Gaussians with diagonal covariance matrices) and multivariate regression with dynamic modeling.

• And as a reference:

Proposed II: as Proposed I but without dynamic modeling.

Proposed III: prediction based on spectral parameters (MCCs) using standard multivariate regression with dynamic modeling (no GMMs).

Proposed IV: as Proposed III but without dynamic modeling.

20

28.8.2011

Page 21: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

21

28.8.2011

MALE SPEAKER: rms FEMALE SPEAKER: slt

SYNTHETIC(FROM HMMS)

RECORDED

HMM

PROPOSED I

Page 22: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• Evaluation of the prediction accuracy for the test data:• BAP prediction RMSE values for each frequency band and• Percentage of incorrect voicing decisions.

22

28.8.2011

HSMM + MDL CLUSTERING

GMM + DYNAMIC MODELING

Page 23: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• Differences between the system Proposed I (with GMM and dynamic modeling) and the HMM-based baseline are small:

• In voicing decision error rates of the proposed system were slightly smaller than for the baseline

• In BAP prediction RMSE values of the traditional approach were somewhat smaller.

• The small difference suggests that comparable accuracy can be achieved by the proposed approach that predicts the aperiodicity features based on the spectrum instead of the context-dependent labels.

• The comparison of the systems Proposed I-IV shows that both the use of GMM-based modeling and dynamics can increase the prediction accuracy compared to the direct mapping from spectral parameters into bandwise aperiodicities.

23

28.8.2011

Page 24: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• The small differences are extremely difficult to detect in the synthesized waveforms.

• To give an idea of the quality of the similarity of the synthesis quality:

Baseline:

Proposed I:

• Randomly chosen synthesis samples are available at:

http://www.cs.tut.fi/sgn/arg/silen/is2011/AperiodicityPrediction.html

24

28.8.2011

Page 25: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

25

28.8.2011

Page 26: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• We have investigated the use of an alternative method for the prediction of voice aperiodicities in the framework of HMM-based speech synthesis.

• The prediction approach employs GMM modeling and multivariate regression to form local mappings from synthetic spectral features into aperiodicity features:

• The role of the band aperiodicity in HMM parameter estimation is limited and can therefore be left out from the training.

• The voicing decision modeling is typically embedded in the F0 modeling.

26

28.8.2011

Page 27: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

• In objective evaluation, the proposed approach was found to produce comparable quality with the conventional approach using HMM modeling and context clustering.

• Randomly selected synthesis samples available at:

http://www.cs.tut.fi/sgn/arg/silen/is2011/AperiodicityPrediction.html

• This is a starting point for the future research:

• Our recent studies suggest that we can further improve the accuracy of the spectrum-based prediction.

27

28.8.2011

Page 28: Interspeech 2011 – - TUNI · – Interspeech 2011 – Hanna Silén, Elina Helander, Moncef Gabbouj Tampere University of Technology, Finland hanna.silen@tut.fi 1 28.8.2011. Introduction

Thank you for your attention.

Contact information:

Hanna Silén

Tampere University of Technology, Finland

[email protected]

28

28.8.2011