Post on 15-Jul-2018
On Measuring and Estimating Speech Articulation
Dr Korin Richmond
Workshop on Speech Production in Automatic Speech RecognitionAugust 30th 2013Lyon, France
RTSC
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Talk outline
1. Why articulation?
2. Available articulatory data
3. Inversion with Artificial Neural Nets (ANN)
• MLP v’s Mixture Density Network
• Deep ANN models
4. Is this any good though?
• is it an adequate articulatory representation?
• what is the best performance possible?
5. Summary
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Why might articulation be useful?
• An articulatory representation of speech has attractive properties
• relatively slow, smooth
• physical constraints - (e.g. no “jumps”)
• Constraints potentially useful for
• low bit-rate speech coding
• speech synthesis
• speech training
• avatar animation/lip synching
• and of course ASR...
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
How to capture speech articulator movements?
• Imaging: Video, Ultrasound, X-ray Cineradiography, MRI
• Contact: Electropalatography, laryngography
• Point tracking: Mocap, X-ray microbeam, Electromagnetic articulography (EMA)
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Advertisement: mngu0 articulatory corpus
Collaborators =>Collaborators =>EMA: Phil Hoole (LMU, Germany)
Simon King (Edinburgh)
MRI: Ingmar Steiner (Saarland, Germany)Ian Marshall (Edinburgh)Calum Gray (Edinburgh)
Dental: Laurie Littlejohn (CoreDental, Glasgow)
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
mngu0 - EMA (day1) data set
• Carstens AG500 EMA
• Articulators: Upper and lower lips, jaw, and three tongue points
• >1,300 phonetically-rich utterances
• Good audio
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
mngu0 - vocal tract anatomy
• 3D volume (26 slices, 4mm, 256x256px) 13 vowels, 16 consonants
• Midsagittal “dynamic” scans, with 16 cons & 3 vowel contexts (eg. “apa”)
• Acoustic reference recordings
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
mngu0 web forum
Website => hub for mngu0 activity
Registration required
Default non-commercial licence
User uploads strongly encouraged
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
The inversion mapping problem
Speech production
Inversion mapping
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
ANN for the inversion mapping
(thanks to Benigno Uría for animation)
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Inversion mapping - characteristics
“oar” “oar”
• Interesting modelling problem:
• non-linear
• one-to-many mappings (=ill-posed problem)
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Mixture density network (MDN) suits ill-posed problems
2 22 mixturemodel
networkneural
x x
E(t|x) p(t|x)
_ mµ _ mµ_ mµ
MLP givesconditional mean
MDN givesconditional probability density
Input vector
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
TMDN Inversion summary
(thanks to Benigno Uría for animation)
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
MDN outputT
2_
x
frame no. (=time)
File: 0580 (mngu0 s1)
20 40 60 80 100 120 140!1
!0.5
0
0.5
1
K. Richmond, S. King, and P. Taylor. Modelling the uncertainty in recovering articulation from acoustics. Computer Speech and Language, 17:153–172, 2003.
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
MDN output - estimating trajectories
K. Richmond. Trajectory mixture density networks with multiple mixtures for acoustic-articulatory inversion. International Conference on Non-Linear Speech Processing, NOLISP 2007.
mid
!to
ng
ue
x
!1
!0.5
0
0.5
1
mid
!to
ngue x
!
!1
!0.5
0
0.5
1
mid
!to
ngue x
!!
frame no. (=time)20 40 60 80 100 120 140
!1
!0.5
0
0.5
1
mid
!to
ngue x
!1
!0.5
0
0.5
1EMA trajectory
mid
!to
ngue x
!
!1
!0.5
0
0.5
1
mid
!to
ngue x
!!
frame no. (=time)20 40 60 80 100 120 140
!1
!0.5
0
0.5
1
mid
!to
ngue x
!1
!0.5
0
0.5
1EMA trajectoryestimated
RMS Error1.37mm
K. Richmond. Preliminary inversion mapping results with a new EMA corpus. In Proc. Interspeech, pages 2835–2838, Brighton, UK, September 2009.
RMS Error0.99mm
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Improvements with deeper ANN models
• The target - deep multilayer ANN
• To build this network:
• stacking RBM pretraining
• add final layer + optimize
• fine tune all weights with standard backpropagation
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Deep MLP test - experiment settings
• Input: PLP features 9 frames (100ms)
• Output: 12 linear units (x,y per articulator)
• Sigmoidal hidden units
• 1 to 7 hidden layers
• 200, 300, 400 or 1024 units per hidden layer
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Deep MLP results - Error versus depth
1 2 3 4 5 6 7Number of hidden layers
0.85
0.90
0.95
1.00
1.05
1.10
Ave
rag
eR
MS
E(m
m)
200 units300 units400 units1024 units
test set RMS error = 0.94 mm (mngu0 day1)
validation set error
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
1 2 3 4 5 6 7Number of hidden layers
0.85
0.90
0.95
1.00
1.05
1.10
Ave
rag
eR
MS
E(m
m)
300 units600 units1024 units
Deep TMDN results - Error versus depth
test set RMS error = 0.885 mm (mngu0 day1)
validation set error
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Is this (or any) inversion mapping any good?!
Two central questions:
• Is the articulatory representation adequate?
• What is the best performance we can hope to achieve?
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Adequate articulatory representation?
• Specifically, is EMA enough?
• Some indications from:
• Tongue contour modelling work
• Synthesis work
• Articulatory controllable HMM-based synthesis
• Direct articulatory-to-acoustic mapping
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Tongue contour prediction from limited points
0
0.2
0.4
0.6
0.8
1
1.2
1.4
K=2K=3K=4K=5
tip dorsum root
Tongue position
RM
SE
(mm
)
2 3 4 50
0.2
0.4
0.6
0.8
1worstaverageoptimal
RM
SE
(mm
)
K
Fig. 5. Error (RMSE) incurred by the RBF prediction of thetongue contour wrt the ground-truth contour. Left: RMSE (mm)for each contour point (averaged over all contours in the dataset)for different numbers K of landmarks, for the optimal landmarkplacement. Right: RMSE (mm) for each contour (averaged overall contours in the dataset and over all points in the contour), asa function of the number of landmarks K, for: the worst place-ment of the landmarks over the combinations we considered(solid line), the average over all combinations (dashed), and theoptimal placement (dotted, corresponding to the left panel).
0
1
2
3
4
5
6
7
K=2K=3K=4K=5
tip dorsum root
Tongue position
RM
SE
(mm
)
2 3 4 50
1
2
3
4
5
6worstaverageoptimal
RM
SE
(mm
)
K
Fig. 6. Like Fig. 5 but for the spline interpolation instead of theRBF prediction. Note the different scale in the Y axis.
were used in the MOCHA database are quite close to the opti-mal ones. From Fig. 5 we then estimate that the tongue contoursmay be reconstructed from the 3 MOCHA pellets with an errorof around 0.3 mm at each point on the tongue contour. The factthat the “worst” and “average” lines in Fig. 5 (right) increasethe error by only about 0.1 mm means that, if we cannot placethe landmarks optimally as given by Fig. 7, the following recipewill yield near-optimal results: place two pellets 2 to 4 mm fromthe tongue ends (tip and root, i.e., as far forward and backwardas possible), and place the remaining K ! 2 pellets so all Kpellets are regularly spaced.
5. Conclusion
We have shown that realistic tongue contours (with errors wellbelow 0.4 mm) may be predicted from as few as 3–4 landmarks(optimally located on the tongue) using a nonlinear mappinglearned from ultrasound data. This information may be used todetermine the optimal number and locations of pellets for EMAand X-ray microbeam technology. Although our dataset wassmall and limited to one speaker, the results demonstrate the ap-proach is much more successful than spline interpolation, andquantify the extent to which the EMA/X-ray data is a good rep-resentation of the tongue. Future work will involve adapting themodel to a different speaker for which we have no (or very lit-tle) data; animating tongue contours for vocal tract visualisationof EMA/X-ray databases; and augmenting the tongue represen-
10 mm
10 m
m
K = 2
K = 3
K = 4
K = 5
K = 3(MOCHA)
Fig. 7. Optimal location of K landmarks (for K = 2, 3, 4, 5)depicted on a sample tongue contour (the tip is to the right andthe root to the left). The bottom contour shows the approximatelocation of the 3 pellets used in the MOCHA database.
tation in data-driven methods for articulatory speech synthesisand articulatory inversion. This will improve our understandingof the limitations of current articulatory databases for articula-tory inversion, articulatory synthesis and vocal tract visualisa-tion. The method is also applicable to predicting the 3D shapefrom landmarks if 3D ground truth is available.
6. Acknowledgements
MACP thanks D. Massaro and M. Cohen (UC Santa Cruz) foruseful discussions. Work funded by NSF CAREER award IIS–0754089 and Marie Curie Early Stage Training Site EdSST(MEST–CT–2005–020568). The ultrasound data and extractedcontours are available from the authors.
7. References[1] J. R. Westbury, X-Ray Microbeam Speech Production DatabaseUser’s Handbook Version 1.0, June 1994.
[2] A. A. Wrench, “A multi-channel/multi-speaker articulatory data-base for continuous speech recognition research,” in Phonus. In-stitute of Phonetics, 2000.
[3] M. M. Cohen, J. Beskow, and D. W. Massaro, “Recent devel-opments in facial animation: An inside view,” in Proc. ThirdAuditory-Visual Speech Processing Conf. (AVSP’98), 1998.
[4] D. H. Whalen, A. Min Kang, H. S. Magen, R. K. Fulbright, andJ. C. Gore, “Predicting midsagittal pharynx shape from tongueposition during vowel production,” Journal of Speech, Languageand Hearing Research 42(3):592–603, 1999.
[5] J. A. Noble and D. Boukerroui, “Ultrasound image segmentation:A survey,” IEEE Trans. Medical Imaging 25(8):987–1010, 2006.
[6] M. Aron, E. Kerrien, M.-Odile Berger, and Y. Laprie, “Couplingelectromagnetic sensors and ultrasound images for tongue track-ing: acquisition set up and preliminary results,” in Proc. Int. Sem-inar on Speech Production (ISSP’06), 2006.
[7] M. Li, C. Kambhamettu, and M. Stone, “Automatic contour track-ing in ultrasound images,” Clinical Linguistics and Phonetics19:545–554, 2005.
[8] M. Kass, A. P. Witkin, and D. Terzopoulos, “Snakes: Active con-tour models,” Int. Journal of Computer Vision 1:321–331, 1988.
[9] M. Stone, “A guide to analyzing tongue motion from ultrasoundimages,” Clinical Linguistics and Phonetics 19:455–501, 2005.
[10] C. M. Bishop, Neural Networks for Pattern Recognition, OxfordUniversity Press, New York, Oxford, 1995.
2309
0
0.2
0.4
0.6
0.8
1
1.2
1.4
K=2K=3K=4K=5
tip dorsum root
Tongue position
RM
SE
(mm
)
2 3 4 50
0.2
0.4
0.6
0.8
1worstaverageoptimal
RM
SE
(mm
)
K
Fig. 5. Error (RMSE) incurred by the RBF prediction of thetongue contour wrt the ground-truth contour. Left: RMSE (mm)for each contour point (averaged over all contours in the dataset)for different numbers K of landmarks, for the optimal landmarkplacement. Right: RMSE (mm) for each contour (averaged overall contours in the dataset and over all points in the contour), asa function of the number of landmarks K, for: the worst place-ment of the landmarks over the combinations we considered(solid line), the average over all combinations (dashed), and theoptimal placement (dotted, corresponding to the left panel).
0
1
2
3
4
5
6
7
K=2K=3K=4K=5
tip dorsum root
Tongue position
RM
SE
(mm
)
2 3 4 50
1
2
3
4
5
6worstaverageoptimal
RM
SE
(mm
)
K
Fig. 6. Like Fig. 5 but for the spline interpolation instead of theRBF prediction. Note the different scale in the Y axis.
were used in the MOCHA database are quite close to the opti-mal ones. From Fig. 5 we then estimate that the tongue contoursmay be reconstructed from the 3 MOCHA pellets with an errorof around 0.3 mm at each point on the tongue contour. The factthat the “worst” and “average” lines in Fig. 5 (right) increasethe error by only about 0.1 mm means that, if we cannot placethe landmarks optimally as given by Fig. 7, the following recipewill yield near-optimal results: place two pellets 2 to 4 mm fromthe tongue ends (tip and root, i.e., as far forward and backwardas possible), and place the remaining K ! 2 pellets so all Kpellets are regularly spaced.
5. Conclusion
We have shown that realistic tongue contours (with errors wellbelow 0.4 mm) may be predicted from as few as 3–4 landmarks(optimally located on the tongue) using a nonlinear mappinglearned from ultrasound data. This information may be used todetermine the optimal number and locations of pellets for EMAand X-ray microbeam technology. Although our dataset wassmall and limited to one speaker, the results demonstrate the ap-proach is much more successful than spline interpolation, andquantify the extent to which the EMA/X-ray data is a good rep-resentation of the tongue. Future work will involve adapting themodel to a different speaker for which we have no (or very lit-tle) data; animating tongue contours for vocal tract visualisationof EMA/X-ray databases; and augmenting the tongue represen-
10 mm
10 m
m
K = 2
K = 3
K = 4
K = 5
K = 3(MOCHA)
Fig. 7. Optimal location of K landmarks (for K = 2, 3, 4, 5)depicted on a sample tongue contour (the tip is to the right andthe root to the left). The bottom contour shows the approximatelocation of the 3 pellets used in the MOCHA database.
tation in data-driven methods for articulatory speech synthesisand articulatory inversion. This will improve our understandingof the limitations of current articulatory databases for articula-tory inversion, articulatory synthesis and vocal tract visualisa-tion. The method is also applicable to predicting the 3D shapefrom landmarks if 3D ground truth is available.
6. Acknowledgements
MACP thanks D. Massaro and M. Cohen (UC Santa Cruz) foruseful discussions. Work funded by NSF CAREER award IIS–0754089 and Marie Curie Early Stage Training Site EdSST(MEST–CT–2005–020568). The ultrasound data and extractedcontours are available from the authors.
7. References[1] J. R. Westbury, X-Ray Microbeam Speech Production DatabaseUser’s Handbook Version 1.0, June 1994.
[2] A. A. Wrench, “A multi-channel/multi-speaker articulatory data-base for continuous speech recognition research,” in Phonus. In-stitute of Phonetics, 2000.
[3] M. M. Cohen, J. Beskow, and D. W. Massaro, “Recent devel-opments in facial animation: An inside view,” in Proc. ThirdAuditory-Visual Speech Processing Conf. (AVSP’98), 1998.
[4] D. H. Whalen, A. Min Kang, H. S. Magen, R. K. Fulbright, andJ. C. Gore, “Predicting midsagittal pharynx shape from tongueposition during vowel production,” Journal of Speech, Languageand Hearing Research 42(3):592–603, 1999.
[5] J. A. Noble and D. Boukerroui, “Ultrasound image segmentation:A survey,” IEEE Trans. Medical Imaging 25(8):987–1010, 2006.
[6] M. Aron, E. Kerrien, M.-Odile Berger, and Y. Laprie, “Couplingelectromagnetic sensors and ultrasound images for tongue track-ing: acquisition set up and preliminary results,” in Proc. Int. Sem-inar on Speech Production (ISSP’06), 2006.
[7] M. Li, C. Kambhamettu, and M. Stone, “Automatic contour track-ing in ultrasound images,” Clinical Linguistics and Phonetics19:545–554, 2005.
[8] M. Kass, A. P. Witkin, and D. Terzopoulos, “Snakes: Active con-tour models,” Int. Journal of Computer Vision 1:321–331, 1988.
[9] M. Stone, “A guide to analyzing tongue motion from ultrasoundimages,” Clinical Linguistics and Phonetics 19:455–501, 2005.
[10] C. M. Bishop, Neural Networks for Pattern Recognition, OxfordUniversity Press, New York, Oxford, 1995.
2309
Frame 3 Frame 97 Frame 205 Frame 428
Frame 553 Frame 663 Frame 711
N!point contourCubic B!splineRBFsK=3 landmarks
10 m
m
10 mm
Frame 754
Fig. 4. Selected frames comparing the true contour (cyan) and the contours estimated by the spline interpolation (green) and our RBFprediction (red), for K = 3 landmarks (yellow dots). Frame 754 shows indicative 10 mm scale bars.
the RBF overlaps almost perfectly with the true contour, sothe latter is barely visible. The spline contour often deviatessignificantly from the true one. For example, since the splinebehaves like an elastic bar, it is impossible for it to predict asharp valley or hump between two adjacent landmarks (frames97, 205, 553). When the landmarks are aligned (e.g. frames428, 754) the spline naturally adopts a straight line shape,which is physically infeasible for the tongue, and differentindeed from the true contour. In all these situations the RBFprediction works very well. The advantage of the predictionbased on a training set is largest when extrapolating beyond theend landmarks, near the root or the tip of the tongue.
Optimal number and location of the landmarks In this ex-periment, we used database db2. In order to determine the opti-mal location of K landmarks, we would need to fit a predictor toeach of the
`
PK
´
combinations (where our contours have P = 24points). We limit the computational cost involved as follows.(1) By using a RBF network with fixed basis function centresand width, we only need to estimate the linear weights W foreach combination. (2) We ignore unreasonable arrangements oflandmarks by dividing the contour into K consecutive segmentsand constraining each landmark to select points from one; forexample, for K = 3, landmarks 1, 2 and 3 can only select points1–8, 9–16 and 17–24, respectively. This prevents landmarksfrom being all very close, or very far from each other, which un-doubtedly would lead to a much worse prediction. This resultedin 145, 513, 1297, 2501 combinations for K = 2, 3, 4, 5, resp.The number of combinations for K = 6 (4900+) or higher re-quired too much computer time for this study. For each combi-nation, we performed 5-fold cross-validation to choose the opti-mal parameters and reported the averaged reconstruction errors.The optimal parameters of the RBF network (M, !) (number ofBFs and width in mm) were:
K K = 2 K = 3 K = 4 K = 5(M, !) (410, 19) (400, 13) (410, 19) (490, 19)
We report the root-mean-square error (RMSE) in mm for each
contour point i = 1, . . . , P : ( 1N
PNn=1 (y(n)
i ! y(n)i )2)1/2 in
Fig. 5 (left), where n is the index of the contour in the dataset(with N = 6 000 contours for db2), and yn and yn are the trueand reconstructed tongue contours, respectively. Fig. 5 (right)reports the RMSE averaged over the P contour points.
Fig. 5 (left) shows that the prediction errors at each con-tour point are roughly symmetric around the fixed landmarks,with the lowest (zero) error at the landmarks themselves, andthe highest error approximately in the midpoint between land-marks, or at the ends of the contour. The errors are largest at thetip of the tongue, consistent with its movement being the mostcomplex. From Fig. 5 (right), using only 2 landmarks yieldsan optimal error of 0.6 mm, while using 3 yields less than 0.3mm and 4 yields 0.2 mm. Using more landmarks yields dimin-ishing returns; it is also practically harder to attach that manypellets to the tongue. The line labelled “worst” is actually notmuch worse than the optimal, because we have ruled out pel-let arrangements that would indeed yield a far larger error (e.g.having all pellets next to each other).
For the spline interpolation, we predicted the contour y byconsidering a uniform grid of P locations along the X axis(with known Y values for K points) and applying to it the splinefunction. Consistent with the previous section, the spline inter-polation (Fig. 6) is always much worse (by an order of magni-tude) than the RBF prediction, although its error improves as Kincreases.
Fig. 7 shows the optimal location of the landmarks forK = 2 to 5. The landmarks are roughly equidistant alongthe tongue contour, but somewhat closer to each other near thetongue tip, consistent with the fact that the tongue tip showsmore complex movements than the rest of the tongue. The endlandmarks are close to the contour ends (tip and back), but notright at the ends. The scale bar allows to determine the posi-tions in mm, and (after rescaling by the total tongue length) onecan determine the approximately optimal placement for a dif-ferent speaker. The approximate locations of the 3 pellets that
2308
C. Qin, M. Carreira-Perpiñán, K. Richmond, A. Wrench, and S. Renals. Predicting tongue shapes from a few landmark locations. In Proc. Interspeech, 2008.
• Database of hand-labelled ultrasound images for training + testing data
• RBF used to predict tongue contour from varying number of points on contour (e.g. EMA locations)
• Also evaluate optimal “EMA” point placement
• For 3 tongue points, average error = 0.3mm
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Articulatorily controllable synthesis
• Aim => To change speech produced by hidden Markov model-based speech synthesiser using articulatory controls.
Z. Ling, K. Richmond, J. Yamagishi, and R. Wang. Integrating articulatory features into HMM-based parametric speech synthesis. IEEE Transactions on Audio, Speech and Language Processing, 17(6):1171-1185, August 2009.
Z. Ling, K. Richmond, and J. Yamagishi. Articulatory control of HMM-based parametric speech synthesis using feature-space-switched multiple regression. Audio, Speech, and Language Processing, IEEE Transactions on, 21(1):207-219, 2013.
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Motivation
• Pros:
• Flexible (parametric rather than concatenative)
• Trainable (data-driven rather than expert-intensive rules)
• Adaptable (speaker, style, emotion...)
• Cons:
• “Black box”
• Modification requires more data
• So, aim is to gain even more flexible control over synthesis
Interpolate:normal->angry
Hidden Markov model-based synthesis is state-of-the-art.
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Introducing articulation into HMM synthesis model(First attempt)
• Model joint distribution of acoustic and articulatory parameters
• Acoustic distribution is dependent on articulation
• Dependency = linear transform
• Aj is (tied) linear transform matrix for state j...
• ... = Global piecewise linear mapping
• No loss of quality
• NOTE: can use arbitrary function to modify yt => articulatory control!
[acoustic only]
[acoustics + real EMA][acoustics + EMA]
bj
(xt
|yt
) = N (xt
;Aj
y
t
+ µxj ,⌃xj )
xt-1
yt
qt+1
yt+1
qt
xt+1
yt-1
xt
qt-1
acoustic params
xt-1
yt
qt+1
yt+1
qt
xt+1
yt-1
xt
qt-1
acoustic params
articulatory params
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Change model tongue height => change vowel
Raise tongue (cm)
+1.5
+1.0
+0.5
bet
-0.5
-1.0
-1.5
Lower tongue (cm)
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Perceptual test results
• 20 listeners, lab conditions, results pooled across speakers and words
0
20
40
60
80
100
+1.5 +1 +0.5 0 -0.5 -1 -1.5
Modification to tongue height (cm)
Perc
eptio
n pe
rcen
tage
(%
)/!/ /"/ /æ/
•Z. Ling, K. Richmond, J. Yamagishi, and R. Wang. Integrating articulatory features into HMM-based parametric speech synthesis. IEEE Transactions on Audio, Speech and Language Processing, 17(6):1171-1185, 2010 IEEE Best Young Author paper award, August 2009.
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Direct articulatory-acoustic mapping
• “Articulatory synthesis” - but data-driven (no physiological model)
• Nonlinear regression from articulation to acoustic synthesis parameters
• Some examples of simple baseline system (MLP mapping)
• Input = EMA+Gain+F0
• Output = LSF vocoder parameters
• Training data = 720 utts of mngu0 day2 EMA
“However, that optimism now seems premature.”
“But this yard.”
“There is a huge amount of data overload.”
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
How good is an inversion mapping?
Inversion mapping methods standardly evaluated using RMS error and correlation. Is this good enough?
• Zero RMSE and perfect correlation will not happen
• Non-uniqueness (c.f. “(non-)critical” articulators ignored
• No indication how close optimal inversion is
Question: Can “task-based” evaluation add new insight?
• Explore with task = articulatory controlled TTS
• Compare standard articulatory error measures with error calculated in the *acoustic* domain
Korin Richmond, Zhenhua Ling, Junichi Yamagishi, and Benigno Uría. On the evaluation of inversion mapping performance in the acoustic domain. In Proc. Interspeech, 2013.
senso
r y!
coord
time
realestimated
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Articulatory-controlled HMM-based TTS
“Feature-space-switched Multiple Regression HMM” (FSS-MRHMM)
On the Evaluation of Inversion MappingPerformance in the Acoustic Domain
Korin Richmond1, Zhenhua Ling2, Junichi Yamagishi1 and Benigno Uría1
1) Centre for Speech Technology Research, School of Informatics, University of Edinburgh 2) National Engineering Laboratory of Speech and Language Information Processing, USTC, Hefei, P.R. China
Z.-H. Ling, K. Richmond, and J. Yamagishi, “Articulatory control of HMM-based parametric speech syn-thesis using feature-space-switched multiple regression,” IEEE TASLP, vol. 21, no. 1, pp. 207–219, 2013.
RTSC
2. Articulatory controlled HMM-based TTS
3. Experiment
1
1.2
1.4
1.6
1.8
EM
A R
MS
E (
mm
)
Cod
eboo
k
linea
r!1
linea
r!2
linea
r!4
linea
r!6
linea
r!8
linea
r!10
MLP
TMDN
0.4
0.6
0.8
Co
rre
latio
n
No lowpass filteringWith lowpass filtering
4. RMS error: comparison of acoustic and articulatory results
Inversion mapping methods are standardly evaluated usingarticulatory RMS error and correlation - is this sufficient?...
• No indication how far away “optimal inversion” is• Non-uniqueness (c.f. “(non-)critical” articulators) ignored
Question: Can “task-based” evaluation add new insight?• Explore with articulatory controlled HMM-based synthesis• Compare standard articulatory error measures with error
calculated in the synthesised *acoustic* domain
1. Motivation - inversion mapping evaluation
Speech production
Inversion mapping
Inversion methods testedInversion methods testedInversion methods testedLinear Simple linear projection, with
1,2,4,6,8 or 10 acoustic context frames
Code-book
KD-Tree to find 5000 candidates each frame, then Viterbi search with unweighted Euclidean target and join costs
MLP 2 22 mixturemodel
networkneural
x x
E(t|x) p(t|x)
_ mµ _ mµ_ mµ1 per channel: 1 hidden layer, 100 units with tanh activation function, 10 acoustic context frames (alter-nate frames selected)
TMDN2 22 mixture
model
networkneural
x x
E(t|x) p(t|x)
_ mµ _ mµ_ mµ
1 per channel: 1 hidden layer, 100 units tanh activation function, 10 context frames, [1,2 or 4] GMM components, static/∆/∆∆ PDFs + MLPG
+ Filter Output additionally smoothed: 2nd order Butterworth filter, 10Hz low-pass cutoff
Method• Evaluate a range of inversion methods using:
1. standard articulatory RMS error and correlation2. acoustic RMS error and listening test using articulator
controlled text-to-speech synthesiser• NOTE: aim is not to develop novel inversion method!• Importantly, a range of inversion performance is required here
Data Processing
mngu0 EMA
(day1)
• 1263 prompts, 3D EMA (AG500), one British male, single session, 6 coils (lips, jaw, 3 tongue)
• Acoustics -> LSF (STRAIGHT), order 40+gain, 5msec frameshift (= EMA sample rate)
• Dataset sizes (num. utterances): Validation (63), Test (63), Train (1137)
• All data z-score normalized (not for FSS-MRHMM)
0.6
0.62
0.64
0.66
0.68
0.7
LS
P R
MS
E
No
EMA
Nat
ural E
MA
Cod
eboo
k
linea
r!1
linea
r!2
linea
r!4
linea
r!6
linea
r!8
linea
r!10
MLP
TMDN
No lowpass filteringWith lowpass filtering
6. Conclusions
Two questions addressed:
1. Can acoustic task-based evaluation give useful info?• YES!• Interesting differences from articulatory RMSE...
2. Can we get insight into “optimal inversion” performance?• MLP+LPFiltering performed as well as natural EMA...• ...BUT we cannot yet claim this is “optimal inversion”• Simply, sufficient inversion for this task
Acknowledgements
Funding: NSFC Grant 61273032; NSFC-RSE Grant 61111130120; EU FP7 agreement 25623 (LISTA); EPSRC grants EP/I027696/1
(Ultrax) and EP/J002526/1.
5. Acoustic evaluation results: listening test
• 30 paid native British English listeners, lab conditions• 9 preference tests, each with 10 pairs of stimuli• Results generally like LSF RMSE results• No EMA < codebook < [MLP,linear10, TMDN] < Natural EMA• Some LSF RMSE differences are imperceptible• MLP+Filter ≅Natural EMA !!“Feature-space-switched Multiple Regression HMM (FSS-MRHMM)”
• Spectral distributions (xt) depend on state + external articulation (yt)
• Separate Single GMM for transform tying (matrices A below)
• Context feature tailoring
1tq tq j 1tq
1tx tx 1tx. . .
1tm tm k 1tm
, }j j
kA
( ) k t
. . .
1ty ty 1ty
bj(xt|yt) =MX
k=1
⇣k(t)N (xt;Ak⇠t + µj ,⌃j)
state-dependent acoustic mean and variance
transform matrix for GMM component k
expanded articulatory vector [yT
t , 1]T
probability for (separate) GMM component mt = k given yt
Flowchart explaining use of the FSS-MRHMM to evalu-ate natural and estimated articulator trajectories
Standard articulatory measures:
• Codebook < linear < MLP ≤ TMDN
• +Filter improves all results, except TMDN
• Reasonable spread of performance
• NOTE: This is not a fair comparison of methods!
Acoustic (LSF RMS error) measure:
• Synthesise 63 test utts, calculate LSF error
• Perceptually weighted Euclidean distance
• Some interesting differences
• MLP+Filter much better than TMDN
• TMDN appears worse than linear+Filter
Z. Ling, K. Richmond, and J. Yamagishi. Articulatory control of HMM-based parametric speech synthesis using feature-space-switched multiple regression. Audio, Speech, and Language Processing, IEEE Transactions on, 21(1):207-219, 2013.
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Experiment method
Test a range of inversion methods using:
1. articulatory evaluation using standard RMS error and correlation
2. acoustic evaluation using an articulator controlled text-to-speech synthesiser:
i. *acoustic* RMS error
ii. human perceptual test
articulatory synthesis flowchart:
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Data processing
• Day 1 EMA set of mngu0 corpus (for TTS + inversion methods)
• one British male, single session, 1263 prompts total
• 3D EMA (AG500), 6 coils (lips, jaw, 3 tongue)
• Acoustics -> LSF (STRAIGHT), order 40+gain, 5msec frameshift (= EMA sample rate)
• Dataset sizes (num. utterances): Validation (63), EMA Test (63), Train (1137)
• All data z-score normalised (not for FSS-MRHMM)
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
4 mapping methods tested (+smoothed versions)
Inversion methods testedInversion methods testedInversion methods testedLinear Simple linear projection, with 1,2,4,6,8 or 10
acoustic context frames
Codebook KD-Tree to find 5000 candidates each frame, then Viterbi search with unweighted Euclidean target and join costs
MLP 1 per channel: 1 hidden layer, 100 units with tanh activation function, 10 acoustic context frames (alternate frames selected)
TMDN 1 per channel: 1 hidden layer, 100 units tanh activation function, 10 context frames, [1,2 or 4] GMM components, static/∆/∆∆ PDFs + MLPG
+ Filter Output additionally smoothed: 2nd order Butterworth filter, 10Hz lowpass cutoff
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Standard articulatory error measure results
1
1.2
1.4
1.6
1.8
EM
A R
MS
E (
mm
)
Cod
eboo
k
linea
r!1
linea
r!2
linea
r!4
linea
r!6
linea
r!8
linea
r!10
MLP
TMDN
0.4
0.6
0.8C
orr
ela
tion
No lowpass filteringWith lowpass filtering
• Codebook < linear < MLP ≤ TMDN• +Filter improves all results, except TMDN• Reasonable spread of performance• NOTE: This is not a fair comparison of methods!
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Acoustic evaluation results: LSF RMSE
0.6
0.62
0.64
0.66
0.68
0.7
LS
P R
MS
E
No
EMA
Nat
ural E
MA
Cod
eboo
k
linea
r!1
linea
r!2
linea
r!4
linea
r!6
linea
r!8
linea
r!10
MLP
TMDN
No lowpass filteringWith lowpass filtering
• Synthesise 63 test utts, calculate LSF error• Perceptually weighted Euclidean distance • Some interesting differences
• MLP+Filter much better than TMDN• TMDN appears worse than linear+Filter
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Acoustic evaluation: listening test results
• 30 paid native British English listeners, lab conditions
• 9 preference tests, each with 10 pairs of stimuli
• Results generally like LSF RMSE results
• Some LSF RMSE differences are imperceptible
• MLP+Filter ≅Natural EMA !!
No EMA < codebook < [MLP,linear10, TMDN] < Natural EMA
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Conclusions from this study
Two questions addressed:
1. Can acoustic task-based evaluation give useful info?
• YES!
• Interesting differences from standard articulatory RMSE and correlation
2. Can we get insight into “optimal inversion” performance?
• MLP+LPFiltering performed as well as natural EMA...
• ...BUT we cannot yet claim this is “optimal inversion”
• Simply, sufficient inversion for this task
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Summary of talk
On the inversion mapping:
• Shown ANNs are a viable model for inversion mapping
• Deep ANN models give the state-of-the-art performance (probably)
• Given some indications of level of “information” in EMA data
• This is far from conclusive... open question up for debate and further study
• Raised question of “optimal inversion”
• Some interesting results - RMSE and correlation don’t give full picture
• Found invertedEMA = naturalEMA, but too early to judge if this is “optimal”
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.
Thanks for listening!
© Korin Richmond 2013. For individual use only. No copying or reuse of content without prior permission.