Time state Athanassios Katsamanis, George Papandreou, Petros Maragos School of E.C.E., National...

1
time s t a t e Athanassios Katsamanis, George Papandreou, Petros Maragos School of E.C.E., National Technical University of Athens, Athens 15773, Greece Audiovisual-to-articulatory speech inversion using Hidden Markov Models References H. Kjellstrom, O. Engwall, and O. Balter, “Reconstructing tongue movements from audio and video,” in Interspeech, 2006, pp. 2238–2241. O. Engwall, “Introducing visual cues in acoustic-to- articulatory inversion,” in INTERSPEECH, 2005, pp. 3205–3208. J. Jiang, A. Alwan, P. A. Keating, E. T. Auer Jr., and L. E. Bernstein, “On the relationship between face movements, tongue movements, and speech acoustics,” EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1174–1188, 2002. H. Yehia, P. Rubin, and E. Vatikiotis-Bateson, “Quantitative association of vocal-tract and facial behavior,” Sp. Comm., vol. 26, pp. 23–43, 1998. K. Richmond, S. King, and P. Taylor, “Modelling the uncertainty in recovering articulation from acoustics,” Computer Speech and Language, vol. 17, pp. 153–172, 2003. S. Hiroya and M. Honda, “Estimation of articulatory movements from speech acoustics using an hmm-based speech production model,” IEEE TSAP, vol. 12, no. 2, pp. 175–185, March 2004. O. Engwall and J. Beskow, “Resynthesis of 3d tongue movements from facial data,” in EUROSPEECH, 2003. S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Tr. Multimedia, vol. 2, no. 3, pp. 141–151, 2000. K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate Analysis. Acad. Press, 1979. L. L. Scharf and J. K. Thomas, “Wiener filters in canonical coordinates for transform coding, filtering, and quantizing,” IEEE TSAP, vol. 46, no. 3, pp. 647–654, 1998. L. Breiman and J. H. Friedman, “Predicting multivariate responses in multiple linear regression,” Journal of the Royal Stat. Soc. (B), vol. 59, no. 1, pp. 3–54, 1997. Acknowledgements This research was co-financed partially by E.U.- European Social Fund (75%) and the Greek Ministry of Development-GSRT (25%) under Grant ΠΕΝΕΔ-2003ΕΔ866, and partially by the European research project ASPI under Grant FP6-021324. We would also like to thank O. Engwall from KTH for providing us the QSMT database. Evaluation Measured (black) and predicted (light color) articulatory trajectories Speech inversion ? Recover vocal tract geometry from the speech signal and speaker’s face Applications in Language Tutoring, Speech Therapy Zero states correspond to the case of a global linear model. Qualisys-Movetrack database Electromagnetic Articulography (EMA) Video Audio / p 1 / / p 2 / y a y v 3-D marker coordinates w a w v spectral characteristics/MFCC Determination of multistream HMM state sequence Why Canonical Correlation Analysis (CCA)? Leads to optimal reduced-rank linear regression models. Improved predictive performance in the case of limited data Generalization error of the linear regression model vs. model order for varying training set size. Upper row: Tongue position from face expression. Lower Row: Face expression from tongue position. We use multistream HMMs Visual-to-articulatory mapping is expected to be nonlinear. Visual stream incorporated following the Audiovisual ASR paradigm. We apply CCA Train a linear mapping at each HMM state between audiovisual and articulatory data Performance is improved Compared to global linear modal or audio only or visual only HMM y t = A i x t + ε t x Time t, state i: Maximum A Posteriori articulatory parameter estimate: ˆ x =( σ x −1 + A i T Q i −1 A i ) −1 ( σ x −1 x + A i T Q i −1 y ) Where Q i is the covariance of the approximation error and the prior of x is considered to be Gaussian determined at the training phase

Transcript of Time state Athanassios Katsamanis, George Papandreou, Petros Maragos School of E.C.E., National...

Page 1: Time state Athanassios Katsamanis, George Papandreou, Petros Maragos School of E.C.E., National Technical University of Athens, Athens 15773, Greece Audiovisual-to-articulatory.

time

state

Athanassios Katsamanis, George Papandreou, Petros Maragos

School of E.C.E., National Technical University of Athens, Athens 15773, Greece

Audiovisual-to-articulatory speech inversion using Hidden Markov Models

ReferencesH. Kjellstrom, O. Engwall, and O. Balter, “Reconstructing tongue movements from audio and video,” in Interspeech, 2006, pp. 2238–2241.

O. Engwall, “Introducing visual cues in acoustic-to-articulatory inversion,” in INTERSPEECH, 2005, pp. 3205–3208.

J. Jiang, A. Alwan, P. A. Keating, E. T. Auer Jr., and L. E. Bernstein, “On the relationship between face movements, tongue movements, and speech acoustics,” EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1174–1188, 2002.

H. Yehia, P. Rubin, and E. Vatikiotis-Bateson, “Quantitative association of vocal-tract and facial behavior,” Sp. Comm., vol. 26, pp. 23–43, 1998.

K. Richmond, S. King, and P. Taylor, “Modelling the uncertainty in recovering articulation from acoustics,” Computer Speech and Language, vol. 17, pp. 153–172, 2003.

S. Hiroya and M. Honda, “Estimation of articulatory movements from speech acoustics using an hmm-based speech production model,” IEEE TSAP, vol. 12, no. 2, pp. 175–185, March 2004.

O. Engwall and J. Beskow, “Resynthesis of 3d tongue movements from facial data,” in EUROSPEECH, 2003.

S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Tr. Multimedia, vol. 2, no. 3, pp. 141–151, 2000.

K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate Analysis. Acad. Press, 1979.

L. L. Scharf and J. K. Thomas, “Wiener filters in canonical coordinates for transform coding, filtering, and quantizing,” IEEE TSAP, vol. 46, no. 3, pp. 647–654, 1998.

L. Breiman and J. H. Friedman, “Predicting multivariate responses in multiple linear regression,” Journal of the Royal Stat. Soc. (B), vol. 59, no. 1, pp. 3–54, 1997.

AcknowledgementsThis research was co-financed partially by E.U.-European Social Fund (75%) and the Greek Ministry of Development-GSRT (25%) under Grant ΠΕΝΕΔ-2003ΕΔ866, and partially by the European research project ASPI under Grant FP6-021324. We would also like to thank O. Engwall from KTH for providing us the QSMT database.

Evaluation

Measured (black) and predicted (light color) articulatory trajectories

Speech inversion ?Recover vocal tract geometry from

the speech signal and speaker’s face

Applications in Language Tutoring, Speech Therapy

Zero states correspond to the case of a global linear model.

Qualisys-Movetrack database

Ele

ctro

mag

net

ic

Art

icu

log

rap

hy

(EM

A)

Vid

eo

Au

dio

/p1/ /p2/

ya yv

3-D marker coordinates

wa wv

spectral characteristics/MFCC

Determination of multistream HMM state sequence

Why Canonical Correlation Analysis (CCA)? Leads to optimal reduced-rank linear regression models. Improved predictive performance in the case of limited data

Generalization error of the linear regression model vs. model order for varying training set size. Upper row: Tongue position from face expression. Lower Row: Face expression from tongue position.

We use multistream HMMsVisual-to-articulatory mapping is expected to be nonlinear. Visual stream incorporated following the Audiovisual ASR paradigm.

We apply CCATrain a linear mapping at each HMM state between audiovisual and articulatory data

Performance is improvedCompared to global linear modal or audio only or visual only HMM

y t = Aix t + ε t

x

Time t, state i:

Maximum A Posteriori articulatory parameter estimate:

ˆ x = (σ x−1 + Ai

TQi−1Ai)

−1

(σ x−1x + Ai

TQi−1y)

Where Qi is the covariance of the approximation error and the prior of x is considered to be Gaussian determined at the training phase