ESTIMATING TRACT VARIABLES FROM ACOUSTICS VIA MACHINE LEARNING CHRISTIANA SABETT APPLIED MATH,...

ESTIMATING TRACT VARIABLES FROM ACOUSTICS VIA MACHINE LEARNING CHRISTIANA SABETT APPLIED MATH, APPLIED STATISTICS, AND SCIENTIFIC COMPUTING (AMSC) OCTOBER 7, 2014 ADVISOR: DR. CAROL ESPY-WILSON ELECTRICAL AND COMPUTER ENGINEERING

INTRODUCTION Automatic speech recognition (ASR) systems are inadequate/incomplete in their current forms. Coarticulation overlap of actions in vocal tract

TRACT VARIABLES a Articulatory information: information from the organs along the vocal tract Tract variables (TVs): vocal tract constriction variables relaying information of a physical trajectory in time Lip Aperture (LA) Lip Protrusion (LP) Tongue tip constriction degree (TTCD) Tongue tip constriction location (TTCL) Tongue body constriction degree (TBCD) Tongue body constriction location (TBCL) Velum (VEL) Glottis (GLO) a. Mitra et al, 2010.

TRACT VARIABLES TVs are consistent in the presence of coarticulation TVs can improve the robustness of automatic speech recognition Time Frequency 0.10.20.30.40.50.60.70.8 0 2000 4000 6000 8000 0.10.20.30.40.50.60.70.8 -2000 0 2000 Perfect-memory: Clearly articulated 0.10.20.30.40.50.60.70.8 -10 0 10 TB 0.10.20.30.40.50.60.70.8 -15 -10 -5 0 TT 0.10.20.30.40.50.60.70.8 -25 -20 -15 LA KM PER EH T ERIY MF

PROJECT GOAL Effectively estimate TV trajectories using artificial neural networks, implementing Kalman smoothing when necessary.

APPROACH Artificial neural networks (ANNs) b Feed-forward ANN (FF-ANN) Recurrent ANN (RANN) Motivation: Speech inversion is a many-to-one mapping c ANNs can map m inputs to n outputs Retroflex /r/ Bunched /r/ b. Papcun, 1992. c. Atal et al., 1978.

STRUCTURE d 3 hidden layers Each node has f = tanh(x) sigmoidal activation function Weights w Biases b Input: acoustic features vector (9x20 or 9x13) Output: g k, an estimate of the TV trajectories at time k (dimension 8x1) g k is a nonlinear composition of the activation functions d. Mitra, 2010.

COST FUNCTION Networks trained by minimizing the sum-of-squares error Training data [x, t] (N = 315 words) e Output of the network g k is predicted TV trajectory estimated by position at each time step k Weights and biases updated using scaled conjugate gradient algorithm and dynamic backpropagation to reduce E SE e. Mitra, 2010.

DYNAMIC BACKPROPAGATION f Where: f. Jin, Liang, and M.m. Gupta, 1999.

SCALED CONJUGATE GRADIENT (SCG) g Choose weight vector and scalars. Let p 1 = r 1 = -E SE (w 1 ) While steepest descent direction r k 0 If success = true, calculate second order information. Scale s k : the finite difference approximation to the second derivative. If k 0, make the Hessian positive definite. Calculate step size k = (p k r k )/ k Calculate comparison parameter k If k 0 : w k+1 = w k + k p k, r k+1 = -E SE (w k+1 ) if k mod M = 0 (M is number of weights), restart algorithm: let p k+1 = r k+1 else create new conjugate direction p k+1 = r k+1 + k p k If k < 0.25, increase scale parameter: k = 4 k g. Moller, 1993.

KALMAN SMOOTHING Kalman filtering is used to smooth the noisy trajectory estimates from the ANNs TV trajectories modeled as output of a dynamic system State space representation: Parameters: : time difference (ms) between two consecutive measurements k : process noise k : measurement noise

KALMAN SMOOTHING h Recursive estimator Predict phase: Predicted state estimate Predicted estimate covariance Update phase: S k = Residual covariance K k = Optimal Kalman gain Update the state estimate Update estimate covariance h. Kalman, 1960.

IMPLEMENTATION Python Scientific libraries FANN (Fast Artificial Neural Network) Neurolab PyBrain Deepthought/Deepthought2 high performance computing clusters

TEST PROBLEM Synthetic data set (420 words) as model input [x,t] Data sampled over nine 10-ms windows with Generated from a speech production model at Haskins Laboratory (Yale Univ.) TV trajectories generated by TAsk Dynamic and Applications (TADA) model Reproduce estimates of root mean square error (RMSE) and Pearson product- moment correlation coefficient (PPMC)

VALIDATION METHODS New real data set: 47 American-English speakers 56 tasks per speaker Obtained from Univ. of Wisconsins X-Ray MicroBeam Speech Production database Feed data through model Compare error estimates Obtain visual trajectories

MILESTONES Build a FF-ANN Implement Kalman smoothing Use synthetic data to test FF-ANN Build a recurrent ANN Implement smoothing (if necessary) Test AR-ANN using real data

TIMELINE This semester: Build and test an FF-ANN October: Research and start implementation. November: Finish implementation and incorporate Kalman smoothing. December: Test and compile results using synthetic data. Next semester: Build and test a recurrent ANN January-February: Research and begin implementation (modifying FF-ANN). March: Finish implementation. Begin testing. April: Modifications (as necessary) and further testing. May: Finalize and collect results.

DELIVERABLES Proposal presentation and report Mid-year presentation/report Final presentation/report FF-ANN code Recurrent ANN code Synthetic data set Real acoustic data set

BIBLIOGRAPHY 1.Atal, B. S., J. J. Chang, M. V. Matthews, and J. W. Tukey. "Inversion of Articulatory-to- acoustic Transformation in the Vocal Tract by a Computer-sorting Technique." The Journal of the Acoustical Society of America 63.5 (1978): 1535-1553. 2.Bengio, Yoshua. "Introduction to Multi-Layer Perceptrons (Feedforward Neural Networks)." Introduction to Multi-Layer Perceptrons (Feedforward Neural Networks) Notes De Cours IFT6266 Hiver 2010. 2 Apr. 2010. Web. 4 Oct. 2014. 3.Jin, Liang, and M.m. Gupta. "Stable Dynamic Backpropagation Learning in Recurrent Neural Networks." IEEE Transactions on Neural Networks 10.6 (1999): 1321-1334. Web. 4 Oct. 2014.. 4.Jordan, Michael I., and David E. Rumelhart. "Forward Models: Supervised Learning with a Distal Teacher." Cognitive Science 16 (1992): 307-354. Web. 4 Oct. 2014. ] 5.Kalman, R. E. "A New Approach to Linear Filtering and Prediction Problems." Journal of Basic Engineering 82 (1960): 35-45. Web. 4 Oct. 2014.

BIBLIOGRAPHY 6. Mitra, Vikramjit. Improving Robustness of Speech Recognition Systems. Dissertation, University of Maryland, College Park. 2010. 7. Mitra, V., I. Y. Ozbek, Hosung Nam, Xinhui Zhou, and C. Y. Espy-Wilson. "From Acoustics for Vocal Tract Time Functions." Acoustics, Speech, and Signal Processing, 2009. ICASSP 2009.(2009): 4497-4500. Print. 8.Moller, M. "A Scaled Conjugate Gradient Algorithm For Fast Supervised Learning." Neural Networks 6 (1993): 525-533. Web. 4 Oct. 2014. 9.Nielsen, Michael. "Neural Networks and Deep Learning." Neural Networks and Deep Learning. Determination Press, 1 Sept. 2014. Web. 4 Oct. 2014. 10. Papcun, George. "Inferring Articulation and Recognizing Gestures from Acoustics with a Neural Network Trained on X-ray Microbeam Data." The Journal of the Acoustical Society of America (1992): 688. Web. 4 Oct. 2014.

BIBLIOGRAPHY All images taken from 10.Mitra, Vikramjit, Hosung Nam, Carol Y. Espy-Wilson, Elliot Saltzman, and Louis Goldstein. "Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies." IEEE Journal of Selected Topics in Signal Processing 4.6 (2010): 1027-1045. Print. 11.Espy-Wilson, Carol. Presentation at Interspeech 2013. 12.Espy-Wilson, Carol. Unpublished results. Sound clips courtesy of I Know That Voice. 2013. Film. Carol Espy-Wilson, Interspeech 2013.

THANKS! QUESTIONS?

ESTIMATING TRACT VARIABLES FROM ACOUSTICS VIA MACHINE LEARNING CHRISTIANA SABETT APPLIED MATH,...

Documents

Transcript of ESTIMATING TRACT VARIABLES FROM ACOUSTICS VIA MACHINE LEARNING CHRISTIANA SABETT APPLIED MATH,...