For Peer Review · ﬁber anatomy, respectively. With advances in computational anatomy, a vocal...

For Peer Review

A Sparse Non-negative Matrix Factorization Framework for

Identifying Functional Units of Tongue Behavior from MRI

Journal: Transactions on Medical Imaging

Manuscript ID Draft

Manuscript Type: Full Paper

Date Submitted by the Author: n/a

Complete List of Authors: Woo, Jonghye; Massachusetts General Hospital, Radiology; MGH Prince, Jerry; Johns Hopkins University, Electrical and Computer Engineering Stone, Maureen; University of Maryland Dental School, Neural and Pain Sciences, and Orthodontics Xing, Fangxu; Massachusetts General Hospital, Radiology

Gomez, Arnold; Johns Hopkins University, Electrical and Computer Engineering Green, Jordan; MGH Institute of Health Professions, Communication Sciences and Disorders Hartnick, Christopher; Massachusetts Eye and Ear Infirmary, Harvard Medical School Brady, Thomas; Harvard Medical School, Radiology Reese, Timothy; Massachusetts General Hospital, Martinos Center for Biomedical Imaging Wedeen, Van; Massachusetts General Hospital and the Harvard Medical School, Athinoula A. Martinos Center for Biomedical Imaging El Fakhri, Georges; Massachusetts General Hospital and the Harvard

Medical School, Radiology

Keywords: Tracking (time series analysis) < General methodology, Magnetic resonance imaging (MRI) < Imaging modalities, Muscle < Object of interest, Machine learning < General methodology

Specialty/Area of Expertise:

For Peer Review

IEEE TRANS DRAFT, FEB 2018 1

A Sparse Non-negative Matrix FactorizationFramework for Identifying Functional Units of

Tongue Behavior from MRIJonghye Woo, Member, IEEE, Jerry L. Prince, Fellow, IEEE, Maureen Stone, Fangxu Xing, Arnold Gomez,Jordan R. Green, Christopher J. Hartnick, Thomas J. Brady, Timothy G. Reese, Van J. Wedeen, Georges El

Fakhri, Fellow, IEEE

Abstract—Tongue motions in the course of speech or otherlingual behaviors are synergies created by locally deformingregions, or functional units. Functional units are functionalgroupings of local structural elements within the tongue thatcompress, expand, and move in a cohesive and consistent manner.Identifying the functional units using tagged-Magnetic ResonanceImaging (MRI) provides an insight into the mechanisms ofnormal and pathological muscle coordination, potentially leadingto improvement in surgical planning, treatment, or rehabilitationprocedures. In this work, to mine this information, we propose agraph-regularized sparse non-negative matrix factorization andprobabilistic graphical model framework by learning latent build-ing blocks and the corresponding weighting map using a set ofmotion features from displacements extracted from tagged-MRI.Our tagged-MRI imaging and internal tissue motion trackingparadigm provide previously unavailable internal tongue motionpatterns, thus illuminating the inner workings of the tongueduring speech or other lingual behaviors. Spectral clusteringusing the weighting map is then performed to determine thecoherent regions defined by the tongue motion that may involvemultiple or undocumented regions. Two-dimensional image dataare used to verify that the proposed approach clusters thedifferent types of images accurately. Three-dimensional syntheticand in vivo tongue data carrying out simple non-speech/speechtasks are used to define subject/task-specic functional units ofthe tongue in localized regions.

Index Terms—Tongue Motion, Functional Units, Speech, Non-negative Matrix Factorization, MRI, Sparsity

I. INTRODUCTION

Finding a suitable representation of high-dimensional andcomplex data for a variety of tasks, such as clustering and topicmodeling, is a fundamental challenge in many areas such ascomputer vision, machine learning, data mining, and medicalimage analysis. Non-negative matrix factorization (NMF) [1],an unsupervised generative model, is a class of techniquesto find a low-dimensional representation of a dataset suitablefor a clustering interpretation [2]; it also has been used totransform seemingly disparate features to a common domain.

Part of this work was presented at the MICCAI 2014 and AcousticalSociety of America in 2017. Jonghye Woo, Fangxu Xing, Thomas J. Brady,Timothy R. Reese, Van J. Wedeen, and Georges El Fakhri are with Departmentof Radiology, Massachusetts General Hospital and Harvard Medical School.Maureen Stone is with the University of Maryland Dental School. Jordan R.Green is with MGH Institute of Health Professions. Christopher J. Hartnickis with Massachusetts Eye and Ear Infirmary and Harvard Medical School.Arnold Gomez and Jerry L. Prince are with Department of Electrical andComputer Engineering at Johns Hopkins University.E-mail: [email protected]

NMF and its variants involving sparsity have received substan-tial attention since the seminal work by Lee and Seung [1]because of its ability to provide an interpretable and parts-based representation inspired by psychological and physio-logical observations about the human brain [23]. Specifically,NMF with a sparsity constraint focuses on data matrices whoseelements are non-negative, allowing to model a data matrix assparse linear combinations of basis vectors. In this work, weare interested in modeling the tongue’s underlying behaviorsusing NMF, since non-negative properties of NMF are akinto the physiology of the tongue as reflected in the matrixdecomposition process. That is, NMF does not allow negativecombinations of basis vectors. This is consistent with theanalysis of muscles, which either have positive activation orno activation, not negative activation.

The human tongue is a structurally and functionally com-plex muscular structure, comprising orthogonally oriented andinter-digitated muscles. The tongue is innervated by morethan 13,000 hypoglossal motoneurons [3], [4]. The complexityand precision of the voluntary and involuntary movements ofthe tongue during the course of speaking, swallowing, andbreathing are invoked by a complex set of neural excitations oftongue muscles. The tongue muscles interact with one anotherto carry out the oromotor behaviors, which are executed bydeforming local functional units in this complex musculararray. Tongue motions are synergies created by locally deform-ing regions, or functional units [6], [7]. Functional units areregions of the tongue that exhibit homogeneous motion duringthe execution of the specific task identifying functional unitsand understanding the mechanisms of coupling among themcan identify motor control strategy in both normal and adaptedspeech (e.g., tongue motion after tongue cancer surgery orbrain injury such as Amyotrophic Lateral Sclerosis (ALS)or stroke). However, to date, the mechanisms of the musclecoordination and the relationship between tongue structure andfunction have remained poorly understood partly due to thegreater complexity and variability of both muscular structuresand their interactions.

Understanding the subject/task-specific functional organiza-tion of the tongue requires a map of the functional units ofthe human tongue for specific tasks using medical imaging. Inparticular, recent advances in medical imaging and associatedimage analysis techniques permit the non-invasive imaging ofstructural and functional components of the tongue. Magnetic

Page 1 of 9

For Peer Review


resonance imaging (MRI) technologies have shown that alarge number of tongue muscles undergo highly complexdeformations during speech and other lingual behaviors. Forexample, the ability to perform non-invasive imaging usingMRI has allowed to image both tongue surface motion usingcine-MRI [8], [9], [10] and internal tissue motion usingtagged-MRI (tMRI) [5]. In addition, high-resolution MRI anddiffusion MRI [25], [24] have provided the muscular andfiber anatomy, respectively. With advances in computationalanatomy, a vocal tract atlas [11], [12], a representation ofthe tongue anatomy, has also been created and validated,which allows for investigating the relationship between tonguestructure and function by providing a reference anatomicalconfiguration to analyze similarities and variability of tonguemotion.

In this work, we develop a computational approach to defin-ing the subject-specific and data-driven functional units fromtMRI and 4D (3D space with time) voxel-level tracking [13]by extending our previous approach [17]. We describe a re-fined algorithm including an advanced tracking algorithm andgraph-regularized sparse NMF to determine spatio-temoprallyvarying functional units using simple non-speech/speech tasksand provide extensive validations on both synthetic and in vivotongue data. The method integrates a regularization term thatencourages the computation of distances on a manifold ratherthan the whole of Euclidean space in order to preserve theintrinsic geometry of the observed motion data. We assumea manifold of the data within an NMF approach, therebycapturing the intrinsic geometry and finding a low dimensionalsubspace of the motion features derived from tMRI. Sincestandard NMF assumes a standard Euclidean distance measurefor its data, it fails to discover the intrinsic geometry of itsdata [23]. Thus, we improve the standard NMF by introducinga joint regularization scheme to determine the cohesive motionpattern of the tongue. Both quantitative and qualitative evalu-ation results demonstrate the validity of the proposed methodand its superiority to conventional clustering algorithms.

We present a fully automated approach to discovering thefunctional units using a graph-regularized sparse matrix fac-torization and probabilistic graphical model framework withthe following main contributions:

• The most prominent contribution of this work is touse voxel level data-driven MRI (1.875 mm×1.875mm×1.875 mm) methods incorporating internal tissuepoints to obtain subject-specific functional units of howtongue muscles coordinate to generate target observedmotion.

• This work applies a graph-regularized sparse NMF andprobabilistic graphical model to the voxel level motiondata, allowing us to estimate the latent functional coher-ence, by learning simultaneously latent building blocksand the corresponding weighting map from a set ofmotion features.

• Our tMRI imaging and internal tissue motion trackingparadigm bring into light patterns of motion that have sofar been intractable. The proposed approach is scalableto a variety of motion features derived from motiontrajectories to characterize the coherent motion patterns.

In this work, we consider the most representative featuressuch as displacements and angle from tMRI.

The structure of this paper is as follows. Related work is re-viewed in Section II. Section III shows the proposed approachto determining the functional units. Section IV presents theexperimental results. Section V provides a discussion, andfinally, Section VI concludes this paper.

II. RELATED WORK

In this section, we review recent work on NMF for clus-tering and functional units research that are closely related toour work.

NMF for Clustering. A multitude of NMF-based methodsfor unsupervised data clustering have been proposed overthe last decade across various domains, ranging from videoanalysis to medical image analysis. In particular, the ideaof using L1 norm regularization (i.e., sparse NMF) for thepurpose of clustering has been successfully employed [28].The sparsity condition imposed on the weighting matrix (orcoefficient matrix) indicates the clustering membership. Forexample, Shahnaz et al. [29] proposed an algorithm fordocument clustering. The matrix factorization was used tocompute a low-rank approximation of a sparse matrix alongwith preservation of natural data property. Wang et al. [31]applied the NMF framework to gene-expression data to iden-tify different cancer classes. Anderson et al. [32] presentedan NMF-based clustering method to differential changes indefault mode subnetworks in ADHD from multimodal data.In addition, Mo et al. [33] proposed a motion segmentationmethod using an NMF-based method. A key insight to useNMF for clustering purpose is that NMF is able to learn anddiscriminate localized traits of data with a better interpretation.Please refer to [19] for a detailed review on NMF for clusteringpurpose.

Speech Production. Research on determining functionalunits during speech has a long history (see e.g., Gick andStavness [14] for a recent review). Determining functionalunits is considered as discovering a “missing link” betweenspeech movements primitives and cortical regions associatedwith speech production [14]. In the context of lingual coartic-ulation, functional untis can be seen as “quasi-independent”motions [3]. A variety of different techniques have been usedto address this problem. Stone et al. [15] showed that fourmidsagittal regions including anterior, dorsal, middle, andposterior functioned quasi-independently using ultrasound andmicrobeam data. Green and Wang attempted to characterizefunctionally independent articulators from microbeam datausing covariance-based analysis [6]. More recently, Stone etal. [3] presented a method to determine functional segmentsusing ultrasound and tMRI. That work examined compressionand expansion between the anterior and posterior tongueand found that regions or muscles have functional segmentsstrongly affected by phonemic constraints. Another key reportis that of Ramanarayanan et al. [16], who used a convolutiveNMF algorithm to determine tongue movement primitivesfrom electromagnetic articulography (EMA). Our work isinspired by approaches discussed above and we use far richer

Page 2 of 9

For Peer Review


4D tMRI based tracking data and the augmented NMF frame-work with the addition of prior information on sparsity andintrinsic data geometry in defining functional units. Unlikeother approaches that largely rely on the tracking of landmarkpoints sparsely located on the fixed tongue surface such asthe tip, blade, body, and dorsum, we aim to identify 3Dcohesive functional units that involve multiple, and possiblyundocumented, internal tongue regions.

III. PROPOSED FRAMEWORK

A. Problem Statement

Without loss of generality, let us first define the notationsand definitions used in this work. Consider a set of P internaltongue tissue points tracked through tMRI, each with n scalarquantities (e.g., magnitude and angle of each track) trackedthrough F time frames. These quantities characterize eachtissue point, which are used to group them into cohesivemotion patterns, functional units. The location of the p-thtissue point at the f -th time frame can be expressed as (xpf , ypf ,zpf ). The tongue motion can then be represented by a 3F×Pspatio-temporal feature matrix N = [n1, ...,nP ] ∈ R3F×P ,where the p-th column is given by

np = [xp1, · · · , xpF , y

p1 , · · · , y

pF , z

p1 , · · · , z

pF ]T . (1)

We cast this problem of determining the functional units as amotion clustering problem as the functional units are consid-ered to be regions of the tongue that exhibit homogeneousmotion. However, different from generic motion clusteringproblems in computer vision, the tongue’s function and phys-iology also need to be reflected and captured in our formu-lation. Thus, the goal is to determine a permutation of thecolumns to form [N1| N2| · · · |Nc] , where the submatrix Ni

comprises point tracks associated with the i-th submotion—i.e., the i-th functional unit. The displacements and derivedmotion features for each underlying muscle are not completelyindependent; a subset of motion quantities from each musclemaps to one or more common latent dimensions, which can beinterpreted via our model. These latent dimensions provide asparse summary of the generative process behind the tongue’smotion for a single or multiple muscles. In this work, sparseNMF is utilized to infer the latent structure of motion featuresderived from tMRI. The proposed method is described in moredetail below; a flowchart is shown in Figure 1.

B. MR Data Acquisition and Motion Tracking

1) MR Data Acquisition: All MRI scanning is performedon a Siemens 3.0 T Tim Treo system (Siemens MedicalSolutions, Malvern, PA) with 12-channel head and 4-channelneck coil. While subjects are speaking the same word repeat-edly, the tMRI datasets are collected using Magnitude ImagedCSPAMM Reconstructed images [26]. The datasets have a 1second duration, 26 time-frames with a temporal resolution of36 ms for each phase with no delay from the tagging pulse, 6mm thick slices (6 mm tag separation), and 1.875 mm in-planeresolution with no gap. The field-of-view is 24 cm.

Fig. 1. Flowchart of the proposed method

2) Motion Estimation from Tagged-MRI: The phase vectorincompressible registration algorithm (PVIRA) [35] is usedto estimate deformation of the tongue from tMRI, yieldinga sequence of dense 3D motion fields. Although the inputis a set of sparsely acquired tMRI slices, PVIRA uses cubicB-spline to interpolate these 2D slices into denser 3D voxellocations. Then a harmonic phase (HARP) [20] filter is appliedto produce HARP phase volumes from the interpolated result.Finally, PVIRA uses the iLogDemons method [21] on thesephase volumes. Specifically, we denote the phase volumes asΦx, Θx, Φy , Θy , Φz , and Θz , where x, y, and z denote motioninformation from three cardinal directions usually containedin orthogonal axial, sagittal, and coronal tagged slices. Thevolume in the reference time frame is Φ and the volume inthe deformed time frame is Θ. The motion update vector fieldis derived from these phase volumes. At each voxel in theimage, the update vector is computed by

δv(x) =v0(x)

α1(x) + α2(x)/K, (2)

Note that K is the normalization factor. v0(x), α1(x), andα2(x) are defined by

v0(x) = W (Φx(x)−Θx(x))(∇∗Φx(x) +∇∗Θx(x))

+W (Φy(x)−Θy(x))(∇∗Φy(x) +∇∗Θy(x))

+W (Φz(x)−Θz(x))(∇∗Φz(x) +∇∗Θz(x)) ,

α1(x) = ||∇∗Φx(x) +∇∗Θx(x)||2 + ||∇∗Φy(x) +∇∗Θy(x)||2

+ ||∇∗Φz(x) +∇∗Θz(x)||2 ,α2(x) = W (Φx(x)−Θx(x))2 +W (Φy(x)−Θy(x))2

+W (Φz(x)−Θz(x))2.(3)

Wrapping of phase W (θ) is defined by

W (θ) = mod(θ + π, 2π)− π (4)

and the “starred” gradient is defined by

∇∗Φ(x) :=

{∇Φ(x), if |∇Φ(x)| ≤ |∇W (Φ(x) + π)|∇W (Φ(x) + π), otherwise.

(5)

Page 3 of 9

For Peer Review


After all iterations are complete, the forward and inversedeformation fields can be found by

ϕ(x) = exp(v(x)) and ϕ−1(x) = exp(−v(x)), (6)

and they are both incompressible and diffeomorphic, makingboth Eulerian and Lagrangian computations available for thefollowing continuum mechanics operations.

C. Extraction of Motion Quantities

The first step in our algorithm is to extract the motionfeatures from PVIRA that characterize the cohesive motionpatterns over time. We extract motion features including themagnitude and angle of the track similar to [34] described as

mpf =

√(xpf+1 − x

pf )2 + (ypf+1 − y

pf )2 + (zpf+1 − z

pf )2 (7)

czpf =xpf+1 − x

pf√

(xpf+1 − xpf )2 + (ypf+1 − y

pf )2

+ 1 (8)

cxpf =ypf+1 − y

pf√

(ypf+1 − ypf )2 + (zpf+1 − z

pf )2

+ 1 (9)

cypf =zpf+1 − z

pf√

(zpf+1 − zpf )2 + (xpf+1 − x

pf )2

+ 1 (10)

where mpf denotes the magnitude of the track and czpf , cxpf ,

and cypf represent the cosine of the angle projected in the z,x, and y axes plus one, respectively, which are in the rangeof 0 to 2 to satisfy the non-negative constraint in the NMFformulation. We then rescale all features into the range of 0to 10 for each feature to be comparable.

For clustering, we gather all the motion features into a 5(F−1)×P non-negative matrix U = [u1, ...,un] ∈ Rm×n

+ , wherethe p-th column can be expressed as

up = [mp1, · · · ,m

pF−1, cz

p1 , · · · , cz

pF−1, cx

p1, · · · ,

cxpF−1, cyp1 , · · · , cy

pF−1, ]

T .

These features are always non-negative and can therefore beinput to NMF.

Algorithm 1: Determination of the functional units1. Extract motion features from displacement fields andconstruct U.

2. Apply graph-regularized sparse NMF to U to obtainV and W.

3. Compute affinity matrix A from W.4. Apply spectral clustering to A and identify functionalunits.

D. Graph-regularized Sparse NMF

1) NMF: Given a non-negative data matrix U constructedfrom the motion quantities above and k ≤ min(m,n), let V =[vik] ∈ Rm×k

+ be the building blocks and let W = [wkj ] ∈Rk×n

+ be the weighting map. The goal of NMF is to learnbuilding blocks and corresponding weights such that the inputU is approximated by a product of two non-negative matrices(i.e., U ≈ VW). A typical way to define NMF is to usethe Frobenius norm to measure the difference between U andVW [1] given by

E(V,W) = ‖U−VW‖2F =∑i,j

(uij −

K∑k=1

vikwkj

)2

(11)where ‖·‖F denotes the matrix Frobenius norm. The solutioncan be found through the multiplicative update rule [1]:

V← V. ∗UWT ./VWWT (12)

W←W. ∗VTU./VTVW (13)

2) Sparsity Constraint: In this work, since we aim to iden-tify the simplest muscle coordinations given many differentcombinations based on the current wisdom on phonologicaltheories [18], we impose a sparsity constraint on the weightingmap W. The sparsity constraint allows us to encode the high-dimensional tongue motion data using a small number of activecomponents, thereby making the weighting map simple andeasy to interpret. In particular, the weighting map obtainedthis way will represent optimized tongue behavior that couldgenerate the observed motion. In the NMF framework, it hasbeen reported that a fractional regularizer using the L1/2 normoutperformed the L1 norm regularizer and gave sparser solu-tions [38]. Thus, we incorporate the L1/2 sparsity constraintinto the NMF framework, which can be expressed as

E(V,W) =1

2‖U−VW‖2F + η ‖W‖1/2, (14)

where the parameter η > 0 controls the sparseness of W and‖W‖1/2 is defined as

‖W‖1/2 =

k∑i=1

n∑j=1

w1/2ij

2

. (15)

3) Manifold Regularization: Despite the high-dimensionalconfiguration space of human motions, many human motionslie on low-dimensional manifolds that are non-Euclidean [37].NMF with the L1/2 norm sparsity constraint, however, pro-duces a weighting map based on a Euclidean structure inthe high-dimensional data space. Thus, the intrinsic and geo-metric relation between motion features may not be reflectedaccurately. To address this, we incorporate a manifold reg-ularization that respects the intrinsic geometric structure asin [23], [39], [40]. The manifold regularization favors the localgeometric structure while serving as a smoothness operator byreducing the interference of noise. Our final objective function

Page 4 of 9

For Peer Review


incorporating both the manifold regularization and the sparsityconstraint is then given by

E(V,W) =1

2‖U−VW‖2F +

1

2λTr(WLWT ) + η ‖W‖1/2

(16)where λ is a balancing parameter of the manifold regulariza-tion, Tr(·) denotes the trace of a matrix, Q is a heat kernelweighting, D is a diagonal matrix where Djj =

∑l

Qjl, and

L = D−Q, which is the graph Laplacian.4) Minimization: The objective function in Eq. (16) is not

convex in both V and W and therefore we use a multiplicativeiterative method akin to that used in [40]. Let Ψ = [ψmk] andΦ = [φkn] be Lagrange multipliers subject to vmk ≥ 0 andwkn ≥ 0, respectively. By using the definition of the Frobeniusnorm, ‖U‖F = (Tr(UTU))1/2, and matrix calculus, theLagrangian L is expressed as

L =1

2Tr(UUT )− Tr(UWTVT ) +

1

2Tr(VWWTVT )

+λ

2Tr(WLWT ) + Tr(ΨVT ) + Tr(ΦWT ) + η ‖W‖1/2 .

(17)The partial derivatives of L with respect to V and W aregiven by

∂L∂V

= −UWT + VWWT + Ψ

∂L∂W

= −VTU + VTVW + λWL +η

2W−1/2 + Φ.

(18)

Finally, the update rule is found by using Karush-Kuhn-Tuckerconditions—i.e., ΨmkVmk = 0 and ΦknWkn = 0:

V← V. ∗UWT ./VWWT

W←W. ∗ (VTU + λWQ)./(VTVW +η

2W−1/2

+λWD).

(19)

E. Spectral Clustering

The non-negative weighting map that is simple and sparseobtained in Eq. (19) provides a good measure of regional tissuepoint similarity. To obtain the final clustering results from theweighting map, spectral clustering is used to determine thecohesive motion patterns as spectral clustering outperformstraditional clustering algorithms such as the K-means algo-rithm [41].

Once W is determined from Eq. (19), an affinity matrix Ais first constructed:

A(i, j) = exp

(−‖w(i)− w(j)‖2

σ

), (20)

where w(i) is the i-th column vector of W and σ denotes thescale (we set σ = 0.01 in this work). The column vectors ofW form nodes in the graph, and the similarity A computedbetween column vectors of W form the edge weights. On theaffinity matrix, we apply a spectral clustering technique usinga normalized cut algorithm [42]. From a graph cut perspective,our method can be seen as identifying subgraphs representingmotions that exhibit distinct characteristics.

F. Model Selection

To achieve the best clustering quality, we need to determinethe optimal number k of clusters, which is a challengingtask [45]. In this work, we use a consensus matrix whoseentries indicate probabilities that samples i and j belong tothe same cluster by repeating NMF 10 times. To compute thedispersion coefficient of a consensus matrix C, the dispersioncoefficient ρ is defined as

ρ =1

n2

m∑i=1

n∑j=1

4(c̃ij − 0.5)2, (21)

where c̃ij , m, and n denote each entry of the matrix, the rowand column size of the matrix, respectively. In the ideal case,for a consensus matrix whose entries are all 0s or 1s, we haveρ = 1; for a scattered consensus matrix, we have 0 < ρ < 1.The optimal number of clusters is determined as the one withthe maximal ρk [46]. We set 2 ≤ k ≤ 6 in this work.

IV. EXPERIMENTAL RESULTS

We describe the qualitative and quantitative evaluation tovalidate the proposed approach. We both show synthetic andreal in vivo data to demonstrate the accuracy of our approach.

A. Experiments Using 2D Data

We first used two 2D datasets to demonstrate the clusteringperformance of the proposed method. The first dataset is theCOIL20 image library, which contains 20 classes (32×32gray scale images of 20 objects). The second dataset is theCMU PIE face database, which has 68 classes (32×32 grayscale face images of 68 persons). In order to compare theperformance of the different algorithms, we used a K-meansclustering method (K-means), a normalized cut method (N-Cut) [42], standard NMF with K-means clustering (NMF-K),graph-regularized NMF with K-means clustering (G-NMF-K) [23], graph-regularized NMF with spectral clustering (G-NMF-S), graph-regularized sparse NMF with K-means cluster-ing (GS-NMF-K), and our method (GS-NMF-S). Two metrics,the Normalized Mutual Information (NMI) and the accuracy(AC), were used to measure the clustering performance as usedin [23]. Table 1 lists the NMI and AC values, demonstratingthat the proposed method outperformed other methods. Wealso compared the L1/2 and L1 norms experimentally, andthe L1/2 norm had slightly better results.

B. Experiments Using Synthetic Tongue Motion Data

Since there is no ground truth in our in vivo tongue motiondata, we evaluated the performance of the proposed methodusing four synthetic tongue motion datasets using a compositeLagrangian displacement field of individual muscle groupsbased on a tongue atlas [11]. Each muscle group was definedby a mask volume with a value of 1 inside the musclegroup, and zero elsewhere. Since the masks were known, italso provided ground truth labels to assess the accuracy ofthe output of the clustering method. The first and seconddatasets used genioglossus (GG) and superior longitudinal(SL) muscles with and without interdigitated regions. The GG

Page 5 of 9

For Peer Review


TABLE ICLUSTERING PERFORMANCE: NMI AND AC

NMI (%) K-means N-Cut NMF-K G-NMF-K GS-NMF-K G-NMF-S Our methodCOIL20 (K=20) 73.80% 76.56% 74.36% 87.59% 90.11% 90.24% 90.63%

PIE (K=68) 54.40% 77.13% 69.82% 89.93% 89.95% 90.95% 91.74%AC (%) K-means N-Cut NMF-K G-NMF-K GS-NMF-K G-NMF-S Ours

COIL20 (K=20) 60.48% 66.52% 66.73% 72.22% 83.75% 84.58% 85.00%PIE (K=68) 23.91% 65.91% 66.21% 79.3% 79.93% 80.60% 84.31%

Fig. 2. Illustration of synthetic tongue motion simulation results: (a) trans-lation plus rotation without interdigitated regions (2 clusters), (b) rotationswith interdigitated regions (3 clusters), (c) translation plus rotation withoutinterdigitated regions (2 clusters), and (d) rotations with interdigitated regions(3 clusters). It is noted that our approach identified each label accurately asvisually assessed.

muscle was translated while the SL muscle was rotated −0.1radians about the x direction in Fig. 2(a) in the course of11 time frames. The GG muscle was rotated −0.1 radiansabout the x direction while the SL muscle was rotated 0.1radians about the x direction in Fig. 2(b) in the course of 11time frames. The third and fourth datasets were generated byapplying the same composite Lagrangian displacement field tothe GG and transverse muscles with and without interdigitatedregions as in Fig. 2(c) and (d), respectively. Fig. 2 showedthe final clustering results using our method. We attempted tocluster each dataset into two (first and third datasets) and three(second and fourth datasets) distinct motions, respectively,where we obtained 100% clustering accuracy for all datasetswhen evaluated against the ground truth labels.

C. Experiments Using In Vivo Tongue Motion Data

TABLE IICHARACTERISTICS OF in vivo TONGUE MOTION DATA

Task Protrusion /s/-/u/ /i/-/s/Time frames 1-13 10-17 15-22

Number of clusters 2-4 2-4 2-4

We also tested our method using a simple non-speechprotrusion task and speech tasks: “a souk” and “a geese”.

Ten subjects performed “a souk” and “a geese” tasks and onesubject performed the protrusion task. We used the featuresincluding the magnitude and angle of each track as our inputto the NMF framework. Table II lists the characteristics of ourin vivo tongue motion data including time frames analyzed andthe number of clusters based on the dispersion coefficient.

First, the functional units have been extracted using ourmethod for two clusters (Fig. 3(b)), three clusters (Fig. 3(c)),and four clusters (Fig. 3(d)), respectively, based on the dis-persion coefficient as in Fig. 4 and visual assessment. Theouter tongue layer expands forward and upward (but notbackward), and the region near the jaw has little motion asshown in Fig. 3(a). Fig. 3(b) is a good representation offorward protrusion (red) vs. small motion (blue). In addition,as the number of clusters increases as shown in Fig. 3(c)and (d), subdivision of large regions in small motion (blue,Fig. 3(b)) into small functional units was observed.

Second, the functional units during /s/ to /u/ from “asouk” were determined using our method for two clus-ters (Fig. 5(b)), three clusters (Fig. 5(c)), and four clusters(Fig. 5(d)), respectively, based on the dispersion coefficientshown in Fig. 6 and visual assessment. These motions arecharacterized by forward to upward/backward motion of thetongue tip, upward motion of the tongue body, and forwardmotion of the posterior tongue as in Fig. 5(a). Fig. 5(b) showstwo clusters including the tip plus bottom of the tongue (red)versus the tongue body. Three clusters as in Fig. 5(c) showa good representation of the tip, body and posterior of thetongue and four clusters as in Fig. 5(d) further subdivided thetongue tip and bottom.

Third, the functional units during /i/ to /s/ from “a geese”were determined using our method for two clusters (Fig. 7(b)),three clusters (Fig. 7(c)), and four clusters (Fig. 7(d)), respec-tively, based on the dispersion coefficient shown in Fig. 8and visual assessment. These motions are characterized by anupward motion of the tongue tip, upward/backward motion ofthe tongue body, and forward motion of the posterior tongueas in Fig. 7(a). Two clusters as in Fig. 7(b) show a divisionbetween the tip plus bottom of the tongue (red) and the tonguebody. Three clusters as in Fig. 7(c) are a good representationof the tip, body and posterior of the tongue and four clustersas in Fig. 7(d) subdivided the posterior of the tongue further.

V. DISCUSSION

In this work, we proposed a novel approach to character-izing multiple functional degrees of freedom of the tongue,which is critical to understand the tongue’s role in speechand other lingual behaviors. This is because the functioning

Page 6 of 9

For Peer Review


Fig. 3. Illustration of functional units during the tongue protrusion task, showing (a) 3D Lagrangian displacement field, (b) functional units (2 clusters), (c)functional units (3 clusters), and (d) functional units (4 clusters). It is noted that the colored clustering results are plotted in the tongue shape of the first timeframe showing the neutral tongue position.

Fig. 4. Plot of the dispersion coefficient versus the different number of clustersfor the tongue protrusion task.

of the tongue in speech or other lingual behaviors entailssuccessful orchestration of the complex system of 3D tonguemuscular structures over time. Determining functional unitsfrom healthy controls plays an important role in understandingmotor control strategy, which in turn could elucidate adaptedmotor control strategy when analyzing patient data such astongue cancer patients. It has been a long-sought problem thatmany researchers attempted using various techniques.

Inspired by recent advances in MR motion tracking and datamining schemes including sparse NMF and manifold learn-ing, we presented a novel method for determining functionalunits from tMRI, which opens new vistas to study speechproduction. Unsupervised data clustering using NMF is thetask of identifying semantically meaningful clusters using alow-dimensional representation from a dataset. Unlike previ-ous algorithms, this proposed work aimed at identifying theinternal, coherent manifold structure of high-dimensional 4Dmotion data. Two constraints in addition to the standard NMFwere employed to reflect the physiological properties of 4Dtongue motion during speech. Firstly, the sparsity constraintwas introduced to capture the simplest and the most optimizedweighting map. Sparsity has been one of important propertiesfor phonological theories [18], and our work attempted todecode this phenomenon within a sparse NMF framework.Secondly, the manifold regularization was added to capture theintrinsic and geometric relationship between motion features.It also allows preserving the geometric structure between

motion features, which is particularly important when deal-ing with tongue motions that lie on low-dimensional non-Euclidean manifolds. Our method performed better than K-means, N-Cut, NMF-K, G-NMF-K, GS-NMF-K, and G-NMF-S using 2D data.

As for the input features in our framework, we used instan-taneous velocity information derived from the point tracks.More features could be investigated such as those reflectmechanical properties including principal strains, curvature,minimum-jerk, two-thirds power law, and isochrony [48] ormotion descriptors combining those individual features.

The selection of the number of clusters is often performedmanually as there is no definite model selection methodavailable. In this work, we built a “consensus matrix” frommultiple runs for each k and assessed the presence of blockstructure. As an alternative, one can compare reconstructionerrors for different number k or examine the stability (i.e.,agreement between results) from multiple randomly initializedruns for each k. Since there is no ground truth in our tonguedata, we have used both visual assessment and the modelselection approach in which the model selection approachprovided an upper limit of the number of clusters.

There are a few directions to improve the current work.First, we used a data-driven approach to determine the func-tional units, which was visually assessed due to the lack ofground truth. This could be improved by further studies usingmodel-based approaches via biomechanical stimulations [43]or electromyography [44] to co-validate our findings. Forbiomechanical simulations, subject-specific anatomy and theassociated weighting map could be input and inverse simu-lation can then be used to verify the validity of the obtainedweighting map. Second, we used magnitude and angle of eachtrack as our input features. In order to equal the weight of eachinput feature, we normalized the feature values in the samerange. In our future work, we will further study automatic rele-vance determination methods to model the interactions amongthese features to yield the best clustering outcome. Finally,the identified functional units as shown in our experimentalresults may involve multiple regions that correspond to sub-muscles or multiple muscles. Therefore, we will further studythe identified functional units in the context of the muscularanatomy from individual high-resolution MRI, diffusion MRI,or a high-resolution atlas [11].

Page 7 of 9

For Peer Review


Fig. 5. Illustration of functional units during /s/ to /u/ from “a souk” showing (a) 3D Lagrangian displacement field, (b) functional units (2 clusters), (c)functional units (3 clusters), and (d) functional units (4 clusters). It is noted that the colored clustering results are plotted in the tongue shape of the first timeframe showing the neutral tongue position.

2 3 4 5 6Number of clusters

0.205

0.21

0.215

0.22

0.225

0.23

0.235

Fig. 6. Plot of the dispersion coefficient versus the different number of clustersfor the task of /s/ to /u/ from “a souk”.

VI. CONCLUSION

We have presented a new algorithm to determine localfunctional units that link muscle activity to surface tongue ge-ometry during non-speech and speech tasks. Our work applieda graph-regularized sparse NMF method that incorporates jointsparse and manifold regularizations to the motion trackingdata from tMRI. Both synthetic and in vivo tongue datawere used to verify the performance of the proposed method,demonstrating that the proposed method was able to accuratelycluster the tongue motion. Our results suggest that it is feasibleto identify the functional units using a set of motion featuresincluding magnitude and angle of each track, and this proposedmethod has great potential in the improvement of diagnosis,treatment, and rehabilitation in patients with speech-relateddisorders.

ACKNOWLEDGMENT

This research was supported in part by NIH R21DC016047,R00DC012575, R01DC014717, and R01CA133015.

REFERENCES

[1] D. D. Lee, H. S. Seung, “Learning the parts of objects by non-negativematrix factorization,” Nature, 401(6755), pp. 788–791, 1999

[2] G. Trigeorgis, K. Bousmalis, S. Zafeiriou, B. W. Schuller, “A deepmatrix factorization method for learning attribute representations,” IEEETransactions on Pattern Analysis and Machine Intelligence. 2016 Apr 15.

[3] M. Stone, M. A. Epstein, K. Iskarous, “Functional segments in tonguemovement,” Clinical Linguistics & Phonetics 18(6-8), pp. 507–521, 2004

[4] J. R. O’Kusky and M. G. Norman, “Sudden infant death syndrome:increased number of synapses in the hypoglossal nucleus,” Journal ofNeuropathology and Experimental Neurology, 54, pp. 627–34, 1995

[5] V. Parthasarathy, J. L. Prince, M. Stone, E. Z. Murano, M. NessAiver,“Measuring tongue motion from tagged cine-MRI using harmonic phase(HARP) processing,” Journal of the Acoustical Society of America,121(1), pp. 491–504, January, 2007

[6] J. R. Green, “Tongue-surface movement patterns during speech andswallowing,” The Journal of the Acoustical Society of America 113(5),pp. 2820–2833, 2003

[7] J. A. S. Kelso, “Synergies: Atoms of Brain and Behavior,” In Progressin Motor Control, pp. 83–91, Springer, 2009

[8] M. Stone, E. P. Davis, A. S. Douglas, M. N. Aiver, R. Gullapalli, W.S. Levine, A. J. Lundberg, “Tongue-surface movement patterns duringspeech and swallowing,” Journal of Speech, Language, and HearingResearch, 113(5), pp. 2820–2833, 2001

[9] E. Bresch, Y. C. Kim, K. Nayak, D. Byrd, S. Narayanan, “Seeingspeech: Capturing vocal tract shaping using real-time magnetic resonanceimaging,” IEEE Signal Processing Magazine, 25(3), pp. 123–132, 2008

[10] M. Fu, Z. Bo, C. Carignan, R. K. Shosted, J. L. Perry, D. P. Kuehn, Z.P.Liang, and B. P. Sutton, “Highresolution dynamic speech imaging withjoint lowrank and sparsity constraints,” Magnetic Resonance in Medicine,73(5), pp. 1820-32, 2015

[11] J. Woo, J. Lee, E. Murano, F. Xing, A. Meena, M. Stone and J. Prince, “Ahigh-resolution atlas and statistical model of the vocal tract from structuralMRI,” Computer Methods in Biomechanics and Biomedical Engineering:Imaging & Visualization, pp. 1–14, 2015

[12] M. Stone, J. Woo, J. Lee, T. Poole, A. Seagraves, M. Chung, E. Kim,E. Z. Murano, J. L. Prince, S. S. Blemker, “Structure and variability inhuman tongue muscle anatomy,” Computer Methods in Biomechanics andBiomedical Engineering: Imaging & Visualization, pp. 1–9, 2016

[13] F. Xing, J. Woo, E. Z. Murano, J. Lee, M. Stone, J. L. Prince, “3Dtongue motion from tagged and cine MR images,” International Confer-ence on Medical Image Computing and Computer Assisted Intervention(MICCAI), LNCS 8151, pp. 41–48, 2013

[14] B. Gick and I. Stavness, “Modularizing speech,” Frontiers in psychology,4, pp. 1–3, 2013.

[15] M. Stone, “A three-dimensional model of tongue movement based onultrasound and x-ray microbeam data,” Journal of the Acoustical Societyof America, 87, pp.2207-2217, 1990

[16] V. Ramanarayanan, L. Goldstein, and S. Narayanan, “Spatio-temporalarticulatory movement primitives during speech production: extraction,interpretation, and validation,” Journal of the Acoustical Society ofAmerica, 134(2), pp. 1378–1394, 2013

[17] J. Woo, F. Xing, J. Lee, M. Stone, J. L. Prince, “Determining FunctionalUnits of Tongue Motion via Graph-regularized Sparse Non-negativeMatrix Factorization,” International Conference on Medical Image Com-puting and Computer Assisted Intervention (MICCAI), 17(Pt 2), pp. 146–53, 2014

[18] C. Browman and L. Goldstein, “Dynamics and articulatory phonology,”in Mind as Motion: Explorations in the Dynamics of Cognition, edited byR. F. Port and T.van Gelder (MIT Press, Cambridge, MA), pp. 175-194.

[19] T. Li and C. H. Q. Ding, “Nonnegative Matrix Factorizations forClustering: A Survey,” In: Data Clustering: Algorithms and Applications.Chapman and Hall/CRC, pp. 149-176, 2013

[20] N. F. Osman, W. S. Kerwin, E. R. McVeigh, J. L. Prince, “Cardiacmotion tracking using CINE harmonic phase (HARP) magnetic resonanceimaging,” Magnetic resonance in medicine, 42(6), pp. 1048-1060, 1999

Page 8 of 9

For Peer Review


Fig. 7. Illustration of functional units during /i/ to /s/ from “a geese”, showing (a) 3D Lagrangian displacement field, (b) functional units (2 clusters), (c)functional units (3 clusters), and (d) functional units (4 clusters). It is noted that the colored clustering results are plotted in the tongue shape of the first timeframe showing the neutral tongue position.

2 3 4 5 6Number of clusters

0.18

0.19

0.2

0.21

0.22

0.23

0.24

Fig. 8. Plot of the dispersion coefficient versus the different number of clustersfor the task of /i/ to /s/ from “a geese”.

[21] T. Mansi, X. Pennec, M. Sermesant, H. Delingette, and N. Ayache,“iLogDemons: A demons-based registration algorithm for tracking in-compressible elastic biological tissues,” International journal of computervision, 92(1), pp. 92–111, 2011

[22] F. Xing, C. Ye, J. Woo, M. Stone, J. L. Prince, “Relating speech pro-duction to tongue muscle compressions using tagged and high-resolutionmagnetic resonance imaging,” SPIE Medical Imaging, Florida, FL, 2015

[23] D. Cai, X. He, J. Han, T. S. Huang, “Graph regularized nonnegativematrix factorization for data representation,” IEEE Transactions on PatternAnalysis and Machine Intelligence 33(8), pp. 1548–1560, 2011

[24] H. Shinagawa, E. Z. Murano, J. Zhuo, B. Landman, R. Gullapalli, and J.L. Prince, M. Stone, “Tongue muscle fiber tracking during rest and tongueprotrusion with oral appliances: A preliminary study with diffusion tensorimaging,” Acoustical science and technology, 29(4), pp. 291–294, 2008

[25] T. A. Gaige, T. Benner, R. Wang, V. J. Wedeen, R. J. Gilbert, R.J.,“Three dimensional myoarchitecture of the human tongue determined invivo by diffusion tensor imaging with tractography,” Journal of MagneticResonance Imaging, 26(3), pp. 654–661, 2007

[26] M. NessAiver and J. L. Prince, “Magnitude image CSPAMM reconstruc-tion (MICSR),” Magnetic Resonance in Medicine, 50(2), pp. 331–342,2003

[27] J. Woo, M. Stone, Y. Suo, E. Murano, J. L. Prince, “Tissue-Point MotionTracking in the Tongue From Cine MRI and Tagged MRI,” Journal ofSpeech, Language, and Hearing Research, 57(2), pp. 626–636, 2013

[28] J. Kim and H. Park, “Sparse nonnegative matrix factorization for clus-tering,”Technical Report GT-CSE-08-01, Georgia Institute of Technology,2008

[29] F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons, “Documentclustering using nonnegative matrix factorization,” Information Processingand Management, 42(2), pp. 373–386, 2006

[30] Y. Gao, G. Church, “Improving molecular cancer class discovery throughsparse non-negative matrix factorization,” Bioinformatics, 21(21), pp.3970–3975, 2005

[31] J. Wang, X. Wang, X. Gao, “Non-negative matrix factorization bymaximizing correntropy for cancer clustering,” BMC Bioinformatics,14(1), pp. 1–11, 2013

[32] A. Anderson, P. K. Douglas, W. T., Kerr, V. S. Haynes, A. L. Yuille,A.L., J. Xie, Y. N. Wu, J. A. Brown, M. S. Cohen, “Non-negative

matrix factorization of multimodal MRI, fMRI and phenotypic datareveals differential changes in default mode subnetworks in ADHD,”NeuroImage, 102(1), pp. 207–219, 2014

[33] Q. Mo, and B. A. Draper, “Semi-nonnegative matrix factorizationfor motion segmentation with missing data,” European Conference onComputer Vision, pp. 402–415, Berlin, Heidelberg, 2012

[34] A. Cheriyadat and R. J. Radke, “Non-negative matrix factorization ofpartial track data for motion segmentation,” IEEE 12th InternationalConference in Computer Vision, pp. 865–872, 2009

[35] F. Xing, J. Woo, A. Gomez, D. L. Pham, P. V. Bayly, M. Stone, and J.L. Prince, “Phase Vector Incompressible Registration Algorithm (PVIRA)for Motion Estimation from Tagged Magnetic Resonance Images,” IEEETrans on Medical Imaging, 36(10), pp. 2116–2128, Oct, 2017

[36] F. Xing, J. Woo, E. Z. Murano, J. Lee, M. Stone, J. L. Prince, “3D tonguemotion from tagged and cine MR images,” 16th International Conferenceon Medical Image Computing and Computer Assisted Intervention (MIC-CAI), Nagoya, Japan, pp. 865–872, September, 2013

[37] A. Elgammal and C. S. Lee, “The role of manifold learning in humanmotion analysis,” In B. Rosenhahn, R. Klette, D. Metaxas, (eds.) HumanMotion. Computational Imaging and Vision, vol. 36, pp. 25–56. Springer,Netherlands, 2008

[38] Y. Qian, S. Jia, J. Zhou, J., A. Robles-Kelly, “Hyperspectral unmixing viasparsity-constrained nonnegative matrix factorization,” IEEE Transactionson Geoscience and Remote Sensing, pp. 4282–4297, 2011

[39] S. Yang, C. Hou, C. Zhang, Y. Wu, “Hyperspectral unmixing viasparsity-constrained nonnegative matrix factorization,” IEEE Transactionson Geoscience and Remote Sensing, pp. 4282–4297, 2011

[40] X. Lu, H. Wu, Y. Yuan, P. Yan, X. Li, “Manifold regularized sparseNMF for hyperspectral unmixing,” IEEE Transactions on Geoscience andRemote Sensing, 51(5), pp. 2815-2826, 2013

[41] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics andComputing, 17(4), pp. 395–416, 2007

[42] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEETransactions on Pattern Analysis and Machine Intelligence, 22(8), pp.888–905, 2000

[43] I. Stavness, J. E. Lloyd and S. Fels, “Automatic prediction of tonguemuscle activations using a finite element model,” Journal of Biomechan-ics, 45(16), pp. 2841–2848, 2012

[44] L. Pittman and E. F. Bailey, “Genioglossus and intrinsic electromyo-graphic activities in impeded and unimpeded protrusion tasks,” Journalof Neurophysiology, 101(1), pp. 276–282, 2009

[45] Z. Zhang, Y., T. Li,, C. Ding, R., X.-W., and X. S. Zhang, “Binarymatrix factorization for analyzing gene expression data. Data Mining andKnowledge Discovery,” 20(1), pp. 28–52, 2010

[46] K. Da, J. Choo, and H. Park, “Nonnegative matrix factorization for inter-active topic modeling and document clustering,” In Partitional ClusteringAlgorithms, pp. 215–243, 2015.

[47] C. Boutsidis and E. Gallopoulos, “SVD based initialization: A headstart for nonnegative matrix factorization,” Pattern Recognition, 41(4) pp.1350-1362, 2008.

[48] P. Viviani, T. Flash, “Minimum-jerk, two-thirds power law, andisochrony: converging approaches to movement planning,” Journalof Experimental Psychology: Human Perception and Performance,Feb;21(1):32, 1995.

[49] J. Woo, F. Xing, M. Stone, J. Green, T. Reese, T. Brady, V. Wedeen,J. Prince, and G. El Fakhri,, “Speech Map: A Statistical MultimodalAtlas of 4D Tongue Motion During Speech from Tagged and Cine MRImages,” Journal of Computer Methods in Biomechanics and BiomedicalEngineering (CMBBE): Special Issue on Imaging and Visualization, 2017

Page 9 of 9

For Peer Review · ﬁber anatomy, respectively. With advances in computational anatomy, a vocal...

Documents

Transcript of For Peer Review · ﬁber anatomy, respectively. With advances in computational anatomy, a vocal...