Research Note A Critical Evaluation of Gestural Stiffness … · 2019. 8. 15. · JSLHR Research...

10
JSLHR Research Note A Critical Evaluation of Gestural Stiffness Estimations in Speech Production Based on a Linear Second-Order Model Susanne Fuchs, a Pascal Perrier, b and Mariam Hartinger a Purpose: Linear second-order models have often been used to investigate properties of speech production. However, these models are inaccurate approximations of the speech apparatus. This study aims at assessing how reliably stiffness can be estimated from kinematics with these models. Method: Articulatory movements were collected for 9 speakers of German during the production of reiterant CVCV words at varying speech rates. Velocity peaks, movement amplitudes, and gesture durations were measured. In the context of an undamped model, 2 stiffness estimations were compared that should theoretically yield the same result. In the context of a damped model, gestural stiffness and damping were calculated for each gesture. Results: Numerous cases were found in which stiffness estimations based on the undamped model contradicted each other. Less than 80% of the data were found to be compatible with the properties of the damped model. Stiffness tends to decrease with gestural duration. However, it is associated with a large, unrealistic damping dispersion, making stiffness estimations from kinematic data to a large extent unreliable. Conclusion: Any conclusions about speech control based on stiffness estimations using linear second-order models should therefore be considered with caution. Key Words: second-order model, gestural stiffness, speech kinematics, damping T he modeling of the dynamics of vocal tract articu- lators by means of a linear second order model ( LSOM) has been very useful in understanding and characterizing some general aspects of speech produc- tion. For instance, vowel deletion in fast speech could be predicted by introducing the hypothesis of gestural over- lap (Browman & Goldstein, 1990a), which involves com- petition between second order dynamical attractors. The relation between articulatorsdynamics and gesture durations has been investigated from the perspective of the relations between stiffness and oscillation frequency in the context of an LSOM (Kelso, Vatikiotis-Bateson, Saltzman, & Kay, 1985; Ostry & Munhall, 1985). It could be shown that within a gesture, peak velocity (V max ), movement amplitude (Amp), and duration ( T) cannot be treated as independent variables. Because of these and other ambitious studies in this framework (e.g., Saltzman, 1986; Saltzman & Munhall, 1989), the LSOM has become very popular in speech production research. However, the model has been used not only to describe general aspects of speech production; it has also been used as a tool to quantitatively infer physical characteristics of ar- ticulators, such as stiffness, from speech kinematics. On the basis of such an approach, Kelso et al. (1985) pro- posed that stiffness and rest position are the key control parameters for speech production and determine speech rate variations. Ackermann, Hertrich, and Scharf (1995) suggested that stiffness can be a key parameter to dif- ferentiate speech motor control between populations with different speech disorders. More recently, Perkell, a Center for General Linguistics (Zentrum für Allgemeine Sprachwissenschaft [ZAS]/Phonetik), Berlin, Germany b Département Parole et Cognition (DPC)/Gipsa-lab, Centre National de la Recherche Scientifique (CNRS), Grenoble INP, Grenoble, France Correspondence to Pascal Perrier: [email protected] Editor: Anne Smith Associate Editor: Wolfram Ziegler Received May 15, 2010 Revision received October 27, 2010 Accepted December 22, 2010 DOI: 10.1044/1092-4388(2010/10-0131) Dedication: This work is dedicated to Dieter Fuchs. Journal of Speech, Language, and Hearing Research Vol. 54 10671076 August 2011 D American Speech-Language-Hearing Association 1067 Complimentary Author PDF: Not for Broad Dissemination

Transcript of Research Note A Critical Evaluation of Gestural Stiffness … · 2019. 8. 15. · JSLHR Research...

  • JSLHR

    Research Note

    A Critical Evaluation of Gestural StiffnessEstimations in Speech Production Based

    on a Linear Second-Order ModelSusanne Fuchs,a Pascal Perrier,b and Mariam Hartingera

    Purpose: Linear second-order models have often been used toinvestigate properties of speech production. However, thesemodels are inaccurate approximations of the speech apparatus.This study aims at assessing how reliably stiffness can be estimatedfrom kinematics with these models.Method: Articulatory movements were collected for 9 speakersof German during the production of reiterant CVCV words atvarying speech rates. Velocity peaks, movement amplitudes,and gesture durations were measured. In the context of anundamped model, 2 stiffness estimations were compared thatshould theoretically yield the same result. In the context of adamped model, gestural stiffness and damping were calculatedfor each gesture.

    Results: Numerous cases were found in which stiffness estimationsbased on the undamped model contradicted each other. Lessthan 80% of the data were found to be compatible with theproperties of the damped model. Stiffness tends to decreasewith gestural duration. However, it is associated with a large,unrealistic damping dispersion, making stiffness estimationsfrom kinematic data to a large extent unreliable.Conclusion: Any conclusions about speech control based onstiffness estimations using linear second-order models shouldtherefore be considered with caution.

    Key Words: second-order model, gestural stiffness,speech kinematics, damping

    T he modeling of the dynamics of vocal tract articu-lators by means of a linear second order model(LSOM) has been very useful in understanding andcharacterizing some general aspects of speech produc-tion. For instance, vowel deletion in fast speech could bepredicted by introducing the hypothesis of gestural over-lap (Browman & Goldstein, 1990a), which involves com-petition between second order dynamical attractors.The relation betweenarticulators’ dynamics and gesture

    durations has been investigated from the perspective ofthe relations between stiffness and oscillation frequencyin the context of an LSOM (Kelso, Vatikiotis-Bateson,Saltzman,&Kay, 1985; Ostry&Munhall, 1985). It couldbe shown that within a gesture, peak velocity (Vmax),movement amplitude (Amp), and duration (T) cannot betreated as independent variables. Because of these andother ambitious studies in this framework (e.g., Saltzman,1986; Saltzman &Munhall, 1989), the LSOM has becomevery popular in speech production research. However,the model has been used not only to describe generalaspects of speech production; it has also been used as atool to quantitatively infer physical characteristics of ar-ticulators, such as stiffness, from speech kinematics. Onthe basis of such an approach, Kelso et al. (1985) pro-posed that stiffness and rest position are the key controlparameters for speech production and determine speechrate variations. Ackermann, Hertrich, and Scharf (1995)suggested that stiffness can be a key parameter to dif-ferentiate speech motor control between populationswith different speech disorders. More recently, Perkell,

    aCenter for General Linguistics (Zentrum für AllgemeineSprachwissenschaft [ZAS]/Phonetik), Berlin, GermanybDépartement Parole et Cognition (DPC)/Gipsa-lab, CentreNational de la Recherche Scientifique (CNRS), Grenoble INP,Grenoble, France

    Correspondence to Pascal Perrier:[email protected]

    Editor: Anne SmithAssociate Editor: Wolfram Ziegler

    Received May 15, 2010Revision received October 27, 2010Accepted December 22, 2010DOI: 10.1044/1092-4388(2010/10-0131) Dedication: This work is dedicated to Dieter Fuchs.

    Journal of Speech, Language, and Hearing Research • Vol. 54 • 1067–1076 • August 2011 • D American Speech-Language-Hearing Association 1067

    Complimentary Author PDF: Not for Broad Dissemination

  • Zandipour, Matthies, and Lane (2002) measured stiff-ness on the basis of undamped LSOMproperties and dis-cussed their results in the context of economy of effort inspeech production. Kühnert andHoole (2004) relate alve-olar and velar articulatory reductions with respect tostiffness in the context of an undamped model. Sim-ilarly, in van Lieshout, Bose, Square, and Steele (2007),stiffness was analyzed to account for control mechanismsin apraxia of speech. Xu and Wang (2009) also proposedusing the properties of a damped model to estimate stiff-ness with respect to the timing of f0modulations inMan-darin Chinese. However, there is growing evidence thatstiffness cannot be seen solely in the framework of anLSOMbut is physical innature and, therefore, is cruciallydependent on biomechanics, muscular forces, and cocon-traction, among other factors (e.g., Gomi, Honda, Ito, &Murano, 2002; Shiller, Houle, and Ostry, 2005).

    In the context of an LSOMmodel, articulatory stiff-ness is computed on the basis of the assumption (see Ap-pendix) that the ratio of Vmax/Amp is proportional to thesquare-root of the mass-normalized stiffness K/m, asseen in equation (1):

    VmaxAmp

    ¼ a :ffiffiffiffiffiffiffiffiffiffiffiK=m

    p: ð1Þ

    In addition, articulatory stiffness can also be calculatedon the basis of the assumption (see Appendix) that themass-normalized stiffness K/m is proportional to theduration (T) of the gesture:ffiffiffiffiffiffiffiffiffiffiffi

    K=mp

    ¼ b=T: ð2Þ

    This approach is implicitly based on three assumptions.(a) Speech gestures correspond to half a period of anLSOM oscillating around its rest position. The positionreached at the end of the movement should correspondto a turning point (the extreme position of the oscil-lation). (b) The proportionality factors α and b used inequations (1) and (2) are assumed to be constant acrossgestures. (c) Stiffness and damping are constant duringthe whole duration of a gesture. In speech production,however, these three assumptions are problematic forthe following reasons: There are speech sounds—suchas long vowels, sibilants, or geminates—that can be keptrelatively stable over fairly long time intervals. Duringthese intervals, plateaus are observed in the articulatorytrajectory. These plateaus are not compatible with theturning point conception of assumption (a) unless thedamping is stronger than the critical damping.However,in this case, the influence of damping on peak velocityand gesture duration would be at least as strong as theinfluence of stiffness. Assumption (b) is true only if thesystem isundampedor if the same relation betweendamp-ing and stiffness is found in all gestures (see Appendix).These two conditions are far frombeing realistic. First, the

    vocal tract is such a narrow space that during the course oftheir displacements, speech articulators are likely to bein contact with various structures (e.g., contact of thetongue with the teeth or the palate). Hence, dampingclearly exists due to interaction of the tongue with vocaltract structures, and it is likely to vary across gestures aswell as over the course of a gesture. Second, themechan-ical properties of vocal tract articulators are very differ-ent. Tongue, lips, and velum are soft bodies, whereas thejaw is a rigid body. Hence, the nature of the damping isnot the same for all these articulators. Assumption (c)maynot hold true, since damping can vary along the courseof the articulator ’s displacement. The same may be thecase for stiffness. Oro-facial muscles and, more generally,human soft tissue have been shown to have nonlinearpassive elastic properties for which stiffness (the Youngmodulus) increases with strain (Fung, 1981; Gerard,Ohayon, Luboz, Perrier,&Payan, 2005).Moreover, stiffnessincreases with the level of muscular activity (McMahon,1984). Assumption (c) also implies that the control pa-rameters of the LSOM (stiffness and rest position) areset up instantaneously at the beginning of each gesture,without any transition period. Instantaneous changesare indeednonphysiological.Hence, in spite of its strengthsin globally describing important trends of speech ges-tures, the LSOM remains a rough approximation of thephysical characteristics of articulatory gestures.

    Some of these limitations have already been acknowl-edged in previous studies. For example, Browman andGoldstein (1985) considered the problem raised by theexistence of articulatory plateaus. They realized that pla-teaus cannot be accounted for without considering a speci-fic time variation in the damping. Browman andGoldstein(1990b) also insisted on the necessity of properly evaluat-ing the damping for each gesture, and they agreed withproposals made in the literature that the control param-eters of the model should not vary as a step function.Kelso et al. (1985) addressed the fact that stiffness variesas a function of strain. They proposed an alternativeway tomeasure stiffness in the context of an undampedmodel.

    The aim of the present study was to assess how re-liably gestural stiffness can be inferred fromarticulatoryamplitude and velocity or gestural duration as proposedby the LSOM.

    MethodArticulatory data for various consonants, speakers,

    and speaking rates were used to evaluate gestural stiff-ness in the context of the LSOM. This assessment wascarried out first using anundampedLSOMand thenusinga damped LSOM.

    1068 Journal of Speech, Language, and Hearing Research • Vol. 54 • 1067–1076 • August 2011

    Complimentary Author PDF: Not for Broad Dissemination

  • CorpusSubjects were asked to produce reiterant speech first

    at an increasing speaking rate and then at a decreasingspeaking rate over a time interval of 16 s, following anexperimental scenario proposed by Rochet-Capellan &Schwartz (2007). The corpus consisted of repetitions ofbisyllabic CVCV words chosen from among /fata/, /kata/,/kapa/, /kaʃa/, /pasa/, /paʃa/, /pata/, /paka/, /sapa/, /ʃapa/,/ʃaka/, /tafa/, /tapa/, and /taka/. The variety of consonantsand articulators involved and the different speech ratesallowed us to evaluate the reliability of the LSOM instiffness estimation. Movements were recorded by Elec-tromagnetic Midsagittal Articulography (Carstens, AG100). Three sensors were glued onto the tongue at around1 cm, 3 cm, and 5 cm from the tongue tip. Lower lip andjaw movements were measured using sensors glued atthe vermillion border and just below the lower incisors.Reference sensors glued onto the upper incisors and ontothe bridge of the nose were used to compensate for headmovements in the helmet. A bite plane served as a ref-erence to determine the vertical and horizontal coordi-nates of each individual vocal tract. The contribution ofthe jaw was not subtracted from tongue and lower lipmovements, as this decomposition is not straightforward.Movements were sampled at 200 Hz.Movements and ve-locities were filtered with an 18-Hz and a 15-Hz low-passKaiser filter, respectively.

    Nine speakers of German were recorded: Five fe-males (sp1–sp5) and four males (sp6–sp9). They had noreported history of speech, language, or hearing impair-ment. All procedures were in agreement with our in-stitutional regulations. All subjects provided informedconsent prior to testing. Subjects were between 30 and42 years old, except sp7, who was 68 years old. Sp7 wasincluded in the corpus because he is an ultrafast speakerwho has won several competitions for his fast and pre-cise speech in themassmedia (Jannedy, Fuchs,&Weirich,2010). We expected that this speaker might use differentarticulatory strategies than the other subjects.

    The speaking pace was given by a visualmetronome(see Rochet-Capellan & Schwartz, 2007) consisting of suc-cessively varying black and white images displayed on acomputer screen. Their frequency of change varied be-tween 3.3Hz and 20Hz. Sp7, however, was free to choosethe pace range according to his specific skills.

    MeasuresClosing and opening gestures were analyzed in the

    present study. They were labeled on the tangential ve-locity signal using thevelocityminimumof themovementstoward and from the consonants. To keep the amount ofdata at a reasonable level for each consonant, only thearticulators that are known to be crucial for the production

    of a given consonant were considered. Tongue tip and jawmovementswereanalyzed for /t, s/, tonguedorsumand jawmovements were analyzed for /ʃ/, tongue back movementswere analyzed for /k/, and lower lip and jaw movementswere analyzed for /p, f /. Within each of the segmentedgestures and for each of the crucial articulators, move-ment amplitude (Amp), velocity peak (Vmax), and ges-ture duration (T) were measured. Movement amplitudewas computed as the length of the articulatory path bycomputing the cumulative Euclidian distances from onesample to the next.

    Stiffness Estimation in the Contextof an Undamped LSOM

    In an undamped LSOM, equations (1) and (2), com-puted respectively with α = .5 and b = p (see Appendix),are theoretically two strictly equivalentways to computethemass-normalized stiffness k. Equation (1) is based onkinematic measures, and equation (2) is based on tem-poral measures.

    Undamped LSOMs have essentially been used inspeech production studies to evaluate differences in stiff-ness between two gestures and not to consider absolutestiffness values. For example, a stiffness increase fromone gesture to another could be interpreted as an increasein articulatory effort (Perkell et al., 2002). Consequently,our study focused on stiffness differences between ges-tures. The following procedure was carried out: For agiven speaker and a given condition (split by word, syl-lable, consonant, and articulator), all possible combina-tions of two gestures were selected and were grouped inpairs. For eachpair, thestiffnessdifferencewas computed,once on the basis of equation (1) and once on the basis ofequation (2). All caseswere considered asmisleading (er-rors) where differences in stiffness showed a positive (ornegative) sign on the basis of one equation and a nega-tive (or positive) sign on the basis of the other. Thepercent-age of gesture pairs for which such opposite estimations(OpposEstim) were found was taken as ameasure of theinadequacy of an undamped LSOM to calculate stiff-ness.Whenever differences in stiffness showed the samesign, either positive or negative, the model was consid-ered to be adequate.

    In order to be as conservative as possible, only thefollowing cases were selected: Because movement datawere sampled at 200Hz, amaximal potential inaccuracyof 5 ms exists in the time labeling. Hence, only gestureswith a temporal difference longer than 5 ms were takeninto account.OpposEstimwas computed only for pairs inwhich stiffness differences could not be attributed tomeasurement inaccuracies. In line with classical prin-ciples used for measurement techniques in physics, onlycaseswhere stiffness differences correspond to aminimum

    Fuchs et al.: Speech Gesture Stiffness and Second-Order Modeling 1069

    Complimentary Author PDF: Not for Broad Dissemination

  • of 10% of the stiffness for at least one gesture of thegestures pair were taken into account. This enabled thepossible consequences of these inaccuracies in our eval-uation to be minimized. The distribution of OpposEstimwas analyzed for each speaker and condition as a func-tion of the durations T1 and T2 of the two gestures thatare compared with each pair. To this end, the gestureduration plane (T1, T2) was divided into elementarysquares, each 20 ms × 20 ms in size (see Figure 1).

    Stiffness Estimations in the Contextof a Damped LSOM

    The second step of the evaluation consisted of find-ing themass-normalized stiffness and damping factor inthe context of an underdamped LSOM that would ac-count for the experimentalmeasures of Vmax/AmpandT.In the context of such a model, the movement of thespeech articulator is assumed to obey an equation of thetype seen in equation (3),

    aðtÞ þ b � vðtÞ þ k � xðtÞ ¼ 0; ð3Þwhere x(t), v(t), and a(t) are the position, the velocity, andthe acceleration of the articulator, respectively, and band k are the mass-normalized damping factor and stiff-ness (hereafter referred to as simply “damping” and “stiff-ness”). The following two equations describe the relationbetweenmovement amplitude (Amp), velocity peak (Vmax),gesture duration (T), stiffness (k), and damping (b) (seeAppendix for details):

    T ¼ 2 � pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi4 � k� b2

    p ð4Þ

    and

    VmaxAmp

    ¼ffiffiffik

    p

    1þ e�b �T2 � e�b� T2 � 12�1p � tan�1 b �T2�pð Þ½ �: ð5Þ

    If no solution can be found for the equations (4) and (5)when the kinematic parameters are experimentallymea-sured for a particular speech gesture, then the productionof this gesture cannot be assumed to behave like a dampedLSOM. In this case, gestural stiffness cannot be inferredfromspeechkinematics. If a solution for stiffness anddamp-ingexists for these twoequations, anLSOMcanbeassumedto underlie the production of this gesture. To solve equa-tions (4) and (5), the fsolve function inMATLABwas used.

    ResultsStiffness Estimations Based on anUndamped LSOM: Opposite Estimations

    Consideringall gestures together, themass-normalizedstiffness values computed either fromequation (1) or fromequation (2) ranged from 80 s–2 to 11000 s–2. Figure 1shows the corresponding distribution of the percentageOpposEstim. It canbe seen that thepercentageOpposEstimwas far from negligible (minimum = 2.9%, maximum =47.1%; see Table 1).However, OpposEstims did not occureverywhere in the (T1, T2) plane. Theywere found in 52.2%of the elementary squares (seeTable 1). These squaresweremostly located in the neighborhood of the diagonal in the(T1, T2) plane.No case of OpposEstimswas foundwhen avery short gesture was compared with a very long one.OpposEstims increased with gestural duration.

    Figure 1. Distribution of opposite estimations (OpposEstim) computedfor all data as a function of gesture durations T1 and T2 of the twogestures that are compared with each other. s = seconds.

    Table 1. Characteristics of the percentage of opposite estimation(OpposEstim) distributions.

    Speaker Max Min % nonzero Slope

    all 47.1 2.9 52.2 469.2sp1 46.7 3.3 46.2 409.5sp2 36.7 3.3 58.8 781.0sp3 41.9 3.2 54.8 649.8sp4 46.7 3.3 58.0 684.5sp5 41.0 2.56 52.6 809.5sp6 33.3 2.1 41.5 1031.3sp7 57.6 3.0 55.9 822.5sp8 51.5 3.0 56.0 535.7sp9 60.0 3.3 60.0 1290.5

    Note. all = analysis of all data together; Max = maximal values ofOpposEstim in the distribution; Min = minimal value of OpposEstim inthe distribution; % nonzero = proportion of elementary (20 ms × 20 ms)squares of the gesture duration plane (T1, T2) inwhich opposite estimationswere found; Slope = the slope of the linear regression between T1 andthe average value of OpposEstim computed around the diagonal in the(T1, T2) plane (see text discussion).

    1070 Journal of Speech, Language, and Hearing Research • Vol. 54 • 1067–1076 • August 2011

    Complimentary Author PDF: Not for Broad Dissemination

  • Whendatawere split across speakers, similar obser-vations could be made. Differences across subjects werefound only in the percentage of elementary squares inwhich opposite estimationswere found. This varied from33.3% (sp6) to 60% (sp9). No significant difference wasfound across speakers in the minimum and maximumvalues of OpposEstim. In sum, the trends observed forthe distribution of OpposEstim for all data taken to-gether were confirmed for each speaker.

    Results Obtained for a Damped LSOMConsidering all speakers andall repetitions together,

    the total number of analyzed gestures was 27,531. For5,926 of these gestures, it was not possible to find a

    stiffness k and a damping b for the damped model thatwere compatiblewith both themeasuredVmax/Ampvalueand the T duration value as specified in equations (4) and(5). This result represents 21.5% of the total number ofdata. When considering the data of each speaker sepa-rately, a similar percentage was found for seven speakers(seeTable 2A). For the female speaker sp4, thepercentagewas clearly smaller (13.3%), and it was clearly larger(29.5%) for the ultrafast male speaker sp7. When consid-ering thedata of all speakers together split by articulatorsand by gestures (see Table 2B), a larger variability wasobserved: The percentage was significantly smaller than21.5% in the closing gestures of the tongue tip and thelower lip, and it was significantly larger (close to 40%) forthe tongue back. Nevertheless, the proportion of thesedata was generally substantial, and the value 20% wasquite representative of the general trend observed acrossspeakers, articulators, and gestures.

    To further investigate the characteristics of these20% of data, all measures from all speakers were plottedin the (T, Vmax/Amp) plane. For the sake of clarity, itshould be noted that for an LSOM, Vmax/Amp and T aretheoretically linked by the relation

    VmaxAmp

    ¼ cT; ð6Þ

    where c increases with the damping, with a minimumvalue equal to 1.57 (p /2) in the case of an undamped

    Table 2. Global analysis of the results computed from equations (4)and (5).

    A

    Speaker %NoComp c1 c2

    all 21.5 1.85 1.44sp1 22.3 1.88 1.44sp2 18.4 1.85 1.44sp3 20.1 1.84 1.44sp4 13.3 1.89 1.47sp5 23.1 1.85 1.44sp6 22 1.80 1.45sp7 29.5 1.79 1.44sp8 23.5 1.86 1.42sp9 18.5 1.89 1.44

    B

    Articul Gest %NoComp c1 c2

    dor C 27.4 1.81 1.45dor O 29.3 1.81 1.44jaw C 30.2 1.76 1.43jaw O 23.1 1.80 1.44llip C 7.8 1.91 1.48llip O 18.8 1.81 1.47tback C 35.6 1.81 1.43tback O 39.2 1.80 1.42ttip C 5.8 1.99 1.46ttip O 14 1.89 1.46

    Note. Panel A: Data split by speakers for all gestures together. Panel B:Data split by articulators and gesture types (opening vs. closing) for allspeakers together. %NoComp = percentage of data for which no solutionis found for equations (4) and (5); c1 = c coefficient in Vmax/Amp = c/T fordata compatible with equations (4) and (5); c2 = c coefficient in Vmax/Amp = c/T for data not compatible with equations (4) and (5); sp =speaker; Articul = articulators taken into account, as measured with thedifferent Electromagnetic Midsagittal Articulography (EMMA) sensors;Gest = gesture; O = opening; C = closing; dor = tongue dorsum sen-sor; jaw = jaw sensor; llip = lower lip sensor; tback = tongue back sensor;ttip = tongue tip sensor.

    Figure 2. Distribution of all data in the (T, Vmax/Amp) plane. Lightgray indicates data for which no solution was found for equations (4)and (5). Dark gray indicates data for which a solution was foundfor equations (4) and (5). Bold solid line indicates Vmax/Amp = 1.57/T.Dashed–dotted line: Vmax/Amp = 1.44/T, accounting for data plottedin light gray. Dotted line indicates Vmax/Amp = 1.85/T, accountingfor data plotted in dark gray.

    Fuchs et al.: Speech Gesture Stiffness and Second-Order Modeling 1071

    Complimentary Author PDF: Not for Broad Dissemination

  • system. Speech data have often been found in the litera-ture to be generallywell accounted for by equation (6) withc values larger than 1.57 (see, e.g., Ostry & Munhall,1985), with c values ranging from 1.83 to 1.90), which isconsistent with the hypothesis of the properties of anunderdamped LSOM.

    Figure 2 shows the distribution of the data. Datafor which solutions do not exist for equations (4) and(5) (henceforth, Set1) are plotted in light gray. Otherdata (Set2) are plotted in dark gray. The bold solidcurve corresponds to equation (6), with c = 1.57. All thedata of Set1 are located below this curve. Thus, all dataof Set1 correspond to c values that are smaller than1.57 and are, therefore, not compatible with the char-acteristics of an LSOM. These data are well accountedfor by equation (6), with c = 1.44 (dashed–dotted curve).All the data of Set2 are above the bold solid line, andthey are well accounted for by equation (6), with c = 1.85(dotted curve). Table 2 confirms that the properties ofthe Set1 and Set2 distributions and the c values computed

    for all data together are well representative of the datasplit by speakers, articulators, and gestures.

    In summary, for a significant percentage of the data(around 20%), the kinematic measures Vmax and Ampand themeasured gesture duration Tare not related in away that is compatible with an LSOM. For the data, theratioVmax/Ampis smaller thanthesmallestvalueobservedfor an LSOM having the same gesture duration. Thesedata can be seen as intermediate between gestures gen-erated by an undamped LSOM, in which duration andVmax/Amp are fully determined by the stiffness, and con-stant velocity gestures (corresponding to c = 1), in whichthese variables are fully specified by an extrinsic controller.For these data, no reliable information about gestural stiff-ness can be extracted either from Vmax/Amp or from T.

    In the rest of this section, only the data of Set2 willbe considered. The c values are in good agreement withthose found in former studies (e.g., Ostry & Munhall,1985). In addition, aspredictedbyanunderdampedLSOM(equation [1]), the correlation between Vmax/Amp and

    ffiffiffik

    p

    Table 3. Relative variations of stiffness as computed from equations 4 and 5.

    A

    Speaker Cor1 Cor2 AvVari1 MaxVari1 AvVari2 MaxVari2

    all 0.81 0.98 21.4 23.9 66.2 118.7sp1 0.73 0.97 21.9 32.8 67.9 98.7sp2 0.76 0.97 20.3 25.0 61.5 119.7sp3 0.76 0.97 20.9 26.9 65.6 129.2sp4 0.75 0.97 20.2 24.3 65.6 104.9sp5 0.82 0.97 22.5 28.3 61.1 132.7sp6 0.81 0.97 19.7 29.9 54.0 103.7sp7 0.85 0.98 19.9 25.1 55.7 86.8sp8 0.77 0.97 21.3 25.3 61.4 84.6sp9 0.72 0.97 21.8 26.9 72.4 121.1

    B

    Articul Gest Cor1 Cor2 AvVari1 MaxVari1 AvVari2 MaxVari2

    dor C 0.81 0.97 20.4 24.9 51.7 72.1dor O 0.84 0.98 20.2 26.7 57.9 97.9jaw C 0.85 0.98 19.8 27.3 54.4 87.3jaw O 0.83 0.98 19.4 22.8 58.3 114.8llip C 0.83 0.97 19.8 23.8 52.1 101.6llip O 0.81 0.97 19.0 22.2 51.5 93.4tback C 0.87 0.98 20.5 26.4 57.2 102.1tback O 0.87 0.98 20.5 26.4 58.6 97.4ttip C 0.71 0.97 20.6 24.5 68.1 118.3ttip O 0.78 0.97 21.0 26.7 63.7 116.0

    Note. Panel A: Data split by speakers for all gestures together. Panel B: Data split by articulator and gesture types (openingvs. closing) for all speakers together. Cor1 = correlation coefficient between

    ffiffiffik

    pand 1/T; Cor2 = correlation coefficient

    betweenffiffiffik

    pand Vmax/Amp; AvVari1 = average relative stiffness variability when Vmax/Amp varies (cf. Figure 4A);

    MaxVari1 = the maximum relative stiffness variability when Vmax/Amp varies (cf. Figure 4A); AvVari2 = average relativestiffness variability when T varies (cf. Figure 4A); MaxVari2 =maximum relative stiffness variability when T varies (cf. Figure 4A).

    1072 Journal of Speech, Language, and Hearing Research • Vol. 54 • 1067–1076 • August 2011

    Complimentary Author PDF: Not for Broad Dissemination

  • is very strong (R = .98 for all data), and it is constantacross speakers, articulators, and gestures (see Table 3for details). Similarly, in agreement with equation (2),the correlation between 1/Tand

    ffiffiffik

    pis strong too (R = .81

    for all data), and it varies just a little across speakers,articulators, and gestures (see Table 3 for details). Thisconfirms the fact that LSOMs account well for globaldynamical characteristics of speech articulators. How-ever, the damping values calculated from equations (4)and (5) are highly variable. Figure 3 shows the distri-bution of the normalized damping (damping divided bythe critical damping) for data split by speakers and ar-ticulators. In all cases, the normalized damping is largelydistributedbetweenvalues rangingbetween0 (undampedsystem) and 1 (critical damping). In general, the dampingvariability is huge and cannot be related to any physi-cal properties of the articulatory system. Hence, thesedamping values should be considered as ad hoc valuesfitting in both equations (4) and (5) rather than as reli-able estimations of actual physical damping character-istics of speech gestures. Given equations (4) and (5), alarge variability in damping should be associated with anoticeable variability in stiffness.

    To quantitatively estimate this variability, two ap-proaches were used. In the first approach, the standarddeviation sdk and themean value k of the distribution of

    the stiffness within small intervals (0.5) of variation ofVmax/Ampwas calculated. The relative variability of thestiffness was estimated as the ratio ð2:sdk=kÞ. Its varia-tion was studied as a function of Vmax/Amp; its averageand maximal values were calculated when Vmax/Ampvaried. In a second approach, the same computationswere done for 5-ms intervals of the gesture duration T.Table 3 shows the results. The relative stiffness varia-bility varies little with Vmax/Amp (the maximal value isclose to the average value), and it is located around 20%.This amount of relative variability is found when all dataare taken together as well as for data split by speakers,articulators, and gestures. The relative stiffness variabil-ity is large enough tomake any estimation of the stiffnessfrom the ratio Vmax/Amp quite inaccurate. It is equallylarge for slow and for fast gestures, as illustrated by thetop panel of Figure 4 for all data taken together. The rela-tive stiffness variability varies significantly with gestureduration (themaximal value differs strongly from the av-erage one). The average value is locatedaround60%,withsmall variations across speakers, articulators, and ges-tures. The bottom panel of Figure 4 shows, for all datataken together, that the relative variability increaseswhen gesture duration increases. This result was alsofound when data were split across speakers, articulators,or gestures. Hence, stiffness estimations on the basis of

    Figure 3. Distribution of the normalized damping (damping divided by the critical damping) solutionof equations (4) and (5) as a function of gesture duration for data split by speakers and articulators.ttip = tongue tip sensor; sp = speaker; tback = tongue back sensor; llip = lower lip sensor; jaw = jaw sensor;dor = tongue dorsum sensor.

    Fuchs et al.: Speech Gesture Stiffness and Second-Order Modeling 1073

    Complimentary Author PDF: Not for Broad Dissemination

  • gesture durations are even more inaccurate than the es-timations based on Vmax/Amp, and the longer the ges-ture, the worse the estimation.

    DiscussionAn evaluationwas conducted of the LSOM’s capacity

    to account for dynamical properties of speech articula-tors. The first part of the evaluationwasmade in the con-text of an undamped LSOM. It consisted of consideringpairs of gesturesandcomparing themass-normalized stiff-ness differences between these gestures as estimatedusingequations (1) and (2).Theseequationsare twostrictlyequivalent ways to use the properties of an undampedLSOM to computemass-normalized stiffness—one basedon kinematic properties and the other based on gestureduration. Our results revealed that both estimations are

    quite different in a significant number of cases. Cases inwhich one equation predicts a stiffness increasewhile theother equation predicts a stiffness decrease are numer-ous. For long gesture durations (more than 200 ms), thepercentage of such cases was found to be as high as 50%.Hence, a first important conclusion is that estimations ofthe mass-normalized stiffness based on an undampedLSOM should be considered with a great deal of caution.

    It was also shown that the percentage of opposite es-timations was small for short gestures (around 50 ms)but increased when gesture duration increased. A con-ceivable explanation for the deterioration of the estima-tion lies in the fact that due to their nonlinear elasticproperties (see Introduction), tongue and lip stiffnessesintrinsically vary when these soft articulators move. Be-cause longgestures are oftenassociatedwith larger defor-mations and displacements, long gestures should undergogreater variation in stiffness over time thanshort gestures.Consequently, the longer the gesture is, the less correct amodeling based on anundampedLSOMis,which assumesper se that stiffness is constant over time.

    In the second part of the evaluation, an underdampedLSOM was considered. Mass-normalized stiffness anddamping were computed from equations (4) and (5), whichlink these variables with the ratio Vmax/Amp and theduration T. In around 20% of the cases, no solution couldbe found because the measures of Vmax, Amp, and T arenot compatiblewith the physical laws of anLSOM.Morespecifically, the parameter c of equation (6) was found tobe systematically smaller than the smallest value possible(c = 1.57) for an LSOM. Hence, for these data, stiffnessestimation from Vmax/Amp or T has no theoretical justi-fication. Consequently, a prerequisite for the estimationof stiffness from Vmax/Amp or T is to compute the c valueand to remove those data where c is smaller than 1.57.

    For the other 80% of the data, the stiffness and damp-ing values computed from equations (4) and (5) depictsignificant variabilities. These variabilities make stiff-ness estimations quite inaccurate and, consequently, notvery reliable. Just as for the estimation based on an un-damped model, the inaccuracy of the estimation basedon an underdamped model clearly increases when ges-ture duration increases. This general result shows thatthe relation between gesture duration and stiffness alwaysbecomes less strong when gesture duration increases. Itis consistent with the hypothesis that extrinsic factorscontribute to the control of gesture duration, especiallywhen this duration is long and achievable for a broadrange of stiffness.

    ConclusionTo conclude, this study has shown that in speech pro-

    duction, the relation among mass-normalized stiffness,

    Figure 4. Variation of the relative stiffness variability (see text) as afunction of Vmax/Amp (top panel) and gesture duration T (bottompanel) for all data together.

    1074 Journal of Speech, Language, and Hearing Research • Vol. 54 • 1067–1076 • August 2011

    Complimentary Author PDF: Not for Broad Dissemination

  • peak velocity, gesture amplitude, and gesture durationcannot be reliably described by aunique formulationbasedon an LSOM. This result may originate in a variation inthe damping factor across gestures. In addition, otherfactors such as variations in the stiffness and in thedamping factor over timemay also contribute to this lowreliability. Furthermore, there are some indications thatexternal time specification could significantly contributeto determining gesture durations and the kinematicproperties of speech gestures. Hence, inferences aboutmotor control and time control in speech production basedon stiffness estimation using an LSOM should be con-sidered with caution.

    AcknowledgmentsThis work was funded by a grant from the Bundes

    Ministerium für Bildung und Forschung (BMBF) and theFrench-German University to the project PILIOS carried outjointly at Gipsa-lab in Grenoble and ZAS in Berlin.

    ReferencesAckermann, H., Hertrich, I., & Scharf, G. (1995). Kine-matic analysis of lower lip movements in ataxic dysarthria.Journal of Speech and Hearing Research, 38, 1252–1259.

    Browman, C. P., & Goldstein, L. (1985). Dynamic modelingof phonetic structure. In V. Fromkin (Ed.), Phoneticlinguistics (pp. 35–53). New York, NY: Academic Press.

    Browman, C. P., & Goldstein, L. (1990a). Representationand reality: Physical systems and phonological structure.Journal of Phonetics, 18, 411–424.

    Browman, C. P., & Goldstein, L. (1990b). Gestural specifi-cation using dynamically-defined articulatory structures.Journal of Phonetics, 18, 299–320.

    Fung, Y.-C. (1981). Biomechanics: Mechanical properties ofliving tissues. New York, NY: Springer Verlag.

    Gerard,J.M.,Ohayon, J.,Luboz,V.,Perrier,P.,&Payan,Y.(2005). Non-linear elastic properties of the lingual and facialtissues assessed by indentation technique: Application tothe biomechanics of speech production.Medical Engineering& Physics, 27, 884–892.

    Gomi, H., Honda, M., Ito, T., & Murano, E. Z. (2002). Com-pensatory articulation during bilabial fricative productionby regulating muscle stiffness. Journal of Phonetics, 30,261–279.

    Jannedy, S., Fuchs, S., & Weirich, M. (2010). Articulationbeyond the usual: Evaluating the fastest German speakerunder laboratory conditions. In S. Fuchs, P. Hoole,C. Mooshammer, & M. Zygis (Eds.), Between the regularand the particular in speech and language (pp. 205–234).Frankfurt, Germany: Peter Lang Publishing Group.

    Kelso, J. A. S., Vatikiotis-Bateson, E., Saltzman, E., &Kay, B. (1985). A qualitative dynamical analysis of reiterantspeech production: Phase portraits, kinematics and dynamicmodeling. The Journal of the Acoustical Society of America,77, 266–290.

    Kühnert, B., & Hoole, P. (2004). Speaker specific kinematicproperties of alveolar reductions in English and German.Clinical Linguistics & Phonetics, 18, 559–575.

    McMahon, T. A. (1984). Muscles, reflexes and locomotion.Princeton, NJ: Princeton University Press.

    Ostry, D. J., & Munhall, K. G. (1985). Control of rate andduration of speech movements. The Journal of the Acous-tical Society of America, 77, 640–648.

    Perkell, J. S., Zandipour, M., Matthies, M., & Lane, H.(2002). Economy of effort in different speaking conditions.I. A preliminary study of intersubject differences andmodeling issues. The Journal of the Acoustical Society ofAmerica, 112, 1627–1641.

    Rochet-Capellan, A., & Schwartz, J.-L. (2007). An articu-latory basis for the labial-to-coronal effect: /pata/ seems amore stable articulatory pattern than /tapa/. The Journalof the Acoustical Society of America, 121, 3740–3754.

    Saltzman, E. L. (1986). Task dynamic coordination of thespeech articulators: A preliminary model. ExperimentalBrain Research, 15, 129–144.

    Saltzman, E. L., & Munhall, K. G. (1989). A dynamicalapproach to gestural patterning in speech production.Ecological Psychology, 1, 333–382.

    Shiller, D. M., Houle, G., & Ostry, D. J. (2005). Voluntarycontrol of human jaw stiffness. Journal of Neurophysiology,94, 2207–2217.

    van Lieshout, P. H. H. M., Bose, A., Square, P. A., &Steele, C. M. (2007). Speech motor control in fluent anddysfluent speech production of an individual with apraxiaof speech and Broca’s aphasia. Clinical Linguistics &Phonetics, 21, 159–188.

    Xu, Y., &Wang, M. (2009). Organizing syllables into groups—Evidence from F0 and duration patterns in Mandarin.Journal of Phonetics, 37, 502–520.

    Fuchs et al.: Speech Gesture Stiffness and Second-Order Modeling 1075

    Complimentary Author PDF: Not for Broad Dissemination

  • Appendix. Linear damped second-order model.

    Mass-normalized stiffness k; mass-normalized damping factor bPosition x(t); Velocity v(t); Acceleration a(t)Differential equation of movement:

    aðtÞ þ b � vðtÞ þ k � xðtÞ ¼ 0 ðiÞ

    Initial conditions: t = 0, x(t) = A0 and v(t) = 0.Under these conditions:

    xðtÞ ¼ A0 � 2 �ffiffiffik

    pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi4 � k � b2

    p � cosffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi4 � k � b2

    p

    2� t þ ϕ

    !� e�b2 � t ðiiÞ

    with ϕ ¼ tan�1 �bffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi4 � k � b2

    p� �

    and ϕ 2 � p2;0

    h i

    vðtÞ ¼ �A0 � 2 � kffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi4 � k � b2

    p � sinffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi4 � k � b2

    p

    2� t

    !� e�b2 �t ðiiiÞ

    aðtÞ ¼ A0 � 2 � k �ffiffiffik

    pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi4 � k � b2

    p � cosffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi4 � k � b2

    p

    2� t þ θ

    !� e�b2�t ðivÞ

    with θ ¼ tan�1 bffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi4 � k � b2

    p� �

    � p� �

    and θ 2 �p;� p2

    h i

    The displacement during the first half pseudoperiod is assumed to represent the articulator displacement during a gesture. From theseequations, the ratio Vmax/Amp and the movement duration T can be calculated according to the following equations:

    T ¼ 2 � pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi4 � k � b2

    p ðvÞ

    VmaxAmp

    ¼ffiffiffik

    p

    1þ e�b�T2� e�b�T2 �12�1p�tan�1 b�T2�pð Þ½ � ðviÞ

    Equation (vi) can also be written as follows

    VmaxAmp

    ¼ α �ffiffiffik

    pwith α ¼ 1

    1þ e�b�T2� e�b�T2 �12�1p�tan�1 b�T2�pð Þ½ � ðviiÞ

    and equation (v) as

    T ¼ bffiffiffik

    p with b ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� b24�k

    q ðviiiÞ

    Thus, if b is constant across gestures, VmaxAmp ¼ α �ffiffiffik

    pand T ¼ βffiffi

    kp , α and b being constant. If the system is undamped, b = 0 and then

    α = 1/2 and b = p.

    1076 Journal of Speech, Language, and Hearing Research • Vol. 54 • 1067–1076 • August 2011

    Complimentary Author PDF: Not for Broad Dissemination