Measurement of acoustic and anatomic changes in...

4
Measurement of acoustic and anatomic changes in ortogonathic surgery patients Daniel Aalto 1 , Olli Aaltonen 2 , Risto-Pekka Happonen 3 , Jarmo Malinen 1 , Tiina Murtola 1 , Riitta Parkkola 3 , Jani Saunavaara 3 , Martti Vainio 2 1 Dept. of Mathematics and System Analysis, Aalto University, Helsinki, Finland 2 Institute of Behavioural Sciences, Unversity of Helsinki, Helsinki, Finland 3 Dept. of Oral Diseases, Dept. of Radiology, Medical Imaging Centre of Soutwest Finland, Turku, Finland [email protected] Abstract We describe an arrangement for simultaneous recording of speech and geometry of vocal tract in ortogonathic surgery pa- tients. Experimental design is considered from an articulatory phonetic point of view. The speech signal is recorded with an acoustic-electrical arrangement and the vocal tract with MRI. A Matlab-based system controls the timing of speech recording and MR image acquisition. Index Terms: speech, sound recording, MRI 1. Introduction Mathematical models of human speech production based on the wave equation in the vocal tract have been used for studying normal speech production acoustics (e.g. [1, 2, 3]). Incorprating soft tissue and muscles into the model would extend its usability into studying speech production from a phonetics point of view, and planning and evaluating oral and maxillofacial surgery [4, 5, 6]. A computational model of speech production (for example [1, 7, 8]) must be validated by comparing simulated sound to measured sound in terms of, for example, the resonance struc- ture. Since such simulations are based on anatomic data, the validation of the computational model depends on recording a coupled data set: speech sound and the precise anatomy which produces it. This requires imaging the vocal and nasal tracts from the lips and nostrils to the beginning of the trachea. We have developed experimental arrangement to collect such a data set using magnetic resonance imaging (MRI) [9, 10]. During a pilot stage in 2010, a set of measurements were car- ried out on one male subject confirming the feasibility of the arrangement and the high quality of the data obtained [11]. The pilot measurements also revealed a number of issues to be ad- dressed before the next step: obtaining a clinically relevant data set. Collecting simultaneous speech and anatomic data from a large set of patients is far from a trivial task even when suitable instrumentation is available. A number of phonetic aspects must be taken into account to ensure that the task is within the ability of the patients, regardless of background and skills. It must be possible to monitor the quality of phonation despite the acoustic noise in the MRI room, and data collection procedures must be reliable to minimise number of repetitions and useless data obtained. All this must be achieved in as short a time as possible to minimise cost and maintain patient interest in the project. This paper outlines experimental protocols which take into account the above problems as well as the issues noted during the pilot measurements. These experimental procedures will be used to measure acoustic and anatomic changes in ortogo- nathic surgery patients. These surgical operations involve only hard tissue in the facial area and patients are mostly young and healthy adults. Measurements will be done before and approx- imately 30 weeks after operation for 10-20 voluntary patients. The second set of measurements will be done after removal of post-operational metal braces which may affect image quality adversely by distorting air spaces in the mouth. 2. Speech recording and MR imaging 2.1. Phonetic material The speech materials are chosen to provide a phonetically rich data set of Finnish sounds. The chosen MRI sequences require up to 11.6 s of continuous articulation in a stationary position. We use the Finnish sounds for which this is possible: vowels [a,e,i,o,u,y,æ,œ], nasals [m,n], and approximant [l]. Patients are instructed to produce each of the sounds at a sustained fundamental frequency (f0). We use two different f0 levels (104 and 130 Hz for men, 168 and 210 Hz for women) for the sounds [a] and [i] to obtain the vocal tract geometry with different larynx positions. The rest of the sounds are produced at the lower f0 only. The f0 levels are matched with the MRI noise frequency profile to avoid interference. In a sustained phonation, the long exhalation causes con- traction in the thorax and hence a change in the shape of vocal organs. The stationary 3D imaging sequence used to obtain the vocal tract geometry provides no information on this adaptation process, so additional dynamic 2D imaging on the mid-sagittal section for the sounds [a,i,u,n,l] is used to monitor articulatory stability. Speech context data is also acquired by having the patient repeat 12 sentences, which have been selected from a pho- netically known noise tested set [reference (Daniel/Martti)??]. These continuous speech samples are imaged using the same dynamic 2D sequence which is used for stability checking. An instruction and cue signal guides the patient through each measurement. The signal consists of three parts (Figure 1): (1) recorded instructions specifying the task with a sample of the desired f0, (2) a 2 s pause and three count-down beeps one second apart, and (3) continuous f0 for 11.6 s. In case of speech context experiments, the recorded instructions spec- ify the sentence to be repeated and f0 is left empty in both parts (1) and (3). Audibility of the f0 cues over MR imaging noise is achieved by using a sawtooth wave.

Transcript of Measurement of acoustic and anatomic changes in...

Page 1: Measurement of acoustic and anatomic changes in ...math.aalto.fi/~jmalinen/MyPSFilesInWeb/intspeech2012.pdf · Access from Matlab to the Au-dio Interface is arranged through Playrec

Measurement of acoustic and anatomic changes in ortogonathic surgerypatients

Daniel Aalto1, Olli Aaltonen2, Risto-Pekka Happonen3, Jarmo Malinen1,Tiina Murtola1, Riitta Parkkola3, Jani Saunavaara3, Martti Vainio2

1Dept. of Mathematics and System Analysis, Aalto University, Helsinki, Finland2Institute of Behavioural Sciences, Unversity of Helsinki, Helsinki, Finland

3Dept. of Oral Diseases, Dept. of Radiology, Medical Imaging Centre of Soutwest Finland, Turku, [email protected]

AbstractWe describe an arrangement for simultaneous recording ofspeech and geometry of vocal tract in ortogonathic surgery pa-tients. Experimental design is considered from an articulatoryphonetic point of view. The speech signal is recorded with anacoustic-electrical arrangement and the vocal tract with MRI.A Matlab-based system controls the timing of speech recordingand MR image acquisition.Index Terms: speech, sound recording, MRI

1. IntroductionMathematical models of human speech production based on thewave equation in the vocal tract have been used for studyingnormal speech production acoustics (e.g. [1, 2, 3]). Incorpratingsoft tissue and muscles into the model would extend its usabilityinto studying speech production from a phonetics point of view,and planning and evaluating oral and maxillofacial surgery [4,5, 6].

A computational model of speech production (for example[1, 7, 8]) must be validated by comparing simulated sound tomeasured sound in terms of, for example, the resonance struc-ture. Since such simulations are based on anatomic data, thevalidation of the computational model depends on recording acoupled data set: speech sound and the precise anatomy whichproduces it. This requires imaging the vocal and nasal tractsfrom the lips and nostrils to the beginning of the trachea.

We have developed experimental arrangement to collectsuch a data set using magnetic resonance imaging (MRI) [9, 10].During a pilot stage in 2010, a set of measurements were car-ried out on one male subject confirming the feasibility of thearrangement and the high quality of the data obtained [11]. Thepilot measurements also revealed a number of issues to be ad-dressed before the next step: obtaining a clinically relevant dataset.

Collecting simultaneous speech and anatomic data from alarge set of patients is far from a trivial task even when suitableinstrumentation is available. A number of phonetic aspects mustbe taken into account to ensure that the task is within the abilityof the patients, regardless of background and skills. It must bepossible to monitor the quality of phonation despite the acousticnoise in the MRI room, and data collection procedures mustbe reliable to minimise number of repetitions and useless dataobtained. All this must be achieved in as short a time as possibleto minimise cost and maintain patient interest in the project.

This paper outlines experimental protocols which take intoaccount the above problems as well as the issues noted during

the pilot measurements. These experimental procedures willbe used to measure acoustic and anatomic changes in ortogo-nathic surgery patients. These surgical operations involve onlyhard tissue in the facial area and patients are mostly young andhealthy adults. Measurements will be done before and approx-imately 30 weeks after operation for 10-20 voluntary patients.The second set of measurements will be done after removal ofpost-operational metal braces which may affect image qualityadversely by distorting air spaces in the mouth.

2. Speech recording and MR imaging2.1. Phonetic material

The speech materials are chosen to provide a phonetically richdata set of Finnish sounds. The chosen MRI sequences requireup to 11.6 s of continuous articulation in a stationary position.We use the Finnish sounds for which this is possible: vowels[a,e,i,o,u,y,æ,œ], nasals [m,n], and approximant [l].

Patients are instructed to produce each of the sounds at asustained fundamental frequency (f0). We use two different f0levels (104 and 130 Hz for men, 168 and 210 Hz for women)for the sounds [a] and [i] to obtain the vocal tract geometry withdifferent larynx positions. The rest of the sounds are producedat the lower f0 only. The f0 levels are matched with the MRInoise frequency profile to avoid interference.

In a sustained phonation, the long exhalation causes con-traction in the thorax and hence a change in the shape of vocalorgans. The stationary 3D imaging sequence used to obtain thevocal tract geometry provides no information on this adaptationprocess, so additional dynamic 2D imaging on the mid-sagittalsection for the sounds [a,i,u,n,l] is used to monitor articulatorystability.

Speech context data is also acquired by having the patientrepeat 12 sentences, which have been selected from a pho-netically known noise tested set [reference (Daniel/Martti)??].These continuous speech samples are imaged using the samedynamic 2D sequence which is used for stability checking.

An instruction and cue signal guides the patient througheach measurement. The signal consists of three parts (Figure 1):(1) recorded instructions specifying the task with a sample ofthe desired f0, (2) a 2 s pause and three count-down beepsone second apart, and (3) continuous f0 for 11.6 s. In caseof speech context experiments, the recorded instructions spec-ify the sentence to be repeated and f0 is left empty in both parts(1) and (3). Audibility of the f0 cues over MR imaging noise isachieved by using a sawtooth wave.

Page 2: Measurement of acoustic and anatomic changes in ...math.aalto.fi/~jmalinen/MyPSFilesInWeb/intspeech2012.pdf · Access from Matlab to the Au-dio Interface is arranged through Playrec

Figure 1: Patient instruction and cue signal structure.

2.2. Setting for experiments

The experimental setting is similar to the setting in which thepilot arrangement was tested [11]. The patient lies supine in-side the MRI machine with the sound collector placed on theHead Coil in front of the patient’s mouth. The patient can com-municate with the control room through the sound collector andthe earphones of the MRI machine. The patient can also hearhis own (de-noised) voice through the headphones with delayof approximately 90 ms.

Changes to the setting are mainly to do with instructing andcueing the patient, the role of the experimenter, and the controland timing of MR imaging.

The patients will familiarise themselves with the tasks andthe phonetic materials before the beginning of a measurementsession. They will also practice the tasks under the supervisionof a phonetician(?).

At the start of a measurement, the experimenter selectsthe phonetic task. The patient then hears the recorded instruc-tion. The instructions, and the following pause and count-downbeeps give the patient time to swallow, exhale, and inhale be-fore phonation is started after the count-down beeps. The pa-tient hears the target f0 in the earphones, added to his own (de-noised) voice, throughout the phonation.

MR imaging is started 2 s after the start of phonation andfinishes approximately 500 ms before then end of phonation.Thus ”pure samples” of stabilised utterance are available beforeand after the imaging sequence. Two 200 ms breaks are insertedinto the 3D MRI sequences in order to monitor the quality ofarticulation throughout the experiment. Dynamic 2D sequencesstart and end simultaneously with phonation.

The experimenter listens to the speech sound throughoutthe experiment, allowing unsuccessful utterances to be detectedimmediately. At the end of the experiment, the experimenterwrites comments and observations into a metadata file. Therecorded sound pressure levels are also inspected. Unsuccess-ful measurements are repeated, at the experimenter’s discretion,either immediately or later in the measurement set.

2.3. Magnetic resonance imaging

Measurements are performed on a Siemens Magnetom Avanto1.5T scanner (Siemens Medical Solutions, Erlangen, Germany).Maximum gradient field strength of the system is 33 mT/m(x,y,z directions) and the maximum slew rate is 125 T/m/s.

A 12-element Head Matrix Coil and a 4-element Neck Ma-trix Coil are used to cover the anatomy of interest. Coil configu-ration allowes the use of Generalized Auto-calibrating PartiallyParallel Acquisition (GRAPPA) technique to accelerate acquisi-tion. This technique is applied in all the scans using accelerationfactor 2.

3D VIBE (Volumetric Interpolated Breath-hold Examina-tion) MRI sequence is used as it allows for the rapid 3D acqui-sition required in this study. Sequence parameters are optimizedin order to minimize the acquisition time. The following param-eters allow imaging with 1.8 mm isotropic voxels in 7.8 s: Timeof repetition (TR) is 3.63 ms, echo time (TE) 1.19 ms, flip angle(FA) 6◦, receiver bandwidth (BW) 600 Hz/pixel, FOV 230 mm,matrix 128x128, number of slices 44 and the slab thickness of79.2 mm.

Dynamic MRI scans are performed using segmented ultra-fast spoiled gradient echo sequence (TurboFLASH) where TRand TE were minimized. Single sagittal plane was imaged witha pace of 5.5 images per second using parameters TR 178 ms,TE 1.4 ms, FA 6◦, BW 651 Hz/pixel, FOV 230 mm, matrix120x160, and slice thickness 10 mm.

Slices in the 3D VIBE sequence are externally triggered us-ing a train of 12 ms TTL logic level 1 pulses separated by 237ms of TTL level 0. With the acceleration methods used, thenumber of triggered slices in the sequence is 35. The two addi-tional pauses are inserted after the 12th and the 24th slices. Ex-ternal triggering with the addional pauses increases the imagingtime to 9.1 s. The pulse train is generated with a custom-madedevice which converts 1 kHz analogue sine signal to the logicpulses.

2.4. Speech recording

The MRI room presents a challenging environment for soundrecording due to acoustic noise and interference to electronicsfrom the MRI machine. For safety and image quality reasons,use of metal is restricted inside the MRI room and prohibitednear the MRI machine.

We use instrumentation specially developed for speechrecording during MRI [9, 10]: A two-channel sound collec-tor samples the speech and primary noise signals in a dipoleconfiguration. The sound signals are coupled to a microphonearray inside a Faraday cage by acoustic waveguides of length3 m. Additional noise samples are collected from the micro-phone array and from inside the MRI room using a directionalmicrophone near the patient’s feet pointing towards the patient’shead and the MRI coil. The signals are coupled from the micro-phones to custom RF-proof amplifier that is situated outside theMRI room. This analogue electronics is used to optimally sub-tract the primary noise channel from the the speech channel inreal time. Audio signals are converted between analogue anddigital forms using M-Audio Delta 1010 PCI Audio Interface.

2.5. Control of measurements

Measurement are controlled with a custom code in Matlab7.11.0.584 (R2010b) running on a portable server with LinuxUbuntu 10.04 LTS (Figure 2). Access from Matlab to the Au-dio Interface is arranged through Playrec (a Matlab utility),QjackCtl JACK Audio Connection Kit (v. 0.3.4), and JackEQ0.4.1.

The custom code computes the input signal to MRI trigger-ing device, reads the patient instruction and cue audio file, andassembles the two signals into a playback matrix. Recording isstarted simultaneously with playback, and carried on for equalnumber of samples. In addition to the speech and three noisesignals, the recording also includes the de-noised signal and thepatient instruction signal.

The audio configuration causes delays in the signals. Rel-ative to the onset of recording, MRI noise is recorded with adelay of approximately 60 ms (MRI machine delays excluded).

Page 3: Measurement of acoustic and anatomic changes in ...math.aalto.fi/~jmalinen/MyPSFilesInWeb/intspeech2012.pdf · Access from Matlab to the Au-dio Interface is arranged through Playrec

This is accounted for by the method of locating the ”pure sam-ples”. Patient speech is recorded with a delay of apprioximately90 ms (patient reaction time excluded), and patients also heartheir own voices with this same delay. These patient speech de-lays may have two effects. First, the duration of the last puresample is reduced from 500 ms to 410 ms, which makes nosignificant difference from the point of view of data analysis.And secondly, the echo effect may disturb the patient duringsentence repetition in which case the speech feedback may beturned off or its volume reduced independent of the cue signal.

The control code automatically saves the recorded sound asa six-channel Waveform Audio File. A separate file containingmetadata is also saved automatically. The metadata file containsall experimental parameters, including task specification, andthe locations of the pure samples in the sound file.

The control system requires user input for three tasks. First,the exprimenter selects the next phonetic task (target sound orsentence and f0) and MR imaging sequence. Second, com-ments and observations may, if necessary, be written abouteach measurement separately. They will be saved automati-cally in the metadata file. And third, patient headphone vol-ume and recorded sound pressure levels may be adjusted man-ually based on feedback from the patient and rudimentary post-experimental sound data checks. The sound data checks con-sist of maximum absolute value and root-mean-square value ofrecorded signals, and they are displayed to the experimenter au-tomatically at the end of each measurement.

A single measurement will take on average 30-40 s, includ-ing task selection by the experimenter and writing additionalinformation and observations in the metadata file. We expectthat with 60 min of measurement time per patient we obtain,at a comfortable pace, a data set with 50-60 measurements perpatient both before and after surgery.

3. ConclusionsWe have described experimental protocols, MRI sequences, anda sound recording system that, in conjunction with previouslyreported arrangements [9, 10, 11], can be used for simultaneousspeech sound and anatomical data acquisition on large numberof orthognatic surgery patients. Such data sets are intended forparameter estimation, fine tuning, and validation of a mathe-matical model for speech production as discussed in Section 1.However, these methods and procedures may be used in a widerrange of applications and with other patient groups.

Some questions and problems in the measurement arrange-ments, in particular to do with visibility of teeth, and acousticnoise and its impact on articulation, remain open.

Teeth are not visible in MR images, but they are an impor-tant acoustic element of the vocal tract. Hence it is necessary toadd teeth geometry into the soft tissue geometry obtained fromthe MR images during post-processing. Laser scanning of teethor digitalisation of dental casts could be readily obtained fromthe patients but the alignment of the two geometries remains aproblem. Oil or paraffin containing markers attached to the sur-face of the teeth [12] appear a promising approach but we arestill studying and experimenting to find a working solution.

Acoustic noise during measurements remains a problemfrom two points of view. Firstly, the recorded speech signalmay need further de-noising to be usable for formant extractionby, e.g., the LPC algorithm. The three noise signals collectedsimultaneously with the speech signal enable use of more accu-rate de-noising post-experimental than is achieved by the ana-logue devise.

Figure 2: The measurement server with custom amplifier, M-Audio Delta 1010 Audio Interface, and networking facilities,and a laptop used for remote access.

Secondly, the onset of MRI noise may cause a significantadaptation in the patients’ articulation. It may be possible to re-duce this problem by running the 3D MRI sequence once whilethe patient receives the task instruction to adapt the patient tothe noise, and a second time during phonation to obtain the vo-cal tract geometry. For the 2D sequences, the sequence may bestarted before phonation for the same effect.

4. Acknowledgements5. References

[1] Hannukainen, A., Lukkari, T., Malinen, J. and Palo, P., “Vowelformants from the wave equation”, Journal of the Acoustical So-ciety of America Express Letters, 122(1): EL1–EL7, 2007.

[2] Lu, C., Nakai, T. and Suzuki, H.,“Finite element simulation ofsound transimission in vocal tract”, J. Acoust. Soc. Jpn. (E), 92:2577–2585, 1993.

[3] Svancara, P., Horacek, J., and Pesek, L., “Numerical modelling ofproduction of Czech Wovel /a/ based on FE model of the vocaltract”, Proceedings of International Conference on Voice Physiol-ogy and Biomechanics, 2004.

[4] Dedouch, K., Horacek, J., Vampola, T., and Cerny, L., “Finiteelement modelling of a male vocal tract with consideration of cleftpalate”, Forum Acusticum, Sevilla, Spain, 2002.

[5] Nishimoto, H., Akagi, M., Kitamura, T., and Suzuki, N., “Esti-mation of transfer function of vocal tract extracted from MRI databy FEM”, The 18th International Congress on Acoustics, Kyoto,Japan, Vol. II: 1473–1476, 2004.

Page 4: Measurement of acoustic and anatomic changes in ...math.aalto.fi/~jmalinen/MyPSFilesInWeb/intspeech2012.pdf · Access from Matlab to the Au-dio Interface is arranged through Playrec

[6] Svancara, P., & Horacek, J., “Numerical Modelling of Effect ofTonsillectomy on Production of Czech Vowels”, Acta Acusticaunited with Acustica, 92: 681–688, 2006.

[7] Aalto, A., “A low-order glottis model with nonturbulent flowand mechanically coupled acoustic load”, Master’s thesis, TKK,Helsinki, Finland, 2009.

[8] Aalto, A., Alku, P., and Malinen, J., “A LF-pulse from a simpleglottal flow model”, MAVEBA 2009, Florence, Italy, 199–202,2009.

[9] Lukkari, T., Malinen, J. and Palo, P., ”Recording speech duringmagnetic resonance imaging”, MAVEBA 2007, Florence, Italy,163-166, 2007.

[10] Malinen, J. and Palo, P.,“Recording speech during MRI: Part II”,MAVEBA 2009, Florence, Italy, 211–214, 2009.

[11] Aalto, D., Malinen, J., Aaltonen, O., Vainio, M., Happonen, R.-P., Parkkola, R. and Saunavaara, J., “Recording speech sound andarticulation in MRI”, Biodevices 2011, 168–173, 2011.

[12] Ericsdotter, C., “Articulatory-Acoustic Relationships in SwedishVowel Sounds”, PhD dissertation, Stockholm University, Stock-holm, Sweden, 2005.