166.Multimodal Interaction for Service Robot Control (1)(1)

Multimodal Interaction for Service Robot Control

Felipe Trujillo-Romero, Felix Emilio Luis-Pérez, Santiago Omar Caballero-Morales Technological University of the Mixteca, Postgraduate Division, Highway to Acatlima Km 2.5

Huajuapan de León, Oaxaca, México, 69000 {ftrujillo,eluis,scaballero}@mixteco.utm.mx

Abstract

In this paper we present the multimodal interaction of two sensor systems for control of a mobile robot. These systems consist of (1) an acoustic sensor that receives and recognizes spoken commands, and (2) a visual sensor that perceives and identifies commands based on the Mexican Sign Language (MSL). According to the stimuli, either visual or acoustic, the multimodal interface of the robotic system is able to weight each sensor’s contribution to perform a particular task. The multimodal interface was tested in a simulated environment to validate the pattern recognition algorithms (both, independently and integrated). The independent performance of the sensors was in average of 93.62% (visual signs and spoken commands), and of 95.60% for the multimodal system for service tasks.

1. Introduction From its beginnings, robotics has been an important

assistive technology for the human being, making possible the existence of constant production lines with minimum error rates for the realization of repetitive tasks. This is an example of industrial robots, which although they provide a service to the human being, cannot be classified as service robots. According to the International Federation of Robotics (IFR) [1] a service robot is defined as: a robot that works autonomously either partially or totally, and that performs useful services for the well being of humans and equipments. These robots can have mobility and the capacity to manipulate objects.

Thus, the application field of service robots can be classified in: - Applications of service for human beings (personal

protection and assistance of handicapped people, etc.). - Applications of service for equipments (maintenance,

repairs, cleaning, etc.). - Other autonomous tasks (surveillance, transportation,

inspection, etc.). The autonomy capacity required for such systems is

obtained by means of a control system able to interact with the environment of application. To accomplish this, the system must have sensors to perceive the events that occur in the environment, and actuators/mechanisms to react and realize modifications to it.

This work is focused in developing tasks of service robotics for the human being, looking for the robotic system to be controlled by means of natural language represented in acoustic form (speech) and visual form (signs). Hence, the service robot must interpret visual and acoustic commands to perform a task. The tasks to be performed are considered to be simple in order to be arranged in sequences to accomplish more complex tasks such as to “serve a glass with water” and “take it to” a particular work space or person.

Our advances towards this system are presented in this paper, which is structured as follows: in Section II a review of related studies to this problem is presented, while in Section III the details of the design of the visual and acoustic sensor systems are presented; in Section IV are presented the design details of the integrated multimodal system, whose results are presented and discussed in Section V; finally in Section VI we present our conclusions and future work.

2. Research review Communication between humans is performed by

means of voice and gestures. Unfortunately, robots cannot understand these human natural languages. Thus, a mechanism is necessary for the robot to understand these languages as a human does. To accomplish this, many research projects have been developed to perform communication between humans and robots. Among those projects we can mention to Posada-Gómez et al. [2] that controlled a wheelchair with hand signals. Such equipment has also been controlled by Hashimoto et al. [3] by tracking other signals such as ocular movements, while Alcubierre et al. [4] used spoken commands.

A project more related to the focus of our work is presented by Böhme et al. [5] that developed the multimodal control of a service robot for a shop. This robot was used as an informative kiosk which was able to locate a user by processing audio signals in addition to perceived visual information. On the other hand there are developments for companion robots (i.e., Aibo, Robosapien, etc.). In this field we can mention to Weitzenfeld et al. [6] that controlled a group of Aibo robots to play football by means of spoken commands and visual feedback.

978-1-61284-1325-5/12/$26.00 ©2012 IEEE 305

The list of examples can be very extensicomment on two last projects that use acoinformation for multimodal interaction: (1) [7], developed at the Universidad NacionaMéxico (UNAM), which is able to receive instructions and establish a dialog with a usinformation; and (2) Rackham [8], which wthe Laboratoire d’Analyse et d’Architectur(LAAS) in France. Rackham receives spokeguide visitors towards the different areas of

3. Sensor systems 3.1 Signs recognition system

In order to develop in a satisfactory mrecognition system, this was developed in imaging, segmentation, obtaining the dlearning patterns by the neural network aRelation among these stages is shown in tFig. 1.

Figure 1. Structure of the sign recogniIn the paragraphs below we will briefly e

the stages mentioned above. • Obtaining images

In Fig. 2 the 23 symbols that were usthat comprise the alphabet of the MSL, arechoose those whose shape was different fand image sequences were not considered.

a

b

C

d e

g

h

i

l m

o

p

q

r s

u

v

w

x y

Figure 2. The 23 symbols of the MSL

Y

Yes

No

Image acquire Segmentation ShapeSignature

ActionStop

Finish

ive, and thus, we oustic and visual the Golem robot al Autónoma de complex spoken ser based on that was developed at es des Systemes en commands to

f a museum.

manner the sign different stages:

descriptor form, and recognition. the flowchart of

tion system. explain each of

ed, from the 27 e presented. We from each other

e

f

m

n

s

t

y

L alphabet.

The images of the signs werebackground. Additionally, the useruncover only the hand in order tobjects in the wrist (watch, bracethat the region of interest to be prothe hand.

We used four different sets sim2, each with a different backgrounserved to create a database of iparameters using a photometric filtpower function of the form shown i

f(x)= cex 0.9 ≤ Where c is a constant and x is th

to obtain different intensities in processed. The parameter x was vaorder to obtain different light variaIn evaluating Eq. (1) we have thadarker, and if x > 1, is bright (see F• Segmentation

The first step in the realizationsystem was the image segmentatioThese contours shape the boundabackground and other objects in extraction of the contours of the obmodels that use a priori informationThese techniques are much more rnoise, and allow more complex impaper, snakes were used to (segmentation by region [10]).

(a)

(b)

Figure 3. Examples for seveIn Fig. 4 an example is pre

transitions of the snake for the representing the letter "y".

(a) (b) (c) Figure 4. Different snake transiti(d) final contour, (e) contour of sign

• Descriptor Since the snake as result gives u

the shape signature [11] was used athe object is the shape of the hanMSL alphabet (Fig. 4).

The signature of the object is distances from the object’s center

Yes

No

Neural Network

Learning

Recognition

e taken using a uniform r wore a garment that let to avoid interference by elets, etc.). This ensured ocessed belonged only to

milar to that shown in Fig. nd. These four image sets mages by varying their ter. The used filter was a in Eq. 1. x ≤ 1.1 (1)

he parameter to be varied the image that is being aried in a random way in ation for the input image. at, if x < 1, the image is Fig. 3).

n of the sign recognition on by active contours [9]. aries between the object,

the image. Also allow bjects of interest based on n on the shape of objects. robust to the presence of age segmentation. In this

perform segmentation

(c)

eral power settings. esented of the different image of MSL symbol

(d) (e) ons (a) initial snake and .

us the edge of the object, as descriptor. In our case, nd with one sign of the

done by calculating the of gravity to each one of

978-1-61284-1325-5/12/$26.00 ©2012 IEEE 306

the points that form the boundary. This yields a histogram of distances as shown in Fig. 5 which is the histogram contour for the symbol shown in Fig. 4(e). This histogram consisted of 360 different values which was the input vector for training the neural network. • Neural network: learning

We used a neural network composed of three layers with the following configuration: an input layer of 360 neurons, an intermediate layer of 23 neurons, and one neuron in the final layer.

In this case, the input vectors are formed by histograms of the shape signature descriptor. Each input vector is associated with an output label ranging from number 1 to 23. Therefore there is an output of 23 different values, one for each symbol used of the MSL alphabet.

(a)

(b)

Figure 5. Shape signature of the symbol that represents the letter “y”: (a) in the learning phase, (b) in the recognition phase. There are significant differences between both histograms.

Backpropagation was used as the network training algorithm due to fast convergence and robustness over other type of training algorithms [12]. Some of the parameters used for this training were: (1) learning rate of 0.01; a maximum of 9000 iterations; and minimum error of 1e-10.

The time it took to the neural network to learn the pattern vectors and their association with 23 patterns goal was 316 seconds. • Sign Recognition Performance

We tested the neural network before and after training to observe how the recognition of the symbols of the MSL varied. Fig. 6(a) shows the graph which was obtained with initial values for the weights of the neural network. In “circles” are shown the expected values, and in “stars” those obtained from the output of the network by presenting the symbols of the MSL alphabet. After training the system’s new graphics were obtained, showing the patterns of Fig. 6(b).

To evaluate the recognition system, a series of different images from those learned by the system were used. By comparing the histograms of learning with the histogram of the symbol to recognize, it was observed some variation while there were many values in the histogram that matched. This makes the neural network to recognize the symbol smoothly as expected.

Recognition tests were performed for all the symbols in the MSL alphabet. The system was tested 30 times for each symbol, and the following was observed:

- The symbols that were always properly recognized were those that represented the letters a, b, c, d, e, f, g, h, i, m, n, o, r, u, v, x, y.

- The system confused certain symbols, and the letters l, p, s, and w, were recognized correctly in about 50% of the experiments. The overall recognition rate was 90%.

(a) (b) Figure 6. Neural network output: (a) before training, (b) after training.

3.2 Speech recognition system The structure of the Automatic Speech Recognition

(ASR) system is presented in Fig. 7, and each component is explained below. • Acoustic Models and Lexicon

Hidden Markov Models (HMMs) were used to model the acoustic features of the user’s voice. These models were standard three-state left-to-right HMMs with eight Gaussian components per state. The front-end used 12 MFCCs plus energy, delta, and acceleration coefficients [13, 14].

Figure 7. Structure of the speech recognition system. It is now common practice to build HMMs at the

phonetic level instead of at the word level. A phoneme is a sub-word unit that forms a word, for example, the word HELLO is formed by the sequence of phonemes /hh/, /eh/, /l/, /oh/. A Lexicon, or Phonetic Dictionary, is used in this case to establish the sequence of phonemes that define each word in a vocabulary. In our case, we defined a sequence of phonemes for each word of the control sentences associated to 12 symbols of the MSL (see Table 3). The TranscribeMex [15] tool was used to define the phoneme sequences for the vocabulary words of the application. Because of limited availability of Mexican speech corpora, we developed a Speaker Dependent (SD) ASR system trained with speech samples from a single speaker. For multi-user purposes, Maximum Likelihood Linear Regression (MLLR) [14] was performed to adapt this system for its use by other users. The training corpora of the SD ASR system consisted of a selection of a short

Training Corpora

Viterbi DecodingLexicon

Language Model(LM)

Acoustic Models(HMMs)

“BOT GO OUT BY DOOR ONE”

SPOKEN COMMAND

978-1-61284-1325-5/12/$26.00 ©2012 IEEE 307

narrative, 16 phonetically balanced sentences, and 30 words with presence of contrast phonemes. This text corpus was uttered 5 times by a male speaker to provide speech corpora, which was labeled at the phonetic and word levels using the software Wavesurfer for supervised training. By using the sequences of phonemes established in our Lexicon we expanded these word labels into their corresponding phoneme labels. The supervised training of the acoustic models was performed by using Baum- Welch re-estimation of the HMMs with the acoustic data and the corresponding phoneme labels. • Language Model

The control sentences were defined for manipulation of objects (robot arm “Cube” tasks) and movement of the platform (robot platform “Bot” tasks). The sentences for manipulation have the following structure: Device + Task + Object. Here, Device defines the identifier of the robotic element to be controlled (Cube or Bot), Task identifies the kind of action to perform over an Object (the element to be manipulated). Hence, sentences such as those presented in Fig. 8 were modeled by the grammar.

Figure 8. Sentences for manipulation tasks. Sentences to control de movement of the platform, or to

give more details about the task, had the following structure: Device + Task + Configuration.

Figure 9. Sentences for manipulation tasks and settings. As shown in Fig. 9, Configuration adds parameters to

the Task given to the robotic element. In this way, the grammar allows recognition of simple and more complex commands. We used word bigrams for the Language Model and 32 possible sentences were used for this purpose. Initially we used a vocabulary of 202 different Mexican words to build these sentences. HTK Toolkit [14] was used for language and acoustic modeling, implementation of the search algorithm (i.e., speech recognition function, Viterbi Decoding) and evaluation of performance. • Speech Recognition Performance

Initially the system was tested with the training corpus to verify the accurate modelling of phonemes and stability of the baseline system. The metric of performance was % Word Accuracy (%WAcc) which is defined as:

%WAcc = (N-D-S-I)/N (2) where N is the number of words in the reference (correct) sentences, and S, D and I are the number of words substituted, deleted, and inserted in the recognized sentences. Hence, this measure considers the different kind of errors which can change the meaning of a recognized sentence, compromising the ability of the robot to accomplish the desired task.

As presented in Table 1, the recognizer had a %WAcc of 98.45%, which is consistent with a SD system tested with the corpus used for training. If this metric were lower that would mean that the samples have significant variations in pronunciation and thus the resulting system will not be stable. As we focus on using this recognizer with different users, the SD recognizer alone will not perform efficiently with other users except for the user that provided his voice for the training corpus. The stability of the baseline system allows the use of a speaker adaptation technique to adjust the system to other’s speakers voice, and thus, make it multi-user. Table 1. Performance of the baseline recognizer tested with the training corpus.

N D S I %Wacc 1418 9 12 1 98.45

Maximum Likelihhod Linear Regression (MLLR) was the adaptation technique used for this system, and is based on the assumption that a set of linear transformations can be used to reduce the mismatch between a baseline HMM model set and the speech samples from a different speaker. In this work, these transformations were applied to the mean and variance parameters of the Gaussian mixtures of the SD HMMs. The adaptation speech consisted in a set of 16 phonetically balanced sentences which were uttered by the user once before starting to use the recognizer. The adapted system was tested using the set of control commands defined for this work (see Table 3) with two users (a male and a female students) across ten sessions. In each session, all 12 sentences were read by the test users (thus, there were 120 test sentences for each user), and we recorded the number of instances where %WAcc = 100% as we are interested in precise recognition of commands. As presented in Table 2 the adapted system achieved around 95% of recognition for complete commands. Table 2. Performance of the adapted recognizer tested with control commands.

User Fails/Total % Success Male Student 4/120 96.67 Female Student 7/120 94.17

4. Multimodal recognition system The structure of the integrated recognition system is

shown in Fig.10. As presented, both recognition systems receive input data where each one processes the signal that was designed to detect. Each recognition block generates an

sent-start

Bot

Cube

take

serve

open doorone

two

cup

bootle

sent-end

..

....

..

.

sent-start

Bot

Cube

turn

moveforward one

two

by door

right

sent-end

..backwards meters

go out

come in

left

forty five

ninetydegrees

978-1-61284-1325-5/12/$26.00 ©2012 IEEE 308

output with a likelihood, which is then wwith the other’s system to allow mutual helptask to perform an action in case that there itotal, 12 control commands were consbimodal system. These are presented in Tab

Figure 10. Multimodal recognition syste

Table 3. Control Tasks: * represents co(control of both, the robot’s arm – Cube- platform – Bot. MSL

Symbol Action/Task

1 h BOT MOVE FORWARD QUICKLY2 a BOT MOVE BACKWARDS SLOW3 m BOT TURN NINETY DEGREES TO4 n BOT TURN FORTY FIVE DEGREE5 u CUBE SERVE BOTTLE 6 v CUBE TAKE GLASS 7 o BOT GO OUT BY DOOR ONE 8 s BOT GET IN BY DOOR TWO 9 y BOT SERVE COUP *

10 g BOT MOVE FORWARD SLOWLY 11 c CUBE HOME 12 B BOT STOP *

The recognizers’ outputs can either hel(similar decision) or interfere to each decision). Hence, weighting each other’s coaccurate decision task is important. The fosystem’s weighting is performed to know help or interfere to decision taking is as follo- Each sensor system votes for a p

according with the output probabrecognizer.

- If both systems vote for the same acsystems help to each other. However, different actions, then both interfere to eBoth responses are added together u

converges towards a unique response. performed incrementally in time. Thus, system delivers an action when such sum isequal) to an established limit. In this case, tto 95%.

In Fig. 11(a) in t0 all actions have the sprobabilities. This changes in t1 (Fig. 11(b)moment the system has already processe(speech, signs) via the recognition systprobabilities are modified in function of tprovided by the recognizers. This contmultimodal system converges towards a scase that both systems vote for a differprocess starts with equal occurrence probaaction defined in Table 3. In t1 the probabiland the corresponding probabilities to the v

InputAudio

>9

No

HHM

Image ANN

+

weighted jointly p in the decision is uncertainty. In sidered for the ble 3.

em structure.

omposite actions and the robot’s

Y TWO METERS WLY O THE LEFT ES TO THE RIGHT

TWO METERS

lp to each other other (different

ontribution to the rm in which the if they mutually ows:

particular action bility of each

ction, then both if both vote for

each other. until the system This sum is the multimodal

s higher than (or this limit was set

same occurrence )) because at this ed some signals tem. Thus, the the probabilities tinues until the single action. In rent action, the

abilities for each ities are updated voted actions (in

this example, 3 and 5), are modifie11(b), where action 3 has a probabhas 85%.

(a)

Figure 11. Occurrence probabilAs shown in Fig. 12, by weig

we have an occurrence probabilitythat initially was higher (in this capresented, action 3 influenced in a5, reducing its value. Additionallyaction 3 is decreased because it affe

Figure 12. Updated probabil

The probabilities for the other and all are decreased. Because thprobability does not reach the estaiterates and requests the user (vprovide again the control commanthe decoded command. Fig. 13 shsystem for actions (a) 5 and (b) .

(a)

Figure 13. Decision to execute iterations, and (b) action 6 (v) after

5. Performance of the multiAs presented in Table 3, 1

considered for the service robot. Eaa spoken sentence and a sign opresented in Table 3. To verify ththe desired action, a graphical usein Matlab © which processed tacoustic and visual signals to assian output (see Section 4). This impiterative reception and processing

95Yes

Action

ed. This is shown in Fig. ility of 65% and action 5

(b)

ities in (a) t0, and (b) t1. ghting these probabilities y of 20% for the action ase, action 5). As it was

a negative way the action y, the probability of the ected the other action.

ities after weighting.

actions are also updated he predominant action’s

ablished limit, the system via speech synthesis) to nd to validate, or discard, ows the evolution of the

(b)

(a) action 5 (u) after 4 3 iterations.

imodal system 2 control actions were ach action has associated f the MSL alphabet as

hat the system performed er interface was designed the probabilities of the gn weights and generate plementation allowed the g of speech signals and

978-1-61284-1325-5/12/$26.00 ©2012 IEEE 309

visual MSL signs until the multimodal sconvergence towards a maximum value limit. This limit is reached when the weighor equal to 95%.

Although many tests were performed actions defined in Table 3, only one of themaction used to show the performance of interaction is 6 (v) which is equivalent t“CUBE TAKE GLASS”. In Fig. 13(b), thesystem to perform action 6 (v) is shown. It the system requires only a few iterations weight the probabilities generated by systems and to decide which action is to bethe system converges to action 6 (v) whicresult. The implementation of the multimsystem was tested with the simulation softw© which was developed by the University o14 is shown the simulation of the action 6 (v

Figure 14. Simulated execution of acThe tests for all control commands wer

times, obtaining a successful completion rais important to mention that lightingsegmentation, producing confusions whichand delays in obtaining the desired result. part of our future research that is presensection.

6. Conclusions and future work In this paper a multimodal system to c

robot was presented. This system was builblocks to process acoustic and visual comsigns). The proposed system was validateimplementation in a simulation environmwas constituted by a mobile platform – BDOF arm – Cube, which as a whole is able tasks. The simulation was performed witasks, where two of them required coordinaplatform and the arm. The multimodal sytasks with a completion rate of 95.60considered to be robust enough to allowclosed environments. Among the futureexpectations for this work we present the fo

system achieved over a defined

hing was higher

for each of the m is shown. The the multimodal

o the command evolution of the is observed that to appropriately the recognition

e executed. Also, ch is the correct

modal interaction ware Roboworks of Texas. In Fig. v).

ction 6 (v). re performed 30 ate of 95.60%. It g affected the

h lead to failures However this is

nted in the next

control a service t with two main

mmands (speech, ed by means of

ment. The robot Bot- and by a 6

to perform easy ith 12 different

ation of both, the ystem performed 0%, which was w interaction in e research and

ollowing:

- Implementation of the multimorobotic system.

- To use a network of sensor devreceived signals.

- To modulate the lighting by usinRGB such as the CieLAB system

- Implementation of recognition omovement (i.e., dynamic signs This would increase the range of be executed by the service robot.

7. References [1] International Federation

http://www.ifr.org/service-robots[2] Posada-Gomez, R., Sanchez-Med

Martinez-Sibaja, A., Aguilar-LasHands Gesture System Of CoWheelchair", 4th International CElectronics Engineering (ICEEE

[3] Hashimoto, M., Takahashi, K., control using an EOG- and EMIEEE/ASME International CIntelligent Mechatronics, 2009. 2009.

[4] Alcubierre, J.M., Minguez, J., MSaz, O., Lleida, E., “Silla de Rupor Voz”, In Primer Congreso IRobótica y Teleasistencia para To

[5] Böhme, H.-J., Wilhelm, T., KeyC., Groß, H.-M., Hempel, T., “Ahuman–machine interaction for Robotics and Autonomous SysISSN 0921-8890, 2003.

[6] Weitzenfeld, A., Ramos, C., Robots to Play Soccer via Spo2008: Robot Soccer World CuSpringer, ISBN: 978-3-642-0292

[7] Pineda, L., Meza, I., Espinosa, ASpeech Acts”, Reporte Interno D

[8] Clodic, A., Fleury,S., Alami, R“Supervision and Interaction: ATour-Guide Robot DeploymConference on Advanced Roboti

[9] Blake, A., Isard, M., Active Con1998.

[10] Lankton, S., Tannenbaum, A., Active Contours”, IEEE TransacVol. 17, No. 11, pp. 2029-2039, 2

[11] Fujimura, K., Sako, Y., “Shape SIn Shape Modeling InternationalSociety, 1999.

[12] Hagan, M. T., H.B. Demuth, Mdesign, PWS Pub, 1995.

[13] Jurafsky, D. and Martin, J.HProcessing. Pearson: Prentice Ha

[14] Young, S. and Woodland, P. TVersion 3.4). Cambridge Univers

[15] Pineda, L. A., Castellanos, H., Juárez, J., Llisterri, J., Pérez, P., DIMEx100: Transcription andResources and Evaluation. 44:4. 2

odal system in the real

vices to synchronize the

ng a different scheme to m.

of MSL signs that have instead of static signs).

f applications and tasks to

of Robotics (IFR): s/ del, L.H., Hernandez, G.A., serre, A., Leija-Salas, L., "A ontrol For An Intelligent

Conference on Electrical and 2007), pp.68-71, 2007. Shimada, M., "Wheelchair G-based gesture interface", onference on Advanced AIM 2009, pp.1212-1217,

Montesano, L., Montano, L., uedas Inteligente Controlada Internacional de Domótica, odos, 2005. y, J., Schauer, C., Schröter, An approach to multi-modal

intelligent service robots”, stems, Vol. 44, pp. 83-96,

Dominey, P., “Coaching oken-Language”, RoboCup

up XII, LNCS, Vol. 5399, 20-2, pp. 379-390, 2009. A., “Direct Interpretation of CC-IIMAS, 2003.

R., Herrb, M., Chatila, R., Analysis of an Autonomous ment”, 12th International

cs, ICAR05, 2005. tours, Springer, Cambridge,

“Localizing Region Based ction on Image Processing. 2008. Signature by Deformation”, l (SMI 99), IEEE Computer

. H. Beale, Neural network

H. Speech and Language all , 2009. The HTK Book (for HTK sity, 2006.

Cuétara, J., Galescu, L., Villaseñor, L. “The Corpus

d Evaluation”. Language 2010.

978-1-61284-1325-5/12/$26.00 ©2012 IEEE 310

166.Multimodal Interaction for Service Robot Control (1)(1)

Documents

Transcript of 166.Multimodal Interaction for Service Robot Control (1)(1)