Anthropomorphic Agent as an Integrating Platform of Audio-Visual ... · Anthropomorphic Agent as an...

2
Anthropomorphic Agent as an Integrating Platform of Audio-Visual Information Shigeki Sagayama Takuya Nishimoto Graduate School of Information Science and Technology, The University of Tokyo Hongo, Bunkyo-ku, Tokyo 113-8656 Japan / {sagayama,nishi}@hil.t.u-tokyo.ac.jp 1 Introduction In integration of audio-visual information input from sensors and control output to actuators and robotic systems, anthropomorphic spoken-dialog agent can be a good platform combining them under a unified concept such as “virtual human.” This talk will focus on a anthropomorphic spoken-dialog agent and its future possibility as a integrating platform. 2 Spoken Dialog Agent 2.1 Galatea Toolkit One of ultimate human-machine interfaces is anthro- pomorphic spoken dialog agent which behaves like humans with facial animation and gesture and make speech conversations with humans. Among numer- ous efforts devoted for such a goal, Galatea Project conducted by 17 members from 12 universities is de- veloping an open-source license-free software toolkit [1] for building an anthropomorphic spoken dialog agent under a financial support from IPA 1 during fiscal years of 2000–2002. The authors are members of the project. The features of the toolkit are as fol- lows: (1) high customizability in text-to-speech syn- thesis, realistic face animation synthesis, and speech recognition, (2) basic functions to achieve incremen- tal (on-the-fly) speech recognition, (3) mechanism for “lip synchronization”; synchronization between au- dio speech and lip image motion, (4) “virtual ma- chine” architecture to achieve transparency in mod- ule to module communication. The Galatea Toolkit for UNIX/Linux and Windows operating systems will be publicly available from August 22, 2003, at http://hil.t.u-tokyo.ac.jp/galatea/. 2.2 Toolkit Components The Galatea Toolkit consists of five functional mod- ules: speech recognizer, speech synthesizer, facial an- imation synthesizer, agent manager which works as an inter-module communication manager, and task (dialog) manager. Fig. 2.2.1 shows the basic mod- ule architecture of the Galatea toolkit. The Galatea Project members newly created these components or modified existing components of their own or pub- licly available. The outline of some of these func- tional modules are stated below. 2.2.1 Common features Galatea employs model-based speech and facial an- imation synthesizers whose model parameters are adapted easily to those for an existing person if his/her training data is given. Synthesized facial images and voices are customizabile easily depend- ing on the purposes and applications of the toolkit users. This customiazability is achieved by employ- ing model based approaches where basic model pa- rameters are trained or determined with a set of training data derived from an existing person. Once 1 Information-Technology Promotion Agency Agent Manager Task Manager Other Application Module IIPL Speech Synthesis Module (SSM) Face image Synthesis Module (FSM) Speech Recognition Module (SRM) Microphone CRT Speaker Prototyping Tools Task Information Dialog Model Task Information Dialog Model Figure 1: System architecture of Galatea. the model parameters are trained, facial expressions and voice quality can be controlled easily. 2.2.2 Speech recognition module (SRM) SRM consists of three submodules: the command interpreter, the speech recognition engine, and the grammar transformer. Based on a speech recognition engine “Julian” developed by Kyoto University and others, it accepts the grammar to represent sentences to recognize and has been modified to accept multi- ple formats for grammar representation and output incremental recognition results. It can change the grammar by request from external modules during dialog sessions. It also produces N -best recognition candidates for sophisticated use of multiple results. 2.2.3 Speech synthesis module (SSM) This module is the first open-source license-free Japanese Text-to-Speech conversion system consist- ing of four sub-modules. The command interpreter receives an input command from the agent man- ager and invokes sub-processes according to the com- mand. The text analyzer decomposes arbitrary Japanese input texts containing Kanji, Kana, alpha- betic, numeric characters, and optionally embedded tags according to the JEIDA-62-2000 [3] typically specifying the speaking style, and extracts linguis- tic information including pronunciation, accent type, part of speech, etc., partly utilizing ChaSen[2] and newly developed dictionaries for Japanese morpho- logical analysis. The waveform generation engine is an HMM-based speech synthesizer, that simultane- ously models spectrum, F 0 and duration in a unified framework of HMM capability. The speech output sub-module outputs the synthetic speech waveform. 2.2.4 Facial image synthesis module (FSM) FSM is a module for high quality facial image synthe- sis, animation control and precise lip-synchronization

Transcript of Anthropomorphic Agent as an Integrating Platform of Audio-Visual ... · Anthropomorphic Agent as an...

Page 1: Anthropomorphic Agent as an Integrating Platform of Audio-Visual ... · Anthropomorphic Agent as an Integrating Platform of Audio-Visual Information Shigeki Sagayama Takuya Nishimoto

Anthropomorphic Agent as an Integrating Platform ofAudio-Visual Information

Shigeki Sagayama Takuya Nishimoto

Graduate School of Information Science and Technology, The University of TokyoHongo, Bunkyo-ku, Tokyo 113-8656 Japan / {sagayama,nishi}@hil.t.u-tokyo.ac.jp

1 IntroductionIn integration of audio-visual information input fromsensors and control output to actuators and roboticsystems, anthropomorphic spoken-dialog agent canbe a good platform combining them under a unifiedconcept such as “virtual human.” This talk will focuson a anthropomorphic spoken-dialog agent and itsfuture possibility as a integrating platform.

2 Spoken Dialog Agent2.1 Galatea Toolkit

One of ultimate human-machine interfaces is anthro-pomorphic spoken dialog agent which behaves likehumans with facial animation and gesture and makespeech conversations with humans. Among numer-ous efforts devoted for such a goal, Galatea Projectconducted by 17 members from 12 universities is de-veloping an open-source license-free software toolkit[1] for building an anthropomorphic spoken dialogagent under a financial support from IPA1 duringfiscal years of 2000–2002. The authors are membersof the project. The features of the toolkit are as fol-lows: (1) high customizability in text-to-speech syn-thesis, realistic face animation synthesis, and speechrecognition, (2) basic functions to achieve incremen-tal (on-the-fly) speech recognition, (3) mechanism for“lip synchronization”; synchronization between au-dio speech and lip image motion, (4) “virtual ma-chine” architecture to achieve transparency in mod-ule to module communication. The Galatea Toolkitfor UNIX/Linux and Windows operating systemswill be publicly available from August 22, 2003, athttp://hil.t.u-tokyo.ac.jp/∼galatea/.2.2 Toolkit Components

The Galatea Toolkit consists of five functional mod-ules: speech recognizer, speech synthesizer, facial an-imation synthesizer, agent manager which works asan inter-module communication manager, and task(dialog) manager. Fig. 2.2.1 shows the basic mod-ule architecture of the Galatea toolkit. The GalateaProject members newly created these components ormodified existing components of their own or pub-licly available. The outline of some of these func-tional modules are stated below.

2.2.1 Common features

Galatea employs model-based speech and facial an-imation synthesizers whose model parameters areadapted easily to those for an existing person ifhis/her training data is given. Synthesized facialimages and voices are customizabile easily depend-ing on the purposes and applications of the toolkitusers. This customiazability is achieved by employ-ing model based approaches where basic model pa-rameters are trained or determined with a set oftraining data derived from an existing person. Once

1Information-Technology Promotion Agency

Agent Manager

Task Manager

OtherApplication

Module

IIPL

SpeechSynthesisModule(SSM)

Face imageSynthesisModule(FSM)

SpeechRecognition

Module(SRM)

Microphone CRT Speaker

Prototyping Tools

TaskInformation

DialogModel

TaskInformation

DialogModel

Figure 1: System architecture of Galatea.

the model parameters are trained, facial expressionsand voice quality can be controlled easily.

2.2.2 Speech recognition module (SRM)

SRM consists of three submodules: the commandinterpreter, the speech recognition engine, and thegrammar transformer. Based on a speech recognitionengine “Julian” developed by Kyoto University andothers, it accepts the grammar to represent sentencesto recognize and has been modified to accept multi-ple formats for grammar representation and outputincremental recognition results. It can change thegrammar by request from external modules duringdialog sessions. It also produces N -best recognitioncandidates for sophisticated use of multiple results.

2.2.3 Speech synthesis module (SSM)

This module is the first open-source license-freeJapanese Text-to-Speech conversion system consist-ing of four sub-modules. The command interpreterreceives an input command from the agent man-ager and invokes sub-processes according to the com-mand. The text analyzer decomposes arbitraryJapanese input texts containing Kanji, Kana, alpha-betic, numeric characters, and optionally embeddedtags according to the JEIDA-62-2000 [3] typicallyspecifying the speaking style, and extracts linguis-tic information including pronunciation, accent type,part of speech, etc., partly utilizing ChaSen[2] andnewly developed dictionaries for Japanese morpho-logical analysis. The waveform generation engine isan HMM-based speech synthesizer, that simultane-ously models spectrum, F0 and duration in a unifiedframework of HMM capability. The speech outputsub-module outputs the synthetic speech waveform.

2.2.4 Facial image synthesis module (FSM)

FSM is a module for high quality facial image synthe-sis, animation control and precise lip-synchronization

Page 2: Anthropomorphic Agent as an Integrating Platform of Audio-Visual ... · Anthropomorphic Agent as an Integrating Platform of Audio-Visual Information Shigeki Sagayama Takuya Nishimoto

Figure 2: Examples of synthetic expressions from asingle photo.

with synthetic and natural voice. To customize theface model, a graphical user interface is equipped tofit a generic face wire frame model onto a frontal facesnap shot image. Face action units are defined onthis generic model and prototype facial expressioncan be synthesized by combination of these actionunits. Also autonomous actions such as blinking andnodding can be generated. Lip movement in an utter-ance is controlled by “viseme” and duration. Facialanimation is expressed easily by a simple script.

2.2.5 Task manager (TM)

The task of user-agent dialog management can be de-scribed in VoiceXML. TM consists of translator, fromVoiceXML documents to the intermediate language(Primitive Dialogue Operation Commands, PDOC),and the dialogue controller that interprets the PDOCdocuments. We extended the original specification ofVoiceXML to add some commands, including the fa-cial expression controls of anthropomorphic dialogueagents.

3 Integrating Platform

3.1 Galatea Architecture

In the Galatea Toolkit, the functional units are in-dependently modularized, input/output devices aredirectly managed in the module, and the agent man-ager controls inter-module communication. In orderto easily integrate additional modules, all modulesare modeled as virtual machines having a simple com-mon interface and connected to each other through abroker (communication manager). The Agent Man-ager (AM) works as a hub through which all modulescommunicate with each other. For example, issuinga command to the speech synthesis module meansstarting voice synthesis of a given text right now “setSpeak = Now”.

3.2 Integration of Audio-Visual Infor-mation

Recently, we added a body animation module tothe Galatea Toolkit. First, we built a cartoon-typecomputer-graphics(CG)-based human-image anima-tion (Fig. 3.2) and connected it to the Galatea agentsystem. By making use of the simple Galatea archi-tecture, connection was relatively easy. Second, wecombined the Galatea face and the body part of theCG animation by overlapping the two animated im-ages in the same window with synchronization. Now,

Figure 3: Computer-graphical animated agent whosebody is combined with the Galatea face animation.

the agent system has face and body both animated bycommands received from the agent manager (AM).

Extending the above idea, we can connect informa-tion inputs and control outputs with the same agentarchitecture and operate the whole system utilizingthe new input and output. In this sense, anthropo-morphic spoken-dialog agent is one of excellent plat-forms for integrating audio-visual information fromsensors and actuators. For example, visual informa-tion may provide the user’s information of position,face expression, and gesture. Sound-source separa-tion will be helpful not only in speech recognition innoise, but also in face/body direction control. Com-bined with mechanical robots, intelligent agent willmove and work physically in the real world and com-municate with human users by speech, gestures andface exressions.

4 AcknowledgementAuthors would like to thank Shin-ichi Kawamoto(JAIST) for providing matrials for this article, andother colleagues of the Galatea Project, YasuharuDen (Chiba Univ), Keikichi Hirose (UT), KatsunobuItou (Nagoya U), Atsuhiko Kai (Shizuoka U), TakaoKobayashi (TIT), Akinobu Lee (NAIST), NobuakiMinematsu (UT), Shigeo Morishima (Seikei U),Satoshi Nakamura (ATR), Tsuneo Nitta (TUT), Hi-roshi Shimodaira (JAIST), Keiichi Tokuda (NITech),Takehito Utsuro (Kyoto U), Atsushi Yamada(ASTEM), and Yoichi Yamashita (Ritsumeikan U),for their contribution to the Galatea Project. Also,we are grateful to Masayuki Nakazawa and ChristerLunde for their additinal development (body anima-tion) of the agent. We look forward to future collab-oration with Profs. Shigeki Ando, Ayumu Mataniand Koichi Ishikawa in audio-visual information in-tegration.

References[1] Shigeki Sagayama, et al., “Galatea: An An-

thropomorphic Spoken Dialogue Agent Toolkit,”IPSJ Technical Report, 2002-SLP-45-10, pp. 57-64, Feb. 2003. (in Japanese)

[2] http://chasen.aist-nara.ac.jp/index.html.en[3] Japan Electronic Industry Development Associa-

tion, “Standard of Symbols for Japanese Text-to-Speech Synthesizer,” JEIDA-62-2000, 2000.