Perception Systems for Naturally Interacting Humanoid … · Perception Systems for Naturally...
Transcript of Perception Systems for Naturally Interacting Humanoid … · Perception Systems for Naturally...
Perception Systems for Naturally Interacting Humanoid Robots
Norbert Schmitz and Karsten Berns
Abstract— This paper presents a design concept and a exem-plary realization of a perception system for naturally interactinghumanoid robots. Relevant non-verbal interaction capabilitiesare selected based on psychological research. These interactioncapabilities are transferred into a multi-functional user modelwhich is exemplary implemented for interactive game scenarios.Benefits of this perception system design in comparison toexisting realizations are a modular perception concept andthe possibility to transfer psychological inter-personal researchdirectly into human-robot interaction scenarios.
I. INTRODUCTION
The interaction capabilities of current humanoid robots
is still far beyond human interaction scenarios. Since inter-
personal interaction is not limited to verbal interaction it is
also worth considering non-verbal interaction capabilities.
Many realizations of interactive humanoid robots are
designed for a specific interaction scenario. This kind of
limitation is still required since inter-personal interaction is
a complex task. Targeting the goal of complex and versatile
human-robot interaction it is a feasible approach to transfer
knowledge from inter-personal to human-robot interaction.
This transfer can be supported by the design of the robot
control architecture and the perception system in particular.
This paper focuses on the perception although various other
aspects are considered to be of interest. The design and
realization of the proposed system within this paper is based
on the idea to analyze and categorize knowledge from psy-
chological inter-personal interaction and transfer suitable and
relevant aspects into a user model and a technical realization
of such a perception system for interactive humanoid robots.
The whole system is embedded into a emotion-based control
architecture which is described in [9].
The following sections will introduce this concept and
present an exemplary realization with the focus on an in-
teractive game play situation.
II. ROBOT PERCEPTION
The analysis of currently realized perception systems for
naturally interacting robot reveals dramatic differences con-
cerning the design concepts. Many implementations are not
related to human interaction and extract information which
is considered to be relevant from the robotic point of view.
The following section introduces important contributions
which are based on inter-personal communication.
Robotics Research Laboratory, Faculty of Com-puter Science, University of Kaiserslautern, Germany{nschmitz|berns}@agrosy.cs.uni-kl.de
A. Gandalf and Ymir
The development of naturally interacting robots is not
limited to embodied agents. Artificial animated heads or even
full body models are often used to overcome the limitations
of the mechanical construction. The avatar Gandalf and the
cognitive architecture Ymir are used to realize real-time
capable interaction [16].
Thorisson developed a real-time capable system and a
control architecture based on data from the psychological and
linguistic literature. He realized that existing knowledge from
inter-personal communication can be transferred to human-
robot interaction [17].
The control architecture uses a layered combination of
reactive and reflective behaviors which are triggered by the
dialog management system. The modular distributed system
is divided into three module groups perception, action and
decision. The decision modules of each layer use more or
less abstract knowledge generated by the perception system
to select suitable actions. The action modules are responsible
to execute one action in one of the nine action categories.
The perception system as part of the whole architecture
uses virtual sensors for unimodal information extraction
and multimodal descriptors as fusion modules. The virtual
sensors operate on a single stream of input data and are
separated into the categories: Prosodic, Speech, Positional
and Directional. These sensors are further subdivided into
static sensors reporting the current state of a property and
dynamic sensors reacting on changes of an observed stimuli.
Gandalf uses 16 basic virtual sensors. Since the virtual
sensors operate on a single modality a fusion system is
required. Thorisson introduces ten multimodal descriptors.
They again can either react on static stimuli or a dynamic
change.
The presented approach of natural interaction using Ymir
and Gandalf is a very promising example of modeling
a perception system for humanoid robots. Major aspects
are the transfer of psychological knowledge into robotic
applications and the realization of real-time capable systems.
The perception module with the splitting of virtual sensors
and multimodal descriptors allows multiple views on the
same dataset and reusable information. This approach allows
the separation of the perception and decision process which
is crucial for the adaption end extension of the proposed
architecture.
B. Asimo
Recent developments concerning Asimo’s interaction ca-
pabilities are performing interactive game situations. Okita
[13] analyzes the reaction of children on different behaviors
of the robot in interactive learning situations. Comparisons of
different learning styles show that the behavior of the robot
has an influence on the interacting children.
Focusing again the perception system the concept of a
cognitive map has been introduced by Ng-Thow-Hing [12].
Another demonstration using a card table top game shows
the principles of the architecture. The central component
of the Cognitive Map Architecture is the cognitive map
server which accepts perception information and provides an
interface for information retrieval. In the card game scenario
table, card and speech information are extracted and stored
in the server. Each stored information consists of a variable
set of parameters like position and orientation.
To access the stored data three mechanisms are provided:
indexer, detector and decider. Indexers are use to access the
stored data using a provided sorting strategy. This allows
sorting by time or position. The second mechanism, the
detectors, are boolean operators which limit the amount
of messages. Messages are only forwarded if the detector
evaluates to true. In the card game example detectors are used
to suppress new objects located beside the table. Deciders
finally use the information from detectors or other deciders to
extract information which can be stored back to the cognitive
map. The output of the deciders are used to trigger a finite
state machine representing the game flow by modeling both
robot and game state.
The cognitive map architecture provides an uniform and
flexible way to access stored perception data. The hierarchy
of indexer, detector and decider allows the usage of already
existing information and the reduction of computational ef-
fort. Nevertheless it does not provide rules which perceptions
are required and should be perceived. Here again the focus
is the information required for the specific interactive game
situation.
III. COMMUNICATION MODELS
Focusing the channels sight and hearing a new question
arises: Which information is transferred and of interest for a
communication process? In [6] detailed descriptions of face-
to-face interaction processes are given. There seven major
categories of action that can be performed or perceived are
described.
• Para-language This term defines all information trans-
mitted on top of speech signals but not directly related
to the content of the spoken words. Further subdivi-
sion distinguishes perspective, organic, expressive and
linguistic aspects.
• Kinesics The term kinesics describes any type of body
motion including head, facials, trunk, hands and so on.
• Proxemics The perception of distance zones, spatial
arrangements and sensory capabilities is a key factor
of any conversation.
• Artifacts Artifacts in the environment and even the
environment itself influences a communication process.
Depending on the situation artifacts may even be nec-
essary.
• Olfactory The scent of interaction partners has also
an influence although it is limited to a narrow area
close by the interaction partners [4]. Since human-robot
communication mainly covers public areas olfactory
communication is of minor interest and therefore not
discussed further.
• Haptics Haptic interaction signals are related to specific
situations which are for example hand shaking, beating
or grabbing. These communicative actions require an
embodied agent and direct feedback of the robot [1].
Since the focus of this work is the modeling of the
user and haptic interactions are seldom in public envi-
ronments this topic is not noticed further. Nevertheless
haptics may be of interest for extensions of this work.
• Language Although non-verbal aspects are considered
important most of the conversational content is trans-
fered via the speech [10]. Verbal interaction is men-
tioned due to the immense importance but it is not
directly included in the model since the user modeling
focuses on non-verbal aspects.
A. Aspects of Paralanguage
The analysis of speech related communication information
is divided into different aspects. Traunmller [18] distin-
guishes
• Perspective aspects Perspective aspects of speech are
concerned with spatial relationships. These include be-
side others place, distance, orientation and transition
channel. These aspects have various impacts on the
communication. Examples are attention attraction and
turn-taking.
• Organic aspects Organic aspects are modifications of
speech caused by the differences of the vocal tract. Each
human being has an unique vocal tract with an unique
modification of speech. Based on these modifications in-
formation like age, sex and pathology can be extracted.
Especially information like age and sex are commonly
used in communication situations.
• Expressive aspects Expressive aspects are transmitted
by variations in speaking rate, pitch dynamics and voice
quality. Transmitted are for example emotion, attitude
and environment. Expressive aspects strongly influence
the communication process.
• Linguistic aspects Linguistic aspects transmit primar-
ily the message itself. Besides the spoken words also
dialect, accent and speech style are transmitted. The
influence on the communication process is immense
which is mainly based on the transmitted content.
The amount of information derived from paralinguistic
phenomena is huge and large variations in the usage of these
information and the reliability can be observed [2]. A crucial
part towards a model of non-verbal communication is the
selection of relevant aspects. Unfortunately an importance
based ranking of paralinguistic aspects is not available due
to the strong correlation with the situation and personal
preferences. Nevertheless several aspects do exists which
seem to be of crucial importance.
B. Situation Models
The situation model gives quite precise information about
required knowledge in a communication situation. If the
situation is more specific detailed models do exist. In [5]
various types of conversation situations are mentioned. As
example of a complex situation a specific model of play has
been developed1. Here four continuous cycles are mentioned.
• Continuous play In a concrete play situation an observe-
plan-act cycle is proposed. In this cycle all participants
are focused on the game with a mainly serial turn-taking
scheme.
• Fun In a second independent cycle the pursuit of
fun, which is a basis for many games, is mentioned.
Based on the conversation, a shared situation and the
engagement a situation of amusement can be created.
• Learning Additionally the authors claim a cycle of
learning. Each participant has the goal to gain expe-
rience and improve the game play.
• Repeated play The forth mentioned cycle claims that
game situations may be interrupted and continued at
any time.
C. Communication Partner Model
Any interactive communication situation is based on the
perception of the interaction partner. To provide these in-
formation as basis for any complex interaction three basic
requirements have to be fulfilled:
• Timing requirements The trisection of the communi-
cation process in opening, main part and closing is
realized. During the opening a conversation partner is
not yet present so the detection of humans is important.
The main part is strongly depending on the conversation
situation with detailed information of the interaction
partner. The third part describes the closing which
concludes the conversation.
• Memory requirements It is obvious that previous knowl-
edge as well as a runtime memory is required in any
communication situation. In this user model two types
of knowledge are used: the a priori knowledge provided
by the application programmer containing information
like face dimensions and speech identification data
and the a posteriori knowledge which consists of data
captured and stored during an interaction.
• Sensor requirements The perception system generates
a broad range of information which can be subdivided
into basic information units called percepts. Each per-
cept provides a specific information which can be of
interest for the robot. The percepts are grouped into the
categories paralanguage, kinesics, proxemics, language
and situation specific percepts. The user model assigns
a set of relevant percepts to each perception task in the
timing model. Only these percepts are considered to be
relevant in the specific time step.
All these requirements are described in the general user
model depicted in figure 1. This model consists of the three
1see http://www.dubberly.com/concept-maps/a-model-of-play.html
main stages of interaction and the relevant percepts for each
category.
Opening State
Speech ActivitySound Direction
Head Position Person Movement
Main State
Speech ActivitySound DirectionSound DistanceSpeaker Ident.
Hand GestureHead OrientationHead PositionEmotional State
Person Movement
Game StatePlayer State
Closing State
Head Position
Fig. 1: Schematic view of the realized communication partner
model with active perception capabilities.
IV. AUDITORY PERCEPTION SYSTEM
Audio
Vid
eo
Kinematic
PerceptionFusion
Sensors
JackCapture JackCapture
PreprocessorUSTM
BeamformerCUDA
Preprocessor
VoiceActivityDetection
VoiceClassifierLTM
AudioFusionSTM
Fig. 2: Auditory perception system with two major branches
localization and identification.
The auditory perception system as depicted in figure
2 contains two major branches the identification and the
localization which are fused in the auditory fusion step. The
following sections will briefly introduce these branches.
A. Sound Source Localization
The sound source localization has to provide proxemic
information about the spatial arrangement of human in-
teraction partners. This includes the position of a sound
source relative to the position of the robot or more precise
the position of the microphones. The contribution of the
sound source localization to the model are the paralinguistic
aspects of Sound Direction and Sound Distance. Additionally
information of loudness are of interest to allow priority based
distinction of sound sources. Sound localization approaches
as required here have widely been used in humanoid robotics
with varying focus. Basic requirements for interactive robots
are real time data processing and noise robustness.
B. Speaker Identification
For any inter-personal communication situation the iden-
tity of the interaction partner is of crucial importance. The
introduction of all participants is often one of the first
conversation steps if the identity is not previously known.
The behavior of all participants heavily depends on the
relationship between them. It is therefore an important
competence for a naturally interacting humanoid robot to
estimate the identity of an interaction partner and adapt the
robot behavior according to the relationship. The speaker
identification contributes to the paralinguistic aspects of
Speech Activity and Speaker Identification.
The realized speaker classification is subdivided into three
modules: Preprocessing, voice activity detection and voice
classification. The preprocessing is responsible for basic
transformations into frequency domain which is not de-
scribed further here. The voice activity detection (VAD)
generates a probability value for each microphone channel
and frame of the audio stream. The voice activity detection
is based on the work of Cohen [3]. The classification module
generates the current feature vectors and compares them to
the set of previously trained codewords [7].
The extraction of speaker related information from sound
sources is related to the extraction process described by Sig-
urdsson [15]. The extraction process is divided into five steps:
Framing, frequency transform, log amplitude transform, Mel
scaling and discrete cosine transform. The following section
will briefly introduce this process and describe variations in
the realized approach.
The discrete auditory input signal s(n) of each microphone
m ∈ {0, ..,M} is divided by the voice activity detection into
chunks of speech. Each chunk is further subdivided into a
set of frames f ∈ {0, .., F} with length L. Each input frame
is 50% overlapping the previous one and transformed into
frequency domain by using Hamming Windows and the Fast
Fourier Transformation.
The selection of a suitable set of features influences the
detection rate and the calculation time. The used set of
features is a modification of the MFCC FB-24 filter bank.
The frequency range between flow = 0Hz and fhigh =3.000Hz is divided into M = 22 triangular filters. The
centers Ci of the triangular filters and their distances ∆fmel
are calculated as
∆fmel =1127 ·
(
ln(1 +fhigh
700)− ln(1 + flow
700))
M
Ci = f−1
mel
(
1127 · ln
(
1 +flow
700
)
+ i ·∆fmel
)
0 ≤ i ≤ M + 1
f−1
mel (m) = 700 ∗(
em
1127 − 1)
Each filter Fi(fmel) overlaps half of both neighboring
filters and is defined as
Fi(fmel) =
0 k < Ci − 1k−Ci−1
Ci−Ci−1Ci−1 ≤ k < Ci
Ci+1−k
Ci+1−Ci
Ci ≤ k < Ci+1
0 k > Ci+1
1 ≤ i ≤ M
The corresponding transformed Mel coefficients Smeli for
each filter i and the Fourier transformed input signals S(n)are then calculated as
Smeli =
N−1∑
n=0
Fi · |S(n)|2, 0 ≤ i ≤ M − 1
The final Mel Frequency Cepstral Coefficients (MFCC)
are calculated using the Discrete Cosine Transform II (DCT-
II) as shown in equation 1. The output of this procedure is a
set of MFCC feature vectors describing a chunk of speech.
Xk =
N−1∑
n=0
x(n)·cos
[
π
N
(
n+1
2
)
k
]
k = 0, . . . , N−1
A suitable selection of contributing MFCCs is not trivial.
Zhen [20] provides a relative comparison of the importance
of the Mel coefficients. The coefficients C0 and C1 down-
grade recognition results since they are correlated with the
signal energy. Zhen further describes that all coefficients
above C12 only have minor contributions since they are
heavily influenced by noise. Also the contribution of delta
and delta-delta features can be neglected as described by
Radova [14].
Based on these information the final set of MFCC features
consists solely on the coefficients C2 till C16. One of these 14
dimensional vectors is extracted for each frame of a chunk.
Each of them has to be compared to a reference set of
vectors approximating the possible speakers. To compare two
feature vector a distance measurement is required. The mean
squared error (MSE) provides the distance measurement used
here since it does not require a square root operation in
comparison to the euclidean distance.
d(~x, ~y) = ||~x− ~y||2 =
l∑
i=0
(xi − yi)2
The following section will introduce the realized vector
quantization technique based on the feature vectors and
distance measurement described before.
The modeling of a speaker can now be reduced to the
description of a large amount of example MFCC vectors
generated during a training phase. Assuming that a set of
descriptive vectors xi has been extracted. Such a set X ={x0, . . . , xn} exists for each possible communication partner.
The goal of the vector quantization is the selection of
a suitable preferably small set of representatives M ={µ1, µ2, . . . , µk} which optimally describe the large amount
of input vectors X . This process can be seen as a lossy
compression of data.
Several approaches like Gaussian Mixture Models and
Neural Networks have been used for speaker representation
with similar results. Another approach is the usage of the
K-Means algorithm described for example by Kanungo [11]
with a fixed number of representatives. The algorithm assigns
the closest centroid to each input vector according to a
distance measurement d(x, µ). The centroid vectors are then
recalculated as mean of all assigned input vectors. The
new centroid position leads to an update of the centroid
assignments. This procedure is repeated until the assignment
is unchanged.
The selection of an optimal number of representatives
k is a complex task. Extensions to the K-Means like the
X-Means algorithm do exist. The increased complexity of
estimating the required number of centroids also leads to
high computational complexity. During several experiments
it could be seen that a large number of centroids sufficiently
represents the set of input vectors.
The current implementation selects k = 128 centroid
vectors as description for each speaker. The approach of
independent codebook generation for each speaker did not
provide sufficient results. The distances between the speak-
ers are too low. The problem with this procedure is the
omnipresent background noise which is trained into each
codebook. The following section will introduce the universal
background model which helps to distinguish noise from
speaker.
Speaker recognition systems often use voice which is
recorded in silent environments. The existence of noise is
reduced by using microphones which are nearby the speaker,
like head sets or telephones. In the area of mobile robotics
this is in general not the fact. Speakers are spread in
the environment and noise from motors and computers is
omnipresent.
A crucial point in the training process of speaker code-
books is to avoid the training of noise as speaker features.
A possible solution is noise reduction which is complex and
hard to realize in frequently changing environments. Another
approach is the introduction of a background model. The
principle idea is that all speaker vectors are used to train
a single codebook. This codebook should contain features
which are present in all sets of feature vectors, especially
the background noise. After the training of the background
model each speaker codebook is trained with the additional
constraint to keep away from the background model.
The Universal Background Model (UBM) is trained using
the standard K-Means algorithm. The number of features
used to train the model is huge since it contains all speaker
vectors which leads to long training times. The training of
the individual speaker models is realized using a modified K-
Means algorithm. A similar approach for Background Mod-
eling using Gaussian Mixture Models has been described by
Hautamaeki [8].
V. VISUAL PERCEPTION
From the technical point of view the visual perception
system as depicted in figure 3 is a sensor input driven
hierarchy of perception modules. Each module provides new
information based on input images and optional additional
elementary percepts. The arrangement of perception modules
has been developed and re-factored several times to opti-
mize the flow of information and reduce the computational
complexity. The final module hierarchy is depicted in figure
5 which includes the relation of the camera perception
Vid
eo
AudioKin
em
atic
PerceptionFusion
Sensors
SaveImages
Sca
leIm
ages
SingleProcessing
StereoProcessing
SkinColorD
etector
FaceDetector
Gestu
reDetector
FaceDetector2
EmotionDetector
TangramDetector
SkinColorD
etector2
Hea
dPoseEstim
ation
Video
Fusion
CUDA
CUDA
CUDA
CUDA
CUDA
Fig. 3: Visual perception system with several information
extraction branches.
module to other groups relevant for the perception system.
Each module and the functionality is briefly introduced
before details on relevant modules are given in the following
sections.
• Face Detector The face detector uses the Haar cascade
classifier introduced by [19] to find frontal faces. A
list of face candidates is generated and provided as
elementary percept.
• Tangram Detector The Tangram detector uses color
information to extract interaction relevant artifacts. The
number of relevant artifacts for communication pur-
poses is endless and therefore a small but useful set has
been selected. The detection focuses on the analysis of
game play situations during an instruction puzzle game
called Tangram.
• Skin Color Detector The skin color detector uses a
skin color model based on a combination of Gaussian
probabilities. The classifier assigns a skin probability to
each image pixel using a look up table. Additionally an
average skin color value is added to all face candidates
provided by the face detector.
• Emotion Detector The emotion detector uses a closeup
high resolution image of a frontal face to extract facial
features. According to the relative location of these
features an emotional state is estimated using FACS.
• Gesture Detector The gesture detector realizes the de-
tection of emblematic gestures commonly provided in
interaction situations. Therefore skin colored regions are
detected and compared with a previously trained set of
hand gestures.
• Head Pose Estimation The head pose estimator uses the
camera with narrow field of view and estimates the head
pose by matching a rigid 3D head model with a set of
2D image features.
• Video Fusion The video fusion module uses all el-
ementary perceptions like face candidates, skin color
estimation and depth image to track the head of possible
interaction partners. The tracking is realized using a
particle filter for each person in the environment. The
output of this fusion module is a list of public percepts.
Kinematic
VideoAudio
PerceptionFusion
Sensors
KinematicAnnotation
DynamicAnnotation
AudioVideoFusion
Fig. 4: Perception fusion system as late fusion of auditory
and visual perception.
The multimodal fusion module as part of the fusion
hierarchy depicted in figure 4 provides a consistent view of
the extracted percepts by combining all relevant information
into a single description for each person in the environment.
Additionally all here mentioned percepts can be accessed
directly by the control architecture if desired.
The perception fusion represents the last instance of the
perception system. All information extracted and fused up to
this point is now forwarded to the control architecture which
will be introduced in the following section.
VI. EXPERIMENTS
In group conversations with multiple interaction partners
it is not possible that all participants are communicating
simultaneously. Currently not communicating participants
are observers of the interaction but the role of speaker,
addressee and observer is changing frequently. The following
experiment shows the observation of an interaction between
two participants. One of them is explaining the rules of the
Tangram tabletop game. The robot is observing the situation
without active participation.
The goal of the experiment is to demonstrate both auditory
and visual perception as well as the fusion mechanisms.
Figure 5 depicts the extracted percepts during the time
interval from 30 to 90 seconds. The duration of the whole
experiment was about 100 seconds. In the diagram four time
steps are marked with a to d and the corresponding percepts
are displayed. Figure 6 depicts images of the overhead
reference camera, the left body camera with overlay of the
person tracking system and the left eye camera during the
experiment at the marked time steps.
The output of the percept P ′
audiofusion is displayed in
the second diagram of figure 5. The voice activity detection
splits the talking phases of both participants. The voice
classification uses these intervals to generated an identity
estimation at the marked time steps a,b and d. During the
experiment the identification zero is assigned to the person
on the left side (from robot view) and one is assigned to the
person on the right.
The plot below the voice activity shows the root position of
the localized sound source which corresponds to the positions
of the two talking persons. The large variations of the yrvariable depicts the alternating speech activity. The peak
output energy increases only slightly during speech phases.
The noise generated by the movement of the motors and the
computer below the robot significantly reduces the signal
to noise ratio. Nevertheless both identification as well as
speaker localization is possible.
The last two plots show the output of the visual tracking
system divided into the tracking of the person on the left and
the one on the right hand side. A close relationship between
the position of the visual and the auditory tracking system
can be seen. This geometrical correlation is used to assign the
estimated speaker identification to the closest tracked person.
The behavior of the robot during the experiment is solely
influenced by the sound source position. The robot is steadily
focusing the sound source with the left eye camera. It can be
seen that the left eye is always focused on the current speaker.
The visual tracking sometimes loses the interaction partner
which forces the robot to rely on the auditory perception
until the tracking is reinitialized.
VII. CONCLUSIONS AND FUTURE WORK
The complex interaction of humans and robots requires
a model of the human interaction partner. The design as
well as the realization of such a model should be based on
psychological research to allow the transfer of inter-personal
to human-robot communication scenarios.
This paper presents psychological models of time, memory
and interaction signals and combines these into a common
user model. One representative aspect of the model, the
speaker identification, is discussed in detail. With the focus
on an interactive tabletop game the implementation and usage
of the model is presented using the control architecture of
the humanoid robot Roman.
The experiments showed that the proposed design is well
suitable for complex interactive scenarios. Future develop-
ments will add additional information to the model and
support a wider range of communication scenarios.
REFERENCES
[1] Michael Argyle. Bodily Communikation. Methuen & Co. Ltd., 9thedition 2005 edition, 1975.
[2] S. E. Avons, R. G. Leiser, and D. J. Carr. Paralanguage andhuman-computer interaction. part 1: identification of recorded vocalsegregates. Behaviour & Information Technology, 8(1):13–21, 1989.
[3] I. Cohen. Noise spectrum estimation in adverse environments: im-proved minima controlled recursive averaging. Speech and Audio
Processing, IEEE Transactions on, 11(5):466 – 475, sept. 2003.
30 35 40 45 50 55 60 65 70 75 80 85 90
a
P ′
a= (1, 0, . . . , 0.04, 0.12, 2.3,−0.8, 0.8)
P ′
p= (2, . . . , 1.2,−0.3, 1),(3, . . . , 2, 0.9, 1)
b
P ′
a= (0, 0, . . . , 0.05, 0.09, 3.1, 0.0, 0.9)
c
P ′
a= (−1, 1, . . . , 0.05, 0.08, 2.6,−1.4, 1)
P ′
p= (5, . . . , 1.3,−0.4, 1.1),(4, . . . , 1.2, 0.5, 0.8)
d
P ′
a= (0, 0, . . . , 0.04, 0.09, 3.1, 0.7, 0.9)
P ′
p= (5, . . . , 1.3,−0.3, 1.1),(4, . . . , 1.3, 0.6, 0.8)
Percepts
P ′
person
P ′
audiofusion
30 35 40 45 50 55 60 65 70 75 80 85 900
0.5
1
Activity
P ′
audiofusion
vppa
30 35 40 45 50 55 60 65 70 75 80 85 90−202
Position[m
]
xr
yrzr
30 35 40 45 50 55 60 65 70 75 80 85 90
012
Position[m
]
P ′
person
xr
yrzr
30 35 40 45 50 55 60 65 70 75 80 85 90
012
Time [s]
Position[m
]
xr
yrzr
Fig. 5: Experimental evaluation of a interactive situation with two interaction partners talking in front of the robot. The first
diagram shows the extracted percepts. The second diagram shows the position and activity of the sound localization and
speaker identification. The last two rows show the corresponding perceived positions of the visual perception system.
[4] Richard L. Doty. Olfactory communication in humans. In Oxford
Journals, Life Sciences & Medicine, Chemical Senses, volume 6, 1981.
[5] Hugh Dubberly and Paul Pangaro. On modeling what is conversation,and how can we design for it? interactions, 16(4):22–28, 2009.
[6] Starkey Duncan and Donald W. Fiske. Face to face interaction.Erlbaum [u.a.], Hillsdale, NJ, 1977.
[7] Florian Faust. Multi-channel voice based speaker recognition forhumanoid robots. Master’s thesis, University of Kaiserslautern, 2009.
[8] Ville Hautamki, Tomi Kinnunen, Ismo Krkkinen, Juhani Saasta-moinen, Marko Tuononen, and Pasi Frnti. Maximum a posterioriadaptation of the centroid model for speaker verification. IEEE Signal
Process. Lett, pages 162–165, 2008.
[9] Jochen Hirth, Norbert Schmitz, and Karsten Berns. Towards social
robots: Designing an emotion-based architecture. International Jour-
nal of Social Robotics, 2011. DOI: 10.1007/s12369-010-0087-2.
[10] James G. Hollandsworth, Richard Kazelskis, Joanne Stevens, andMary Edith Dressel. Relative contributions of verbal, articulative,and nonverbal communication to employment decisions in the jobinterview setting. Personnel Psychology, 32(2):359–367, 1979.
[11] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D.Piatko, Ruth Silverman, and Angela Y. Wu. An efficient k-meansclustering algorithm: Analysis and implementation. IEEE Trans.
Pattern Anal. Mach. Intell., 24:881–892, July 2002.
[12] V. Ng-Thow-Hing, K. Thorisson, R. Sarvadevabhatla, J. Wormer, andT. List. Cognitive map architecture. Robotics Automation Magazine,
IEEE, 16(1):55 –66, march 2009.
(a) 31s : Identification of person on the right side is assigned
(b) 38s : Visual tracking is lost but robot still focuses sound source
(c) 52s : Visual and auditory tracking is running independently
(d) 58s : Identification can be reassigned
Fig. 6: Samples of the captured images during the dialog experiment. The left eye image shown in the rightmost column
continuously focuses the observed sound source and therefore shows the current talking person.
[13] S.Y. Okita, V. Ng-Thow-Hing, and R. Sarvadevabhatla. Learningtogether: Asimo developing an interactive learning partnership withchildren. In Robot and Human Interactive Communication, 2009. RO-
MAN 2009. The 18th IEEE International Symposium on, pages 1125–1130, 272009-oct.2 2009.
[14] Vlasta Radov and Zdenk venda. Speaker identification based on vectorquantization. In Vclav Matousek, Pavel Mautner, Jana Ocelkov, andPetr Sojka, editors, Text, Speech and Dialogue, volume 1692 of Lecture
Notes in Computer Science, pages 83–83. Springer Berlin / Heidelberg,1999.
[15] Sigurdur Sigurdsson, Kaare Br, T Petersen, and Tue Lehn-schiler. Melfrequency cepstral coefficients: An evaluation of robustness of mp3encoded music. In in Proceedings of the International Symposium on
Music Information Retrieval, 2006.
[16] Kristinn R. Thrisson. Real-time decision making in multimodal face-to-face communication. In Proceedings of the second international
conference on Autonomous agents, AGENTS ’98, pages 16–23, NewYork, NY, USA, 1998. ACM.
[17] Kristinn R. Thrisson. A mind model for multimodal communicativecreatures & humanoids. INTERNATIONAL JOURNAL OF APPLIED
ARTIFICIAL INTELLIGENCE, 13(4):519–538, 1999.[18] Hartmut Traunmller. Paralinguistic phenomena. In Handbooks of
Linguistics and Communication Science, pages 653–665. Walter deGruyter, Berlin New York, December 2004.
[19] P. Viola and M. Jones. Rapid object detection using a boosted cascadeof simple features. In Computer Vision and Pattern Recognition,
2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society
Conference on, volume 1, pages I–511 – I–518 vol.1, 2001.[20] Bin Zhen, Xihong Wu, Zhimin Liu, and Huisheng Chi. On the impor-
tance of components of the mfcc in speech and speaker recognition.In ICSLP-2000, volume 2, pages 487–490, October 2000.