V19 WS09 GestureRecognition.ppt [Kompatibilitätsmodus]€¦ · Example: "she talked first, I mean...
Transcript of V19 WS09 GestureRecognition.ppt [Kompatibilitätsmodus]€¦ · Example: "she talked first, I mean...
Gesture RecognitionVorlesung
„Visuelle Perzeption für Mensch-Maschine Interaktion“
Rainer StiefelhagenKai Nickel
2010-01-15
Interactive Systems Laboratories, Universität Karlsruhe (TH)1
Overviewer
actio
n
H) Introduction
ompu
ter I
nte
Kar
lsru
he (T
H IntroductionHidden Markov ModelsA li ti
or H
uman
-C
Uni
vers
itätK Applications
American Sign Language
uter
Vis
ion
fo
rch
Gro
up, U Pointing Gestures
Gestures and Speech
Com
pu
Res
ear
ci
Miscellaneous
cv:h
c
2
Definitioner
actio
n
H) Gesture:
ompu
ter I
nte
Kar
lsru
he (T
H Gesture:a movement usually of the body or limbs that expresses or emphasizes an idea sentiment or
or H
uman
-C
Uni
vers
itätK expresses or emphasizes an idea, sentiment, or
attitudeth f ti f th li b b d
uter
Vis
ion
fo
rch
Gro
up, U the use of motions of the limbs or body as a means
of expression
Com
pu
Res
ear
ci
[Merriam-Webster Online Dictionary]
cv:h
c
3
Automatic Gesture Recognitioner
actio
n
H)
g
A gesture recognition system generates a semantic
ompu
ter I
nte
Kar
lsru
he (T
H g g y gdescription for certain body motionsGesture recognition exploits the power of non-verbal
or H
uman
-C
Uni
vers
itätK communication, which is very common in human-human
interaction
uter
Vis
ion
fo
rch
Gro
up, U Gesture recognition is often built on top of a human motion
trackerR l t d t i H A ti it A l i I t ti
Com
pu
Res
ear
ci
Related topics: Human Activity Analysis, Intention Recognition, Facial Expression Recognition
cv:h
c
4
Some Applicationser
actio
n
H)
ppom
pute
r Int
e
Kar
lsru
he (T
H Virtual Environmentsmanipulation of objects
M lti d l R
or H
uman
-C
Uni
vers
itätK Multimodal Rooms
interaction with room facilitiesanalysis of a user‘s actions
context aware services
uter
Vis
ion
fo
rch
Gro
up, U context aware services
Human-Robot Interactioncommunicative gestures
Com
pu
Res
ear
ci
gprogramming by demonstration
Understanding Human-Human Interaction
cv:h
c level of arousalturn-taking
5
Types of Gestureser
actio
n
H)
ypom
pute
r Int
e
Kar
lsru
he (T
H
Hand & arm gesturesPointing GesturesSi L
or H
uman
-C
Uni
vers
itätK Sign Language
Hello, Thumbs up/down, victory sign, the finger, call-me, … (Wikipedia lists around 40 hand gestures)
H d
uter
Vis
ion
fo
rch
Gro
up, U Head gestures
Nodding, head shaking, turning, pointingBody gestures
Com
pu
Res
ear
ci
Don’t know / shrug
Gestures are mostly culture-specific
cv:h
c Gestures are mostly culture specificPointing gesture probably one of the exceptions
Some gestures closely coordinated with speech
6
Gesture Taxonomyer
actio
n
H)
yom
pute
r Int
e
Kar
lsru
he (T
H
or H
uman
-C
Uni
vers
itätK
uter
Vis
ion
fo
rch
Gro
up, U
[P l ić97]
Com
pu
Res
ear
ci
[Pavlović97]
cv:h
c
static vs. dynamic gesturesdifferent phases of a gesture: preparation, peak, retraction
7
A Purpose Taxonomy of Gestural Interaction [Q k ICMI05]
erac
tion
H)
[Quek, ICMI05]
Manipulative Gestures
ompu
ter I
nte
Kar
lsru
he (T
H “Put that there” [Bolt 1980]Navigation in virtual spaces, game control, …Robot control
or H
uman
-C
Uni
vers
itätK
Semaphoric GesturesPrecisely defined specific symbols of an alphabet
uter
Vis
ion
fo
rch
Gro
up, U
Precisely defined, specific symbols of an alphabetStatic: Finger menu selection, hand pose classification, …Dynamic: Crane operator signals, editing gestures for interactive whiteboard, …
Com
pu
Res
ear
ci
whiteboard, …
Conversational GesturesNaturally performed in the course of human multimodal communication
cv:h
c Naturally performed in the course of human multimodal communication
8
Spontaneous gestures [Cassell98]
(C i i G )er
actio
n
H)
(Communicative Gestures)om
pute
r Int
e
Kar
lsru
he (T
H
Iconic gestures Depict by the form of the gesture some feature/action/event being described
or H
uman
-C
Uni
vers
itätK
Metaphoric gestures Also representational, but the concept they represent has no physical form; form of the gesture comes from a common metaphor Example: "the meeting
uter
Vis
ion
fo
rch
Gro
up, U from a common metaphor. Example: "the meeting
went on and on" + hand indicating rolling motion.
DeicticsSpatialize or locate in the physical space in front of
Com
pu
Res
ear
ci
Spatialize, or locate in the physical space in front of the narrator, aspects of the discourse.
Beat gesturesSmall baton like movements; do not change in form
cv:h
c Small baton like movements; do not change in form with content of speech. Pragmatic function, occurring with comments on one’s own linguistic contribution, speech repairs and reported speech. Example: "she talked first, I mean second“ + hand
See also:McNeill, D. 1992. Hand and Mind: What gestures reveal about thought. The University of Chicago Press, Chicago IL.
9
flicking down and up.Chicago IL.
Spatial Modelser
actio
n
H)
p
3D model-based
ompu
ter I
nte
Kar
lsru
he (T
H
3D hand model. Parameters: angels, palm position
or H
uman
-C
Uni
vers
itätK
uter
Vis
ion
fo
rch
Gro
up, U Appearance-based
Parameters: Pi l t l t i t t I ti
Com
pu
Res
ear
ci
Pixel templates, image geometry parameters, Image motion parameters, Fingertip position & motion...Use templates for modelling gestures
[Pavlović]
cv:h
c
10
Gesture Recognitioner
actio
n
H)
gom
pute
r Int
e
Kar
lsru
he (T
H
Feature acquisition:• attached sensors• visually
or H
uman
-C
Uni
vers
itätK
visually• markers• color• motion• shape
uter
Vis
ion
fo
rch
Gro
up, U • shape
• …
Classifiers: Hidd M k M d l (HMM)
Com
pu
Res
ear
ci
• Hidden Markov Models (HMM)• Neural Networks (ANN)• Finite State Machines (FSM)• Support Vector Machines (SVM)
cv:h
c • Support Vector Machines (SVM)• …
12
erac
tion
H)
ompu
ter I
nte
Kar
lsru
he (T
H
or H
uman
-C
Uni
vers
itätK
uter
Vis
ion
fo
rch
Gro
up, U
Com
pu
Res
ear
cicv
:hc
13
Hidden Markov Modelser
actio
n
H) Gesture recognition temporal pattern matching
ompu
ter I
nte
Kar
lsru
he (T
H g p p g
HMMs
or H
uman
-C
Uni
vers
itätK
represent templates in a person-independent way (when trained with examples from many persons)
uter
Vis
ion
fo
rch
Gro
up, U provide an inherently “soft” time-alignment mechanism
are based on a sound mathematical foundation
Com
pu
Res
ear
ci
Introduction to HMM [Rabiner89]
cv:h
c
14
Motivationer
actio
n
H)
ompu
ter I
nte
Kar
lsru
he (T
H
x1 x2 x3
observable
or H
uman
-C
Uni
vers
itätK
b1 b2 b3 non-observable
uter
Vis
ion
fo
rch
Gro
up, U s1 a12
s2 s3a23
Com
pu
Res
ear
ci
the term "hidden" comes from observing observations and drawing conclusions without knowing the hidden sequence of states
cv:h
c Markov assumption (1st order): the next state depends only on the current state (not on the complete state history)
15
HMM Definitioner
actio
n
H) A Hidden Markov Model is a five-tuple (S,π,A,B,V).
ompu
ter I
nte
Kar
lsru
he (T
H
It consists of:the set of States S = {s1,s2,...,sn} h i iti l b bilit di ib i
or H
uman
-C
Uni
vers
itätK the initial probability distribution π
π(si) = probability of si being the first state of a state sequence the matrix of state transition probabilities A
uter
Vis
ion
fo
rch
Gro
up, U A=(aij) where aij is the probability of state sj following si
the set of emission probability distributions/densities B B={b1,b2,...,bn} where bi(x) is the probability of observing x
Com
pu
Res
ear
ci
n iwhen the system is in state sithe observable feature space V can be discrete V={x1,x2,...,xv}, or continuous V=Rd
cv:h
c { 1, 2, , v},
16
Some Properties of HMMser
actio
n
H)
p
for the initial probabilities we have: Σi π(si) = 1
ompu
ter I
nte
Kar
lsru
he (T
H p i ( i)often things are simplified by π(s1)=1, and π(si>1) = 0obviously: Σj aij = 1 for all i
or H
uman
-C
Uni
vers
itätK
y j ij
often: aij = 0 for most j except for a few stateswhen V = {x1,x2,...,xv} then bi are discrete probability
uter
Vis
ion
fo
rch
Gro
up, U
when V {x1,x2,...,xv} then bi are discrete probability distributions, the HMMs are called discrete HMMswhen V = Rd then bi are continuous probability density
Com
pu
Res
ear
ci
i p y yfunctions, the HMMs are called continuous (density) HMMs
cv:h
c
17
HMM Topologieser
actio
n
H)p g
ompu
ter I
nte
Kar
lsru
he (T
H
l ft t i ht
or H
uman
-C
Uni
vers
itätK left-to-right
uter
Vis
ion
fo
rch
Gro
up, U
Com
pu
Res
ear
ci
alternative paths ergodic
cv:h
c paths
18
The Observation Modeler
actio
n
H) Most popular: Gaussian mixture models
ompu
ter I
nte
Kar
lsru
he (T
H Most popular: Gaussian mixture models
or H
uman
-C
Uni
vers
itätK
nj: number of Gaussians (in state j)
uter
Vis
ion
fo
rch
Gro
up, U
j ( j)cjk: mixture weight for k-th Gaussian (in state j)μjk: means of k-th Gaussian (in state j)Σ i i f k h G i (i j)
Com
pu
Res
ear
ci
Σjk: covariane matrix of k-th Gaussian (in state j)
cv:h
c
19
Three main problems of HMMs (Th t b l d!)
erac
tion
H)
(That can be solved!)om
pute
r Int
e
Kar
lsru
he (T
H
Given an HMM λ and an observation x1,x2,...,xT
or H
uman
-C
Uni
vers
itätK The evaluation problem:
compute the probability of the observation p(x x x | λ)
uter
Vis
ion
fo
rch
Gro
up, U p(x1,x2,...,xT | λ)
The decoding problem: h lik l
Com
pu
Res
ear
ci
compute the most likely state sequence sq1,sq2,...,sqT, i.e. argmaxq1,..,qT p(q1,..,qT| x1,x2,...,xT, λ)
cv:h
c
The learning/optimization problem: find an HMM λ‘ such that p(x x x | λ‘ ) > p(x x x | λ)
20
p(x1,x2,...,xT | λ ) > p(x1,x2,...,xT | λ)
The Evaluation Problemer
actio
n
H) We define αt( j) as the probability of observing the partial
ompu
ter I
nte
Kar
lsru
he (T
H t( j) p y g pobservation x1,x2,...,xt and then being in state sj: αt( j) = p(x1,x2,...,xt, qt=j | λ )
( | λ) ( )
or H
uman
-C
Uni
vers
itätK p(x1,x2,...,xT | λ) = Σj=1..n αT( j)
uter
Vis
ion
fo
rch
Gro
up, U Recursive calculation:
αt( j) = bj(xt) · Σi=1..n aij αt-1(i)( j) = π(s ) b (x )
Com
pu
Res
ear
ci
α1( j) = π(sj) bj(x1)
cv:h
c
The Forward Algorithm
21
The Decoding Problemer
actio
n
H)g
Find the most likely state sequence, i.e.( | λ )
ompu
ter I
nte
Kar
lsru
he (T
H argmaxq1,..,qT p(q1,..,qT, x1,...,xT | λ )
or H
uman
-C
Uni
vers
itätK Define zt( j) = max p(x1,...,xt, qt=j | λ )
z (j) = max (z (i) a ) b (x )
uter
Vis
ion
fo
rch
Gro
up, U zt(j) = maxi(zt-1(i) aij) bj(xt)
z1(j) = π(sj) bj(x1)
T t i th ti l t t
Com
pu
Res
ear
ci
To retrieve the optimal state sequence,we have to remember for every state its optimal predecessorrt( j) = argmaxi(zt-1(i) aij)
cv:h
c
22The Viterbi Algorithm
The Learning / Optimization Problemer
actio
n
H)
g p
Train parameters of HMM
ompu
ter I
nte
Kar
lsru
he (T
H pTune λ to maximize P(O | λ)No efficient algorithm for global optimum
or H
uman
-C
Uni
vers
itätK Efficient iterative algorithm finds a local optimum
Viterbi-TrainingC Vi bi P h i C M d l
uter
Vis
ion
fo
rch
Gro
up, U Compute Viterbi-Path using Current Model
Reestimate Parameters Using Labels Assigned by Viterbi
Baum Welch (Forward Backward) reestimation
Com
pu
Res
ear
ci
Baum-Welch (Forward-Backward) reestimationSee [Rabiner]
cv:h
c
23
erac
tion
H)
ompu
ter I
nte
Kar
lsru
he (T
H
or H
uman
-C
Uni
vers
itätK
uter
Vis
ion
fo
rch
Gro
up, U
Com
pu
Res
ear
cicv
:hc
24
Sign Language Recognition [Starner98]er
actio
n
H)
g g g g [ ]om
pute
r Int
e
Kar
lsru
he (T
H
American Sign Language (ASL)
or H
uman
-C
Uni
vers
itätK 6000 gesture describe persons,
places and thingsExact meaning and strong rules
f d f h
uter
Vis
ion
fo
rch
Gro
up, U of context and grammar for each
Com
pu
Res
ear
ci
Sign recognitionPrevious with instrumental gloves and neural nets
cv:h
c Previous with instrumental gloves and neural netsHMM ideal for complex and structured hand gestures of ASL
25
ASL Test Lexiconer
actio
n
H)
ompu
ter I
nte
Kar
lsru
he (T
H
Vocabulary size: 40 words
P t f h V b l
or H
uman
-C
Uni
vers
itätK Part of speech Vocabulary
Pronoun I, you, he, we, you (pl), they
Verb want like lose dontwant dontlike love pack hit loan
uter
Vis
ion
fo
rch
Gro
up, U Verb want, like, lose, dontwant, dontlike, love, pack, hit, loan
Noun box, car, book, table, paper, pants, bicycle, bottle, can,
Com
pu
Res
ear
ci
wristwatch, umbrella, coat, pencil, shoes, food, magazine, fish, mouse, pill, bowl
Adjective red, brown, black, gray, yellow
cv:h
c Adjective red, brown, black, gray, yellow
26
Feature Extractioner
actio
n
H)
ompu
ter I
nte
Kar
lsru
he (T
H
Camera either located as a 1st-person and a 2nd-person viewSegment hand blobs by a skin color model
or H
uman
-C
Uni
vers
itätK Extracted Features: x, y, Δx, Δy, size, blob ”angle”, eccentricity
of ellipse
uter
Vis
ion
fo
rch
Gro
up, U
Com
pu
Res
ear
cicv
:hc
27
HMM for American Sign Languageer
actio
n
H)
g g g
Four-State HMM for each word
ompu
ter I
nte
Kar
lsru
he (T
H
Training:Automatic segmentation of sentences in five portions
or H
uman
-C
Uni
vers
itätK
Automatic segmentation of sentences in five portions Initial estimates by iterative Viterbi-alignmentThen Baum-Welch re-estimationNo conte t sed
uter
Vis
ion
fo
rch
Gro
up, U No context used
RecognitionWi h d i h f h ( b dj i
Com
pu
Res
ear
ci
With and without part-of-speech grammar (pronoun, verb, noun, adjective, pronoun)All features / only relative features used
cv:h
c
28
ASL Results: Desk-baseder
actio
n
H)
ompu
ter I
nte
Kar
lsru
he (T
H 348 training and 94 testing sentences without contexts
AN: #Words
or H
uman
-C
Uni
vers
itätK Accuracy: D: #Deletions
S: #SubstitutionsI: #Insertions
uter
Vis
ion
fo
rch
Gro
up, U
Experiment Training set Independent test set
Com
pu
Res
ear
ci
All features 94.1% 91.9%
Relative features 89.6% 87.2%
cv:h
c
All features & unrestricted grammar
81.0% (D=31, S=287, I=137,
N=2390)
74.5%(D=3, S=76, I=41,
N=470)
29
ASL Results: Wearable-baseder
actio
n
H)
ompu
ter I
nte
Kar
lsru
he (T
H
400 training sentences and 100 for testingTest 5-word sentences
or H
uman
-C
Uni
vers
itätK
Test 5 word sentencesRestricted and unrestricted similar!
uter
Vis
ion
fo
rch
Gro
up, U
Experiment Training set Independent test set
Part of speech 99 3% 97 8%
Com
pu
Res
ear
ci
Part-of-speech 99.3% 97.8%
5-word sentence 98.2% 97.8%
unrestricted 96 4% 96 8%
cv:h
c unrestricted 96.4%(D=24, S=32, I=45,
N=2500)
96.8% (D=4, S=6, I=6, N=500)
30
erac
tion
H)
ompu
ter I
nte
Kar
lsru
he (T
H
or H
uman
-C
Uni
vers
itätK
uter
Vis
ion
fo
rch
Gro
up, U
Com
pu
Res
ear
cicv
:hc
31
Pointing Gesture Recognitioner
actio
n
H)
g gom
pute
r Int
e
Kar
lsru
he (T
H
Pointing gesturesare used to specify objects and locations
or H
uman
-C
Uni
vers
itätK
p y jcan be needful to resolve ambiguities in verbal statements: „Put that there!“
uter
Vis
ion
fo
rch
Gro
up, U Here:
Pointing gesture = movement of the arm towards a pointing target
Com
pu
Res
ear
ci
Tasks:Detect occurrence of human pointing gestures in natural arm movementsExtract the 3D pointing direction
cv:h
c p g
32
Application: Multimodal Dialogue for H R b t I t ti
erac
tion
H)
Human-Robot Interactionom
pute
r Int
e
Kar
lsru
he (T
H
or H
uman
-C
Uni
vers
itätK
uter
Vis
ion
fo
rch
Gro
up, U
(ca. 2005)
Com
pu
Res
ear
cicv
:hc
33(2004)
(2006)Video
Pointing Gesture Phaseser
actio
n
H)
gom
pute
r Int
e
Kar
lsru
he (T
H
Looking at persons performing pointing gestures, one can identify 3 phases in the movement of the pointing hand:
or H
uman
-C
Uni
vers
itätK
μ [sec] σ [sec]
Complete gesture 1.75 0.48
B i 0 52 0 17
uter
Vis
ion
fo
rch
Gro
up, U Begin 0.52 0.17
Hold 0.76 0.40
End 0.47 0.12Average duration of pointing gesture phases
Com
pu
Res
ear
ci
For estimating the pointing direction, it is crucial to detect the hold phase precisely!
cv:h
c phase precisely!
34
Modeling the Gestureer
actio
n
H)
gom
pute
r Int
e
Kar
lsru
he (T
H
Separate models for each of the 3 phases:
or H
uman
-C
Uni
vers
itätK
continuous HMMs with 2
uter
Vis
ion
fo
rch
Gro
up, U Gaussians per state
Com
pu
Res
ear
ci
„Garbage model“ acts as a reference for the phase models‘ output
cv:h
c Models trained on hand-labeled sequences using the Baum-Welch reestimation equations [Rabiner89].
35
Detectioner
actio
n
H) Output of the phase
ompu
ter I
nte
Kar
lsru
he (T
H Output of the phase models during a sequence of 2 pointing gestures (subtracted by P0)
or H
uman
-C
Uni
vers
itätK
uter
Vis
ion
fo
rch
Gro
up, U
the likelihood P0 of the garbage model M0 is subtracted from the gesture
Com
pu
Res
ear
ci
model likelihoods (P1,P2 ,P3)P0 thus serves as a threshold
cv:h
c
Detect a pointing gesture, whenever there are 3 points in time tB < tH < tE, so that
PE(tE) > PB(tE) and PE(tE) > 0
37
E( E) B( E) E( E)PB(tB) > PE(tB) and PB(tB) > 0 PH(tH) > 0
Featureser
actio
n
H)
Hand position Head orientation
ompu
ter I
nte
Kar
lsru
he (T
H
( r, Δθ, Δy ) ( |θHead-θHand|, |ΦHead-ΦHand| )+
or H
uman
-C
Uni
vers
itätK Coordinates of the hand in
a cylindrical head-centered coordinate system
Difference between head’s and hand’s azimuth/elevation angle
0 if h d d h d
uter
Vis
ion
fo
rch
Gro
up, U 0, if head and hand are
“in-line“
M d ith
Com
pu
Res
ear
ci
Measured with a magnetic sensor
cv:h
c
[see also Campbell 96]
Visually Estimated (ANN)
38
[see also Campbell, 96]
Pointing Directioner
actio
n
H)
gom
pute
r Int
e
Kar
lsru
he (T
H
Head-hand lineLine of sight between head and hand
provided by the tracker
or H
uman
-C
Uni
vers
itätK provided by the tracker
Forearm orientation
uter
Vis
ion
fo
rch
Gro
up, U PCA of the point cloud around the
hand forearm orientation
Com
pu
Res
ear
ci
Head orientationVisually estimated (ANNs)(M i )
cv:h
c (Magnetic sensor)
39
Experiments and Resultser
actio
n
H)
p
118
ompu
ter I
nte
Kar
lsru
he (T
H
Scenario:Household Robot
118 gestures8 targets
or H
uman
-C
Uni
vers
itätK Household Robot
uter
Vis
ion
fo
rch
Gro
up, U 1
23
5
8
4
3
y [m]
Com
pu
Res
ear
ci
45
6
7
1
2
3
4
z [m]1
2
cv:h
c
-2-1
01
2x [m]
-1
00
40
erac
tion
H)
ompu
ter I
nte
Kar
lsru
he (T
H
Video
or H
uman
-C
Uni
vers
itätK
uter
Vis
ion
fo
rch
Gro
up, U
Com
pu
Res
ear
cicv
:hc
• Head and hand tracking will be discussed later ( Tracking II)
41
(Nickel & Stiefelhagen, 2004)
Results Pointing Directioner
actio
n
H)
gom
pute
r Int
e
Kar
lsru
he (T
H
Accuracy of pointing direction estimation on manually labeled hold-phases
or H
uman
-C
Uni
vers
itätK
uter
Vis
ion
fo
rch
Gro
up, U
Com
pu
Res
ear
ci
Error angle
Targets identified Availability
Head-hand line 25° 90% 98%
cv:h
c
Forearm line 39° 73% 78%
Head orientation 22° 75% (100%)
42
Results in Gesture Recognitioner
actio
n
H)
gom
pute
r Int
e
Kar
lsru
he (T
H
Performance of the automatic gesture recognition and pointing direction estimation:
or H
uman
-C
Uni
vers
itätK
Recall Precision Error angle*
uter
Vis
ion
fo
rch
Gro
up, U
No Head Orientation 79.8% 73.6% 19.4°
Vi ll ti t d
Com
pu
Res
ear
ci
Visually estimatedHead-Orientation 78.3% 87.1% 16.9°
True (Sensor)78.3% 86.3% 16.8°
cv:h
c Head Orientation78.3% 86.3% 16.8
*Head-hand line
43
Multimodal Human-Robot Interactioner
actio
n
H)
Take the cup!
ompu
ter I
nte
Kar
lsru
he (T
H
Multimodal Interaction in a H h ld S i
“Which cup do you want me to take?”
Thi !
or H
uman
-C
Uni
vers
itätK Household Scenario
Person Tracking & Gesture Recognition
This one!
uter
Vis
ion
fo
rch
Gro
up, U & Gesture Recognition
Speech Recognition Multimodal Dialog Manager
I iti t l ifi ti di l
Com
pu
Res
ear
ci
Initiates clarification dialog for missing parameters
Speech SynthesisIntegrated on Mobile Robot
cv:h
c Integrated on Mobile Robotable to find & follow person
SFB 588 Humanoide Roboter - Lernende und kooperierende multimodale Roboter
44
erac
tion
H)
ompu
ter I
nte
Kar
lsru
he (T
H
W it A d
or H
uman
-C
Uni
vers
itätK Weitere Anwendung:
Zeigegestenerkennung in einem Smart
uter
Vis
ion
fo
rch
Gro
up, U Room
Com
pu
Res
ear
cicv
:hc
45
Zieleer
actio
n
H) Projekt: “Visuelle Perzeption für die MMI – Interaktion in und
mit a fmerksamen Rä men”
ompu
ter I
nte
Kar
lsru
he (T
H mit aufmerksamen Räumen”Projekt am Fraunhofer IOSB (ehemals IITB)Beginn: Oktober 2007, Laufzeit 5 Jahre, 5 Mitarbeiter
or H
uman
-C
Uni
vers
itätK
Projektziel:Maschinensehen für die Unterstützung der Interaktion des Menschen in und mit »aufmerksamen« Räumen (»Smart Rooms«) und großflächiger Display-
uter
Vis
ion
fo
rch
Gro
up, U mit »aufmerksamen« Räumen (»Smart Rooms«) und großflächiger Display-
UmgebungErstes Anwendungsziel: »Smart Control Room« für Krisenreaktion und –Management
Com
pu
Res
ear
ci
Zielfunktionalitäten des Raumes:Erfassung aller Personen: Position, Identität, Blickrichtung und Aufmerksamkeit Bewegungen
cv:h
c Aufmerksamkeit, BewegungenErkennung, welche Information wahrgenommen wurdeUnterstützung personalisierter ArbeitsumgebungenUnterstützung multimodaler Interaktion mit großen Displaysg g p y
46
Der Smart Room: Sensoren & Geräteer
actio
n
H) Videowand:
ompu
ter I
nte
Kar
lsru
he (T
H
8 Rückprojektionswürfel (4096x1536 px)Können als ein Display angesteuert werden
or H
uman
-C
Uni
vers
itätK Sensoren
4 Kameras in den Raumecken, 1 Weitwinkelkamera an der Decke
uter
Vis
ion
fo
rch
Gro
up, U
1 Weitwinkelkamera an der Decke2 (4) Kameras über der Videowand2 aktive (pan-tilt-zoom) KamerasMikrophone
Com
pu
Res
ear
ci
Mikrophone Lautsprecher
cv:h
c Weiterer Smart Room am KITÄhnliches Setup, keine VideowandMikrophonarrays zur Lokalisierung
Smart Room: Perzeption & Funktionalitäter
actio
n
H)
Perzeptionskomponenten (alle echtzeitf.) 3D Personentracking und Berechnung der Visuellen
ompu
ter I
nte
Kar
lsru
he (T
H g gHülleErkennung von Zeigegesten und Berührungen Erkennung v. Kopfdrehungen und Aufmerksamkeit
or H
uman
-C
Uni
vers
itätK Identifikation durch Gesichtserkennung
Spracherkennung (Kooperation mit KIT)Florian
uter
Vis
ion
fo
rch
Gro
up, U Situationserfassung
Erste Funktionalitäten
Com
pu
Res
ear
ci
Erste FunktionalitätenPersonalisierte und lokalisierte Arbeitsplätze und Nachrichten Multimodale Interaktion mit der Videowand
Michael
cv:h
c Multimodale Interaktion mit der Videowand (Sprache, Gesten, Berührung)Erkennung der Aufmerksamkeit auf Videowand
Video: er
actio
n
H)
ompu
ter I
nte
Kar
lsru
he (T
H
or H
uman
-C
Uni
vers
itätK
uter
Vis
ion
fo
rch
Gro
up, U
Com
pu
Res
ear
cicv
:hc
Perzeptionskomponentener
actio
n
H)
p p3D Mehrpersonentracking
Lokalisierung ist eine grundlegende Komponente für alle
ompu
ter I
nte
Kar
lsru
he (T
H g g g pweiteren AnalysenMerkmale:
Kopf- und Oberkörperdetektoren, Vordergrund, Farbe
or H
uman
-C
Uni
vers
itätK Audio: Time-delay of arrival (TDOA), Gen. Cross-Corr.
Partikelfilter für Tracking und Fusion
Multi person tracking in a smart room (4 5 cameras)
uter
Vis
ion
fo
rch
Gro
up, U
IdentifikationNotwendig für Personalisierung
Multi-person tracking in a smart room (4-5 cameras)
Com
pu
Res
ear
ci
g gLokaler ansichtsbasierter Ansatz auf Basis von DCT
(Forschungspreis 2008 des Europ. Biometric Forum)Kann mit Sprechererkennung und aktiven Kameras
cv:h
c
kombiniert werdenGemeinsames Tracking und Identifikation bringt Vorteile
50
Combined tracking and identification
Perzeptionskomponenten (2)er
actio
n
H)
p p ( )Extraktion der Visuellen Hülle:
Liefert 3D Repräsentationen der Personen
ompu
ter I
nte
Kar
lsru
he (T
H Liefert 3D Repräsentationen der PersonenHilfreich zur Erkennung der Körperhaltung und für ZeigegestenIn Echtzeit möglich, mit Voxel-Carving Ansatz
or H
uman
-C
Uni
vers
itätK
g , g
uter
Vis
ion
fo
rch
Gro
up, U
Erkennung von KopfdrehungenLiefert wichtige Information über die Aufmerksamkeit
Berechnung der Visuellen Hüllen
Com
pu
Res
ear
ci
AufmerksamkeitGelöst durch videobasierte Schätzung der Kopfdrehungen (Neuronale Netze)Fusion über alle Ergebnisse aus den Kamerasichten
cv:h
c Fusion über alle Ergebnisse aus den Kamerasichten (Partikelfilter)Abbildung der Kopfdrehungen auf wahrscheinliche Ziele, bspw. andere Personen, bestimmte Objekte, Regionen der Videowand
51Erkennung von Kopfdrehungen
Interaktion mit großen Videowändener
actio
n
H)
gBenutzer erwarten Berührungserkennung!Berührungen reichen aber nicht aus
ompu
ter I
nte
Kar
lsru
he (T
H Berührungen reichen aber nicht ausObjekte sind zum Teil nicht erreichbar (Entfernung / Höhe)
or H
uman
-C
Uni
vers
itätK Erweiterung: Nutzung von Zeigegesten
uter
Vis
ion
fo
rch
Gro
up, U Vorteile der videobasierten Lösung
Vorhandene Geräte und Oberflächen können erweitert werdenSkaliert gut in der Größe
Com
pu
Res
ear
ci
Skaliert gut in der GrößeKameras sind günstigNahtloser Übergang zwischen Berührung und Zeigegesten ist möglich
cv:h
c g g g
52
Technische Realisierunger
actio
n
H)
g3D Extraktion der Visuellen Hülle
mit Voxel-Carving Ansatz
ompu
ter I
nte
Kar
lsru
he (T
H mit Voxel Carving Ansatz
Detektion und Tracking des Arms
or H
uman
-C
Uni
vers
itätK Detektion eines zylinderförmigen Objekts vor der
VideowandTracking durch paarweises Matching
uter
Vis
ion
fo
rch
Gro
up, U Interaktionspunkt wird durch 3D Rekonstruktion
berechnetAnnäherung
Com
pu
Res
ear
ci
Erkennung der Interaktion (Berührung / Geste)Als binäre Interaktionsmodalität modelliert
Halten
Bewegung
cv:h
c Zustand wird durch Handbewegung determiniertNahtloser Übergang zwischen Berührung und Zeigen
Halten
53
Zurückziehen
3D Armrekonstruktion und Trackinger
actio
n
H)
gom
pute
r Int
e
Kar
lsru
he (T
H
or H
uman
-C
Uni
vers
itätK
uter
Vis
ion
fo
rch
Gro
up, U
Com
pu
Res
ear
cicv
:hc
54
Evaluation der Komponentener
actio
n
H)
pLaufzeit: < 20ms / frame
3 GHz Core 2 Duo NVIDIA GTX280
ompu
ter I
nte
Kar
lsru
he (T
H 3 GHz Core 2 Duo, NVIDIA GTX280
Komponente Dauer [ms]
Vordergrundsegmentierung 6.50
or H
uman
-C
Uni
vers
itätK Voxel carving 8.26
Berührung und Zeigen 1.89
uter
Vis
ion
fo
rch
Gro
up, U
GenauigkeitWie genau trifft die Berührungs- /
Com
pu
Res
ear
ci
Zeigedetektion den vom Benutzer anvisierten Punkt
Modalität Mittl Fehler [cm] Standardab eich ng [cm]
cv:h
c Modalität Mittl. Fehler [cm] Standardabweichung [cm]
Berührung 9.49 6.56
Zeigen 16.61 9.16V t il d F hl f d Vid d
55
Verteilung der Fehler auf der Videowand(Berührung / Zeigen kombiniert)(Stand September 09, inzwischen deutlich genauer)
Benutzerstudieer
actio
n
H)
Aufgabe: Bewege die farbigen Blöcke schnellstmöglich ins ZielZwei Konditionen: “Touch-only” vs “Touch-and-pointing”
ompu
ter I
nte
Kar
lsru
he (T
H Zwei Konditionen: Touch-only vs. Touch-and-pointingTeilnehmer: 15 Männer, 4 Frauen
or H
uman
-C
Uni
vers
itätK
uter
Vis
ion
fo
rch
Gro
up, U
Com
pu
Res
ear
cicv
:hc
56
Benutzerstudie: Ergebnisseer
actio
n
H)
gFragebögen mit Fünfpunkt Likert-Skala
• Wilcoxon Rank Sum Test zur Analyse
ompu
ter I
nte
Kar
lsru
he (T
H ystatistischer Signifikanz
Hypo 1:“touch + pointing” ist intuitiver und nts
Touch ts Touch + Pointing
or H
uman
-C
Uni
vers
itätK
Hypo 1: touch + pointing ist intuitiver und einfacher zu benutzen
Nein, beides gleich gut bewertet (p=0.14)
# Pa
rtic
ipan Touch
# Pa
rtic
ipan
Touch + Pointing
uter
Vis
ion
fo
rch
Gro
up, U
Hypo 2: Alle Regionen an der Videowandsind per “touch + pointing” einfacher zuerreichen p
ants Touch + Pointing
pan
ts Touch
Com
pu
Res
ear
ci
erreichenJa ! (p<<0.01) Pa
rtic
ip
Part
icip
cv:h
c Hypo 3: “„Zeigegesten waren eine hilfreicheErgänzung“
Ja ! (μ>4 with p<<0.01) rtic
ipan
ts
(μ p )
57Pa
r
Benutzungsschnittstellen für Berührungs- und Zeigebasierte Interaktion
erac
tion
H)
Zeigebasierte InteraktionEntwurf graphischer Benutzungs-schnittstellenfü B üh d Z i b i t I t kti
ompu
ter I
nte
Kar
lsru
he (T
H für Berührungs- und Zeigebasierte InteraktionTests und Vergleiche verschiedener Interaktionsmethoden
or H
uman
-C
Uni
vers
itätK 40 Teilnehmer, verschiedene Setups
Quantitative Messungen: task completion time, Fehlerraten
uter
Vis
ion
fo
rch
Gro
up, U Experimental setupQualitative Messungen: Questionnaires
Com
pu
Res
ear
ci
Scale-rotateMove
cv:h
c
Multiselect Move-scale-rotateFind optimal objectsizePhoto application
Fortschritte: Multi-Toucher
actio
n
H)
ompu
ter I
nte
Kar
lsru
he (T
H
Multitouch
or H
uman
-C
Uni
vers
itätK
uter
Vis
ion
fo
rch
Gro
up, U
Com
pu
Res
ear
cicv
:hc
Multimodale Interaktioner
actio
n
H) Anwendung:
ompu
ter I
nte
Kar
lsru
he (T
H gLageraum für FeuerwehrHinzufügen taktischer Symbole zur Lagekarte
or H
uman
-C
Uni
vers
itätK
Methode:
uter
Vis
ion
fo
rch
Gro
up, U Sprachkommando zur Auswahl des Symbols
Berührung/Zeigen zur Positionierung
Com
pu
Res
ear
ci
Geplant:Mehr und komplexere Sprachkommandos
cv:h
c Mehr und komplexere SprachkommandosBessere Speech-DetektionManipulation von SymbolenMultimodaler Dialog
erac
tion
H)
ompu
ter I
nte
Kar
lsru
he (T
H
or H
uman
-C
Uni
vers
itätK
uter
Vis
ion
fo
rch
Gro
up, U
Com
pu
Res
ear
cicv
:hc
Related Topicser
actio
n
H)
p
Human motion Analysis
ompu
ter I
nte
Kar
lsru
he (T
H Gait: classification and person identificationClassification of motion figures in dance, sport, …
or H
uman
-C
Uni
vers
itätK
figures in dance, sport, …
Facial expression [Yam02]
uter
Vis
ion
fo
rch
Gro
up, U Synthesizing gesture
and facial expressionsTalking heads
Com
pu
Res
ear
ci
gAvatars
cv:h
c
Efros et al., Recognizing Action at a
62Distance, ICCV’03
Summaryer
actio
n
H)
y
Gesture Taxonomy:
ompu
ter I
nte
Kar
lsru
he (T
H Manipulative - semaphoric - conversational Static vs. dynamic
or H
uman
-C
Uni
vers
itätK
Hidden Markov Models“The 3 problems”G i i t d l
uter
Vis
ion
fo
rch
Gro
up, U Gaussian mixture models
Examples
Com
pu
Res
ear
ci
pAmerican Sign Language recognitionPointing gestures recognitionInteraction with a video wall, Speech & Gesture
cv:h
c Interaction with a video wall, Speech & Gesture
63
Referenceser
actio
n
H)
* [Pavlović97] Additional:
ompu
ter I
nte
Kar
lsru
he (T
H [Pavlović97]V.I. Pavlović, R. Sharma, T.S. Huang: Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997.
Additional:
[Becker97]Becker, D.A.: Sensei: A Real-Time Recognition,Feedback and Training System for T’ai Chi Gestures. M.I.T. Media Lab Perceptual Computing Group Technical Report No. 426, 1997.
or H
uman
-C
Uni
vers
itätK
* [Rabiner89]Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE, 77 (2), 257–286, 1989.
p p ,
[Cassell98]Cassell, J.: A Framework For Gesture Generation And Interpretation. In Cipolla, R. and Pentland, A. (eds.), Computer Vision in Human-Machine
uter
Vis
ion
fo
rch
Gro
up, U
* [Starner98]T. Starner, J. Weaver, A. Pentland: Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video. IEEE
Interaction, pp. 191-215. New York: Cambridge University Press. 1998.
[Poddar98]Poddar, I., Sethi, Y., Ozyildiz, E. and Sharma, R.:
Com
pu
Res
ear
ci
Transactions on Pattern Analysis and Machine Intelligence, 20(12):1371--1375, 1998.
* [Nickel04]Nickel, K., Stiefelhagen, R.: 3D-Tracking of Heads
d H d f P i ti G t R iti i
Toward Natural Gesture/Speech HCI: A Case Studyof Weather Narration. Proc. Workshops onPerceptual User Interfaces, pages 1-6, November,1998.
cv:h
c and Hands for Pointing Gesture Recognition in a Human-Robot Interaction Scenario, Sixth Int. Conf. On Face and Gesture Recognition, May 2004, Seoul, Korea.
[Xiong03]Y. Xioung, F. Quek, D. McNeill: Hand Motion Gestural Oscillations and Multimodal Discourse. ICMI 2003.
64