V19 WS09 GestureRecognition.ppt [Kompatibilitätsmodus]€¦ · Example: "she talked first, I mean...

Gesture RecognitionVorlesung

„Visuelle Perzeption für Mensch-Maschine Interaktion“

Rainer StiefelhagenKai Nickel

2010-01-15

Interactive Systems Laboratories, Universität Karlsruhe (TH)1

Overviewer

actio

n

H) Introduction

ompu

ter I

nte

Kar

lsru

he (T

H IntroductionHidden Markov ModelsA li ti

or H

uman

-C

Uni

vers

itätK Applications

American Sign Language

uter

Vis

ion

fo

rch

Gro

up, U Pointing Gestures

Gestures and Speech

Com

pu

Res

ear

ci

Miscellaneous

cv:h

c

2

Definitioner

actio

n

H) Gesture:

ompu

ter I

nte

Kar

lsru

he (T

H Gesture:a movement usually of the body or limbs that expresses or emphasizes an idea sentiment or

or H

uman

-C

Uni

vers

itätK expresses or emphasizes an idea, sentiment, or

attitudeth f ti f th li b b d

uter

Vis

ion

fo

rch

Gro

up, U the use of motions of the limbs or body as a means

of expression

Com

pu

Res

ear

ci

[Merriam-Webster Online Dictionary]

cv:h

c

3

Automatic Gesture Recognitioner

actio

n

H)

g

A gesture recognition system generates a semantic

ompu

ter I

nte

Kar

lsru

he (T

H g g y gdescription for certain body motionsGesture recognition exploits the power of non-verbal

or H

uman

-C

Uni

vers

itätK communication, which is very common in human-human

interaction

uter

Vis

ion

fo

rch

Gro

up, U Gesture recognition is often built on top of a human motion

trackerR l t d t i H A ti it A l i I t ti

Com

pu

Res

ear

ci

Related topics: Human Activity Analysis, Intention Recognition, Facial Expression Recognition

cv:h

c

4

Some Applicationser

actio

n

H)

ppom

pute

r Int

e

Kar

lsru

he (T

H Virtual Environmentsmanipulation of objects

M lti d l R

or H

uman

-C

Uni

vers

itätK Multimodal Rooms

interaction with room facilitiesanalysis of a user‘s actions

context aware services

uter

Vis

ion

fo

rch

Gro

up, U context aware services

Human-Robot Interactioncommunicative gestures

Com

pu

Res

ear

ci

gprogramming by demonstration

Understanding Human-Human Interaction

cv:h

c level of arousalturn-taking

5

Types of Gestureser

actio

n

H)

ypom

pute

r Int

e

Kar

lsru

he (T

H

Hand & arm gesturesPointing GesturesSi L

or H

uman

-C

Uni

vers

itätK Sign Language

Hello, Thumbs up/down, victory sign, the finger, call-me, … (Wikipedia lists around 40 hand gestures)

H d

uter

Vis

ion

fo

rch

Gro

up, U Head gestures

Nodding, head shaking, turning, pointingBody gestures

Com

pu

Res

ear

ci

Don’t know / shrug

Gestures are mostly culture-specific

cv:h

c Gestures are mostly culture specificPointing gesture probably one of the exceptions

Some gestures closely coordinated with speech

6

Gesture Taxonomyer

actio

n

H)

yom

pute

r Int

e

Kar

lsru

he (T

H

or H

uman

-C

Uni

vers

itätK

uter

Vis

ion

fo

rch

Gro

up, U

[P l ić97]

Com

pu

Res

ear

ci

[Pavlović97]

cv:h

c

static vs. dynamic gesturesdifferent phases of a gesture: preparation, peak, retraction

7

A Purpose Taxonomy of Gestural Interaction [Q k ICMI05]

erac

tion

H)

[Quek, ICMI05]

Manipulative Gestures

ompu

ter I

nte

Kar

lsru

he (T

H “Put that there” [Bolt 1980]Navigation in virtual spaces, game control, …Robot control

or H

uman

-C

Uni

vers

itätK

Semaphoric GesturesPrecisely defined specific symbols of an alphabet

uter

Vis

ion

fo

rch

Gro

up, U

Precisely defined, specific symbols of an alphabetStatic: Finger menu selection, hand pose classification, …Dynamic: Crane operator signals, editing gestures for interactive whiteboard, …

Com

pu

Res

ear

ci

whiteboard, …

Conversational GesturesNaturally performed in the course of human multimodal communication

cv:h

c Naturally performed in the course of human multimodal communication

8

Spontaneous gestures [Cassell98]

(C i i G )er

actio

n

H)

(Communicative Gestures)om

pute

r Int

e

Kar

lsru

he (T

H

Iconic gestures Depict by the form of the gesture some feature/action/event being described

or H

uman

-C

Uni

vers

itätK

Metaphoric gestures Also representational, but the concept they represent has no physical form; form of the gesture comes from a common metaphor Example: "the meeting

uter

Vis

ion

fo

rch

Gro

up, U from a common metaphor. Example: "the meeting

went on and on" + hand indicating rolling motion.

DeicticsSpatialize or locate in the physical space in front of

Com

pu

Res

ear

ci

Spatialize, or locate in the physical space in front of the narrator, aspects of the discourse.

Beat gesturesSmall baton like movements; do not change in form

cv:h

c Small baton like movements; do not change in form with content of speech. Pragmatic function, occurring with comments on one’s own linguistic contribution, speech repairs and reported speech. Example: "she talked first, I mean second“ + hand

See also:McNeill, D. 1992. Hand and Mind: What gestures reveal about thought. The University of Chicago Press, Chicago IL.

9

flicking down and up.Chicago IL.

Spatial Modelser

actio

n

H)

p

3D model-based

ompu

ter I

nte

Kar

lsru

he (T

H

3D hand model. Parameters: angels, palm position

or H

uman

-C

Uni

vers

itätK

uter

Vis

ion

fo

rch

Gro

up, U Appearance-based

Parameters: Pi l t l t i t t I ti

Com

pu

Res

ear

ci

Pixel templates, image geometry parameters, Image motion parameters, Fingertip position & motion...Use templates for modelling gestures

[Pavlović]

cv:h

c

10

Gesture Recognitioner

actio

n

H)

gom

pute

r Int

e

Kar

lsru

he (T

H

Feature acquisition:• attached sensors• visually

or H

uman

-C

Uni

vers

itätK

visually• markers• color• motion• shape

uter

Vis

ion

fo

rch

Gro

up, U • shape

• …

Classifiers: Hidd M k M d l (HMM)

Com

pu

Res

ear

ci

• Hidden Markov Models (HMM)• Neural Networks (ANN)• Finite State Machines (FSM)• Support Vector Machines (SVM)

cv:h

c • Support Vector Machines (SVM)• …

12

erac

tion

H)

ompu

ter I

nte

Kar

lsru

he (T

H

or H

uman

-C

Uni

vers

itätK

uter

Vis

ion

fo

rch

Gro

up, U

Com

pu

Res

ear

cicv

:hc

13

Hidden Markov Modelser

actio

n

H) Gesture recognition temporal pattern matching

ompu

ter I

nte

Kar

lsru

he (T

H g p p g

HMMs

or H

uman

-C

Uni

vers

itätK

represent templates in a person-independent way (when trained with examples from many persons)

uter

Vis

ion

fo

rch

Gro

up, U provide an inherently “soft” time-alignment mechanism

are based on a sound mathematical foundation

Com

pu

Res

ear

ci

Introduction to HMM [Rabiner89]

cv:h

c

14

Motivationer

actio

n

H)

ompu

ter I

nte

Kar

lsru

he (T

H

x1 x2 x3

observable

or H

uman

-C

Uni

vers

itätK

b1 b2 b3 non-observable

uter

Vis

ion

fo

rch

Gro

up, U s1 a12

s2 s3a23

Com

pu

Res

ear

ci

the term "hidden" comes from observing observations and drawing conclusions without knowing the hidden sequence of states

cv:h

c Markov assumption (1st order): the next state depends only on the current state (not on the complete state history)

15

HMM Definitioner

actio

n

H) A Hidden Markov Model is a five-tuple (S,π,A,B,V).

ompu

ter I

nte

Kar

lsru

he (T

H

It consists of:the set of States S = {s1,s2,...,sn} h i iti l b bilit di ib i

or H

uman

-C

Uni

vers

itätK the initial probability distribution π

π(si) = probability of si being the first state of a state sequence the matrix of state transition probabilities A

uter

Vis

ion

fo

rch

Gro

up, U A=(aij) where aij is the probability of state sj following si

the set of emission probability distributions/densities B B={b1,b2,...,bn} where bi(x) is the probability of observing x

Com

pu

Res

ear

ci

n iwhen the system is in state sithe observable feature space V can be discrete V={x1,x2,...,xv}, or continuous V=Rd

cv:h

c { 1, 2, , v},

16

Some Properties of HMMser

actio

n

H)

p

for the initial probabilities we have: Σi π(si) = 1

ompu

ter I

nte

Kar

lsru

he (T

H p i ( i)often things are simplified by π(s1)=1, and π(si>1) = 0obviously: Σj aij = 1 for all i

or H

uman

-C

Uni

vers

itätK

y j ij

often: aij = 0 for most j except for a few stateswhen V = {x1,x2,...,xv} then bi are discrete probability

uter

Vis

ion

fo

rch

Gro

up, U

when V {x1,x2,...,xv} then bi are discrete probability distributions, the HMMs are called discrete HMMswhen V = Rd then bi are continuous probability density

Com

pu

Res

ear

ci

i p y yfunctions, the HMMs are called continuous (density) HMMs

cv:h

c

17

HMM Topologieser

actio

n

H)p g

ompu

ter I

nte

Kar

lsru

he (T

H

l ft t i ht

or H

uman

-C

Uni

vers

itätK left-to-right

uter

Vis

ion

fo

rch

Gro

up, U

Com

pu

Res

ear

ci

alternative paths ergodic

cv:h

c paths

18

The Observation Modeler

actio

n

H) Most popular: Gaussian mixture models

ompu

ter I

nte

Kar

lsru

he (T

H Most popular: Gaussian mixture models

or H

uman

-C

Uni

vers

itätK

nj: number of Gaussians (in state j)

uter

Vis

ion

fo

rch

Gro

up, U

j ( j)cjk: mixture weight for k-th Gaussian (in state j)μjk: means of k-th Gaussian (in state j)Σ i i f k h G i (i j)

Com

pu

Res

ear

ci

Σjk: covariane matrix of k-th Gaussian (in state j)

cv:h

c

19

Three main problems of HMMs (Th t b l d!)

erac

tion

H)

(That can be solved!)om

pute

r Int

e

Kar

lsru

he (T

H

Given an HMM λ and an observation x1,x2,...,xT

or H

uman

-C

Uni

vers

itätK The evaluation problem:

compute the probability of the observation p(x x x | λ)

uter

Vis

ion

fo

rch

Gro

up, U p(x1,x2,...,xT | λ)

The decoding problem: h lik l

Com

pu

Res

ear

ci

compute the most likely state sequence sq1,sq2,...,sqT, i.e. argmaxq1,..,qT p(q1,..,qT| x1,x2,...,xT, λ)

cv:h

c

The learning/optimization problem: find an HMM λ‘ such that p(x x x | λ‘ ) > p(x x x | λ)

20

p(x1,x2,...,xT | λ ) > p(x1,x2,...,xT | λ)

The Evaluation Problemer

actio

n

H) We define αt( j) as the probability of observing the partial

ompu

ter I

nte

Kar

lsru

he (T

H t( j) p y g pobservation x1,x2,...,xt and then being in state sj: αt( j) = p(x1,x2,...,xt, qt=j | λ )

( | λ) ( )

or H

uman

-C

Uni

vers

itätK p(x1,x2,...,xT | λ) = Σj=1..n αT( j)

uter

Vis

ion

fo

rch

Gro

up, U Recursive calculation:

αt( j) = bj(xt) · Σi=1..n aij αt-1(i)( j) = π(s ) b (x )

Com

pu

Res

ear

ci

α1( j) = π(sj) bj(x1)

cv:h

c

The Forward Algorithm

21

The Decoding Problemer

actio

n

H)g

Find the most likely state sequence, i.e.( | λ )

ompu

ter I

nte

Kar

lsru

he (T

H argmaxq1,..,qT p(q1,..,qT, x1,...,xT | λ )

or H

uman

-C

Uni

vers

itätK Define zt( j) = max p(x1,...,xt, qt=j | λ )

z (j) = max (z (i) a ) b (x )

uter

Vis

ion

fo

rch

Gro

up, U zt(j) = maxi(zt-1(i) aij) bj(xt)

z1(j) = π(sj) bj(x1)

T t i th ti l t t

Com

pu

Res

ear

ci

To retrieve the optimal state sequence,we have to remember for every state its optimal predecessorrt( j) = argmaxi(zt-1(i) aij)

cv:h

c

22The Viterbi Algorithm

The Learning / Optimization Problemer

actio

n

H)

g p

Train parameters of HMM

ompu

ter I

nte

Kar

lsru

he (T

H pTune λ to maximize P(O | λ)No efficient algorithm for global optimum

or H

uman

-C

Uni

vers

itätK Efficient iterative algorithm finds a local optimum

Viterbi-TrainingC Vi bi P h i C M d l

uter

Vis

ion

fo

rch

Gro

up, U Compute Viterbi-Path using Current Model

Reestimate Parameters Using Labels Assigned by Viterbi

Baum Welch (Forward Backward) reestimation

Com

pu

Res

ear

ci

Baum-Welch (Forward-Backward) reestimationSee [Rabiner]

cv:h

c

23

erac

tion

H)

ompu

ter I

nte

Kar

lsru

he (T

H

or H

uman

-C

Uni

vers

itätK

uter

Vis

ion

fo

rch

Gro

up, U

Com

pu

Res

ear

cicv

:hc

24

Sign Language Recognition [Starner98]er

actio

n

H)

g g g g [ ]om

pute

r Int

e

Kar

lsru

he (T

H

American Sign Language (ASL)

or H

uman

-C

Uni

vers

itätK 6000 gesture describe persons,

places and thingsExact meaning and strong rules

f d f h

uter

Vis

ion

fo

rch

Gro

up, U of context and grammar for each

Com

pu

Res

ear

ci

Sign recognitionPrevious with instrumental gloves and neural nets

cv:h

c Previous with instrumental gloves and neural netsHMM ideal for complex and structured hand gestures of ASL

25

ASL Test Lexiconer

actio

n

H)

ompu

ter I

nte

Kar

lsru

he (T

H

Vocabulary size: 40 words

P t f h V b l

or H

uman

-C

Uni

vers

itätK Part of speech Vocabulary

Pronoun I, you, he, we, you (pl), they

Verb want like lose dontwant dontlike love pack hit loan

uter

Vis

ion

fo

rch

Gro

up, U Verb want, like, lose, dontwant, dontlike, love, pack, hit, loan

Noun box, car, book, table, paper, pants, bicycle, bottle, can,

Com

pu

Res

ear

ci

wristwatch, umbrella, coat, pencil, shoes, food, magazine, fish, mouse, pill, bowl

Adjective red, brown, black, gray, yellow

cv:h

c Adjective red, brown, black, gray, yellow

26

Feature Extractioner

actio

n

H)

ompu

ter I

nte

Kar

lsru

he (T

H

Camera either located as a 1st-person and a 2nd-person viewSegment hand blobs by a skin color model

or H

uman

-C

Uni

vers

itätK Extracted Features: x, y, Δx, Δy, size, blob ”angle”, eccentricity

of ellipse

uter

Vis

ion

fo

rch

Gro

up, U

Com

pu

Res

ear

cicv

:hc

27

HMM for American Sign Languageer

actio

n

H)

g g g

Four-State HMM for each word

ompu

ter I

nte

Kar

lsru

he (T

H

Training:Automatic segmentation of sentences in five portions

or H

uman

-C

Uni

vers

itätK

Automatic segmentation of sentences in five portions Initial estimates by iterative Viterbi-alignmentThen Baum-Welch re-estimationNo conte t sed

uter

Vis

ion

fo

rch

Gro

up, U No context used

RecognitionWi h d i h f h ( b dj i

Com

pu

Res

ear

ci

With and without part-of-speech grammar (pronoun, verb, noun, adjective, pronoun)All features / only relative features used

cv:h

c

28

ASL Results: Desk-baseder

actio

n

H)

ompu

ter I

nte

Kar

lsru

he (T

H 348 training and 94 testing sentences without contexts

AN: #Words

or H

uman

-C

Uni

vers

itätK Accuracy: D: #Deletions

S: #SubstitutionsI: #Insertions

uter

Vis

ion

fo

rch

Gro

up, U

Experiment Training set Independent test set

Com

pu

Res

ear

ci

All features 94.1% 91.9%

Relative features 89.6% 87.2%

cv:h

c

All features & unrestricted grammar

81.0% (D=31, S=287, I=137,

N=2390)

74.5%(D=3, S=76, I=41,

N=470)

29

ASL Results: Wearable-baseder

actio

n

H)

ompu

ter I

nte

Kar

lsru

he (T

H

400 training sentences and 100 for testingTest 5-word sentences

or H

uman

-C

Uni

vers

itätK

Test 5 word sentencesRestricted and unrestricted similar!

uter

Vis

ion

fo

rch

Gro

up, U

Experiment Training set Independent test set

Part of speech 99 3% 97 8%

Com

pu

Res

ear

ci

Part-of-speech 99.3% 97.8%

5-word sentence 98.2% 97.8%

unrestricted 96 4% 96 8%

cv:h

c unrestricted 96.4%(D=24, S=32, I=45,

N=2500)

96.8% (D=4, S=6, I=6, N=500)

30

erac

tion

H)

ompu

ter I

nte

Kar

lsru

he (T

H

or H

uman

-C

Uni

vers

itätK

uter

Vis

ion

fo

rch

Gro

up, U

Com

pu

Res

ear

cicv

:hc

31

Pointing Gesture Recognitioner

actio

n

H)

g gom

pute

r Int

e

Kar

lsru

he (T

H

Pointing gesturesare used to specify objects and locations

or H

uman

-C

Uni

vers

itätK

p y jcan be needful to resolve ambiguities in verbal statements: „Put that there!“

uter

Vis

ion

fo

rch

Gro

up, U Here:

Pointing gesture = movement of the arm towards a pointing target

Com

pu

Res

ear

ci

Tasks:Detect occurrence of human pointing gestures in natural arm movementsExtract the 3D pointing direction

cv:h

c p g

32

Application: Multimodal Dialogue for H R b t I t ti

erac

tion

H)

Human-Robot Interactionom

pute

r Int

e

Kar

lsru

he (T

H

or H

uman

-C

Uni

vers

itätK

uter

Vis

ion

fo

rch

Gro

up, U

(ca. 2005)

Com

pu

Res

ear

cicv

:hc

33(2004)

(2006)Video

Pointing Gesture Phaseser

actio

n

H)

gom

pute

r Int

e

Kar

lsru

he (T

H

Looking at persons performing pointing gestures, one can identify 3 phases in the movement of the pointing hand:

or H

uman

-C

Uni

vers

itätK

μ [sec] σ [sec]

Complete gesture 1.75 0.48

B i 0 52 0 17

uter

Vis

ion

fo

rch

Gro

up, U Begin 0.52 0.17

Hold 0.76 0.40

End 0.47 0.12Average duration of pointing gesture phases

Com

pu

Res

ear

ci

For estimating the pointing direction, it is crucial to detect the hold phase precisely!

cv:h

c phase precisely!

34

Modeling the Gestureer

actio

n

H)

gom

pute

r Int

e

Kar

lsru

he (T

H

Separate models for each of the 3 phases:

or H

uman

-C

Uni

vers

itätK

continuous HMMs with 2

uter

Vis

ion

fo

rch

Gro

up, U Gaussians per state

Com

pu

Res

ear

ci

„Garbage model“ acts as a reference for the phase models‘ output

cv:h

c Models trained on hand-labeled sequences using the Baum-Welch reestimation equations [Rabiner89].

35

Detectioner

actio

n

H) Output of the phase

ompu

ter I

nte

Kar

lsru

he (T

H Output of the phase models during a sequence of 2 pointing gestures (subtracted by P0)

or H

uman

-C

Uni

vers

itätK

uter

Vis

ion

fo

rch

Gro

up, U

the likelihood P0 of the garbage model M0 is subtracted from the gesture

Com

pu

Res

ear

ci

model likelihoods (P1,P2 ,P3)P0 thus serves as a threshold

cv:h

c

Detect a pointing gesture, whenever there are 3 points in time tB < tH < tE, so that

PE(tE) > PB(tE) and PE(tE) > 0

37

E( E) B( E) E( E)PB(tB) > PE(tB) and PB(tB) > 0 PH(tH) > 0

Featureser

actio

n

H)

Hand position Head orientation

ompu

ter I

nte

Kar

lsru

he (T

H

( r, Δθ, Δy ) ( |θHead-θHand|, |ΦHead-ΦHand| )+

or H

uman

-C

Uni

vers

itätK Coordinates of the hand in

a cylindrical head-centered coordinate system

Difference between head’s and hand’s azimuth/elevation angle

0 if h d d h d

uter

Vis

ion

fo

rch

Gro

up, U 0, if head and hand are

“in-line“

M d ith

Com

pu

Res

ear

ci

Measured with a magnetic sensor

cv:h

c

[see also Campbell 96]

Visually Estimated (ANN)

38

[see also Campbell, 96]

Pointing Directioner

actio

n

H)

gom

pute

r Int

e

Kar

lsru

he (T

H

Head-hand lineLine of sight between head and hand

provided by the tracker

or H

uman

-C

Uni

vers

itätK provided by the tracker

Forearm orientation

uter

Vis

ion

fo

rch

Gro

up, U PCA of the point cloud around the

hand forearm orientation

Com

pu

Res

ear

ci

Head orientationVisually estimated (ANNs)(M i )

cv:h

c (Magnetic sensor)

39

Experiments and Resultser

actio

n

H)

p

118

ompu

ter I

nte

Kar

lsru

he (T

H

Scenario:Household Robot

118 gestures8 targets

or H

uman

-C

Uni

vers

itätK Household Robot

uter

Vis

ion

fo

rch

Gro

up, U 1

23

5

8

4

3

y [m]

Com

pu

Res

ear

ci

45

6

7

1

2

3

4

z [m]1

2

cv:h

c

-2-1

01

2x [m]

-1

00

40

erac

tion

H)

ompu

ter I

nte

Kar

lsru

he (T

H

Video

or H

uman

-C

Uni

vers

itätK

uter

Vis

ion

fo

rch

Gro

up, U

Com

pu

Res

ear

cicv

:hc

• Head and hand tracking will be discussed later ( Tracking II)

41

(Nickel & Stiefelhagen, 2004)

Results Pointing Directioner

actio

n

H)

gom

pute

r Int

e

Kar

lsru

he (T

H

Accuracy of pointing direction estimation on manually labeled hold-phases

or H

uman

-C

Uni

vers

itätK

uter

Vis

ion

fo

rch

Gro

up, U

Com

pu

Res

ear

ci

Error angle

Targets identified Availability

Head-hand line 25° 90% 98%

cv:h

c

Forearm line 39° 73% 78%

Head orientation 22° 75% (100%)

42

Results in Gesture Recognitioner

actio

n

H)

gom

pute

r Int

e

Kar

lsru

he (T

H

Performance of the automatic gesture recognition and pointing direction estimation:

or H

uman

-C

Uni

vers

itätK

Recall Precision Error angle*

uter

Vis

ion

fo

rch

Gro

up, U

No Head Orientation 79.8% 73.6% 19.4°

Vi ll ti t d

Com

pu

Res

ear

ci

Visually estimatedHead-Orientation 78.3% 87.1% 16.9°

True (Sensor)78.3% 86.3% 16.8°

cv:h

c Head Orientation78.3% 86.3% 16.8

*Head-hand line

43

Multimodal Human-Robot Interactioner

actio

n

H)

Take the cup!

ompu

ter I

nte

Kar

lsru

he (T

H

Multimodal Interaction in a H h ld S i

“Which cup do you want me to take?”

Thi !

or H

uman

-C

Uni

vers

itätK Household Scenario

Person Tracking & Gesture Recognition

This one!

uter

Vis

ion

fo

rch

Gro

up, U & Gesture Recognition

Speech Recognition Multimodal Dialog Manager

I iti t l ifi ti di l

Com

pu

Res

ear

ci

Initiates clarification dialog for missing parameters

Speech SynthesisIntegrated on Mobile Robot

cv:h

c Integrated on Mobile Robotable to find & follow person

SFB 588 Humanoide Roboter - Lernende und kooperierende multimodale Roboter

44

erac

tion

H)

ompu

ter I

nte

Kar

lsru

he (T

H

W it A d

or H

uman

-C

Uni

vers

itätK Weitere Anwendung:

Zeigegestenerkennung in einem Smart

uter

Vis

ion

fo

rch

Gro

up, U Room

Com

pu

Res

ear

cicv

:hc

45

Zieleer

actio

n

H) Projekt: “Visuelle Perzeption für die MMI – Interaktion in und

mit a fmerksamen Rä men”

ompu

ter I

nte

Kar

lsru

he (T

H mit aufmerksamen Räumen”Projekt am Fraunhofer IOSB (ehemals IITB)Beginn: Oktober 2007, Laufzeit 5 Jahre, 5 Mitarbeiter

or H

uman

-C

Uni

vers

itätK

Projektziel:Maschinensehen für die Unterstützung der Interaktion des Menschen in und mit »aufmerksamen« Räumen (»Smart Rooms«) und großflächiger Display-

uter

Vis

ion

fo

rch

Gro

up, U mit »aufmerksamen« Räumen (»Smart Rooms«) und großflächiger Display-

UmgebungErstes Anwendungsziel: »Smart Control Room« für Krisenreaktion und –Management

Com

pu

Res

ear

ci

Zielfunktionalitäten des Raumes:Erfassung aller Personen: Position, Identität, Blickrichtung und Aufmerksamkeit Bewegungen

cv:h

c Aufmerksamkeit, BewegungenErkennung, welche Information wahrgenommen wurdeUnterstützung personalisierter ArbeitsumgebungenUnterstützung multimodaler Interaktion mit großen Displaysg g p y

46

Der Smart Room: Sensoren & Geräteer

actio

n

H) Videowand:

ompu

ter I

nte

Kar

lsru

he (T

H

8 Rückprojektionswürfel (4096x1536 px)Können als ein Display angesteuert werden

or H

uman

-C

Uni

vers

itätK Sensoren

4 Kameras in den Raumecken, 1 Weitwinkelkamera an der Decke

uter

Vis

ion

fo

rch

Gro

up, U

1 Weitwinkelkamera an der Decke2 (4) Kameras über der Videowand2 aktive (pan-tilt-zoom) KamerasMikrophone

Com

pu

Res

ear

ci

Mikrophone Lautsprecher

cv:h

c Weiterer Smart Room am KITÄhnliches Setup, keine VideowandMikrophonarrays zur Lokalisierung

Smart Room: Perzeption & Funktionalitäter

actio

n

H)

Perzeptionskomponenten (alle echtzeitf.) 3D Personentracking und Berechnung der Visuellen

ompu

ter I

nte

Kar

lsru

he (T

H g gHülleErkennung von Zeigegesten und Berührungen Erkennung v. Kopfdrehungen und Aufmerksamkeit

or H

uman

-C

Uni

vers

itätK Identifikation durch Gesichtserkennung

Spracherkennung (Kooperation mit KIT)Florian

uter

Vis

ion

fo

rch

Gro

up, U Situationserfassung

Erste Funktionalitäten

Com

pu

Res

ear

ci

Erste FunktionalitätenPersonalisierte und lokalisierte Arbeitsplätze und Nachrichten Multimodale Interaktion mit der Videowand

Michael

cv:h

c Multimodale Interaktion mit der Videowand (Sprache, Gesten, Berührung)Erkennung der Aufmerksamkeit auf Videowand

Video: er

actio

n

H)

ompu

ter I

nte

Kar

lsru

he (T

H

or H

uman

-C

Uni

vers

itätK

uter

Vis

ion

fo

rch

Gro

up, U

Com

pu

Res

ear

cicv

:hc

Perzeptionskomponentener

actio

n

H)

p p3D Mehrpersonentracking

Lokalisierung ist eine grundlegende Komponente für alle

ompu

ter I

nte

Kar

lsru

he (T

H g g g pweiteren AnalysenMerkmale:

Kopf- und Oberkörperdetektoren, Vordergrund, Farbe

or H

uman

-C

Uni

vers

itätK Audio: Time-delay of arrival (TDOA), Gen. Cross-Corr.

Partikelfilter für Tracking und Fusion

Multi person tracking in a smart room (4 5 cameras)

uter

Vis

ion

fo

rch

Gro

up, U

IdentifikationNotwendig für Personalisierung

Multi-person tracking in a smart room (4-5 cameras)

Com

pu

Res

ear

ci

g gLokaler ansichtsbasierter Ansatz auf Basis von DCT

(Forschungspreis 2008 des Europ. Biometric Forum)Kann mit Sprechererkennung und aktiven Kameras

cv:h

c

kombiniert werdenGemeinsames Tracking und Identifikation bringt Vorteile

50

Combined tracking and identification

Perzeptionskomponenten (2)er

actio

n

H)

p p ( )Extraktion der Visuellen Hülle:

Liefert 3D Repräsentationen der Personen

ompu

ter I

nte

Kar

lsru

he (T

H Liefert 3D Repräsentationen der PersonenHilfreich zur Erkennung der Körperhaltung und für ZeigegestenIn Echtzeit möglich, mit Voxel-Carving Ansatz

or H

uman

-C

Uni

vers

itätK

g , g

uter

Vis

ion

fo

rch

Gro

up, U

Erkennung von KopfdrehungenLiefert wichtige Information über die Aufmerksamkeit

Berechnung der Visuellen Hüllen

Com

pu

Res

ear

ci

AufmerksamkeitGelöst durch videobasierte Schätzung der Kopfdrehungen (Neuronale Netze)Fusion über alle Ergebnisse aus den Kamerasichten

cv:h

c Fusion über alle Ergebnisse aus den Kamerasichten (Partikelfilter)Abbildung der Kopfdrehungen auf wahrscheinliche Ziele, bspw. andere Personen, bestimmte Objekte, Regionen der Videowand

51Erkennung von Kopfdrehungen

Interaktion mit großen Videowändener

actio

n

H)

gBenutzer erwarten Berührungserkennung!Berührungen reichen aber nicht aus

ompu

ter I

nte

Kar

lsru

he (T

H Berührungen reichen aber nicht ausObjekte sind zum Teil nicht erreichbar (Entfernung / Höhe)

or H

uman

-C

Uni

vers

itätK Erweiterung: Nutzung von Zeigegesten

uter

Vis

ion

fo

rch

Gro

up, U Vorteile der videobasierten Lösung

Vorhandene Geräte und Oberflächen können erweitert werdenSkaliert gut in der Größe

Com

pu

Res

ear

ci

Skaliert gut in der GrößeKameras sind günstigNahtloser Übergang zwischen Berührung und Zeigegesten ist möglich

cv:h

c g g g

52

Technische Realisierunger

actio

n

H)

g3D Extraktion der Visuellen Hülle

mit Voxel-Carving Ansatz

ompu

ter I

nte

Kar

lsru

he (T

H mit Voxel Carving Ansatz

Detektion und Tracking des Arms

or H

uman

-C

Uni

vers

itätK Detektion eines zylinderförmigen Objekts vor der

VideowandTracking durch paarweises Matching

uter

Vis

ion

fo

rch

Gro

up, U Interaktionspunkt wird durch 3D Rekonstruktion

berechnetAnnäherung

Com

pu

Res

ear

ci

Erkennung der Interaktion (Berührung / Geste)Als binäre Interaktionsmodalität modelliert

Halten

Bewegung

cv:h

c Zustand wird durch Handbewegung determiniertNahtloser Übergang zwischen Berührung und Zeigen

Halten

53

Zurückziehen

3D Armrekonstruktion und Trackinger

actio

n

H)

gom

pute

r Int

e

Kar

lsru

he (T

H

or H

uman

-C

Uni

vers

itätK

uter

Vis

ion

fo

rch

Gro

up, U

Com

pu

Res

ear

cicv

:hc

54

Evaluation der Komponentener

actio

n

H)

pLaufzeit: < 20ms / frame

3 GHz Core 2 Duo NVIDIA GTX280

ompu

ter I

nte

Kar

lsru

he (T

H 3 GHz Core 2 Duo, NVIDIA GTX280

Komponente Dauer [ms]

Vordergrundsegmentierung 6.50

or H

uman

-C

Uni

vers

itätK Voxel carving 8.26

Berührung und Zeigen 1.89

uter

Vis

ion

fo

rch

Gro

up, U

GenauigkeitWie genau trifft die Berührungs- /

Com

pu

Res

ear

ci

Zeigedetektion den vom Benutzer anvisierten Punkt

Modalität Mittl Fehler [cm] Standardab eich ng [cm]

cv:h

c Modalität Mittl. Fehler [cm] Standardabweichung [cm]

Berührung 9.49 6.56

Zeigen 16.61 9.16V t il d F hl f d Vid d

55

Verteilung der Fehler auf der Videowand(Berührung / Zeigen kombiniert)(Stand September 09, inzwischen deutlich genauer)

Benutzerstudieer

actio

n

H)

Aufgabe: Bewege die farbigen Blöcke schnellstmöglich ins ZielZwei Konditionen: “Touch-only” vs “Touch-and-pointing”

ompu

ter I

nte

Kar

lsru

he (T

H Zwei Konditionen: Touch-only vs. Touch-and-pointingTeilnehmer: 15 Männer, 4 Frauen

or H

uman

-C

Uni

vers

itätK

uter

Vis

ion

fo

rch

Gro

up, U

Com

pu

Res

ear

cicv

:hc

56

Benutzerstudie: Ergebnisseer

actio

n

H)

gFragebögen mit Fünfpunkt Likert-Skala

• Wilcoxon Rank Sum Test zur Analyse

ompu

ter I

nte

Kar

lsru

he (T

H ystatistischer Signifikanz

Hypo 1:“touch + pointing” ist intuitiver und nts

Touch ts Touch + Pointing

or H

uman

-C

Uni

vers

itätK

Hypo 1: touch + pointing ist intuitiver und einfacher zu benutzen

Nein, beides gleich gut bewertet (p=0.14)

# Pa

rtic

ipan Touch

# Pa

rtic

ipan

Touch + Pointing

uter

Vis

ion

fo

rch

Gro

up, U

Hypo 2: Alle Regionen an der Videowandsind per “touch + pointing” einfacher zuerreichen p

ants Touch + Pointing

pan

ts Touch

Com

pu

Res

ear

ci

erreichenJa ! (p<<0.01) Pa

rtic

ip

Part

icip

cv:h

c Hypo 3: “„Zeigegesten waren eine hilfreicheErgänzung“

Ja ! (μ>4 with p<<0.01) rtic

ipan

ts

(μ p )

57Pa

r

Benutzungsschnittstellen für Berührungs- und Zeigebasierte Interaktion

erac

tion

H)

Zeigebasierte InteraktionEntwurf graphischer Benutzungs-schnittstellenfü B üh d Z i b i t I t kti

ompu

ter I

nte

Kar

lsru

he (T

H für Berührungs- und Zeigebasierte InteraktionTests und Vergleiche verschiedener Interaktionsmethoden

or H

uman

-C

Uni

vers

itätK 40 Teilnehmer, verschiedene Setups

Quantitative Messungen: task completion time, Fehlerraten

uter

Vis

ion

fo

rch

Gro

up, U Experimental setupQualitative Messungen: Questionnaires

Com

pu

Res

ear

ci

Scale-rotateMove

cv:h

c

Multiselect Move-scale-rotateFind optimal objectsizePhoto application

Fortschritte: Multi-Toucher

actio

n

H)

ompu

ter I

nte

Kar

lsru

he (T

H

Multitouch

or H

uman

-C

Uni

vers

itätK

uter

Vis

ion

fo

rch

Gro

up, U

Com

pu

Res

ear

cicv

:hc

Multimodale Interaktioner

actio

n

H) Anwendung:

ompu

ter I

nte

Kar

lsru

he (T

H gLageraum für FeuerwehrHinzufügen taktischer Symbole zur Lagekarte

or H

uman

-C

Uni

vers

itätK

Methode:

uter

Vis

ion

fo

rch

Gro

up, U Sprachkommando zur Auswahl des Symbols

Berührung/Zeigen zur Positionierung

Com

pu

Res

ear

ci

Geplant:Mehr und komplexere Sprachkommandos

cv:h

c Mehr und komplexere SprachkommandosBessere Speech-DetektionManipulation von SymbolenMultimodaler Dialog

erac

tion

H)

ompu

ter I

nte

Kar

lsru

he (T

H

or H

uman

-C

Uni

vers

itätK

uter

Vis

ion

fo

rch

Gro

up, U

Com

pu

Res

ear

cicv

:hc

Related Topicser

actio

n

H)

p

Human motion Analysis

ompu

ter I

nte

Kar

lsru

he (T

H Gait: classification and person identificationClassification of motion figures in dance, sport, …

or H

uman

-C

Uni

vers

itätK

figures in dance, sport, …

Facial expression [Yam02]

uter

Vis

ion

fo

rch

Gro

up, U Synthesizing gesture

and facial expressionsTalking heads

Com

pu

Res

ear

ci

gAvatars

cv:h

c

Efros et al., Recognizing Action at a

62Distance, ICCV’03

Summaryer

actio

n

H)

y

Gesture Taxonomy:

ompu

ter I

nte

Kar

lsru

he (T

H Manipulative - semaphoric - conversational Static vs. dynamic

or H

uman

-C

Uni

vers

itätK

Hidden Markov Models“The 3 problems”G i i t d l

uter

Vis

ion

fo

rch

Gro

up, U Gaussian mixture models

Examples

Com

pu

Res

ear

ci

pAmerican Sign Language recognitionPointing gestures recognitionInteraction with a video wall, Speech & Gesture

cv:h

c Interaction with a video wall, Speech & Gesture

63

Referenceser

actio

n

H)

* [Pavlović97] Additional:

ompu

ter I

nte

Kar

lsru

he (T

H [Pavlović97]V.I. Pavlović, R. Sharma, T.S. Huang: Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997.

Additional:

[Becker97]Becker, D.A.: Sensei: A Real-Time Recognition,Feedback and Training System for T’ai Chi Gestures. M.I.T. Media Lab Perceptual Computing Group Technical Report No. 426, 1997.

or H

uman

-C

Uni

vers

itätK

* [Rabiner89]Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE, 77 (2), 257–286, 1989.

p p ,

[Cassell98]Cassell, J.: A Framework For Gesture Generation And Interpretation. In Cipolla, R. and Pentland, A. (eds.), Computer Vision in Human-Machine

uter

Vis

ion

fo

rch

Gro

up, U

* [Starner98]T. Starner, J. Weaver, A. Pentland: Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video. IEEE

Interaction, pp. 191-215. New York: Cambridge University Press. 1998.

[Poddar98]Poddar, I., Sethi, Y., Ozyildiz, E. and Sharma, R.:

Com

pu

Res

ear

ci

Transactions on Pattern Analysis and Machine Intelligence, 20(12):1371--1375, 1998.

* [Nickel04]Nickel, K., Stiefelhagen, R.: 3D-Tracking of Heads

d H d f P i ti G t R iti i

Toward Natural Gesture/Speech HCI: A Case Studyof Weather Narration. Proc. Workshops onPerceptual User Interfaces, pages 1-6, November,1998.

cv:h

c and Hands for Pointing Gesture Recognition in a Human-Robot Interaction Scenario, Sixth Int. Conf. On Face and Gesture Recognition, May 2004, Seoul, Korea.

[Xiong03]Y. Xioung, F. Quek, D. McNeill: Hand Motion Gestural Oscillations and Multimodal Discourse. ICMI 2003.

64

V19 WS09 GestureRecognition.ppt [Kompatibilitätsmodus]€¦ · Example: "she talked first, I mean...

Documents

Transcript of V19 WS09 GestureRecognition.ppt [Kompatibilitätsmodus]€¦ · Example: "she talked first, I mean...