Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick...
-
Upload
cameron-nash -
Category
Documents
-
view
215 -
download
2
Transcript of Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick...
![Page 1: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/1.jpg)
Seeing Action 1or
Representation and Recognition of Activity by Machine
Aaron [email protected]
School of Interactive Computing College of Computing
Georgia Tech
![Page 2: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/2.jpg)
An analogy… "Once upon a time"
"Once upon a time" … computer vision researchers wanted to do Object Recognition– "This is a chair"
But this was really hard…
![Page 3: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/3.jpg)
![Page 4: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/4.jpg)
An analogy… "Once upon a time"
"Once upon a time" … computer vision researchers wanted to do Object Recognition– "This is a chair"
But this was really hard…
So they gave up! (until recently) and instead did…
![Page 5: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/5.jpg)
![Page 6: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/6.jpg)
An analogy… "Once upon a time"
"Once upon a time" … computer vision researchers wanted to do Object Recognition– "This is a chair"
But this was really hard…
So they gave up! (sort of) and instead did…
Chair recognition became model-based, object recognition based upon geometric properties.
![Page 7: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/7.jpg)
Object Recognition ≠ Activity Recognition
Imagine you created an algorithm that could recognize only me sitting down?– Would you get a PhD? (well maybe…)
"Activity recognition"* seems to get back to semantics: – A person stealing a car– A person sitting down– Two people having a discussion (or a fight) – 2 people attacking a third– A crowd is storming the Bastille
*which we'll define more about later.
![Page 8: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/8.jpg)
Recognition implies Representation
To recognize something you have to have something to recognize.– Brilliant huh?
Just like any other AI recognition problem, we must have a representation of whatever it is we are going to recognize.
And with representations come questions…
![Page 9: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/9.jpg)
Some Q's about Representations
Marr's criteria:– Scope & range, sensitivity
"Language" of the representation Computability of an instance Learnability of the "class"; training versus
learning versus "the oracle" Stability in face of perceptual uncertainty Inference-support (reasoning in face of
variation or ambiguity) Others???
![Page 10: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/10.jpg)
BUT WAIT!!!!
Just what is it we are trying to represent?
Deconstruct delivering a package… (videos)
![Page 11: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/11.jpg)
Barnes and Noble Video Segment
![Page 12: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/12.jpg)
What are the "activities" or "actions" taking place?
Back door openTruck arrivingBack door closingTruck leaving
Carrying a packageFollowing the car
unloading
![Page 13: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/13.jpg)
An old story (’96):
Behavior taxonomy for vision
Different levels of understanding motion –visual evidence of behavior – require different forms of representation, methods of manipulating time, and depth of reasoning
Propose three levels: Movement - atomic behaviors defined by motion
• Ballet moves, body motions (“sitting”)
Activity - sequences or compositions• Statistically structured events
Action - semantics, causality, knowledge• Cooking, football plays, “moving Coke cans”
![Page 14: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/14.jpg)
A better story… (anonymous paper)
(Still) three levels of understanding motion or behavior:Movement - atomic behaviors defined by motion
"Bending down", "(door) rising up", Swinging a hammer
Action – a single, semantically meaningful "event""Opening a door", "Lifting a package"Typically short in timeMight be definable in terms of motion; especially so
in a particular context.
Activity – a behavior or collection of actions with a purpose/intention.
"Delivering packages"Typically has causal underpinningsCan be thought of as statistically structured events
![Page 15: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/15.jpg)
Thinking a bit about the levels…
(Still) Three levels of understanding motion or behavior:Movement - atomic behaviors defined by motion
"Bending down", "(door) rising up", Swinging a hammer
Action – a single, semantically meaningful "event""Opening a door", "Lifting a package"Typically short in timeMight be definable in terms of motion; especially so in a
particular context.
Activity – a behavior or collection of actions with a purpose/intention.
"Delivering packages"Typically has causal underpinningsCan be thought of as statistically structured events
Maybe Actions are movements in context??
Context
![Page 16: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/16.jpg)
What is the goal of a representation of activity/behaviors?
Recognition implies representation
Representations can talk about what events *ARE*:– Definitional – but sometimes not “real” because primitives
not grounded– Permits specification of reasoning mechanism – Context can be made explicit (but is not usually)– Hard to learn
Representations can talk about what events *LOOK LIKE*:– Sometimes learnable, always well defined primitives– Typically not guaranteed to be complete– Have no explanatory power– Often leverages (ie is wholly dependent upon) context –
makes it learnable from specific data
![Page 17: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/17.jpg)
Data-driven vs Knowledge-taught
Data-driven Knowledge
Statistical Structural
Movement
Activity
MHI’s
PHMM’s
SCFG’s
P-Net’s
Action BN’sPNF
Event N-gramsSuffix Trees
Temporaland
relational complexity
![Page 18: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/18.jpg)
So how do we proceed?
Three (now only 2.5) sessions, climbing the representational ladder in terms of the semantics and representational "power“
Cover movements through activity-level descriptions.
Some structural, some statistical.
I will leave the real AI to others here…
![Page 19: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/19.jpg)
Strict Appearance: human movements
Is recognizing movement a 3D or 2D problem? Simple human psychophysics and computational complexity argue for 2D aspects.
Temporal templates: Movements are recognized directly from the motion.
Appearance-based recognition can assist geometric recovery: recognition labels the parts and allows extraction.
demonstration
![Page 20: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/20.jpg)
Blurry Video
![Page 21: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/21.jpg)
Less Blurry Video!
![Page 22: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/22.jpg)
Shape and motion: view-based
Schematic representation of sitting at 90
![Page 23: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/23.jpg)
Motion energy images
Spatial accumulation of motion. Collapse over specific time window. Motion measurement method not critical (e.g.
motion differencing).Time
![Page 24: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/24.jpg)
Motion history images
Motion history images are a different function of temporal volume.
Pixel operator is replacement decay:
if moving I(x,y,t) = otherwise I(x,y,t) = max(I(x,y,t-1)-1 ,0)
Trivial to construct Ik(x,y,t) from I(x,y,t) so can process multiple time window lengths without more search.
MEI is thresholded MHI
Movedt-1
Movedt-15
![Page 25: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/25.jpg)
Temporal-templates
MEI+ MHI = Temporal template
motion history image
motion energyimage
![Page 26: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/26.jpg)
Recognizing temporal templates
For MEI and MHI compute global properties (e.g. Hu moments). Treat both as grayscale images.
Collect statistics on distribution of those properties over people for each movement.
At run time, construct MEIs and MHIs backwards in time.– Recognizing movements as soon as they complete.
Linear time scaling.– Compute range of using the min and max of training data.
Simple recursive fomulation so very fast. Filter implementation obvious so biologically “relevant”. Best reference is PAMI 2001, Bobick and Davis
![Page 27: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/27.jpg)
Aerobics examples
![Page 28: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/28.jpg)
Virtual PAT (Personal Aerobics Trainer)
Uses MHI recognition Portable IR background subtraction system
(CAPTECH ‘98)
![Page 29: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/29.jpg)
The KidsRoom
A narrative, interactive children’s playspace.
Demonstrates computer vision “action” recognition.
Someitmes, possible because the machine knows the context.
A kinder, gentler C3I interface
Ported to the Millenium Dome, London, 2001
Summary and critique in Presence, August 1999.
![Page 30: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/30.jpg)
Recognizing Movement in the KidsRoom
First teach the kids, then observe.
Temporal templates “plus” (but in paper).
Monsters always do something, but only speak it when sure.
![Page 31: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/31.jpg)
Some Q's about Representations…
Scope and Range:– Gross motor activities; view dependent
"Language" of the representation– Statistical characterization of video properties
Computability of an instance– Easy to computer assuming you can extract person from
background Learnability of the "class":
– Parameters predetermined by design– Explicit training; easily acquired
Stability in face of perceptual uncertainty– Pretty good
Inference-support – Zilch
![Page 32: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/32.jpg)
Lesson from Temporal Templates:
It’s the representation, stupid…
![Page 33: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/33.jpg)
"Gesture recognition"-like activities
![Page 34: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/34.jpg)
Some thoughts about gesture
There is a conference on Face and Gesture Recognition so obviously Gesture recognition is an important problem…
Prototype scenario: – Subject does several examples of "each gesture" – System "learns" (or is trained) to have some sort of model
for each– At run time compare input to known models and pick one
Recently some work at Univ of Maryland on “Ballistic motions” decomposing a sequence into its parts.
![Page 35: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/35.jpg)
Anatomy of Hidden Markov Models
Typically thought of as a stochastic FSMs where:
– aij is P(qt = j | qt-1 = i)– bj(x) is p(xt = x | qt = j)
HMMs model activity by presuming activity is a first order Markov process. Sequence is output from the bj(x) . States are hidden and unknown.
Train via expectation/maximization. (EM) (more on this…) Paradigm:
– Training: examples from each class, slow but OK.– Testing: fast (Viterbi), typical PR types of issues.– Backward looking, real-time at completion.
CBA
![Page 36: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/36.jpg)
Tutorial on HMM?
Yes/No???
![Page 37: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/37.jpg)
Wins and Losses of HMMs in Gesture
Good points about HMMs:– A learning paradigm that acquires spatial and
temporal models and does some amount of feature selection.
– Recognition is fast; training is not so fast but not too bad.
Not so good points:– If you know something about state definitions, difficult
to incorporate (coming later…)– Every gesture is a new class, independent of anything
else you’ve learned.– ->Particularly bad for “parameterized gesture.”
![Page 38: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/38.jpg)
Parameterized Gesture
“I caught a fish this big.”
![Page 39: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/39.jpg)
Parametric HMMs (PAMI, 1999)
Basic ideas:– Make output probabilities of the state be a function of the
parameter of interest, bj (x) becomes b’j (x, – Maintain same temporal properties, aij unchanged.– Train with known parameter values to solve for
dependencies of b’ on – During testing, use EM to find that gives the highest
probability. That probability is confidence in recognition; best is the parameter.
Issues:– How to represent dependence on ?– How to train given ?– How to test for ?– What are the limitations on dependence on ?
![Page 40: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/40.jpg)
Linear PHMM - Representation
Represent dependence on as linear movement of the mean of the Gaussians of the states:
Need to learn Wj and j for each state j. (ICCV ’98)
(For the graphical model folks in the audience.)
![Page 41: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/41.jpg)
Linear PHMM - training
Need to derive EM equations for linear parameters and proceed as normal:
![Page 42: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/42.jpg)
Linear HMM - testing
Derive EM equations with respect to :
We are testing by EM! (i.e. iterative):– Solve for tk given guess for
– Solve for given guess for tk
![Page 43: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/43.jpg)
How big was the fish?
![Page 44: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/44.jpg)
Pointing
Pointing is the prototypical example of a parameterized gesture.Assuming two DOF, can parameterize either by (x,y) or by Under linear assumption must choose carefully.A generalized non-linear map would allow greater freedom. (ICCV 99)
![Page 45: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/45.jpg)
Linear pointing results
Test for both recognition and recovery:
If prune based on legal (MAP via uniform density) :
![Page 46: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/46.jpg)
Noise sensitivity
Compare ad hoc procedure with PHMM parameter recovery (ignoring “their” recognition problem!!).
![Page 47: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/47.jpg)
Lesson from PHMMs:
It’s the representation, stupid…
(The non-linear case is an even better representation.)
![Page 48: Seeing Action 1 or Representation and Recognition of Activity by Machine Aaron Bobick afb@cc.gatech.edu School of Interactive Computing College of Computing.](https://reader036.fdocuments.in/reader036/viewer/2022070407/56649e4b5503460f94b3f01c/html5/thumbnails/48.jpg)
Some Q's about Representations…
Scope and Range:– Densely sampled motions through parameter spaces
"Language" of the representation– Stochastic FSM through regions of parameter space
Computability of an instance– Only assumes consistent noise model between training and
testing Learnability of the "class":
– Explicit training; easily acquired– All parameters learned
Stability in face of perceptual uncertainty– Pretty good (see above)
Inference-support – Zilch