Post on 23-Jan-2016
description
Extracting Simple Verb Framesfrom Images
Toward Holistic Scene Understanding
Prof. Daphne Koller Research Group
Stanford University
Geremy HeitzDARPA CLLR Workshop
December 2, 2008
Grand Goal: Scene Understanding
“man wearing a backpack,
smoking a cigarette, walking a dog on a sidewalk”
Man
Dog
Backpack
Cigarette
“A cow walking through the grass
on a pasture by the sea”
Understanding Verb Frames
“a man is walking on a sidewalk”
Primitives Objects Parts Surfaces Regions
Interactions Context Actions
Methods exist to extract these, but we need to both do a better job, and get them all at once
Modeling verb frames requires understanding the interactions between primitives, and which fit well into the framework of graphical models.
Man
Dog
Backpack
Cigarette Building
Sidewalk
“a dog is walking on a sidewalk”
Frame: to walk
Outline Extracting the Primitives Qualitative 3D Scene Layout Modeling Relationships Learning Frames Refined Characterization of Objects
Computer View of a “Scene”
BUILDING
ROAD
STREET
SCENE
Object Detection
= Car
= Person
= Motorcycle= Boat
= Sheep= Cow
Detection Window W
Score(W) > 0.5
Finding the Primitives Jointly
SEASIDEPASTURE
GRASS
SKYGrass = FlatSky = FarFG = Vertical
40% Grass,30% Sky…
1 cow, 2 boats…
[Heitz et al., NIPS 2008a]
Results – TAS ModelContextual DetectorBase Detector
[Heitz et al.,ECCV 2008]
Qualitative 3D Scene Layout
Primitives imply a certain 3D layout of the scene, absolute depth may not be preserved
For example: Sky is a far, vertical plane Water, road are horizontal planes Objects “popup” from the image
Modeling Relationships
Beside
In front ofOn
We have explored how to model 2D relationships
We should be able to extend this to 3D relationships
[Gould et al., IJCV 2008]
[Heitz et al.,ECCV 2008]
Outline Extracting the Primitives Qualitative 3D Scene Layout Modeling Relationships Learning Frames Refined Characterization of Objects
Learning Semantics: Verb Frames
Given primitives, rough layout, and relationships Let’s learn subjects, verb, and objects for
frames:The [S] [V] the [O].
[S],[O]
CARROADCOW
GRASSPERSONAPPLE
…
[V]
WALKS ONEATS
DRIVES ONJUMPS OVER
THROWS…
The CAR DRIVES ON the ROAD
Refined Characterization
We need to know that the white stick is a cigarette…
and where the man’s mouth is…
in order to determine that he’s smoking.
Refined Object Characterization
Set of “keypoint” landmarks Outline shape defined by connecting contour
[Heitz et al., NIPS 2008b, IJCV in submission]
Results
Giraffe
Llama Rhino
Mammals
[Heitz et al., NIPS 2008b, IJCV in submission]Eating Standing
Running Standing
Activity RecognitionEating
Drinking
2) Extract histogram of “stuff” in a window around the head landmark
1) Localize the landmarks of the cow, including the head.
GrassCow
3) Make a decision
Eating
Activity Recognition with People
Running Walking Standing
Hitting
Pose of person is one of the important factors Also need to recognize objects person interacts with
How far can we take this?
Front legs off ground = Jumping
Ball near hands = Throwing
Apple near mouth = Eating
Does phased learning help?
Cartoon/Caricature
Exaggerates the most salient features of the object class.
Simple BG
Real object with no confusing clutter.
Cluttered BG
Object in standard pose on natural background.
Articulated
Once we have built a strong appearance model, can we learn complicated articulations?
Our Related Papers G. Elidan, B. Packer, G. Heitz, and D. Koller. Convex Point
Estimation using Undirected Bayesian Transfer Hierarchies. UAI, 2008.
S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller. Multi-Class Segmentation with Relative Location Prior. IJCV, 2008.
S. Gould, P. Baumstarck, M. Quigley, A. Ng, and D. Koller. Integrating Visual and Range Data for Robotic Object Detection. ECCV Workshop M2SFA2, 2008.
G. Heitz and D. Koller. Learning Spatial Context: Using Stuff to Find Things. ECCV, 2008.
G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded Classification Models: Combining Models for Holistic Scene Understanding. NIPS, 2008.
G. Heitz, G. Elidan, B. Packer, and D. Koller. Shape-based Object Localization for Descriptive Classification. NIPS, 2008.