Extracting Simple Verb Frames from Images

Extracting Simple Verb Framesfrom Images

Toward Holistic Scene Understanding

Prof. Daphne Koller Research Group

Stanford University

Geremy HeitzDARPA CLLR Workshop

December 2, 2008

Grand Goal: Scene Understanding

“man wearing a backpack,

smoking a cigarette, walking a dog on a sidewalk”

Man

Dog

Backpack

Cigarette

“A cow walking through the grass

on a pasture by the sea”

Understanding Verb Frames

“a man is walking on a sidewalk”

Primitives Objects Parts Surfaces Regions

Interactions Context Actions

Methods exist to extract these, but we need to both do a better job, and get them all at once

Modeling verb frames requires understanding the interactions between primitives, and which fit well into the framework of graphical models.

Man

Dog

Backpack

Cigarette Building

Sidewalk

“a dog is walking on a sidewalk”

Frame: to walk

Outline Extracting the Primitives Qualitative 3D Scene Layout Modeling Relationships Learning Frames Refined Characterization of Objects

Computer View of a “Scene”

BUILDING

ROAD

STREET

SCENE

Object Detection

= Car

= Person

= Motorcycle= Boat

= Sheep= Cow

Detection Window W

Score(W) > 0.5

Finding the Primitives Jointly

SEASIDEPASTURE

GRASS

SKYGrass = FlatSky = FarFG = Vertical

40% Grass,30% Sky…

1 cow, 2 boats…

[Heitz et al., NIPS 2008a]

Results – TAS ModelContextual DetectorBase Detector

[Heitz et al.,ECCV 2008]

Qualitative 3D Scene Layout

Primitives imply a certain 3D layout of the scene, absolute depth may not be preserved

For example: Sky is a far, vertical plane Water, road are horizontal planes Objects “popup” from the image

Modeling Relationships

Beside

In front ofOn

We have explored how to model 2D relationships

We should be able to extend this to 3D relationships

[Gould et al., IJCV 2008]

[Heitz et al.,ECCV 2008]

Outline Extracting the Primitives Qualitative 3D Scene Layout Modeling Relationships Learning Frames Refined Characterization of Objects

Learning Semantics: Verb Frames

Given primitives, rough layout, and relationships Let’s learn subjects, verb, and objects for

frames:The [S] [V] the [O].

[S],[O]

CARROADCOW

GRASSPERSONAPPLE

…

[V]

WALKS ONEATS

DRIVES ONJUMPS OVER

THROWS…

The CAR DRIVES ON the ROAD

Refined Characterization

We need to know that the white stick is a cigarette…

and where the man’s mouth is…

in order to determine that he’s smoking.

Refined Object Characterization

Set of “keypoint” landmarks Outline shape defined by connecting contour

[Heitz et al., NIPS 2008b, IJCV in submission]

Results

Giraffe

Llama Rhino

Mammals

[Heitz et al., NIPS 2008b, IJCV in submission]Eating Standing

Running Standing

Activity RecognitionEating

Drinking

2) Extract histogram of “stuff” in a window around the head landmark

1) Localize the landmarks of the cow, including the head.

GrassCow

3) Make a decision

Eating

Activity Recognition with People

Running Walking Standing

Hitting

Pose of person is one of the important factors Also need to recognize objects person interacts with

How far can we take this?

Front legs off ground = Jumping

Ball near hands = Throwing

Apple near mouth = Eating

Does phased learning help?

Cartoon/Caricature

Exaggerates the most salient features of the object class.

Simple BG

Real object with no confusing clutter.

Cluttered BG

Object in standard pose on natural background.

Articulated

Once we have built a strong appearance model, can we learn complicated articulations?

Our Related Papers G. Elidan, B. Packer, G. Heitz, and D. Koller. Convex Point

Estimation using Undirected Bayesian Transfer Hierarchies. UAI, 2008.

S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller. Multi-Class Segmentation with Relative Location Prior. IJCV, 2008.

S. Gould, P. Baumstarck, M. Quigley, A. Ng, and D. Koller. Integrating Visual and Range Data for Robotic Object Detection. ECCV Workshop M2SFA2, 2008.

G. Heitz and D. Koller. Learning Spatial Context: Using Stuff to Find Things. ECCV, 2008.

G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded Classification Models: Combining Models for Holistic Scene Understanding. NIPS, 2008.

G. Heitz, G. Elidan, B. Packer, and D. Koller. Shape-based Object Localization for Descriptive Classification. NIPS, 2008.

Extracting Simple Verb Frames from Images

Documents

Transcript of Extracting Simple Verb Frames from Images