3D Scene Models 6.870 Object recognition and scene understanding Krista Ehinger.

3D Scene Models6.870 Object recognition and scene understanding

Krista Ehinger

Questions

What makes a good 3D scene model? How accurate does it need to be?

How far can you get with automatic surface detection? Where do you need human input?

Modelling the scene

Real scenes have way too many surfaces

Modelling the scene

Option 1: Diorama world

Tour Into the Picture (TIP)

Model the scene as 5 planes + foreground objects

Easy implementation: planes/objects defined by humans

Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the Picture: Using a spidery mesh user interface to make animation from a single image". ACM SIGGRAPH 1997

TIP Implementation

User defines vanishing point, rear wall of the scene (inner rectangle)

Given some assumptions about the camera, position/size of all planes can be computed...


Defining the box

Define planes: Floor -> y=0, Ceiling -> y=H Given horizon (vanishing point), corners of

floor, ceiling can be computed from 2D image position


Defining the box

Once the positions of the planes are known, compute the texture of the planes


What about foreground objects?

Assume a quadrangle attached to floor, compute attachment points, upper points

Hierarchical model of foreground objects


Extracting foreground objects

Foreground objects removed, added to mask Holes in background filled in using photo

completion software


TIP Demonstration

TIP Discussion

Pros: Accurate model (due to human input) Deals with foreground objects, occlusions

Cons: Requires human input, not automatic Model too simple for many real-world scenes

Modelling the scene

Option 2: Pop-up book world

Automatic Photo Pop-Up

Three classes of surface: ground, sky, vertical Not just a box: can model more kinds of scenes Automatic classification, no labeling

D. Hoiem, A.A. Efros, and M. Hebert, "Automatic Photo Pop-up", ACM SIGGRAPH 2005.

Photo Pop-Up Implementation

Pixels -> superpixels -> constellations Automatic labeling of constellations as ground,

vertical, or sky Define angles of vertical planes (using

attachment to ground) Map textures to vertical planes (as in TIP)


Superpixels, constellations

Superpixels are neighboring pixels that have nearly the same color (Tao et al, 2001)

Superpixels assigned to constellations according to how likely they are to share a label (ground, vertical, sky) based on difference between feature vectors

Feature vectors

Color features: RGB, hue, saturation Texture features: Difference of oriented

Gaussians, Textons Location (absolute and percentile) N superpixels in constellation Line and intersection detectors Not used: constellation shape (contiguous, N

sides), some texture features

Training process

For each of 82 labeled training images Compute superpixels, features, pairwise likelihoods Form a set of N constellations (N = 3 to 25), each

labeled with ground truth Compute constellation features

Compute constellation label, homogeneity likelihood:

Training process

Adaboost weak classifiers learn to estimate whether superpixels have same label (based on feature vector)

Another set of Adaboost week classifiers learns constellation label, homogeneity likelihood (expressed as percent ground, vertical, sky, mixed)

Emphasis on classifying larger constellations

Building the 3D model

Along vertical/ground boundary, fit line segments (Hough transform) – goal is to find simplest shape (fewest lines)

Project lines up from corners of boundary lines, cut and fold


Photo Pop-Up Demonstration


Photo Pop-Up Discussion

Pros: Automatic Can handle a variety of scenes, not just boxes

Cons: No handling of foreground objects Misclassification leads to very strange models Only 2 kinds of surface: ground, vertical


Modelling the scene

Option 3: Actually try to model surface angles

3D Scene Structure from Still Image

Compute surface normal for each surface No right-angle assumptions; surfaces can have

any angle Automatic (trained on images with known depth

maps)

3D Scene Implementation

Segment image into superpixels Estimate surface normal of each superpixel

(using Markov Random Field model) Optional: Detect and extract foreground objects Map textures to planes

Original image Modeled depth map

A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image". In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image". In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007

Image features

Superpixel features (xi) Color and texture features as in Photo Pop-Up Vector also includes features of neighboring

superpixels Boundary features (xij)

Color difference, texture difference, edge detector

Markov Random Field Model

First term: model planes in terms of image features of superpixels

Second term: model planes in terms of pairs of superpixels, with constraints...

A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image". In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007

Model constraints

Connected structure: except where there is an occlusion, neighboring superpixels are likely to be connected

Coplanar structure: except where there are folds, neighboring superpixels are likely to lie on the same plane

Co-linearity: long straight lines in the image correspond to straight lines in 3D

Foreground objects

Automatically-detected foreground objects may be removed from model (for example: pedestrians, using Dalal & Triggs detector)

Detected objects add 3D cues (pedestrians are basically vertical, occlude other surfaces)

3D Scene Demonstration

Results

A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene Structure from a Single Still Image". In ICCV workshop on 3D Representation for Recognition (3dRR-07), 2007

3D Scene Discussion

Pros: Handles a variety of scene types Fairly accurate (about 2/3 of scenes correct) Automatic Handles foreground objects

Cons: Still fails on 1/3 of scenes

Discussion

Simple 3D models are adequate for many scenes

You can get pretty far without human input (but still would be better results with human annotation of scenes)

Extensions? Use photo completion techniques to handle

occlusions? Massive training sets -> better 3D models?

3D Scene Models 6.870 Object recognition and scene understanding Krista Ehinger.

Documents

Transcript of 3D Scene Models 6.870 Object recognition and scene understanding Krista Ehinger.