Training Image Classifiers with Similarity Metrics, Linear Programming, and Minimal Supervision

12012 LLNL

Training Image Classifiers with Similarity Metrics, Linear Programming, and Minimal Supervision

Asilomar SSC

Karl Ni, Ethan Phelps, Katherine Bouman, Nadya BlissLincoln Laboratory, Massachusetts Institute of Technology

2 November 2012

This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.

22012 LLNL

• What can a computer understand?

Applying Semantic Understanding of Images

• Who?• What?• When?• Where?

Classifier Decision!

Training Data Query by exampleStatistical modeledQuery by sketch

Computer vision algorithms• Image retrieval• Robotic navigation• Semantic labeling• Image sketch• Structure from Motion• Image localization

• Requires: Some prior knowledge

Feature Extraction

Matching &Association

32012 LLNL

Processing

FRAMEWORKLocalization Algorithms

EXPL

OIT

ATIO

N

Ground Imagery, VideoAerial Imagery, Video

Location

Training Framework

• Metadata• Graphs• Point

Clouds• Distributi

ons• Terrain• Etc.

World ModelMulti-Modal

SourcesOFF

LIN

ESE

TUP

Feature Extraction


42012 LLNL

• Introduction• Feature Pruning Background• Matched Filter Training• Results• Summary

Outline

52012 LLNL

• Problems in image pattern matching

• Features are a quantitative way for machines to understand an image

Image Property Feature Technique– Local Color (Luma + Chroma Hists)– Object texture (DCT Local & Normalized)– Shape (Curvelets, Shapelets)– Lower level gradients (DWT : Haar, Daubechies)– Higher level descriptors (SIFT/SURF/HoG, etc)– Scene descriptors (GIST) - Torralba et al.

Finding the Features of Image

• Each image = 10 million pixels! • Most dimensions are irrelevant• Multiple concepts inside the image• Typical chain:

Feature Extraction

Training / Classifier

62012 LLNL

Numerous features: subset is relevant

• FEATURES ARE:• Red bricks on multiple

buildings• Small hedges, etc• Windows of a certain

type• Types of buildings are

there

• FEATURES ARE:• Arches and white

buildings• Domes and ancient

architecture• Older/speckled

materials (higher frequency image content)

• FEATURES ARE:• More suburb-like• Larger roads• Drier vegetation• Shorter houses

Choice of features requires looking at multiple semantic concepts defined by entities and attributes inside of images

Feature Extraction


72012 LLNL

• Most of the features are irrelevant

• Large dimensionality and algorithmic complexity

• Keep small numbers of salient features and discard large numbers of nondescriptive features

– Feature invariance to transformations, content, and context only to an extent (e.g., SIFT, RIFT, etc.)

– Simplify classifier (both computation & supervision)– Multiple instances of several features describing the same object

• Require a high level of abstraction– Visual similarity does not always correlate with “semantic” similarity

Feature DescriptorsFeature

ExtractionTraining / Classifier

Brown et. al., Lowe et. al.,Ng et. al., Thrun et. al.

82012 LLNL

• Tools to hand labelconcepts

• 2006-2011– Google Image Labeler– Kobus’s Corel Dataset– MIT LabelMe– Yahoo! Games

• Problems– Tedious– Time consuming– Incorrect– Very low throughput

• Famous algorithms – Parallelizable– Not generalizable,

unfortunately

Getting the Right Features

Feature Extraction

People can’t be flying or walking on billboards!1. Chair, 2. Table, 3. Road, 4. Road, 5. Table, 6. Car, 7. Keyboard

1

3

46

7

25

92012 LLNL

• Segmentation is a difficult manual task

• Multiple semantic concepts per single image

• Considerable amounts of noise most often irrelevant to any concept

Automatically Learn the Best Features

Concept 1 (e.g., sky)

Concept 2 (e.g., mountain)

Concept 3 (e.g., river)

Semantic Simplex

0.2 0.3 0.05 … 0.2

Kwitt, et. al. (Kitware)

102012 LLNL

• Lots of work in the 1990s– Conditional probabilities through large training data sets– Motivated by the query by example and query by sketch problems– Primarily based on multiple instance learning and noisy density

estimation• Learning multiple instances of an object (no noise case)

• Robustness to noise through law of large numbers– Hope to integrate it out

– Although the area of red boxes per instance is small, their aggregate over all instances is dominant

Leveraging Related Work

Noise, if uncorrelated, will become more and more sparse

Diettrich, et. al.

Keeler, et. al.

(Not the IBM Query for relational databases Zloof, but Ballerini et al.)

112012 LLNL

• Feature clustering in the large

• Mixture hierarchies can be incrementally trained

Parallel Calculations through Hierarchies

Top Level GMM

Lower Level GMMsCan be done in parallel

image 1 image 2 image 3

Vasconcelos, et. Al.

Image Class 1 Image Class 2 Image Class N

Distribution 1

Entire image

Distribution 2 Distribution N

Entire image Entire image

Training imagesTraining images

Automatic feature subselection has been submitted to SSP 2012

Lincoln LaboratoryGRID Processing

122012 LLNL


Outline

132012 LLNL

• Hierarchical Gaussian mixtures as a density estimate– Small sample-bias is large– Non-convex / sensitive to initialization– Extensive computational process to bring hierarchies together– Each level requires supervision (#classes, initialization, etc.)

• Think discriminantly:– Instead of: Generating centroids that represent images– Think: Prune features to eliminate redundancy

• Sparsity optimization– Solving directly for the features that we want to use– Reduction of redundancy is intuitive and not generative

• Under normalization, GMM’s classifier can be implemented with matched filter instead

Finding a sparse basis set

normalize

cCc

yx,maxarg},...,1{

22

},...,1{||||minarg c

Ccyx

142012 LLNL

• Let the feature be the jth feature in the training set, where italicized is the ith dimension of that feature.

• Let the X be a d x N matrix that represents the collection of all the features, where the jth column of X is a feature vector xj.

A Note on Notation

)(

)(2

)(

)(

1

jd

j

j

j

x

xx

x

)( jx)( j

ix

NjX xxxx ,,,, 21

152012 LLNL

• Gaussian Mixture Models

• Many optimization problems induce sparsity:

• Matched filter constraint:

• Relaxation of constraints

Finding sparsity with linear programming

Group Lasso

Max-Constraint OptimizationNot convex

LP Optimization Problem:Faster than G-LassoIndependemt of dimensionality!Convex (unlike MF opt & GMM, EM)On average, according to N2

GMM, solved via EM(non-convex optimization problem)

ii

T tXXtr

)(minarg

10 iij t 11:, Tjsuch that and

jj

T XXtr 2:, ||||)(maxarg

}1,0{ij 11:, Tjsuch that and

jjXX 2:,

22 ||||||)(||minarg

N

j

M

mmmjmm xp

MM 1 1......)|(logmin

11

Feature Extraction


162012 LLNL

• Relies on similarity matrix concept

• Actual implementation does not include similarity matrix, but rather keeps track of beta indices

Intuition

β =

< t1

< t2

< t3

< t4

ii

T tXXtr

)(minarg*

10 iij t 11Ts.t. and

195.1.01.095.12.001.02.0198.1.0098.1

XX T

ℓ∞-norm of the rows of X

172012 LLNL

• The optimization problem consists solely of dot products in a similarity function, whose prototypes are provided by that are similar to a set:

• Nonlinearity may be introduced in a kernel function (RKHS) that induces a vector space that we may not necessarily know the mapping to.

Nonlinear Feature Selection

ii

T tXXtr

)(minarg



182012 LLNL

Application to Classification

Feature Extraction


= BEST FEATURES

QUERY

Classifying Imagewith Confidence

Just a faster way to classify imagery in one versus all frameworks

ii

T tXXtr

)(minarg*

10 iij t 11T

TRAINING

Feature Extraction

192012 LLNL


Outline

202012 LLNL

LP Feature Learning versus G-Lasso

• More intuitive grouping– Threshold learning is unnecessary– Post-processing is unnecessary

• 5.452% more accurate in +1/-1 learning classes

212012 LLNL

Segmentation and Classification Visual Result

Decisions

Original Image

Decisions

222012 LLNL

Interesting automatic semantic learning result

232012 LLNL

Application to Localization

TrainingDatasets MIT-Kendall Vienna Dubrovnik Lubbock

Testing MIT-Kendall 0.975 0.056 0.024 0.102

Vienna 0.050 0.896 0.035 0.060

Dubrovnik 0.015 0.024 0.905 0.057Lubbock 0.097 0.002 0.053 0.901

• 1400 images per dataset• Filter reduction to 356 filters per class• Less than a minute classification time• Coverage of cities: entire cities (Vienna, Dubrovnik,

Lubbock), portion of Cambridge (MIT-Kendall)

242012 LLNL

• Accurate modeling must occur before we have any hope in classifying images.

• Feature pruning is equivalent to Gaussian centroid determination under normalization

• Sparse optimization enables feature pruning and matched filter creation

• Sparse optimization contains only dot products so optimization can occur with RKHS in the transductive setting

Summary

252012 LLNL

• K. Ni, E. Phelps, K. L. Bouman, N. Bliss, “Image Feature Selection via Linear Programming,” to appear in Presentation at Asilomar SSC, Pacific Grove, CA. October (Asilomar ‘12)

• S. M. Sawyer, K. Ni, N. T. Bliss. "Cluster-based 3D Reconstruction of Aerial Video." to appear in Presentation at the 1st IEEE High Performance Extreme Computing Conference, Waltham, MA. September 2012 (HPEC '12)

• H. Viggh and K. Ni, “SIFT Based Localization Using Prior World Model for Robotic Navigation in Urban Environments,” to appear in Presentation at the 16th International Conference on Image Processing, Computer Vision, and Pattern Recognition, 2012, Las Vegas, Nevada (IPCV-2012)

• K. Ni, Z. Sun, N. Bliss, "Real-time Global Motion Blur Detection", to appear in Presentation at the IEEE International Conference on Image Processing, 2012, Orlando, Florida, (ICIP-2012)

• N. Arcalano, K. Ni, B. Miller, N. Bliss, P. Wolfe, "Moments of Parameter Estimates for Chung-Lu Random Graph Models", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2012, Kyoto Japan, ICASSP-2012

• A. Vasile, L. Skelly, K. Ni, R. Heinrichs, O. Camps, and M. Sznaier, “Efficient City-sized 3D Reconstruction from Ultra-High Resolution Aerial and Ground Video Imagery”, Proceedings of the IEEE International Symposium on Visual Computing, 2011, Las Vegas, NV, ISCV-2011, pp 347-358

• K. Ni, Z. Sun, N. Bliss, "3-D Image Geo-Registration Using Vision-Based Modeling", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2011, Prague, Czech Republic, ICASSP-2011, pp 1573 - 1576

• K. Ni, T. Q. Nguyen, "Empirical Type-I Filter Design for Image Interpolation", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-2010, pp 866 - 869

• Z. Sun, N. Bliss, & K. Ni, "A 3-D Feature Model for Image Matching", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-2010, pp 2194-2197

• K. Ni, Z. Sun, N. Bliss, & N. Snavely, "Construction and exploitation of a 3D model from 2D image features", Proceedings of SPIE International Conference on Electronic Imaging, Inverse Problems Session, SPIE-2010, Vol. 7533, San Jose, CA, U.S.A., January 2010.

References

262012 LLNL

• MIT Lincoln Laboratory– Karl Ni– Nicholas Armstrong-Crews– Scott Sawyer– Nadya Bliss

• MIT– Katherine L. Bouman

• Boston University– Zachary Sun

• Northeastern University– Alexandru Vasile

• Cornell University– Noah Snavely

Contributors and Acknowledgements

272012 LLNL

Questions?

282012 LLNL

Backup

Training Image Classifiers with Similarity Metrics, Linear Programming, and Minimal Supervision

Documents

Transcript of Training Image Classifiers with Similarity Metrics, Linear Programming, and Minimal Supervision