ShawnQuinnCSS581FinalProjectReport

Evaluation of Visual Bag of Features: Theory of Op-

eration, Performance Tests, and Proposed Enhance-

ments

Michael Shawn Quinn University of Washington

Bothell, WA

Abstract – Bag of Features or Bag of Keypoints, as it was originally named, has

become a popular technique in multi-class object detection in computer image

analysis. This paper reports on a specific implementation of “Bag of Features”

that compares 3 popular machine learning techniques for multi-class, supervised

learning. This paper also outlines proposed enhancements to add geometric, spa-

tial, and semantic constraints to the algorithm.

Keywords. Bag of features, bag of keypoints, bag of words

1 Introduction

Digital images pervade all aspects of daily life, and frequently contain infor-

mation important and useful to private individuals, businesses, and government entities.

Significant development effort has been devoted to devising techniques of efficient,

automated extraction of the image information from images.

Recent developments in object identification have focused on designing algo-

rithms that are general, in that they can be used on any object class, while at the same

time exhibiting computational efficiency while operating on images of multiple objects

of varying scale and orientation in the presence of noise.

Taking a cue from the bag-of-words technique used in text classification, one

popular recent approach to object identification and classification involves constructing

a bag-of-features, or as referred to by (1), a bag-of-keypoints. The method utilizes fast

feature detectors and unsupervised learning to build up a vocabulary or dictionary of

key image features for each object classification. The dictionary can be used to train

any common supervised learning algorithm.

2 Previous Work

Csurak et al (1) proposed the bag-of-keypoints method and presented results for

naïve bayes and support vector machine (svm) classifiers. The results indicated gener-

ally good performance when evaluated against 7 diverse object classes. The motivation

for development was the perceived need for generalized object classification without

the requirement for imposing spatial or geometric constraints on the object descriptors.

A bag-of-keypoints or bag-of-features method utilizes a scale invariant feature

detector, SIFT in the case of (1), to gather a large collection descriptors from a collec-

tion of keypoints for a set of sample images that represent the object classes of interest.

The descriptors are subsequently clustered into features by unsupervised k-means clus-

tering. The center of each cluster represents one word in a dictionary of visual words.

For each class of interest, a histogram is produced of the frequency of occurrence of

each word in the dictionary. The histogram of unknown images are then matched

against the histograms of trained classes to attempt to predict the class of the unknown

object.

3 Limitations and Possible Enhancements

SIFT features have been shown to be scale and affine invariant (2). This allows

the user to classify a set of images without needing to re-size, rotate, stretch, shrink,

shear, or crop the images during pre-processing. However, the resulting clusters of

descriptors, the centers of which correspond to the vocabulary of visual words, do not

contain object localization information within the image, nor is there any spatial or ge-

ometric dependence between features.

As a result, the bag-of-features technique is susceptible to pixel noise in the

image, specifically pixels near the object of interest, and bag-of-features does not return

any information regarding the relationship between the object and the background or

other objects in the image scene.

To attempt to overcome these limitations, alternative approaches have been sug-

gested. Cascades of haar features (3) are a popular technique with available open source

implementations. Cascades of haar features build up complex geometric classifiers

from simple geometric building blocks. The process of cascading successively large

kernel windows over the image also eliminates much of the background pixel noise, as

noise pixels are rejected at each stage of the cascade. Unfortunately, existing haar cas-

cade implementations are inherently single class classifiers and are not readily adapta-

ble to multi-class supervised learning.

Spatial pyramid matching (4) is another cascade-type approach that combines

pyramid or cascade matching of image sections, while using feature clustering within

a given image section. And more recently, a bag-of-phrases (5) approach extends the

bag-of features by utilizing spatial constraints inherent in SIFT features to impose spa-

tial relationships between detected features in unknown images.

4 Algorithm

The bag-of-features algorithm consists of the following steps:

1. Build path to image files.

2. Divide images into training sets and evaluation sets.

Training

3. For each training image read:

(a) Detect keypoints in training image using SURF.

(b) Extract or compute a descriptor from keypoints using SIFT.

(c) Cluster the descriptors into features or visual words using K-means.

(d) Add features to bag-of-features dictionary.

4. Add class labels to vector of class labels.

5. Train the model with the bag-of-features.

Classification

6. For each evaluation image read:

(e) Detect keypoints in training image.

(f) Extract or compute a descriptor from keypoints.

(g) Predict the class of the image using the descriptor.

(h) Add prediction to confusion matrix.

Data Presentation

7. Calculate the prediction error.

8. Display the confusion matrix.

5 Implementation

The OpenCV library version 3.0 (6) was used extensively in the code implemen-

tation. Development was done on Windows 10 using Visual Studio 2013 Professional

in C++. An existing solution implementing the basic bag-of-words algorithm was used

as a starting point (7). The existing implementation was focused on evaluating various

descriptors, in addition to SIFT, and only utilized svm for classification. The existing

implementation also utilized the OpenCV implementation of k-means for unsupervised

clustering of keypoint descriptors into the visual dictionary.

For this study, the machine learning aspects of the prior work were extended to

include k nearest neighbor (knn) and decision tree classifiers. This allow comparison

of three substantially different supervised learning algorithms for classification of ob-

jects in images. SURF was used for rapid keypoint detection and SIFT was used as the

descriptor extractor in all three cases. For knn, a k value of 5 was chosen as a reasonable

starting value, and in all three cases the number of k means clusters was 1500. All other

statistical parameters were at or near recommended default values.

The OpenCV ml library contains an implementation of bag-of-words, which was

adapted to provide the basic bag-of-features capability. OpenCV also contains statis-

tics model templates that were instantiated to provide implementations of the machine

learning functionality, i.e. svm, knn, and decision tree.

Custom code was needed primarily for iterating over images, performing 10-

fold cross validation, reading sample image files, generating sample jpeg files, calcu-

lating and displaying the confusion matrix, and tying together the library function calls

for initialization and updating of the bag-of-features data structures.

6 Data Set

Image samples were extracted from the Caltech 101 database (8). For purposes

of this study, 6 distinct classes of object were chosen to challenge the three diverse

supervised learning methods:

(a) car_side

(b) chair

(c) elephant

(d) faces_easy

(e) motorbikes

(f) sunflower

7 Results

Sample images and associated keypoints detected are shown in Figure 1.

Fig. 1.

Training and execution times were roughly equivalent, with svm completing

training and 10-fold cross validation in 3.25 hours, knn in 3.5 hours, and decision tree

in 4 hours. Svm exhibited superior overall classification success rate of 0.919, followed

by knn at 0.840, and decision tree at 0.711. The confusion matrices are shown in Figure

2.

SVM Prediction

Class Car

_side

chair elephant faces_

easy

Motor

bike

sun-

flower

car

_side

119 2 2 0 0 0

chair 0 46 15 0 1 0

elephant 2 6 52 1 1 2

faces

_easy

1 7 19 408 0 0

motor

bike

7 15 37 0 739 0

sun-

flower

0 1 6 0 1 77

KNN Prediction

Class Car

_side


easy

Motor

bike

sun-

flower

car

_side

98 2 21 2 0 0

chair 1 15 36 9 1 0

elephant 1 5 53 4 1 0

faces

_easy

0 1 33 401 0 0

motor

bike

3 8 75 9 702 1

sun-

flower

0 1 28 8 1 47

Decision Tree Prediction

Class Car

_side


easy

Motor

bike

sun-

flower

car

_side

95 17 6 2 3 0

chair 5 34 10 1 8 4

elephant 5 20 27 3 3 6

faces

_easy

14 41 45 323 8 4

motor

bike

13 121 67 3 585 9

sun-

flower

3 16 11 2 2 51

The classification accuracy is consistent with similar results obtained from Matlab.

8 Conclusions and Follow-On Work

A visual bag-of-features is an effective and efficient method of object classifi-

cation when paired with the OpenCV machine learning library. Testing with common

supervised and unsupervised machine learning methods demonstrates the ability to de-

tect and classify a wide range of object categories and the method lends itself to multi-

class training.

On the down side, the results do not contain any geometric, spatial, localization

or semantic content, they merely allow the user to determine if an object is present in

an image or not.

The proposed follow-on work will follow one of two paths:

(1) A hybrid algorithm that merges bag-of-features simplicity with the geo-

metric constraints of SIFT descriptors and haar-like objects in a cascade

of kernels, resulting in a bag-of-constrained-phrases solution.

(2) Re-implement the entire algorithm, including geometric and spatial con-

straints, in a convolution neural network. This approach would take the

strong points from all of the earlier work and realize them in a contem-

porary architecture, with the caveat that library support will be some-

what lacking, and much of the code will need to be written from scratch.

Additional enhancements, such as code to display the individual visual words in

the dictionary, would be useful for results presentation and explanation of the technique

to those who are unfamiliar with it. This will be a bit of work, since the keypoint

locations are not encapsulated in the SIFT descriptor. The clustering operation would

need to keep track of the spatial information somehow.

Another improvement would be to restructure the C++ classes so that the user

does not need to rebuild the bag-of-features for each classifier, just build it once and

then train and run each classification. Also, it would be useful to include the option of

whether to run each classification or not during execution.

9 References

1. Csurka, Gabriella, et al. "Visual categorization with bags of keypoints." Workshop on sta-

tistical learning in computer vision, ECCV. Vol. 1. No. 1-22. 2004.

2. Lowe, David G. "Object recognition from local scale-invariant features." Computer vision,

1999. The proceedings of the seventh IEEE international conference on. Vol. 2. Ieee, 1999.

3. Viola, Paul, and Michael Jones. "Rapid object detection using a boosted cascade of simple

features." Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the

2001 IEEE Computer Society Conference on. Vol. 1. IEEE, 2001.

4. Lazebnik, Svetlana, Cordelia Schmid, and Jean Ponce. "Beyond bags of features: Spatial

pyramid matching for recognizing natural scene categories." 2006 IEEE Computer Society

Conference on Computer Vision and Pattern Recognition (CVPR'06). Vol. 2. IEEE, 2006.

5. Xu, Yuetian, and Richard Madison. "Robust object recognition using a cascade of geometric

consistency filters." 2009 IEEE Applied Imagery Pattern Recognition Workshop (AIPR

2009). IEEE, 2009.

6. opencv.org

7. https://github.com/cfolson/classification

8. http://www.vision.caltech.edu/Image_Datasets/Caltech101/

http://opencv.org/

https://github.com/cfolson/classification

ShawnQuinnCSS581FinalProjectReport

Documents

Transcript of ShawnQuinnCSS581FinalProjectReport