ShawnQuinnCSS581FinalProjectReport
-
Upload
shawn-quinn -
Category
Documents
-
view
20 -
download
0
Transcript of ShawnQuinnCSS581FinalProjectReport
Evaluation of Visual Bag of Features: Theory of Op-
eration, Performance Tests, and Proposed Enhance-
ments
Michael Shawn Quinn University of Washington
Bothell, WA
Abstract – Bag of Features or Bag of Keypoints, as it was originally named, has
become a popular technique in multi-class object detection in computer image
analysis. This paper reports on a specific implementation of “Bag of Features”
that compares 3 popular machine learning techniques for multi-class, supervised
learning. This paper also outlines proposed enhancements to add geometric, spa-
tial, and semantic constraints to the algorithm.
Keywords. Bag of features, bag of keypoints, bag of words
1 Introduction
Digital images pervade all aspects of daily life, and frequently contain infor-
mation important and useful to private individuals, businesses, and government entities.
Significant development effort has been devoted to devising techniques of efficient,
automated extraction of the image information from images.
Recent developments in object identification have focused on designing algo-
rithms that are general, in that they can be used on any object class, while at the same
time exhibiting computational efficiency while operating on images of multiple objects
of varying scale and orientation in the presence of noise.
Taking a cue from the bag-of-words technique used in text classification, one
popular recent approach to object identification and classification involves constructing
a bag-of-features, or as referred to by (1), a bag-of-keypoints. The method utilizes fast
feature detectors and unsupervised learning to build up a vocabulary or dictionary of
key image features for each object classification. The dictionary can be used to train
any common supervised learning algorithm.
2 Previous Work
Csurak et al (1) proposed the bag-of-keypoints method and presented results for
naïve bayes and support vector machine (svm) classifiers. The results indicated gener-
ally good performance when evaluated against 7 diverse object classes. The motivation
for development was the perceived need for generalized object classification without
the requirement for imposing spatial or geometric constraints on the object descriptors.
A bag-of-keypoints or bag-of-features method utilizes a scale invariant feature
detector, SIFT in the case of (1), to gather a large collection descriptors from a collec-
tion of keypoints for a set of sample images that represent the object classes of interest.
The descriptors are subsequently clustered into features by unsupervised k-means clus-
tering. The center of each cluster represents one word in a dictionary of visual words.
For each class of interest, a histogram is produced of the frequency of occurrence of
each word in the dictionary. The histogram of unknown images are then matched
against the histograms of trained classes to attempt to predict the class of the unknown
object.
3 Limitations and Possible Enhancements
SIFT features have been shown to be scale and affine invariant (2). This allows
the user to classify a set of images without needing to re-size, rotate, stretch, shrink,
shear, or crop the images during pre-processing. However, the resulting clusters of
descriptors, the centers of which correspond to the vocabulary of visual words, do not
contain object localization information within the image, nor is there any spatial or ge-
ometric dependence between features.
As a result, the bag-of-features technique is susceptible to pixel noise in the
image, specifically pixels near the object of interest, and bag-of-features does not return
any information regarding the relationship between the object and the background or
other objects in the image scene.
To attempt to overcome these limitations, alternative approaches have been sug-
gested. Cascades of haar features (3) are a popular technique with available open source
implementations. Cascades of haar features build up complex geometric classifiers
from simple geometric building blocks. The process of cascading successively large
kernel windows over the image also eliminates much of the background pixel noise, as
noise pixels are rejected at each stage of the cascade. Unfortunately, existing haar cas-
cade implementations are inherently single class classifiers and are not readily adapta-
ble to multi-class supervised learning.
Spatial pyramid matching (4) is another cascade-type approach that combines
pyramid or cascade matching of image sections, while using feature clustering within
a given image section. And more recently, a bag-of-phrases (5) approach extends the
bag-of features by utilizing spatial constraints inherent in SIFT features to impose spa-
tial relationships between detected features in unknown images.
4 Algorithm
The bag-of-features algorithm consists of the following steps:
1. Build path to image files.
2. Divide images into training sets and evaluation sets.
Training
3. For each training image read:
(a) Detect keypoints in training image using SURF.
(b) Extract or compute a descriptor from keypoints using SIFT.
(c) Cluster the descriptors into features or visual words using K-means.
(d) Add features to bag-of-features dictionary.
4. Add class labels to vector of class labels.
5. Train the model with the bag-of-features.
Classification
6. For each evaluation image read:
(e) Detect keypoints in training image.
(f) Extract or compute a descriptor from keypoints.
(g) Predict the class of the image using the descriptor.
(h) Add prediction to confusion matrix.
Data Presentation
7. Calculate the prediction error.
8. Display the confusion matrix.
5 Implementation
The OpenCV library version 3.0 (6) was used extensively in the code implemen-
tation. Development was done on Windows 10 using Visual Studio 2013 Professional
in C++. An existing solution implementing the basic bag-of-words algorithm was used
as a starting point (7). The existing implementation was focused on evaluating various
descriptors, in addition to SIFT, and only utilized svm for classification. The existing
implementation also utilized the OpenCV implementation of k-means for unsupervised
clustering of keypoint descriptors into the visual dictionary.
For this study, the machine learning aspects of the prior work were extended to
include k nearest neighbor (knn) and decision tree classifiers. This allow comparison
of three substantially different supervised learning algorithms for classification of ob-
jects in images. SURF was used for rapid keypoint detection and SIFT was used as the
descriptor extractor in all three cases. For knn, a k value of 5 was chosen as a reasonable
starting value, and in all three cases the number of k means clusters was 1500. All other
statistical parameters were at or near recommended default values.
The OpenCV ml library contains an implementation of bag-of-words, which was
adapted to provide the basic bag-of-features capability. OpenCV also contains statis-
tics model templates that were instantiated to provide implementations of the machine
learning functionality, i.e. svm, knn, and decision tree.
Custom code was needed primarily for iterating over images, performing 10-
fold cross validation, reading sample image files, generating sample jpeg files, calcu-
lating and displaying the confusion matrix, and tying together the library function calls
for initialization and updating of the bag-of-features data structures.
6 Data Set
Image samples were extracted from the Caltech 101 database (8). For purposes
of this study, 6 distinct classes of object were chosen to challenge the three diverse
supervised learning methods:
(a) car_side
(b) chair
(c) elephant
(d) faces_easy
(e) motorbikes
(f) sunflower
7 Results
Sample images and associated keypoints detected are shown in Figure 1.
Fig. 1.
Training and execution times were roughly equivalent, with svm completing
training and 10-fold cross validation in 3.25 hours, knn in 3.5 hours, and decision tree
in 4 hours. Svm exhibited superior overall classification success rate of 0.919, followed
by knn at 0.840, and decision tree at 0.711. The confusion matrices are shown in Figure
2.
SVM Prediction
Class Car
_side
chair elephant faces_
easy
Motor
bike
sun-
flower
car
_side
119 2 2 0 0 0
chair 0 46 15 0 1 0
elephant 2 6 52 1 1 2
faces
_easy
1 7 19 408 0 0
motor
bike
7 15 37 0 739 0
sun-
flower
0 1 6 0 1 77
KNN Prediction
Class Car
_side
chair elephant faces_
easy
Motor
bike
sun-
flower
car
_side
98 2 21 2 0 0
chair 1 15 36 9 1 0
elephant 1 5 53 4 1 0
faces
_easy
0 1 33 401 0 0
motor
bike
3 8 75 9 702 1
sun-
flower
0 1 28 8 1 47
Decision Tree Prediction
Class Car
_side
chair elephant faces_
easy
Motor
bike
sun-
flower
car
_side
95 17 6 2 3 0
chair 5 34 10 1 8 4
elephant 5 20 27 3 3 6
faces
_easy
14 41 45 323 8 4
motor
bike
13 121 67 3 585 9
sun-
flower
3 16 11 2 2 51
The classification accuracy is consistent with similar results obtained from Matlab.
8 Conclusions and Follow-On Work
A visual bag-of-features is an effective and efficient method of object classifi-
cation when paired with the OpenCV machine learning library. Testing with common
supervised and unsupervised machine learning methods demonstrates the ability to de-
tect and classify a wide range of object categories and the method lends itself to multi-
class training.
On the down side, the results do not contain any geometric, spatial, localization
or semantic content, they merely allow the user to determine if an object is present in
an image or not.
The proposed follow-on work will follow one of two paths:
(1) A hybrid algorithm that merges bag-of-features simplicity with the geo-
metric constraints of SIFT descriptors and haar-like objects in a cascade
of kernels, resulting in a bag-of-constrained-phrases solution.
(2) Re-implement the entire algorithm, including geometric and spatial con-
straints, in a convolution neural network. This approach would take the
strong points from all of the earlier work and realize them in a contem-
porary architecture, with the caveat that library support will be some-
what lacking, and much of the code will need to be written from scratch.
Additional enhancements, such as code to display the individual visual words in
the dictionary, would be useful for results presentation and explanation of the technique
to those who are unfamiliar with it. This will be a bit of work, since the keypoint
locations are not encapsulated in the SIFT descriptor. The clustering operation would
need to keep track of the spatial information somehow.
Another improvement would be to restructure the C++ classes so that the user
does not need to rebuild the bag-of-features for each classifier, just build it once and
then train and run each classification. Also, it would be useful to include the option of
whether to run each classification or not during execution.
9 References
1. Csurka, Gabriella, et al. "Visual categorization with bags of keypoints." Workshop on sta-
tistical learning in computer vision, ECCV. Vol. 1. No. 1-22. 2004.
2. Lowe, David G. "Object recognition from local scale-invariant features." Computer vision,
1999. The proceedings of the seventh IEEE international conference on. Vol. 2. Ieee, 1999.
3. Viola, Paul, and Michael Jones. "Rapid object detection using a boosted cascade of simple
features." Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the
2001 IEEE Computer Society Conference on. Vol. 1. IEEE, 2001.
4. Lazebnik, Svetlana, Cordelia Schmid, and Jean Ponce. "Beyond bags of features: Spatial
pyramid matching for recognizing natural scene categories." 2006 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR'06). Vol. 2. IEEE, 2006.
5. Xu, Yuetian, and Richard Madison. "Robust object recognition using a cascade of geometric
consistency filters." 2009 IEEE Applied Imagery Pattern Recognition Workshop (AIPR
2009). IEEE, 2009.
6. opencv.org
7. https://github.com/cfolson/classification
8. http://www.vision.caltech.edu/Image_Datasets/Caltech101/