Machine Learning in Machine Vision - ERNETval.serc.iisc.ernet.in/DAV/ML_in_Vision.pdf · Machine...
Transcript of Machine Learning in Machine Vision - ERNETval.serc.iisc.ernet.in/DAV/ML_in_Vision.pdf · Machine...
Machine Learning in Machine
Vision
R. Venkatesh Babu
Video Analytics Lab, SERC
Indian Institute of Science, Bangalore
Can Machines Replace Human?
Semantic Gap
How do we interpret image data?
What is an Image?
What do we see?
What is an Image?
What do machines see?
Semantic Gap
Organization
• Machine Vision – Challenges
• Discriminative and Generative Approaches
• ML Applications in Vision
• Deep Learning • Inspiration from Neuroscience
• Deep Architecture
• Applications
Machine Vision -
Challenges
Challenges 1: view point variation
Michelangelo 1475-1564
Challenges 2: illumination
slide credit: S. Ullman
Challenges 3: occlusion
Magritte, 1957
Challenges 4: scale
slide by Fei Fei, Fergus & Torralba
Challenges 5: deformation
Xu, Beihong 1943
Challenges 6: background clutter
Klimt, 1913
Challenges 7: object intra-class variation
slide by Fei-Fei, Fergus & Torralba
Object Categorization Discriminative model p(Object | image)
Generative models p(image | Object)
Slides from: Fei-Fei Li
Discriminative
Generative
p(image | zebra) p(image | no zebra)
Object Detection Pipeline
Object Representation Which features are suitable for the task
Learning
Which machine learning algorithm to choose
Bag-of-words Approach
Features
Pixels
Texture
Color Histograms
SIFT/SURF
HoG …
Requirements: Invariance to challenges (illumination, scale,
orientation …), computational and memory burden
Machine Learning Algorithms
Nearest Neighbor
Naïve Bayes
ANN
SVM
Ada- Boost
CNN …
Face Detection
Neural Network-Based Face Detection
Rowley, Baluja and Kanade, PAMI ’98
Object Detection Using the Statistics of Parts
H. Schneiderman, & T. Kanade, CVPR’00, IJCV’04
Robust Real-time Object Detection
Paul Viola and Michael Jones (IJCV’04)
Neural Network-Based Face
Detection
(Henry A. Rowley, Shumeet Baluja, and Takeo Kanade, PAMI ‘98)
System
Stage 1: Applies a set of neural network-based filters to an
image.
The filters examine each location in the image at several scales,
Stage 2: Uses an arbitrator to combine the outputs
Merges detections from individual filters and eliminates
overlapping detections.
Overview
Detection Time
#NWs: Two networks
•Image Size: 320 x 240 pixel image
• 246,766 (20x20) windows
•Machine : 200 MHz R4400 SGI Indigo 2
•Time Taken: 383 seconds (approx) ( > 6mins!)
Object Detection Using the Statistics of Parts H. Schneiderman, & T. Kanade, CVPR’00, IJCV’04
Object Detection Using the Statistics of
Parts
•Represent appearance statistics as a product of histogram
•Each histogram represents the joint statistics of a subset of
wavelet coefficients and their position on the object.
•Use many such histograms representing a wide variety of visual
attributes
Number of orientations
Face – 2
Cars – 8
There are too many parameters to learn
)(
)(?
)|,...,(
)|,...,(
1?)()|,...,(
)()|,...,(
1?),...,|(
),...,|(
1
1
1
1
1
1
ObjectP
ObjectP
ObjectxxP
ObjectxxP
ObjectPObjectxxP
ObjectPObjectxxP
xxObjectP
xxObjectP
n
n
n
n
n
n
Bayes optimal classifier
Image is defined by n attrs: x1,x2,…,xn
SE 263 R. Venkatesh Babu
Reported results for faces
Kodak dataset: Test set: 17 images, 46 faces, 36 profile views.
ϒ=λ2
SE 263 R. Venkatesh Babu
A bigger dataset From multiple sources 208 images, 441 faces, about 347
profiles.
Robust Real-time Object Detection Paul Viola and Michael Jones (IJCV’04)
Integral Image with Haar Features
Training via AdaBoost
Speed-up through Attentional cascades
Integral Image
The integral image at location (x,y), is the sum
of the pixel values above and to the left of (x,y),
inclusive.
Rapid evaluation of rectangular
features
Using the integral image
representation one can compute the
value of any rectangular sum in
constant time.
For example the integral sum inside
rectangle D we can compute as:
ii(4) + ii(1) – ii(2) – ii(3)
As a result two-, three-, and four-rectangular features can be computed with 6, 8 and 9 array
references respectively.
Haar Features 3 rectangular features types:
• two-rectangle feature type
(horizontal/vertical)
• three-rectangle feature type
• four-rectangle feature type
Using a 24x24 pixel base detection window, with all the possible
combination of horizontal and vertical location and scale of these feature
types the full set of features has 49,396 features.
The motivation behind using rectangular features, as opposed to more
expressive steerable filters is due to their extreme computational efficiency.
Scanning at many Scales
At base scale objects are detected at 24x24 size
Scanned at 11 scales with a factor of 1.25 (24x24, 30x30, 38x38,
47x47 ….)
Conventional Approach:
• Compute a pyramid of 11 images, each 1.25 times
smaller than the previous
• Requires significant time (< 15fps)
AdaBoost: Intuition
39 K. Grauman, B. Leibe
Figure adapted from Freund and Schapire
Consider a 2-d feature
space with positive and
negative examples.
Each weak classifier splits
the training examples with
at least 50% accuracy.
Examples misclassified by
a previous weak learner
are given more emphasis
at future rounds.
40 K. Grauman, B. Leibe
AdaBoost: Intuition
41 K. Grauman, B. Leibe
AdaBoost: Intuition
AdaBoost Algorithm Start with uniform
weights on training
examples
Evaluate weighted
error for each
feature, pick best.
Incorrectly classified -> more weight
Correctly classified -> less weight
Final classifier is combination of the weak ones,
weighted according to error they had.
Freund & Schapire 1995
{x1,…xn}
Boosting Example
First classifier
First 2 classifiers
First 3 classifiers
Final Classifier learned by Boosting
-0.42-0.65+0.92 = -0.15
-0.42+0.65+0.92 = 1.15
Recall: Perceptron Operation Equations of “thresholded” operation:
= 1 (if w1x1 +… wd xd + wd+1 > 0)
o(x1, x2,…, xd-1, xd)
= -1 (otherwise)
Performance of 200 feature face
detector The ROC curve of the constructed classifies
indicates that a reasonable detection rate of 0.95
can be achieved while maintaining an extremely
low false positive rate of approximately 10-4 (1 in
14084).
• First features selected by AdaBoost are meaningful and have high
discriminative power
• By varying the threshold of the final classifier one can construct a
two-feature classifier which has a detection rate of 1 and a false
positive rate of 0.4.
•Requires 0.7 sec to scan 384x288 image !
Speed-up through the Attentional
Cascade • Simple, boosted classifiers can reject many of negative sub-windows
while detecting all positive instances.
• Series of such simple classifiers can achieve good detection
performance while eliminating the need for further processing of
negative sub-windows.
more difficult examples faced by deeper classifiers
Single Vs Cascade Classifier
The Cascaded
Classifier is
nearly
10 times faster!
Experiments (dataset for training)
4916 positive training example
were hand picked aligned,
normalized, and scaled to a base
resolution of 24x24
10,000 negative examples were
selected by randomly picking sub-
windows from 9500 images which
did not contain faces
Results cont.
More Detection Examples
Practical implementation
Details discussed in Viola-Jones paper
•Training time = weeks (with 5k faces and 9.5k non-faces)
•Final detector has 32 layers in the cascade, 4297 features
•700 Mhz Pentium III processor :
Can process a 384 x 288 image in 0.067 seconds (in 2002
when paper was written)
Ensemble Tracking Shai Avidan – CVPR 05
(Adaboost in Tracking)
Object Localization
Ensemble of weak learners is used to create a per-pixel
confidence map
Optimal location found by mean shift algorithm
Ensemble is updated in new location
Weak Classifiers Linear classifiers are used as weak classifiers
Find the best hyperplane to separate data
Strong classifier calculated using AdaBoost
Determines weights of each weak classifier
Trains iteratively on “harder” examples
Experimental Results
SVMs in Machine Vision
Ensemble of Exemplar-SVMs for Object
Detection and Beyond (Malisiewicz et al.,
ICCV’11)
Discriminative Object Detectors
Linear SVM on HOG
Hard-Negative Mining
Sliding Window Detection
Exemplar SVMs
Learn a separate linear SVM for each instance
(exemplar) in the dataset
Exemplar SVM
Advantages: we can use different features for each exemplar
Adapt features to each exemplar’s aspect ratio
Ensemble of Exemplar SVMs
Results
Image Parsing
Tighe et al., Finding Things: Image Parsing with Regions and Per-Exemplar Detectors,
CVPR’13
Results
Representation Learning
using CNNs
Video Analytics Lab, SERC, IISc
Why Deep Learning??
❖ To learn feature hierarchies
❖ In Vision
➢Mainly for recognition
➢But, is being applied in almost all the vision
tasks
Conventional Recognition approach
Hand designed
feature extraction Trainable classifier Object
Class
Features are not learned
Image/Video
Pixels
Conventional Recognition approach
❖ Classifiers are often generic
❖ Features are key to progress in recognition until now
❖ Multitude of hand-designed features
➢ SIFT, HOG, LBP, MSER, Color-SIFT etc.
But, Why learn features ??
❖ Better performance
❖ Other new domains (unclear how to hand engineer)
➢ Kinect
➢ Video
➢ Multi spectral
❖ Feature computation time
Deep Learning??
Learning
multiple levels of representation and abstraction
that help to make sense of data
such as images, sound, and text.
Hierarchical Structure of Visual Cortex
N. Kruger et al.
Lateral Geniculate Nucleus (LGN)
Primary Visual Cortex (V1)
David Hubel and Torsten Wiesel won the Nobel prize for discovering
the functional organization and basic physiology of neurons in V1.
• Simple Cells
• Complex Cells
• Hypercomplex Cells
Simple Cell: Hubel-Wiesel Model
Complex Cell
Deep Architecture
Theoretical:
“Many functions can be much more efficiently represented with deeper
architectures…” [Bengio & LeCun 2007]
fl takes as input a datum xl and parameter set wl and outputs xl+1
Learning a Hierarchy of Feature
Extractors
❖ Each layer extracts features from output of previous layer
❖ All the way from pixels to classifier
❖ Layers have (nearly) the same structure
❖ Train all layers jointly
layer 1 Layer 2 Layer 3 Simple
Classifier
Image/Video
Pixels
Learning a Hierarchy of Feature
Extractors
❖ Stack multiple stages of simple cells / complex cells layers
❖ Higher stages compute more global, more invariant features
❖ Classification layer on top
Natural progression from
low level to high level structures.
Can share the lower-level
representations for multiple tasks.
Deep architectures can be
representationally efficient.
Typical CNN Operations
❖ Filtering (Convolution)
❖ Contrast Normalization
❖ Local Pooling (Sub-sampling)
2D Convolution
Image from http://developer.amd.com
Image Convolution / Filtering
❖ Convolutional
➢ Translation equivariance
➢ Tied filter weights
(same at each position: few
parameters)
Feature Maps
Translation Equivariance
❖ Input translation results in translation of features
➢ Fewer filters needed: no translated replications
➢ But still need to cover orientation/frequency
Convolutional FIlters
CNN: Convolution in 3D
Image from http://deeplearning.net
Normalization
❖ Contrast normalization
➢ Across feature maps or within the maps
❖ Each feature is scaled by
❖ α and β are parameters, n: size of the local region
❖ Induces local competition between features to explain input
Local Pooling
Images by Zhu et al. and http://vaaaaaanquish.hatenablog.com
Pooling
❖ Spatial Pooling
❖ Non-overlapping / overlapping regions
❖ Sum or max
❖ In-variance to small transformations
Sum Max
Example Nets
CNN Applications
❖ Image recognition, speech recognition, photo taggers
❖ Have won several competitions
➢ ImageNet, Kaggle Facial Expression and Multimodal Learning,
German Traffic Signs, Connectomics, Handwriting etc.
❖ Applicable to array data where nearby values are correlated
➢ Images, sound, time-frequency representations, video, volumetric
images, RGB-Depth images etc.
❖ Reading Text in the Wild
❖ One of the few models that can be trained purely supervised
Software Tools
Caffe: From Berkeley
Torch7: www.torch.ch
OverFeat: From NYU
Cuda-Convnet: http://code.google.com/p/cuda-convnet/
MatConvnet: CNNs for MATLAB
Theano:
http://deeplearning.net/software/theano/