Convolutional Restricted Boltzmann Machines for Feature Learning
description
Transcript of Convolutional Restricted Boltzmann Machines for Feature Learning
Convolutional RestrictedBoltzmann Machines for
Feature Learning
Mohammad NorouziAdvisor: Dr. Greg Mori
CS @ Simon Fraser University27 Nov 2009
1
CRBMs forFeature Learning
Mohammad NorouziAdvisor: Dr. Greg Mori
CS @ Simon Fraser University27 Nov 2009
2
Problems
Human detectionHandwritten digit classification
3
Sliding Window Approach
4
Sliding Window Approach (Cont’d)
5
[INRIA Person Dataset]
Decisi
on B
ound
ary
Success or Failure of an object recognition algorithm hinges on the features used
Input Feature representation Label
Our Focus Classifier? HumanBackground
0 / 1 / 2 / 3 / …
6
Learning
Local Feature Detector Hierarchies
7
Larger More complicated Less frequent
Generative & Layerwise Learning
8
?
?
?
?
?
?
??
?
?
?
?
?
?
?
?Generative
CRBM
?
?
? ?
?
??
?
? ?
?
?
Visual Features: Filtering
9
1 0 -1
2 0 -2
1 0 -1Filter Kernel (Feature)
-1 0 1
-2 0 2
-1 0 1
0 -1 -2
1 0 -1
2 1 0Filter Response
1W
V
2W 2W
),( 1WVFilter ),( 2WVFilter ),( 3WVFilter
Our approach to feature learningis generative
?
?
?
1H
2H
3H
V
Binary HiddenVariables
10
1W
2W
3W
(CRBM model)
Related Work
11
Related Work
• Convolutional Neural Network (CNN)– Filtering layers are bundled with a classifier, and all
the layers are learned together using error backpropagation.
– Does not perform well on natural images
• Biologically plausible models– Hand-crafted first layer vs. Randomly selected
prototypes for second layer.
[Lecun et al. 98]
[Ranzato et al. CVPR'07]
[Serre et al., PAMI'07] [Mutch and Lowe, CVPR'06]
12
Discrim
inative
No Learning
Related Work (cont’d)
• Deep Belief Net– A two layer partially observed MRF, called RBM, is
the building block– Learning is performed unsupervised and layer-by-
layer from bottom layer upwards
• Our contributions: We incorporate spatial locality into RBMs and adapt the learning algorithm accordingly
• We add more complicated components such as pooling and sparsity into deep belief nets
[Hinton et al., NC'2006]
13
Generative &
Unsupervi
sed
Why Generative &Unsupervised
• Discriminative learning of deep and large neural networks has not been successful– Requires large training sets– Easily gets over-fitted for large models– First layer gradients are relatively small
• Alternative hybrid approach– Learn a large set of first layer features generatively– Switch to a discriminative model to select the
discriminative features from those that are learned– Discriminative fine-tuning is helpful
Details
15
CRBM
• Image is the visible layer and hidden layer is related to filter responses
• An energy based probabilistic model
16Dot product of vectorized matrices
),();,(
);,();,(
,exp1
kkkk
k kk
H
WVFilterHWHVE
WHVEWHVE
H;WVEZ
=V;WP
Training CRBMs
• Maximum likelihood learning of CRBMs is difficult• Contrastive Divergence (CD) learning is applicable
• For CD learning we need to compute the conditionals and .
data
17
sample
HVP | VHP |
CRBM (Backward)
• Nearby hidden variablescooperate in reconstruction
• Conditional Probabilities take the form
18
)exp1(
1
*
)(
),()|(
),()|(
x
k kk
kk
x
WHFilterHVP
WVFilterVHP
Learning the Hierarchy
• The structure is trained bottom up and layerwise• The CRBM model for training filtering layers • Filtering layers are followed by down-sampling
CRBM CRBMClassifier
Pooling Pooling
19FilteringNon-linearity
Reduce thedimensionality
layers
Input
1st Filters 2nd Filters
ResponsesResponses
1 32 4
Experiments
21
Evaluation
MNIST digit dataset• Training set: 60,000 image
of digits of size 28x28• Test set: 10,000 images
INRIA person dataset• Training set: 2416 person
windows of size 128 x 64 pixels and 4.5x106 negative windows
• Test set: 1132 positive and 2x106 negative windows
22
First layer filters
• Gray-scale images of INRIA positive set
• 15 filters of 7x7
23
• MNIST unlabeled digits• 15 filters of 5x5
Second Layer Features (MNIST)• Hard to visualize the filters• We show patches highly responded to filters:
2424
Second Layer Features (INRIA)
25
MNIST Results
• MNIST error rate when model is trained on the full training set
26
Results
27
False Positive
1st
28
2nd
29
3rd
30
4th
31
5th
32
INRIA Results
• Adding our large-scale features significantly improves performance of the baseline (HOG)
33
Conclusion
• We extended the RBM model to Convolutional RBM, useful for domains with spatial locality
• We exploited CRBMs to train local hierarchical feature detectors one layer at a time and generatively
• This method obtained results comparable to state-of-the-art in digit classification and human detection
34
Thank You
35
Hierarchical Feature Detector
36
? ? ?
? ? ?
? ? ?
Contrastive Divergence Learning
37
data
1kdata
0kkk H,VFilterH,VFilterη+W=W )()( 10
kk
HV,Filter=W
θH;V,E
Training CRBMs (Cont'd)
• The problem of reconstructing border region becomes severe when number of Gibbs sampling steps > 1.– Partition visible units into middle and border
regions
• Instead of maximizing thelikelihood, we (approximately)maximize bm v|vp
Enforcing Feature Sparsity
• The CRBM's representation is K (number of filters) times overcomplete
• After a few CD learning iterations, V is perfectly reconstructed
• Enforce sparsity to tackle this problem– Hidden bias terms were frozen at large negative values
• Having a single non-sparse hidden unit improves the learned features– Might be related to the ergodicity condition
Probabilistic Meaning of Max
1 2 3 4 5 6
1 2 3 4
Max
1 2 3 4 5 6
1 1 2 2h
h'
v
6453
4231
:T
4:T
3
:T
2:T
1
vwh+vwh+
vwh+vwh=hv,E
h'
v
6453
4231
:T
2:T
2
:T
1:T
1
vwh'+vwh'max+
vwh',vwh'max=hv,E
The Classifier Layer
• We used SVM as our final classifier– RBF kernel for MNIST– Linear kernel for INRIA– For INRIA we combined our 4th layer outputs and HOG
features
• We experimentally observed that relaxing the sparsity of CRBM's hidden units yields better results– This lets the discriminative model to set the thresholds
itself
Why HOG features are added?
• Because part-like features are very sparse
• Having a template of the human figure helps a lot
f
RBM
• Two layer pairwise MRF with a full setof hidden-visible connections
• RBM Is an energy based model
• Hidden random variables are binary, Visible variables can be binary or continuous
• Inference is straightforward: and• Contrastive Divergence learning for training
h
v
w
θh;v,EθZ
=θh;v,p exp1
22
1ijjiijiji v+hcvbhwv=θh;v,E
v|hp h|vp
Why Unsupervised Bottom-Up
• Discriminative learning of deep structure has not been successful– Requires large training sets– Easily is over-fitted for large models– First layer gradients are relatively small
• Alternative hybrid approach– Learn a large set of first layer features generatively– Later, switch to a discriminative model to select the
discriminative features from those learned– Fine-tune the features using
INRIA Results (Cont'd)
• Missrate at different FPPW rates
• FPPI is a better indicator of performance• More experiments on size of features and
number of layers are desired