Convolutional Restricted Boltzmann Machines for Feature Learning

Convolutional RestrictedBoltzmann Machines for

Feature Learning

Mohammad NorouziAdvisor: Dr. Greg Mori

CS @ Simon Fraser University27 Nov 2009

1

CRBMs forFeature Learning

Mohammad NorouziAdvisor: Dr. Greg Mori

CS @ Simon Fraser University27 Nov 2009

2

Problems

Human detectionHandwritten digit classification

3

Sliding Window Approach

4

Sliding Window Approach (Cont’d)

5

[INRIA Person Dataset]

Decisi

on B

ound

ary

Success or Failure of an object recognition algorithm hinges on the features used

Input Feature representation Label

Our Focus Classifier? HumanBackground

0 / 1 / 2 / 3 / …

6

Learning

Local Feature Detector Hierarchies

7

Larger More complicated Less frequent

Generative & Layerwise Learning

8

?

?

?

?

?

?

??

?

?

?

?

?

?

?

?Generative

CRBM

?

?

? ?

?

??

?

? ?

?

?

Visual Features: Filtering

9

1 0 -1

2 0 -2

1 0 -1Filter Kernel (Feature)

-1 0 1

-2 0 2

-1 0 1

0 -1 -2

1 0 -1

2 1 0Filter Response

1W

V

2W 2W

),( 1WVFilter ),( 2WVFilter ),( 3WVFilter

Our approach to feature learningis generative

?

?

?

1H

2H

3H

V

Binary HiddenVariables

10

1W

2W

3W

(CRBM model)

Related Work

11

Related Work

• Convolutional Neural Network (CNN)– Filtering layers are bundled with a classifier, and all

the layers are learned together using error backpropagation.

– Does not perform well on natural images

• Biologically plausible models– Hand-crafted first layer vs. Randomly selected

prototypes for second layer.

[Lecun et al. 98]

[Ranzato et al. CVPR'07]

[Serre et al., PAMI'07] [Mutch and Lowe, CVPR'06]

12

Discrim

inative

No Learning

Related Work (cont’d)

• Deep Belief Net– A two layer partially observed MRF, called RBM, is

the building block– Learning is performed unsupervised and layer-by-

layer from bottom layer upwards

• Our contributions: We incorporate spatial locality into RBMs and adapt the learning algorithm accordingly

• We add more complicated components such as pooling and sparsity into deep belief nets

[Hinton et al., NC'2006]

13

Generative &

Unsupervi

sed

Why Generative &Unsupervised

• Discriminative learning of deep and large neural networks has not been successful– Requires large training sets– Easily gets over-fitted for large models– First layer gradients are relatively small

• Alternative hybrid approach– Learn a large set of first layer features generatively– Switch to a discriminative model to select the

discriminative features from those that are learned– Discriminative fine-tuning is helpful

Details

15

CRBM

• Image is the visible layer and hidden layer is related to filter responses

• An energy based probabilistic model

16Dot product of vectorized matrices

),();,(

);,();,(

,exp1

kkkk

k kk

H

WVFilterHWHVE

WHVEWHVE

H;WVEZ

=V;WP

Training CRBMs

• Maximum likelihood learning of CRBMs is difficult• Contrastive Divergence (CD) learning is applicable

• For CD learning we need to compute the conditionals and .

data

17

sample

HVP | VHP |

CRBM (Backward)

• Nearby hidden variablescooperate in reconstruction

• Conditional Probabilities take the form

18

)exp1(

1

*

)(

),()|(

),()|(

x

k kk

kk

x

WHFilterHVP

WVFilterVHP

Learning the Hierarchy

• The structure is trained bottom up and layerwise• The CRBM model for training filtering layers • Filtering layers are followed by down-sampling

CRBM CRBMClassifier

Pooling Pooling

19FilteringNon-linearity

Reduce thedimensionality

layers

Input

1st Filters 2nd Filters

ResponsesResponses

1 32 4

Experiments

21

Evaluation

MNIST digit dataset• Training set: 60,000 image

of digits of size 28x28• Test set: 10,000 images

INRIA person dataset• Training set: 2416 person

windows of size 128 x 64 pixels and 4.5x106 negative windows

• Test set: 1132 positive and 2x106 negative windows

22

First layer filters

• Gray-scale images of INRIA positive set

• 15 filters of 7x7

23

• MNIST unlabeled digits• 15 filters of 5x5

Second Layer Features (MNIST)• Hard to visualize the filters• We show patches highly responded to filters:

2424

Second Layer Features (INRIA)

25

MNIST Results

• MNIST error rate when model is trained on the full training set

26

Results

27

False Positive

1st

28

2nd

29

3rd

30

4th

31

5th

32

INRIA Results

• Adding our large-scale features significantly improves performance of the baseline (HOG)

33

Conclusion

• We extended the RBM model to Convolutional RBM, useful for domains with spatial locality

• We exploited CRBMs to train local hierarchical feature detectors one layer at a time and generatively

• This method obtained results comparable to state-of-the-art in digit classification and human detection

34

Thank You

35

Hierarchical Feature Detector

36

? ? ?

? ? ?

? ? ?

Contrastive Divergence Learning

37

data

1kdata

0kkk H,VFilterH,VFilterη+W=W )()( 10

kk

HV,Filter=W

θH;V,E

Training CRBMs (Cont'd)

• The problem of reconstructing border region becomes severe when number of Gibbs sampling steps > 1.– Partition visible units into middle and border

regions

• Instead of maximizing thelikelihood, we (approximately)maximize bm v|vp

Enforcing Feature Sparsity

• The CRBM's representation is K (number of filters) times overcomplete

• After a few CD learning iterations, V is perfectly reconstructed

• Enforce sparsity to tackle this problem– Hidden bias terms were frozen at large negative values

• Having a single non-sparse hidden unit improves the learned features– Might be related to the ergodicity condition

Probabilistic Meaning of Max

1 2 3 4 5 6

1 2 3 4

Max

1 2 3 4 5 6

1 1 2 2h

h'

v

6453

4231

:T

4:T

3

:T

2:T

1

vwh+vwh+

vwh+vwh=hv,E

h'

v

6453

4231

:T

2:T

2

:T

1:T

1

vwh'+vwh'max+

vwh',vwh'max=hv,E

The Classifier Layer

• We used SVM as our final classifier– RBF kernel for MNIST– Linear kernel for INRIA– For INRIA we combined our 4th layer outputs and HOG

features

• We experimentally observed that relaxing the sparsity of CRBM's hidden units yields better results– This lets the discriminative model to set the thresholds

itself

Why HOG features are added?

• Because part-like features are very sparse

• Having a template of the human figure helps a lot

f

RBM

• Two layer pairwise MRF with a full setof hidden-visible connections

• RBM Is an energy based model

• Hidden random variables are binary, Visible variables can be binary or continuous

• Inference is straightforward: and• Contrastive Divergence learning for training

h

v

w

θh;v,EθZ

=θh;v,p exp1

22

1ijjiijiji v+hcvbhwv=θh;v,E

v|hp h|vp

Why Unsupervised Bottom-Up

• Discriminative learning of deep structure has not been successful– Requires large training sets– Easily is over-fitted for large models– First layer gradients are relatively small

• Alternative hybrid approach– Learn a large set of first layer features generatively– Later, switch to a discriminative model to select the

discriminative features from those learned– Fine-tune the features using

INRIA Results (Cont'd)

• Missrate at different FPPW rates

• FPPI is a better indicator of performance• More experiments on size of features and

number of layers are desired

Convolutional Restricted Boltzmann Machines for Feature Learning

Documents

Transcript of Convolutional Restricted Boltzmann Machines for Feature Learning