Deep Nets for image classification - Colorado School of Mines

Image Classification with Deep Neural Networks

Greg Schoeninger

The Problem (General)

• Transform raw input as pixels, to higher level representations.

• Edges, local shapes and colors, object parts, etc.

• How do we as humans recognize objects in scenes?

The Problem (simplified)

• Train a neural network to recognize 4 classes of images.

• STL‐10 Data Set (Stanford)

• 100,000 Unlabeled Image Patches

• 5,000 Labeled Images

Complexity

• Natural images have a high dimensionality.

• Varying position, orientation, lighting, etc. (Factors of variation)

• Many different features could be considered. (Edges, colors, SIFT, Gabor filters..)

Deep Architectures• Learn feature hierarchies from

lower level features to higher level ones.

• Do not rely on hand engineered features.

• Inspired by the depth of the brain.

• Natural images are “stationary” ‐ features learned in one part can be applied to others.

• Invariant to small changes in input (translation, rotation, etc.)

Deep Architectures Continued• # of variations in input greater

than # of training examples.• We now have sufficient

computational power.• Unsupervised learning

performed locally at each level.• Minimal supervised learning at

the end.• Learn good properties and

representations of images, then learn what combinations of these properties are called (labels).

Solution (Overview)

• Self taught learning with a sparse auto encoder.

• Convolution• Mean Pooling of features• Soft max Regression of pooled features for classification.

• Unsupervised Feature Learning and Deep Learning ‐ Stanford

Simple Neuron

Neural Network

• Hook up neurons so that output of a neuron goes into input of another.

• 3 input units,3 hidden units, 1 output unit.

• Notation – (x, a, w, b, l, h(x))

Neural Networks Activations

• x – Input• a – Activations• z – Total weighted sum

of inputs and bias• W – parameters or

weights associated with the connections between unit j in layer l, and unit i in layer l + 1.

• h(x) – Hypothesis, real number output.

Forward Propagation

• a(1) = x

Bias Units

• Bias unit – enables activation function to be shifted as well as scaled.

Gradient Descent and Back Propagation

• Batch Gradient Descent.

• Try to minimize cost function J(W, b; x, y)

Gradient Descent

• Initialize weights (W) and bias’s (b) to small random values near zero.

• alpha = learning rate.• Back propagation is an

efficient way to calculate the partial derivatives of J(W, b)

Back Propagation

• Given a training example (x,y), run forward propagation to compute all the activations, including final hypothesis.

• Then for each neuron (i) in layer (L) compute an error term delta that measures how responsible each node was for any errors in the output.

Back Propagation

• Perform feed forward pass to calculate activations

• For each output unit in the final output layer set the delta term. This is just the error between the output nodes and the true expected values.

• Work backwards from the output layer to the first hidden layer and set the delta terms. Weighted average of error terms that use a(L) as an input.

• Use these delta terms to calculate the partial derivatives of weights and biases.

Gradient Descent and Back Prop

• Set initial weights and biases to random values close to 0.

• Go through the training examples and use back propagation to compute the error terms (delta)

• Set the change in the weights and biases by adding their respective delta terms.

• Update the parameters, minimizing the error

Auto encoders

• Unsupervised training

• Set target values equal to the inputs (identity function)

Auto encoders with sparsity constraint

• Make sure that the average activation over the training set is constrained to p.

• Add extra penalty to the overall cost function – based on KL divergence of Bernoulli random variable.

Auto encoder continued• Auto encoders learn what input image would most likely cause an activation.

• Each hidden unit is now learning to look for certain features.

• Example of training auto encoder on 10x10 whitened image patches, with 100 hidden units.

Linear Decoders (Sparse Auto Encoder Variation)

• Some neurons use a different activation function.

• Sigmoid activation function constrains the range of inputs and outputs to [0,1]

• Linear activation function: Set a(3) = z(3) instead of a(3) = f(z(3)) for the output layer. (Identity function)

• Output is now linear function of hidden unit activations.

Simplified Gradients and Back propagation

• New activation function, so the gradients change for output units.

• y = x is the desired output.

• f(z) = z• f’(z) = 1• Hidden layer still uses the sigmoid activation f’(z(2))

ZCA Whitening• Goal is to make input data less redundant.

– Pixels are highly correlated to nearby pixels, and weakly correlated to faraway pixels. Similar model to how we think the biological eye processes images.

– Adjacent pixels will be perceived to have similar values, inefficient to transmit every single pixel separately.

• Not interested in the overall brightness, subtract mean value for normalization.

PCA and ZCA Whitening• Subtract the mean value of all

patches from input patch.• Sigma – the covariance matrix

since x has a 0 mean now.• Compute the eigenvectors of

sigma using:– [U,S,V] = svd(sigma).– U = eigenvectors– S = eigenvalues– V is transpose(U)

• You can reduce the dimensionality by only considering the top (k) eigenvalues of the data.

Linear Decoder Implementation• Learn color image patch

features, flatten intensities from each channel into vector.

• 100,000 8x8 random image patches from 13,000 96x96 color images. (Cats, dogs, deer, airplanes, birds, horses, monkeys, ships, trucks).

• 192 (8*8*3) input units• 400 hidden units• 192 output units• 0.035 sparsity parameter.

Convolution

• We have learned features over random 8x8 patches from large images.

• Convolve these feature detectors on a new large image.– This gives us different feature activation values at each location of the image.

• Run 8x8 window over 64x64 image to get sets of 57x57 convolved features (400 sets in our case).

Convolution Implementation

• Compute activations for every 8x8 patch in new image.

• Loop go through features, and convolve the image with the feature using matlabs conv2 function over “valid” region.

• Then run the resulting convolved image plus the bias for this feature through the sigmoid function to get the activations.

Pooling

• In theory we could run the convolved features right through a classifier – but this is computationally challenging.– 57*57*400 = 1,299,600 features per example.

• Aggregate statistics of features over windows.• Mean pooling or Max pooling• PoolDim = 19, so 3x3 pooling.

Classification

• We can now use the pooled for classification.• Soft max Regression

– Supervised– Similar to logistic regression (binary classification) but we can have multiple class labels.

– Compute the probability of a label given an input.

Soft max Regression

• No way to closed‐form way to solve for minimum of J(theta)

• Use gradient descent or L‐BFGS to solve for minimum.• Add weight decay parameter to guarantee convergence to unique solution.

Architecture Overview• Self taught learning with sparse auto encoder.– Preprocessed with ZCA whitening.

• Use learned features for convolution on large image.

• Pool convolutions to reduce dimensionality and and over fitting.

• Softmax regression for classification.

Example of self taught learning.

Layers of Depth

• Deep networks have multiple hidden layers– Remember our auto encoder had 1 hidden layer.

– You can stack auto encoders to achieve greater depth.

– Ditch the “decoding” layer and attach to next layer or classifier.

Questions?

Sources• http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial

• http://deeplearningworkshopnips2010.files.wordpress.com/2010/09/nips10‐workshop‐tutorial‐final.pdf

• http://www.cs.toronto.edu/~kriz/learning‐features‐2009‐TR.pdf

• http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf

• http://www.cs.toronto.edu/~hinton/absps/ranzato_cvpr2011.pdf

• http://www.cs.toronto.edu/~hinton/science.pdf

Deep Nets for image classification - Colorado School of Mines

Documents

Transcript of Deep Nets for image classification - Colorado School of Mines