Unsupervised Feature Learning
-
Upload
amgad-muhammad -
Category
Technology
-
view
2.284 -
download
0
Transcript of Unsupervised Feature Learning
![Page 1: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/1.jpg)
Unsupervised Feature Learning:A Literature Review
By: Amgad Muhammad & Mohamed EL
Fadly
1
![Page 2: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/2.jpg)
Outline
• Background
• Problem Definition
• Unsupervised Feature Learning
• Our Work
• Sparse Auto-encoder
• Preprocessing: PCA and Whitening
• Self-Taught Learning and Unsupervised Feature Learning
• References 2 of 37
![Page 3: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/3.jpg)
Background
• Machine learning is one of the corner stone fields in Artificial Intelligence, where machines
learn to act autonomously, and react to new situations without being pre-programmed.
• Machine learning has seen numerous successes, but applying learning algorithms today often
means spending a long time hand-engineering the input feature representation. This is true for
many problems in vision, audio, NLP, robotics, and other areas.
• There are many learning algorithms for learning among them are [1]:
1) Supervised learning
2) Unsupervised learning
3 of 37
![Page 4: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/4.jpg)
Problem Definition
• The target of the supervised learning method can be summarized as follows:
• Regression
• Classification
• The first step to train a machine using the supervised learning method, is collecting the data set,
which in most cases is a very difficult and an expensive process
• The alternative approach is to measure and use everything, which will lead to other problems, i.e. the
noisy data [2]
4 of 37
![Page 5: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/5.jpg)
Unsupervised feature learning
• The unsupervised feature learning approach learns higher-level representation of the
unlabeled data features by detecting patterns using various algorithms, i.e. sparse encoding
algorithm [3]
• It is a self-taught learning framework developed to transfer knowledge from unlabeled data,
which is much easier to obtain, to be used as preprocessing step to enhance the supervised
inductive models.
• This framework is developed to tackle present issues in the supervised learning model and to
increase its accuracy regardless of the domain of interest (vision, sound, and text).[4]
5 of 37
![Page 6: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/6.jpg)
Our Work
• We will present some of the methods for unsupervised feature learning and deep learning, each of
which automatically learns a good representation of the input from unlabeled data.
• We will be concentrating on the following algorithms, with more details in the following slides:
• Sparse Autoencoder
• PCA and Whitening
• Self-Taught
• We will also be focusing on the application of these algorithms to learn features from images
6 of 37
![Page 7: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/7.jpg)
Sparse Autoencoders
7 of 37
![Page 8: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/8.jpg)
Sparse Auto-encoder
An Autoencoder neural network is an unsupervised learning algorithm that applies back
propagation, on a set of unlabeled training examples , ,….} where by setting the target values to
be equal to the inputs.[6]
i.e. it uses =
8 of 37
Autoencoder [6]
![Page 9: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/9.jpg)
Neural Network
Before we get further into the details of the algorithm, we need to quickly go through neural
network.
To describe neural networks, we will begin by describing the simplest possible neural network. One
that comprises a single "neuron." We will use the following diagram to denote a single neuron [5]
9 of 37
Single Neuron [8]
![Page 10: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/10.jpg)
Neural Network
This "neuron" is a computational unit that takes as input x1,x2,x3 (and a +1 intercept term), and
outputs
where is called the activation function. [5]
10 of 37
![Page 11: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/11.jpg)
Sigmoid Activation Function
The activation function can be:[8]
1) Sigmoid function : , output scale from [0,1]
11 of 37
Sigmoid Function [8]
![Page 12: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/12.jpg)
Tanh Activation Function
2) Tanh function: : , output scale from [-1,1]
12 of 37
Tanh Function [8]
![Page 13: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/13.jpg)
Neural Network Model
• A neural network is put together by hooking together many of our simple "neurons," so that the
output of a neuron can be the input of another. For example, here is a small neural network
• The circles labeled "+1" are called bias units, and correspond to the intercept term. The
leftmost layer of the network is called the input layer, and the rightmost layer the output
layer .The middle layer of nodes is called the hidden layer, because its values are not observed
in the training set.[8]
Small Neural Network[8]13 of 37
![Page 14: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/14.jpg)
Neural Network Model
Neural network parameters are:
• (W,b) = (W(1),b(1),W(2),b(2)), where we write to denote the parameter (or weight) associated with the connection between unit j in layer l, and unit i in layer l+ 1.
• the bias associated with unit i in layer l + 1.
• will denote the activation (meaning output value) of unit i in layer l.
Given a fixed setting of the parameters W, b, our neural network defines a hypothesis hW,b(x) that outputs a real
number.
Specifically, the computation that this neural network represents is given by [8]
14 of 37
![Page 15: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/15.jpg)
Autoencoders and Sparsity
• The auto-encoder tries to learn a function . In other words, it is trying an approximation to the
identity function, so as to output is similar to
• Placing constraints on the network, such as limiting the number of hidden units, or imposing
a sparsity constraint on the hidden units, lead to discover interesting structure in the data, even if
the number of hidden units is large.[6]
15 of 37
![Page 16: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/16.jpg)
Autoencoders and Sparsity Algorithm
Assumption :
1. The neurons to be inactive most of the time (a neuron to be "active" (or as "firing") if its output value is
close to 1, or "inactive" if its output value is close to 0) and the activation function is sigmoid function.[6]
2. Recall that denotes the activation of hidden unit in the autoencoder
3. (x) to denote the activation of this hidden unit when the network is given a specific input
4. Let: be the average activation unit (averaged over the training set).
Objective:
We would like to (approximately) enforce the constraint: = where is a sparsity parameter, a small value
close to zero [6]
16 of 37
![Page 17: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/17.jpg)
Autoencoders and Sparsity Algorithm –cont’d
• To achieve this, we will add an extra penalty term to our optimization objective that penalizes :
deviating significantly from .
• +(1- here “” is the number of neurons in the hidden layer, and the index is the summing over
the hidden units in the network.[6]
• It can also be written where = +(1- is the Kullback-Leibler (KL) divergence between a
Bernoulli random variable with mean and a Bernoulli random variable with mean. [6]
• KL-divergence is a standard function for measuring how different two different distributions are.
17 of 37
![Page 18: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/18.jpg)
Autoencoders and Sparsity Algorithm –cont’d
• Kl penalty function has the following property =0 if and otherwise it increases monotonically as
diverges from .
• For example, if we plotted for a range of values
(set =0.2), We will see that the KL-divergence reaches its minimum
of 0 at = and approach ∞ as approaches 0 or 1.
• Thus, minimizing this penalty term has the effect of causing
to close to [6]
KL Function [6]
18 of 37
![Page 19: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/19.jpg)
Autoencoders and Sparsity Algorithm – Cont’d
• Overall cost function is now =J(W,b) + β where J(W,b) is the neural Network Cost function and
βcontrols the weight of the sparsity penalty term
• The term depends on W, b also, because it is the average activation of hidden unit and the
activation of a hidden unit depends on the parameters W, b. [6]
19 of 37
![Page 20: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/20.jpg)
Autoencoder Implementation
• We implemented a sparse autoencoder, trained with 8×8 image patches using the L-BFGS
optimization algorithm
Step 1: Generate training set
The first step is to generate a training set.
20 of 37
A random sample of 200 patches from the dataset.
![Page 21: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/21.jpg)
Autoencoder Implementation
Step 2: Sparse autoencoder objective
Compute the sparse autoencoder cost function Jsparse(W,b) and the corresponding derivatives
of Jsparse with respect to the different parameters
Step3: Train the sparse autoencoder
After computing Jsparse and its derivatives, we will minimize Jsparse with respect to its parameters, and
thereby train our sparse autoencoder. We trained our sparse encoder with L-BFGS algorithm Our
neural network for training has 64 input units, 25 hidden units, and 64 output units.
21 of 37
![Page 22: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/22.jpg)
Autoencoder Implementation Results
After training the sparse autoencoder, the sparse
autoencoder successfully learned a set of edge detectors.
22 of 37
CPU Intel corei7 Quad Core processor 2.7GHz
RAM 6 GB RAM
Training Set 200 patches 8x8 images
Neural Network for training 64 input units, 25 hidden units, and 64 output units.
![Page 23: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/23.jpg)
Autoencoder Implementation Results
23 of 37
Training Time Expected Time [1]
39 seconds Less than a minute
![Page 24: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/24.jpg)
Principle Component Analysis – PCA
24 of 32
![Page 25: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/25.jpg)
Principle Component Analysis – PCA
• PCA is a dimensionality reduction mechanism used to eliminate highly correlated variables,
without sacrificing much of the details.[7]
25 of 37
![Page 26: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/26.jpg)
PCA – Example
Example
• Given the 2D data example.
• This data has already been pre-processed using mean
normalization.
• We want to find the principle directions of variation.
2D data example[8]
26 of 37
![Page 27: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/27.jpg)
PCA – Example (Cont’d)
As depicted, we can see that is the
strongest variation direction and is
the second strongest direction of
variation. [8]
27 of 37
u1u2
2D data example[8]
![Page 28: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/28.jpg)
PCA – Math
• To compute the principle variation direction we first compute the covariance matrix, , for the data
points as follows:
• , m is the number of points
• It can be proven that is the top Eigen vector of and is the second Eigen vector.[8]
28 of 37
![Page 29: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/29.jpg)
PCA – Math
• Having the Eigen vectors calculated, we construct a matrix U of the Eigen vectors.
• Retaining all the Eigen vectors of the covariance matrix into U, makes U a rotation matrix
of the input .[8]
2D data example[8]
29 of 37
![Page 30: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/30.jpg)
PCA – Dimensionality Reduction
• If we decided to retain k Eigen vectors in matrix , then we can reduce the dimension of the input from n to k
[7]
• How would we chose k?
30 of 37
![Page 31: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/31.jpg)
PCA – Dimensionality Reduction
• To decide on k, we consider the percentage of variance retained (PVR).
• Having a covariance matrix , with n Eigen vectors, it will have n Eigen values
• It’s common in images for example to choose [7]
31 of 37
![Page 32: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/32.jpg)
Whitening
• In PCA, we got rid of highly correlated features to reduce the features dimensionality.
• Following this step, is to make sure that all the features have a unity variance.
• From PCA, we transformed our input matrix , to be .
• By scaling every feature by , where is the corresponding Eigen value for component i, we get
features with a unity variance. [8]
32 of 37
![Page 33: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/33.jpg)
Self-Taught Learning
33 of 32
![Page 34: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/34.jpg)
Self-Taught learning and Unsupervised feature learning
Given an unlabeled data set, we can start training a sparse autoencoder to
extract features to give us a better, condense representation of the data.
34 of 37 Neural Network[8]
![Page 35: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/35.jpg)
Self-Taught learning and Unsupervised feature learning
• Once the training is done, the network is now ready to find better features to represent the input using the activations of the network hidden layer. [8]
35 of 37 Input layer of Neural Network[8]
![Page 36: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/36.jpg)
Self-Taught learning and Unsupervised feature learning
• Given a labeled training set , we feed the auto encoder the input , to get the activation for the trained hidden layer .[8]
36 of 37 Input layer of Neural Network[8]
![Page 37: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/37.jpg)
Self-Taught learning and Unsupervised feature learning
• Given the activation , we can either replace the labeled training data to be or concatenate the auto encoder extracted features to the original labeled features .[8]
37 of 37 Input layer of Neural Network[8]
![Page 38: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/38.jpg)
Self-Taught Learning Application
• We used the self-taught learning paradigm with the sparse autoencoder and
softmax classifier to build a classifier for handwritten digits.
• The goal is to distinguish between the digits from 0 to 4. We will use the digits
5 to 9 as our "unlabeled" dataset; we will then use a labeled dataset with the
digits 0 to 4 with which to train the softmax classifier.
38 of 37
![Page 39: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/39.jpg)
Self-Taught Learning Implementation
Step 1: Generate the input and test data sets
We used the datasets from the MNIST Handwritten Digit Database for this project.
Step 2: Train the sparse autoencoder
We used the unlabeled data (the digits from 5 to 9) to train a sparse autoencoder. These results are shown after
training is complete for a visualization of pen strokes like the image shown to the right
Step 3: Extracting features
After the sparse autoencoder is trained, we will use it to extract features from the handwritten digit images.
Step 4: Training and testing the logistic regression model
We will train a softmax classifier using the training set features and labels and finally computing the predictions
and accuracy
39 of 37
![Page 40: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/40.jpg)
Self-Taught Learning Setup Environment
CPU Intel corei7 Quad Core processor 2.7GHz
RAM 6 GB RAM
Training Set 60,000 examples from MNIST database
Unlabeled set 29404 examples
Supervised training set 15298 examples
Supervised testing set 15298 examples
40 of 37
![Page 41: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/41.jpg)
Self-Taught Learning Results
The results are shown below after training is complete for a visualization of pen strokes like the image
shown below:
41 of 37
![Page 42: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/42.jpg)
Self-Taught Learning Anaylsis
We have done a comparison between
our application outputs and the
Stanford course tutorial outputs [8].
42 of 37
Our classifier
Tutorial’s classifier
Training Time
16 minutes 25 minutes
Classifier Score (Accuracy)
98.208916% 98 %
![Page 43: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/43.jpg)
Future Work
We propose that if we were able to parallize our code or make the training part run on a GPU for
example, it will boost the performance and decrease the time needed to train the classifier
43 of 37
![Page 44: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/44.jpg)
References
[1] Taiwo Oladipupo Ayodele. New Advances in Machine Learning. InTech, 2010.
[2] SB Kotsiantis, ID Zaharakis, and PE Pintelas. Supervised machine learning: A review of classication techniques. 31:249-268, 2007.
[3] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Ng. Ecient sparse coding algorithms. In Advances in neural information
processing systems, pages 801-808,2006.
[4] Bruno A Olshausen et al. Emergence of simple-cell receptive field properties by learning a sparse code for natural images.Nature,
381(6583):607-609, 1996.
[5] Simon O. Haykin, ”Multilayer Perceptron,” in Neural Networks and Learning Machines, 3rd Edition ed. , Prentice Hall, 2009.
[6] Andrew Ng. CS294A . Lecture notes, Topic : “Sparse autoencoder ” Standford University, Jan 11, 2011. Available: http://
www.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf. [Accessed Dec. 10,2013].
[7] Aapo Hyvärinen, Jarmo Hurri, and Patrik O. Hoyer, “Principal components and whitening,” in Natural Image Statistics: A Probabilistic
Approach to Early Computational Vision., Vol. 39, Springer-Verlag, 2009,pp. 97-137
[8] Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, and Caroline Suen, “UFLDL Tutorial”, April 7, 2013. [Online]. Available:
http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial. [Accessed Dec. 10,2013].
44 of 37
![Page 45: Unsupervised Feature Learning](https://reader038.fdocuments.in/reader038/viewer/2022103016/554a0938b4c905e56c8b5a62/html5/thumbnails/45.jpg)
Thank You!