TATA INTERACTIVE SYSTEMS SimBLs™ Turbo charge your e-Learning.
Deep Learning - Indian Statistical Institutescc/DSML/DSML-Deep_learning.pdfDeep Learning AshishGhosh...
Transcript of Deep Learning - Indian Statistical Institutescc/DSML/DSML-Deep_learning.pdfDeep Learning AshishGhosh...
Deep Learning
Ashish Ghosh
Professor, Machine Intelligence Unit
In‐Charge, Center for Soft Computing Research
Indian Statistical Institute
203 B. T. Road, Kolkata [email protected]://www.isical.ac.in/~ash
abab
2
What is learning?
• Learning is a process by which a system improves its performance from experience
• Examples– Classification (divide objects into multiple classes)– Driving a vehicle (without hitting others)
ab
Learning
• The agent perceives and then formulates a rule for it.
• This process is robust and avoidsthe need of any complicated mathematical model– A child learns to catch a ball
• Learning is the ability of an agent to improve its performance based on experience.
•The range of performance is expanded: the agent can do more•The accuracy on tasks is improved: the agent can do things better•The speed is improved: the agent can do things faster
abab
Parking a vehicle: Learning
• Hits the left boundary
• Move right
• Hits the right boundary
• Move left (little less)
• Not perfect, move right (less amount)
• Perfect
• Learning (Neural Network)
abab
5
An example
A bank wants to know whether to assign loan to a person or not
Bank has these data of previous years
ab
Machine learning
In machine learning the agent is computer.
What we do is:(i) take some data, (ii) train a model on
that data, and (iii) use the trained
model to make predictions onnew data.
abab
7
What is machine learning?
• Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.
Machine learning is used in cases where:• There is an intuition that a certain rule exists• But, we do not know it or cannot express it
mathematically
So, we learn the rule from data
ab
Shallow learning
• Traditional Classifiers• Linear & Kernel Regression• Hidden Markov Models (HMM)• Gaussian Mixture Models (GMM)• Single hidden layer MLP• ...
Limited modeling capability of conceptsCannot make use of unlabeled data
ab
Biological neural networks
Rough sketch of how a biological neuron works.
Information flows in is processed result flows out
abab
Neural networks
• Definition : Massively parallel interconnected networkof simple processing elements which are intended to interact with the objects of the real world in the same way as biological systems do.
• NN models are extreme simplifications of human neural systems.
• Computational elements (neurons/nodes/ processors) are analogous to that of the fundamental constituents (neurons) of the biological nervous system.
abab
Naϊve neurobiology
Gross Physical Structure:-There is one axon that branches-There is a dendritic tree that collectts input from other neuronsAxons typically contact dendritic tree at synapses-A spike of activity in the axon causes charge to be injected into
post synaptic neuron.-The axon hillock generates outgoing spikes only when enough
charge has flowed in at synapses.
Dendritic tree
Axon HillockBodySynapse
ababab
Naϊve neurobiology
Synaptic weights: There are synapses.Cumulative stimulus: Decision about whether the spike goes
down the axon or not.All or none: Spike either goes fully or doesnot go at all.
abab
Artificial /electronic neuron
V1V2
Vn-1Vn
Ui
Ri1
Ri2
Rin
Rin-1
R C
A
g
Ii Vi
• Gets input through resistors • Total input is converted to a single output by OP-AMP • Output is transmitted via resistors
abab
Basic model of a neuron
iiwxby ∑+=
+
w1
w2
w3
wn
x1
x2
x3
xn
Final output z
Biasb
abab
Exhibit a number of human brain's characteristics (partially).
Learn from example ‐ shown a set of inputs, they self‐adjust to produce consistent response.
Generalize from previous examples to new ones ‐ once trained, a network's response is mostly insensitive to variations in input.
Abstract essential characteristics from inputs ‐ find the ideals (prototype) from imperfect inputs.
Characteristics of neural networks
adaptivity ‐ adjusting the connection strengths to new data/information,
speed ‐ due to massively parallel architecture,
robustness ‐ to missing, confusing, ill‐defined/noisy data,
ruggedness ‐ to failure of components,
optimality ‐ as regards error rates in performance.
Major advantages
Learning in neural networks
• Associative (supervised) learning • Learning pattern pair association.• Input = X={x1,x2,....,xn}• Output = T = {t1,t2,....,tn}• Learn (X, T)
• auto-associator (auto-encoders) (T ≅ X).• hetero-associator (any arbitrary combination of X, T)
(classification).
• Regularity detection (unsupervised)• System discovers statistically salient features of input population
(clustering).
ab
Some common feature are there; but differ in finer details.
Multi‐layer perceptron (hetero associator/supervised classifier)
Hopfield's model of associative memory (auto associator/CAM)
Kohonen's model of self‐organizing neural network (regularity detector/ unsupervised classifier)
Radial basis function network (supervised)
Adaptive resonance theory (regularity detector)
Cellular neural network
Neo‐cognitron
Popularly used NN models
The Perceptron
iiwxby ∑+=
+
X
X
X
X
w1
w2
w3
wn
x1
x2
x3
xn
Final Outputz
Output Bias ith input
Biasb
It was proposed by Rosenblatt in the late 1950's
abab
Learning rule
Learning: Present a set of input patterns, adjust the weights until the desired output occurs for each of them.
wi(t+1) = wi(t) + Δi;
Δi = η δ xi;
δ = T – A (i.e., target – actual).
If the sets of patterns are linearly separable, the single layer perceptron algorithm is guaranteed to find a separating hyperplane in a finite number of steps.
x1
x2
x2=1x1=1
Change of weights
x1
x2
ab
Boolean functions
θ=1.5
AND
θ=0.5
OR
θ=‐0.5
NOT
How to design other gates (NOR, NAND) ?
θ=1
Memory
Cascading of layers
S1
S2
X
Y LAYER 2 NEURONIS 1 ONLY INTHIS REGION
w11
w12w21
w22
S1
S2
LAYER 1
LAYER 2X
Y
Two layers :Generate convex decision regions
abab
Effect of hidden neuronsTwo layers : Generate convex decision regions
abab
Cascading of LayersThree layers : Decision regions of any shape
ab
TriangleA Triangle
B
Non – convex region A AND NOT B
X1
X2
Multi‐layer network
LAYER 1 LAYER 2
LAYER 3
Y
ab
Multi-layer perceptron
OUTPUT LAYER
HIDDEN LAYER
INPUT LAYER
INPUT PATTERN
OUTPUT PATTERN
wkj
wji
k
j
i
Nodes of two different consecutive layers are connected by links or weights.
There is no connection among the elements of the same layer.
The layer where the inputs are presented is known as the input layer.
On the other hand the output producing layer is called the output layer.
The layers in between the input and the output layers are known as hidden layers.
The total input (Ii) to the ith unit
Ii =
oj is the output of the jth neuron.∑j
jijow
∑j
jijow
Multi-layer perceptron
The output of a node i is obtained as
oi = f(Ii), f is the activation function.
Mostly the activation function is sigmoidal/squashing, with the form (smooth, non‐linear, differentiable & saturating),
f(x) = 1/(1+e‐(x‐θ)/θ0).
f(x)
0 2‐2‐4 4
0.5
1.0
xInitially very small random values are assigned to the
links/weights.
Multi-layer perceptron
An input pattern X={xi} is presented during training,
Network’s set of weights/biases are adjusted such that the desired output T={ti} is obtained at the output layer.
Then another pair of X and T is presented for learning.
During learning a simple set of weights and biases are found that will be able to discriminate among all the input/output pairs presented to it.
The output {oi} will not be the same as the target {ti}.
Error is,
E =
For learning the correct set of weights error is E is reduced as rapidly as possible.
Use gradient descent technique.
2)(21∑ −
iii ot
Parameter updating
The incremental change in the direction of negative gradient is
where
For nodes in the hidden layers
Hence for the hidden layer we have
ijji
j
jjijiji o
wI
IE
wE
wEw ηδηη =
∂
∂
∂∂
−=∂∂
−=∂∂
−∝Δ
For the links connected to the output layer the change in weight is given by
).( jjj
j
jjj If
oE
Io
oE
IE ′
∂∂
−=∂∂
∂∂
−=∂∂
−=δ
ijji
j
jjijiji o
wI
IE
wE
wEw ηδηη =
∂
∂
∂∂
−=∂∂
−=∂∂
−∝Δ
( ) .ijj
ji oIfoEw ′⎟⎟⎠
⎞⎜⎜⎝
⎛
∂∂
−=Δ η
( )∑ ∑∑ ∑∑ −=∂∂
=∂∂
∂∂
=∂∂
∂∂
=∂∂
k kkjkkj
kk iiki
jkk j
k
kj
wwIEow
oIE
oI
IE
oE .δ
( ) ijkjk
kji oIfww ′⎟⎠
⎞⎜⎝
⎛=Δ ∑δη
Parameter updating
If then
and thus we get→output layer
→hidden layer
A large value of η corresponds to rapid learning but might result in oscillations.
A momentum term of αΔwji(t) can be added to increase the learning rate without oscillation.
Δwji(t+1) = ηδjoi + αΔwji(t)
The second term is used to specify that the change in wji at (t+1)th instant should be somewhat similar to the change undertaken at instant t.
⎟⎠⎞
⎜⎝⎛ −∑−
+
=ji
ijiow
j
eo
θ1
1)1()( jj
j
jj oo
Io
If −=∂
∂=′
( )
⎪⎪
⎩
⎪⎪
⎨
⎧
−⎟⎠
⎞⎜⎝
⎛
−⎟⎟⎠
⎞⎜⎜⎝
⎛
∂∂
−
=Δ
∑ ijjkjk
k
ijjj
ji
ooow
ooooE
w
)1(
1
δη
η
Parameter updating
Gradient descentIn gradient descent, weights are changed in proportion to the negative of
an error derivative with respect to each weight:
ij
ji
j
j
ji
jiji
o
wI
IE
wE
wEw
ηδ
η
η
=
∂
∂
∂∂
−=
∂∂
−=
∂∂
−∝Δ
ab
Gradient descent
Small values of η Slow
Convergence
Large values of η Oscillations
ab
Local minima or saddle pointsThe are some problems with the gradient descent approach:
Adjusting all the weights at once can result in a significant movement of the neural network in weight space.
The gradient descent algorithm is quite slowSusceptible to:
Local minima orSaddle points.
ab
Training MLP
Feature 1 Feature 2 Feature 3 Class
1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0... ... ... ...
1.4
2.7 0.8
1.9
ab
Training MLP
Feature 1 Feature 2 Feature 3 Class
1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0... ... ... ...
1.4
2.7 0.8 0
1.9 error 0.8
Adjust Weights
ab
Training MLP
Feature 1 Feature 2 Feature 3 Class
1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0... ... ... ...
6.4
2.8 0.9 1
1.7 error -0.1
Algorithms for weight adjustment are designed to
make changes that will reduce the error
ab
Drawbacks of BP
Backpropagation (BP) could train multiple layers.Expected to learn multiple feature representations
It was too slow to train and needed many labelled instances
• Since the basic back-propagation (BP) learning algorithm is too slow for most practical applications, there have been extensive research efforts to accelerate its convergence.
• The backpropagation algorithm is still the workhorse of learning in neural networks
ab
What happens when we add more layers?
1
(||δ1||2) 2
(||δ2||2)3
(||δ3||2)4
(||δ4||2)...
n
(||δn||2)2 Layers (784:30:10) 0.7 0.31
3 Layers(784:30:30:10) 0.12 0.60 0.283
4 Layers(784:30:30:30:10) 0.003 0.017 0.070 0.285
... ... ... ... ... ... ...n Layers(784:30: ... :30:10) 0 ... ... ... ... ...
Speed of learning in layers (||δl||2 where δl=∂E/∂blj )
Input: 784 neuronsOutput: 10 neurons
Each hidden layer has 30 neurons
Deep Learning, draft book in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron Courville
ab
MLNN are hard to train through BP
Vanishing gradient-Later layers learn slower than early layers
Lets take 784 neurons in input layer (28×28=784 pixels) 30 hidden neurons, 10 output neurons- Accuracy=96.48
Add another layer of 30 hidden neurons- Accuracy=96.90Add another layer of 30 hidden neurons- Accuracy=96.57Add another layer of 30 hidden neurons- Accuracy=96.53
Why is this happening?
ab
Deep architecture in human brain
Lateral geniculatenucleus
Visual cortex
V4High level abstractions
V2Simple shapes
V1Edges,corners
Retina--Pixels
ab
ab
Do you see some changes?
Deep architecture of brain
• A model of object recognition in brain, based on neuropsychological evidence in bottom-up way:
– Stage 1: Basic components (colour, depth, and form) are processed.
– Stage 2: Basic components are then grouped on the basis of similarity(distinct edges, shapes)
– Stage 3: The visual representation is matched with structural descriptions in memory.
– Stage 4: Semantic attributes are merged to representation and thereby recognition.
• Other existing models propose integrative hierarchies (top-down and bottom-up), parallel processing.
ab
Deep neural networks architecture• Modelled on the working principle of multistage recognition
ability of Human Brain.
• At the core: Artificial Neural Networks
• Deep-learning networks are distinguished from the more commonplace single-hidden-layer neural networks by their depth i.e. more hidden layers
• Many layers of non-linear transformations
• Takes raw input
ab
Concept of deep learning
Raw input vector representation
Very High Level representationGirl Jumping
.
.Higher levels of representation
.
.Slightly higher level of representation
Input Image
ab
Learning with depth
What new is: algorithms for training many-layer networks
• Prior to deep learning MLPs were typically initialized using random numbers.
• Deep learning proposed a new initialization strategy: Use a series of single layer networks - which do not suffer from vanishing/exploding gradients - to find the initial parameters for a deep MLP
ab
Why deep learning?
• It has been proven that depth 2 networks are enough to represent any function.
• However, for any complex function the required number of nodes (computations and parameters) may grow very large.– The weight matrix may be too large to handle.
• Hastad has shown that if number of nodes required for n-inputs is O(n) for depth d, then number of nodes become O(2n) if the depth is d-1.– i.e. for a 10 layer network, if only 100 nodes are needed to represent a
function then for 9 layer network, the same function would need 2100
nodes (although both of them represent it correctly).
[On the power of small-depth threshold circuits, Computational Complexity Volume 1, Issue 2 , pp 113-129, Springer ]
ab
Types of deep neural networks
• Deep convolutional neural networks– Image recognition, object recognition and speech
recognition
• Auto-encoders– Feature extraction, Unsupervised Learning, Pattern
Recognition
• Deep belief networks– Image recognition and classification
• Deep convolutional inverse graphics networks– Given one input image, generates new images of the same
object with variations in pose and lighting
ab
Types of deep neural networks
• Deep Residual Network– Image recognition
• Recurrent Neural Networks– Time series analysis, Speech recognition
• Recursive Neural Tensor Network– Text analysis
ab
Introduction to CNN
• CNN is a feed-forward network that can extract topological properties from an image.
• Like almost every other neural networks they are trained with a version of the back-propagation algorithm.
• Convolutional Neural Networks are designed to recognize visual patterns directly from pixel images.
• They can recognize patterns with extreme variability (such as handwritten characters).
ab
Convolutional Neural Networks
• In 1995, Yann LeCun and Yoshua Bengio introduced the concept of convolutional neural networks.
• How to produce good internal representations of the visual world to support recognition...– detect and classify objects into categories, independent of
pose, scale, illumination, conformation, occlusion and clutter
• Previously in Computer Vision: Hand-crafted feature extractor
• Now in Computer Vision: Learn suitable representations of images
• Argument: Biological vision is hierarchically organized
ab
Convolutional Neural Networks
Oriented Edges
Barks, Leaves, etc.
Trees
V1: Simple and complex cells
V4: Different Textures
Inferotemporal Cortex
Forest Image Photoreceptos
ab
ab
• Convolution operation is pretty much local in image domain− more sparsity in
the number of connections in neural network.
Convolution operation
Convolutional Neural Networks• Convolutional layers apply a number of filters to the input.
– The result of one filter applied across the image is called feature map (FM) and the number of feature maps is equal to the number of filters.
– The intuition behind the shared weights across the image is that the features will be detected regardless of their location, while the multiplicity of filters allows each of them to detect different set of features.
Layer m-1
Layer m
Feature map
Weights of the same color are shared or constrained to be identical
Input Image Convolution Kernel Feature Map
ab
Convolutional Neural Networks
• Subsampling layers reduce the size of the input. – There are multiple ways to subsample, but the most popular are max pooling,
average pooling, and stochastic pooling.– The last subsampling (or convolutional) layer is usually connected to one or
more fully connected layers, the last of which represents the target data.
• Modified backpropagation is used for training and update only in the convolution layer
– Subsampling layers have no weights to learn
ab
56
Weights
Output
Input
Bias, b1Bias, b2
ab
Convolutional Neural Networks
Initialization
• All zero initialization– All weights are set to zero
• Initialization with small random numbers– All weights are initialised with random numbers
very close to zero.– Treated as symmetry breaking
ab
Activation functions
• Sigmoid
• Tanh
• RELU
• Leaky - RELU
ab
Backpropagation with weight constraints
• Modify the BP algorithm with linear constraints between the weights.
• Modify the gradients so that they satisfy the constraints.– So if the weights started off
satisfying the constraints, they will continue to satisfy them.
To constrain : w1 = w2
we need : Δw1 = Δw2
compute : ∂E∂w1
and ∂E∂w2
use ∂E∂w1
+ ∂E∂w2
for w1 and w2
ab
Advantages of CNN
• They use fewer parameters (weights) to learn than a fully connected network.
• They are designed to be invariant to object position and distortion in the scene.
• They automatically learn and generalize features from the input domain.
ab
Autoencoders• Imagine that we train a neural network which has:
– input– one hidden layer– output which is the same as the input
• And you require that the hidden layer has:– Either less nodes than in the input/output layers– Or is sparse, i.e. the nodes usually output 0, but only sometimes
>0.05
• This is called autoencoder (or autoassociator).
ab
Autoencoders• So then your final network might look like:
– input layer– layer from first autoencoder– layer from second autoencoder– ...– layer from nth autoencoder– output layer
• Now, if we have a lot of labeled data we can then "fine-tune" this network.– i.e. use those layers from autoencoders as the first generation of a
big neural network and then – run a lot of generations of back propagation
ab
AutoencodersAn auto-encoder is trained, with an absolutely standard weight-adjustment algorithm to reproduce the input.
When you pass data through such a network, •It first compresses (encodes) input vector to "fit" in a smaller representation•Then tries to reconstruct (decode) it back.
ab
XYXgf→→
where Y is the hidden layer unit's output vector and X is the input vector
Autoencoders
•The task of training is to minimize an error of reconstruction, i.e. find the most efficient compact representation (encoding) for input data.
ab
Autoencoders
ab
Black- functionRed-error
Autoencoders- Why do they work?
ab
Associative memoryAny physical system whose dynamics in the state space is dominated by a number of locally stable states can be regarded as an associative memory/ content addressable memory (CAM).
The information stored in the system are the locally stable points Xa, Xb, ……, Xn.
Then, if the system is started at X=Xa+Δ,it will proceed in time until X=Xa.
The starting point X=Xa+Δ, represents a partial knowledge of the item Xa, and the system then generates the total information Xa.
Autoencoders- Why do they work?
ab
Let (xk,yk) be an associated pattern pair.
xk∈ RN1→ inputyk ∈ RN2→ output
Then MN1×N2 = yk,xkt can serve as an associative memory.
If xk is given as input, yk will be produced as output.
Mxk = (yk . xkt).xk = yk(xk
t . xk) = yk [if xkt.xk = 1].
If the stimulus is not xk , but xk~ ( a distorted version of xk), then
Mxk~ = yk(xkt. xk
~);Output will be yk with magnitude ( xk
t.xk~)
Autoencoders- Why do they work?
Tom Cruise
ab
Dimensionality Reduction
Use a deep auto-encoder
Train it with images as input & output
Limit one layer to few dimensions
Information has to pass through middle layer
ab
How do Deep Autoencoder works?
Train this layer first
Then this second layer
Then this third layer
Then this fourth layer
Finally this last layer
ab
How do Deep Autoencoder works?
EACH of the (non-output) layers is trained to bean autoencoder
Basically, it is forced to learn good features that describe what comes from the previous layer
ab
Deep AutoencodersIntermediate layers are each trained to be auto encoders (or similar)
Final layer pre-trained to predict class based on outputs from previous layers
ab
Deep Autoencoders
As pre-training process has initialised the weights favourably, −the deep MLP training can be done using gradient descent techniques.−the problem of vanishing/exploding gradients ceases to exist.
•That’s the basic idea•There are many many types of deep learning,•Different kinds of autoencoder, variations on architectures and training algorithms, etc…
ab
ab
Deep AutoencodersSo, everything basically wraps down to:
• Iterative algorithm• Learning at different levels of abstraction• Non-linear transformations• Typically multi-layered neural networks
What do we do?• Take unlabeled data (a lot of data)• Unsupervised pre-training (feature detection by autoencoders)• Then run supervised backpropagation iteratively
– Classify labeled data• Learning in successive layers
Deep learning based image analysis
LabeledImages
ValidationSet
Training Set
DeepNeural
Network
Validate
Adjust the model
Learn
• The Deep neural network tries to learn the features from the input images provided.
• The network trains iteratively by using the information available from the Training set.
• The Deep Network is validated and more adjustments are made in the model.
ab
Image analysis using deep learning
It is an image labelling task.
It basically gives the semanticinformation from the image.
The quality of obtained resultsdepends on features.
Girl Jumping
ab
Image analysis using deep learning
ab
Identify the mug
Image analysis using deep learning
ab
Image analysis using deep learning
Feature Representation Learning Algorithm
Neuron 1 of visual cortex's V1 part in human brain
Neuron 2 of visual cortex's V1 part in human brain
ab
Image analysis using deep learning
LabeledImages:
CupsPre-training
Learnfeature representations
DeepNeuralNetwork
ab
Training: Form a concept
What is this?
A cup
Labels name this concept as cups
Image analysis using deep learning
ab
UnlabeledImages
Pre-training
Learnfeature representations
DeepNeuralNetwork
ab
Training: Form many concepts of the groups/clusters of utensils
What is this?
Belongs to group of these
Image analysis using deep learning
ab
Image analysis using deep learning And so on...
Layer 3: Combines the objects to form complete faces
Layer 2: Combines the image patches and forms individual objects
Layer 1: Learns from extracted image patches
ab
How deep neural network sees
ab
Image analysis using deep learning
Unlabeled data is readily available
•Example: Images from the webDownload 10’000’000 imagesTrain a 9-layer Deep Neural Net
•Concepts are formed by DNN
•It is found that they perform 70%better than previous state of the art Concept of Human face
ab
Deep Neural Nets are fooled
Remote99.99% confidence
Starfish99.99% confidence
Starfish99.99% confidence
Remote99.99% confidence
Deep neural networks are easily fooled: High confidence predictions for unrecognizable images A Nguyen, J Yosinski, J Clune -Computer Vision and Pattern Recognition, IEEE Conference on. IEEE, 2015.
ab
Deep Neural Nets are fooled
ab
•If the algorithms creating the images add all the basic parts the DNN is looking for.
— But not in a logical way
•The result can look meaningless to humans looking at them.
•However, the DNN may find a resemblance to what it learned from its early training, because it is able to find those basic image parts.
Deep Neural Nets are fooled
ab
If DNN perceived the colour texture, then all of these images are recognized as giraffes
If DNN perceived the body structure, then none of these images are recognized as giraffes
Deep Neural Nets are fooled
ab
These tooare recognized as giraffes then.
Conclusion•Deep Neural Networks are powerful.
•Deep learning allows to- use effectively knowledge extracted from unlabeled data- to lessen the chance to be stuck in local minima-improve the training performance
•Deep Neural Networks are trainable if we have a very fast computer.
•So if we have a very large high-quality dataset, we can find the best Deep Neural Networks for the task.
–Big Data might come in handy.–Much like big data tools, a deep learning model is as good as the data it is fed.
•Which will solve the problem, or at least some where close to solve it.
ab