Deep Learning - Indian Statistical Institutescc/DSML/DSML-Deep_learning.pdfDeep Learning AshishGhosh...

Deep Learning

Ashish Ghosh

Professor, Machine Intelligence Unit

In‐Charge, Center for Soft Computing Research

Indian Statistical Institute

203 B. T. Road, Kolkata [email protected]://www.isical.ac.in/~ash

abab

mailto:[email protected]

http://www.isical.ac.in/~ash

2

What is learning?

• Learning is a process by which a system improves its performance from experience

• Examples– Classification (divide objects into multiple classes)– Driving a vehicle (without hitting others)

ab

Learning

• The agent perceives and then formulates a rule for it.

• This process is robust and avoidsthe need of any complicated mathematical model– A child learns to catch a ball

• Learning is the ability of an agent to improve its performance based on experience.

•The range of performance is expanded: the agent can do more•The accuracy on tasks is improved: the agent can do things better•The speed is improved: the agent can do things faster

abab

Parking a vehicle: Learning

• Hits the left boundary

• Move right

• Hits the right boundary

• Move left (little less)

• Not perfect, move right (less amount)

• Perfect

• Learning (Neural Network)

abab

5

An example

A bank wants to know whether to assign loan to a person or not

Bank has these data of previous years

ab

Machine learning

In machine learning the agent is computer.

What we do is:(i) take some data, (ii) train a model on

that data, and (iii) use the trained

model to make predictions onnew data.

abab

7

What is machine learning?

• Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.

Machine learning is used in cases where:• There is an intuition that a certain rule exists• But, we do not know it or cannot express it

mathematically

So, we learn the rule from data

ab

Shallow learning

• Traditional Classifiers• Linear & Kernel Regression• Hidden Markov Models (HMM)• Gaussian Mixture Models (GMM)• Single hidden layer MLP• ...

Limited modeling capability of conceptsCannot make use of unlabeled data

ab

Biological neural networks

Rough sketch of how a biological neuron works.

Information flows in is processed result flows out

abab

Neural networks

• Definition : Massively parallel interconnected networkof simple processing elements which are intended to interact with the objects of the real world in the same way as biological systems do.

• NN models are extreme simplifications of human neural systems.

• Computational elements (neurons/nodes/ processors) are analogous to that of the fundamental constituents (neurons) of the biological nervous system.

abab

Naϊve neurobiology

Gross Physical Structure:-There is one axon that branches-There is a dendritic tree that collectts input from other neuronsAxons typically contact dendritic tree at synapses-A spike of activity in the axon causes charge to be injected into

post synaptic neuron.-The axon hillock generates outgoing spikes only when enough

charge has flowed in at synapses.

Dendritic tree

Axon HillockBodySynapse

ababab

Naϊve neurobiology

Synaptic weights: There are synapses.Cumulative stimulus: Decision about whether the spike goes

down the axon or not.All or none: Spike either goes fully or doesnot go at all.

abab

Artificial /electronic neuron

V1V2

Vn-1Vn

Ui

Ri1

Ri2

Rin

Rin-1

R C

A

g

Ii Vi

• Gets input through resistors • Total input is converted to a single output by OP-AMP • Output is transmitted via resistors

abab

Basic model of a neuron

iiwxby ∑+=

+

w1

w2

w3

wn

x1

x2

x3

xn

Final output z

Biasb

abab

Exhibit a number of human brain's characteristics (partially).

Learn from example ‐ shown a set of inputs, they self‐adjust to produce consistent response.

Generalize from previous examples to new ones ‐ once trained, a network's response is mostly insensitive to variations in input.

Abstract essential characteristics from inputs ‐ find the ideals (prototype) from imperfect inputs.

Characteristics of neural networks

adaptivity ‐ adjusting the connection strengths to new data/information,

speed ‐ due to massively parallel architecture,

robustness ‐ to missing, confusing, ill‐defined/noisy data,

ruggedness ‐ to failure of components,

optimality ‐ as regards error rates in performance.

Major advantages

Learning in neural networks

• Associative (supervised) learning • Learning pattern pair association.• Input = X={x1,x2,....,xn}• Output = T = {t1,t2,....,tn}• Learn (X, T)

• auto-associator (auto-encoders) (T ≅ X).• hetero-associator (any arbitrary combination of X, T)

(classification).

• Regularity detection (unsupervised)• System discovers statistically salient features of input population

(clustering).

ab

Some common feature are there; but differ in finer details.

Multi‐layer perceptron (hetero associator/supervised classifier)

Hopfield's model of associative memory (auto associator/CAM)

Kohonen's model of self‐organizing neural network (regularity detector/ unsupervised classifier)

Radial basis function network (supervised)

Adaptive resonance theory (regularity detector)

Cellular neural network

Neo‐cognitron

Popularly used NN models

The Perceptron

iiwxby ∑+=

+

X

X

X

X

w1

w2

w3

wn

x1

x2

x3

xn

Final Outputz

Output Bias ith input

Biasb

It was proposed by Rosenblatt in the late 1950's

abab

Learning rule

Learning: Present a set of input patterns, adjust the weights until the desired output occurs for each of them.

wi(t+1) = wi(t) + Δi;

Δi = η δ xi;

δ = T – A (i.e., target – actual).

If the sets of patterns are linearly separable, the single layer perceptron algorithm is guaranteed to find a separating hyperplane in a finite number of steps.

x1

x2

x2=1x1=1

Change of weights

x1

x2

ab

Boolean functions

θ=1.5

AND

θ=0.5

OR

θ=‐0.5

NOT

How to design other gates (NOR, NAND) ?

θ=1

Memory

Cascading of layers

S1

S2

X

Y LAYER 2 NEURONIS 1 ONLY INTHIS REGION

w11

w12w21

w22

S1

S2

LAYER 1

LAYER 2X

Y

Two layers :Generate convex decision regions

abab

Effect of hidden neuronsTwo layers : Generate convex decision regions

abab

Cascading of LayersThree layers : Decision regions of any shape

ab

TriangleA Triangle

B

Non – convex region A AND NOT B

X1

X2

Multi‐layer network

LAYER 1 LAYER 2

LAYER 3

Y

ab

Multi-layer perceptron

OUTPUT LAYER

HIDDEN LAYER

INPUT LAYER

INPUT PATTERN

OUTPUT PATTERN

wkj

wji

k

j

i

Nodes of two different consecutive layers are connected by links or weights.

There is no connection among the elements of the same layer.

The layer where the inputs are presented is known as the input layer.

On the other hand the output producing layer is called the output layer.

The layers in between the input and the output layers are known as hidden layers.

The total input (Ii) to the ith unit

Ii =

oj is the output of the jth neuron.∑j

jijow

∑j

jijow


The output of a node i is obtained as

oi = f(Ii), f is the activation function.

Mostly the activation function is sigmoidal/squashing, with the form (smooth, non‐linear, differentiable & saturating),

f(x) = 1/(1+e‐(x‐θ)/θ0).

f(x)

0 2‐2‐4 4

0.5

1.0

xInitially very small random values are assigned to the

links/weights.


An input pattern X={xi} is presented during training,

Network’s set of weights/biases are adjusted such that the desired output T={ti} is obtained at the output layer.

Then another pair of X and T is presented for learning.

During learning a simple set of weights and biases are found that will be able to discriminate among all the input/output pairs presented to it.

The output {oi} will not be the same as the target {ti}.

Error is,

E =

For learning the correct set of weights error is E is reduced as rapidly as possible.

Use gradient descent technique.

2)(21∑ −

iii ot

Parameter updating

The incremental change in the direction of negative gradient is

where

For nodes in the hidden layers

Hence for the hidden layer we have

ijji

j

jjijiji o

wI

IE

wE

wEw ηδηη =

∂

∂

∂∂

−=∂∂

−=∂∂

−∝Δ

For the links connected to the output layer the change in weight is given by

).( jjj

j

jjj If

oE

Io

oE

IE ′

∂∂

−=∂∂

∂∂

−=∂∂

−=δ

ijji

j

jjijiji o

wI

IE

wE

wEw ηδηη =

∂

∂

∂∂

−=∂∂

−=∂∂

−∝Δ

( ) .ijj

ji oIfoEw ′⎟⎟⎠

⎞⎜⎜⎝

⎛

∂∂

−=Δ η

( )∑ ∑∑ ∑∑ −=∂∂

=∂∂

∂∂

=∂∂

∂∂

=∂∂

k kkjkkj

kk iiki

jkk j

k

kj

wwIEow

oIE

oI

IE

oE .δ

( ) ijkjk

kji oIfww ′⎟⎠

⎞⎜⎝

⎛=Δ ∑δη

Parameter updating

If then

and thus we get→output layer

→hidden layer

A large value of η corresponds to rapid learning but might result in oscillations.

A momentum term of αΔwji(t) can be added to increase the learning rate without oscillation.

Δwji(t+1) = ηδjoi + αΔwji(t)

The second term is used to specify that the change in wji at (t+1)th instant should be somewhat similar to the change undertaken at instant t.

⎟⎠⎞

⎜⎝⎛ −∑−

+

=ji

ijiow

j

eo

θ1

1)1()( jj

j

jj oo

Io

If −=∂

∂=′

( )

⎪⎪

⎩

⎪⎪

⎨

⎧

−⎟⎠

⎞⎜⎝

⎛

−⎟⎟⎠

⎞⎜⎜⎝

⎛

∂∂

−

=Δ

∑ ijjkjk

k

ijjj

ji

ooow

ooooE

w

)1(

1

δη

η

Parameter updating

Gradient descentIn gradient descent, weights are changed in proportion to the negative of

an error derivative with respect to each weight:

ij

ji

j

j

ji

jiji

o

wI

IE

wE

wEw

ηδ

η

η

=

∂

∂

∂∂

−=

∂∂

−=

∂∂

−∝Δ

ab

Gradient descent

Small values of η Slow

Convergence

Large values of η Oscillations

ab

Local minima or saddle pointsThe are some problems with the gradient descent approach:

Adjusting all the weights at once can result in a significant movement of the neural network in weight space.

The gradient descent algorithm is quite slowSusceptible to:

Local minima orSaddle points.

ab

Training MLP

Feature 1 Feature 2 Feature 3 Class

1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0... ... ... ...

1.4

2.7 0.8

1.9

ab

Training MLP


1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0... ... ... ...

1.4

2.7 0.8 0

1.9 error 0.8

Adjust Weights

ab

Training MLP


1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0... ... ... ...

6.4

2.8 0.9 1

1.7 error -0.1

Algorithms for weight adjustment are designed to

make changes that will reduce the error

ab

Drawbacks of BP

Backpropagation (BP) could train multiple layers.Expected to learn multiple feature representations

It was too slow to train and needed many labelled instances

• Since the basic back-propagation (BP) learning algorithm is too slow for most practical applications, there have been extensive research efforts to accelerate its convergence.

• The backpropagation algorithm is still the workhorse of learning in neural networks

ab

What happens when we add more layers?

1

(||δ1||2) 2

(||δ2||2)3

(||δ3||2)4

(||δ4||2)...

n

(||δn||2)2 Layers (784:30:10) 0.7 0.31

3 Layers(784:30:30:10) 0.12 0.60 0.283

4 Layers(784:30:30:30:10) 0.003 0.017 0.070 0.285

... ... ... ... ... ... ...n Layers(784:30: ... :30:10) 0 ... ... ... ... ...

Speed of learning in layers (||δl||2 where δl=∂E/∂blj )

Input: 784 neuronsOutput: 10 neurons

Each hidden layer has 30 neurons

Deep Learning, draft book in preparation, by Yoshua Bengio, Ian Goodfellow, and Aaron Courville

ab

MLNN are hard to train through BP

Vanishing gradient-Later layers learn slower than early layers

Lets take 784 neurons in input layer (28×28=784 pixels) 30 hidden neurons, 10 output neurons- Accuracy=96.48

Add another layer of 30 hidden neurons- Accuracy=96.90Add another layer of 30 hidden neurons- Accuracy=96.57Add another layer of 30 hidden neurons- Accuracy=96.53

Why is this happening?

ab

Deep architecture in human brain

Lateral geniculatenucleus

Visual cortex

V4High level abstractions

V2Simple shapes

V1Edges,corners

Retina--Pixels

ab

ab

Do you see some changes?

Deep architecture of brain

• A model of object recognition in brain, based on neuropsychological evidence in bottom-up way:

– Stage 1: Basic components (colour, depth, and form) are processed.

– Stage 2: Basic components are then grouped on the basis of similarity(distinct edges, shapes)

– Stage 3: The visual representation is matched with structural descriptions in memory.

– Stage 4: Semantic attributes are merged to representation and thereby recognition.

• Other existing models propose integrative hierarchies (top-down and bottom-up), parallel processing.

ab

Deep neural networks architecture• Modelled on the working principle of multistage recognition

ability of Human Brain.

• At the core: Artificial Neural Networks

• Deep-learning networks are distinguished from the more commonplace single-hidden-layer neural networks by their depth i.e. more hidden layers

• Many layers of non-linear transformations

• Takes raw input

ab

Concept of deep learning

Raw input vector representation

Very High Level representationGirl Jumping

.

.Higher levels of representation

.

.Slightly higher level of representation

Input Image

ab

Presenter

Presentation Notes

Transform the raw input image into gradual higher levels of representation, where the higher levels are more abstract functions of the input data than the lower representations.

Learning with depth

What new is: algorithms for training many-layer networks

• Prior to deep learning MLPs were typically initialized using random numbers.

• Deep learning proposed a new initialization strategy: Use a series of single layer networks - which do not suffer from vanishing/exploding gradients - to find the initial parameters for a deep MLP

ab

Why deep learning?

• It has been proven that depth 2 networks are enough to represent any function.

• However, for any complex function the required number of nodes (computations and parameters) may grow very large.– The weight matrix may be too large to handle.

• Hastad has shown that if number of nodes required for n-inputs is O(n) for depth d, then number of nodes become O(2n) if the depth is d-1.– i.e. for a 10 layer network, if only 100 nodes are needed to represent a

function then for 9 layer network, the same function would need 2100

nodes (although both of them represent it correctly).

[On the power of small-depth threshold circuits, Computational Complexity Volume 1, Issue 2 , pp 113-129, Springer ]

ab

Types of deep neural networks

• Deep convolutional neural networks– Image recognition, object recognition and speech

recognition

• Auto-encoders– Feature extraction, Unsupervised Learning, Pattern

Recognition

• Deep belief networks– Image recognition and classification

• Deep convolutional inverse graphics networks– Given one input image, generates new images of the same

object with variations in pose and lighting

ab

Types of deep neural networks

• Deep Residual Network– Image recognition

• Recurrent Neural Networks– Time series analysis, Speech recognition

• Recursive Neural Tensor Network– Text analysis

ab

Introduction to CNN

• CNN is a feed-forward network that can extract topological properties from an image.

• Like almost every other neural networks they are trained with a version of the back-propagation algorithm.

• Convolutional Neural Networks are designed to recognize visual patterns directly from pixel images.

• They can recognize patterns with extreme variability (such as handwritten characters).

ab

Convolutional Neural Networks

• In 1995, Yann LeCun and Yoshua Bengio introduced the concept of convolutional neural networks.

• How to produce good internal representations of the visual world to support recognition...– detect and classify objects into categories, independent of

pose, scale, illumination, conformation, occlusion and clutter

• Previously in Computer Vision: Hand-crafted feature extractor

• Now in Computer Vision: Learn suitable representations of images

• Argument: Biological vision is hierarchically organized

ab


Oriented Edges

Barks, Leaves, etc.

Trees

V1: Simple and complex cells

V4: Different Textures

Inferotemporal Cortex

Forest Image Photoreceptos

ab

ab

• Convolution operation is pretty much local in image domain− more sparsity in

the number of connections in neural network.

Convolution operation

Convolutional Neural Networks• Convolutional layers apply a number of filters to the input.

– The result of one filter applied across the image is called feature map (FM) and the number of feature maps is equal to the number of filters.

– The intuition behind the shared weights across the image is that the features will be detected regardless of their location, while the multiplicity of filters allows each of them to detect different set of features.

Layer m-1

Layer m

Feature map

Weights of the same color are shared or constrained to be identical

Input Image Convolution Kernel Feature Map

ab


• Subsampling layers reduce the size of the input. – There are multiple ways to subsample, but the most popular are max pooling,

average pooling, and stochastic pooling.– The last subsampling (or convolutional) layer is usually connected to one or

more fully connected layers, the last of which represents the target data.

• Modified backpropagation is used for training and update only in the convolution layer

– Subsampling layers have no weights to learn

ab

56

Weights

Output

Input

Bias, b1Bias, b2

ab


Initialization

• All zero initialization– All weights are set to zero

• Initialization with small random numbers– All weights are initialised with random numbers

very close to zero.– Treated as symmetry breaking

ab

Activation functions

• Sigmoid

• Tanh

• RELU

• Leaky - RELU

ab

Backpropagation with weight constraints

• Modify the BP algorithm with linear constraints between the weights.

• Modify the gradients so that they satisfy the constraints.– So if the weights started off

satisfying the constraints, they will continue to satisfy them.

To constrain : w1 = w2

we need : Δw1 = Δw2

compute : ∂E∂w1

and ∂E∂w2

use ∂E∂w1

+ ∂E∂w2

for w1 and w2

ab

Advantages of CNN

• They use fewer parameters (weights) to learn than a fully connected network.

• They are designed to be invariant to object position and distortion in the scene.

• They automatically learn and generalize features from the input domain.

ab

Autoencoders• Imagine that we train a neural network which has:

– input– one hidden layer– output which is the same as the input

• And you require that the hidden layer has:– Either less nodes than in the input/output layers– Or is sparse, i.e. the nodes usually output 0, but only sometimes

>0.05

• This is called autoencoder (or autoassociator).

ab

Autoencoders• So then your final network might look like:

– input layer– layer from first autoencoder– layer from second autoencoder– ...– layer from nth autoencoder– output layer

• Now, if we have a lot of labeled data we can then "fine-tune" this network.– i.e. use those layers from autoencoders as the first generation of a

big neural network and then – run a lot of generations of back propagation

ab

AutoencodersAn auto-encoder is trained, with an absolutely standard weight-adjustment algorithm to reproduce the input.

When you pass data through such a network, •It first compresses (encodes) input vector to "fit" in a smaller representation•Then tries to reconstruct (decode) it back.

ab

XYXgf→→

where Y is the hidden layer unit's output vector and X is the input vector

Autoencoders

•The task of training is to minimize an error of reconstruction, i.e. find the most efficient compact representation (encoding) for input data.

ab

Autoencoders

ab

Black- functionRed-error

Autoencoders- Why do they work?

ab

Associative memoryAny physical system whose dynamics in the state space is dominated by a number of locally stable states can be regarded as an associative memory/ content addressable memory (CAM).

The information stored in the system are the locally stable points Xa, Xb, ……, Xn.

Then, if the system is started at X=Xa+Δ,it will proceed in time until X=Xa.

The starting point X=Xa+Δ, represents a partial knowledge of the item Xa, and the system then generates the total information Xa.


ab

Let (xk,yk) be an associated pattern pair.

xk∈ RN1→ inputyk ∈ RN2→ output

Then MN1×N2 = yk,xkt can serve as an associative memory.

If xk is given as input, yk will be produced as output.

Mxk = (yk . xkt).xk = yk(xk

t . xk) = yk [if xkt.xk = 1].

If the stimulus is not xk , but xk~ ( a distorted version of xk), then

Mxk~ = yk(xkt. xk

~);Output will be yk with magnitude ( xk

t.xk~)


Tom Cruise

ab

Dimensionality Reduction

Use a deep auto-encoder

Train it with images as input & output

Limit one layer to few dimensions

Information has to pass through middle layer

ab

How do Deep Autoencoder works?

Train this layer first

Then this second layer

Then this third layer

Then this fourth layer

Finally this last layer

ab

How do Deep Autoencoder works?

EACH of the (non-output) layers is trained to bean autoencoder

Basically, it is forced to learn good features that describe what comes from the previous layer

ab

Deep AutoencodersIntermediate layers are each trained to be auto encoders (or similar)

Final layer pre-trained to predict class based on outputs from previous layers

ab

Deep Autoencoders

As pre-training process has initialised the weights favourably, −the deep MLP training can be done using gradient descent techniques.−the problem of vanishing/exploding gradients ceases to exist.

•That’s the basic idea•There are many many types of deep learning,•Different kinds of autoencoder, variations on architectures and training algorithms, etc…

ab

ab

Deep AutoencodersSo, everything basically wraps down to:

• Iterative algorithm• Learning at different levels of abstraction• Non-linear transformations• Typically multi-layered neural networks

What do we do?• Take unlabeled data (a lot of data)• Unsupervised pre-training (feature detection by autoencoders)• Then run supervised backpropagation iteratively

– Classify labeled data• Learning in successive layers

Deep learning based image analysis

LabeledImages

ValidationSet

Training Set

DeepNeural

Network

Validate

Adjust the model

Learn

• The Deep neural network tries to learn the features from the input images provided.

• The network trains iteratively by using the information available from the Training set.

• The Deep Network is validated and more adjustments are made in the model.

ab

Image analysis using deep learning

It is an image labelling task.

It basically gives the semanticinformation from the image.

The quality of obtained resultsdepends on features.

Girl Jumping

ab


ab

Identify the mug


ab


Feature Representation Learning Algorithm

Neuron 1 of visual cortex's V1 part in human brain

Neuron 2 of visual cortex's V1 part in human brain

ab


LabeledImages:

CupsPre-training

Learnfeature representations

DeepNeuralNetwork

ab

Training: Form a concept

What is this?

A cup

Labels name this concept as cups


ab

UnlabeledImages

Pre-training

Learnfeature representations

DeepNeuralNetwork

ab

Training: Form many concepts of the groups/clusters of utensils

What is this?

Belongs to group of these


ab

Image analysis using deep learning And so on...

Layer 3: Combines the objects to form complete faces

Layer 2: Combines the image patches and forms individual objects

Layer 1: Learns from extracted image patches

ab

How deep neural network sees

ab


Unlabeled data is readily available

•Example: Images from the webDownload 10’000’000 imagesTrain a 9-layer Deep Neural Net

•Concepts are formed by DNN

•It is found that they perform 70%better than previous state of the art Concept of Human face

ab

Deep Neural Nets are fooled

Remote99.99% confidence

Starfish99.99% confidence

Starfish99.99% confidence

Remote99.99% confidence

Deep neural networks are easily fooled: High confidence predictions for unrecognizable images A Nguyen, J Yosinski, J Clune -Computer Vision and Pattern Recognition, IEEE Conference on. IEEE, 2015.

ab


ab

•If the algorithms creating the images add all the basic parts the DNN is looking for.

— But not in a logical way

•The result can look meaningless to humans looking at them.

•However, the DNN may find a resemblance to what it learned from its early training, because it is able to find those basic image parts.


ab

If DNN perceived the colour texture, then all of these images are recognized as giraffes

If DNN perceived the body structure, then none of these images are recognized as giraffes


ab

These tooare recognized as giraffes then.

Conclusion•Deep Neural Networks are powerful.

•Deep learning allows to- use effectively knowledge extracted from unlabeled data- to lessen the chance to be stuck in local minima-improve the training performance

•Deep Neural Networks are trainable if we have a very fast computer.

•So if we have a very large high-quality dataset, we can find the best Deep Neural Networks for the task.

–Big Data might come in handy.–Much like big data tools, a deep learning model is as good as the data it is fed.

•Which will solve the problem, or at least some where close to solve it.

ab

Deep Learning - Indian Statistical Institutescc/DSML/DSML-Deep_learning.pdfDeep Learning AshishGhosh...

Documents

Transcript of Deep Learning - Indian Statistical Institutescc/DSML/DSML-Deep_learning.pdfDeep Learning AshishGhosh...