Lecture 9: Neural Networks
Transcript of Lecture 9: Neural Networks
Lecture 9: Neural NetworksShuai Li
John Hopcroft Center, Shanghai Jiao Tong University
https://shuaili8.github.io
https://shuaili8.github.io/Teaching/CS410/index.html
1
Brief history of artificial neural nets
β’ The First waveβ’ 1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model
β’ 1958 Rosenblatt introduced the simple single layer networks now called Perceptrons
β’ 1969 Minsky and Papertβs book Perceptrons demonstrated the limitation of single layer perceptrons, and almost the whole field went into hibernation
β’ The Second waveβ’ 1986 The Back-Propagation learning algorithm for Multi-Layer Perceptrons was
rediscovered and the whole field took off again
β’ The Third waveβ’ 2006 Deep (neural networks) Learning gains popularity
β’ 2012 made significant break-through in many applications
2
Biological neuron structure
β’ The neuron receives signals from their dendrites, and send its own signal to the axon terminal
3
Biological neural communicationβ’ Electrical potential across cell membrane
exhibits spikes called action potentials
β’ Spike originates in cell body, travels down axon, and causes synaptic terminals to release neurotransmitters
β’ Chemical diffuses across synapse to dendrites of other neurons
β’ Neurotransmitters can be excitatory or inhibitory
β’ If net input of neuro transmitters to a neuron from other neurons is excitatory and exceeds some threshold, it fires an action potential
4
Perceptron
β’ Inspired by the biological neuron among humans and animals, researchers build a simple model called Perceptron
β’ It receives signals π₯πβs, multiplies them with different weights π€π, and outputs the sum of the weighted signals after an activation function, step function
5
Neuron vs. Perceptron
6
Artificial neural networks
β’ Multilayer perceptron network
β’ Convolutional neural network
β’ Recurrent neural network
7
Perceptron
8
Training
β’ π€π β π€π β π π β π¦ π₯πβ’ π¦: the real label
β’ π: the output for the perceptron
β’ π: the learning rate
β’ Explanationβ’ If the output is correct, do nothing
β’ If the output is higher, lower the weight
β’ If the output is lower, increase the weight
9
Properties
β’ Rosenblatt [1958] proved the training can converge if two classes are linearly separable and π is reasonably small
10
Limitation
β’ Minsky and Papert [1969] showed that some rather elementary computations, such as XOR problem, could not be done by Rosenblattβs one-layer perceptron
β’ However Rosenblatt believed the limitations could be overcome if more layers of units to be added, but no learning algorithm known to obtain the weights yet
11
Solution: Add hidden layers
β’ Two-layer feedforward neural network12
Demo
β’ Larger Neural Networks can represent more complicated functions. The data are shown as circles colored by their class, and the decision regions by a trained neural network are shown underneath. You can play with these examples in this ConvNetsJS demo. 13
Activation Functions
14
Activation functions
β’ Sigmoid: π π§ =1
1+πβπ§
β’ Tanh: tanh π§ =ππ§βπβπ§
ππ§+πβπ§
β’ ReLU (Rectified Linear Unity): ReLU π§ = max 0, π§
15
Most popular in fully connected neural network
Most popular in deep learning
Activation function values and derivatives
16
β’ Its derivativeπβ² π§ = π π§ 1 β π π§
β’ Output range 0,1
β’ Motivated by biological neurons and can be interpreted as the probability of an artificial neuron βfiringβ given its inputs
β’ However, saturated neurons make value vanished (why?)
β’ π π π β―
β’ π 0,1 β 0.5, 0.732
β’ π 0.5, 0.732 β 0.622,0.676
Sigmoid activation function
β’ Sigmoid
π π§ =1
1 + πβπ§
17
Tanh activation function
β’ Tanh function
tanh π§ =sinh(π§)
cosh(π§)=
ππ§ β πβπ§
ππ§ + πβπ§
18
β’ Its derivativetanh β² π§ = 1 β tanh2(π§)
β’ Output range β1,1
β’ Thus strongly negative inputs to the tanh will map to negative outputs
β’ Only zero-valued inputs are mapped to near-zero outputs
β’ These properties make the network less likely to get βstuckβ during training
ReLU activation function
β’ ReLU (Rectified linear unity) function
ReLU π§ = max 0, π§
19
β’ Its derivative
ReLUβ² π§ = α1 if π§ > 00 if π§ β€ 0
β’ ReLU can be approximated by softplus function
β’ ReLUβs gradient doesn't vanish as x increases
β’ Speed up training of neural networksβ’ Since the gradient computation is very simple
β’ The computational step is simple, no exponentials, no multiplication or division operations (compared to others)
β’ The gradient on positive portion is larger than sigmoid or tanh functions
β’ Update more rapidly
β’ The left βdead neuronβ part can be ameliorated by Leaky ReLU
ReLU activation function (cont.)
β’ ReLU functionReLU π§ = max 0, π§
20
β’ The only non-linearity comes from the path selection with individual neurons being active or not
β’ It allows sparse representations:β’ for a given input only a subset of neurons are active
Sparse propagation of activations and gradients
Multilayer Perceptron Networks
21
Neural networks
β’ Neural networks are built by connecting many perceptrons together, layer by layer
22
Universal approximation theorem
β’ A feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions
23
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universalapproximators." Neural networks 2.5 (1989): 359-366
1-20-1 NN approximates a noisy sine function
Increasing power of approximation
β’ With more neurons, its approximation power increases. The decision boundary covers more details
β’ Usually in applications, we use more layers with structures to approximate complex functions instead of one hidden layer with many neurons 24
Common neuron network structures at a glance
25
Multilayer perceptron network
26
Single / Multiple layers of calculation
β’ Single layer functionππ π₯ = π π0 + π1π₯1 + π2π₯2
β’ Multiple layer functionβ’ β1 π₯ = π π0
1 + π11π₯1 + π2
1π₯2β’ β2 π₯ = π π0
2 + π12π₯1 + π2
2π₯2β’ ππ β = π π0 + π1β1 + π2β2
27
π₯1 π₯2
ππ
π₯1 π₯2
ππ
β2β1
TrainingBackpropagation
28
How to train?
β’ As previous models, we use gradient descent method to train the neural network
β’ Given the topology of the network (number of layers, number of neurons, their connections), find a set of weights to minimize the error function
OutputTargetThe set of training
examples
29
Gradient interpretation
β’ Gradient is the vector (the red one) along which the value of the function increases most rapidly. Thus its opposite direction is where the value decreases most rapidly
30
Gradient descent
β’ To find a (local) minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or an approximation) of the function at the current point
β’ For a smooth function π(π₯), ππ
ππ₯is the direction that π increases most
rapidly. So we apply
π₯π‘+1 = π₯π‘ β πππ
ππ₯(π₯π‘)
until π₯ converges
31
Gradient descent
Gradient
Training Rule
i.e.
Update
Partial derivatives are the key
32
The chain rule
β’ The challenge in neural network model is that we only know the target of the output layer, but donβt know the target for hidden and input layers, how can we update their connection weights using the gradient descent?
β’ The answer is the chain rule that you have learned in calculus
π¦ = π(π(π₯))
βππ¦
ππ₯= πβ²(π(π₯))πβ²(π₯)
33
Feed forward vs. Backpropagation
34
Make a prediction
35
π‘1
π‘π
2
Make a prediction (cont.)
36
π‘1
π‘π
2
Make a prediction (cont.)
37
π‘1
π‘π
2
Backpropagationβ’ Assume all the activation functions are sigmoid
β’ Error function πΈ =1
2Οπ π¦π β π‘π
2
β’ππΈ
ππ¦π= π¦π β π‘π
β’ππ¦π
ππ€π,π(2) = π(2)
β² πππ‘π(2)
βπ(1)
= π¦π 1 β π¦π βπ(1)
β’ βππΈ
ππ€π,π(2) = β π‘π β π¦π π¦π 1 β π¦π βπ
(1)
β’ β π€π,π(2)
β π€π,π(2)
+ ππΏπ(2)βπ(1)
38
πΏπ(2)
π‘1
π‘π
Output of unit π
2
Backpropagation (cont.)β’ Error function πΈ =
1
2Οπ π¦π β π‘π
2
β’ππΈ
ππ¦π= π¦π β π‘π
β’ πΏπ(2)
= π‘π β π¦π π¦π 1 β π¦π
β’ β π€π,π(2)
β π€π,π(2)
+ ππΏπ(2)βπ(1)
β’ππ¦π
πβπ(1) = π¦π 1 β π¦π π€π,π
(2)
β’πβπ
(1)
ππ€π,π(1) = π(1)
β² πππ‘π(1)
π₯π = βπ(1)
1 β βπ(1)
π₯π
β’ππΈ
ππ€π,π(1) = ββπ
(1)1 β βπ
(1)Οππ€π,π
2π‘π β π¦π π¦π 1 β π¦π π₯π
= ββπ(1)
1 β βπ(1)
π
π€π,π2πΏπ(2)
π₯π
β’ β π€π,π(1)
β π€π,π(1)
+ ππΏπ(1)π₯π
39
πΏπ(1)
π‘1
π‘π
2
Backpropagation algorithms
β’ Activation function: sigmoid
Initialize all weights to small random numbers
Do until convergence
β’ For each training example:1. Input it to the network and compute the network output2. For each output unit π, ππ is the output of unit π
πΏπ β ππ 1 β ππ π‘π β ππ3. For each hidden unit π, ππ is the output of unit π
πΏπ β ππ 1 β ππ
πβnext layer
π€π,ππΏπ
4. Update each network weight, where π₯π is the output for unit π
π€π,π β π€π,π + ππΏππ₯π40
β’ Error function πΈ =1
2Οπ π¦π β π‘π
2
β’ πΏπ(2)
= π‘π β π¦π π¦π 1 β π¦π
β’ β π€π,π(2)
β π€π,π(2)
+ ππΏπ(2)βπ(1)
β’ πΏπ(1)
= βπ(1)
1 β βπ(1) Οππ€π,π
2πΏπ(2)
β’ β π€π,π(1)
β π€π,π(1)
+ ππΏπ(1)π₯π
See the backpropagation demo
β’ https://shuaili8.github.io/Teaching/VE445/L6%20backpropagation%20demo.html
41
Formula example for backpropagation
42
Formula example for backpropagation (cont.)
43
Calculation example
β’ Consider the simple network below:
β’ Assume that the neurons have sigmoid activation function and β’ Perform a forward pass on the network and find the predicted output
β’ Perform a reverse pass (training) once (target = 0.5) with π = 1
β’ Perform a further forward pass and comment on the result
44
Calculation example (cont.)
45
β’ For each output unit π, ππ is the outputof unit π
πΏπ β ππ 1 β ππ π‘π β ππ
β’ For each hidden unit π, ππ is the output of unit π
πΏπ β ππ 1 β ππ
πβπππ₯π‘ πππ¦ππ
π€π,ππΏπ
β’ Update each network weight, where π₯π is the input for unit π
π€π,π β π€π,π + ππΏππ₯π
Calculation example (cont.)
β’ Answer (i)β’ Input to top neuron = 0.35 Γ 0.1 + 0.9 Γ 0.8 = 0.755. Out=0.68β’ Input to bottom neuron = 0.35 Γ 0.4 + 0.9 Γ 0.6 = 0.68. Out= 0.6637β’ Input to final neuron = 0.3 Γ 0.68 + 0.9 Γ 0.6637 = 0.80133. Out= 0.69
β’ (ii) It is both OK to use new or old weights when computing πΏπ for hidden units
β’ Output error πΏ = π‘ β π π 1 β π = 0.5 β 0.69 Γ 0.69 Γ 1 β 0.69 = β0.0406β’ Error for top hidden neuron πΏ1 = 0.68 Γ 1 β 0.68 Γ 0.3 Γ β0.0406 = β0.00265β’ Error for top hidden neuron πΏ2 = 0.6637 Γ 1 β 0.6637 Γ 0.9 Γ β0.0406 = β0.008156β’ New weights for the output layer
β’ π€π1 = 0.3 β 0.0406 Γ 0.68 = 0.272392
β’ π€π2 = 0.9 β 0.0406 Γ 0.6637 = 0.87305
β’ New weights for the hidden layer
β’ π€1π΄ = 0.1 β 0.00265 Γ 0.35 = 0.0991
β’ π€1π΅ = 0.8 β 0.00265 Γ 0.9 = 0.7976
β’ π€2π΄ = 0.4 β 0.008156 Γ 0.35 = 0.3971
β’ π€2π΅ = 0.6 β 0.008156 Γ 0.9 = 0.5927
β’ (iii)β’ Input to top neuron = 0.35 Γ 0.0991 + 0.9 Γ 0.7976 = 0.7525. Out=0.6797β’ Input to bottom neuron = 0.35 Γ 0.3971 + 0.9 Γ 0.5927 = 0.6724. Out= 0.662β’ Input to final neuron = 0.272392 Γ 0.6797 + 0.87305 Γ 0.662 = 0.7631. Out= 0.682β’ New error is β0.182, which is reduced compared to old error β0.19
46
β’ For each output unit π, ππ is the outputof unit π
πΏπ β ππ 1 β ππ π‘π β ππ
β’ For each hidden unit π, ππ is the output of unit π
πΏπ β ππ 1 β ππ
πβπππ₯π‘ πππ¦ππ
π€π,ππΏπ
β’ Update each network weight, where π₯π is the input for unit π
π€π,π β π€π,π + ππΏππ₯π
More on backpropagations
β’ It is gradient descent over entire network weight vectors
β’ Easily generalized to arbitrary directed graphs
β’ Will find a local, not necessarily global error minimumβ’ In practice, often work well (can run multiple times)
β’ Often include weight momentum πΌΞπ€π,π π β ππΏππ₯π + πΌΞπ€π,π(π β 1)
β’ Minimizes error over training examplesβ’ Will it generalize well to unseen examples?
β’ Training can take thousands of iterations, slow!
β’ Using network after training is very fast47
An Training Example
48
Network structure
β’ What happens when we train a network?
β’ Example:
49
Input and output
β’ A target function:
β’ Can this be learned?50
Learned hidden layer
51
Convergence of sum of squared errors
52
Hidden unit encoding for 01000000
53
Convergence of first layer weights
54
Convergence of backpropagation
β’ Gradient descent to some local minimumβ’ Perhaps not global minimum
β’ Add momentum
β’ Stochastic gradient descent
β’ Train multiple times with different initial weights
β’ Nature of convergenceβ’ Initialize weights near zero
β’ Therefore, initial networks near-convex
β’ Increasingly approximate non-linear functions as training progresses
55
Expressiveness of neural nets
β’ Boolean functions:β’ Every Boolean function can be represented by network with single hidden
layer
β’ But might require exponential (in number of inputs) hidden units
β’ Continuous functions:β’ Every bounded continuous function can be approximated with arbitrarily
small error, by network with one hidden layer
β’ Any function can be approximated to arbitrary accuracy by network with two hidden layers
56
Overfitting
58
Two examples of overfitting
59
Avoid overfitting
β’ Penalize large weights:
β’ Train on target slopes as well as values:
β’ Weight sharingβ’ Reduce total number of weights
β’ Using structures, like convolutional neural networks
β’ Early stopping
β’ Dropout 60
Applications
61
Autonomous self-driving cars
β’ Use neural network to summarize observation information and learn path planning
62
Autonomous self-driving cars (cont.)
63
A recipe for training neural networks
β’ By Andrej Karpathy
β’ https://karpathy.github.io/2019/04/25/recipe/
64
Summary
β’ Perceptron
β’ Activation functions
β’ Multilayer perceptron networks
β’ Training: backpropagation
β’ Examples
β’ Overfitting
β’ Applications
Questions?
https://shuaili8.github.io
Shuai Li
65