Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks...
Transcript of Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks...
![Page 1: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/1.jpg)
43hrs
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 1 / 109
![Page 2: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/2.jpg)
Back to optimization
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109
![Page 3: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/3.jpg)
Stochastic optimization
• Optimize a cost that is a random variable
• Types of randomness:
- Measurement plus noise: R+ ν
- Multiple effects mixed together (we might use a mixture model)
- Unknown statistical properties
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 3 / 109
![Page 4: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/4.jpg)
Monte Carlo integration
• Expectation of a random variable X :
E {X} =
∫Eξ px(ξ) dξ
(over the whole data space E)
• . . . But only a sample {x1, . . . , xn} is given (training set)
• Empirical distribution Px(ξ) =1
n
∑nl=1 δ (ξ − xl)→
• Approximate (empirical) expectation of X :
E {X} =
∫Eξ Px(ξ) dξ =
1
n
n∑l=1
xl
• This is a Monte Carlo integral
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 4 / 109
![Page 5: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/5.jpg)
• Suppose that R is classification performance (risk).
• We want to optimize the true risk, the one computed on all possible,infinite data:
R(w) =
∫R (y(x),w) p(x)dx.
• This is a function of w(the weights identify one specific neural net)
• It is also a function of the data distribution p(x)
(the performance is estimated on the data)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 5 / 109
![Page 6: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/6.jpg)
• When training a neural network we don’t have p(x), but only thetraining set {x1, . . . ,xn}
• From the training set we have the empirical distribution
Px(ξ) =1
n
n∑l=1
δ (ξ − xl)
• so we can compute a Monte Carlo estimate of the risk
R(w, X) =1
np
np∑l=1
R (y(xl),w)
this is the empirical risk.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 6 / 109
![Page 7: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/7.jpg)
Training by epoch
• Optimize using the whole training set to estimate the cost
• It means computing R (and the ∆W )
• on the basis of a Monte Carlo estimate of risk
• Finds the optimal value of an approximate (empirical) cost function
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 7 / 109
![Page 8: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/8.jpg)
Stochastic approximation
• A special kind of stochastic optimization
• R is estimated at each input pattern using that pattern alone
• Extremely unreliable estimation – but it converges in probability!
• Robbins and Monro, 1951; Kiefer and Wolfowitz, 1952
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 8 / 109
![Page 9: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/9.jpg)
• Convergence in probability:
limn→∞
Pr(|Rn −R| ≥ ε
)= 0
• Rn is the estimate of R on a training set of size n
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 9 / 109
![Page 10: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/10.jpg)
Stochastic approximation
• Given:
- A function R whose gradient ∇R we want to set to zero, or minimize(but we can’t compute analytically)
- A sequence G1, G2, . . . , Gl, . . . of random samples of ∇R, affectedby random noise
- A decreasing sequence η1, η2, . . . , ηl, . . . of step size coefficients
• Basic iteration:w(l + 1) = w(l)− ηl Gl
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 10 / 109
![Page 11: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/11.jpg)
Stochastic approximation: The intuition
• Each sample gives a noisy (stochastic) estimate of the gradient
• ⇒ ∇R + noise
• By averaging over time, noise cancels out
• Random variations also make it possible to escape local minima
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 11 / 109
![Page 12: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/12.jpg)
Results on convergence of stochastic approximation
• If R is twice differentiable and convex,
then stochastic approximation converges with a rate of O(
1
l
)• A condition of convergence (not optimal rate of convergence):
0 <∑l
η2l = A <∞
• Usually the hypotheses are not met (complex cost landscape) and wedon’t have guarantees.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 12 / 109
![Page 13: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/13.jpg)
Training by pattern
• is computing R (and the ∆W )
• on the basis of an estimate of risk on a single point
• An extreme Monte Carlo estimate on a training set of one observationonly
• Finds the approximate optimal value of an approximate costfunction
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 13 / 109
![Page 14: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/14.jpg)
Implementation of training
• By epoch: estimation loop, then update
• By pattern: estimation + update loop
• By pattern on a training set: l = random
• Learning rate η → By pattern: keep it low
• → By epoch: make it adaptive
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 14 / 109
![Page 15: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/15.jpg)
Multi-layer neural networks
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 15 / 109
![Page 16: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/16.jpg)
Connectionism and Parallel Distributed Processes
David Rumelhart James McClelland Geoffrey Hinton
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 16 / 109
![Page 17: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/17.jpg)
What is connectionism?
• Connectionism is an approach to cognitive science that characterizeslearning and memory through the discrete interactions between nodesof neural networks
• Representation of concepts and rules not concentrated in symbols witha lot of meaning, but in sub-symbolic “neural encodings” (neuronactivations) which have a meaning only if taken collectively as patterns
• Neural networks are distributed and massively parallel
• They rely on spontaneously-generated internal representations
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 17 / 109
![Page 18: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/18.jpg)
Network topologies
Most general: feedback.
*
Units may be visible or hidden (*)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 18 / 109
![Page 19: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/19.jpg)
Network topologies
A special type of feedback is lateral connections
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 19 / 109
![Page 20: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/20.jpg)
Network topologies
Less general: a topology where cycles are forbidden: feedforward.Visible units may be input or output.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 20 / 109
![Page 21: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/21.jpg)
Network topologies
Least general: multi-layer
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 21 / 109
![Page 22: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/22.jpg)
Why multi-layer?
Linear separability
Feature discovery
Hierarchies of abstractions
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 22 / 109
![Page 23: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/23.jpg)
Example: Parity
Problem: Given any input string of d bits, tell whether the number of bitsset (= 1) is even.
Generalizes XOR: it is not linearly separable
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 23 / 109
![Page 24: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/24.jpg)
Example: Parity
The solution requires d hidden units
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 24 / 109
![Page 25: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/25.jpg)
Universal approximation theorem
G. Cybenko 1989
A feed-forward network with a single hidden layer containing a finitenumber of neurons can approximate any continuous function on compactsubsets of Rd
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 25 / 109
![Page 26: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/26.jpg)
How do we train a multi-layer neural network?
1 With a suitable algorithm
2 With a sequence of independent trainings
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 26 / 109
![Page 27: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/27.jpg)
• As we have seen, learning (e.g., learning to recognize) can be cast asthe problem of optimizing a suitable cost function (risk)
• But most optimization methods rely on the necessary minimumcondition ∇E = 0 or on the direction of the gradient ∇E
→ requirement: E must be at least differentiable (even better if alsoconvex, but that’s not always possible)
• Even if E is differentiable, for hidden units we cannot compute anerror term like (t− a)2 (mse)
→ requirement: we need a way to do this
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 27 / 109
![Page 28: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/28.jpg)
A differentiable activation function
• Let’s write the discriminant function for a problem with two Gaussian,spherical, equal-variance classes.
• Translation of the origin, rotation of axes. . .
• 1-dimensional symmetrical problem in x with only two parameters
p(x|ω1) =1√2πσ
exp
[(x− µ)2
2σ2
]p(x|ω2) =
1√2πσ
exp
[(x+ µ)2
2σ2
]
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 28 / 109
![Page 29: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/29.jpg)
For the Bayes theorem:
P (ω1) =p(x|ω1)P (ω1)
p(x|ω1)P (ω1) + p(x|ω2)P (ω2)
P (ω2) =p(x|ω2)P (ω2)
p(x|ω1)P (ω1) + p(x|ω2)P (ω2)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 29 / 109
![Page 30: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/30.jpg)
2-class discriminant function:
g(x) = P (ω1)− P (ω2)
=exp
[(x−µ)22σ2
]exp
[(x−µ)22σ2
]+ exp
[(x+µ)2
2σ2
] − exp[(x+µ)2
2σ2
]exp
[(x−µ)22σ2
]+ exp
[(x+µ)2
2σ2
]removing the factors 1/
√2πσ
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 30 / 109
![Page 31: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/31.jpg)
g(x) =exp
[−x2+µ2
2σ2
]exp
[xµσ2
]− exp
[−x2+µ2
2σ2
]exp
[−xµσ2
]exp
[−x2+µ2
2σ2
]exp
[xµσ2
]+ exp
[−x2+µ2
2σ2
]exp
[−xµσ2
]The common positive factor exp
[−x2+µ2
2σ2
]cancels out:
g(x) =exµ
σ2 − e−xµ
σ2
exµ
σ2 + e−xµ
σ2
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 31 / 109
![Page 32: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/32.jpg)
• We replace x with the score r = x ·w′
• We can absorb the factor µ/σ2 into the norm of w′:
w =µ
σ2w′
• We obtain
g(r) =er − e−r
er + e−r, r = x ·w
g(r) = hyperbolic tangent activation, tanh(r)
• logistic or sigmoid activation:
σ(r) =1
1 + e−r=
tanh(r) + 1
2
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 32 / 109
![Page 33: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/33.jpg)
-1
-0.5
0
0.5
1
-10 -5 0 5 10
a
r
SIGMOID
TANH
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 33 / 109
![Page 34: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/34.jpg)
-1
-0.5
0
0.5
1
-10 -5 0 5 10
a
r
HEAVISIDE
SIGN
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 34 / 109
![Page 35: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/35.jpg)
• The sigmoid is the solution of the logistic equation
y′ = y(1− y)
• Therefore, by definition,
∂σ(r)
∂r= σ(r) ( 1− σ(r) )
• Also,∂ tanh(r)
∂r= 1− tanh2(r)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 35 / 109
![Page 36: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/36.jpg)
The error back-propagation algorithm
• Discovered by Amari/Werbos/Parker/Rumelhart/Hinton/Williams from1974 to 1986
• The name appears in Rosenblatt’s book “Principles of Neurodynamics”in 1962
• A clever application of the chain rule of differential calculus
• We can perform gradient descent in a distributed way and withoutactually computing derivatives
• The responsibility for errors is back-propagated from the outputsback inside the network, and distributed among the hidden layers.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 36 / 109
![Page 37: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/37.jpg)
The chain rule
df(g(x))
dx=
df(y)
dy
∣∣∣∣y=g(x)
dg(x)
dx
Where is the “chain”?
df(g(h(x)))
dx=
df(g)
dg↗ dg(h)
dh↗ dh(x)
dx
which, for instance, can be used to prove that
∂σ(r)
∂wi=
dσ(r)
dr∂r
∂wi= σ′(r)xi = σ(r) ( 1− σ(r) )xi (1)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 37 / 109
![Page 38: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/38.jpg)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 38 / 109
![Page 39: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/39.jpg)
np number of patterns in the training setni number of input units
nh number of hidden unitsno number of output unitsnw total number of weights, nw = (ni + 1)nh + (nh + 1)noi index for input componentsj index for hidden unitsk index for output unitsxi i-th component of input patternrj net stimulus of the j-th hidden unitrk net stimulus of the k-th output unit
shj j-th hidden unit activation valuesok k-th component of outputtgk k-th component of target
whiji weight to j-th hidden unit from i-th input unit [(ni + 1)× nh]wohkj weight to k-th output unit from j-th hidden unit [(nh + 1)× no]
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 39 / 109
![Page 40: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/40.jpg)
Loss function λ(sok, tgk) = (tgk − sok)2
1 in general there may be several output units;
2 the overall cost function is not quadratic (a paraboloid) because thenetwork is non-linear
Non convex cost function
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 40 / 109
![Page 41: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/41.jpg)
Expected cost
E =
∫1
2
1
no
no∑k=1
(sok(x)− tgk(x))2 p(x)dx (2)
E is known only through its estimate on the training set (here by epoch):
E =1
np
np∑l=1
1
2
1
no
no∑k=1
(sok(xl)− tgk(xl))2 (3)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 41 / 109
![Page 42: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/42.jpg)
Summation and differentiations are both linear and therefore can beexchanged freely.
E =1
2
1
no
no∑k=1
(sok − tgk)2 (4)
We only consider one pattern
• For training online = by pattern, we will apply immediately the ∆w
as we did with perceptron and Adaline
• For training by epoch, we will sum several ∆w and apply them onlyat the end of each pass (a training epoch).
• For training by batch, we will sum several ∆w and apply them onlyafter some % of a complete pass.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 42 / 109
![Page 43: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/43.jpg)
The operation of the multilayer perceptron is divided in two steps:
• activation forward-propagation
• error back-propagation.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 43 / 109
![Page 44: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/44.jpg)
Forward propagation
→→→→→→→→
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 44 / 109
![Page 45: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/45.jpg)
Forward propagation
∀j rj =
ni∑i=0
whijixi ⇒ shj = σ(rj) (5)
∀k rk =
nh∑j=0
wohkjshj ⇒ sok = σ(rk) (6)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 45 / 109
![Page 46: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/46.jpg)
Error back-propagation
←←←←←←←←
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 46 / 109
![Page 47: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/47.jpg)
Error back-propagation and update
we start from computation of partial derivatives, i.e., the gradient of theerror.
w is generically any of the weights of the network.
We need all the components of the gradient ∇E
These are∂E
∂wfor all possible w
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 47 / 109
![Page 48: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/48.jpg)
∂E
∂w=
1
2
1
no
no∑k=1
∂ (sok − tgk)2
∂w=
1
no
no∑k=1
(sok − tgk)∂sok∂w
(7)
Depending on whether w is a woh or a whi we will have differentexpansions of the above expression.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 48 / 109
![Page 49: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/49.jpg)
Hidden-to-output weights wohkj
∂E
∂wohkj=
1
no
no∑k′=1
(sok′ − tgk′)∂sok′
∂rk′
∂rk′
∂shj(8)
We can drop all terms not depending on k, those with k′ 6= k:
∂E
∂wohkj=
1
no(sok − tgk)
∂sok∂rk
∂rk∂shj
(9)
We plug in quantities known from the forward pass:
∂E
∂wohkj=
1
no(sok − tgk)σ
′(rk)shj (10)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 49 / 109
![Page 50: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/50.jpg)
If we define
δk = (sok − tgk)σ′(rk) (11)
we have a generalization of the “delta” term which we have seen in the deltarule by Widrow and Hoff.
Generalized delta rule for the hidden-to-output weights:
∆wohkj = −ηδkshj , (12)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 50 / 109
![Page 51: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/51.jpg)
Problem with the input-to-hidden weights:not all terms are readily available.We use again the chain rule to find another formulation for ∂E/∂whiji
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 51 / 109
![Page 52: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/52.jpg)
∂E
∂whiji=
1
2
1
no
no∑k=1
∂ (sok − tgk)2
∂whiji= (13)
=1
no
no∑k=1
(sok − tgk)∂sok∂rk
∂rk∂shj
∂shj∂whiji
(14)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 52 / 109
![Page 53: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/53.jpg)
Now the quantities appearing in the last equation are available, again fromeither the forward pass or theory:
• (sok − tgk)∂sok∂rk
= δk
•∂rk∂shj
= wohkj
•∂shj∂whiji
=∂shj∂∂rj
∂rj∂whiji
= σ′(rj)xi
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 53 / 109
![Page 54: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/54.jpg)
∂E
∂whiji=
1
no
no∑k=1
(sok − tgk)∂sok∂rk
∂rk∂shj
∂shj∂whiji
(15)
=1
no
no∑k=1
[δkσ′(rk)wohkj
] [σ′(rj)xi
](16)
Note that the summation here does not disappear
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 54 / 109
![Page 55: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/55.jpg)
We can further manipulate the expression, by first isolating the terms whichdo not contribute to the summation:
=
[1
no
no∑k=1
δkσ′(rk)wohkj
] [σ′(rj)xi
](17)
and then identifying the generalized delta for the input-to-hidden weights:
δj =
[1
no
no∑k=1
δkσ′(rk)wohkj
](18)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 55 / 109
![Page 56: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/56.jpg)
Generalized delta rule
for the input-to-hidden weights:
∆whiji = −ηδjxi , (19)
amazingly similar in form to that for the hidden-to-output weights
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 56 / 109
![Page 57: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/57.jpg)
Important property of multi-layer networks
The layered network is the simplest possible connectivity that has theuniversal approximation property.
Should be large enough – or deep enough
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 57 / 109
![Page 58: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/58.jpg)
Generalization and overfitting
The number of weights needs to be high.
We must take care of controlling overfitting.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 58 / 109
![Page 59: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/59.jpg)
Overfitting
Is the situation where
• R is low
• but |R−R| is high
Symptom: While training we are happy, but then tests fail!
No generalization due to too much specialization (learning the trainingset, not the classificatin rule)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 59 / 109
![Page 60: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/60.jpg)
Multi-layer perceptrons not a good model for the brain?
Some evidence that the brain uses sparse (localized) rather than dense(distributed) representations.
Probably both
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 60 / 109
![Page 61: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/61.jpg)
Deep neural networks
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 61 / 109
![Page 62: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/62.jpg)
David Hubel and Torsten Wiesel
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 62 / 109
![Page 63: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/63.jpg)
Hubel and Wiesel placed electrodes in animals brains (visual cortex)They discovered the columnar organization of neurons
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 63 / 109
![Page 64: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/64.jpg)
Each layer in a cortical colum extracts features from the input it receivesfrom the previous layer
These features are more and more abstract
Edges – Simple shapes – Composite shapes – Eyes, mouths, noses. . . Grandmother(The Grandmother Cell hypothesis)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 64 / 109
![Page 65: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/65.jpg)
Learning features
in neural networksInternal representation in hidden layers
Hierarchy requires many layers (deep networks).
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 65 / 109
![Page 66: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/66.jpg)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 66 / 109
![Page 67: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/67.jpg)
Learning: Limits of multi-layer networks
Error back-propagation does not work well with very deep structures
Vanishing gradient phenomenon:At each layer, the backpropagated components of the gradient becomeexponentially smaller.
To avoid the problem: use shallow networks (theoretically sufficient).
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 67 / 109
![Page 68: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/68.jpg)
Example of a shallow architecture
Support vector machines
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 68 / 109
![Page 69: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/69.jpg)
Representational advantage of depth
In the 80’s and early 90’s some works proved that:some logical functions, that can be implemented with a depth of klayers, require exponentially more units if reduced to k − 1 layers
In the 2010’s: dependent inputs (variables) need very deep networks
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 69 / 109
![Page 70: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/70.jpg)
How to avoid training the whole network altogether?
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 70 / 109
![Page 71: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/71.jpg)
Multi-level hierarchies of networks
Cascaded networks of unsupervised layers trained one after the other+Final classification layer
The whole structure is finally trained with error back-propagation
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 71 / 109
![Page 72: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/72.jpg)
The idea is not new: Neocognitron
K. Fukushima, 1987
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 72 / 109
![Page 73: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/73.jpg)
Unsupervised learning principles
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 73 / 109
![Page 74: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/74.jpg)
Information Bottleneck
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 74 / 109
![Page 75: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/75.jpg)
Information Bottleneck
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 75 / 109
![Page 76: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/76.jpg)
Techniques using the "information bottleneck" principle
Using statistics and entropy
• Coding theory
• Stochastic complexity and minimum description length
Using errors
• Autoencoders
• PCA
• Rate-distortion theory
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 76 / 109
![Page 77: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/77.jpg)
Autoencoders
An autoencoder is a special case of a multi-layer perceptron charcterized bytwo aspects:
1 Structure: number of units in the input layer = number of units in theoutput layer > number of hidden units
2 Learned task: an autoencoder is trained to approximate theidentity function (= replicate its input at the output)
An autoencoder is not a classifier
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 77 / 109
![Page 78: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/78.jpg)
Autoencoders
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 78 / 109
![Page 79: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/79.jpg)
Autoencoders
What is interesting is not the output value (is an approximation to the input)but the pattern present on the hidden layer
Since we don’t use any target (the target coincides with the input),the autoencoder task is unsupervisedSometimes termed "self-supervised"
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 79 / 109
![Page 80: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/80.jpg)
Learned features from a set of images
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 80 / 109
![Page 81: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/81.jpg)
Recognizing handwritten digits
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 81 / 109
![Page 82: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/82.jpg)
Features for recognizing ’0’ from ’8’
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 82 / 109
![Page 83: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/83.jpg)
Features for recognizing ’1’ from ’8’
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 83 / 109
![Page 84: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/84.jpg)
An example of an autoencoder for learning features fromsymbolic data
Task: diagnose Lyme disease from patient records
Problem: many features (observed signs and symptoms) are binary and verysparse
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 84 / 109
![Page 85: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/85.jpg)
An example of an autoencoder for learning features fromsymbolic data
Learning the features
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 85 / 109
![Page 86: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/86.jpg)
An example of an autoencoder for learning features fromsymbolic data
Using the learned features
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 86 / 109
![Page 87: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/87.jpg)
Principal component analysis
Is an instance of factor analysis:Discover the few unobservable factors that give rise to observable(measurable) variables
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 87 / 109
![Page 88: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/88.jpg)
Example of factor analysis problem:Discover the abilities underlying performance in school tests
Observed variables Marks in algebra testMarks in geometry testMarks in literature testMarks in foreign language testMarks in music testMarks in essay
Hidden factors Linguistic abilitySpatial abilitySymbolic processing ability
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 88 / 109
![Page 89: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/89.jpg)
Principal Component Analysis or PCA
is a linear solution to the factor analysis problem
Linear: factors are linear combinations of patterns
v = λ1x1 + λ2x2 + . . .+ λdxd
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 89 / 109
![Page 90: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/90.jpg)
PCA works on the Covariance matrix of data
Covariance between input xi and input xj :
σi,j = σj,i = E {(xi − xi)(xj − xj)}
E{} expectation (or mean over te training set), xi mean of i-th input
Σ =
σ1,1 σ1,2 . . . σ1,dσ2,1 σ2,2 . . . σ2,d...
. . ....
σd,1 σd,2 . . . σd,d
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 90 / 109
![Page 91: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/91.jpg)
Note: If X is the training set as a matrix and all inputs have zero mean ,i.e., X −X , then
Σ = XTX
X = X-repmat(mean(X),size(X,1),1)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 91 / 109
![Page 92: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/92.jpg)
Principal componentsThe "factors" in PCA are called principal components and are given bythe eigenvectors of Σ:
v1, . . . ,vd
If we project pattern x = [x1, x2, . . . , xd] onto the componentvi = [v1, v2, . . . , vd]
we obtain the value of the i-th factor, or component, or feature, for patternx:
ai = x · vi =∑i
xivi
OK, components; but why "principal"?
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 92 / 109
![Page 93: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/93.jpg)
Property
1 Eigenvectors of Σ can be ordered by the corresponding eigenvalues,from largest to smallest
2 Eigenvectors are thus ordered by variance or energy or level ofactivity from largest to smallest
3 Projection of the training set X onto the first r (principal)components gives the best rank r approximation to X itself, whenmeasured by mean square error
PCA is a form of lossy compressionThe principal components are features useful to represent the data in asynthetic way (information bottleneck)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 93 / 109
![Page 94: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/94.jpg)
It has been proved that an autoencoder with linear activations learnsthe principal components
This is because the objective is the mean squared reconstruction error of alower-rank representation, the same as PCA
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 94 / 109
![Page 95: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/95.jpg)
Oja’s neuron
A single-unit model with linear (identity) activation
a = x ·w
Learning rule:
w← w + ηx(a− aw)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 95 / 109
![Page 96: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/96.jpg)
It can be proven that, for small η, Oja’s learning rule is a first-order Taylorapproximation of the Rayleigh quotient iteration method of finding theprincipal eigenvector.
At convergence, w is the principal component of Σ.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 96 / 109
![Page 97: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/97.jpg)
Oja’s neuron is a neural principal component analyzer
Advantages over using explicit eigensolvers (e.g., LAPACK eigensolver, orMatlab’s eig function):
1 Distributed
2 Online (big data!)
Disadvantages:
1 Stochastic (convergence in probability)
2 Slower because of the requirement of small η
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 97 / 109
![Page 98: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/98.jpg)
Restricted Boltzmann Machines
A generative modelInvented by G. HintonStarted in the Eighties (Boltzmann machines) then developed in thefollowing decades
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 98 / 109
![Page 99: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/99.jpg)
Boltzmann Machines:
• binary-valued units
• bi-directional connections
• symmetric weight (equal in the two directions)
• general topology (feedback possible)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 99 / 109
![Page 100: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/100.jpg)
The restricted version has the limitation that its topology must be abipartite graph
This makes it more tractable
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 100 / 109
![Page 101: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/101.jpg)
Energy
• v = [vi] and h = [hi] visible and hidden unit activation values,respectively
• wi,j weight between vi and hj• ai and bi biases of visible and hidden units, respectively
then we can define an "energy"
E(v,h) = −∑i
aivi −∑j
bjhj −∑i
∑j
viwi,jhj
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 101 / 109
![Page 102: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/102.jpg)
Probability of states
The probability of any possible network state is
P (v,h) =1
Ze−E(v,h)
with Z partition function (normalizer)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 102 / 109
![Page 103: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/103.jpg)
Probability of states
Since intra-layer connections are not present, probability of activation ofone unit does not depend on that of other units in the same layer – only inthe other layer
P (vi = 1|h) =1
1 + e−(ai+∑j wi,jhj)
P (hj = 1|v) =1
1 + e−(bj+∑i wi,jvi)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 103 / 109
![Page 104: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/104.jpg)
Training a RBM
Algorithm called contrastive divergenceUses random sampling from the probabilities (computed as above):
• Apply one input
• Compute probability P (h|v) - Sample from it to generate hiddenconfiguration
• Compute a positive update step ∆w+ = vhT (outer product)
• Generate one possible input v′ from the hidden configuration
• Compute probability P (h′|v′)• Compute a negative update step ∆w− = v′h′T
• Apply update: w← w + η(∆w+ −∆w−)
This does not optimize any explicit objective function!!
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 104 / 109
![Page 105: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/105.jpg)
Training RBMs of large size is not simpleThere are tricks to make the task easier
Example: weight sharing and convolutional neural networksThese help with data having correlated inputs, as in image, video, speech,general time series.
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 105 / 109
![Page 106: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/106.jpg)
Deep Belief Networks
• A DBN is a sequence of RBMs
• Each RBM can be trained independently of the following ones
• greedy strategy
• The last layer can be a classifier
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 106 / 109
![Page 107: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/107.jpg)
Deep networks can be built out of RBMs, but also out of autoencoders
Autoencoders are less insensitive to random noise
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 107 / 109
![Page 108: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/108.jpg)
Neural networks: Why bothering?
Deep learning achieved success in very complex tasks and won manycompetitionsExample: extracting words from audio and transforming them in automaticsubtitles (cfr. Youtube)
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 108 / 109
![Page 109: Introduction to neural networks - unige.it · Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 2 / 109. ... Network topologies Most general: feedback. Units may be visible](https://reader036.fdocuments.in/reader036/viewer/2022081513/5ee0a52cad6a402d666bccd3/html5/thumbnails/109.jpg)
T H E E N D
Stefano Rovetta Introduction to neural networks 20/23-Jul-2016 109 / 109