TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural
Transcript of TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural
TDT4171 Artificial Intelligence MethodsLecture 10 – Neural Networks
Norwegian University of Science and Technology
Helge LangsethIT-VEST 310
1 TDT4171 Artificial Intelligence Methods
Outline
1 Background
2 Threshold units
3 Gradient descent
4 Multilayer networks
5 Backpropagation
6 Hidden layer representations
7 Example: Detecting Viewing Direction
2 TDT4171 Artificial Intelligence Methods
Summary from last time
Learning needed for unknown environments, lazy designers
Learning agent = performance element + learningelement
Learning method depends on type of performance element,available feedback, type of component to be improved,and its representation
For supervised learning, the aim is to find a simplehypothesis that is approximately consistent with trainingexamples
Decision tree learning using information gain
Learning performance = prediction accuracy measuredon test set
3 TDT4171 Artificial Intelligence Methods
Background
Brains
1011 neurons of > 20 types, 1014 synapsesSignals are noisy “spike trains” of electrical potential
Axon
Cell body or Soma
Nucleus
Dendrite
Synapses
Axonal arborization
Axon from another cell
Synapse
4 TDT4171 Artificial Intelligence Methods
Background
Connectionist Models
Some facts about the human brain:
Number of neurons about 1011
Connections per neuron about 104 – 105
Scene recognition time about .1 second
Neuron switching time about .001 second
100 inference steps doesn’t seem like enough for scenerecognition → much parallel computation
Properties of artificial neural nets (ANN’s):
Many neuron-like threshold switching units
Many weighted interconnections among units
Highly parallel, distributed process
Emphasis on tuning weights automatically
5 TDT4171 Artificial Intelligence Methods
Background
When to Consider Neural Networks
Input is high-dimensional discrete or real-valued (e.g. rawsensor input)
Output is discrete or real valued
Output is a vector of values
Possibly noisy data
Form of target function is unknown
Human readability of result is unimportant
Examples:
Speech phoneme recognition
Image classification
Autonomous drivers (ALVINN)
Financial prediction
6 TDT4171 Artificial Intelligence Methods
Threshold units
Perceptron
Output is a “squashed” linear function of the inputs:
ai ← g(ini) = g (ΣjWj,iaj)
Output
ΣInput Links
Activation Function
Input Function
Output Links
a0 = −1 ai = g(ini)
ai
giniWj,i
W0,i
Bias Weight
aj
A gross oversimplification of real neurons, but its purpose is todevelop understanding of what networks of simple units can do.
7 TDT4171 Artificial Intelligence Methods
Threshold units
Activation functions
(a) (b)
+1 +1
iniini
g(ini)g(ini)
(a) is a step function or threshold function
(b) is a sigmoid function 1/(1 + e−x)
(c) We will also consider linear function
Note! Changing the bias weight W0,i moves the threshold location
8 TDT4171 Artificial Intelligence Methods
Threshold units
Perceptron activated by step-function
Start by looking at single perceptron with step-functionactivation:
1
20
1
2
0=-1
.
.
.Σ
AAAAAAAAA
Σ =0 1 f > 0
0 other se{o =Σ =0
xx
x
x
x
x
w
w
w
w
w
w
w
i ii
i
ii
iin
n
n
n
output(x1, . . . , xn) =
{
1 if w0x0 + w1x1 + · · · +wnxn > 00 otherwise.
Alternative vector notation:
output(~x) =
{
1 if ~wT~x > 00 otherwise.
9 TDT4171 Artificial Intelligence Methods
Threshold units
Decision Surface of a Perceptron
A perceptron activated by a step-function can represent someuseful binary functions.
Discuss with your neighbour:What weights (if any) represent g(x1, x2) = AND(x1, x2)?
g(x1, x2) = OR(x1, x2)?g(x1, x2) = XOR(x1, x2)?
x1
x2+
+
--
+
-
x1
x2
(a) (b)
-
+ -
+
10 TDT4171 Artificial Intelligence Methods
Threshold units
Implementing logical functions
AND
0 = 1.5
1 = 1
2 =1
OR
2 = 1
1 = 1
0 = 0.5
NOT
1 = -1
0 = -0.5 w
w
w
w
ww
w
w
Some functions not representable by perceptrons, namelythose that are not linearly separable (e.g., XOR).
Can make it work by “inventing” new inputs to make itlinearly separable (e.g., x3 ← |x1 − x2| for XOR).
Equivalentely, if we have networks networks of perceptrons,any Boolean function can be implemented
11 TDT4171 Artificial Intelligence Methods
Threshold units
Perceptron training rule (The “Delta-rule”)
First we look at learning weights of single perceptrons:From data 〈x, t〉 (t is “target”), we want to tune ~w automatically.
Let us consider simpler linear unit, where input is ~x gives outputo = ~wT~x. The correct (observed) value is t.
Desiderata:
Each weight wi should be tuned separately.If o = t there is no need to change any of the weights.If o < t and xi > 0, we should increase wi
If o < t and xi < 0, we should decrease wi
If o > t and xi > 0, we should decrease wi
If o > t and xi < 0, we should increase wi
The more we are off, the larger the change to the weights.
12 TDT4171 Artificial Intelligence Methods
Threshold units
Perceptron training rule (The “Delta-rule”)
First we look at learning weights of single perceptrons:From data 〈x, t〉 (t is “target”), we want to tune ~w automatically.
Let us consider simpler linear unit, where input is ~x gives outputo = ~wT~x. The correct (observed) value is t.
Learning rule:
wi ← wi +∆wi, where ∆wi = η(t− o)xi
Where:t is target value and o is perceptron output (as before)
η is small constant called the learning rate.
12 TDT4171 Artificial Intelligence Methods
Threshold units
Perceptron training rule (The “Delta-rule”)
First we look at learning weights of single perceptrons:From data 〈x, t〉 (t is “target”), we want to tune ~w automatically.
Let us consider simpler linear unit, where input is ~x gives outputo = ~wT~x. The correct (observed) value is t.
Learning rule:
wi ← wi +∆wi, where ∆wi = η(t− o)xi
Where:t is target value and o is perceptron output (as before)
η is small constant called the learning rate.
Convergence:
Converges if training data is linearly separable and ηsufficiently small.
Result unclear when NOT linearly separable.12 TDT4171 Artificial Intelligence Methods
Gradient descent
Finding the optimal weights
We still consider linear unit, where input is ~x gives output
o = w0x0 +w1x1 + · · · + wnxn = ~wT~x
We learn wi’s that minimise the squared error
E[~w] ≡1
2
∑
d∈D
(td − od)2,
where D is set of training examples, each of the formd = 〈xd, td〉.
13 TDT4171 Artificial Intelligence Methods
Gradient descent
Finding optimal weights – requirements
Description of our situation:
We have a function E[~w] we want to minimise (wrt ~w).
Why not just try a number of weight configurations ~wi,calculate E[~wi] and see what happens?
Demo: errorPlot.m
14 TDT4171 Artificial Intelligence Methods
Gradient descent
Finding optimal weights – requirements
Description of our situation:
We have a function E[~w] we want to minimise (wrt ~w).
Why not just try a number of weight configurations ~wi,calculate E[~wi] and see what happens?
There are infinitely many (even uncountably many) weightconfigurations.
Mineralisation is typically in very high dimensional space.
Evaluating E[~w] involves summing over all training examples –can be very expensive.
Thus, we cannot use a standard (“silly”) Trial & Error approachhere, but must devise a local search method.
14 TDT4171 Artificial Intelligence Methods
Gradient descent
Gradient Descent – Motivation
Question: How can we minimise this error function withoutknowing the whole shape?Obvious answer: Walk “downhills”!
15 TDT4171 Artificial Intelligence Methods
Gradient descent
Gradient Descent – The setup
We want to find the value x which minimizes f(x).
To avoid evaluating the whole function we use an iterativeapproach:
Guess a value for xCalculate the derivative at x.Make a new guess for x based on the calculated information. . . and keep going.
Intuition:
If the derivative is large (in absolute value) we are far from theminimizing pointIf it is small we are close. . . and if it is zero we are done
Solution:Use the update rule xi+1 ← xi−η ·f ′(xi). η > 0 is the learning rate.
16 TDT4171 Artificial Intelligence Methods
Gradient descent
Gradient Descent – Example
Minimize f(x) = 2x4 − 5x3 + 2x2 − 6x+ 4 with η = 0.025.
0 0.5 1 1.5 2 2.5 3−10
−5
0
5
10
15
20
25
30
35
The f(x) has a minimum at x = 1.8261. Let’s try to find it. . .
17 TDT4171 Artificial Intelligence Methods
Gradient descent
Gradient Descent – Example
Minimize f(x) = 2x4 − 5x3 + 2x2 − 6x+ 4 with η = 0.025.
0 0.5 1 1.5 2 2.5 3−250
−200
−150
−100
−50
0
50
Starting from x0 = 3 and finding f ′(3) = 87:
x1 = x0 − ηf ′(x0)
= 3− 0.025 · 87 = 0.8250
17 TDT4171 Artificial Intelligence Methods
Gradient descent
Gradient Descent – Example
Minimize f(x) = 2x4 − 5x3 + 2x2 − 6x+ 4 with η = 0.025.
0 0.5 1 1.5 2 2.5 3−20
−10
0
10
20
30
40
Going from x1 = 0.8250 with f ′(0.8250) = −8.4172:
x2 = x1 − ηf ′(x1)
= 0.825 − 0.025 · (−8.4172) = 1.0354
17 TDT4171 Artificial Intelligence Methods
Gradient descent
Gradient Descent – Example
Minimize f(x) = 2x4 − 5x3 + 2x2 − 6x+ 4 with η = 0.025.
0 0.5 1 1.5 2 2.5 3−30
−20
−10
0
10
20
30
40
Going from x2 = 1.0354 with f ′(1.0354) = −9.0592:
x3 = 1.0354 − 0.025 · (−9.0592) = 1.2619
17 TDT4171 Artificial Intelligence Methods
Gradient descent
Gradient Descent – Example
Minimize f(x) = 2x4 − 5x3 + 2x2 − 6x+ 4 with η = 0.025.
0 0.5 1 1.5 2 2.5 3−10
−5
0
5
10
15
20
25
30
35
. . . and finally going from x10 = 1.8260 with f ′(1.8260) = −0.0034:
x11 = 1.8260 − 0.025 · (−0.0034) = 1.8261
. . . and we are done.17 TDT4171 Artificial Intelligence Methods
Gradient descent
Gradient Descent – in higher dimensions
Recall thatThe gradient of a surface E[~w] is a vector in the direction thecurve grows the most (calculated at ~w).
The gradient is calculated as ∇E[~w] ≡[
∂E∂w0
, ∂E∂w1
, · · · ∂E∂wn
]
.
Training rule: ∆~w = −η · ∇E[~w], i.e., ∆wi = −η ·∂E∂wi
.
18 TDT4171 Artificial Intelligence Methods
Gradient descent
Gradient Descent – Details. . .
∂E
∂wi=
∂
∂wi
1
2
∑
d
(td − od)2
=1
2
∑
d
∂
∂wi(td − od)
2
=1
2
∑
d
2(td − od)∂
∂wi(td − od)
=∑
d
(td − od)∂
∂wi
(td − ~wT ~xd)
∂E
∂wi
=∑
d
(td − od)(−xi,d)
19 TDT4171 Artificial Intelligence Methods
Gradient descent
Gradient Descent – The algorithm
Gradient-Descent(D, η)
Each training example is a pair of the form 〈~x, t〉, where~x is the vector of input values, and t is the target outputvalue. η is the learning rate (e.g., .05).
1 Initialize each wi to some small random value2 Until the termination condition is met:
1 Initialize: ∆wi ← 0.2 For each 〈~x, t〉 in D:
Input the instance ~x to the unit and compute the output o
For each linear unit weight wi: ∆wi ← ∆wi + η(t− o)xi
3 For each linear unit weight wi: wi ← wi +∆wi
Same algorithm skeleton works for other activationfunctions, as long as we can calculate ∂E
∂wi.
20 TDT4171 Artificial Intelligence Methods
Gradient descent
Demo
errrorPlot.m — Second part
21 TDT4171 Artificial Intelligence Methods
Gradient descent
Perceptron learning – Does it work?
Perceptron learning rule converges to a consistent function for anylinearly separable data set
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50 60 70 80 90 100
Prop
ortio
n co
rrec
t on
test
set
Training set size
PerceptronDecision tree
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50 60 70 80 90 100Pr
opor
tion
corr
ect o
n te
st s
et
Training set size
Decision treePerceptron
majority_11: Perceptron OK, DTL hopelessrestaurant: DTL OK, perceptron cannot represent it
22 TDT4171 Artificial Intelligence Methods
Gradient descent
Summary - Perceptron
Perceptron delta-rule guaranteed to succeed if
Training examples are linearly separable
Sufficiently small learning rate η
Linear unit training rule uses gradient descent
Guaranteed to converge to w with minimum squared error
Given sufficiently small learning rate η
This is true even when training data contains noise, or thedata is not linearly separable.
23 TDT4171 Artificial Intelligence Methods
Multilayer networks
Multilayer Networks of Sigmoid Units
24 TDT4171 Artificial Intelligence Methods
Multilayer networks
Multilayer perceptrons
Layers are usually fully connected; numbers of hidden unitstypically chosen by hand
Input units
Hidden units
Output units i
j,i
j
k,j
kx
x
x
w
w
25 TDT4171 Artificial Intelligence Methods
Multilayer networks
Sigmoid Unit
w1
w2
wn
w0
x1
x2
xn
x0 = -1
AAAAAAAAAAAA
.
.
.Σ
i n = Σ wi xii=0
n1
1 + e-i n o = σ(i n ) =
σ(·) is the sigmoid function, σ(y) = 11+e−y .
Nice property: dσ(y)dy
= σ(y)(1 − σ(y))
We can derive gradient decent rules to train
One sigmoid unit → We (should) know how to do that.
Multilayer networks of sigmoid units → Backpropagation
26 TDT4171 Artificial Intelligence Methods
Multilayer networks
Sigmoids - expressibility
No hidden layer, input (x1, x2):
-4 -2 0 2 4x1-4
-20
24
x2
00.10.20.30.40.50.60.70.80.9
1Perceptron output
Adjusting weights moves the location, orientation, andsteepness of cliff
Cannot tackle “correlation effects” of non-separable targets
27 TDT4171 Artificial Intelligence Methods
Multilayer networks
Sigmoids - expressibility
Hidden layers:
-4 -2 0 2 4x1-4
-20
24
x2
00.10.20.30.40.50.60.70.80.9
hW(x1, x2)
-4 -2 0 2 4x1-4
-20
24
x2
00.10.20.30.40.50.60.70.80.9
1
hW(x1, x2)
Combine two opposite-facing threshold functions to make aridge
Combine two perpendicular ridges to make a bump
Add bumps of various sizes and locations to fit any surface
All continuous functions w/ 1 hidden layer, all functions w/ 227 TDT4171 Artificial Intelligence Methods
Multilayer networks
Error Gradient for a Sigmoid Unit
∂E
∂wi=
∂
∂wi
1
2
∑
d∈D
(td − od)2
=1
2
∑
d
2(td − od)∂
∂wi
(td − od)
= −∑
d
(td − od)∂od∂ind
∂ind∂wi
But we know:
∂od∂ind
=∂σ(ind)
∂ind= od(1− od) ,
∂ind∂wi
=∂(~wT~xd)
∂wi
= xi,d
So:∂E
∂wi= −
∑
d∈D
(td − od)od(1− od)xi,d
28 TDT4171 Artificial Intelligence Methods
Backpropagation
Backpropagation Algorithm
1 Initialize all weights to small random numbers.2 Until satisfied:
For each training example
1 Input the training example to the network and compute the
network outputs
2 For each output unit k: δk ← ok(1− ok)(tk − ok)3 For each hidden unit h:
δh ← oh(1− oh)∑
k∈outputs
wh,k · δk
4 Update each network weight wi,j : wi,j ← wi,j +∆wi,j , where
∆wi,j = η · δj · xi,j .
29 TDT4171 Artificial Intelligence Methods
Backpropagation
More on Backpropagation
Gradient descent over entire network weight vector
Easily generalized to arbitrary directed graphs
Will find a local, not necessarily global error minimum
In practice, often works well (can run multiple times)
Often include weight momentum α > 0
∆wi,j(n) = η · δj · xi,j + α ·∆wi,j(n− 1)
Minimizes error over training examples
Will it generalize well to subsequent examples? Overfitting. . .
Training can take thousands of iterations → slow!
Using network after training is very fast
30 TDT4171 Artificial Intelligence Methods
Hidden layer representations
Learning Hidden Layer Representations
Inputs Outputs
A target function:
Input Output10000000 → 10000000
01000000 → 01000000
00100000 → 00100000
00010000 → 00010000
00001000 → 00001000
00000100 → 00000100
00000010 → 00000010
00000001 → 00000001
Can this be learned??
31 TDT4171 Artificial Intelligence Methods
Hidden layer representations
Learning Hidden Layer Representations
Learned hidden layer representation:
Input Hidden OutputValues
10000000 → .89 .04 .08 → 10000000
01000000 → .15 .99 .99 → 01000000
00100000 → .01 .97 .27 → 00100000
00010000 → .99 .97 .71 → 00010000
00001000 → .03 .05 .02 → 00001000
00000100 → .01 .11 .88 → 00000100
00000010 → .80 .01 .98 → 00000010
00000001 → .60 .94 .01 → 00000001
Do you recognize the encoding?
32 TDT4171 Artificial Intelligence Methods
Hidden layer representations
Learning Hidden Layer Representations
Learned hidden layer representation:
Input Hidden OutputValues
10000000 → .89 .04 .08 → 10000000
01000000 → .15 .99 .99 → 01000000
00100000 → .01 .97 .27 → 00100000
00010000 → .99 .97 .71 → 00010000
00001000 → .03 .05 .02 → 00001000
00000100 → .01 .11 .88 → 00000100
00000010 → .80 .01 .98 → 00000010
00000001 → .60 .94 .01 → 00000001
Do you recognize the encoding?
32 TDT4171 Artificial Intelligence Methods
Hidden layer representations
Training
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 500 1000 1500 2000 2500
Sum of squared errors for each output unit
33 TDT4171 Artificial Intelligence Methods
Hidden layer representations
Training
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000 2500
Hidden unit encoding for input 01000000
33 TDT4171 Artificial Intelligence Methods
Hidden layer representations
Representative power of 8 - 1 - 8 sigmoid network
Discussion time:Representative power of 8 - 1 - 8 sigmoid network:
Consider the same input → output mapping as last example(100000000 → 100000000, etc.)
Can this be represented using the 8 - 1 - 8 structure?
Hidden node h can, e.g., take value∑
8
i=12i−1 · xi
Can h = f1(x) be approximated using sigmoid functions?Can t = f2(h), be approximated using sigmoid functions?If so: t = f2(f1(x)), and the answer is YES!
34 TDT4171 Artificial Intelligence Methods
Hidden layer representations
Convergence of Backpropagation
Gradient descent to some local minimum:
Perhaps not global minimum . . .
Add momentum
Train multiple nets with different initial weights
Nature of convergence:
Initialize weights near zero→ Therefore, initial networks near-linear
Increasingly non-linear functions possible as training progresses
35 TDT4171 Artificial Intelligence Methods
Hidden layer representations
Expressive Capabilities of ANNs
Boolean functions:
Every boolean function can be represented by network withsingle hidden layer
but might require exponential (in number of inputs) hiddenunits
Continuous functions:
Every bounded continuous function can be approximated witharbitrarily small error, by network with one hidden layer
Any function can be approximated to arbitrary accuracy by anetwork with two hidden layers
36 TDT4171 Artificial Intelligence Methods
Hidden layer representations
Overfitting in ANNs
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
0 5000 10000 15000 20000
Err
or
Number of weight updates
Error versus weight updates (example 1)
Training set errorValidation set error
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0 1000 2000 3000 4000 5000 6000
Err
or
Number of weight updates
Error versus weight updates (example 2)
Training set errorValidation set error
37 TDT4171 Artificial Intelligence Methods
Example: Detecting Viewing Direction
Neural Nets for Recognition of Viewing Direction
left straight right up
Typical input images (30 × 32 pixels)
90% accurate learning head pose, and recognizing 1-of-20 faces38 TDT4171 Artificial Intelligence Methods
Example: Detecting Viewing Direction
Learned Hidden Unit Weights
left straight right up
Learned Weights
Mostly “Up” ”Right”; not “Left” Mostly “Straight”
Typical input images
39 TDT4171 Artificial Intelligence Methods
Example: Detecting Viewing Direction
Summary
Most brains have lots of neurons; each neuron ≈linear–threshold unit (?)
Perceptrons (one-layer networks) insufficiently expressive
Multi-layer networks are sufficiently expressive; can be trainedby gradient descent, i.e., error back-propagation
Many applications: speech, driving, handwriting, frauddetection, etc.
Engineering, cognitive modelling, and neural system modellingsubfields have largely diverged
40 TDT4171 Artificial Intelligence Methods