TDT4171 Artificial Intelligence Methods - Lecture 10 â€“ Neural

TDT4171 Artificial Intelligence MethodsLecture 10 – Neural Networks

Norwegian University of Science and Technology

Helge LangsethIT-VEST 310

[email protected]

1 TDT4171 Artificial Intelligence Methods

Outline

1 Background

2 Threshold units

3 Gradient descent

4 Multilayer networks

5 Backpropagation

6 Hidden layer representations

7 Example: Detecting Viewing Direction


Summary from last time

Learning needed for unknown environments, lazy designers

Learning agent = performance element + learningelement

Learning method depends on type of performance element,available feedback, type of component to be improved,and its representation

For supervised learning, the aim is to find a simplehypothesis that is approximately consistent with trainingexamples

Decision tree learning using information gain

Learning performance = prediction accuracy measuredon test set


Background

Brains

1011 neurons of > 20 types, 1014 synapsesSignals are noisy “spike trains” of electrical potential

Axon

Cell body or Soma

Nucleus

Dendrite

Synapses

Axonal arborization

Axon from another cell

Synapse


Background

Connectionist Models

Some facts about the human brain:

Number of neurons about 1011

Connections per neuron about 104 – 105

Scene recognition time about .1 second

Neuron switching time about .001 second

100 inference steps doesn’t seem like enough for scenerecognition → much parallel computation

Properties of artificial neural nets (ANN’s):

Many neuron-like threshold switching units

Many weighted interconnections among units

Highly parallel, distributed process

Emphasis on tuning weights automatically


Background

When to Consider Neural Networks

Input is high-dimensional discrete or real-valued (e.g. rawsensor input)

Output is discrete or real valued

Output is a vector of values

Possibly noisy data

Form of target function is unknown

Human readability of result is unimportant

Examples:

Speech phoneme recognition

Image classification

Autonomous drivers (ALVINN)

Financial prediction


Threshold units

Perceptron

Output is a “squashed” linear function of the inputs:

ai ← g(ini) = g (ΣjWj,iaj)

Output

ΣInput Links

Activation Function

Input Function

Output Links

a0 = −1 ai = g(ini)

ai

giniWj,i

W0,i

Bias Weight

aj

A gross oversimplification of real neurons, but its purpose is todevelop understanding of what networks of simple units can do.


Threshold units

Activation functions

(a) (b)

+1 +1

iniini

g(ini)g(ini)

(a) is a step function or threshold function

(b) is a sigmoid function 1/(1 + e−x)

(c) We will also consider linear function

Note! Changing the bias weight W0,i moves the threshold location


Threshold units

Perceptron activated by step-function

Start by looking at single perceptron with step-functionactivation:

1

20

1

2

0=-1

.

.

.Σ

AAAAAAAAA

Σ =0 1 f > 0

0 other se{o =Σ =0

xx

x

x

x

x

w

w

w

w

w

w

w

i ii

i

ii

iin

n

n

n

output(x1, . . . , xn) =

{

1 if w0x0 + w1x1 + · · · +wnxn > 00 otherwise.

Alternative vector notation:

output(~x) =

{

1 if ~wT~x > 00 otherwise.


Threshold units

Decision Surface of a Perceptron

A perceptron activated by a step-function can represent someuseful binary functions.

Discuss with your neighbour:What weights (if any) represent g(x1, x2) = AND(x1, x2)?

g(x1, x2) = OR(x1, x2)?g(x1, x2) = XOR(x1, x2)?

x1

x2+

+

--

+

-

x1

x2

(a) (b)

-

+ -

+


Threshold units

Implementing logical functions

AND

0 = 1.5

1 = 1

2 =1

OR

2 = 1

1 = 1

0 = 0.5

NOT

1 = -1

0 = -0.5 w

w

w

w

ww

w

w

Some functions not representable by perceptrons, namelythose that are not linearly separable (e.g., XOR).

Can make it work by “inventing” new inputs to make itlinearly separable (e.g., x3 ← |x1 − x2| for XOR).

Equivalentely, if we have networks networks of perceptrons,any Boolean function can be implemented


Threshold units

Perceptron training rule (The “Delta-rule”)

First we look at learning weights of single perceptrons:From data 〈x, t〉 (t is “target”), we want to tune ~w automatically.

Let us consider simpler linear unit, where input is ~x gives outputo = ~wT~x. The correct (observed) value is t.

Desiderata:

Each weight wi should be tuned separately.If o = t there is no need to change any of the weights.If o < t and xi > 0, we should increase wi

If o < t and xi < 0, we should decrease wi

If o > t and xi > 0, we should decrease wi

If o > t and xi < 0, we should increase wi

The more we are off, the larger the change to the weights.


Threshold units




Learning rule:

wi ← wi +∆wi, where ∆wi = η(t− o)xi

Where:t is target value and o is perceptron output (as before)

η is small constant called the learning rate.


Threshold units




Learning rule:

wi ← wi +∆wi, where ∆wi = η(t− o)xi

Where:t is target value and o is perceptron output (as before)

η is small constant called the learning rate.

Convergence:

Converges if training data is linearly separable and ηsufficiently small.

Result unclear when NOT linearly separable.12 TDT4171 Artificial Intelligence Methods

Gradient descent

Finding the optimal weights

We still consider linear unit, where input is ~x gives output

o = w0x0 +w1x1 + · · · + wnxn = ~wT~x

We learn wi’s that minimise the squared error

E[~w] ≡1

2

∑

d∈D

(td − od)2,

where D is set of training examples, each of the formd = 〈xd, td〉.


Gradient descent

Finding optimal weights – requirements

Description of our situation:

We have a function E[~w] we want to minimise (wrt ~w).

Why not just try a number of weight configurations ~wi,calculate E[~wi] and see what happens?

Demo: errorPlot.m


Gradient descent

Finding optimal weights – requirements

Description of our situation:

We have a function E[~w] we want to minimise (wrt ~w).

Why not just try a number of weight configurations ~wi,calculate E[~wi] and see what happens?

There are infinitely many (even uncountably many) weightconfigurations.

Mineralisation is typically in very high dimensional space.

Evaluating E[~w] involves summing over all training examples –can be very expensive.

Thus, we cannot use a standard (“silly”) Trial & Error approachhere, but must devise a local search method.


Gradient descent

Gradient Descent – Motivation

Question: How can we minimise this error function withoutknowing the whole shape?Obvious answer: Walk “downhills”!


Gradient descent

Gradient Descent – The setup

We want to find the value x which minimizes f(x).

To avoid evaluating the whole function we use an iterativeapproach:

Guess a value for xCalculate the derivative at x.Make a new guess for x based on the calculated information. . . and keep going.

Intuition:

If the derivative is large (in absolute value) we are far from theminimizing pointIf it is small we are close. . . and if it is zero we are done

Solution:Use the update rule xi+1 ← xi−η ·f ′(xi). η > 0 is the learning rate.


Gradient descent

Gradient Descent – Example

Minimize f(x) = 2x4 − 5x3 + 2x2 − 6x+ 4 with η = 0.025.

0 0.5 1 1.5 2 2.5 3−10

−5

0

5

10

15

20

25

30

35

The f(x) has a minimum at x = 1.8261. Let’s try to find it. . .


Gradient descent



0 0.5 1 1.5 2 2.5 3−250

−200

−150

−100

−50

0

50

Starting from x0 = 3 and finding f ′(3) = 87:

x1 = x0 − ηf ′(x0)

= 3− 0.025 · 87 = 0.8250


Gradient descent



0 0.5 1 1.5 2 2.5 3−20

−10

0

10

20

30

40

Going from x1 = 0.8250 with f ′(0.8250) = −8.4172:

x2 = x1 − ηf ′(x1)

= 0.825 − 0.025 · (−8.4172) = 1.0354


Gradient descent



0 0.5 1 1.5 2 2.5 3−30

−20

−10

0

10

20

30

40

Going from x2 = 1.0354 with f ′(1.0354) = −9.0592:

x3 = 1.0354 − 0.025 · (−9.0592) = 1.2619


Gradient descent



0 0.5 1 1.5 2 2.5 3−10

−5

0

5

10

15

20

25

30

35

. . . and finally going from x10 = 1.8260 with f ′(1.8260) = −0.0034:

x11 = 1.8260 − 0.025 · (−0.0034) = 1.8261

. . . and we are done.17 TDT4171 Artificial Intelligence Methods

Gradient descent

Gradient Descent – in higher dimensions

Recall thatThe gradient of a surface E[~w] is a vector in the direction thecurve grows the most (calculated at ~w).

The gradient is calculated as ∇E[~w] ≡[

∂E∂w0

, ∂E∂w1

, · · · ∂E∂wn

]

.

Training rule: ∆~w = −η · ∇E[~w], i.e., ∆wi = −η ·∂E∂wi

.


Gradient descent

Gradient Descent – Details. . .

∂E

∂wi=

∂

∂wi

1

2

∑

d

(td − od)2

=1

2

∑

d

∂

∂wi(td − od)

2

=1

2

∑

d

2(td − od)∂

∂wi(td − od)

=∑

d

(td − od)∂

∂wi

(td − ~wT ~xd)

∂E

∂wi

=∑

d

(td − od)(−xi,d)


Gradient descent

Gradient Descent – The algorithm

Gradient-Descent(D, η)

Each training example is a pair of the form 〈~x, t〉, where~x is the vector of input values, and t is the target outputvalue. η is the learning rate (e.g., .05).

1 Initialize each wi to some small random value2 Until the termination condition is met:

1 Initialize: ∆wi ← 0.2 For each 〈~x, t〉 in D:

Input the instance ~x to the unit and compute the output o

For each linear unit weight wi: ∆wi ← ∆wi + η(t− o)xi

3 For each linear unit weight wi: wi ← wi +∆wi

Same algorithm skeleton works for other activationfunctions, as long as we can calculate ∂E

∂wi.


Gradient descent

Demo

errrorPlot.m — Second part


Gradient descent

Perceptron learning – Does it work?

Perceptron learning rule converges to a consistent function for anylinearly separable data set

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100

Prop

ortio

n co

rrec

t on

test

set

Training set size

PerceptronDecision tree

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100Pr

opor

tion

corr

ect o

n te

st s

et

Training set size

Decision treePerceptron

majority_11: Perceptron OK, DTL hopelessrestaurant: DTL OK, perceptron cannot represent it


Gradient descent

Summary - Perceptron

Perceptron delta-rule guaranteed to succeed if

Training examples are linearly separable

Sufficiently small learning rate η

Linear unit training rule uses gradient descent

Guaranteed to converge to w with minimum squared error

Given sufficiently small learning rate η

This is true even when training data contains noise, or thedata is not linearly separable.


Multilayer networks

Multilayer Networks of Sigmoid Units


Multilayer networks

Multilayer perceptrons

Layers are usually fully connected; numbers of hidden unitstypically chosen by hand

Input units

Hidden units

Output units i

j,i

j

k,j

kx

x

x

w

w


Multilayer networks

Sigmoid Unit

w1

w2

wn

w0

x1

x2

xn

x0 = -1

AAAAAAAAAAAA

.

.

.Σ

i n = Σ wi xii=0

n1

1 + e-i n o = σ(i n ) =

σ(·) is the sigmoid function, σ(y) = 11+e−y .

Nice property: dσ(y)dy

= σ(y)(1 − σ(y))

We can derive gradient decent rules to train

One sigmoid unit → We (should) know how to do that.

Multilayer networks of sigmoid units → Backpropagation


Multilayer networks

Sigmoids - expressibility

No hidden layer, input (x1, x2):

-4 -2 0 2 4x1-4

-20

24

x2

00.10.20.30.40.50.60.70.80.9

1Perceptron output

Adjusting weights moves the location, orientation, andsteepness of cliff

Cannot tackle “correlation effects” of non-separable targets


Multilayer networks

Sigmoids - expressibility

Hidden layers:

-4 -2 0 2 4x1-4

-20

24

x2

00.10.20.30.40.50.60.70.80.9

hW(x1, x2)

-4 -2 0 2 4x1-4

-20

24

x2

00.10.20.30.40.50.60.70.80.9

1

hW(x1, x2)

Combine two opposite-facing threshold functions to make aridge

Combine two perpendicular ridges to make a bump

Add bumps of various sizes and locations to fit any surface

All continuous functions w/ 1 hidden layer, all functions w/ 227 TDT4171 Artificial Intelligence Methods

Multilayer networks

Error Gradient for a Sigmoid Unit

∂E

∂wi=

∂

∂wi

1

2

∑

d∈D

(td − od)2

=1

2

∑

d

2(td − od)∂

∂wi

(td − od)

= −∑

d

(td − od)∂od∂ind

∂ind∂wi

But we know:

∂od∂ind

=∂σ(ind)

∂ind= od(1− od) ,

∂ind∂wi

=∂(~wT~xd)

∂wi

= xi,d

So:∂E

∂wi= −

∑

d∈D

(td − od)od(1− od)xi,d


Backpropagation

Backpropagation Algorithm

1 Initialize all weights to small random numbers.2 Until satisfied:

For each training example

1 Input the training example to the network and compute the

network outputs

2 For each output unit k: δk ← ok(1− ok)(tk − ok)3 For each hidden unit h:

δh ← oh(1− oh)∑

k∈outputs

wh,k · δk

4 Update each network weight wi,j : wi,j ← wi,j +∆wi,j , where

∆wi,j = η · δj · xi,j .


Backpropagation

More on Backpropagation

Gradient descent over entire network weight vector

Easily generalized to arbitrary directed graphs

Will find a local, not necessarily global error minimum

In practice, often works well (can run multiple times)

Often include weight momentum α > 0

∆wi,j(n) = η · δj · xi,j + α ·∆wi,j(n− 1)

Minimizes error over training examples

Will it generalize well to subsequent examples? Overfitting. . .

Training can take thousands of iterations → slow!

Using network after training is very fast


Hidden layer representations

Learning Hidden Layer Representations

Inputs Outputs

A target function:

Input Output10000000 → 10000000

01000000 → 01000000

00100000 → 00100000

00010000 → 00010000

00001000 → 00001000

00000100 → 00000100

00000010 → 00000010

00000001 → 00000001

Can this be learned??



Learning Hidden Layer Representations

Learned hidden layer representation:

Input Hidden OutputValues

10000000 → .89 .04 .08 → 10000000

01000000 → .15 .99 .99 → 01000000

00100000 → .01 .97 .27 → 00100000

00010000 → .99 .97 .71 → 00010000

00001000 → .03 .05 .02 → 00001000

00000100 → .01 .11 .88 → 00000100

00000010 → .80 .01 .98 → 00000010

00000001 → .60 .94 .01 → 00000001

Do you recognize the encoding?



Training

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 500 1000 1500 2000 2500

Sum of squared errors for each output unit



Training

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 500 1000 1500 2000 2500

Hidden unit encoding for input 01000000



Representative power of 8 - 1 - 8 sigmoid network

Discussion time:Representative power of 8 - 1 - 8 sigmoid network:

Consider the same input → output mapping as last example(100000000 → 100000000, etc.)

Can this be represented using the 8 - 1 - 8 structure?

Hidden node h can, e.g., take value∑

8

i=12i−1 · xi

Can h = f1(x) be approximated using sigmoid functions?Can t = f2(h), be approximated using sigmoid functions?If so: t = f2(f1(x)), and the answer is YES!



Convergence of Backpropagation

Gradient descent to some local minimum:

Perhaps not global minimum . . .

Add momentum

Train multiple nets with different initial weights

Nature of convergence:

Initialize weights near zero→ Therefore, initial networks near-linear

Increasingly non-linear functions possible as training progresses



Expressive Capabilities of ANNs

Boolean functions:

Every boolean function can be represented by network withsingle hidden layer

but might require exponential (in number of inputs) hiddenunits

Continuous functions:

Every bounded continuous function can be approximated witharbitrarily small error, by network with one hidden layer

Any function can be approximated to arbitrary accuracy by anetwork with two hidden layers



Overfitting in ANNs

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

0 5000 10000 15000 20000

Err

or

Number of weight updates

Error versus weight updates (example 1)

Training set errorValidation set error

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 1000 2000 3000 4000 5000 6000

Err

or

Number of weight updates

Error versus weight updates (example 2)

Training set errorValidation set error


Example: Detecting Viewing Direction

Neural Nets for Recognition of Viewing Direction

left straight right up

Typical input images (30 × 32 pixels)

90% accurate learning head pose, and recognizing 1-of-20 faces38 TDT4171 Artificial Intelligence Methods


Learned Hidden Unit Weights

left straight right up

Learned Weights

Mostly “Up” ”Right”; not “Left” Mostly “Straight”

Typical input images



Summary

Most brains have lots of neurons; each neuron ≈linear–threshold unit (?)

Perceptrons (one-layer networks) insufficiently expressive

Multi-layer networks are sufficiently expressive; can be trainedby gradient descent, i.e., error back-propagation

Many applications: speech, driving, handwriting, frauddetection, etc.

Engineering, cognitive modelling, and neural system modellingsubfields have largely diverged


TDT4171 Artificial Intelligence Methods - Lecture 10 â€“ Neural

Documents

Transcript of TDT4171 Artificial Intelligence Methods - Lecture 10 â€“ Neural