TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

50
TDT4171 Artificial Intelligence Methods Lecture 10 – Neural Networks Norwegian University of Science and Technology Helge Langseth IT-VEST 310 [email protected] 1 TDT4171 Artificial Intelligence Methods

Transcript of TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Page 1: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

TDT4171 Artificial Intelligence MethodsLecture 10 – Neural Networks

Norwegian University of Science and Technology

Helge LangsethIT-VEST 310

[email protected]

1 TDT4171 Artificial Intelligence Methods

Page 2: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Outline

1 Background

2 Threshold units

3 Gradient descent

4 Multilayer networks

5 Backpropagation

6 Hidden layer representations

7 Example: Detecting Viewing Direction

2 TDT4171 Artificial Intelligence Methods

Page 3: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Summary from last time

Learning needed for unknown environments, lazy designers

Learning agent = performance element + learningelement

Learning method depends on type of performance element,available feedback, type of component to be improved,and its representation

For supervised learning, the aim is to find a simplehypothesis that is approximately consistent with trainingexamples

Decision tree learning using information gain

Learning performance = prediction accuracy measuredon test set

3 TDT4171 Artificial Intelligence Methods

Page 4: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Background

Brains

1011 neurons of > 20 types, 1014 synapsesSignals are noisy “spike trains” of electrical potential

Axon

Cell body or Soma

Nucleus

Dendrite

Synapses

Axonal arborization

Axon from another cell

Synapse

4 TDT4171 Artificial Intelligence Methods

Page 5: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Background

Connectionist Models

Some facts about the human brain:

Number of neurons about 1011

Connections per neuron about 104 – 105

Scene recognition time about .1 second

Neuron switching time about .001 second

100 inference steps doesn’t seem like enough for scenerecognition → much parallel computation

Properties of artificial neural nets (ANN’s):

Many neuron-like threshold switching units

Many weighted interconnections among units

Highly parallel, distributed process

Emphasis on tuning weights automatically

5 TDT4171 Artificial Intelligence Methods

Page 6: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Background

When to Consider Neural Networks

Input is high-dimensional discrete or real-valued (e.g. rawsensor input)

Output is discrete or real valued

Output is a vector of values

Possibly noisy data

Form of target function is unknown

Human readability of result is unimportant

Examples:

Speech phoneme recognition

Image classification

Autonomous drivers (ALVINN)

Financial prediction

6 TDT4171 Artificial Intelligence Methods

Page 7: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Threshold units

Perceptron

Output is a “squashed” linear function of the inputs:

ai ← g(ini) = g (ΣjWj,iaj)

Output

ΣInput Links

Activation Function

Input Function

Output Links

a0 = −1 ai = g(ini)

ai

giniWj,i

W0,i

Bias Weight

aj

A gross oversimplification of real neurons, but its purpose is todevelop understanding of what networks of simple units can do.

7 TDT4171 Artificial Intelligence Methods

Page 8: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Threshold units

Activation functions

(a) (b)

+1 +1

iniini

g(ini)g(ini)

(a) is a step function or threshold function

(b) is a sigmoid function 1/(1 + e−x)

(c) We will also consider linear function

Note! Changing the bias weight W0,i moves the threshold location

8 TDT4171 Artificial Intelligence Methods

Page 9: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Threshold units

Perceptron activated by step-function

Start by looking at single perceptron with step-functionactivation:

1

20

1

2

0=-1

.

.

AAAAAAAAA

Σ =0 1 f > 0

0 other se{o =Σ =0

xx

x

x

x

x

w

w

w

w

w

w

w

i ii

i

ii

iin

n

n

n

output(x1, . . . , xn) =

{

1 if w0x0 + w1x1 + · · · +wnxn > 00 otherwise.

Alternative vector notation:

output(~x) =

{

1 if ~wT~x > 00 otherwise.

9 TDT4171 Artificial Intelligence Methods

Page 10: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Threshold units

Decision Surface of a Perceptron

A perceptron activated by a step-function can represent someuseful binary functions.

Discuss with your neighbour:What weights (if any) represent g(x1, x2) = AND(x1, x2)?

g(x1, x2) = OR(x1, x2)?g(x1, x2) = XOR(x1, x2)?

x1

x2+

+

--

+

-

x1

x2

(a) (b)

-

+ -

+

10 TDT4171 Artificial Intelligence Methods

Page 11: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Threshold units

Implementing logical functions

AND

0 = 1.5

1 = 1

2 =1

OR

2 = 1

1 = 1

0 = 0.5

NOT

1 = -1

0 = -0.5 w

w

w

w

ww

w

w

Some functions not representable by perceptrons, namelythose that are not linearly separable (e.g., XOR).

Can make it work by “inventing” new inputs to make itlinearly separable (e.g., x3 ← |x1 − x2| for XOR).

Equivalentely, if we have networks networks of perceptrons,any Boolean function can be implemented

11 TDT4171 Artificial Intelligence Methods

Page 12: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Threshold units

Perceptron training rule (The “Delta-rule”)

First we look at learning weights of single perceptrons:From data 〈x, t〉 (t is “target”), we want to tune ~w automatically.

Let us consider simpler linear unit, where input is ~x gives outputo = ~wT~x. The correct (observed) value is t.

Desiderata:

Each weight wi should be tuned separately.If o = t there is no need to change any of the weights.If o < t and xi > 0, we should increase wi

If o < t and xi < 0, we should decrease wi

If o > t and xi > 0, we should decrease wi

If o > t and xi < 0, we should increase wi

The more we are off, the larger the change to the weights.

12 TDT4171 Artificial Intelligence Methods

Page 13: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Threshold units

Perceptron training rule (The “Delta-rule”)

First we look at learning weights of single perceptrons:From data 〈x, t〉 (t is “target”), we want to tune ~w automatically.

Let us consider simpler linear unit, where input is ~x gives outputo = ~wT~x. The correct (observed) value is t.

Learning rule:

wi ← wi +∆wi, where ∆wi = η(t− o)xi

Where:t is target value and o is perceptron output (as before)

η is small constant called the learning rate.

12 TDT4171 Artificial Intelligence Methods

Page 14: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Threshold units

Perceptron training rule (The “Delta-rule”)

First we look at learning weights of single perceptrons:From data 〈x, t〉 (t is “target”), we want to tune ~w automatically.

Let us consider simpler linear unit, where input is ~x gives outputo = ~wT~x. The correct (observed) value is t.

Learning rule:

wi ← wi +∆wi, where ∆wi = η(t− o)xi

Where:t is target value and o is perceptron output (as before)

η is small constant called the learning rate.

Convergence:

Converges if training data is linearly separable and ηsufficiently small.

Result unclear when NOT linearly separable.12 TDT4171 Artificial Intelligence Methods

Page 15: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Finding the optimal weights

We still consider linear unit, where input is ~x gives output

o = w0x0 +w1x1 + · · · + wnxn = ~wT~x

We learn wi’s that minimise the squared error

E[~w] ≡1

2

d∈D

(td − od)2,

where D is set of training examples, each of the formd = 〈xd, td〉.

13 TDT4171 Artificial Intelligence Methods

Page 16: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Finding optimal weights – requirements

Description of our situation:

We have a function E[~w] we want to minimise (wrt ~w).

Why not just try a number of weight configurations ~wi,calculate E[~wi] and see what happens?

Demo: errorPlot.m

14 TDT4171 Artificial Intelligence Methods

Page 17: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Finding optimal weights – requirements

Description of our situation:

We have a function E[~w] we want to minimise (wrt ~w).

Why not just try a number of weight configurations ~wi,calculate E[~wi] and see what happens?

There are infinitely many (even uncountably many) weightconfigurations.

Mineralisation is typically in very high dimensional space.

Evaluating E[~w] involves summing over all training examples –can be very expensive.

Thus, we cannot use a standard (“silly”) Trial & Error approachhere, but must devise a local search method.

14 TDT4171 Artificial Intelligence Methods

Page 18: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Gradient Descent – Motivation

Question: How can we minimise this error function withoutknowing the whole shape?Obvious answer: Walk “downhills”!

15 TDT4171 Artificial Intelligence Methods

Page 19: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Gradient Descent – The setup

We want to find the value x which minimizes f(x).

To avoid evaluating the whole function we use an iterativeapproach:

Guess a value for xCalculate the derivative at x.Make a new guess for x based on the calculated information. . . and keep going.

Intuition:

If the derivative is large (in absolute value) we are far from theminimizing pointIf it is small we are close. . . and if it is zero we are done

Solution:Use the update rule xi+1 ← xi−η ·f ′(xi). η > 0 is the learning rate.

16 TDT4171 Artificial Intelligence Methods

Page 20: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Gradient Descent – Example

Minimize f(x) = 2x4 − 5x3 + 2x2 − 6x+ 4 with η = 0.025.

0 0.5 1 1.5 2 2.5 3−10

−5

0

5

10

15

20

25

30

35

The f(x) has a minimum at x = 1.8261. Let’s try to find it. . .

17 TDT4171 Artificial Intelligence Methods

Page 21: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Gradient Descent – Example

Minimize f(x) = 2x4 − 5x3 + 2x2 − 6x+ 4 with η = 0.025.

0 0.5 1 1.5 2 2.5 3−250

−200

−150

−100

−50

0

50

Starting from x0 = 3 and finding f ′(3) = 87:

x1 = x0 − ηf ′(x0)

= 3− 0.025 · 87 = 0.8250

17 TDT4171 Artificial Intelligence Methods

Page 22: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Gradient Descent – Example

Minimize f(x) = 2x4 − 5x3 + 2x2 − 6x+ 4 with η = 0.025.

0 0.5 1 1.5 2 2.5 3−20

−10

0

10

20

30

40

Going from x1 = 0.8250 with f ′(0.8250) = −8.4172:

x2 = x1 − ηf ′(x1)

= 0.825 − 0.025 · (−8.4172) = 1.0354

17 TDT4171 Artificial Intelligence Methods

Page 23: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Gradient Descent – Example

Minimize f(x) = 2x4 − 5x3 + 2x2 − 6x+ 4 with η = 0.025.

0 0.5 1 1.5 2 2.5 3−30

−20

−10

0

10

20

30

40

Going from x2 = 1.0354 with f ′(1.0354) = −9.0592:

x3 = 1.0354 − 0.025 · (−9.0592) = 1.2619

17 TDT4171 Artificial Intelligence Methods

Page 24: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Gradient Descent – Example

Minimize f(x) = 2x4 − 5x3 + 2x2 − 6x+ 4 with η = 0.025.

0 0.5 1 1.5 2 2.5 3−10

−5

0

5

10

15

20

25

30

35

. . . and finally going from x10 = 1.8260 with f ′(1.8260) = −0.0034:

x11 = 1.8260 − 0.025 · (−0.0034) = 1.8261

. . . and we are done.17 TDT4171 Artificial Intelligence Methods

Page 25: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Gradient Descent – in higher dimensions

Recall thatThe gradient of a surface E[~w] is a vector in the direction thecurve grows the most (calculated at ~w).

The gradient is calculated as ∇E[~w] ≡[

∂E∂w0

, ∂E∂w1

, · · · ∂E∂wn

]

.

Training rule: ∆~w = −η · ∇E[~w], i.e., ∆wi = −η ·∂E∂wi

.

18 TDT4171 Artificial Intelligence Methods

Page 26: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Gradient Descent – Details. . .

∂E

∂wi=

∂wi

1

2

d

(td − od)2

=1

2

d

∂wi(td − od)

2

=1

2

d

2(td − od)∂

∂wi(td − od)

=∑

d

(td − od)∂

∂wi

(td − ~wT ~xd)

∂E

∂wi

=∑

d

(td − od)(−xi,d)

19 TDT4171 Artificial Intelligence Methods

Page 27: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Gradient Descent – The algorithm

Gradient-Descent(D, η)

Each training example is a pair of the form 〈~x, t〉, where~x is the vector of input values, and t is the target outputvalue. η is the learning rate (e.g., .05).

1 Initialize each wi to some small random value2 Until the termination condition is met:

1 Initialize: ∆wi ← 0.2 For each 〈~x, t〉 in D:

Input the instance ~x to the unit and compute the output o

For each linear unit weight wi: ∆wi ← ∆wi + η(t− o)xi

3 For each linear unit weight wi: wi ← wi +∆wi

Same algorithm skeleton works for other activationfunctions, as long as we can calculate ∂E

∂wi.

20 TDT4171 Artificial Intelligence Methods

Page 28: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Demo

errrorPlot.m — Second part

21 TDT4171 Artificial Intelligence Methods

Page 29: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Perceptron learning – Does it work?

Perceptron learning rule converges to a consistent function for anylinearly separable data set

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100

Prop

ortio

n co

rrec

t on

test

set

Training set size

PerceptronDecision tree

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100Pr

opor

tion

corr

ect o

n te

st s

et

Training set size

Decision treePerceptron

majority_11: Perceptron OK, DTL hopelessrestaurant: DTL OK, perceptron cannot represent it

22 TDT4171 Artificial Intelligence Methods

Page 30: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Gradient descent

Summary - Perceptron

Perceptron delta-rule guaranteed to succeed if

Training examples are linearly separable

Sufficiently small learning rate η

Linear unit training rule uses gradient descent

Guaranteed to converge to w with minimum squared error

Given sufficiently small learning rate η

This is true even when training data contains noise, or thedata is not linearly separable.

23 TDT4171 Artificial Intelligence Methods

Page 31: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Multilayer networks

Multilayer Networks of Sigmoid Units

24 TDT4171 Artificial Intelligence Methods

Page 32: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Multilayer networks

Multilayer perceptrons

Layers are usually fully connected; numbers of hidden unitstypically chosen by hand

Input units

Hidden units

Output units i

j,i

j

k,j

kx

x

x

w

w

25 TDT4171 Artificial Intelligence Methods

Page 33: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Multilayer networks

Sigmoid Unit

w1

w2

wn

w0

x1

x2

xn

x0 = -1

AAAAAAAAAAAA

.

.

i n = Σ wi xii=0

n1

1 + e-i n o = σ(i n ) =

σ(·) is the sigmoid function, σ(y) = 11+e−y .

Nice property: dσ(y)dy

= σ(y)(1 − σ(y))

We can derive gradient decent rules to train

One sigmoid unit → We (should) know how to do that.

Multilayer networks of sigmoid units → Backpropagation

26 TDT4171 Artificial Intelligence Methods

Page 34: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Multilayer networks

Sigmoids - expressibility

No hidden layer, input (x1, x2):

-4 -2 0 2 4x1-4

-20

24

x2

00.10.20.30.40.50.60.70.80.9

1Perceptron output

Adjusting weights moves the location, orientation, andsteepness of cliff

Cannot tackle “correlation effects” of non-separable targets

27 TDT4171 Artificial Intelligence Methods

Page 35: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Multilayer networks

Sigmoids - expressibility

Hidden layers:

-4 -2 0 2 4x1-4

-20

24

x2

00.10.20.30.40.50.60.70.80.9

hW(x1, x2)

-4 -2 0 2 4x1-4

-20

24

x2

00.10.20.30.40.50.60.70.80.9

1

hW(x1, x2)

Combine two opposite-facing threshold functions to make aridge

Combine two perpendicular ridges to make a bump

Add bumps of various sizes and locations to fit any surface

All continuous functions w/ 1 hidden layer, all functions w/ 227 TDT4171 Artificial Intelligence Methods

Page 36: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Multilayer networks

Error Gradient for a Sigmoid Unit

∂E

∂wi=

∂wi

1

2

d∈D

(td − od)2

=1

2

d

2(td − od)∂

∂wi

(td − od)

= −∑

d

(td − od)∂od∂ind

∂ind∂wi

But we know:

∂od∂ind

=∂σ(ind)

∂ind= od(1− od) ,

∂ind∂wi

=∂(~wT~xd)

∂wi

= xi,d

So:∂E

∂wi= −

d∈D

(td − od)od(1− od)xi,d

28 TDT4171 Artificial Intelligence Methods

Page 37: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Backpropagation

Backpropagation Algorithm

1 Initialize all weights to small random numbers.2 Until satisfied:

For each training example

1 Input the training example to the network and compute the

network outputs

2 For each output unit k: δk ← ok(1− ok)(tk − ok)3 For each hidden unit h:

δh ← oh(1− oh)∑

k∈outputs

wh,k · δk

4 Update each network weight wi,j : wi,j ← wi,j +∆wi,j , where

∆wi,j = η · δj · xi,j .

29 TDT4171 Artificial Intelligence Methods

Page 38: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Backpropagation

More on Backpropagation

Gradient descent over entire network weight vector

Easily generalized to arbitrary directed graphs

Will find a local, not necessarily global error minimum

In practice, often works well (can run multiple times)

Often include weight momentum α > 0

∆wi,j(n) = η · δj · xi,j + α ·∆wi,j(n− 1)

Minimizes error over training examples

Will it generalize well to subsequent examples? Overfitting. . .

Training can take thousands of iterations → slow!

Using network after training is very fast

30 TDT4171 Artificial Intelligence Methods

Page 39: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Hidden layer representations

Learning Hidden Layer Representations

Inputs Outputs

A target function:

Input Output10000000 → 10000000

01000000 → 01000000

00100000 → 00100000

00010000 → 00010000

00001000 → 00001000

00000100 → 00000100

00000010 → 00000010

00000001 → 00000001

Can this be learned??

31 TDT4171 Artificial Intelligence Methods

Page 40: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Hidden layer representations

Learning Hidden Layer Representations

Learned hidden layer representation:

Input Hidden OutputValues

10000000 → .89 .04 .08 → 10000000

01000000 → .15 .99 .99 → 01000000

00100000 → .01 .97 .27 → 00100000

00010000 → .99 .97 .71 → 00010000

00001000 → .03 .05 .02 → 00001000

00000100 → .01 .11 .88 → 00000100

00000010 → .80 .01 .98 → 00000010

00000001 → .60 .94 .01 → 00000001

Do you recognize the encoding?

32 TDT4171 Artificial Intelligence Methods

Page 41: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Hidden layer representations

Learning Hidden Layer Representations

Learned hidden layer representation:

Input Hidden OutputValues

10000000 → .89 .04 .08 → 10000000

01000000 → .15 .99 .99 → 01000000

00100000 → .01 .97 .27 → 00100000

00010000 → .99 .97 .71 → 00010000

00001000 → .03 .05 .02 → 00001000

00000100 → .01 .11 .88 → 00000100

00000010 → .80 .01 .98 → 00000010

00000001 → .60 .94 .01 → 00000001

Do you recognize the encoding?

32 TDT4171 Artificial Intelligence Methods

Page 42: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Hidden layer representations

Training

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 500 1000 1500 2000 2500

Sum of squared errors for each output unit

33 TDT4171 Artificial Intelligence Methods

Page 43: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Hidden layer representations

Training

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 500 1000 1500 2000 2500

Hidden unit encoding for input 01000000

33 TDT4171 Artificial Intelligence Methods

Page 44: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Hidden layer representations

Representative power of 8 - 1 - 8 sigmoid network

Discussion time:Representative power of 8 - 1 - 8 sigmoid network:

Consider the same input → output mapping as last example(100000000 → 100000000, etc.)

Can this be represented using the 8 - 1 - 8 structure?

Hidden node h can, e.g., take value∑

8

i=12i−1 · xi

Can h = f1(x) be approximated using sigmoid functions?Can t = f2(h), be approximated using sigmoid functions?If so: t = f2(f1(x)), and the answer is YES!

34 TDT4171 Artificial Intelligence Methods

Page 45: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Hidden layer representations

Convergence of Backpropagation

Gradient descent to some local minimum:

Perhaps not global minimum . . .

Add momentum

Train multiple nets with different initial weights

Nature of convergence:

Initialize weights near zero→ Therefore, initial networks near-linear

Increasingly non-linear functions possible as training progresses

35 TDT4171 Artificial Intelligence Methods

Page 46: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Hidden layer representations

Expressive Capabilities of ANNs

Boolean functions:

Every boolean function can be represented by network withsingle hidden layer

but might require exponential (in number of inputs) hiddenunits

Continuous functions:

Every bounded continuous function can be approximated witharbitrarily small error, by network with one hidden layer

Any function can be approximated to arbitrary accuracy by anetwork with two hidden layers

36 TDT4171 Artificial Intelligence Methods

Page 47: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Hidden layer representations

Overfitting in ANNs

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

0 5000 10000 15000 20000

Err

or

Number of weight updates

Error versus weight updates (example 1)

Training set errorValidation set error

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 1000 2000 3000 4000 5000 6000

Err

or

Number of weight updates

Error versus weight updates (example 2)

Training set errorValidation set error

37 TDT4171 Artificial Intelligence Methods

Page 48: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Example: Detecting Viewing Direction

Neural Nets for Recognition of Viewing Direction

left straight right up

Typical input images (30 × 32 pixels)

90% accurate learning head pose, and recognizing 1-of-20 faces38 TDT4171 Artificial Intelligence Methods

Page 49: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Example: Detecting Viewing Direction

Learned Hidden Unit Weights

left straight right up

Learned Weights

Mostly “Up” ”Right”; not “Left” Mostly “Straight”

Typical input images

39 TDT4171 Artificial Intelligence Methods

Page 50: TDT4171 Artificial Intelligence Methods - Lecture 10 – Neural

Example: Detecting Viewing Direction

Summary

Most brains have lots of neurons; each neuron ≈linear–threshold unit (?)

Perceptrons (one-layer networks) insufficiently expressive

Multi-layer networks are sufficiently expressive; can be trainedby gradient descent, i.e., error back-propagation

Many applications: speech, driving, handwriting, frauddetection, etc.

Engineering, cognitive modelling, and neural system modellingsubfields have largely diverged

40 TDT4171 Artificial Intelligence Methods