Backpropagation Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Post on 22-Dec-2015

226 views 0 download

Transcript of Backpropagation Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

BackpropagationBackpropagation

Introduction toIntroduction toArtificial IntelligenceArtificial Intelligence

COS302COS302

Michael L. LittmanMichael L. Littman

Fall 2001Fall 2001

AdministrationAdministration

Questions, concerns?Questions, concerns?

Classification Percept.Classification Percept.

xx11

netnet

xx22 xx33 xxDD…

sumsum

ww11

11

ww22 ww33wwDD ww00

outout squashsquash

gg

PerceptronsPerceptrons

Recall that the squashing function Recall that the squashing function makes the output look more like makes the output look more like bits: 0 or 1 decisions.bits: 0 or 1 decisions.

What if we give it inputs that are also What if we give it inputs that are also bits?bits?

A Boolean FunctionA Boolean Function

A B C D E FA B C D E F G G outout

1 0 1 0 1 0 11 0 1 0 1 0 1 00

0 1 1 0 0 0 10 1 1 0 0 0 1 00

0 0 1 0 0 1 00 0 1 0 0 1 0 00

1 0 0 0 1 0 01 0 0 0 1 0 0 11

0 0 1 1 0 0 00 0 1 1 0 0 0 11

1 1 1 0 1 0 11 1 1 0 1 0 1 00

0 1 0 1 0 0 10 1 0 1 0 0 1 11

1 1 1 1 1 0 11 1 1 1 1 0 1 11

1 1 1 1 1 1 11 1 1 1 1 1 1 11

1 1 1 0 0 1 11 1 1 0 0 1 1 00

Think GraphicallyThink Graphically

Can perceptron learn this?Can perceptron learn this?

CC

DD

11

11

11

00

Ands and OrsAnds and Ors

out(out(xx) = g(sum) = g(sumk k wwk k xxkk))

How can we set the weights to represent How can we set the weights to represent (v(v11)(v)(v22)(~v)(~v77)) ? ? ANDAND

wwii=0, except=0, except

ww11=10, w=10, w22=10, w=10, w77=-10, w=-10, w00=-15 (5-max)=-15 (5-max)

How about How about ~v~v3 3 ++ vv4 4 ++ ~v~v88 ?? OROR

wwii=0, except=0, except

ww11=-10, w=-10, w22=10, w=10, w77=-10, w=-10, w00=15 (-5-min)=15 (-5-min)

MajorityMajority

Are at least half the bits on?Are at least half the bits on?

Set all weights to 1, wSet all weights to 1, w00 to –n/2. to –n/2.A B C D E FA B C D E F G G outout1 0 1 0 1 0 11 0 1 0 1 0 1 110 1 1 0 0 0 10 1 1 0 0 0 1 000 0 1 0 0 1 00 0 1 0 0 1 0 001 0 0 0 1 0 01 0 0 0 1 0 0 001 1 1 0 1 0 11 1 1 0 1 0 1 110 1 0 1 0 0 10 1 0 1 0 0 1 001 1 1 1 1 0 11 1 1 1 1 0 1 111 1 1 1 1 1 11 1 1 1 1 1 1 11

Representation size using decision tree?Representation size using decision tree?

Sweet Sixteen?Sweet Sixteen?

abab (~a)+(~b)(~a)+(~b)

a(~b)a(~b) (~a)+b(~a)+b

(~a)b(~a)b a+(~b)a+(~b)

(~a)(~b)(~a)(~b) a+ba+b

aa ~a~a

bb ~b~b

11 00

a = ba = b a exclusive-or b (a a exclusive-or b (a b) b)

XOR ConstraintsXOR Constraints

A B A B outout

0 0 0 0 00 g(g(ww00) < 1/2) < 1/2

0 1 0 1 11 g(wg(wBB++ww00) > 1/2) > 1/2

1 01 0 11 g(wg(wAA++ww00) > 1/2) > 1/2

1 11 1 00 g(wg(wAA+w+wBB++ww00) < 1/2) < 1/2

ww0 0 < 0, w< 0, wAA+w+w00>0, w>0, wBB+w+w00>0,>0,

wwAA+w+wBB+2 w+2 w00>0, 0 < w>0, 0 < wAA+w+wBB+w+w0 0 < 0< 0

Linearly SeparableLinearly Separable

XOR problematicXOR problematic

CC

DD

00

00

11

11

??

How Represent XOR?How Represent XOR?

A xor B A xor B = (A+B)(~A+~B)= (A+B)(~A+~B)

netnet

AA BB

cc11

netnet

11cc22

netnet

outout

1010 -10-10-10-101010

11-5-5

111515

1010 1010 -15-15

Requiem for a PerceptronRequiem for a Perceptron

Rosenblatt proved that a perceptron Rosenblatt proved that a perceptron will learn any linearly separable will learn any linearly separable function.function.

Minsky and Papert (1969) in Minsky and Papert (1969) in PerceptronsPerceptrons: “there is no reason to : “there is no reason to suppose that any of the virtues suppose that any of the virtues carry over to the many-layered carry over to the many-layered version.”version.”

BackpropagationBackpropagation

Bryson and Ho (1969, same year) Bryson and Ho (1969, same year) described a training procedure for described a training procedure for multilayer networks. Went multilayer networks. Went unnoticed.unnoticed.

Multiply rediscovered in the 1980s.Multiply rediscovered in the 1980s.

Multilayer NetMultilayer Net

xx11

netnet11ii

xx22 xx33 xxDD…WW1111

11WW1212WW1313

hidhid11gg

netnetHHiinetnet22

ii

hidhidhidhid22

UU11 netnetii

outout

…UU00

Multiple OutputsMultiple Outputs

Makes no difference for the Makes no difference for the perceptron.perceptron.

Add more outputs off the hidden Add more outputs off the hidden layer in the multilayer case.layer in the multilayer case.

Output FunctionOutput Function

outoutii((xx) = g(sum) = g(sumj j UUji ji g(sumg(sumk k WWkj kj xxkk))))

H: number of “hidden” nodesH: number of “hidden” nodes

Also:Also:• Use more than one hidden layerUse more than one hidden layer• Use direct input-output weightsUse direct input-output weights

How Train?How Train?

Find a set of weights U, WFind a set of weights U, W

that minimizethat minimize

sumsum((xx,,yy) ) sumsumi i (y(yii-out-outii((xx))))22

using gradient descent.using gradient descent.

Incremental version (vs. batch):Incremental version (vs. batch):

Move weights a small amount for Move weights a small amount for each training exampleeach training example

Updating WeightsUpdating Weights

1.1. Feed-forward to hidden:Feed-forward to hidden: netnetj j = = sumsumk k WWkj kj xxkk; hid; hidjj = g(net = g(netjj))

2.2. Feed-forward to output:Feed-forward to output:

netneti i = sum= sumj j UUji ji hidhidjj; out; outii = g(net = g(netii))

3. Update output weights:3. Update output weights:

i i = g’(net= g’(netii) (y) (yii-out-outii); U); Uji ji += += hid hidjj ii

4. Update hidden weights:4. Update hidden weights:

jj= g’(net= g’(netjj) sum) sumi i UUjj jj ii; W; Wkj kj += += x xkk jj

Multilayer Net (schema)Multilayer Net (schema)

WWkjkj

xxkk

netnetjj

hidhidjj

netnetii

outoutii

UUjiji

jj

ii

yyii

UUjiji

Does it Work?Does it Work?

Sort of: Lots of practical applications, Sort of: Lots of practical applications, lots of people play with it. Fun.lots of people play with it. Fun.

However, can fall prey to the However, can fall prey to the standard problems with local standard problems with local search…search…

NP-hard to train a 3-node net.NP-hard to train a 3-node net.

Step Size IssuesStep Size Issues

Too small? Too big?Too small? Too big?

Representation IssuesRepresentation Issues

Any continuous function can be Any continuous function can be represented by a one hidden layer net represented by a one hidden layer net with sufficient hidden nodes.with sufficient hidden nodes.

Any function at all can be represented by Any function at all can be represented by a two hidden layer net with a sufficient a two hidden layer net with a sufficient number of hidden nodes.number of hidden nodes.

What’s the downside for learning?What’s the downside for learning?

Generalization IssuesGeneralization Issues

Pruning weights: Pruning weights: “optimal “optimal brain damage”brain damage”

Cross validationCross validation

Much, much more to this. Take a Much, much more to this. Take a class on machine learning.class on machine learning.

What to LearnWhat to Learn

Representing logical functions using Representing logical functions using sigmoid unitssigmoid units

Majority (net vs. decision tree)Majority (net vs. decision tree)

XOR is not linearly separableXOR is not linearly separable

Adding layers adds expressibilityAdding layers adds expressibility

Backprop is gradient descentBackprop is gradient descent

Homework 10 (due 12/12)Homework 10 (due 12/12)

1.1. Describe a procedure for Describe a procedure for converting a Boolean formula in converting a Boolean formula in CNF (n variables, m clauses) into CNF (n variables, m clauses) into an equivalent network? How an equivalent network? How many hidden units does it have?many hidden units does it have?

2.2. More soonMore soon