An informal account of BackProp For each pattern in the training set: Compute the error at the...

53
An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute w for each wt in 2 nd layer Compute delta (generalized error expression) for hidden units Compute w for each wt in 1 st layer After amassing w for all weights and, change each wt a little bit, as determined by the learning rate jp ip ij o w

Transcript of An informal account of BackProp For each pattern in the training set: Compute the error at the...

Page 1: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

An informal account of BackProp

For each pattern in the training set:

Compute the error at the output nodes

Compute w for each wt in 2nd layer

Compute delta (generalized error expression) for hidden units

Compute w for each wt in 1st layer

After amassing w for all weights and, change each wt a little bit, as determined by the learning rate

jpipij ow

Page 2: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Backprop Details

Here we go…

Also refer to web notes for derivation

Page 3: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

k j i

wjk wij

E = Error = ½ ∑i (ti – yi)2

yi

ti: targetij

ijij W

EWW

ijij W

EW

jiiiij

i

i

i

iij

yxfytW

x

x

y

y

E

W

E

)('

The derivative of the sigmoid is just ii yy 1

jiiiiij yyyytW 1

ijij yW iiiii yyyt 1

The output layerlearning rate

Page 4: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

k j i

wjk wij

E = Error = ½ ∑i (ti – yi)2

yi

ti: target

The hidden layerjk

jk W

EW

jk

j

j

j

jjk W

x

x

y

y

E

W

E

iijiii

i j

i

i

i

ij

Wxfyty

x

x

y

y

E

y

E)(')(

kji

ijiiijk

yxfWxfytW

E

)(')(')(

kjji

ijiiiijk yyyWyyytW

11)(

jkjk yW jji

ijiiiij yyWyyyt

11)(

jji

iijj yyW

1

Page 5: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Momentum term

The speed of learning is governed by the learning rate. If the rate is low, convergence is slow If the rate is too high, error oscillates without reaching minimum.

Momentum tends to smooth small weight error fluctuations.

n)(n)y()1n(ij

wn)(ij

w ji

10

the momentum accelerates the descent in steady downhill directions.the momentum has a stabilizing effect in directions that oscillate in time.

Page 6: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Convergence

May get stuck in local minima Weights may diverge

…but works well in practice

Representation power:2 layer networks : any continuous function3 layer networks : any function

Page 7: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Local Minimum

USE A RANDOM COMPONENT SIMULATED ANNEALING

Page 8: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Overfitting and generalization

TOO MANY HIDDEN NODES TENDS TO OVERFIT

Page 9: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Overfitting in ANNs

Page 10: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Early Stopping (Important!!!)

Stop training when error goes up on validation set

Page 11: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Stopping criteria

Sensible stopping criteria: total mean squared error change:

Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]).

generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. If this stopping criterion is used then the part of the training set used for testing the network generalization will not be used for updating the weights.

Page 12: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Architectural ConsiderationsWhat is the right size network for a given job?

How many hidden units?

Too many: no generalization

Too few: no solution

Possible answer: Constructive algorithm, e.g.

Cascade Correlation (Fahlman, & Lebiere 1990)

etc

Page 13: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

The number of layers and of neurons depend on the specific task. In practice this issue is solved by trial and error.

Two types of adaptive algorithms can be used:start from a large network and successively

remove some nodes and links until network performance degrades.

begin with a small network and introduce new neurons until performance is satisfactory.

Network Topology

Page 14: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Problems and Networks

•Some problems have natural "good" solutions

•Solving a problem may be possible by providing the right armory of general-purpose tools, and recruiting them as needed

•Networks are general purpose tools.

•Choice of network type, training, architecture, etc greatly influences the chances of successfully solving a problem

•Tension: Tailoring tools for a specific job Vs Exploiting general purpose learning mechanism

Page 15: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Summary

Multiple layer feed-forward networksReplace Step with Sigmoid (differentiable)

function Learn weights by gradient descent on error

functionBackpropagation algorithm for learningAvoid overfitting by early stopping

Page 16: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

ALVINN drives 70mph on highways

Page 17: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Use MLP Neural Networks when …

(vectored) Real inputs, (vectored) real outputs

You’re not interested in understanding how it works

Long training times acceptable Short execution (prediction) times required Robust to noise in the dataset

Page 18: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Applications of FFNNClassification, pattern recognition: FFNN can be applied to tackle non-linearly separable

learning problems. Recognizing printed or handwritten characters, Face recognition Classification of loan applications into credit-worthy and non-

credit-worthy groups Analysis of sonar radar to determine the nature of the source

of a signal

Regression and forecasting: FFNN can be applied to learn non-linear functions

(regression) and in particular functions whose inputs is a sequence of measurements over time (time series).

Page 19: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.
Page 20: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Extensions of Backprop Nets

Recurrent Architectures Backprop through time

Page 21: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Elman Nets & Jordan Nets

Updating the context as we receive input

• In Jordan nets we model “forgetting” as well

• The recurrent connections have fixed weights

• You can train these networks using good ol’ backprop

Output

Hidden

Context Input

1

α

Output

Hidden

Context Input

1

Page 22: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Recurrent Backprop

• we’ll pretend to step through the network one iteration at a time

• backprop as usual, but average equivalent weights (e.g. all 3 highlighted edges on the right are equivalent)

a b c unrolling3 iterations

a b c

a b c

a b cw2

w1 w3

w4

w1 w2 w3 w4

a b c

Page 23: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Connectionist Models in Cognitive Science

Structured PDP (Elman)

Neural Conceptual Existence Data Fitting

Hybrid

Page 24: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

5 levels of Neural Theory of Language

Cognition and Language

Computation

Structured Connectionism

Computational Neurobiology

Biology

MidtermQuiz Finals

Neural Development

Triangle NodesNeural Net and learning

Spatial Relation

Motor Control Metaphor

SHRUTI

Grammar

abst

ract

ion

Pyscholinguistic experiments

Page 25: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

The Color Story: A Bridge The Color Story: A Bridge between Levels of NTLbetween Levels of NTL

(http://www.ritsumei.ac.jp/~akitaoka/color-e.html

Page 26: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

A Tour of the Visual System

• two regions of interest:

– retina

– LGN

Page 27: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

The Physics of Light

Light: Electromagnetic energy whose wavelength is between 400 nm and 700 nm. (1 nm = 10 meter)-6

400 500 600 700

ELECTROMAGNETIC SPECTRUM

VISIBLE SPECTRUM

10-14 meters 106 meters

Wavelength (nm)

CosmicRays

GammaRays X-rays UV Infra-

RedMicro-waves TV RadioLight

© Stephen E. Palmer, 2002

Page 28: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

The Physics of Light

.

# P

ho

ton

s

D. Normal Daylight

Wavelength (nm.)

B. Gallium Phosphide Crystal

400 500 600 700

# P

ho

ton

s

Wavelength (nm.)

A. Ruby Laser

400 500 600 700

400 500 600 700

# P

ho

ton

s

C. Tungsten Lightbulb

400 500 600 700

# P

ho

ton

s

Some examples of the spectra of light sources

© Stephen E. Palmer, 2002

Page 29: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

The Physics of Light

Some examples of the reflectance spectra of surfaces

Wavelength (nm)

% P

hoto

ns R

efle

cted

Red

400 700

Yellow

400 700

Blue

400 700

Purple

400 700

© Stephen E. Palmer, 2002

Page 30: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

The Psychophysical Correspondence

There is no simple functional description for the perceivedcolor of all lights under all viewing conditions, but …...

A helpful constraint: Consider only physical spectra with normal distributions

area

Wavelength (nm.)

# Photons

400 700500 600

mean

variance

© Stephen E. Palmer, 2002

Page 31: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Physiology of Color Vision

© Stephen E. Palmer, 2002

Cones cone-shaped less sensitive operate in high light color vision

Rods rod-shaped highly sensitive operate at night gray-scale vision

Two types of light-sensitive receptors

cone

rod

Page 32: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

The Microscopic View

Page 33: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

http://www.iit.edu/~npr/DrJennifer/visual/retina.html

Rods and Cones in the Retina

Page 34: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

What Rods and Cones Detect

Notice how they aren’t distributed evenly, and the rod is more sensitive to shorter wavelengths

Page 35: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

© Stephen E. Palmer, 2002

.

400 450 500 550 600 650

RE

LAT

IVE

AB

SO

RB

AN

CE

(%

)

WAVELENGTH (nm.)

100

50

440

S

530 560 nm.

M L

Three kinds of cones: Absorption spectra

Implementation of Trichromatic theory

Physiology of Color Vision

Opponent Processes: R/G = L-M G/R = M-L B/Y = S-(M+L) Y/B = (M+L)-S

Page 36: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

© Stephen E. Palmer, 2002

Double Opponent Cells in V1

Physiology of Color Vision

G+R-

G+R-

R+G-

R+G-

Red/Green

Y+B-

Y+B-

B+Y-

B+Y-

Blue/Yellow

Page 37: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Color Blindness

Not everybody perceives colors in the same way!

What numbers do you see in these displays?

© Stephen E. Palmer, 2002

Page 38: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Theories of Color Vision

A Dual Process Wiring Diagram

© Stephen E. Palmer, 2002

S M L

R+ G-

+ +- -

B+ Y-+

+

- -

G+Y+

Bk+

S-M-L

++

L-M -S+M+L -S-M-L M-L

W+ Bk-

S+M+L

++ -

-

MLML

S M L

W-

B- R-

Trichromatic Stage

Opponent Process Stage

Page 39: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Color Naming

© Stephen E. Palmer, 2002

Basic Color Terms (Berlin & Kay)

Criteria:

1. Single words -- not “light-blue” or “blue-green”

2. Frequently used -- not “mauve” or “cyan”

3. Refer primarily to colors -- not “lime” or “gold”

4. Apply to any object -- not “roan” or “blond”

Page 40: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Color Naming

© Stephen E. Palmer, 2002

BCTs in English

RedGreenBlueYellowBlackWhite

GrayBrownPurpleOrange*Pink

Page 41: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Color Naming

© Stephen E. Palmer, 2002

Five more BCTs in a study of 98 languages

Light-BlueWarmCoolLight-WarmDark-Cool

Page 42: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

The WCS Color Chips

• Basic color terms:– Single word (not blue-green)

– Frequently used (not mauve)

– Refers primarily to colors (not lime)

– Applies to any object (not blonde)

FYI:

English has 11 basic color terms

Page 43: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Results of Kay’s Color Study

If you group languages into the number of basic color terms they have, as the number of color terms increases, additional terms specify focal colors

Stage I II IIIa / IIIb IV V VI VII

W or R or Y W W W W W W

Bk or G or Bu R or Y R or Y R R R R

Bk or G or Bu G or Bu Y Y Y Y

Bk G or Bu G G G

Bk Bu Bu Bu

W Bk Bk Bk

R Y+Bk (Brown) Y+Bk (Brown)

Y R+W (Pink)

Bk or G or Bu R + Bu (Purple)

R+Y (Orange)

B+W (Grey)

Page 44: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Color Naming

© Stephen E. Palmer, 2002

Typical “developmental” sequence of BCTs

Light-warm

Dark-cool

(2 Terms)

White

Warm

Dark-cool

(3 Terms)

Black

Cool

White

Warm

(4 Terms)

Red

Yellow

White

Black

Cool

(5 Terms)

White

Red

Yellow

Black

Green

Blue

(6 Terms)

Page 45: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Color Naming

© Stephen E. Palmer, 2002

Studied color categories in two ways

Boundaries

Best examples

(Berlin & Kay)

Page 46: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Color Naming

© Stephen E. Palmer, 2002

MEMORY : Focal colors are remembered better than nonfocal colors.

LEARNING: New color categories centered on focal colors are learned faster.

Categorization: Focal colors are categorized more quickly than nonfocal colors.

(Rosch)

Page 47: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Color Naming

Deg

ree

of M

embe

rshi

p

FUZZY SETS AND FUZZY LOGIC (Zadeh)

0

1.0

0

"Green"

very

not-at-all

a little bit

sorta

Hue

extremely

Degree ofMembership

Fuzzy set theory (Zadeh)

A fuzzy logical model of color naming (Kay & Mc Daniel)

© Stephen E. Palmer, 2002

Page 48: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Color Naming

© Stephen E. Palmer, 2002

0

1

Degree ofMembership

Hue

Blue Green Yellow Red

focal blue

focalgreen

focalyellow

focal red

BlueGreen

Yellow

Red

Hue

0

1

Degree ofMembership

“Primary” color categories

Page 49: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Color Naming

© Stephen E. Palmer, 2002

“Primary” color categories

RedGreenBlueYellowBlackWhite

Page 50: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Color Naming

© Stephen E. Palmer, 2002

“Derived” color categories.

Hue

0

1 Yellow Red

Y R

U

Degree ofMembership

Hue

Hue

Orange

0

1

Degree ofMembership

Fuzzylogical“ANDf”

Page 51: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Color Naming

© Stephen E. Palmer, 2002

“Derived” color categories

Orange = Red ANDf YellowPurple = Red ANDf BlueGray = Black ANDf WhitePink = Red ANDf WhiteBrown = Yellow ANDf Black(Goluboi = Blue ANDf White)

Page 52: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Color Naming

© Stephen E. Palmer, 2002

“Composite” color categories

Fuzzylogical“ORf”

Hue

0

1 Yellow Red

Y RU Degree ofMembership

Hue

Warm = Red Orf YellowCool = Blue Orf GreenLight-warm = White Orf WarmDark-cool = Black Orf Cool

Page 53: An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.

Color Naming

FUZZY LOGICAL MODEL OF COLOR NAMING (Kay & McDaniel)

RedGreenBlue

YellowBlackWhite

OrangePurpleBrownPinkGray

[Light-blue]

[Warm][Cool]

[Light-warm][Dark-cool]

PRIMARY DERIVED COMPOSITE

Only 16 Basic Color Terms in Hundreds of Languages:

1.0

0

1.0

00

Yellow Orange = Yellow ANDf Red Warm = Yellow ORf RED

De

gre

e o

f M

em

be

rsh

ip

(Fuzzy ANDf) (Fuzzy ORf)(Fuzzy sets)

© Stephen E. Palmer, 2002