An informal account of BackProp For each pattern in the training set: Compute the error at the...

An informal account of BackProp

For each pattern in the training set:

Compute the error at the output nodes

Compute w for each wt in 2nd layer

Compute delta (generalized error expression) for hidden units

Compute w for each wt in 1st layer

After amassing w for all weights and, change each wt a little bit, as determined by the learning rate

jpipij ow

Backprop Details

Here we go…

Also refer to web notes for derivation

k j i

wjk wij

E = Error = ½ ∑i (ti – yi)2

yi

ti: targetij

ijij W

EWW

ijij W

EW

jiiiij

i

i

i

iij

yxfytW

x

x

y

y

E

W

E

)('

The derivative of the sigmoid is just ii yy 1

jiiiiij yyyytW 1

ijij yW iiiii yyyt 1

The output layerlearning rate

k j i

wjk wij

E = Error = ½ ∑i (ti – yi)2

yi

ti: target

The hidden layerjk

jk W

EW

jk

j

j

j

jjk W

x

x

y

y

E

W

E

iijiii

i j

i

i

i

ij

Wxfyty

x

x

y

y

E

y

E)(')(

kji

ijiiijk

yxfWxfytW

E

)(')(')(

kjji

ijiiiijk yyyWyyytW

11)(

jkjk yW jji

ijiiiij yyWyyyt

11)(

jji

iijj yyW

1

Momentum term

The speed of learning is governed by the learning rate. If the rate is low, convergence is slow If the rate is too high, error oscillates without reaching minimum.

Momentum tends to smooth small weight error fluctuations.

n)(n)y()1n(ij

wn)(ij

w ji

10

the momentum accelerates the descent in steady downhill directions.the momentum has a stabilizing effect in directions that oscillate in time.

Convergence

May get stuck in local minima Weights may diverge

…but works well in practice

Representation power:2 layer networks : any continuous function3 layer networks : any function

Local Minimum

USE A RANDOM COMPONENT SIMULATED ANNEALING

Overfitting and generalization

TOO MANY HIDDEN NODES TENDS TO OVERFIT

Overfitting in ANNs

Early Stopping (Important!!!)

Stop training when error goes up on validation set

Stopping criteria

Sensible stopping criteria: total mean squared error change:

Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]).

generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. If this stopping criterion is used then the part of the training set used for testing the network generalization will not be used for updating the weights.

Architectural ConsiderationsWhat is the right size network for a given job?

How many hidden units?

Too many: no generalization

Too few: no solution

Possible answer: Constructive algorithm, e.g.

Cascade Correlation (Fahlman, & Lebiere 1990)

etc

The number of layers and of neurons depend on the specific task. In practice this issue is solved by trial and error.

Two types of adaptive algorithms can be used:start from a large network and successively

remove some nodes and links until network performance degrades.

begin with a small network and introduce new neurons until performance is satisfactory.

Network Topology

Problems and Networks

•Some problems have natural "good" solutions

•Solving a problem may be possible by providing the right armory of general-purpose tools, and recruiting them as needed

•Networks are general purpose tools.

•Choice of network type, training, architecture, etc greatly influences the chances of successfully solving a problem

•Tension: Tailoring tools for a specific job Vs Exploiting general purpose learning mechanism

Summary

Multiple layer feed-forward networksReplace Step with Sigmoid (differentiable)

function Learn weights by gradient descent on error

functionBackpropagation algorithm for learningAvoid overfitting by early stopping

ALVINN drives 70mph on highways

Use MLP Neural Networks when …

(vectored) Real inputs, (vectored) real outputs

You’re not interested in understanding how it works

Long training times acceptable Short execution (prediction) times required Robust to noise in the dataset

Applications of FFNNClassification, pattern recognition: FFNN can be applied to tackle non-linearly separable

learning problems. Recognizing printed or handwritten characters, Face recognition Classification of loan applications into credit-worthy and non-

credit-worthy groups Analysis of sonar radar to determine the nature of the source

of a signal

Regression and forecasting: FFNN can be applied to learn non-linear functions

(regression) and in particular functions whose inputs is a sequence of measurements over time (time series).

Extensions of Backprop Nets

Recurrent Architectures Backprop through time

Elman Nets & Jordan Nets

Updating the context as we receive input

• In Jordan nets we model “forgetting” as well

• The recurrent connections have fixed weights

• You can train these networks using good ol’ backprop

Output

Hidden

Context Input

1

α

Output

Hidden

Context Input

1

Recurrent Backprop

• we’ll pretend to step through the network one iteration at a time

• backprop as usual, but average equivalent weights (e.g. all 3 highlighted edges on the right are equivalent)

a b c unrolling3 iterations

a b c

a b c

a b cw2

w1 w3

w4

w1 w2 w3 w4

a b c

Connectionist Models in Cognitive Science

Structured PDP (Elman)

Neural Conceptual Existence Data Fitting

Hybrid

5 levels of Neural Theory of Language

Cognition and Language

Computation

Structured Connectionism

Computational Neurobiology

Biology

MidtermQuiz Finals

Neural Development

Triangle NodesNeural Net and learning

Spatial Relation

Motor Control Metaphor

SHRUTI

Grammar

abst

ract

ion

Pyscholinguistic experiments

The Color Story: A Bridge The Color Story: A Bridge between Levels of NTLbetween Levels of NTL

(http://www.ritsumei.ac.jp/~akitaoka/color-e.html

A Tour of the Visual System

• two regions of interest:

– retina

– LGN

The Physics of Light

Light: Electromagnetic energy whose wavelength is between 400 nm and 700 nm. (1 nm = 10 meter)-6

400 500 600 700

ELECTROMAGNETIC SPECTRUM

VISIBLE SPECTRUM

10-14 meters 106 meters

Wavelength (nm)

CosmicRays

GammaRays X-rays UV Infra-

RedMicro-waves TV RadioLight

© Stephen E. Palmer, 2002


.

# P

ho

ton

s

D. Normal Daylight

Wavelength (nm.)

B. Gallium Phosphide Crystal

400 500 600 700

# P

ho

ton

s

Wavelength (nm.)

A. Ruby Laser

400 500 600 700

400 500 600 700

# P

ho

ton

s

C. Tungsten Lightbulb

400 500 600 700

# P

ho

ton

s

Some examples of the spectra of light sources



Some examples of the reflectance spectra of surfaces

Wavelength (nm)

% P

hoto

ns R

efle

cted

Red

400 700

Yellow

400 700

Blue

400 700

Purple

400 700


The Psychophysical Correspondence

There is no simple functional description for the perceivedcolor of all lights under all viewing conditions, but …...

A helpful constraint: Consider only physical spectra with normal distributions

area

Wavelength (nm.)

# Photons

400 700500 600

mean

variance


Physiology of Color Vision


Cones cone-shaped less sensitive operate in high light color vision

Rods rod-shaped highly sensitive operate at night gray-scale vision

Two types of light-sensitive receptors

cone

rod

The Microscopic View

http://www.iit.edu/~npr/DrJennifer/visual/retina.html

Rods and Cones in the Retina

What Rods and Cones Detect

Notice how they aren’t distributed evenly, and the rod is more sensitive to shorter wavelengths


.

400 450 500 550 600 650

RE

LAT

IVE

AB

SO

RB

AN

CE

(%

)

WAVELENGTH (nm.)

100

50

440

S

530 560 nm.

M L

Three kinds of cones: Absorption spectra

Implementation of Trichromatic theory


Opponent Processes: R/G = L-M G/R = M-L B/Y = S-(M+L) Y/B = (M+L)-S


Double Opponent Cells in V1


G+R-

G+R-

R+G-

R+G-

Red/Green

Y+B-

Y+B-

B+Y-

B+Y-

Blue/Yellow

Color Blindness

Not everybody perceives colors in the same way!

What numbers do you see in these displays?


Theories of Color Vision

A Dual Process Wiring Diagram


S M L

R+ G-

+ +- -

B+ Y-+

+

- -

G+Y+

Bk+

S-M-L

++

L-M -S+M+L -S-M-L M-L

W+ Bk-

S+M+L

++ -

-

MLML

S M L

W-

B- R-

Trichromatic Stage

Opponent Process Stage

Color Naming


Basic Color Terms (Berlin & Kay)

Criteria:

1. Single words -- not “light-blue” or “blue-green”

2. Frequently used -- not “mauve” or “cyan”

3. Refer primarily to colors -- not “lime” or “gold”

4. Apply to any object -- not “roan” or “blond”

Color Naming


BCTs in English

RedGreenBlueYellowBlackWhite

GrayBrownPurpleOrange*Pink

Color Naming


Five more BCTs in a study of 98 languages

Light-BlueWarmCoolLight-WarmDark-Cool

The WCS Color Chips

• Basic color terms:– Single word (not blue-green)

– Frequently used (not mauve)

– Refers primarily to colors (not lime)

– Applies to any object (not blonde)

FYI:

English has 11 basic color terms

Results of Kay’s Color Study

If you group languages into the number of basic color terms they have, as the number of color terms increases, additional terms specify focal colors

Stage I II IIIa / IIIb IV V VI VII

W or R or Y W W W W W W

Bk or G or Bu R or Y R or Y R R R R

Bk or G or Bu G or Bu Y Y Y Y

Bk G or Bu G G G

Bk Bu Bu Bu

W Bk Bk Bk

R Y+Bk (Brown) Y+Bk (Brown)

Y R+W (Pink)

Bk or G or Bu R + Bu (Purple)

R+Y (Orange)

B+W (Grey)

Color Naming


Typical “developmental” sequence of BCTs

Light-warm

Dark-cool

(2 Terms)

White

Warm

Dark-cool

(3 Terms)

Black

Cool

White

Warm

(4 Terms)

Red

Yellow

White

Black

Cool

(5 Terms)

White

Red

Yellow

Black

Green

Blue

(6 Terms)

Color Naming


Studied color categories in two ways

Boundaries

Best examples

(Berlin & Kay)

Color Naming


MEMORY : Focal colors are remembered better than nonfocal colors.

LEARNING: New color categories centered on focal colors are learned faster.

Categorization: Focal colors are categorized more quickly than nonfocal colors.

(Rosch)

Color Naming

Deg

ree

of M

embe

rshi

p

FUZZY SETS AND FUZZY LOGIC (Zadeh)

0

1.0

0

"Green"

very

not-at-all

a little bit

sorta

Hue

extremely

Degree ofMembership

Fuzzy set theory (Zadeh)

A fuzzy logical model of color naming (Kay & Mc Daniel)


Color Naming


0

1

Degree ofMembership

Hue

Blue Green Yellow Red

focal blue

focalgreen

focalyellow

focal red

BlueGreen

Yellow

Red

Hue

0

1

Degree ofMembership

“Primary” color categories

Color Naming


“Primary” color categories

RedGreenBlueYellowBlackWhite

Color Naming


“Derived” color categories.

Hue

0

1 Yellow Red

Y R

U

Degree ofMembership

Hue

Hue

Orange

0

1

Degree ofMembership

Fuzzylogical“ANDf”

Color Naming


“Derived” color categories

Orange = Red ANDf YellowPurple = Red ANDf BlueGray = Black ANDf WhitePink = Red ANDf WhiteBrown = Yellow ANDf Black(Goluboi = Blue ANDf White)

Color Naming


“Composite” color categories

Fuzzylogical“ORf”

Hue

0

1 Yellow Red

Y RU Degree ofMembership

Hue

Warm = Red Orf YellowCool = Blue Orf GreenLight-warm = White Orf WarmDark-cool = Black Orf Cool

Color Naming

FUZZY LOGICAL MODEL OF COLOR NAMING (Kay & McDaniel)

RedGreenBlue

YellowBlackWhite

OrangePurpleBrownPinkGray

[Light-blue]

[Warm][Cool]

[Light-warm][Dark-cool]

PRIMARY DERIVED COMPOSITE

Only 16 Basic Color Terms in Hundreds of Languages:

1.0

0

1.0

00

Yellow Orange = Yellow ANDf Red Warm = Yellow ORf RED

De

gre

e o

f M

em

be

rsh

ip

(Fuzzy ANDf) (Fuzzy ORf)(Fuzzy sets)


An informal account of BackProp For each pattern in the training set: Compute the error at the...

Documents

Transcript of An informal account of BackProp For each pattern in the training set: Compute the error at the...