Connectionist Models: The Briefest Course

Connectionist Models:The Briefest Course

What do cows drink?

Symbolic AI

ISA(cow, mammal)

ISA(mammal, animal)

Rule1: IF animal(X) AND thirsty(X) THEN lack_water(X)

Rule2: IF lack_water(X) then drink_water(X) Conclusion: Cows drink water.

What do cows drink?Connectionism:

COW MILK DRINK100 ms.

What interestsConnectionism

What interests Symbolic AI

What do cows drink?Connectionism:

COW MILK DRINK100 ms.

These neurons are activated without ever have heard the

word “milk”

Artificial Neural Networks

“Systems that are deliberately constructed to make use of some of the organizational principles that are felt to be used in the human brain.” (Anderson & Rosenfeld, 1990, Neurocomputing, p. xiii)

The Origin of Connectionist NetworksMajor Dates

William James (1892): the idea of a network of associations in the brain.McCulloch & Pitts (1943, 1947): the “logical” neuronHebb (1949): The Organization of Behavior: Hebbian learning and the

formation of cell assemblies Hodgkin and Huxley (1952): Description of the chemistry of neuron-firing.Rochester, Holland, Haibt, & Duda (1956): first real neural network

computer modelRosenblatt (1958, 1962): perceptronMinsky and Papert (1969) bring the walls down on perceptronsHopfield (1982, 1984): Hopfield network, settling to an attractorKohonen (1982): unsupervised learning networkRumelhart & McClelland and the PDP Research Group (1986):

backpropagation, etc.Elman (1990): the simple recurrent networkHinton (1980 – present ): just about everything else...

McCulloch & Pitts (1943, 1947)

T

1

0

0Inputs

Output

The real neuron was far, far more complex,but they felt that they had captured itsessence. Neurons were the biologicalequivalent of logic gates.

Conclusion: Collections of neurons, appropriately wired together, can do logical calculus. Cognition is just a complex logical calculus.

Inputs

Output

The McCulloch & Pitts representation of the “essential” neuron was that it was a logic gate (here an AND gate)

Hebb (1949)Connecting changes in neurons to cognition

Hebb asked: What changes at the neuronal level might make possible our acquisition of high-level (semantic) information? His answer: Learning rule of synaptic reinforcement (Hebbian learning).

When neuron A fires and is followed immediately by the firing of neuron B, the synapse between the two neurons is strengthened, i.e., the next time A fires, it will be easier for B to fire.

Connecting neural function to behavior

High level models of human cognition and behavior

Low-level models of single neurons

Even lower-level models of synapses and ion channels

The Hebbian

Gap

Neuronal population coding models

Cell assemblies: Closing the Hebbian Gap

Cell assemblies at the neuronal level give rise to categories at the semantic level.

The formation of cell assemblies involves• persistence of activity without external input. Cell assemblies can overlap. e.g., the cell assembly associated with “dog” will overlap with those associated with “wolf”, “cat”, etc.

• recruitment: creation of a new cell assembly (via Hebbian learning) corresponding to a new concept • fractionation: creation of new cell assemblies from an old one, corresponding to the refinement of a concept.

A Hebbian Cell Assembly

By means of the Hebbian Learning Rule, a circuit of continuously firing neurons could be learned by the network.

The continuing activation in this cell assembly does not require external input.

The activation of the neurons in this circuit would correspond to the perception of a concept.

A Cell Assembly

Input from the environment

A Cell Assembly

Notice that the input from theenvironment is gone...

A Cell Assembly

Rochester, Holland, Haibt, & Duda (1956)

• First real simulation that attempted to implement the principles outlined by Hebb in real computer hardware

• Attempted to simulate the emergence of cell assemblies in a small network of 69 neurons. They found that everything became active in their network.

• They decided that they needed to include inhibitory synapses. (Hebb only discussed excitatory synapses). This worked and cell assemblies did, indeed, form.

• Probably the earliest example in neural network modeling of a network which made a prediction (i.e., inhibitory synapses are needed to form cell assemblies), that was later confirmed in real brain circuitry.

Rosenblatt (1958, 1962): The Perceptron

• Rosenblatt’s perceptron could learn to associate inputs with outputs.

• He believed this was how the visual system learned to associate low-level visual input with higher level concepts.

• He introduced a learning rule (weight-change algorithm) that allowed the perceptron to learn associations.

The elementary perceptron

Consists of:

• two layers of nodes (one layer of weights)

• only feedforward connections

• a threshold function on each output unit

• a linear summation of the weights times inputs

w1 w2

x 1 x2

y

Threshold = T

t

actual output

desired output (“teacher”)

0

11

yelse

ythenThresholdxwifINPUTS

iii

The perceptron (Widrow-Hoff) learning rule (weight-change rule) is:

where is the learning constant,

)( ytxww oldnew

10

X

x i

I

“X”

w i

This perceptron learns to associate the visual input of two crossed straight lines with the character “X”. In other words, the output of the network will be the character “X”.

Generalization

x i

I

“X”

w i

The real image in the world is degraded, but if the network has already learned to correctly identify the original complete “X”, it will recognize the degraded X as being an “X”.

Fundamental limitations of the perceptron

Minsky & Papert (1969) showed that the Rosenblatt two-layer perceptron had some fundamental limitations: They could only classify linearly separable sets.

XX

X

X

X

X

Y

YY

Y

Y

Y

This: But not this:

XX X

X

X

XY

Y

YYY Y

The (infamous) XOR problem

• Minsky and Papert showed there were a number of extremely simple patterns that no perceptron could learn, including a logic function XOR.

• Since cognition supposedly required elementary logical operations, this severely weakened the perceptron’s claim to be able to do general cognition.

Input Output

0 0 0

0 1 1

1 0 1

1 1 0

There is no set of weights w1 and w2 and a threshold T,

such that the perceptron below can learn the above XOR function.

XOR

w 1 w 2

x 1 x 2

y

Threshold = T

t

actual output


The activation arriving at the output node is . If then we output 1, otherwise 0.

But is a straight line if we consider x1

and x2 to be the axis of a coordinate system.

2211 xwxw

Txwxw 2211

1 1 2 2w x w x T

(0,0)

(1,1)(0,1)

(1,0)

1

0

x1

x2

No values of w1, w2, and T will form a straight line w1x1 + w2x2 = T with (0,1) and (1,0) on one side and (0,0) and (1,1) on the other.

1 1 2 2w x w x T

NO!

The Revival of the (Multi-layered) Perceptron:The Connectionist Revolution (1985)

and the Statistical Nature of Cognition

By the early 1980’s Symbolic AI had hit a wall. “Simple” tasks that humans do (almost) effortlessly (face, word, speech recognition, retrieving information from incomplete cues, generalizing, etc) proved to be notoriously hard for symbolic AI.

• Minsky (1967): “Within a generation the problem of creating ‘artificial intelligence’ will be substantially solved.”

• Minsky (1982): “The AI problem is one of the hardest ever undertaken by science.”

By the early 1980’s the statistical nature of much of cognition became ever more apparent.

Three factors contributed to the revival of the perceptron:

• the radical failure of AI to achieve the goals announced in the 1960’s

• the growing awareness of the statistical and “fuzzy” nature of cognition

• the development of improved perceptrons, capable of overcoming the linear separability problems brought to light by Minsky & Papert.

Advantages of Connectionist Models compared to Symbolic AI

• Learning: Specifically designed to learn.

• Pattern completion of familiar patterns.

• Generalization: Can generalize to novel patterns based on previously learned patterns.

• Retrieval with partial information: Can retrieve information in memory based on nearly any attribute of the representation.

• Massive parallelism.

100-step processing constraint (Feldman & Balard, 1982) Neural hardware is too slow and too unreliable for sequential models of processing. But we can do very complex processing in a few hundred ms. But transmission across a synapse (~10-6 in.) occurs in about ~1 ms. Thus, complex tasks must be accomplished in no more than a few hundred serial steps, which is impossible.

• Graceful degradation: when they are damaged, their performance degrades gradually.

Real Brains and Connectionist Networks

Some characteristics of real brains that serve as the basis of ANN design:

•Neurons receive input from lots of other neurons.•Massive parallelism: neurons are slow but there are lots of them•Learning involves modifying the strength of synaptic connections.•Neurons communicate with one another via activation or inhibition.•Connections in the brain have a clear geometric and topological structure.•Information is continuously available to the brain.•Graceful degradation of performance in the face of damage and information overload•Control is distributed, not central (i.e., no central executive).•One primary way of understanding what the brain does is relaxation to attractors.

General principles of all connectionist networks

• a set of processing units• a state of activation defined over all of the units• an output function (“squashing function”) for each unit: Transforms

unit activation into outgoing activation;• a connectivity pattern with two features:• - weights of the connections• - locations of the connections• an activation rule for combining inputs impinging on a unit to produce

a total activation for the unit• a learning rule, by which the connectivity pattern is changed.• an environment in which the system operates (i.e., how is the i/o

represented and given to/taken from the system)

Knowledge storage and Learning

• Knowledge storage: Knowledge is stored exclusively in the pattern of strengths of the connections (weights) between units. The network stores multiple patterns in the SAME set of connections.

• Learning: The system learns by automatically adjusting the strengths of these weights as it receives information from its environment.

There are no high-level rules programmed into the system. Because all patterns are stored in the same set of connections, generalization, graceful degradation, etc. are relatively easy in connectionist networks. It is also what makes planning, logic, etc. are so hard.

Two major classes of networks

• Supervised: Includes all error-driven learning algorithms. The error between the desired output and the actual output determines how to change the weights. This error is gradually decreased by the learning algorithm.

• Unsupervised: There is no error feedback signal. The network automatically clusters the input into categories.

Example: if the network is presented with 100 patterns, half of which are different kinds of ellipses and half of which are different types of rectangles, it would automatically group these patterns into the two appropriate categories. There is no feedback to tell the network explicitly “this is a rectangle” or “this is an ellipse.”

So, how did they solve the problem of linear separability?

ANSWER:

i) By adding another “hidden” layer to the perceptron between the input and output layers,

ii) introducing a differentiable squashing function and

iii) discovering a new learning rule (the “generalized delta rule”)

“Concurrent” learning

Learning a series of patterns:If each pattern in the series is learned to criterion (i.e., completely) sequentially, the learning of the new patterns will erase the learning of the previously-learned patterns. This is why concurrent learning must be used. Otherwise, catastrophic forgetting may occur.

1 epoch

- 1st pattern presented to the network, change its weights a little to reduce the error on that pattern; - 2nd pattern, change its weights a little to reduce the error on that pattern;- etc.- last pattern, change its weights a little to reduce the error on that pattern;- REPEAT until the error for all patterns is below criterion

Concurrent learning

Backpropagation

error

wij

wjk

input layer (nodessubscripted with k’s)

hidden layer (nodessubscripted with j’s)

output layer (nodessubscripted with i’s)


actual output

input from the environment

hidden layer representation

Training of a backpropagation network i) Feedforward activation pass with activation “squashed” at hidden layer.ii) The output is compared with the desired output (= error signal)iii) This error signal is “backpropagated” through the network to change the network’s weights (with gradient descent).iv) When the overall error is below a predefined criterion, learning stops.

Backpropagation networks are excellent function-learners...

...but they also suffer from catastrophic interference.

A-B List

A-C List

1 5 10 20 Learning Trials on A-C List

correct

A-B List

A-C List

Learning Epochs on A-C List

0 5 10 15 20 25 30 35 40 45 50

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.9

1.0

0

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.9

1.0

0

correct Humans: Backpropagation networks:

They can learn to read words aloud (NetTalk, 1987) ....

... but they have trouble learning sequences.

Much of our cognition involves learning sequences of patterns. Standard BP networks are fine for learning input-output patterns, they cannot be used effectively to learn sequences of patterns.

Consider the sequence: A B C D E F G H I For this sequence we could train a network to associate the following

A BB CC DD EE FF GG HH I

If we give the network A as it’s “seed”, it would produce B on output,

which we would feed back into the network to produce C on output, and so on. Thus, we could reproduce the original sequence.

But what about context-dependent sequences?

But what if the sequence were:

A B C D E F C H I Here C is repeated. The technique above would give:

A BB CC DD EE FF CC H H I

But the network could not learn this sequence since it has no context to distinguish the two different outputs associated with C (for the first occurrence, D; for the second, H).

A “sliding window” solution

Consider a “sliding window” solution to provide the context. Instead of having the network learn single-letter inputs, it will learn two-letter inputs, thus:

AB CBC DCD EDE FEF GFG HGH I

Now the network is fed AB (here, “A” servers as “context” for “B”) as its seed and it can reproduce the sequence with the repeated C without difficulty. But what if we needed more than one letter’s worth of context, as in a sequence like this:

A B C D E B C H I

Now the network needs another context letter...and so on.Conclusion: The Sliding Window technique doesn’t work in general.

Elman’s solution (1990) The Simple Recurrent Network

Hidden units

Input units Context units

Output units

copy

SRN Bilingual language learning(French, 1998; French & Jacquet, 2004)

BOY LIFTS TOY MAN SEES PEN GIRL PUSHES BALL BOY PUSHES BOOK FEMME SOULEVE STYLO FILLE PREND STYLO GARÇON TOUCHE LIVRE FEMME POUSSE BALLON FILLE SOULEVE JOUET WOMAN PUSHES TOY.... (Note: absence of markers between sentences and between languages.)

Input to the SRN: - Two “micro” languages, Alpha & Beta, 12 words each- An SVO grammar for each language- Unpredictable language switching

The network tries each time to predict the next element.

We do a cluster analysis of its internal (hidden-unit) representations after having seen 20,000 sentences.

Attempted Prediction

Clustering of the internal representations formed by the SRN

LIVRE: STYLO:

BALLON: JOUET:

VOIT: PREND:

POUSSE: SOULEVE: HOMME: FEMME:

FILLE: GARCON: BOOK:

PEN: BALL: TOY:

PUSHES: TAKES: SEES: LIFTS:

WOMAN: MAN: GIRL: BOY:

Alpha

Beta

N.B. It also works for micro languages with 768 words each

Unsupervised learning:Kohonen networks

Kohonen networks cluster inputs in a non-supervised manner. There is no activation spreading or summing processes here: Kohonen networks adjust weight vectors to match input vectors.

1 2

11w62w52w

12w

1 2 3 4 5 6input layer

output nodes

The next frontier...

Computational neuroscience using spiking neurons, and variables such as their connection density, their firing timing and synchrony, and so on, to better understand human cognitive functions.

We are almost at a point where the population dynamics of large networks of these kinds of simulated neurons can realistically be studied.

Further in the future neuronal models with Hodgkin-Huxley equations of membrane potentials and neuronal firing, will be incorporated into our computational models of cognition.

Ultimately...

Gradually, neural network models and the computers they run on will become good enough to give us a deep understanding of neurophysiological processes and their behavioral counterparts and to make precise predictions about them.

They will be used to study epilepsy, Alzheimer’s disease, and the effects of various kinds of stroke, without requiring the presence of human patients.

They will be, in short, like the models used in all of the other hard sciences. Neural modeling and neurobiology will then have achieved a truly symbiotic relationship.

Connectionist Models: The Briefest Course

Documents

Transcript of Connectionist Models: The Briefest Course