Unsupervised learning: simple competitive learning Biological background: Neurons are wired...
-
Upload
amelia-holmes -
Category
Documents
-
view
223 -
download
0
description
Transcript of Unsupervised learning: simple competitive learning Biological background: Neurons are wired...
Unsupervised learning: simple competitive learning
Biological background: Neurons are wired topographically, nearby neurons connect to nearby neurons. In visual cortex, neurons are organized in functional columns.
Ocular dominance columns: one region responds to one eye input
Orientation columns: one region responds to one direction
Self-organization as a principle of neural development
Competitive learning --finite resources: outputs ‘compete’ to see which will win via
inhibitory connections between them --aim is to automatically discover statistically salient features of
pattern vectors in training data set: feature detectors --can find clusters in training data pattern space which can be used to classify new patterns --basic structure:
inputs
Outputs
input layer fully connected to output layer input to output layer connection feedforward output layer compares activation's of units following presentation of pattern vector x via (sometimes virtual) inhibitory lateral connections
winner selected based on largest activation winner- takes-all (WTA) linear or binary activation functions of output units. Very different from previous (supervised) learning where we pay our attention to input-output relationship, here we will look at the pattern of connections (weights)
elsejiiallforyyif
v
iiiallforyy
ww
where
xwy
yx
ijj
ii
d
jjii
j
d
jjii
i
0,1
:definingby function binary a thismake could and*,
:if winner theis *ineuron say that WeM. ..., 1, i allfor
1
:
:have we outputs M andor input vect ldimensiona d a have weif formally, More
*
1
2
1
Simple competitive learning algorithm
Initialise all weights to random values and normalise (so that || w ||=1)loop until stopping criteria satisfied .
choose pattern vector x from training setCompute distance between pattern and weight vectors
|| Xi - W ||
find output unit with largest activation ie ‘winner’ i* with the property that || Xi* - W ||< || X i - W ||
update the weight vector of winning unit only withW(t+1) = W(t)+(t) (Xi - W (t))
end loop
NB choosing the largest output is the same as choosing the vector w that is nearest to x since:
a) w.x = wTx = ||w|| ||x|| cos(angle between x and w)
= ||w|| ||x|| if the angle is 0
b) ||w - x||2 = (w1-x1) 2 + (w2-x2) 2
= w1 2 + w2 2 + x1 2 + x2 2 – 2(x1w1 + x2w2)
= ||w|| 2 + ||x|| 2 - 2 wTx
Since ||w|| = 1 and ||x|| is fixed, minimising ||w - x||2 is equivalent to maximising wTx
Therefore, as we only really want the angle, WLOG only consider inputs with ||x|| =1
EG = 0.5
W1 = (-1, 0)
W2 = (0,1)
X = (1, 0)
Y1= -1, y2 = 0 so y2 wins
W2-> (0.5, 0.5)
Y1= -1, y2 = 0.5 so y2 wins
W2-> (0.75, 0.25)
Y1= -1, y2 = 0.75 so y2 wins
W2-> (0.875, 0.125)
Etc etc
How does competitive learning work?can view the points of all input vectors as in contact with surface of hypersphere (in 2D: a circle)
distance between points on surface of hypersphere = degree of similarity between patterns
Outputs
inputs
Initial state
inputs
Outputs
with respect to the incoming signal, the weight of the yellow line is updated so that the vector is rotated towards the incoming signal
W(t+1) = W(t)+(t) (Xi - W (t))Thus the weight vector becomes more and more similar to the input ie a feature detector for that input
When there are more input patterns, can see each weight vector migrates from initial position (determined randomly) to centre of gravity of a cluster of the input vectors.
Thus: Discover Clusters
inputs
Outputs
Eg on-line k-means. Nearest centre updated by: centrei(n+1) = centrei + (t) (Xi - centrei(t))
Error minimization viewpoint: Consider error minimization across all patterns N in training set. Aim is to decrease error E
E = ||Xi - W (t)|| 2
For winning unit k when pattern is Xi the direction the weights need to change in a direction (so as to perform gradient descent) determined by (from previous lectures):
W(t+1) = W(t)+(t) (Xi - W (t))
Which is the update rule for supervised learning (remembering that in supervised learning, W is O, the output of the neurons)
ie replace W by O and we recover the adaline/simple gradient descent learning rule
Enforcing fairer competition
Initial position of weight vector of an output unit may be in region with few, if any, patterns (cf problems of k-means)
Many never or rarely become a winner and so weight vector may not be updated preventing it finding richer part of pattern spaceDEAD UNIT
Or, initial position of a weight vector may be close to a large number of patterns while most other unit’s weights are more distant CONTINUAL WINNER, but weights will change little over time and prevent other units competing
More efficient to ensure a fairer competition where each unit has an equal chance of representing some part of training data
Leaky learning
modify weights of both winning and losing units but at different learning rates
where w(t) >> L(t)
has the effect of slowly moving losing units towards denser regions pattern space.
Many other ways as we will discuss later on
))(())((
)()1(twxtwx
twtwL
w
Vector Quantization:
Application of competitive learning
Idea: Categorize a given set of input vectors into M classes using competitive learning algorithms, and then represent any vector just by the class into which it falls
Important use of competitive learning (esp. in data compressing)divides entire pattern space into a number of separate subspacesset of M units represent set of prototype vectors: CODEBOOK (cf k-means)
new pattern x is assigned to a class based on its closeness to a prototype vector using Euclidean distances LABELED(cf k-nearest neighbours, kernel density estimation)
Example (VQ)
See:http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/DemoGNG/GNG.html
Topographic maps
Extend the ideas of competitive learning to incorporate the neighborhood around inputs and neurons
We want a nonlinear transformation of input pattern space onto output feature space which preserves neighbourhood relationship between the inputs -- a feature map where nearby neurons respond to similar inputs
Eg Place cells, orientation columns, somatosensory cells etc
Idea is that neurons selectively tune to particular input patterns in such a way that the neurons become ordered with respect to each other so that a meaningful coordinate system for different input features is created
Known as a Topographic map: spatial locations are indicative of the intrinsic statistical features of the input patterns: ie close in the input => close in the output
EG: When the yellow input is active, the yellow neuron is the winner so when the orange input input is active, we want the orange neuron to be the winner
2 layer network each cortical unit fully connect to visual space via Hebbian units
Interconnections of cortical units described by ‘Mexican-hat’ function (Garbor function): short-range excitation and long-range inhibition
visual space
cortical units
Eg Activity-based self-organization (von der Malsburg, 1973)
incorporation of competitive and cooperative mechanisms to generate feature maps using unsupervised learning networks
Biologically motivated: how can activity-based learning using highly interconnected circuits lead to orderly mapping of visual stimulus space onto cortical surface? (visual-tectum map)
After learning (see original paper for details), a topographic map appears.
However, input dimension is the same as output dimension
Kohonen simplified this model and called it Kohonen’s self-organizing map (SOM) algorithm
More general as it can perform dimensionality reduction
SOM can be viewed as a vector quantisation type algorithm
Kohonen’s self-organizing map (SOM) Algorithm
set time, t=0 initialise learning rate (0) initialise size of neighbourhood function initialize all weights to small random values
inputs
Loop until stopping criteria satisfied
Choose pattern vector x from training setCompute distance between pattern and weight vectors for each output unit || x - Wi(t) ||
Find winning unit from minimum distance i*: || x - Wi*(t) || = min || x - Wi(t) || Update weights of winning and neighbouring units using neighbourhood functions
wij(t+1)=wij(t)+ (t) h(i,i*,t) [xj- wij(t)]
note that when h(i,i*)=1 if i=i* and 0 otherwise, we have the simple competitive learning
Decrease size of neighbourhood when t is large we have h(i,i*,t)=1 if i=i* and 0 otherwise
Decrease learning rate (t) when t is large we have (t) ~ 0 Increment time t=t+1
end loop
Generally, need a LARGE number of iterations
Neighbourhood function
Relates degree of weight update to distance from winning unit, i* toother units in lattice. Typically a Gaussian function
When i=i* , distance is zero so h=1
Note h decreases monotonically with distance from winning unit and is symmetrical about winning unit
Important that the size of neighbourhood (width) decreases over time to stabilize mapping, otherwise would get noisy version of competitive learning
2
2
)(2||*||exp)*,,(
tiitiih
Biologically, development (self-organization) is governed by the gradient of a chemical element, which generally preserves the topological relation.
Kohonen (and others) has shown that the neighborhood function could be implemented by a difusing neuromodulatory gas such as nitric oxide (which has been shown to be necessary for learning esp. spatial learning)
Thus there is some biological motivation for the SOM
Weight Adaptation two phases during training: ordering and convergence
Ordering Phase
-Topographic ordering of weight vectors of output units to untangle map so that weight vectors are evenly spread
Example:
W2
W3
W1
W1 is closest to input x unit 1 is winner
W3 is closer to x than W2
x
W2
W3
W1 W2 moves closer to x since h(1,2) > 0;W3 does not move as h(1,3) = 0
After x is shown many times W2 moves closer to W1
Finally, the weights will have the correct order
Completion of ordering phase
W2 W3
W1
Convergence phase
Topographic map adapts to probability density of input patterns in training data
However, it is important that ordering is maintained Thus learning rate should be small to prevent large movements of weight vector but non-zero to avoid pathological states (getting caught in odd minima etc)
Neighbourhood size should also start off relatively small containing only close neighbours and eventually shrinking to nearest neighbours (or less)
Problems
SOM tends to overrepresent regions of low density and underrepresent regions of high density
where p(x) is the distribution density of input data
Also, can have problems when there is no natural 2d ordering of the data.
3/2)(xpWx
Visualizing network training
Consider simple 2D lattice of units
Six output units in 2 D space: set up neighbourhood function according to some two dimensional structure, connect units which are close together with a line
31
4 5 6
2
31
4 5 62
Initial random distribution of weight vectors of each unit in 2D
31
4 5 6
2 Ordered weight vectors after training
Example 2D to 2D
Example (JAVA)
See:http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/DemoGNG/GNG.html
Semantic Maps
In the previous demos etc, we only viewed the feature map by visualising the weight vectors of the neurons.
Another method of visualisation is to view class labels assigned to neurons in a 2d lattice depending on how they respond to (unseen) test patterns: contextual or semantic maps
In this way the SOM creates coherent regions by grouping sets of labels with similar features together
Useful tool in data mining and for visualising complex high dimensional data
Example: Animal and attributes, 16 animals with 13 attributes
Animal small 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0is medium 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 big 2 legs 4 legshas hair hooves mane see Haykin’s book feathers huntlikes runto fly swim
Dov
eH
enD
uck
Goo
seO
wl
Haw
kEa
gle
Fox
Dog
Wol
fC
atTi
ger
Lion
Hor
se
Zebr
aco
w
lion=(0,0,0,0,0,0,0,0,0, 0.2, 0,0,0, 0,0,1,0,1,1,0,1,0,1,1,0,0)
0,0,0,0,0,0,0,0,0, 0.2, 0,0,0 =animal code
Train for 2000 interations of all animals (until convergence)
Next the code for each animal is presented but with the animal code all zeroes. We then assign the neuron to the animal for which it gives the largest response
10x10 neuronsoutput
29 D inputs