1
Instance Based Learning
Soongsil University
Intelligent Systems Lab.
2
Content
Motivation Eager Learning Lazy Learning Instance-Based Learning
k-Nearest Neighbour Learning (kNN) Distance-Weighted k-NN Locally Weighted Regression (LWR) Radial Basis Functions (RBF) Case-Based Reasoning (CBR) Summary
3
Instance-based learning
One way of solving tasks of approximating discrete or real valued target functions
Have training examples: (xn, f(xn)), n=1..N. Key idea:
just store the training examples when a test example is given then find the closest matches
4
Motivation: Eager Learning
The Learning Task: Try to approximate a target function through a hypothesis on the basis of training examples
EAGER Learning:As soon as the training examples and the hypothesis space are re-ceived the search for the first hypothesis begins Training phase:
given: training examples D=<Xi, f(Xi)> hypothesis space Hsearch: best hypothesis
Processing phase:for every new instance xq return
Examples
Radial based function
f̂
qf̂ x
5
Motivation: Lazy Algorithms
LAZY ALGORITHMS: Training examples are stored and sleeping Generalisation beyond these examples is postponed
till new instances must be classified Every time a new query instance is encountered, its
relationship to the previously stored examples is ex-amined in order to compute the value of the target function for this new instance
6
Motivation: Instance-Based Learning
Instance-Based Algorithms can establish a new local approximation for every new instance Training phase:
given: training sample D=<Xi, f(Xi)> Processing phase:
given: instance xq search: best local hypothesis return
Examples: Nearest Neighbour Algorithm Distance Weighted Nearest Neighbour Locally Weighted Regression ....
qf̂ x
7
Motivation: Instance-Based Learning
How are the instances represented? How can we measure the similarity of the in-
stances? How can be computed?f xq
Motivation Eager Learning Lazy Learning Instance-Based Learning
k-Nearest Neighbour Learning (kNN) Distance-Weighted k-NN Locally Weighted Regression (LWR) Radial Basis Functions (RBF) Case-Based Reasoning (CBR) Summary
Content
X X
Nearest Neighbour Algorithm
Idea: All instances correspond to the points in the n-dimensional space .Assign the value of the next, neighboured instance to the new instance
Representation: Let be an instance, where denotes the value of the r-th attribute of an instance x
Target Function: Discrete valued or real valued
n
i 1 i 2 i n ix = a x ,a x ,...,a x r ia x
We may also use Xir instead of
r ia x
1-Nearest Neighbor
Four things make a memory based learner:
1. A distance metric
Euclidian
2. How many nearby neighbors to look at?
One
3. A weighting function (optional)
Unused
4. How to fit with the local points?
Just predict the same output as the nearest neighbor.
11
Nearest Neighbour Algorithm
HOW IS FORMED?
Discrete target function:
where V: set of s classes (e.g., red, black, yellow…)
Continuous target function:
Let the next neighbour of
n1, 2, sf : V | V = v v ..., v
nx
qf̂ x
n nf :
qx
n q i i qd x ,x min d x ,x
q nf̂ x = f x
13
1-Nearest neighbour:Given a query instance xq, • first locate the nearest training example xn
• then := f(xn)
K-Nearest neighbour:Given a query instance xq, • first locate the k nearest training examples • If discrete valued target function, take vote among its k nearest
neighbour. (e.g., X, X, O, O, X, O, X, X)X
Else if real valued target faction, take the mean of the f values of the k nearest neighbour
k
)f(x:)(xf̂
k
1ii
q
Nearest Neighbour Algorithm
)(xf̂ q
How to choose “k”
Average of k points more reliable when: noise in attributes noise in class labels classes partially overlap
Large k: less sensitive to noise (particularly class noise) better probability estimates for discrete classes larger training sets allow larger values of k
Small k: captures fine structure of problem space better may be necessary with small training sets
Balance must be struck between large and small k As training set approaches infinity, and k grows large, kNN be-
comes Bayes optimal14if p (x) > .5 then predict 1, else 0
k-Nearest Neighbor
Four things make a memory based learner:
1. A distance metric
Euclidian
2. How many nearby neighbors to look at?
k
3. A weighting function (optional)
Unused
4. How to fit with the local points?
Just predict the average output among the k nearest neighbors.
15
k-Nearest Neighbour
Idea:
If we choose k=1, then the algorithm assigns to the value f (xi)
where xi is the nearest training instance to xq
For larger values of k, the algorithm assigns the most common value among the k nearest training examples
How can be established ?
where ( , )=1 if a = b,𝛿 𝑎 𝑏 𝛿( , )=0 otherwise 𝑎 𝑏
qf̂ x
qf̂ x
k
1i
iiVv
q ))f(xδ(v,wargmax)(xf̂
Let xi, …xk denote the k instances from training examples that are nearest to xq
k-Nearest Neighbour Algorithm
Training algorithm: For each training example < x, f(x) >, add the example to the list
training_examples
[Classification algorithm]: Given a query instance xq to be classified,
Let xi, …xk denote the k instances from training examples that are nearest to xq
Return
where if a = b,
otherwise
k
1i
iiVv
q ))f(xδ(v,wargmax)(xf̂
18
The distance between examples
We need a measure of distance in order to know who are the neighbours
Assume that we have T attributes for the learning problem. Then one example point x has elements xt , t=1,…T.
The distance between two points xi , xj is often defined as the Euclidean distance:
T
t
tjtiji xxd1
2][),( xx
Similarity and Dissimilarity Be-tween Objects
Distances are normally used measures Minkowski distance: a generalization
If q = 2, d is Euclidean distance If q = 1, d is Manhattan distance Weighed distance
)0(||...||||),(2211
qj
xi
xj
xi
xj
xi
xjxixd qq
pp
)0()||...||2
||1
),(2211
qj
xi
xpwj
xi
xwj
xi
xwjxixd qq
pp
20
Voronoi Diagram
qf̂ x = + qf̂ x =
Example:
1NN: 5-NN: Voronoi Diagram
Voronoi Diagram: The decision surface is induced by a 1-Nearest Neighbour algorithm for a typical set of training examples. The convex surrounding of each training example indicates the region of query points whose classification will be completely determined by the training example.
21
Characteristics of Inst-b-Learning
An instance-based learner is a lazy-learner and does all the work when the test example is presented. This is opposed to so-called eager-learners, which build a parameterised compact model of the target.
It produces local approximation to the target function (different with each test instance)
22
When to consider Nearest Neighbour algorithms?
Instances map to points in Not more than say 20 attributes per instance Lots of training data Advantages:
Training is very fast Can learn complex target functions Don’t lose information
Disadvantages: ? (will see them shortly…)
n
23
twoone
four
three
five six
seven Eight ?
24
Training data
Number Lines Line types Rectangles Colours Mondrian?
8 7 2 9 4
Test instance
Number Lines Line types Rectangles Colours Mondrian?
1 6 1 10 4 No
2 4 2 8 5 No
3 5 2 7 4 Yes
4 5 1 8 4 Yes
5 5 1 10 5 No
6 6 1 8 6 Yes
7 7 1 14 5 No
3
11
8
6
7
4
27
T
t
tjtiji xxd1
2][),( xx
25
Distances of test instance from training data
Classification
1-NN No
3-NN Yes
5-NN Yes
7-NN No
Example
example
Mondrian?
1 No
2 No
3 Yes
4 Yes
5 No
6 Yes
7 No
Mondrian?Distanceof test from
3
11
8
6
7
4
27
THINK a moment !!. DOES this seem sensible to you ?
Isn’t the calculation being skewed by the large values of the rectangle data relative to the other data?
26
Keep data in normalised form
One way to normalise the data ar(x) to a´r(x) is
t
ttt
xxx
'
attributestofmeanx tht
attributestofdeviationndardsta tht
27
Normalised training dataNumber Lines Line
types Rectangles Colours Mondrian?
1 0.632 -0.632 0.327 -1.021 No
2 -1.581 1.581 -0.588 0.408 No
3 -0.474 1.581 -1.046 -1.021 Yes
4 -0.474 -0.632 -0.588 -1.021 Yes
5 -0.474 -0.632 0.327 0.408 No
6 0.632 -0.632 -0.588 1.837 Yes
7 1.739 -0.632 2.157 0.408 No
Number Lines Line types
Rectangles Colours Mondrian?
8 1.739 1.581 -0.131 -1.021
Test instance
28
Distances of test instance from training data
ClassificationAfter Normalize Before After Normalize
1-NN Yes No
3-NN Yes Yes
5-NN No Yes
7-NN No No
Example
example
Mondrian?
1 2.517 No
2 3.644 No
3 2.395 Yes
4 3.164 Yes
5 3.472 No
6 3.808 Yes
7 3.490 No
Mondrian?Distanceof test from
3
11
8
6
7
4
27
Motivation Eager Learning Lazy Learning Instance-Based Learning
k-Nearest Neighbour Learning (kNN) Distance-Weighted k-NN Locally Weighted Regression (LWR) Radial Basis Functions (RBF) Case-Based Reasoning (CBR) Summary
Content
30
Difficulties with k-nearest neighbour al-gorithms
Have to calculate the distance of the test case from all training cases
There may be irrelevant attributes amongst the attributes – curse of dimensionality
31
What if the target function is real valued?
The k-nearest neighbour algorithm would just calculate the mean of the k nearest neighbours
The weights of the neighbours are taken into account relative to their dis-tance to the query point.
To accommodate the case where the query point exactly matches one of the training instances and the denominator therefore is zero, we as-sign to be in this case Query point 와 정확히 일치하는 학습 data 가 있으면 ←
qx
qf̂ x if x qx
Distance-Weighted KNN
2),(
1
iqi
xxdw
Might want nearer neighbors with more heavy weight:
qf̂ x if x
Distance-Weighted KNN
33
For discrete-valued target functions:
For real-valued target functions:
Shepard method
ki i
ki ii
qw
xfwxf
1
1 )()(ˆ
k
iii
Vvq xfvwxf
1))(,(maxarg)(ˆ
Distance-weight
V: set of s classes (e.g., red, black, yellow…)
where if a = b, otherwise
Remarks on k-Nearest Neighbour Algorithm
PROBLEM:The measurement of the distance between two instances considers every attribute. So even irrelevant attributes can influence the approx-imation.
EXAMPLE: n =20 but only 2 attributes are relevant SOLUTION: Weight each attribute differently when calculating the
distance between two neighbours: stretching the relevant axes in Euclidian space:
shortening the axes that correspond to less relevant attributes lengthening the axes that correspond to more relevant attribute
PROBLEM: Determine which weight belongs to which attribute auto-matically? Cross-validation Leave-one-out see in next lecture
Remarks on k-Nearest Neighbour Algorithm 2
ADVANTAGE: The training phase is processed very fast Can learn complex target function Robust to noisy training data Quite effective when a sufficiently large set of training data is provided
DISADVANTAGE: Alg. delays all processing until a new query is received => significant
computation can be required to process; efficient memory indexing Processing is slow Sensibility about escape of the dimensions
BIAS:Inductive bias corresponds to an assumption that the classification of an instance will be most similar to the classification of other instances that are nearby in Euclidean distance
qx
Motivation Eager Learning Lazy Learning Instance-Based Learning
k-Nearest Neighbour Learning (kNN) Distance-Weighted k-NN Locally Weighted Regression (LWR) Radial Basis Functions (RBF) Case-Based Reasoning (CBR) Summary
Content
Generalizing k-nearest neighbor to continuous outputs
The version of k-nearest neighbors we have already seen works well for discrete outputs.
How would we generalize this to predict continuous outputs ?
Ideas?
37
Locally Weighted Regression
Local means using nearby points (i.e. a nearest neighborsapproach), based solely on the training data near the query point
Weighted means we value points based upon how far away they are from the query point
Regression means approximating a function This is an instance-based learning method
The idea: whenever you want to classify a sample: Build a local model of the function (using a linear function,
quadratic, neural network, etc.) Use the model to predict the output value Throw the model away
Locally Weighted Regression
IDEA: Generalization of Nearest Neighbour Alg.It constructs an explicit approximation to f over a local region surrounding xq. It uses nearby or distance-weighted training examples to form the local approximation to f. Local: The function is approximated based solely on the training
data near the query point Weighted: The construction of each training example is weighted
by its distance from the query point Regression: means approximating a real-valued target function
How to works Locally Weighted Regres-sion
40
• Unweighted averaging using springs.
• Locally weighted averaging using springs.
The strength of the springs are equal in the unweighted case, and the position of the horizontal line minimizes the sumof the stored energy in the springs
The springs are not equal, and the spring constant of each spring is given by K(d(xi, q)). Note that the locally weighted average em-phasizes points close to the query point, and produces an answer(the height of the hori-zontal line) that is closer to the height of points near the query point than the un-weighted case.
Example of Locally Weighted Learn-ing
containing in the upper graphic the set of data points (x,y) (blue dots), query point (green line), local linear model (red line) and prediction (yellow dot). The graphic in the middle shows the activation area of the model. The corresponding weighting kernel (receptive field) is shown in the bottom graphic. 41
How to works Locally Weighted Regres-sion
42
Fits using different types of local models for three and five data points.
43
Nearest neighbor
Weighted average
Locally Weighted regression
Locally weighted linear regression
In the following, x is an instance, D is the set of possible instances, D=<xi, f(xi)> ai(x) is the value of the ith attribute value of in-
stance x The weights wi form our hypothesis f is the target function is our approximation to the target function
44
f̂
Locally weighted linear regres-sion
In this case, we use a linear model to do the local approxi-mation
Suppose we aim to minimize the total squared error:
Recall the gradient descent we used in checkers for this
purpose
45
f̂
Dx
221 (x))f(f(x)E ˆ
Dx
jj (x)(x))af(f(x)w ˆΔ
η is a small number (the learning rate)
)(...)()(ˆ110 xawxawwxf nn
Locally weighted linear regres-sion
Now we adjust it to work with the present situation. Define the error for instance xq:
Minimise the squared error over the KNN set using some kernel function K to decrease this error based on the distance
And the new version of the gradient descent becomes:
qxof KNNx jqj (x)(x))af(f(x) x)) ,K(d(xw ˆΔ
x),K(d(x(x))f(f(x))E(x qxofKNNx
221
q
q
ˆ
Locally Weighted Linear Regression
We might approximate the target function in the neighborhood surrounding xq using a linear function, a quadratic function, a multilayer neu-ral network, or some other function form. Using linear function to approximate f:
Recall chapter 4:
)(...)()(ˆ110 xawxawwxf nn
Dxjj
Dx
xaxfxfw
xfxfE
)())(ˆ)((
))(ˆ)(( 221
gradient descent rule
Locally Weighted Regression
PROCEDURE:Given a new query xq , construct an approximation
that fits the training examples in the neighbourhood surrounding xq
This approximation is used to calculate , which is as the estimated target value assigned to the query instance.
The description of may change, because a differ-ent local approximation will be calculated for each in-stance
f̂
f̂
qf̂ x
Evaluation Locally Weighted Regression
ADVANTAGE Pointwise approximation of a complex target function Earlier data has no influence on the new ones
DISADVANTAGE The quality of the result depends on
Choice of the function Choice of the kernel function K Choice of the hypothesis space H
Sensibility against the relevant and irrelevant attributes
f̂
Motivation Eager Learning Lazy Learning Instance-Based Learning
k-Nearest Neighbour Learning (kNN) Distance-Weighted k-NN Locally Weighted Regression (LWR) Radial Basis Functions (RBF) Case-Based Reasoning (CBR) Summary
Content
51
Radial Basis Function (RBF) Networks
RBF neural network has an input layer, a hidden layer, an output layer.
The neurons in the hidden layer contain Gaussian transfer functions whose outputs are inversely proportional to the
distance from the center of the neuron. ( 뉴런의 중심에서 멀리 떨어진 데이터 일수록 결과에
대한 영향을 줄이고자 한다 )
Similar to K-Means clustering and PNN (Probabilistic Neural Network ) /GRNN (Generalized Regression Neural Network). : 방법적인 면에서… .
The main difference: PNN/GRNN networks have one neuron for each
point in the training file, RBF networks have a variable number of neurons
that is usually much less than the number of training points.
For problems with small to medium size training sets, PNN/GRNN are usually more accurate than RBF
PNN/GRNN networks are impractical for large training sets.
52
How RBF networks work Although the implementation is very different,
RBF neural networks are conceptually similar to K-Nearest Neighbor (k-NN) models. : 전략적인 면에서… .
The basic idea is that a predicted target value of an item is likely to be about the same as other items that have close values of the predictor variables.
Radial-Basis Function Networks
RBFs represent local receptors, as illustrated below, where each green point is a stored vector used in one RBF.
In a RBF network one hidden layer uses neurons with RBF acti-vation functions describing local receptors. Then one output node is used to combine linearly the outputs of the hidden neu-rons.
w1
w3
w2
The output of the red vectoris “interpolated” using the threegreen vectors, where each vector gives a contribution that depends onits weight and on its distance from the red point. In the picture we have
231 www
In MLP
in RBFN
MLP vs RBFN
Radial Basis Function Network
A kind of supervised neural networks Design of NN as curve-fitting problem Learning
find surface in multidimensional space best fit to training data
Generalization Use of this multidimensional surface to interpolate
the test data
New model : f(x) = w1h1(x) + w2h2(x) + w3h3(x)where h1(x) = 1,
h2(x) = x,
h3(x) = x2
Radial Basis Function Network
Approximate function with linear combination of Radial basis functions
h(x) is mostly Gaussian function
m
jjjhwf
1
)()( xx
hj(x) = exp( -(x-cj)2 / rj2 )
Where cj is center of a region,
rj is width of the receptive field
HIDDEN NEURON MODEL
x2
x1
xm
hj( || x - cj ||)
cj is called center of a regionj is called spreadcenter and spread are parameters
hj( || x - cj ||) the output depends on the distance of the input x from the center cj
hj
RBF ARCHITECTURE
One hidden layer with RBF activation functions
Output layer with linear activation function.
||x-c|| disitance of x=(x1, …, xm) from vector c
x2
xm
x1
y
wm1
w11h
mh
mhh ...1
||)(||...||)(|| 111 mmm cxhwcxhwy
Three layers
Input layer Source nodes that connect to the network to its
environment Hidden layer
Hidden units provide a set of basis function High dimensionality
Output layer Linear combination of hidden functions
RBF Network Architecture
Weight = RBF(distance) The further a neuron is from the point being evaluated, the less in-
fluence it has.
Radial Basis Function
Different types of radial basis functions could be used, but the most common is the Gaussian function:
66
If there is more than one predictor variable, then the RBF function has as many dimensions as there are variables.
Radial Basis Function
Z is the value coming out of the RBF func-tions
two predictor variables, X and Y
Three neurons in a space
The best predicted value for the new point is found by summing the output values of the RBF functions multi-plied by weights computed for each neuron.
The radial basis function for a neuron has a cen-ter and a radius (also called a spread). The radius may be different for each neuron,
and, in RBF
Training RBF Networks
The following parameters are determined by the training process: The number of neurons in the hidden layer. The coordinates of the center of each hidden-layer
RBF function. The radius (spread) of each RBF function in each di-
mension.
The weights applied to the RBF function outputs as they are passed to the summation layer.
designing
Require Selection of the radial basis function width parameter Number of radial basis neurons
Number of radial basis neurons
By designer Max of neurons = number of input Min of neurons = ( experimentally deter-
mined) More neurons
More complex, but smaller tolerance
designing
Various learning strategies
How the centers of the radial-basis functions of the network are specified. Fixed centers selected at random Self-organized selection of centers Supervised selection of centers
learning strate-gies
Fixed centers selected at random(1)
Fixed RBFs of the hidden units The locations of the centers may be chosen
randomly from the training data set. We can use different values of centers and
widths for each radial basis function -> exper-imentation with training data is needed.
learning strate-gies
Fixed centers selected at random(2)
Only output layer weight is need to be learned.
Main problem Require a large training set for a satisfactory level of
performance
learning strate-gies
Self-organized selection of centers(1)
By means of clustering. Supervised learning of output weights by
LMS(Least Mean Square) algorithm. Hybrid learning
self-organized learning to estimate the centers of RBFs in hidden layer
supervised learning to estimate the linear weights of the output layer⋇ Center 는 clustering 으로 결정 하지만 output
weight
는 supervised learning !!
learning strate-gies
76
Self-organized selection of centers(2)
k-means clustering1. Initialization
2. Sampling
3. Similarity matching
4. Updating
5. Continuation
learning strate-gies
Supervised selection of centers
All free parameters of the network are changed by supervised learning process.
Error-correction learning using LMS algorithm.
learning strate-gies
Radial functions
Gassian RBF:c : center, r : radius
2
2)(exp)(
rh
cxx
• Multiquadric RBF
r
rh
22 )()(
cxx
• monotonically decreases with distance from center
• monotonically increases with distance from center
Gaussian RBF multiqradric RBF
Least Squares
m
jjjhwf
1
)()( xx
• Training data : {(x1, y1), (x2, y2), …, (xp, yp)}
• Minimize the sum-squared-error
p
ii fyS
1
2))(( ix
Example
Sample points (noisy) from the curve y = x : {(1, 1.1), (2, 1.8), (3, 3.1)}
linear model :
f(x) = w1h1(x) + w2h2(x),
where h1(x) = 1, h2(x) = x
Estimate the coefficient w1, w2
f(x) = x
New model : f(x) = w1h1(x) + w2h2(x) + w3h3(x)where h1(x) = 1,
h2(x) = x,
h3(x) = x2
If the model absorbs all the noise : overfit If it is too flexible, it will fit the noise
If it is too inflexible, it will miss the target
The optimal weight vector
• model
m
jjjhwf
1
)()( xx
• sum-squared-error
p
ii fyS
1
2))(( ix
• cost function (minimized): weight penalty term is added
m
jjj
p
ii wfyC
1
2
1
2))(( ix
λj: regularization parameters
jjj
p
ii
j
ww
fyf
w
C 2)())((21
ii xx
)()( ii xx jj
hw
f
jjj
p
ii
j
whyfw
C 2)())((21
ii xx
)()()(11
i
p
jjjjj
p
i
xhyiwhf
ii xx
0)())((01
jjj
p
ii
j
whyfwC ii xx
p
ijijjj
p
ij
hywhfw
C
11
)()()(0 iii xxx
pppj
j
j
y
y
y
f
f
f
h
h
h
2
1
2
1
2
1
,
)(
)(
)(
,
)(
)(
)(
Let y
x
x
x
f
x
x
x
hj
mjw Tjjj
Tj ,....,2,1 allfor ,Then yhfh
mjw Tjjj
Tj ,....,2,1 allfor , yhfh
y
y
y
fh
fh
fh
Tm
T
T
mmTm
T
T
h
h
h
w
w
w
2
1
22
11
2
1
m
m
TT
H
HH
00
00
00
,
where,
2
1
21 hhh
ywf
)f(x
)f(x
)f(x
f
p
2
1
HHA T where
,ywf TT HH
Design matrix
wfy TT HH
www HHHH TT (
yTHA 1yw TT HHH 1)(
m
1jpjj
m
1j2jj
m
1j1jj
)(xhw
)(xhw
)(xhw
Hw
m
2
1
pmp2p1
2m2221
1m1211
w
w
w
)(xh)(xh)(xh
)(xh)(xh)(xh
)(xh)(xh)(xh
Example
Sample points (noisy) from the curve y = x : {(1, 1.1), (2, 1.8), (3, 3.1)}
linear model :
f(x) = w1h1(x) + w2h2(x),
where h1(x) = 1, h2(x) = x
Estimate the coefficient w1, w2
3.1
1.8
1.1
y
,146
63HHT
yHAw T1
,
31
21
11
)(xh)(xh
)(xh)(xh
)(xh)(xh
H
3231
2221
1211
1
0
21
37
1T1
1
1H)(HA
{(1, 1.1), (2, 1.8), (3, 3.1)}
h1(x) = 1, h2(x) = x
f(x) = w1h1(x) + w2h2(x),
f(x) = 0*1 + 1*x
where h1(x) = 1, h2(x) = x
f(x) = x
It should have an extra term, x2
New model: f(x) = w1h1(x) + w2h2(x) + w3h3(x), Where w3h3(x) = x2
93
9
4
1
)(
)(
)(
23
22
21
33
23
13
x
x
x
xh
xh
xh
H
94
New model : f(x) = w1h1(x) + w2h2(x) + w3h3(x)where h1(x) = 1,
h2(x) = x,
h3(x) = x2
Radial Basis Function (RBF) Networks
Each prototype node computes a distance based kernel function (Gaussian is common)
Prototype nodes form a hidden layer in a neural network Train top layer with simple delta rule to get outputs Thus, prototype nodes learn weightings for each class
95
blend of instance-based method and neural network method.
Radial Basis Function
96
Function to be learned:
One common choice for is:
Global approximation to target function, in terms of linear combination of local approxima-tions.
Related to distance-weighted regression, “ea-ger” instead of “lazy”.
Radial Basis Function Networks
97
ai(x) are attributes describing instance x. The first layer
computes variousKu(d(xu,x)).
Second layer computeslinear combination of first-layer unit values.
Hidden unit activation is close to 0 if x isn’t near xu
Approximation
MLP : Global network All inputs cause an output
RBF : Local network Only inputs near a receptive field produce an acti-
vation Can give “don’t know” output
MLP vs RBFN
MLP vs RBFN
Global hyperplane Local receptive field
EBP(Error of Back Para-pagation) LMS
Local minima Serious local minima
Smaller number of hidden neurons
Larger number of hidden neurons
Shorter computation time Longer computation time
Longer learning time Shorter learning time
Motivation Eager Learning Lazy Learning Instance-Based Learning
k-Nearest Neighbour Learning (kNN) Distance-Weighted k-NN Locally Weighted Regression (LWR) Radial Basis Functions (RBF) Case-Based Reasoning (CBR) Summary
Content
Case-based reasoning (CBR)
Instance-based methods and locally weighted regression
CBR: first two principles and instances are represented by using a richer symbolic description and the methods used to retrieval
CBR is an advanced instance based learning applied to more complex instance objects
Objects may include complex structural descriptions of cases & adaptation rules
CBR cannot use Euclidean distance measures Must define distance measures for those complex
objects instead (e.g. semantic nets)
CBR tries to model human problem-solving uses past experience (cases) to solve new problems retains solutions to new problems
CBR is an ongoing area of machine learning research with many applications
103
Applications of CBR
Design landscape, building, mechanical, conceptual
design of aircraft sub-systems Planning
repair schedules Diagnosis
medical Adversarial reasoning
legal
CBR process
New Case
matchingMatched
Cases
Retrieve
Adapt?No
Yes
Closest Case
Suggest solution
Retain
Learn
Revise
Reuse
Case Base
Knowledge and Adaptation rules
105
CBR example: Property pricing
Case Location code
Bedrooms Recep rooms
Type floors Cond-ition
Price £
1 8 2 1 terraced 1 poor 20,500
2 8 2 2 terraced 1 fair 25,000
3 5 1 2 semi 2 good 48,000
4 5 1 2 terraced 2 good 41,000
Case Location code
Bedrooms Recep rooms
Type floors Cond-ition
Price £
5 7 2 2 semi 1 poor ???
Test instance
106
How rules are generated
There is no unique way of doing it. Here is one possibility:
Examine cases and look for ones that are al-most identical case 1 and case 2
Rule1: If recep-rooms changes from 2 to 1 then reduce price by £5,000
case 3 and case 4 Rule2: If Type changes from semi to terraced then re-
duce price by £7,000
107
Matching
Comparing test instance matches(5,1) = 3 matches(5,2) = 3 (MAX COST: £ 25000) matches(5,3) = 2 matches(5,4) = 1
Estimate price of case 5 is £25,000
108
Adapting
Reverse rule 2 if type changes from terraced to semi then increase
price by £7,000 Apply reversed rule 2
new estimate of price of property 5 is £32,000
109
Learning
So far we have a new case and an estimated price nothing is added yet to the case base
If later we find house of location code 8 sold for £35,000 then the case would be added could add a new rule
if location changes from 7 to 8 increase price by £3,000
Case Location code
Bedrooms Recep rooms
Type floors Cond-ition
Price £
5 7 2 2 teraced 1 poor £32000
Case Location code
Bedrooms Recep rooms
Type floors Cond-ition
Price £
#5 8 2 2 teraced 1 poor £35000
110
Problems with CBR
How should cases be represented? How should cases be indexed for fast retrieval? How can good adaptation heuristics be de-
veloped? When should old cases be removed?
111
Advantages
A local approximation is found for each test case Knowledge is in a form understandable to human
beings Fast to train
Lazy and Eager Learning
112
Lazy: wait for query before generalizing KNN, locally weighted regression, CBR
Eager: generalize before seeing query RBF networks
Differences: Computation time Global and local approximations to the target function Use same H, lazy can represent more complex functions. (e.g.,
consider H=linear functions)
113
Summary
Differences and advantages KNN algorithm:
the most basic instance-based method. Locally weighted regression: generalization of
KNN. RBF networks:
blend of instance-based method and neural network method.
Case-based reasoning
114
Lazy and Eager Learning
Lazy: wait for query before generalizing k-Nearest Neighbour, Case based reasoning
Eager: generalize before seeing query RBF Networks, ID3, …
Does it matter? Eager learner must create global approximation Lazy learner can create many local approximations
The End
Top Related