Download - 1 Instance Based Learning Soongsil University Intelligent Systems Lab.

1

Instance Based Learning

Soongsil University

Intelligent Systems Lab.

2

Content

Motivation Eager Learning Lazy Learning Instance-Based Learning

k-Nearest Neighbour Learning (kNN) Distance-Weighted k-NN Locally Weighted Regression (LWR) Radial Basis Functions (RBF) Case-Based Reasoning (CBR) Summary

3

Instance-based learning

One way of solving tasks of approximating discrete or real valued target functions

Have training examples: (xn, f(xn)), n=1..N. Key idea:

just store the training examples when a test example is given then find the closest matches

4

Motivation: Eager Learning

The Learning Task: Try to approximate a target function through a hypothesis on the basis of training examples

EAGER Learning:As soon as the training examples and the hypothesis space are re-ceived the search for the first hypothesis begins Training phase:

given: training examples D=<Xi, f(Xi)> hypothesis space Hsearch: best hypothesis

Processing phase:for every new instance xq return

Examples

Radial based function

f̂

qf̂ x

5

Motivation: Lazy Algorithms

LAZY ALGORITHMS: Training examples are stored and sleeping Generalisation beyond these examples is postponed

till new instances must be classified Every time a new query instance is encountered, its

relationship to the previously stored examples is ex-amined in order to compute the value of the target function for this new instance

6

Motivation: Instance-Based Learning

Instance-Based Algorithms can establish a new local approximation for every new instance Training phase:

given: training sample D=<Xi, f(Xi)> Processing phase:

given: instance xq search: best local hypothesis return

Examples: Nearest Neighbour Algorithm Distance Weighted Nearest Neighbour Locally Weighted Regression ....

qf̂ x

7

Motivation: Instance-Based Learning

How are the instances represented? How can we measure the similarity of the in-

stances? How can be computed?f xq



Content

Nearest Neighbour Algorithm

Idea: All instances correspond to the points in the n-dimensional space .Assign the value of the next, neighboured instance to the new instance

Representation: Let be an instance, where denotes the value of the r-th attribute of an instance x

Target Function: Discrete valued or real valued

n

i 1 i 2 i n ix = a x ,a x ,...,a x r ia x

We may also use Xir instead of

r ia x

1-Nearest Neighbor

Four things make a memory based learner:

1. A distance metric

Euclidian

2. How many nearby neighbors to look at?

One

3. A weighting function (optional)

Unused

4. How to fit with the local points?

Just predict the same output as the nearest neighbor.

11


HOW IS FORMED?

Discrete target function:

where V: set of s classes (e.g., red, black, yellow…)

Continuous target function:

Let the next neighbour of

n1, 2, sf : V | V = v v ..., v

nx

qf̂ x

n nf :

qx

n q i i qd x ,x min d x ,x

q nf̂ x = f x

13

1-Nearest neighbour:Given a query instance xq, • first locate the nearest training example xn

• then := f(xn)

K-Nearest neighbour:Given a query instance xq, • first locate the k nearest training examples • If discrete valued target function, take vote among its k nearest

neighbour. (e.g., X, X, O, O, X, O, X, X)X

Else if real valued target faction, take the mean of the f values of the k nearest neighbour

k

)f(x:)(xf̂

k

1ii

q


)(xf̂ q

How to choose “k”

Average of k points more reliable when: noise in attributes noise in class labels classes partially overlap

Large k: less sensitive to noise (particularly class noise) better probability estimates for discrete classes larger training sets allow larger values of k

Small k: captures fine structure of problem space better may be necessary with small training sets

Balance must be struck between large and small k As training set approaches infinity, and k grows large, kNN be-

comes Bayes optimal14if p (x) > .5 then predict 1, else 0

k-Nearest Neighbor

Four things make a memory based learner:

1. A distance metric

Euclidian

2. How many nearby neighbors to look at?

k

3. A weighting function (optional)

Unused

4. How to fit with the local points?

Just predict the average output among the k nearest neighbors.

15

k-Nearest Neighbour

Idea:

If we choose k=1, then the algorithm assigns to the value f (xi)

where xi is the nearest training instance to xq

For larger values of k, the algorithm assigns the most common value among the k nearest training examples

How can be established ?

where ( , )=1 if a = b,𝛿 𝑎 𝑏 𝛿( , )=0 otherwise 𝑎 𝑏

qf̂ x

qf̂ x

k

1i

iiVv

q ))f(xδ(v,wargmax)(xf̂

Let xi, …xk denote the k instances from training examples that are nearest to xq

k-Nearest Neighbour Algorithm

Training algorithm: For each training example < x, f(x) >, add the example to the list

training_examples

[Classification algorithm]: Given a query instance xq to be classified,

Let xi, …xk denote the k instances from training examples that are nearest to xq

Return

where if a = b,

otherwise

k

1i

iiVv

q ))f(xδ(v,wargmax)(xf̂

18

The distance between examples

We need a measure of distance in order to know who are the neighbours

Assume that we have T attributes for the learning problem. Then one example point x has elements xt , t=1,…T.

The distance between two points xi , xj is often defined as the Euclidean distance:

T

t

tjtiji xxd1

2][),( xx

Similarity and Dissimilarity Be-tween Objects

Distances are normally used measures Minkowski distance: a generalization

If q = 2, d is Euclidean distance If q = 1, d is Manhattan distance Weighed distance

)0(||...||||),(2211

qj

xi

xj

xi

xj

xi

xjxixd qq

pp

qq

)0()||...||2

||1

),(2211

qj

xi

xpwj

xi

xwj

xi

xwjxixd qq

pp

qq

20

Voronoi Diagram

qf̂ x = + qf̂ x =

Example:

1NN: 5-NN: Voronoi Diagram

Voronoi Diagram: The decision surface is induced by a 1-Nearest Neighbour algorithm for a typical set of training examples. The convex surrounding of each training example indicates the region of query points whose classification will be completely determined by the training example.

21

Characteristics of Inst-b-Learning

An instance-based learner is a lazy-learner and does all the work when the test example is presented. This is opposed to so-called eager-learners, which build a parameterised compact model of the target.

It produces local approximation to the target function (different with each test instance)

22

When to consider Nearest Neighbour algorithms?

Instances map to points in Not more than say 20 attributes per instance Lots of training data Advantages:

Training is very fast Can learn complex target functions Don’t lose information

Disadvantages: ? (will see them shortly…)

n

23

twoone

four

three

five six

seven Eight ?

24

Training data

Number Lines Line types Rectangles Colours Mondrian?

8 7 2 9 4

Test instance

Number Lines Line types Rectangles Colours Mondrian?

1 6 1 10 4 No

2 4 2 8 5 No

3 5 2 7 4 Yes

4 5 1 8 4 Yes

5 5 1 10 5 No

6 6 1 8 6 Yes

7 7 1 14 5 No

3

11

8

6

7

4

27

T

t

tjtiji xxd1

2][),( xx

25

Distances of test instance from training data

Classification

1-NN No

3-NN Yes

5-NN Yes

7-NN No

Example

example

Mondrian?

1 No

2 No

3 Yes

4 Yes

5 No

6 Yes

7 No

Mondrian?Distanceof test from

3

11

8

6

7

4

27

THINK a moment !!. DOES this seem sensible to you ?

Isn’t the calculation being skewed by the large values of the rectangle data relative to the other data?

26

Keep data in normalised form

One way to normalise the data ar(x) to a´r(x) is

t

ttt

xxx

'

attributestofmeanx tht

attributestofdeviationndardsta tht

27

Normalised training dataNumber Lines Line

types Rectangles Colours Mondrian?

1 0.632 -0.632 0.327 -1.021 No

2 -1.581 1.581 -0.588 0.408 No

3 -0.474 1.581 -1.046 -1.021 Yes

4 -0.474 -0.632 -0.588 -1.021 Yes

5 -0.474 -0.632 0.327 0.408 No

6 0.632 -0.632 -0.588 1.837 Yes

7 1.739 -0.632 2.157 0.408 No

Number Lines Line types

Rectangles Colours Mondrian?

8 1.739 1.581 -0.131 -1.021

Test instance

28

Distances of test instance from training data

ClassificationAfter Normalize Before After Normalize

1-NN Yes No

3-NN Yes Yes

5-NN No Yes

7-NN No No

Example

example

Mondrian?

1 2.517 No

2 3.644 No

3 2.395 Yes

4 3.164 Yes

5 3.472 No

6 3.808 Yes

7 3.490 No

Mondrian?Distanceof test from

3

11

8

6

7

4

27



Content

30

Difficulties with k-nearest neighbour al-gorithms

Have to calculate the distance of the test case from all training cases

There may be irrelevant attributes amongst the attributes – curse of dimensionality

31

What if the target function is real valued?

The k-nearest neighbour algorithm would just calculate the mean of the k nearest neighbours

The weights of the neighbours are taken into account relative to their dis-tance to the query point.

To accommodate the case where the query point exactly matches one of the training instances and the denominator therefore is zero, we as-sign to be in this case Query point 와 정확히 일치하는 학습 data 가 있으면 ←

qx

qf̂ x if x qx

Distance-Weighted KNN

2),(

1

iqi

xxdw

Might want nearer neighbors with more heavy weight:

qf̂ x if x

Distance-Weighted KNN

33

For discrete-valued target functions:

For real-valued target functions:

Shepard method

ki i

ki ii

qw

xfwxf

1

1 )()(ˆ

k

iii

Vvq xfvwxf

1))(,(maxarg)(ˆ

Distance-weight

V: set of s classes (e.g., red, black, yellow…)

where if a = b, otherwise

Remarks on k-Nearest Neighbour Algorithm

PROBLEM:The measurement of the distance between two instances considers every attribute. So even irrelevant attributes can influence the approx-imation.

EXAMPLE: n =20 but only 2 attributes are relevant SOLUTION: Weight each attribute differently when calculating the

distance between two neighbours: stretching the relevant axes in Euclidian space:

shortening the axes that correspond to less relevant attributes lengthening the axes that correspond to more relevant attribute

PROBLEM: Determine which weight belongs to which attribute auto-matically? Cross-validation Leave-one-out see in next lecture

Remarks on k-Nearest Neighbour Algorithm 2

ADVANTAGE: The training phase is processed very fast Can learn complex target function Robust to noisy training data Quite effective when a sufficiently large set of training data is provided

DISADVANTAGE: Alg. delays all processing until a new query is received => significant

computation can be required to process; efficient memory indexing Processing is slow Sensibility about escape of the dimensions

BIAS:Inductive bias corresponds to an assumption that the classification of an instance will be most similar to the classification of other instances that are nearby in Euclidean distance

qx



Content

Generalizing k-nearest neighbor to continuous outputs

The version of k-nearest neighbors we have already seen works well for discrete outputs.

How would we generalize this to predict continuous outputs ?

Ideas?

37

Locally Weighted Regression

Local means using nearby points (i.e. a nearest neighborsapproach), based solely on the training data near the query point

Weighted means we value points based upon how far away they are from the query point

Regression means approximating a function This is an instance-based learning method

The idea: whenever you want to classify a sample: Build a local model of the function (using a linear function,

quadratic, neural network, etc.) Use the model to predict the output value Throw the model away


IDEA: Generalization of Nearest Neighbour Alg.It constructs an explicit approximation to f over a local region surrounding xq. It uses nearby or distance-weighted training examples to form the local approximation to f. Local: The function is approximated based solely on the training

data near the query point Weighted: The construction of each training example is weighted

by its distance from the query point Regression: means approximating a real-valued target function

How to works Locally Weighted Regres-sion

40

• Unweighted averaging using springs.

• Locally weighted averaging using springs.

The strength of the springs are equal in the unweighted case, and the position of the horizontal line minimizes the sumof the stored energy in the springs

The springs are not equal, and the spring constant of each spring is given by K(d(xi, q)). Note that the locally weighted average em-phasizes points close to the query point, and produces an answer(the height of the hori-zontal line) that is closer to the height of points near the query point than the un-weighted case.

Example of Locally Weighted Learn-ing

containing in the upper graphic the set of data points (x,y) (blue dots), query point (green line), local linear model (red line) and prediction (yellow dot). The graphic in the middle shows the activation area of the model. The corresponding weighting kernel (receptive field) is shown in the bottom graphic. 41

How to works Locally Weighted Regres-sion

42

Fits using different types of local models for three and five data points.

43

Nearest neighbor

Weighted average

Locally Weighted regression

Locally weighted linear regression

In the following, x is an instance, D is the set of possible instances, D=<xi, f(xi)> ai(x) is the value of the ith attribute value of in-

stance x The weights wi form our hypothesis f is the target function is our approximation to the target function

44

f̂

Locally weighted linear regres-sion

In this case, we use a linear model to do the local approxi-mation

Suppose we aim to minimize the total squared error:

Recall the gradient descent we used in checkers for this

purpose

45

f̂

Dx

221 (x))f(f(x)E ˆ

Dx

jj (x)(x))af(f(x)w ˆΔ

η is a small number (the learning rate)

)(...)()(ˆ110 xawxawwxf nn

Locally weighted linear regres-sion

Now we adjust it to work with the present situation. Define the error for instance xq:

Minimise the squared error over the KNN set using some kernel function K to decrease this error based on the distance

And the new version of the gradient descent becomes:

qxof KNNx jqj (x)(x))af(f(x) x)) ,K(d(xw ˆΔ

x),K(d(x(x))f(f(x))E(x qxofKNNx

221

q

q

ˆ

Locally Weighted Linear Regression

We might approximate the target function in the neighborhood surrounding xq using a linear function, a quadratic function, a multilayer neu-ral network, or some other function form. Using linear function to approximate f:

Recall chapter 4:

)(...)()(ˆ110 xawxawwxf nn

Dxjj

Dx

xaxfxfw

xfxfE

)())(ˆ)((

))(ˆ)(( 221

gradient descent rule


PROCEDURE:Given a new query xq , construct an approximation

that fits the training examples in the neighbourhood surrounding xq

This approximation is used to calculate , which is as the estimated target value assigned to the query instance.

The description of may change, because a differ-ent local approximation will be calculated for each in-stance

f̂

f̂

qf̂ x

Evaluation Locally Weighted Regression

ADVANTAGE Pointwise approximation of a complex target function Earlier data has no influence on the new ones

DISADVANTAGE The quality of the result depends on

Choice of the function Choice of the kernel function K Choice of the hypothesis space H

Sensibility against the relevant and irrelevant attributes

f̂



Content

51

Radial Basis Function (RBF) Networks

RBF neural network has an input layer, a hidden layer, an output layer.

The neurons in the hidden layer contain Gaussian transfer functions whose outputs are inversely proportional to the

distance from the center of the neuron. ( 뉴런의 중심에서 멀리 떨어진 데이터 일수록 결과에

대한 영향을 줄이고자 한다 )

Similar to K-Means clustering and PNN (Probabilistic Neural Network ) /GRNN (Generalized Regression Neural Network). : 방법적인 면에서… .

The main difference: PNN/GRNN networks have one neuron for each

point in the training file, RBF networks have a variable number of neurons

that is usually much less than the number of training points.

For problems with small to medium size training sets, PNN/GRNN are usually more accurate than RBF

PNN/GRNN networks are impractical for large training sets.

52

http://www.dtreg.com/kmeans.htm

http://www.dtreg.com/pnn.htm

How RBF networks work Although the implementation is very different,

RBF neural networks are conceptually similar to K-Nearest Neighbor (k-NN) models. : 전략적인 면에서… .

The basic idea is that a predicted target value of an item is likely to be about the same as other items that have close values of the predictor variables.

Radial-Basis Function Networks

RBFs represent local receptors, as illustrated below, where each green point is a stored vector used in one RBF.

In a RBF network one hidden layer uses neurons with RBF acti-vation functions describing local receptors. Then one output node is used to combine linearly the outputs of the hidden neu-rons.

w1

w3

w2

The output of the red vectoris “interpolated” using the threegreen vectors, where each vector gives a contribution that depends onits weight and on its distance from the red point. In the picture we have

231 www

In MLP

in RBFN

MLP vs RBFN

Radial Basis Function Network

A kind of supervised neural networks Design of NN as curve-fitting problem Learning

find surface in multidimensional space best fit to training data

Generalization Use of this multidimensional surface to interpolate

the test data

New model : f(x) = w1h1(x) + w2h2(x) + w3h3(x)where h1(x) = 1,

h2(x) = x,

h3(x) = x2

Radial Basis Function Network

Approximate function with linear combination of Radial basis functions

h(x) is mostly Gaussian function

m

jjjhwf

1

)()( xx

hj(x) = exp( -(x-cj)2 / rj2 )

Where cj is center of a region,

rj is width of the receptive field

HIDDEN NEURON MODEL

x2

x1

xm

hj( || x - cj ||)

cj is called center of a regionj is called spreadcenter and spread are parameters

hj( || x - cj ||) the output depends on the distance of the input x from the center cj

hj

RBF ARCHITECTURE

One hidden layer with RBF activation functions

Output layer with linear activation function.

||x-c|| disitance of x=(x1, …, xm) from vector c

x2

xm

x1

y

wm1

w11h

mh

mhh ...1

||)(||...||)(|| 111 mmm cxhwcxhwy

Three layers

Input layer Source nodes that connect to the network to its

environment Hidden layer

Hidden units provide a set of basis function High dimensionality

Output layer Linear combination of hidden functions

RBF Network Architecture

Weight = RBF(distance) The further a neuron is from the point being evaluated, the less in-

fluence it has.

Radial Basis Function

Different types of radial basis functions could be used, but the most common is the Gaussian function:

66

If there is more than one predictor variable, then the RBF function has as many dimensions as there are variables.


Z is the value coming out of the RBF func-tions

two predictor variables, X and Y

Three neurons in a space

The best predicted value for the new point is found by summing the output values of the RBF functions multi-plied by weights computed for each neuron.

The radial basis function for a neuron has a cen-ter and a radius (also called a spread). The radius may be different for each neuron,

and, in RBF

Training RBF Networks

The following parameters are determined by the training process: The number of neurons in the hidden layer. The coordinates of the center of each hidden-layer

RBF function. The radius (spread) of each RBF function in each di-

mension.

The weights applied to the RBF function outputs as they are passed to the summation layer.

designing

Require Selection of the radial basis function width parameter Number of radial basis neurons

Number of radial basis neurons

By designer Max of neurons = number of input Min of neurons = ( experimentally deter-

mined) More neurons

More complex, but smaller tolerance

designing

Various learning strategies

How the centers of the radial-basis functions of the network are specified. Fixed centers selected at random Self-organized selection of centers Supervised selection of centers

learning strate-gies

Fixed centers selected at random(1)

Fixed RBFs of the hidden units The locations of the centers may be chosen

randomly from the training data set. We can use different values of centers and

widths for each radial basis function -> exper-imentation with training data is needed.


Fixed centers selected at random(2)

Only output layer weight is need to be learned.

Main problem Require a large training set for a satisfactory level of

performance


Self-organized selection of centers(1)

By means of clustering. Supervised learning of output weights by

LMS(Least Mean Square) algorithm. Hybrid learning

self-organized learning to estimate the centers of RBFs in hidden layer

supervised learning to estimate the linear weights of the output layer⋇ Center 는 clustering 으로 결정 하지만 output

weight

는 supervised learning !!


76

Self-organized selection of centers(2)

k-means clustering1. Initialization

2. Sampling

3. Similarity matching

4. Updating

5. Continuation


Supervised selection of centers

All free parameters of the network are changed by supervised learning process.

Error-correction learning using LMS algorithm.


Radial functions

Gassian RBF:c : center, r : radius

2

2)(exp)(

rh

cxx

• Multiquadric RBF

r

rh

22 )()(

cxx

• monotonically decreases with distance from center

• monotonically increases with distance from center

Gaussian RBF multiqradric RBF

Least Squares

m

jjjhwf

1

)()( xx

• Training data : {(x1, y1), (x2, y2), …, (xp, yp)}

• Minimize the sum-squared-error

p

ii fyS

1

2))(( ix

Example

Sample points (noisy) from the curve y = x : {(1, 1.1), (2, 1.8), (3, 3.1)}

linear model :

f(x) = w1h1(x) + w2h2(x),

where h1(x) = 1, h2(x) = x

Estimate the coefficient w1, w2

f(x) = x


h2(x) = x,

h3(x) = x2

If the model absorbs all the noise : overfit If it is too flexible, it will fit the noise

If it is too inflexible, it will miss the target

The optimal weight vector

• model

m

jjjhwf

1

)()( xx

• sum-squared-error

p

ii fyS

1

2))(( ix

• cost function (minimized): weight penalty term is added

m

jjj

p

ii wfyC

1

2

1

2))(( ix

λj: regularization parameters

jjj

p

ii

j

ww

fyf

w

C 2)())((21

ii xx

)()( ii xx jj

hw

f

jjj

p

ii

j

whyfw

C 2)())((21

ii xx

)()()(11

i

p

jjjjj

p

i

xhyiwhf

ii xx

0)())((01

jjj

p

ii

j

whyfwC ii xx

p

ijijjj

p

ij

hywhfw

C

11

)()()(0 iii xxx

pppj

j

j

y

y

y

f

f

f

h

h

h

2

1

2

1

2

1

,

)(

)(

)(

,

)(

)(

)(

Let y

x

x

x

f

x

x

x

hj

mjw Tjjj

Tj ,....,2,1 allfor ,Then yhfh

mjw Tjjj

Tj ,....,2,1 allfor , yhfh

y

y

y

fh

fh

fh

Tm

T

T

mmTm

T

T

h

h

h

w

w

w

2

1

22

11

2

1

m

m

TT

H

HH

00

00

00

,

where,

2

1

21 hhh

ywf

)f(x

)f(x

)f(x

f

p

2

1

HHA T where

,ywf TT HH

Design matrix

wfy TT HH

www HHHH TT (

yTHA 1yw TT HHH 1)(

m

1jpjj

m

1j2jj

m

1j1jj

)(xhw

)(xhw

)(xhw

Hw

m

2

1

pmp2p1

2m2221

1m1211

w

w

w

)(xh)(xh)(xh

)(xh)(xh)(xh

)(xh)(xh)(xh

Example

Sample points (noisy) from the curve y = x : {(1, 1.1), (2, 1.8), (3, 3.1)}

linear model :

f(x) = w1h1(x) + w2h2(x),

where h1(x) = 1, h2(x) = x

Estimate the coefficient w1, w2

3.1

1.8

1.1

y

,146

63HHT

yHAw T1

,

31

21

11

)(xh)(xh

)(xh)(xh

)(xh)(xh

H

3231

2221

1211

1

0

21

37

1T1

1

1H)(HA

{(1, 1.1), (2, 1.8), (3, 3.1)}

h1(x) = 1, h2(x) = x

f(x) = w1h1(x) + w2h2(x),

f(x) = 0*1 + 1*x

where h1(x) = 1, h2(x) = x

f(x) = x

It should have an extra term, x2

New model: f(x) = w1h1(x) + w2h2(x) + w3h3(x), Where w3h3(x) = x2

93

9

4

1

)(

)(

)(

23

22

21

33

23

13

x

x

x

xh

xh

xh

H

94


h2(x) = x,

h3(x) = x2

Radial Basis Function (RBF) Networks

Each prototype node computes a distance based kernel function (Gaussian is common)

Prototype nodes form a hidden layer in a neural network Train top layer with simple delta rule to get outputs Thus, prototype nodes learn weightings for each class

95

blend of instance-based method and neural network method.


96

Function to be learned:

One common choice for is:

Global approximation to target function, in terms of linear combination of local approxima-tions.

Related to distance-weighted regression, “ea-ger” instead of “lazy”.

Radial Basis Function Networks

97

ai(x) are attributes describing instance x. The first layer

computes variousKu(d(xu,x)).

Second layer computeslinear combination of first-layer unit values.

Hidden unit activation is close to 0 if x isn’t near xu

Approximation

MLP : Global network All inputs cause an output

RBF : Local network Only inputs near a receptive field produce an acti-

vation Can give “don’t know” output

MLP vs RBFN

MLP vs RBFN

Global hyperplane Local receptive field

EBP(Error of Back Para-pagation) LMS

Local minima Serious local minima

Smaller number of hidden neurons

Larger number of hidden neurons

Shorter computation time Longer computation time

Longer learning time Shorter learning time



Content

Case-based reasoning (CBR)

Instance-based methods and locally weighted regression

CBR: first two principles and instances are represented by using a richer symbolic description and the methods used to retrieval

CBR is an advanced instance based learning applied to more complex instance objects

Objects may include complex structural descriptions of cases & adaptation rules

CBR cannot use Euclidean distance measures Must define distance measures for those complex

objects instead (e.g. semantic nets)

CBR tries to model human problem-solving uses past experience (cases) to solve new problems retains solutions to new problems

CBR is an ongoing area of machine learning research with many applications

103

Applications of CBR

Design landscape, building, mechanical, conceptual

design of aircraft sub-systems Planning

repair schedules Diagnosis

medical Adversarial reasoning

legal

CBR process

New Case

matchingMatched

Cases

Retrieve

Adapt?No

Yes

Closest Case

Suggest solution

Retain

Learn

Revise

Reuse

Case Base

Knowledge and Adaptation rules

105

CBR example: Property pricing

Case Location code

Bedrooms Recep rooms

Type floors Cond-ition

Price £

1 8 2 1 terraced 1 poor 20,500

2 8 2 2 terraced 1 fair 25,000

3 5 1 2 semi 2 good 48,000

4 5 1 2 terraced 2 good 41,000

Case Location code



Price £

5 7 2 2 semi 1 poor ???

Test instance

106

How rules are generated

There is no unique way of doing it. Here is one possibility:

Examine cases and look for ones that are al-most identical case 1 and case 2

Rule1: If recep-rooms changes from 2 to 1 then reduce price by £5,000

case 3 and case 4 Rule2: If Type changes from semi to terraced then re-

duce price by £7,000

107

Matching

Comparing test instance matches(5,1) = 3 matches(5,2) = 3 (MAX COST: £ 25000) matches(5,3) = 2 matches(5,4) = 1

Estimate price of case 5 is £25,000

108

Adapting

Reverse rule 2 if type changes from terraced to semi then increase

price by £7,000 Apply reversed rule 2

new estimate of price of property 5 is £32,000

109

Learning

So far we have a new case and an estimated price nothing is added yet to the case base

If later we find house of location code 8 sold for £35,000 then the case would be added could add a new rule

if location changes from 7 to 8 increase price by £3,000

Case Location code



Price £

5 7 2 2 teraced 1 poor £32000

Case Location code



Price £

#5 8 2 2 teraced 1 poor £35000

110

Problems with CBR

How should cases be represented? How should cases be indexed for fast retrieval? How can good adaptation heuristics be de-

veloped? When should old cases be removed?

111

Advantages

A local approximation is found for each test case Knowledge is in a form understandable to human

beings Fast to train

Lazy and Eager Learning

112

Lazy: wait for query before generalizing KNN, locally weighted regression, CBR

Eager: generalize before seeing query RBF networks

Differences: Computation time Global and local approximations to the target function Use same H, lazy can represent more complex functions. (e.g.,

consider H=linear functions)

113

Summary

Differences and advantages KNN algorithm:

the most basic instance-based method. Locally weighted regression: generalization of

KNN. RBF networks:

blend of instance-based method and neural network method.

Case-based reasoning

114

Lazy and Eager Learning

Lazy: wait for query before generalizing k-Nearest Neighbour, Case based reasoning

Eager: generalize before seeing query RBF Networks, ID3, …

Does it matter? Eager learner must create global approximation Lazy learner can create many local approximations

The End