K nearest neighbor

34

Click here to load reader

description

K nearest neighbor

Transcript of K nearest neighbor

Page 1: K nearest neighbor
Page 2: K nearest neighbor

Classification is done by relating the unknown to the known according to some distance/similarity function

Stores all available cases and classifies new cases based on similarity measure

Different names

Memory-based reasoning

Example-based reasoning

Instance-based reasoning

Case-based reasoning

Lazy learning

Page 3: K nearest neighbor

kNN determines the decision boundary locally. Ex. for 1NN we assign each document to the class of its closest neighbor

For kNN we assign each document to the majority class of its closest neighbors where k is a parameter

The rationale of kNN classification is based on contiguity hypothesis, we expect the test document to have the same training label as the training documents located in the local region surrounding the document.

Veronoi tessellation of a set of objects decomposes space into Voronoi cells, where each object’s cell consist of all points that are closer to the object than to other objects.

It partitions the plane to complex polygons, each containing its corresponding document.

Page 4: K nearest neighbor

Let k=3

P(circle class | star) = 1/3

P(X class | star) = 2/3

P(diamond class | star) = 0

3NN estimate is –

P(circle class | star) = 1/3

1NN estimate is –

P(circle class | star) = 1

3NN preferring X class and 1NN preferring circle class

Page 5: K nearest neighbor

Advantages

Non-parametric architecture

Simple

Powerful

Requires no training time

Disadvantages

Memory intensive

Classification/estimation is slow

The distance is calculated using Euclidean distance

Page 6: K nearest neighbor

2

21

2

21 )()( yyxxD

Page 7: K nearest neighbor

MinMax

MinXX s

Page 8: K nearest neighbor

2

21

2

21 )()( yyxxD

Page 9: K nearest neighbor

MinMax

MinXX s

Page 10: K nearest neighbor

If k=1, select the nearest neighbors

If k>1

For classification, select the most frequent neighbors

For regression, calculate the average of k neighbors

Page 11: K nearest neighbor
Page 12: K nearest neighbor

An inductive learning task – use particular facts to make more generalized conclusions

Predictive model based on branching series of Boolean test – these Boolean test are less complex than the one-stage classifier

Its learning from class labeled tuples

Can be used as visual aid to structure and solve sequential problems

Internal node (Non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test and each leaf node holds a class label

Page 13: K nearest neighbor

If we leave at 10 AM and there are no cars stalled on the road, what will our commute time be?

Leave At

Stall? Accident?

10 AM 9 AM8 AM

Long

Long

Short Medium Long

No Yes No Yes

Page 14: K nearest neighbor

In this decision tree, we made a series of Boolean decision and followed a corresponding branch –

Did we leave at 10AM?

Did the car stall on road?

Is there an accident on the road?

By answering each of these questions as yes or no, we can come to a conclusion on how long our commute might take

Page 15: K nearest neighbor

We do not have to represent this tree graphically

We can represent this as a set of rules. However, it may be harder to read

if hour == 8am

commute time = long

else if hour == 9am

if accident == yes

commute time = long

else

commute time = medium

else if hour == 10am

if stall == yes

commute time = long

else

commute time = short

Page 16: K nearest neighbor

The algorithm is called with three parameters – data partition, attribute list, attribute subset selection.

It’s a set of tuples and there associated class label

Attribute list is a list of attributes describing the tuples

Attribute selection method specifies a heuristic procedure for selecting attribute that best discriminates the tuples

Tree starts at node N. if all the tuples in D are of the same class, then node N becomes a leaf and is labelled with that class

Else attribute selection method is used to determine the splitting criteria.

Node N is labelled with splitting criteria, which serves as a test at the node.

Page 17: K nearest neighbor
Page 18: K nearest neighbor
Page 19: K nearest neighbor

The previous experience decision table showed 4 attributes – hour, weather, accident and stall

But the decision tree showed three attributes – hour, attribute and stall

So which attribute is to be kept and which is to be removed?

Methods for selecting attribute shows that weather is not a discriminating attribute

Method – given a number of competing hypothesis, the simplest one is preferable

We will focus on ID3 algorithm

Page 20: K nearest neighbor

Basic idea

Choose the best attribute to split the remaining instances and make that attribute a decision node

Repeat this process for recursively for each child

Stop when

All attribute have same target attribute value

There are no more attributes

There are no more instances

ID3 splits attributes based on their entropy.

Entropy is a measure of disinformation

Page 21: K nearest neighbor

Entropy is minimized when all values of target attribute are the same

If we know that the commute time will be short, the entropy=0

Entropy is maximized when there is an equal chance of values for the target attribute (i.e. result is random)

If commute time = short in 3 instances, medium in 3 instances and long in 3 instances, entropy is maximized

Calculation of entropy

Entropy S = ∑(i=1 to l)-|Si|/|S| * log2(|Si|/|S|)

S = set of examples

Si = subset of S with value vi under the target attribute

L – size of range of target attribute

Page 22: K nearest neighbor
Page 23: K nearest neighbor

If we break down the leaving time to the minute, we might get something like this

Since the entropy is very less for each branch and we have n branches with n leaves. This would not be helpful for predictive modelling

We use a technique called as discretization. We choose cut point such as 9AM for splitting continuous attributes

8:02 AM 10:02 AM8:03 AM 9:09 AM9:05 AM 9:07 AM

Long Medium Short Long Long Short

Page 24: K nearest neighbor

Consider the attribute commute time

When we split the attribute, we increase the entropy so we don’t have a decision tree with the same number of cut points as leaves

8:00 (L), 8:02 (L), 8:07 (M), 9:00 (S), 9:20 (S), 9:25 (S), 10:00 (S), 10:02 (M)

Page 25: K nearest neighbor

Binary decision trees

Classification of an input vector is done by traversing the tree beginning at the root node and ending at the leaf

Each node of the tree computes an inequality

Each leaf is assigned to a particular class

Input space is based on one input variable

Each node draws a boundary that can be geometrically interpreted as a hyperplaneperpendicular to the axis

Page 26: K nearest neighbor

B CYes No

Yes No

NoYes

BMI<24

Page 27: K nearest neighbor

They are similar to binary tree

Inequality computed at each node takes on a linear form that may depend on linear variable

aX1+bX2

Yes No

Yes No

NoYes

Page 28: K nearest neighbor

Chi-squared automatic intersection detector(CHAID)

Non-binary decision tree

Decision made at each node is based on single variable, but can result in multiple branches

Continuous variables are grouped into a finite number of bins to create categories

Equal population bins is created for CHAID

Classification and Regression Trees (CART) are binary decision trees which split a single variable at each node

The CART algorithm goes through an exhaustive search of all variables and split values to find the optimal splitting rule for each node.

Page 29: K nearest neighbor

There is another technique for reducing the number of attributes used in tree –pruning

Two types of pruning

Pre-pruning (forward pruning)

Post-pruning (backward pruning)

Pre-pruning

We decide during the building process, when to stop adding attributes (possibly based on their information gain)

However, this may be problematic – why?

Sometimes, attribute individually do not compute much to a decision, but combined they may have significant impact.

Page 30: K nearest neighbor

Post-pruning waits until full decision tree has been built and then prunes the attributes.

Two techniques:

Subtree replacement

Subtree raising

Subtree replacement

A

B

C

1 2 3

4 5

Page 31: K nearest neighbor

Node 6 replaced the subtree

May increase accuracy

A

B

6 4 5

Page 32: K nearest neighbor

Entire subtree is raised onto another node

A

B

C

1 2 3

4 5

A

C

1 2 3

Page 33: K nearest neighbor

While decision tree classifies quickly, the time taken for building the tree may be higher than any other type of classifier.

Decision tree suffer from problem of error propagation throughout the tree

Page 34: K nearest neighbor

Since decision trees work by a series of local decision, what happens if one of these decision is wrong?

Every decision from that point on may be wrong

We may return to the correct path of the tree