Clustering in artificial intelligence

1

UNSUPERVISED LEARNINGUNSUPERVISED LEARNING

Supervised and Unsupervised Learning

ID3 and Version space are supervised learning algorithms

Unsupervised learning eliminates the teacher and requires that the learners form concepts (categories) on their own

Conceptual clustering is the problem of discovering useful categories in unclassified data (data whose categories are not pre-determined)

2

CONCEPTUAL CLUSTERINGCONCEPTUAL CLUSTERING

Unsupervised Learning and Numeric taxonomy

The clustering problem begins with a collection of unclassified objects and a means for measuring the similarity of objects

The goal is to organize the objects into classes so that similar objects are in one class

Numeric taxonomy is one of the oldest approaches to clustering problem

3


In Numeric Taxonomy the objects are represented as a collection of features and each of the feature has some numeric value

An object is thus a vector of n feature values and can be considered as a point in n-dimensional space

The similarity of any two objects can be measured by the Euclidean distance between them in this space

Using this similarity metric, clustering algorithms build clusters in a bottom up fashion (agglomerative clustering strategy)

4


The categories are formed by the following approach

1. Examine all pairs of objects, and select the pair with the highest degree of similarity and make that pair a cluster

2. Define the features of the cluster as some function, such as average, of the features of the component members and then replace the component objects with this cluster definition

3. Repeat this process on the collection of objects until all objects have been reduced to a single cluster

The result of this algorithm is a binary tree whose leaf nodes are instances and whose internal nodes are clusters of increasing size

5


6


We may extend this algorithm to objects represented as setsof symbolic features. The only problem is in the measuring the similarity of objects

A similarity metric can be the proportion of features that any two objects have in common

object 1 = {small, red, rubber, ball}object 2 = {small, blue, rubber, ball}object 3 = {large, black, wooden, ball}

similarity (object 1, object 2) = ¾similarity (object 1, object 3) = ¼ similarity (object 2, object 3) = ¼

7


In defining categories we cannot give all features equal weight

In any given context, certain of an object’s features are more important than others; simple similarity metrics treat all features equally

The feature weights are to be set according to the goals of the categorization

8

CLUSTER/2CLUSTER/2

CLUSTER/2 forms k categories by constructing individuals around k seed objects

The parameter k is user adjustable

CLUSTER/2 evaluates the resulting clusters, selecting new seeds and repeating the process until its quality criteria are met

9

CLUSTER/2CLUSTER/2

The algorithm

• Select k seeds from the set of observed objects. This may be done randomly or according to some selection function

• For each seed, using that seed as a positive instance and all other seeds as negative instances, produce a maximally general definition that tries to cover all of the non-seed instances, until stopped by the negative instances (other seeds)

10

CLUSTER/2CLUSTER/2

The algorithm

• Classify all objects in the sample according to these descriptions. Note that this may lead to multiple classifications of other, non seed, objects

11

CLUSTER/2CLUSTER/2

The algorithm

• Replace each maximally general description with a maximally specific description that covers all objects in the category. This decreases likelihood that classes overlap on unseen objects

12

CLUSTER/2CLUSTER/2

The algorithm

• Classes may still overlap on given objects.

• Using a distance metric, select an element closest to the center of each class. Using these central elements as new seeds, repeat the above steps.

13

CLUSTER/2CLUSTER/2

The algorithm

• Stop when clusters are satisfactory. A typical quality matrix is the complexity of the general descriptions of classes. For instance, we might prefer clusters that yield syntactically simple definitions, such as those with a small number of conjuncts

14

CLUSTER/2CLUSTER/2

The algorithm

• If clusters are unsatisfactory and no improvement occurs over several iterations, select the new seeds closest to the edge of the cluster, rather than those at the center. This favors creation of totally new clusters

15

AssignmentAssignment

Read Section 9.6.1 & 9.6.2 of Luger

Clustering in artificial intelligence

Technology

Transcript of Clustering in artificial intelligence