DATA MINING WITH CLUSTERING AND CLASSIFICATION

49
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam

description

DATA MINING WITH CLUSTERING AND CLASSIFICATION. Spring 2007, SJSU Benjamin Lam. Overview. Definition of Clustering Existing clustering methods Clustering examples Classification Classification examples Conclusion. Definition. - PowerPoint PPT Presentation

Transcript of DATA MINING WITH CLUSTERING AND CLASSIFICATION

Page 1: DATA MINING WITH CLUSTERING AND CLASSIFICATION

DATA MINING WITH CLUSTERING AND CLASSIFICATION

Spring 2007, SJSUBenjamin Lam

Page 2: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Overview

• Definition of Clustering• Existing clustering methods• Clustering examples• Classification• Classification examples• Conclusion

Page 3: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Definition

• Clustering can be considered the most important unsupervised learning technique; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data.

• Clustering is “the process of organizing objects into groups whose members are similar in some way”.

• A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

Mu-Yu Lu, SJSU

Page 4: DATA MINING WITH CLUSTERING AND CLASSIFICATION
Page 5: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Why clustering?

A few good reasons ...

• Simplifications• Pattern detection• Useful in data concept construction• Unsupervised learning process

Page 6: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Where to use clustering?

• Data mining• Information retrieval• text mining• Web analysis• marketing• medical diagnostic

Page 7: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Which method should I use?

• Type of attributes in data• Scalability to larger dataset• Ability to work with irregular data• Time cost• complexity• Data order dependency• Result presentation

Page 8: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Major Existing clustering methods

• Distance-based• Hierarchical• Partitioning • Probabilistic

Page 9: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Measuring Similarity• Dissimilarity/Similarity metric: Similarity is expressed

in terms of a distance function, which is typically metric: d(i, j)

• There is a separate “quality” function that measures the “goodness” of a cluster.

• The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.

• Weights should be associated with different variables based on applications and data semantics.

• It is hard to define “similar enough” or “good enough” – the answer is typically highly subjective.

Professor Lee, Sin-Min

Page 10: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Distance based method

• In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance. This is called distance-based clustering.

Page 11: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Hierarchical clusteringAgglomerative (bottom

up)

1. start with 1 point (singleton)

2. recursively add two or more appropriate

clusters3. Stop when k number

of clusters is achieved.

Divisive (top down)

1. Start with a big cluster2. Recursively divide into

smaller clusters3. Stop when k number of

clusters is achieved.

Page 12: DATA MINING WITH CLUSTERING AND CLASSIFICATION

general steps of hierarchical

clustering Given a set of N items to be clustered, and an N*N

distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this:

• Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain.

• Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less.

• Compute distances (similarities) between the new cluster and each of the old clusters.

• Repeat steps 2 and 3 until all items are clustered into K number of clusters

Mu-Yu Lu, SJSU

Page 13: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Exclusive vs. non exclusive clustering

• In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. A simple example of that is shown in the figure below, where the separation of points is achieved by a straight line on a bi-dimensional plane.

• On the contrary the second type, the overlapping clustering, uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership.

Page 14: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Partitioning clustering

1. Divide data into proper subset2. recursively go through each subset

and relocate points between clusters (opposite to visit-once approach in Hierarchical approach)

This recursive relocation= higher quality cluster

Page 15: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Probabilistic clustering

1. Data are picked from mixture of probability distribution.

2. Use the mean, variance of each distribution as parameters for cluster

3. Single cluster membership

Page 16: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Single-Linkage Clustering(hierarchical)

• The N*N proximity matrix is D = [d(i,j)]• The clusterings are assigned sequence

numbers 0,1,......, (n-1)• L(k) is the level of the kth clustering• A cluster with sequence number m is

denoted (m)• The proximity between clusters (r) and

(s) is denoted d [(r),(s)]

Mu-Yu Lu, SJSU

Page 17: DATA MINING WITH CLUSTERING AND CLASSIFICATION

The algorithm is composed of the following steps:

• Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0.

• Find the least dissimilar pair of clusters in the current clustering, say pair (r), (s), according to

d[(r),(s)] = min d[(i),(j)]

where the minimum is over all pairs of clusters in the current clustering.

Page 18: DATA MINING WITH CLUSTERING AND CLASSIFICATION

The algorithm is composed of the following steps:(cont.)

• Increment the sequence number : m = m +1. Merge clusters (r) and (s) into a single cluster to form the next clustering m. Set the level of this clustering to

L(m) = d[(r),(s)]

• Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) and adding a row and column corresponding to the newly formed cluster. The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined in this way:

d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]

• If all objects are in one cluster, stop. Else, go to step 2.

Page 19: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Hierarchical clustering example

• Let’s now see a simple example: a hierarchical clustering of distances in kilometers between some Italian cities. The method used is single-linkage.

• Input distance matrix (L = 0 for all the clusters):

Page 20: DATA MINING WITH CLUSTERING AND CLASSIFICATION

• The nearest pair of cities is MI and TO, at distance 138. These are merged into a single cluster called "MI/TO". The level of the new cluster is L(MI/TO) = 138 and the new sequence number is m = 1.Then we compute the distance from this new compound object to all other objects. In single link clustering the rule is that the distance from the compound object to another object is equal to the shortest distance from any member of the cluster to the outside object. So the distance from "MI/TO" to RM is chosen to be 564, which is the distance from MI to RM, and so on.

Page 21: DATA MINING WITH CLUSTERING AND CLASSIFICATION

• After merging MI with TO we obtain the following matrix:

Page 22: DATA MINING WITH CLUSTERING AND CLASSIFICATION

• min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called NA/RML(NA/RM) = 219m = 2

Page 23: DATA MINING WITH CLUSTERING AND CLASSIFICATION

• min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called BA/NA/RML(BA/NA/RM) = 255m = 3

Page 24: DATA MINING WITH CLUSTERING AND CLASSIFICATION

• min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RML(BA/FI/NA/RM) = 268m = 4

Page 25: DATA MINING WITH CLUSTERING AND CLASSIFICATION

• Finally, we merge the last two clusters at level 295.• The process is summarized by the following hierarchical tree:

Page 26: DATA MINING WITH CLUSTERING AND CLASSIFICATION

K-mean algorithm1. It accepts the number of clusters to group

data into, and the dataset to cluster as input values.

2. It then creates the first K initial clusters (K= number of clusters needed) from the dataset by choosing K rows of data randomly from the dataset. For Example, if there are 10,000 rows of data in the dataset and 3 clusters need to be formed, then the first K=3 initial clusters will be created by selecting 3 records randomly from the dataset as the initial clusters. Each of the 3 initial clusters formed will have just one row of data.

Page 27: DATA MINING WITH CLUSTERING AND CLASSIFICATION

3. The K-Means algorithm calculates the Arithmetic Mean of each cluster formed in the dataset. The Arithmetic Mean of a cluster is the mean of all the individual records in the cluster. In each of the first K initial clusters, their is only one record. The Arithmetic Mean of a cluster with one record is the set of values that make up that record. For Example if the dataset we are discussing is a set of Height, Weight and Age measurements for students in a University, where a record P in the dataset S is represented by a Height, Weight and Age measurement, then P = {Age, Height, Weight). Then a record containing the measurements of a student John, would be represented as John = {20, 170, 80} where John's Age = 20 years, Height = 1.70 metres and Weight = 80 Pounds. Since there is only one record in each initial cluster then the Arithmetic Mean of a cluster with only the record for John as a member = {20, 170, 80}.

Page 28: DATA MINING WITH CLUSTERING AND CLASSIFICATION

4. Next, K-Means assigns each record in the dataset to only one of the initial clusters. Each record is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity like the Euclidean Distance Measure or Manhattan/City-Block Distance Measure.

5. K-Means re-assigns each record in the dataset to the most similar cluster and re-calculates the arithmetic mean of all the clusters in the dataset. The arithmetic mean of a cluster is the arithmetic mean of all the records in that cluster. For Example, if a cluster contains two records where the record of the set of measurements for John = {20, 170, 80} and Henry = {30, 160, 120}, then the arithmetic mean Pmean is represented as Pmean= {Agemean, Heightmean, Weightmean).  Agemean= (20 + 30)/2, Heightmean= (170 + 160)/2 and Weightmean= (80 + 120)/2. The arithmetic mean of this cluster = {25, 165, 100}. This new arithmetic mean becomes the center of this new cluster. Following the same procedure, new cluster centers are formed for all the existing clusters.

Page 29: DATA MINING WITH CLUSTERING AND CLASSIFICATION

6. K-Means re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity 

7. The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. Stable clusters are formed when new iterations or repetitions of the K-Means clustering algorithm does not create new clusters as the cluster center or Arithmetic Mean of each cluster formed is the same as the old cluster center. There are different techniques for determining when a stable cluster is formed or when the k-means clustering algorithm procedure is completed.

Page 30: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Classification

• Classification Problem Overview• Classification Techniques

– Regression– Distance– Decision Trees– Rules– Neural Networks

Goal:Goal: Provide an overview of the Provide an overview of the classification problem and introduce some of classification problem and introduce some of the basic algorithmsthe basic algorithms

Page 31: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Classification Examples

• Teachers classify students’ grades as A, B, C, D, or F.

• Identify mushrooms as poisonous or edible.

• Predict when a river will flood.• Identify individuals with credit

risks. • Speech recognition• Pattern recognition

Page 32: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Classification Ex: Grading

• If x >= 90 then grade =A.

• If 80<=x<90 then grade =B.

• If 70<=x<80 then grade =C.

• If 60<=x<70 then grade =D.

• If x<50 then grade =F.

>=90<90

x

>=80<80

x

>=70<70

x

F

B

A

>=60<50

x C

D

Page 33: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Classification Techniques

• Approach:1. Create specific model by

evaluating training data (or using domain experts’ knowledge).

2. Apply model developed to new data.

• Classes must be predefined• Most common techniques use DTs,

NNs, or are based on distances or statistical methods.

Page 34: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Defining Classes

Partitioning Based

Distance Based

Page 35: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Classification Using Regression

• Division: Use regression function to divide area into regions.

• Prediction: Use regression function to predict a class membership function. Input includes desired class.

Page 36: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Height Example DataName Gender Height Output1 Output2 Kristina F 1.6m Short Medium Jim M 2m Tall Medium Maggie F 1.9m Medium Tall Martha F 1.88m Medium Tall Stephanie F 1.7m Short Medium Bob M 1.85m Medium Medium Kathy F 1.6m Short Medium Dave M 1.7m Short Medium Worth M 2.2m Tall Tall Steven M 2.1m Tall Tall Debbie F 1.8m Medium Medium Todd M 1.95m Medium Medium Kim F 1.9m Medium Tall Amy F 1.8m Medium Medium Wynette F 1.75m Medium Medium

Page 37: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Division

Page 38: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Prediction

Page 39: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Classification Using Distance

• Place items in class to which they are “closest”.

• Must determine distance between an item and a class.

• Classes represented by– Centroid: Central value.– Medoid: Representative point.– Individual points

• Algorithm: KNN

Page 40: DATA MINING WITH CLUSTERING AND CLASSIFICATION

K Nearest Neighbor (KNN):

• Training set includes classes.• Examine K items near item to be

classified.• New item placed in class with the

most number of close items.• O(q) for each tuple to be classified.

(Here q is the size of the training set.)

Page 41: DATA MINING WITH CLUSTERING AND CLASSIFICATION

KNN

Page 42: DATA MINING WITH CLUSTERING AND CLASSIFICATION

KNN Algorithm

Page 43: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Classification Using Decision Trees

• Partitioning based: Divide search space into rectangular regions.

• Tuple placed into class based on the region within which it falls.

• DT approaches differ in how the tree is built: DT Induction

• Internal nodes associated with attribute and arcs with values for that attribute.

• Algorithms: ID3, C4.5, CART

Page 44: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Decision TreeGiven:

– D = {t1, …, tn} where ti=<ti1, …, tih> – Database schema contains {A1, A2, …, Ah}– Classes C={C1, …., Cm}

Decision or Classification Tree is a tree associated with D such that– Each internal node is labeled with attribute, Ai

– Each arc is labeled with predicate which can be applied to attribute at parent

– Each leaf node is labeled with a class, Cj

Page 45: DATA MINING WITH CLUSTERING AND CLASSIFICATION

DT Induction

Page 46: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Comparing DTs

BalancedDeep

Page 47: DATA MINING WITH CLUSTERING AND CLASSIFICATION

ID3• Creates tree using information theory concepts and

tries to reduce expected number of comparison..• ID3 chooses split attribute with the highest

information gain using entropy as base for calculation.

Page 48: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Conclusion

• very useful in data mining• applicable for both text and

graphical based data• Help simplify data complexity• classification• detect hidden pattern in data

Page 49: DATA MINING WITH CLUSTERING AND CLASSIFICATION

Reference

• Dr. M.H. Dunham - http://engr.smu.edu/~mhd/dmbook/part2.ppt.

• Dr. Lee, Sin-Min – San Jose State University• Mu-Yu Lu, SJSU• Database System Concepts, Silberschatz, Korth,

Sudarshan