Clustering

1
7.1 Introduction to Cluster Analysis While we often think of statistics as giving definitive answers to well-posed questions, there are some statistical techniques that are used simply to gain further insight into a group of observations. One such technique (which encompasses lots of different methods) is cluster analysis. The idea of cluster analysis is that we have a set of observations, on which we have available several measurements. Using these measurements, we want to find out if the observations naturally group together in some predictable way. For example, we may have recorded physical measurements on many animals, and we want to know if there’s a natural grouping (based, perhaps on species) that distinquishes the animals from another. (This use of cluster analysis is sometimes called ”numerical taxonomy”). As another example, suppose we have information on the demographics and buying habits of many consumers. We could use cluster analysis on the data to see if there are distinct groups of consumers with similar demographics and buying habits (market segmentation). It’s important to remember that cluster analysis isn’t about finding the right answer – it’s about finding ways to look at data that allow us to understand the data better. For example, suppose we have a deck of playing cards, and we want to see if they form some natural groupings. One person may separate the black cards from the red; another may break the cards up into hearts, clubs, diamonds and spades; a third person might separate cards with pictures from cards with no pictures, and a fourth might make one pile of aces, one of twos, and so on. Each person is right in their own way, but in cluster analysis, there’s really not a single “correct” answer. Another aspect of cluster analysis is that there are an enormous number of possible ways of dividing a set of observations into groups. Even if we specify the number of groups, the number of possibilities is still enormous. For example, consider the task of dividing 25 observations into 5 groups. (25 observations is considered very small in the world of cluster analysis). It turns out there are 2.4 * 10 15 different ways to arrange those observations into 5 groups. If, as is often the case, we don’t know the number of groups ahead of time, and we need to consider all possible numbers of groups (from 1 to 25), the number is more than 4 * 10 18 ! So any technique that simply tries all the different possibilities is doomed to failure. 7.2 Standardization There are two very important decisions that need to be made whenever you are carrying out a cluster analysis. The first regards the relative scales of the variables being measured. We’ll see that the available cluster analysis algorithms all depend on the concept of measuring the distance (or some other measure of similarity) between the different observations we’re trying to cluster. If one of the variables is measured on a much larger scale than the other variables, then whatever measure we use will be overly influenced by that variable. For example, recall the world data set that we used earlier in the semester. Here’s a quick summary of the mean values of the variables in that data set: > apply(world1[-c(1,6)],2,mean,na.rm=TRUE) 159 Click to buy NOW! P D F - X C h a n g e V i e w e r w w w . d o c u - t ra c k . c o m Click to buy NOW! P D F - X C h a n g e V i e w e r w w w . d o c u - t ra c k . c o m

description

An introduction to clustring using R

Transcript of Clustering

Page 1: Clustering

7.1 Introduction to Cluster Analysis

While we often think of statistics as giving definitive answers to well-posed questions, thereare some statistical techniques that are used simply to gain further insight into a group ofobservations. One such technique (which encompasses lots of different methods) is clusteranalysis. The idea of cluster analysis is that we have a set of observations, on which wehave available several measurements. Using these measurements, we want to find out if theobservations naturally group together in some predictable way. For example, we may haverecorded physical measurements on many animals, and we want to know if there’s a naturalgrouping (based, perhaps on species) that distinquishes the animals from another. (This useof cluster analysis is sometimes called ”numerical taxonomy”). As another example, supposewe have information on the demographics and buying habits of many consumers. We coulduse cluster analysis on the data to see if there are distinct groups of consumers with similardemographics and buying habits (market segmentation).

It’s important to remember that cluster analysis isn’t about finding the right answer –it’s about finding ways to look at data that allow us to understand the data better. Forexample, suppose we have a deck of playing cards, and we want to see if they form somenatural groupings. One person may separate the black cards from the red; another maybreak the cards up into hearts, clubs, diamonds and spades; a third person might separatecards with pictures from cards with no pictures, and a fourth might make one pile of aces,one of twos, and so on. Each person is right in their own way, but in cluster analysis, there’sreally not a single “correct” answer.

Another aspect of cluster analysis is that there are an enormous number of possible waysof dividing a set of observations into groups. Even if we specify the number of groups,the number of possibilities is still enormous. For example, consider the task of dividing 25observations into 5 groups. (25 observations is considered very small in the world of clusteranalysis). It turns out there are 2.4 ∗ 1015 different ways to arrange those observations into5 groups. If, as is often the case, we don’t know the number of groups ahead of time, andwe need to consider all possible numbers of groups (from 1 to 25), the number is more than4∗1018! So any technique that simply tries all the different possibilities is doomed to failure.

7.2 Standardization

There are two very important decisions that need to be made whenever you are carrying outa cluster analysis. The first regards the relative scales of the variables being measured. We’llsee that the available cluster analysis algorithms all depend on the concept of measuring thedistance (or some other measure of similarity) between the different observations we’re tryingto cluster. If one of the variables is measured on a much larger scale than the other variables,then whatever measure we use will be overly influenced by that variable. For example, recallthe world data set that we used earlier in the semester. Here’s a quick summary of the meanvalues of the variables in that data set:

> apply(world1[-c(1,6)],2,mean,na.rm=TRUE)

159

Click t

o buy N

OW!PD

F-XChange Viewer

ww

w.docu-track.com Clic

k to b

uy NOW

!PD

F-XChange Viewer

ww

w.docu-track.c

om

rajesh
Typewriter
We use cluster analysis when we have no idea regarding what the data is all about. We basically use this algo to inves- tigate the data & to see if there's any any relation b/w the data ,i.e, wether observations natu- rally group togeth- er in some predica- table way.
rajesh
Typewriter
rajesh
Highlight
rajesh
Underline
rajesh
Highlight
rajesh
Highlight
rajesh
Highlight
rajesh
Highlight
rajesh
Underline
rajesh
Highlight
rajesh
Underline
rajesh
Highlight
rajesh
Highlight
rajesh
Underline
rajesh
Underline
rajesh
Highlight
rajesh
Highlight
rajesh
Underline