Clustering
description
Transcript of Clustering
7.1 Introduction to Cluster Analysis
While we often think of statistics as giving definitive answers to well-posed questions, thereare some statistical techniques that are used simply to gain further insight into a group ofobservations. One such technique (which encompasses lots of different methods) is clusteranalysis. The idea of cluster analysis is that we have a set of observations, on which wehave available several measurements. Using these measurements, we want to find out if theobservations naturally group together in some predictable way. For example, we may haverecorded physical measurements on many animals, and we want to know if there’s a naturalgrouping (based, perhaps on species) that distinquishes the animals from another. (This useof cluster analysis is sometimes called ”numerical taxonomy”). As another example, supposewe have information on the demographics and buying habits of many consumers. We coulduse cluster analysis on the data to see if there are distinct groups of consumers with similardemographics and buying habits (market segmentation).
It’s important to remember that cluster analysis isn’t about finding the right answer –it’s about finding ways to look at data that allow us to understand the data better. Forexample, suppose we have a deck of playing cards, and we want to see if they form somenatural groupings. One person may separate the black cards from the red; another maybreak the cards up into hearts, clubs, diamonds and spades; a third person might separatecards with pictures from cards with no pictures, and a fourth might make one pile of aces,one of twos, and so on. Each person is right in their own way, but in cluster analysis, there’sreally not a single “correct” answer.
Another aspect of cluster analysis is that there are an enormous number of possible waysof dividing a set of observations into groups. Even if we specify the number of groups,the number of possibilities is still enormous. For example, consider the task of dividing 25observations into 5 groups. (25 observations is considered very small in the world of clusteranalysis). It turns out there are 2.4 ∗ 1015 different ways to arrange those observations into5 groups. If, as is often the case, we don’t know the number of groups ahead of time, andwe need to consider all possible numbers of groups (from 1 to 25), the number is more than4∗1018! So any technique that simply tries all the different possibilities is doomed to failure.
7.2 Standardization
There are two very important decisions that need to be made whenever you are carrying outa cluster analysis. The first regards the relative scales of the variables being measured. We’llsee that the available cluster analysis algorithms all depend on the concept of measuring thedistance (or some other measure of similarity) between the different observations we’re tryingto cluster. If one of the variables is measured on a much larger scale than the other variables,then whatever measure we use will be overly influenced by that variable. For example, recallthe world data set that we used earlier in the semester. Here’s a quick summary of the meanvalues of the variables in that data set:
> apply(world1[-c(1,6)],2,mean,na.rm=TRUE)
159
Click t
o buy N
OW!PD
F-XChange Viewer
ww
w.docu-track.com Clic
k to b
uy NOW
!PD
F-XChange Viewer
ww
w.docu-track.c
om