K Means Handout

8/22/2019 K Means Handout

1/7

ClusterAnalysis

WhatisClusterAnalysis?

Clusteranalysisisastatisticaltechniqueusedtogroupcases(individualsorobjects)intohomogeneous

subgroupsbasedonresponsestovariables.UsingPASW(SPSS)17.0toconductaclusteranalysis,there

arethree

clustering

procedures:

two

step,

kmeans,

and

hierarchical.

Kmeansclusteringallowsyoutoselectthenumberofclustersandtheprocedurecanbeusedwith

moderatetolargedatasets.Thekmeansclusteringalgorithmassignscasestoclustersbasedonthe

smallestamountofdistancebetweentheclustermeanandcase.Thisisaniterativeprocessthatstops

oncetheclustermeansdonotchangemuchinsuccessivesteps.

KMeansClustering

Asanexampleofkmeansclustering,asamplePASW17.0datasetwasused;telco_extra.sav,

telecommunicationsproviderdatathathas14continuousvariables.Thecontinuousvariableshave

alreadybeen

standardized,

with

amean

of

0and

standard

deviation

of

1,

to

allow

for

different

units

in

whichvariablesweremeasured.Thisanalysiswillclustercustomersbytheirserviceusagepatterns.

InPASW17.0,gotoAnalyze >Classify >KMeansCluster

Next,theKMeansClusterAnalysismenuappears.SelectStandardizedloglongdistancethrough

StandardizedlogwirelessandStandardizedmultiplelinesthroughStandardizedelectronicbilling

variablesandplaceintheVariablesbox.

LabelCasesby.Optional;placevariableheretolabelcases

NumberofClusters.Youhavetospecifythenumberofclustersyouwant.Forthisexample,

type3inthebox.

Method.Thedefault"Iterateandclassify,"whichisaniterativeprocessisusedtocomputethe

clustermeanseachtimeacaseisaddedordeletedfromthecluster.Clustersarethenclassified

Page1of7


2/7

basedonceclustercentershavebeenupdated.The"Classifyonly"methodareclassifiedbased

ontheinitialclustercenters,whicharenotiterativelycomputed.Forthisexample,Iterateand

classifyischosen.

ClusterCenters.Youcandrawinitialclustercentersfromafile(Readinitial)oryoucansave

thefinal

cluster

centers

(Write

final).

For

this

example,

we

are

not

using

either

option.

ClicktheIteratebutton;theKMeansClusterAnalysis:Iterateboxappears.ChangeMaximum

Iterationsto20.ClickContinue.

MaximumIterations.Setsthemaximumnumberofiterations.

ConvergenceCriterion:Thedefaultterminatesoncethelargestchangeinmeansofanycluster

islessthan2%oftheminimumdistancebetweeninitialclustercenters.

Userunningmeans.Ifthisboxischecked,clustercenterswillbeupdatedaftereachcaseis

classified,insteadofafterallofthecasesareclassified.

Page2of7


3/7

ClickOptionsintheKMeansClusterAnalysisdialogbox.CheckInitialclustercenters,ANOVAtable,

Clusterinformationforeachcase,andExcludecasespairwise.ClickContinue.ClickOk.

Initialclustercenters.Printstheinitialvariablemeansforeachclusterintheoutput.

ANOVAtable.ANOVAFtestsareconductedforeachvariabletoindicatehowwellthevariable

discriminatesbetweenclusters.

Clusterinformationforeachcase.Printseachcase'sfinalclusterassignmentandtheEuclidean

distancebetweenthecaseandtheclustercenterintheouput.

MissingValues.Thedefaultislistwisedeletion.Forthisexample,therearemanymissingvalues

becausemostcustomersdidnotsubscribetoallservices,soexcludingcasespairwisemaximizes

theinformationyoucanobtainfromthedata.

Page3of7


4/7

KMeansClusteringInterpretation

TheInitialClusterCenterstableshowsthefirststepinthekmeansclusteringinfindingthekcenters.

TheIterationHistorytableshowsthenumberofiterationsthatwereenoughuntilclustercentersdid

notchangesubstantially.

Page4of7


5/7

TheClusterMembershiptablegivesyouthecaseclustereachcasebelongstoandtheEuclidean

distanceofeachcasetotheclustercenter.Belowisaprintoutofthefirstandlast10cases.Visual

inspectionofdistancesisnecessarytocheckforoutliersthatmaynotadequatelyreflectthepopulation.

TheFinalClusterCenterstablebelowallowsyoutodescribetheclustersbythevariables.Forexample,

customersinCluster1tendtopurchasealotofservices,asevidencedbyvaluesabovethemeanforall

variables.CustomersinCluster2tendtopurchasethe"calling"services,shownbypositivevaluesfor

thefourcallingservices(callerID,callwaiting,callforwarding,and3waycalling).Customersin

Cluster3tendtospendverylittleanddonotpurchasemanyservices;theyhavenegativevalueson

mostofthevariables.

Page5of7


6/7

TheDifferencesbetweenFinalClusterCenterstableshowstheEuclideandistancesbetweenthefinal

clustercenters.Greaterdistancesbetweenclustersmeantherearegreaterdissimilarities.

Clusters1and3havethegreatestdissimilarities.

Cluster2isequallysimilartoClusters1and3.

TheANOVAtableindicateswhichvariablescontributethemosttoyourclustersolution.Variableswith

largemeansquareerrorsprovidetheleasthelpindifferentiatingbetweenclusters.Forexample,long

distanceandcallingcardhadthetwohighestmeansquareerrors(andlowestFstatistics);therefore,the

twovariableswerenotashelpfulastheothervariablesinforminganddifferentiatingclusters.

Page6of7


7/7

TheNumberofCasesineachClustertableillustratesthesplitofcasesintoclusters.Alargenumberof

caseswereassignedtothethirdcluster,whichistheleastprofitablegroup.

Page7of7

K Means Handout

Documents

Transcript of K Means Handout