K Means Handout
-
Upload
edgar-colo -
Category
Documents
-
view
212 -
download
0
Transcript of K Means Handout
-
8/22/2019 K Means Handout
1/7
ClusterAnalysis
WhatisClusterAnalysis?
Clusteranalysisisastatisticaltechniqueusedtogroupcases(individualsorobjects)intohomogeneous
subgroupsbasedonresponsestovariables.UsingPASW(SPSS)17.0toconductaclusteranalysis,there
arethree
clustering
procedures:
two
step,
kmeans,
and
hierarchical.
Kmeansclusteringallowsyoutoselectthenumberofclustersandtheprocedurecanbeusedwith
moderatetolargedatasets.Thekmeansclusteringalgorithmassignscasestoclustersbasedonthe
smallestamountofdistancebetweentheclustermeanandcase.Thisisaniterativeprocessthatstops
oncetheclustermeansdonotchangemuchinsuccessivesteps.
KMeansClustering
Asanexampleofkmeansclustering,asamplePASW17.0datasetwasused;telco_extra.sav,
telecommunicationsproviderdatathathas14continuousvariables.Thecontinuousvariableshave
alreadybeen
standardized,
with
amean
of
0and
standard
deviation
of
1,
to
allow
for
different
units
in
whichvariablesweremeasured.Thisanalysiswillclustercustomersbytheirserviceusagepatterns.
InPASW17.0,gotoAnalyze >Classify >KMeansCluster
Next,theKMeansClusterAnalysismenuappears.SelectStandardizedloglongdistancethrough
StandardizedlogwirelessandStandardizedmultiplelinesthroughStandardizedelectronicbilling
variablesandplaceintheVariablesbox.
LabelCasesby.Optional;placevariableheretolabelcases
NumberofClusters.Youhavetospecifythenumberofclustersyouwant.Forthisexample,
type3inthebox.
Method.Thedefault"Iterateandclassify,"whichisaniterativeprocessisusedtocomputethe
clustermeanseachtimeacaseisaddedordeletedfromthecluster.Clustersarethenclassified
Page1of7
-
8/22/2019 K Means Handout
2/7
basedonceclustercentershavebeenupdated.The"Classifyonly"methodareclassifiedbased
ontheinitialclustercenters,whicharenotiterativelycomputed.Forthisexample,Iterateand
classifyischosen.
ClusterCenters.Youcandrawinitialclustercentersfromafile(Readinitial)oryoucansave
thefinal
cluster
centers
(Write
final).
For
this
example,
we
are
not
using
either
option.
ClicktheIteratebutton;theKMeansClusterAnalysis:Iterateboxappears.ChangeMaximum
Iterationsto20.ClickContinue.
MaximumIterations.Setsthemaximumnumberofiterations.
ConvergenceCriterion:Thedefaultterminatesoncethelargestchangeinmeansofanycluster
islessthan2%oftheminimumdistancebetweeninitialclustercenters.
Userunningmeans.Ifthisboxischecked,clustercenterswillbeupdatedaftereachcaseis
classified,insteadofafterallofthecasesareclassified.
Page2of7
-
8/22/2019 K Means Handout
3/7
ClickOptionsintheKMeansClusterAnalysisdialogbox.CheckInitialclustercenters,ANOVAtable,
Clusterinformationforeachcase,andExcludecasespairwise.ClickContinue.ClickOk.
Initialclustercenters.Printstheinitialvariablemeansforeachclusterintheoutput.
ANOVAtable.ANOVAFtestsareconductedforeachvariabletoindicatehowwellthevariable
discriminatesbetweenclusters.
Clusterinformationforeachcase.Printseachcase'sfinalclusterassignmentandtheEuclidean
distancebetweenthecaseandtheclustercenterintheouput.
MissingValues.Thedefaultislistwisedeletion.Forthisexample,therearemanymissingvalues
becausemostcustomersdidnotsubscribetoallservices,soexcludingcasespairwisemaximizes
theinformationyoucanobtainfromthedata.
Page3of7
-
8/22/2019 K Means Handout
4/7
KMeansClusteringInterpretation
TheInitialClusterCenterstableshowsthefirststepinthekmeansclusteringinfindingthekcenters.
TheIterationHistorytableshowsthenumberofiterationsthatwereenoughuntilclustercentersdid
notchangesubstantially.
Page4of7
-
8/22/2019 K Means Handout
5/7
TheClusterMembershiptablegivesyouthecaseclustereachcasebelongstoandtheEuclidean
distanceofeachcasetotheclustercenter.Belowisaprintoutofthefirstandlast10cases.Visual
inspectionofdistancesisnecessarytocheckforoutliersthatmaynotadequatelyreflectthepopulation.
TheFinalClusterCenterstablebelowallowsyoutodescribetheclustersbythevariables.Forexample,
customersinCluster1tendtopurchasealotofservices,asevidencedbyvaluesabovethemeanforall
variables.CustomersinCluster2tendtopurchasethe"calling"services,shownbypositivevaluesfor
thefourcallingservices(callerID,callwaiting,callforwarding,and3waycalling).Customersin
Cluster3tendtospendverylittleanddonotpurchasemanyservices;theyhavenegativevalueson
mostofthevariables.
Page5of7
-
8/22/2019 K Means Handout
6/7
TheDifferencesbetweenFinalClusterCenterstableshowstheEuclideandistancesbetweenthefinal
clustercenters.Greaterdistancesbetweenclustersmeantherearegreaterdissimilarities.
Clusters1and3havethegreatestdissimilarities.
Cluster2isequallysimilartoClusters1and3.
TheANOVAtableindicateswhichvariablescontributethemosttoyourclustersolution.Variableswith
largemeansquareerrorsprovidetheleasthelpindifferentiatingbetweenclusters.Forexample,long
distanceandcallingcardhadthetwohighestmeansquareerrors(andlowestFstatistics);therefore,the
twovariableswerenotashelpfulastheothervariablesinforminganddifferentiatingclusters.
Page6of7
-
8/22/2019 K Means Handout
7/7
TheNumberofCasesineachClustertableillustratesthesplitofcasesintoclusters.Alargenumberof
caseswereassignedtothethirdcluster,whichistheleastprofitablegroup.
Page7of7