K Means Handout

download K Means Handout

of 7

Transcript of K Means Handout

  • 8/22/2019 K Means Handout

    1/7

    ClusterAnalysis

    WhatisClusterAnalysis?

    Clusteranalysisisastatisticaltechniqueusedtogroupcases(individualsorobjects)intohomogeneous

    subgroupsbasedonresponsestovariables.UsingPASW(SPSS)17.0toconductaclusteranalysis,there

    arethree

    clustering

    procedures:

    two

    step,

    kmeans,

    and

    hierarchical.

    Kmeansclusteringallowsyoutoselectthenumberofclustersandtheprocedurecanbeusedwith

    moderatetolargedatasets.Thekmeansclusteringalgorithmassignscasestoclustersbasedonthe

    smallestamountofdistancebetweentheclustermeanandcase.Thisisaniterativeprocessthatstops

    oncetheclustermeansdonotchangemuchinsuccessivesteps.

    KMeansClustering

    Asanexampleofkmeansclustering,asamplePASW17.0datasetwasused;telco_extra.sav,

    telecommunicationsproviderdatathathas14continuousvariables.Thecontinuousvariableshave

    alreadybeen

    standardized,

    with

    amean

    of

    0and

    standard

    deviation

    of

    1,

    to

    allow

    for

    different

    units

    in

    whichvariablesweremeasured.Thisanalysiswillclustercustomersbytheirserviceusagepatterns.

    InPASW17.0,gotoAnalyze >Classify >KMeansCluster

    Next,theKMeansClusterAnalysismenuappears.SelectStandardizedloglongdistancethrough

    StandardizedlogwirelessandStandardizedmultiplelinesthroughStandardizedelectronicbilling

    variablesandplaceintheVariablesbox.

    LabelCasesby.Optional;placevariableheretolabelcases

    NumberofClusters.Youhavetospecifythenumberofclustersyouwant.Forthisexample,

    type3inthebox.

    Method.Thedefault"Iterateandclassify,"whichisaniterativeprocessisusedtocomputethe

    clustermeanseachtimeacaseisaddedordeletedfromthecluster.Clustersarethenclassified

    Page1of7

  • 8/22/2019 K Means Handout

    2/7

    basedonceclustercentershavebeenupdated.The"Classifyonly"methodareclassifiedbased

    ontheinitialclustercenters,whicharenotiterativelycomputed.Forthisexample,Iterateand

    classifyischosen.

    ClusterCenters.Youcandrawinitialclustercentersfromafile(Readinitial)oryoucansave

    thefinal

    cluster

    centers

    (Write

    final).

    For

    this

    example,

    we

    are

    not

    using

    either

    option.

    ClicktheIteratebutton;theKMeansClusterAnalysis:Iterateboxappears.ChangeMaximum

    Iterationsto20.ClickContinue.

    MaximumIterations.Setsthemaximumnumberofiterations.

    ConvergenceCriterion:Thedefaultterminatesoncethelargestchangeinmeansofanycluster

    islessthan2%oftheminimumdistancebetweeninitialclustercenters.

    Userunningmeans.Ifthisboxischecked,clustercenterswillbeupdatedaftereachcaseis

    classified,insteadofafterallofthecasesareclassified.

    Page2of7

  • 8/22/2019 K Means Handout

    3/7

    ClickOptionsintheKMeansClusterAnalysisdialogbox.CheckInitialclustercenters,ANOVAtable,

    Clusterinformationforeachcase,andExcludecasespairwise.ClickContinue.ClickOk.

    Initialclustercenters.Printstheinitialvariablemeansforeachclusterintheoutput.

    ANOVAtable.ANOVAFtestsareconductedforeachvariabletoindicatehowwellthevariable

    discriminatesbetweenclusters.

    Clusterinformationforeachcase.Printseachcase'sfinalclusterassignmentandtheEuclidean

    distancebetweenthecaseandtheclustercenterintheouput.

    MissingValues.Thedefaultislistwisedeletion.Forthisexample,therearemanymissingvalues

    becausemostcustomersdidnotsubscribetoallservices,soexcludingcasespairwisemaximizes

    theinformationyoucanobtainfromthedata.

    Page3of7

  • 8/22/2019 K Means Handout

    4/7

    KMeansClusteringInterpretation

    TheInitialClusterCenterstableshowsthefirststepinthekmeansclusteringinfindingthekcenters.

    TheIterationHistorytableshowsthenumberofiterationsthatwereenoughuntilclustercentersdid

    notchangesubstantially.

    Page4of7

  • 8/22/2019 K Means Handout

    5/7

    TheClusterMembershiptablegivesyouthecaseclustereachcasebelongstoandtheEuclidean

    distanceofeachcasetotheclustercenter.Belowisaprintoutofthefirstandlast10cases.Visual

    inspectionofdistancesisnecessarytocheckforoutliersthatmaynotadequatelyreflectthepopulation.

    TheFinalClusterCenterstablebelowallowsyoutodescribetheclustersbythevariables.Forexample,

    customersinCluster1tendtopurchasealotofservices,asevidencedbyvaluesabovethemeanforall

    variables.CustomersinCluster2tendtopurchasethe"calling"services,shownbypositivevaluesfor

    thefourcallingservices(callerID,callwaiting,callforwarding,and3waycalling).Customersin

    Cluster3tendtospendverylittleanddonotpurchasemanyservices;theyhavenegativevalueson

    mostofthevariables.

    Page5of7

  • 8/22/2019 K Means Handout

    6/7

    TheDifferencesbetweenFinalClusterCenterstableshowstheEuclideandistancesbetweenthefinal

    clustercenters.Greaterdistancesbetweenclustersmeantherearegreaterdissimilarities.

    Clusters1and3havethegreatestdissimilarities.

    Cluster2isequallysimilartoClusters1and3.

    TheANOVAtableindicateswhichvariablescontributethemosttoyourclustersolution.Variableswith

    largemeansquareerrorsprovidetheleasthelpindifferentiatingbetweenclusters.Forexample,long

    distanceandcallingcardhadthetwohighestmeansquareerrors(andlowestFstatistics);therefore,the

    twovariableswerenotashelpfulastheothervariablesinforminganddifferentiatingclusters.

    Page6of7

  • 8/22/2019 K Means Handout

    7/7

    TheNumberofCasesineachClustertableillustratesthesplitofcasesintoclusters.Alargenumberof

    caseswereassignedtothethirdcluster,whichistheleastprofitablegroup.

    Page7of7