Cluster Analysis Grouping Cases or Variables. Clustering Cases Goal is to cluster cases into groups...
-
Upload
christian-girling -
Category
Documents
-
view
217 -
download
1
Transcript of Cluster Analysis Grouping Cases or Variables. Clustering Cases Goal is to cluster cases into groups...
Cluster Analysis
Grouping Cases or Variables
Clustering Cases
• Goal is to cluster cases into groups based on shared characteristics.
• Start out with each case being a one-case cluster.
• The clusters are located in k-dimensional space, where k is the number of variables.
• Compute the squared Euclidian distance between each case and each other case.
Squared Euclidian Distance
• the sum across variables (from i = 1 to v) of the squared difference between the score on variable i for the one case (Xi) and the score on variable i for the other case (Yi)
2
1
v
iii YX
Agglomerate
• The two cases closest to each other are agglomerated into a cluster.
• The distances between entities (clusters and cases) are recomputed.
• The two entities closest to each other are agglomerated.
• This continues until all cases end up in one cluster.
What is the Correct Solution?
• You may have theoretical reasons to expect a certain k cluster solution.
• Look at that solution and see if it matches your expectations.
• Alternatively, you may try to make sense out of solutions at two or more levels of the analysis.
Faculty Salaries
• Subjects were faculty in Psychology at ECU.
• Variables were rank, experience, number of publications, course load, and salary.
• Data are at ClusterAnonFaculty.sav• Also see the statistical output
Analyze, Classify, Hierarchical Cluster
Statistics
Plots
Method
Save
Proximity Matrix
• We did not request this, but if we had it would display a measure of dissimilarity for each pair of entities.
• The pair of cases with the smallest squared Euclidian distance are clustered.
Stage Cluster Combined Coefficients
Cluster 1 Cluster 2 Cluster 1
1 32 33 .000
Look at the Agglomeration Schedule.Cases 32 and 33 are clustered. They are very similar (distance = 0.000)
Agglomeration Schedule
StageCluster Combined
Coefficients
Stage Cluster First Appears
Next Stage
Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 1 Cluster 21 32 33 .000 0 0 92 41 42 .000 0 0 63 43 44 .000 0 0 64 37 38 .000 0 0 55 37 39 .001 4 0 76 41 43 .002 2 3 27
Steps 2 Through 5
Stages 2-5
• The agglomeration schedule show that in Stage 2 cases 41 and 42 are clustered.
• In Stage 3 cases 43 and 44 are clustered.• In Stage 4 cases 37 and 38 are clustered.• In Stage 5 case 39 is added to the cluster
that contains cases 37 and 38.• And so on.
Vertical Icicle, Two Clusters
• Look at the top of the display (next slide).• You can see two clusters
– On the left Boris through Willy– On the right, Deanna through Sunila
• The 2 cluster solution was adjuncts versus full time faculty.
Vertical Icicle, Three Clusters
• Look at the icicle second highest white bar.
• Now there are three clusters– Adjuncts– Junior faculty (Deanna through Mickey)– Senior faculty (Lawrence through Roslyn)
Vertical Icicle, Four Clusters
• Look at the white bar furthest to the right.• Now there are four clusters
– Adjuncts– Junior faculty – The acting chair (Lawrence)– The rest of the senior faculty (Catalina
through Roslyn)
The Dendogram
• At the far right you can see the two cluster solution.
• The next step to the left shows the three cluster solution.
• The next step to the left shows the four cluster solution.
• And so on.• Truncated and rotated dendogram on next
slide.
Compare Two Clusters
• The 2 cluster solution was adjuncts versus everybody else.
• Look at the t tests in the output• Adjuncts had lower rank, experience,
number of publications, course load, and salary.
Compare Three Clusters
• Look at the ANOVAs and plots.• The senior faculty had higher salary,
experience, rank, and number of pubs.
Compare Four Clusters• The acting chair had a higher salary and
number of publications.
I Could Not Help Myself
• With these data on hand, I could not resist predicting salary from the other variables.
• Salary was well correlated with Rank, FTEs, Publications, and Experience.
• In the multiple regression, only Rank and FTEs had significant unique effects.
• The residuals suggest who was being overpaid and who underpaid.
Split by Sex
• For men, the unique effect of number of publications was positive – more publications, higher salary.
• For women it was negative – more publications, lower salary.
• Curious.
Workaholism
• Aziz & Zickar (2005)• Workaholics may be defined as those
– High in work involvement,– High in drive to work, and– Low in work enjoyment.
• For each case, a score was obtained for each of these three dimensions.
The Three Cluster Solution
• Workaholics– High work involvement– High drive to work– Low work enjoyment
• Positively engaged workers– High work involvement– Medium drive to work– High work enjoyment
• Unengaged workers– Low work involvement– Low drive to work– Low work enjoyment
• Past research/theory indicated there should be six clusters, but the theorized six clusters were not obtained.
Clustering Variables
• FactBeer.sav• The statistical output.• Analyze, Classify, Hierarchical Cluster
Statistics
Plots
Method
Proximity Matrix
• Is simply the intercorrelation matrix• The two most correlated variables are
Color and Aroma (r = .909) – they are clustered on the first step.
• Stage 2: Size and Alcohol (r = .904) are clustered.
• Stage 3: Taste added to the cluster that already contains Color and Aroma
Also See Other Tables & Plots
• Stage 4: Cost added to the cluster that already contains Size and Alcohol.
• Stage 5: The two clusters are combined– But they are not very similar (similarity
coefficient = .038)– Now we have one cluster with six variables
and one with one (Reputation)