Applied Data Analysis (With SPSS)
Transcript of Applied Data Analysis (With SPSS)
-
7/28/2019 Applied Data Analysis (With SPSS)
1/19
Research Methodology: Tools
Applied Data Analysis (with SPSS)
Lecture 04: Cluster Analysis
March 2011
Prof. Dr. Jrg Schwarz [email protected]
MSc Business Administration
Slide 2
Contents
Aims ___________________________________________________________________________________________________ 5
Introduction _____________________________________________________________________________________________ 6
Outline _________________________________________________________________________________________________ 9
Concepts of Cluster Analysis______________________________________________________________________________ 10
Cluster Analysis with SPSS: A detailed example______________________________________________________________ 24
-
7/28/2019 Applied Data Analysis (With SPSS)
2/19
-
7/28/2019 Applied Data Analysis (With SPSS)
3/19
Slide 5
Aims
Aims of the lecture
You know different types of measures of distance / similarity
You know the key steps in conducting a cluster analysis.
You can conduct a cluster analysis with SPSS
(Hierarchical agglomerative methods: Between-groups linkage and Ward)
In particular, you know how to
choose the appropriate measure of distance / similarity
interpret the agglomeration schedule
use the dendrogram to determine the number of clusters
interpret the meaning of a cluster
Slide 6
Introduction
Example
Marketing research: Customer survey on brand awareness ("Markenbewusstsein")
Brandawareness[Index]
Yearly income [Index]
Survey features
Sample of n = 150 customers
Brand awareness index consist of 3 items:
How likely is it that you will use thebrand again in the future?
How likely would you be to recommendthe brand to your friends?
Overall, how satisfied are you with thebrand?
Also included in the dataset:
yearly income
-
7/28/2019 Applied Data Analysis (With SPSS)
4/19
Slide 7
Question
Is there a linear relation between brand awareness and yearly income?
Hypothesis: The higher a person's income, the higher his/her brand awareness.
Conduct regression analysis with SPSS
Brandawareness[Index]
Output (summarized)
Overall model test (F-test)
Significance p = .014
Test of coefficients
Constant p = .000
Income p = .014
Coefficient of determination
R Square = .040
It is a really poor model
It seems to have structure in the
brand awareness dataset.
Yearly income [Index]
Slide 8
Question
Is there structure in the brand awareness dataset?
Are there clusters for the combination of yearly income and brand awareness?
Conduct cluster analysis
Brandawareness[Index]
Yearly income [Index]
Output
SPSS identified 3 distinct clusters
Interpretation
People with low income are least aware
because they lack money.
People with middle income have the
highest brand awareness because of
the dream of being richer.
People with high income are moderately
brand aware because they have a
certain status but don't need to show off.
-
7/28/2019 Applied Data Analysis (With SPSS)
5/19
Slide 9
Outline
Cluster analysis is a multivariate procedure for detecting natural groupings in data.
The grouping is based on the scores of several measures (e.g. income and awareness).
Brand
awareness[Index]
Yearly income [Index]
Goals in conducting cluster analysisElements within a group should be as
similar as possible
distance d should be small
Similarities between the groups should be
minimal
distance D should be large
FeaturesBecause all information is used for
grouping, cluster analysis is more
objective than just a subjective impression.
There is no optical illusion.
D
d
Slide 10
Concepts of Cluster Analysis
Key steps in using a cluster analysis
1. Measure of distance or similarity between objects (also called proximity measure)
Depends on type of data: interval, counts, binary
Distance: geometrical measure. Similarity: content-related measure
2. Formation of clusters
Calculation of proximity matrix
Many different procedures: Hierarchical / non-hierarchical, agglomerative / divisive etc.
3. Tools / criteria for determining the number of clusters
Tools: Agglomeration schedule, structural chart, dendrogram, icicle plot ("Eiszapfen-Plot")
Criteria (not available in SPSS): F-value, information criterion, etc.
4. Display and save cluster membership
Done by SPSS
5. Interpretation of clusters
Taking into account means (possibly variances) of cluster members
-
7/28/2019 Applied Data Analysis (With SPSS)
6/19
Slide 11
How to measure proximity
From dataset ...
Variable 1 Variable 2 Variable 3 : Variable j
Object 1
Object 2
Object 3
:
Object k
... to proximity matrix (done by SPSS internally)
Object 1 Object 2 Object 3 : Object k
Object 1
Object 2
Object 3
:
Object k
raw data
distance or similarity
Slide 12
Different proximity measures, depending on type of data
Measureallows specifying the distance (d) or similarity (s) to be used in clustering.
Interval (e.g. brand awareness, yearly income)
Euclidean distance (d)
City block distance (d)
Pearson correlation (d):
Counts (e.g. number of clients)
Chi-square measure (s)
Phi-square measure (s):
Binary (e.g. yes/no, female/male)
Euclidean distance (d)
Russel and Rao (s)
Simple matching (s)
Dice (s)
(only a selection of 27!)
-
7/28/2019 Applied Data Analysis (With SPSS)
7/19
Slide 13
Proximity measure with interval variables
Example: Brand awareness
Theorem of Pythagoras about right triangle
cbacba 22222 =+=>=+
Distance between "pers_001" and "pers_002"
[ ][ ]
407.1
488.1490.0
73.195.297.067.1d
2/1
2/122
002,001
=
+=
+=
Coordinates {x-axis, y-axis}
a
bc
0.97 1.67
2.95
1.73
1.407
a
bc
0.97 1.67
2.95
1.73
1.407
{1.67, 1.73}
{0.97, 2.95}
Slide 14
Generalized equation
Minkowski distance (Hermann Minkowski, 1864 - 1909, German physicist)r/1
J
1j
r
ljkjl,k xxd
=
=
r = Minkowski's constant
dk,l = Distance between objects k and l (e.g. distance between persons 001 and 002)
J = Number of cluster variables (e.g. variables income and awareness)xkj, xlj = Values of variable j of objects k and l (e.g. income of persons 001 and 002)
Values of Minkowski's constant
r = 1: City block distance (also called L1-norm)
r = 2: Euclidean distance (also called L2-norm)
City block distance
Manhattan distance
Taxi distance
L2
L1
L2
L1
-
7/28/2019 Applied Data Analysis (With SPSS)
8/19
Slide 15
Proximity measure with binary variables
Example: Car configuration
Identification of similarities between two objects by means of comparison
ABS Airbag ESP Navi MetallicMercedes 0 1 1 1 0
BMW 0 1 1 0 1
Case D A A C B
0 = feature not present 1 = feature present
Configuration
4 Cases
A = Feature exists in both comparison objects
B, C = Feature exists in one comparison object
D = Feature exists in none of the comparison objects
Non-existence is also an important similarity in proximity definition
Slide 16
Binary proximity measures
Proximity measure between two objects i and j depends on whether and how
the cases are included and how they are weighted (weights , i und ).
General case: Simple Matching Coefficient*
ij
a dS
a (b c) d1
2
+ =
+ + +
Variants Description Definition
Russel und Rao Case d reduces proximityij
aS
a b c d=
+ + +
Simple matching Case d raises proximityij
a dS
a b c d
+=
+ + +
DiceCase d is not taken into account
Similar features are weighted more ij 2aS 2a b c=
+ +
*Sokal, R.R. and Michener, C.D., Statistical method for evaluating systematic relationships,*University of Kansas science bulletin, 38:1409--1438, 1958.
a = Number of cases of case "A"b = Number of cases of case "B"
:
-
7/28/2019 Applied Data Analysis (With SPSS)
9/19
Slide 17
Example: Car configuration
ABS Airbag ESP Navi Metallic
Mercedes 0 1 1 1 0
BMW 0 1 1 0 1
Case D A A C B
0 = feature not present 1 = feature present
Configuration
Measure Proximity
Russel and Raoij
2 2S 0.4
2 1 1 1 5= = =
+ + +
Simple matchingij
2 1 3S 0.62 1 1 1 5
+= = =+ + +
Diceij
2 2 4S 0.67
2 2 1 1 6
= = =
+ +
Some remarks
Sij varies between 0 and 1
There is no "right" proximity measure
Important question/decision:
Is non-existence important?
( taking case d into account?)
Count of cases
a = 2
b = 1
c = 1
d = 1
Slide 18
How to form Clusters
Cluster A Cluster B
1.
2.
3.
How to define similarity?
Similarity between cluster A and cluster B is measured by
1. Nearest neighbor (also called single linkage in the cluster formation tree on slide 20)
... the minimum of all possible distances between the cases in cluster A and the cases in B.2. Centroid clustering (also called other linkage)
... the distance between the centroids of cluster A and of cluster B.
3. Furthest neighbor (also called complete linkage)
... the maximum of all ossible distances between the cases in cluster A and the cases in B.
-
7/28/2019 Applied Data Analysis (With SPSS)
10/19
Slide 19
Similarity between cluster A and cluster B is measured by
Between-groups linkage (also called average linkage)
... the average of all the possible distances between the cases in cluster A and the cases in B.
Within-groups linkage (also called other linkage)
... the average of all the possible distances between the cases within a single new cluster
determined by combining cluster A and cluster B.
Median clustering (also called other linkage)
... the distance between the SPSS determined median for the cases in cluster A and the median
for the cases in cluster B.
Special case, taking into account sum of squares
Wards method
For a cluster the sum of squares is the sum of squared distances of each case from the centroid.
d1
d2
Sum of squared distances
=
=++k
1i
2
i
2
2
2
1 d...dd
Slide 20
Cluster formation tree (rules for cluster formation)
There are several types of clustering procedures:
DivisiveAgglomerative
Variance
methods
Linkage
methods
Wardsprocedure
Singlelinkage
Clusteralgorithms
Hierarchical
Completelinkage
Averagelinkage
k-Meansprocedure
Non-hierarchical
Otherlinkage
Non-hierarchical clustering is also called k-means clustering.
Average linkage between groups is the default in SPSS ("Between-groups linkage")
used in this course
-
7/28/2019 Applied Data Analysis (With SPSS)
11/19
Slide 21
Pros and cons
Hierarchical clustering
No a priori decision about the number of clusters
Can be very slow
Non-hierarchical clustering
Need to specify the number of clusters (can be an arbitrary number)
Faster, more reliable
Features
Procedure Proximity measure Remark
Single linkage distance or similarity tendency to form chains
Complete linkage distance or similarity tendency to smaller groups of same size
Average linkage distance or similarity "between" single and complete linkage
Other linkage only distance No remark
Ward's method only distance tendency to groups of same size
Slide 22
Example of hierarchical method: Single linkage (nearest neighbor)
Tendency to form chains
Suitable for the detection of outliers
Close groups are badly separated
Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)
nearest neighbor
Step k Step k + 1
"chain"
-
7/28/2019 Applied Data Analysis (With SPSS)
12/19
Slide 23
Example of hierarchical method: Complete linkage (furthest neighbor)
Tendency to form smaller groups with same size
Not suitable for detecting outliers
Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)
furthest neighbor
Step k Step k + 1
Slide 24
Cluster Analysis with SPSS: A detailed example
Marketing research: Customer survey on brand awareness
Data
Random sub-sample of n = 15
(Why this small sub-sample?
Just to keep track of what SPSS does.)
Data set: cluster_small.sav
Syntax: cluster_small.sps
Brandawareness[Index]
Yearly income [Index]
-
7/28/2019 Applied Data Analysis (With SPSS)
13/19
Slide 25
SPSS Elements:
-
7/28/2019 Applied Data Analysis (With SPSS)
14/19
Slide 27
First step: Measure of distance or similarity between objects
Output
Proximity Matrix (Distances or similarities between items)
Values represent Euclidian distances
Example:
Distance between cases 9 and 7
:
:
Slide 28
Second step: Formation of clusters
Between-groups linkage
Stage 1: Cases 7 and 9 have smallest distance ("Coefficients" = .203) => first cluster {7,9}
First cluster {7,9} will be clustered with case 10 in stage 5 => cluster {7,9,10}
Stage 2: Cases 13 and 14 have second smallest distance => second cluster {13,14}
Second cluster {13,14} will be clustered with case 11 in stage 3 => cluster {11,13,14}
:
Agglomeration schedule: Displaysthe clusters combined at each stage.
-
7/28/2019 Applied Data Analysis (With SPSS)
15/19
Slide 29
Dendrogram
Stage 1
Stage 5
Stage 2
Stage 3
Slide 30
Icicle plot
14 clusters: Cases 7 and 9 in one cluster, all others each in their own clusters.
13 clusters: 7 and 9 in one cluster, 13 and 14 in one cluster, all others each in their clusters.
12 clusters: 7 and 9 in one cluster, 11, 13 and 14 in one cluster, all others each in their clusters.
:
The figure is called an
icicle plot because the
columns look like icicles
hanging from above.
The plot shows how cases
are merged into clusters.
Read it from bottom to top
-
7/28/2019 Applied Data Analysis (With SPSS)
16/19
Slide 31
Third step: Determining the number of clusters
0) Theoretical and empirical reasons (But, be careful about optical illusion!)
In the case of brand awareness there are some indications for three clusters.
A) Elbow criterion in the structure chart (can't be done with SPSS, but with Excel)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Please note: Mostly there is large effect from cluster 1 to cluster 2 which is not the "elbow".
P
roximity("Coefficients")
Number of clusters (=sample size - "Stage")
elbow => choose 3 clusters
Slide 32
B) Dendrogram
Choose the number of clusters within the largest increase in heterogeneity
Standardized distance
Largest increase in heterogeneity
-
7/28/2019 Applied Data Analysis (With SPSS)
17/19
Slide 33
Fourth step: Display and save cluster membership
Output table of cluster membership
Example of brand awareness: assumed 3 clusters
If you're not sure
about the number of
clusters, choose a fullrange
Slide 34
Saving the cluster membership
Used for drawing a scatter plot, for example.
Range of solutions: 2 to 5
Example of brand awareness: assumed 3 clusters
-
7/28/2019 Applied Data Analysis (With SPSS)
18/19
Slide 35
Scatter plot:
Slide 36
One point was assigned incorrectly
-
7/28/2019 Applied Data Analysis (With SPSS)
19/19
Slide 37
Fifth step: Interpretation of clusters
In the case of the brand awareness example, the interpretation is obvious and straightforward.
Taking into account means
The means of the clusters with respect to the original variablesindicate how the clusters can be interpreted.
Example of Lecture 01: Marketing survey on consumer buying behavior
Questionnaire to ask people about their attitudes.
Among other questions:
"What is your general attitude to life?" (variable x1)
"What is your attitude to innovation?" (variable x2)
"What is your willingness to take risks?" (variable x3)
Scales of variables vary
from extremely negative (1)
to extremely positive (7)
general
attitude to life
attitude to
innovation
willingness
to take risks
Person A 1 2 2
Person B 1 3 3
Person C 2 4 2
Person D 5 4 3
Person E 5 4 4
Person F 7 6 7
Objects
Attributes
Data of 6 people
Slide 38
general
attitude to life
attitude to
innovation
willingness
to take risks
(A, B, C) 1.3 3 2.3
(D, E) 5 4 3.5
(F) 7 6 7Cluster
Attributes
Mean of clusters, regarding the cluster variables
Cluster 1 (A, B, C): pessimistic people who live in fear
Cluster 2 (D, E): slightly optimistic normalos
Cluster 3 (F): life-affirming adventurer