Applied Data Analysis (With SPSS)

7/28/2019 Applied Data Analysis (With SPSS)

1/19

Research Methodology: Tools

Applied Data Analysis (with SPSS)

Lecture 04: Cluster Analysis

March 2011

Prof. Dr. Jrg Schwarz [email protected]

MSc Business Administration

Slide 2

Contents

Aims ___________________________________________________________________________________________________ 5

Introduction _____________________________________________________________________________________________ 6

Outline _________________________________________________________________________________________________ 9

Concepts of Cluster Analysis______________________________________________________________________________ 10

Cluster Analysis with SPSS: A detailed example______________________________________________________________ 24


2/19


3/19

Slide 5

Aims

Aims of the lecture

You know different types of measures of distance / similarity

You know the key steps in conducting a cluster analysis.

You can conduct a cluster analysis with SPSS

(Hierarchical agglomerative methods: Between-groups linkage and Ward)

In particular, you know how to

choose the appropriate measure of distance / similarity

interpret the agglomeration schedule

use the dendrogram to determine the number of clusters

interpret the meaning of a cluster

Slide 6

Introduction

Example

Marketing research: Customer survey on brand awareness ("Markenbewusstsein")

Brandawareness[Index]

Yearly income [Index]

Survey features

Sample of n = 150 customers

Brand awareness index consist of 3 items:

How likely is it that you will use thebrand again in the future?

How likely would you be to recommendthe brand to your friends?

Overall, how satisfied are you with thebrand?

Also included in the dataset:

yearly income


4/19

Slide 7

Question

Is there a linear relation between brand awareness and yearly income?

Hypothesis: The higher a person's income, the higher his/her brand awareness.

Conduct regression analysis with SPSS


Output (summarized)

Overall model test (F-test)

Significance p = .014

Test of coefficients

Constant p = .000

Income p = .014

Coefficient of determination

R Square = .040

It is a really poor model

It seems to have structure in the

brand awareness dataset.


Slide 8

Question

Is there structure in the brand awareness dataset?

Are there clusters for the combination of yearly income and brand awareness?

Conduct cluster analysis



Output

SPSS identified 3 distinct clusters

Interpretation

People with low income are least aware

because they lack money.

People with middle income have the

highest brand awareness because of

the dream of being richer.

People with high income are moderately

brand aware because they have a

certain status but don't need to show off.


5/19

Slide 9

Outline

Cluster analysis is a multivariate procedure for detecting natural groupings in data.

The grouping is based on the scores of several measures (e.g. income and awareness).

Brand

awareness[Index]


Goals in conducting cluster analysisElements within a group should be as

similar as possible

distance d should be small

Similarities between the groups should be

minimal

distance D should be large

FeaturesBecause all information is used for

grouping, cluster analysis is more

objective than just a subjective impression.

There is no optical illusion.

D

d

Slide 10

Concepts of Cluster Analysis

Key steps in using a cluster analysis

1. Measure of distance or similarity between objects (also called proximity measure)

Depends on type of data: interval, counts, binary

Distance: geometrical measure. Similarity: content-related measure

2. Formation of clusters

Calculation of proximity matrix

Many different procedures: Hierarchical / non-hierarchical, agglomerative / divisive etc.

3. Tools / criteria for determining the number of clusters

Tools: Agglomeration schedule, structural chart, dendrogram, icicle plot ("Eiszapfen-Plot")

Criteria (not available in SPSS): F-value, information criterion, etc.

4. Display and save cluster membership

Done by SPSS

5. Interpretation of clusters

Taking into account means (possibly variances) of cluster members


6/19

Slide 11

How to measure proximity

From dataset ...

Variable 1 Variable 2 Variable 3 : Variable j

Object 1

Object 2

Object 3

:

Object k

... to proximity matrix (done by SPSS internally)

Object 1 Object 2 Object 3 : Object k

Object 1

Object 2

Object 3

:

Object k

raw data

distance or similarity

Slide 12

Different proximity measures, depending on type of data

Measureallows specifying the distance (d) or similarity (s) to be used in clustering.

Interval (e.g. brand awareness, yearly income)

Euclidean distance (d)

City block distance (d)

Pearson correlation (d):

Counts (e.g. number of clients)

Chi-square measure (s)

Phi-square measure (s):

Binary (e.g. yes/no, female/male)

Euclidean distance (d)

Russel and Rao (s)

Simple matching (s)

Dice (s)

(only a selection of 27!)


7/19

Slide 13

Proximity measure with interval variables

Example: Brand awareness

Theorem of Pythagoras about right triangle

cbacba 22222 =+=>=+

Distance between "pers_001" and "pers_002"

[ ][ ]

407.1

488.1490.0

73.195.297.067.1d

2/1

2/122

002,001

=

+=

+=

Coordinates {x-axis, y-axis}

a

bc

0.97 1.67

2.95

1.73

1.407

a

bc

0.97 1.67

2.95

1.73

1.407

{1.67, 1.73}

{0.97, 2.95}

Slide 14

Generalized equation

Minkowski distance (Hermann Minkowski, 1864 - 1909, German physicist)r/1

J

1j

r

ljkjl,k xxd

=

=

r = Minkowski's constant

dk,l = Distance between objects k and l (e.g. distance between persons 001 and 002)

J = Number of cluster variables (e.g. variables income and awareness)xkj, xlj = Values of variable j of objects k and l (e.g. income of persons 001 and 002)

Values of Minkowski's constant

r = 1: City block distance (also called L1-norm)

r = 2: Euclidean distance (also called L2-norm)

City block distance

Manhattan distance

Taxi distance

L2

L1

L2

L1


8/19

Slide 15

Proximity measure with binary variables

Example: Car configuration

Identification of similarities between two objects by means of comparison

ABS Airbag ESP Navi MetallicMercedes 0 1 1 1 0

BMW 0 1 1 0 1

Case D A A C B

0 = feature not present 1 = feature present

Configuration

4 Cases

A = Feature exists in both comparison objects

B, C = Feature exists in one comparison object

D = Feature exists in none of the comparison objects

Non-existence is also an important similarity in proximity definition

Slide 16

Binary proximity measures

Proximity measure between two objects i and j depends on whether and how

the cases are included and how they are weighted (weights , i und ).

General case: Simple Matching Coefficient*

ij

a dS

a (b c) d1

2

+ =

+ + +

Variants Description Definition

Russel und Rao Case d reduces proximityij

aS

a b c d=

+ + +

Simple matching Case d raises proximityij

a dS

a b c d

+=

+ + +

DiceCase d is not taken into account

Similar features are weighted more ij 2aS 2a b c=

+ +

*Sokal, R.R. and Michener, C.D., Statistical method for evaluating systematic relationships,*University of Kansas science bulletin, 38:1409--1438, 1958.

a = Number of cases of case "A"b = Number of cases of case "B"

:


9/19

Slide 17

Example: Car configuration

ABS Airbag ESP Navi Metallic

Mercedes 0 1 1 1 0

BMW 0 1 1 0 1

Case D A A C B

0 = feature not present 1 = feature present

Configuration

Measure Proximity

Russel and Raoij

2 2S 0.4

2 1 1 1 5= = =

+ + +

Simple matchingij

2 1 3S 0.62 1 1 1 5

+= = =+ + +

Diceij

2 2 4S 0.67

2 2 1 1 6

= = =

+ +

Some remarks

Sij varies between 0 and 1

There is no "right" proximity measure

Important question/decision:

Is non-existence important?

( taking case d into account?)

Count of cases

a = 2

b = 1

c = 1

d = 1

Slide 18

How to form Clusters

Cluster A Cluster B

1.

2.

3.

How to define similarity?

Similarity between cluster A and cluster B is measured by

1. Nearest neighbor (also called single linkage in the cluster formation tree on slide 20)

... the minimum of all possible distances between the cases in cluster A and the cases in B.2. Centroid clustering (also called other linkage)

... the distance between the centroids of cluster A and of cluster B.

3. Furthest neighbor (also called complete linkage)

... the maximum of all ossible distances between the cases in cluster A and the cases in B.


10/19

Slide 19

Similarity between cluster A and cluster B is measured by

Between-groups linkage (also called average linkage)

... the average of all the possible distances between the cases in cluster A and the cases in B.

Within-groups linkage (also called other linkage)

... the average of all the possible distances between the cases within a single new cluster

determined by combining cluster A and cluster B.

Median clustering (also called other linkage)

... the distance between the SPSS determined median for the cases in cluster A and the median

for the cases in cluster B.

Special case, taking into account sum of squares

Wards method

For a cluster the sum of squares is the sum of squared distances of each case from the centroid.

d1

d2

Sum of squared distances

=

=++k

1i

2

i

2

2

2

1 d...dd

Slide 20

Cluster formation tree (rules for cluster formation)

There are several types of clustering procedures:

DivisiveAgglomerative

Variance

methods

Linkage

methods

Wardsprocedure

Singlelinkage

Clusteralgorithms

Hierarchical

Completelinkage

Averagelinkage

k-Meansprocedure

Non-hierarchical

Otherlinkage

Non-hierarchical clustering is also called k-means clustering.

Average linkage between groups is the default in SPSS ("Between-groups linkage")

used in this course


11/19

Slide 21

Pros and cons

Hierarchical clustering

No a priori decision about the number of clusters

Can be very slow

Non-hierarchical clustering

Need to specify the number of clusters (can be an arbitrary number)

Faster, more reliable

Features

Procedure Proximity measure Remark

Single linkage distance or similarity tendency to form chains

Complete linkage distance or similarity tendency to smaller groups of same size

Average linkage distance or similarity "between" single and complete linkage

Other linkage only distance No remark

Ward's method only distance tendency to groups of same size

Slide 22

Example of hierarchical method: Single linkage (nearest neighbor)

Tendency to form chains

Suitable for the detection of outliers

Close groups are badly separated

Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)

nearest neighbor

Step k Step k + 1

"chain"


12/19

Slide 23

Example of hierarchical method: Complete linkage (furthest neighbor)

Tendency to form smaller groups with same size

Not suitable for detecting outliers

Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)

furthest neighbor

Step k Step k + 1

Slide 24

Cluster Analysis with SPSS: A detailed example

Marketing research: Customer survey on brand awareness

Data

Random sub-sample of n = 15

(Why this small sub-sample?

Just to keep track of what SPSS does.)

Data set: cluster_small.sav

Syntax: cluster_small.sps




13/19

Slide 25

SPSS Elements:


14/19

Slide 27

First step: Measure of distance or similarity between objects

Output

Proximity Matrix (Distances or similarities between items)

Values represent Euclidian distances

Example:

Distance between cases 9 and 7

:

:

Slide 28

Second step: Formation of clusters

Between-groups linkage

Stage 1: Cases 7 and 9 have smallest distance ("Coefficients" = .203) => first cluster {7,9}

First cluster {7,9} will be clustered with case 10 in stage 5 => cluster {7,9,10}

Stage 2: Cases 13 and 14 have second smallest distance => second cluster {13,14}

Second cluster {13,14} will be clustered with case 11 in stage 3 => cluster {11,13,14}

:

Agglomeration schedule: Displaysthe clusters combined at each stage.


15/19

Slide 29

Dendrogram

Stage 1

Stage 5

Stage 2

Stage 3

Slide 30

Icicle plot

14 clusters: Cases 7 and 9 in one cluster, all others each in their own clusters.

13 clusters: 7 and 9 in one cluster, 13 and 14 in one cluster, all others each in their clusters.

12 clusters: 7 and 9 in one cluster, 11, 13 and 14 in one cluster, all others each in their clusters.

:

The figure is called an

icicle plot because the

columns look like icicles

hanging from above.

The plot shows how cases

are merged into clusters.

Read it from bottom to top


16/19

Slide 31

Third step: Determining the number of clusters

0) Theoretical and empirical reasons (But, be careful about optical illusion!)

In the case of brand awareness there are some indications for three clusters.

A) Elbow criterion in the structure chart (can't be done with SPSS, but with Excel)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Please note: Mostly there is large effect from cluster 1 to cluster 2 which is not the "elbow".

P

roximity("Coefficients")

Number of clusters (=sample size - "Stage")

elbow => choose 3 clusters

Slide 32

B) Dendrogram

Choose the number of clusters within the largest increase in heterogeneity

Standardized distance

Largest increase in heterogeneity


17/19

Slide 33

Fourth step: Display and save cluster membership

Output table of cluster membership

Example of brand awareness: assumed 3 clusters

If you're not sure

about the number of

clusters, choose a fullrange

Slide 34

Saving the cluster membership

Used for drawing a scatter plot, for example.

Range of solutions: 2 to 5

Example of brand awareness: assumed 3 clusters


18/19

Slide 35

Scatter plot:

Slide 36

One point was assigned incorrectly


19/19

Slide 37

Fifth step: Interpretation of clusters

In the case of the brand awareness example, the interpretation is obvious and straightforward.

Taking into account means

The means of the clusters with respect to the original variablesindicate how the clusters can be interpreted.

Example of Lecture 01: Marketing survey on consumer buying behavior

Questionnaire to ask people about their attitudes.

Among other questions:

"What is your general attitude to life?" (variable x1)

"What is your attitude to innovation?" (variable x2)

"What is your willingness to take risks?" (variable x3)

Scales of variables vary

from extremely negative (1)

to extremely positive (7)

general

attitude to life

attitude to

innovation

willingness

to take risks

Person A 1 2 2

Person B 1 3 3

Person C 2 4 2

Person D 5 4 3

Person E 5 4 4

Person F 7 6 7

Objects

Attributes

Data of 6 people

Slide 38

general

attitude to life

attitude to

innovation

willingness

to take risks

(A, B, C) 1.3 3 2.3

(D, E) 5 4 3.5

(F) 7 6 7Cluster

Attributes

Mean of clusters, regarding the cluster variables

Cluster 1 (A, B, C): pessimistic people who live in fear

Cluster 2 (D, E): slightly optimistic normalos

Cluster 3 (F): life-affirming adventurer

Applied Data Analysis (With SPSS)

Documents

Transcript of Applied Data Analysis (With SPSS)