Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 1 Cluster Analysis...

Statistics for Marketing & Consumer ResearchCopyright © 2008 - Mario Mazzocchi

1

Cluster Analysis

(from Chapter 12)


2

Cluster analysis

• It is a class of techniques used to classify cases into groups that are• relatively homogeneous within

themselves and• heterogeneous between each other

• These groups are called clusters


3

Market segmentation

• Cluster analysis is especially useful for market segmentation

• Segmenting a market means dividing its potential consumers into separate sub-sets where• Consumers in the same group are similar with respect

to a given set of characteristics• Consumers belonging to different groups are dissimilar

with respect to the same set of characteristics

• This allows one to calibrate the marketing mix differently according to the target consumer group


4

Other uses of cluster analysis• Clustering of similar brands or products according to

their characteristics allow one to identify competitors, potential market opportunities and available niches.

• Data reduction• Factor analysis and principal component analysis allow to

reduce the number of variablesnumber of variables. • Cluster analysis allows to reduce the number of number of

observationsobservations, by grouping them into homogeneous clusters.

• Maps profiling simultaneously consumers and products, market opportunities and preferences as in preference or perceptual mappings.


5

Steps to conduct a cluster analysis

• Select a distance measure• Select a clustering algorithm• Define the distance between two

clusters• Determine the number of clusters• Validate the analysis


6

Distance measures for individual observations

• To measure similarity between two observations a distance measure is needed.

• Multiple variables require an aggregate distance measure

• The most known measure of distance is the Euclidean distance, which is the concept we use in everyday life for spatial coordinates.


7

Examples of distances

Dij distance between cases i and j

xkj value of variable xk for case j

Problems: Different measures = different weightsCorrelation between variables (double counting)

Solution: Standardization, rescaling, principal Solution: Standardization, rescaling, principal component analysiscomponent analysis

2

1

n

ij ki kjk

D x x

1

n

ij ki kjk

D x x

Euclidean distance

City-block (Manhattan) distance

A

BA

B


8

Clustering procedures

• Hierarchical procedures• Agglomerative (start from n clusters to

get to 1 cluster)• Divisive (start from 1 cluster to get to n

clusters)

• Non hierarchical procedures• K-means clustering (knowledge of the

number of clusters (c) is required).


9

Distance between clusters

• Algorithms vary according to the way the distance between two clusters is defined.

• The most common algorithm for hierarchical methods include• single linkage method• complete linkage method• average linkage method• Ward algorithm• centroid method


10

Linkage methods• Single linkage method (nearest neighbour):

distance between two clusters is the minimum distance among all possible distances between observations belonging to the two clusters.

• Complete linkage method (furthest neighbour): nests two cluster using as a basis the maximum distance between observations belonging to separate clusters.

• Average linkage method: the distance between two clusters is the average of all distances between observations in the two clusters


11

Hierarchical vs. non-hierarchical methods

Hierarchical Methods Non-hierarchical methods

No decision about the number of clusters

Problems when data contain a high level of error

Can be very slow, preferable with small data-sets

Initial decisions are more influential (one-step only)

At each step they require computation of the full proximity matrix

Faster, more reliable, works with large data sets

Need to specify the number of clusters

Need to set the initial seeds Only cluster distances to seeds need

to be computed in each iteration


12

The number of clusters c• Two alternatives

• Determined by the analysis• Fixed by the researchers

• In segmentation studiessegmentation studies, the c c represents the number of potential separate segments.

• Preferable approach: “let the data speak”• Hierarchical approach and optimal partition identified

through statistical tests (stopping rule for the algorithm)• However, the detection of the optimal number of clusters is

subject to a high degree of uncertainty

• If the research objectives allow a choice rather than estimating the number of clusters, non-hierarchical methods are the way to go.


13

Example: fixed number of clusters

• A retailer wants to identify several shopping profiles in order to activate new and targeted retail outlets

• The budget only allows him to open three types of outlets

• A partition into three clusters follows naturally, although it is not necessarily the optimal one.

• Fixed number of clusters and (k-means), non hierarchical approach


14

Determining the optimal number of cluster from hierarchical

methods(in SPSS)

• Agglomeration schedule (programma di agglomerazione)

• Icicle plot (grafico a “stalattite”)• Dendrogram

Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 1 Cluster Analysis...

Documents

Transcript of Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 1 Cluster Analysis...