Disseration_ppt
-
Upload
ahmed-hamada -
Category
Documents
-
view
71 -
download
0
Transcript of Disseration_ppt
“A Comparative Study between Clustering Algorithms”
Pattern Discovery for Categorical Cross-Cultural Data in the Market Research
DomainSeptember, 2015
Supervisor : Reviewer: - Industry Partner:
Professor: Plamen Angelov Professor: Nigel Davies Bonamy Finch
Author: Ahmed Hamada
INDUSTRY PARTNER
+ 50 Customers
THE CHALLENGE
Cross-cultural attitudinal segmentation studies using rating scales are
seriously a challengeable tasks within the market research domain as there are
a lot of shared views with fuzzy boundaries in these studies, unlike clustering
on demographics. The dilemma of having meaningful clusters that can
realistically reflect the respondents segments with good geometrical cluster
properties is also a demanding subject in the market research domain
GAP ANALYSIS
• 76% used K-means as a partitioning method for their segmentation
• 93% of the segmentation studies Euclidean distance.
• More 60% of the examined market research studies didn’t include an
evaluation criteria for the developed clusters
In a multi variate survey study, studying 243 market segmentation publications in the tourism domain (Dolnicar, 2003)
K-MEANS PROBLEMS
Data Dimensionality
• Distances between points become relatively uniform, therefore the concept of the nearest neighbour of a point becomes meaningless
Dissimilarity Measure
• it isn't just about distances, but about computing the mean. But there is no reasonable mean on categorical data
Non-Convex Shaped Clusters
• In Euclidean space, an object is convex if for every pair of points within the object, every point on the straight line segment that joins them is also within the object
Local Minima
• differentiating the objective function w.r.t. to the centroids, to find a local minimum. More paths and more initiation points can result in a global minima
EXPERIMENTS
PARTITIONING METHODS HIERARCHICAL METHODS
K-means K-modes ROCKKernel K-means
K-m
eans
on
raw
dat
a
K-m
eans
on
stan
dard
ized
row
s
MCA
on
raw
dat
a +
K-m
eans
Kern
el K
-mea
ns o
n ra
w d
ata
Kern
el K
-mea
ns o
n st
anda
rdize
d ro
ws
K-m
odes
on
raw
dat
a
ROCK
on
raw
dat
a
Euclidean Distance
Matching Measure
Arbitrary shaped clusters
Non-convex shaped clusters
21
experiments
7 X 3
DETERMINING THE NUMBER OF CLUSTERS
______________________________________________Gap Statistic for 10 clusters
_____________________________________________Within Sum of Squares for 10 clusters
? 5, 6 & 7 Clusters Models
7-CLUSTER MODEL GEOMETRICAL COMPARISON
0
100,000
200,000
300,000
117,604 87,2321,644
283,904224,892
0%20%40%60%80%
21% 18%
59%
0% 0%
Within cluster sum of squares Cluster closeness index
INTERNAL MEASURES COMPARISON
K-means
K-means o
d standard
ised ro
ws
MCA + K-means
Kernel K
-means
Kernel K
-means o
n standard
ised ro
ws0
0.1
0.2
0.3
0.4
0.1020.05
0.09 0.08 0.07
0.125
0.05
0.1
00.05
0.109
0.05
0.1
0.08 0.05
5 clusters 6 clusters 7 clusters
K-means
K-means o
d standard
ised ro
ws
MCA + K-means
Kernel K
-means
Kernel K
-means o
n standard
ised ro
ws
-0.1
0
0.1
0.2
0.050.03
-0.02 -0.01 -0.01
0.05
0.03
-0.02
0.01 0.01
0.04
0.03
-0.03
-0.010.02
5 clusters 6 clusters 7 clusters
Dunn index Silhouette measure
INDUSTRY EVALUATIONAlgorithm K-means on standardised rows Kernel K-means on standardised
rowsNo. Clusters 5 6 7 5 6 7Response Bias Freedom
1 79% 86% 79% 70% 59% 58%2 81% 77% 67% 93% 61% 71%3 90% 61% 79% 77% 64% 75%4 72% 81% 71% 74% 79% 83%5 80% 70% 75% 79% 67% 67%6 71% 71% 61% 79%7 71% 79%
Reportability 1 71% 62% 67% 62% 76% 71%2 38% 19% 19% 90% 24% 19%3 19% 29% 81% 48% 81% 71%4 43% 52% 29% 71% 33% 62%5 62% 52% 43% 10% 33% 43%6 71% 57% 33% 43%7 62% 52%
5-CLUSTERS MODEL SCATTER PLOT MATRIX FOR THE FIRST 4 VARIABLES
K-means on standardised rows Kernel K-means on standardised rows
CONCLUSION
1. The results of this research revealed that the standardisation of the respondents developed better segments from the pragmatic point of view.
2. From the overall evaluation analysis, the results of the 5 clusters model using the K-means and the kernel K-means on standardised rows revealed more meaningful segments than the other methods.
3. The results illustrated that the ROCK algorithm and the application of MCA then K-means was not suitable for multiscale categorical data and resulted in meaningless clusters.
FURTHER RESEARCH
• Evaluate the stability of the classification accuracy using different algorithms
• Study other clustering methods available in the literature
• Evaluate the same algorithms on various cross-cultural multiscale data sets and test the hypothesis whether the multi-scaled data (i.e. Likert scale) develop better clusters from the geometrical point of view.
• Evaluate the clustering algorithms on a different type of response scales rather than using the multi point biased response scales
Thank You