Customer segmentation of retail chain customers using ...1319851/FULLTEXT02.pdfCustomer segmentation...

IN DEGREE PROJECT MATHEMATICS,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2019

Customer segmentation of retail chain customers using cluster analysis

SEBASTIAN BERGSTRÖM

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES

Customer segmentation of retail chain customers using cluster analysis

SEBASTIAN BERGSTRÖM

Degree Projects in Mathematical Statistics (30 ECTS credits)

Master's Programme in Applied and Computational Mathematics (120 credits)

KTH Royal Institute of Technology year 2019

Supervisor at Advectas AB: Pehr Wessmark

Supervisor at KTH: Tatjana Pavlenko

Examiner at KTH: Tatjana Pavlenko

TRITA-SCI-GRU 2019:092

MAT-E 2019:48

Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

Abstract

In this thesis, cluster analysis was applied to data comprising of customer spending habits at a retail chainin order to perform customer segmentation. The method used was a two-step cluster procedure in whichthe first step consisted of feature engineering, a square root transformation of the data in order to handlebig spenders in the data set and finally principal component analysis in order to reduce the dimensionalityof the data set. This was done to reduce the effects of high dimensionality. The second step consistedof applying clustering algorithms to the transformed data. The methods used were K−means clustering,Gaussian mixture models in the MCLUST family, t-distributed mixture models in the tEIGEN family andnon-negative matrix factorization (NMF). For the NMF clustering a slightly different data pre-processingstep was taken, specifically no PCA was performed. Clustering partitions were compared on the basis ofthe Silhouette index, Davies-Bouldin index and subject matter knowledge, which revealed that K-meansclustering with K = 3 produces the most reasonable clusters. This algorithm was able to separate thecustomer into different segments depending on how many purchases they made overall and in these clusterssome minor differences in spending habits are also evident. In other words there is some support for theclaim that the customer segments have some variation in their spending habits.

Keywords: Cluster analysis, customer segmentation, tEIGEN , MCLUST , K-means, NMF, Silhouette,Davies-Bouldin, big spenders

Swedish title: Kundsegmentering av detaljhandelskunder med klusteranalys

Sammanfattning

I denna uppsats har klusteranalys tillampats pa data bestaende av kunders konsumtionsvanor hos en detalj-handelskedja for att utfora kundsegmentering. Metoden som anvants bestod av en tva-stegs klusterprocedurdar det forsta steget bestod av att skapa variabler, tillampa en kvadratrotstransformation av datan for atthantera kunder som spenderar langt mer an genomsnittet och slutligen principalkomponentanalys for attreducera datans dimension. Detta gjordes for att mildra effekterna av att anvanda en hogdimensionelldatamangd. Det andra steget bestod av att tillampa klusteralgoritmer pa den transformerade datan. Meto-derna som anvandes var K-means klustring, gaussiska blandningsmodeller i MCLUST -familjen, t-fordeladeblandningsmodeller fran tEIGEN -familjen och icke-negativ matrisfaktorisering (NMF). For klustring medNMF anvandes forbehandling av datan, mer specifikt genomfordes ingen PCA. Klusterpartitioner jamfordesbaserat pa silhuettvarden, Davies-Bouldin-indexet och amneskunskap, som avslojade att K-means klustringmed K = 3 producerar de rimligaste resultaten. Denna algoritm lyckades separera kunderna i olika segmentberoende pa hur manga kop de gjort overlag och i dessa segment finns vissa skillnader i konsumtionsvanor.Med andra ord finns visst stod for pastaendet att kundsegmenten har en del variation i sina konsumtions-vanor.

Nyckelord: Klusteranalys, kundsegmentering, tEIGEN , MCLUST , K-means, NMF, Silhouette, Davies-Bouldin, storkonsumenter

Acknowledgements

I would first and foremost like to thank my supervisors Pehr Wessmark at Advectas and Tatjana Pavlenko atKTH for all the help during the time working on this thesis. The discussions and feedback were immenselyhelpful and I could not have completed the thesis without them. Also I want to thank Christopher Madsen forhis peer review of my project, which helped shape the report into its current shape.

Contents

1 Introduction and motivation 11.1 Customer segmentation and cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem formulation and data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Previous research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Theory 22.1 Clustering algorithms used in this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.1 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.2 Mixture models for clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.3 Gaussian mixture models (MCLUST) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.4 t-distributed mixture models (tEIGEN) . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.5 Non-negative matrix factorization (NMF) . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Cluster validation indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Choice of cluster validity index (CVI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 The silhouette index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.3 The Davies-Bouldin index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 High-dimensional data and dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . 142.5 t-distributed Stochastic Neighbor Embedding (t-SNE) . . . . . . . . . . . . . . . . . . . . . . . 16

3 Case study and methods 183.1 Outline of method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2 Exploratory data analysis and feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Data pre-processing and dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5 Optimizing individual algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.6 Software and hardware used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.6.1 Hardware used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.6.2 Software used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Results 264.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2 MCLUST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3 tEIGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.4 NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.5 Choice of algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Analysis 325.1 Analysis of distributions in clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Discussion 366.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.2 Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

A Silhouette plots 38A.1 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38A.2 MCLUST models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38A.3 tEIGEN models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39A.4 NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

List of Figures

2.1 Illustration of MCLUST models in 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.1 Illustration the sparsity in the data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Correlations between the product groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Histograms showing the distribution of products in a given level 1 product group . . . . . . . . 203.4 Histograms showing the distribution of products in a given level 1 product group . . . . . . . . 213.5 Box plots that illustrate the presence of big spenders . . . . . . . . . . . . . . . . . . . . . . . . 213.6 Illustration of silhouette values in the presence of outliers for K-means clustering with K = 2 . 223.7 Result of running t−SNE with standard settings . . . . . . . . . . . . . . . . . . . . . . . . . . 233.8 Result of running t−SNE with standard settings, now colored according to x3 . . . . . . . . . . 233.9 Result of running t−SNE with standard settings, now colored according to the total amount of

purchases made. 3.9a) Result of one run. 3.9b) Result of another run. . . . . . . . . . . . . . . 243.10 Result of running t−SNE with standard settings, colored according to the total amount of pur-

chases made in x3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.11 Results of running t−SNE with Linderman and Steinerberger’s settings colored according to the

total amount of purchases. 3.11a) Result one run. 3.11b) Result of another run. . . . . . . . . . 254.1 Silhouette plots for K = 3 and K = 7 in K-means. 4.1a) K = 3. 4.1b) K = 7 . . . . . . . . . . 264.2 Silhouette plots for different number of components K of the best MCLUST model EII. 4.2a)

K = 2. 4.2b) K = 3. 4.2c) K = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3 Silhouette plots for different number of components K of the best tEIGEN model. 4.3a) K = 2.

4.3b) K = 3. 4.3c) K = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.4 Silhouette plot obtained when using 3 clusters in NMF . . . . . . . . . . . . . . . . . . . . . . . 294.5 Density plots of the principal components in cluster 0 . . . . . . . . . . . . . . . . . . . . . . . 304.6 Density plots of the principal components in cluster 1 . . . . . . . . . . . . . . . . . . . . . . . 314.7 Density plots of the principal components in cluster 2 . . . . . . . . . . . . . . . . . . . . . . . 315.1 Density plots of x3 and x4 in the different clusters. 5.1a) Density function of x3 within the three

different clusters. 5.1b) Density function of x4 within the three different clusters . . . . . . . . 325.2 Boxplots of x3 and x4 in the different clusters. 5.2a) Boxplot of x4 within the three different

clusters. 5.2b) Boxplot of x3 within the three different clusters . . . . . . . . . . . . . . . . . . 325.3 Radar charts showing the mean percentages spent within each product category for the three

clusters. 5.3a) Cluster 0. 5.3b) Cluster 1. 5.3c) Cluster 2. . . . . . . . . . . . . . . . . . . . . . 335.4 Radar chart showing the mean percentage spent within each product category (excluding x3 for

the three clusters) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.5 Variances of the percentages in cluster 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.6 Variances of the percentages in cluster 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.7 Variances of the percentages in cluster 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.8 Density plots of percentages spent in x3, x4, x5 and x8 in the three clusters. 5.8a) x3 percentages.

5.8b) x4 percentages. 5.8c) x5 percentages. 5.8d) x8 percentages. . . . . . . . . . . . . . . . . . 35A.1 Silhouette plots for K-means clustering with different K. A.1a) K = 2. A.1b) K = 4. A.1c)

K = 5. A.1d) K = 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38A.2 Silhouette plots for different number of components K for the model EEE. A.2a) K = 3. A.2b)

K = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38A.3 Silhouette plots for different number of components K for the model EEI. A.3a) K = 3. A.3b)

K = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38A.4 Silhouette plots for different number of components K of the model EEV. A.4a) K = 2. A.4b)

K = 3. A.4c) K = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39A.5 Silhouette plots for different number of components K of the model VEI. A.5a) K = 2. A.5a)

K = 3. A.5c K = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39A.6 Silhouette plots for different number of components K of the model VII. A.6a) K = 2. A.6b)

K = 3. A.6c) K = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39A.7 Silhouette plots for different number of components K for the model CICC. A.7a) K = 2. A.7b)

K = 3. A.7c) K = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39A.8 Silhouette plots for different number of components K for the model CIIC. A.8a) K = 2. A.8b)

K = 3. A.8c) K = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40A.9 Silhouette plots for different number of components K for the model CIIU. A.9a) K = 2. A.9b)

K = 3. A.9c) K = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40A.10 Silhouette plots for different number of components K for the model UIIC. A.10a) K = 2. A.10b)

K = 3. A.10c) K = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40A.11 Silhouette plots for different number of components K for the model UIIU. A.11a) K = 2. A.11b)

K = 3. A.11c) K = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

A.12 Silhouette plots for different number of components K for the model UUCU. A.12a) K = 2.A.12b) K = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.13 Silhouette plot obtained when using 2 clusters in NMF . . . . . . . . . . . . . . . . . . . . . . . 41A.14 Silhouette plot obtained when using 4 clusters in NMF . . . . . . . . . . . . . . . . . . . . . . . 41A.15 Silhouette plot obtained when using 5 clusters in NMF . . . . . . . . . . . . . . . . . . . . . . . 41A.16 Silhouette plot obtained when using 6 clusters in NMF . . . . . . . . . . . . . . . . . . . . . . . 41A.17 Silhouette plot obtained when using 7 clusters in NMF . . . . . . . . . . . . . . . . . . . . . . . 42

List of Tables

2.1 Models available in Mclust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Models in the tEIGEN-family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.1 Class assignments based on x3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Class assignments based on the total number of purchases . . . . . . . . . . . . . . . . . . . . . 244.1 Average silhouette scores for different K in K-means clustering . . . . . . . . . . . . . . . . . . 264.2 Davies-Bouldin scores for different K in K-means clustering. . . . . . . . . . . . . . . . . . . . 264.3 MCLUST models that converged . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.4 Average silhouette values for the different MCLUST models with varying number of components 274.5 Davies-Bouldin scores for the different MCLUST models with varying number of components . 274.6 Models in the tEIGEN -family that converged . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.7 Average silhouette values for the different tEIGEN models with varying number of components 284.8 Davies-Bouldin scores for the different tEIGEN models with varying number of components . 284.9 Average silhouette values when clustering using NMF . . . . . . . . . . . . . . . . . . . . . . . 294.10 Davies-Bouldin values when clustering using NMF . . . . . . . . . . . . . . . . . . . . . . . . . 29

1 Introduction and motivation

1.1 Customer segmentation and cluster analysis

Customer segmentation is the process of a company dividing its customer base into different groups based onsome shared characteristics. This is done so the different groups can be analyzed and marketing efforts can betailored toward these groups based on their preferences in order to improve sales and customer relations.

Cluster analysis is a branch of mathematics whose purpose is to group similar data points into subsets calledclusters. The partition of data points should be done such that data points in the same cluster are more similarto one another than ones in different clusters [25, p.501]. A central aspect of cluster analysis is the concept ofproximity between data points, which is usually described using a N × N (with N being the number of datapoints) dissimilarity matrix D whose elements dij contain the distance between the i:th and j:th observation[25, p.503]. Assuming we have access to a data matrix

X = xij1≤i≤n,1≤j≤p (1.1)

where n is the number of data points and p is the dimension of each point we define a dissimilarity dj(xij , xi′j)between the values of the j:th feature in our data [25, p.503]. With the help of this we nearly always define adissimilarity between two data points xi, xi′ (i.e. rows in X) as

D(xi, xi′) =

p∑j=1

wjdj(xij , xi′j),

p∑j=1

wj = 1, [25, p.505] (1.2)

A common choice is dj(xij , xi′j) = (xij − xi′j)2, which is suitable for numeric data [25, p.503]. These dissimi-larities will be used in the clustering algorithms to produce clusters of similar data points. Details for differentclustering algorithms will be presented in other sections of this report.

Cluster analysis can be applied to customer segmentation since customers can be clustered based on the patternsin the data. In other words, customer segmentation can be done on the basis of collected data rather thanpreconceptions about what customer segments exist and how they differ from one another.

1.2 Problem formulation and data description

The data used in this thesis has been provided from a retail client of Advectas AB. The original data consistsof transactions that can be linked to individual customers and must be aggregated for each customer beforeclustering can be done. This part of the project entails feature engineering and the total number of purchasesin different product groups will be used to profile the customers. Due to the proprietary nature of the data,the client name will not be revealed and variable names are anonymized. A subset of the customers (themost interesting ones from the client’s perspective) was used in this analysis. In total 74 836 customers weresegmented based on purchases during a set time period.

1.3 Previous research

Cluster analysis has been used extensively for customer segmentation. Examples include Zakrzewska andMurlewski who used K-means clustering and DBSCAN for customer segmentation in banking [53], the usage ofmixture models for shopping market customers by Gilbert et al [22] and other examples [46]. While K-meansclustering and hierarchical clustering have been used extensively in customer segmentation, other methods areless common, e.g. include the tEIGEN family of mixture models for clustering. Due to the scarcity of researcharticles which utilize this algorithm for customer segmentation it may be of interest to examine it and compareit to more commonly used algorithms.

1

2 Theory

2.1 Clustering algorithms used in this thesis

2.1.1 K-means clustering

K-means clustering is a clustering algorithm in which the data points are segmented into K ∈ Z non-overlappingclusters where K is specified by the user [25, p.507] and all the columns in the data matrix X (i.e. the variables)are quantitative. Strategies for choosing a suitable K will be discussed in Section 2.2. In this case the squaredEuclidean distance is the dissimilarity measure, i.e.

d(xi, xi′) =

p∑j=1

(xij − xi′j)2 = ||xi − xi′ ||2, [25, p.509]. (2.1)

The goal of K-means clustering is to minimize the function W (C), defined in Equation 2.2. Here, C isa many-to-one mapping that assigns observations to clusters, i.e. the clustering result of some algorithm.xj = (x1j , ..., xpj) is the mean vector of cluster number j and Nj =

∑Ni=1 IC(i) = j where IC(i) = j is

the indicator function. In other words, W (C) is minimized by assigning the observations to clusters such thatthe average dissimilarity from the cluster mean within each cluster is minimized [25, p.509].

W (C) =1

2

K∑j=1

∑C(i)=j

∑C(i′)=j

d(xi, xi′) = ||xi − xj ||2 (2.2)

W (C) can be called the ’within cluster’ point scatter. This is a combinatorial optimization problem whichis computationally intractable for all but very small data sets. It can be shown that the number of differentpossible cluster assignments given N data points is given by 1

K!

∑Kj=1(−1)K−j

(Kk

)jN which grows rapidly in

its arguments [25, p.508]. For this reason, only a small subset of different cluster assignments are examinedin K-means clustering. These strategies are built on iterative greedy descent. In this thesis a combination ofLloyd’s algorithm (which is commonly referred to as K-means clustering) and K + +-means clustering havebeen used. K + +-means was proposed by Arthur and Vassilvitskii as a way to select the initial cluster centresin K-means clustering [8] and is presented in Algorithm 1. First we explain the notation used. χ ∈ Rd is thedata set which consists of n data points, K is the number of clusters, C = (c1, ..., cK) is the set of cluster centersand D(x) is the shortest distance from a data point x to the closest center already chosen.

Algorithm 1: Lloyd’s algorithm and K + +-means

Input: Number of clusters K, data set χOutput: Partition of data set into K clustersK + +-means

1. Randomly choose a center c1 from χ

2. Take a new center ci, choosing data point x ∈ χ with probability D(x)2∑x∈χD(x)2

3. Repeat the previous step until K centers have been chosenLloyd’s algorithm

4. For each i ∈ 1, ...,K, set cluster Ci to be the set of points in χ that are closer to ci than they are allto cj for all j 6= i, i.e. Ci = x ∈ χ : ||x− ci||2 ≤ ||x− cj ||2∀j 6= i

5. For each i ∈ 1, ...,K set ci = 1|Ci|

∑x∈Ci x, where |Ci| is the number of elements in cluster Ci

6. Repeat steps 4 and 5 until C no longer changes.

One problem to consider is that although the algorithm converges to a minimum, it may be a suboptimal localminimum. The algorithm may also be sensitive to the initial cluster assignments so it is recommended to run thealgorithm multiple times with different initial assignments and choose the solution which gives the best results[25, p.510]. This is what we refer to as the ”hard” K-means algorithm which gives a definitive assignment foreach data point. We could also instead use a ”soft” clustering algorithm, the concept is introduced in Section2.1.2.

2

2.1.2 Mixture models for clustering

Clustering can also be done using a statistical model for the population we sample from. The model assumes thepopulation consists of sub-populations, i.e. clusters, in which our data set’s features follow different multivariateprobability distributions. The population as a whole will then follow a so called mixed density distribution.When using a mixture model, clustering becomes a question of estimating the mixture’s parameters, calculatingposterior cluster probabilities using these parameters and assigning cluster membership based on these [20,p.143]. We may express this mathematically as follows. Finite mixture models have probability densities of theform

f(x; p, θ) =

K∑j=1

pjgj(x; θj). (2.3)

x is a p-dimensional random variable, pT = (p1, ..., pK−1) is the vector of mixing proportions where∑Kj=1 pj = 1,

K is the assumed number of clusters and the gj are the component densities parameterized by their respectiveθj . Assuming we have estimated the mixing distribution’s parameters, probabilities for cluster membership canbe estimated using the posterior probability in Equation 2.4

P (clusterj |xi) =pjgj(xi, θ)

f(xi; p, θ), j = 1, 2, ...,K (2.4)

Parameters are estimated by maximum-likelihood estimation [20, p.144-145], specifically the EM-algorithmis used in the case of unobserved cluster labels. The description of the algorithm is largely based on the oneavailable by McLachlan and Krishnan [39, p.19-20], albeit with slightly changed notation to be consistent withthe notation in Everitt [20].

Imagine we can observe realizations of the random variable X with the density in Equation 2.3 but not Y andlet Z = (X,Y ) be the complete data vector. Letting gc(z; Θ) denote the density of Z, the complete-data loglikelihood function is given by log(Lc(Θ)) = log(gc(z; Θ)). The algorithm proceeds as in Algorithm 2.

Algorithm 2: EM-algorithm

Input: Starting values for parameter estimates Θ(0)

Output: Parameter estimates Θ(k)

1. Initialize Θ = Θ(0)

2. For k = 1, 2, ...:(a) E-step: Compute Q(Θ; Θ(k−1)) = EΘ(k−1) [log(Lc(Θ))|X]. Here Θ(k−1) denotes the estimate of Θ in

the (k − 1) : th iteration.(b) Compute M-step: Choose Θ(k+1) as any θ ∈ Θ such that Q(θ; θ(k)) is maximized, i.e.

Q(θ(k+1); θ(k)) ≥ Q(θ; θ(k))∀θ ∈ Θ

The E- and M-steps are repeated until the log-likelihood function converges, i.e. L(θ(k)) − L(θ(k−1)) changesby an arbitrarily small amount.

2.1.3 Gaussian mixture models (MCLUST)

Mathematically, a Gaussian mixture model (GMM) can be described as a mixture model in which the compo-nents follow Gaussian distributions. The model assumed is given by Equation 2.5, in which pkk≥1 are themixing proportions, K is the assumed number of components and Φ(x|µ,Σ) is the density of a multivariatenormal random variable following a N(µ,Σ)-distribution [17].

f(x) =

K∑k=1

pkΦ(x|µk,Σk) (2.5)

The GMMs used in this thesis come from the MCLUST family in which the number of clusters and covariancestructures can vary [45]. In the MCLUST setting the covariance matrix Σk can be parameterized in terms ofits eigenvalue decomposition as

3

Σk = DkΛkDkT = λkDkAkDk

T (2.6)

In this decomposition Dk is the matrix of eigenvectors and Λk is a diagonal matrix with Σk’s eigenvalues on thediagonal. Λk can be expressed as Λk = λkAk where λk is the first eigenvalue of Σk and Ak = diag(α1k, ..., αpk)where 1 = α1k ≥ α2k ≥ ... ≥ αpk > 0. We may interpret the decomposition as follows. Dk determines theorientation of cluster k, λk determines the volume of the cluster and Ak determines its shape, see Figure 2.1for examples in 2D. By placing different constraints on Σk we can obtain different models, these are listed inTable 2.1.

In mclust, clustering is done slightly differently compared to the method described in Section 2.1.2. Clustering isdone by the classification maximum likelihood procedure, which aims to find θ, γ such that L(θ, γ) is maximized,where L(θ, γ) =

∏ni=1 φγi(xi; θγi). In this case, γT = (γ1, ..., γn) where γi = j if xi comes from cluster j, i.e. γ

is the vector containing the cluster memberships and is treated as an unknown parameter [20]. Also, φj refersto the density function from the jth component of the data. In this case it refers to the density function for aN(µj ,Σj)-distribution.

Specifics for parameter estimation by the EM-algorithm are presented by Celeux and Govaert [17]. The relevantlog-likelihood function is given in Equation 2.7, in which Φ(x|µ,Σ) denotes the density of a Gaussian N(µ,Σ)-distribution.

L(θ, z1, ..., zn|x1, ...,xn) =

K∑i=k

∑xi∈Pk

ln(pkΦ(xi|µk,Σk)) (2.7)

The variant of the EM-algorithm used is called the CEM algorithm and can briefly be described as follows. Aninitial partition of the data is made, then the conditional probabilities tk(xi) (see Equation 2.9) are computed(E-step). These are used to assign each xi to the cluster with the largest current tk(xi) (C-step). Then theparameters pk, µk and Σk are updated in the M-step, which consists of maximizing F in Equation 2.8. In thisequation, we need the notation of a classification matrix c = cik1≤i≤n,1≤k≤K where cik ∈ 0, 1. c defines apartition in our case.

F (θ|x1, ..., xn, c) =

K∑k=1

n∑i=1

cikln(pkΦ(xi|µk,Σk)), (2.8)

t(xi) =pkΦ(xi, µk, Σk)∑Kl=1 plΦ(xi, µl, Σl)

(2.9)

Using nk =∑ni=1 cik, updating formulas for mixing proportions and mean vectors are given in Equations 2.10

and 2.11, respectively.

pk =nkn

(2.10)

µk = xk =

∑ni=1 cikxi

nk(2.11)

The updating formula for the covariance matrix varies from model to model. In some cases, a closed formsolution for estimation exists while in other cases an iterative procedure must be used. Presenting details onhow all these are updates is outside the scope of this thesis and the interested reader is referred to Celeux andGovaert [17]. Browne and McNicholas present updates for the specific models EV E and V V E [14]. Unlessexplicitly stated otherwise, any descriptions of the parameter estimation procedure in this thesis come Celeuxand Govaert according to the authors of the mclust-software. The tolerance for likelihood convergence in theclust software was 10−5.

4

Model Σk Distribution Volume Shape OrientationEII λI Spherical Equal Equal -VII λkI Spherical Variable Equal -EEI λA Diagonal Equal Equal Coordinate axesVEI λkA Diagonal Variable Equal Coordinate axesEVI λAk Diagonal Equal Variable Coordinate axesVVI λkAk Diagonal Variable Variable Coordinate axes

EEE λDADT Ellipsoidal Equal Equal Equal

EVE λDAkDT Ellipsoidal Equal Variable Equal

VEE λkDADT Ellipsoidal Variable Equal Equal

VVE λkDAkDT Ellipsoidal Variable Variable Equal

EEV λDkADkT Ellipsoidal Equal Equal Variable

VEV λkDkADkT Ellipsoidal Variable Equal Variable

EVV λDkAkDkT Ellipsoidal Equal Variable Variable

VVV λkDkAkDkT Ellipsoidal Variable Variable Variable

Table 2.1: Models available in Mclust

Figure 2.1: Illustration of MCLUST models in 2D

Some special notice has to be taken to potential pitfalls when using the EM-algorithm for clustering, specificallyconvergence problems and singularities in the likelihood function. The EM-algorithm may get stuck in localmaxima rather than global ones, to combat this the algorithm is often run repeatedly using different startingvalues. In mclust and teigen these values are based on solutions from K-means clustering as a default, whichhas been done in this project. In some cases the likelihood function may become infinite, which is due to thenumber of parameters to estimate being large in comparison to the number of data points. This may happenin models with unconstrained variances or many components. Two methods are possible for combating thisproblem: Constraining the variances or attempting a Bayesian approach [20]. In this thesis the covariancematrices were constrained.

2.1.4 t-distributed mixture models (tEIGEN)

An alternative to Gaussian mixtures is to use mixture models in which the components follow a multivariatet-distribution. Different models and software packages are available for this, this thesis has used the tEIGEN-family of models which is described in Equation 2.12. The different mixture models in the family take the formof

g(x|ϑ) =

K∑k=1

πkft(x|µk,Σk, νk) (2.12)

The πk are the mixing proportions, K is the number of components, ft(x|µk,Σk, νk) is the density function ofa random variable following a p-dimensional t-distribution that has expected value vector µk, covariance matrix

5

Σk and degrees of freedom νk. In the tEIGEN model family, Σk is decomposed using the eigen-decompositionas in MCLUST , i.e. Σk = λkDkAkDT

k . λk, Dk and Ak have the same meaning as in MCLUST . The sameconstraints on Σk can be considered as for the MCLUST models, and an additional one is that the degreesof freedom across groups can be constrained [6]. For a complete list of the models available in the softwarepackage teigen, see Table 2.2. In the column for νk, the entries may be either ”C” (for ”constrained”) or ”U”(for ”unconstrained”). If νk is constrained, then it will be equal for all the clusters. If it is not constrained, itmay vary between clusters.

Model Σk νkCIIC λI CCIIU λI UUIIC λkI CUIIU λkI UCICC λA CCICU λA UUICC λkA CUICU λkA UCIUC λAk CCIUU λAk UUIUC λkAk CUIUU λkAk U

CCCC λDADT C

CCCU λDADT U

UCCC λkDADT C

UCCU λkDADT U

CUCC λDkADkT C

CUCU λDkADkT U

UUCC λkDkADkT C

UUCU λkDkADkT U

CCUC λDAkDT C

CCUU λDAkDT U

CUUC λDkAkDkT C

CUUU λDkAkDkT U

UCUC λkDAkDT C

UCUU λkDAkDT U

UUUC λkDkAkDkT C

UUUU λkDkAkDkT U

Table 2.2: Models in the tEIGEN-family

In order to assign cluster labels to data points the model parameters must be estimated, which is done usinga variant of the EM-algorithm called the multicycle ECM algorithm. The procedure is similar to the onedescribed by Celeux and Govaert [17]. In the words of the authors, parameter estimation is done as follows.For the general tEIGEN model the complete-data log likelihood is given by Equation 2.13, in which zik = 1if xi belongs to cluster k and 0 otherwise.

lc(ϑ) =

K∑k=1

n∑i=1

ziklog[πkγ(uik|νk/2, νk/2)Φ(xi|µk,Σk/uik)] (2.13)

Here, γ(y|α, β) = βαyα−1exp(−βy)Γ(α) Iy > 0, where I is the indicator function, for some α, β > 0. Φ is the density

function of a multivariate Gaussian with mean µ and covariance matrix Σ. In each E-step of the algorithm thezik (cluster membership indicators) and uik (characteristic weights) are updated by their conditional expectedvalues

zik =πkft(x|µk,Σk, νk)∑Kh=1 πhft(x|µh,Σh, νh)

(2.14)

6

uik =νk + p

νk + σ(xi, µk|Σk)(2.15)

In Equation 2.15 σ(xi, µk|Σk) is the squared Mahalanobis distance between xi and µk. The CM step consistsof two steps, one for updating mixing proportions, means and degrees of freedom and another for updatingcovariance matrices. Mixing proportions and means are updated as described by Andrews and McNicholas [6].

πk =nkn, µk =

∑ni=1 zikuikxi∑ni=1 zikuik

, nk =

n∑i=1

zik (2.16)

Updating the degrees of freedom deserves some special attention. Depending on if they are constrained ornot, different approximations will be used for updating [5]. In the case of constrained degrees of freedom, let

k′ = −1− 1n

∑Gg=1

∑ni=1 zig(log(wig)− wig))− φ( ν

old+p2 ) + log( ν

old+p2 ). An approximation is then provided by

ν ≈−exp(k) + 2exp(k)(exp(φ( ν

old

2 ))− ( νold

2 −12 ))

1− exp(k)(2.17)

The same approximation can hold for the unconstrained degrees of freedom by using alternative definition ofk′, namely

k′ = −1− 1

nk

n∑i=1

zik(log(wik)− wik)− φ(νold + p

2) + log(

νold + p

2) (2.18)

Covariance matrices are updated in the second CM-step. In the words of the original authors [6] these areupdated similar to their Gaussian counterparts in Celeux and Govaert [17].

Finally, the aspects of initialization and convergence criteria must be addressed. The software package teigenprovides some alternatives for these aspects. The zik must be initialized and in this thesis this was achieved byusing a K-means initialization with 50 random starting points. The initial degrees of freedom is 50 by default[5]. Aitken’s acceleration is used to determine convergence [5]. In iteration t, Aitken’s acceleration is given by

a(t) =l(t+1) − l(t)

l(t) − l(t−1)(2.19)

In this case, l(i) refers to the value of the log-likelihood function at iteration i. It is used to compute anasymptotic estimate of the log-likelihood by

l(t+1)∞ = l(t) +

1

1− a(t)(l(t+1) − l(t)) (2.20)

The stopping criterion is l(t+1)∞ −l(t+1) < ε2, where ε2 = 0.1 is the default value used in the tEIGEN−software.

When the algorithm has converged, cluster memberships for the data points are determined by the maximumposterior probabilities, i.e. if zig is maximized for component g, then data point i is assigned to cluster g.

2.1.5 Non-negative matrix factorization (NMF)

An entirely different approach to clustering is non-negative matrix factorization (NMF). It is a matrix approx-imation method which factorizes a non-negative matrix into two non-negative matrices which have lower rankthan the original. NMF has been used for clustering in different contexts, e.g. document clustering [51] andcancer gene clustering [16]. There also exists theoretical work regarding the clustering aspect of NMF doneby Mirzal and Furukawa [40]. A general explanation of the problem, which is inspired by the one providedby Brunet et al. and Xu et al. [16] is as follows. Suppose we have access to a data matrix X ∈ Rm×n inwhich all elements are non-negative, m denotes the number of features and n denotes the number of observa-tions. We wish to find an approximate non-negative factorization of this matrix, i.e. find non-negative matricesU ∈ Rm×k,VT ∈ Rk×n such that X ≈ UV T . This is the general outline of the problem, which may be tackledin different ways.

7

Xu et al. assume that the columns in X are normalized to unit Euclidean length while no such assumption isdescribed by Brunet et al. Moreover, finding U and V can be done by solving different optimization problemsand using different optimization procedures. Brunet et al. seek to minimize a functional related to the Poissonlikelihood while Xu et al. seek to minimize J = 1

2 ||X−UVT||, where ||.|| is the squared sum of all matrixelements. In other words they wish to minimize half the Euclidean distance between X and UVT, which ofcourse is equivalent to minimizing the Euclidean distance.

The steps taken in this thesis are as follows: The data is not normalized prior to clustering since all featureshave the same units and due to the constraint of non-negativity on the matrix elements, no PCA is performed.In other words, the only pre-processing done is the square-root transformation. The transpose of this datamatrix (denoted X′) will then act as input to the NMF algorithm. We will seek to solve the optimizationproblem in Equation 2.21.

minU,V

1

2||X′ −UVT || (2.21a)

subject to U ≥ 0,V ≥ 0 (2.21b)

The optimization problem may be solved by using the updating formulas presented in Equation 2.22. Thesemultiplicative updates were introduced presented by Lee and Seung [35], here we use the same notation as in [51].These are iterative updates, so an initialization is required. Initialization is done using the method NonnegativeDouble Singular Value Decomposition (NNDSVD), which is described by Boutsidis and Gallopoulos [13]. Inthis thesis, the tolerance for the stopping condition has been 10−4 and the maximum number of iterations isset to 200.

uij ← uij(XV)ij

(UVTV)ij, vij ← vij

(XTU)ij(VUTU)ij

(2.22)

It should be noted that the solution to minimizing the Euclidean distance is not unique. Say U and V aresolutions to minimizing J and that D is some positive diagonal matrix. Then UD and VD−1 will also besolutions to the optimization problem. To ensure the solution’s uniqueness, U is normalized by setting

vij ← vij

√∑i

u2ij , uij ←

uij√∑i u

2ij

(2.23)

Once this has been done, the actual clustering can be done. This is done by assigning data point xi to clusterj∗ = argmaxjvij . In other words, for each row i in V we care about the column index of the greatest elementin that row. This column index is the cluster assignment for the data point xi [51]. It is also worth discussinghow NMF relates to K-means clustering to further illustrate the algorithm’s use in clustering, as shown byKim and Park [32]. It is helpful to view K-means clustering as a lower rank matrix factorization with specialconstraints. As previously discussed, the objective function in K-means can be written as in Equation 2.24, inwhich A = [a1, ..., an] ∈ Rm×n, K is some integer and C = [c1, ..., cK ] ∈ Rm×K is the centroid matrix. cj is thecluster centroid of cluster j and B ∈ Rn×K is the clustering assignment.

Jk =

K∑j=1

∑ai∈Cj

||ai − cj ||2 = ||A−CBT ||2F (2.24)

JK can be rewritten. Letting |Cj | denote the number of data points in cluster j and

D−1 = diag(1

|C1|,

1

|C2|, ...,

1

|CK |) ∈ RK×K ,C = ABD−1 (2.25)

we may rewrite JK as

JK = ||A−ABD−1BT ||2F (2.26)

Using this formulation, K-means may be viewed as the task of finding B such that JK is minimized. In thissetting, B has one non-zero element per row, this non-zero element being 1. We consider a factorization of D−1

8

using two diagonal matrices D1 and D2 which satisfy D−1 = D1D2 and let F = BD1 and H = BD2. ThenJK can be written as

JK = ||A−AFHT ||2F (2.27)

and the optimization problem becomes

minF,H

JK =||A−AFHT ||2F (2.28a)

in which F and H have one non-zero and positive element per row. If we set U = AF we see that the objectivefunction is similar to the one in NMF. In K-means, U is the centroid matrix with rescaled columns and Hhas exactly one non-zero element per row. In other words each row represents a hard clustering of each datapoint. The difference compared to NMF is that NMF does not have these constraints, so the basis vectors (i.e.columns in U) in NMF are not necessarily the cluster centroids and it does not force hard clustering upon thedata points.

2.2 Cluster validation indices

After running a clustering algorithm and obtaining a partition we may naturally wish to evaluate this solution.Is it a good one, is it close to being good or is it worthless? Maybe we want to compare different kinds ofalgorithms or maybe we want to vary the parameters in one type of algorithm and choose the best parametervalues, but how do we determine the best result from all these different methods? Several validation indicesexist and there is some confusion regarding naming convention as well as which one should be chosen.

According to Halkidi et al [23] cluster validation indices can be divided into three different types: externalcriteria, internal criteria and relative criteria. The usage of external criteria requires some prior knowledge ofcluster structures inherent to the input data, in our case customer segment labels. In this thesis no such priorinformation is available so we exclude this type of criteria and turn our attention to the two remaining ones.Defining internal and relative criteria proves to be somewhat challenging as the terms are used interchangeably.We begin by providing the explanations by Halkidi et al [23] and then proceed to explain the naming conundrum.Internal criteria are described in terms of evaluating the clustering results in terms of the data itself, e.g. aproximity matrix. According to Halkidi et al the goal is to determine how much the result of a clusteringalgorithm agrees with the proximity matrix, for which purpose the authors propose using Hubert’s Γ-statistic.Relative criteria are used when we compare different clustering schemes with respect to some predeterminedcriterion. These criteria do not depend on any statistical tests like the external and internal criteria do. Nowwe turn our attention to a potential source of confusion. In the work by Halkidi et al, examples of relativecriteria include the Davies-Bouldin index, Dunn index and Dunn-like indices. These are well known indices thathave been studied in other works. The crux however, is that in other works these are usually called internalindices, see [27, 10, 37, 48]. The very same articles refer to the Silhouette index as an internal validity measurealthough it fits the definition of a relative criterion. To make things even more confusing, other articles referto the very same indices as relative criteria, see [50, 29].

This confusion about naming convention is important to note since cluster evaluation and comparing clusteringsolutions is a central part of this thesis. No prior customer segment memberships are known for the data pointsin this thesis, meaning external validation is out of the question. The indices used will be the Silhouette indexand Davies-Bouldin index. Whether the to call them internal or relative criteria is left as an exercise to thereader, even though it has no direct impact on the work in this thesis it is a confusing topic that needs to beaddressed.

2.2.1 Choice of cluster validity index (CVI)

In the absence of class labels for our data we have to use internal validation indices to evaluate the clusteringpartitions obtained. There are a multitude of different indices one can use and in order to reduce the arbitrarinessof choice, a few criteria have been used in this thesis: level of empirical support, computational complexity andavailability of software. These three aspects will be the basis on choice of CVI. The rationale is that since nosingle CVI dominates all data sets and we cannot analytically prove that a single CVI is suitable, we have to useempirical results available in the research literature. Also, we must take into account that due to the amountof data available some CVIs may be computationally unfeasible. Lastly the question of software availability

9

must be considered. It has been shown that different software implementations of the same algorithm can varygreatly in quality [30]. If we are to be certain about our results we need to trust the software used, so this is afactor we must consider.

Several works on comparing CVIs have been published. The most comprehensive one was by Arbelaitz et al [7]in which multiple CVIs were evaluated on different data sets. The CVIs were evaluated as follows. A clusteringalgorithm was applied to the data set with different values of K, the hypothesized number of true clusterspresent. This yields a set of different partitions of the data. The CVI is computed for all such partitions andthe best partition according to the CVI will be the ”predicted” partition by the CVI. The goal is to predict thepartition that is most similar to the true one. This similarity was measured using the Adjusted Rand index,Jaccard index and Variation of Information. The results show that no single CVI dominates the others butoverall the Silhouette and Davies-Bouldin indices seem to be good choices. It should however be noted thatthey is not significantly different from the Calinski-Harabasz and Dunn indices. In another study it was shownthat the Silhouette index was very competitive [30]. In this particular study the Silhouette was actually not thesingle best performing CVI, but since the better performing CVI overall has less support in the literature werefrain from using it. Also, the Davies-Bouldin index did not perform well in this paticular study. In anotherarticle by Liu et al [37] we can see that the Silhouette index and Davies-Bouldin are among the top performers.In this particular article neither was not the best choice of CVI (rather, S Dbw was) but it should also be notedthat S Dbw does not perform well in general [30]. All in all, there exists work which shows that the Silhouetteand Davies-Bouldin are good choices even if they do not always find the correct number of clusters.

Other arguments for using the Silhouette and Davies-Bouldin indices are related to their mathematical prop-erties. The Silhouette differs from the other well-performing indices (Dunn, Davies-Bouldin and Calinski-Harabasz) since it can compute a score for each data point while the others only compute an overall score forthe partition. Thus, we can evaluate a distribution of silhouette scores instead of just the arithmetic meanwhich is desirable since distributions are more informative than summary statistics. Secondly, the Silhouettehas limits. Since we know that a Silhouette index must lie in [−1, 1] we have a slightly more absolute measureof cluster quality. Even though we do not know the true labels of each customer, we at least now that if a clus-tering algorithm gives very high silhouette values then at least it produces tight-knit clusters. This argumentalso applies to the Davies-Bouldin index since we know that it is theoretically bounded from below by 0 (thisfollows from Definitions 1 and 2), so for very small values of R we at least know that tight-knit clusters areformed.

2.2.2 The silhouette index

The silhouette index is a internal validation index which provides a graphical way of assessing the quality ofclustering results. Unless stated otherwise, all results in this section come directly from the original article byRousseeuw [44]. In the author’s own words it is useful for evaluating clusters when we wish to create clustersthat are compact and clearly separated. The construction of a silhouette plot requires only the partition whichresults from applying a clustering algorithm to some data set and the distances between the data points.

Before defining the silhouette and discussing some properties we introduce some notation. Throughout thissection x denotes some data point in our data set X. We let A be the cluster x is assigned to by some algorithm.When A contains more points than just x we can compute a(x) as the average dissimilarity of x to all otherpoints in A. If we let C be a different cluster than A we can compute d(x,C) as the average dissimilarity of xto all points in C. Further, let b(x) = minC 6=Ad(x,C). This minimum is attained for some cluster B, whichwill be called the neighbor of x. We can think of it as the ”second best choice” for cluster assignment of x,i.e. the closest cluster we would have used if A was not available. We assume that the number of clusters K isstrictly greater than 1. We then define the silhouette score of x as

s(x) =

1− a(x)/b(x) if a(x) < b(x)0 if a(x) = b(x)b(x)/a(x)− 1 if a(x) > b(x)

⇐⇒ s(x) = b(x)−a(x)max(a(x),b(x))

The definition implies that −1 ≤ s(x) ≤ 1. Rousseeuw sets s(x) = 0 if A = x which in his own words isarbitrary, but the most neutral in his opinion. This is done since it is unclear how to define a(x) if A = x.s(x) is used to measure how well x matches its’ cluster assignment. Below we describe how to interpret it.

In the case when s(x) ≈ 1 we say that x is well-clustered since s(x) ≈ 1 =⇒ a(x) << b(x), i.e. it is likely thatit is in the correct cluster since it is much closer to A than to B. When s(x) ≈ 0 it is unclear if x should be putin A or B. Finally we consider the case when s(x) ≈ −1, which is the worst case. s(x) ≈ −1 =⇒ a(x) >> b(x),i.e. x is on average closer to elements in another cluster than the one it has been assigned to. We can almostcertainly conclude that x has been put in the wrong cluster. Using these definitions and computed s(x) for

10

different x we can construct a graphical display. For some A, its’ silhouette is the plot of all s(x) (in decreasingorder) where x ∈ A. The s(x) are represented by horizontal bars whose lengths and direction are proportionalto the respective s(x).

Silhouettes are useful since they are independent of the cluster algorithm used, only the actual cluster partitionsare used. This means we can use silhouettes to compare the quality of outputs from different algorithms appliedto the same data set. E.g. if using K−means clustering then we can select an appropriate K based on thesilhouettes. Generally we can use silhouettes to determine the correct number of clusters K. This is explainedin greater detail below.

Assume our data consists of clusters that are far apart but that we have specified K to be less than the truenumber of clusters. Some cluster algorithms (e.g. K−means) will join some of the true clusters so we end upwith exactly K clusters. This will result in large a(x) and subsequently small s(x), which will be visible inthe graphical display. In the reverse situation, i.e. we have set K too high, then some of the true clusters willbe divided so we form exactly K clusters. Such artificial divisions will also give small s(x) since the b(x) willbecome small. We can compare cluster quality between different partitions that arise. For each cluster resultingfrom a partition we can compare the average s(x) from different clusters. This will distinguish weak and clearclusters. When comparing plots we can compute the overall average s(x) of the entire data set. Different Kwill in general give different average values of s(x), denoted them by s(K). We can find a suitable number ofclusters K∗ by K∗ = argmaxK s(K).

When choosing the number of clusters or cluster algorithm by maximizing the Silhouette coefficient, there arepotential pitfalls one has to be aware of. High values of the average silhouette coefficient may indicate thata strong clustering structure has been found, but this might be misleading if there are outliers in the data.These outliers may be so far from the other data points that the other points seem like a tightly knit clusterin comparison to the outliers. In this case a clustering algorithm with 2 clusters may give a very high averagesilhouette index even if the cluster assignment is very rough since the outlier(s) end up in one cluster and allthe other data points end up in another. An example of this is provided by Rousseeuw to illustrate the idea.This phenomenon is relevant for this thesis since the data is very skewed, which is explored in more detail inSection 3.3.

2.2.3 The Davies-Bouldin index

Another cluster validity index is the Davies-Bouldin index, introduced in 1979 [19]. Unless explicitly statedotherwise, all information in this section was obtained from the original article. The Davies-Bouldin indexindicates similarity of clusters and can be used to determine how appropriate a partition of data is. The DBindex is independent of the number of clusters as well as method used for clustering, making it very attractivefor comparing clustering algorithms that are very different in nature. Some notation needs to be defined. Theoriginal authors define the concept of a dispersion function as follows.

Definition 1. A real-valued function S is said to be a dispersion measure if the following properties hold: Letcluster C have members X1,...,Xm.

1) S(X1, ..., Xm) ≥ 02) S(X1, ..., Xm) = 0 ⇐⇒ Xi = Xj ,∀Xi, Xj ∈ C

The authors set out to define a cluster separation measure R(Si, Sj ,Mij) which computes the average similarityof each cluster with its most similar cluster. They propose a definition of a cluster similarity measure as inDefinition 2.

Definition 2. Let Mij denote the distance between vectors that are characteristic of clusters i and j. Si and Sjdenoted the dispersion of cluster i and j, respectively. A real-valued function R is a cluster similarity measureif the following properties hold:

1) R(Si, Sj ,Mij) ≥ 02) R(Si, Sj ,Mij) = R(Sj , Si,Mji)3) R(Si, Sj ,Mij) = 0 ⇐⇒ Si = Sj = 04) Sj = Sk and Mij < Mik =⇒ R(Si, Sj ,Mij) > R(Si, Sk,Mik)5) Mij = Mik and Sj > Sk =⇒ R(Si, Sj ,Mij) > R(Si, Sk,Mik)

Definition 2 implies some heuristically meaningful limitations on R. Specifically, these are

1. R is nonnegative

11

2. R is symmetric

3. The similarity between two clusters is zero only if their dispersion functions vanish

4. If the distance between clusters increases while their dispersions remain constant, the similarity of theclusters decreases

5. If the distance between clusters remains constant while the dispersions increase, the similarity increases

The authors propose a function which satisfies the criteria in Definitions 1 and 2. Specifically, when using thesame Si, Sj and Mij as in Definition 2 and letting N denote the number of observations they set

Rij =Si + SjMij

, R =1

N

N∑i=1

Ri, Ri := maxi6=jRij (2.29)

R is used to select an appropriate clustering solution. One can think of it as the average value of the similaritymeasures of each cluster with its most similar cluster. When comparing multiple clustering solutions to oneanother, the solution which minimizes R is the best one with respect to this measure. In the original articleand in the following choices of distance function, dispersion measure and characteristic vector were made

Si = (1

Ti

Ti∑j=1

|Xj −Ai|q)1/q,Mij = (

N∑k=1

|aki − akj |p)(1/p) (2.30)

In Equation 2.30, Ti is the number of observations in cluster i, Ai is the centroid of cluster i and aki is the kthcomponent of the centroid in cluster i. In this thesis, Mij has been chosen as the Euclidean distance betweencentroids, i.e. p = 2 and q = 2, meaning Si is the standard deviation of the distance between samples and theirrespective cluster centres.

There are additional aspects that must be mentioned. First of all, the partition of the data set must yield atleast two clusters for R to make any sense. This is because the distance measure Mij must be non-zero for Rto be defined. Moreover the presence of cluster with only a single member limits the use of R since they willhave zero dispersion (see Definition 1, property 2).

2.3 Principal Component Analysis (PCA)

Principal component analysis (PCA) is a linear dimension reduction technique often used as a data pre-processing step prior to clustering. Unless explicitly stated othwerise, the information about PCA was takenfrom Izenmann [28]. Consider the case of a random vector X = (X1, ..., Xr)

T which consists of r randomvariables. We assume X has mean vector µx and the r× r covariance matrix ΣXX. The goal of PCA is to finda set of t (where t << r) ordered and uncorrelated linear projections of the input variables that can replacethe original variables with minimal loss in information. In this case, ”information” refers to the total variationof the input variables, defined in Equation 2.32. These projections are so called principal components and willbe denoted ξi, 1 ≤ i ≤ t and are of the form shown in Equation 2.31

ξi = bTi X = bi1X1 + ...+ bjrXr, j = 1, 2, ..., t (2.31)

r∑i=1

var(Xi) = tr(ΣXX) (2.32)

To find these principal components we make use of the spectral decomposition of ΣXX, see Equation 2.33. Λis a diagonal matrix whose elements are eigenvalues of ΣXX (denoted λi) and U is a matrix whose columnsare the eigenvectors of ΣXX.

ΣXX = UΛUT,UTU = I (2.33)

We thus see that tr(ΣXX) = tr(Λ) =∑ri=1 λi. bi = (b1i, ..., bri)

T is the ith coefficent vector and is chosen sothat the following properties hold

12

1. The first t principal components ξ1, ..., ξt are ranked in decreasing order of their variances var(ξi). Inother words, var(ξ1) ≥ var(ξ2) ≥ ... ≥ var(ξt)

2. ξj is uncorrelated with all ξk, k < j

In the case of dealing with observed data, we have to estimate the principal components. Specifically, we esti-mate ΣXX by the sample covariance matrix ΣXX = S/n = XCXC

T /n, where n is the number of observationsavailable and XC is the centered data matrix (i.e. each column is centered). The ordered eigenvalues of ΣXX

are denoted λ1 ≥ λ2 ≥ ... ≥ 0. The eigenvector associated with the jth greatest λj is the jth eigenvector andis denoted vj , j = 1, 2, ..., r. The jth sample PC score of X is given by

ξj = vTj (X− X) (2.34)

and the variance of the jth principal component is estimated by λj . So we clearly see that in order to computethe principal components we need the eigenvalues and eigenvectors of XTX.

We now turn our focus to how the principal components are computed. For this purpose we need to considerthe singular value decomposition of a matrix. Unless explicitly stated otherwise, the information about SVDand its relationship to PCA is taken from Jolliffe [31]. For an arbitrary matrix X of size n×p, where we assumethat n is the number of data points and p is the dimension of each data point, we may write

X = ULAT (2.35)

In Equation 2.35 we used the following notation

1. U, A are matrices of sizes (n× r) and (p× r), respectively. They are also orthonormal.

2. L is an (r × r) diagonal matrix.

3. r is the rank of X.

To see why these statements are true, consider the spectral decomposition of XTX. Let the eigenvalues ofXTX be denoted as lk for k = 1, 2, ..., p and let S be the sample covariance matrix. Then we can write

(n− 1)S = XTX = l1a1a1T + ...+ lrarar

T (2.36)

A is defined as a (p× r) whose kth column is ak, U is the (n× r) matrix whose kth column is

uk = l−1/2k Xak, k = 1, 2, ..., r (2.37)

and finally L is the (r × r) diagonal matrix whose kth diagonal element is l1/2k . Thus, conditions 1 and 2 are

satisfied and it can be shown that X = ULAT . This equation can also be expressed as

ULAT =

p∑k=1

XakakT = X (2.38)

in which Xak is a vector which contains the scores on the kth PC. SVD is important for the computation ofPCA since we see that if we can find U, L and A such that ULAT = X then A and L give the eigenvectors andsquare roots of eigenvalues of XTX. This means that we in turn get the coefficients and standard deviationsof the principal components of S. Also, U will give scaled versions of the principal component scores. We thensee why SVD is important for PCA, now we turn our focus to how it is actually computed in practice.

The software implementation used in this thesis relies on an approximation of the SVD presented by Halko etal [24], in which a randomized low-rank approximation is used. The general idea behind their algorithm is tofirst perform a low-rank approximation of X and then perform SVD on this approximation. A brief summaryof their goal is that they wish to construct a matrix Q with k orthonormal columns which satisfies

X ≈ QQTX (2.39)

k is to be kept as low as possible and is usually chosen in advance. SVD of X is then performed with the helpof Q.

13

In order to simplify algorithm development, the authors formulate the problem a bit more specifically asfollows. Given X, target rank k and an oversampling parameter p, the goal is to construct a matrix Q withk + p orthonormal columns such that

||X−QQTA|| ≈ minrank(Y)≤k||X−Y|| (2.40)

The steps taken are

• A: Find a low-rank approximation of the target matrix X

• B: Perform SVD on this approximation

Specifically the authors make use of what they call the ”proto-algorithm”, which can be described generally asin Algorithm 3. It should be noted that this is a general outline and that the authors propose several potentialrefinements depending on the matrix used, which can be found in the work by Halko et al [24].

Algorithm 3: Prototype algorithm for randomized SVD

Input: m× n matrix X, target rank k ∈ Z, exponent q ∈ [1, 2]Output: Approximate rank 2k factorization UΣVT, where U and V are orthonormal and Σ is nonnegative

and diagonal.Stage A

1. Form Y = (AAT)qAΩ by multiplying alternately with A and AT, where q is usually 1 or 22. Generate a n× 2k Gaussian test matrix Ω3. Construct a matrix Q whose columns form an orthonormal basis for the range of Y, e.g. using the

QR-factorization Y = QRStage B

4. Form B = Q ∗A5. Compute and SVD of the small matrix B = UΣVT

6. Set U = QU

In step 2, what we mean is that each entry of Ω is an independent Gaussian random variable following aN(0, 1)-distribution. It should be noted that in the authors’ words this algorithm is especially appropriate forapproximating sparse matrices, which is very suitable for this thesis project.

After Stage A is complete we turn to Stage B. At this point, an orthonormal matrix Q has been produced sothat

||A−QQTA|| ≤ ε (2.41)

ε is a computational tolerance. At this stage, the factorizations of A using Q are computed, see steps 4-6 in Algorithm 3. Using this implementation for computing the SVD of a large matrix, we may obtain anapproximate SVD which we then use to compute the principal components.

2.4 High-dimensional data and dimensionality reduction

From the given data it is possible to create a very large number of new features. One alternative we willnot pursue is to measures the total number of purchases and money spent for a single product. Due to thelarge number of products offered, this is not a viable option. Also, it is unwise to use both the total numberof purchases as well as total revenue since the total revenue is a function of the total number of purchases.Including both would increase the computational burden and introduce redundant variables in the data.

Dimensionality reduction and treating high-dimensional data is an important aspect of cluster analysis. Webegin by describing some problems using high-dimensional data. It was shown by Beyer et al. [11] that if thedimensionality of data X (i.e. the number of columns) increases, distance metrics will behave in a problematicway when determining the nearest neighbor. Since cluster algorithms are heavily dependent on distance metricsand the concept of nearest neighbors, this naturally relates to this thesis project. Even if the experimentalconditions between my thesis and the setting used by Beyer et al differ the results are still worth mentioning.

We express the results mathematically using the notation in Beyer. Specifically,

• Pm,1,...,Pm,n are n independent data points that follow the same probability distribution.

14

• Qm is a query point which is chosen indepedently of all the other points. We are interested in finding theclosest neighbor to this point.

• m is the dimensionality of our data

• dm is a function that takes a data point Xi, a query point Q and returns a non-negative real number. Inour case, dm is a distance function

• 0 < p <∞ is a constant

• DMINm = mindm(Pm,i, Qm)|1 ≤ i ≤ n

• DMAXm = maxdm(Pm,i, Qm)|1 ≤ i ≤ n

A summary of the authors’ result is given by Equation 2.42

limm→∞

var((dm(Pm,1, Qm)p

E[(dm(Pm,1, Qm))p]) = 0 =⇒ ∀ε > 0 lim

m→∞P [DMAXm ≤ (1 + ε)DMINm] = 1 (2.42)

The last limit may equivalently be expressed as in Equation 2.43

limm→∞

P [DMAXm ≤ (1 + ε)DMINm] = limm→∞

P [|DMAXm

DMINm− 1| ≤ ε] = 1 (2.43)

Expressed in words, Equations 2.42 and 2.43 mean that as the dimensionality of our data grows, all points willconverge to being the same distance away from the query point in question. In other words, the concept of anearest neighbor loses its meaning as m→∞.

Another view on the matter is presented by Houle et al [26], in which the work of Beyer et al was referenced.They also examined different experimental conditions which are more relevant to this thesis. Beyer et al assumeddata come from identical distributions while in Houle et al. the data could be allowed to follow a mixture ofdifferent distributions. In this case it was shown that the problems caused by the curse of dimensionalitywere not always as severe, if they even existed. The article’s main focus was to show that in the proposedexperimental setting, the dimension of the data by itself was not a problem for clustering but the ratio ofirrelevant and redundant features.

It is clear that dimensionality reduction and possibly feature selection must be considered. A common approachis to first perform PCA on the original data and use the principal components which account most of theobserved variation. Hopefully only a handful of principal components will be required to explain most of thevariation, possibly two or three. Doing so may give a data set of drastically reduced dimensions and this newdata set can then be used as input to a clustering algorithm. To see why, we consider Equation 2.44

limm→∞

P [|DMAXm

DMINm− 1| ≤ ε] = 1 (2.44)

When clustering on only a handful of principal components such that the new transformed data is low-dimensional, we see that m → ∞ is irrelevant to that particular case and that the curse of dimensionalitydescribed by Beyer et al. is not a problem. Clustering a handful of principal components has been done inmultiple works, e.g. [49, 41, 12, 38, 3].

Despite the fact that PCA is commonly used prior to clustering, it is important to note that doing so is notentirely uncontroversial and that some potential pitfalls exist. Work by Kriegel et al. [34] mentioned that PCAmay not be suitable for high-dimensional data and provide a concrete example illustrating why. Moreover, ithas been demonstrated by Chang that applying clustering to the first principal components may miss the trueunderlying clusters in data when the data was generated from two Gaussian distributions [18]. This of course,is an argument against using the first few principal components if we assume the data arises from Gaussiandistributions (as is done in GMMs). An article that continued in the same vein showed that the situation is evenmore complicated and that there exists no straight forward answer to whether PCA should be used prior toclustering or not. In some cases, PCA can improve cluster qualities but this is not necessarily true. The authorsarrived at the conclusion that the effectiveness of PCA on clustering depends on the data used, algorithm usedand the distance metric. It should be noted that in their example, the first couple of principal componentscould be used as input to a K−means algorithm using the Euclidean distance as dissimilarity metric. However,determining the number of principal components to use is unclear [52]. It should be noted that the conclusionfrom these two is not that performing clustering on the results from PCA is always necessarily bad but doing

15

so will not necessarily capture the true clusters in the data. This potential pitfall is important and provides anargument against using PCA prior to clustering that one should be aware of.

Another alternative to performing dimension reduction and clustering on the principal components is presentedby Brodinova et al [15]. Their proposed method is a variation of K-means that aims to tackle the situationwhere outliers and noise variables are present in the data. A short description of their method is as follows.First a K-means based algorithm with a weighting function is applied to the data. The weighting functionmakes K-means more robust and gives observation weights that reflect which observations are outliers. In thesecond step, the variable weights are updated with respect to clusters and observation weights from the firststep. These two steps are repeated until the variable weights stabilize. In the final step the observations areclustered by using the informative variables and the observations with low weights are classified as outliers.

2.5 t-distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is an algorithm designed for visualizing high dimensional data in two or three dimensions. The algorithm’spurpose and steps can be summarized as follows, using the original article by Van Der Maaten and Hinton [49].Assume we have access to high-dimensional data X ∈ Rd and that we wish to find a low-dimensional mapping Ysuch that the Kullback-Leibler divergence between the joint distributions P and Q (for X and Y , respectively)is minimized. In both the high-dimensional and low-dimensional spaces we model the similarities betweendata points using joint probabilities given by Equations 2.45 and 2.46, respectively. Given a distribution Pwe use the conditional probability pij to model the similarity between data point xj to data point xi in thehigh-dimensional space. In the low-dimensional space we use the conditional probabilities qij . Expressions forthe probabilities and the Kullback-Leibler divergence to minimize are provided in Equations 2.45, 2.46 and2.47. Note that in Equation 2.46 σ2

i is the variance of a normally distributed random variable that is centeredon the datapoint xi.

qij =(1 + ||yi − yj ||2)−1∑k 6=l(1 + ||yk − yl||2)−1

(2.45)

pij =pj|i + pi|j

2, pj|i =

exp(−||xi − xj ||2/(2σ2i ))∑

k 6=i exp(−||xi − xk||2/(2σ2i ))

(2.46)

C = KL(P ||Q) =∑i

∑j

pij log(pijqij

) (2.47)

It is worth commenting on the choice of σi. Each value of σi produced a probability distribution Pi over all otherdata points, this distribution’s entropy will increase along with σi. In t-SNE, σi is chosen so that the perplexityof Pi, denoted Perp(Pi), matches some user-specified value. This perplexity is defined as Perp(Pi) = 2H(Pi),H(Pi) is the Shannon entropy of Pi, i.e. H(Pi) = −

∑j pj|ilog2pj|i. In the original article by Van Der Maaten

and Hinton the authors claim that the algorithm’s performance is not greatly affected by the choice of Perpwhen values vary between 5 and 50. Outside of these values however, performance may be greatly affected,as shown by Kobak and Berens where they use Perp = 5, 500 which gives very different outputs than whenusing Perp = 20, 50, 80 [33]. In the high-dimensional space we model the distance between data points using aGaussian distribution P while in the low-dimensional space we use a Student t-distribution Q with one degreeof freedom. Choosing points yi to minimize C is done using gradient descent with the gradient being given by

δC

δyi= 4

∑j

(pij − qij)(yi − yj)(1 + ||yi − yj ||2)−1 (2.48)

The gradient update is given by the updating formula below where Y (t) is the solution at iteration t, η is thelearning rate and α(t) is the momentum at iteration t.

Y (t) = Y (t−1) + ηδC

δY+ α(t)(Y (t−1) − Y (t−2)) (2.49)

The optimization is often performed using a trick called ”early exaggeration”. This is done by multiplying allthe pij by some constant in the early stages of the optimization. The effect of this is that the data’s naturalclusters form widely separated clusters in the mapping, making it easier to find a good global organization of

16

the clusters [49]. The algorithm is summarized in 4. As in the original article the early exaggeration is notincluded in the pseudo-code.

Algorithm 4: Algorithm for producing a visualization by t-SNE

Input: High-dimensional data χ, perplexity Perp, number of iterations T , learning rate η, momentum α(t)Output: Low-dimensional data Y = (y1, ..., yn)

1. Compute pairwise affinities pj|i with perplexity Perp

2. Set pij =pj|i+pi|j

2n

3. Draw Y (0) from a N(0, 10−4I) multivariate Gaussian distribution4. For i = 1, 2, ..., T :

(a) Compute qij(b) Compute δC

δY

(c) Update Y (t) = Y (t−1) + η δCδY + α(t)(Y (t−1) − Y (t−2))

A few parameters must be set when using t-SNE, specifically the exaggeration factor α, the step size h andthe perplexity Perp. There is very little analytical work done regarding the choice of these parameters andwhat little there is very new. In other words, there exists no widely acknowledged body of work that can guideus but we can use the articles published in recent years. In this thesis the standard parameter values in thescikit-learn implementation were tested, i.e. the perplexity was set to 30, the early exaggeration to 12 and thelearning rate to 200. In addition to this the parameter choices recommended by Linderman and Steinerberger[36] were examined as well. Specifically they recommend η ∼ 1 and that the exaggeration factor (denoted βhere) should satisfy β ∼ n/10, where n is the number of data points in the sample. The important ratio to useβ and η so βη ∼ n/10 is satisfied. This approach has been used in astronomy, in which the authors set η ∼ 1and β ∼ n/10 [4]. In this thesis, using these parameter values sped up the convergence of t-SNE threefold.

Finally we can discuss some differences between PCA and t-SNE. As discussed in Section 2.3, PCA is a lineartechnique which seeks to find the linear combinations of the original variables that explain most of the variation.Due to its linearity it may be limited in some cases. t-SNE on the other hand, produces projections to a low-dimensional space which aim to conserve local topological distances (while ignoring global structure) ratherthan explain linear variation in data. Since PCA is limited to these linear projections while t-SNE is not, t-SNEis able to capture non-linear characteristics of the data.

17

3 Case study and methods

3.1 Outline of method

Before diving into the details of what steps were taken and why, a general outline of the method used may behelpful.

1. Feature engineering/Exploratory data analysis: Prior to any mathematics can be done the datato be used must be created. Since no features are immediately available they must all be created byfeature engineering. In this particular case, the level of granularity of product groupings needed to bedetermined. This entailed a feedback loop of creating features, analyzing them and refining the createdones.

2. Data pre-processing: Data is usually pre-processed in some way prior to being used as input to aclustering algorithm. A square-root transformation was used in this thesis.

3. Visualization of high-dimensional data: Exploratory data analysis may be helpful and providevaluable insights on how to tackle the question of clustering but histograms and density plots two di-mensions can only take us so far. To get a better understanding of the data we need to employ moresophisticated methods.

4. Dimensionality reduction: In order to mitigate damage caused by the curse of dimensionality, PCAwas employed to reduce the dimension of the data.

5. Optimizing individual clustering algorithms: Different clustering algorithms have different pa-rameters that can be varied, e.g. K in K-means or the covariance structure in MCLUST or tEIGENmodels.

6. Selecting the best partition: Using the silhouette index, the Davies-Bouldin index and subject matterknowledge the results from using different clustering algorithms can be compared. The best algorithmand the optimal parameter combination can be chosen here.

7. Analyze the results from the best algorithm: Once we have chosen an algorithm and instance ofit that performs the best we want to understand what patterns we have discovered in the data.

3.2 Exploratory data analysis and feature engineering

3.2.1 Feature engineering

The subject of feature engineering for clustering is not always straight-forward. Often, subject matter knowledgeis required to complement the ideas generated by purely mathematical considerations. In this particular case,a balance was needed between the granularity of product groupings (i.e. how specific the description of eachcustomer is) and the dimensionality of the data set. The process of feature engineering was not entirely straight-forward as it often entailed some exploratory data analysis followed by feature engineering and then some moreexploratory data analysis. A full retelling of all the steps will not be provided but an illustrating example canbe found in section 3.2.2.

The original data set consists of transactions that can be tied to individual customers. Each transaction providesdetails about what the customer bought, how much they bought, when they bought it, among other things.A clear hierarchy for grouping products exists, with three levels. The least granular one is called level 1, themiddle one is called the level 2 and the most granular one is called level 3. Choosing an appropriate levelof granularity for different product groups is a matter of subject matter knowledge as well as mathematicalconsiderations since it affects the dimensionality of the data. After some discussion and exploratory dataanalysis it was decided that the level 1 product groups with very low variation were kept in this state while thegroups with the most variation were kept at level 2. Also, the most unpopular product groups at level 2 werecombined into a larger category. For each such product group we noted the total amount of units purchasedby each customer, these were the features used in clustering. Thus, only numerical features were used.

3.2.2 Exploratory data analysis

To illustrate the interplay between feature engineering and exploratory data analysis we consider Figure 3.1which illustrates the sparsity of the features in the data set. Specifically, we consider the percentage of values

18

that take the value 0 among the 45 different product groups at hierarchy level 2.

Figure 3.1: Illustration the sparsity in the data set

As we can see from Figure 3.1, quite a lot of the different product groups contain a large percentage of zeros,i.e. they are bought by a very small percentage of the customers. This was handled by combining the mostunpopular categories into aggregated, larger ones. This approach has some precedent as it has been used ina previous master thesis project in Sweden [9]. Combining the sparsest features had to be weighed againstsubject matter knowledge, which in the end resulted in 33 product groups. Specifically, the variables x20, x36,x38, x29, x35, x32, x42, x43 and x45 were combined into a single aggregated variable that will be referred toas a1, which contained the most unpopular products. Furthermore x6, x13, x14, x15, x16 and x21 all belongto the same level 1 product grouping and were aggregated into a single feature called a2. The other productgroups were kept separate. Performing these steps reduced the number of zeros in the data set but the datamatrix was still very sparse.

Other aspects we may wish to investigate are correlations between variables and their distributions in orderto get a better understanding of the data. Correlations are illustrated in Figure 3.2, in which we see severalmoderately positive correlations in the data and no negative ones. These correlations are not a big concernfrom a clustering perspective since PCA will be performed prior to clustering, meaning the variables used forclustering will be uncorrelated. Even if PCA were not used the correlations would not be a major problemsince the algorithms used in this thesis are not greatly affected by correlations between the features. In thecase of the probabilistic clustering algorithms in sections 2.1.3 and 2.1.4, it may be even be incorporated intothe models.

19

Figure 3.2: Correlations between the product groups

Other graphical illustrations for understanding the data may be used. We begin by illustrating some distri-butions within the level 1 product groups by using histograms. In Figure 3.3 we see the distribution of level2 product groups within a given level 1 product group. As we can see, there is some clear variation betweenlevel 2 product groups. In Figure 3.4 on the other hand, there is much less variation as this is a less popularsegment overall. It should be noted that these histograms do not show the full range of possible values of eachindividual variable. Due to the presence of outliers (as seen in Figure 3.5), histograms including all the valuesbecome difficult to understand. For this reason, appropriate cut-off values were used in order to illustrate thedistributions. For the other level 1 product groups, similar patterns emerge. The distributions are heavilyskewed to the left, meaning that most of the customers have made very few recorded purchases.

Figure 3.3: Histograms showing the distribution of products in a given level 1 product group

20

Figure 3.4: Histograms showing the distribution of products in a given level 1 product group

From Figure 3.3 we notice that there are small peaks for large values on the horizontal axis, i.e. a few customersmake very many purchases. This hints at the presence of several big spenders and that the data may be skewed.This is examined closer in Figure 3.5. As the plot clearly shows, several of the product categories have valueswhich are clearly outliers. A natural question is what should be done about these customers. Should they beremoved? Should they be kept? Should the data be transformed? In this thesis, they were kept since theyare all legitimate data points and also the most interesting points from the client’s perspective. This meansthat the skewness must be handled in some way since it may affect clustering results and Silhouette indices, asdiscussed in section 3.3.

Figure 3.5: Box plots that illustrate the presence of big spenders

3.3 Data pre-processing and dimensionality reduction

The skewness of the data is of special importance in this thesis since the silhouette index is used to evaluatecluster quality. The data set shows signs of very uneven distributions in some of the features, i.e. there arecustomers who spend much more than the average customer in those product categories. These customers

21

contribute most of the total purchases made. It may be tempting to supply this data matrix as input toa clustering algorithm and evaluate the results, but Rousseeuw warns against doing so [44]. We have datapoints whose feature values are much larger than average so most other points will seem close to each other incomparison to the highest spenders and thus return very high average silhouette values, even if they are notvery similar at all! This is exactly what happens if K−means clustering with K = 2 is run on the data set (seeFigure 3.6), in which the average silhouette value is 0.794. Although very high silhouette values are obtained,the clustering partition did not make very much sense. In the largest cluster there were several customerswith very different purchasing behavior that were clumped together nontheless, specifically customers who hadbarely made any purchases ended up in the same clusters as many habitual customers, simply because they arenot among the top spenders.

Figure 3.6: Illustration of silhouette values in the presence of outliers for K-means clustering with K = 2

A way of understanding this phenomenon is to think of a group of customers who have some differences inhow they like to spend their money at the retail chain. Now we further imagine that a multimillionaire withlavish spending habits at our retail chain of interest decides to regularly shop at the same chain as the othercustomers. Due to the sheer amount of purchasing power the millionaire will stand out as belonging to aseparate cluster and the other customers will seem very homogeneous in comparison even though they may infact be very different. This is only an illustrative example but the issue of high spenders remains in our dataand must be adressed. Since these customers are legitimate observations and of interest to the customer theywere kept and the skewness was handled using a transformation of the data.

The chosen way in this thesis was to transform all of the elements of the data matrix using some function f(x)whose derivative f ′(x) is decreasing in x. Common transformations include ln(x), ln(x + 1), or a Box-Coxtransformation but f(x) =

√x was preferred in this thesis. All these transformations were considered and

√x

was the alternative that gave the most reasonable results. Using this transformation helped mitigate the issueand gave more reasonable cluster solutions. As we will see in the results, the cluster solutions still gives onecluster that is clearly larger than the others but this is reasonable considering the spending habits shown inFigures 3.1, 3.3, 3.5 since a minority of the customers make a majority of the purchases.

Typically, some sort of standardization is employed prior to clustering if the features used exist on very differentorders of magnitude or are measured in different units. In this case all the features are measured in the sameunits (number of purchases) and after the square-root transformation was applied they all had very similarorders of magnitude. For this reason no standardization was employed prior to applying principal componentsanalysis, which was done in order to reduce the dimensionality of the data set. Three principal componentswere used, these explained 95% of the variance in the data

3.4 Visualization

Histograms and density plots may reveal interesting and useful information but they only show limited aspectsof the data. In order to understand the data a bit better we employ t-SNE to visualize the high dimensionaldata to see if any interesting patterns can be discerned.

t-SNE has a number of different parameters that can be varied, specifically the perplexity, step size and earlyexaggeration factor. There are not set rules for which values to choose, so a bit of experimentation was employed.Initially the standard settings in the scikit-learn software were used, i.e. perplexity 30, early exaggeration 12and learning rate 200. Since t-SNE is not a deterministic algorithm, each run of the algorithm produces differentvisualizations. The interesting question is whether these different outcomes share some similarities. We turnto Figure 3.7, in which t-SNE was applied to the data where unpopular categories had been merged and thesquare root had been taken of all elements in the data matrix.

22

Figure 3.7: Result of running t−SNE with standard settings

Initially it is hard to detect clearly separated clusters, with the exception for the small group of points separatedfrom the larger group. However, we do not know what this image tells us. To get a better understanding wewill color the points according to how many purchases they have made in the product group x3, which resultsin Figure 3.8. In this figure, customers were assigned a class as given in Table 3.1.

Class assignment Interval for x30 x3 < 151 15 ≤ x3 < 1002 100 ≤ x3 < 5003 500 ≤ x3 < 9004 x3 ≥ 900

Table 3.1: Class assignments based on x3

Figure 3.8: Result of running t−SNE with standard settings, now colored according to x3

From Figure 3.8 we see that most of the customers who have made few purchases in the x3 category are similarto one another but that there is some overlap with the other class assignments. In this figure it seems thatsome of the customers who make many x3 purchases are similar to ones that make very few. Another way ofanalyzing the result is possible. We color the points according to the total number of purchases made (heredenoted t), to see if the different spending segments are appropriately grouped. For this purpose we use theclass assignment as in Table 3.2 and present the visualization in Figure 3.9a, in which we see that t−SNE isable to separate the low, mid and high spenders. Some mishaps occur but overall the algorithm seems able toseparate the data into reasonable groups.

23

Class assignment Interval for total number of purchases made0 t < 501 50 ≤ t < 1002 100 ≤ t < 5003 500 ≤ t < 6004 t ≥ 600

Table 3.2: Class assignments based on the total number of purchases

(a) (b)

Figure 3.9: Result of running t−SNE with standard settings, now colored according to the total amount ofpurchases made. 3.9a) Result of one run. 3.9b) Result of another run.

Since each run of t-SNE gives different results we naturally may wish to inspect the results of a different run.The most important part here is to see if the same patterns as in Figures 3.9a and 3.9b appear. By inspectingthese we arrive at the same conclusion as previously, namely that t−SNE is able to separate the differentcustomers according to the total number of purchases made.

Figure 3.10: Result of running t−SNE with standard settings, colored according to the total amount of purchasesmade in x3

Another set of parameter choices used are the ones recommended by Linderman and Steinerberger [36], η ∼ 1and β ∼ n/10, where n is the number of data points in the sample, as done by Friedrich et al [4]. This sped upthe computations considerably and produced very different visualizations. If we turn our attention to Figures3.11a and 3.11b we see that these parameter settings are also able to find the groups of customers according tothe different spending segments.

24

(a) (b)

Figure 3.11: Results of running t−SNE with Linderman and Steinerberger’s settings colored according to thetotal amount of purchases. 3.11a) Result one run. 3.11b) Result of another run.

3.5 Optimizing individual algorithms

Different algorithms have different parameters or settings that can be adjusted, e.g. the number of clustersK in K-means clustering, which will give different silhouette plots, average silhouette values and values of theDavies-Bouldin index. The optimal number of clusters found in this stage will act as guidance for the numberof clusters to investigate in the other algorithms. An explanatory example is as follows. Say we notice thatK = 10 is the best choice for K-means, then we will consider using 9, 10 or 11 as the number of componentsin the mixture models. In other words, K-means is used as a guide toward finding the number of clusters andonce we have rough idea of how many clusters might exist we employ more sophisticated clustering algorithms.The exception is NMF, in which we investigate the same number of clusters as in K-means clustering.

In the mixture models used, both the number of clusters and the covariance structure of the clusters can bevaried. Different configurations of these will produce different silhouette values and values of the Davies-Bouldinindex. It should be noted that this thesis strays off the beaten path when it comes to model selection for mixturemodels. It is more common to perform model selection based on statistical criteria, e.g. BIC [45].

The choice of best clustering partition will be made on the basis of silhouette values, the Davies-Bouldin indexand subject matter knowledge.

3.6 Software and hardware used

3.6.1 Hardware used

All the work in this thesis was performed on a HP EliteBook using the operating system 64-bit Windows 10Pro with an Intel Core i7-5500U CPU @ 2.40GHz and 16GB of RAM.

3.6.2 Software used

This thesis used the programming languages R (versions 3.5.2, 3.5.3 and 3.6.0) [43] and Python (version3.7.2, 2019). Data handling and visualization were performed in R using the packages dplyr [2], data.table andggplot2 [1]. Both languages were used for the computational aspects of this thesis. Specifically, Python wasused for clustering and computation of silhouette values for K-means clustering and NMF, this was done usingthe scikit-learn package [42]. When clustering was done using mixture models, both languages had to be used.In R, the packages mclust [45] and teigen[5] were used for the GMMs and t-distributed clustering, respectively.Silhouette and Davies-Bouldin indices were computed using the scikit-learn[42] package in Python. t − SNEwas performed using the implementation in the Python package scikit-learn.

25

4 Results

4.1 K-means

The general trend when varying K in K-means clustering is that the average silhouette score decreases as Kincreases. Solely relying on the average silhouette score for selecting K would thus lead to selecting a solutionin which only two clusters exist, which is motivated by the values in Table 4.1. However, after discussion withthe client their subject matter knowledge dictated that three or four clusters are more reasonable solutions.In this case, K = 3 was agreed upon as the most reasonable solution. Relying on subject matter expertise inconjunction with average silhouette scores has a precedent in previous studies, e.g. as done by du Toit et al[47]. When looking at the Davies-Bouldin scores for different K we also reach the conclusion that K = 3 is agood number of clusters.

K Average silhouette score2 0.6263 0.5664 0.5355 0.5086 0.4857 0.465

Table 4.1: Average silhouette scores for different K in K-means clustering

We also include the silhouette plot obtained when using three clusters in Figure 4.1a as well seven clusters inFigure 4.1b to give a general idea of how they looked for the different K. The silhouette plots for these otherK are available in Appendix A.1

(a) (b)

Figure 4.1: Silhouette plots for K = 3 and K = 7 in K-means. 4.1a) K = 3. 4.1b) K = 7

For the other values of K, we can also notice a single cluster that contains most of the observations. As Kincreases, this cluster is divided bit by bit into smaller clusters, this reduces the average silhouette as the newclusters are relatively weak. These silhouette plots are available in Appendix A.1.

K Davies-Bouldin score2 0.5963 0.5964 0.6125 0.6296 0.6517 0.671

Table 4.2: Davies-Bouldin scores for different K in K-means clustering.

4.2 MCLUST

For MCLUST, we limited the number of clusters investigated. The results from using K-means indicate thatusing only a handful of clusters is the best choice. In this thesis we limited the number of clusters to being in theset 2, 3, 4. Different models were tried using these values, but only a few produced results while many modelssuffered from convergence or singularity problems when using 3 components. These models were discarded from

26

further analysis and only the ones that converged when using three components were analyzed further whenusing two or four components. These models were EII, V II, EEI, V EI, EEE and EEV , whose covariancestructures are presented in Table 4.3. The average silhouette values for 2, 3 and 4 components for each modelare presented in Table 4.4 and the Davies-Bouldin values in Table 4.5.

Model Σk Distribution Volume Shape OrientationEII λI Spherical Equal Equal -VII λkI Spherical Variable Equal -EEI λA Diagonal Equal Equal Coordinate axesVEI λkA Diagonal Variable Equal Coordinate axesEEE λDADT Ellipsoidal Equal Equal EqualEEV λDkAD

Tk Ellipsoidal Equal Equal Variable

Table 4.3: MCLUST models that converged

Model K = 2 K = 3 K = 4EII 0.638 0.576 0.545VII 0.546 0.493 0.459EEI NA 0.651 0.540VEI 0.555 0.475 0.451EEE NA 0.566 0.521EEV 0.160 0.159 0.176

Table 4.4: Average silhouette values for the different MCLUST models with varying number of components

Model K = 2 K = 3 K = 4EII 0.572 0.589 0.610VII 0.649 0.635 0.638EEI NA 0.508 2.320VEI 0.668 0.646 0.650EEE NA 0.639 3.643EEV 2.0358 3.416 2.459

Table 4.5: Davies-Bouldin scores for the different MCLUST models with varying number of components

Special attention needs to be paid to EEE and EEI. In both cases, fitting a model with K = 2 componentsresulted in all the data points being assigned to one cluster. In this case, neither the silhouette index norDavies-Bouldin index can not computed and the clustering is a trivial one. Furthermore, when K = 3 thesetwo models only assigned observations to two clusters. Also we can see from Figure A.3 that for EII with K = 3nearly all the points ended up in the same cluster, which is not a helpful partition. Setting these solutionsaside then, we conclude that EEV is clearly the worst performing algorithm. EII is the best performing onedespite having worse silhouette and Davies-Bouldin scores than EEI since the EEI solution for K = 3 producesa partition that does not make sense. Silhouette plots for the EII model are presented in Figure 4.2, the restcan be found in Appendix A.2.

(a) (b) (c)

Figure 4.2: Silhouette plots for different number of components K of the best MCLUST model EII. 4.2a)K = 2. 4.2b) K = 3. 4.2c) K = 4.

27

4.3 tEIGEN

Regarding the number of components used in the tEIGEN models, the reasoning was the same as forMCLUST , in other words we begin by trying to fit models using three clusters. Those that have conver-gence or singularity problems are discarded from further analysis and the remaining are examined using twoand four components as well. The models which did not suffer from convergence problems were CIIC, CIIU ,UIIC, UIIU , CICC and UUCU , whose covariance structures are given in Table 4.6.

Model Σk νkCIIC λI CCIIU λI UUIIC λkI CUIIU λkI UCICC λA CUUCU λkDkAD

Tk U

Table 4.6: Models in the tEIGEN -family that converged

These tEIGEN models were also tested with 2, 3 and 4 components and the silhouette values for each modelwere computed. These average values are presented in Table 4.7 and the computed Davies-Bouldin values inTable 4.8.

Model K = 2 K = 3 K = 4CIIC 0.322 0.215 0.286CIIU 0.544 0.413 0.389UIIC 0.502 0.393 0.242UIIU 0.468 0.361 0.288CICC 0.459 0.308 0.327UUCU 0.502 0.372 NA

Table 4.7: Average silhouette values for the different tEIGEN models with varying number of components

Model K = 2 K = 3 K = 4CIIC 0.959 1.930 3.850CIIU 0.673 2.395 4.734UIIC 0.704 0.753 1.684UIIU 0.722 0.768 3.193CICC 0.833 1.768 3.957UUCU 0.682 4.056 NA

Table 4.8: Davies-Bouldin scores for the different tEIGEN models with varying number of components

(a) K = 2 (b) K = 3 (c) K = 4

Figure 4.3: Silhouette plots for different number of components K of the best tEIGEN model. 4.3a) K = 2.4.3b) K = 3. 4.3c) K = 4.

In the case of modeling with UUCU, the parameter estimates did not converge for K = 4. Overall, we seethat the tEIGEN models perform worse than the MCLUST and K-means models. Naturally, only lookingat average values do not tell the whole story and for this reason the silhouette plots are available in AppendixA.3. For now, only the best performing model CIIU is presented in Figure 4.3.

28

4.4 NMF

Overall, NMF performed very poorly. The general trend is that it always created a single extremely largecluster in which the data points had low silhouette values. Out of the clustering algorithms examined in thisthesis it was clearly the worst performing. Only one silhouette plot is included in Figure 4.4, Table 4.9 showsthe average silhouette scores when using different number of clusters and Table 4.10 lists the Davies-Bouldinindices. Despite the misleadingly high average silhouette value for K = 3 clusters, we clearly see that NMF isnot a suitable clustering algorithm for this data. Other silhouette plots are available in Appendix A.4.

Figure 4.4: Silhouette plot obtained when using 3 clusters in NMF

Number of clusters Average silhouette score2 0.1033 0.5274 -0.1825 -0.2136 -0.2147 -0.260

Table 4.9: Average silhouette values when clustering using NMF

Number of clusters Davies-Bouldin score2 0.5643 1.8694 3.9045 4.2166 4.4207 4.543

Table 4.10: Davies-Bouldin values when clustering using NMF

4.5 Choice of algorithm

Naturally we are interested in seeing if selecting a best clustering solution on the basis of the Davies-Bouldinindex differs from selecting it based on the silhouette index. We provide an overview of the choices madedepending on the index used.

We begin by noticing some similarities between the results from the silhouette and Davies-Bouldin indices. ForK−means we see a clear trend in both that increasing K lowers the cluster quality according to both indices.In both cases K = 2 is the best option according to the indices but using a bit of critical thought and subjectmatter knowledge leads to the conclusion that K = 3 is a more suitable alternative. In this case, the choice ofK would be the same regardless of which metric one prefers. For the MCLUST models, we also see that themodel EEI with K = 3 components leads to the optimal value of the respective indices. However, by studyingthe silhouette plot in Figure A.3 we see that the clustering is not quite satisfactory. Most observations areplaced into a single cluster and the remaining ones seem to be misclassified according to the silhouette values.Both the Davies-Bouldin and silhouette index indicate that EEI with K = 3 is the best choice, but by applyinga bit of critical thought we quickly realize that the model EII with K = 3 is a more suitable choice since itproduces validation indices comparable to EEI, has fewer poorly clustered observations and makes more sensefrom the perspective of subject matter knowledge. For the teigen models we see that the choice of model would

29

be the same regardless if the choice was based on the silhouette or the Davies-Bouldin index, both indicesindicate that the CIIU model with K = 2 components is the best one. Finally we turn our attention to theresults from NMF in which we see a general trend of decreasing clustering quality as the number of clustersincreases. The difference in this case occurs when using three clusters, which is the optimal value according tothe silhouette scores while two clusters is optimal according to the Davies-Bouldin index.

Now we must select a solution which seems to be best. This is done on the basis of both validation indices andsubject matter knowledge. According to the Silhouette index, K−means (or equivalently the EII MCLUSTmodel) with three clusters is the best choice while a choice according to the Davies-Bouldin index would selectthe results from NMF when using two clusters. However by studying Figure 4.4 we see that this solution isnot appropriate since nearly all observations end up in the same cluster. By inspecting Table 4.2 we see thatK−means with K = 3 gives index values close to the very best one. The same is seen in Table 4.5. Based onthese observations, K−means and MCLUST using the EII model are deemed the best options. In fact onemay see them as the same since the model EII is actually identical to K-means clustering in the sense thatthey solve the same optimization problem [17]. Clearly, we need to make a choice between EII and K-means.On the one hand, a GMM provides greater inferential capabilities than the results from K-means. The resultsfrom K-means can be analyzed graphically and possibly with the help of non-parametric statistics while aparametric model allows the usage of covariances, confidence intervals and other tools that are possible for agiven parametric distribution. On the other hand, K-means does not explicitly assume that the features arenormally distributed, so moving away from this assumption-based way of thinking may be more robust.

Since MCLUST assumes that the features used follow a Gaussian distribution, we naturally want to investigateif the assumption of Gaussian distributions seem reasonable. It is possible that the normality assumption isnot reasonable but that pure chance resulted in a GMM producing results this good, in which case it is unwiseto rely on inference done by the GMM. To determine this, density plots of the variables used in clustering (theprincipal components) within each cluster are shown in Figures 4.5-4.7. It is easy to see that the principalcomponents do not follow Gaussian distributions in any of the clusters, so relying on normality assumptions isunwise. For a more formal statistical approach it could be possible to employ e.g. a Kolmogorov-Smirnov testof normality. In this case it is unnecessary since the test will not tell us anything that we do not already seefrom the figures.

Figure 4.5: Density plots of the principal components in cluster 0

30



After finally arriving at the conclusion that K-means is the best choice, we note that the cluster sizes are 6942,43168 and 24726.

31

5 Analysis

5.1 Analysis of distributions in clusters

The goal of customer segmentation is to understand customer behavior, not merely to maximize silhouettescores by hyperparameter tuning and algorithm selection. So naturally, we wish to understand the customersegments that have been found. This can be done in different ways, the first measure taken here is to analyzedensity plots of individual features and compare the different customer segments. This is done for a handful offeatures to illustrate the findings, looking at plots of the other features gives leads to similar conclusions.

(a) (b)

Figure 5.1: Density plots of x3 and x4 in the different clusters. 5.1a) Density function of x3 within the threedifferent clusters. 5.1b) Density function of x4 within the three different clusters

The density plots have been cut off at chosen values of their respective variables to make the plots readable.The presence of outliers distort the density plots but we still wish to determine how these customers have beenclustered. For this purpose we use box plots as a complement to the density plots, with the help of both we canget a better understanding of how the variables x3 and x4 are distributed. These two are chosen as examples,the other variables show similar results.

(a) (b)

Figure 5.2: Boxplots of x3 and x4 in the different clusters. 5.2a) Boxplot of x4 within the three differentclusters. 5.2b) Boxplot of x3 within the three different clusters

As we can see in Figures 5.1a, 5.1b, 5.2b and 5.2a the different clusters seem to differ with respect to howmuch the customers spend in total. In other words, it seems that K-means seems to separate the customersinto different levels depending on the number of purchases they have made. This is perhaps not the mostgroundbreaking of findings but at least it gives an indication that K-means can reveal the most obvious patternsin the data. We see that cluster 0 corresponds to the big spenders, cluster 1 to the low spenders and cluster 2 tothe mid-level spenders. To determine if we have been able to find any hitherto unknown patterns we can instead

32

look at percentages spent within product categories. Expressed more clearly, we will analyze a slightly changeddata set. Rather than inspecting the features used for clustering we will divide each row in the data matrix Xby the row sum. In the end a new matrix X′ will be available, and the element x′ij will be the percentage ofpurchases made by customer i in the j:th product category. By comparing these percentages between customersegments we can analyze if the different spending segments have markedly different preferences. We begin bylooking at the mean percentages within each cluster, see Figure 5.3.

(a) (b)

(c)

Figure 5.3: Radar charts showing the mean percentages spent within each product category for the threeclusters. 5.3a) Cluster 0. 5.3b) Cluster 1. 5.3c) Cluster 2.

It is easy to see that product category x3 is by far the most popular product category in all three customersegments. This is perhaps not surprising as it was also the product category with the lowest percentage ofsparsity, see Figure 3.1. Momentarily ignoring x3 we turn our attention to the other variables used to see ifthere are clear differences in mean values.

Figure 5.4: Radar chart showing the mean percentage spent within each product category (excluding x3 forthe three clusters)

We see in Figure 5.4 that some differences exist but that the mean values overall seem to be very similar across

33

customer segments. Overall, it cannot be said that the mean percentage values give evidence of any majordifferences. However, mean values are only point estimates and it may be helpful to consider distributions.For this reason we make use of density plots of the percentages within each cluster. The graphical analysisis limited to a few variables, specifically the ones with the highest variances. These features can be identifiedfrom Figures 5.5, 5.6 and 5.7.

Figure 5.5: Variances of the percentages in cluster 0



From the variance plots it is clear to see that the features with most variance across all three segments are x3,x4, x5 and x8. Hence, we will investigate the differences between the distributions of these more closely, seeFigures 5.8a, 5.8b, 5.8c and 5.8d.

34

(a) (b)

(c) (d)

Figure 5.8: Density plots of percentages spent in x3, x4, x5 and x8 in the three clusters. 5.8a) x3 percentages.5.8b) x4 percentages. 5.8c) x5 percentages. 5.8d) x8 percentages.

The clearest differences seems to exist in x3 and x4 while the difference in distributions for x5 and x8 seem tobe very similar.

35

6 Discussion

6.1 Summary

A brief summary of the project is in order. The initial data set consisted of transactions that could betied to specific customers. A suitable level of hierarchy of product groupings was decided upon and then thetransactions were aggregated for all customers, leading to a data matrix in which each row represents a customerand each column entry in that row represents the number of purchases by the customer in a particular productgroup. Due to the sparsity in the data set some columns were combined to form larger product group. Thepresence of big spenders for most product groups made a transformation of the data necessary, specifically thetransformed data was created by computing the square root of all the elements in the original data matrix. Inorder to avoid the harmful effects of the curse of dimensionality, principal component analysis was performedand the first three components were used for clustering when using K-means and the mixture models. Whenclustering with NMF, no PCA was performed.

All clustering algorithms were evaluated on a basis of their computed silhouette values, Davies-Bouldin indicesand subject matter knowledge. The results showed that non-negative matrix factorization (NMF) was a verypoor clustering algorithm in this case. t-distributed mixture models from the tEIGEN family fared betterbut still not satisfactory. Most models from the MCLUST Gaussian mixture model family performed on par,even slightly better, than the tEIGEN models. EII was the best model, which solves the same optimizationproblem as K-means clustering. K-means clustering was the best choice of algorithm overall, specifically withK = 3. Since this in a sense is equivalent to the MCLUST model EII a choice had to be made between thetwo before analysis of the results could be made. After inspecting the distributions of the principal componentsit was decided that the GMM was an unwise choice. Analysis of the clustering partition obtained shows thatcustomers were divided into segments dependent on how many purchases they had made in total, with smalldifferences in taste between the different clusters.

6.2 Reflection

Several parts of the project could have been performed differently. For one, different cluster validation indicescould have been used. These alternatives range from well-established indices like the Calinski-Harabasz andDunn indices to more experimental variants which use stability-based approaches to cluster validation that arebased on the bootstrap and cross validation [21]. In this thesis these were not used due to the computationalburden. For the mixture models, it might be possible to not use a silhouette at all but rather use statisticalinformation criteria, e.g. BIC. This is a more common approach in the literature. The mixture model approachcan also be further developed by adopting a Bayesian approach, which could have tackled the issues of infinitelikelihood functions.

In the case of NMF a possible change would be to use another loss function to optimize, e.g. the divergencebetween the target matrix X and the factorization matrix UVT rather than the Euclidean distance betweenthem.

Furthermore, the feature engineering could have been done differently. In this thesis only the total numberof purchases within different categories were used, another possible approach would have been to count thenumber of purchases made during different times of the day. Then each customer would be profiled by thenumber of purchases in different time periods and these variables could be used in clustering.

6.3 Future work

During the course of this thesis, many different algorithms and ideas were tried. Many showed unpromisingresults and were thus quickly discarded but still merit some mention since they might provide some guidancefor future works in customer segmentation by cluster analysis. We begin by discussing some algorithms thatwere tested on the original data, when no transformations nor PCA were attempted.

When using the clustering algorithm DBSCAN for a variety of different parameter values, the algorithm simplydid not assign cluster memberships in most cases. One of the algorithm’s features is that it may classifyobservations as noise, which it did for a lot of observations. In addition to this, running the algorithm oftengave rise to program crashes and unsatisfactory computational performance.

Mean shift clustering did not fare very well. It is a clustering algorithm in which you do not determine thenumber of clusters beforehand, but you can influence the number of clusters by changing its parameter. The

36

general picture of the results is that one large cluster with higher average silhouette score than the rest wasalways identified along with a group of very small clusters in which the average silhouette scores were verylow. The size and average silhouette value of this large cluster tended to increase along with the bandwidthparameter of the clustering algorithm. At the same time, the number of smaller cluster decreased.

When using algorithms in which the number of clusters is not predetermined (DBSCAN and mean shift-clustering), the resulting partitions consisted of one large homogeneous cluster with large silhouette values andmany small clusters that were very heterogeneous and had small silhouette values. This can easily be explainedby the very sparse and heavily skewed distribution that we observed in the data. A closer inspection revealedthat the large cluster consisted of the low spenders and that the smaller clusters were often big spenders, whohad been put into their own separate clusters.

Other algorithms that were tried but could ultimately not be used due to computational memory issues werehierarchical cluster, BIRCH clustering and spectral clustering. In all of these cases the number of data pointsthat could be used was simply too small for them to be useful. Affinity propagation was also tested butultimately it could only be used for small data sets due to computational reasons.

37

A Silhouette plots

A.1 K-means clustering

(a) (b)

(c) (d)

Figure A.1: Silhouette plots for K-means clustering with different K. A.1a) K = 2. A.1b) K = 4. A.1c)K = 5. A.1d) K = 6.

A.2 MCLUST models

(a) (b)

Figure A.2: Silhouette plots for different number of components K for the model EEE. A.2a) K = 3. A.2b)K = 4.

(a) (b)

Figure A.3: Silhouette plots for different number of components K for the model EEI. A.3a) K = 3. A.3b)K = 4.

38

(a) (b) (c)

Figure A.4: Silhouette plots for different number of components K of the model EEV. A.4a) K = 2. A.4b)K = 3. A.4c) K = 4.

(a) (b) (c)

Figure A.5: Silhouette plots for different number of components K of the model VEI. A.5a) K = 2. A.5a)K = 3. A.5c K = 4.

(a) (b) (c)

Figure A.6: Silhouette plots for different number of components K of the model VII. A.6a) K = 2. A.6b)K = 3. A.6c) K = 4.

A.3 tEIGEN models

(a) (b) (c)

Figure A.7: Silhouette plots for different number of components K for the model CICC. A.7a) K = 2. A.7b)K = 3. A.7c) K = 4.

39

(a) (b) (c)

Figure A.8: Silhouette plots for different number of components K for the model CIIC. A.8a) K = 2. A.8b)K = 3. A.8c) K = 4.

(a) (b) (c)

Figure A.9: Silhouette plots for different number of components K for the model CIIU. A.9a) K = 2. A.9b)K = 3. A.9c) K = 4.

(a) (b) (c)

Figure A.10: Silhouette plots for different number of components K for the model UIIC. A.10a) K = 2. A.10b)K = 3. A.10c) K = 4.

(a) (b) (c)

Figure A.11: Silhouette plots for different number of components K for the model UIIU. A.11a) K = 2. A.11b)K = 3. A.11c) K = 4.

40

(a) (b)

Figure A.12: Silhouette plots for different number of components K for the model UUCU. A.12a) K = 2.A.12b) K = 3.

A.4 NMF

Figure A.13: Silhouette plot obtained when using 2 clusters in NMF




41


42

References

[1] Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. isbn:978-3-319-24277-4. url: https://ggplot2.tidyverse.org.

[2] Hadley Wickham et al. dplyr: A Grammar of Data Manipulation. R package version 0.7.6. 2018. url:https://CRAN.R-project.org/package=dplyr.

[3] Antje Wolf and Karl Kirschner. ”Principal component and clustering analysis on molecular dynamics dataof the ribosomal L11·23S subdomain”. eng. In: Journal of Molecular Modeling 19.2 (2013), pp. 539–549.issn: 1610-2940.

[4] Friedrich Anders et al. ”Dissecting stellar chemical abundance space with t-SNE”. In: 619 (2018). issn:00046361.

[5] Jeffrey L. Andrews et al. ”teigen: An R Package for Model-Based Clustering and Classification via theMultivariate t Distribution”. eng. In: Journal of Statistical Software 83.1 (2018), pp. 1–32. issn: 1548-7660.url: https://doaj.org/article/8ecbdab733b747e0a8c9886fbdcfc1cc.

[6] Jeffrey Andrews and Paul McNicholas. ”Model-based clustering, classification, and discriminant analysisvia mixtures of multivariate t -distributions”. eng. In: Statistics and Computing 22.5 (2012), pp. 1021–1029. issn: 0960-3174.

[7] Olatz Arbelaitz et al. ”An extensive comparative study of cluster validity indices”. eng. In: PatternRecognition 46.1 (2013), pp. 243–256. issn: 0031-3203.

[8] David Arthur and Sergei Vassilvitskii. ”k-means++: the advantages of careful seeding”. eng. In: Pro-ceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. SODA ’07. Society forIndustrial and Applied Mathematics, 2007, pp. 1027–1035. isbn: 9780898716245.

[9] Andrew Aziz. Customer Segmentation basedon Behavioural Data in E-marketplace. eng. Uppsala univer-sitet, Teknisk-naturvetenskapliga vetenskapsomradet, Matematisk-datavetenskapliga sektionen, Institu-tionen for informationsteknologi, 2017.

[10] Jonathan Baarsch and M. Emre Celebi. ”Investigation of Internal Validity Measures for K-Means Clus-tering”. In:

[11] K. Beyer et al. ”When is “nearest neighbor” meaningful?” In: vol. 1540. Springer Verlag, 1998, pp. 217–235. isbn: 3540654526.

[12] Ernesto Borrayo et al. ”Principal components analysis - K-means transposon element based foxtail milletcore collection selection method.(Report)”. eng. In: BMC Genetics 17.42 (2016). issn: 1471-2156.

[13] C. Boutsidis and E. Gallopoulos. ”SVD based initialization: A head start for nonnegative matrix factor-ization”. eng. In: Pattern Recognition 41.4 (2008), pp. 1350–1362. issn: 0031-3203.

[14] Ryan Browne and Paul McNicholas. ”Estimating common principal components in high dimensions”.eng. In: Advances in Data Analysis and Classification 8.2 (2014), pp. 217–226. issn: 1862-5347.

[15] Sarka Brodinova et al. ”Robust and sparse k-means clustering for high-dimensional data”. In: (Sept.2017).

[16] Jean-Philippe Brunet et al. ”Metagenes and molecular pattern discovery using matrix factorization.(AuthorAbstract)”. English. In: Proceedings of the National Academy of Sciences of the United States 101.12(2004). issn: 0027-8424.

[17] Gilles Celeux and Gerard Govaert. ”Gaussian parsimonious clustering models”. eng. In: Pattern Recog-nition 28.5 (1995), pp. 781–793. issn: 0031-3203.

[18] Wei-Chien Chang. ”On Using Principal Components before Separating a Mixture of Two MultivariateNormal Distributions”. In: Journal of the Royal Statistical Society: Series C (Applied Statistics) 32.3(1983), pp. 267–275. issn: 0035-9254.

[19] D. L. Davies and D. W. Bouldin. ”A Cluster Separation Measure”. In: IEEE Transactions on PatternAnalysis and Machine Intelligence PAMI-1.2 (Apr. 1979), pp. 224–227. issn: 0162-8828. doi: 10.1109/TPAMI.1979.4766909.

[20] Brian S Everitt. Cluster Analysis. eng. 5th edition. Wiley Series in Probability and Statistics. 2010. isbn:1-280-76795-2.

[21] Yixin Fang and Junhui Wang. ”Selection of the number of clusters via the bootstrap method”. eng. In:Computational Statistics and Data Analysis 56.3 (2011). issn: 0167-9473.

[22] Tom Brijs Gilbert et al. ”Using Shopping Baskets to Cluster Supermarket Shoppers”. In: in: Proceedingsof the 12th Annual Advanced Research Techniques Forum of the American Marketing Association. 2001.

[23] Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. ”On Clustering Validation Techniques”. eng.In: Journal of Intelligent Information Systems 17.2 (2001), pp. 107–145. issn: 0925-9902.

[24] N. Halko, P. G. Martinsson, and J. A. Tropp. ”Finding Structure with Randomness: Probabilistic Al-gorithms for Constructing Approximate Matrix Decompositions”. eng. In: SIAM Review 53.2 (2011),pp. 217–288. issn: 0036-1445.

43

https://ggplot2.tidyverse.org

https://CRAN.R-project.org/package=dplyr

https://doaj.org/article/8ecbdab733b747e0a8c9886fbdcfc1cc

https://doi.org/10.1109/TPAMI.1979.4766909

https://doi.org/10.1109/TPAMI.1979.4766909

[25] Trevor Hastie, Jerome Friedman, and Robert Tibshirani. The Elements of Statistical Learning: DataMining, Inference, and Prediction. eng. Springer Series in Statistics, New York, NY: Springer New York,2001. isbn: 9780387216065.

[26] Michael E Houle et al. ”Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?” eng. In:Scientific and Statistical Database Management: 22nd International Conference, SSDBM 2010, Heidel-berg, Germany, June 30–July 2, 2010. Proceedings. Vol. 6187. Lecture Notes in Computer Science. Berlin,Heidelberg: Springer Berlin Heidelberg, 2010, pp. 482–500. isbn: 9783642138171.

[27] Diego Ingaramo, Paolo Rosso, and Marcelo Errecalde. ”Evaluation of Internal Validity Measures inShort-Text Corpora”. eng. In: Computational Linguistics and Intelligent Text Processing: 9th Interna-tional Conference, CICLing 2008, Haifa, Israel, February 17-23, 2008. Proceedings. Vol. 4919. LectureNotes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 555–567. isbn:9783540781349.

[28] Alan Izenman. Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learn-ing. eng. Springer Texts in Statistics. New York, NY: Springer New York, 2008. isbn: 9780387781884.

[29] Pablo Jaskowiak et al. ”On strategies for building effective ensembles of relative clustering validity crite-ria”. eng. In: Knowledge and Information Systems 47.2 (2016), pp. 329–354. issn: 0219-1377.

[30] Susanne Jauhiainen and Karkkainen Tommi. ”A Simple Cluster Validation Index with Maximal Cov-erage”. In: Proceedings of the 25th European Symposium on Artificial Neural Networks, ComputationalIntelligence and Machine Learning. 2017, pp. 293–298. isbn: 978-287587039-1.

[31] I. Jolliffe. Principal Component Analysis. eng. Springer Series in Statistics. New York, NY: Springer NewYork, 2002. isbn: 978-0-387-95442-4.

[32] Jingu Kim and Haesun Park. ”Sparse Nonnegative Matrix Factorization for Clustering”. In: 2008.[33] Dmitry Kobak and Philipp Berens. ”The art of using t-SNE for single-cell transcriptomics”. In: bioRxiv

(2018). doi: 10.1101/453449. eprint: https://www.biorxiv.org/content/early/2018/10/25/

453449.full.pdf. url: https://www.biorxiv.org/content/early/2018/10/25/453449.[34] Hans-Peter Kriegel, Peer Kroger, and Arthur Zimek. ”Clustering high-dimensional data: A survey on

subspace clustering, pattern-based clustering, and correlation clustering”. eng. In: ACM Transactions onKnowledge Discovery from Data (TKDD) 3.1 (2009), pp. 1–58. issn: 1556-472X.

[35] Daniel D. Lee and H. Sebastian Seung. ”Algorithms for Non-negative Matrix Factorization”. In: Advancesin Neural Information Processing Systems 13. Ed. by T. K. Leen, T. G. Dietterich, and V. Tresp. MITPress, 2001, pp. 556–562. url: http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf.

[36] George C. Linderman and Stefan Steinerberger. ”Clustering with t-SNE, provably”. In: CoRR abs/1706.02582(2017). arXiv: 1706.02582. url: http://arxiv.org/abs/1706.02582.

[37] Yanchi Liu et al. ”Understanding of Internal Clustering Validation Measures”. eng. In: IEEE Publishing,2010, pp. 911–916. isbn: 9781424491315.

[38] Ana Marın Celestino and Diego Martınez Cruz. ”Groundwater Quality Assessment: An Improved Ap-proach to K-Means Clustering, Principal Component Analysis and Spatial Analysis: A Case Study”. eng.In: Water 10.4 (2018). issn: 20734441. url: http://search.proquest.com/docview/2040900240/.

[39] Geoffrey J. McLachlan and T Krishnan. The EM Algorithm and Extensions. eng. 2nd ed. Wiley series inprobability and statistics. Interscience, 2008. isbn: 0471201707.

[40] Andri Mirzal and Masashi Furukawa. ”On the clustering aspect of nonnegative matrix factorization”. In:(2010).

[41] Seunghee Park et al. ”Electro-Mechanical Impedance-Based Wireless Structural Health Monitoring UsingPCA-Data Compression and k-means Clustering Algorithms”. eng. In: Journal of Intelligent MaterialSystems and Structures 19.4 (2008), pp. 509–520. issn: 1045-389X.

[42] F. Pedregosa et al. ”Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning Research12 (2011), pp. 2825–2830.

[43] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for StatisticalComputing. Vienna, Austria, 2013. url: http://www.R-project.org/.

[44] Peter J. Rousseeuw. ”Silhouettes: A graphical aid to the interpretation and validation of cluster analysis”.eng. In: Journal of Computational and Applied Mathematics 20.C (1987), pp. 53–65. issn: 0377-0427.

[45] Luca Scrucca et al. ”mclust 5: clustering, classification and density estimation using Gaussian finitemixture models”. In: The R Journal 8.1 (2016), pp. 205–233. url: https://journal.r-project.org/archive/2016-1/scrucca-fop-murphy-etal.pdf.

[46] H. W. Shin and S. Y. Sohn. ”Segmentation of Stock Trading Customers According to Potential Value”.In: Expert Syst. Appl. 27.1 (July 2004), pp. 27–33. issn: 0957-4174. doi: 10.1016/j.eswa.2003.12.002.url: http://dx.doi.org/10.1016/j.eswa.2003.12.002.

[47] J. du Toit et al. ”Customer Segmentation Using Unsupervised Learning on Daily Energy Load Profiles”.eng. In: Journal of Advances in Information Technology 7.2 (2016), pp. 69–75. issn: 1798-2340.

44

https://doi.org/10.1101/453449

https://www.biorxiv.org/content/early/2018/10/25/453449.full.pdf

https://www.biorxiv.org/content/early/2018/10/25/453449.full.pdf

https://www.biorxiv.org/content/early/2018/10/25/453449

http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf

http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf

http://arxiv.org/abs/1706.02582

http://arxiv.org/abs/1706.02582

http://search.proquest.com/docview/2040900240/

http://www.R-project.org/

https://journal.r-project.org/archive/2016-1/scrucca-fop-murphy-etal.pdf

https://journal.r-project.org/archive/2016-1/scrucca-fop-murphy-etal.pdf

https://doi.org/10.1016/j.eswa.2003.12.002

http://dx.doi.org/10.1016/j.eswa.2003.12.002

[48] Toon Van Craenendonck and Hendrik Blockeel. ”Using internal validity measures to compare clusteringalgorithms”. eng. In: 2015.

[49] L. Van Der Maaten and G. Hinton. ”Visualizing data using t-SNE”. In: Journal of Machine LearningResearch 9 (2008), pp. 2579–2625. issn: 15324435.

[50] Lucas Vendramin, Ricardo J. G. B. Campello, and Eduardo R. Hruschka. ”Relative clustering validitycriteria: A comparative overview”. eng. In: Statistical Analysis and Data Mining 3.4 (2010), pp. 209–235.issn: 1932-1864.

[51] Wei Xu, Xin Liu, and Yihong Gong. ”Document clustering based on non-negative matrix factorization”.eng. In: Proceedings of the 26th annual international ACM SIGIR conference on research and developmentin informaion retrieval. SIGIR ’03. ACM, 2003, pp. 267–273. isbn: 1581136463.

[52] K. Y. Yeung and W. L. Ruzzo. ”Principal component analysis for clustering gene expression data”. In:Bioinformatics 17.9 (2001), pp. 763–774. issn: 1460-2059.

[53] D. Zakrzewska and J. Murlewski. ”Clustering algorithms for bank customer segmentation”. In: 5th Inter-national Conference on Intelligent Systems Design and Applications (ISDA’05). Sept. 2005, pp. 197–202.doi: 10.1109/ISDA.2005.33.

45

https://doi.org/10.1109/ISDA.2005.33

TRITA -SCI-GRU 2019:092

www.kth.se

Customer segmentation of retail chain customers using ...1319851/FULLTEXT02.pdfCustomer segmentation...

Documents

Transcript of Customer segmentation of retail chain customers using ...1319851/FULLTEXT02.pdfCustomer segmentation...