Thierry Vallaud Thesis

8/3/2019 Thierry Vallaud Thesis

1/47

TVallaud 1

Estimating potential customer value using customer dataUsing a classification technique

to determine customer value

Thierry Vallaud

A Thesis

Submitted in Partial Fulfillment of the

Requirements for the Degree of

Master of Science in Data Mining

Department of Mathematical Sciences

Central Connecticut State University

New Britain, Connecticut

April 2009

Thesis Advisor

Dr. Daniel Larose

Department of Mathematical Sciences

Key Words: Turnover potential, Classification, Kohonen Networks


2/47

TVallaud 2

Abstract:

This study outlines a method of determining individual customer potential, based solely on

data present in the customer database: descriptive information and transaction records.

We define potential as the incremental turnover that any particular company could do with

their present customers.

In order to successfully calculate this potential in a large database with multiple variables, we

propose grouping together customers who look like each other (known as clones), by means

of an appropriate clustering technique: Kohonen Networks.

This method is applied to actual data sets, and various techniques are employed to check the

stability of the clusters obtained. Real potential is then determined by means of an empirical

approach: practical application to a major French retailers database of 5 million customers.


3/47

TVallaud 3

Contents

The context ................................................................................................................................. 4

Our thesis subject ....................................................................................................................... 6

The precise modelling application ............................................................................................. 6

The research questions ............................................................................................................... 7

The data mining process used .................................................................................................... 7

Data understanding ................................................................................................................. 8

Data preparation ..................................................................................................................... 8

Clustering models and determination of customer potential .................................................... 11

Kohonen network method .................................................................................................... 11

Model development .................................................................................................................. 13

1- Objectives and methodology ............................................................................................ 13

2- Robustness of the Kohonen method: ............................................................................... 15

3- Calculation of the potentials ............................................................................................ 26

4- Main results ...................................................................................................................... 29

5- Results summary .............................................................................................................. 34

The validation procedures for the models ................................................................................ 35

Conclusions .............................................................................................................................. 36

Discussion of the results of the research study .................................................................... 36

The limits and the contribution of our research study .......................................................... 36

Further research .................................................................................................................... 36

Bibliography ............................................................................................................................. 37

Appendix .................................................................................................................................. 40


4/47

TVallaud 4

The context

Most companies would like to know their customers potential in terms of turnover at the

individual level. Determining potential means identifying the incremental turnover that agiven company generates with its existing customers.

Customer turnover potential models exist and are mainly based on the customer value

determined by the LTV approach (LTV = Life Time Value) (Bnavent and Cri; Berger and

Nasr 1998; Dwyer 1997; Venkasten, Rajkumar and Kumar 2004).

Beside this model, other models exist which estimate the customers spending share (Cooil et

al. 2007; Yuxing Du et al.; Keimingham et al. 2007). Other econometric models exist, which

are based on data that often are external to the database (Plastria 2001, Huff 2003, Reilly

1931).

Customer consumption (total value) represents the lifetime consumption of a particularproduct by a particular customer, referred to as Customer Total Value or CTV. For example

over the course of his life, a customers total value for a retailer is the sum of all the purchases

he will make in the retailers stores during his life.

It is possible to estimate a customers consumption on this market for a given brand b. Over

the course of his lifetime, the customer will consume several brands. His total consumption

one of these brands then constitutes the brands wallet share over the customers lifetime

(Figure 1).

Wallet share of

The difference or delta between total consumption by the customer in the market and the

total consumption of the brand corresponds to the Competitors Consumption Total

Value CCTV (Figure 2).

or

Depending on the brands marketing stimulus, the customer will take a share of that

delta to competitors and/or increase his consumption in the total market:

Customers of the retailer will consume in some competitors stores and may be increase his

total consumption for retailers.


5/47

TVallaud 5

Thus, the customers theoretical potential is his total consumption over his lifetime:

which is his reachable potential that can be estimated by means of the above econometric

model

Where

Actual Value for Brand 1

Share of consumption taken to the competitors (Figure 3).

Increase of its total consumption

The customers reachable potential then corresponds to what the brand has already captured

and what the customer could consume additionally or obtain from competitors. This reachable

potential can be estimated in two ways: using an econometric model, which requires

exogenous data from the companys internal customer database; or alternatively, using solely

internal data from the companys customer database, by means of the clones method.

A given brand can only capture n% of the theoretical potential (Berend Wierenga and Gerrit,

2000). Some marketing researchers have shown that a brand can increase its actual wallet

share to a maximum of 30%, above this rate the customer perceives a change and tries to

resist it. Above 30% of increase there is too much modification of his choice set1(Bremer and

Joyce, 1988). This subject has already been covered in one of our previous studies (Vallaud,

2003).

1 The choice set is the finite set of products for a given product category that a customer has in mindbefore to make a purchase


6/47

TVallaud 6

The most advanced approaches to determination of potential try to determine the portion that

could be reachable for the company, relying solely on customer data from the companys

customer database. These approaches calculate a customer by customer potential but

evidently have to be consistent at the aggregated level with market values macro

information.

Our thesis subject

The objective is to work on clustering models2(Lerman 1970, Dorofeyuk 1971, Borko et al.

Bernick 1963, Two Steps (Tan et al. 1997), K means (Hartigan et al. 1979, Fang et al. 1982),

SOM (Teuvo Kohonen 1988, Vesanto 1997, Kaski 1997), etc..), on large databases from

commercial companies (phone operators, ISPs, major retailers, mail order companies, etc...).

We use clustering models in order to determine the customer potential using a method we call

the clonemethod, whereby customers who most resemble each other are considered to be

clones and should have the same potential.

We have access to a variety of data bases suited to our methodological process. In this

document we will perform an empirical test of our method on customer data from a major

French grocery retailer.

As part of our brief presentation of the context, we will look at two main subjects:

- Calculation of potential or the customer value in marketing and its differentdependences: LTV, wallet share, market share capture, etc.

- The mathematical models that allow similar individuals to be grouped intohomogeneous data groups : clustering techniques

The investigation field will be multidisciplinary, although there will be a minor marketing

investigation and a major investigation in the area of statistics, data mining and clustering.

The precise modelling application

The greater part of our research objectives is to test several techniques, separately and

possibly jointly, to ensure that the clusters formed are homogeneous groups of clones.

Besides choosing the models, part of the research involves defining the most informative

variables and a model topology which fits with these data. The aim here is to obtain the most

meaningful and convergent results.

Another aspect of our research will involve confirming the clusters obtained using the models,along with other complementary statistical techniques:

- Dimension reduction to choose variables because of the very large numbers of clustersand with large value ranges,

- Projection of passive and active variables3 in the clusters,- Clusters reallocation by supervised models,- Validation by non automatic classification techniques, connectivity of super classes,

2 SOM belongs to the clustering methods, typologies is the French word for clustering and

typologies belong to the unsupervised classification techniques 3 Active variables are used to build the groups themselves in term of distances, passive are justdescriptive variables to explain the groups


7/47

TVallaud 7

- Empirical verification with external panels like Nielsen or TNS Sofres4 whichrepresents the market reality of the potential.

Another large part of our research study is selecting the above mentioned methods and

validating these choices. The aim is to find a clustering method that converges sufficiently to

be validated with all the approaches described above. The modelling will therefore become aprocess of several models.

The definitive modelling will be realized using a market standard software platform:

Clementine from SPSS in a French version.

The scientific contribution will be:

- a methodological contribution to selecting clustering models and validating thesechoices

- a real life data application, validated by the reality of an actual business case:calculating real attainable potentials

The research questions

- Can we use a clustering technique to determine customers which are similar to eachother and therefore define a realistic potential in terms of turnover for these

customers?

- Can we develop a method?- How can we validate the stability of the clusters?

The data mining process used

We will use the Cross-Industry Standard Process for Data Mining (CRISP)5

data mining

project process which will conduct our approach to analyzing the data. The CRISP standard

process consists of the following stages:

4 Nielsen and TNS are market research companies which provide panels in which members scan

purchases they do. These panels can be crossed with customers data bases to measure marketingmix effects5http://www.crisp-dm.org/
http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/


8/47

TVallaud 8

Data understanding

We will work on 5,373,026 individuals derived from the database of a major French retail

company. We have the details of all cash register receipts over a period of 12 months fromJanuary 2006 to December 2006.

For external validation purposes, we also have market research available on the French

market:Referenseigne 2006from TNS Sofres6. This research gives us the wallet share of the

main French retailers7.

Data preparation

This step consists in familiarizing ourselves with the data in the database of the program

members, in order to determine the structure of the database due to the data layout, the level

of completed fields comprising the data file, and also the origin and nature of the data in thedata file. Each field will hence be checked to ensure it does not undermine model stability.

We have done a data audit and EDA in two steps, only the second EDA is presented in this

document.

The audit includes:

- The structure of the database- The origin and nature of data (socio-demographic / consumption)- The possibility of performing cross data analysis (by brand / shelf / product family,

etc)

- Data periodicity- Data historicity- Data completeness

Thus, the principal data management processes performed on the data in the database will

therefore include:

- Controlling and the validation of the format of the variables

- Recoding and correcting certain variables called aberrant variables- Creating specific aggregates useful for further segmentation (total turnover, turnover

by product family, annual visit frequency, average buying basket )

- Analysing the correlation of the target variable (turnover) with other variables (socio-demographic criteria, order frequency) in order to check whether any dependantrelationships exist

- Geocoding (useful for the enriching the profiles of certain socio-demographic dataderived from the INSEE

8(French national statistical office) via the IRIS

9(specific

French geocoding data)

6 Referenseigne is a monographic market research done on the French retail market yearly since tenyears by TNS Sofres the third worldwide research company.7http://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp/8

INSEE (Institut National de la Statistique et des tudes conomiques in French) is the FrenchNational Institute for Statistics and Economic Studies. It collects and publishes information on theFrench economy and society, carrying out the periodic national census. Located in Paris, it is the
http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp#/http://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp#/http://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp#/http://en.wikipedia.org/wiki/Francehttp://en.wikipedia.org/wiki/List_of_national_and_international_statistical_serviceshttp://en.wikipedia.org/wiki/List_of_national_and_international_statistical_serviceshttp://en.wikipedia.org/wiki/Francehttp://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp#/


9/47

TVallaud 9

- Calculating the distances between the customer and the Point of Sales (trade zone)

The analysis will be performed on 12 months sliding turnover on the total sum of the

historical data, to ensure modelling is more reliable. Nevertheless, the greater the historical

data set and its homogeneity, the more stable and predictive should be the model.

In this document, we have merely included some examples of the data audit and data

preparation, as our demonstration is focused on the model and results. Details of the second

EDA in appendix 2 (p.39).

The input variables are as shown in the following table :

French branch of Euro stat, European Statistical System. The INSEE was created in 1946 as asuccessor to the National Statistics Service (SNS) created under Vichy during World War II.

9The IRIS is a French geographic unit on which are linked the census data


10/47

TVallaud 10

Identification of the outliers:

We have identified and eliminated from the analysis some customers with anomalous

behaviour on two variables linked to turnover.

We used only these two variables in the outlier detection, because they are very constitutive

of the potential itself.

Discretization:

We have discretized some important variables and studied their dispersion.

We produced a total EDA in appendix 2 (p.39) with descriptive analysis with tables and

graphs, correlation estimates, and so on.


11/47

TVallaud 11

Clustering models and determination of customer potential

The modelling process is divided into three major phases:

(1), The clustering method itself, (2) the calculation of the evolution levels, and (3) the

calculation of the individual customer potential:

1. The clustering method: as these models are being applied to very large databases withlarge numbers of variables and records, the SOM (Self Organizing Map) seem to be

particularly well adapted (Kohonen, 1988):

- Kohonen networks allow very homogenous and stable groups with multipleindividuals and variables,

- Kohonen networks allow complex non linear relationship on many variables for manyindividuals,

- Kohonen networks handle missing data well.

Kohonen network method

Kohonen networks represent a type of self organising map (SOM), which itself represents a

special class of neural networks.

Kohonen analysis is a clustering method. Its main advantage is to convert high dimensional

input signal into a simpler low dimensional discrete map. Kohonen is an unsupervised method

no target as to be defined.

Kohonen network exhibit three characteristic process :

1 Competition: Ouput nodes compete with each other to produce the best value for a

particular scoring function, most commonly the smallest Euclidian distance.

2 Cooperation: Winning node therefore becomes the center of the neighbourhood of exited

neurones.

3 Adaptation: Nodes is the neighbourhood of the winning node participate in adaptation, thatis, learning. The weights of that node are adjusted so as to further improve the score function.

Network architecture :

Each neuron of the Kohonen map is linked to all the other neurons of the map. Each one of

them receives a complete copy of an input vector.


12/47

TVallaud 12

Gagnant Voisinage

Inputs

Taux d apprentissage Poids ajust des gagnants en fonctiondes donnes d entre

Les donnes desortie qu i essaiede devenirgagnantes

Gagnant Voisinage

Inputs

Taux d apprentissage Poids ajust des gagnants en fonctiondes donnes d entre

Les donnes desortie qu i essaiede devenirgagnantes

Le s donnes ensortie quiessaient dedevenir gagnan te s

Learning rate

Winner Neighborhood

Adjusted weight ofwinners in function of

the input data

Output data whichtry to become

winners

Kohonen networks are self-organising maps that exhibit Kohonen Learning. There is a set of

m field values for the nth record to be an input vector and the current set

of m weights for a particular output node j to be a weight vector . In

Kohonen learning, the nodes in the neighbourhood of the winning node adjust their weights

using a linear combination of the input vector and the current weight vector :

)

where , represents the learning rate. Kohonen indicates the learning rate should

be a decreasing function of training epochs (run through the data set).

Upon each iteration, it checks the accuracy of its previous grouping.

- A Kohonen network is particularly well suited to building homogenous groups. It isobviously a lengthy process when performed on large number of individuals with

many variables and records.

- A Kohonen network allocates a relevant group to each customer.

By mapping the analysis, we can evaluate the similarity between groups. Two groups which

are close on the graph have similar characteristics.

The aim is to find a method:

- That represents the best trade-off between many classes, ensuring small groups withhomogeneous customers within each group, but groups which differ greatly from each

other.

- That enables us to obtain realistic customer potential with clusters that are internallystable.


13/47

TVallaud 13

2. Calculation of the evolution level: Evolution is the small jump in turnover rate that acustomer needs to produce in order to be clustered with customers who most resemble

him on all the variables selected for the model, but who represent higher turnover than

him. This requires a calculation method based on dividing each class of clones for

which we are calculating the median into decile.

Individuals in one group should not have a huge gap to cross in order to obtain a realistic

determination of potential10: the potential increase of turnover that could be achieved afterapplication of the correct marketing actions. We will try to justify this calculation by

methodological means. This step will give us the evolution rates in the classes.

3. Calculating individual customer potential: once the rates are properly determined, wewill calculate, for each customer, individual customer potential to be captured. This

calculation needs specific adjustments: all customers with an evolution rate potential

above 100% are allocated to the average potential rates of all groups, except that to

which they belong.

Model development

1 - Objectives and methodology

To complete segmentations based on customer turnover, SML segmentation11

(Brusset 2005)and RFM segmentation

12(McCartya and Hastak 2007, Chen et al., 2008), we calculate scores

of turnover potential for each customer in the loyalty program data base.

This score is based on an iterative approach allowing us to predict the consumption propensity

of customers to the aim to determine the potential future turnover.

10 Example: Customer A has an actual turnover of 1 000$. Customer A belongs to first decile of acluster in which all customers look like the most each others. Turnover max of the customer at theupper limit of this decile is 1 200$. So potential is the difference between the 1 200$ of customer max

and the 1 000$ of customerA: 20% or 20011 SML Segmentation (Small, Medium, Large) is dividing the customers in function of their turnover12 RFM Segmentation (Recency, Frequency, Money Value) is a classical segmentation in marketing


14/47

TVallaud 14

The approach consists of grouping together customers who resemble each other, according to

some socio demographic and consumption variables.

For the computation we will use consumption data recorded on a period of 12 months (from

January 2006 to December 2006).

The variables used in the model are those we decide to keep following the data preparation

stage.

Socio-Demo & Consumption data Turnover rate per product family

Customer ID Customer ID

Number of children in the household Rate other

Filtered turnover on 12 months Rate Bazar

Total turnover Rate otherYearly turnover on promo Rate Pork Butcher LS

Nb of transformed points on 12 months Rate Pet food

Nb of CM on 12 months Rate Baby

Nb of reduction voutchers used Rate Butcher

SML 12 months Rate backer

RFM 3 months Rate Pork Butcher

Number of children in the household Rate dietetic bio

Rate cheese

Rate fruits and vegetables

Rate fishs

Rate frozen food

Rate wine

Rate cleaning products

Rate grocery

Rate liquid

Rate textile

Rate ultra fresh products

Rate pouldry

Rate First price

Rate Retailer Brand 1

Rate Retailer Brand 2

Discarded variables are eliminated after a correlation analysis for the quantitative variables

(turnover and number of purchases acts for instance) or by proximity matrixes for qualitative

variables. We dont used PCA because we would like to keep the information as the much

desegregated level of the original variables in the data base.

Inactive customers, customers without any transaction of the period, are discarded.

Clementine stream:


15/47

TVallaud 15

This figure is here to illustrate how a model is done on Clementine from SPSS, Clementine is a statistical

software which uses object language to make models

We will use clustering method to create "clone" groups that are highly homogeneous within

each other, but different from each others.

The second stage involves creating turnover potential values for these different groups, given

that an individual with the same variables as another does not obviously realize the same level

of turnover. He can tend towards the turnover of his superior clone. To do this, we will use a

Kohonen neural network.

Once the clone families have been obtained and potential values calculated, the main familiesare determined:

-"Gold : evolution rate higher than 20%

- Silver : evolution rate between 15% and 20%

- Bronze : evolution rate below 15%

The evolution rate is the ratio of the potential on the actual turnover.

It should be note here that potential refers to absolute potential over twelve consecutive

months.

This potential is expressed in the form of a rate. For operational purposes, potential values

must be reclassified as absolute value:

P1: Large potential

P2: Medium potential

P3: Small potential

2- Robustness of the Kohonen method:

We test several methods of determining convergences between Kohonen groups.

2.1 CONVERGENCES VISUALIZATION


16/47

TVallaud 16

We obtained 40 groups, numbered from 00 to 93 (note that clusters do not follow a numbered

sequence).

We would like to obtain a quiet important number of groups to minimize at the maximum the

inter group standard deviation.

Mappings: 00 is the cluster of 0 coordinate on the X axis and 0 on the Y axis, and 93 is thegroup of coordinate 9 on the X axis and 3 on the Y axis.

Kohonen groups x SML segmentation (12 months)

Colors are generally well grouped, with customers belonging to the same SML segments

being together.

Visually, the placement of SML through clusters shows stability.

Kohonen groups x RFM segmentation (3 month)


17/47

TVallaud 17

Colours are generally well-grouped, with customers belonging to the same RFM segments

found in the same Kohonen groups.

There is a far greater mixture of colors inside each cluster, with customers belonging to the

same RFM segments being found in the same Kohonen groups, but the homogeneity of

clusters is less obvious than with SML mapping.

2.2 ROBUSTNESS OF THE KOHONEN CLASSIFICATION

Is this distribution of the population stable? We can answer this question in four different

ways

A - Is there a convergence of clusters weights between the sample of the active observations

and passive observations?

B - Can the grouping be reproduced by a Bayesian network (Pourret et al, Jensen, Stephenson

2000)?

C - Can the classification be reproduced by segmentation as C5.0 (Quinlan 1993, 1996,2004)?

D - Is there convexity of the super classes?

A/ Convergence of the method

We can check the percentage ofcustomers allocation on two random samples


18/47

TVallaud 18

Number % Number % Number %

KH01 271 944 5,06% 10 949 5,13% 260 995 5,06%

KH02 171 396 3,19% 6 983 3,27% 164 413 3,19%

KH03 261 136 4,86% 10 498 4,92% 250 638 4,86%

KH04 289 912 5,40% 11 508 5,39% 278 404 5,40%

KH05 80 239 1,49% 3 214 1,50% 77 025 1,49%

KH06 40 698 0,76% 1 596 0,75% 39 102 0,76%

KH07 64 515 1,20% 2 550 1,19% 61 965 1,20%KH08 93 685 1,74% 3 768 1,76% 89 917 1,74%

KH09 95 415 1,78% 3 757 1,76% 91 658 1,78%

KH10 91 169 1,70% 3 681 1,72% 87 488 1,70%

KH11 57 384 1,07% 2 235 1,05% 55 149 1,07%

KH12 181 691 3,38% 7 224 3,38% 174 467 3,38%

KH13 142 728 2,66% 5 624 2,63% 137 104 2,66%

KH14 83 298 1,55% 3 260 1,53% 80 038 1,55%

KH15 65 365 1,22% 2 597 1,22% 62 768 1,22%

KH16 152 665 2,84% 6 153 2,88% 146 512 2,84%

KH17 119 559 2,23% 4 797 2,25% 114 762 2,22%

KH18 45 360 0,84% 1 794 0,84% 43 566 0,84%

KH19 73 151 1,36% 2 783 1,30% 70 368 1,36%

KH20 35 914 0,67% 1 378 0,65% 34 536 0,67%

KH21 120 165 2,24% 4 688 2,20% 115 477 2,24%

KH22 137 752 2,56% 5 462 2,56% 132 290 2,56%

KH23 36 215 0,67% 1 417 0,66% 34 798 0,67%KH24 267 939 4,99% 10 739 5,03% 257 200 4,99%

KH25 193 624 3,60% 7 581 3,55% 186 043 3,61%

KH26 50 454 0,94% 2 019 0,95% 48 435 0,94%

KH27 26 271 0,49% 1 036 0,49% 25 235 0,49%

KH28 76 724 1,43% 3 082 1,44% 73 642 1,43%

KH29 199 372 3,71% 7 810 3,66% 191 562 3,71%

KH30 28 913 0,54% 1 102 0,52% 27 811 0,54%

KH31 124 878 2,32% 4 922 2,30% 119 956 2,32%

KH32 347 565 6,47% 13 963 6,54% 333 602 6,47%

KH33 75 304 1,40% 2 998 1,40% 72 306 1,40%

KH34 103 656 1,93% 4 107 1,92% 99 549 1,93%

KH35 24 658 0,46% 989 0,46% 23 669 0,46%

KH36 31 206 0,58% 1 272 0,60% 29 934 0,58%

KH37 301 456 5,61% 11 863 5,55% 289 593 5,61%

KH38 252 820 4,71% 10 042 4,70% 242 778 4,71%

KH39 130 904 2,44% 5 193 2,43% 125 711 2,44%KH40 425 926 7,93% 16 942 7,93% 408 984 7,93%

Total 5 373 026 100,00% 213 576 100,00% 5 159 450 100,00%

Learning sample Test sampleClones

Total

B/ Reallocation using a Bayesian network

The above table confirms that the algorithm is able to reproduce the distribution on a larger

data set (Learning sample vs Test sample).

However, it is by using another algorithm that we can determine whether or not the clustering

can be reproduced or if it is stable or not.

Again, the learning sample is split into two independent sub-samples. The learning sample

includes 70% of the observations, the test sample 30%.

We use a Bayesian network, because to make a prediction on 40 groups discriminating

analysis is not well adapted.

Bayesian network allows a stepwise approach, as we can fix the level of probabilities of links

that we retain between variables. If we fix a probability of 0.9, the results are as presented on

a graph format below.


19/47

TVallaud 19

The network uses 11 variables, turnover data and socio demographic variables. It can be seen

that SML and RFM are very important. This result validates the representation of the

densities. Below the weights of variables in the model.


20/47

TVallaud 20

Kullback-Leibler measurement http://www.it-

innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdf. comes from informationtheory. It is a measure of convergence between two series after they have been recoded on a

bitmap format. The higher the value, the greater the probability that these two values have a

joint distribution.
http://www.it-innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdfhttp://www.it-innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdfhttp://www.it-innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdfhttp://www.it-innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdf


21/47

TVallaud 21

Scoring result at the individual level: on the learning sample, 90.7% of the individuals are

correctly classified.

On the test sample, the figure is 90.1%

Below are the rates in % of correctly classified individuals by the Bayesian Network for each

of the 40 clusters.

The Kohonen clusters can be reproduced.


22/47

TVallaud 22


23/47

TVallaud 23

In the above table, poorly reallocated groups are of course groups containing a small number

of customers.

Even for these groups, accuracy remains above 65%

C/ Reallocation by decision tree

The cross-validation rate is 94.2% of correctly affected individuals to groups.

The test sample confirms this rate.

There is a strong convergence of the two supervised learning methods Bayesian Networks and

C5 are able to reallocate properly individuals to 40 clusters.

Robustness of the classification is validated.


24/47

TVallaud 24


25/47

TVallaud 25

D/ Superclasses convexity

We use a Bayesian network analysis, which identifies a small number of variables that are the

most important for clustering.

We analyse contingency table between the 40 groups and the variables which contribute at the

network for more than 10% of explicative ability.

- Family situation

- C.S.P.

- R.F.M. at 3 months

- S.M.L at 3 months

- Home type

- Age categories

- Filtered cumulated turnover

- Customer seniority categories

On this table, the scale used is Khi distance (Ottos, 2007, Meunier et al, Romesburg, 2004)

and aggregation method is that used by Ward (Clarke and Sun, 1997, Barnier 2008).

Dendrogramme

KH01

KH05KH02

KH06

KH11

KH12

KH03

KH07

KH04

KH08

KH25

KH29

KH34

KH30

KH33

KH38

KH39

KH40

KH35

KH37

KH31

KH32

KH36

KH21

KH26

KH27

KH24

KH28

KH22

KH23

KH16

KH19

KH20

KH10

KH09

KH13

KH14

KH17

KH15

KH18

0 1 2 3 4 5 6 7 8 9

Breakdown of the standard deviation for an optimal classification:

Intra-groups 89790172,495Inter-groups 25701803,959Total 115491976,454


26/47

TVallaud 26

Distances between the central objects:

Results per cluster:

Cluster 1 2 3 4 5

Objects 10 7 8 5 10

Sum of weights 10 7 8 5 10

Intra class standard

deviation72493506,744 25490092,048 67169820,393 125787548,900 151548331,778

Minimal distance to

barycenter

3535,855 3419,845 3465,065 3277,114 4143,086

Average distance to

the barycenter7323,184 4577,213 6560,727 8500,865 10606,386

Maximal distance to

the barycenter14057,216 5976,284 16549,137 18792,278 22881,067

KH01 KH09 KH16 KH21 KH25





KH06 KH17 KH24 KH35

KH07 KH18 KH27 KH37KH08 KH28 KH38

KH11 KH39KH12 KH40

A check is performed to ensure that the bottom/top classification respects the order of the

groups: clone 40 is not grouped together with clone 3. It's one of the "quality" criteria of a

Kohonen map.

In conclusion, the sharp classification obtained by Kohonen algorithm satisfies the criteria of

stability and reproducibility which guarantee a robust and lasting potential.

3- Calculation of the potentials

We divided the annual turnover (filtered turnover on 12 month) into deciles.

For each clusters obtained with the Kononen method, we have calculated the business

potential based on the turnover.

We retained the deciles method which allows very significant variations in turnover to be

taken into account.

We split the total turnover of each class of clones into deciles, then calculated the median of

each deciles.


27/47

TVallaud 27

Then we allocate the groups a potential turnover value derived from the calculation of the rate

of increase between medians and deciles.

For each clones group, the increasing rate of the turnover measures the turnover growth to go

from a decile to the upper decile.

18 increasing rates per clones group are determined:

- Between the median of the first decile and the upper limit of the first decile: Tx01

- Between the upper limit of the first deciles and the median of the second decile: Tx02

- Between median of the second deciles and the upper limit of the second deciles: Tx03

...

- Between the upper limit of the eighth deciles and the median of the ninth decile: Tx16

- Between the median of the ninth deciles and the upper limit of the ninth decile: Tx17

- Between the upper limit of the tenth deciles and the median of the tenth decile: Tx18

We let without any potential companies which are higher than the median of the tenth deciles.

We estimate that companies with such high turnover will have an evolution rate near 0, equalto the inflation rate, or equal to their annual evolution rate.

For each Kohonen group, a customer for whom the filtered turnover is between the minimum

and median of the first decile will have an evolution rate equal to rate 1 (Tx01).

A customer whose turnover is between the median and the upper limit of the first deciles, will

have an increase rate equal to rate 2 (Tx02) etc...

Each customer is allocated an evolution rate. The rate multiplied by the turnover allows us to

estimate a potential turnover of each customer.

4.1 - Limits and medians of deciles per Kohonen group:


28/47

TVallaud 28

Turnover in euros

Number Mean Median Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median

KH01 00 271 944 683 354 16,6 34,6 56,4 81,9 111,9 146,5 187,5 234,1 289,4 354,5 431,5 523,7 634,9 769,0 935,8 1 151,5 1 409,5

KH02 01 171 396 730 396 19,0 39,9 64,5 93,1 126,3 164,1 209,3 262,8 323,4 396,1 482,8 584,4 705,7 847,6 1 023,5 1 245,5 1 521,7

KH03 02 261 136 770 414 22,2 45,8 72,5 102,8 137,3 177,3 223,6 277,3 340,4 414,1 501,1 603,4 724,0 867,4 1 041,4 1 266,4 1 570,2

KH04 03 289 912 1 353 658 31,3 65,4 104,5 148,8 201,6 262,2 335,5 423,3 529,1 658,0 812,4 1 000,2 1 230,8 1 510,2 1 864,7 2 314,6 2 898,5

KH05 10 80 239 471 303 17,8 35,4 55,7 79,8 106,7 136,4 170,6 209,7 253,7 303,4 359,7 425,9 499,4 582,3 675,7 784,9 911,0

KH06 11 40 698 452 312 19,7 39,1 61,4 86,5 113,9 146,0 180,7 217,6 262,3 311,7 367,8 430,3 502,8 584,4 677,0 775,0 885,0

KH07 12 64 515 469 360 23,9 49,5 77,3 107,4 140,9 177,0 217,5 261,5 308,6 360,1 417,2 480,0 549,0 622,9 705,6 797,8 900,8

KH08 13 93 685 697 467 27,9 57,4 91,0 127,5 168,3 215,1 268,2 326,7 391,7 466,7 550,4 643,3 748,8 862,3 995,6 1 146,6 1 381,5

KH09 20 95 415 432 313 20,5 39,6 61,7 86,1 113,2 144,8 179,9 219,5 263,6 313,0 369,2 430,3 499,9 576,5 662,0 756,4 863,1

KH10 21 91 169 519 452 32,7 65,5 101,0 140,9 183,5 229,8 280,1 333,2 391,0 451,7 516,7 581,1 651,4 722,7 797,6 877,1 957,5

KH11 22 57 384 580 454 27,6 56,8 92,0 129,3 170,7 219,8 271,1 327,2 390,2 454,2 525,6 600,9 683,3 771,7 865,6 965,8 1 071,6

KH12 23 181 691 850 670 37,7 80,4 129,2 184,3 245,6 314,3 390,1 475,0 567,6 669,9 783,8 905,1 1 037,8 1 180,3 1 330,4 1 501,9 1 694,4

KH13 30 142 728 988 779 39,4 85,9 144,8 214,2 293,1 377,6 472,0 571,3 674,2 779,0 888,3 1 000,1 1 115,8 1 257,1 1 429,2 1 627,0 1 853,7

K H1 4 31 8 3 2 98 2 9 71 2 5 92 1 26 8, 3 1 3 93 ,2 1 5 25 ,2 1 6 58 ,7 1 80 2, 7 1 9 46 ,8 2 09 9, 5 2 2 57 ,7 2 43 0, 3 2 5 91 ,5 2 76 4, 1 2 9 46 ,5 3 15 8, 4 3 3 96 ,3 3 67 5, 6 4 0 09 ,5 4 42 2, 9

K H1 5 32 6 5 3 65 2 4 88 2 0 79 1 20 9, 3 1 2 95 ,8 1 3 83 ,0 1 4 72 ,9 1 56 4, 9 1 6 60 ,3 1 75 9, 2 1 8 63 ,6 1 97 1, 1 2 0 78 ,9 2 18 9, 0 2 3 04 ,7 2 42 2, 7 2 5 94 ,9 2 87 6, 1 3 2 39 ,7 3 70 1, 3

K H1 6 33 1 52 66 5 4 2 03 3 7 45 1 58 4, 7 1 9 76 ,5 2 3 25 ,5 2 5 79 ,3 2 74 6, 6 2 9 17 ,8 3 10 7, 6 3 3 04 ,4 3 52 0, 0 3 7 45 ,0 3 99 1, 8 4 2 56 ,8 4 54 9, 6 4 8 79 ,7 5 26 4, 2 5 7 07 ,4 6 26 1, 8

K H1 7 40 1 19 55 9 2 6 46 2 2 54 1 16 6, 1 1 2 70 ,9 1 3 77 ,5 1 4 88 ,5 1 60 4, 0 1 7 22 ,5 1 84 3, 4 1 9 71 ,9 2 11 0, 0 2 2 53 ,5 2 40 5, 7 2 5 74 ,7 2 78 4, 9 3 0 27 ,5 3 30 4, 2 3 6 25 ,9 4 03 4, 0

K H1 8 41 4 5 3 60 3 2 63 2 8 88 1 33 6, 7 1 5 12 ,5 1 6 81 ,1 1 8 59 ,9 2 03 5, 6 2 2 15 ,0 2 39 7, 4 2 5 68 ,8 2 72 0, 1 2 8 88 ,3 3 06 9, 4 3 2 67 ,3 3 48 8, 7 3 7 38 ,1 4 02 5, 8 4 3 93 ,4 4 84 1, 6

K H1 9 42 7 3 1 51 4 5 54 4 0 65 2 62 1, 0 2 7 55 ,0 2 8 93 ,3 3 0 34 ,8 3 17 7, 2 3 3 34 ,5 3 50 0, 5 3 6 73 ,4 3 86 2, 7 4 0 64 ,4 4 27 4, 6 4 4 99 ,5 4 75 5, 2 5 0 56 ,5 5 39 7, 1 5 7 93 ,1 6 30 3, 2

K H2 0 43 3 5 9 14 4 5 71 4 0 70 2 57 0, 7 2 7 05 ,2 2 8 50 ,9 3 0 04 ,1 3 15 5, 6 3 3 12 ,4 3 48 7, 7 3 6 69 ,4 3 86 1, 8 4 0 69 ,6 4 28 4, 9 4 5 27 ,1 4 79 7, 5 5 1 03 ,9 5 44 7, 5 5 8 61 ,8 6 38 2, 0

K H2 1 50 1 20 16 5 2 2 91 1 9 43 1 21 2, 4 1 2 82 ,5 1 3 56 ,5 1 4 32 ,3 1 51 0, 3 1 5 90 ,1 1 67 4, 2 1 7 60 ,4 1 84 9, 6 1 9 42 ,5 2 03 8, 5 2 1 41 ,3 2 24 6, 1 2 3 55 ,7 2 46 9, 4 2 7 62 ,9 3 23 2, 9

K H2 2 51 1 37 75 2 4 5 72 4 1 09 2 51 5, 0 2 6 64 ,9 2 8 17 ,2 2 9 78 ,6 3 14 3, 5 3 3 19 ,0 3 50 1, 2 3 6 91 ,9 3 89 1, 3 4 1 09 ,4 4 33 9, 2 4 5 94 ,2 4 87 0, 6 5 1 86 ,9 5 53 9, 8 5 9 56 ,9 6 47 9, 9

K H2 3 52 3 6 2 15 4 3 36 3 8 49 2 59 9, 6 2 7 09 ,0 2 8 23 ,5 2 9 41 ,5 3 06 9, 1 3 2 01 ,5 3 34 4, 2 3 5 04 ,2 3 66 7, 1 3 8 48 ,9 4 04 5, 0 4 2 60 ,8 4 49 8, 7 4 7 70 ,8 5 08 8, 1 5 4 74 ,5 5 96 2, 4

K H2 4 53 2 67 93 9 4 6 33 4 0 99 2 63 7, 0 2 7 72 ,2 2 9 12 ,9 3 0 57 ,5 3 20 8, 4 3 3 65 ,3 3 52 8, 8 3 7 07 ,9 3 89 6, 9 4 0 99 ,3 4 32 0, 3 4 5 62 ,6 4 82 7, 9 5 1 28 ,9 5 47 7, 8 5 8 96 ,2 6 42 8, 4

KH25 60 193 624 552 436 24,0 50,9 83,8 121,2 163,1 210,5 260,4 314,0 373,5 436,1 502,0 573,0 646,0 722,1 802,9 888,8 977,2

K H2 6 61 5 0 4 54 1 6 92 1 6 69 1 61 ,2 1 1 82 ,5 1 2 34 ,0 1 2 85 ,9 1 34 2, 0 1 4 00 ,7 1 46 3, 2 1 5 28 ,9 1 59 5, 8 1 6 68 ,7 1 74 1, 8 1 8 21 ,0 1 90 4, 3 1 9 90 ,9 2 08 2, 9 2 1 78 ,5 2 28 1, 0

K H2 7 62 2 6 2 71 2 7 66 2 5 52 1 34 8, 1 1 5 38 ,1 1 7 09 ,0 1 8 52 ,4 1 97 8, 5 2 1 03 ,6 2 22 1, 7 2 3 39 ,1 2 45 2, 7 2 5 51 ,7 2 65 2, 2 2 7 72 ,0 2 90 5, 3 3 0 59 ,1 3 24 1, 7 3 4 68 ,9 3 78 1, 4

K H2 8 63 7 6 7 24 3 2 24 2 9 33 1 63 2, 4 1 9 13 ,0 2 1 20 ,9 2 2 99 ,2 2 46 3, 6 2 5 59 ,5 2 64 0, 9 2 7 28 ,8 2 82 7, 7 2 9 33 ,2 3 05 2, 2 3 1 84 ,3 3 34 0, 5 3 5 17 ,5 3 73 2, 8 3 9 97 ,6 4 34 5, 7

KH29 70 199 372 365 276 22,6 43,0 64,6 87,8 112,6 140,2 169,7 202,1 237,2 276,0 320,2 368,9 424,1 486,6 557,6 641,4 738,9

KH30 71 28 913 598 565 58,5 119,2 174,6 228,6 281,5 335,2 390,3 444,8 502,3 564,5 622,8 684,5 744,3 808,2 873,5 940,2 1 009,9

K H31 72 124 878 1 292 1 442 25,8 62, 7 118, 6 215, 8 498, 0 1 194,3 1 251,5 1 313,0 1 375,6 1 441,9 1 512,9 1 589,5 1 672,5 1 763,4 1 859,4 1 965,2 2 078,4

K H32 73 347 565 1 220 1 415 14,7 32, 2 59,1 102, 6 184, 8 445, 0 1 200,4 1 267,9 1 339,4 1 414,7 1 493,6 1 577,2 1 665,5 1 759,2 1 859,6 1 967,5 2 084,2

KH33 80 75 304 527 495 72,8 124,0 168,2 212,6 257,3 301,7 348,6 395,7 444,1 494,5 547,2 602,7 659,6 719,7 783,0 849,3 917,2

KH34 81 103 656 417 334 36,0 62,9 90,2 118,1 146,9 178,4 211,3 248,3 289,0 334,2 383,6 437,9 498,8 567,4 643,6 728,1 821,9

KH35 82 24 658 458 345 16,7 51,6 82,6 115,2 149,1 184,6 219,3 257,4 297,8 345,3 395,3 449,8 507,5 575,5 647,2 729,4 822,4

KH36 83 31 206 982 1 199 0,9 3,8 9,1 15,0 22,1 32,4 47,8 76,5 155,9 1 199,2 1 264,9 1 335,1 1 417,4 1 509,6 1 622,1 1 759,1 1 930,6

KH37 90 301 456 623 619 151,9 214,8 269,7 322,0 371,7 421,1 470,2 520,2 569,5 619,2 669,1 719,9 771,5 823,8 876,4 930,6 985,9

KH38 91 252 820 322 241 27,8 47,1 66,0 86,4 107,8 130,7 155,3 181,6 209,5 241,0 275,7 314,9 358,7 409,4 467,9 537,8 624,2

KH39 92 130 904 265 171 21,4 35,3 48,6 62,5 77,3 92,6 109,4 127,5 147,5 170,5 196,9 228,3 263,3 306,0 357,8 423,1 509,1

KH40 93 425 926 143 62 5,5 9,6 13,7 18,2 23,3 29,0 35,6 43,1 52,0 62,4 74,8 89,8 108,4 131,6 163,2 206,9 271,4

T OTA L 5 373 026 1 390 687 31,2 2 772,2 90,0 3 057,5 181, 3 3 365,3 316, 4 3 707,9 461,9 4 109,4 632,4 4 594,2 780,9 5 186,9 1 055,1 5 956,9 1 556,4

9th

Cluster

5th Decile 6th Decile 7t Decile 8th Decile1st Decile 2nd Decile 3rd Decile 4th Decile

Number TX1 TX2 TX3 TX4 TX5 TX6 TX7 TX8 TX9 TX10 TX11 TX12 TX13 TX14 TX15 TX16 TX17 TX18

KH01 00 5 789 108,63 63,13 45,26 36,55 30,96 27,95 24,86 23,62 22,48 21,74 21,36 21,22 21,13 21,69 23,05 22,41 25,49 33,37

KH02 01 1 580 109,62 61,89 44,29 35,62 29,93 27,54 25,55 23,07 22,50 21,87 21,06 20,76 20,11 20,74 21,69 22,18 24,58 28,22

KH03 02 3 067 106,41 58,41 41,79 33,61 29,12 26,10 24,00 22,76 21,65 21,02 20,42 19,98 19,80 20,06 21,60 23,99 26,03 30,51

KH04 03 2 345 109,01 59,80 42,32 35,48 30,08 27,94 26,19 24,98 24,36 23,48 23,11 23,05 22,70 23,48 24,13 25,23 27,30 35,39

KH05 04 1 781 99,49 57,27 43,20 33,71 27,81 25,14 22,89 20,97 19,60 18,54 18,42 17,25 16,61 16,03 16,17 16,06 16,15 33,77

KH06 05 4 534 98,18 56,94 40,91 31,68 28,12 23,79 20,39 20,55 18,84 18,01 16,98 16,84 16,24 15,86 14,47 14,19 13,86 14,33

KH07 10 688 106,73 56,18 38,95 31,22 25,67 22,87 20,23 17,99 16,71 15,86 15,05 14,38 13,45 13,28 13,06 12,91 12,36 12,62

KH08 11 378 105,92 58,47 40,16 32,00 27,76 24,72 21,80 19,91 19,14 17,94 16,89 16,39 15,16 15,46 15,16 20,49 23,96 25,23

KH09 12 398 92,99 55,61 39,55 31,49 27,96 24,22 22,00 20,07 18,76 17,97 16,54 16,18 15,32 14,83 14,25 14,10 13,41 12,78

KH10 13 472 100,21 54,17 39,50 30,19 25,21 21,90 18,98 17,34 15,53 14,38 12,46 12,09 10,96 10,36 9,96 9,18 8,84 8,49

KH11 14 627 106,06 62,07 40,46 32,05 28,76 23,36 20,69 19,26 16,38 15,72 14,34 13,70 12,94 12,17 11,58 10,96 13,96 33,63

KH12 15 1 388 113,10 60,81 42,60 33,25 28,00 24,11 21,77 19,49 18,02 17,02 15,47 14,66 13,74 12,72 12,89 12,82 13,61 14,42

KH13 20 4 594 118,30 68,56 47,95 36,82 28,83 25,00 21,04 18,01 15,55 14,03 12,58 11,58 12,66 13,69 13,84 13,93 15,36 18,86

KH14 21 3 075 9,85 9,47 8,76 8,68 7,99 7,84 7,53 7,65 6,63 6,66 6,60 7,19 7,53 8,22 9,08 10,31 13,70 20,35

KH15 22 420 7,16 6,73 6,50 6,25 6,09 5,96 5,93 5,77 5,47 5,30 5,28 5,12 7,11 10,84 12,64 14,25 17,16 25,06

KH16 23 3 872 24,73 17,66 10,91 6,49 6,23 6,50 6,33 6,52 6,39 6,59 6,64 6,88 7,26 7,88 8,42 9,71 11,71 16,92KH17 24 959 8,99 8,39 8,05 7,76 7,39 7,02 6,98 7,00 6,80 6,75 7,02 8,16 8,71 9,14 9,73 11,26 14,23 20,64

KH18 25 2 661 13,15 11,15 10,64 9,44 8,81 8,23 7,15 5,89 6,18 6,27 6,45 6,78 7,15 7,69 9,13 10,20 12,58 19,12

KH19 30 883 5,11 5,02 4,89 4,69 4,95 4,98 4,94 5,15 5,22 5,17 5,26 5,68 6,34 6,74 7,34 8,81 11,07 16,74

KH20 31 1 037 5,23 5,39 5,38 5,04 4,97 5,29 5,21 5,24 5,38 5,29 5,65 5,97 6,39 6,73 7,61 8,87 11,67 16,93

KH21 32 66 5,78 5,77 5,58 5,45 5,28 5,29 5,15 5,07 5,02 4,94 5,04 4,89 4,88 4,83 11,89 17,01 20,42 27,37

KH22 33 1 429 5,96 5,71 5,73 5,53 5,58 5,49 5,45 5,40 5,60 5,59 5,88 6,02 6,49 6,81 7,53 8,78 10,61 15,87

KH23 34 381 4,21 4,23 4,18 4,34 4,31 4,46 4,78 4,65 4,96 5,10 5,34 5,58 6,05 6,65 7,59 8,91 10,88 16,20

KH24 35 3 392 5,12 5,07 4,96 4,94 4,89 4,86 5,08 5,10 5,19 5,39 5,61 5,82 6,23 6,80 7,64 9,03 11,30 17,32

KH25 40 1 508 111,69 64,82 44,56 34,56 29,06 23,70 20,59 18,95 16,75 15,12 14,15 12,73 11,79 11,19 10,69 9,95 9,52 13,91

KH26 41 1 340 633,59 4,35 4,21 4,36 4,37 4,47 4,49 4,37 4,57 4,38 4,55 4,58 4,55 4,62 4,59 4,70 4,84 5,28

KH27 42 1 525 14,09 11,11 8,39 6,81 6,33 5,61 5,28 4,86 4,04 3,94 4,52 4,81 5,29 5,97 7,01 9,01 12,27 18,53

KH28 43 956 17,19 10,87 8,40 7,15 3,89 3,18 3,33 3,63 3,73 4,06 4,33 4,91 5,30 6,12 7,09 8,71 11,18 17,50

KH29 44 684 90,17 50,40 35,81 28,31 24,52 21,03 19,11 17,34 16,37 16,02 15,22 14,95 14,74 14,58 15,04 15,19 15,33 16,13

KH30 45 2 217 103,76 46,45 30,93 23,13 19,06 16,44 13,96 12,92 12,38 10,34 9,90 8,74 8,59 8,08 7,63 7,41 6,93 6,45

KH31 50 3 839 143,57 89,13 81,91 130,78 139,82 4,80 4,91 4,77 4,82 4,92 5,06 5,23 5,43 5,44 5,69 5,76 6,14 6,39

KH32 51 277 118,82 83,58 73,58 80,03 140,84 169,74 5,62 5,64 5,62 5,58 5,60 5,60 5,63 5,70 5,80 5,93 6,16 6,26

KH33 52 1 649 70,22 35,66 26,38 21,03 17,27 15,53 13,53 12,21 11,36 10,67 10,13 9,45 9,11 8,80 8,47 7,99 7,89 8,40

KH34 53 799 74,74 43,50 30,94 24,35 21,45 18,44 17,50 16,40 15,64 14,78 14,16 13,91 13,76 13,43 13,12 12,88 12,70 12,05

KH35 54 392 208,99 60,21 39,48 29,41 23,76 18,82 17,38 15,70 15,94 14,46 13,78 12,83 13,39 12,46 12,71 12,76 13,08 13,70

KH36 55 4 055 322,22 139,61 65,18 46,88 46,76 47,38 60,19 103,73 669,03 5,48 5,54 6,17 6,50 7,46 8,45 9,74 12,00 15,74

KH37 60 353 41,38 25,55 19,40 15,44 13,28 11,67 10,62 9,49 8,73 8,05 7,60 7,16 6,78 6,38 6,18 5,94 5,75 5,56

KH38 61 1 000 69,11 40,25 30,82 24,84 21,22 18,77 16,95 15,39 15,04 14,40 14,20 13,93 14,13 14,28 14,93 16,07 18,24 21,65

KH39 62 904 64,60 37,69 28,67 23,61 19,82 18,09 16,55 15,76 15,56 15,50 15,91 15,33 16,25 16,93 18,25 20,33 23,48 29,45

KH40 63 1 670 73,32 43,87 32,68 27,54 24,82 22,54 21,23 20,67 20,03 19,75 20,16 20,60 21,49 24,02 26,72 31,21 39,22 54,61

Mean 39,16 38,60 29,60 23,07 18,69 16,44 15,15 13,07 12,38 11,70 11,36 11,18 11,19 11,45 12,07 12,98 14,62 19,14

Clusters

This table gives us the evolution rate that assigned to each customer depending on his clones

group and present turnover.

Under the table, we cans see the average of these rates: if we obtain an aberrant evolution rate

(>100%), we replace this excessively high value with the mean rate calculated on all the

groups except those containing a high value. The table below shows the corrections made.

For instance, a customer who belongs to the clones group 00 (KH01) and has turnover below

21.5 euros will have an evolution rate equal to rate 1: i.e. 89%.

If a customer belongs to the clones group 00 (KH01) and has turnover between 21.5 and 40.5euros, then he will have an evolution rate equal to rate 2, i.e. 54%.


29/47

TVallaud 29

The evolution rates given for the two examples are very high, but they concern very small

customers.

Each of the customers is assigned an evolution rate. The rate multiplied by turnover will allow

us to estimate potential turnover for each of the customers.

CORRECTION OF ABERRANT RATES

Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number

KH01 0 5 789 39,16 63,13 45,26 36,55 30,96 27,95 24,86 23,62 22,48 21,74 21,36 21,22 21,13 21,69 23,05 22,41 25,49 33,37

KH02 1 1 580 39,16 61,89 44,29 35,62 29,93 27,54 25,55 23,07 22,50 21,87 21,06 20,76 20,11 20,74 21,69 22,18 24,58 28,22

KH03 2 3 067 39,16 58,41 41,79 33,61 29,12 26,10 24,00 22,76 21,65 21,02 20,42 19,98 19,80 20,06 21,60 23,99 26,03 30,51

KH04 3 2 345 39,16 59,80 42,32 35,48 30,08 27,94 26,19 24,98 24,36 23,48 23,11 23,05 22,70 23,48 24,13 25,23 27,30 35,39

KH07 10 688 39,16 56,18 38,95 31,22 25,67 22,87 20,23 17,99 16,71 15,86 15,05 14,38 13,45 13,28 13,06 12,91 12,36 12,62

KH08 11 378 39,16 58,47 40,16 32,00 27,76 24,72 21,80 19,91 19,14 17,94 16,89 16,39 15,16 15,46 15,16 20,49 23,96 25,23

KH10 13 472 39,16 54,17 39,50 30,19 25,21 21,90 18,98 17,34 15,53 14,38 12,46 12,09 10,96 10,36 9,96 9,18 8,84 8,49

KH11 14 627 39,16 62,07 40,46 32,05 28,76 23,36 20,69 19,26 16,38 15,72 14,34 13,70 12,94 12,17 11,58 10,96 13,96 33,63

KH12 15 1 388 39,16 60,81 42,60 33,25 28,00 24,11 21,77 19,49 18,02 17,02 15,47 14,66 13,74 12,72 12,89 12,82 13,61 14,42

KH13 20 4 594 39,16 68,56 47,95 36,82 28,83 25,00 21,04 18,01 15,55 14,03 12,58 11,58 12,66 13,69 13,84 13,93 15,36 18,86

KH25 40 1 508 39,16 64,82 44,56 34,56 29,06 23,70 20,59 18,95 16,75 15,12 14,15 12,73 11,79 11,19 10,69 9,95 9,52 13,91

KH26 41 1 340 39,16 4,35 4,21 4,36 4,37 4,47 4,49 4,37 4,57 4,38 4,55 4,58 4,55 4,62 4,59 4,70 4,84 5,28

KH30 45 2 217 39,16 46,45 30,93 23,13 19,06 16,44 13,96 12,92 12,38 10,34 9,90 8,74 8,59 8,08 7,63 7,41 6,93 6,45

KH31 50 3 839 39,16 89,13 81,91 23,07 18,69 4,80 4,91 4,77 4,82 4,92 5,06 5,23 5,43 5,44 5,69 5,76 6,14 6,39

KH32 51 277 39,16 83,58 73,58 80,03 18,69 16,44 5,62 5,64 5,62 5,58 5,60 5,60 5,63 5,70 5,80 5,93 6,16 6,26

KH35 54 392 39,16 60,21 39,48 29,41 23,76 18,82 17,38 15,70 15,94 14,46 13,78 12,83 13,39 12,46 12,71 12,76 13,08 13,70

KH36 55 4 055 39,16 38,60 65,18 46,88 46,76 47,38 60,19 13,07 12,38 5,48 5,54 6,17 6,50 7,46 8,45 9,74 12,00 15,74

Cluster

The above table shows the replacement by the average of the aberrant rates (>100%).

4- Main results

The average incremental rate of the loyalty program customers is 12.79%.

This retailer can earn 12.79% of extra turnover on these customers.

Customer assigned to turnover


30/47

TVallaud 30

41,9%

65,9%

38,4%

14,2%

44,0%

19,0%

41,5%

20,0%

15,0%

0%

20%

40%

60%

80%

100%

Number Turnover Turnover potential

Br onze Silve r Gold

41.9% of the retailer customers are Bronze potentials generating 65.9% of annual turnover

and accounting for 38.4 % of the potential turnover.

At the other end 44% of customers are "Gold" potentials, generating only 19% of the

turnover but accounting for 41.5% of potential turnover.

Regrouping in SML segments:

31,4%

76,7% 76,4%33,8%

5,9%34,6%13,6% 5,7%

9,0%

9,0%

3,8%0,2%

0%

20%

40%

60%

80%

100%

Bronze Silver Gold

S M L New

S customers account for a high proportion of "Gold" potentials, based on annual turnover.

There are M customers among the Bronze potentials, and L customers among the "Bronze"and "Silver" potentials. Most of them have an interesting margin of growth.

In annual turnover (in):

0,1% 1,1% 1,8%10,3%

22,5%29,9%

27,1%7,1%

62,5%69,3%

25,2%

43,1%

0%

20%

40%

60%

80%

100%

B ro nze Silver Go ld

New S M L


31/47

TVallaud 31

In potential turnover (in k) :

0,1% 1,2% 2,6%14,5%

22,3%29,9%

25,9% 7,1%

59,5%69,4%

23,7%

43,8%

0%

20%

40%

60%

80%

100%

Bronze Silver Gold

New S M L

L customers have the most important potential in absolute value, although they do not have

the highest evolution rates. They balance this with much more significant turnover than the Sor M segments.

S customers are over-represented among "Gold" potential, with 29,9% of the potential

turnover of the cluster.

Distribution by the retailer RFM and by potential categories

In numbers:


32/47

TVallaud 32

0,9% 4,0%14,7%

1,4%3,8%

17,0%

0,2%

3,8%

9,0%

10,9%

43,3%

31,4%

32,4%

24,4%

16,5%

7,8%

3,0%

2,2%

23,0%

7,0%

6,0%23,4%

10,8%3,2%

0%

20%

40%

60%

80%

100%

1 W ithout sta tut I NACTI VE 3 MOI S Ne wM--F-- M-F- M-F+M+F- M+F+

In annual turnover (in ):

0,6% 1,4% 4,9%0,8% 0,9%3,0%

0,1% 1,1%1,8%

4,2%11,4%

14,8%19,6%12,0%

22,2%

5,9% 2,3%

5,0%28,6%

16,9%

27,7%

40,4%

54,0%

20,6%

0%

20%

40%

60%

80%

100%

1Wi tho ut s ta tut I NA CT IV E 3 M OI S N ew

M --F-- M -F- M -F+

M +F- M +F+

Logically, heavier potentials should be present in RFM+ segments in absolute values.

Categories of potential:

Potential rates per clone clusters are grouped into four categories:

- P0: No potential turnover

- P1: Potential > 20 %

- P2: Potential between 15 and 20 %

- P3: Potential below 15


33/47

TVallaud 33

5,0% 13,9%

41,7%13,4% 36,6%

13,0%

9,0%

15,6%

40,2%

0,0%

63,7%47,8%

0%

20%

40%

60%

80%

100%

Number Turnover Potential turnover

P0 P1 P2 P3

40% of the customers create 63% of the turnover and 47.8 of potential turnover. On average,

they achieve turnover of 2202 for an average potential of162. These customers who

already contribute substantially are the most likely (for the least perceived effort) to reachtheir potential.

Grouping in SML segments

9,5%29,3%

78,7% 82,0%

31,3%

23,1%

8,9% 6,4%

32,8%

47,6% 35,7%

0,0% 4,1% 0,2%

7,5%2,9%

0%

20%

40%

60%

80%

100%

P0 P1 P2 P3

New S M L

S customers represent a high proportion of "P1" and "P2", based on annual turnover generated

by P1 potentials. We find M customers mainly among potential P3, while L customers for

their part are found under "P0", but also "P3".

In yearly turnover (in )

0,0% 2,5% 1,8% 0,0%

39,0% 36,0%

10,0%12,8%

33,2%

25,8%

79,6%

25,3%

64,2%

7,6%

11,8 %

50,4%

0%

20%

40%

60%

80%

100%

P 0 P 1 P 2 P 3

New S M L


34/47

TVallaud 34

P0 P1 P2 P3 TOTAL

Average

amount

Average

amount

Average

amount

Average

amount

Average

amount

New 4 187 120 430 456 165

S 1 004 222 422 701 384 M 2 131 1 673 1 756 1 732 1 746

L 6 439 3 911 6 493 3 963 4 401

TOTAL 3 853 448 961 2 202 1 393

In potential of turnover (in ):

3,9% 1,9% 0,0%

38,9% 35,7%14,4%

31,8%

11,9%

25,4%

50,4%

24,3%

61,3%

0%

20%

40%

60%

80%

100%

P1 P2 P3

L

M

S

New

S customers are over-represented in the "P1" category, with 38.9% of the potential of turnover

for this segment. In absolute terms, it is really L customers who have the highest potential. It

is with good customers that we can increase turnover as these have the most chance to

succeed than any other segments. The marketing budget can therefore be allocated on the

basis of average turnover and intensity of offers by potential. The two concepts are

complementary in the definition of the mechanics of loyalty/retention.

5-Results summary

The Kohonen network allows us to group customers into 40 clone clusters. The 4 by 10matrix had no empty group, so we retained it.

Customers within the same group resemble each other according to socio demographic and

consumption characteristics.

Using the deciles method, we assigned a turnover evolution rate to each customer in the

sample.

We created the following potential turnover score:

- Gold : evolution rate higher than 20%

- Silver : evolution rate between 15% and 20%

- Bronze : evolution rate below 15%


35/47

TVallaud 35

We calculated potential turnover from this rate and turnover.

Our sample is composed 5,373,026 customers generating annual turnover 7.46 billion euros,

and representing potential turnover of 953.2 million euros.

Then the retailer can earn almost 12.77% more turnover from his customers.

In rate term, the sample is composed of 41.9% Bronze customers, 14.1% Silver customers and

of 44% Gold customers. In reality, it must be assumed that the best customers are those with

the highest absolute values.

76.6% of Gold customers are S and generate 29.9% of potential turnover.

65% of Gold customers are 3 months Inactive, RFM-- and RFM-, and they generate 40% of

annual turnover and 38.4% of potential turnover.

Logically customers with the highest evolution rate find themselves among customers with

poor turnovers values.

At the opposite end of the scale, customers with the highest turnover have the strongest

potential of turnover in terms of absolute value.

P1 P2 P3 TOTAL

Average

amount

Average

amount

Average

amount

Average

amount

New 49 77 34 52

S 60 71 75 64

M 431 302 120 182 L 1 055 1 104 279 336

TOTAL 120 163 162 137

The validation procedures for the models

Internal validity

We carried out several tests on our model:

- Division of our population into sub-populations for checking the allocation coherenceof the clone classes

- Benchmark of several classification techniques- Re-allocation of the classes by supervised models (C5, Bayesian network)- Connectivity of super classes

The internal validation methods will need of course to be completed

External validity

The customer of wallet share is in accordance with a TNS of 24%. Given overall

consumption, the achievable potential of the wallet share will increase to 28%. An extra 2%

of the wallet share is much more realistic.


36/47

TVallaud 36

We would like to de-duplicate13

our base with Nielsen Home Scan Panel to check if sales

really do increase, but this is not yet possible in this context.

Conclusions

Discussion of the results of the research study

The results of our research study will be placed in the context of corporate customer potential

determination: determining customer potential represents a major part ofa companys direct

and promotional marketing investment. Most large loyalty programs are based on this notion.

We will look at how our approach compared with other methods enables us to establish

converging results to answer our research questions:

- The clustering technique (SOM) is used to identify customers which are similar and todefine realistic potential.

- We can estimate the stability of the clusters in several ways which show an internalstability

- We have developed a pragmatic approach which is a potential determination method:the clones method.

The limits and the contribution of our research study

We used specific clustering techniques for the purpose of validating our method. We shown

the eventual statistical limits of our approach in terms of complexity or reliability of the

models used.

For feasibility reasons we worked only with a single business area, the large grocery retail

sector in France, and used only accessory data from other business sectors.

We do not have access at the moment to data from foreign retailers, for example.

Calculation of potential turnover in group is very empirical and should be more scientifically

justified.

Further research

There are several ways to improve upon our research:

- Refine our choice of variables- Determine a more empirical method than the deciles/median method for estimating the

potential per group

- Make more rotations of the model in some other industrial sectors; we have done thisand it works quiet well, but it is important that others test it

- Validate the result in time, by observing the reality of potential values on sales

13 We merge the two data bases to find the doublons


37/47

TVallaud 37

We hope that, by means of its strategic impact on company results and the fact that this

calculation is based on internal customer data already at hand; this method will find an

important use.

Bibliography

1. Aguilera, P. A., Frenich, A. G., Torres, J. A., Castro, H., Vidal, J. L. M., and Canton,M. (2001). Application of the cohune neural network in coastal water management:

Methodological development for the assessment and prediction of water quality.

Water Research, 35(17):40534062.

2. Anderson, B. (1999). Kohonen neural networks and language. Brain and Language,70(1):8694

3. B Meunier, E Dumas, I Piec, D Bechet, M Hebraud, - J Proteome Res, 2007 -Assessment of hierarchical clustering methodologies for proteomic data mining - les

4 versions aseanbiotechnology.info4. Baran, Stanley J. Theories of Mass Communication.5. Benavent and Crie http://christophe.benavent.free.fr/publications/ltv1.pdf6. Beran, R. (1986). Discussion of Wu, C.F.J.: Jackknife, bootstrap, and other resampling

methods in regression analysis (with discussion). Ann. Statist., 14:1295-1298.

7. Berend Wierenga and Gerrit Harm van Bruggen (2000), Marketing Management,Springer Support Systems: Principles, Tools, and Implementation, Springer

8. Berger, Paul D. and Nada I. Nasr (1998), "Customer lifetime value: Marketing modelsand applications," Journal of Interactive Marketing, 12 (1), p.1730

9. Bertrand Clarke et Dongchu Sun, Reference priors under the Chi-Squared distance:The Indian Journal of Statistics 1997, Volume 59, Series A, Pt. 2, 215-231

10.Boos, D.D. (2003). Introduction to the bootstrap world. Statist. Science, 18:168-174.11.Borko, H. and Bernick, M., 'Automatic document classification', Journal of the ACM,

10, 151-162 (1963).

12.Bremer and Joyce (1988), Human Judgment,The SJT View, North-Holand13.Bruce Cooil, Timothy L Keiningham, Lerzan Aksoy, Michael Hsu. (2007) A

Longitudinal Analysis of Customer Satisfaction and Wallet share: Investigating the

Moderating Effect of Customer Characteristics. Journal of Marketing 71:1, 67-83

14.Charles Romesburg Cluster Analysis for Researchers (2004) Lulu press p.13515.Ching-Hsue Cheng and You-Shyang Chen Classifying the segmentation of customer

value via RFM model and RS theory Expert Systems with Applications, In Press,

Corrected Proof, Available online 16 April 2008,Collectif, Recherche sur la Distribution moderne p.64, d: lUnivers du Livre

16.Ciampi, A. and Lechevallier, Y. (2000). Clustering large, multi-level data sets: anapproach based on Kohonen self-organizing maps. In Principles of Data Mining and

Knowledge Discovery. 4th European Conference, PKDD 2000. Proceedings (Lecture

Notes in Artificial Intelligence Vol.1910). Springer-Verlag, Berlin, Germany, pages

3538

17.Ciampi, A. and Lechevallier, Y. (2000). Clustering large, multi-level data sets: anapproach based on Kohonen self-organizing maps. In Principles of Data Mining and

Knowledge Discovery. 4th European Conference, PKDD 2000. Proceedings (Lecture

Notes in Artificial Intelligence Vol.1910). Springer-Verlag, Berlin, Germany, pages

3538
http://christophe.benavent.free.fr/publications/ltv1.pdfhttp://christophe.benavent.free.fr/publications/ltv1.pdf


38/47

TVallaud 38

18.Dahbur, K. and Muscarello, T. (2001). Hybrid Kohonen neural network in datamining. In Proceedings of the IASTED International Conference. Artificial

Intelligence and Applications. ACTA Press, Anaheim, CA, USA, pages 303.

19.David Huff, 18-Jun 2003 - University of Texas Austin, "A Retrospective View of theHuff Model and its Application to Spatial Interaction Analysis" University of

Redlands/ESRI Colloquium Series20.Dorofeyuk, A.A., 'Automatic Classification Algorithms (Review)', Automation and

Remote Control, 32, 1928-1958 (1971).

21.Dwyer, R.F. (1997), "Customer lifetime valuation to support marketing decisionmaking", Journal of Direct Marketing, Vol. 11 No.4, p.6-13.

22.Efron B. (1981) Non parametric estimates of standard error: the jackknife, thebootstrap and other methods. Biometrika 68. pp 589--599.

23.Eric Chen-Kuo Tsao, James C. Bezdek and Nikhil R. Pal "Fuzzy Kohonen clusteringnetworks 1994 Published by Elsevier Science B.V.

24.F. V. Jensen Introduction to Bayesian Networks, 1st edition 1996 Springer-VerlagNew York, Inc.

25.Fang, K.; He, S. The problem of selecting a given number of representative points in anormal population and a generalized mills ratio. Technical report, Department of

Statistics; Stanford University: 1982. MacQueen J. Some methods for classification

and analysis of multivariate observations. Proceedings 5th Berkeley Symposium on

Mathematics, Statistics and Probability. 1967;3:281297.

26.Frank Plastria Static competitive facility location: An overview of optimisationapproaches European Journal of Operational Research, Volume 129, Issue 3, 16

March 2001, Pages 461-470.

27.Gehrlein W. V. General mathematical programming formulations for the statisticalclassification problem Operations research letters ISSN 0167-

6377 CODEN ORLED5

28.Harris, M.J. and N. Blisard. 1995. Characteristics of the Nielsen Homescan Data.Working paper. Washington, DC: U.S. Department of Agriculture, Economic

Research Service.

29.Hartigan JA, Wong MA. A k-means clustering algorithm. Applied Statistics.1979;28:100108.

30.http://en.wikipedia.org/wiki/Lifetime_value31.J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial

Intelligence Research, 4:77-90, 1996.

32.Jajuga K.Classification, Clustering and Data Analysis : Recent Advances andApplications2002 lavoisier

33.John A. McCartya,

and Manoj Hastak Segmentation approaches in data-mining: Acomparison of RFM, CHAID, and logistic regression Journal of Business Research,

Volume 60, Issue 6, June 2007, Pages 656-662

34.Juha Vesanto 1997 The SOM in data mining: analysis of world pulp and papertechnology

35.Julien Barnier Tout ce que vous navez jamais voulu savoir sur le Chi2 san s jamaisavoir eu envie de le demander Groupe de Recherche sur la Socialisation CNRS

UMR 5040 15 avril 2008

36.Kaski, S., "Data exploration using self-organizing maps. Acta PolytechnicaScandinavica, Mathematics, Computing and Management in Engineering Series No.

82, Espoo 1997.

37.Kohonen, T., Self-Organization and Associative Memory , New York : Springer-Verlag, 1988
http://en.wikipedia.org/wiki/Lifetime_valuehttp://en.wikipedia.org/wiki/Lifetime_valuehttp://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V7S-4MV1P09-3&_user=10&_coverDate=06%2F30%2F2007&_rdoc=1&_fmt=full&_orig=search&_cdi=5850&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=3a8f86bf0680b39935ae32f446a1364d#aff1http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V7S-4MV1P09-3&_user=10&_coverDate=06%2F30%2F2007&_rdoc=1&_fmt=full&_orig=search&_cdi=5850&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=3a8f86bf0680b39935ae32f446a1364d#aff1http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V7S-4MV1P09-3&_user=10&_coverDate=06%2F30%2F2007&_rdoc=1&_fmt=full&_orig=search&_cdi=5850&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=3a8f86bf0680b39935ae32f446a1364d#aff1http://en.wikipedia.org/wiki/Lifetime_value


39/47

TVallaud 39

38.Lerman, I.C., Les Bases de la Classification Automatique, Gauthier-Villars, Paris(1970).

39.M Roux -, 1985 Algorithmes de classification Editions Masson, Paris40.Mattias Otto ChemometricsStatistics and Computer Application in Analytical

Chemistry Publi 2007 Wiley-VCH

41.Nielsen, Inc. May 2006. Understanding the Homescan Advantage. Presentation byLiz Crews and Ed Groves, Nielsen at RTI International, Research Triangle Park, NC.42.O. Pourret, P. Naim and B. Marcot (2008). Bayesian Networks: A Practical Guide to

Applications. Chichester, UK: Wiley. ISBN 978-0-470-06030-8.

43.Olivier Brusset Segmentation Cibler, scorer, analyser, une seule limite, lesrendements Marketing Direct N92 - 01/04/2005 p.2

44.Pena M. Vanegas A. Valencia Digital Hardware Architectures of Kohonen's SelfOrganizing Feature Maps with Exponential Neighboring Function 2006 IEEE

International Conference on Reconfigurable Computing and FPGA's J. (ReConFig

2006) pp. 1-8

45.Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers,1993.

46.Quinlan, R. (2004). Data mining tools see5 and c5.0.47.Rajanee Ranjan Encyclopaedia of Marketing Research Publi 2002, Anmol

Publications PVT. LTD., p.585.

48.Reilly, W.J. (1931) The law of retail gravitation, New York.49.S. Kaski, J. Nikkila, and T. Kohonen Methods for Exploratory Cluster Analysis

Intelligent Exploration of the Web De Piotr S. SzczepaniakPubli 2003 Springer

50.Size and Share of Customer Wallet. Rex Yuxing Du, Wagner A. Kamakura, Carl F.Mela. Journal of Marketing | Volume: 71 | Issue: 2 | Pps: 94-113

51.Tan, Peter J.,Dowe David L., Dix Tevor I, Building classification model in two steps1997

52.Teuvo Kohonen. Self-Organization and Associative Memory. Springer-Verlag,Berlin, 3rd edition, 1989.

53.Teuvo Kohonen. Self-Organizing Maps, 3rd edition. Springer, 2054.The Useful Words from a Decisional Corpus. Contribution of Correspondence

Analysis Springer Berlin / Heidelberg Volume 185/2005. p.159-179

55.Timothy L. Keiningham, Bruce Cooil, Lerzan Aksoy, Tor W. Andreassen, Jay Weiner.(2007) The value of different customer satisfaction and loyalty metrics in predicting

customer retention, recommendation, and share-of-wallet. Managing Service Quality

17:4, 361-384

56.Todd A. Stephenson An Introduction to Bayesian Network Theory and Usage

IDIAP-RR 00-03, 200057.Vallaud Thierry (2003), La fidlisation rentable : la proposition du modle composite,www. numlog.com

58.Venkatesan, Rajkumar and V. Kumar (2004), "A Customer Lifetime ValueFramework for Customer Selection and Resource Allocation Strategy," Journal of

Marketing, 68 (October), p.106-125.


40/47

TVallaud 40

4.

Appendix

Appendix 1 : Translation of the filenames

Appendix 2 : Detail of the first data audit

The data set was audited it two stages: a first stage to determine all the data useful for the

analysis in the original data base, and a second stage to determinant the data available to

calculate potential. In the appendix, only the second stage is shown.

Analysis of the Potentiel_Ratio and Potentiel_Socio tables

Potentiel_Ratio contains 5 373 048 observations (Customer accounts)

It is composed of 26 fields

Potentiel_Socio contain 5 373 056 observations (Customer accounts)

It is composed of 18 fields

This audit is based on the combination of the two tables, i.e. 5 373 048 observations


41/47

TVallaud 41

Data format

This is the original data format. We may have to change some formats to better achieve our

model objectives.


42/47

TVallaud 42

RFM 3 months variable is empty-therefore discarded

Variable by variable analysis

Dichotomous variables

RFM 3 months Number % First audit comparison

New 247 326 4.60% 3.74%

Ex-customers 400 236 7.45%

Inactive 465 107 8.66% 19.15%

M--F-- 1 315 873 24.49% 21.00%

M-F- 1 302 089 24.23% 26.00%

M-F+ 248 619 4.63% 6.06%

M+F- 710 311 13.22% 11.65%

M+F+ 683 487 12.72% 12.40%

Total 5 373 048 100.00% 100.00%

Family statute Number % First audit comparison

Couple 1 557 871 28.99% 26.19%

Single 642 374 11.96% 10.81%

Empty 3 172 803 59.05% 62.99%

Total 5 373 048 100.00% 100.00%

SML on 12 months Number %

NA 7 816 0.15%NV 245 278 4.56%

I 2 573 0.05%

S 3 086 955 57.45%

M 1 014 777 18.89%

L 1 015 649 18.90%

Total 5 373 048 100.00%

Home type Number % First audit comparison

Flat 880 722 16.39% 18.76%

House and flat 1 300 0.02% 0.00%

House 1 576 829 29.35% 34.98%Empty 2 914 197 54.24% 65.02%

Total 5 373 048 100.00% 100.00%

Number of children in thehousehold

Number % First audit comparison

0 4 088 282 76.09% 75.72%

1 510 448 9.50% 9.57%

2 508 967 9.47% 9.54%

3 196 520 3.66% 3.79%

4 48 644 0.91% 1.01%

5 11 482 0.21% 0.22%

> 5 8 705 0.16% 0.14%

Total 5 373 048 100.00% 100.00%


43/47

TVallaud 43

Social categories Number % First audit comparison

Farmer 49 892 0.93% 0.96%

Artisan 86 807 1.62% 1.79%

Other 84 696 1.58% 1.39%

Manager 188 017 3.50% 3.50%

Employee 737 469 13.73% 14.42%

Student 85 928 1.60% 1.28%

Housewife 211 003 3.93% 4.38%

Civil servant 233 386 4.34% 3.70%

Independent worker 42 913 0.80% 0.72%

Worker 138 099 2.57% 2.73%

Retired 664 304 12.36% 14.05%

Unemployed 147 700 2.75% 3.28%

Technician 91 956 1.71% 2.08%

Empty 2 610 877 48.59% 45.71%

24 1 0.00% 0.00%

Total 5 373 048 100.00% 100.00%

The value 24 is a mistake, we eliminate it.

Age Number % First audit comparison

0 to 18 years 8 604 0.16% 0.21%

19 to 29 years 317 371 5.91% 6.00%

30 to 39 years 537 010 9.99% 10.76%

40 to 49 years 652 497 12.14% 12.59%

50 to 59 years 649 038 12.08% 12.30%

60 to 69 years 458 669 8.54% 8.07%70 years and more 592 227 11.02% 10.92%

Empty 2 157 632 40.16% 39.15%

Total 5 373 048 100.00% 100.00%

Customer historic Number % First audit comparison

0 to 2 months 194 269 3.62% 2.76%

3 to 5 months 238 267 4.43% 2.54%

6 to 8 months 182 733 3.40% 3.20%

9 to 11 months 221 313 4.12% 3.26%

12 to 17 months 354 680 6.60% 6.39%18 to 23 months 231 513 4.31% 6.17%

24 to 35 months 517 972 9.64% 9.58%

36 to 47 months 396 706 7.38% 7.96%

48 to 59 months 403 389 7.51% 10.01%

60 months and more 2 631 717 48.98% 44.60%

Empty 489 0.01% 3.53%

Total 5 373 048 100.00% 100.00%


44/47

TVallaud 44

Time since last purchase Number %

0 to 2 months 4 213 358 78.42%

3 to 5 months 562 908 10.48%

6 to 8 months 312 721 5.82%9 to 11 months 241 093 4.49%

12 to 17 months 42 968 0.80%

Total 5 373 048 100.00%

Numerical variables

RateOther

Rate BazarRate BOF

/ APFRate Porkbutcher LS

Rate PetRate

Beauty/Make

up

Rate BabyRate

Butcher

Amount 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055

Mean 77.1 7.0 9.0 6.5 1.2 0.7 1.4 6.5Min -34 035.0 -27 666.7 -195.2 -1 580.0 -15.3 -61.2 -23.5 -125.2

Max 294.9 5 328.9 9 685.6 3 186.7 1 206.5 574.5 3 614.4 4 193.0

SD 22.2 16.3 8.4 6.7 3.7 2.4 5.8 8.6

Rate

Backer

Rate Pork

butcher

Rate

Dietetic

food

Rate

cheese

Rate fruits

and

vegetables

Rate fisher

Rate

frozen

food

Rate wine

Amount 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055

Mean 2.5 2.6 0.4 1.5 8.4 1.9 3.1 2.1

Min -138.5 -74.5 -11.3 -29.4 -105.5 -200.0 -308.3 -530.3

Max 1 247.6 1 373.1 2 428.6 262.7 17 364.3 1 542.9 11 528.6 610.2

SD 4.9 4.7 2.3 2.9 12.0 4.5 7.6 5.4

Ratio

cleaning

products

Ratio

grocery

Ratio

liquid

Ratio

textil

Ratio ulta

fresh food

Rate of

pouldry

Rate first

price

Rate

retailer

brand 1

Rate

retailer

brand 2

Amount 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055

Mean 8.9 16.8 10.8 2.0 4.7 1.7 6.2 14.6 2.0

Min -1 472.8 -1 463.6 -2 843.2 -1 250.0 -372.3 -37.5 -393.8 -194.9 -30.9

Max 14 007.1 16 122.2 6 680.4 2 713.2 5 814.3 2 822.8 26 242.0 7 501.1 2 031.8

SD 12.4 14.9 13.8 5.4 6.4 3.8 15.0 11.5 3.4

Monetary fields have not decimal symbol in the field. We have divided turnover per 100.


45/47

TVallaud 45

SML 12months

Amount % of customersFiltered turnover

12 months% Filtered turnover

12 months

Filtered

turnover 12months: Mean

per customer

NA 7816 0.15% 0 0.00% 0.0

New 245278 4.56% 40 365 377 0.54% 164.6I 2573 0.05% 0 0.00% 0.0

S 3086955 57.45% 1 185 154 005 15.87% 383.9

M 1014777 18.89% 1 772 044 857 23.72% 1 746.2

L 1015649 18.90% 4 471 932 029 59.87% 4 403.0

Total 5 373 048 100.00% 7 469 496 268 100.00% 1 390.2

SML 12

monthsTotal turnover

% Total

turnover

Total turnover:

Mean per customer

Cumulated filtered

turnover

% of

cumulated

filtered

turnover

Cumulated

filtered

turnover: Mean

per customer

NA 29 189 799 0.06% 3 734.6 16 060 558 0.05% 2 054.8

New 118 781 921 0.25% 484.3 91 122 432 0.31% 371.5

I 19 120 403 0.04% 7 431.2 9 701 975 0.03% 3 770.7

S 11 506 545 207 23.89% 3 727.5 6 400 520 698 21.51% 2 073.4

M 10 946 321 611 22.72% 10 786.9 6 841 635 615 22.99% 6 742.0

L 25 549 213 252 53.04% 25 155.6 16 398 620 825 55.11% 16 146.0

Total 48 169 172 193 100.00% 8 965.0 29 757 662 104 100.00% 5 538.3

SML 12

months

Turnover annual on

promo

% Turnover

annual on

promo

Turnover annual on

promo: Mean per

customer

Total nb taken

reduction vouchers

(BA)

% Total nbtaken

reducti

Thierry Vallaud Thesis

Documents

Transcript of Thierry Vallaud Thesis