Thierry Vallaud Thesis

download Thierry Vallaud Thesis

of 47

Transcript of Thierry Vallaud Thesis

  • 8/3/2019 Thierry Vallaud Thesis

    1/47

    TVallaud 1

    Estimating potential customer value using customer dataUsing a classification technique

    to determine customer value

    Thierry Vallaud

    A Thesis

    Submitted in Partial Fulfillment of the

    Requirements for the Degree of

    Master of Science in Data Mining

    Department of Mathematical Sciences

    Central Connecticut State University

    New Britain, Connecticut

    April 2009

    Thesis Advisor

    Dr. Daniel Larose

    Department of Mathematical Sciences

    Key Words: Turnover potential, Classification, Kohonen Networks

  • 8/3/2019 Thierry Vallaud Thesis

    2/47

    TVallaud 2

    Abstract:

    This study outlines a method of determining individual customer potential, based solely on

    data present in the customer database: descriptive information and transaction records.

    We define potential as the incremental turnover that any particular company could do with

    their present customers.

    In order to successfully calculate this potential in a large database with multiple variables, we

    propose grouping together customers who look like each other (known as clones), by means

    of an appropriate clustering technique: Kohonen Networks.

    This method is applied to actual data sets, and various techniques are employed to check the

    stability of the clusters obtained. Real potential is then determined by means of an empirical

    approach: practical application to a major French retailers database of 5 million customers.

  • 8/3/2019 Thierry Vallaud Thesis

    3/47

    TVallaud 3

    Contents

    The context ................................................................................................................................. 4

    Our thesis subject ....................................................................................................................... 6

    The precise modelling application ............................................................................................. 6

    The research questions ............................................................................................................... 7

    The data mining process used .................................................................................................... 7

    Data understanding ................................................................................................................. 8

    Data preparation ..................................................................................................................... 8

    Clustering models and determination of customer potential .................................................... 11

    Kohonen network method .................................................................................................... 11

    Model development .................................................................................................................. 13

    1- Objectives and methodology ............................................................................................ 13

    2- Robustness of the Kohonen method: ............................................................................... 15

    3- Calculation of the potentials ............................................................................................ 26

    4- Main results ...................................................................................................................... 29

    5- Results summary .............................................................................................................. 34

    The validation procedures for the models ................................................................................ 35

    Conclusions .............................................................................................................................. 36

    Discussion of the results of the research study .................................................................... 36

    The limits and the contribution of our research study .......................................................... 36

    Further research .................................................................................................................... 36

    Bibliography ............................................................................................................................. 37

    Appendix .................................................................................................................................. 40

  • 8/3/2019 Thierry Vallaud Thesis

    4/47

    TVallaud 4

    The context

    Most companies would like to know their customers potential in terms of turnover at the

    individual level. Determining potential means identifying the incremental turnover that agiven company generates with its existing customers.

    Customer turnover potential models exist and are mainly based on the customer value

    determined by the LTV approach (LTV = Life Time Value) (Bnavent and Cri; Berger and

    Nasr 1998; Dwyer 1997; Venkasten, Rajkumar and Kumar 2004).

    Beside this model, other models exist which estimate the customers spending share (Cooil et

    al. 2007; Yuxing Du et al.; Keimingham et al. 2007). Other econometric models exist, which

    are based on data that often are external to the database (Plastria 2001, Huff 2003, Reilly

    1931).

    Customer consumption (total value) represents the lifetime consumption of a particularproduct by a particular customer, referred to as Customer Total Value or CTV. For example

    over the course of his life, a customers total value for a retailer is the sum of all the purchases

    he will make in the retailers stores during his life.

    It is possible to estimate a customers consumption on this market for a given brand b. Over

    the course of his lifetime, the customer will consume several brands. His total consumption

    one of these brands then constitutes the brands wallet share over the customers lifetime

    (Figure 1).

    Wallet share of

    The difference or delta between total consumption by the customer in the market and the

    total consumption of the brand corresponds to the Competitors Consumption Total

    Value CCTV (Figure 2).

    or

    Depending on the brands marketing stimulus, the customer will take a share of that

    delta to competitors and/or increase his consumption in the total market:

    Customers of the retailer will consume in some competitors stores and may be increase his

    total consumption for retailers.

  • 8/3/2019 Thierry Vallaud Thesis

    5/47

    TVallaud 5

    Thus, the customers theoretical potential is his total consumption over his lifetime:

    which is his reachable potential that can be estimated by means of the above econometric

    model

    Where

    Actual Value for Brand 1

    Share of consumption taken to the competitors (Figure 3).

    Increase of its total consumption

    The customers reachable potential then corresponds to what the brand has already captured

    and what the customer could consume additionally or obtain from competitors. This reachable

    potential can be estimated in two ways: using an econometric model, which requires

    exogenous data from the companys internal customer database; or alternatively, using solely

    internal data from the companys customer database, by means of the clones method.

    A given brand can only capture n% of the theoretical potential (Berend Wierenga and Gerrit,

    2000). Some marketing researchers have shown that a brand can increase its actual wallet

    share to a maximum of 30%, above this rate the customer perceives a change and tries to

    resist it. Above 30% of increase there is too much modification of his choice set1(Bremer and

    Joyce, 1988). This subject has already been covered in one of our previous studies (Vallaud,

    2003).

    1 The choice set is the finite set of products for a given product category that a customer has in mindbefore to make a purchase

  • 8/3/2019 Thierry Vallaud Thesis

    6/47

    TVallaud 6

    The most advanced approaches to determination of potential try to determine the portion that

    could be reachable for the company, relying solely on customer data from the companys

    customer database. These approaches calculate a customer by customer potential but

    evidently have to be consistent at the aggregated level with market values macro

    information.

    Our thesis subject

    The objective is to work on clustering models2(Lerman 1970, Dorofeyuk 1971, Borko et al.

    Bernick 1963, Two Steps (Tan et al. 1997), K means (Hartigan et al. 1979, Fang et al. 1982),

    SOM (Teuvo Kohonen 1988, Vesanto 1997, Kaski 1997), etc..), on large databases from

    commercial companies (phone operators, ISPs, major retailers, mail order companies, etc...).

    We use clustering models in order to determine the customer potential using a method we call

    the clonemethod, whereby customers who most resemble each other are considered to be

    clones and should have the same potential.

    We have access to a variety of data bases suited to our methodological process. In this

    document we will perform an empirical test of our method on customer data from a major

    French grocery retailer.

    As part of our brief presentation of the context, we will look at two main subjects:

    - Calculation of potential or the customer value in marketing and its differentdependences: LTV, wallet share, market share capture, etc.

    - The mathematical models that allow similar individuals to be grouped intohomogeneous data groups : clustering techniques

    The investigation field will be multidisciplinary, although there will be a minor marketing

    investigation and a major investigation in the area of statistics, data mining and clustering.

    The precise modelling application

    The greater part of our research objectives is to test several techniques, separately and

    possibly jointly, to ensure that the clusters formed are homogeneous groups of clones.

    Besides choosing the models, part of the research involves defining the most informative

    variables and a model topology which fits with these data. The aim here is to obtain the most

    meaningful and convergent results.

    Another aspect of our research will involve confirming the clusters obtained using the models,along with other complementary statistical techniques:

    - Dimension reduction to choose variables because of the very large numbers of clustersand with large value ranges,

    - Projection of passive and active variables3 in the clusters,- Clusters reallocation by supervised models,- Validation by non automatic classification techniques, connectivity of super classes,

    2 SOM belongs to the clustering methods, typologies is the French word for clustering and

    typologies belong to the unsupervised classification techniques 3 Active variables are used to build the groups themselves in term of distances, passive are justdescriptive variables to explain the groups

  • 8/3/2019 Thierry Vallaud Thesis

    7/47

    TVallaud 7

    - Empirical verification with external panels like Nielsen or TNS Sofres4 whichrepresents the market reality of the potential.

    Another large part of our research study is selecting the above mentioned methods and

    validating these choices. The aim is to find a clustering method that converges sufficiently to

    be validated with all the approaches described above. The modelling will therefore become aprocess of several models.

    The definitive modelling will be realized using a market standard software platform:

    Clementine from SPSS in a French version.

    The scientific contribution will be:

    - a methodological contribution to selecting clustering models and validating thesechoices

    - a real life data application, validated by the reality of an actual business case:calculating real attainable potentials

    The research questions

    - Can we use a clustering technique to determine customers which are similar to eachother and therefore define a realistic potential in terms of turnover for these

    customers?

    - Can we develop a method?- How can we validate the stability of the clusters?

    The data mining process used

    We will use the Cross-Industry Standard Process for Data Mining (CRISP)5

    data mining

    project process which will conduct our approach to analyzing the data. The CRISP standard

    process consists of the following stages:

    4 Nielsen and TNS are market research companies which provide panels in which members scan

    purchases they do. These panels can be crossed with customers data bases to measure marketingmix effects5http://www.crisp-dm.org/

    http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/
  • 8/3/2019 Thierry Vallaud Thesis

    8/47

    TVallaud 8

    Data understanding

    We will work on 5,373,026 individuals derived from the database of a major French retail

    company. We have the details of all cash register receipts over a period of 12 months fromJanuary 2006 to December 2006.

    For external validation purposes, we also have market research available on the French

    market:Referenseigne 2006from TNS Sofres6. This research gives us the wallet share of the

    main French retailers7.

    Data preparation

    This step consists in familiarizing ourselves with the data in the database of the program

    members, in order to determine the structure of the database due to the data layout, the level

    of completed fields comprising the data file, and also the origin and nature of the data in thedata file. Each field will hence be checked to ensure it does not undermine model stability.

    We have done a data audit and EDA in two steps, only the second EDA is presented in this

    document.

    The audit includes:

    - The structure of the database- The origin and nature of data (socio-demographic / consumption)- The possibility of performing cross data analysis (by brand / shelf / product family,

    etc)

    - Data periodicity- Data historicity- Data completeness

    Thus, the principal data management processes performed on the data in the database will

    therefore include:

    - Controlling and the validation of the format of the variables

    - Recoding and correcting certain variables called aberrant variables- Creating specific aggregates useful for further segmentation (total turnover, turnover

    by product family, annual visit frequency, average buying basket )

    - Analysing the correlation of the target variable (turnover) with other variables (socio-demographic criteria, order frequency) in order to check whether any dependantrelationships exist

    - Geocoding (useful for the enriching the profiles of certain socio-demographic dataderived from the INSEE

    8(French national statistical office) via the IRIS

    9(specific

    French geocoding data)

    6 Referenseigne is a monographic market research done on the French retail market yearly since tenyears by TNS Sofres the third worldwide research company.7http://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp/8

    INSEE (Institut National de la Statistique et des tudes conomiques in French) is the FrenchNational Institute for Statistics and Economic Studies. It collects and publishes information on theFrench economy and society, carrying out the periodic national census. Located in Paris, it is the

    http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.crisp-dm.org/http://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp#/http://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp#/http://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp#/http://en.wikipedia.org/wiki/Francehttp://en.wikipedia.org/wiki/List_of_national_and_international_statistical_serviceshttp://en.wikipedia.org/wiki/List_of_national_and_international_statistical_serviceshttp://en.wikipedia.org/wiki/Francehttp://www.secodip.fr/worldpanel/htm/dossier_presse/tns-plusloin.asp#/
  • 8/3/2019 Thierry Vallaud Thesis

    9/47

    TVallaud 9

    - Calculating the distances between the customer and the Point of Sales (trade zone)

    The analysis will be performed on 12 months sliding turnover on the total sum of the

    historical data, to ensure modelling is more reliable. Nevertheless, the greater the historical

    data set and its homogeneity, the more stable and predictive should be the model.

    In this document, we have merely included some examples of the data audit and data

    preparation, as our demonstration is focused on the model and results. Details of the second

    EDA in appendix 2 (p.39).

    The input variables are as shown in the following table :

    French branch of Euro stat, European Statistical System. The INSEE was created in 1946 as asuccessor to the National Statistics Service (SNS) created under Vichy during World War II.

    9The IRIS is a French geographic unit on which are linked the census data

  • 8/3/2019 Thierry Vallaud Thesis

    10/47

    TVallaud 10

    Identification of the outliers:

    We have identified and eliminated from the analysis some customers with anomalous

    behaviour on two variables linked to turnover.

    We used only these two variables in the outlier detection, because they are very constitutive

    of the potential itself.

    Discretization:

    We have discretized some important variables and studied their dispersion.

    We produced a total EDA in appendix 2 (p.39) with descriptive analysis with tables and

    graphs, correlation estimates, and so on.

  • 8/3/2019 Thierry Vallaud Thesis

    11/47

    TVallaud 11

    Clustering models and determination of customer potential

    The modelling process is divided into three major phases:

    (1), The clustering method itself, (2) the calculation of the evolution levels, and (3) the

    calculation of the individual customer potential:

    1. The clustering method: as these models are being applied to very large databases withlarge numbers of variables and records, the SOM (Self Organizing Map) seem to be

    particularly well adapted (Kohonen, 1988):

    - Kohonen networks allow very homogenous and stable groups with multipleindividuals and variables,

    - Kohonen networks allow complex non linear relationship on many variables for manyindividuals,

    - Kohonen networks handle missing data well.

    Kohonen network method

    Kohonen networks represent a type of self organising map (SOM), which itself represents a

    special class of neural networks.

    Kohonen analysis is a clustering method. Its main advantage is to convert high dimensional

    input signal into a simpler low dimensional discrete map. Kohonen is an unsupervised method

    no target as to be defined.

    Kohonen network exhibit three characteristic process :

    1 Competition: Ouput nodes compete with each other to produce the best value for a

    particular scoring function, most commonly the smallest Euclidian distance.

    2 Cooperation: Winning node therefore becomes the center of the neighbourhood of exited

    neurones.

    3 Adaptation: Nodes is the neighbourhood of the winning node participate in adaptation, thatis, learning. The weights of that node are adjusted so as to further improve the score function.

    Network architecture :

    Each neuron of the Kohonen map is linked to all the other neurons of the map. Each one of

    them receives a complete copy of an input vector.

  • 8/3/2019 Thierry Vallaud Thesis

    12/47

    TVallaud 12

    Gagnant Voisinage

    Inputs

    Taux d apprentissage Poids ajust des gagnants en fonctiondes donnes d entre

    Les donnes desortie qu i essaiede devenirgagnantes

    Gagnant Voisinage

    Inputs

    Taux d apprentissage Poids ajust des gagnants en fonctiondes donnes d entre

    Les donnes desortie qu i essaiede devenirgagnantes

    Le s donnes ensortie quiessaient dedevenir gagnan te s

    Learning rate

    Winner Neighborhood

    Adjusted weight ofwinners in function of

    the input data

    Output data whichtry to become

    winners

    Kohonen networks are self-organising maps that exhibit Kohonen Learning. There is a set of

    m field values for the nth record to be an input vector and the current set

    of m weights for a particular output node j to be a weight vector . In

    Kohonen learning, the nodes in the neighbourhood of the winning node adjust their weights

    using a linear combination of the input vector and the current weight vector :

    )

    where , represents the learning rate. Kohonen indicates the learning rate should

    be a decreasing function of training epochs (run through the data set).

    Upon each iteration, it checks the accuracy of its previous grouping.

    - A Kohonen network is particularly well suited to building homogenous groups. It isobviously a lengthy process when performed on large number of individuals with

    many variables and records.

    - A Kohonen network allocates a relevant group to each customer.

    By mapping the analysis, we can evaluate the similarity between groups. Two groups which

    are close on the graph have similar characteristics.

    The aim is to find a method:

    - That represents the best trade-off between many classes, ensuring small groups withhomogeneous customers within each group, but groups which differ greatly from each

    other.

    - That enables us to obtain realistic customer potential with clusters that are internallystable.

  • 8/3/2019 Thierry Vallaud Thesis

    13/47

    TVallaud 13

    2. Calculation of the evolution level: Evolution is the small jump in turnover rate that acustomer needs to produce in order to be clustered with customers who most resemble

    him on all the variables selected for the model, but who represent higher turnover than

    him. This requires a calculation method based on dividing each class of clones for

    which we are calculating the median into decile.

    Individuals in one group should not have a huge gap to cross in order to obtain a realistic

    determination of potential10: the potential increase of turnover that could be achieved afterapplication of the correct marketing actions. We will try to justify this calculation by

    methodological means. This step will give us the evolution rates in the classes.

    3. Calculating individual customer potential: once the rates are properly determined, wewill calculate, for each customer, individual customer potential to be captured. This

    calculation needs specific adjustments: all customers with an evolution rate potential

    above 100% are allocated to the average potential rates of all groups, except that to

    which they belong.

    Model development

    1 - Objectives and methodology

    To complete segmentations based on customer turnover, SML segmentation11

    (Brusset 2005)and RFM segmentation

    12(McCartya and Hastak 2007, Chen et al., 2008), we calculate scores

    of turnover potential for each customer in the loyalty program data base.

    This score is based on an iterative approach allowing us to predict the consumption propensity

    of customers to the aim to determine the potential future turnover.

    10 Example: Customer A has an actual turnover of 1 000$. Customer A belongs to first decile of acluster in which all customers look like the most each others. Turnover max of the customer at theupper limit of this decile is 1 200$. So potential is the difference between the 1 200$ of customer max

    and the 1 000$ of customerA: 20% or 20011 SML Segmentation (Small, Medium, Large) is dividing the customers in function of their turnover12 RFM Segmentation (Recency, Frequency, Money Value) is a classical segmentation in marketing

  • 8/3/2019 Thierry Vallaud Thesis

    14/47

    TVallaud 14

    The approach consists of grouping together customers who resemble each other, according to

    some socio demographic and consumption variables.

    For the computation we will use consumption data recorded on a period of 12 months (from

    January 2006 to December 2006).

    The variables used in the model are those we decide to keep following the data preparation

    stage.

    Socio-Demo & Consumption data Turnover rate per product family

    Customer ID Customer ID

    Number of children in the household Rate other

    Filtered turnover on 12 months Rate Bazar

    Total turnover Rate otherYearly turnover on promo Rate Pork Butcher LS

    Nb of transformed points on 12 months Rate Pet food

    Nb of CM on 12 months Rate Baby

    Nb of reduction voutchers used Rate Butcher

    SML 12 months Rate backer

    RFM 3 months Rate Pork Butcher

    Number of children in the household Rate dietetic bio

    Rate cheese

    Rate fruits and vegetables

    Rate fishs

    Rate frozen food

    Rate wine

    Rate cleaning products

    Rate grocery

    Rate liquid

    Rate textile

    Rate ultra fresh products

    Rate pouldry

    Rate First price

    Rate Retailer Brand 1

    Rate Retailer Brand 2

    Discarded variables are eliminated after a correlation analysis for the quantitative variables

    (turnover and number of purchases acts for instance) or by proximity matrixes for qualitative

    variables. We dont used PCA because we would like to keep the information as the much

    desegregated level of the original variables in the data base.

    Inactive customers, customers without any transaction of the period, are discarded.

    Clementine stream:

  • 8/3/2019 Thierry Vallaud Thesis

    15/47

    TVallaud 15

    This figure is here to illustrate how a model is done on Clementine from SPSS, Clementine is a statistical

    software which uses object language to make models

    We will use clustering method to create "clone" groups that are highly homogeneous within

    each other, but different from each others.

    The second stage involves creating turnover potential values for these different groups, given

    that an individual with the same variables as another does not obviously realize the same level

    of turnover. He can tend towards the turnover of his superior clone. To do this, we will use a

    Kohonen neural network.

    Once the clone families have been obtained and potential values calculated, the main familiesare determined:

    -"Gold : evolution rate higher than 20%

    - Silver : evolution rate between 15% and 20%

    - Bronze : evolution rate below 15%

    The evolution rate is the ratio of the potential on the actual turnover.

    It should be note here that potential refers to absolute potential over twelve consecutive

    months.

    This potential is expressed in the form of a rate. For operational purposes, potential values

    must be reclassified as absolute value:

    P1: Large potential

    P2: Medium potential

    P3: Small potential

    2- Robustness of the Kohonen method:

    We test several methods of determining convergences between Kohonen groups.

    2.1 CONVERGENCES VISUALIZATION

  • 8/3/2019 Thierry Vallaud Thesis

    16/47

    TVallaud 16

    We obtained 40 groups, numbered from 00 to 93 (note that clusters do not follow a numbered

    sequence).

    We would like to obtain a quiet important number of groups to minimize at the maximum the

    inter group standard deviation.

    Mappings: 00 is the cluster of 0 coordinate on the X axis and 0 on the Y axis, and 93 is thegroup of coordinate 9 on the X axis and 3 on the Y axis.

    Kohonen groups x SML segmentation (12 months)

    Colors are generally well grouped, with customers belonging to the same SML segments

    being together.

    Visually, the placement of SML through clusters shows stability.

    Kohonen groups x RFM segmentation (3 month)

  • 8/3/2019 Thierry Vallaud Thesis

    17/47

    TVallaud 17

    Colours are generally well-grouped, with customers belonging to the same RFM segments

    found in the same Kohonen groups.

    There is a far greater mixture of colors inside each cluster, with customers belonging to the

    same RFM segments being found in the same Kohonen groups, but the homogeneity of

    clusters is less obvious than with SML mapping.

    2.2 ROBUSTNESS OF THE KOHONEN CLASSIFICATION

    Is this distribution of the population stable? We can answer this question in four different

    ways

    A - Is there a convergence of clusters weights between the sample of the active observations

    and passive observations?

    B - Can the grouping be reproduced by a Bayesian network (Pourret et al, Jensen, Stephenson

    2000)?

    C - Can the classification be reproduced by segmentation as C5.0 (Quinlan 1993, 1996,2004)?

    D - Is there convexity of the super classes?

    A/ Convergence of the method

    We can check the percentage ofcustomers allocation on two random samples

  • 8/3/2019 Thierry Vallaud Thesis

    18/47

    TVallaud 18

    Number % Number % Number %

    KH01 271 944 5,06% 10 949 5,13% 260 995 5,06%

    KH02 171 396 3,19% 6 983 3,27% 164 413 3,19%

    KH03 261 136 4,86% 10 498 4,92% 250 638 4,86%

    KH04 289 912 5,40% 11 508 5,39% 278 404 5,40%

    KH05 80 239 1,49% 3 214 1,50% 77 025 1,49%

    KH06 40 698 0,76% 1 596 0,75% 39 102 0,76%

    KH07 64 515 1,20% 2 550 1,19% 61 965 1,20%KH08 93 685 1,74% 3 768 1,76% 89 917 1,74%

    KH09 95 415 1,78% 3 757 1,76% 91 658 1,78%

    KH10 91 169 1,70% 3 681 1,72% 87 488 1,70%

    KH11 57 384 1,07% 2 235 1,05% 55 149 1,07%

    KH12 181 691 3,38% 7 224 3,38% 174 467 3,38%

    KH13 142 728 2,66% 5 624 2,63% 137 104 2,66%

    KH14 83 298 1,55% 3 260 1,53% 80 038 1,55%

    KH15 65 365 1,22% 2 597 1,22% 62 768 1,22%

    KH16 152 665 2,84% 6 153 2,88% 146 512 2,84%

    KH17 119 559 2,23% 4 797 2,25% 114 762 2,22%

    KH18 45 360 0,84% 1 794 0,84% 43 566 0,84%

    KH19 73 151 1,36% 2 783 1,30% 70 368 1,36%

    KH20 35 914 0,67% 1 378 0,65% 34 536 0,67%

    KH21 120 165 2,24% 4 688 2,20% 115 477 2,24%

    KH22 137 752 2,56% 5 462 2,56% 132 290 2,56%

    KH23 36 215 0,67% 1 417 0,66% 34 798 0,67%KH24 267 939 4,99% 10 739 5,03% 257 200 4,99%

    KH25 193 624 3,60% 7 581 3,55% 186 043 3,61%

    KH26 50 454 0,94% 2 019 0,95% 48 435 0,94%

    KH27 26 271 0,49% 1 036 0,49% 25 235 0,49%

    KH28 76 724 1,43% 3 082 1,44% 73 642 1,43%

    KH29 199 372 3,71% 7 810 3,66% 191 562 3,71%

    KH30 28 913 0,54% 1 102 0,52% 27 811 0,54%

    KH31 124 878 2,32% 4 922 2,30% 119 956 2,32%

    KH32 347 565 6,47% 13 963 6,54% 333 602 6,47%

    KH33 75 304 1,40% 2 998 1,40% 72 306 1,40%

    KH34 103 656 1,93% 4 107 1,92% 99 549 1,93%

    KH35 24 658 0,46% 989 0,46% 23 669 0,46%

    KH36 31 206 0,58% 1 272 0,60% 29 934 0,58%

    KH37 301 456 5,61% 11 863 5,55% 289 593 5,61%

    KH38 252 820 4,71% 10 042 4,70% 242 778 4,71%

    KH39 130 904 2,44% 5 193 2,43% 125 711 2,44%KH40 425 926 7,93% 16 942 7,93% 408 984 7,93%

    Total 5 373 026 100,00% 213 576 100,00% 5 159 450 100,00%

    Learning sample Test sampleClones

    Total

    B/ Reallocation using a Bayesian network

    The above table confirms that the algorithm is able to reproduce the distribution on a larger

    data set (Learning sample vs Test sample).

    However, it is by using another algorithm that we can determine whether or not the clustering

    can be reproduced or if it is stable or not.

    Again, the learning sample is split into two independent sub-samples. The learning sample

    includes 70% of the observations, the test sample 30%.

    We use a Bayesian network, because to make a prediction on 40 groups discriminating

    analysis is not well adapted.

    Bayesian network allows a stepwise approach, as we can fix the level of probabilities of links

    that we retain between variables. If we fix a probability of 0.9, the results are as presented on

    a graph format below.

  • 8/3/2019 Thierry Vallaud Thesis

    19/47

    TVallaud 19

    The network uses 11 variables, turnover data and socio demographic variables. It can be seen

    that SML and RFM are very important. This result validates the representation of the

    densities. Below the weights of variables in the model.

  • 8/3/2019 Thierry Vallaud Thesis

    20/47

    TVallaud 20

    Kullback-Leibler measurement http://www.it-

    innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdf. comes from informationtheory. It is a measure of convergence between two series after they have been recoded on a

    bitmap format. The higher the value, the greater the probability that these two values have a

    joint distribution.

    http://www.it-innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdfhttp://www.it-innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdfhttp://www.it-innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdfhttp://www.it-innovations.ae/iit005/proceedings/articles/E_6_IIT05_Khalid.pdf
  • 8/3/2019 Thierry Vallaud Thesis

    21/47

    TVallaud 21

    Scoring result at the individual level: on the learning sample, 90.7% of the individuals are

    correctly classified.

    On the test sample, the figure is 90.1%

    Below are the rates in % of correctly classified individuals by the Bayesian Network for each

    of the 40 clusters.

    The Kohonen clusters can be reproduced.

  • 8/3/2019 Thierry Vallaud Thesis

    22/47

    TVallaud 22

  • 8/3/2019 Thierry Vallaud Thesis

    23/47

    TVallaud 23

    In the above table, poorly reallocated groups are of course groups containing a small number

    of customers.

    Even for these groups, accuracy remains above 65%

    C/ Reallocation by decision tree

    The cross-validation rate is 94.2% of correctly affected individuals to groups.

    The test sample confirms this rate.

    There is a strong convergence of the two supervised learning methods Bayesian Networks and

    C5 are able to reallocate properly individuals to 40 clusters.

    Robustness of the classification is validated.

  • 8/3/2019 Thierry Vallaud Thesis

    24/47

    TVallaud 24

  • 8/3/2019 Thierry Vallaud Thesis

    25/47

    TVallaud 25

    D/ Superclasses convexity

    We use a Bayesian network analysis, which identifies a small number of variables that are the

    most important for clustering.

    We analyse contingency table between the 40 groups and the variables which contribute at the

    network for more than 10% of explicative ability.

    - Family situation

    - C.S.P.

    - R.F.M. at 3 months

    - S.M.L at 3 months

    - Home type

    - Age categories

    - Filtered cumulated turnover

    - Customer seniority categories

    On this table, the scale used is Khi distance (Ottos, 2007, Meunier et al, Romesburg, 2004)

    and aggregation method is that used by Ward (Clarke and Sun, 1997, Barnier 2008).

    Dendrogramme

    KH01

    KH05KH02

    KH06

    KH11

    KH12

    KH03

    KH07

    KH04

    KH08

    KH25

    KH29

    KH34

    KH30

    KH33

    KH38

    KH39

    KH40

    KH35

    KH37

    KH31

    KH32

    KH36

    KH21

    KH26

    KH27

    KH24

    KH28

    KH22

    KH23

    KH16

    KH19

    KH20

    KH10

    KH09

    KH13

    KH14

    KH17

    KH15

    KH18

    0 1 2 3 4 5 6 7 8 9

    Breakdown of the standard deviation for an optimal classification:

    Intra-groups 89790172,495Inter-groups 25701803,959Total 115491976,454

  • 8/3/2019 Thierry Vallaud Thesis

    26/47

    TVallaud 26

    Distances between the central objects:

    Results per cluster:

    Cluster 1 2 3 4 5

    Objects 10 7 8 5 10

    Sum of weights 10 7 8 5 10

    Intra class standard

    deviation72493506,744 25490092,048 67169820,393 125787548,900 151548331,778

    Minimal distance to

    barycenter

    3535,855 3419,845 3465,065 3277,114 4143,086

    Average distance to

    the barycenter7323,184 4577,213 6560,727 8500,865 10606,386

    Maximal distance to

    the barycenter14057,216 5976,284 16549,137 18792,278 22881,067

    KH01 KH09 KH16 KH21 KH25

    KH02 KH10 KH19 KH26 KH29

    KH03 KH13 KH20 KH31 KH30

    KH04 KH14 KH22 KH32 KH33

    KH05 KH15 KH23 KH36 KH34

    KH06 KH17 KH24 KH35

    KH07 KH18 KH27 KH37KH08 KH28 KH38

    KH11 KH39KH12 KH40

    A check is performed to ensure that the bottom/top classification respects the order of the

    groups: clone 40 is not grouped together with clone 3. It's one of the "quality" criteria of a

    Kohonen map.

    In conclusion, the sharp classification obtained by Kohonen algorithm satisfies the criteria of

    stability and reproducibility which guarantee a robust and lasting potential.

    3- Calculation of the potentials

    We divided the annual turnover (filtered turnover on 12 month) into deciles.

    For each clusters obtained with the Kononen method, we have calculated the business

    potential based on the turnover.

    We retained the deciles method which allows very significant variations in turnover to be

    taken into account.

    We split the total turnover of each class of clones into deciles, then calculated the median of

    each deciles.

  • 8/3/2019 Thierry Vallaud Thesis

    27/47

    TVallaud 27

    Then we allocate the groups a potential turnover value derived from the calculation of the rate

    of increase between medians and deciles.

    For each clones group, the increasing rate of the turnover measures the turnover growth to go

    from a decile to the upper decile.

    18 increasing rates per clones group are determined:

    - Between the median of the first decile and the upper limit of the first decile: Tx01

    - Between the upper limit of the first deciles and the median of the second decile: Tx02

    - Between median of the second deciles and the upper limit of the second deciles: Tx03

    ...

    - Between the upper limit of the eighth deciles and the median of the ninth decile: Tx16

    - Between the median of the ninth deciles and the upper limit of the ninth decile: Tx17

    - Between the upper limit of the tenth deciles and the median of the tenth decile: Tx18

    We let without any potential companies which are higher than the median of the tenth deciles.

    We estimate that companies with such high turnover will have an evolution rate near 0, equalto the inflation rate, or equal to their annual evolution rate.

    For each Kohonen group, a customer for whom the filtered turnover is between the minimum

    and median of the first decile will have an evolution rate equal to rate 1 (Tx01).

    A customer whose turnover is between the median and the upper limit of the first deciles, will

    have an increase rate equal to rate 2 (Tx02) etc...

    Each customer is allocated an evolution rate. The rate multiplied by the turnover allows us to

    estimate a potential turnover of each customer.

    4.1 - Limits and medians of deciles per Kohonen group:

  • 8/3/2019 Thierry Vallaud Thesis

    28/47

    TVallaud 28

    Turnover in euros

    Number Mean Median Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median Higher l imit Median

    KH01 00 271 944 683 354 16,6 34,6 56,4 81,9 111,9 146,5 187,5 234,1 289,4 354,5 431,5 523,7 634,9 769,0 935,8 1 151,5 1 409,5

    KH02 01 171 396 730 396 19,0 39,9 64,5 93,1 126,3 164,1 209,3 262,8 323,4 396,1 482,8 584,4 705,7 847,6 1 023,5 1 245,5 1 521,7

    KH03 02 261 136 770 414 22,2 45,8 72,5 102,8 137,3 177,3 223,6 277,3 340,4 414,1 501,1 603,4 724,0 867,4 1 041,4 1 266,4 1 570,2

    KH04 03 289 912 1 353 658 31,3 65,4 104,5 148,8 201,6 262,2 335,5 423,3 529,1 658,0 812,4 1 000,2 1 230,8 1 510,2 1 864,7 2 314,6 2 898,5

    KH05 10 80 239 471 303 17,8 35,4 55,7 79,8 106,7 136,4 170,6 209,7 253,7 303,4 359,7 425,9 499,4 582,3 675,7 784,9 911,0

    KH06 11 40 698 452 312 19,7 39,1 61,4 86,5 113,9 146,0 180,7 217,6 262,3 311,7 367,8 430,3 502,8 584,4 677,0 775,0 885,0

    KH07 12 64 515 469 360 23,9 49,5 77,3 107,4 140,9 177,0 217,5 261,5 308,6 360,1 417,2 480,0 549,0 622,9 705,6 797,8 900,8

    KH08 13 93 685 697 467 27,9 57,4 91,0 127,5 168,3 215,1 268,2 326,7 391,7 466,7 550,4 643,3 748,8 862,3 995,6 1 146,6 1 381,5

    KH09 20 95 415 432 313 20,5 39,6 61,7 86,1 113,2 144,8 179,9 219,5 263,6 313,0 369,2 430,3 499,9 576,5 662,0 756,4 863,1

    KH10 21 91 169 519 452 32,7 65,5 101,0 140,9 183,5 229,8 280,1 333,2 391,0 451,7 516,7 581,1 651,4 722,7 797,6 877,1 957,5

    KH11 22 57 384 580 454 27,6 56,8 92,0 129,3 170,7 219,8 271,1 327,2 390,2 454,2 525,6 600,9 683,3 771,7 865,6 965,8 1 071,6

    KH12 23 181 691 850 670 37,7 80,4 129,2 184,3 245,6 314,3 390,1 475,0 567,6 669,9 783,8 905,1 1 037,8 1 180,3 1 330,4 1 501,9 1 694,4

    KH13 30 142 728 988 779 39,4 85,9 144,8 214,2 293,1 377,6 472,0 571,3 674,2 779,0 888,3 1 000,1 1 115,8 1 257,1 1 429,2 1 627,0 1 853,7

    K H1 4 31 8 3 2 98 2 9 71 2 5 92 1 26 8, 3 1 3 93 ,2 1 5 25 ,2 1 6 58 ,7 1 80 2, 7 1 9 46 ,8 2 09 9, 5 2 2 57 ,7 2 43 0, 3 2 5 91 ,5 2 76 4, 1 2 9 46 ,5 3 15 8, 4 3 3 96 ,3 3 67 5, 6 4 0 09 ,5 4 42 2, 9

    K H1 5 32 6 5 3 65 2 4 88 2 0 79 1 20 9, 3 1 2 95 ,8 1 3 83 ,0 1 4 72 ,9 1 56 4, 9 1 6 60 ,3 1 75 9, 2 1 8 63 ,6 1 97 1, 1 2 0 78 ,9 2 18 9, 0 2 3 04 ,7 2 42 2, 7 2 5 94 ,9 2 87 6, 1 3 2 39 ,7 3 70 1, 3

    K H1 6 33 1 52 66 5 4 2 03 3 7 45 1 58 4, 7 1 9 76 ,5 2 3 25 ,5 2 5 79 ,3 2 74 6, 6 2 9 17 ,8 3 10 7, 6 3 3 04 ,4 3 52 0, 0 3 7 45 ,0 3 99 1, 8 4 2 56 ,8 4 54 9, 6 4 8 79 ,7 5 26 4, 2 5 7 07 ,4 6 26 1, 8

    K H1 7 40 1 19 55 9 2 6 46 2 2 54 1 16 6, 1 1 2 70 ,9 1 3 77 ,5 1 4 88 ,5 1 60 4, 0 1 7 22 ,5 1 84 3, 4 1 9 71 ,9 2 11 0, 0 2 2 53 ,5 2 40 5, 7 2 5 74 ,7 2 78 4, 9 3 0 27 ,5 3 30 4, 2 3 6 25 ,9 4 03 4, 0

    K H1 8 41 4 5 3 60 3 2 63 2 8 88 1 33 6, 7 1 5 12 ,5 1 6 81 ,1 1 8 59 ,9 2 03 5, 6 2 2 15 ,0 2 39 7, 4 2 5 68 ,8 2 72 0, 1 2 8 88 ,3 3 06 9, 4 3 2 67 ,3 3 48 8, 7 3 7 38 ,1 4 02 5, 8 4 3 93 ,4 4 84 1, 6

    K H1 9 42 7 3 1 51 4 5 54 4 0 65 2 62 1, 0 2 7 55 ,0 2 8 93 ,3 3 0 34 ,8 3 17 7, 2 3 3 34 ,5 3 50 0, 5 3 6 73 ,4 3 86 2, 7 4 0 64 ,4 4 27 4, 6 4 4 99 ,5 4 75 5, 2 5 0 56 ,5 5 39 7, 1 5 7 93 ,1 6 30 3, 2

    K H2 0 43 3 5 9 14 4 5 71 4 0 70 2 57 0, 7 2 7 05 ,2 2 8 50 ,9 3 0 04 ,1 3 15 5, 6 3 3 12 ,4 3 48 7, 7 3 6 69 ,4 3 86 1, 8 4 0 69 ,6 4 28 4, 9 4 5 27 ,1 4 79 7, 5 5 1 03 ,9 5 44 7, 5 5 8 61 ,8 6 38 2, 0

    K H2 1 50 1 20 16 5 2 2 91 1 9 43 1 21 2, 4 1 2 82 ,5 1 3 56 ,5 1 4 32 ,3 1 51 0, 3 1 5 90 ,1 1 67 4, 2 1 7 60 ,4 1 84 9, 6 1 9 42 ,5 2 03 8, 5 2 1 41 ,3 2 24 6, 1 2 3 55 ,7 2 46 9, 4 2 7 62 ,9 3 23 2, 9

    K H2 2 51 1 37 75 2 4 5 72 4 1 09 2 51 5, 0 2 6 64 ,9 2 8 17 ,2 2 9 78 ,6 3 14 3, 5 3 3 19 ,0 3 50 1, 2 3 6 91 ,9 3 89 1, 3 4 1 09 ,4 4 33 9, 2 4 5 94 ,2 4 87 0, 6 5 1 86 ,9 5 53 9, 8 5 9 56 ,9 6 47 9, 9

    K H2 3 52 3 6 2 15 4 3 36 3 8 49 2 59 9, 6 2 7 09 ,0 2 8 23 ,5 2 9 41 ,5 3 06 9, 1 3 2 01 ,5 3 34 4, 2 3 5 04 ,2 3 66 7, 1 3 8 48 ,9 4 04 5, 0 4 2 60 ,8 4 49 8, 7 4 7 70 ,8 5 08 8, 1 5 4 74 ,5 5 96 2, 4

    K H2 4 53 2 67 93 9 4 6 33 4 0 99 2 63 7, 0 2 7 72 ,2 2 9 12 ,9 3 0 57 ,5 3 20 8, 4 3 3 65 ,3 3 52 8, 8 3 7 07 ,9 3 89 6, 9 4 0 99 ,3 4 32 0, 3 4 5 62 ,6 4 82 7, 9 5 1 28 ,9 5 47 7, 8 5 8 96 ,2 6 42 8, 4

    KH25 60 193 624 552 436 24,0 50,9 83,8 121,2 163,1 210,5 260,4 314,0 373,5 436,1 502,0 573,0 646,0 722,1 802,9 888,8 977,2

    K H2 6 61 5 0 4 54 1 6 92 1 6 69 1 61 ,2 1 1 82 ,5 1 2 34 ,0 1 2 85 ,9 1 34 2, 0 1 4 00 ,7 1 46 3, 2 1 5 28 ,9 1 59 5, 8 1 6 68 ,7 1 74 1, 8 1 8 21 ,0 1 90 4, 3 1 9 90 ,9 2 08 2, 9 2 1 78 ,5 2 28 1, 0

    K H2 7 62 2 6 2 71 2 7 66 2 5 52 1 34 8, 1 1 5 38 ,1 1 7 09 ,0 1 8 52 ,4 1 97 8, 5 2 1 03 ,6 2 22 1, 7 2 3 39 ,1 2 45 2, 7 2 5 51 ,7 2 65 2, 2 2 7 72 ,0 2 90 5, 3 3 0 59 ,1 3 24 1, 7 3 4 68 ,9 3 78 1, 4

    K H2 8 63 7 6 7 24 3 2 24 2 9 33 1 63 2, 4 1 9 13 ,0 2 1 20 ,9 2 2 99 ,2 2 46 3, 6 2 5 59 ,5 2 64 0, 9 2 7 28 ,8 2 82 7, 7 2 9 33 ,2 3 05 2, 2 3 1 84 ,3 3 34 0, 5 3 5 17 ,5 3 73 2, 8 3 9 97 ,6 4 34 5, 7

    KH29 70 199 372 365 276 22,6 43,0 64,6 87,8 112,6 140,2 169,7 202,1 237,2 276,0 320,2 368,9 424,1 486,6 557,6 641,4 738,9

    KH30 71 28 913 598 565 58,5 119,2 174,6 228,6 281,5 335,2 390,3 444,8 502,3 564,5 622,8 684,5 744,3 808,2 873,5 940,2 1 009,9

    K H31 72 124 878 1 292 1 442 25,8 62, 7 118, 6 215, 8 498, 0 1 194,3 1 251,5 1 313,0 1 375,6 1 441,9 1 512,9 1 589,5 1 672,5 1 763,4 1 859,4 1 965,2 2 078,4

    K H32 73 347 565 1 220 1 415 14,7 32, 2 59,1 102, 6 184, 8 445, 0 1 200,4 1 267,9 1 339,4 1 414,7 1 493,6 1 577,2 1 665,5 1 759,2 1 859,6 1 967,5 2 084,2

    KH33 80 75 304 527 495 72,8 124,0 168,2 212,6 257,3 301,7 348,6 395,7 444,1 494,5 547,2 602,7 659,6 719,7 783,0 849,3 917,2

    KH34 81 103 656 417 334 36,0 62,9 90,2 118,1 146,9 178,4 211,3 248,3 289,0 334,2 383,6 437,9 498,8 567,4 643,6 728,1 821,9

    KH35 82 24 658 458 345 16,7 51,6 82,6 115,2 149,1 184,6 219,3 257,4 297,8 345,3 395,3 449,8 507,5 575,5 647,2 729,4 822,4

    KH36 83 31 206 982 1 199 0,9 3,8 9,1 15,0 22,1 32,4 47,8 76,5 155,9 1 199,2 1 264,9 1 335,1 1 417,4 1 509,6 1 622,1 1 759,1 1 930,6

    KH37 90 301 456 623 619 151,9 214,8 269,7 322,0 371,7 421,1 470,2 520,2 569,5 619,2 669,1 719,9 771,5 823,8 876,4 930,6 985,9

    KH38 91 252 820 322 241 27,8 47,1 66,0 86,4 107,8 130,7 155,3 181,6 209,5 241,0 275,7 314,9 358,7 409,4 467,9 537,8 624,2

    KH39 92 130 904 265 171 21,4 35,3 48,6 62,5 77,3 92,6 109,4 127,5 147,5 170,5 196,9 228,3 263,3 306,0 357,8 423,1 509,1

    KH40 93 425 926 143 62 5,5 9,6 13,7 18,2 23,3 29,0 35,6 43,1 52,0 62,4 74,8 89,8 108,4 131,6 163,2 206,9 271,4

    T OTA L 5 373 026 1 390 687 31,2 2 772,2 90,0 3 057,5 181, 3 3 365,3 316, 4 3 707,9 461,9 4 109,4 632,4 4 594,2 780,9 5 186,9 1 055,1 5 956,9 1 556,4

    9th

    Cluster

    5th Decile 6th Decile 7t Decile 8th Decile1st Decile 2nd Decile 3rd Decile 4th Decile

    Number TX1 TX2 TX3 TX4 TX5 TX6 TX7 TX8 TX9 TX10 TX11 TX12 TX13 TX14 TX15 TX16 TX17 TX18

    KH01 00 5 789 108,63 63,13 45,26 36,55 30,96 27,95 24,86 23,62 22,48 21,74 21,36 21,22 21,13 21,69 23,05 22,41 25,49 33,37

    KH02 01 1 580 109,62 61,89 44,29 35,62 29,93 27,54 25,55 23,07 22,50 21,87 21,06 20,76 20,11 20,74 21,69 22,18 24,58 28,22

    KH03 02 3 067 106,41 58,41 41,79 33,61 29,12 26,10 24,00 22,76 21,65 21,02 20,42 19,98 19,80 20,06 21,60 23,99 26,03 30,51

    KH04 03 2 345 109,01 59,80 42,32 35,48 30,08 27,94 26,19 24,98 24,36 23,48 23,11 23,05 22,70 23,48 24,13 25,23 27,30 35,39

    KH05 04 1 781 99,49 57,27 43,20 33,71 27,81 25,14 22,89 20,97 19,60 18,54 18,42 17,25 16,61 16,03 16,17 16,06 16,15 33,77

    KH06 05 4 534 98,18 56,94 40,91 31,68 28,12 23,79 20,39 20,55 18,84 18,01 16,98 16,84 16,24 15,86 14,47 14,19 13,86 14,33

    KH07 10 688 106,73 56,18 38,95 31,22 25,67 22,87 20,23 17,99 16,71 15,86 15,05 14,38 13,45 13,28 13,06 12,91 12,36 12,62

    KH08 11 378 105,92 58,47 40,16 32,00 27,76 24,72 21,80 19,91 19,14 17,94 16,89 16,39 15,16 15,46 15,16 20,49 23,96 25,23

    KH09 12 398 92,99 55,61 39,55 31,49 27,96 24,22 22,00 20,07 18,76 17,97 16,54 16,18 15,32 14,83 14,25 14,10 13,41 12,78

    KH10 13 472 100,21 54,17 39,50 30,19 25,21 21,90 18,98 17,34 15,53 14,38 12,46 12,09 10,96 10,36 9,96 9,18 8,84 8,49

    KH11 14 627 106,06 62,07 40,46 32,05 28,76 23,36 20,69 19,26 16,38 15,72 14,34 13,70 12,94 12,17 11,58 10,96 13,96 33,63

    KH12 15 1 388 113,10 60,81 42,60 33,25 28,00 24,11 21,77 19,49 18,02 17,02 15,47 14,66 13,74 12,72 12,89 12,82 13,61 14,42

    KH13 20 4 594 118,30 68,56 47,95 36,82 28,83 25,00 21,04 18,01 15,55 14,03 12,58 11,58 12,66 13,69 13,84 13,93 15,36 18,86

    KH14 21 3 075 9,85 9,47 8,76 8,68 7,99 7,84 7,53 7,65 6,63 6,66 6,60 7,19 7,53 8,22 9,08 10,31 13,70 20,35

    KH15 22 420 7,16 6,73 6,50 6,25 6,09 5,96 5,93 5,77 5,47 5,30 5,28 5,12 7,11 10,84 12,64 14,25 17,16 25,06

    KH16 23 3 872 24,73 17,66 10,91 6,49 6,23 6,50 6,33 6,52 6,39 6,59 6,64 6,88 7,26 7,88 8,42 9,71 11,71 16,92KH17 24 959 8,99 8,39 8,05 7,76 7,39 7,02 6,98 7,00 6,80 6,75 7,02 8,16 8,71 9,14 9,73 11,26 14,23 20,64

    KH18 25 2 661 13,15 11,15 10,64 9,44 8,81 8,23 7,15 5,89 6,18 6,27 6,45 6,78 7,15 7,69 9,13 10,20 12,58 19,12

    KH19 30 883 5,11 5,02 4,89 4,69 4,95 4,98 4,94 5,15 5,22 5,17 5,26 5,68 6,34 6,74 7,34 8,81 11,07 16,74

    KH20 31 1 037 5,23 5,39 5,38 5,04 4,97 5,29 5,21 5,24 5,38 5,29 5,65 5,97 6,39 6,73 7,61 8,87 11,67 16,93

    KH21 32 66 5,78 5,77 5,58 5,45 5,28 5,29 5,15 5,07 5,02 4,94 5,04 4,89 4,88 4,83 11,89 17,01 20,42 27,37

    KH22 33 1 429 5,96 5,71 5,73 5,53 5,58 5,49 5,45 5,40 5,60 5,59 5,88 6,02 6,49 6,81 7,53 8,78 10,61 15,87

    KH23 34 381 4,21 4,23 4,18 4,34 4,31 4,46 4,78 4,65 4,96 5,10 5,34 5,58 6,05 6,65 7,59 8,91 10,88 16,20

    KH24 35 3 392 5,12 5,07 4,96 4,94 4,89 4,86 5,08 5,10 5,19 5,39 5,61 5,82 6,23 6,80 7,64 9,03 11,30 17,32

    KH25 40 1 508 111,69 64,82 44,56 34,56 29,06 23,70 20,59 18,95 16,75 15,12 14,15 12,73 11,79 11,19 10,69 9,95 9,52 13,91

    KH26 41 1 340 633,59 4,35 4,21 4,36 4,37 4,47 4,49 4,37 4,57 4,38 4,55 4,58 4,55 4,62 4,59 4,70 4,84 5,28

    KH27 42 1 525 14,09 11,11 8,39 6,81 6,33 5,61 5,28 4,86 4,04 3,94 4,52 4,81 5,29 5,97 7,01 9,01 12,27 18,53

    KH28 43 956 17,19 10,87 8,40 7,15 3,89 3,18 3,33 3,63 3,73 4,06 4,33 4,91 5,30 6,12 7,09 8,71 11,18 17,50

    KH29 44 684 90,17 50,40 35,81 28,31 24,52 21,03 19,11 17,34 16,37 16,02 15,22 14,95 14,74 14,58 15,04 15,19 15,33 16,13

    KH30 45 2 217 103,76 46,45 30,93 23,13 19,06 16,44 13,96 12,92 12,38 10,34 9,90 8,74 8,59 8,08 7,63 7,41 6,93 6,45

    KH31 50 3 839 143,57 89,13 81,91 130,78 139,82 4,80 4,91 4,77 4,82 4,92 5,06 5,23 5,43 5,44 5,69 5,76 6,14 6,39

    KH32 51 277 118,82 83,58 73,58 80,03 140,84 169,74 5,62 5,64 5,62 5,58 5,60 5,60 5,63 5,70 5,80 5,93 6,16 6,26

    KH33 52 1 649 70,22 35,66 26,38 21,03 17,27 15,53 13,53 12,21 11,36 10,67 10,13 9,45 9,11 8,80 8,47 7,99 7,89 8,40

    KH34 53 799 74,74 43,50 30,94 24,35 21,45 18,44 17,50 16,40 15,64 14,78 14,16 13,91 13,76 13,43 13,12 12,88 12,70 12,05

    KH35 54 392 208,99 60,21 39,48 29,41 23,76 18,82 17,38 15,70 15,94 14,46 13,78 12,83 13,39 12,46 12,71 12,76 13,08 13,70

    KH36 55 4 055 322,22 139,61 65,18 46,88 46,76 47,38 60,19 103,73 669,03 5,48 5,54 6,17 6,50 7,46 8,45 9,74 12,00 15,74

    KH37 60 353 41,38 25,55 19,40 15,44 13,28 11,67 10,62 9,49 8,73 8,05 7,60 7,16 6,78 6,38 6,18 5,94 5,75 5,56

    KH38 61 1 000 69,11 40,25 30,82 24,84 21,22 18,77 16,95 15,39 15,04 14,40 14,20 13,93 14,13 14,28 14,93 16,07 18,24 21,65

    KH39 62 904 64,60 37,69 28,67 23,61 19,82 18,09 16,55 15,76 15,56 15,50 15,91 15,33 16,25 16,93 18,25 20,33 23,48 29,45

    KH40 63 1 670 73,32 43,87 32,68 27,54 24,82 22,54 21,23 20,67 20,03 19,75 20,16 20,60 21,49 24,02 26,72 31,21 39,22 54,61

    Mean 39,16 38,60 29,60 23,07 18,69 16,44 15,15 13,07 12,38 11,70 11,36 11,18 11,19 11,45 12,07 12,98 14,62 19,14

    Clusters

    This table gives us the evolution rate that assigned to each customer depending on his clones

    group and present turnover.

    Under the table, we cans see the average of these rates: if we obtain an aberrant evolution rate

    (>100%), we replace this excessively high value with the mean rate calculated on all the

    groups except those containing a high value. The table below shows the corrections made.

    For instance, a customer who belongs to the clones group 00 (KH01) and has turnover below

    21.5 euros will have an evolution rate equal to rate 1: i.e. 89%.

    If a customer belongs to the clones group 00 (KH01) and has turnover between 21.5 and 40.5euros, then he will have an evolution rate equal to rate 2, i.e. 54%.

  • 8/3/2019 Thierry Vallaud Thesis

    29/47

    TVallaud 29

    The evolution rates given for the two examples are very high, but they concern very small

    customers.

    Each of the customers is assigned an evolution rate. The rate multiplied by turnover will allow

    us to estimate potential turnover for each of the customers.

    CORRECTION OF ABERRANT RATES

    Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number Number

    KH01 0 5 789 39,16 63,13 45,26 36,55 30,96 27,95 24,86 23,62 22,48 21,74 21,36 21,22 21,13 21,69 23,05 22,41 25,49 33,37

    KH02 1 1 580 39,16 61,89 44,29 35,62 29,93 27,54 25,55 23,07 22,50 21,87 21,06 20,76 20,11 20,74 21,69 22,18 24,58 28,22

    KH03 2 3 067 39,16 58,41 41,79 33,61 29,12 26,10 24,00 22,76 21,65 21,02 20,42 19,98 19,80 20,06 21,60 23,99 26,03 30,51

    KH04 3 2 345 39,16 59,80 42,32 35,48 30,08 27,94 26,19 24,98 24,36 23,48 23,11 23,05 22,70 23,48 24,13 25,23 27,30 35,39

    KH07 10 688 39,16 56,18 38,95 31,22 25,67 22,87 20,23 17,99 16,71 15,86 15,05 14,38 13,45 13,28 13,06 12,91 12,36 12,62

    KH08 11 378 39,16 58,47 40,16 32,00 27,76 24,72 21,80 19,91 19,14 17,94 16,89 16,39 15,16 15,46 15,16 20,49 23,96 25,23

    KH10 13 472 39,16 54,17 39,50 30,19 25,21 21,90 18,98 17,34 15,53 14,38 12,46 12,09 10,96 10,36 9,96 9,18 8,84 8,49

    KH11 14 627 39,16 62,07 40,46 32,05 28,76 23,36 20,69 19,26 16,38 15,72 14,34 13,70 12,94 12,17 11,58 10,96 13,96 33,63

    KH12 15 1 388 39,16 60,81 42,60 33,25 28,00 24,11 21,77 19,49 18,02 17,02 15,47 14,66 13,74 12,72 12,89 12,82 13,61 14,42

    KH13 20 4 594 39,16 68,56 47,95 36,82 28,83 25,00 21,04 18,01 15,55 14,03 12,58 11,58 12,66 13,69 13,84 13,93 15,36 18,86

    KH25 40 1 508 39,16 64,82 44,56 34,56 29,06 23,70 20,59 18,95 16,75 15,12 14,15 12,73 11,79 11,19 10,69 9,95 9,52 13,91

    KH26 41 1 340 39,16 4,35 4,21 4,36 4,37 4,47 4,49 4,37 4,57 4,38 4,55 4,58 4,55 4,62 4,59 4,70 4,84 5,28

    KH30 45 2 217 39,16 46,45 30,93 23,13 19,06 16,44 13,96 12,92 12,38 10,34 9,90 8,74 8,59 8,08 7,63 7,41 6,93 6,45

    KH31 50 3 839 39,16 89,13 81,91 23,07 18,69 4,80 4,91 4,77 4,82 4,92 5,06 5,23 5,43 5,44 5,69 5,76 6,14 6,39

    KH32 51 277 39,16 83,58 73,58 80,03 18,69 16,44 5,62 5,64 5,62 5,58 5,60 5,60 5,63 5,70 5,80 5,93 6,16 6,26

    KH35 54 392 39,16 60,21 39,48 29,41 23,76 18,82 17,38 15,70 15,94 14,46 13,78 12,83 13,39 12,46 12,71 12,76 13,08 13,70

    KH36 55 4 055 39,16 38,60 65,18 46,88 46,76 47,38 60,19 13,07 12,38 5,48 5,54 6,17 6,50 7,46 8,45 9,74 12,00 15,74

    Cluster

    The above table shows the replacement by the average of the aberrant rates (>100%).

    4- Main results

    The average incremental rate of the loyalty program customers is 12.79%.

    This retailer can earn 12.79% of extra turnover on these customers.

    Customer assigned to turnover

  • 8/3/2019 Thierry Vallaud Thesis

    30/47

    TVallaud 30

    41,9%

    65,9%

    38,4%

    14,2%

    44,0%

    19,0%

    41,5%

    20,0%

    15,0%

    0%

    20%

    40%

    60%

    80%

    100%

    Number Turnover Turnover potential

    Br onze Silve r Gold

    41.9% of the retailer customers are Bronze potentials generating 65.9% of annual turnover

    and accounting for 38.4 % of the potential turnover.

    At the other end 44% of customers are "Gold" potentials, generating only 19% of the

    turnover but accounting for 41.5% of potential turnover.

    Regrouping in SML segments:

    31,4%

    76,7% 76,4%33,8%

    5,9%34,6%13,6% 5,7%

    9,0%

    9,0%

    3,8%0,2%

    0%

    20%

    40%

    60%

    80%

    100%

    Bronze Silver Gold

    S M L New

    S customers account for a high proportion of "Gold" potentials, based on annual turnover.

    There are M customers among the Bronze potentials, and L customers among the "Bronze"and "Silver" potentials. Most of them have an interesting margin of growth.

    In annual turnover (in):

    0,1% 1,1% 1,8%10,3%

    22,5%29,9%

    27,1%7,1%

    62,5%69,3%

    25,2%

    43,1%

    0%

    20%

    40%

    60%

    80%

    100%

    B ro nze Silver Go ld

    New S M L

  • 8/3/2019 Thierry Vallaud Thesis

    31/47

    TVallaud 31

    In potential turnover (in k) :

    0,1% 1,2% 2,6%14,5%

    22,3%29,9%

    25,9% 7,1%

    59,5%69,4%

    23,7%

    43,8%

    0%

    20%

    40%

    60%

    80%

    100%

    Bronze Silver Gold

    New S M L

    L customers have the most important potential in absolute value, although they do not have

    the highest evolution rates. They balance this with much more significant turnover than the Sor M segments.

    S customers are over-represented among "Gold" potential, with 29,9% of the potential

    turnover of the cluster.

    Distribution by the retailer RFM and by potential categories

    In numbers:

  • 8/3/2019 Thierry Vallaud Thesis

    32/47

    TVallaud 32

    0,9% 4,0%14,7%

    1,4%3,8%

    17,0%

    0,2%

    3,8%

    9,0%

    10,9%

    43,3%

    31,4%

    32,4%

    24,4%

    16,5%

    7,8%

    3,0%

    2,2%

    23,0%

    7,0%

    6,0%23,4%

    10,8%3,2%

    0%

    20%

    40%

    60%

    80%

    100%

    1 W ithout sta tut I NACTI VE 3 MOI S Ne wM--F-- M-F- M-F+M+F- M+F+

    In annual turnover (in ):

    0,6% 1,4% 4,9%0,8% 0,9%3,0%

    0,1% 1,1%1,8%

    4,2%11,4%

    14,8%19,6%12,0%

    22,2%

    5,9% 2,3%

    5,0%28,6%

    16,9%

    27,7%

    40,4%

    54,0%

    20,6%

    0%

    20%

    40%

    60%

    80%

    100%

    1Wi tho ut s ta tut I NA CT IV E 3 M OI S N ew

    M --F-- M -F- M -F+

    M +F- M +F+

    Logically, heavier potentials should be present in RFM+ segments in absolute values.

    Categories of potential:

    Potential rates per clone clusters are grouped into four categories:

    - P0: No potential turnover

    - P1: Potential > 20 %

    - P2: Potential between 15 and 20 %

    - P3: Potential below 15

  • 8/3/2019 Thierry Vallaud Thesis

    33/47

    TVallaud 33

    5,0% 13,9%

    41,7%13,4% 36,6%

    13,0%

    9,0%

    15,6%

    40,2%

    0,0%

    63,7%47,8%

    0%

    20%

    40%

    60%

    80%

    100%

    Number Turnover Potential turnover

    P0 P1 P2 P3

    40% of the customers create 63% of the turnover and 47.8 of potential turnover. On average,

    they achieve turnover of 2202 for an average potential of162. These customers who

    already contribute substantially are the most likely (for the least perceived effort) to reachtheir potential.

    Grouping in SML segments

    9,5%29,3%

    78,7% 82,0%

    31,3%

    23,1%

    8,9% 6,4%

    32,8%

    47,6% 35,7%

    0,0% 4,1% 0,2%

    7,5%2,9%

    0%

    20%

    40%

    60%

    80%

    100%

    P0 P1 P2 P3

    New S M L

    S customers represent a high proportion of "P1" and "P2", based on annual turnover generated

    by P1 potentials. We find M customers mainly among potential P3, while L customers for

    their part are found under "P0", but also "P3".

    In yearly turnover (in )

    0,0% 2,5% 1,8% 0,0%

    39,0% 36,0%

    10,0%12,8%

    33,2%

    25,8%

    79,6%

    25,3%

    64,2%

    7,6%

    11,8 %

    50,4%

    0%

    20%

    40%

    60%

    80%

    100%

    P 0 P 1 P 2 P 3

    New S M L

  • 8/3/2019 Thierry Vallaud Thesis

    34/47

    TVallaud 34

    P0 P1 P2 P3 TOTAL

    Average

    amount

    Average

    amount

    Average

    amount

    Average

    amount

    Average

    amount

    New 4 187 120 430 456 165

    S 1 004 222 422 701 384 M 2 131 1 673 1 756 1 732 1 746

    L 6 439 3 911 6 493 3 963 4 401

    TOTAL 3 853 448 961 2 202 1 393

    In potential of turnover (in ):

    3,9% 1,9% 0,0%

    38,9% 35,7%14,4%

    31,8%

    11,9%

    25,4%

    50,4%

    24,3%

    61,3%

    0%

    20%

    40%

    60%

    80%

    100%

    P1 P2 P3

    L

    M

    S

    New

    S customers are over-represented in the "P1" category, with 38.9% of the potential of turnover

    for this segment. In absolute terms, it is really L customers who have the highest potential. It

    is with good customers that we can increase turnover as these have the most chance to

    succeed than any other segments. The marketing budget can therefore be allocated on the

    basis of average turnover and intensity of offers by potential. The two concepts are

    complementary in the definition of the mechanics of loyalty/retention.

    5-Results summary

    The Kohonen network allows us to group customers into 40 clone clusters. The 4 by 10matrix had no empty group, so we retained it.

    Customers within the same group resemble each other according to socio demographic and

    consumption characteristics.

    Using the deciles method, we assigned a turnover evolution rate to each customer in the

    sample.

    We created the following potential turnover score:

    - Gold : evolution rate higher than 20%

    - Silver : evolution rate between 15% and 20%

    - Bronze : evolution rate below 15%

  • 8/3/2019 Thierry Vallaud Thesis

    35/47

    TVallaud 35

    We calculated potential turnover from this rate and turnover.

    Our sample is composed 5,373,026 customers generating annual turnover 7.46 billion euros,

    and representing potential turnover of 953.2 million euros.

    Then the retailer can earn almost 12.77% more turnover from his customers.

    In rate term, the sample is composed of 41.9% Bronze customers, 14.1% Silver customers and

    of 44% Gold customers. In reality, it must be assumed that the best customers are those with

    the highest absolute values.

    76.6% of Gold customers are S and generate 29.9% of potential turnover.

    65% of Gold customers are 3 months Inactive, RFM-- and RFM-, and they generate 40% of

    annual turnover and 38.4% of potential turnover.

    Logically customers with the highest evolution rate find themselves among customers with

    poor turnovers values.

    At the opposite end of the scale, customers with the highest turnover have the strongest

    potential of turnover in terms of absolute value.

    P1 P2 P3 TOTAL

    Average

    amount

    Average

    amount

    Average

    amount

    Average

    amount

    New 49 77 34 52

    S 60 71 75 64

    M 431 302 120 182 L 1 055 1 104 279 336

    TOTAL 120 163 162 137

    The validation procedures for the models

    Internal validity

    We carried out several tests on our model:

    - Division of our population into sub-populations for checking the allocation coherenceof the clone classes

    - Benchmark of several classification techniques- Re-allocation of the classes by supervised models (C5, Bayesian network)- Connectivity of super classes

    The internal validation methods will need of course to be completed

    External validity

    The customer of wallet share is in accordance with a TNS of 24%. Given overall

    consumption, the achievable potential of the wallet share will increase to 28%. An extra 2%

    of the wallet share is much more realistic.

  • 8/3/2019 Thierry Vallaud Thesis

    36/47

    TVallaud 36

    We would like to de-duplicate13

    our base with Nielsen Home Scan Panel to check if sales

    really do increase, but this is not yet possible in this context.

    Conclusions

    Discussion of the results of the research study

    The results of our research study will be placed in the context of corporate customer potential

    determination: determining customer potential represents a major part ofa companys direct

    and promotional marketing investment. Most large loyalty programs are based on this notion.

    We will look at how our approach compared with other methods enables us to establish

    converging results to answer our research questions:

    - The clustering technique (SOM) is used to identify customers which are similar and todefine realistic potential.

    - We can estimate the stability of the clusters in several ways which show an internalstability

    - We have developed a pragmatic approach which is a potential determination method:the clones method.

    The limits and the contribution of our research study

    We used specific clustering techniques for the purpose of validating our method. We shown

    the eventual statistical limits of our approach in terms of complexity or reliability of the

    models used.

    For feasibility reasons we worked only with a single business area, the large grocery retail

    sector in France, and used only accessory data from other business sectors.

    We do not have access at the moment to data from foreign retailers, for example.

    Calculation of potential turnover in group is very empirical and should be more scientifically

    justified.

    Further research

    There are several ways to improve upon our research:

    - Refine our choice of variables- Determine a more empirical method than the deciles/median method for estimating the

    potential per group

    - Make more rotations of the model in some other industrial sectors; we have done thisand it works quiet well, but it is important that others test it

    - Validate the result in time, by observing the reality of potential values on sales

    13 We merge the two data bases to find the doublons

  • 8/3/2019 Thierry Vallaud Thesis

    37/47

    TVallaud 37

    We hope that, by means of its strategic impact on company results and the fact that this

    calculation is based on internal customer data already at hand; this method will find an

    important use.

    Bibliography

    1. Aguilera, P. A., Frenich, A. G., Torres, J. A., Castro, H., Vidal, J. L. M., and Canton,M. (2001). Application of the cohune neural network in coastal water management:

    Methodological development for the assessment and prediction of water quality.

    Water Research, 35(17):40534062.

    2. Anderson, B. (1999). Kohonen neural networks and language. Brain and Language,70(1):8694

    3. B Meunier, E Dumas, I Piec, D Bechet, M Hebraud, - J Proteome Res, 2007 -Assessment of hierarchical clustering methodologies for proteomic data mining - les

    4 versions aseanbiotechnology.info4. Baran, Stanley J. Theories of Mass Communication.5. Benavent and Crie http://christophe.benavent.free.fr/publications/ltv1.pdf6. Beran, R. (1986). Discussion of Wu, C.F.J.: Jackknife, bootstrap, and other resampling

    methods in regression analysis (with discussion). Ann. Statist., 14:1295-1298.

    7. Berend Wierenga and Gerrit Harm van Bruggen (2000), Marketing Management,Springer Support Systems: Principles, Tools, and Implementation, Springer

    8. Berger, Paul D. and Nada I. Nasr (1998), "Customer lifetime value: Marketing modelsand applications," Journal of Interactive Marketing, 12 (1), p.1730

    9. Bertrand Clarke et Dongchu Sun, Reference priors under the Chi-Squared distance:The Indian Journal of Statistics 1997, Volume 59, Series A, Pt. 2, 215-231

    10.Boos, D.D. (2003). Introduction to the bootstrap world. Statist. Science, 18:168-174.11.Borko, H. and Bernick, M., 'Automatic document classification', Journal of the ACM,

    10, 151-162 (1963).

    12.Bremer and Joyce (1988), Human Judgment,The SJT View, North-Holand13.Bruce Cooil, Timothy L Keiningham, Lerzan Aksoy, Michael Hsu. (2007) A

    Longitudinal Analysis of Customer Satisfaction and Wallet share: Investigating the

    Moderating Effect of Customer Characteristics. Journal of Marketing 71:1, 67-83

    14.Charles Romesburg Cluster Analysis for Researchers (2004) Lulu press p.13515.Ching-Hsue Cheng and You-Shyang Chen Classifying the segmentation of customer

    value via RFM model and RS theory Expert Systems with Applications, In Press,

    Corrected Proof, Available online 16 April 2008,Collectif, Recherche sur la Distribution moderne p.64, d: lUnivers du Livre

    16.Ciampi, A. and Lechevallier, Y. (2000). Clustering large, multi-level data sets: anapproach based on Kohonen self-organizing maps. In Principles of Data Mining and

    Knowledge Discovery. 4th European Conference, PKDD 2000. Proceedings (Lecture

    Notes in Artificial Intelligence Vol.1910). Springer-Verlag, Berlin, Germany, pages

    3538

    17.Ciampi, A. and Lechevallier, Y. (2000). Clustering large, multi-level data sets: anapproach based on Kohonen self-organizing maps. In Principles of Data Mining and

    Knowledge Discovery. 4th European Conference, PKDD 2000. Proceedings (Lecture

    Notes in Artificial Intelligence Vol.1910). Springer-Verlag, Berlin, Germany, pages

    3538

    http://christophe.benavent.free.fr/publications/ltv1.pdfhttp://christophe.benavent.free.fr/publications/ltv1.pdf
  • 8/3/2019 Thierry Vallaud Thesis

    38/47

    TVallaud 38

    18.Dahbur, K. and Muscarello, T. (2001). Hybrid Kohonen neural network in datamining. In Proceedings of the IASTED International Conference. Artificial

    Intelligence and Applications. ACTA Press, Anaheim, CA, USA, pages 303.

    19.David Huff, 18-Jun 2003 - University of Texas Austin, "A Retrospective View of theHuff Model and its Application to Spatial Interaction Analysis" University of

    Redlands/ESRI Colloquium Series20.Dorofeyuk, A.A., 'Automatic Classification Algorithms (Review)', Automation and

    Remote Control, 32, 1928-1958 (1971).

    21.Dwyer, R.F. (1997), "Customer lifetime valuation to support marketing decisionmaking", Journal of Direct Marketing, Vol. 11 No.4, p.6-13.

    22.Efron B. (1981) Non parametric estimates of standard error: the jackknife, thebootstrap and other methods. Biometrika 68. pp 589--599.

    23.Eric Chen-Kuo Tsao, James C. Bezdek and Nikhil R. Pal "Fuzzy Kohonen clusteringnetworks 1994 Published by Elsevier Science B.V.

    24.F. V. Jensen Introduction to Bayesian Networks, 1st edition 1996 Springer-VerlagNew York, Inc.

    25.Fang, K.; He, S. The problem of selecting a given number of representative points in anormal population and a generalized mills ratio. Technical report, Department of

    Statistics; Stanford University: 1982. MacQueen J. Some methods for classification

    and analysis of multivariate observations. Proceedings 5th Berkeley Symposium on

    Mathematics, Statistics and Probability. 1967;3:281297.

    26.Frank Plastria Static competitive facility location: An overview of optimisationapproaches European Journal of Operational Research, Volume 129, Issue 3, 16

    March 2001, Pages 461-470.

    27.Gehrlein W. V. General mathematical programming formulations for the statisticalclassification problem Operations research letters ISSN 0167-

    6377 CODEN ORLED5

    28.Harris, M.J. and N. Blisard. 1995. Characteristics of the Nielsen Homescan Data.Working paper. Washington, DC: U.S. Department of Agriculture, Economic

    Research Service.

    29.Hartigan JA, Wong MA. A k-means clustering algorithm. Applied Statistics.1979;28:100108.

    30.http://en.wikipedia.org/wiki/Lifetime_value31.J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial

    Intelligence Research, 4:77-90, 1996.

    32.Jajuga K.Classification, Clustering and Data Analysis : Recent Advances andApplications2002 lavoisier

    33.John A. McCartya,

    and Manoj Hastak Segmentation approaches in data-mining: Acomparison of RFM, CHAID, and logistic regression Journal of Business Research,

    Volume 60, Issue 6, June 2007, Pages 656-662

    34.Juha Vesanto 1997 The SOM in data mining: analysis of world pulp and papertechnology

    35.Julien Barnier Tout ce que vous navez jamais voulu savoir sur le Chi2 san s jamaisavoir eu envie de le demander Groupe de Recherche sur la Socialisation CNRS

    UMR 5040 15 avril 2008

    36.Kaski, S., "Data exploration using self-organizing maps. Acta PolytechnicaScandinavica, Mathematics, Computing and Management in Engineering Series No.

    82, Espoo 1997.

    37.Kohonen, T., Self-Organization and Associative Memory , New York : Springer-Verlag, 1988

    http://en.wikipedia.org/wiki/Lifetime_valuehttp://en.wikipedia.org/wiki/Lifetime_valuehttp://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V7S-4MV1P09-3&_user=10&_coverDate=06%2F30%2F2007&_rdoc=1&_fmt=full&_orig=search&_cdi=5850&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=3a8f86bf0680b39935ae32f446a1364d#aff1http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V7S-4MV1P09-3&_user=10&_coverDate=06%2F30%2F2007&_rdoc=1&_fmt=full&_orig=search&_cdi=5850&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=3a8f86bf0680b39935ae32f446a1364d#aff1http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V7S-4MV1P09-3&_user=10&_coverDate=06%2F30%2F2007&_rdoc=1&_fmt=full&_orig=search&_cdi=5850&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=3a8f86bf0680b39935ae32f446a1364d#aff1http://en.wikipedia.org/wiki/Lifetime_value
  • 8/3/2019 Thierry Vallaud Thesis

    39/47

    TVallaud 39

    38.Lerman, I.C., Les Bases de la Classification Automatique, Gauthier-Villars, Paris(1970).

    39.M Roux -, 1985 Algorithmes de classification Editions Masson, Paris40.Mattias Otto ChemometricsStatistics and Computer Application in Analytical

    Chemistry Publi 2007 Wiley-VCH

    41.Nielsen, Inc. May 2006. Understanding the Homescan Advantage. Presentation byLiz Crews and Ed Groves, Nielsen at RTI International, Research Triangle Park, NC.42.O. Pourret, P. Naim and B. Marcot (2008). Bayesian Networks: A Practical Guide to

    Applications. Chichester, UK: Wiley. ISBN 978-0-470-06030-8.

    43.Olivier Brusset Segmentation Cibler, scorer, analyser, une seule limite, lesrendements Marketing Direct N92 - 01/04/2005 p.2

    44.Pena M. Vanegas A. Valencia Digital Hardware Architectures of Kohonen's SelfOrganizing Feature Maps with Exponential Neighboring Function 2006 IEEE

    International Conference on Reconfigurable Computing and FPGA's J. (ReConFig

    2006) pp. 1-8

    45.Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers,1993.

    46.Quinlan, R. (2004). Data mining tools see5 and c5.0.47.Rajanee Ranjan Encyclopaedia of Marketing Research Publi 2002, Anmol

    Publications PVT. LTD., p.585.

    48.Reilly, W.J. (1931) The law of retail gravitation, New York.49.S. Kaski, J. Nikkila, and T. Kohonen Methods for Exploratory Cluster Analysis

    Intelligent Exploration of the Web De Piotr S. SzczepaniakPubli 2003 Springer

    50.Size and Share of Customer Wallet. Rex Yuxing Du, Wagner A. Kamakura, Carl F.Mela. Journal of Marketing | Volume: 71 | Issue: 2 | Pps: 94-113

    51.Tan, Peter J.,Dowe David L., Dix Tevor I, Building classification model in two steps1997

    52.Teuvo Kohonen. Self-Organization and Associative Memory. Springer-Verlag,Berlin, 3rd edition, 1989.

    53.Teuvo Kohonen. Self-Organizing Maps, 3rd edition. Springer, 2054.The Useful Words from a Decisional Corpus. Contribution of Correspondence

    Analysis Springer Berlin / Heidelberg Volume 185/2005. p.159-179

    55.Timothy L. Keiningham, Bruce Cooil, Lerzan Aksoy, Tor W. Andreassen, Jay Weiner.(2007) The value of different customer satisfaction and loyalty metrics in predicting

    customer retention, recommendation, and share-of-wallet. Managing Service Quality

    17:4, 361-384

    56.Todd A. Stephenson An Introduction to Bayesian Network Theory and Usage

    IDIAP-RR 00-03, 200057.Vallaud Thierry (2003), La fidlisation rentable : la proposition du modle composite,www. numlog.com

    58.Venkatesan, Rajkumar and V. Kumar (2004), "A Customer Lifetime ValueFramework for Customer Selection and Resource Allocation Strategy," Journal of

    Marketing, 68 (October), p.106-125.

  • 8/3/2019 Thierry Vallaud Thesis

    40/47

    TVallaud 40

    4.

    Appendix

    Appendix 1 : Translation of the filenames

    Appendix 2 : Detail of the first data audit

    The data set was audited it two stages: a first stage to determine all the data useful for the

    analysis in the original data base, and a second stage to determinant the data available to

    calculate potential. In the appendix, only the second stage is shown.

    Analysis of the Potentiel_Ratio and Potentiel_Socio tables

    Potentiel_Ratio contains 5 373 048 observations (Customer accounts)

    It is composed of 26 fields

    Potentiel_Socio contain 5 373 056 observations (Customer accounts)

    It is composed of 18 fields

    This audit is based on the combination of the two tables, i.e. 5 373 048 observations

  • 8/3/2019 Thierry Vallaud Thesis

    41/47

    TVallaud 41

    Data format

    This is the original data format. We may have to change some formats to better achieve our

    model objectives.

  • 8/3/2019 Thierry Vallaud Thesis

    42/47

    TVallaud 42

    RFM 3 months variable is empty-therefore discarded

    Variable by variable analysis

    Dichotomous variables

    RFM 3 months Number % First audit comparison

    New 247 326 4.60% 3.74%

    Ex-customers 400 236 7.45%

    Inactive 465 107 8.66% 19.15%

    M--F-- 1 315 873 24.49% 21.00%

    M-F- 1 302 089 24.23% 26.00%

    M-F+ 248 619 4.63% 6.06%

    M+F- 710 311 13.22% 11.65%

    M+F+ 683 487 12.72% 12.40%

    Total 5 373 048 100.00% 100.00%

    Family statute Number % First audit comparison

    Couple 1 557 871 28.99% 26.19%

    Single 642 374 11.96% 10.81%

    Empty 3 172 803 59.05% 62.99%

    Total 5 373 048 100.00% 100.00%

    SML on 12 months Number %

    NA 7 816 0.15%NV 245 278 4.56%

    I 2 573 0.05%

    S 3 086 955 57.45%

    M 1 014 777 18.89%

    L 1 015 649 18.90%

    Total 5 373 048 100.00%

    Home type Number % First audit comparison

    Flat 880 722 16.39% 18.76%

    House and flat 1 300 0.02% 0.00%

    House 1 576 829 29.35% 34.98%Empty 2 914 197 54.24% 65.02%

    Total 5 373 048 100.00% 100.00%

    Number of children in thehousehold

    Number % First audit comparison

    0 4 088 282 76.09% 75.72%

    1 510 448 9.50% 9.57%

    2 508 967 9.47% 9.54%

    3 196 520 3.66% 3.79%

    4 48 644 0.91% 1.01%

    5 11 482 0.21% 0.22%

    > 5 8 705 0.16% 0.14%

    Total 5 373 048 100.00% 100.00%

  • 8/3/2019 Thierry Vallaud Thesis

    43/47

    TVallaud 43

    Social categories Number % First audit comparison

    Farmer 49 892 0.93% 0.96%

    Artisan 86 807 1.62% 1.79%

    Other 84 696 1.58% 1.39%

    Manager 188 017 3.50% 3.50%

    Employee 737 469 13.73% 14.42%

    Student 85 928 1.60% 1.28%

    Housewife 211 003 3.93% 4.38%

    Civil servant 233 386 4.34% 3.70%

    Independent worker 42 913 0.80% 0.72%

    Worker 138 099 2.57% 2.73%

    Retired 664 304 12.36% 14.05%

    Unemployed 147 700 2.75% 3.28%

    Technician 91 956 1.71% 2.08%

    Empty 2 610 877 48.59% 45.71%

    24 1 0.00% 0.00%

    Total 5 373 048 100.00% 100.00%

    The value 24 is a mistake, we eliminate it.

    Age Number % First audit comparison

    0 to 18 years 8 604 0.16% 0.21%

    19 to 29 years 317 371 5.91% 6.00%

    30 to 39 years 537 010 9.99% 10.76%

    40 to 49 years 652 497 12.14% 12.59%

    50 to 59 years 649 038 12.08% 12.30%

    60 to 69 years 458 669 8.54% 8.07%70 years and more 592 227 11.02% 10.92%

    Empty 2 157 632 40.16% 39.15%

    Total 5 373 048 100.00% 100.00%

    Customer historic Number % First audit comparison

    0 to 2 months 194 269 3.62% 2.76%

    3 to 5 months 238 267 4.43% 2.54%

    6 to 8 months 182 733 3.40% 3.20%

    9 to 11 months 221 313 4.12% 3.26%

    12 to 17 months 354 680 6.60% 6.39%18 to 23 months 231 513 4.31% 6.17%

    24 to 35 months 517 972 9.64% 9.58%

    36 to 47 months 396 706 7.38% 7.96%

    48 to 59 months 403 389 7.51% 10.01%

    60 months and more 2 631 717 48.98% 44.60%

    Empty 489 0.01% 3.53%

    Total 5 373 048 100.00% 100.00%

  • 8/3/2019 Thierry Vallaud Thesis

    44/47

    TVallaud 44

    Time since last purchase Number %

    0 to 2 months 4 213 358 78.42%

    3 to 5 months 562 908 10.48%

    6 to 8 months 312 721 5.82%9 to 11 months 241 093 4.49%

    12 to 17 months 42 968 0.80%

    Total 5 373 048 100.00%

    Numerical variables

    RateOther

    Rate BazarRate BOF

    / APFRate Porkbutcher LS

    Rate PetRate

    Beauty/Make

    up

    Rate BabyRate

    Butcher

    Amount 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055

    Mean 77.1 7.0 9.0 6.5 1.2 0.7 1.4 6.5Min -34 035.0 -27 666.7 -195.2 -1 580.0 -15.3 -61.2 -23.5 -125.2

    Max 294.9 5 328.9 9 685.6 3 186.7 1 206.5 574.5 3 614.4 4 193.0

    SD 22.2 16.3 8.4 6.7 3.7 2.4 5.8 8.6

    Rate

    Backer

    Rate Pork

    butcher

    Rate

    Dietetic

    food

    Rate

    cheese

    Rate fruits

    and

    vegetables

    Rate fisher

    Rate

    frozen

    food

    Rate wine

    Amount 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055

    Mean 2.5 2.6 0.4 1.5 8.4 1.9 3.1 2.1

    Min -138.5 -74.5 -11.3 -29.4 -105.5 -200.0 -308.3 -530.3

    Max 1 247.6 1 373.1 2 428.6 262.7 17 364.3 1 542.9 11 528.6 610.2

    SD 4.9 4.7 2.3 2.9 12.0 4.5 7.6 5.4

    Ratio

    cleaning

    products

    Ratio

    grocery

    Ratio

    liquid

    Ratio

    textil

    Ratio ulta

    fresh food

    Rate of

    pouldry

    Rate first

    price

    Rate

    retailer

    brand 1

    Rate

    retailer

    brand 2

    Amount 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055 5 373 055

    Mean 8.9 16.8 10.8 2.0 4.7 1.7 6.2 14.6 2.0

    Min -1 472.8 -1 463.6 -2 843.2 -1 250.0 -372.3 -37.5 -393.8 -194.9 -30.9

    Max 14 007.1 16 122.2 6 680.4 2 713.2 5 814.3 2 822.8 26 242.0 7 501.1 2 031.8

    SD 12.4 14.9 13.8 5.4 6.4 3.8 15.0 11.5 3.4

    Monetary fields have not decimal symbol in the field. We have divided turnover per 100.

  • 8/3/2019 Thierry Vallaud Thesis

    45/47

    TVallaud 45

    SML 12months

    Amount % of customersFiltered turnover

    12 months% Filtered turnover

    12 months

    Filtered

    turnover 12months: Mean

    per customer

    NA 7816 0.15% 0 0.00% 0.0

    New 245278 4.56% 40 365 377 0.54% 164.6I 2573 0.05% 0 0.00% 0.0

    S 3086955 57.45% 1 185 154 005 15.87% 383.9

    M 1014777 18.89% 1 772 044 857 23.72% 1 746.2

    L 1015649 18.90% 4 471 932 029 59.87% 4 403.0

    Total 5 373 048 100.00% 7 469 496 268 100.00% 1 390.2

    SML 12

    monthsTotal turnover

    % Total

    turnover

    Total turnover:

    Mean per customer

    Cumulated filtered

    turnover

    % of

    cumulated

    filtered

    turnover

    Cumulated

    filtered

    turnover: Mean

    per customer

    NA 29 189 799 0.06% 3 734.6 16 060 558 0.05% 2 054.8

    New 118 781 921 0.25% 484.3 91 122 432 0.31% 371.5

    I 19 120 403 0.04% 7 431.2 9 701 975 0.03% 3 770.7

    S 11 506 545 207 23.89% 3 727.5 6 400 520 698 21.51% 2 073.4

    M 10 946 321 611 22.72% 10 786.9 6 841 635 615 22.99% 6 742.0

    L 25 549 213 252 53.04% 25 155.6 16 398 620 825 55.11% 16 146.0

    Total 48 169 172 193 100.00% 8 965.0 29 757 662 104 100.00% 5 538.3

    SML 12

    months

    Turnover annual on

    promo

    % Turnover

    annual on

    promo

    Turnover annual on

    promo: Mean per

    customer

    Total nb taken

    reduction vouchers

    (BA)

    % Total nbtaken

    reducti