With applications to proﬁling and differentiating habitual ... · With applications to proﬁling...

NEW VARIATIONAL BAYESIAN

APPROACHES FOR

STATISTICAL DATA MINING

With applications to profiling and differentiating

habitual consumption behaviour of customers

in the wireless telecommunication industry

BURTON WU

Bachelor of Applied Science (Mathematics), QUTBachelor of Engineering (Electrical & Computing), QUT

Bachelor of Engineering (Honours), QUT

A thesis submitted for the degree ofDoctor of Philosophy

Mathematical SciencesFaculty of Science and Technology

Queensland University of Technology

Principal Supervisor: Professor Anthony N. PettittAssociate Supervisor: Dr. Clare A. McGrory

April 2011

Abstract

This thesis investigates profiling and differentiating customers through the use of

statistical data mining techniques. The business application of our work centres on

examining individuals’ seldomly studied yet critical consumption behaviour over an

extensive time period within the context of the wireless telecommunication indus-

try; consumption behaviour (as oppose to purchasing behaviour) is behaviour that

has been performed so frequently that it become habitual and involves minimal in-

tentions or decision making. Key variables investigated are the activity initialised

timestamp and cell tower location as well as the activity type and usage quantity

(e.g., voice call with duration in seconds); and the research focuses are on customers’

spatial and temporal usage behaviour. The main methodological emphasis is on

the development of clustering models based on Gaussian mixture models (GMMs)

which are fitted with the use of the recently developed variational Bayesian (VB)

method. VB is an efficient deterministic alternative to the popular but computa-

tionally demanding Markov chain Monte Carlo (MCMC) methods. The standard VB-

GMM algorithm is extended by allowing component splitting such that it is robust to

initial parameter choices and can automatically and efficiently determine the num-

ber of components. The new algorithm we propose allows more effective modelling

of individuals’ highly heterogeneous and spiky spatial usage behaviour, or more gen-

erally human mobility patterns; the term spiky describes data patterns with large

areas of low probability mixed with small areas of high probability. Customers are

then characterised and segmented based on the fitted GMM which corresponds to

how each of them uses the products/services spatially in their daily lives; this is es-

sentially their likely lifestyle and occupational traits. Other significant research con-

tributions include fitting GMMs using VB to circular data i.e., the temporal usage

behaviour, and developing clustering algorithms suitable for high dimensional data

based on the use of VB-GMM.

iii

Keywords

Gaussian Mixture Model (GMM); Mixture Models; Probability Density Estima-

tion; Variational Bayes (VB); Bayesian Statistics; Data Mining (DM); Combinational

Data Analysis (CDA); Profiling; Segmentation; Clustering; Feature Extraction; Be-

havioural Characteristics; Consumer Behaviour; Customer Behaviour; Consump-

tion Behaviour; Customer Relationship Management (CRM); Relationship Market-

ing (RM); Human Mobility Pattern; Spatial Behaviour; Temporal Behaviour; Circu-

lar Data; Data Stream; High Dimensional Data; Call Detail Records (CDR); Wireless

Telecommunication Industry

v

Acronyms

AIC Akaike’s Information CriterionBIC Bayesian Information CriterionCCC Cubic Clustering CriterionCDA Combinational Data AnalysisCDR Call Detail RecordsCH Calinski and Harabasz (Index)DIC Deviance Information CriterionDP Dirichlet ProcessEM Expectation-Maximization (Algorithm)GMM Gaussian Mixture ModelHDDC High Dimensional Data Clustering (Algorithm)i.i.d. Independent and Identically DistributedKL Kullback-Leibler (Divergence)KM k-Means AlgorithmLL Log-LikelihoodMAE Mean Absolute ErrorMAEAC Mean Absolute Error Adjusted for CovarianceMCMC Markov Chain Monte CarloMD Mahalanobis DistanceML Maximum LikelihoodPCA Principal Component AnalysisRJMCMC Reversible Jump Markov Chain Monte CarloSD Standard DeviationSEVB Split and Eliminate Variational Bayesian (Method/Algorithm)SMS Short Message ServicesVB Variational Bayes or Variational Bayesian (Method/Algorithm)

vii

Preface

This thesis includes four chapters that have been submitted as articles for publica-

tion as follows.

• Chapter 3 titled “The Variational Bayesian Method: Component Elimination,

Initialization & Circular Data” has been submitted;

• Chapter 4 titled “A New Variational Bayesian Algorithm with Application to Hu-

man Mobility Pattern Modeling” has been accepted for Statistics and Comput-

ing. Note that the concepts described in this chapter were also presented as

a peer reviewed poster at the thirteenth international conference on artificial

intelligence and statistics (AISTATS 2010);

• Chapter 5 titled “Customer Spatial Usage Behavior Profiling and Segmentation

with Mixture Modeling” is undergoing revision for a marketing journal. Note

that the concepts discussed here were also presented as a poster at the ninth

world conference of the international society for Bayesian analysis (ISBA 2008);

and

• Chapter 6 titled “Identifying Subspace Clusters for High Dimensional Data with

Mixture Models” has been submitted.

All research was carried out in collaboration with my principal supervisor, Professor

Anthony N. Pettitt, and my associate supervisor, Dr. Clare A. McGrory. I proposed

the ideas for each of the articles and I was the main researcher responsible for im-

plementing the methodology described therein and the writing of the articles. Ad-

ditionally, I collected all of the data according to the intellectual property (IP) agree-

ment signed by all associated parties.

ix

Acknowledgements

I am grateful to my supervisors, Professor Anthony N. Pettitt and Dr. Clare A. Mc-

Grory, for their guidance in this work. Their experiences and knowledge were in-

valuable to this project.

I also thank my former managers, Michael Sheehan and Terry Simmonds, for their

support, in particular their efforts in organising the intellectual property agreement;

this research would not have been possible without their involvement. For the same

reason, I also thank Dr. Terry Bell who was the senior contract review officer at QUT.

I am also grateful to all the academic and supporting staff and students within the

discipline of mathematical sciences as well as many of my former colleagues for their

assistance and kindness.

Finally, thank you mum, dad and my sister for always being there for me.

xi

Contents

Abstract iii

Keywords v

Acronyms vii

Preface ix

Acknowledgements xi

1 Introduction 1

1.1 Understanding Customer Behaviour & Human Mobility Patterns . . . . 1

1.2 Telecommunication Call Data Record Dataset . . . . . . . . . . . . . . . 3

1.3 Motivation for Our Inferential Approach . . . . . . . . . . . . . . . . . . 3

1.4 The Role of Statistical Data Analysis . . . . . . . . . . . . . . . . . . . . . 7

1.5 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Literature Review 11

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Research problem & methodology requirements . . . . . . . . . 11

2.1.2 Research methodology overview . . . . . . . . . . . . . . . . . . 16

2.2 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.2 Gaussian mixture model (GMM) . . . . . . . . . . . . . . . . . . 18

2.2.3 Classical/frequentist techniques (for GMMs) . . . . . . . . . . . 18

2.2.4 Bayesian techniques (for GMMs) . . . . . . . . . . . . . . . . . . 21

2.2.5 Approximate techniques (for GMMs) . . . . . . . . . . . . . . . . 27

2.2.6 High dimensional GMM . . . . . . . . . . . . . . . . . . . . . . . 32

2.2.7 Review conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 32

xiii

2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.2 Classical clustering algorithms . . . . . . . . . . . . . . . . . . . 35

2.3.3 Scalable clustering algorithms . . . . . . . . . . . . . . . . . . . . 37

2.3.4 Algorithms for clustering high dimensional data . . . . . . . . . 39

2.3.5 Review conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3 Variational Bayesian Method: Component Elimination, Initialization & Cir-

cular Data 51

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2 VB-GMM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3 Model Evaluation Criterion . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.4.1 The irreversible nature of the VB component elimination prop-

erty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.4.2 Evaluating the results of the VB-GMM fit under different initial-

ization schemes for padded circular data . . . . . . . . . . . . . 59

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4 A New Variational Bayesian Algorithm with Application to Human Mobility

Pattern Modeling 71

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2 Standard VB-GMM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 76

4.3 Split and Eliminate Variational Bayes for Gaussian Mixture Models

(SEVB-GMM) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.3.1 Model stability criterion . . . . . . . . . . . . . . . . . . . . . . . 82

4.3.2 Component splitting criteria . . . . . . . . . . . . . . . . . . . . . 83

4.3.3 Component split operations . . . . . . . . . . . . . . . . . . . . . 85

4.3.4 Algorithm termination criterion . . . . . . . . . . . . . . . . . . . 88

4.3.5 Model selection criterion . . . . . . . . . . . . . . . . . . . . . . . 89

4.4 Human Mobility Pattern Application & Results . . . . . . . . . . . . . . 90

4.4.1 Data mining & human mobility patterns . . . . . . . . . . . . . . 90

4.4.2 Simulated results . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.4.3 Real data results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5 Customer Spatial Usage Behavior Profiling and Segmentation with Mixture

Modeling 111

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.2 Data & Individuals’ Consumption Behavior . . . . . . . . . . . . . . . . 116

5.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.2.2 Usage behavior of aggregated voice call durations and SMS

counts & the segmentation stability benchmark . . . . . . . . . 117

5.2.3 Spatial usage behavior (or mobility patterns) . . . . . . . . . . . 118

5.3 Modeling Individuals’ Spatial Usage Behavior . . . . . . . . . . . . . . . 119

5.3.1 Gaussian mixture model (GMM) . . . . . . . . . . . . . . . . . . 120

5.3.2 The variational Bayesian (VB) method . . . . . . . . . . . . . . . 121

5.3.3 Split and eliminate variational Bayes for Gaussian mixture

model (SEVB-GMM) algorithm . . . . . . . . . . . . . . . . . . . 122

5.3.4 Results, model accuracy & computational efficiency . . . . . . . 124

5.4 Profiling Individuals’ Spatial Usage Behavior . . . . . . . . . . . . . . . 126

5.4.1 SEVB-GMM component characteristics . . . . . . . . . . . . . . 128

5.4.2 Differentiating SEVB-GMM components . . . . . . . . . . . . . 128

5.4.3 SEVB-GMM component types . . . . . . . . . . . . . . . . . . . . 130

5.4.4 Spatial usage behavioral signatures . . . . . . . . . . . . . . . . . 131

5.4.5 Results & spatial usage behavioral profile stability . . . . . . . . 133

5.5 Spatial Usage Behavioral Segmentation . . . . . . . . . . . . . . . . . . 135

5.5.1 The k-means (KM) algorithm & selection of number of groups . 135

5.5.2 Results & spatial usage behavioral segmentation stability . . . . 137

5.6 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6 Identifying Subspace Clusters for High Dimensional Data with Mixture

Models 151

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.2 VB-GMM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6.3 Subspace Clusters Identification . . . . . . . . . . . . . . . . . . . . . . 157

6.3.1 Approximating the density of each 2D subspace with VB-GMM 157

6.3.2 Detection of dense regions of each 2D subspace . . . . . . . . . 158

6.3.3 Estimating the associated subspace of each observation . . . . . 159

6.3.4 Identifying interesting associated subspaces . . . . . . . . . . . 161

6.3.5 Assigning observations to appropriate subspace clusters . . . . 161

6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

6.4.1 Sensitivity to choice of δ, the tolerance level for determining if

the VB-GMM model has converged . . . . . . . . . . . . . . . . . 163

6.4.2 Sensitivity to choice of GMM granularity h (or kinitial) . . . . . . 164

6.4.3 Sensitivity to choice of c2, the likelihood threshold where ob-

servations are considered to be in the dense regions . . . . . . . 164

6.4.4 Sensitivity to choice of c3, the threshold level in determining

the dimension relevance to an observation . . . . . . . . . . . . 165

6.4.5 Effect of data dimensionality d . . . . . . . . . . . . . . . . . . . 165

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

6.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

7 Conclusion 171

7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

7.1.1 Semi-parametric Bayesian methods & mixed membership

models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

7.1.2 Spatial-temporal/longitudinal extension . . . . . . . . . . . . . 173

7.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 174

A Review of Research Question 179

A.1 Telecommunication Industry Research . . . . . . . . . . . . . . . . . . . 179

A.2 Customer/Consumer Research . . . . . . . . . . . . . . . . . . . . . . . 180

A.2.1 Customer management system . . . . . . . . . . . . . . . . . . . 182

A.2.2 Customer behaviour heterogeneity . . . . . . . . . . . . . . . . . 183

A.2.3 Consumer behaviour research . . . . . . . . . . . . . . . . . . . . 184

A.2.4 Customer/market segmentation . . . . . . . . . . . . . . . . . . 186

A.3 Review Conclusion & Research Proposal . . . . . . . . . . . . . . . . . . 187

B Review of Data Stream Mining 191

B.1 Data Stream & Its Mining Challenges . . . . . . . . . . . . . . . . . . . . 191

B.2 Synopsis Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

B.3 Review Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

C Review of Clustering Time Series & Data Stream 199

C.1 Time Series Representation & Clustering . . . . . . . . . . . . . . . . . . 199

C.2 Clustering on Extracted Time Series Characteristics . . . . . . . . . . . 200

C.3 Data Stream Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

C.4 Review Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

Bibliography 203

List of Figures

1.1 Spatial usage behaviour of two subscribers over the 17-month period.

Plot (a) and (c) are line plots connecting consecutive activities; (b) is a

bubble plot where the bubble volume represents the total number of

activities at the location. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Weekday 24-hour temporal usage behaviour (i.e., number of activities)

for two subscribers over the 17-month period. . . . . . . . . . . . . . . 5

3.1 Overlapping initialization scheme . . . . . . . . . . . . . . . . . . . . . 55

3.2 Distribution of number of components k with selected setups. . . . . . 61

3.3 The results of the VB-GMM fits of the usage pattern of User A. The his-

togram summarizes the actual observations; (a) represents the model

fit of the Partitioned and kinitial = 17 setup, (b) Overlapping and

kinitial = 17 setup, (c) Partitioned and kinitial = 23 setup, (d) Overlap-

ping and kinitial = 23 setup, (e) Partitioned and kinitial = 35 setup, and

(f) Overlapping and kinitial = 35 setup. . . . . . . . . . . . . . . . . . . . 64

3.4 The results of the VB-GMM fits of the usage pattern of User B. The his-






3.5 The results of the VB-GMM fits of the usage pattern of User C. The his-






xvii

3.6 Stephen’s Kuiper V ∗n vs. n; (a) Random and kinitial = 17 setup, (b) Par-

titioned and kinitial = 17 setup, (c) Overlapping and kinitial = 17 setup,

(d) Random and kinitial = 23 setup, (e) Partitioned and kinitial = 23setup, (f) Overlapping and kinitial = 23 setup, (g) Random and kinitial =35 setup, (h) Partitioned and kinitial = 35 setup, and (i) Overlapping

and kinitial = 35 setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.1 (a) Plot of our simulated dataset where the data points (‘Actual’) are

marked by an ‘x’. (b) The results of our SEVB-GMM fit of a bivariate

mixture model to these data; the center of each component in the fit-

ted mixture is indicated by a ‘+’ and we also show 95% probability re-

gions (outlined by ‘-’) for each component in the model. We can see

that the data appear to be well represented by the fitted model. Note

also that the resulting fit is identical for kinitial = 1− 20. . . . . . . . . . 96

4.2 Selected results obtained from applying the standard VB-GMM algo-

rithm under different initialization conditions to the simulated data

shown in Figure 4.1 (a). The centers of each component in the fitted

mixtures are indicated by a ‘+’, we also show 95% probability regions

(‘-’) for each component in the model. The computed values of F and

MAEAC, and the fitted value of k in the final model are also shown.

We can see that the initial choice for k and the corresponding initial

component allocation does influence the final fit obtained. . . . . . . . 97

4.3 (a) Observed mobility pattern of Subscriber A over a 17-month period

corresponding to the recorded locations, marked by an ‘x’, of cell tow-

ers from which telecommunication activities were initialized. (b) The

results of the SEVB-GMM fit of a bivariate mixture model to these data;

the center of each component in the fitted mixture is indicated by a ‘+’

and we plot the 95% probability regions (‘-’) for each fitted compo-

nent. Note that results obtained were similar for kinitial = 1 − 18 and

that values of kfinal, F and MAEAC corresponding to various kinitial are

summarized in Table 4.3. . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.4 Selected results obtained by using various choices for kinitial in the

standard VB-GMM algorithm for Subscriber A’s mobility pattern

shown in Figure 4.3 (a); the center of each component in the fitted

mixture is indicated by a ‘+’ and we plot the 95% probability regions

(marked by ‘-’) for each fitted component. . . . . . . . . . . . . . . . . . 101

4.5 Mobility patterns over a 17-month period for four subscribers are

shown in the left column. Observations, ‘x’, are the recorded cell tower

locations from which subscribers initiated a communication. Bivari-

ate mixture models fitted using SEVB-GMM are shown in the right

column; the center of each fitted mixture component is marked ‘+’

and corresponding 95% probability regions (‘-’) are shown. Note that

SEVB-GMM was initialized with inappropriate choice kinitial = 1 each

time, yet we are still able to model the data well. Values for F and

MAEAC are also reported. . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.6 Comparisons between fits obtained from standard VB-GMM and our

SEVB-GMM algorithm when using different values kinitial ranging from

1 to 30 based on the observed data for 100 randomly selected anony-

mous individuals: (a) plot of the fitted kfinal vs. the kinitial that was used

for both algorithms, (b) value of MAEAC (Equation (4.7)) for the fits

from both algorithms vs. the kinitial that was used, and (c) for the fits

obtained from both the standard and SEVB algorithms, we computed

the corresponding values of BIC, DIC, F and MAEAC then plotted the

% of times there was an agreement in the model that would be selected

based on either the BIC, DIC or F values, and the model that was se-

lected in the SEVB algorithm using MAEAC. . . . . . . . . . . . . . . . . 105

5.1 Voice call duration distributions approximated by a mixture of log-

normal distributions (‘—’s) of two subscribers whose voice call dura-

tions have a mean of 58 seconds. (a) Subscriber 1: large amount of

‘message’-like calls of a very short duration. (b) Subscriber 2: call du-

ration is more evenly distributed when compared with Subscriber 1. . . 113

5.2 Spatial behavior of four different subscribers. (a) Subscriber A: inter-

capital businessperson-like pattern. (b) Subscriber B: inter-state truck

driver-like pattern. (c) Subscriber C: home-office-like pattern shown

in bubble plot. (d) Subscriber D: taxi driver-like pattern shown in bub-

ble plot. Note that in (a) and (b) ‘x’s represent the actual observations

and ‘. . .’s represent the ‘virtual’ path the user is likely to have taken

between two consecutive actual observations. In (c) and (d), user pat-

terns are shown in the form of bubble plots instead of the scatter plot

for better demonstrating that a large number of activities were initi-

ated from the same cell tower locations; the size of the bubble repre-

sents the activity volume of the particular location. . . . . . . . . . . . . 115

5.3 Distributions of users’ aggregated call patterns. (a) Aggregated voice

call durations. (b) Aggregated SMS counts. . . . . . . . . . . . . . . . . . 117

5.4 Mobility pattern analysis. (a) Percentage of outbound activities made

from users’ top five preferred locations. (b) Average of users’ cumula-

tive activity count distribution with respect to distance from their real

centers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.5 SEVB-GMM results of the four subscribers in Figure 5.2. (a) Subscriber

A. (b) Subscriber B. (c) Subscriber C. (d) Subscriber D. Note that the el-

lipses represent 95% probability regions for the component densities,

whereas the estimated centers of these components are marked by ‘+’s

and the actual observations are marked by ‘x’s. We also note that the

95% probability regions of some components (e.g., those correspond-

ing to point masses) are not always visible because they are simply

too small to be seen. The most noticeable examples are the two most

weighted components in (c) which correspond to the three big bub-

bles (with two of them centered at nearly identical spot) in Figure 5.2 (c).125

5.6 Model accuracy of SEVB-GMM. (a) Distribution of distances between

real & SEVB-GMM centers. (b) Average of users’ cumulative activ-

ity count distribution with respect to distance from their SEVB-GMM

centers. Note that ‘. . .’s refer to calculations made with respect to the

SEVB-GMM model fits, whereas ‘—’s were calculated with respect to

the actual data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.7 Mobility pattern analysis based on SEVB-GMM. (a) Distribution of

SEVB-GMM component maximum SD σmax for which σmax ≤ 10km. (b) Distribution of SEVB-GMM component weight w for which

w ≤ 0.24. (c) Distribution of % of variation accounted for by the

first principal components (the p1’s) of the SEVB-GMM components.

(d) Distribution of distances between users’ daily activity boundary to

their SEVB-GMM centers (Note: almost identical for real centers). . . . 129

5.8 Distribution of spatial usage behavior. (a) SignificantWt. (b) UrbanWt.

(c) RemoteWtX2. (d) UrbanArea (1 = 302π km2). (e) RouteDist (1 =1000 km). (f) HomeOfficeLik. . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.9 Selected k-means clustering results # 1. (a) Clustering quality eval-

uated with respect to different g when subscribers are clustered with

SignificantWt, UrbanWt & UrbanArea. (b) Clustering quality evaluated

with respect to different g when subscribers are clustered with Signif-

icantWt, UrbanWt, UrbanArea, RemoteWtX2 & RouteDist. (c) Cluster-

ing quality evaluated with respect to different g when including voice

call duration & SMS counts into the setting (b). (d) Variables R2’s (RSQ)

for the setting (c) with voice call duration marked as D, SMS counts as

S & five spatial behavioral signatures unmarked. Note that in (a) to (c),

lines with • correspond to CH index, and lines with � correspond to

CCC; number of groups g is generally chosen based on the local max-

ima shared by both CH index and CCC. . . . . . . . . . . . . . . . . . . . 137

5.10 Cross validation results # 1. (a) Clustering quality evaluated with re-

spect to different g for the new sample # 1 with setting of Figure 5.9 (b).

(b) Clustering quality evaluated with respect to different g for the new

sample # 1 with the unsuccessful simplistic model described in §5.6.

Note that lines with • correspond to CH index, and lines with � cor-

respond to CCC; number of groups g is generally chosen based on the

local maxima shared by both CH index and CCC. . . . . . . . . . . . . . 141

Statement of Original Authorship

I hereby declare that this submission is my own work and to the best of my knowl-

edge it contains no material previously published or written by another person, nor

material which to a substantial extent has been accepted for the award of any other

degree or diploma at QUT or any other educational institution, except where due

acknowledgement is made in the thesis. Any contribution made to the research by

colleagues, with whom I have worked at QUT or elsewhere, during my candidature,

is fully acknowledged.

I also declare that the intellectual content of this thesis is the product of my

own work, except to the extent that assistance from others in the project’s design

and conception or in style, presentation and linguistic expression is acknowledged.

Signature:

Burton Wu

Date:

xxiii

1Introduction

1.1 Understanding Customer Behaviour & Human Mobility

Patterns

Customers are the most important asset of any business, but customers today

are more educated, sophisticated, expectant, demanding and volatile than ever

(Yankelovich and Meer, 2006). “Being willing and able to change your behaviour to-

ward an individual customer based on what the customer tells you and what else you

know about the customer” (Peppers et al., 1999, p.151) is vital to business survival

and success. It is also important to understand that not all customers are the same or

equally profitable for business (Cooper and Kaplan, 1991). Differentiating between

customers according to the detailed understanding of their needs, behaviour, prof-

itability and values to the business is therefore crucial in enabling companies to have

appropriate relationships with each individual customer (Peppers et al., 1999).

Research finds that currently there is “too little emphasis on actual [customer] be-

haviour” (Yankelovich and Meer, 2006, p.131). Still, the majority of existing stud-

ies focus on purchasing behaviour (e.g., buying goods such as houses, vehicles, or

plasma televisions) or loyalties (i.e., churn), while little attention is paid to compre-

hending customers’ consumption behaviour (Jacoby, 1978) i.e., behaviour such as

making phone calls, accessing the Internet and using water or electricity.

More formally, consumption behaviour refers to activities which have been per-

formed so frequently that they have become habitual and involve little decision

making (Ouellette and Wood, 1998; Ajzen, 2001). It is different from purchasing be-

haviour (Alderson, 1957) and is more relevant to the service industries than to the

retail industries. However, companies’ existing understanding of individuals’ con-

sumption behaviour appears to be almost exclusively limited to discrete (e.g., which

services customers use and the number of institutions they conduct business with),

2 Chapter 1. Introduction

or average and aggregated measures (e.g., number of transactions per month) (c.f.

Yankelovich and Meer, 2006). These measures, which make no distributional as-

sumptions at all, are not necessarily appropriate, meaningful or adequate for de-

scribing the observed patterns (Schultz, 1995). Also, some studies (e.g., Schultz,

1995) suggest that using just a single Gaussian distribution is generally not appro-

priate for describing customers and their behaviour. Appendix A.2.2 review the issue

of customer behavioural heterogeneity further.

In spite of this, most studies to date do not take the observed behaviour pattern

variations at a point in time into consideration; and they also do not investigate

individuals’ behaviour spatially or temporally. Businesses need a more sophisti-

cated approach than any that are currently available to profile as well as differentiate

the types of customer behaviour they have observed. This would increase insights

that can be used for interacting with each individual in a personalised format, and

support effective strategic and tactical informed decision making, business man-

agement and resource planning. In other words, there appears to be insufficient

understanding of customers’ actual consumption behaviour in the competitive and

rapidly changing wireless telecommunication industry, which is the application fo-

cus of this research.

Telecommunication is both a service as well as a retail industry (Berry and Linoff,

2004, pp.314-315) and there is a great demand for statistical data analysis. Our ap-

plication for research involves the use of mobile phone traffic data to better compre-

hend each telecommunication subscriber’s usage patterns. Such data are sometimes

referred to as call detail records (CDR); its primary use is for billing subscribers (and

hence there are fewer issues with accuracy and completeness when compared to

data typically used for analysis). To date, detailed CDR has not been utilised for bet-

ter profiling and differentiating individual subscribers. This is somewhat surprising

since CDR is typically readily available to all established telecommunication busi-

nesses.

Specifically, the primary interest for our application of this research is to analyse in-

dividuals’ seldomly studied, habitual spatial and temporal, with respect to pattern

on days of the week and hours of the day, usage behaviour. However, the key focus

of the modelling is the spatial aspect of the problem where we see unique charac-

teristics which are discussed further in § 2.1.1. This is essentially the problem of

gaining statistical insights into human mobility patterns (Gonzalez et al., 2008) for

which modelling is still largely problematic. The ability to model human mobility

has implications for epidemic prevention, emergency response, and urban planning,

for example (Gonzalez et al., 2008). Our research in this area makes a contribution

to the development of statistical methods for pattern approximation, interpretation

and classification.

1.2 Telecommunication Call Data Record Dataset 3

1.2 Telecommunication Call Data Record Dataset

The dataset consisted of a total of 95,061 wireless consumer (as oppose to business)

customers or 185,331 subscribers when including those who have defected, partially

defected or newly connected. They have been randomly sampled via a simple ran-

dom sampling (SRS) procedure, and are based on a fixed percentage of the entire

subscriber base. Note that the selected subscribers should not give a biased repre-

sentation of the customer base as they are not based on uncontrolled convenience

samples, a technique that is commonly applied (Glymour et al., 1996; Hand, 1998).

Successful outbound usage activities for 17 consecutive months from 1st May 2006

00:00 to 30th September 2007 23:59 for these traced subscribers were recorded. Key

attributes of these records are:

• subscriber identification number,

• activity initiated date and time,

• activity initiated tower location in latitude and longitude,

• activity type, and usage quantities. For example, voice call in second (and

count), and text message in count.

There are a total of 264,782,106 observed records (i.e., usage activities) with

185,324,517 made within Australia.

Moreover, in contrast to other related research, popular demographic variables

available about the customer, while having been shown to be useful (Verhoef and

Donkers, 2001), are not investigated here. The reason being that the account holder

is not necessarily the actual user. Therefore such demographic variables may be mis-

leading. Note that such false assumptions are often made in current practice. The

sole focus here is the seldomly studied behavioural data which should also be use-

ful in predicting a customer’s future behaviour (c.f. Schmittlein and Peterson, 1994).

Additionally, in contrast to analysing outbound voice calls and short message service

(SMS) separately, as is done typically, this research focuses mainly on analysing the

activities together.

1.3 Motivation for Our Inferential Approach

Selected examples which illustrate the inspiration for our research application are

presented here. Figure 1.1 shows the spatial usage behaviour of two anonymous sub-

scribers over a period of time. The pattern is based on the observed cellular tower

location (i.e., its longitude and latitude) where the subscriber successfully initialised

an outbound activity such as voice call or text message; the activity-initialised cell

tower is typically the one geographically closest to the user, exception to this may

occur if closest cell tower is out of service or too busy, for example. Figure 1.2 illus-

trates the temporal usage behaviour of two other anonymous subscribers over the


(a) Subscriber A (b) Subscriber A (Zoomed in on Sydney)

(c) Subscriber B

Figure 1.1: Spatial usage behaviour of two subscribers over the 17-month period.Plot (a) and (c) are line plots connecting consecutive activities; (b) is a bubble plotwhere the bubble volume represents the total number of activities at the location.

same period.

Looking at Figure 1.1 (a), we can see that there is a strong suggestion that Subscriber

A is a business professional who is based in Sydney, Australia, and travels interstate

(i.e., to Brisbane, Gold Coast, Cairns, and Melbourne) occasionally (about 10% of the

time). It also shows that he/she is most likely to have travelled by airplane rather

than by car to these interstate destinations since no activities have been recorded

1.3 Motivation for Our Inferential Approach 5

(a) Subscriber C (b) Subscriber D

Figure 1.2: Weekday 24-hour temporal usage behaviour (i.e., number of activities)for two subscribers over the 17-month period.

between Sydney and the interstate locations; Hamilton Island is a holiday destina-

tion. Figure 1.1 (b) provides a closer look at his/her spatial usage behaviour closer to

‘home’ (i.e., Sydney). The size of the ‘bubble’ in this figure represents the number of

the activities initialized via the cellar tower location. The somewhat ‘tradesperson’-

like pattern suggests that his/her profession and/or lifestyle requires visiting vari-

ous part of the Sydney regularly (about 30% of the time), in contrast to many other

‘home-and-office’-like subscribers who typically visit only a handful of selected lo-

cations (e.g., home and office) within the living neighbourhood. It also suggests that

Subscriber A is able to make the majority (about 60%) of activities from a relatively

fixed location where the two largest bubbles are located. His/her ‘home’, in this case,

is most likely to be located at the intersection of two cellular towers service areas.

Figure 1.1 (c) shows the mobility pattern of another subscriber, Subscriber B, who

has travelled around Australia during the analysed period.

Figure 1.2 (a) shows one of many subscribers, Subscriber C, for whom the majority

of his/her 1,442 communication is ‘restricted’ to an hour-long window (i.e., between

7 to 8pm) during the entire analysed period. Preliminary investigation indicates that

this is not an isolated case; communication time ‘restriction’, for reasons that are

not known to us, is a behavioural characteristic of this user. On the other hand, Fig-

ure 1.2 (b) shows the temporal usage pattern of Subscriber D, who has been quite

active during midnight when majority of other subscribers are asleep; this ‘party

lifestyle’-like behaviour, shared by many other users, appears to be their distinct fea-

ture. Note that this study only focuses on the weekdays; while preliminary investi-

gation shows that individual’s temporal behaviour generally varies somewhat from


Monday to Friday, weekday usage tends to be significantly different to weekend us-

age.

Overall, considering the above examples, individuals’ time independent spatial and

temporal usage patterns appears to provide some indication of their likely lifestyle

and/or occupational traits, which otherwise cannot be easily or cheaply discovered.

These insights, which are a combination of the actual behavioural understanding

and the infer lifestyle/occupation, appear to be potentially valuable for business to

enhance their relationship with customers, influence customer behaviour, ‘bene-

fit’ the customers with a deeper understanding of how the products/services are

used in their day to day lives (Fournier et al., 1998; Yankelovich and Meer, 2006),

and make more appropriate actions and/or decisions (e.g., pricing structure around

hours of the day). There is an obvious need to understand each customer’s exhib-

ited behaviour more comprehensively, and businesses have been arguing the im-

portance of understanding customers’ lifestyle/occupation. However, many of them

appear to have lost the focus and have been ‘actively looking’ for customers’ lifestyle/

occupation (Yankelovich and Meer, 2006) through the only available approaches

they know of such as the rather ‘unreliable’ market research approach (Wolfers and

Zitzewitz, 2004) and/or the use of coupon promotions (Stone et al., 2004, p.114). Ap-

pendix A.2.3 provides a detailed review of the current ‘strategies’ used in practice and

discusses the lack of reliability issues associated with market research.

This research challenges/objectives are therefore to first find a reliable and practi-

cal statistical way to approximate these clearly overlooked individual habitual con-

sumption behaviour patterns, many of which are extremely heterogeneous (to be

discussed further in § 2.1.1, and more generally in Appendix A.2.2), and then pro-

file these observed patterns meaningfully. Efficient and effective techniques are evi-

dently required as it is clearly impractical to attempt to interpret each pattern subjec-

tively, visually and manually. Of course, the proposed methods must also be trans-

parent for interpretation. While it appears that our proposed strategy can provide

businesses with a more sophisticated description and understanding of each (ex-

isting) individual customer with respect to his/her actual consumption behaviour,

profiling and differentiating customers based on these alternative ideas also needs

to be examined and compared to an existing benchmark. Chapter 2 reviews various

existing potentially suitable techniques for our problem. While Appendix A provides

a more in depth reviews on the business related aspect of our research question; in

particular Appendix A.2.4 reviews approaches of how customer segmentation is typ-

ically being performed today.

Note that in reality, one wireless customer can have multiple wireless subscriptions.

1.4 The Role of Statistical Data Analysis 7

Here the focus is on the subscriptions rather than the customers. However, the ex-

tracted subscriber knowledge will still be insightful for those who have multiple sub-

scriptions.

1.4 The Role of Statistical Data Analysis

Analysis of this kind is often referred to as data mining (DM) or data mining and

knowledge discovery from data (DMKDD); it refers to the process of extracting non-

trivial, and previously unknown but useful information hidden from large datasets

in a bottom-up fashion (Fayyad et al., 1996). It is an interdisciplinary field of study

which integrates statistics, database technology, machine learning, and pattern

recognition, for example (Hand, 1998). It has attracted serious attention in recent

years, particularly in industry, as a result of an explosive amount of data being avail-

able nowadays, and the imminent need to turn it into competitive advantage i.e.,

knowledge (Han and Kamber, 2006).

A common misconception is that data mining is an “automatic or semi-automatic”

(Berry and Linoff, 2004, p.7) black box product, and while you might place your ‘faith’

in finding the complete solutions, it often turns out to be a disaster (Dasu and John-

son, 2003, p.2). That is, while it is essential to have some sort of computer program

(i.e., for automation) when dealing with large volumes of data, making use of the

domain knowledge, and involvement of the researcher in the inference process is

fundamental (Hand, 1998), but often ignored. Additionally, most off the shelf pack-

ages generally only consist of restricted functionalities of very limited standard or

textbook techniques that will not suit all projects, such as ours, for example.

Large volumes of CDR pose great challenges to analysts for obtaining useful sub-

scriber knowledge. One of the essential first steps for mining them is therefore to

summarise/approximate the data efficiently with significantly less required space

(Han and Kamber, 2006). While it is important to be able to process the data effi-

ciently, or even in a parallel/distributed fashion, which is also known as high per-

formance computing (Park and Kargupta, 2003), it is also important to emphasise

the statistical learning aspects (rather than today’s common emphasis on comput-

ing and databases (c.f. Appendix A.2.1), for example) (Hand, 1998). Large volumes of

data records and variables means traditional processes for statistical inferences are

likely to be inappropriate (Hand, 1998).

Our research focus here is on designing transparent methodologies. Predictive ac-

curacy improvements from the black box operations, if any (Glymour et al., 1996),

should not override the interpretability goals for both models and results that are

critical to the business (e.g., ability to include experts’ insights and opinions). In


fact, modelling individuals’ mobility patterns with nonparametric approaches such

as the kernel method, for example, would not be able to provide the interpretations

and the data summarisation needed for this research. As indicated earlier, vari-

ous techniques were considered to this research problem before we chose our ap-

proach based on the variational Bayesian (VB) method for Gaussian mixture models

(GMMs).

1.5 Thesis Overview

So far, we have demonstrated the application value of this proposed customer con-

sumption behaviour research in which our primary focus is analysing users’ spatial

usage behaviour although we also investigate temporal and high dimensional fea-

tures. We emphasise that, to the best of our knowledge, this is the first research which

aims to profile each mobile phone subscriber’s overall spatial usage behaviour auto-

matically, effectively and meaningfully, as well as differentiate general users based

on their actual observed mobility patterns (outlined in Chapter 4 and 5). In essence,

our proposed strategy to this spatial objective involves a two-stage clustering pro-

cess:

• we first cluster/model each individual’s spatial usage pattern for which his/her

unique behavioural characteristics are extracted;

• customers are then clustered/segmented based on these extracted features.

As we will discuss later, our data is difficult to model efficiently and effectively using

many existing standard methods, as they are too restrictive and lacking in the flex-

ibility required and/or are not scalable. This is largely the result of our spatial data

being highly heterogeneous and spiky; the term spiky describes data patterns with

large areas of low probability mixed with small areas of high probability. The nature

of our data will be discussed further in Chapter 2.

To undertake this research, a total of three new statistical algorithms have been de-

veloped. They are designed to:

• model one-dimensional circular data i.e., individuals’ temporal usage be-

haviour (Chapter 3),

• model two-dimensional heterogeneous spiky patterns with weak prior infor-

mation i.e., individuals’ spatial usage behaviour (Chapter 4), and

• cluster high dimensional data (Chapter 6) in a way that is more useful for seg-

menting customers with a large number of (behavioural) attributes,

respectively. These algorithms are all based on the variational Bayesian (VB) method

and Gaussian mixture models (GMMs). We briefly introduce these methods and our

algorithms in the chapter overview below.

The main text of the thesis is separated into seven chapters, of which this is the first.

1.5 Thesis Overview 9

In Chapter 2, we review the literature related to the research problem in detail

to facilitate comprehension of research requirements for this project. This is fol-

lowed by detailed methodology reviews on mixture models and clustering. Con-

sideration of the thorough reviews led to the adoption of the recently popular VB

method together with GMMs as the foundational techniques of the study. That is, the

fast, non-sampling based VB method, an alternative to Markov chain Monte Carlo

(MCMC)-based methods, and GMMs will be utilised for approximating individuals’

consumption behaviour and to facilitate the extraction of selected customer-centric

behavioural characteristics. GMMs are one of the most popular and flexible ap-

proaches for modelling more complicated patterns.

Chapters 3 to 6 are written in the form of papers.

Chapter 3 focuses on modelling individuals’ 24-hour activity patterns (c.f. one-

dimensional circular data). We begin by first exploring VB’s unique component elim-

ination property in more detail, and evaluating its modelling effectiveness and ro-

bustness. The empirical results appear promising; and we highlight a potential im-

plication of the VB elimination property which is often overlooked. A new VB-GMM

algorithm is also presented that is suitable for modelling circular data; its effective-

ness is evaluated and demonstrated with Stephens’ Kuiper statistics.

Chapter 4 and Chapter 5 are the heart of the thesis, these focus primarily on the

subscribers’ spatial usage behaviour. In Chapter 4, a new VB algorithm, called split

and eliminate variational Bayesian (SEVB), is developed. This new algorithm is more

suitable for modelling large numbers of highly heterogeneous spatial patterns as

GMMs with weak prior information. This new algorithm introduces and makes use

of several novel concepts in areas such as component splitting and proposes a new

model evaluation criterion. Empirical results suggest that our SEVB-GMM algorithm

is effective and robust to different initialisation settings including the initial choice of

the number of components in the model, as well as various observation component

initialisation allocations. This chapter also examines the unstableness of many ex-

isting log-likelihood (LL)-based model selection criteria which makes them not suit-

able for this application; whereas our proposed alternative model evaluation mea-

sure appears to be able to provide consistent and reliable model selection. Chapter 5

adapts this new algorithm to our real world dataset, and focuses on interpreting the

patterns to gain useful customer insights. It also investigates the stability and differ-

entiability of users’ spatial usage behaviour. Empirical results reveal that users’ spa-

tial usage behaviour profiles are more stable than the currently popular approach

which involves the ordered partitioning of customers based on current benchmark

measures such as aggregated voice call durations.


Chapter 6 develops a new VB-GMM based clustering algorithm capable of finding

subspace clusters in high dimensional subspaces; the key notion of subspace clus-

ters is that not all attributes are relevant to all clusters. This algorithm aims to ad-

dress the potential cluster quality issue of many current algorithms. Besides demon-

strating the portability of VB to many existing data mining algorithms (which makes

use of, for example, histograms, as the density estimation tool), the rationale of this

chapter is that clustering algorithms for high dimensional data should be considered

for customer behaviour segmentation since the number of subscriber behavioural

characteristics that we are interested in and extract from the database are likely to

increase over time. Empirical results suggest that this algorithm is capable of identi-

fying subspace clusters with very low intrinsic dimensionality in settings that would

be considered challenging for many existing clustering algorithms.

The thesis concludes in Chapter 7 where our contributions are summarised. We also

discuss several future research directions that would be valuable from both statistical

and application perspectives.

2Literature Review

2.1 Introduction

2.1.1 Research problem & methodology requirements

In order to understand the nature of the challenges we faced in this research and

to set out appropriate methodology requirements, it is important to review the lit-

erature on consumption behaviour studies within the telecommunication industry,

human mobility pattern modelling, and the nature of call detail records (CDR).

Existing Consumption Behavioural Understanding & Segmentation

As we briefly mentioned earlier in § 1.3, customers’ consumption behaviour is typ-

ically evaluated based on discrete, or average and aggregated measures. In relation

to segmentation (i.e., differentiating customers based on the certain characteristics),

they are commonly partitioned, for example, into several quantile groups, based on

the RFM model; RFM stands for recency (i.e., time lapsed since last activity), (av-

erage) frequency and (aggregated) monetary (over a predefined time period) (Stone

et al., 2004, pp.111-134). The telecommunication industry is no exception; however

measures used which somewhat deviate from RFM include:

• whether certain features have been used by a customer,

• the total number of distinct receivers contacted by the customer over a speci-

fied period (Wei and Chiu, 2002),

• fraction of incomplete calls,

• call number ‘birthday’ i.e., the first day that a call was made to the number

(Cortes and Pregibon, 1999),

• the top regions where calls were made to and from, and

• the top cell tower locations where calls were made from (Cortes et al., 2000).

12 Chapter 2. Literature Review

Besides evaluating the churn (i.e., customer defection) likelihood of the customer

(e.g., Wei and Chiu, 2002) (c.f. Appendix A.1), which should not be considered as con-

sumption behaviour, individuals’ usage patterns have not been examined in detail;

papers such as Cortes and Pregibon (1998), Cortes and Pregibon (1999) and Cortes

et al. (2000) from AT&T can be considered as the exceptions though their focuses

are mostly on mining data stream (which we discuss briefly below and more fully in

Appendix B) and identifying fraudulent accounts.

For each phone number, AT&T modelled the voice call distribution with 24 bins (c.f.

24 hours) and the voice call duration distribution with 12 logarithmically spaced

bins. Degree of ‘business’-likeness of the phone number was evaluated based on

whether the majority of the calls were made during weekday office hours (exclud-

ing lunch time) and whether they were mainly shorter calls. However, even with

their work on this topic, valuable call detail records (CDR) are still not being uti-

lized fully for understanding individuals’ actual behaviour; we can see this over-

looked potential when considering the spatial usage behaviour illustrated in § 1.3

and thus the proposed profiling/segmentation approach in this research. Note that

analysing CDR has already been shown to be useful. For example, promoting ser-

vices to groups of customers by association and sequential patterns (Han and Kam-

ber, 2006, p.653), identifying the best time to contact customers (Berry and Linoff,

2000, p.394), and understanding the relationship among pricing, voice call counts,

average voice call duration, household demographics and revenue at the higher than

individual level (Train et al., 1987; Heitfield and Levy, 2001).

Businesses often request to have an ‘actionable’ requirement (Wedel and Kamakura,

1998, pp.328-329) in having small and balanced customer groups (Ghosh and Strehl,

2004); this single view segmentation is often achieved by partitioning the customer

base based only on predefined attributes of so called experts’ opinions (Berkhin,

2006). Such a process completely ignores the reality of heterogeneous customer be-

haviour (Smith et al., 2000). This commonly used customer segmentation approach,

which is in fact contrary to the idea of ‘actionable’ one-to-one/relationship market-

ing, cannot explain behaviour in detail and cannot be used to comprehend the needs

of the customer; additionally the resulting segments may not be easy to relate to and

action on for the market specialists. In fact, 50 homogeneous customer groups is

believed to be the ‘optimal’ solution which segmenting car insurance renewal be-

haviour (Smith et al., 2000); whereas, a total of 66 marketing segments are identified

in Claritas’s segmentation system, PRIZM, which is based on market researches of

demographic traits, lifestyle preference and consumer behaviour in USA (Claritas

Inc., 2008).

2.1 Introduction 13

That is, rather than analysing the entire customer behaviour as a ’whole’, which

would result in large amount of customer groups, and is thus not necessarily use-

ful or actionable for the businesses due to its highly heterogeneous nature, it would

be more preferable to have “different segments for different purposes” (Yankelovich

and Meer, 2006, p.125) and not to predetermine the number of segments required for

each purpose. This implies that customers should be able to be classified into sev-

eral different groups at the same time such that detailed understanding on different

aspects of customer behaviour can be preserved; algorithms utilised need to be able

to automatically determine the complexity of the models. As indicated in § 1.3, our

proposed strategy may provide a complementary/innovative view of the customers

based on their spatial (and temporal) usage behaviour; its feasibility and potential

merit (e.g., differentiability and stability) will need to be measured against the most

common existing approach, for example, the most popular method of naively parti-

tioning the customers based on the aggregated call volumes.

Furthermore, the number of subscriber behavioural characteristics (what we call

‘signatures’), that we are interested in and that are extracted from the database,

are likely to increase over time even when analyses have the same purpose (e.g.,

analysing spatial usage behaviour). This means that the chosen scalable algorithm

utilised for segmenting customers also needs to be suitable for high dimensional

data. This is a problematic issue and requires thorough investigation; detailed ex-

planations and reviews are presented in § 2.3.4. Note that Appendix C reviews a re-

lated subject: clustering algorithms specifically for time series or data stream. How-

ever, it appears that it is more preferable, at least given the nature of this research, to

cluster time series in the typical way i.e., capturing each series’ longitudinal charac-

teristics, and then clustering series based on these extracted characteristics with an

algorithm suitable for high dimensional data. Note that the investigation of longitu-

dinal aspects of subscriber behavioural changes is outside the scope of this research

because our data is limited to only 17 months of records.

Spatial Usage Behaviour (or Human Mobility Patterns)

One of the most important, unique and ignored features of CDR is its spatial in-

formation (Han and Kamber, 2006, p.653). Engineering/network focused commer-

cial warehouses such as CDRInsight (LGR Telecommunications, 2008a) and CDRLive

(LGR Telecommunications, 2008b) are perhaps the stand out exceptions. While their

primary focus is mostly on monitoring call drop outs (with respect to the cell towers)

(Intel Corporation, 2002; Ajala, 2005, 2006); they are able to identify subscribers geo-

graphically (i.e., subscribers have utilised certain cell towers) for marketing/strategy

purposes, and able to display each subscriber’s usage histories efficiency.


On the other hand, from the more ‘scientific’ viewpoint, individual human mobility

patterns (or spatial usage behaviour) have recently been studied (Gonzalez et al.,

2008) and results suggest that human trajectories has a high degree of spatial (as

well as temporal) regularity as we observed/illustrated in § 1.3. That is, individuals

typically spend the majority of their time in their most highly preferred locations,

and occasionally visit other ‘isolated’ places varying widely in the range of distances

outside of their usual activity areas (Gonzalez et al., 2008).

Statistically, this implies that users’ repeated spatial usage behaviour is not only

heterogeneous (both between and within users), but also spiky; the term spiky de-

scribes data patterns with large areas of low probability mixed with small areas of

high probability. These important features of human mobility patterns have been

largely ignored as is evident when we consider that previous modelling approaches

were based on Levy flight and Markov-based modelling approaches (e.g., Brock-

mann et al., 2006). To further complicate the problem, CDR typically only consists of

locations with respect to the cell tower where the activity was initialised from i.e., the

exact location of the subscriber is not known. This lack of jitter creates an additional

data singularity issue; jitter refers to some small jumpy movements. Moreover, the

‘home’ location for each subscriber is different and is not known in advance. Over-

all, these unique features of individuals’ spatial usage behaviour can be problematic

when applying standard algorithms (c.f. Nurmi and Koolwaaij, 2006) and trying to

interpret/segment them meaningfully. We note that there appears to be no previous

attempt to model each individual’s overall mobility pattern, and differentiate their

spatial usage behaviour in the way that we propose here.

Call Detail Records (CDR) & Data Stream

The distinct features of these research data are the massive volume of observations

and the sequential characteristics. This emerging form of data is known as data

stream (or stream data) and it is also not necessarily ordered in the desired way. For

example, telecommunication transitions are recorded in a sequential manner, but a

transaction is not recorded until it has ended. With today’s technology, one can make

use of several service connections at the same time, but the data sequence recorded

may be ordered by transaction completion time, i.e. an Internet connection com-

menced prior to making a phone call which is then completed before the Internet

session ends, may appear after the phone call in record sequence.

Data stream was first defined in 1998 to refer to this type of data that is often real time

and generated by a continuous process, growing rapidly at an ‘unlimited’ rate (Hen-

zinger et al., 1998; Muthukrishnan, 2005). Much modern data shares these unique

characteristics, for example records on banking, credit card, shopping and financial

2.1 Introduction 15

market transactions, Internet clickstream records, weather measurements, and sen-

sor monitoring observations (Babcock et al., 2002a).

Analysing such data (or under such a data environment scenario), usually requires

one to:

• process it in a single pass (i.e., only have one look at each data record); and

• approximate it with acceptable levels of accuracy but within a stricter time and

space requirement.

It is also often necessary to be adaptive to the non-stationary nature of the data as

it may evolve over time (Aggarwal, 2007b). This is because data stream may not be

stored entirely on a disk or memory and it may not be possible to access it randomly

(Babcock et al., 2002a). More traditional data analytic approaches focus on examin-

ing the same data throughout the study, and they only focus on learning data with

bounded memory. They also generally require multiple scans of the data. In con-

trast, data stream is typically continuously updated throughout the analysis, mean-

ing that having advance knowledge of the input data size, for example, is generally

not possible and not particularly helpful (Babcock et al., 2002a). In other words,

traditional tactics in analysing data, such as finding the median value of a dataset,

which requires first knowing the size of the input data and then determining the me-

dian after the data has been ordered is heavy computationally, is no longer practical

or efficient. This is an exciting aspect of the research problem.

Most of the research in this field comes from research laboratories within companies

such as AT&T, Bell, Google, IBM, and Microsoft that have a management system or

database as the primary focus (Muthukrishnan, 2005). However, after detailed re-

viewing their research which can be found in Appendix B, it appears that they are

typically based on the use of histograms (e.g., Muthukrishnan and Strauss, 2004)

(of which many are based on the use of wavelet transformation (c.f. Donoho et al.,

1996)), the concept of averages (e.g., Aggarwal et al., 2003), or they are in a format

such that distribution and/or seasonality/periodic information cannot be easily ob-

tained (Littau and Boley, 2006b), for example. While the above statement overlooks

what they have achieved with these approaches, it does imply that their analytical

techniques do not appear to be appropriate for the extracting versatile behaviour

characteristics we seek in this research.

That is, despite the fact that CDR is data stream, this research will not proceed in

that direction i.e., CDR will be treated as typical data; this should be acceptable since

CDR for analytical purpose is typically updated periodically rather than in real time

(Chaudhuri et al., 2001; Ganti et al., 2002). Though, it is worthwhile pointing out

that purely from the viewpoint of density approximation i.e., not interpreting the

patterns, the following algorithms appear useful: one-dimensional histogram (Guha


and Harb, 2006), two-dimensional histogram (without assuming variables are inde-

pendent from each other) (Thaper et al., 2002), and kernel estimation (Heinz and

Seeger, 2008); many recent papers (e.g., Bhaduri et al., 2007; Parthasarathy et al.,

2007; Feldman et al., 2008) also focus on analysing data stream in the parallel/

distributed fashion.

2.1.2 Research methodology overview

It is clear by now that this research needs to utilise scalable, space efficient and trans-

parent algorithms to approximate the patterns in one- (i.e., temporal) and two- (i.e.,

spatial) dimensional spaces, and to differentiate the patterns in high dimensional

space without predetermining the model complexity. Analysis of this kind can be

more formally described as combinational data analysis (CDA); the focus of CDA is

the sensible arrangement of objects for which useful data are available (Arabie and

Hubert, 1996). Mixture modelling and clustering are the two main approaches (Ara-

bie and Hubert, 1996) used for this. Mixture models aim to model an unknown den-

sity as a combination of typically parametric functions; though the main difficulty

encountered is that the parametric form for the density is not known (Scott and Sain,

2005). Clustering is perhaps the most commonly applied CDA (Arabie and Hubert,

1996, p.8) and it aims to identify homogeneous groups of objects in a nonparamet-

ric manner. Mixture models are reviewed in the following section while clustering is

reviewed in § 2.3.

While clustering algorithms (e.g., Ester et al., 1996) have been applied before on

modelling some aspects of individuals’ spatial usage behaviour (e.g., Nurmi and

Koolwaaij, 2006), such an approach appears to be lacking in the interpretability

needed to profile each subscriber; although clustering is still preferable to his-

tograms from the viewpoint of density approximation due to the heterogeneous and

spiky nature of human mobility pattern that we discussed above. Mixture models,

on the other hand, appear to be more suitable for this particular task; we shall have

to evaluate their effectiveness against the clustering approach. Both mixture mod-

els and clustering have been considered before for use in customer segmentation

(Wedel and Kamakura, 1998, Chapter 5 and 6), though this was without the assump-

tion of the data being high dimensional.

2.2 Mixture Models 17

2.2 Mixture Models

2.2.1 Introduction

Patterns in data can be statistically represented with a convex combination of den-

sity distributions resulting in what is known as mixture models (Newcomb, 1886;

Pearson, 1894; Everitt and Hand, 1981, pp.1-2). They are flexible and attractive in that

they do not assume the overall shape of the distribution. At the same time, they are

known to be suitable for representing any distribution (c.f. Marron and Wand, 1992;

Priebe, 1994) as in the case of nonparametric approaches (Scott and Sain, 2005). This

is despite the fact that they often model subpopulations of the observed data para-

metrically (e.g., using a Gaussian distribution) (c.f. § 2.2.2).

Mixture models have been used extensively in various applications (McLachlan and

Peel, 2000) including customer segmentation (Wedel and Kamakura, 1998, Chapter

6); their usefulness in the context of clustering and discriminant analysis is a result

of the fact that they are able to represent a particular subpopulation in the data as a

mixture component (Banfield and Raftery, 1993; Fraley and Raftery, 2002). They can

also be useful for detecting outliers (Aitkin and Wilson, 1980; Wang et al., 1997a), an

important task in data mining (Madigan and Ridgeway, 2003).

Mixture models are now often considered as incomplete/missing data/variable

models (Everitt and Hand, 1981, pp.5-7) since the subpopulation identifier of the

observations is generally not known in practice (Marin and Robert, 2007, Chapter 6).

In fact, even though mixture models have a long history, fitting them was problem-

atic until the introduction of expectation-maximisation (EM) algorithm (Dempster

et al., 1977) which framed the models as a missing data problem (Scott and Sain,

2005). The objective of mixture models is to ‘unmix’ the distributions (Wedel and

Kamakura, 1998, p.73), and hence to estimate how many groups there are and the

distribution setting for each group. They are often used in a way such that recon-

ciling the missing actual group membership information is not of interest (Marin

and Robert, 2007, p.149), and one of the advantages is enabling statistical inference

(Wedel and Kamakura, 1998, p.76).

Mixtures of a finite number of parametric distributions have been shown to provide a

computationally convenient and flexible approach for modelling more complicated

distributions (McLachlan and Peel, 2000, pp.1-4) making them suitable for repre-

senting individuals’ heterogeneous behavioural patterns. The overall finite paramet-

ric mixture density with k (∈ N) components for an observation x = (x1, ..., xn) can

be expressed as:


f (x) =k∑j=1

wjfj (x|θj),

where wj is the mixing distribution weighting associated with jth component and

{wj} needs to satisfy both 0 ≤ wj and∑k

j=1wj = 1, and f (.) denotes a parametric

density distribution with θj representing unknown component parameters (Everitt

and Hand, 1981, pp.4-7).

2.2.2 Gaussian mixture model (GMM)

One of the most popular of all mixture models is Gaussian mixture models (GMMs),

these have been applied extensively (McLachlan and Peel, 2000); mainly because the

theory for Gaussian distributions is well understood. The overall density distribution

of a GMM can be expressed as:

f (x) =k∑j=1

wjN(x;µj , T−1

j

),

where µj and T−1j represent parameters mean and (co-)variance of the correspond-

ing underlying Gaussian distribution N (.) (Everitt and Hand, 1981, p.25). From the

viewpoint of clustering, T−1j controls the geometric features of the cluster i.e., its

shape, volume and the orientation (Fraley and Raftery, 2002).

The main problem here, and more generally for mixture models, is the estimation of

the parameters k, w, and θ =(µ, T−1

). However, parameter formulae typically can-

not be written down explicitly (Titterington et al., 1985, p.ix), and the value of k, de-

spite its importance (Titterington et al., 1985, p.148), is usually not known in advance

(McLachlan and Peel, 2000, p.4); and the challenges generally increase with increas-

ing dimensionality of the data (Scott and Sain, 2005; Jain and Dubes, 1988, p.118).

This means that the solutions generally cannot be obtained analytically and many

estimation techniques have been considered (c.f. Titterington et al., 1985, Chapter

4). We briefly review some below, and we note that these methods extend beyond

mixture model scenarios.

2.2.3 Classical/frequentist techniques (for GMMs)

In the past the main method for the parameter estimation was the method of mo-

ments based on Pearson (1894). However, the introduction of the EM algorithm


(Dempster et al., 1977), which simplified the problem considerably by instead inter-

preting the observations as incomplete data (McLachlan and Peel, 2000, p.4), com-

bined with computational resources becoming more widely available, have made

maximum likelihood estimation (MLE) popular. MLE involves maximisation for es-

timating the parameter values best suited to the data and are not available in closed

form. The likelihood of finite mixtures can also be maximised via numerical opti-

mization routines such as the Newton-Raphson (NR) method (McHugh, 1956).

Expectation-Maximisation (EM) Algorithm

The EM algorithm starts with initial guesses for the parameter of interest, and mak-

ing guesses may involve some sort of clustering of data (Scott and Sain, 2005). It is

iterative in nature; and there are two steps in each iteration for a given fixed number

of components k (Dempster et al., 1977; McLachlan and Krishnan, 2008). They are:

• Expectation (E-step): This is the first step at each iteration; it calculates the

expected value of the complete data log-likelihood (i.e., replaces the missing

values with their conditional expectation) given the observed data and the pro-

visional parameter estimates;

• Maximisation (M-step): This is the second step at each iteration; it maximises

the expectation using both the observed data and the predictions found in the

previous step.

The algorithm iterates between these two steps until a converged solution is reached.

Theoretically (Boyles, 1983; Wu, 1983) i.e., not always true in practice (Fraley and

Raftery, 2002), the solution of EM should be at least a local maximum as the like-

lihood is increased at each iteration i.e., it converges monotonically. This means

that, in order for EM to avoid finding a suboptimal solution, it is essential to have

good initial parameter values (Biernacki et al., 2003; Karlis and Xekalaki, 2003) and

stopping criteria (Wedel and Kamakura, 1998, p.87). Thus, many strategies have

been proposed for better initialisation (e.g., McLachlan and Peel, 2000, pp.54-60);

and many modifications to the standard EM algorithm have been made which at-

tempt to guide the algorithms towards the global maximum (e.g., McLachlan and

Krishnan, 2008). For example, the stochastic EM (SEM) algorithm (e.g., Celeux and

Diebolt, 1985; Celeux et al., 1996), and split and merge EM (SMEM) algorithm (Ueda

et al., 2000).

Scalability wise, EM may be slower than those direct numerical optimization meth-

ods mentioned above (c.f. Titterington et al., 1985, pp.90-91); since the convergence

is quadratic in the number of parameters (Wedel and Kamakura, 1998, p.87). Thus,

it may be a good idea to adopt a hybrid approach i.e., use EM initially for its good

global convergence property but then switch to Newton-type method for its rapid


local convergence (McLachlan and Peel, 2000, p.70-75). Note that there have also

been many recent developments (e.g., McLachlan and Peel, 2000, Chapter 12) mak-

ing EM more suitable for large datasets such as ours.

However, above all previously described limitations (e.g., Wedel and Kamakura,

1998, pp.87-92), the key weakness of EM for mixture modelling is the need to prede-

fine the number of components k (Cheung, 2005), the central issue in mixture mod-

elling; this is also the issue for many clustering algorithms (to be reviewed in § 2.3).

Consequently, there has been a great amount of research into better determining

the choice of k (Scott and Sain, 2005). For example, this can be done more tradition-

ally by using informal graphical techniques or formal hypothesis testing techniques

(Everitt and Hand, 1981, pp.22,30-57,108-118).

Today, selecting models with respect to k is often achieved by comparing solu-

tions utilizing complexity criteria (McLachlan and Peel, 2000, Chapter 6) such as

Akaike’s information criterion (AIC) (Akaike, 1974), Bayesian information criterion

(BIC) (Schwarz, 1978), Laplace empirical criterion (LEC), and minimum message

length (MML) (Wallace and Dowe, 1994; Wallace and Freeman, 1987; Wallace and

Dowe, 2000). We note that numerical experiments have shown that measures such

as BIC and MML, for example, lead to similar results (Roberts et al., 1998; Biernacki

et al., 2000). Sometimes computationally expensive approaches (Corduneanu and

Bishop, 2001) such as the bootstrap (Efron and Tibshirani, 1993), cross-validation

which is also wasteful of valuable data (Smyth, 2000), or statistical tests (Har-even

and Brailovsky, 1995; Polymenis and Titterington, 1998) may also be used.

Nonetheless, the central issue of EM, and more generally maximum likelihood

(ML) approaches, is that it tends to favour models with ever increasing complexity,

(Svensen and Bishop, 2005; Archambeau and Verleysen, 2007; McLachlan and Peel,

2000, p.41). That is, having more components i.e., larger k nearly always translates

to a ‘better’ model within the ML framework which is generally not appropriate (Ya-

mazaki and Watanabe, 2003). Nonetheless the model selection criteria mentioned

above do aim to penalize over complicated models and find the balance between

data fitting and the model complexity. However, these measures can be misleading

if the ‘training’ sample size is small (Watanabe et al., 2002); AIC is known to choose

models with too many components, while certain regularity conditions for BIC do

not hold in the mixture model case despite BIC’s increasing popularity (Aitkin and

Rubin, 1985; Biernacki et al., 2000; Scott and Sain, 2005).

In other words, besides the need to predefine k, EM has several inherent problems

such as sensitivity to the starting parameter values, possible singular/suboptimal

solutions (Jain and Dubes, 1988, p.118), and it is not suitable for estimating very

large numbers of components (Fraley and Raftery, 1998). With few exceptions


(e.g., Bradley et al., 2000; Ordonez and Omiecinski, 2005), like most fuzzy-, nearest

neighbours-, kernel-, optimisation-, or neural network-based clustering techniques

(c.f. Jain et al., 1999; Xu and Wunsch II, 2005), EM is not particularly well suited for

clustering large datasets. Nonetheless, it has been used as a clustering algorithm

(e.g., Wallace and Dowe, 1994; Fraley and Raftery, 1998, 1999; McLachlan et al., 1999;

Cadez et al., 2001), and can provide pattern interpretations which otherwise can-

not be obtained with those model-free clustering approaches (c.f. § 2.3). Finally, on

the theoretical front, mixture models with k covariance matrices assumed equal is

related to a well-known clustering method (Symons, 1981); while Celeux and Gov-

aert (1992) and Banfield and Raftery (1993) have shown that the classification EM

(CEM) algorithm under a spherical Gaussian mixture is the ‘same’ as the k-means

(KM) clustering algorithm. A more effective modelling approach than EM is needed.

2.2.4 Bayesian techniques (for GMMs)

Bayesian Statistics Overview

A more ‘recent’ advance in mixture modelling was the use of Bayesian techniques

(Binder, 1978; Symons, 1981; Gilks et al., 1989). Bayesian data analysis (Gelman et al.,

2004; Lee, 2004; Ghosh et al., 2006b; Marin and Robert, 2007) aims to make infer-

ences about data using probability models for quantities we observe and for quan-

tities about which we wish to learn (Gelman et al., 2004, p.1). It has now been used

widely in various real life applications (Ridgeway and Madigan, 2003; Gelman et al.,

2004) including many financial, economic (e.g., Rachev et al., 2008) and marketing

applications (e.g., Rossi et al., 2005). It differs from classical/frequentist statistics in

its use of probability for naturally quantifying model uncertainty at all levels of the

modelling process, and provides a natural framework for producing more reliable

parameter estimates (Andrieu et al., 2003); this includes selecting an appropriate

number of components k in mixture modelling either by fully Bayesian models or

comparing model marginal likelihood (Diebolt and Robert, 1994; Richardson and

Green, 1997). Bayesian statistics makes use of prior distributions on the model pa-

rameters to express the uncertainty present even before seeing the data (Chatfield,

1995). In essence, it is like mixing several models rather than aiming to obtain one

single best model (Chatfield, 1995).

The attractiveness of the Bayesian approach comes from the transparent inclusion of

prior knowledge, a straightforward probabilistic interpretation and hence communi-

cation of parameter estimates, and greater flexibility in model specification (Ridge-

way and Madigan, 2003; Gelman et al., 2004, pp.3-4). Unlike classical/frequentist

statistics, Bayesian approaches favour a simpler model (Jefferys and Berger, 1992;

MacKay, 1995). They are less likely to over-fit the data (Beal and Ghahraman, 2002).


Additionally, they do not utilize sometimes problematic p-values (Schervish, 1996;

Sterne et al., 2001; Hubbard and Lindsay, 2008). Consequently, Bayesian inference is

also now widely established as one of the principal foundations for machine learning

(e.g., Winn and Bishop, 2005; Bishop, 2006).

A Bayesian statistical analysis typically begins with a full probability model, and then

uses Bayes’ theorem (Bayes, 1763) to learn or to compute the posterior distribution

of the parameters of interest after seeing the data. Bayesian posterior distribution,

p(θ|x), of the model for the parameter of interest θ given data x can be expressed as:

posterior ∝ prior× likelihood

p(θ|x) ∝ p(θ)× p(x|θ),

with p(θ) representing the prior knowledge of θ, and p(x|θ) representing the likeli-

hood inference of θ drawn from x (Gelman et al., 2004, pp.7-8). Integration is re-

quired for obtaining the expectation of h(θ):

E(h(θ)|x) =∫h(θ)× p(θ|x)dθ,

and typically researchers work with the logarithm of these quantities for conve-

nience (Madigan and Ridgeway, 2003). However, such operations are known to be

generally intractable, and hence obtaining exact inference is rarely possible (Madi-

gan and Ridgeway, 2003). Thus, many approximation techniques have been devel-

oped. Two fundamental techniques are described below.

Monte Carlo Sampling Sampling techniques enjoy wide applicability and can be

powerful in evaluating multi-dimensional integrals and representing posterior dis-

tributions (Madigan and Ridgeway, 2003). Monte Carlo integration sampling is one

computational approach that can approximate the solutions by sampling from the

posterior distribution iteratively (c.f. m times),

limm→∞

1m

m∑i=1

h(θi) =∫h(θ)× p(θ|x)dθ.

However, the convergence of the approximation of this method can be slow (Madi-

gan and Ridgeway, 2003).


Importance Sampling (IS) Importance sampling is a useful technique that may

assist in obtaining a sampling distribution that converges quickly to the required

integral. It is often utilised in Monte Carlo-based approaches for estimating a tar-

get distribution p(θ|x) which is difficult to compute with samples generated from an

alternative and more amenable distribution g(θ), known as the importance distribu-

tion (Madigan and Ridgeway, 2003). Importance sampling is based on the identities

below where θi is drawn independent and identically distributed (i.i.d.) from g(θ):

∫h(θ)× p(θ|x)dθ =

∫h(θ)

p(θ|x)g(θ)

g(θ)dθ

=limm→∞

1m

m∑i=1

ωih(θi)

E(h(θ|x)) =1m

∑mi=1 ωih(θi)∑mi=1 ωi

ωi =p(θi|x)g(θi)

.

Where ωi are the importance sampling weights; and p (θ|x) need not be normalised.

However, importance sampling can be difficult to implement when the target distri-

bution is complex, because it can be difficult to find a suitable g(θ) to use (Andrieu

et al., 2003).

Markov Chain Monte Carlo (MCMC) Methods

While there exist many other types of Monte Carlo-based approaches (e.g., Fearn-

head, 2008), Markov chain Monte Carlo (MCMC) methods (Geyer, 1992; Tierney,

1994; Besag et al., 1995; Gilks et al., 1998) are generally considered as standard, flex-

ible, and the most widely used Bayesian methods for approximating these incalcu-

lable distributions (Andrieu et al., 2003; Ridgeway and Madigan, 2003; Balakrishnan

and Madigan, 2006). Typically, Bayesian inference for mixture models was not cor-

rectly treated until the introduction of MCMC algorithms in the early 1990s (Marin

and Robert, 2007, p.147) or required being simplified for approximation (Everitt and

Hand, 1981, pp.12-13).

Unlike standard Monte Carlo-based methods that create independent sample draws,

MCMC methods build a Markov chain, a sample of dependent draws that has

stationary target distribution p(θ|x). The Metropolis-Hastings (MH) algorithm

(Metropolis et al., 1953; Hastings, 1970) is the most popular MCMC method. A pop-

ular MH kernel is the Gibbs Sampler (Geman and Geman, 1984; Gelfand and Smith,

1990) which is somewhat related to the EM algorithm (Andrieu et al., 2003). The


Gibbs sampler forms the basis for software packages such as BUGS (Gilks et al.,

1994), and is just an example of an ‘efficient’ sampler that aims to make sensible

moves based on the current distributional knowledge (Andrieu et al., 2003).

However, while the MH algorithm is simple, the success or failure of the algorithm of-

ten depends on the careful design of the distribution proposal (Andrieu et al., 2003);

and it is typically not suitable for high dimensions (Mengersen and Tweedie, 1996)

i.e., it may not be flexible enough for the task of modelling individuals’ heteroge-

neous mobility patterns.

Note that it is also possible to combine several different samplers (Tierney, 1994). A

combination of ‘global’ and ‘local’ proposals, for example, can be a useful approach

when the target distribution has many narrow peaks (Andrieu et al., 2003).

Reversible Jump MCMC (RJMCMC) & Related Methods

The Bayes factor (BF) (Kass and Raftery, 1995) is a standard measure for comparing

Bayesian models and can therefore be used for selecting a suitable number of com-

ponents k; this selection is particularly critical for modelling patterns which are het-

erogeneous. However, BF approximations can be computational demanding (Han

and Carlin, 2001). Alternatively, one can:

• utilise other measures (e.g., Mengersen and Robert, 1994; Raftery, 1996; Roeder

and Wasserman, 1997) such as the recently proposed deviance information

criteria (DIC) (Spiegelhalter et al., 2002; Celeux et al., 2006) which can be

computed more straightforwardly (McGrory and Titterington, 2007) for model

comparison; this strategy is, of course, ad-hoc since models with various k

need to be obtained first;

• adopting the nonparametric Dirichlet process strategy which involves mod-

elling the Gaussian parameters as coming from a Dirichlet process (Escobar

and West, 1995); this approach, however, is not always recommended: Roeder

and Wasserman (1997) argues that choosing the number of components rather

than modelling it using a Dirichlet process is preferable, on the basis that this

provides a more direct control over the number of components;

• circumvent the problem by having fixed k in the model but allowing some com-

ponents to be empty as done by Gilks; this strategy, while theoretically sound,

has been found to be problematic in practice particularly for modelling spiky

patterns (c.f. discussions in Richardson and Green, 1997); or


• utilise a reversible jump sampler (Green, 1995) capable of performing model

comparison between models of varying dimensions.

The practical advantage of adopting the reversible jump approach for mixture mod-

elling within the MCMC scheme is being able to automatically select k by a fully

Bayesian method and estimate the parameter values simultaneously (Richardson

and Green, 1997); this approach is known as the reversible jump MCMC (RJMCMC)

method. The reversible jump sampler (Green, 1995) allows algorithms to occasion-

ally propose ‘jumps’ for exploring potential models; models may be rejected to en-

sure the desired stationary distribution is retained. In the mixture modelling sense,

this means that the algorithm can attempt to randomly split/merge components

provided the move is reversible (Richardson and Green, 1997) en route to the ‘op-

timal’ models. However, engineering reversible moves can be very tricky and time

consuming (Andrieu et al., 2003). Note that such split/merge strategies can also be

implemented for the EM algorithm (Ueda et al., 2000), but k needs to stay fixed; this

is to prevent over-fitting within the maximum likelihood (ML) framework.

Instead of making ‘reversible jump’ moves, Stephens (2000) proposed an alterna-

tive scheme for determining k based on a continuous time Markov birth-death pro-

cess; the scheme allows ‘birth’ of new components and the ‘death’ of some exist-

ing components. While this more straightforward birth-death MCMC method does

not require the calculations of a complicated Jacobian, its computational time re-

quirement is comparable to RJMCMC; that is, both algorithms are not suitable for

analysing large datasets such as those used in our research.

However, significant headway has been made recently in approximate Bayesian

computation (ABC) such as the approach we describe in the following.

Sequential Monte Carlo (SMC) Methods & Sequential Importance Sampling (SIS)

Traditional importance sampling (IS) can be modified to sequential importance

sampling (SIS) so that efficiency can be improved for the sequential data environ-

ment. When a new observation arrives at time t, importance sampling weights need

to be adjusted by gt(θ)gt−1(θ) , which is proportional to p(xt|θ). Such a setting allows the

algorithm to stop when uncertainty in the parameters of interest has reached the sat-

isfactory level. This approach is known as the sequential Monte Carlo (SMC) method

or particle filter (PF) (Doucet et al., 2001); in contrast to standard MCMC methods

(Ridgeway and Madigan, 2003), SMC has been shown to be efficient, flexible, paral-

lelisable, and easy to implement (Doucet et al., 2001; Ridgeway and Madigan, 2003;

Balakrishnan and Madigan, 2006). More recently, there has been even more focus on

the efficiency issue of SMC methods (Chopin, 2002).


Already, Ridgeway and Madigan (2002, 2003), for example, have managed to reduce

98% of the data access requirements when compared to traditional MCMC meth-

ods with the use of importance sampling algorithm in their experimental studies.

Their algorithm partitions all observations into subsets x1 and x2; instead of sam-

pling from the posterior conditioned on all the data, samples are drawn from the

posterior conditioned on x1 to speed up the sampling procedure, and x2 is utilised

for adjusting the sampling parameter values by reweighting. Furthermore, Balakr-

ishnan and Madigan (2006) have recently improved on this and have presented the

first single pass algorithm in this research direction, known as the one pass particle

filter with shrinkage (1PFS) which is better suited for real time environment. 1PFS

bypasses the exhaustive analysis of an initial portion of the training data by sam-

pling initial particles from the prior distribution of the parameters (Balakrishnan

and Madigan, 2006). Nonetheless, SMC methods typically focus on analysing long

series (as oppose to many shorter series), and their properties for static datasets are

still largely unknown; their shortfalls are described more generally below.

Disadvantages of Monte Carlo-Based Approaches

MCMC methods can provide ‘correct’ approximations given infinite computational

resources (Bishop, 2006, p.462). However, despite their popularity, they are typi-

cally computationally intensive (Balakrishnan and Madigan, 2006; Blei and Jordan,

2006). While they have been successfully utilised for solving many smaller data min-

ing problems (Giudici and Castelo, 2001; Giudici and Passerone, 2002), they are not

known to be practical for analysing massive datasets (Madigan and Ridgeway, 2003);

this is in spite of the fact that their statistical concepts are sound and useful, and

often being noted in the data mining literature (Glymour et al., 1996). MCMC’s scal-

ability issues are the result of their following requirements.

1. intensive computational requirements of scanning (and updating models)

through the dataset which generally means a very large number of iterations

ensuring converged models (Ridgeway and Madigan, 2003);

2. typical requirement of a complete scan of the dataset for each iteration (Ridge-

way and Madigan, 2003; Balakrishnan and Madigan, 2006);

3. high storage requirements (Wang and Titterington, 2006);

4. having parameter posterior distributions stored as a set of samples usually in

the memory (Andrieu et al., 2003); and

5. the typical step of loading data into the memory prior to the modelling process

(Ridgeway and Madigan, 2003).

While the more recently popular sequential Monte Carlo (SMC) methods (Doucet

et al., 2001; Chopin, 2002; Ridgeway and Madigan, 2003; Balakrishnan and Madigan,


2006) have attempted to address many of these scalability challenges, Monte Carlo-

based methods typically have several additional drawbacks, namely they:

6. rely on samples being a good representation of the true model (Andrieu et al.,

2003);

7. do not yield closed form solutions;

8. do not guarantee monotonically improving approximations (Jaakkola and Jor-

dan, 2000);

9. involve difficulties in verifying if models have converged (Robert and Casella,

1999, Chapter 8); and

10. have a label switching issue caused by the non-identifiably of the components

under symmetric priors of Monte Carlo-based methods (Celeux et al., 2000;

Marin and Robert, 2007, p.162).

In short, Bayesian approaches allow for flexibility within models by naturally incor-

porating (decision) uncertainty into the models via prior distributions on the pa-

rameters, and the results can be interpreted simply. However, their scalability can

be significantly worse than those classical/frequentist techniques described in the

previous subsection making them impractical for many applications such as this re-

search. Nonetheless, there are several clustering algorithms (e.g., Cheeseman and

Stutz, 1996) based on the Bayesian mixture models framework; some recent algo-

rithms (e.g., Pizzuti and Talia, 2003) are more scalable as a result of a parallel imple-

mentation, for example. In summation, a more efficient approach than the existing

typical Bayesian methods is needed for our task.

2.2.5 Approximate techniques (for GMMs)

Variational Bayesian (VB) Methods

There are many Bayesian inference approximation schemes (e.g., Madigan and

Ridgeway, 2003; Gelman et al., 2004; Bishop, 2006; Rue et al., 2009), and one general

approach is variational Bayesian (VB) methods (Wang and Titterington, 2006). VB

(Waterhouse et al., 1996; Neal and Hinton, 1998; Jordan et al., 1998) was first formally

formularised by Attias (1999). It involves introducing a prior over the model struc-

ture and is a deterministic alternative to sampling-based Monte Carlo-based meth-

ods for Bayesian inference. It is less computationally demanding and promising for

analysing large datasets (Madigan and Ridgeway, 2003). This is achieved by convert-

ing the inference problems into the ‘relaxed’ optimization problems (Blei and Jordan,

2006). It is based on the work on calculus of variations which has its origins in the

18th century i.e., it involves a functional derivative which expresses how the value


of the functional changes in response to infinitesimal changes to the input function

(Bishop, 2006, pp.462-463).

The theory of VB has now been well documented (e.g., Wang and Titterington, 2006).

In short, it aims to obtain an approximation to the posterior distribution p (θ|x),

which leads to a coupled expression for the posterior that can be solved iteratively

until convergence. It aims to maximise the lower bound on the data marginal like-

lihood p (x), which is done by aiming to minimise the Kullback-Leibler (KL) diver-

gence between a function which is an approximation to the posterior, and the actual

target posterior from which the observed marginal posterior can be obtained. That

is, with the introduction of a variational function q (θ, z|x) and the use of Jensen’s in-

equality, the log transformation of the data marginal likelihood can be expressed as

follows:

log p (x) = log∫ ∑{z}

q (θ, z|x)p (x, z, θ)q (θ, z|x)

dθ

=∫ ∑{z}

q (θ, z|x) logp (x, z, θ)q (θ, z|x)

dθ

+∫ ∑{z}

q (θ, z|x) logq (θ, z|x)p (θ, z|x)

dθ (2.1)

= F (q (θ, z|x)) +KL (q|p)

≥ F (q (θ, z|x)) ,

where F (.) is the first term in (2.1) and KL (.) is the second and also is the KL di-

vergence between the target p (θ, z|x) and its variational approximation; and z de-

notes the unobserved component membership information of x. By minimizing

KL (.), which cannot be negative, VB is effectively maximizing F (.), a lower bound

of log p (x). However, the variational function must be chosen carefully so that it is

a close approximation to the true conditional density, and importantly that it gives

tractable computations for approximating the required posterior distribution. Typi-

cally it is assumed that q (θ, z|x) can be expressed as qθ (θ|x)×qz (z|x), with conjugate

distributions chosen for the parameters. VB then involves solving q (θ, z|x) iteratively

in a way similar to the standard EM algorithm. At each iteration, log p (x) is increased

provided it has not reached the maximum. We note that this typical approach of the

variational function factorisation is based on the approximation framework of mean

field theory (Opper and Saad, 2001).

Besides its scalability and deterministic nature, the advantages of VB are that it does

not have the singularity problems of maximum likelihood (ML) approaches such as

EM (Attias, 1999), or the mixing or label switching problems of Monte Carlo-based


methods (Wang and Titterington, 2006). It has been shown to perform well empir-

ically (Wang and Titterington, 2006), and is more suitable (Ghahramani and Beal,

1999) and accurate for mixture models when compared with the Laplace approxi-

mation, a large sample method that assumes all posterior distributions are Gaussian

(MacKay, 1998). Consequently, it has already been applied in various applications

(Winn and Bishop, 2005; Wang and Titterington, 2006). Note that as with the other

methods which we have described previously, the VB approach is not limited to mix-

ture modelling or the missing data model context, but this is the primary focus of

this research.

Furthermore, it is not yet clear exactly why, but it turns out that VB will effectively

and progressively eliminate redundant components when an excessive number of

components is specified in the initial model (Attias, 1999; Corduneanu and Bishop,

2001; McGrory and Titterington, 2007). This automatic selection of the number of

components k implies that one does not need to utilize a fully Bayesian computa-

tional extensive approaches such as RJMCMC method (Richardson and Green, 1997)

or birth-death MCMC method (Stephens, 2000). That is, VB can be used to simul-

taneously perform model selection and estimation of model parameters (McGrory

and Titterington, 2007) as in the case of those fully Bayesian methods, but in a much

more efficient way. Note that this can be a very useful feature for our research aim

of modelling individuals’ heterogeneous spatial usage behaviour, but its effect re-

quires further investigation. We discuss this issue in Chapter 3 where we explore the

practical implications of this aspect of the approach, and also in Chapter 4 where

we attempt to describe this feature more fully. Also the actual VB model hierarchies

for GMMs are omitted here as they are discussed in relation to particular cases later;

please refer to Chapter 3 for the one-dimensional GMM and Chapter 4 to 6 for the

two-dimensional GMM.

For completeness, we conclude this subsection by detailing the general result of VB

for mixtures of exponential family models in the one-dimensional case which can

be easily extended to higher dimensions as outlined in the thesis of McGrory (2006).

For consistency, we again denote an observation as x = (x1, ..., xn), which is to be

modelled as mixture of k distributions where each distribution has a corresponding

parameter φ, and z = {zij : i = 1, ..., n, j = 1, ..., k} as the missing binary observation

component indicator variable such that zij = 1 if observation xi is from the compo-

nent j and zij = 0 otherwise. The complete data marginal likelihood of mixture of

exponential distributions can generally be expressed as follows:

p (x, z|φ,w) =n∏i=1

k∏j=1

wzij

j [s (xi) t (φj) exp {a (xi) b (φj)}]zij ,


where w is the distribution mixing weights; b (φj) is the natural parameter and s (xi),

t (φj) and a (xi) are functions defining the exponential family distribution. Given

this, the conjugate prior will be in the form of:

p (x, z|η, υ) ∝k∏j=1

wα

(0)j −1

j

k∏j=1

h(η

(0)j , υ

(0)j

)t (φj)

η(0)j exp

{υ

(0)j b (φj)

},

where α is an hyper-parameter and h (.) is a function of another exponential family

distribution with parameters η and υ:

p (x, z, φ, w|η, υ) ∝n∏i=1

k∏j=1

wzij

j [s (xi) t (φj) exp {a (xi) b (φj)}]zij

×k∏j=1

wα

(0)j −1

j

k∏j=1

h(η

(0)j , υ

(0)j

)t (φj)

η(0)j exp

{υ

(0)j b (φj)

}.

Assuming the introduced variational

function q (z, φ, w) =∏ni=1 {qzi (zi)}qφ (φ) qw (w) with qφ (φ) =

∏kj=1 qφj

(φj); the vari-

ational posterior qzi (zi = j) can be derived by focusing on the sufficient statistics of

the variational lower bound marginal log-likelihood:

∑{z}

∫ ∏n

i=1{qzi (zi)}qφ (φ) qw (w) log

p (φ,w)∏ni=1 p (xi, zi|φ,w)∏n

i=1 {qzi (zi)}qφ (φ) qw (w)dφdw

=∑{j}

∫qzi (zi = j) qφ (φ) qw (w) log

p (xi, zi = j|φ,w)qzi (zi = j)

dφdw

+ terms independent of qzi

=∑{j}

qzi (zi = j){∫

qφ (φ) qw (w) log p (xi, zi = j|φ,w) dφdw − log qzi (zi = j)}

+ terms independent of qzi

=∑{j}

qzi (zi = j) log[

exp∫qφ (φ) qw (w) log p (xi, zi = j|φ,w) dφdw

qzi (zi = j)

]+ terms independent of qzi

That is, when taking the prior and the general form of mixture density into the equa-

tion:


qzi (zi = j) ∝ exp{∫

qφ (φ) qw (w) log p (xi, zi = j|φ,w) dφdw}

∝ exp{∫

qφ (φ) qw (w) [logwj + log t (φj) + a (xi) b (φj)] dφdw}

= exp {Eq [logwj ] + Eq [log t (φj)] + a (xi) Eq [b (φj)]} .

Similarly, we can obtain:

qw (w) ∝k∏j=1

wα

(0)j +

∑ni=1 qij−1

j

=k∏j=1

wαj−1j ,

where qij = qzi (zi = j) and αj = α(0)j +

∑ni=1 qij ;

qφj

(qφj

)∝ t (φj)

∑ni=1 Eqzi

[zij ] exp

[n∑i=1

Eqij [zij ] a (xi) b (φj)

]t (φj)

η(0)j exp

[υ

(0)j b (φj)

]= t (φj)

∑ni=1 qzij +η

(0)j exp

[{n∑i=1

qija (xi) + υ(0)j

}b (φj)

]= t (φj)

ηj exp [υjb (φj)] ,

and

ηj = η(0)j +

n∑i=1

qij ,

υj = υ(0)j +

n∑i=1

qija (xi).

Expectation Propagation (EP) Method

The deterministic expectation propagation (EP) method (Minka, 2001) is closely re-

lated to VB in that it also aims to approximate inference based on minimising KL

divergence. However, EP minimises KL (p|q) instead of KL (q|p). Note that KL is not

symmetric. One disadvantage of EP is that it typically does not guarantee to converge


monotonically; and EP is not ‘sensible’ for mixture modelling as it aims to capture all

of the posterior modes (Bishop, 2006, p.510); however attempts to use it have been

made (Minka and Ghahramani, 2003).

2.2.6 High dimensional GMM

As indicated earlier, one of the research requirements is to investigate combinational

data analysis (CDA) for high dimensional data. There is limited study on high dimen-

sional GMMs in comparison to high dimensional clustering algorithms (which we

later discuss in § 2.3.4). This is not surprising considering that it is more challenging

to model ‘complex’ data with parametric/semi-parametric models in comparison to

the nonparametric approach.

One recent noticeable exception is Bouveyron et al. (2007) which models subspace

clusters in the high dimensional spaces with a GMM; the concept of subspace clus-

ters, which is discussed further in § 2.3.4, is that some dimensions are considered as

noise for some clusters. The algorithm is based on work on mixtures of probabilistic

PCA (Tipping and Bishop, 1999; McLachlan et al., 2003) and eigenvalue decomposi-

tion of the covariance matrices (Celeux and Govaert, 1995) with only certain essen-

tial parameters estimated by an EM algorithm. The algorithm assumes that there

are no irrelevant dimensions (but some can have a weighting very close to zero), but

the intrinsic dimensionality of the clusters are estimated iteratively (in each M-step

based on the eigenvalues of the each cluster covariance matrix) with the use of the

scree-test of Cattell (1966) and BIC.

In spite of assumptions and constraints that have been made so that it calculates

parameter estimations only with respect to the likely subspace of each cluster, the

algorithm still appears to be quite computationally demanding. More problemati-

cally, however, is that it requires the number of clusters k to be predetermined which

is clearly not desirable for this research. Note that algorithms (e.g., Friedman and

Meulman, 2004; Domeniconi et al., 2004; Jing et al., 2007; Cheng et al., 2008) which

adopt this approach are often considered as weighted k-means-like algorithms since

they focus on normalising attributes but not discarding them; they are sometimes

referred to as soft projected clustering algorithms (Kriegel et al., 2009).

2.2.7 Review conclusion

Overall, the two recent but different approaches, sampling-based sequential Monte

Carlo (SMC) methods and non-sampling based variational Bayesian (VB) methods,

2.3 Clustering 33

both within the Bayesian framework, appear to be effective and scalable for approx-

imating individuals’ mobility patterns as Gaussian mixture models (GMMs). How-

ever, the deterministic nature of VB in addition to its ability to automatically deter-

mine the suitable number of components k as well as it being space efficient suggest

that VB should be the preferred technique for modelling the heterogeneous patterns

we expect to observe in this research and it is applied more generally in Chapters 3

to 6. Moreover, it may be necessary to consider utilising more robust covariance

estimation (Campbell, 1980; Pena and Prieto, 2001; Wang and Raftery, 2002) within

the VB framework to account for the irregular nature of the problem, though this

appears to be computationally expensive.

2.3 Clustering

2.3.1 Introduction

The aim of clustering (Jain and Dubes, 1988; Duda et al., 2001) is to segment unla-

belled and typically numerical data an automatic fashion, into relatively meaningful,

natural, homogeneous but hidden subgroups or ‘clusters’. This is done by maximis-

ing intra-cluster and minimising inter-cluster similarities without the need to have

any prior knowledge (Hastie et al., 2009, p.501). The similarities are often measured

based on the (Euclidean) distance for (low dimensional) numerical data; though the

definition of ‘similar’ (Xu and Wunsch II, 2005; Jain and Dubes, 1988, pp.14-23) can

vary greatly. In machine learning, it is sometimes known as unsupervised classifica-

tion (Jain et al., 1999); and today it is usually performed without first assessing the

cluster tendency, i.e., without first determining whether clusters are present in the

data (Smith and Jain, 1984), even though some have argued the importance of do-

ing this step and proposed approaches which do so (Smith and Jain, 1984; Jain and

Dubes, 1988, p.201). In addition to discovering the underlying data structure (Hastie

et al., 2009, p.502), clustering has been used as a data reduction, compression, sum-

marisation (Jain, 2010) and outlier detection tool (Barbara et al., 1997); it has shown

to be useful for pattern/image segmentation/recognition and information retrieval,

for example (Jain, 2010). Furthermore, it has been shown to be useful as a stand

alone technique as well as a preprocessing technique for other analytical tasks such

as supervised classification (Han and Kamber, 2006, pp.383-384).

Even though probabilistic theories have recently been proposed for better algorithm

design (Dougherty and Brun, 2004), clustering is still considered to be a problematic

and subjective process with no standard benchmarks available for comparison (Jain

et al., 1999). This is especially true for clustering high dimensional data (Patrikainen

and Meila, 2006). The loose definition of a cluster, however, implies that clustering

results can be evaluated/validated according to their internal, external and relative


similarities (Dubes, 1999; Jain and Dubes, 1988, Chapter 4). Yet, many of these ap-

proaches do not appear to be effective nor practical, and often do not address issues

such as the stability of the results (Lange et al., 2004). Whereas for applications, such

as our business application, being able to assess cluster interpretability and visuali-

sation (Berkhin, 2006) can be useful if not critical.

Unlike mixture models (c.f. § 2.2), which profiles data based on the mixture decom-

position (Jain and Dubes, 1988, pp.117-118), the ‘definition’ of a cluster can vary

greatly (Everitt, 1974, pp.43-44). Examples of various different cluster representa-

tions for numerical data, which is our primary research focus, include:

• by the cluster’s gravity centre. For example, the classical k-means (KM) (Mac-

Queen, 1967) and hierarchical (Kaufman and Rousseeuw, 1990) clustering al-

gorithms;

• by an object (i.e., medoid) located near its centre. For example, k-medoids

algorithms (Kaufman and Rousseeuw, 1990; Ng and Han, 1994). This definition

makes the algorithms less sensitive to outliers when compared to gravity centre

definition above, but, at the same time, makes the algorithms less scalable;

• by density connected points (Jain and Dubes, 1988, pp.128-133). Algorithms

based on this notion are sometimes referred to as density- (e.g., Ester et al.,

1996) or grid-based (e.g., Wang et al., 1997b) clustering algorithms (c.f. § 2.3.3).

They have significant influence to research in clustering high dimensional data

(Hinneburg and Keim, 1998; Agrawal et al., 1998) (c.f. § 2.3.4);

• by a collection of points (e.g., Guha et al., 1998) or by a boundary (e.g., Karypis

et al., 1999). Algorithms based on the former definition often can better repre-

sent the clusters, and reduce the implications of clusters being very different in

sizes, for example. Whereas algorithms utilising boundaries as the cluster rep-

resentations are often based on the use of graph theory (Jain and Dubes, 1988,

pp.120-128) such as minimal spanning tree (MST) (Zahn, 1971). The graphical

approach is closely related to hierarchical clustering algorithms (c.f. § 2.3.2).

However, while there have been many recent developments in this research di-

rection (c.f. Jain, 2010), they do not appear to be effective for high dimensional

large datasets;

• by concept (e.g., Gennari et al., 1989);

• by its statistical summaries (e.g., Zhang et al., 1996); and

• by its probability density estimations mode (Jain and Dubes, 1988, pp.118-120).

This in turn suggests that while significant literature exists on the topics of clustering

and algorithms (e.g., Jain and Dubes, 1988; Jain et al., 1999; Xu and Wunsch II, 2005;

Berkhin, 2006; Han and Kamber, 2006), differences in cluster definitions (Jain, 2010),

cluster assumptions such as assigning each data points to:

• only one (MacQueen, 1967) or multiple (Cole and Wishart, 1970) clusters, or

• every cluster with various degrees of probability (Dempster et al., 1977; Bezdek,

1981),

2.3 Clustering 35

and the different contexts in which clustering are used, for example, have made re-

viewing clustering and transferring useful generic concepts and methodologies chal-

lenging (Jain et al., 1999). Moreover, it appears that there is no optimal algorithm for

solving all of the problems (Kleinberg, 2002), and no standard or effective criteria to

guide algorithm selection (Xu and Wunsch II, 2005). Of course, there are also issues

that need to carefully considered such as feature selection, weighting and normali-

sations (Wedel and Kamakura, 1998, pp.57-59).

Furthermore, most algorithms (including many more recently developed algo-

rithms) appear to be very sensitive to critical parameter settings such as the number

of clusters k (Jain and Dubes, 1988, p.177). Yet, these parameters can be difficult to

determine and can lead to unreliable or poor clustering quality (Jain et al., 1999) es-

pecially for clustering high dimensional data (Moise et al., 2008). Determining the

‘true’ value of k is the fundamental problem of clustering or cluster validity (Everitt,

1979), and some guidance (e.g., Milligan and Cooper, 1985; Dubes, 1987; Tibshirani

et al., 2001) has been provided (c.f. Xu and Wunsch II, 2005; Berkhin, 2006). See also

the discussion in § 2.2.3. Bayesian techniques in particular (Schwarz, 1978; Wallace

and Freeman, 1987; Kass and Raftery, 1995; Fraley and Raftery, 1998; Blei et al., 2003;

Li and McCallum, 2006) have been shown to be useful in determining the value of k

and parameters more generally. However, most of these techniques are based on the

concept of clusters being ‘compact’ and ‘isolated’ (Jain, 2010) which is not necessar-

ily appropriate for this application since human mobility patterns have a high degree

of spatial regularity which makes the overall pattern heterogeneous and spiky.

In the remainder of this review of the clustering literature, methodology is discussed

roughly in order by time of development. In § 2.3.4, techniques for clustering high

dimensional numerical data are reviewed.

2.3.2 Classical clustering algorithms

Hierarchical Clustering Algorithms

Classical algorithms are commonly separated into two classes: hierarchical or flat

partitioning (Jain and Dubes, 1988, pp.55-58). Hierarchical clustering algorithms

represent data as an easily understood hierarchical nested series of partitions; they

are typically based on a distance related criterion and/or the number of cluster cri-

terion for merging (c.f. agglomerative) or splitting (c.f. divisive) (Jain et al., 1999).

However, their minimum requirements in terms of quadratic time and space com-

putational complexities (i.e., O(n2)

with n being the number of observations) (Jain

et al., 1999) implies that they have limited application for analysing large datasets;


although many recent algorithms (e.g., Achtert et al., 2007a) for clustering high di-

mensional data are influenced by them.

k-Means Algorithm(s)

On the other hand, the popular k-means (KM) algorithm is the most representative,

but not the only, partitional algorithm (Jain et al., 1999). It and its variants are some-

times known as the squared error clustering algorithms since they segment objects

into k groups iteratively by minimising the sum of squared error over all k clusters

(Jain et al., 1999). KM-based algorithms are computationally more efficiently than

the hierarchical clustering algorithms; their time complexity isO (ndk) with n being

the number of observations, d the number of attributes and k the number of clusters

(Jain et al., 1999). While the qualities of the resultant clusters between them and the

hierarchical alternatives are not conclusive (Milligan, 1980; Punj and Stewart, 1983),

they both (as well as the k-medoid variant) can generally only identify clusters with

convex shapes (or hyper-spherical shapes to be more precise) (Jain et al., 1999), and

have a tendency of splitting larger clusters for similar size clusters (Mao and Jain,

1996). Additionally, they can both be rather sensitive to noise due to the use of a sin-

gle representative per cluster and the use of distance-based measures (Guha et al.,

1999). Interestingly however, the KM algorithm with the use of Mahalanobis dis-

tance (MD) (Mahalanobis, 1936) as an alternative to the commonly used Euclidean

distance as proximity measure, tends to have clusters that are hyper-ellipsoidal and

can be unusually large or small in size (Mao and Jain, 1996).

There are a great number of KM algorithm extensions (c.f. Jain, 2010). In particular,

many (e.g., Bradley et al., 1998; Farnstrom et al., 2000; Pham et al., 2004; Ordonez

and Omiecinski, 2004) have been developed to address, for example:

• the quality of results (as well as the algorithm sensitivity due to the initial par-

tition selection which can lead to locally optimal clusters);

• the issue of data order dependency. This is an issue for many efficient algo-

rithms (c.f. Hartigan, 1975; Fisher, 1987) which scan the data only once (c.f.

Appendix B and C); and

• the suitability of the algorithm for clustering large datasets without needing to

freely access all data as most (classical) algorithms have assumed (c.f. Jain et al.,

1999; Xu and Wunsch II, 2005; Kogan et al., 2006; Kogan, 2007).

Additionally, note that the KM algorithm, which is predominantly used for cluster-

ing numerical data, has also been extended to take on categorical or mixed type at-

tributes (i.e., utilising measures other than distance) (e.g., Huang, 1998; Ghosh and

Strehl, 2006; Kogan, 2007). Overall, however, more scalable algorithms that do not

require predetermination of k and are capable of discovering more arbitrary shape

clusters are needed for this application.

2.3 Clustering 37

2.3.3 Scalable clustering algorithms

Random Sampling & Index Tree Structure Random sampling (and randomised

search) (Kaufman and Rousseeuw, 1990; Ng and Han, 1994; Guha et al., 1998), which

has been shown to be robust (in terms of resilience to noise) and useful for clustering

large datasets by fitting selected samples into the memory, is one frequent technique

utilised in various parts of many clustering algorithms. Sets of well chosen sam-

ples have shown to be useful for identifying and representing quality clusters (Guha

et al., 1998). An alternative approach to random sampling, which discards part of

the observations to improve scalability, is to summarise the data into a dynamically

updated index tree structure; an index tree structure can efficiently identify observa-

tions’ nearest neighbours (Jain, 2010). The use of index has been shown to be useful

for algorithms to focus on relevant data; algorithms can thus cluster data representa-

tives (resided in memory) based on the summarised statistical information instead

of the original data (Zhang et al., 1996). Algorithms (e.g., Zhang et al., 1996) based on

this technique have shown that it is actually possible to cluster with close to linear

time complexity i.e.,O (n). However, they typically utilise statistics (e.g., zeroth, first,

and second moments) assuming the data is Gaussian distributed which is generally

inappropriate. Additionally, these algorithms are somewhat sensitive to the data or-

dering; and their use of radius to control the boundary of the cluster still resulted in

spherical sharp clusters as in the case of classical clustering algorithms (Guha et al.,

1998; Sheikholeslami et al., 1998). Nonetheless, the index tree structure has been

widely utilised since (e.g., Ganti et al., 1999c; Aggarwal et al., 2003), and has been

shown to be useful in dealing with noisy data and detecting anomalies (e.g., Bohm

et al., 2000; Burbeck and Nadjm-Tehrani, 2005).

Density-Based Clustering Algorithms

Density-based clustering algorithms define and connect clusters based on the den-

sity of neighbourhood objects (i.e., number of data points within a given radius of

the objects) agglomeratively; DBSCAN is perhaps the most well-known example (Es-

ter et al., 1996; Sander et al., 1998). The density-based clustering algorithms, with

the use of the local criterion, have been shown to be able to discover somewhat ar-

bitrary clusters with different shapes, sizes and densities, and are naturedly robust

from outliers; though clusters may not be very informative and/or easy to interpret

(Berkhin, 2006). While many improvements have been made in, for example:

• eliminating the input parameters requirement (e.g., Xu et al., 1998). This can

be done, for example, by identifying the intrinsic clustering structures of the

clusters (Ankerst et al., 1999) for which the technique has also been found use-

ful even for discovering high dimensional clusters hierarchically (e.g., Achtert

et al., 2006, 2007a),


• dealing outliers better with the use of degrees of likelihood instead of a binary

decision (e.g., Breunig et al., 2000),

• improving efficiency by being able to incrementally update only neighbour-

hood information related to the updated data (e.g., Ester et al., 1998), and

• extending the algorithm to clustering objects in the spatial-temporal domain

(e.g., Birant and Kut, 2007).

Poor quality clusters may still be obtained as a result of global density tactic (i.e., the

use of fixed radius), and they typically still require O (n log n) time complexity even

with the use of the index tree structure discussed above. Note that instead of using

global distance measure, utilising hyper-graph partitioning techniques (e.g., Karypis

et al., 1999; Estivill-Castro and Lee, 2000; Agarwal et al., 2005) has been shown to

improve the resultant cluster qualities with relative interconnectivity (c.f. Guha et al.,

1999) and closeness concepts (c.f. Guha et al., 1998).

Grid-Based Clustering Algorithms

The use of density means random sampling (c.f. Guha et al., 1998) is not practical for

improving the scalability. However, density-based clustering algorithms have been

able to be approximated efficiently with algorithms based on grids (e.g., Schikuta

and Erhart, 1997; Wang et al., 1997b; Sheikholeslami et al., 1998). These algorithms

minimise the distance computation requirements by:

• quantising the objects in its original feature spaces,

• computing statistical distribution summaries for each attribute within each

grid cell in a single scan of data i.e., with time complexity ofO (n), and then

• hierarchically clustering on the resultant grid information structure instead of

the original objects.

They generally require minimal prior knowledge, can obtain quality clusters (some-

times even with multi-resolution (e.g., Sheikholeslami et al., 1998)), are data order

independent, and are robust to outliers (Berkhin, 2006). More importantly, how-

ever, is that the grid structure can also naturally facilitate parallel/distributed (c.f.

Parthasarathy et al., 2007) processing, incremental updating only summaries re-

lated to the updated data (Wang et al., 1997b) and work with mixed type attributes

(Berkhin, 2006). Its use has also been shown to improve the algorithm scalability,

better utilise memory, and provide clusters with better qualities (as a result of algo-

rithms being less depend on the initialisations) (e.g., Guha et al., 1998; Zhang et al.,

2005; Garg et al., 2006); and it has been found useful for speeding up clustering algo-

rithms based on kernel density distribution functions that models the overall density

of a point analytically as the sum of influence function of data points around it (e.g.,

Hinneburg and Keim, 1998; Hinneburg and Gabriel, 2007). However, while these al-

gorithms (as well as many recent algorithms) have improved the scalability require-

ments for clustering large datasets, and the qualities of resulting clusters, and often

2.3 Clustering 39

without the need to predetermine k, they are still not suitable for clustering the type

of high dimensional data that this application is facing (but they should be adequate

for modelling individuals’ mobility patterns as these are in two dimensional). The

most pertinent part of our review of clustering follows.

2.3.4 Algorithms for clustering high dimensional data

Curse of Dimensionality Data embedded in a high dimensional space remains dif-

ficult for humans to interpret (Jain et al., 1999) despite some recent development

of visualisation tools (e.g., Lee and Ong, 1996; Kandogan, 2001; Ankerst et al., 1999;

Konig and Gratz, 2004; Ghosh and Strehl, 2004; Assent et al., 2007b). As data dimen-

sionality d increases, the sparseness of data usually increases as a result; this leads to

meaningless similarity measures which are the foundation of clustering (Berchtold

et al., 1997; Agrawal et al., 1998; Aggarwal et al., 2001) especially when the distance-

based proximity similarity is used. This issue is known as the “curse of dimension-

ality” (Bellman, 1961, p.94) or “empty space phenomenon” (Scott, 1992, p.84). It im-

plies that there is a lack of data separation in the high dimensional spaces (Aggarwal

and Yu, 2000; Hastie et al., 2009, pp.22-27) and that the nearest neighbours are not

stable (Beyer et al., 1999). This phenomenon also means that algorithms based on

the density notion (c.f. § 2.3.3) are less effective for clustering high dimensional data;

and outliers detection is more challenging (Aggarwal and Yu, 2001; Yu et al., 2002)

while at the same time being more critical (Hinneburg and Keim, 1999).

Recall that also in § 2.3.3, many more recent algorithms make use of the (spatial)

index tree structures for scalability improvements. However, their effectiveness has

been shown to degrade rapidly for dimension d > 10; that is, having an index is no

better than simply doing sequential searches (Weber et al., 1998; Beyer et al., 1999;

Chakrabarti and Mehrotra, 1999). While some not so recent developments in the

index data structure have achieved higher dimensional limits (e.g., Berchtold et al.,

1996, 1998) or even have extended to the spatial-temporal (e.g., Zhang et al., 2003) or

multi-dimensional (e.g., Gaede and Gunther, 1998; Bohm et al., 2001) domain, they

typically still require overall superlinear runtime complexity (Bohm et al., 2000). Ad-

ditionally, they are generally still somewhat limited in their usefulness for clustering

‘real’ high dimensional data (i.e., d� 20); though the index data structure itself is still

an active research area (e.g., Houle and Sakuma, 2005; Manolopoulos et al., 2005).

One the other hand, there are a small number of algorithms based on the use of

grid (e.g., Sheikholeslami et al., 1998; Hinneburg and Keim, 1998), random sampling

(e.g., Guha et al., 1998), and fractal dimension (e.g., Barbara and Chen, 2000) (which

is not discussed here due to its uniqueness) that have been shown or are believed to

work somewhat better with high dimensional data (i.e., d ∼ 20). However they too


increasingly lose effectiveness (and become more sensitive to noise) as the dimen-

sion d increases as clusters are likely to be spread over many grid cells with many of

them actually being empty (Hinneburg and Keim, 1999). Though, being able to ob-

tain (non-axis-parallel) ‘optimal’ grid partitions by cutting low density regions and

maximising clusters discriminations through a set of (contracting) projections divi-

sively recursively has been shown useful (Hinneburg and Keim, 1999) for analysing

high dimensional data.

Dimension Reduction One approach to this high dimensional problem is to first

reduce the dimensionality of the data prior to clustering; feature transformation/

extraction, and feature selection are two logical methods (Han and Kamber, 2006,

pp.435-436).

Feature transformation projects the data onto a smaller space while seeking to main-

tain the original relative distances between objects. Principal component analysis

(PCA) (Jolliffe, 2002) is one of the popular techniques, and was also suggested by

Sheikholeslami et al. (1998) in addressing their algorithm shortfall in clustering ‘real’

high dimensional data. However, PCA is only suitable for projecting Gaussian dis-

tributions (Cherkassky and Mulier, 2007, p.204); it is relatively computationally ex-

pensive (Witten and Frank, 2005, p.309), sensitive to noise, and often misses inter-

esting details (Volkovich et al., 2006). Not surprisingly, PCA has been shown to be

an ineffective technique for clustering high dimensional data (Kriegel et al., 2009).

In fact, Chang (1983) argued that the PCA factor with largest eigenvalue may not

necessarily be the first component, which makes PCA unsuitable for clustering high

dimensional data; PCA was not recommended by Wedel and Kamakura (1998, p.59)

for customer segmentation more generally.

There are many more recent linear techniques such as independent component

analysis (ICA), projection pursuit, random projections, singular value decomposi-

tion (SVD) (c.f. Xu and Wunsch II, 2005; Hastie et al., 2009) and wavelet transforma-

tion (WT) (Murtagh et al., 2000) which have shown improvements from PCA. How-

ever, it has also been shown that, for example, SVD is unable to achieve any ‘real’

dimensionality reduction (Agrawal et al., 1998); and RP, useful for nearest neighbour

search as in the case of index tree structures (Jain, 2010), can result in highly un-

stable clusters (Fern and Brodley, 2003). Their ineffectiveness is the result of not

removing irrelevant attributes (Parsons et al., 2004), and not taking into account

that there may be different feature correlations for different clusters (Kriegel et al.,

2009). However, perhaps more importantly, at least for our application is that, trans-

formation generates results with poor interpretability that are critical to clustering

(Agrawal et al., 1998). Of course, this also means that it is not practical to consider

those computational infeasible kernel or non-linear transformation techniques as

2.3 Clustering 41

the clustering preprocessing for clustering high dimensional data (c.f. Xu and Wun-

sch II, 2005). Though, it is worthwhile pointing out that some recent PCA-based

(e.g., Bohm et al., 2004b; Tung et al., 2005; Kriegel et al., 2008) and SVD-based (e.g.,

Agarwal and Mustafa, 2004) clustering algorithms have been shown to discover even

arbitrary-shaped high dimensional clusters, and some success has been reported

with the use of the Hough transform (e.g., Achtert et al., 2008).

On the other hand, feature selection (Guyon and Elisseeff, 2003) aims to reduce

the number of attributes in a dataset by removing irrelevant/redundant dimensions

(globally); it has been shown, more generally, to be able to improve prediction per-

formance, stability, and interpretation (Parsons et al., 2004). Note that most of these

techniques were designed for supervised learning (Parsons et al., 2004). While there

exists no universal accepted approach for measuring the (unsupervised) clustering

accuracies and hence guide the attribute selection, a number of methods (e.g., en-

tropy analysis) have been shown to be useful (Parsons et al., 2004). Unfortunately,

these techniques are typically iterative in nature and hence not scalable (Parsons

et al., 2004); and they can cause loss of important information or even distort the

real clusters (Aggarwal et al., 1999; Xu and Wunsch II, 2005). That is, feature selec-

tion as done in the typical way, generally can not overcome the challenge in cluster-

ing high dimensional data (Kriegel et al., 2009). Interestingly, however, without any

dimensional reduction, some success was shown in McCallum et al. (2000) by first

dividing the data into overlapping subsets, known as canopies, prior to clustering;

this technique is more generally known as domain decomposition.

Subspace Clusters Fortunately, despite all the challenges faced in clustering high

dimensional data, high dimensional data usually have an intrinsic dimensionality

(Jain and Dubes, 1988, pp.42-46) that is much lower than the original dimensions

(Cherkassky and Mulier, 2007, p.178). That is, as observed by Agrawal et al. (1998),

usually only a small numbers of different dimensions (i.e., subspaces) are relevant to

certain clusters, whilst noisy signals often contribute information in the remaining

unwanted dimensions. This also implies that the number of unwanted attributes

grows with dimensions, as objects are increasingly likely located in different dimen-

sional subspaces (Berkhin, 2006). This phenomenon is sometimes referred to as “lo-

cal feature relevance” or “local feature correlation” (Kriegel et al., 2009, p.5). That

is, algorithms discussed previously are not effective because they were developed to

discover clusters in the full dimensional space (Agrawal et al., 1998) and it is not fea-

sible to obtain clusters by searching through all possible combination of subspaces

(i.e., different combinations of features) (Parsons et al., 2004). Consequently, the

challenge becomes being able to instead search (in a localised way) effectively and

efficiently for groups of clusters within different subspaces of the same dataset.


Besides (sequential) pattern-based clustering (see short discussion separately be-

low), algorithms that aim to discover subspace clusters are often divided to into

two categories, subspace clustering algorithms and projected clustering algorithms

(Parsons et al., 2004), although the distinction is sometimes not clear (Kriegel et al.,

2009).

• Subspace clustering algorithms aim at finding all subspaces where clusters can

be identified (Kriegel et al., 2009); thus their solutions can consist of significant

overlapping since they aim to discover all clusters in all subspaces.

• In contrast, projected clustering algorithms aim to find an assignment(s) of

each point to a subspace(s) (Kriegel et al., 2009). They typically report non-

overlapped clusters (i.e., an unique assignment for each object) and thus

are sometimes referred to as partition-based clustering algorithms (Liu et al.,

2009).

Note that, unlike subspace clustering algorithms which always follow a bottom-

up search approach, projected clustering algorithms often adopt a top-down ap-

proach in discovering the clusters (Kriegel et al., 2009); though, as pointed out by

Kriegel et al. (2009), the distinction between these two methods should not be sim-

plified as being bottom-up ‘dimension-growth’ subspace algorithms and top-down

‘dimension-reduction’ projected algorithms as done in Han and Kamber (2006,

p.434). Nevertheless, Han and Kamber (2006, Chapter 7) is the first ever ‘textbook’

review on high dimensional numerical data clustering, as nearly all papers in this

research direction focus on the algorithm developments except Parsons et al. (2004)

and Kriegel et al. (2009), for example.

Most of these relatively recent algorithms are axis-parallel in the sense that they do

not focus on finding arbitrarily shaped clusters (Kriegel et al., 2009); algorithms that

are non-axis-parallel are referred separately as correlation clustering algorithms in

Kriegel et al. (2009). Axis-parallel algorithms have the advantages of restricted search

spaces but are stillO(2d)

with d the data dimensionality (Kriegel et al., 2009), and the

clusters found are more meaningful for business applications such as this. Note that

many of these algorithms (e.g., Ng et al., 2005; Moise et al., 2008) can still discover

arbitrarily shaped clusters but the clusters are identified in the hyper-rectangles for-

mat. Consequently, this review concentrates on the axis-parallel algorithms. Oth-

erwise, it is worth pointing out that Kailing et al. (2003) has proposed an approach

in ranking interesting subspaces and has shown some success for clustering high

dimensional data.

2.3 Clustering 43

Pattern-based Clustering Algorithms

As mentioned previously, pattern-based clustering (also known as bi-clustering

(Cheng and Church, 2000) or co-clustering, for example) algorithms also aim to dis-

cover subspace clusters (Kriegel et al., 2009). They, in contrast, focus on clustering

categorical data with application domains such as microarray (e.g., gene expression)

data (e.g., Jiang et al., 2004; Madeira and Oliveira, 2004; Van Mechelen et al., 2004;

Tanay et al., 2006). They have also been utilised for discovering associations rules

(Agrawal et al., 1993; Agrawal and Srikant, 1994) among transactions/webs/texts and

is useful for (e-commence and retail) business applications such as recommender

systems (e.g., Cho and Kim, 2004; Cho et al., 2002; Kim and Yum, 2005; Lee et al.,

2002; Li et al., 2005; Wang et al., 2004). However, since our research focus is primarily

about clustering numerical data, there is no further review on this particular topic.

Subspace Clustering Algorithms

Recall that subspace clustering algorithm aim to identify all subspace clusters in all

subspaces (Kriegel et al., 2009). To avoid exhaustive subspace searches through all

possible subspaces, it employs a bottom-up strategy based on the downward clo-

sure (also known as monotonicity) property of density. This property is based on the

lemma that if a d′-dimensional unit is dense (i.e., there is/are (a) cluster/clusters),

then so are its projections in (d′ − 1)-dimensional subspace (Agrawal et al., 1998).

That is, if a cluster is found in subspace S, it must also be found in subspace S′ ⊆ S

(Kriegel et al., 2009). Accordingly, subspace clustering algorithms:

• first identify the dense regions/units for each dimension; this is generally based

on the use of a histogram with predefined number of bins and a density thresh-

old parameter; and

• then use dimensions that contain dense regions to form clusters by combining

adjacent dense regions; this ‘integration’ step is typically, though not always

(e.g., Liu et al., 2009), utilising an algorithm similar to Apriori (Agrawal et al.,

1993; Agrawal and Srikant, 1994) developed for market basket analysis (i.e., for

searching frequent itemsets in transactional databases).

The first and the most well-known algorithm of this kind is CLIQUE (Agrawal et al.,

1998) which is innovative and has significant influence on categorical data cluster-

ing (i.e., pattern-based clustering algorithms) (e.g., Ganti et al., 1999a; Cheng and

Church, 2000; Wang et al., 2002a; Yang et al., 2002; Zaki et al., 2005).

The utilisation of an axis-parallel grid (c.f. histograms) implies that these algorithms

are scalable, not sensitive to the order of records, and make no assumptions on the

data distribution. The process of connecting dense regions means that they can han-

dle somewhat arbitrary shaped clusters and focus on producing clusters with good


interpretability rather than accurate cluster shapes. However, they may produce a

fairly large number of overlapped clusters with many of them being projections of

a higher dimensional clusters (Moise and Sander, 2008) which can make interpreta-

tion of the results a bit more complicated (Aggarwal et al., 1999). Additionally, they

may consider a relatively large number of objects as outliers which may not be ac-

ceptable for some applications (Aggarwal et al., 1999).

Many improvements have been made on CLIQUE. The quality of the clusters has

been shown to improve:

• by pruning unwanted subspaces with entropy (e.g., Cheng et al., 1999) instead

of the minimal description length (MDL) (Rissanen, 1983) used in CLIQUE;

• by using an adaptive grid instead of a fixed interval size static grid (e.g., Nagesh

et al., 2000); the bin cut-points on each dimension are analysed based on his-

tograms (Nagesh et al., 2000; Chang and Jin, 2002). This strategy can elimi-

nate the use of the pruning techniques which could result in missing clusters

(Nagesh et al., 2000);

• by allowing histogram bins to be overlapping (e.g., Liu et al., 2007, 2009); and

• by varying the density threshold parameter either globally (e.g., Sequeira and

Zaki, 2004) or being more adaptive in the sense of taking the dimensionality

into consideration (e.g., Assent et al., 2007a);

for example.

In terms of computational requirements, CLIQUE scales linearly with the size of the

inputs (Agrawal et al., 1998). Although its complete time complexity is data depen-

dent, its most computational demanding step isO(ndImax + cdImax

)withnnumber of

observations, dImax the highest cluster intrinsic dimensions (which is generally� d

with d the full data dimensionality) and c a constant (Agrawal et al., 1998). However,

despite the fact that subspace clustering algorithms are generally already faster than

projected clustering algorithms (to be discussed in § 2.3.4) (Parsons et al., 2004) effi-

cient techniques can still be utilised to further improve their performance as we dis-

cuss in more detail below; note that this is still typically the case even though many

projected clustering algorithms have already adopted some sort of random sampling

strategy (Moise and Sander, 2008) to make scalable improvements to efficiency. For

example, by:

• allowing the algorithm to perform in a parallel/distributed fashion (e.g.,

Nagesh et al., 2000), or

• adopting a ‘filter(-refinement) architecture’ which can approximate clusters

without performing the worst case search procedure; not merging dense re-

gions in the typical Apriori style but instead grouping one-dimensional dense

regions, so called base-cluster, through use of a modified DBSCAN (c.f. § 2.3.3)

(Kriegel et al., 2005).

2.3 Clustering 45

However, even though these algorithms do not require the number of clusters k to

be predetermined (Agrawal et al., 1998), they have been found to be generally rather

sensitive to the difficult to determine input parameter values; though this is gener-

ally the case for most high dimensional data clustering algorithms (Parsons et al.,

2004; Yip et al., 2005; Moise et al., 2008; Kriegel et al., 2009). It is worth pointing out

that while most of the subspace clustering algorithms are grid-based, some algo-

rithms (e.g., Kailing et al., 2004; Assent et al., 2007a) are built on the notion of density,

for example, by utilising DBSCAN (c.f. § 2.3.3) for identifying dense regions of each

dimension instead of histograms; these algorithms can produce better clustering re-

sults but require more computations.

Projected Clustering Algorithms

In contrast to the subspace clustering algorithm we discussed above, projected clus-

tering algorithms typically cluster data in a top-down partition-like fashion. Gener-

ally speaking, this aims to

• first locate an initial approximation of the clusters in the full set of equal

weighted dimensions (Parsons et al., 2004).

• then adjust feature weighting and evaluate the subspaces of each cluster it-

eratively (Parsons et al., 2004); though some algorithms (e.g., Friedman and

Meulman, 2004; Achtert et al., 2007a) adjust each dimension weight of each

instance.

Accordingly, these algorithms typically use some measures, either explicitly or im-

plicitly, of similarity on attributes or observations of interest (Moise and Sander,

2008). They, by design, are computationally more demanding than the subspace

clustering approaches. Additionally, they may require number of clusters k as an

input (e.g., Bohm et al., 2004a; Friedman and Meulman, 2004) or even the average

cluster dimensionality dIavg (e.g., Aggarwal et al., 1999; Aggarwal and Yu, 2000).

In spite of this, projected clustering algorithms often can obtain clusters with bet-

ter qualities than the subspace clustering approach; the subspace clustering ap-

proach typically utilises global density thresholds which are problematic as density

decreases with increasing dimensionality (Parsons et al., 2004; Moise et al., 2008;

Moise and Sander, 2008). As pointed out by Moise and Sander (2008) that if val-

ues for the global density threshold are chosen to be too large this will encourage the

formation of only low dimensional clusters, too small values for the global density

threshold will lead to many outlying clusters in addition to the real clusters of higher

dimension. As to the usefulness of the results, some researchers (e.g., Aggarwal et al.,

1999) believe non-overlapped clusters, as often obtained by projected clustering al-

gorithms, provide clearer cluster interpretations; while others argue that they result

in the loss of useful information (e.g., Kriegel et al., 2005). Interestingly, however,


the few projected clustering algorithms (Procopiuc et al., 2002; Moise et al., 2008),

that can produce either solutions, have shown some scalability and cluster quality

improvements for obtaining not significantly overlapped clusters.

The first two projected clustering algorithms are:

• the axis-parallel PROCLUS (Aggarwal et al., 1999) which aims to group objects

located closely in each of the related dimensions in its associated subspace,

and

• the ‘improved’ non-axis-parallel ORCLUS which aims to discover arbitrary

shape clusters (Aggarwal and Yu, 2000).

However, while both algorithms were designed for high dimensional data, they both

behaved like k-medoid algorithms (Moise et al., 2008). This is because these algo-

rithms initialise the clusters based on distance calculation involving all the dimen-

sions (Moise et al., 2008). Accordingly, these two algorithms tend to produce similar

size clusters with spherical shapes (Kriegel et al., 2009) as in the case of classical clus-

tering algorithms (c.f. § 2.3.2). Additionally, the use of mean square error (MSE) as

their objective function is also considered problematic since having less dimensions

will always result in better MSE (Moise et al., 2008). Furthermore, their use of ran-

dom sampling (c.f. Guha et al., 1998) for cluster initialisation means that they can

miss interesting clusters and obtain different results each time (Kriegel et al., 2009);

though many more recent algorithms (e.g., Procopiuc et al., 2002; Woo et al., 2004;

Yiu and Mamoulis, 2005) still adopt the use of random sampling as the result of the

general scalability challenges of the projected clustering approach (Parsons et al.,

2004). Interestingly, however, is that relatively computationally demanding decision

tree has been considered for projected clustering algorithms (e.g., Liu et al., 2000).

A cluster split is determined on the evaluation of information gain; the calculation

is based on having all observations labelled with a common class and ‘added’ uni-

formly distributed data with a different class.

Locality Assumption It is important to point out that projected clustering algo-

rithms often assume the subspace of each cluster can be determined locally i.e.,

based on the local neighbourhood corresponding to members of the cluster or to

the cluster representatives (Kriegel et al., 2009); this is despite the analysed data is

high dimensional. In other words, many projected clustering algorithms (e.g., Fried-

man and Meulman, 2004) aim to derive insights, for example, the ‘true’ subspace,

of an observation through its nearest neighbours (c.f. density-based clustering algo-

rithms; § 2.3.3). In fact, such tactic has also been utilised even for algorithms (e.g.,

Achtert et al., 2007c,b) that aim to uncover arbitrary shaped subspace clusters. How-

ever, this implies that the ‘definition’ of distance and nearest neighbours clearly need

to be redefined. Some ideas are discussed below.

As in the case of subspace clustering algorithms, the well-known DBSCAN (c.f.

2.3 Clustering 47

§ 2.3.3) has also been utilised. However, instead of applying DBSCAN for identify-

ing the dense regions of each dimension as used by some subspace clustering algo-

rithms, some projected clustering algorithms (e.g., Bohm et al., 2004a,b) have modi-

fied DBSCAN such that it can cluster data full-dimensionally. The key modification,

of course, is the calculation of distance between observations. Instead of using sim-

ple Euclidean distance which clearly suffers from the curse of dimensionality, they

propose to use weighted Euclidean distances essentially only based on the prefer

dimensions (Bohm et al., 2004a) or eigenbases (Bohm et al., 2004b) of observations.

Alternatively, a specialised distance measure, dimension oriented distance (DoD),

was introduced by FINDIT (Woo et al., 2004) in determining observation/instance’s

nearest neighbours. This appears to be robust DoD measures the similarity be-

tween two observations/instances by counting number of dimensions in which

their Manhattan distance is less than a given ε; the actual value of Manhattan dis-

tance is not important, it is simply utilised to determine if the two observations/

instances are ‘close enough’ with respect to that particular attribute. Obviously, the

largest DoD implies that the two observations/instances are the ‘nearest neighbours’.

FINDIT then proposed a ‘dimensional voting policy’ in determining an observation/

instance’s likely subspace i.e., correlated dimensions; the relevant dimensions are

determined by number of its neighbours that are considered to be ‘close’. Such strat-

egy is perhaps an improvement from the use of random sampling (e.g., Procopiuc

et al., 2002); however, fixing number of reliable ‘voters’ may have been somewhat

restricted.

Bottom-up Strategy Some projected clustering algorithms (e.g., Moise et al., 2008),

conversely, have adopted a bottom-up strategy as in the case of subspace clustering

algorithms such as CLIQUE. However, not all of them adopt this for the efficiency

purpose. HARP (Yip et al., 2004) is one such example; it is the first algorithm that

aims to produce cluster hierarchy and can eliminate the need to have number of

clusters k predetermined. In essence, HARP is a slow single-link-like agglomera-

tive hierarchical clustering algorithm (Kriegel et al., 2009) which merges observa-

tions/clusters based on the proposed ‘relevant score’ with respect to the dimensions.

While HARP has trouble identifying low dimensional clusters as in the case for most

other algorithms (Kriegel et al., 2009), its ‘related’ algorithm, k-mediod-like top-

down SSPC (Yip et al., 2005), has been shown to be able to discover clusters with low

dimensionality; SSPC successfully avoids extensive distance calculation involved all

dimensions by having its objective function based on HARP’s ‘relevance score’ mea-

sure. Nonetheless, it is worthwhile pointing out that some recent top-down algo-

rithms (e.g., Achtert et al., 2007a) can also obtain the cluster hierarchy. Whereas,

SSPC is a semi-supervised algorithm which is rather unique; though there appears

to be a growing interest in clustering in the semi-supervised framework (Chapelle

et al., 2010).


Other bottom-up projected clustering algorithms include EPCH (Ng et al., 2005) and

P3C (Moise et al., 2008) which, as in the case of HARP, also conveniently do not re-

quire number of clusters k as an input. Both algorithms assume clusters’ low di-

mensional projection will ‘stand out’ (Moise and Sander, 2008) similar to those sub-

space clustering algorithms. However, rather than identifying the low dimensional

dense regions utilising a predefined global density threshold, as done by most sub-

space clustering algorithms, they have adopted different approaches: EPCH has the

threshold iteratively lowered, whereas P3C applies chi-square goodness-of-fit test in

examining if attributes are uniformly distributed for each histogram bin.

Yet, the processes of EPCH and P3C in obtaining the subspace clusters are quite dif-

ferent. In EPCH, after all low dimensional dense regions have been identified, a ‘sig-

nature’ is derived for each observation recording which dense regions the object are

found. By comparing the ‘signature’ coefficients, the degree of similarity between

two observations are determined; similar objects/clusters are merged until at most

a user-specified number of clusters is obtained. Many similarity ‘rules’ have been

introduced by the algorithm. On the other hand, P3C locates ‘cluster cores’, an initial

maximal-dimensional subspace cluster approximations, by merging dense regions

with a Apriori-like process. The ‘cluster cores’ are then iteratively refined with an

EM-like procedure; and observations are assigned to the nearest ‘cluster core’ based

on Mahalanobis distance (MD) (Mahalanobis, 1936).

Interestingly, for the completeness of reviewing projected clustering algorithms,

note that there exist some relatively efficient but unique (top-down) algorithms (e.g.,

Procopiuc et al., 2002; Yiu and Mamoulis, 2005) that do not produce all clusters at the

same time. Rather, they discover clusters in a sequential manner. However, these al-

gorithms are based on an inappropriate assumption that clusters are hypercube-like

with fixed width for all attributes (Moise et al., 2008), and usually have problems with

clusters of significantly different dimensionality (Kriegel et al., 2009). The paper of

Yiu and Mamoulis (2005) is perhaps more significant in showing the suitability and

the algorithm scalability improvements by adopting FP-Growth (Han et al., 2000)

instead of Apriori-like process; the tactic can and has already be adopted by some

subspace clustering algorithms (e.g., Liu et al., 2009).

Overall, the iterative nature of the projected clustering algorithms and the use of ran-

dom sampling by some imply that they are not suitable for real time applications;

their time complexity requirements are more challenging to determine. Algorithms

such as P3C did not specific its time complexity. Experimental evaluations (Moise

et al., 2008; Moise and Sander, 2008) between P3C and subspace clustering algo-

rithm MAFIA, which is considered faster than CLIQUE, indicate that P3C (as well as

many other projected clustering algorithms) is about 10 − 100 times slower, but is

about 10 − 100 times faster than PRIM (c.f. bump hunting) (Friedman and Fisher,

2.3 Clustering 49

1999). On the other hand, while projected clustering algorithms may miss clusters

(Kriegel et al., 2009), and there is no agreed cluster quality evaluation criteria (Pa-

trikainen and Meila, 2006), it appears that parameter-robust P3C and SSPC generally

perform well particularly in identifying clusters with lower intrinsic dimensionality

that is more realistic in the real world (Yip et al., 2004; Cherkassky and Mulier, 2007,

p.178). Though, only P3C is also suitable for clustering categorical/mixed type at-

tributes data which can be useful for extension of this application.

2.3.5 Review conclusion

Classical algorithms such as k-means (KM) (MacQueen, 1967) have been utilised

widely in various applications; it will be applied in Chapter 5 for segmenting cus-

tomer behaviour (c.f. Wedel and Kamakura, 1998, Chapter 5). KM’s limitations,

and more generally the clustering subject itself, have been well documented (e.g.,

Han and Kamber, 2006, Chapter 7). In 1990s, there was a significant amount of re-

search into developing more scalable clustering algorithms capable of discovering

non-convex shaped clusters without the need to predetermine number of clusters

k, which in term addresses some earlier algorithm shortfalls. DBSCAN (Ester et al.,

1996) is perhaps the most influential algorithm around that period, which has re-

cently been shown useful for identifying individuals’ highly visited locations (Nurmi

and Koolwaaij, 2006), an application which is somewhat similar to one of this re-

search objectives; its effectiveness in comparison to mixture models will be exam-

ined in Chapter 4.

Since 1998, many efficient algorithms suitable for identifying hidden subspace clus-

ters in a high dimensional space have been proposed based on the geometric con-

siderations for avoiding exhaustive subspace searches. The majority of them can be

referred to as being grid-based; they utilise histograms (c.f. grids), as a density esti-

mation tool, in identifying low dimensional dense regions corresponding to the low

dimensional projection of the subspace clusters. However, the effectiveness of these

grid-based algorithms depends on the granularity and the positioning of the grid

(Kriegel et al., 2009). The well-known Sturges’ rule (Sturges, 1926), determining the

bin size as (1 + log2 (n)) with n the number of observations, has been used frequently

by algorithms such as P3C (Moise et al., 2008); this rule, however, has been shown to

be ineffective for n > 100 or 200 (Hyndman, 1995; Scott, 2009). An alternative ap-

proach for clustering high dimensional data will be investigated in Chapter 6; it may

be useful to model low dimensional density distributions with mixture models in-

stead of common approach of grid/histograms. Finally, it is worthwhile pointing out

that Moise and Sander (2008) has recently proposed an approach for determining if a

subspace clusters is statistically significant; this approach is related to scan statistics

(Agarwal et al., 2006) and bump hunting (Friedman and Fisher, 1999).

3Variational Bayesian Method: Component Elimination,

Initialization & Circular Data

Abstract

The recently popular variational Bayesian (VB) method is an efficient non-

simulation based alternative to Bayesian approaches such as the Markov chain

Monte Carlo (MCMC) method. A key practical advantage of VB in fitting data with

a Gaussian mixture model (GMM) is its ability to effectively and progressively elim-

inate redundant components specified in the initial model thereby simultaneously

estimating model complexity and parameters. In this paper, we consider the poten-

tial implications of this irreversible VB property. We then outline an extension of the

VB approach to modeling circular data represented by a truncated GMM. We con-

sider the usefulness and effectiveness of this approach and evaluate how different

observation component allocation initialization schemes may influence results.

Keyword

Variational Bayes (VB); Gaussian Mixture Model (GMM); Component Elimination;

Initialization; Directional/Circular Statistics

3.1 Introduction

Mixture models provide a convenient, flexible way to model data; a popular and

computationally efficient approach is to use Gaussian mixture models (GMMs)

(McLachlan and Peel, 2000). In this paper, we investigate the increasingly popular

variational Bayesian (VB) method for GMMs which has been shown to approximate

52 Chapter 3. The Variational Bayesian Method

Bayesian posterior distributions efficiently and has already been used in various ap-

plications (Wang and Titterington, 2006). One of VB’s unique feature for mixture

modeling is its automatic redundant component elimination property; we demon-

strate the usefulness of this feature as well as its generally overlooked implications in

a one-dimensional scenario. We base results on the algorithm of McGrory and Tit-

terington (2007). In addition, we extend this one-dimensional VB-GMM approach to

approximate the distribution of a real world circular data. Our empirical results re-

veal that such an approach is generally sufficient; and the implication of several dif-

ferent observation component allocation initialization schemes are also evaluated.

Our application dataset comprises the temporal usage patterns of phone customers

over a 24-hour period; being able to summarize each user’s behavior more formally

can provide businesses with the means to obtain better customer behavioral under-

standing, profiling/differentiation and hence better customer relationships (c.f. Wu

et al., 2010a).

The deterministic VB method for Bayesian inference was first formally outlined in

Attias (1999). It is more efficient in terms of both computation and storage require-

ments than most other approximate Bayesian approaches such as Markov chain

Monte Carlo (MCMC). Additionally, unlike Monte Carlo based approaches, VB does

not suffer from the mixing or label switching problem (c.f. Celeux et al., 2000), or the

difficulties with assessing convergence (Jaakkola and Jordan, 2000; Wang and Titter-

ington, 2006). Being a Bayesian method, VB suffers less from the over-fitting and

singularity problems that persist in maximum likelihood (ML) approaches (Attias,

1999); and theoretically it has been shown to be asymptotic consistent in approxi-

mating mixture models with fixed number of components k (Wang and Titterington,

2006). Given that a central issue of mixture modeling is the selection of a suitable k

(McLachlan and Peel, 2000), a key practical advantage of VB over ML approaches is

its ability to automatically select k to give the ‘best’ fit to the data according to the

variational optimization, and estimate the model parameter values and their poste-

rior distributions at the same time (c.f. Richardson and Green, 1997).

We use the term standard VB-based algorithms to refer to those algorithms that

do not allow components to be split and/or merged. These algorithms select k

through the complexity reduction property of the VB approximation; this property

leads to the progressive elimination of redundant components (as their component

weights tend towards zero) that were specified in the initial model during conver-

gence (McGrory and Titterington, 2007). Note that this implies that the final k, kfinal,

in the model can not be greater than the initial specification of k, kinitial; though it

is worthwhile to mention that non-standard algorithms (e.g., Ghahramani and Beal,

1999; Ueda and Ghahramani, 2002; Constantinopoulos and Likas, 2007; Wu et al.,

3.1 Introduction 53

2010b) which allow components to be split do not face such limitations. This auto-

matic feature of the approximation has been observed by many researchers (e.g., At-

tias, 1999; Corduneanu and Bishop, 2001; McGrory and Titterington, 2007), and has

been shown to perform as well or better than the use of expectation-maximization

(EM) algorithm (Dempster et al., 1977) with Bayesian information criterion (BIC)

(Schwarz, 1978) (e.g., Watanabe et al., 2002; Teschendorff et al., 2005); but its theoret-

ical reasoning is still not well understood. Despite the usefulness and effectiveness

of this VB component elimination property, its irreversible nature can potentially

lead to incorrect/different models in practice. This is because a component might

be eliminated prematurely in the iterative convergence towards a solution, remov-

ing the opportunity for appropriate observations to be allocated to it. Discussion of

this issue has generally been overlooked in the literature, here we give some consid-

eration to its implications.

As indicated earlier, our application involves analyzing users’ phone activity patterns

over a 24-hour period, and it is therefore critical to take the circular characteristics

of the data into consideration i.e., the difference between hour 0 and hour 24 is not

24 hours, but rather it is zero hours. Data analysis of this kind is often referred to

as circular or directional statistics (Fisher, 1996; Mardia and Jupp, 2000; Jammala-

madaka and Sengupta, 2001). There have been numerous distributional models al-

ready proposed for analyzing this type of data, and the von Mises, also known as

circular normal (Jammalamadaka and Sengupta, 2001), distributional family is per-

haps the most popular choice; the wrapped distributional family is also useful (Mar-

dia and Jupp, 2000, pp.32-52). However, they typically have focused on modeling

unimodal and symmetric data and are therefore not well suited to many real world

applications.

Recently some researchers (e.g., Fernandez-Duran, 2004; Pewsey, 2008; McVinish

and Mengersen, 2008) have successfully modeled more complicated circular pat-

terns either parametrically or semi-parametrically. For example, one computation-

ally demanding approach is to model them using a mixture of von Mises distribu-

tions (Ghosh et al., 2003). However, VB cannot easily be utilized when taking this ap-

proach because the algebraic expressions required for the update equations are not

computationally straightforward to evaluate. An EM-based analysis of a mixture of

wrapped distributions (c.f. Fisher and Lee, 1994) can be computationally inefficient

due to its infinite sums that need to be approximated at each step. Lees et al. (2007)

present a VB-based implementation of a wrapped normal analysis, but they appear

to have overly simplified the study by assuming only one distribution wrapping on

the circle; this analysis would otherwise be more computationally demanding. An-

other option is to use non-parametric kernel density estimation based on the von

Mises-Fisher kernel (Mardia and Jupp, 2000, pp.277-278), but the results of the ker-

nel approach depend on the degree of smoothing and lack the interpretability that


may be critical for some applications such as ours.

Consequently, in this paper, we propose the simplistic approach of circumventing

such problems by applying the VB-GMM approach for interval data to the circular

data problem by padding the repeated data at both ends (c.f. Mardia and Jupp, 2000,

p.4), and then normalizing the resulting models i.e., f (x; 0 ≤ x ≤ 24). We acknowl-

edge that this tactic can only be used as an approximation, but we have found it to

be generally useful. We investigate how sensitive our modeling approach is to the

extent to which we pad out the ends; we do this by comparing results when we re-

peat data at both ends up to either 1/4, 1/2 or 1 cycle; that is, we consider situations

where the overall patterns analyzed are either 1.5, 2 or 3 complete data cycles. Note

that we found that the usage pattern of each user was quite different on weekdays

than it was at the weekend; but here we restrict our focus to the weekday activities.

While VB improves the model approximation at each iteration (in a similar manner

to the EM algorithm), like many other algorithms including EM (Ueda et al., 2000), it

is still somewhat sensitive to the initialization with respect to the component mem-

bership of each observation (c.f. Watanabe et al., 2002; Wu et al., 2010b). In this

paper, we also demonstrate this issue through modeling each user’s heterogeneous

temporal calling pattern. We compare the resulting fitted models that arose when

using the following three simple initialization schemes (with all other prior settings

being non-informative with the exception of kinitial); for more informative schemes

i.e., Partitioned and Overlapping, observation component allocations correspond to

intervals:

1. Random: assigning the component membership j of each observation i, which

we call i(j), non-informatively and thus randomly;

2. Partitioned: assigning each i(j) more informatively based on which non-

overlapped equal-width interval the observed value has fallen into. We focus

on the equal width interval setting here; and

3. Overlapping : similar to Partitioned using an overlapped interval setting i.e. al-

lowing an observation to have equal chances of being initialized with either

of the components corresponding to the overlapping intervals. An example of

the 17 overlapped interval setting for data range from -6 to 30 (corresponding

to having 6 hours of information padded on both side of the 24-hour period

data) is shown in Figure 3.1.

Note that it is quite obvious that the VB-GMM algorithm with Random initialization

will require more iterations to reach converged models; thus our focus here is solely

on the goodness-of-fit among the fitted models with different initialization schemes

after a large number of iterations, i.e., after 1,000 iterations have been executed. This

also implies that we are less interested in the resulting kfinal’s fitted, but we choose

3.2 VB-GMM Algorithm 55

-4 0 4 8 12 16 20 24 28

# 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8# 9 # 10 # 11 # 12 # 13 # 14 # 15 # 16 # 17

-6 -2 2 6 10 14 18 22 26 30

Figure 3.1: Overlapping initialization scheme

larger values for kinitial which should allow more chance of finding a better fit to the

model. We compare the goodness-of-fit of the results by considering Kuiper (1962)’s

test statistic and mean absolute error (MAE).

We organize the rest of this paper as follows. In Section 3.2, we briefly discuss fit-

ting GMMs with VB. In Section 3.3, we detail the Kuiper’s test statistic and MAE. We

present the results in Section 3.4 and conclude in Section 3.5.

3.2 VB-GMM Algorithm

In a GMM, it is assumed that all k underlying distributions (or components) of the

mixture are distributed as Gaussian. In the notation we adopt here, the density of

an observation x = (x1, ..., xn) is given by∑k

j=1wjN(x;µj , τ−1

j

), where k ∈ N, µj

and τ−1j represent the mean and variance, respectively, of the jth component, each

mixing coefficient wj , satisfies 0 ≤ wj and∑k

j=1wj = 1, and N (·) denotes a Gaus-

sian density. In the Bayesian framework, inference is based on the target posterior

distribution, p (θ, z|x), where θ denotes the model parameters (µ, τ, w) and z = {zij}denotes the missing component membership information of observation x. Note

that the zij ’s are indicator variables such that zij = 1 if observation xi belongs to the

jth component and zij = 0 otherwise.

The target posterior is not analytically available in this mixture model problem, as

is generally the case, and therefore it has to be approximated in the Bayesian infer-

ence approach. The idea of the VB approach is to approximate the target posterior

by a variational distribution which we denote by q (θ, z|x). Importantly, it is a mean-

field type approach in that it is assumed that this approximating distribution fac-

torizes over the model parameters θ and the missing variables z; this means that we

can write q (θ, z|x) = qθ (θ|x) × qz (z|x). In order to obtain a good approximation to

the target, the distribution q (θ, z|x) overall must be chosen carefully so that it can

approximate the true conditional density well, and qθ (θ|x) and qz (z|x) can provide

computational convenience needed at the same time. VB’s objective is to maximize

the lower bound on the log marginal likelihood, logp (x). This is equivalent to min-

imizing the Kullback-Leibler (KL) divergence between the target posterior and the


variational approximating distribution. This approach leads to tractable coupled ex-

pressions for the variational posterior over the parameters which can be iteratively

updated in a similar fashion to classical EM algorithm to obtain convergence to a

solution.

Most of the papers on the subject of fitting GMMs with VB (e.g., Attias, 1999; Cor-

duneanu and Bishop, 2001; McGrory and Titterington, 2007) make similar prior as-

sumptions, but they differ in the form of the model hierarchy used. As indicated pre-

viously, we follow the model setting described in McGrory and Titterington (2007),

but we do not make use of the Deviance information criterion (DIC) as a comple-

mentary model selection criterion as they did in that paper (c.f. Spiegelhalter et al.,

2002; Celeux et al., 2006). We model the pattern as a mixture of k Gaussian distri-

butions with unknown means µ = (µ1, ..., µk), precisions τ = (τ1, ..., τk) and mixing

coefficients w = (w1, ..., wk), such that

p (x, z|θ) =n∏i=1

k∏j=1

{wjN

(xi;µj , τj−1

)}zij

,

with the joint distribution being p (x, z, θ) = p (x, z|θ) p (w) p (µ|τ) p (τ). We express

our priors as:

p (w) = Dirichlet(w;α1

(0), ..., αk(0))

,

p (µ|τ) =k∏j=1

N

(µj ;mj

(0),(βj

(0)τj

)−1)

, and

p (τ) =k∏j=1

Gamma(τj ; 1

2υj(0), 1

2σj(0))

,

with α(0), β(0), m(0), υ(0), and σ(0) being known, user chosen initial values. These

are the standard conjugate priors used in Bayesian mixture modeling (Gelman et al.,

2004). Using the lower bound approximation, the posteriors are then:

qw (w) = Dirichlet (w;α1, ..., αk),

qµ|τ (µ|τ) =k∏j=1

N(µj ;mj , (βjτj)

−1)

, and

qτ (τ) =k∏j=1

Gamma(τj ; 1

2υj ,12σj)

.

The posterior parameters are iteratively updated as:

αj = αj(0) +

∑ni=1 qij ,

βj = βj(0) +

∑ni=1 qij ,

υj = υj(0) +

∑ni=1 qij ,

mj = 1βj

(βj

(0)mj(0) +

∑ni=1 qijxi

), and

σj = σj(0) +

n∑i=1

qijxi2 + β

(0)j m

(0)j

2− βjmj

2,

3.3 Model Evaluation Criterion 57

where qij is the VB posterior probability that zij = 1, and expectations are given by

E ( µj) = mj , and E ( τj) = υjσ−1j . Please refer to McGrory and Titterington (2007),

for example, for more details on VB-GMM. Finally, we emphasize that approximating

circular data as a truncated GMM, as we proposed here as the simple circumventing

approach, requires the data to be first padded on both sides of the interval prior to

the modeling.

3.3 Model Evaluation Criterion

Kuiper’s test statistic (Mardia and Jupp, 2000, pp.99-103) is based on the popular

Kolmogorov-Smirnov (KS) test statistics (c.f. Sheskin, 2004). However, unlike KS,

Kuiper is suitable for comparing two circular distributions non-parametrically as it

does not depend on the choice of origin. It is based on evaluating the cumulative

distribution function (CDF), and can be expressed as:

Vn = max1≤i≤n

(F(x′i

)− Sn

(x′i

))− min

1≤i≤n

(F(x′i

)− Sn

(x′i

))In this expression, x is the data re-arranged into increasing numerical order in an

array x′; Sn is the sample distribution function of the data and F is the fitted distri-

bution function of the GMM. However, we felt that simply evaluating the goodness-

of-fit of the results based on the combined largest deviances on both sides of the

distributional fit may not be sufficient. Consequently, we propose also using the

MAE:

MAE = 1n

∑ni=1

∣∣∣F (x′i)− Sn (x′i)∣∣∣.We consider MAE to be useful here given that VB generally does not over-fit (Attias,

1999). In our results section, we utilized both measures. Finally, we will make use of

the following modification of Vn (Stephens, 1970):

V ∗n =√n× Vn ×

(1 + 0.155√

n+ 0.24

n

)The definition of V ∗n allows us to evaluate more generally whether our modeling ap-

proach is robust with respect to n, number of observations in a pattern; the distribu-

tion of V ∗n has been shown to be quite stable for n ≥ 4, but having n ≥ 8 is recom-

mended.


3.4 Results

As indicated in the introduction, we carry out two analyses. The first task is to

demonstrate the irreversible nature of the VB component elimination property for

fitting GMMs or mixture models more generally. We next evaluate the goodness-

of-fit among different initialization schemes with respect to the observation com-

ponent allocations. We also evaluate the effectiveness of using the simple circum-

venting VB-GMM approach for modeling circular data of users’ weekday temporal

usage patterns. We note that the heterogeneous nature of this dataset provides a

good example of a real world problem for which to evaluate the practical merits and

effectiveness of VB-GMM methodology for approximating circular data.

3.4.1 The irreversible nature of the VB component elimination property

To demonstrate this property, a simulated dataset with 200 observations was gener-

ated based on a mixture of four Gaussian mixture components with parameter val-

ues shown in Table 3.1. The data we generated here can be considered as being not

well separated and therefore are reasonably challenging. This example is chosen to

illustrate that in some cases, the VB elimination property can lead to differences in

results when different values of kinitial are used and we demonstrate this with a ran-

dom initial allocation of observations to components. The models recovered by the

VB-GMM algorithm with the random initialization scheme and with various kinitial’s

are shown in Table 3.2. We note that while the algorithm is unable to identify the

original model for this less than well behaved dataset, the four-component solu-

tion found by most of the different kinitial’s appears to be satisfactory. Moreover, we

observe that VB has eliminated components with little support rather effectively as

has been observed by other previous researchers (e.g., Attias, 1999; Corduneanu and

Bishop, 2001; McGrory and Titterington, 2007).

Obviously, model outputs with kinitial < 4 cannot recover the original model. How-

ever, we observe that in the case of this example, there is some variation in results

obtained even when we initialize the algorithm such that kinitial is larger than the

number of components in the model the data were simulated from. In particular,

when we chose kinitial = 4 or 8, the algorithm only fits a three-component model,

failing to fit component four which was the least weighted component in the model

we simulated from. It is easy to see how such an over-simplification of the model

can arise, particularly when there are low weighted components heterogeneously

mixed in the model. This over-simplification occurred here because in the initial ob-

servation allocation, observations that were generated from component four were

not grouped appropriately enough giving little support to a fourth component and

hence leading to its premature elimination from the model before convergence was

3.4 Results 59

Table 3.1: Mixture model parameters used for the simulated dataset

Component #1 #2 #3 #4

Parameter µ τ−12 w µ τ−

12 w µ τ−

12 w µ τ−

12 w

Actual 5.124 0.552 0.240 6.237 0.323 0.410 6.584 0.052 0.260 7.335 0.176 0.090

complete. Following through the computation iterations by iterations, it is clear that

the standard VB elimination cannot be reversed.

While the non-uniform behavior of VB in this example (c.f. the result of kinitial = 8 6=kinitial = 7 or 9, for example, in Table 3.2) which occurs as a result of the irreversible

nature of its component elimination property is somewhat concerning, we note that

fortunately it seldomly occurred in our other experiments when the data was well

behaviored. We did not show these here as our focus is on more problematic exam-

ples. Of course, heterogeneous data can be modeled differently and it can also cause

inference challenges in other Bayesian approaches too. However, what is interesting

to us is that regardless of how the observation component allocations are initialized,

and what value of kinitial is used for this non-well separated dataset, the VB solutions

have been quite consistent in the sense that models with the same number of com-

ponents were nearly always practically identical (c.f. Table 3.2).

Finally, we note that in practice, researchers often perform several analyses on the

same dataset to ensure the findings are appropriate. Moreover, they typically choose

some kind of initial clustering (e.g., Corduneanu and Bishop, 2001), for example, to

initialize their Bayesian schemes; this tactic can provide potential computational

savings given that a more informative prior is utilized. Nevertheless, we note that

randomly allocated observations into components in a one-dimensional space ap-

pears to be somewhat concerning to us in a sense that all components are initialized

very similarly, and within the VB framework this can make this issue of premature

component elimination more likely to occur and actually lead to very inappropri-

ate models. This is a motivation for designing the allocations in Section 3.1; and

we follow on this point in the next subsection. Despite this, we again emphasize that

many studies as well as the above analysis have shown empirically that VB for GMMs

is generally very effective.

3.4.2 Evaluating the results of the VB-GMM fit under different initializa-tion schemes for padded circular data

Our data for this experiment is provided by a telecommunication provider. The

dataset consists every single successful outbound communication activity made by

100 users during a 17-month period. These anonymous users were randomly sam-

pled from a large database of several million users, with users’ activity during the


Table 3.2: Parameter estimates recovered by VB with various kinitial

Component #1 #2 #3 #4

Parameter µ τ−12 w µ τ−

12 w µ τ−

12 w µ τ−

12 w

kinitial = 1 6.130 0.769 1.000kinitial = 2 5.969 0.840 0.737 6.579 0.038 0.263kinitial = 3 4.160 0.033 0.025 6.036 0.779 0.715 6.579 0.038 0.260kinitial = 4 4.160 0.033 0.025 6.036 0.779 0.715 6.579 0.038 0.260kinitial = 5 5.348 0.706 0.342 6.278 0.300 0.309 6.580 0.038 0.259 7.299 0.160 0.089kinitial = 6 5.348 0.707 0.342 6.278 0.300 0.309 6.580 0.038 0.259 7.299 0.160 0.089kinitial = 7 5.348 0.707 0.342 6.278 0.300 0.309 6.580 0.038 0.259 7.299 0.160 0.089kinitial = 8 4.160 0.033 0.025 6.036 0.779 0.715 6.579 0.038 0.260kinitial = 9 5.348 0.707 0.342 6.278 0.300 0.309 6.580 0.038 0.259 7.299 0.160 0.089

......

......

......

......

......

......

...kinitial = 20 5.348 0.707 0.342 6.278 0.300 0.309 6.580 0.038 0.259 7.299 0.160 0.089

Table 3.3: Average Kuiper of all setups for all 100 users; the ‘best’ model for each useris determined based on the lowest Kuiper. Note that different kinitial correspond todifferent cycle scenarios.

Average Kuiper # of ‘Best’ ModelsInit. scheme Random Partitioned Overlapping Random Partitioned Overlappingkinitial = 17 0.041911 0.033613 0.034013 21 43 36kinitial = 23 0.060567 0.036761 0.037422 15 47 38kinitial = 35 0.227374 0.039212 0.038057 12 40 48

weekdays ignored, and overall there are more than 100 activities. The average num-

ber of activities for the analyzed users is 1,766, and the maximum number of activi-

ties is 10,607. In this evaluation, we first consider results for just three of the anony-

mous users. Then we consider the whole sample to summarize the overall results

obtained for all 100 users with various VB-GMM settings.

Our objectives are to evaluate the goodness-of-fit of the VB-GMM results when ap-

plied to padded circular data and the implications of the three different initialization

schemes, Random, Partitioned and Overlapping, for the observations’ initial com-

ponent allocations. That is, on one hand, we aim to understand the implications of

different initialization schemes to VB-GMM for data that is more complicated, and

on the other hand, we assess the wellness of approximating circular data with a trun-

cated GMM. As stated in the introduction, we also investigate the sensitivity of our

approach by considering three scenarios: 1.5, 2 and 3 complete data cycles. For ease

of demonstration and ‘fair’ comparison, kinitial = 17 is used for the 1.5-cycle scenario

(c.f. Figure 3.1 for Overlapping ); while kinitial = 23 and 35 are utilized for 2- and 3-

cycle scenarios, respectively. In other words, for each pattern a total of nine setups

will be evaluated (c.f. three initialization schemes with three data repeating scenar-

ios). Note that for simplicity, we shall simply refer the three data repeating scenarios

by their corresponding kinitial.

Figures 3.3 to 3.5 illustrate the selected modeling results of the temporal usage pat-

tern of three selected users over the 24-hour period. As suspected previously we

have observed that the Random initialization can have a disastrous effect for this

3.4 Results 61

Table 3.4: Average MAE of all setups for all 100 users; the ‘best’ model for each useris determined based on the lowest MAE. Note that different kinitial correspond to dif-ferent cycle scenarios.

Average MAE # of ‘Best’ ModelsInit. scheme Random Partitioned Overlapping Random Partitioned Overlappingkinitial = 17 0.006779 0.005205 0.005172 19 48 33kinitial = 23 0.011930 0.005557 0.005656 13 48 39kinitial = 35 0.073938 0.006210 0.005865 11 38 51

(a) Overlapping and kinitial = 35 setup (b) Random and kinitial = 35 setup

Figure 3.2: Distribution of number of components k with selected setups.

one-dimensional study; the occurring poor fits appear to be the result of initialized

components being too vague and/or too similar. This suggests that all initialized

components covered nearly identical data ranges, as if one were aiming to fit a Gaus-

sian distribution to a padded heterogeneous pattern. We will follow up on this point

in the next paragraph. In contrast, most patterns appeared to have been modeled

quite well in all setups where the Partitioned or Overlapping initialization was used

(c.f. Figures 3.3 to 3.5). We can also see that the fit appears good on the ‘edges’ of the

datasets which indicates that our proposed VB-GMM with padded data approach

works well in the circular data setting.

The averages of Kuiper and MAE for all analyzed users are summarized in Tables 3.3

and 3.4. It seems that the more informative Partitioned and Overlapping initial-

izations have generally produced as well or better results when compared to non-

informative Random initialization; and the fitted models of Random initialization

appear to suffer significantly when a longer series of data is padded. Moreover, Over-

lapping appears to perform marginally better than Partitioned when a longer se-

ries of data is padded. To understand why some VB-GMM fits appear to be better

than others, we next summarize the average kfinal’s for these users in Table 3.5. It

seems to us that, on average, one requires approximately four or five components

Table 3.5: Average kfinal of all setups for all 100 users. Note that different kinitial corre-spond to different cycle scenarios.

Average kfinal

Init. scheme Random Partitioned Overlappingkinitial = 17 6.051 7.333 7.606kinitial = 23 6.788 9.586 9.778kinitial = 35 12.778 12.202 12.667


Table 3.6: Average Stephens’ Kuiper, V ∗n , for all setups for all 100 users; the ‘good’model is determined based on comparing its V ∗n to the critical value of 1.224 basedon nominal significance level of α = 10% (Stephens, 1970). Note that different kinitial

correspond to different cycle scenarios.

Average Stephens’ Kuiper V ∗n # of ‘Good’ ModelsInit. scheme Random Partitioned Overlapping Random Partitioned Overlappingkinitial = 17 1.245 0.963 0.960 63 83 83kinitial = 23 1.851 1.058 1.097 41 77 75kinitial = 35 9.733 1.127 1.109 20 67 72

(c.f. Partitioned and Overlapping ) for modeling one’s one cycle of 24-hour calling

pattern. Whereas, the poorer model fits obtained from Random initialization (ex-

cept with kinitial = 35 i.e., three completed data cycles) appear to be the direct result

of there being less surviving components in the models, a side-effect of VB’s irre-

versible nature of the component elimination property which is normally quite ef-

fective. Nonetheless, careful evaluations of the distributions of kfinal’s for all different

setups revealed something interesting. All distributions centered around its average

kfinal’s (c.f. Figure 3.2 (a)) as expected; however, Random and kinitial = 35 setup (c.f.

Figure 3.2 (b)) is the exception. In that, most models either have lots of surviving

components (of which all of them are nearly identical) or they have very little mak-

ing its average kfinal misleading. This result suggests that the Random initialization

scheme for VB can be problematic for modeling complicated one-dimensional pat-

terns, and more informative observation component allocation schemes are gener-

ally needed. Additionally, we note that if we were to execute several thousand more

iterations for those models that currently still consist very high number of compo-

nents, in many cases components can still be eliminated; however, they generally

end up with only a handful number of components which clearly are still not suffi-

cient for modeling three cycles of heterogeneous data.

Recall that our the other focus is to assess the effectiveness of modeling circular data

as a truncated GMM. Figure 3.6 shows how the Stephens’ Kuiper, V ∗n , is distributed

with respect to n in the fitted model for all nine setups. It shows that the VB-GMM

with padded data approach for modeling circular data with either Partitioned or

Overlapping observation component initialization schemes is robust regardless of

the size of n; and for our data, different numbers of cycles appear to have minimal

affect when a more informative initialization scheme is used. Tables 3.6 summa-

rized number of cases (out of 100 models/users) where the model V ∗n ’s are less than

the critical value determined by the nominal significant level of α = 10% in the up-

per tail, that is, number of ‘good’ fitted models. These numbers echo the findings

from Figure 3.6; they show that the VB-GMM with the Partitioned and Overlapping

approach on padded circular data will generally result in satisfactory models, and it

is sufficient to approximate circular patterns in a simple way by circumventing the

problem with truncated GMMs.

3.5 Discussion 63

3.5 Discussion

In this paper, we have shown how VB-GMM can be adapted for use in approximating

circular data by taking an approach where the data is padded at the edges. Addition-

ally, we have illustrated and discussed the generally overlooked potential implica-

tions of the irreversible nature of VB’s component elimination property and illus-

trated the effectiveness of utilizing more informative observation component allo-

cation schemes in avoiding this problem. In doing this we have demonstrated an

effective circumvent modeling approach for circular data that will be of particular

value in settings where there are large volumes of data to be analyzed as VB-based

approach is generally more computationally and time efficient when compared to

other Bayesian approaches. This should be useful to other circular data applications.

One key advantage of modeling each user’s temporal usage pattern using a GMM

is the ease of interpretation of the fitted model. From an application standpoint,

this can provide telecommunication companies with an opportunity to gain insights

into, as well as differentiate, each customer’s temporal usage behavior (c.f. Wu et al.,

2010a). This type of information is valuable implication in marketing and product

design. For example, User B is mostly active during business hours, while User C is

highly active around midnight. This might suggest that these two users would have

very different needs and a company could therefore better tailor their product for

each of them. We also note that while we restricted our attention here to the standard

VB algorithm in which components may only be eliminated and not added, there are

component splitting VB schemes available as discussed in the introduction. We note

that Wu et al. (2010b) shows the usefulness of component splitting in VB applied to

telecommunication spatial data. We anticipate that the model goodness-of-fit could

also be further improved by adopting a component splitting strategy in the VB-GMM

algorithm more generally.

3.6 References

Attias, H., 1999. Inferring parameters and structure of latent variable models by vari-

ational Bayes. In: Laskey, K. B., Prade, H. (Eds.), Proceedings of the Fifteenth Con-

ference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, Stockholm,

Sweden, pp. 21–30.

Celeux, G., Forbes, F., Robert, C., Titterington, D., 2006. Deviance information crite-

ria for missing data models. Bayesian Analysis 1 (4), 651–674.

Celeux, G., Hurn, M., Robert, C. P., 2000. Computational and inferential difficulties


kinitialPartitioned Overlapping

17(a) Kuiper=0.020312,

MAE=0.002811(b) Kuiper=0.020313,

MAE=0.002811

23(c) Kuiper=0.031329,

MAE=0.005304(d) Kuiper=0.020389,

MAE=0.002708

35(e) Kuiper=0.02255,

MAE=0.003156(f) Kuiper=0.04123,

MAE=0.007255

Figure 3.3: The results of the VB-GMM fits of the usage pattern of User A. The his-togram summarizes the actual observations; (a) represents the model fit of the Par-titioned and kinitial = 17 setup, (b) Overlapping and kinitial = 17 setup, (c) Partitionedand kinitial = 23 setup, (d) Overlapping and kinitial = 23 setup, (e) Partitioned andkinitial = 35 setup, and (f) Overlapping and kinitial = 35 setup.

3.6 References 65


17(a) Kuiper=0.010413,

MAE=0.001333(b) Kuiper=0.011612,

MAE=0.001605

23(c) Kuiper=0.010727,

MAE=0.001525(d) Kuiper=0.010257,

MAE=0.001525

35(e) Kuiper=0.009021,

MAE=0.001338(f) Kuiper=0.010963,

MAE=0.001692

Figure 3.4: The results of the VB-GMM fits of the usage pattern of User B. The his-togram summarizes the actual observations; (a) represents the model fit of the Par-titioned and kinitial = 17 setup, (b) Overlapping and kinitial = 17 setup, (c) Partitionedand kinitial = 23 setup, (d) Overlapping and kinitial = 23 setup, (e) Partitioned andkinitial = 35 setup, and (f) Overlapping and kinitial = 35 setup.



17(a) Kuiper=0.038779,

MAE=0.016959(b) Kuiper=0.040034,

MAE=0.017051

23(c) Kuiper=0.040514,

MAE=0.015667(d) Kuiper=0.040352,

MAE=0.015937

35(e) Kuiper=0.041396,

MAE=0.013901(f) Kuiper=0.042945,

MAE=0.017163

Figure 3.5: The results of the VB-GMM fits of the usage pattern of User C. The his-togram summarizes the actual observations; (a) represents the model fit of the Par-titioned and kinitial = 17 setup, (b) Overlapping and kinitial = 17 setup, (c) Partitionedand kinitial = 23 setup, (d) Overlapping and kinitial = 23 setup, (e) Partitioned andkinitial = 35 setup, and (f) Overlapping and kinitial = 35 setup.

3.6 References 67

kinitialRandom Partitioned Overlapping

17(a) (b) (c)

23(d) (e) (f)

35(g) (h) (i)

Figure 3.6: Stephen’s Kuiper V ∗n vs. n; (a) Random and kinitial = 17 setup, (b) Par-titioned and kinitial = 17 setup, (c) Overlapping and kinitial = 17 setup, (d) Randomand kinitial = 23 setup, (e) Partitioned and kinitial = 23 setup, (f) Overlapping andkinitial = 23 setup, (g) Random and kinitial = 35 setup, (h) Partitioned and kinitial = 35setup, and (i) Overlapping and kinitial = 35 setup.


with mixture posterior distributions. Journal of the American Statistical Associa-

tion 95 (451), 957–970.

Constantinopoulos, C., Likas, A., 2007. Unsupervised learning of Gaussian mixtures

based on variational component splitting. IEEE Transactions on Neural Networks

18 (3), 745–755.

Corduneanu, A., Bishop, C. M., 2001. Variational Bayesian model selection for mix-

ture distributions. In: Proceedings of the Eighth International Conference on Arti-

ficial Intelligence and Statistics. Morgan Kaufmann, Key West, FL, pp. 27–34.

Dempster, A. P., Laird, N. M., Rubin, D., 1977. Maximum likelihood from incomplete

data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statis-

tical Methodology) 39 (1), 1–38.

Fernandez-Duran, J. J., 2004. Circular distributions based on nonnegative trigono-

metric sums. Biometrics 60 (2), 499–503.

Fisher, N. I., 1996. Statistical analysis of circular data, 2nd Edition. Cambridge Uni-

versity Press, Cambridge, UK.

Fisher, N. I., Lee, A. J., 1994. Time series analysis of circular data. Journal of the Royal

Statistical Society. Series B (Methodological) 56 (2), 327–339.

Gelman, A., Carlin, J. B., Stern, H. S., Rubin, D. B., 2004. Bayesian Data Analysis, 2nd

Edition. Texts in Statistical Science. Chapman & Hall, Boca Raton, FL.

Ghahramani, Z., Beal, M. J., 1999. Variational inference for Bayesian mixtures of fac-

tor analysers. In: Solla, S. A., Leen, T. K., Muller, K.-R. (Eds.), Proceedings of the

1999 Neural Information Processing Systems. MIT, Denver, CO, pp. 449–455.

Ghosh, K., Jammalamadaka, S. R., Tiwari, R. C., 2003. Semiparametric Bayesian tech-

niques for problems in circular data. Journal of Applied Statistics 30 (2), 145–161.

Jaakkola, T. S., Jordan, M. I., 2000. Bayesian parameter estimation via variational

methods. Statistics and Computing 10 (1), 25–37.

Jammalamadaka, S. R., Sengupta, A., 2001. Topics in Circular Statistics. Series on

Multivariate Analysis. World Scientific, Singapore.

Kuiper, N. H., 1962. Tests concerning random points on a circle. Proceedings of the

Koninklijke Nederlandse Akademie van Wetenschappen, Series A 63, 38–47.

3.6 References 69

Lees, K., Roberts, S., Skamnioti, P., Gurr, S., 2007. Gene microarray analysis using

angular distribution decomposition. Journal of Computational Biology 14 (1), 68–

83.

Mardia, K. V., Jupp, P. E., 2000. Directional Statistics, 2nd Edition. Wiley Series in

Probability and Statistics. Wiley, Chichester, UK.

McGrory, C. A., Titterington, D. M., 2007. Variational approximations in Bayesian

model selection for finite mixture distributions. Computational Statistics & Data

Analysis 51 (11), 5352–5367.

McLachlan, G. J., Peel, D., 2000. Finite Mixture Models. Wiley Series in Probability

and Statistics. Wiley, New York.

McVinish, R., Mengersen, K., 2008. Semiparametric Bayesian circular statistics.

Computational Statistics & Data Analysis 52 (10), 4722–4730.

Pewsey, A., 2008. The wrapped stable family of distributions as a flexible model for

circular data. Computational Statistics & Data Analysis 52 (3), 1516 – 1523.

Richardson, S., Green, P. J., 1997. On Bayesian analysis of mixtures with an unknown

number of components (with discussion). Journal of the Royal Statistical Society:

Series B (Statistical Methodology) 59 (4), 731–792.

Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics

6 (2), 461–464.

Sheskin, D. J., 2004. Handbook of Parametric and Nonparametric Statistical Proce-

dures, 3rd Edition. Chapman & Hall, Boca Raton, FL.

Spiegelhalter, D., Best, N., Carlin, B., Van der Linde, A., 2002. Bayesian measures of

model complexity and fit. Journal of the Royal Statistical Society: Series B (Statis-


Stephens, M. A., 1970. Use of the Kolmogorov-Smirnov, Cramer-von Mises and re-

lated statistics without extensive tables. Journal of the Royal Statistical Society:


Teschendorff, A. E., Wang, Y., Barbosa-Morais, N. L., Brenton, J. D., Caldas, C., 2005.

A variational Bayesian mixture modelling framework for cluster analysis of gene-

expression data. Bioinformatics 21 (13), 3025–3033.

Ueda, N., Ghahramani, Z., 2002. Bayesian model search for mixture models based

on optimizing variational bounds. Neural Networks 15 (10), 1223–1241.


Ueda, N., Nakano, R., Ghahramani, Z., Hinton, G. E., 2000. SMEM algorithm for

mixture models. Neural Computation 12 (9), 2109–2128.

Wang, B., Titterington, D. M., 2006. Convergence properties of a general algo-

rithm for calculating variational Bayesian estimates for a normal mixture model.

Bayesian Analysis 1 (3), 625–650.

Watanabe, S., Minami, Y., Nakamura, A., Ueda, N., 2002. Application of variational

Bayesian approach to speech recognition. In: Becker, S., Thrun, S., Obermayer,

K. (Eds.), Proceedings of the 2002 Neural Information Processing Systems. MIT,

Vancouver, BC, Canada, pp. 1237–1244.

Wu, B., McGrory, C. A., Pettitt, A. N., 2010a. Customer spatial usage behavior profiling

and segmentation with mixture modeling. Submitted.

Wu, B., McGrory, C. A., Pettitt, A. N., 2010b. A new variational Bayesian algorithm

with application to human mobility pattern modeling. Statistics and Computing,

(in press).

http://dx.doi.org/10.1007/s11222-010-9217-9

4A New Variational Bayesian Algorithm with Application to Human

Mobility Pattern Modeling

Abstract

A new variational Bayesian (VB) algorithm, split and eliminate VB (SEVB), for mod-

eling data via a Gaussian mixture model (GMM) is developed. This new algorithm

makes use of component splitting in a way that is more appropriate for analyzing a

large number of highly heterogeneous spiky spatial patterns with weak prior infor-

mation than existing VB-based approaches. SEVB is a highly computationally effi-

cient approach to Bayesian inference and like any VB-based algorithm it can per-

form model selection and parameter value estimation simultaneously. A significant

feature of our algorithm is that the fitted number of components is not limited by

the initial proposal giving increased modeling flexibility. We introduce two types of

split operation in addition to proposing a new goodness-of-fit measure for evalu-

ating mixture models. We evaluate their usefulness through empirical studies. In

addition, we illustrate the utility of our new approach in an application on modeling

human mobility patterns. This application involves large volumes of highly hetero-

geneous spiky data; it is difficult to model this type of data well using the standard

VB approach as it is too restrictive and lacking in the flexibility required. Empirical

results suggest that our algorithm has also improved upon the goodness-of-fit that

would have been achieved using the standard VB method, and that it is also more

robust to various initialization settings.

Keyword

Variational Bayes (VB) ; Gaussian Mixture Model (GMM) ; Component Splitting ; Hu-

man Mobility Pattern ; Data Mining

72 Chapter 4. A New Variational Bayesian Algorithm

4.1 Introduction

Mixture models are commonly employed in statistical analysis as they provide a

great deal of modeling flexibility. In particular, one very popular and computation-

ally convenient approach is to model data as a mixture of a finite number of inde-

pendent Gaussian distributions. In this paper we refer to this model as a Gaussian

mixture model (GMM) (McLachlan and Peel, 2000). In recent years the computa-

tionally efficient variational Bayesian (VB) approach has been successfully used to

fit GMMs as described in McGrory and Titterington (2007). We refer to this approach

as the standard VB-GMM method. While this method enables faster computation

and lower storage requirements than most other Bayesian approaches, working with

large volumes of data which exhibit widely varying patterns can still be challenging.

This paper aims to improve on the standard method to create an approach which

is better suited to analyzing datasets that are characterized by a large number of

highly heterogeneous spiky spatial patterns and, in particular, when there is only

weak prior information available. We use the term spiky to describe data patterns

with large areas of low probability mixed with small areas of high probability; and

the term heterogeneous to describe datasets where we observe patterns in different

regions with various degrees of complexity, some of which are better described by a

mixture of one or two components, while others may require a model with a large

number of components.

Human mobility patterns are known to be highly heterogeneous and spiky (Gonzalez

et al., 2008). Understanding of human mobility patterns is valuable for urban plan-

ning, traffic modeling and predicting the transmission of biological viruses (Gon-

zalez et al., 2008), for example. To the best of our knowledge, individuals’ mobility

patterns have not yet been modeled with GMMs; therefore taking such an approach

will allow us to gain further insights into this type of data. To capture human mobility

patterns, we analyze individuals’ telecommunication call detail records (CDR) that

were observed over a 17-month period. While clearly CDR information is biased in

that it only reflects those times when communications are being made, studies have

previously shown that this information is in fact adequate to provide a good reflec-

tion of a person’s overall mobility pattern (Gonzalez et al., 2008). For the reasons

mentioned earlier, fitting a GMM to this type of data presents challenges for many

standard Bayesian approaches. Another challenge is that CDR data is somewhat dis-

crete since the location of the user is only known up to the location of the nearest

activity initiated cell tower. In this paper we present an algorithm that is highly sta-

ble, efficient and more appropriate than existing approaches for analyzing real world

data where such issues arise. In particular we shall show the advantages of adopting

a component splitting strategy in the analysis.

VB was first formally proposed by Attias (1999) and has now been used in various

4.1 Introduction 73

applications (Wang and Titterington, 2006). Its scalability, the ease of computations,

and efficiency in terms of both computation and storage requirements, makes VB

practical for analyzing large datasets in contrast to the more popular but compu-

tationally demanding Markov chain Monte Carlo (MCMC) approach (Madigan and

Ridgeway, 2003). Another alternative Bayesian approach for GMMs is sequential

Monte Carlo (SMC); the properties of SMC for static datasets are largely unknown

although Balakrishnan and Madigan (2006) proposes an efficient approach which

is comparable with MCMC. Other key advantages associated with the VB approach

are that, unlike Monte Carlo based approaches, they do not suffer from mixing or la-

bel switching problems or the difficulties with assessing convergence (Celeux et al.,

2000; Jaakkola and Jordan, 2000; Wang and Titterington, 2006; McGrory and Titter-

ington, 2007). Further, since VB is deterministic, it does not rely on sampling, the

accuracy of which can be difficult to assess in the context of GMMs and, being a

Bayesian approach, suffers less from over-fitting and singularity problems (Bishop,

2006, pp.461-486). While the literature is lacking in formal comparisons between the

VB and maximum likelihood (ML) approaches such as expectation-maximization

(EM) algorithms (e.g., Aitkin and Wilson, 1980), within the context of speech recogni-

tion problems, Watanabe et al. (2002), for example, showed through empirical stud-

ies that VB performed as well or better in terms of robustness, accuracy and rate

of convergence in hidden Markov modeling when compared to EM algorithms with

the use of the Bayesian information criterion (BIC) and minimum description length

(MDL).

Another significant practical advantage of using VB for mixture modeling is that, in

the same way as the reversible jump MCMC (RJMCMC) approach (Richardson and

Green, 1997) or birth-death MCMC (Stephens, 2000), VB is able to automatically se-

lect the number of components k and estimate the parameter values simultaneously

(e.g., Attias, 1999). Note however that in McGrory and Titterington (2007) the authors

had also chosen to compute the deviance information criterion (DIC) (Spiegelhalter

et al., 2002; Celeux et al., 2006) within the VB algorithm, but DIC was used only as a

complementary approach to validate the automatic VB selection and to assist with

modeling decisions for those cases where the VB algorithm could automatically se-

lect alternative fitted models under different initialization settings. Of course the

computing time and storage involved for VB is significantly less than that required

to carry out Monte Carlo based approaches. Many other mixture modeling methods

are incapable of this type of simultaneous estimation; they instead separate the se-

lection of k, which is an important issue of mixture modeling (McLachlan and Peel,

2000), from the parameter estimation which assumes k is fixed (c.f. Richardson and

Green, 1997). The automatic and simultaneous scheme is naturally more desirable,

particularly for analyzing the heterogeneous spiky patterns with weak prior infor-

mation that we see in applications.


As mentioned above, in the GMM case, VB will ultimately select a suitable k and con-

verge to give parameter estimates for the k-component model. This leads to what we

call the variational posterior fit to the data. Standard VB converges to select a suitable

k by effectively and progressively eliminating redundant components in the model.

In general mixture modeling (i.e., fitting mixtures with an unknown number of com-

ponents and unknown parameter values), it is well-known that the posterior may be

multimodal (Wang and Titterington, 2006; Titterington et al., 1985, pp.48-50). This of

course can cause mixing and label switching problems for MCMC-based algorithms.

Within the VB framework, if the posterior is multimodal, then naturally the algo-

rithm can only converge to one of the local maxima of the posterior and any others

would not be explored. It would also be possible for the algorithm to converge to dif-

ferent local maxima if different parameter initializations were used. Therefore, while

in the specific context of GMMs, VB is guaranteed to converge (Bishop, 2006, p.466)

at least locally to the maximum likelihood estimator (Wang and Titterington, 2006)

and has been shown to monotonically improve the model approximations from one

iteration to the next (in contrast to stochastic convergence that is associated with

MCMC schemes), standard VB is still somewhat sensitive to the initialization of the

hyper-parameters in the prior (Watanabe et al., 2002) and the initial component

membership probabilities of each observation. That is, in some cases suboptimal

models might be found as a result of these initialization choices.

While the elimination property of VB may often be convenient and useful, note that

it implies that the initially proposed number of components kinitial in the standard

VB-GMM approach (Attias, 1999; Corduneanu and Bishop, 2001; McGrory and Tit-

terington, 2007) is effectively the maximum. That is, the standard method will lead

to suboptimal models if the value for kinitial chosen is smaller than required. This

implies that, after assessing the dataset, a suitably large kinitial should be selected

in order to try to avoid this problem, but this approach is clearly not convenient

or efficient for exploring a large number of highly heterogeneous patterns. In this

situation, while one can set kinitial to be equivalent to or larger than the maximum

number of components likely to be present across all subsets of the data and then

let the algorithm converge in each case, such a tactic is computationally wasteful in

terms of both time and storage for those simple patterns where a large number of

unnecessary components would have to be removed as the algorithm converged.

This paper addresses the aforementioned challenges of standard VB-GMM algo-

rithm by allowing components to be split. This strategy will remove the limitation

imposed through the choice of kinitial, and allow a more thorough exploration of

the parameter space than the standard algorithm would achieve. Our approach

aims to avoid the possibility of obtaining a less appropriate model as a result of

an irreversible VB elimination operation (Wu et al., 2010c), and is more suitable

for analyzing the more challenging types of datasets that we are interested in here,

4.1 Introduction 75

namely heterogeneous spiky spatial patterns with weak prior information. Split op-

erations have been proposed previously within the VB-GMM framework in the ma-

chine learning literature (Ghahramani and Beal, 1999; Ueda and Ghahramani, 2002;

Constantinopoulos and Likas, 2007). However, these approaches have focused on

splitting only one component at a time, in an attempt to split every single compo-

nent until reaching a model that is optimal with respect to their criteria. In contrast,

we pursue a more focused approach that is more adaptable for real world problems.

That is, each time we attempt a split in a given iteration, we attempt to split not

all components, but only those fitted poorly, and we attempt to split all of them at

the same time. We define and assess the goodness-of-fit of each fitted component

through a set of proposed split criteria designed to identify why a component is a

poor fit and hence determine what are the appropriate split operations to pursue.

Two possible distributional failures of fit and hence two different split operations

are considered in our algorithm. The first operation is to split a component into

two side-by-side subcomponents such that their means are unequal, µ(1) 6= µ(2), as

is the approach taken in most mixture modeling split studies (e.g., Richardson and

Green, 1997). The other operation is to split a component in a less conventional way

into two overlapping subcomponents such that µ(1) ∼ µ(2) but∥∥Σ(1)

∥∥ << ∥∥Σ(2)∥∥.

That is, components are split into ‘inliers’ and ‘non-inliers’ subcomponents in situa-

tions where there exists a high concentration of observations (i.e., inliers) close to the

component mean. We propose to allow these two split operations non-exclusively;

that is, we will allow a component to be split into three subcomponents at the same

time if both split criteria have been satisfied. We have found this to be useful. Like all

previous VB-GMM split studies, we rely on the competing nature of the VB mixture

modeling approach that arises from the component elimination property associated

with using the variational approximation. This enables the scheme to converge to an

appropriate model, therefore we need no other elimination or merge moves.

Typically mixture models within the Bayesian framework are evaluated based on in-

formation criteria such as BIC (McLachlan and Peel, 2000) or the recently developed

DIC (McGrory and Titterington, 2007) which are less wasteful than validation ap-

proaches (Corduneanu and Bishop, 2001). However, the posterior lower bound F ,

which we outline below, is perhaps the most popular approach for comparing mod-

els in the VB framework. In fact, F has been used for guiding each split attempt in

the previously proposed VB-GMM algorithms with a split process (Ghahramani and

Beal, 1999; Ueda and Ghahramani, 2002; Constantinopoulos and Likas, 2007). De-

spite the fact that our proposed split targets poorly fitted components, and empirical

studies suggest that these lead to a more suitable model for our application data, we

have observed that the models fitted through our split moves are sometimes ranked

lower than the pre-split models with respect to BIC, DIC andF . A closer examination

of these cases revealed that these measures of fit can be unstable for our data mainly


due to the discreteness present. We discuss this in more detail later. Consequently,

we propose a new criterion that is more robust to this issue for evaluating results,

and we select the final model based on our proposed goodness-of-fit measure in-

stead. In Section 4.3.5, we will introduce this alternative criterion which is based on

absolute errors and takes the covariance matrix of each component into considera-

tion. We believe our proposed criterion is more appropriate for assessing the fit of

the mixture models, at least for our application, and we provide motivation for this

opinion by demonstrating results on some examples.

We structure this paper as follows. In Section 4.2, we briefly discuss the theory of

VB and outline the standard VB-GMM algorithm. In Section 4.3, we detail our new

algorithm and proposed model selection measure. Section 4.4 discusses the human

mobility pattern application and presents both simulated and real data results. We

conclude this paper in Section 4.5.

4.2 Standard VB-GMM Algorithm

Modeling more complex distributions via a GMM with k independently Gaus-

sian distributed underlying mixture components is a well-known popular ap-

proach. The mixture density of an observation x = (x1, ..., xn) is of the form∑kj=1wjN

(x;µj , T−1

j

), where k ∈ N, N (.) represents the (multivariate) Gaussian

density, µj and T−1j denote the mean and variance, respectively, for component j,

the mixing proportions {wj}, satisfy 0 ≤ wj and∑k

j=1wj = 1. Bayesian inference is

based on the target posterior distribution, p (θ, z|x), where θ represents the model

parameters (µ, T, w) and z = {zij : i = 1, ..., n, j = 1, ..., k} denotes the unobserved

component membership indicators for the observed data x. The target posterior is

not analytically available, as is typically the case, and it has to be approximated in

the Bayesian inference approach.

VB methods are becoming increasingly popular as an approach for approximating

the posterior distribution of the parameters of a GMM (Wang and Titterington, 2006).

VB aims to obtain tractable coupled expressions for approximating the posterior dis-

tribution p (θ|x); these resulting expressions can then be solved iteratively (McGrory

and Titterington, 2007). The parameters of the coupled expressions are adjusted via

an EM-like optimization algorithm. This yields a sequence of approximations that

improve with each iteration and can often be expressed in the closed form when

these parameters have fixed values (Jaakkola and Jordan, 2000). In the following we

briefly outline the VB approach.

We begin by introducing a variational function written as q (θ, z|x), which will be

used to maximize the value of a quantityF (q (θ, z|x)) which depends on it as follows.

4.2 Standard VB-GMM Algorithm 77

Jensen’s inequality tells us that we can express the marginal log-likelihood as

log p (x) = log∫ ∑{z}

q (θ, z|x)p (x, z, θ)q (θ, z|x)

dθ (4.1)

=∫ ∑{z}

q (θ, z|x) logp (x, z, θ)q (θ, z|x)

dθ

+∫ ∑{z}

q (θ, z|x) logq (θ, z|x)p (θ, z|x)

dθ (4.2)

= F (q (θ, z|x)) +KL (q|p) (4.3)

≥ F (q (θ, z|x)) , (4.4)

where F (.) is the first term in Equation (4.2) and KL (q|p) is the second and also

is the Kullback-Leibler (KL) divergence between the target p (θ, z|x) and its varia-

tional approximation q (θ, z|x). Note that KL (q|p) cannot be negative. By minimiz-

ing KL (q|p), VB is effectively maximizing F (q (θ, z|x)), a lower bound of log p (x).

However, q (θ, z|x) must be chosen carefully so that it is a close approximation to

the true conditional density, and importantly that it gives tractable computations

for approximating the required posterior distribution. Typically it is assumed that

q (θ, z|x) can be expressed as qθ (θ|x) × qz (z|x), with conjugate distributions chosen

for the parameters. VB then involves solving q (θ, z|x) iteratively in a way similar to

the classical EM algorithm:

• E-step: find the expected value of the posterior of the component membership,

qz (z|x); and,

• M-step: estimate the model parameters in qθ (θ|x) by maximizing F (q (θ, z|x)).

This results in variational posterior approximations of the form p (θ|x) ≈ qθ (θ|x).

Research on the theoretical properties of VB is limited. However, Wang and Titter-

ington (2006) have already demonstrated the asymptotic consistency of the VB ap-

proximation for GMMs with fixed k. They have pointed out that VB-GMM is not bi-

ased in large samples, and have proved its local convergence to the maximum like-

lihood estimators is at the rate of O (1/n) for large n. As has been noted by other

researchers (e.g., Attias, 1999; McGrory and Titterington, 2007), we see here that VB

can effectively eliminate unnecessary mixture components when an excessive num-

ber of components is specified in the initial model. Although it is not yet well un-

derstood how this feature of the algorithm works, it means that VB can be used to

estimate model complexity and parameter values simultaneously.

Previous articles on VB-GMM (Attias, 1999; Corduneanu and Bishop, 2001; McGrory


and Titterington, 2007; Bishop, 2006, Section 10.2) have made similar prior assump-

tions for this type of model, but have used different model hierarchies. The conju-

gate priors used have been of the following forms:

• A Dirichlet distribution for the mixture weighting w with respect to k mixture

components,

• A Gaussian distribution for the mean µ of each mixture component, and

• A Wishart distribution for precision (i.e., inverse covariance) matrix T of each

mixture component.

In this paper, we follow the model hierarchy outlined in McGrory and Titterington

(2007) and the reader may refer to that paper for further detail on the derivation

expressions for the VB posteriors given below. Alternatively, readers may also wish

to refer to Attias (1999), Corduneanu and Bishop (2001), and (Bishop, 2006, Section

10.2) for different model hierarchies with different parameter notations. We assume

a mixture of k bivariate Gaussian distributions with unknown means µ = (µ1, ..., µk),

precisions T = (T1, ..., Tk) and mixing coefficients w = (w1, ..., wk), such that

p (x, z|θ) =n∏i=1

k∏j=1

{wjN

(xi;µj , Tj−1

)}zij

.

Recall that we have introduced latent indicator variables, the zij ’s in order to express

the GMM in the convenient and popular missing data model representation. Note

that zij = 1 if observation xi belongs to the jth component and zij = 0 otherwise.

The VB approximation will lead to an update expression for the variational posterior

estimates of these latent variables which is outlined below. The joint distribution is

p (x, z, θ) = p (x, z|θ) p (w) p (µ|T ) p (T ) .

Our priors are given by:


(0), ..., αk(0)),

p (µ|T ) =k∏j=1

N

(µj ;mj

(0),(βj

(0)Tj

)−1),

p (T ) =k∏j=1

Wishart(Tj ; υj(0),Σj

(0)),

with α(0), β(0), m(0), υ(0), and Σ(0) being known, user chosen initial values. These are

standard conjugate priors used in Bayesian mixture modeling (Gelman et al., 2004).

4.2 Standard VB-GMM Algorithm 79

Using the lower bound approximation, the posteriors are:

qw (w) = Dirichlet (w;α1, ..., αk) ,

qµ|T (µ|T ) =k∏j=1

N(µj ;mj , (βjTj)

−1),

qT (T ) =k∏j=1

Wishart (Tj ; υj ,Σj).

The variational posterior update for the qij ’s which denotes the VB posterior proba-

bility that for observation xi the component membership indicator variable, zij = 1,

is given by

qij ∝ exp{Ψ(αj)−Ψ(α·)

+12{

2∑s=1

Ψ(vj + 1− s

2) + 2 log (2)− log |Σj |}

−12

tr(vjΣ−1j (xi −mj)(xi −mj)

T +1βj

)I2}, (4.5)

where Ψ is the digamma function and α· =∑

j αj . Note that the above expression

is normalized so that for each observation xi, the qij ’s sum to one over the j’s. As

we can see, the update for the qij ’s involves the updates for the parameters, which in

turn requires this update for the qij ’s, i.e., we have a set of coupled expressions that

must be solved iteratively.

The corresponding updates for our posterior parameters are then:

αj = αj(0) +

n∑i=1

qij ,

βj = βj(0) +

n∑i=1

qij ,

υj = υj(0) +

n∑i=1

qij ,

mj =1βj

(βj

(0)mj(0) +

n∑i=1

qijxi

),

Σj = Σj(0) +

n∑i=1

qijxixiT + βj

(0)mj(0)mj

(0)T

− βjmjmjT ,

where posterior expectations are given by E ( µj) = mj , and E ( Tj) = υjΣ−1j . In this

way, the variational posterior estimate for each of the parameters is updated at each


iteration by adding some function of the current estimates of the qij ’s to the user

chosen initial values that are denoted by the superscript (0).

Note that the VB framework we described can straightforwardly be applied to the

case of a multivariate GMM with general dimension. For clarity we have restricted

our notation in this article to the two-dimensional case since our application in-

volves two-dimensional data.

4.3 Split and Eliminate Variational Bayes for Gaussian Mix-

ture Models (SEVB-GMM) Algorithm

Fitting a GMM via the standard VB approach typically involves choosing a value for

kinitial to start off the algorithm; this is then effectively the maximum number of com-

ponents allowed in the model because the component elimination property of stan-

dard VB means that components may be removed at any iteration, but none will be

added. This automatic complexity reduction (e.g., Attias, 1999) that can occur results

from the use of the variational approximation in performing the Bayesian inference.

The choice of kinitial can have an effect on the results in some cases. In this paper,

we remove the limitation imposed by the choice kinitial by allowing components in

the model to be split during the SEVB iterations so that the size of k in the final fitted

model is still automatically determined, but it can now also be larger than kinitial, if

appropriate, as well as smaller. As we discussed in the introduction, this approach

will be capable of exploring the parameter space more thoroughly than the stan-

dard algorithm and is more suitable for analyzing the type of data patterns that we

are interested in here, that is, heterogeneous spiky spatial patterns with weak prior

information. As we have mentioned, splitting as well as merging mixture compo-

nents has already been considered within both the RJMCMC (Richardson and Green,

1997) and the VB-GMM framework (Ghahramani and Beal, 1999; Ueda and Ghahra-

mani, 2002; Constantinopoulos and Likas, 2007). In Richardson and Green (1997),

components are randomly chosen either to be split into two side-by-side subcom-

ponents, or combined into one, with the condition that the split/combine moves are

reversible. These moves are accepted or rejected via a trans-dimensional Metropolis-

Hastings update. However, this approach is known to be very computationally de-

manding.

In the context of VB-GMM, a birth-death operation on components has been pro-

posed (Ghahramani and Beal, 1999), based on the idea of improvingF . Alternatively,

Ueda and Ghahramani (2002) suggested transforming VB into a greedy search algo-

rithm that examines all possible split, merge, and split-and-merge moves until F

cannot be further improved. Note that this was based on their previous work (Ueda

et al., 2000) that transformed the standard EM algorithm to be less dependent on

4.3 Split and Eliminate Variational Bayes for Gaussian Mixture Models (SEVB-GMM)Algorithm 81

initial settings. Constantinopoulos and Likas (2007) also proposed a splitting VB al-

gorithm which always starts with a single component and progressively adds more.

Component additions are again guided by improvements in F , and this approach

requires components that will not be considered for splitting integrated out. Each

of the previously proposed VB-GMM splitting algorithms tries to split components

with the worst fit first, and has a different assessment criterion for assessing the fit.

However, we do not believe these approaches are very practical considering that the

range of feasible values for k can be very wide and many applications involve massive

volumes of data. That is, we do not believe that it is necessary, efficient or effective,

from our modeling prospective, to choose and attempt to split only one component

at a time and/or to attempt to split every single component. Consequently, in each

split attempt, our algorithm instead identifies all components that do not appear to

have described the pattern well based on a set of proposed criteria and then splits all

of them at the same time.

Our SEVB algorithm makes use of the elimination property of the VB approximation

as discussed in the introduction. That is, components fitting the same region of the

data will be competing with each other; and when there is strong evidence to suggest

that two or more components are fitting the same part of the data, in most cases

only one of these existing components will survive while others will be removed.

In addition, we only attempt to split components when we have reached a stable

model, that is, one which cannot be improved by performing further iterations of

the VB algorithm.

Our proposed algorithm can be summarized by seven steps listed under the heading

Algorithm 1.

Algorithm 1 SEVB-GMM1: Randomly assign mixture component membership probabilities to each obser-

vation and initialize the prior parameter values.2: Execute a standard VB-GMM iteration (see Section 4.2).3: If the model is assessed as not stable according to our criteria (see 4.3.1)⇒ repeat

Step 2.4: Examine our split criteria (see Section 4.3.2) on each component.5: If the algorithm should be terminated according to our criteria (see Section 4.3.4)⇒ go to Step 7.

6: Perform our split operation(s) (see Section 4.3.3) on components to be split ⇒return to Step 2.

7: Select the final model based on our model selection criterion (see Section 4.3.5)

That is, we propose to execute standard VB-GMM in Step 2, until a stable model

result is obtained in Step 3. We then aim to identify all poorly fitted components in

Step 4; these are then split simultaneously in Step 6, provided the algorithm has not

satisfied the terminating condition which is checked in Step 5. In the cases where

there are no remaining poorly fitted components to be split, or we do not want to


split components any further (Step 5), we go to the last step (Step 7) to select the final

model and then terminate the algorithm. Otherwise, we return to Step 2 to execute

more standard VB-GMM iterations until we reach the next stable model (Step 3) after

the split operations have been performed in Step 6.

In the remainder of this section, we first discuss our criterion for assessing whether

the model is stable (Section 4.3.1). We then detail our proposed split criteria (Sec-

tion 4.3.2), split operations (Section 4.3.3), and our criterion for determining if the

algorithm should be terminated (Section 4.3.4). We conclude this section by outlin-

ing our proposed model selection criterion (Section 4.3.5).

4.3.1 Model stability criterion

Our algorithm considers splitting components only when in a stable model. Most

VB-GMM algorithms define a stable model based on examining F . That is, a model

is typically declared stable if F of the current iteration is the same as the previous it-

eration up to a very small tolerance level. Such an approach can be computationally

demanding, and is therefore not suitable for analyzing large amounts of data. We

also point out that, while subsequent iterations may be able to fine tune the mod-

els, we have observed that, often when analyzing the real world data, VB-based al-

gorithms may simply be hopping among several alternative similarly good, but dif-

ferent models, i.e., the models are not really improving. See further discussion on

Criterion T3 in Section 4.3.4. In contrast, we aim to find a balance between accuracy

and computational efficiency. We declare that the model is stable in Step 3 if:

• Number of surviving components ksurviving i.e., the number of component cur-

rently in the model remained identical from the previous iteration (S1);

• Variational posterior mean estimations of all surviving components mj ’s re-

mained the same up to a tolerance level δ1 from the previous iteration (S2). A

suitable tolerance level choice will be application driven to obtain the user’s

desired level of accuracy; and,

• At least c0 iterations have been completed since the initialization or last stable

model (S3). This is to prevent the algorithm being declared stable prematurely

in the very first few iterations.

Therefore, instead of monitoring changes in F , we propose to focus on the estima-

tion of the key model parameters through checking criteria S1 and S2. We have found

this to be adequate. As noted, S3 is to prevent the algorithm being declared stable

too early (e.g., after only one or two iterations) in the process. While in the majority

of cases model parameter estimates typically change rapidly in the early iterations,


this has been observed to happen occasionally when applying the algorithm. There-

fore, in practice, the choice of c0 will have minimal effect on the final fit, however

choosing it to be a very large number would tend to lead to excessive and wasteful

iterations. Once we have obtained a stable model, we will then proceed to Step 4.

4.3.2 Component splitting criteria

Unlike existing VB-GMM with split process algorithms which attempt to split all of

the components, we adopt a more targeted and efficient approach: we only attempt

to split poorly fitted components. We identify the poorly fitted components in Step

4 through the use of our proposed criteria. We have proposed two split criteria for

the two distributional imperfections that we are interested in identifying. We shall

first discuss our inliers and non-inliers split criterion. This criterion aims to find

components which might be better separated into two overlapping subcomponents

in which their means are such that µ(1) ∼ µ(2), but their variances are such that∥∥Σ(1)∥∥ << ∥∥Σ(2)

∥∥. We then detail our standard split criterion for identifying poorly

fitted components that could be improved by separation into two side-by-side sub-

components whose means are such that µ(1) 6= µ(2). Note that the reader may ignore

the inliers and non-inliers split criterion and the corresponding operation if they do

not wish to make such assumptions in their application.

4.3.2.1 Inliers and Non-Inliers Split Criterion

This criterion is based using the Mahalanobis distance (MD) measure. This measure

is utilized because it takes correlation between variables into consideration and it

is scale invariant. While MD is more typically used as a multivariate outlier statistic,

we show here that by considering its theoretical distribution, we obtain a straightfor-

ward diagnostic for the distributions fitted to the components which can aid in the

identification of the inliers. We define the distance MD(j)i , corresponding to obser-

vation xi from the most likely jth component determined by the largest qij (Equation

(4.5)) with mean mj as

MD(j)i =

√(xi −mj)

T (Σj/υj )−1 (xi −mj).

For each fitted component, the estimated distribution can be compared with the

chi-square distribution with two degrees of freedom (df) i.e., MD2 ∼ χ2df=2 as this

is the theoretical relationship that exists when x is assumed to be bivariate Gaussian

distributed (Azzalini, 1996, p.291). We consider an observation xi to be an inlier with

respect to the jth component if

MD(j)i <

√χ2df=2,α=r, (4.6)


where α represents the cumulative probability of the area under curve of the chi-

square distribution. Its value r is a user chosen probability value and a reasonable

choice for r will depend on the application: typically one would like to have a small

r so that only observations lying within a small MD, calculated from the right hand

side of Equation (4.6), from a fitted component mean will be identified as inliers.

The inliers and non-inliers split criterion for the jth component is as follows. We

consider it appropriate to split into two overlapping subcomponents

if N (j)inliers/N

(j) > q.

In the above expression, N (j), and N (j)inliers are the total number of observations, and

the number of inliers, respectively, belonging to component j; and q is a chosen

probability value such that 1 > q > r > 0. That is, we highlight components for split-

ting into two overlapping subcomponents when they have more than a proportion

q of their observations classified as inliers when the theoretical expected proportion

would only be r. In other words, we adopt a simple two level thresholding approach

in identifying inliers components: we choose r to correspond to the proportion ob-

servations we would expect to see lying in the center region of a fitted component,

then by assessing how many observations actually lie in that region for each com-

ponent present in the model, we can decide whether a further split is needed. Note

that, however, choosing r too close to zero will result in the algorithm missing the

potential inliers subcomponent if its µ is not fairly close to the fitted component

mean.

4.3.2.1 Standard Split Criterion

In contrast with the inliers and non-inliers split criterion we have just described, the

type of move we use here, which splits components into two side-by-side subcom-

ponents is much more typical in the literature. In fact, it is generally the only type

of split that is used in mixture modeling problems (Richardson and Green, 1997;

Constantinopoulos and Likas, 2007) or clustering algorithms (Ball and Hall, 1965).

For our standard split criterion move, we propose using principal component anal-

ysis (PCA). We use PCA to transform linearly correlated variables into a set of un-

correlated principal components (or eigenvectors). This can assist in determining

whether or not a component has too much variation in a certain basis as this would

suggest that it could perhaps be better fitted using more than one component. An-

other advantage to using PCA is that we can straightforwardly incorporate it into our

algorithm since it makes use of the easily computable covariance matrix Σj/υj which

is already estimated in the algorithm.


In the PCA for our bivariate model, we transform our variables to obtain two prin-

cipal components p1 and p2. Here p1 represents as much of the data variation as

possible, that is, the larger eigenvalue λ1, and p2 represents the remaining variation

(λ2) of the component. The standard split criterion for the jth component is then

defined as split into two side-by-side subcomponents

ifλ1

λ1 + λ2> g and if σ(j)

max > s.

In the above expression σ(j)max =

√max (diag (Σj/υj )) which represents the larger

standard deviation in either X-Y coordinates of the component of interest. This

means that we assess that a component should be split into two side-by-side com-

ponents if it has more than proportion g of the data variation along p1, and the larger

of the standard deviations in either X-Y coordinates is greater than s. Here g and s

are carefully chosen values that will often be application driven: g should be chosen

to be reasonably large as this would reflect a large difference in the eigenvalue ratio

which typically suggests a poorly fitted component, and the smaller s is, the more

likely we will be to split a component that is irregular, this would then lead to more

complex fitted models. In this way we have devised an alternative criterion to that

used in Constantinopoulos and Likas (2007) where the order of the split is assessed

by det(Σ−1

). It is also an alternative to that used in Ueda and Ghahramani (2002)

which assessed whether to split based on the KL divergence distance between the

data density and its estimated model distribution. Note that, instead of σmax, alter-

natively we could have used other measures such as√λ1. Either of these criteria

would assist the algorithm in a similar manner; here σmax is adopted following Ball

and Hall (1965).

4.3.3 Component split operations

In Section 4.3.2 we described the two proposed split criteria that are used in Step

4. With the exception of cases where the algorithm terminating criterion has been

satisfied in Step 5 (e.g., this could be satisfied because there are no remaining poorly

fitted components for splitting), the aim in Step 6 is then to split the identified com-

ponents into either two or three subcomponents as appropriate. The main focus

here is on determining the posterior parameter values for initializing the newly cre-

ated subcomponents. Depending on which split criterion the component has sat-

isfied, one of the two possible split moves will be performed. We discuss these and

the special case where a component has been flagged by both of the criteria in detail

in Sections 4.3.3.1–4.3.3.3 below. Section 4.3.3.4 then gives a description of model

adjustments that we carry out after all subcomponents have been created in order

to ensure that the algorithm continues towards convergence.


4.3.3.1 Inliers and Non-Inliers Split Operation

If a component has satisfied the inliers and non-inliers split criterion (see Sec-

tion 4.3.2.1), this implies that at least some proportion q of observations assigned

to that component have been assessed as inliers. In these instances, our split cre-

ates two new overlapping subcomponents; one of these represents the inliers and

the other represents the non-inliers. Assuming that we choose q ∼ 50% in this split

move, we can initialize the posterior parameters of the two new subcomponents (in-

stead of modifying the qij ’s to estimate them) as follows.

• minliers = mnon-inliers = mj ;

• αinliers = αnon-inliers = 0.5× αj ;• βinliers = βnon-inliers = 0.5× βj ;• υinliers = υnon-inliers = 0.5× υj ;• Σnon-inliers = Σj ;

• Σinliers = 1/c1 × Σj ,

where c1 > 1 is a user chosen value. This implies that we assign these two new sub-

components the same means with only half of the parent component mixing weight,

and the inliers subcomponent will have smaller variances than its parent compo-

nent. These newly created subcomponent initializations are used in the next round

of standard VB-GMM iterations in Step 2. The choice of c1 is data dependent, and

represents the assumed difference in variance between the inliers and non-inliers

components. While the specific choice of c1 will generally have only a minimal effect

on the resulting fit, setting c1 too large or too small may increase the likelihood of the

newly created inliers subcomponent being eliminated.

Note that, the user may wish to estimate these posterior parameter values more for-

mally by first partitioning the observations in the component. However, we found

that our simple proposal was sufficient as the following VB iterations will adjust these

proposed values.

4.3.3.2 Standard Split Operation

Components flagged by the standard split criterion (see Section 4.3.2.2) will be split

into two side-by-side subcomponents. Our approach is to use the data for initializing

the posterior parameters of these subcomponents. We do so by dividing the obser-

vations by first linearly projecting them on to p1 via PCA, and then we group them

according to m(j)p1 , which is the p1 transformation of mj . These subcomponent ini-

tializations are then used in the standard VB-GMM iteration at Step 2. Our approach

here differs from Ghahramani and Beal (1999) where the split direction was instead

sampled from the parent component’s distribution rather than relying on PCA. While


our objective here is similar to that of Constantinopoulos and Likas (2007), we found

their inverse covariance matrix assumption T± = T (j) for initializing the two sub-

components problematic for our real world application as many components that

required this type of split were those covering two or more unrelated clusters, and in

these cases the inverse covariance matrix assumption is not realistic.

4.3.3.3 Case of Splitting One Component into Three Subcomponents

If a component satisfies both of our split criteria, we perform both operations on the

component which results in its replacement by three new subcomponents instead

of two. We do this by performing the inliers and non-inliers split operation first.

When the standard split operation is then performed, all posterior parameters with

the exception of m will need to be further halved as a result of inliers which were

not excluded from the side-by-side subcomponent initialization process. As before,

these initializations are used in the next iteration of Step 2.

4.3.3.4 Adjusting the Variance Posterior Parameters for All Components

As mentioned, our final task in Step 6 is to adjust the overall model that we obtain

after all subcomponents have been initialized. Due to the convergence properties

of the VB algorithm, the combined component variance will generally decrease at

each iteration as the algorithm moves closer to a solution and the components pro-

vide an improved fit. Since we will have split some or all components in the stable

model, it is logical to assume that the overall dynamics of the model will have been

changed. Since splitting leads to additional components in some neighborhoods,

we would expect that some of the observations covered by other non-altered com-

ponents could now be incorrectly classified. We can address this issue by increasing

the variances of all components (after all subcomponents have been initialized) such

that the value of posterior parameter Σj is updated as the following:

diag(Σ∗j)

= c2 × diag (Σj) ,

with a user chosen value c2 > 1. We do this without concern as the estimates of the

variances will then be updated and improved with further iterations of the algorithm.

That is, the specific choice of c2 would generally have little effect on the overall re-

sults, but of course setting c2 too small would defy the propose of this particular step.

After this process has been performed, we will return to Step 2.


4.3.4 Algorithm termination criterion

In Step 5, we must decide whether to terminate the algorithm. In order to do this we

introduce termination criteria. We declare that the algorithm should be terminated

if either:

• No components satisfy any of the split criteria (T1);

• Nsplitting/ksurviving > c3 (T2); or

• Model log-likelihood (LL) is the same within a tolerance level δ2 as one of the

previous stable models in which ksurviving is identical (T3).

Here c3 is a chosen value such that c3 > 1, and ksurviving and Nsplitting represent the

number of components currently surviving i.e., currently in the model, and the total

number of split operations that have been performed until this point, respectively.

If the chosen c3 is too small, it will limit the algorithm’s exploration of the parameter

space; while if too large, potentially unnecessary iterations will be performed, there-

fore this choice involves a trade off. On the other hand, there is no need to choose

the tolerance level δ2 to be very small as very small differences between the LLs have

little significance.

Termination criterion T1 is straightforward, so here we give some further detail on

the motivation for the other criteria.

Criterion T2 allows us to assess whether further split attempts are worthwhile based

on previous split attempts. Ideally we would like to track whether previous splits

have been successful or not as was done in Constantinopoulos and Likas (2007).

However, because the design of our algorithm allows for multiple splits to be per-

formed simultaneously, and these splits can potentially change the dynamics within

the model, the tracking approach is less straightforward here. For this reason we

have designed T2 as a way of tracking simply how successful the splits have been

overall.

Criterion T3 was proposed for recognizing the two other situations where the algo-

rithm should be terminated to avoid wasteful unnecessary computations. Firstly,

we know that when all split attempts have failed the algorithm will most likely have

converged back to models which are identical or very similar to those models that

we had prior to the attempted split. Secondly, we know that, quite often, we can

model the same (heterogeneous) data well with several alternative models; and in

those situations, we have observed that our algorithm can become stuck moving be-

tween several alternative ‘good’ models in our application. As a result, we would like


to declare the algorithm to be terminated in these situations immediately without

further unnecessary computations.

4.3.5 Model selection criterion

The final step, Step 7, of our algorithm is to select the final model. We opt to select the

final model after all proposed splits have been considered and computations have

been completed. Models within the Bayesian framework can be evaluated based on

information criteria such as BIC or DIC. The DIC has been used as a complementary

model selection technique in McGrory and Titterington (2007)’s VB-GMM. However,

in the VB literature, F is perhaps the most popular approach for comparing models.

Interestingly, some studies (e.g., Beal and Ghahraman, 2002, 2006) have shown that

monitoring of F consistently outperformed the less computationally efficient BIC

approach for finding an appropriate model structure in each of their simulated ex-

amples. Most previous VB-GMM splitting algorithms allow components to be split

(Ghahramani and Beal, 1999; Ueda and Ghahramani, 2002; Constantinopoulos and

Likas, 2007) and then examine if the proposed random split should be accepted or

rejected based on the improvement of F . We do not monitor F or utilize F for deter-

mining the validity of the splits. Our proposed splits target poorly fitted components

meaning that intuitively they should result in a better representation of the data and

visual inspection of empirical results suggested that this was the case. However,

somewhat surprisingly, we observed that on some occasions the fit we obtained after

carrying out splits was ranked lower than the pre-split model fit, with respect to BIC,

DIC, and F . Since this conflicts with intuitive reasoning, we further explored this

issue and concluded that this is largely due to discreteness in our dataset. We have

proposed a new criterion for comparing the fitted models; we have found that eval-

uating the results and selecting the final model using our new goodness-of-fit mea-

sure is more robust to this issue than any of the aforementioned criteria based on

empirical results. We outline and discuss how we propose to evaluate the goodness-

of fit below and Section 4.4.3 further illustrates this point through empirical studies.

Model evaluations or selections based on goodness-of-fit measures are particularly

useful for Bayesian techniques as they suffer less from the problem of over-fitting.

In this respect, it has been shown that absolute error is a preferred goodness-of-fit

measure over widely used square error related measures (e.g., sum of square error

(SSE), and root mean square error (RMSE) used in Ueda et al. (2000) and Ueda and

Ghahramani (2002), for example) which are known to be misleading and particularly

sensitive to outliers (see Armstrong, 2001, Chapter 14 for further detail). However,

none of these simple distance related measures are appropriate for evaluating re-

sults in applications such as ours which often involve large numbers of inliers and


variables are highly correlated. To address this problem, we have introduced an al-

ternative criterion which is based on absolute errors, but also takes the 1-norm of

the covariance matrix of each component into consideration. We propose that this

provides a more appropriate assessment of the model. We call our proposed mea-

sure Mean Absolute Error Adjusted for Covariance (MAEAC), and it is based on the

use of MD:

MAEAC =1n

n∑i=1

MD(j)i ×

√∥∥∥Σi(j)∥∥∥

1√υi(j)

(4.7)

where observation xi belongs to the jth component as determined by the maximum

value of qij (Equation (4.5)). Recall that MD has been used in Section 4.3.2.1 for iden-

tifying inliers, and it can be considered as an absolute deviance measure. In MAEAC,

we estimate the model ‘absolute error’ with respect to observation xi by multiplying

its MD(j)i to our best estimated deviance of the jth component, square root of maxi-

mum overall variance of the component; the sum of these estimated absolute errors

results in MAEAC. We select the final model based on this goodness-of-fit measure

before we end the algorithm. We believe that this is a more appropriate selection cri-

teria than either BIC, DIC or F , because unlike these it does not involve an estimated

LL term. Estimation of the LL can be unstable when the component estimated co-

variance measure is singular or near singular. For example, this can occur when there

is a point mass in the dataset. Empirical studies support the assertion that MAEAC is

more robust in these settings and we also find that the use of MAEAC leads to more

consistent and reliable model selection when the same data is being analyzed with

different initialization settings. The initialization settings we are referring to are the

choice of initial model complexity kinitial and the corresponding various possible ini-

tial allocations of observations to components. Note that the user may elect to adopt

the usual model selection criteria instead of our MAEAC.

4.4 Human Mobility Pattern Application & Results

4.4.1 Data mining & human mobility patterns

Data mining involves extracting nontrivial, previously unknown, but useful hidden

information from large datasets (Han and Kamber, 2006). It has attracted increas-

ing attention in recent years as a result of the rapidly growing amount of available

data, and the timely need to turn it into knowledge. The real world application on

modeling human mobility patterns that we explore in this paper, is one such exam-

ple of a research area where large amounts of data are involved and efficient data

mining techniques are required. Through this application, we demonstrate that our

4.4 Human Mobility Pattern Application & Results 91

algorithm has improved upon the standard VB-GMM method and is also more ro-

bust to various initialization settings. This is a prime example for illustrating our

approach because individuals’ observed mobility patterns are highly heterogeneous

and spiky, and there is a very little prior knowledge about them. We model each indi-

vidual’s mobility pattern by a GMM. We believe our modeling approach is appropri-

ate, because of the ease of interpretation, flexibility and computational convenience

of GMM. We present results for both simulated (in Section 4.4.2) as well as real data

(in Section 4.4.3).

Human trajectories have been modeled before with Levy flight and random walk

models (Brockmann et al., 2006); but these previous analyses have not taken in-

dividuals’ well-known high degree of spatial regularity into consideration i.e., it is

known that over a period of time people tend to frequently return to the same sev-

eral locations, and these frequented locations may change across different periods

throughout their lives. For example, for many people, two of their most frequented

locations will be their current home and office (c.f. Gonzalez et al., 2008). This is

an important issue that should be accounted for when modeling this type of data,

particularly when we consider that it is estimated that individuals typically spend

approximately 40% to 80% of their time in their first two preferred locations. Note

that Gonzalez et al. (2008) show that the probabilities of individuals visiting certain

locations can be reasonably approximated by a truncated power law. This regularity

issue which have a significant influence on the choice of an appropriate distribu-

tion for use in modeling, is addressed very easily in our approach with the use of our

proposed inliers and non-inliers split process. That is, we can model the frequently

visited locations with inliers components and capture the broader activity areas with

non-inliers components. This, on the whole, should lead to a better representation

of the observed mobility patterns.

Of course, while individuals typically spend the majority of their time in the same

area(s), they will also occasionally visit alternative locations which we refer to as

‘remote’ in this context. We use the term ‘remote locations’ to encompass all ob-

served locations other than the ones that are habitually visited by the given individ-

ual. These remote locations can vary widely in their range of distance from the habit-

ual daily activity areas and in frequency of observation. For example, a one-off visit

to a friend living at the opposite end of the city, or a vacation to the other side of the

country, would both represent visits to remote locations in relation to an individual’s

commonly exhibited day to day trajectories. Readers are referred to Gonzalez et al.

(2008) for further discussion on this point. This is another important feature of hu-

man mobility patterns which has also been ignored in previous Markov-based mod-

eling approaches. This characteristic presents challenges for the standard algorithm

as these isolated cases can have a large effect on the fit obtained using the standard

approach. We have observed repeatedly that applying the standard algorithm with


kinitial too small, or simply with an inappropriate component membership initializa-

tion setting, can lead to one component representing two or more ‘unrelated’ areas

visited by an individual. Here the notion of unrelated refers to locations in which the

individual’s observed presence has no clear connection, for example, we might think

of observations recorded at a person’s office and the surrounding cafes or transport

hubs as being related, while an observed visit to a restaurant in another part of town

and a visit to their local doctor’s surgery are unrelated. Our algorithm, through our

proposed standard split process, aims to address this issue in modeling terms, and

is therefore more robust to the initialization settings.

Gonzalez et al. (2008) points out that modeling of human mobility patterns has sug-

gested that regardless of how diverse and wide an individual’s mobility or travel

history is, human beings tend to follow simple underlying reproducible patterns.

Therefore, while our interest here is in capturing patterns with the business oriented

view of improving understanding of the needs and habits of each telecommunica-

tion customer, an ability to effectively model human mobility patterns could poten-

tially have wider implications for many other real world problems that are driven by

the effects of human mobility from the formulation of disease and epidemic strate-

gies to disaster response strategies.

4.4.2 Simulated results

We consider modeling a simulated human mobility-like spatial pattern. We compare

our SEVB-GMM algorithm to the standard VB-GMM algorithm outlined in McGrory

and Titterington (2007) using identical prior initialization, the settings for which are

not our primary concern with the exception of kinitial. We opt to use uninformative

prior settings as it is unrealistic for us to assume having any advance specific knowl-

edge on each pattern to be analyzed.

We have several algorithm settings that we must assign. We point out that in the case

of straightforward non-heterogeneous datasets, the particular choices are much less

significant as they will generally only impact computational time. This is because

their influence is to affect the number of split-merge attempts in the convergence

towards the final fit and, with datasets where there is clearer separation between

components, the final fitted model would generally not be altered as the algorithm

can more easily recover a good fitting model. However, for real data applications

where we often see heterogeneous or spiky patterns, these choices become slightly

more significant: if these values are chosen so as to encourage more split attempts

then we increase the likelihood of finding particular components and hence obtain-

ing a more complex model. This means that in choosing these variables our aim is to

find a trade off which encourages an appropriate tendency for split attempts with as


little computational waste as possible. In practical settings the choice of these val-

ues will primarily be application driven and here we explain the motivation for our

particular choices within the context of our problem. Our choices were:

• δ1 = 0.00001 longitudinal or latitudinal degrees. This choice corresponds to a

small area of approximately 1 meter as the tolerance level for declaring a stable

model. We use a very small δ1 here because the actual position of an individual,

unlike the observed data, is not restricted by the locations of the cell towers (c.f.

Gonzalez et al., 2008),

• δ2 = 1 as the model LL tolerance level for terminating the algorithm, which

means that we assume that the model LLs are equal if their difference between

the current and any previous stable models is less than 1,

• r = 0.25 and q = 0.45, representing that we would like to separate out inliers

from the rest of the observations for components that have more than 45% of

their observations classified as inliers whereas the theoretical value is only 25%,

• g = 0.90, and s = 0.05 degrees representing that we would like to split compo-

nents for which over 90% of the data variation is along its principal component

p1, and for which over 30 km or 0.30 degrees (i.e., 6s) in distances in either lon-

gitudinal or latitudinal direction,

• c0 = 5 so that in each case the algorithm will have at least 5 iterations; this

choice is simply to ensure that the algorithm does not terminate prematurely

and any other small number would be reasonable,

• c1 = 1000 which means that the variance posterior parameter of the inliers sub-

component will be initialized 1000 times smaller than that of the non-inliers

subcomponent when the inliers and non-inliers split operation is performed,

• c2 = 100 which means that after all subcomponents have been initialized, we

would like to adjust the overall model by increasing all component variance

posterior parameters by 100 times, and

• c3 = 2 which means that, on average, we allow each surviving component to be

split less than two times.

Note that we chose q close to 0.50 as a result of our inliers and non-inliers split op-

eration assumption and in order to have meaningful inliers subcomponent with a

reasonable number of observations. On the other hand, our choice of r, which is

partially affected by the choice of q, allows us to identify potential inlier subcom-

ponents whose µ’s are not fully aligned with the current component mean. This is


particularly important for our real application data as it is somewhat discrete (c.f.

comments above on the selection of δ1). We find the selection of g and s are sensible

as components split using the standard approach in this application mainly arise

through individuals occasionally traveling to isolated locations in relation to their

base. That is, we target components which have their first eigenvalue typically con-

tributing close to 100% of component variation as these are the result of components

wrongly surviving through groupings of observations from two or more unrelated

components; and we simply elect to ignore those cases with small σmax which, even

when they appear to be poorly fitted, are unlikely to affect our understanding of an

individual’s mobility pattern overall.

We emphasize that the selection of c0 = 5 is only there to ensure that the algorithm

has a chance to move towards stable model results by avoiding early termination.

Users may select alternative values but we would suggest five as a minimum; our

data analysis in Section 4.4.3 involved 3000 runs, and over 95% of them required

more than five iterations (including the initialization and splitting steps) to reach

the first stable models. The choice c1 = 1000 is proposed because the average cell

tower service area is approximately 3 km2 and over 30% of them covering an area of

1 km2 or less (Gonzalez et al., 2008); cities like New York City are about 1000 times

that size. c2 = 100 is used as we have observed that, on average, when excluding

the top 5% of cases, the combined component variance of our test data decreased

by a factor of around 100 from the first iteration to the first stable model. We have

observed that when a much smaller c2 is used, we are more likely to obtain results

which either over-fit the data or have missed inliers components and are less robust

to the initialization settings.

Empirical results suggest that our algorithm is generally not very sensitive to the se-

lection of these parameter values. However, many of them were set based on ad-

vance knowledge of the nature of the data. It is largely only the computational effi-

ciency that may be influenced with the selection of δ1 and δ2. However, the model

complexity, in particular for analyzing real application data, may be affected by the

selection of r, q, g, s, and c3 to some extent. The selection of c0, c1 and c2 has largely

no effect on either the computational efficiency or the model complexity. We discuss

this in our limited sensitivity analysis based on the simulated data at the end of this

section.

Our experimental results are based on a simulated dataset of 200 observations from

a mixture of seven Gaussian distributions with parameter values given in Table 4.1

and shown in Figure 4.1 (a). This simulated individual is most active around the area

of latitudinal (Lat) 2.90 and longitudinal (Lng) 1.50 (c.f. component numbers 1 to 3;

total 85% of all observations) and is particularly active around the locations marked

by component number 2. This simulated individual has occasionally visited other


Table 4.1: Parameters of the mixture model that our synthetic data were simulatedfrom

Component Mean Vector Covariance Matrix(#Points) Lat Lng Var(Lat) Cov Var(Lng)

1 (40) 2.90 1.50 3e-03 1e-04 2e-032 (100) 2.95 1.49 1e-05 0 1e-053 (30) 2.90 1.51 1e-04 0 1e-044 (13) 2.20 1.20 1e-03 9e-04 1e-035 (15) 3.30 1.25 1e-04 0 1e-046 (1) 3.40 1.25 1e-99 0 1e-997 (1) 3.35 1.25 1e-99 0 1e-99

Table 4.2: Parameter estimates recovered by our SEVB-GMM algorithm with kinitial =1 for the simulated dataset plotted in Figure 4.1 (a)

Component Mean Vector Covariance Matrix(#Points) Lat Lng Var(Lat) Cov Var(Lng)

1 (41) 2.90 1.50 3.5e-03 -3.7e-05 2.3e-032 (100) 2.95 1.49 1.3e-05 -2.3e-07 8.9e-063 (29) 2.90 1.51 8.9e-05 -2.0e-05 5.5e-054 (13) 2.21 1.20 8.3e-04 6.8e-04 7.0e-045 (17) 3.31 1.25 7.1e-04 -2.0e-05 1.2e-04

locations marked by component number 4 to 7. The results from our SEVB-GMM

algorithm with kinitial varying between one and 20 were identical and this is shown

in Figure 4.1 (b), while selected results of standard VB-GMM algorithm with vari-

ous kinitial have been presented in Figure 4.2. The ellipses represent 95% probability

regions for the component densities, whereas the estimated centers of these compo-

nents are marked by ‘+’s in this figure. Note that we plot our data pattern separately

from the fitted models for clearer presentation.

In contrast to the standard VB-GMM algorithm, our SEVB-GMM algorithm has pro-

duced very consistent models regardless the value of kinitial. That is, our proposed

split process appears to be working, and we can obtain models with k both higher

and lower than initially proposed. On the other hand, as discussed before, the stan-

dard algorithm can be sensitive to the isolated observations when uninformative ini-

tialization settings are used. Beside the outliers (c.f. components 6 and 7), our algo-

rithm has recovered all other components including both inliers components (c.f.

components 1 and 3). The estimated parameter values with kinitial = 1 by our algo-

rithm is presented in Table 4.2. We found when larger kinitial is used in the standard

algorithm, we are more likely, but not always (c.f. kinitial = 11 and kinitial = 14), to

recover the model. Neither algorithms allocated any observations to the outliers,

which is useful for understanding individuals’ mobility patterns at an aggregated

level.

Note that despite the fact that six components were identified for the standard VB-

GMM algorithm with kinitial = 11 and kinitial = 14 (c.f. Figure 4.1 (d)–(e)), we stress


(a) Simulated data, n = 200 (b) SEVB-GMM

Figure 4.1: (a) Plot of our simulated dataset where the data points (‘Actual’) aremarked by an ‘x’. (b) The results of our SEVB-GMM fit of a bivariate mixture modelto these data; the center of each component in the fitted mixture is indicated by a ‘+’and we also show 95% probability regions (outlined by ‘-’) for each component in themodel. We can see that the data appear to be well represented by the fitted model.Note also that the resulting fit is identical for kinitial = 1− 20.

that the sixth component does not represent the outliers. The extra component (c.f.

the linear-shaped component) was formed because observations from two unre-

lated components were inappropriately grouped together by the algorithm. These

are good examples of results obtained when observations’ component initial alloca-

tions are poor, but as we have shown, is not a concern for our SEVB-GMM algorithm

while it is for the standard algorithm.

Finally, we discuss the implication of the choice of algorithm settings. We concen-

trate on the SEVB-GMM algorithm only and we perform a limited sensitivity analysis

on the algorithm settings of SEVB-GMM. We focus solely on the implication of the

choice of values for g, s, r, q, c1, c2 and c3. Having experimented with using a wide

range of algorithm settings on our SEVB-GMM algorithm, we are pleased to observe

that with some rare exceptions, our algorithm has remained able to recover the cor-

rect model shown in Figure 4.1 (b). As we would expect, most of the exceptions to

this occurred when kinitial was very small; this follows logically as the settings we are

modifying have some influence whether or not split attempts will be made and how

these new components will be initialized. Of course, the splitting feature is crucial

when there are not enough components in the initial model to represent the data

well. As a result, here we only focus on reporting our results for various initializations

of the user chosen parameters when kinitial = 1 as this gives a reasonable picture of

the effect these parameters can have.

Our analysis produces three different models: the correct model shown in Fig-

ure 4.1 (b), the correct model minus one inliers component which we will refer to


(a) kinitial = 1 (b) kinitial = 2

(c) kinitial = 5 (d) kinitial = 11

(e) kinitial = 14 (f) kinitial = 19

Figure 4.2: Selected results obtained from applying the standard VB-GMM algo-rithm under different initialization conditions to the simulated data shown in Fig-ure 4.1 (a). The centers of each component in the fitted mixtures are indicated bya ‘+’, we also show 95% probability regions (‘-’) for each component in the model.The computed values of F and MAEAC, and the fitted value of k in the final modelare also shown. We can see that the initial choice for k and the corresponding initialcomponent allocation does influence the final fit obtained.


as a four-component model, and the failed model in which no component splitting

has been taken place i.e., what we would have achieved by simply using the standard

VB-GMM model. We detail the results of interest below:

• the correct model was recovered when g was changed from 0.9 to 0.5, this

change means that a component may be split if its first and second eigenvalues

are not the same which is a rather extreme choice;

• the correct model was recovered when s was changed from 0.05 to 0, this

change means that a component may be split regardless of its size which is

again a rather extreme choice;

• the correct model was recovered when r was changed from 0.25 to 0.40;

• a four-component model was recovered when r (which we recall determines

the size of the center region of the components for an observation to be con-

sidered as an inlier) was changed from 0.25 to 0.10, but the algorithm failed

when a more extreme choice of 0.01 was utilized;

• the correct model was recovered when q (the proportion of components actu-

ally considered as inliers) was changed from 0.45 to either 0.30 or 0.60;

• a four-component model was recovered with q = 0.70, but the algorithm failed

when an extreme choice of 0.90 was utilized;

• the correct model was recovered when c1 (the variance ratio between inliers

and non-inliers subcomponent) was changed from 1000 to 10000;

• a four-component model was recovered with c1 = 100 or 10, but the algorithm

failed when an extreme choices such as 100000 or 1 were used (note that c1 = 1corresponds to effectively disabling the inliers and non-inliers split operation

(c.f. Section 4.3.3.1));

• a four-component model was recovered when c2 was changed from 100 to

1000, but the algorithm failed when 10 or 1 was used (note that the choice of

1 would imply no overall model variance posterior parameter adjustment (c.f.

Section 4.3.3.4) should take place). The test with c2 = 1 is particularly impor-

tant to demonstrate the need to have this operation;

• a four-component model was recovered when c3 was changed from 2 to 1, this

change means that on average an component can only be split once;

• the correct model was still recovered when c3 was increased to 3 or more.


Table 4.3: For the mobility pattern of Subscriber A (the observed data are plottedin Figure 4.3 (a)), we report the values of kfinal, F and MAEAC resulting from severalSEVB-GMM fits that were obtained using different values of kinitial; note that for com-parison these values were chosen to correspond to those values selected in the studyrepresented in Figure 4.4. Comparing these results with Figure 4.4, we can see thatunlike the standard VB algorithm, SEVB is much more robust to component initial-ization settings.

kinitial 1 2 4 5 8 10kfinal 3 3 3 3 3 3F 394 574 711 572 426 575

MAEAC 0.078 0.080 0.074 0.077 0.074 0.080

One can see then that our SEVB-GMM algorithm appears to be quite robust to the

selection of these values.

4.4.3 Real data results

Our real data analysis is based on the confidential call detail record (CDR) data pro-

vided by a wireless telecommunication provider based in Australia. It comprises

every single successful outbound activity made by 100 consumer subscribers over

a 17-month period. These anonymous subscribers were randomly selected from a

large database of several million subscribers, and these subscribers have stayed con-

nected during the entire study period. In practice the geographic position of an in-

dividual is generally recorded based on the position of the mobile cell tower that was

used at the commencement of the call (c.f. Gonzalez et al., 2008). This means that

the locations of the individuals are in fact approximate rather than precise geograph-

ical locations. However, cell tower location is precise enough to give us a picture of

the users’ movements. This data was collected for billing purposes with attributes

including, but not limited to, the activity-initiated date, time and mobile cell tower

location in latitude and longitude coordinates. Here we follow the same initializa-

tion settings as in the simulated study with the exception of δ1. We choose δ1 = 0.01degrees which translates to approximately 1 km. This selection reflects that, in re-

ality, we do not know the exact location of the subscribers within the tower service

area, as was the case in Gonzalez et al. (2008) and is therefore sufficient when dis-

tances between most towers are considered. Before we evaluate both algorithms, we

first examine model outputs of several selected anonymous subscribers with various

kinitial.

Selected model outputs have been presented in Figures 4.3 (b), 4.4 and 4.5. As before,

the ellipses represent 95% probability regions for the component densities, whereas

the estimated centers of these components are marked by ‘+’s in these figures. Fig-

ures 4.3 (b) and 4.4 are the results of our SEVB-GMM and the standard VB-GMM

algorithms, respectively, for the mobility pattern of Subscriber A with various values


(a) Actual data, n = 56 (b) SEVB-GMM

Figure 4.3: (a) Observed mobility pattern of Subscriber A over a 17-month periodcorresponding to the recorded locations, marked by an ‘x’, of cell towers from whichtelecommunication activities were initialized. (b) The results of the SEVB-GMM fit ofa bivariate mixture model to these data; the center of each component in the fittedmixture is indicated by a ‘+’ and we plot the 95% probability regions (‘-’) for eachfitted component. Note that results obtained were similar for kinitial = 1 − 18 andthat values of kfinal, F and MAEAC corresponding to various kinitial are summarizedin Table 4.3.

of kinitial. With the SEVB-GMM algorithm, the resulting fits were almost identical for

the various kinitial, hence we only plot one of the actual fits, and Table 4.3 summarizes

the values of kfinal, F and MAEAC for selected kinitial. These figures show that our al-

gorithm is more robust than the standard method, and is able to obtain very similar

results regardless the value of kinitial used. Note that the computed values of F and

MAEAC in Table 4.3 appear to vary somewhat even though the final fitted models

were almost identical and had three components in each case; this is mostly due to

the size of this particular dataset. Since it is rather small, having only 56 observa-

tions, any differences in the posterior component allocations of the observed points

will have more of an influence on the computation of the selection criteria. Further,

the presence of a near singular inliers component in the fitted model strongly affects

the estimation of the LL value that is required for the calculation of F , which is why

F varies more than MAEAC does.

Figure 4.5 shows the results of four other subscribers, Subscribers B to E, with a

clearly inappropriate choice kinitial = 1 used. It shows that our algorithm is able

to model very complicated patterns sufficiently even when kinitial was assigned in-

correctly. These results are very encouraging.

Before we evaluate our SEVB algorithms more generally, we draw the reader’s atten-

tion to the differences in the values of F , between, for example:


(a) kinitial = 1 (b) kinitial = 2

(c) kinitial = 4 (d) kinitial = 5

(e) kinitial = 8 (f) kinitial = 10

Figure 4.4: Selected results obtained by using various choices for kinitial in the stan-dard VB-GMM algorithm for Subscriber A’s mobility pattern shown in Figure 4.3 (a);the center of each component in the fitted mixture is indicated by a ‘+’ and we plotthe 95% probability regions (marked by ‘-’) for each fitted component.


(a) Subscriber B, n = 2970

(b) Subscriber C, n = 1593

(c) Subscriber D, n = 3576

(d) Subscriber E, n = 1843

Figure 4.5: Mobility patterns over a 17-month period for four subscribers are shownin the left column. Observations, ‘x’, are the recorded cell tower locations from whichsubscribers initiated a communication. Bivariate mixture models fitted using SEVB-GMM are shown in the right column; the center of each fitted mixture componentis marked ‘+’ and corresponding 95% probability regions (‘-’) are shown. Note thatSEVB-GMM was initialized with inappropriate choice kinitial = 1 each time, yet weare still able to model the data well. Values for F and MAEAC are also reported.


• Figure 4.1 (b), and 4.2 (d) and (e),

• Table 4.3 (b) with kinitial = 5 and 8, and Figure 4.4 (d)–(f), but note the small

sample size; and

• Figure 4.5 (d), and the F value resulting from the standard VB-GMM algorithm

fit with inappropriate kinitial = 1 which we found to be−16743.

These are some of the examples where the measure of F appears to conflict with

the choice of our algorithm. Comparing F between Figure 4.4 (b) and (d), for exam-

ple, gives us cause for concern about using F to choose the model since it selects a

model which does not appear to be very appropriate while MAEAC leads to a choice

which intuitively appears to be more suitable. We suggest that for this application

our goodness-of-fit measure, MAEAC, is more reliable and robust than the widely

used F measure. While we are yet to be able to show this result more generally, we

believe that one of the reasons is that F (and also BIC and DIC) relies heavily on

the estimated model LL which is influenced by the covariance matrix estimation for

components in the model. The presence of point masses in our data which occur at

some cell tower locations, and components corresponding to any potentially inap-

propriate surviving linear-shaped components in the fitted model (e.g., the linear-

shaped components in the fits shown in Figure 4.4 (d) and (e)), cause numerical

stability problems because the corresponding covariance matrices for these com-

ponents are singular or near singular. This is because the estimated covariance ma-

trix for a singular or near singular component can be heavily influenced by round-

ing errors which can have a big effect on the resulting LL estimate computed. Note

also that the estimates obtained in this case are also software dependent since dif-

ferent packages can deal with this issue in different ways. MAEAC is more robust

to these situations as we outline in the following. When there is a fitted compo-

nent corresponding to a point mass, its estimated absolute deviance measure term

in the MAEAC formula will be very close to zero and have almost no influence on

MAEAC’s ranking of the models, while in contrast an estimated LL could be greatly

and unduly influenced by even slight changes in covariance estimates associated

with a point mass, it may even be undefined and its estimation may be misleading

if the covariance matrix turns out to be not positive definitive. Further, MAEAC will

penalize linear-shaped components (which are often inappropriate in our applica-

tion, although we note that they may not be in some situations) as the estimated

absolute deviance measure term for those will be extremely large, while in contrast

an estimated LL for such a term could appear very favorable as a result of the cor-

responding covariance matrix being near singular or singular. These point masses,

and to some degree those linear-shaped components, are a result of how this data

was recorded, i.e. recorded user locations are not continuously varying in the plane


but given by the finite set of cell tower locations. This also explains why the differ-

ence between models selected using F and MAEAC appears to be less pronounced

in our simulated data studies as synthetic data tend to be more well behaved than

real data. From this perspective, MAEAC appears to be a more robust measure un-

der these conditions and therefore it may also be useful in other applications where

these data issues arise.

Figure 4.6 summarizes the results of both algorithms based on the 100 anonymous

subscribers. For each of these subscribers, we have fitted 30 GMMs to their observed

pattern corresponding to the fits obtained with kinitial ranging from one to 30. Fig-

ure 4.6 (a) focuses on the final number of components fitted, kfinal, when different

kinitial are used. It shows that our algorithm is able to discover models with more

consistent estimates of kfinal, and is not limited by the choice of kinitial. We are able

to discover, on average, more than six components for this particular dataset, even

when kinitial = 1 is used. Note that this result possibly could be improved further by

varying the user chosen parameters. Moreover, this figure also shows that if we use

larger values for kinitial, we tend to obtain a fitted model with a higher kfinal. This is

understandable as the dataset we are dealing with is highly heterogeneous. However,

it is encouraging that for both algorithms, and more so for our algorithm, the value

of kfinal in the fitted models is reasonably consistent even when a large kinitial is used.

Note that our algorithm appears to fit models with smaller kfinal when large kinitial is

used when compared to the standard method for this particular dataset.

Figure 4.6 (b) examines both algorithms from the aspect of goodness-of-fit. It shows

that our model performs better which is not surprising considering we select our

models based on this particular criterion. It also shows that data, particularly for the

standard method, tends to be less well fitted when smaller kinitial is used correspond-

ing to models with smaller kfinal (c.f. Figure 4.6 (a)). Except when very small kinitial is

used, our algorithm appears to be able to obtain models with similar goodness-of-fit

regardless of the value of kinitial used. Based on Figure 4.6 (a) and (b), we believe our

algorithm is, in comparison, very robust and suitable for our targeted application.

Note that while we did not evaluate the results with respect to different observa-

tion membership initializations with kinitial being the same here, we believe results

in Figure 4.6 (a) and (b) are already sufficient in demonstrating the robustness of our

algorithm in this respect.

We next examined our proposed goodness-of-fit measure, MAEAC, more closely.

Figure 4.6 (c) is plotted based on comparing the model BIC, DIC, and F obtained

by both algorithms. It is based on the principle that assuming our algorithm has dis-

covered the same or better models when compared to the standard method, then,

in theory, models of our algorithm should have equal or lower BIC and DIC, as well

as equal or higher F with respect to models of the standard method. Percentages of


(a) (b)

(c)

Figure 4.6: Comparisons between fits obtained from standard VB-GMM and ourSEVB-GMM algorithm when using different values kinitial ranging from 1 to 30 basedon the observed data for 100 randomly selected anonymous individuals: (a) plot ofthe fitted kfinal vs. the kinitial that was used for both algorithms, (b) value of MAEAC(Equation (4.7)) for the fits from both algorithms vs. the kinitial that was used, and (c)for the fits obtained from both the standard and SEVB algorithms, we computed thecorresponding values of BIC, DIC, F and MAEAC then plotted the % of times therewas an agreement in the model that would be selected based on either the BIC, DICor F values, and the model that was selected in the SEVB algorithm using MAEAC.


cases where BIC, DIC and F agreed with our final model selection are presented in

this figure. It shows that when kinitial is small, our final models, besides being better

fitted, generally have also improved from the point of view of BIC, DIC and F ; BIC,

DIC and F appear to agree with our model selection around 60 to 70% of time re-

gardless of kinitial and as we have discussed earlier, in cases where MAEAC is not in

agreement with these other measures, we would expect it to be more robust to the

data discreteness issues we have described.

4.5 Discussion

We have proposed an extension of the standard VB method for approximating

GMMs. Unlike the standard approach, our algorithm can lead to models with a

higher number of components than proposed initially. It is therefore more flexible

and practical for applications as we have demonstrated through our empirical re-

sults in Section 4.4. Our approach is inevitably faster than other existing VB-GMM

algorithms with split operations. This is because we only attempt to split compo-

nents that are identified as showing a poor fit, as determined by our proposed split

assessment criteria, and we split all of these components at the same time. Addi-

tionally, we terminate the algorithm based on a set of diagnostic criteria in contrast

to current approaches which typically run for as many iterations as necessary un-

til F cannot be improved further. While the exact computational advantages of our

SEVB-GMM over other VB-GMM algorithms which allow for component splitting is

difficult to evaluate, for illustration we note that a total of nine split attempts were

made by our algorithm for modeling subscriber D (c.f. Figure 4.5 (c)) while this fit

would have required making at least 39 successful split attempts if we had used one

of the existing splitting algorithms with kinitial = 1. This, in our view, suggests that

ours is a more effective and efficient approach as the increased speed is extremely

important when working with real, large datasets. We have also improved on the

standard algorithm in the sense that as the parameter space is now explored more

thoroughly.

From the application perspective, to the best of our knowledge, this is the first piece

of research that aims to model individuals’ overall human mobility patterns with

GMMs. Mixture modeling (Jain and Dubes, 1988, pp.117-118) is often considered

as a model-based approach to clustering. For comparison, we have also attempted

to model the simulated data used in Section 4.4.2 with the well-known k-Means

(KM) algorithm as well as DBSCAN (Density-Based Spatial Clustering of Applica-

tions with Noise) (Ester et al., 1996). DBSCAN is one of the efficient density-based

clustering algorithms which has become popular recently in machine learning liter-

ature (Han and Kamber, 2006). However, unlike the GMM approach, we found that

4.6 References 107

these clustering algorithms did not appear to be able to provide us with meaning-

ful model descriptions as the patterns were represented with combinations of many

non-overlapped and irregularly shaped clusters. When compared to the GMM, KM

appears to focus on identifying outliers as the result of the pattern being heteroge-

neous. In contrast, DBSCAN is able to first identify and then remove the outliers (c.f.

mixture number 6 and 7). However, beside the fact that it had identified three out-

liers from mixture number 1, it was not able to identify any inliers components. This

is not a surprise to us as clustering is typically based on the principle that the opti-

mal model will have clusters that are compacted or clearly separated (Milligan and

Cooper, 1985; Jain and Dubes, 1988). This suggests that clustering is generally not an

appropriate method for modeling human mobility patterns.

However, we should note that there is a limitation to using GMMs for modeling an

individual’s spatial pattern despite its superiority to clustering approaches. That is,

it is not able to model well the movements of an individual when they correspond

to a journey or path taken that follows a non-linear trajectory, this is illustrated in

the results of Subscriber D in Figure 4.5 (c). Additionally, one major limitation of a

GMM is its lack of robustness to outliers. Modeling more robust mixtures of Student-

t distributions within the VB framework has been proposed (Svensen and Bishop,

2005; Archambeau and Verleysen, 2007). Such algorithms might be improved further

with the adoption of our split process to result in an approach even more suitable

for our application. Additionally, Aitkin and Wilson (1980) outlined a modified EM

algorithm in which outliers were identified via mixture modeling and this might also

be usefully incorporated into our approach.

We have also proposed a new model selection criteria, MAEAC. We have shown

through empirical results that it appears to be more robust to the problems of near

singular or singular covariance matrices that arose due to issues of data discreteness.

This new criteria might also be a useful tool in other applications where such data

problems exist.

4.6 References

Aitkin, M., Wilson, G. T., 1980. Mixture models, outliers, and the EM algorithm. Tech-

nometrics 22 (3), 325–331.

Archambeau, C., Verleysen, M., 2007. Robust Bayesian clustering. Neural Networks

20 (1), 129–138.

Armstrong, J. S., 2001. Principles of Forecasting: A Handbook for Researchers and

Practitioners. International Series in Operations Research & Management Science.


Kluwer Academic, Boston, MA.





Azzalini, A., 1996. Statistical Inference: Based on the Likelihood. Monographs on

Statistics and Applied Probability. Chapman & Hall, London.

Balakrishnan, S., Madigan, D., 2006. A one-pass sequential Monte Carlo method for

Bayesian analysis of massive datasets. Bayesian Analysis 1 (2), 345–362.

Ball, G. H., Hall, D. J., 1965. ISODATA, a novel method of data analysis and pattern

classification. Tech. rep., Stanford Research Institute, Menlo Park, CA.

Beal, M. J., Ghahraman, Z., 2002. The variational Bayesian EM algorithm for incom-

plete data: with application to scoring graphical model structures. In: Bernardo,

J. M., Bayarri, M. J., Berger, J. O., Dawid, A. P., Heckerman, D., Smith, A. F. M., West,

M. (Eds.), Proceedings of the Seventh Valencia International Meeting. Oxford Uni-

versity, Tenerife, Spain, pp. 453–464.

Beal, M. J., Ghahraman, Z., 2006. Variational Bayesian learning of directed graphical

models with hidden variables. Bayesian Analysis 1 (4), 793832.

Bishop, C. M., 2006. Pattern Recognition and Machine Learning. Information Sci-

ence and Statistics. Springer, New York.

Brockmann, D., Hufnagel, L., Geisel, T., 2006. The scaling laws of human travel. Na-

ture 439 (7075), 462–465.

Celeux, G., Forbes, F., Robert, C., Titterington, D., 2006. Deviance information crite-

ria for missing data models. Bayesian Analysis 1 (4), 651–674.



tion 95 (451), 957–970.



18 (3), 745–755.

4.6 References 109




Ester, M., Kriegel, H.-p., Sander, J., Xu, X., 1996. A density-based algorithm for dis-

covering clusters in large spatial databases with noise. In: Simoudis, E., Han,

J., Fayyad, U. M. (Eds.), Proceedings of the Second International Conference on

Knowledge Discovery and Data Mining. AAAI, Portland, OR, pp. 226–231.






Gonzalez, M. C., Hidalgo, C. A., Barabasi, A.-L., 2008. Understanding individual hu-

man mobility patterns. Nature 453 (7196), 779–782.

Han, J., Kamber, M., 2006. Data Mining: Concepts and Techniques, 2nd Edition. The

Morgan-Kaufmann Series in Data Management Systems. Morgan Kaufmann, San

Francisco, CA.



Jain, A. K., Dubes, R. C., 1988. Algorithms for Clustering Data. Prentice Hall, Upper

Saddle River, NJ.

Madigan, D., Ridgeway, G., 2003. Bayesian data analysis. In: Ye, N. (Ed.), The Hand-

book of Data Mining. Human Factors and Ergonomics. Lawrence Erlbaum Asso-

ciates, Mahwah, NJ.



Analysis 51 (11), 5352–5367.



Milligan, G., Cooper, M., 1985. An examination of procedures for determining the

number of clusters in a data set. Psychometrika 50 (2), 159–179.








Stephens, M., 2000. Bayesian analysis of mixture models with an unknown number

of components - an alternative to reversible jump methods. The Annals of Statis-

tics 28 (1), 40–74.

Svensen, M., Bishop, C. M., 2005. Robust Bayesian mixture modelling. Neurocom-

puting 64, 235–252.

Titterington, D. M., Smith, A. F. M., Makov, U. E., 1985. Statistical Analysis of Finite

Mixture Distribution. Wiley series in Probability and Mathematical Statistics. Wi-

ley, New York.












Wu, B., McGrory, C. A., Pettitt, A. N., 2010c. The variational Bayesian method: com-

ponent elimination, initialization & circular data. Submitted.

5Customer Spatial Usage Behavior Profiling and Segmentation with

Mixture Modeling

Abstract

While companies typically acknowledge the need to be customer focused, an un-

derstanding of how each customer utilizes their product/service often appears to be

lacking. This paper describes how businesses can improve on this knowledge short-

coming, with ideas illustrated within the context of the wireless telecommunication

industry. Importantly, this article demonstrates the feasibility and potential merit

in analyzing individuals’ frequently overlooked habitual consumption behavior. For

the first time, an approach is developed that can automatically and effectively pro-

file each user’s observed overall spatial usage behavior (or mobility pattern). Mobil-

ity data is highly heterogeneous and spiky; we tailor a technique based on the use

of Gaussian mixture models (GMMs) and the variational Bayesian (VB) method for

overcoming these difficulties. The detailed distributional understanding achieved

here is then transformed to unlock potentially valuable insights such as each sub-

scriber’s likely lifestyle and occupational traits, which otherwise cannot be easily or

cheaply discovered. Our empirical results reveal that users’ spatial usage behavior

profiles are more stable than the currently popular approach which involves the or-

dered partitioning of customers based on current benchmark measures such as ag-

gregated voice call durations. The mobility patterns that we find among customer

groups are highly differentiable and therefore are valuable for business strategy for-

mulation.

112 Chapter 5. Customer Spatial Usage Behavior Profiling and Segmentation

Keywords

Consumption Behavior; Spatial Usage Behavioral Segmentation; Gaussian Mixture

Model; Variational Bayes; k-Means Clustering; Wireless Telecommunication Indus-

try

5.1 Introduction

Customers are the most important asset of any business, and companies typically

acknowledge the need and necessity of being customer focused (Christopher et al.,

1991, p.13). However, not all customers are the same (Cooper and Kaplan, 1991).

To better serve and/or satisfy each customer, businesses often seek to group them

based on their characteristics, needs, preferences and behavior exhibited for distinct

marketing propositions (Smith, 1956). Alternatively, they may try to differentiate

customers based on their current and future needs and values to the business with

the aim of exchanging appropriate relationships with them (Blattberg and Deighton,

1996; Reichheld, 1996; Fournier et al., 1998; Peppers et al., 1999). Detailed customer

behavior understanding forms a critical part of such customer knowledge. However,

the habitual consumption aspect of customer behavior has not been very well stud-

ied despite the fact that there is already a wealth of customer/consumer behavior

literature. The repeated nature of this behavior should provide good insight into

customers’ current and future patterns (c.f. Schmittlein and Peterson, 1994). In this

paper, we address this knowledge shortcoming and present an innovative approach

which aims to enable businesses to comprehend how customers have utilized their

product/service in their daily lives (c.f. Fournier et al., 1998). We illustrate our ideas

in the context of the wireless telecommunication industry although they can be gen-

eralized to other industries. More specifically, we novelly explore and analyze cus-

tomers’ spatial usage behavior (or mobility patterns), to transform it into insightful

and more stable information, as well as highly differentiable segmentation for the

business.

Consumption behavior differs from purchasing behavior (Alderson, 1957; Jacoby,

1978), and is more relevant to the service than the retail industries because of its re-

peated patterns (Ouellette and Wood, 1998; Ajzen, 2001). Existing knowledge of each

individual’s consumption behavior is primarily limited to discrete (e.g., which ser-

vices customers use) or average and aggregated measures (e.g., number of transac-

tions per month). These measures, however, are not necessarily appropriate, mean-

ingful or adequate for describing the observed pattern. Instead, these patterns can

often be modeled in a way that is more revealing and yet still fairly straight forward

by using mixture models (McLachlan and Peel, 2000). We demonstrate this with the

5.1 Introduction 113

Figure 5.1: Voice call duration distributions approximated by a mixture of lognormaldistributions (‘—’s) of two subscribers whose voice call durations have a mean of58 seconds. (a) Subscriber 1: large amount of ‘message’-like calls of a very shortduration. (b) Subscriber 2: call duration is more evenly distributed when comparedwith Subscriber 1.

(a) (b)

following example. The two users shown in Figure 5.1 have behaved quite differ-

ently in their outbound calling behavior. Yet, if we simply calculated the average

call duration for each, we would obtain the same information for these two very dif-

ferently behavioring subscribers; both of their average voice call durations over a

particular period are 58 seconds. The fitted models, marked by the overlaid solid

lines, demonstrate that these distributions can be well approximated by a mixture of

several lognormal distributions. That is, a model involving several means and stan-

dard deviations which is much more flexible than just one mean value. Somewhat

more interestingly, however, is that this figure also illustrates that even the com-

monly used hazard functions, such as the Weibull distribution, are inappropriate

(c.f. Heitfield and Levy, 2001). Clearly the behavioral distributional differences re-

vealed by the type of the modeling used in Figure 5.1 can be critical to the business.

For example, from the viewpoint of pricing structure or the understanding of po-

tential product/service substitution/migration implications. Consequently, we be-

lieve it is important to promote the analysis and comprehension of the distributions

of customers’ consumption behavior, rather than just observing averages. We will

show that this distributional analysis tactic can provide businesses with more com-

prehensive insights, which otherwise could not be uncovered using the typically ap-

plied approaches of the ordered partitioning of customers based on their average or

aggregated measures (c.f. §5.2.2) (Twedt, 1967; Wedel and Kamakura, 1998, p.10).

This paper focuses primarily on the spatial aspect of customers’ consumption be-

havior. This aspect is rarely considered in the literature despite the fact that stud-

ies have already shown that individuals exhibit a high degree of spatial regularities

(Gonzalez et al., 2008), and the locations visited by an individual can be socially im-

portant to them (Stryker and Burke, 2000). That is, each user’s highly repetitive mo-

bility pattern should reveal something about them; yet, we know very little about this

subject. For our application dataset which we will describe in §5.2.1, unsurprisingly,

our preliminary analysis of subscribers’ spatial usage behavior has demonstrated


that different users have been using their wireless devices in very different ways in

their daily lives. For example:

• The profiles of Subscriber A and B in Figure 5.2 (a) and (b) suggest that they

have behaved like a businessperson frequently flying between cities, and like

an inter-state truck driver frequently driving between cities, respectively.

• The profile suggests that Subscriber C in Figure 5.2 (c) has been mostly active

in places most likely to be his/her home and workplace.

• Subscriber D in Figure 5.2 (d) appears to have behaved like a tradesperson or a

taxi driver moving around places in their living (or working) neighborhood.

We believe that the distinctly different observed patterns are largely influenced by

occupations and/or lifestyles; such valuable individual insights could not otherwise

be easily or cheaply obtained, but can be useful for business in improving the cus-

tomer interactions. Given the volumes of this type of data, it is essential that these

patterns, both individually as well as behavioral segment wise, are identified and also

interpreted automatically in an efficient and effective analysis process.

Mathematically, the goal of this research is to put forward an approach for pro-

filing as well as segmenting subscribers accurately and intuitively based on their

frequently overlooked actual mobility patterns. It, however, does not aim to seg-

ment customers spatially/geographically (c.f. Hofstede et al., 2002), but rather to

differentiate users based on their spatial behavioral characteristics. In a novel way

of exploring individuals’ spatial usage behavior, we show that it is practical and

effective to model individuals’ mobility patterns using Gaussian mixture models

(GMMs), whose characteristics can be easily captured unlike alternative nonpara-

metric approaches. We tailor a recently proposed computationally efficient varia-

tional Bayesian (VB) algorithm that was designed specifically for modeling highly

heterogeneous and ‘spiky’ patterns with weak prior information available (Wu et al.,

2010b), and is therefore suitable for this application. Note that our application was

discussed in Wu et al. (2010b); but unlike that paper which focuses on the statisti-

cal methodology used for fitting a GMM, this paper concentrates on interpreting the

patterns to gain useful customer knowledge. We use the term ‘spiky’ to describe pat-

terns where large areas of low probability are often mixed with small areas of high

probability. Although VB (Attias, 1999; Wang and Titterington, 2006) has received in-

creasing attention in other fields, to the best of our knowledge, no references to such

models have been made in the marketing literature; although Braun and McAuliffe

(2010) have already illustrated the usefulness of a VB-based discrete choice model in

a statistical journal. We must emphasize that the use and interpretation of GMMs for

customer behavior modeling is the foundation of this study, therefore while Wu et al.

(2010b)’s split and eliminate VB (SEVB) algorithm is used to fit the GMMs, alternative

modeling approaches could also be taken as we discuss in §5.3.


Figure 5.2: Spatial behavior of four different subscribers. (a) Subscriber A: inter-capital businessperson-like pattern. (b) Subscriber B: inter-state truck driver-likepattern. (c) Subscriber C: home-office-like pattern shown in bubble plot. (d) Sub-scriber D: taxi driver-like pattern shown in bubble plot. Note that in (a) and (b) ‘x’srepresent the actual observations and ‘. . .’s represent the ‘virtual’ path the user islikely to have taken between two consecutive actual observations. In (c) and (d),user patterns are shown in the form of bubble plots instead of the scatter plot forbetter demonstrating that a large number of activities were initiated from the samecell tower locations; the size of the bubble represents the activity volume of the par-ticular location.

(a) (b)

(c) (d)


Based on our thorough data analyses for individuals’ real and approximated mobil-

ity patterns, we develop a new customer behavior modeling method involving the

introduction of several behavioral ‘signatures’ (i.e., characteristics) for automatically

and statistically profiling how each user utilizes the product/service spatially in their

daily lives. We demonstrate statistically that these meaningful descriptors, which are

extracted from the approximated GMM, are more stable and highly differentiable

(c.f. Wedel and Kamakura, 1998, p.4) than existing alternatives such as the quan-

tiles of aggregated outbound voice call durations and short message services (SMS)

counts. In fact, we show that the lack of stability of ordered partitioning of customers

based on these aggregated measures can be alarming. We also show that customers’

spatial usage behaviors naturally form clusters that can be easily related to by market

specialists.

The remainder of this paper is organized as follows. We introduce our data in §5.2; we

then establish a benchmark for assessing segmentation stability which is based on

subscribers’ aggregated outbound voice call durations and SMS counts since these

are two of the most widely analyzed consumption behaviors in this industry. Addi-

tionally, we analyze individuals’ spatial usage behavioral data, and identify its unique

characteristics which pose challenges in modeling. We begin §5.3 with a brief discus-

sion of existing literature on modeling individuals’ mobility patterns. We then artic-

ulate the advantages, and demonstrate the accuracy and efficiency of instead mod-

eling the patterns with GMMs; we fit the GMMs using the SEVB algorithm (Wu et al.,

2010b). This is followed by further data analyses differentiating users’ spatial usage

behavior, which leads to the introduction of our behavioral signatures in §5.4. We

emphasize that separate GMMs are fitted to each individual whose mobility pattern

characteristics are then extracted. In concluding both §5.4 and §5.5, we evaluate the

effectiveness of our spatial usage behavioral profiling and segmentation including

the comparison to the benchmark established in §5.2. We next perform validation

demonstrating that the proposed behavioral grouping is useful and highly differen-

tiable in §5.6, and finish with a discussion of our contributions in §5.7.

5.2 Data & Individuals’ Consumption Behavior

5.2.1 Data

Studies have shown that we can sufficiently comprehend an individual’s mobility

pattern through analyzing their call detail records (CDR) without the need to track

them at all times (Gonzalez et al., 2008). Our research adopts this convenient ap-

proach, and analyzes confidential CDR provided by a wireless telecommunication

provider in Australia. Our data records every single successful outbound activity

made by 1,082 consumer subscribers during a consecutive 17-month period, which

5.2 Data & Individuals’ Consumption Behavior 117

Table 5.1: Distributions of users’ aggregated call patterns

(seconds) Mean Min 1Q Median 3Q MaxVoice Call Durations 80926 0 12326 39062 97925 1446649SMS Counts 557 0 16 135 564 12522

Figure 5.3: Distributions of users’ aggregated call patterns. (a) Aggregated voice calldurations. (b) Aggregated SMS counts.

(a) (b)

is a relatively long history for an analysis of this kind. These anonymous subscribers

were statistically randomly selected, prior to CDR being collected, and have stayed

connected during the entire study period. Attributes collected in this sample in-

clude, but are not limited to, the activity initiated timestamp and cell tower location

in latitude and longitude. Note, we do not know the precise location of an individual,

we only know approximately where they are through the location of the cell tower

used to initialize an outbound activity. This paper focuses only on the activities made

domestically.

5.2.2 Usage behavior of aggregated voice call durations and SMS counts& the segmentation stability benchmark

Ordered partitioning of customers based on their aggregated voice call durations or

SMS counts over a period of time is perhaps the most commonly adopted approach

for usage behavioral segmentation in the telecommunications industry. We refer to

this as the benchmark in this paper. Before we examine its segmentation effective-

ness, we explore these attributes. The distributional summary statistics of our sub-

scribers’ aggregated voice call durations and SMS counts over the 17-month period

are shown in Table 5.1, and presented in the histogram in Figure 5.3. They reveal

that a typical user initializes more than half an hour of talk and nearly eight SMSs

per month. Note that as a result of both distributions being highly skewed, we have

grouped observations above the 95 percentile together and capped at that level lim-

iting the influence of outlying extremely heavy users in the histogram.

To evaluate the stability of order partitioning subscribers based on these two mea-

sures, we partition our 17-month data into three non-overlapping periods:


Table 5.2: Stability of the ‘benchmark’ customer segmentation defined by Equa-tion 5.1

Period 1 vs. 2 Period 1 vs. 3# of Groups 4 5 10 4 5 10

Voice Call Durations 53.4% 44.4% 23.8% 55.6% 49.6% 29.3%SMS Counts 62.8% 54.9% 32.1% 58.6% 51.7% 28.7%

• Period 1 corresponding to the first five months,

• Period 2 for the following seven months, and

• Period 3 for the final five months (i.e., same months as Period 1 but a year later).

For each period we partition subscribers into four, five, and 10 equal-sized groups

based on the percentile and note which group each user’s value is in for each pe-

riod. We concentrated on the comparison between Periods 1 and 2, and 1 and 3, and

define:

Stability =# of subscribers in the same groups for both periods

# of subscribers. (5.1)

The results are presented in Table 5.2 and they suggest that users’ usage patterns in

SMS are more stable than their voice calls. Interestingly, their aggregated SMS counts

can predict the behavior of the following period better than the same period a year

later, while the situation is reversed in their aggregated voice call durations. While

users’ behavior in this industry is known to change quickly, many of these prediction

accuracies look alarming. For example, we can only correctly predict the behaviors

of 23.8% of all subscribers for the following period when they are partitioned into 10

quantile groups based on their aggregated voice call durations in the previous pe-

riod. We refer to values in Table 5.2 as the benchmark for our spatial usage behavior

research. However, we acknowledge that the usefulness of these numbers is diffi-

cult to assess. To the best of our knowledge, there exists no benchmark that is better

suited for comparison than the one we have selected; this is a direct result of the fact

that this type of analysis is typically not being conducted, despite its importance.

5.2.3 Spatial usage behavior (or mobility patterns)

Human mobility patterns have recently been examined closely (Gonzalez et al.,

2008) to reveal that they are highly heterogeneous with strong spatial regularities

exhibited. That is, individuals typically spend most of their time in their most highly

preferred locations, and occasionally visit other places that are ‘isolated’ in relation

to their usual activity areas. Statistically speaking, this implies that users’ spatial us-

age behavior is not only heterogeneous (both between and within users), but also

spiky. Our telecommunications data supports this finding, and Figure 5.4 (a) reveals

5.3 Modeling Individuals’ Spatial Usage Behavior 119

Figure 5.4: Mobility pattern analysis. (a) Percentage of outbound activities madefrom users’ top five preferred locations. (b) Average of users’ cumulative activitycount distribution with respect to distance from their real centers.

(a) (b)

that around 70% of all outbound activities made by our users were initialized at his/

her top five preferred locations as marked by the corresponding cell towers. This het-

erogeneous and spiky nature of the spatial usage behavior, along with the data being

somewhat discrete (c.f. the locations recorded in CDR are restricted to where the

cell towers are located rather than positioned on a continuous plane), poses some

modeling challenges as we shall explain in §5.3.

To help us understand how each user has moved around spatially, we define their

mobility pattern ‘real center’, also known as ‘home’ (e.g., Balazinska and Castro,

2003), as being the most frequently used cell tower location in the most active ar-

eas; each evaluated area is circular with radius being 100 km, and no differences

were found if 200 km, for example, were used. For nearly all of our users, this real

center corresponds to the location of the most frequently used cell tower; excep-

tions typically occurred when an subscriber’s activities were roughly divided some-

what equally into two or more regions and his/her spatial movement in one of the

regions is largely limited (e.g., a mining site where there is only one or two cell tow-

ers). Note that we use the term ‘real’ simply to express the fact that they are calcu-

lated directly from the actual data (as oppose to the estimation we will carry out in

§5.3.4). Assuming that each latitude or longitude degree always corresponds to 100

km, the spatial behavior of our average user with respect to their real center is shown

in Figure 5.4 (b). It illustrated that, on average, users have made around 65% of all

outbound activities within 10 km of their real centers, and nearly 90% within 100 km.

5.3 Modeling Individuals’ Spatial Usage Behavior

Individuals’ mobility patterns have mostly been studied from the perspective of in-

frastructure (Liu et al., 1998; Perkins, 2001; Camp et al., 2002; Balazinska and Cas-

tro, 2003), with the aim of providing better network experiences for the users. How-

ever, these approaches are ineffective for understanding individuals’ spatial usage


behavior and typically do not take the strong spatial regularities discussed in §5.2.3

into consideration. For those that do focus on modeling each subscriber’s mobil-

ity pattern, there is limited attention given besides the identification of ‘significant’

locations which would refer to individuals’ preferred (cell tower) locations, for exam-

ple, their home and workplace (Nurmi and Koolwaaij, 2006). While most users are

stationary in the sense that they can generally be found at the same locations over

time (Balazinska and Castro, 2003), non-significant locations are actually crucial to

the understanding of highly mobile users (e.g., Subscriber D in Figure 5.2 (d)). It is

important to comprehend users’ overall mobility patterns, not just the significant

locations. While it provides some insights, simply identifying the top several most

frequently used cell tower locations by individuals and attempting to characterize

them as done in Cortes et al. (2000), for example, is not adequate for fully under-

standing their spatial usage behavior. In addition, while most studies have utilized

density-based clustering algorithms (e.g., DBSCAN algorithm (Ester et al., 1996)) for

identifying individuals’ significant locations, the accuracy of these algorithms has

been shown to be a concern particularly when there is more than one cell tower cov-

ering the same location (Nurmi and Koolwaaij, 2006). In fact, Wu et al. (2010b) have

shown that these algorithms often fail to identify those frequently used locations.

Additionally, it appears that these algorithms are inadequate for comprehending and

characterizing subscribers’ spatial usage behavior in detail. This is because they can

only represent users’ mobility patterns with combinations of many non-overlapped

(and irregularly shaped) clearly separated clusters, and that type of representation is

clearly not meaningful here. We will show that the approach we outline here is much

more appropriate.

5.3.1 Gaussian mixture model (GMM)

Our approach is to model each user’s overall mobility pattern (i.e., latitude and lon-

gitude) with bivariate Gaussian mixture models (GMMs). These are easy to inter-

pret, flexible and computationally convenient (McLachlan and Peel, 2000). Mixture

models, which are the convex combination of a number of density distributions,

have been shown to be capable of representing any distribution as in the case of

nonparametric approaches (Escobar and West, 1995; Roeder and Wasserman, 1997),

and have therefore been extensively used in other research (e.g., Wedel and Ka-

makura, 1998, Chapter 6). The spatial density of an individual’s mobility pattern

x = (x1, ..., xn) (i.e., n outbound activity observations) when modeled with mixture

of k Gaussian components is given by:

f (x) =k∑j=1

wjN(x;µj , T−1

j

), (5.2)


where k ≥ 1, and µj and T−1j represent the mean and variance, respectively, of the jth

component density; each mixing proportion {wj}, satisfies 0 ≤ wj and∑k

j=1wj = 1;

and N (·) denotes a bivariate Gaussian distribution. We emphasize that each user’s

overall spatial usage behavior over a time period is individually modeled with a dif-

ferent GMM and thus fitted independently.

5.3.2 The variational Bayesian (VB) method

The most popular approach for fitting the GMM in the literature is the expectation-

maximization (EM) algorithm. However, maximum likelihood (ML) approaches

such as EM (Dempster et al., 1977), can suffer from over-fitting and singularity prob-

lems; these are much less of a problem for the relatively recent variational Bayesian

(VB) inference approach (McGrory and Titterington, 2007). VB is one of the most

time and computationally efficient Bayesian techniques currently available. It has

been shown to perform as well as or better than the EM algorithm with the use

of Bayesian information criterion (BIC) (Schwarz, 1978), in terms of accuracy, ro-

bustness and rate of convergence for mixture modeling (e.g., Watanabe et al., 2002;

Teschendorff et al., 2005). The Bayesian approach differs from classical or frequen-

tist statistical methods such as ML approaches in its use of probability for naturally

quantifying uncertainty at all levels of the modeling process. Consequently, it pro-

vides a natural framework for producing reliable parameter estimates (Gelman et al.,

2004).

The most practical motivation for adopting VB is that it can automatically select the

number of components k for each mixture model, while being able to estimate the

parameter values simultaneously; previous studies have shown that they typically

lead to a reliable fit (Attias, 1999; Corduneanu and Bishop, 2001; McGrory and Titter-

ington, 2007). Automatic selection of k is particularly critical for our application be-

cause the complexities of users’ mobility patterns can vary greatly and k is typically

not known in advance. VB determines the ‘optimal’ k by automatically, effectively

and progressively eliminating redundant components when an excessive number of

initial components are specified in the mixture model. Such a strategy is clearly more

efficient than ML/EM approaches where one must choose the ‘optimal’ k by using

more ad-hoc approaches based on comparing measures such as the BIC after fitting

models with various possible k to the same pattern. In addition, the deterministic

nature of VB implies that it is much more computationally efficient than alternative

Bayesian methods that are also capable of simultaneously estimating k such as Re-

versible Jump Markov chain Monte Carlo (RJMCMC) (Richardson and Green, 1997).


The theory of VB is now well documented in the literature (e.g., Wang and Titter-

ington, 2006). In short, VB aims to minimize the Kullback-Leibler (KL) distance be-

tween the target posterior distribution, and its approximation is improved through

an EM-like algorithm. The VB approximation, through its use of tractable coupled

expressions typically with conjugate distribution settings, for GMMs have been theo-

retically shown to be reliable, asymptotically consistent, and unbiased for large sam-

ples (Wang and Titterington, 2006). Additionally, unlike Markov chain Monte Carlo

(MCMC)-based methods, VB does not have a mixing or label switching problem, and

its model convergence is easier to assess (McGrory and Titterington, 2007).

For the reasons given above, in this paper, we choose Wu et al. (2010b)’s VB algorithm

to approximate each individual’s mobility pattern as a GMM. However, the particu-

lar algorithm utilized for fitting the GMM is not critical for this research, and other

existing GMM algorithms could also be adapted to be useful for this application.

5.3.3 Split and eliminate variational Bayes for Gaussian mixture model(SEVB-GMM) algorithm

The split and eliminate VB for the GMM (SEVB-GMM) algorithm (Wu et al., 2010b)

is particularly suitable for our application. It is very effective for modeling highly

heterogeneous spiky patterns with little prior knowledge on the parameters and the

model complexities. The unique component elimination property of VB coupled

with the additional component splitting strategy used in SEVB means that this al-

gorithm is able to well explore the parameter space, and can provide a good data

fit without having the number of components k limited. There are other avail-

able VB-GMM algorithms that allow for component splitting (e.g., Ghahramani and

Beal, 1999; Ueda et al., 2000; Constantinopoulos and Likas, 2007), but this algorithm

adopts a more directed and targeted approach which is why we choose it here. Em-

pirical results of Wu et al. (2010b) suggest that, typically, only a handful of split at-

tempts is required for analyzing patterns such as ours, and this algorithm is therefore

more computationally efficient than alternatives which attempt to split all compo-

nents one after another until an ‘optimal’ model is reached.

We emphasize the importance of allowing components to be split as it can guide the

algorithms such that they are less likely to coverage to a local minimum solution; al-

gorithms allow components to be split (e.g., Richardson and Green, 1997; Ueda et al.,

2000; Ueda and Ghahramani, 2002) are thus much more appropriate for fitting our

highly heterogeneous and spiky data than the standard methods. Nevertheless, in

most of these existing algorithms, components are typically only ever split into two

side-by-side subcomponents making them rather ineffective in their current form.


That is, as observed by Wu et al. (2010b) that when fitting heterogeneous spiky pat-

terns with GMMs, there is actually another distributional defect that needs to be

carefully evaluated. Specifically, it was observed that, often, a fitted component can

consist an unusually high concentration of observations near the component cen-

ters, a direct result of human’s strong spatial regularities discussed earlier in §5.2.3.

To address this challenge, Wu et al. (2010b) introduced and incorporated a novel ‘in-

liers and non-inliers’ split process into their VB-GMM algorithm, and such a step is

particularly useful for identifying and separating significant locations (i.e., ‘inliers’)

of each subscriber and the activity areas surrounding them (i.e., ‘non-inliers’).

Recall that, besides heterogeneous and spiky, the other somewhat interesting fea-

ture of our data is its discreteness (and thus point masses), which arise because the

positions of individuals cannot be located accurately as they are based on the posi-

tions of the mobile cell towers. While this type of challenge can usually be resolved

with some jittering, such tactic raised additional issues in how the data should be

‘tweaked’; activities initiated through cell towers located in the remote area need to

be adjusted very differently to those in the CBD. Without introducing those small

jumpy movements, we point out that this data singularity issue causes modeling

challenges in two fronts. First, lack of randomness around certain locations can of-

ten lead to two sets of ‘unrelated’ observations being grouped together. This can be

addressed quite simply by forcing linear-shaped component (i.e., those simply con-

nect two cell tower locations) to split. However, more importantly, is the question

of how the models of the same pattern should be evaluated when the underlying

data has a lack of jitter. This is a very interesting statistical problem by itself and re-

quires a lot further research; we do not attempt to address it fully here. Nonetheless,

we point out that this discreteness issue is particularly problematic when adopting

algorithms such as EM where k needs to be determined by comparing BIC among

models with various different k values. This is because measures such as BIC relies

heavily on the log-likelihood (LL) calculation which is influenced by the covariance

matrix estimation for components in the model (Wu et al., 2010b). Covariance ma-

trix estimation can be numerically unstable for components that are singular or near

singular which makes LL/BIC biases either towards or away from a particular model

when linear-shaped component or component corresponding to point masses ex-

ist. Consequently, Wu et al. (2010b) suggest making use of goodness-of-fit measures

suitable for components to be overlapped or variables being highly correlated; we

found that the somewhat commonly applied Silhouette coefficient, for example, is

not suitable for evaluating models with overlapping components. Wu et al. (2010b)

proposed an alternative measure called mean absolute error adjusted for covariance

(MAEAC) that aims to approximate the overall model ‘absolute errors’, taking the ad-

vantage that VB is less likely to over-fit. MAEAC was found to be generally in agree-

ment with model selection choices made by BIC when the data is well behaved, but


has shown to be more robust for data that is heterogeneous, spiky, and lack of jit-

ter (Wu et al., 2010b). Consequently, in this paper, we make use of the SEVB-GMM

algorithm with MAEAC. Again we emphasize that the main focus here is fitting and

interpreting a GMM for each user’s spatial usage behavior and not on the particular

algorithm for doing the task.

5.3.4 Results, model accuracy & computational efficiency

The SEVB-GMM algorithm is relatively robust to the selection of kinitial, the initial

choice of k, and can automatically determine the ‘optimal’ k for each fitted GMM.

However, models with marginally larger k are likely to be obtained with larger kinitial

values for data that is highly heterogeneous. While it is typically sufficient and is cer-

tainly more computationally efficient to obtain a good visual representation of the

data with smaller kinitial (Wu et al., 2010b), we opt to initialize the SEVB-GMM algo-

rithm with larger kinitial for the potential slight improvement on the data goodness-

of-fit. We adopt the default initialization settings used in Wu et al. (2010b) with

kinitial = 30, but here we also force linear-shape component simply connecting two

sites to be split. We detail our promising results of modeling each user’s spatial usage

behavior below.

Figure 5.5 gives the modeling results for the four selected subscribers shown in Fig-

ure 5.2. It demonstrates that the SEVB-GMM algorithm appears to be effective for

our application since the fitted mixture model component seems to capture the pat-

terns. In terms of the modeling accuracy of the SEVB-GMM, one approach is to

evaluate the pattern distributions with respect to their (estimated) pattern centers.

We opt to estimate the pattern center directly from the GMM instead of utilizing

the ‘real center’ defined in §5.2.3; this is convenient for profiling and segmenting

users in §5.4 and §5.5 i.e., without accessing to the raw data. We define the ‘SEVB-

GMM center’ as the greatest weighted component center in the most active area, in

which the definition generally corresponds simply to the greatest weighted compo-

nent center (c.f. §5.2.3). Despite the definition difference between real centers and

SEVB-GMM centers, that is to say one is no more correct than the other, the differ-

ence between them is shown in Figure 5.6 (a). For over 50% of users, the difference

between these two center definitions is less than 5 km. Figure 5.6 (b) examines the

spatial pattern distribution with respect to the SEVB-GMM centers for both the real

and modeled data. The SEVB-GMM modeled distribution is estimated based on the

component weights and distances between component centers and the SEVB-GMM

centers. It shows that SEVB-GMM can effectively summarize the patterns, and the

‘gap’ between modeled and actual (i.e., the cumulative distribution of the distances

between the SEVB-GMM center and the actual activity locations) distributions on

the left hand side can be explained by patterns surrounding the SEVB-GMM centers


Figure 5.5: SEVB-GMM results of the four subscribers in Figure 5.2. (a) Subscriber A.(b) Subscriber B. (c) Subscriber C. (d) Subscriber D. Note that the ellipses represent95% probability regions for the component densities, whereas the estimated centersof these components are marked by ‘+’s and the actual observations are marked by‘x’s. We also note that the 95% probability regions of some components (e.g., thosecorresponding to point masses) are not always visible because they are simply toosmall to be seen. The most noticeable examples are the two most weighted compo-nents in (c) which correspond to the three big bubbles (with two of them centered atnearly identical spot) in Figure 5.2 (c).

(a) (b)

(c) (d)


Figure 5.6: Model accuracy of SEVB-GMM. (a) Distribution of distances between real& SEVB-GMM centers. (b) Average of users’ cumulative activity count distributionwith respect to distance from their SEVB-GMM centers. Note that ‘. . .’s refer to cal-culations made with respect to the SEVB-GMM model fits, whereas ‘—’s were calcu-lated with respect to the actual data.

(a) (b)

that have already been summarized into a handful of components. Figure 5.6 (b)

echoed Figure 5.4 (b) in that users are mostly active within 10 km from the centers,

and, on average, around 10% of their activities have been carried on in locations

which are over 100 km from the centers.

We conclude that it is appropriate to model individuals’ mobility patterns with

GMMs. With an average of 10.763 GMM components or 63.578 parameters, we

have been able to accurately approximate an average of 1,288 outbound activities

during the 17-month analyzed period; this gives us a data compression ratio of ap-

proximately 20 to 1. Of course, this ratio will be higher when longer history is be-

ing analyzed. In addition, we are pleased with the efficiency delivered by VB. The

SEVB-GMM algorithm has been able to obtain a good approximation in an average

of 62.612 iterations with an average of 2.982 split attempts. It is clearly less demand-

ing in terms of computational requirements than many other Bayesian methods, as

well as other VB-GMM splitting algorithms which attempt to split all components

separately until an ‘optimal’ model is reached. It is also, in the case of this sample,

36 times more efficient than if we were to adopt an EM algorithm, given that we

now know our final pattern complexities range from one to 36; using SEVB-GMM

has removed the need to fit the model for each kinitial = 1 to 36 separately and then

select the ‘optimal’ model (c.f. number of components k). Our next task is to in-

terpret these behavior patterns meaningfully and automatically before segmenting

their spatial behavior in §5.5.

5.4 Profiling Individuals’ Spatial Usage Behavior

Several previous studies have attempted to characterize individuals’ mobility pat-

terns (e.g., Balazinska and Castro, 2003). However, they typically focused on cor-

porate or campus networks with data history often only of the order of weeks, and

5.4 Profiling Individuals’ Spatial Usage Behavior 127

sometimes even days (Balazinska and Castro, 2003), making them difficult to gener-

alize. Regardless, Balazinska and Castro (2003) grid partitioned users’ spatial usage

behavior which appears to be overly simplified based on the standardized frequency

of users’ most visited location and their median usage quantity. On the other hand,

a mobility study described in Ghosh et al. (2006a) did give more consideration to

the spatial aspect of the patterns. They first identified the nature of all locations

with respect to all users such as seminar rooms, shopping malls, and home in the

ETH Zurich Campus of The State University of New York at Buffalo, New York; this

was then utilized for profiling individuals based on their activity frequency in each

location. Despite the fact that we could generalize this study further by matching

each subscriber’s mobility pattern to the social importance of each cell tower loca-

tion (e.g., football stadium or airport) or even the regional census information (e.g.,

education, income, and primary industry such as mining or tourism), this approach

suffers from several drawbacks. Firstly, in the real world, each user’s significant loca-

tions are not aligned. That is, we do not necessarily all live in the same location and

work in the same location. Secondly, the social importance of each location to dif-

ferent individuals can be difficult to determine. For example, shopping malls have

different meanings for people who work there. Finally, as we have articulated earlier,

it is important to comprehend each user’s overall mobility pattern.

That is, simply profiling users based on their visitation probability at each location

as done in Ghosh et al. (2006a) is still inadequate for fully comprehending or charac-

terizing users’ spatial usage behavior. We reference Larson et al. (2005) which aims

to cluster/profile customers’ supermarket shopping path, a somewhat related focus;

though, as in the case of Ghosh et al. (2006a), their work is also based on first defin-

ing the meaning of each zone of a finite space which has the same significance to all

shoppers (as well as predefining the high-medium-low grouping of the path time).

Studies related to Larson et al. (2005) include Hui et al. (2009a,b), where Hui et al.

(2009b) has a strong analytical focus in relation to the implications of actual pur-

chasing behavior. Frarley and Ring (1966) and Batty (2003), for example, focused on

analyzing pedestrians’ zone-to-zone Makovian movement, but these also do not ap-

pear to be valuable to this study. In contrast, in this paper, we develop an approach

for automatically and statistically profiling each user’s overall mobility pattern based

on approximated SEVB-GMM; note that while we have chosen a particular inferen-

tial approach, the underlying concept of partitioning patterns also could be done in

other ways. We explore the characteristics of SEVB-GMM components for approx-

imating users’ spatial usage behavior in §5.4.1, then differentiate them in §5.4.2. In

§5.4.4, we introduce several spatial usage behavioral signatures, extracted from the

SEVB-GMM approximations, for automatically profiling their mobility patterns of

which the effectiveness is examined in §5.4.5.


5.4.1 SEVB-GMM component characteristics

Previous literature and the analysis illustrated such as in Figure 5.4 have revealed

most individuals’ spatial usage behavioral tendency. That is, they are mostly ac-

tive in the neighborhood they live in and only occasionally travel to places outside

that region. This means that the location of the components with respect to the dis-

tance from their corresponding SEVB-GMM centers which we call ∆ and the com-

ponent mixing weight w can be crucial for interpreting individuals’ overall mobility

patterns. In addition, the component size and shape can assist us in understand-

ing what patterns have been summarized nearby. For example, most components

in Figure 5.5 (b) appear to be mostly long and narrow representing the routes that

the individual appears to have regularly passed through; whereas two concentrated

point components in Figure 5.5 (c) correspond to the user’s significant locations. We

can determine the component size simply by considering the standard deviation

(SD) σ in latitude and longitude; whereas the percentage of the variation (λ1) ac-

counted by the first principal component p1 of a component, r = λ1/(λ1 + λ2), com-

puted through principal component analysis (PCA) can also provide us with insights

into the component shape; in the bivaraite case, the second principal component p2

consists the remaining of the variation λ2. Besides distance from center ∆, weight

w, SD σ, and p1’s variation % r, components can be described and characterized by

the probability region A that they cover and the maximum length l. Note that here

we adopt 99.9% as the probability region for the ellipses. Assuming a and b are the

ellipse’s i.e., the GMM component’s semi-major and minor axes respectively, then

A = a × b × π, and l = 2 × a. We next examine the differentiability of components

with some of these characteristics.

5.4.2 Differentiating SEVB-GMM components

Analyses to date have motivated us to differentiate observations or SEVB-GMM com-

ponents into the following four types:

• Those that correspond to the individuals’ significant locations (Significant),

• Those in the daily activity areas of an individual but which are not significant

(Urban),

• Those in the locations where the individual does not visit regularly (Remote),

and

• Those that correspond to the long commuting routes frequently traveled by an

individual (Route).

To reinforce this idea and assist in formally defining these component types, we have

conducted the following additional mobility pattern analyses.


Figure 5.7: Mobility pattern analysis based on SEVB-GMM. (a) Distribution of SEVB-GMM component maximum SD σmax for which σmax ≤ 10 km. (b) Distribution ofSEVB-GMM component weight w for which w ≤ 0.24. (c) Distribution of % of vari-ation accounted for by the first principal components (the p1’s) of the SEVB-GMMcomponents. (d) Distribution of distances between users’ daily activity boundary totheir SEVB-GMM centers (Note: almost identical for real centers).

(a) (b)

(c) (d)

The first three analyses are based solely on the characteristics of the SEVB-GMM

components. The distribution of components’ maximum SD σ, which we call σmax,

in latitude and longitude is shown in Figure 5.7 (a). It reveals that more than 25% of

all components have their σmax ≤ 1/3 km. In fact, most of these compact compo-

nents have their σmax ≈ 0 referring to the exact location of cell towers and individ-

uals’ significant locations. Figure 5.7 (b) examines the distribution of components’

weightw. It indicates that more than 25% of all components are almost meaningless

with respect to the understanding of individuals’ overall mobility patterns as their

w < 0.01. This needs to be taken into consideration when profiling users’ mobility

patterns. The result of PCA on components is shown in Figure 5.7 (c). A large portion

of the components are narrow with their p1’s accounting for nearly all of the variabil-

ity; these components are generally associated with individuals’ (long) commuting

routes. Note that we assume all compact components have their p1’s account for as

much variability as their p2’s.

A different perspective on individuals’ spatial usage behavior is taken in Fig-

ure 5.7 (d). This time we focus on users’ daily activity area. It is revealed that over

30% of the subscribers were mostly active within a 10 km (inner) radius with respect

to their centers, but were practically inactive in the circular ring area where the outer

radius is 10 km more than the inner radius. We define practically inactive as less than

1% of the activities. Similarly, over 35% of the users were mostly active in locations


where distance from center ∆ ≤ 20 km, but were practically inactive in areas where

20 < ∆ ≤ 30 km. Overall, the daily activity areas for approximately 90% of the sub-

scribers were within the 30 km radius area; whereas nearly everyone had been active

mostly within the 60 km radius area. Consequently, for the sake of most subscribers,

we define the boundary of daily activity areas (or living neighborhood) i.e., Signif-

icant and Urban as 30 km from the SEVB-GMM centers, and Remote locations as

those over 60 km away from the centers. This also implies that we will ignore around

6% of all components centered between 30 and 60 km from the SEVB-GMM centers

because of the difficultly in determining their social importance with respect to each

individual. Note that the 20 km circular area is approximately 1,257 km2 and the size

of New York City; whereas the 60 km circular area is approximately 11,309 km2 and

the size of metropolitan Sydney.

5.4.3 SEVB-GMM component types

Accordingly, we define compact components with weight w ≥ 0.01 and distance

from center ∆ ≤ 30 km as individuals’ significant locations. However, we believe

that it is important to relax the definition of compact from our earlier discussion

with respect to Figure 5.7 (a). This relaxation is necessary in order to avoid situations

in which there is more than one cell tower covering the same area, and the locations

of these different cell towers are represented by the same components which are not

compact based on the earlier definition. If we assume the typical distance between

two neighboring cell towers is 4 km, then we define compact, and thus the criterion

for Significant components, as 3×σmax ≤ 4 km. Recall that σmax refers to the compo-

nent’s maximum SD. On the other hand, we identify Route components which can

also be Remote as for those components with ∆ > 30 km, w ≥ 0.01, p1’s variation

% r ≥ 0.90, and maximum length l > 12 km. This criterion on l corresponds to

three times the assumed typical distance between two neighboring cell towers. We

formally define different types of components as follows:

• Significant : ∆ ≤ 30 km, 3× σmax ≤ 4 km, and w ≥ 0.01;

• Urban: ∆ ≤ 30 km, 3× σmax ≤ 60 km, and not Significant ;

• Remote: ∆ > 60 km;

• Route: ∆ > 30 km, r ≥ 0.90, w ≥ 0.01, and l > 12 km.

Note that we could have redefined Remote as those that are not Route, however we

found in practice (c.f. §5.4.5 and §5.5.2) that similar conclusions about individual

subscriber’s behavior in any case would still be obtained. We next investigate how

we can characterize individuals’ overall spatial usage behavior from our detailed un-

derstanding of each SEVB-GMM component.


5.4.4 Spatial usage behavioral signatures

In this section, we introduce what we refer to as several spatial usage behavioral sig-

natures; these aim to profile the mobility patterns of each subscriber meaningfully

and automatically based on the component types defined above. The key signatures

are:

• SignificantWt=∑

j∈Significant w(j);

• UrbanWt=∑

j∈Urban w(j);

• UrbanArea=∑

j∈Urban A(j)

302π;

• RemoteWtX2= Min (1, 2× RemotWt) = Min(

1, 2×∑

j∈Remote w(j))

;

• RouteDist= Min(

1,∑

j∈Route l(j)

1000

).

We next describe these signatures in more detail and their effectiveness in profiling

and segmentation in §5.4.5 and §5.5.2 respectively.

SignificantWt, UrbanWt & RemoteWtX2 Recall that the weightw is one simple and

informative measure for describing each VB-GMM component. Notice that three of

the signatures we have designed are based on the aggregatedw’s of different compo-

nent types. They are SignificantWt, UrbanWt, and RemoteWt, which equate to the ag-

gregatedw of all Significant, Urban, and Remote VB-GMM components respectively,

representing an individual’s share of activities in three different zone areas. However,

as expected, the majority of the users have not been very active visiting locations out-

side their daily activity areas. Consequently, we choose RemoteWtX2= 2×RemoteWt

instead of RemoteWt, so that it is more evenly distributed between zero and one.

This rescaling/standardization is necessary for clustering/segmentation which uti-

lizes the k-means algorithm (Jain and Dubes, 1988, pp.89-117) in §5.5.

Figure 5.8 (a)–(c) presents the distributions of the subscribers with respect to these

three signatures. Figure 5.8 (a) indicates that more than 20% of all subscribers have

not been particularly active in the overall sense in their significant locations while a

large number of users have been predominantly active in their first several preferred

locations. Figures 5.8 (b) and (c) show that the majority of subscribers do, at least

occasionally, visit their living neighborhood and places outside their daily activity

areas; while approximately 5% of all users are less typical having been more active

in locations greater than 60 km away from their center than their daily activity areas.

Note that this does not imply that our estimated center is incorrect, but users have

been more active in various places outside their living neighborhood combined.


Figure 5.8: Distribution of spatial usage behavior. (a) SignificantWt. (b) UrbanWt.(c) RemoteWtX2. (d) UrbanArea (1 = 302π km2). (e) RouteDist (1 = 1000 km). (f)HomeOfficeLik.

(a) (b)

(c) (d)

(e) (f)

UrbanArea It has been observed that the size of each subscriber’s daily activity area

does vary, but cannot always be reflected with UrbanWt. We have therefore intro-

duced signature UrbanArea, which equates to the portion of the aggregated prob-

ability region A of all Urban components with weight w ≥ 0.01 with respect to the

nominated area of 302π = 2, 827.43 km2, with the aim of providing a good proxy to

this observation. We note the key limitation of this proposed signature is its inability

to exclude those overlapped regions; a GMM focuses on density approximation and

thus often requires multiple components for representing one non-Gaussian com-

ponent (Baudry et al., 2010). Figure 5.8 (d) shows the distribution of the subscribers

with respect to UrbanArea. It discovers that around 15% of the users have been very

active throughout most areas of their living neighborhood.

RouteDist The unique nature of Route components implies that maximum length

l is useful for characterizing this aspect of users’ mobility patterns. We have there-

fore introduced signature RouteDist, which equates to the portion of aggregated l of

these components with weight w > 0.01 with respect to the nominated distance of

1,000 km, representing individuals’ overall unique long commuting distance. This is


useful for identifying those whose spatial usage behavior is less conventional as in

the case of the Subscriber B in Figure 5.2 (b). In fact, the unique long commuting

distance for this user was estimated to be 1,785 km providing a good indication to

1,515 km approximated in Google Maps (c.f. Brisbane to Sydney: 929 km, and Coffs

Harbour to Newcastle via Tamworth: 586 km). Note that this knowledge cannot be

extracted easily by analyzing a user’s physical path directly. Recall that we assume

one degree corresponds to 100 km in this paper. Figure 5.8 (e) shows the distribution

of the subscribers with respect to RouteDist, and reveals that the majority of the users

rarely commute long distances; with only about 5% of our subscribers that have been

traveling along routes more than 1,000 km in combined unique route distances.

Alternative signatures Finally, we would like to point out that we can also inter-

pret, for example, SignificantWt as the likelihood of an individual being a stationary

subscriber, combining UrbanWt and UrbanArea as the likelihood of being trades-

man or taxi driver-like, and combining RemoteWtX2 and RouteDist as the likelihood

of being inter-state businessperson or truck driver-like. In addition, it is also possible

to profile each user’s spatial usage behavior slightly differently. For example, besides

describing how much an individual has been active, say 70%, of his/her time in sig-

nificant locations (SignificantWt= 0.70), we can also profile them as being a certain

type of user with some degree of likelihood. For example, signature HomeOfficeLik

can be introduced for measuring the likelihood that an individual is a home-office-

like user who is mostly active in two locations assumed to be their home and office. If

we define HomeOfficeLik as the aggregated weight w of the top two Significant com-

ponents when there is more than one Significant component, then the likelihood

distribution of a user being home-office-like is shown in Figure 5.8 (f). It is revealed

that only a small number of subscribers have a very high probability of being home-

office-like users.

5.4.5 Results & spatial usage behavioral profile stability

We illustrate the usefulness of our profiling approach through examples and evalu-

ate the stability of our proposed spatial behavioral signatures for all users with re-

spect to the benchmark. For consistency, we again demonstrate our results with the

same users shown in Figure 5.2. Their signature values are listed in Table 5.3, and we

profile them as follows:

• Subscriber A is mostly active (85.0%; SignificantWt = 0.850) in the selected

fixed locations, rarely visiting other parts of the living neighborhood, and

sometimes (10.8%; RemoteWtX2 = 0.216) flying to towns/cities away from his/

her center;


Table 5.3: Spatial profile signature values for each subscriber in Figure 5.2

Subscriber SignificantWt UrbanWt UrbanArea RemoteWtX2 RouteDistA 0.850 0.001 - 0.216 -B 0.191 - - 1.000 1.000C 0.970 - - 0.006 0.048D - 0.964 1.000 0.031 0.033

Table 5.4: Stability of the customer spatial profile signature values defined by Equa-tion 5.1

Period 1 vs. 2 Period 1 vs. 3# of Groups 4 5 10 4 5 10

SignificantWt 61.7% 56.2% 40.3% 59.0% 52.5% 33.8%UrbanWt 62.3% 55.7% 38.2% 60.2% 54.2% 36.5%

UrbanArea 62.7% 57.3% 46.0% 58.3% 53.7% 39.8%RemoteWtX2 80.7% 76.0% 55.2% 78.8% 74.3% 59.2%

RouteDist 80.8% 76.7% 63.0% 81.5% 76.8% 64.3%HomeOfficeLik 57.3% 53.7% 46.3% 62.5% 58.3% 49.3%

• Subscriber B is mostly active (≥ 50.0%; RemoteWtX2 = 1.000 and

RouteDist = 1.000) on the roads traveling long distances, and sometimes

(19.1%; SignificantWt = 0.191) making calls or sending messages from the se-

lected fixed locations;

• Subscriber C is practically only active (97.0%; SignificantWt = 0.970) in his/her

significant locations;

• Subscriber D is very active (96.4%; UrbanWt = 0.964) throughout his/her entire

living neighborhood with no particular preferred locations.

This provides very similar findings to the discussion in §5.1. In addition, HomeOf-

ficeLik for Subscriber A and C are 0.850 and 0.970, respectively, suggesting that they

were very active in their top two preferred locations most likely to be their home and

workplace.

This highlights how our approach can provide meaningful and useful hidden in-

sights effectively and automatically based on the values of our proposed signatures,

which captures different aspect of their spatial usage behavior.

We next examine the effectiveness of profiling individuals into several non-

overlapped and equal-size intervals based on these signatures. For example, if we

group subscribers into 10 groups based on their share of activities in significant lo-

cations, group one will consist of users with 0 ≤ SignificantWt < 0.1, group two

with 0.1 ≤ SignificantWt < 0.2, and so on. Note that this is different to how users are

more commonly partitioned; usually partitions are based on the usage quantiles. We

take this alternative approach because here the partitions has some meaning within

5.5 Spatial Usage Behavioral Segmentation 135

the context of the signature values. To provide some comparison of this profiling

approach with a benchmark, even though they are not strictly directly comparable,

individuals’ CDR have also been divided into three periods as in §5.2.2 prior to being

approximated with the SEVB-GMM methods (i.e., fitted three times, one GMM for

each user and period).

Table 5.4 presents the stability results, and it shows that our profiling approach has

generally ‘outperformed’ the benchmark. Again, we stress the point that they are not

directly comparable; we do this simply to make a reference to one of the most pop-

ular approaches used in customer behavioral segmentation today. This implies that

subscribers’ spatial usage behavior is, relatively speaking, stable and allows business

to understand how their customers have been using their product/service in a spa-

tial sense for the first time. Some of the shortfalls in prediction accuracy, as in §5.2.2,

can be explained as a result of the fact that subscribers’ behavior will change over

time, and some loss in accuracy may have arisen from the SEVB-GMM approxima-

tion. In addition, since the signatures are based on numerical values (e.g., the selec-

tion of living neighborhood boundary being 30 km), this could possibly be optimized

further with a Bayesian or fuzzy approach which might further improve users’ spatial

usage behavior approximation capabilities.

5.5 Spatial Usage Behavioral Segmentation

5.5.1 The k-means (KM) algorithm & selection of number of groups

As well as having additional insights into how each individual has used their prod-

uct/service spatially, it is vital for the business to comprehend the similarities and

dissimilarities among users’ behavior for distinct marketing propositions or prod-

uct/service developments, for example. The most obvious approach for this is a

cluster analysis that groups users into clusters with similar characteristics in an un-

supervised manner. As a first attempt, we adopt the most widely used k-means (KM)

algorithm for grouping users’ overall spatial usage behavior; the limitations of KM

have been well documented (e.g., Wedel and Kamakura, 1998, Chapter 5). Note that

it is also possible to adopt a GMM or mixture modeling approach more generally

for this particular task (Fraley and Raftery, 1998); results of KM and GMM analy-

ses can be very similar (e.g., Symons, 1981; Celeux and Govaert, 1992; Banfield and

Raftery, 1993). However, GMMs are computationally more demanding than the KM

algorithm, its parameter estimations can sometimes be numerically unstable, and

as pointed out by Baudry et al. (2010), GMM components may need to be merged for

more appropriate clustering results. On the other hand, methods such as DeSarbo

et al. (1990) that make use of multi-dimensional scaling (MDS) may be useful in ob-

taining more ‘appropriate’ groupings; but the transformed/scaled features can make


Table 5.5: Pearson correlation coefficients among spatial profile signatures

SignificantWt UrbanWt UrbanArea RemoteWtX2 RouteDistUrbanWt -0.736 - - - -

UrbanArea -0.161 0.374 - - -RemoteWtX2 -0.388 -0.200 -0.231 - -

RouteDist -0.211 -0.115 -0.131 0.519 -Voice Call Durations 0.056 -0.052 0.176 0.014 0.041

SMS Counts 0.210 -0.078 0.102 -0.160 -0.081

the results more difficult to interpret. Please also refer to §5.7 for further discussion

in relation to other clustering techniques.

Results of KM can be influenced by variables with unequal weighting or that are un-

standardized, which has already been taken into consideration during our behav-

ioral signature constructions. However, KM can also produce unreliable results if

variables are highly correlated in §5.4. We have thus performed a correlation analy-

sis with results shown in Table 5.5. It reveals that only SignificantWt and UrbanWt

are moderately negatively correlated which coincides with our expectation as the

sum of these two variables approximately equal to 0.80 (c.f., Figure 5.4 (b)), whereas

the correlation among other variables including the two benchmark measures are

considered to be quite weak. Overall, we believe KM should be sufficiently robust

for our purposes.

Our biggest challenge in interpreting the KM results is determining the most appro-

priate number of spatial usage behavior groups, g, as in the case of mixture mod-

eling. Milligan and Cooper (1985) have shown that two of the better measures for

determining g, and hence the clustering quality, are the Calinski and Harabasz (CH)

index (Calinski and Harabasz, 1974) and the cubic clustering criterion (CCC) (SAS In-

stitute Inc., 1983). That is, the local peaks of these two measures when in agreement,

represents the most likely solutions of g. The CH index (also known as Pseudo-F

statistic) aims to capture the tightness of clusters, and:

CH index =SSB/ (g − 1)SSW/ (m− g)

, (5.3)

whenm users are grouped into g groups. It is dominated by the ratio between sum of

square between groups (SSB) and within groups (SSW). CCC, on the other hand, can

be biased towards larger g, and measures the deviation of the clusters from the dis-

tribution expected if data had sampled from a uniform distribution. We will utilize

both measures for the ‘optimal’ selection of g and hence the clustering results.

5.5 Spatial Usage Behavioral Segmentation 137

Figure 5.9: Selected k-means clustering results # 1. (a) Clustering quality evalu-ated with respect to different g when subscribers are clustered with SignificantWt,UrbanWt & UrbanArea. (b) Clustering quality evaluated with respect to differ-ent g when subscribers are clustered with SignificantWt, UrbanWt, UrbanArea, Re-moteWtX2 & RouteDist. (c) Clustering quality evaluated with respect to different gwhen including voice call duration & SMS counts into the setting (b). (d) VariablesR2’s (RSQ) for the setting (c) with voice call duration marked as D, SMS counts as S& five spatial behavioral signatures unmarked. Note that in (a) to (c), lines with •correspond to CH index, and lines with � correspond to CCC; number of groups g isgenerally chosen based on the local maxima shared by both CH index and CCC.

(a) (b)

(c) (d)

5.5.2 Results & spatial usage behavioral segmentation stability

Here we focus on segmenting users based on the following three scenarios. We first

consider their behavior only in their living neighborhood, ignoring their activities

outside of this 30 km radius area, where the majority of activities have taken place.

Next, we expand the analysis to include all of the spatial usage behavioral signa-

tures representing subscribers’ overall mobility patterns. Finally, we include the two

benchmark measures into this previous scenario with the aim of increasing our un-

derstanding with respect to these variables’ inter-relationships and their implica-

tions on user behavioral segmentation.

Mathematically, our first scenario focused on clustering users based on Signifi-

cantWt, UrbanWt and UrbanArea; and number of groups g is determined, as dis-

cussed in §5.5.1, based on the results of CH index and CCC shown in Figure 5.9 (a). It

suggests that there are four distinct groups of users with respect to their spatial usage

behavior. The four cluster results are listed in Table 5.6 (a):


Table 5.6: Selected k-means clustering results # 2. (a) 4-cluster solution: Cluster cen-ters & variables R2’s (setting of Figure 5.9 (a)). (b) 6-cluster solution: Cluster centers& variables R2’s (setting of Figure 5.9 (b)).

(a)

Cluster # Obs SignificantWt UrbanWt UrbanArea RemoteWtX2 RouteDist(A) 434 0.690 0.101 0.092 - -(B) 189 0.608 0.282 0.813 - -(C) 305 0.122 0.564 0.166 - -(D) 154 0.095 0.721 0.852 - -

Variable R2: 0.738 0.664 0.826 - -

(b)

Cluster # Obs SignificantWt UrbanWt UrbanArea RemoteWtX2 RouteDist(A) 321 0.760 0.117 0.107 0.129 0.077(B) 170 0.626 0.291 0.832 0.104 0.104(C) 204 0.165 0.665 0.191 0.161 0.101(D) 138 0.102 0.748 0.865 0.160 0.146(E) 125 0.245 0.216 0.174 0.831 0.177(F) 124 0.303 0.219 0.159 0.698 0.941

Variable R2: 0.696 0.681 0.755 0.690 0.729

• Cluster (A) is the largest and represents those who are mostly active in se-

lected fixed locations, for example, ‘home and office’ (c.f. Subscriber C in Fig-

ure 5.2 (c));

• Cluster (B) consists of users with similar behavior to those in Cluster (A) but are

more active outside those significant locations;

• Cluster (C) groups those who are mostly active in selected parts of their living

neighborhood. For example, ‘regional salespersons’ or ‘tradespersons’; and,

• Cluster (D) represents subscribers who are active throughout most part of

their living neighborhood. For example, ‘taxi driver’ (c.f. Subscriber D in Fig-

ure 5.2 (d)).

Next, we examine users’ overall mobility patterns. That is, also taking signatures

RemoteWtX2 and RouteDist into consideration. The results of CH index and CCC are

presented in Figure 5.9 (b) which indicate that there are mostly likely six unique user

groups. The six cluster results are listed in Table 5.6 (b) and the first four clusters

correspond very nicely to the results in Table 5.6 (a). The remaining two clusters are:

• Cluster (E) identifies users who frequently visit places most likely by flights

away from their centers (c.f. Subscriber A in Figure 5.2 (a)); and,

• Cluster (F) represents those who frequently travel along selected routes. For

example, ‘inter-state truck delivery workers’ (c.f. Subscriber B in Figure 5.2 (b)).

In addition, we have observed that if the higher number of groups g is chosen, the

first four groups remain largely the same, while Clusters (E) and (F) are broken up

5.6 Cross Validation 139

and differentiated further based on the first three signatures.

Interestingly, however, if we further include the voice call duration and SMS counts

(which we first standardized with 1 being the maximum and have 1 = 95 percentile

limiting the influence of those extremely high users; c.f. §5.2.2) into the cluster anal-

ysis, the cluster structures suddenly become less clear (c.f. Figure 5.9 (c)). We know

that a variables R2 measures the between cluster variance; and, importantly, based

on this measure, it appears that customers are more differentiable with respect to

their spatial usage behavioral signatures than those benchmark measures as indi-

cated in Figure 5.9 (d). Voice call duration, in particular, has performed relatively

poorly throughout the entire analysis. That is, mobility patterns among customer

groups are highly differentiable, we can see this in the plot as variables R2’s with re-

spect to mobility pattern (c.f. lines unmarked) are quite high in comparison to voice

call duration and SMS count (c.f. lines marked with D and S).

Despite the overwhelming support with respect to how key customer groups differ

with respect to their spatial usage behavior, as articulated earlier, it is important to

examine the segmentation effectiveness. We adopt a similar setting to §5.2.2 and

§5.4.5 in terms of separating data into three periods. When comparing Period 1 to 2,

and 1 to 3, the stability for the 4-cluster solution is 47.5% and 44.2% respectively; and

for the 6-cluster solution, the results are 42.2% and 40.0% respectively. These results

are reasonably good but are poorer than the results in §5.2.2 and §5.4.5 despite signif-

icant evidence pointing to such a clustering structure; however users’ spatial usage

behavior is being considered more completely here i.e., considering more than just

one attribute. Interestingly, however, the stability measures for the largest cluster,

Cluster (A), are 59.1% and 62.8% for the 4-cluster solution, and 53.6% and 59.1% for

the 6-cluster solution respectively making it the most stable customer group and po-

tentially the most useful for the business. Overall, we believe this implies that more

effort is still required to comprehend users’ complicated spatial usage behavior and

the appropriate segmentation. Finally, we have found results generally similar even

if different hard signatures parameter values are used even though KM can be very

sensitive to the derived data value. That is, we view spatial usage behavioral segmen-

tation as reasonably effective and potentially valuable for business strategy formu-

lations.

5.6 Cross Validation

The clustering result can be unstable in general, and cross validation is thus required

to further demonstrate the usability of the proposed spatial usage behavioral seg-

mentation. We perform this on several new samples; their sizes are much larger

than the initial dataset. The first two new samples are utilized for demonstrating


the differentiability of the groupings; these samples consist 6,898 and 9,337 users re-

spectively and these users do not need to stay connected for the entire 17 months.

However, users in the new sample # 1 is randomly sampled, while # 2 is block sam-

pled (based on the internal IDs). The new sample # 3 has 6,624 subscribers and is

utilized for evaluating the segmentation stability; users in this dataset were chosen

randomly but need to stay connected throughout the entire 17-month period for the

obvious reason.

We are pleased to observe that all of the results were fairly consistent with what was

observed previously, even though KM can be quite sensitive to the data, suggesting

that our initial sample is quite representative. For example, when group subscribers

in the new sample # 1 based on our five proposed spatial behavioral signatures (i.e.,

the same setting as in Figure 5.9 (b)), the CH index and CCC also point to the likely

solution of six clusters (c.f. Figure 5.10 (a)). Similarly for the new sample # 2. We

note, however, that CH indexes for both new samples also point to a good three-

cluster solution, but are in disagreement with the CCCs. We further note that when

combining these two samples, the CH index and CCC again point to the likely solu-

tion of six clusters, but it is ‘less obvious’, as in the case of Figure 5.9 (b). Additionally,

the cluster structures disappeared as before (i.e., similar to Figure 5.9 (c)) when voice

call duration and SMS counts are further included in the study. Table 5.7 lists the KM

clustering results for both new sample # 1 and # 2; note the near identical results

between the two samples as well as the initial sample (i.e., Table 5.6 (b)). Finally, we

note that the new sample # 3 illustrates the stability of the six-cluster solution; com-

parison between Period 1 and 2, and Period 1 and 3 are 42.3%, and 40.7% respectively

similar to the findings of our initial sample.

Due to the lack of directly comparable approaches in the literature, our final exper-

iment in this paper aimed to design some suitable benchmark in order to obtain a

more appropriate comparison with our spatial behavioral segmentation that seems

to be quite effective, at least highly differentiable. We aimed to group users by a

simple comparison model based on the raw dataset; behavioral characteristics of an

individual were to be mostly derived based on the aggregated activity frequencies in

each zone, which is determined based on the distance to each user’s center. For ex-

ample, we had classified zone urban being distance ∆ < 30 km from the center, and

zone remote being ∆ > 100 km; activities initiated in urban through heavily used cell

towers of the individual (e.g., frequencies greater than 0.01) were classified as signif-

icant instead. Boundary activities in zone urban were used to proxy our UrbanArea,

whereas we attempted to use activity frequency in zone 60 < ∆ < 100 km to proxy

our RouteDist, given that the ‘path’ cell towers are difficult to define and each tower

has different meanings to different users. Unfortunately, we have been unsuccess-

ful in finding a suitable and comparable simplistic model; CH index and CCC (c.f.

Figure 5.10 (b)) generally point to lack of cluster structure. In some settings where

5.7 Discussion 141

Figure 5.10: Cross validation results # 1. (a) Clustering quality evaluated with respectto different g for the new sample # 1 with setting of Figure 5.9 (b). (b) Clustering qual-ity evaluated with respect to different g for the new sample # 1 with the unsuccessfulsimplistic model described in §5.6. Note that lines with • correspond to CH index,and lines with � correspond to CCC; number of groups g is generally chosen basedon the local maxima shared by both CH index and CCC.

(a) (b)

the cluster structures appear to be more obvious, its cluster results are often not very

meaningful at all i.e., cluster centers of an attribute proxy a behavioral characteristic

are often very close to each other, and the two main clusters resided near the ‘center’

can often consist around 75% of all users.

In summary, the above cross validation have illustrated the usefulness of our re-

search.

5.7 Discussion

In this paper, we have made several contributions. Firstly, we have illustrated how

businesses can further improve their understanding of how customers typically uti-

lize their product/service, and the feasibility and potential merits of analyzing in-

dividuals’ habitual consumption behavior. Secondly, we have shown that our ap-

proach, for the first time, can effectively model and automatically profile each user’s

overall spatial usage behavior. Thirdly, individuals’ spatial usage behavioral signa-

tures are more effective, at least for our dataset, for predicting their future behavior

than ordered partitioning subscribers based on their aggregated voice call durations

and SMS counts. Finally, spatial usage behavior among customer groups is highly

differentiable.

Tactically, we have utilized observational data of CDR that is typically already avail-

able and hence cost efficient to the established businesses. We have shown that

massive amounts of CDR, when being leveraged more fully, can provide wireless

telecommunication providers with a wealth of enhanced customer knowledge. In

fact, we could have also included incoming CDR and unsuccessful outbound CDR,


Table 5.7: Cross validation results # 2. (a) 6-cluster k-means algorithm solution forthe new sample # 1: Cluster centers & variables R2’s (setting of Figure 5.9 (b)). (b)6-cluster k-means algorithm solution for the new sample # 2: Cluster centers & vari-ables R2’s (setting of Figure 5.9 (b)).

(a)


Variable R2: 0.717 0.722 0.765 0.622 0.693

(b)


Variable R2: 0.710 0.704 0.764 0.642 0.686

these are often considered valueless in analyses. Moreover, our data mining ap-

proach has provided a level of granularity and subtlety that cannot be achieved by

approaches such as market research. In addition, our tactic has allowed marketers

to identify and interact with individuals, and could be easily extended into a longitu-

dinal behavioral study. Note that recently there has been much research focused on

approximating the density distributions of stream data such as CDR in an extremely

time-and-space efficient and scalable manner (Aggarwal, 2007a). However, we be-

lieve their non-parametric density representations, as in the case of clustering (c.f.

§5.3), would be less suited for obtaining behavioral descriptors as we have done in

§5.4.

We have demonstrated the model accuracy in modeling each subscriber’s mobility

pattern with a GMM. While we have assumed that the relationships among obser-

vations are independent and identically distributed (i.i.d.), we believe they are rea-

sonable given that we are analyzing users’ habitual consumption behavior. Besides

understanding the implication of observations actually being sequential, we believe

that the most practical extension to this work is to explore their spatial usage be-

havior with respect to different time periods, for example weekdays, weeknights or

weekends (c.f. Ghosh et al., 2006a). Also, our choice of non-overlapped segmenta-

tion is ‘unnecessarily restrictive’ for interacting with the customers (c.f. Wedel and

5.7 Discussion 143

Kamakura, 1998, p.32). Results in this paper also promote us to consider other alter-

native approaches for profiling each user’s mobility pattern. We are currently explor-

ing profiling behavior with ‘mixed membership’ (Airoldi et al., 2008). That is, rather

than assigning each user into one cluster, we may profile users as a mixture of be-

havioral clusters. For example, for Subscriber A in Figure 5.2, his/her behavior might

be more preferable for profiling with 80% home-office-like plus 20% inter-capital

businessperson-like which may further improve the stability results of our spatial

usage behavioral segments in §5.5.

We acknowledge that more investigation is required to understand how our spatial

usage behavioral profiling and segmentations: (1) relate to the current and future

needs and values of the business, (2) correlate with their geo-demographic or other

aspect of customer information or knowledge including their purchasing behavior,

(3) have implications for future product/service development and (4) are effective

in interacting meaningfully with each customer in real life. Given the exploratory

nature of this paper, we leave to future researchers to examine the managerial im-

plications. Nonetheless, we believe that our inferred richer behavioral descriptions

derived based on series of exploratory analysis can provide businesses with a bet-

ter understanding of both individuals’ and spatial usage behavioral segments’ typ-

ical needs and behavior, which marketers can more easily relate to and are hence

valuable for strategy formations. Note that many benefits to both the business and

the clients, such as finding the nearest restaurants or cash machines, or assisting

in emergency situations or commercial activities such as advertising, can already

be provided simply by knowing the current position of an individual (Chong et al.,

2009). However, additional benefits can be delivered with detailed understanding

of their spatial needs. That is, it will now be possible to target customers more ad-

vancely and precisely without the need to access all the data all the time or have

data pre-summarized for a particular purposes. For example, our method allows us

to target inter-capital non-Sydney based businesspersons with restaurant discount

vouchers in Sydney, or to target users mostly at the significant locations with innova-

tive mobile devices capable of transforming into premise equipments. Also, it allows

us to provide better services to subscribers in relation to traffic and public trans-

port alerts relevant to them. Furthermore, we believe that our spatial knowledge can

help businesses to better determine the cost of providing services to each user, and

hence assess their value to the business as a result of different servicing costs asso-

ciated with different cell towers. We also believe that our spatial usage behavioral

signatures may be able to further improve on subscribers’ future behavior predic-

tions such as churn modeling (Bhattacharya, 1998; Mozer et al., 2000; Keaveney and

Parthasarathy, 2001; Lemon et al., 2002; Buckinx and Van den Poel, 2005) which are

currently based on the less effective benchmark measures.

Finally, we believe that it is not sensible to extend this research much further with


traditional clustering algorithms for investigating the relationships among a large

number of customer attributes at the same time. This includes KM used here. This

is because, as dimensionality increases, the sparseness of data usually increases as

a result, this in turn leads to meaningless similarity measures and clustering results

(Agrawal et al., 1998). The situation is worse when a high level of noise (c.f. het-

erogeneous behavior) also exists. Interestingly, it has been observed that usually

only a small number of different dimensions (i.e., subspaces) are relevant to certain

clusters, whilst noisy signals are often contributed by information in the remaining

irrelevant dimensions (Agrawal et al., 1998). Put simply, age, for example, may be

critical to one customer group but not to the other. Consequently, algorithms that

aim to cluster data full dimensionally, as done traditionally, are inappropriate. In

fact, even applying traditional feature selection or transformation prior to cluster-

ing, based on the full dimension philosophy will not resolve this dimensionality issue

(Agrawal et al., 1998). Accordingly, we believe future research on this topic needs to

consider adopting recently developed subspace or projected clustering algorithms,

which have shown to efficiently, effectively and automatically identify groups of clus-

ters within different subspaces of the same dataset (e.g., Moise et al., 2008). In partic-

ular, we believe algorithm P3C (Moise et al., 2008) currently looks very promising. It

can deal with both numerical as well as categorical attributes, and can find both non-

overlapped and overlapped clusters. Overall, given the competition taking place in

the telecommunication industry, we believe this study should be of interest to both

academics and practitioners.

5.8 References

Aggarwal, C. C., 2007. Data Streams: Models and Algorithms. Advances in Database

Systems. Springer, New York.

Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P., 1998. Automatic subspace clus-

tering of high dimensional data for data mining applications. In: Haas, L. M., Ti-

wary, A. (Eds.), Proceedings of the 1998 ACM SIGMOD International Conference

on Management of Data. ACM, Seattle, WA, pp. 94–105.

Airoldi, E. M., Blei, D. M., Fienberg, S. E., Xing, E. P., 2008. Mixed membership

stochastic blockmodels. The Journal of Machine Learning Research 9 (Sep), 1981–

2014.

Ajzen, I., 2001. Nature and operation of attitudes. Annual Review of Psychology

52 (1), 27–58.

5.8 References 145

Alderson, W., 1957. Marketing Behavior and Executive Action: A Functionalist Ap-

proach to Marketing Theory. Richard D. Irwin, Homewood, IL.





Balazinska, M., Castro, P., 2003. Characterizing mobility and network usage in a cor-

porate wireless local-area network. In: Proceedings of the First International Con-

ference on Mobile Systems, Applications, and Services. USENIX, San Francisco,

CA, pp. 303–316.

Banfield, J. D., Raftery, A. E., 1993. Model-based Gaussian and non-Gaussian cluster-

ing. Biometrics 49 (3), 803–821.

Batty, M., 2003. Agent-based pedestrian modelling. In: Longley, P. A., Batty, M. (Eds.),

Advanced Spatial Analysis: The CASA Book of GIS. ESRI Press, Redlands, CA.

Baudry, J.-P., Raftery, A. E., Celeux, G., Lo, K., Gottardo, R., 2010. Combining mix-

ture components for clustering. Journal of Computational and Graphical Statistics

19 (2), 332–353.

Bhattacharya, C. B., 1998. When customers are members: Customer retention in

paid membership contexts. Journal of the Academy of Marketing Science 26 (1),

31–44.

Blattberg, R. C., Deighton, J., 1996. Manage marketing by the customer equity test.

Harvard Business Review July-August, 136–144.

Braun, M., McAuliffe, J., 2010. Variational inference for large-scale models of discrete

choice. Journal of American Statistical Association 105 (489), 324–335.

Buckinx, W., Van den Poel, D., 2005. Customer base analysis: partial defection of

behaviourally loyal clients in a non-contractual FMCG retail setting. European

Journal of Operational Research 164 (1), 252–268.

Calinski, T., Harabasz, J., 1974. A dendrite method for cluster analysis. Communica-

tions in Statistics - Theory and Methods 3 (1), 1–27.

Camp, T., Boleng, J., Davies, V., 2002. A survey of mobility models for ad hoc network

research. Wireless Communications and Mobile Computing 2 (5), 483–502.


Celeux, G., Govaert, G., 1992. A classification EM algorithm for clustering and two

stochastic versions. Computational Statistics & Data Analysis 14, 315–332.

Chong, C.-C., Guvenc, I., Watanabe, F., Inamura, H., 2009. Ranging and localization

by UWB radio for indoor LBS. NTT DOCOMO Technical Journal 11 (1), 41–48.

Christopher, M., Payne, A., Ballantyne, D., 1991. Relationship Marketing: Bring-

ing Quality, Customer Service and Marketing Together. The Marketing Series.

Butterworth-Heinemann, Boston, MA.



18 (3), 745–755.

Cooper, R., Kaplan, R. S., 1991. Profit priorities from activity-based costing. Harvard

Business Review May-June, 130–135.




Cortes, C., Fisher, K., Pregibon, D., Rogers, A., 2000. Hancock: a language for extract-

ing signatures from data streams. In: Proceedings of the Sixth ACM SIGKDD In-

ternational Conference on Knowledge Discovery and Data Mining. ACM, Boston,

MA, pp. 9–17.




DeSarbo, W. S., Howard, D. J., Jedidi, K., 1990. MULTICLUS: A new method for simul-

taneously performing multidimensional scaling and cluster analysis. Psychome-

trika 56 (1), 121–136.

Escobar, M. D., West, M., 1995. Bayesian density estimation and inference using

mixtures. Journal of the American Statistical Association 90 (430), 577–588.





Fournier, S., Dobscha, S., Mick, D. G., 1998. Preventing the premature death of rela-

tionship marketing. Harvard Business Review 76 (1), 42–51.

5.8 References 147

Fraley, C., Raftery, A. E., 1998. How many clusters? which clustering method? an-

swers via model-based cluster analysis. The Computer Journal 41 (8), 578–588.

Frarley, J. U., Ring, L. W., 1966. A stochastic model of supermarket traffic flow. Oper-

ations Research 14 (4), 555–567.






Ghosh, J., Beal, M. J., Ngo, H. Q., Qiao, C., 2006. On profiling mobility and predicting

locations of wireless users. In: Proceedings of the 2nd International Workshop on

Multi-hop Ad Hoc Networks: From Theory to Reality. ACM, Florence, Italy, pp.

55–62.



Heitfield, E., Levy, A., 2001. Parametric, semi-parametric and non-parametric mod-

els of telecommunications demand: an investigation of residential calling pat-

terns. Information Economics and Policy 13 (3), 311–329.

Hofstede, F. T., Wedel, M., Steenkamp, J.-B. E. M., 2002. Identifying spatial segments

in international markets. Marketing Science 21 (2), 160–177.

Hui, S. K., Bradlow, E. T., Fader, P. S., 2009a. Testing behavioral hypotheses using an

integrated model of grocery store shopping path and purchase behavior. Journal

of Consumer Research 36 (3), 478–493.

Hui, S. K., Fader, P. S., Bradlow, E. T., 2009b. The traveling salesman goes shopping:

The systematic deviations of grocery paths from TSP-optimality. Marketing Sci-

ence 28 (3), 566–572.

Jacoby, J., 1978. Consumer research: a state of the art review. Journal of Marketing

42 (2), 87–96.


Saddle River, NJ.

Keaveney, S. M., Parthasarathy, M., 2001. Customer switching behavior in online

services: an exploratory study of the role of selected attitudinal, behavioral, and


demographic factors. Journal of the Academy of Marketing Science 29 (4), 374–

390.

Larson, J. S., Bradlow, E. T., Fader, P. S., 2005. An exploratory look at supermarket

shopping paths. International Journal of Research in Marketing 22 (4), 395 – 414.

Lemon, K. N., White, T. B., Winer, R. S., 2002. Dynamic customer relationship man-

agement: incorporating future considerations into the service retention decision.

Journal of Marketing 66 (1), 1–14.

Liu, T., Bahl, P., Chlamtac, I., 1998. Mobility modeling, location tracking, and tra-

jectory prediction in wireless ATM networks. IEEE Journal on Selected Areas in

Communications 16 (6), 922–936.



Analysis 51 (11), 5352–5367.





Moise, G., Sander, J., Ester, M., 2008. Robust projected clustering. Knowledge and

Information Systems 14 (3), 273–298.

Mozer, M. C., Wolniewicz, R., Grimes, D. B., Johnson, E., Kaushansky, H., 2000. Pre-

dicting subscriber dissatisfaction and improving retention in the wireless telecom-

munications industry. IEEE Transactions on Neural Networks 11 (3), 690–696.

Nurmi, P., Koolwaaij, J., 2006. Identifying meaningful locations. In: Proceedings of

the Third Annual International Conference on Mobile and Ubiquitous Systems:

Networks and Services. IEEE, San Jose, CA, pp. 1–8.

Ouellette, J. A., Wood, W., 1998. Habit and intention in everyday life: the multiple

processes by which past behavior predicts future behavior. Psychological Bulletin

124 (1), 54–74.

Peppers, D., Rogers, M., Dorf, B., 1999. Is your company ready for one-to-one mar-

keting? Harvard Business Review 77 (1), 151–160.

Perkins, C. E., 2001. Ad Hoc Networking. Addison-Wesley, Boston, MA.

5.8 References 149

Reichheld, F. F., 1996. The Loyalty Effect: The Hidden Force Behind Growth, Profits,

and Lasting Value. Harvard Business School, Boston, MA.




Roeder, K., Wasserman, L., 1997. Practical Bayesian density estimation using mix-

tures of normals. Journal of the American Statistical Association 92 (439), 894–902.

SAS Institute Inc., 1983. Cubic clustering criterion. Tech. Rep. SAS Technical Report

A-108, SAS Institute Inc., Cary, NC.

Schmittlein, D. C., Peterson, R. A., 1994. Customer base analysis: an industrial pur-

chase process application. Marketing Science 13 (1), 41–67.


6 (2), 461–464.

Smith, W. R., 1956. Product differentiation and market segmentation as alternative

marketing strategies. Journal of Marketing 21 (1), 3–8.

Stryker, S., Burke, P. J., 2000. The past, present, and future of an identity theory. Social

Psychology Quarterly 63 (4), 284–297.

Symons, M. J., 1981. Clustering criteria and multivariate normal mixtures. Biomet-

rics 37 (1), 35–43.




Twedt, D. W., 1967. How does brand awareness-attitude affect marketing strategy?














Wedel, M., Kamakura, W. A., 1998. Market Segmentation: Conceptual and Method-

ological Foundations. International Series in Quantitative Marketing. Kluwer Aca-

demic, Boston, MA.



(in press).

http://dx.doi.org/10.1007/s11222-010-9217-9

6Identifying Subspace Clusters for High Dimensional Data with

Mixture Models

Abstract

Clustering algorithms have been extensively researched, but the traditional algo-

rithms were developed to search for clusters over the full dimensional space mean-

ing that they are typically not suitable for analyzing high dimensional data i.e., iden-

tifying clusters located at different subspaces. In this paper, we put forward a simple

alternative approach for integrating the low dimensional patterns to locate subspace

clusters based on the geometric considerations among observations. We propose

utilizing Gaussian mixture models (GMMs) for approximating the low dimensional

densities; the equal width histogram is currently commonly utilized, but its gran-

ularity can often affect the clustering results. We fit the GMMs with the efficient

and recently popular non-simulation based variational Bayesian (VB) method that

is an alternative to more computationally expensive techniques such as the Markov

chain Monte Carlo (MCMC) method. In addition to the fact that the number of

clusters need not be prespecified in our approach, its clustering accuracy contrasts

significantly with that of standard full dimensional GMM, as we would expect, and

also with that of another existing model-based subspace clustering algorithm. The

method is applied to simulated data yielding promising empirical results.

Keywords

Variational Bayes (VB); Gaussian Mixture Model (GMM); Clustering Algorithm; Sub-

space Clusters; High Dimensional Data

152 Chapter 6. Identifying Subspace Clusters for High Dimensional Data

6.1 Introduction

Clustering algorithms aim to automatically segment unlabeled data into relatively

meaningful and natural, homogeneous, but hidden, subgroups (or clusters). This is

done by maximizing intra-cluster similarities and minimizing inter-cluster similari-

ties without the need for any prior knowledge (Hastie et al., 2009, p.501). They have

been used for data reduction, compression, summarization and outlier detection,

and have been shown to be useful not only as a stand alone technique, but also as a

preprocessing technique for other analytical tasks (Kriegel et al., 2009).

However, while clustering algorithms have been well studied, the traditional algo-

rithms were developed to discover clusters in the full dimensional space and such

an approach is typically not suitable for analyzing high dimensional data (Agrawal

et al., 1998). This is because as the dimensionality increases, the sparseness of data

usually also increases as a result, and this in turn means that the similarity measures

that are critical for clustering are essentially meaningless (Aggarwal et al., 2001). This

phenomenon is known as the ‘curse of dimensionality’ (Bellman, 1961) or ‘empty

space phenomenon’ (Scott, 1992, p.84); it implies a lack of data separation in high

dimensional spaces (Hastie et al., 2009, pp.22-27) and means that the nearest neigh-

bors are not stable (Beyer et al., 1999). However, we stress the point that this does

not imply that observations are always better distinguished in the lower dimensional

space. Therefore, it is still worthwhile to explore techniques for higher dimensional

data which motivates us in this paper to propose an approach that is suitable for

identifying clusters in the high dimensional space.

One possible approach to tackle this ‘curse of dimensionality’ issue is to apply di-

mensionality reduction techniques prior to clustering high dimensional data. How-

ever, global techniques such as feature transformation, e.g., principal component

analysis (PCA) and singular value decomposition (SVD), and feature selection have

already been shown to be inappropriate (Chang, 1983; Agrawal et al., 1998; Kriegel

et al., 2009). This is because many attributes are irrelevant (Parsons et al., 2004) and

there may be variation in the level of correlation that is relevant among different

clusters (Kriegel et al., 2009).

Another concept that has been proposed is that of subspace clusters; it is based on

the observation that high dimensional data usually have an intrinsic dimensionality

that is lower than the original full dimensionality (Jain and Dubes, 1988, pp.42-46).

That is, usually only a small number of the dimensions are actually relevant to par-

ticular clusters, and it is often just noisy signals that contribute information to the

remaining unwanted attributes (Agrawal et al., 1998). Also, the number of unwanted

attributes is likely to grow with the total number of dimensions, as observations are

increasingly likely to be located in different subspaces. The challenge in clustering


high dimensional data, at least in the notion of subspace clusters, is therefore to

achieve the ability to search effectively and efficiently for groups of clusters within

different subspaces of the same dataset without exhaustively examining all possible

attribute combinations (Parsons et al., 2004). In other words, the key for clustering

high dimensional data is to perform feature selection, but not in the global sense as

is typically done. We follow this notion in this paper.

Many new efficient algorithms following this research direction have recently been

introduced (Parsons et al., 2004; Kriegel et al., 2009). They typically employ the re-

duced search space strategy based on the observation that if a d′-dimensional unit,

or cluster, is dense, then so are its projections in (d′ − 1)-dimensional subspace

(Agrawal et al., 1998). That is, they often aim to first identify dense regions at a lower

(typically one or two) dimensional subspace. Then, depending on the algorithms

used and the approach taken, subspaces that contain clusters are identified and

clusters are formed, for example, by combining adjacent dense units in a bottom-

up fashion (Agrawal et al., 1998).

Axis-parallel clustering algorithms (as oppose to those focused on finding arbitrarily

shaped clusters) (Kriegel et al., 2009) have often adopted a grid-based approach for

identifying dense regions. That is, they often model the distribution of each dimen-

sion (or dimension combination at the very low level) with an equal width histogram

(or a grid) with a predefined number of bins, and then identify the dense regions of

the histograms using a predefined value for the density threshold parameter. How-

ever, while the results of these algorithms can generally be interpreted meaningfully

and the algorithms are often also capable of finding arbitrarily shaped clusters in

hyper-rectangular format, clusters are likely to be spread over many bins (or grid

cells) (Hinneburg and Keim, 1999); and the accuracy of such a strategy, while simple

and generally effective, depends on the granularity and the positioning of the grid

(Kriegel et al., 2009).

Some grid-based algorithms are flexible in that they utilize an adaptive grid instead

of a fixed interval size static grid (e.g., Nagesh et al., 2000). Alternatively, some al-

gorithms increase flexibility by iteratively lowering the density threshold parameter

value as in Ng et al. (2005), for example, or by setting the threshold parameter value

based on the Poisson distribution i.e., utilizing the chi-square goodness-of-fit test

in examining if number of the observations in a bin is significantly different from a

uniform distribution as in Moise et al. (2008). However, while these improved grid-

based approaches can sometimes better identify the dense regions, they can still

mistakenly divide a cluster into several smaller subclusters. This concern is shared

by Liu et al. (2007) who have instead proposed adopting histograms with overlapped

bins. The disadvantage of using grid-based approaches is also highlighted by Kriegel

et al. (2005) who have shown that better clustering results can be obtained by using


a density-based algorithm (c.f. Ester et al., 1996) that is suitable for low dimensional

data to identify the dense regions of each dimension instead of a grid.

In this paper, we follow these above typical bottom-up approaches in identifying

subspace clusters by first approximating the densities in the low dimensional spaces

(which are then utilized for guiding the discovery of the subspace clusters). However,

we do this a bit differently. We show that it is suitable to adopt Gaussian mixture

models (GMMs) for identifying the dense regions at the low dimensional level. Addi-

tionally, instead of identifying dense regions of each dimension in the usual way, we

opt to instead do this for each two-dimensional (2D) subspace (i.e., combination of

two dimensions). While such a tactic is clearly less computationally efficient, better

clustering results were obtained in doing so (Ng et al., 2005).

To implement this approach, we adopt McGrory and Titterington (2007)’s algorithm

that is based on the recently popular variational Bayesian (VB) framework for GMM

approximations. VB is able to automatically select the number of components k that

best represent the data (based on the variational approximation), which in this case

is observations in each 2D subspace. It also allows estimation of the model parame-

ter values at the same time and is computationally more efficient than the alternative

Markov chain Monte Carlo (MCMC) Bayesian approach (McGrory and Titterington,

2007). Note that one of the features associated with the use of VB is that it results in

a somewhat automatic choice of a suitable k for the fitted model by effectively and

progressively eliminating redundant components specified in the initial models as it

converges. See Wang and Titterington (2006) and McGrory and Titterington (2007),

for example, for more discussion of this aspect of the VB approximation. This VB

property implies that we are less dependent on the initial choice of k, kinitial, than we

would be if we had adopted approaches based on expectation-maximization (EM)

algorithm (Dempster et al., 1977), for example. This is partly because forcing data to

be grouped into k components can lead to a component being divided into several

smaller ones unnecessary.

Additionally, unlike some other clustering algorithms, our algorithm does not re-

quire prespecification either of the number of subspace clusters g, or of the average

subspace cluster dimensionality. We do this by taking observations’ nearest neigh-

bors into consideration and forming the subspace clusters bottom-up; we found that

our simple, straight forward approach (described later in Section 6.3) significantly

reduced the implication of ‘curse of dimensionality’ presented in the high dimen-

sional space. Note that despite us modeling the 2D subspaces with GMMs, we form

the clusters nonparametrically. This differs to certain model-based algorithms, such

as high dimensional data clustering (HDDC) algorithm (Bouveyron et al., 2007), that

aim to model each cluster as a GMM in a subspace. The intrinsic dimensionality

of the clusters in HDDC are estimated iteratively based on the eigenvalues of the

6.2 VB-GMM Algorithm 155

each cluster covariance matrix, and g can be determined based on criterion such

as Bayesian information criterion (BIC) (Schwarz, 1978; Fraley and Raftery, 1998) in

HDDC. Furthermore, this research also differs from papers (e.g., Raftery and Dean,

2006; Maugis et al., 2009; Scrucca, 2010) that aim to identify one single, but poten-

tially transformed subspace (c.f. variable selection) which best distinguishes high

dimensional data overall. It is also quite different from many other weighted k-

means-like algorithms (e.g., Friedman and Meulman, 2004) that focus on normal-

izing attributes, but do not discard attributes as we do here; these approaches lead

to clusters that are more difficult to interpret.

We organize the rest of this paper as follows. In Section 6.2, we briefly discuss how

VB can be utilized for approximating each 2D subspace with a GMM. In Section 6.3,

we detail our proposed process for identifying subspace clusters. We present our

experimental results and comparisons to the standard full dimensional GMM and

the HDDC algorithm in Section 6.4. Of course, we expect the full dimensional GMM

to perform poorly due to the ‘curse of dimensionality’ effect. We conclude with a

discussion in Section 6.5.

6.2 VB-GMM Algorithm

Finite mixture distributions provide a convenient, flexible way to approximate other

potentially complex distributions (Titterington et al., 1985). In a GMM, it is assumed

that all k underlying mixture components are distributed as Gaussian. The density

of an observation x = (x1, ..., xn) is given by∑k

j=1wjN(x;µj , T−1

j

), where k ∈ N, µj

and T−1j represent the mean and variance, respectively, of the jth component den-

sity, each mixing coefficient wj , satisfies 0 ≤ wj and∑k

j=1wj = 1, and here N (·)denotes a multivariate Gaussian density. In the Bayesian framework, inference is

based on the target posterior distribution, p (θ, z|x), where θ denotes the model pa-

rameters (µ, T, w) and z denotes the missing component membership information

for the observation x. Note that the elements of z, which we call the zij , are indicator

variables such that zij = 1 if observation xi belongs to the jth component and zij = 0otherwise. The target posterior is proportional to the product of the likelihood times

the chosen prior distributions and is generally not analytically tractable.

VB methods have become popular for approximating the target posteriors of a GMM

and the theory is now well documented in the literature (e.g., Wang and Titterington,

2006). VB approximation for a GMM is reliable, asymptotically consistent, and not

biased for large samples (Wang and Titterington, 2006). The idea of VB is to approx-

imate the target posterior by a variational distribution q (θ, z|x) that factorizes over

θ and z so that q (θ, z|x) = qθ (θ|x) × qz (z|x). The distribution q (θ, z) is chosen to

maximize the lower bound on the log marginal likelihood. Or alternatively, note that


this is equivalent to minimizing the Kullback-Leibler (KL) divergence between the

target posterior and the variational approximating distribution. This minimization

produces a set of coupled expressions for the variational approximations to the pos-

teriors over the parameters and these can be iteratively updated to find a solution.

While alternative model hierarchies could be used, this paper follows the model

specification and resulting posterior updates that are described in McGrory and Tit-

terington (2007). We model each combination of two-dimensional patterns as a mix-

ture of k bivariate Gaussian distributions with unknown means µ = (µ1, ..., µk), pre-

cisions T = (T1, ..., Tk) and mixing coefficients w = (w1, ..., wk), such that

p (x, z|θ) =n∏i=1

k∏j=1

{wjN

(xi;µj , Tj−1

)}zij

,

with the joint distribution being p (x, z, θ) = p (x, z|θ) p (w) p (µ|T ) p (T ). We express

our priors as:


(0), ..., αk(0))

,

p (µ|T ) =k∏j=1

N

(µj ;mj

(0),(βj

(0)Tj

)−1)

, and

p (T ) =k∏j=1

Wishart(Tj ; υj(0),Σj

(0))

,

with α(0), β(0), m(0), υ(0), and Σ(0) being known user chosen values. These are the

standard conjugate priors used in Bayesian mixture modeling (Gelman et al., 2004).

Using the lower bound approximation, the posteriors are:

qw (w) = Dirichlet (w;α1, ..., αk),

qµ|T (µ|T ) =k∏j=1

N(µj ;mj , (βjTj)

−1)

, and

qT (T ) =k∏j=1

Wishart (Tj ; υj ,Σj).

The variational updates for the posterior parameters are then:

αj = αj(0) +

∑ni=1 qij ,

βj = βj(0) +

∑ni=1 qij ,

υj = υj(0) +

∑ni=1 qij ,

mj = 1βj

(βj

(0)mj(0) +

∑ni=1 qijxi

), and

Σj = Σj(0) +

n∑i=1

qijxixiT + βj

(0)mj(0)mj

(0)T

− βjmjmjT ,

where qij is the VB posterior probability that component membership indicator zij =1, and the required expectations are given byE ( µj) = mj , andE ( Tj) = υjΣ−1

j . This

6.3 Subspace Clusters Identification 157

is a standard VB approach to fitting GMMs; other algorithms with different model

hierarchies include Attias (1999) and Corduneanu and Bishop (2001), for example.

6.3 Subspace Clusters Identification

In the following, we assume an observation x has dimensionality of d and is in the

feature space of<d. We denote the lth attribute of an observation xi as xil and assume

that the xil’s have already been standardized such that 0 ≤ xil ≤ 1.

Our approach for identifying associated subspaces and evaluating the results is in a

similar spirit to that of Ng et al. (2005), but differs in that it uses a VB approach in the

low dimensional dense regions estimation. It can be summarized into the following

steps:

1. Approximating the density of each 2D subspace with VB-GMM.

2. Detection of dense regions of each 2D subspace.

3. Estimating the associated subspace of each observation and derive each ob-

servation’s ‘signature’.

4. Identify interesting associated subspaces, or subspace clusters, by merging

similar observation ‘signatures’.

5. Assigning observations to appropriate subspace clusters.

We next describe these steps in detail.

6.3.1 Approximating the density of each 2D subspace with VB-GMM

Our first step is to adopt the VB-GMM algorithm described in Section 6.2 for ap-

proximating the pattern of each 2D subspace i.e., we need to execute the VB-GMM

algorithm a total of(d2

)times. Typically, iterative VB-GMM ‘declares’ a model has

converged based on examining the lower bound on the log marginal likelihood F

(Attias, 1999; Corduneanu and Bishop, 2001; Wang and Titterington, 2006). When

the lower bound F of the current iteration is the same as that of the previous iter-

ation up to a very small tolerance level then the variational scheme has converged.

However, Wu et al. (2010b) pointed out that in practice such an approach can be

computationally wasteful and subsequent iterations can simply be ‘hopping’ among

several alternative ‘good’ models. Consequently, we follow Wu et al. (2010b)’s model

stability criterion for identifying converged models which is as follows; we found that

this lead to good results in our simulation trials and improved efficiency. A model is

declared to be converged if:


• The number of components k currently in the model has remained unchanged

from the previous iteration (S1);

• The variational posterior mean estimates of all components mj ’s currently in

the model are the same as in the previous iteration up to a very small tolerance

level δ (S2); and,

• At least c0 iterations have been completed (S3).

In other words, instead of monitoring changes in F as is done in most other VB pa-

pers, we focus on key model parameter estimates as this has been found to be ad-

equate. Note that we follow Wu et al. (2010b) and choose c0 to be equal to five; the

role of S3 is simply to prevent the algorithm ‘declaring’ that a model has converged

prematurely before at least some iterations have been carried out.

Each time we run the algorithm, we initialize it with kinitial components. Assuming

we choose kinitial = h2 with h ∈ N , we propose to informatively assign the initial

mixture membership of an observation based on where the observations are on the

h× h grid. An informative initial allocation strategy has been shown to perform bet-

ter than simply allocating the observation component membership randomly (Wu

et al., 2010c). However, in order to avoid introducing any bias from this initialization

scheme, we set larger initial component covariance matrices than implied by the

grid. The computational requirements of this step (for the same dimensionality of d)

are very dependent on the choice of δ and h; the implications of this are examined in

Section 6.4.

6.3.2 Detection of dense regions of each 2D subspace

In order to identify interesting associated subspaces, we must first identify dense

regions of each 2D subspace. As discussed in the introduction, many differ-

ent approaches have been used in grid-based methods, but these suffered from

some drawbacks. Alternatively, an approach was suggested by Kriegel et al. (2005)

which involves performing density approximation at each dimension with a non-

overlapped density-based clustering algorithm that is suitable for low-dimensional

data instead; dense regions are identified as those one-dimensional clusters with

weights greater than 25% of the average cluster weights.

In this paper, we adopt a similar approach to that of Kriegel et al. (2005). However, in-

stead of ignoring components with small weights as done in Kriegel et al. (2005), we

consider the jth mixture component of a 2D subspace to be dense ifwj ≥ c1×waverage,

where c1 > 1 and waverage is the average component weight. Note that waverage can

be different for each 2D subspace as it depends on the number of components re-

maining in the model; this count must be less than or equal to kinitial due to VB’s


component elimination property. Additionally, instead of identifying dense regions

based on all observations in the dense components as done in Kriegel et al. (2005),

we only considering observations to be in the dense region if their component like-

lihood is greater than c2; observations are only considered with respect to their most

probable component according to the VB posterior estimate of qij . The implications

of choosing different c2 values is examined in section 6.4.

After all dense regions have been identified for all 2D subspaces, we summarize each

observation into a ‘signature’ data structure as proposed by Ng et al. (2005) at the

end of this step. Each observation summary describes whether it has been found

in the dense regions in each of the 2D subspaces; and, if so, which dense region

or component number. That is, suppose we have four dimensions, A to D, and an

observation has been found (only) to be in the dense region #2 for subspace AB and

dense region #5 for subspace AC, this observation will be summarized as [2 5 0 0 0 0]

corresponding to subspaces AB, AC, AD, BC, BD, and CD. However, unlike Ng et al.

(2005), we do not refer to these observation summaries as signatures yet, as we will

first refine them further in our next step.

6.3.3 Estimating the associated subspace of each observation

After detecting the dense regions for each 2D subspace, our next step, Step 3, is to

estimate the corresponding subspace of the clusters to which the observations are

likely to belong. Ng et al. (2005) determines the ‘best’ estimated associated subspace

of an observation as the union of its dense regions’ dimensions with respect to 2D

subspaces. For example, for the example observation given above, its ‘best’ esti-

mated associated subspace will be ABC even though it was not identified to be in

any of the dense regions of subspace BC, but was identified to be in the dense re-

gions of subspaces AB and AC. Obviously, for this particular observation, one would

be more confident with the estimation if it were also in one of the subspace BC dense

regions. To take this confidence level into consideration, Ng et al. (2005) have pro-

posed to also compute the likelihood of each observation’s ‘best’ estimated associ-

ated subspace; observations with higher likelihood will have more influence on the

final clustering results.

However, we found the tactic described above in estimating the associated sub-

spaces of an observation (i.e., union of dense regions’ dimensions with respect to

2D subspaces of an observation) can be ineffective in practice. That is, assuming an

observation is in a cluster for which its true associated subspace includes dimension

A, we have observed that an observation is often found in a dense region of most

2D subspaces involving dimension A, even though other dimensions are irrelevant.


Consequently, we often observed a large number of observations’ likelihood as cal-

culated in Ng et al. (2005) with respect to their ‘best’ estimated associated subspaces

are practically zero and hence not very useful. For this reason we believe that the

union of all 2D subspace dense regions’ dimensions of an observation should really

be considered as the ‘upper bound’ of the estimated associated subspace of an ob-

servation. Thus, this suggests a need to estimate the associated subspace for each

observation differently.

We propose to do this by refining the observation summaries (described in Sec-

tion 6.3.2) obtained from Step 2. We do this using two simple strategies; both aim

to identify irrelevant dimensions with respect to an observation. We then update the

observation summaries so that the observation will no longer be identified as involv-

ing those irrelevant dimensions. We believe that this way the associated subspace for

each observation can be better estimated.

Our first strategy is based on our observation that if dimension A is highly relevant

to an observation, then the observation is likely to be identified in the dense regions

of most if not all 2D subspaces involving dimension A. Thus, we determine the rele-

vance of a dimension to an observation by counting how many times an observation

has been found in a dense region of 2D subspaces involving that dimension; a di-

mension with low count with respect to an observation i.e., less than or equal to c3 of

the data dimensionality d, is considered as irrelevant. The implications of choosing

different c3 values is examined in Section 6.4, where we found c3 = 12% to be a good

choice.

For our second strategy, we adopt a similar approach to the dimension voting proce-

dure in Woo et al. (2004); that is we identify irrelevant dimensions of an observation

by utilizing the estimated associated subspaces information of its neighbors. For ex-

ample, if an observation’s estimated associated subspace is ABC and dimension A

was not found to be part of any of its neighbors’ estimated associated subspaces,

then dimension A will be considered as irrelevant for the observation and the ob-

servation’s estimated associated subspace will become just BC. To do this, Woo et al.

(2004) introduce a unique distance measure for identifying an observation’s p near-

est neighbors. However, the properties of this measure have not been explored thor-

oughly. In this paper, we instead utilize a measure similar to that of Ng et al. (2005)

for measuring the similarity between two observation summaries. Suppose we have

two observations x1 and x2, we measure their similarities as follows:

sim (x1, x2) =

(dcommon

2

)(dunique

2

) , (6.1)


where(d2

)denotes all possible two-dimensional combinations of dcommon and dunique

which are the number of common and unique dimensions of their estimated associ-

ated subspaces, respectively. We found in simulation experiments this measure can

effectively identify those observations which should not be considered as the neigh-

bors and hence have no right to ‘vote’. This way, our decision for an observation as to

whether a dimension is irrelevant or not will not depend only on its p nearest neigh-

bors, but rather all observations that are similar. In this paper, we define x2 to be a

neighbor of x1 if it is at least 30% similar based on Equation 6.1, and the estimated

associated subspace dimension of x1 is relevant if it is shared with at least 70% of its

neighbors. We do this several times (e.g., five) or until there are no further changes

to our observation summaries.

Recall that after identifying those irrelevant dimensions with respect to an observa-

tion, we update the observation summaries such that the observations are consid-

ered as not being found in any dense region in the 2D subspaces involving those ir-

relevant dimensions. We refer to this updated observation summary as a ‘signature’,

its structure still corresponds to the list of observations’ dense region or component

number of each 2D subspace.

6.3.4 Identifying interesting associated subspaces

Our next step, Step 4, is to merge similar observation signature entries. Our proce-

dure is simple: we group observations together as long as there is no ‘conflict’ in the

dense region number in any of the 2D subspaces, and we set some minimum size

for the group (e.g., 3% of n). We define ‘conflict’ as follows: assuming we have three

observations x1, x2, and x3 and their signature entries with respect to subspace AB

are 0, 5, and 7, respectively; we consider x2 and x3 being in conflict with each other

with respect to subspace AB, whereas x1 and x2, and x1 and x3, are not. Recall that a

record entry equal to 0 implies that the observation is at least not being considered

to be in any dense regions of the 2D subspace. At the end of this observation sig-

nature grouping/merging process, we obtain a list of groups, or subspace clusters,

which are sometimes referred to as projected clusters in the literature.

6.3.5 Assigning observations to appropriate subspace clusters

Finally, we assign observations into appropriate subspace clusters utilizing the sim-

ilarity measure defined in Ng et al. (2005). Assuming we have subspace cluster s1,

and an observation x1, their similarity is defined as follows:

sim (s1, x1) = number of matched 2D subspacesnumber of unique 2D subspaces .


If we had x1 and s1 signatures of [2 5 0 2 0 0] and [2 5 3 7 0 0], respectively, then x1 and

s1 would be considered to be 50% similar since the number of unique 2D subspaces

= 4 and the number of matched 2D subspaces = 2. We assign each observation to

a subspace cluster with the highest similarity measure as long as it is greater than a

certain threshold (e.g., 30%). Note that in the above expression, we place no impor-

tance on matching signature entries that are 0, since these dimensions/subspace are

less relevant.

6.4 Experimental Results

We demonstrate the effectiveness of our method in identifying subspace clusters in

high dimensional data. One of our goals is to evaluate the suitability of utilizing

mixture models for assisting in identifying subspace clusters. This preliminary eval-

uation is based on a simulated dataset with n = 1000 observations and d = 25 di-

mensions. It consists five clusters, four of which have weights equal to 20%, and one

of which has weight equal to 15%; the remaining 5% of the observations are outliers

(c.f. Moise et al., 2008). Each cluster has an intrinsic dimensionality of six or 24%

of the total dimensionality with some dimensions being shared by more than one

cluster.

But first, we note that attributes with no relevance to any of the clusters should dis-

play a uniform distribution, while those that are relevant to one or more clusters will

typically display a non-uniform distribution (Moise et al., 2008). Our experimental

evaluations suggested to us that when we approximate bivariate uniform distribu-

tions with VB-GMM, components with maximum weights are generally lower than

1.5× waverage. Consequently, we set c1 = 1.5 representing that we consider a compo-

nent to be a dense region if its w ≥ 1.5× waverage.

We evaluate our results in the following contexts. First, we examine the implication

of the choices of δ, the tolerance level for determining if the VB-GMM model has

converged. We then evaluate the implication of different GMM granularity i.e., the

choice of grid size h (c.f. h × h = kinitial), c2, the likelihood threshold where ob-

servations are considered to be in the dense regions, and c3, the threshold level in

determining the dimension relevance to an observation.

Finally, we consider situations where the intrinsic dimensionality of the clusters is

lower than 24% by adding up to additional 75 irrelevant dimensions. As indicated

in the introduction, the approaches of using both full dimensional GMM, and the

HDDC algorithm (Bouveyron et al., 2007) written within the software package of

MIXMOD (Biernacki et al., 2006) will also be applied in the above situation for com-

parison. As is done typically, here an EM algorithm is used for full dimensional

6.4 Experimental Results 163

Table 6.1: Comparison of results obtained using different δ, the tolerance level fordetermining if the VB-GMM model has converged

δ 10−3 5× 10−2

VBPerformance

Avg. kfinal 8.8 8.9Avg. # of VB Iterations 93.6 22.1

ClusteringResults

# of Subspace Clusters 6 5Clustering Accuracy 70.7% 96.9%

Avg. Cluster Dimensionality 5.4 6.0

GMM; whereas the default settings of HDDC are employed. We note that HDDC is

based on work on mixtures of probabilistic PCA (Tipping and Bishop, 1999; McLach-

lan et al., 2003) and eigenvalue decomposition of the covariance matrices (Celeux

and Govaert, 1995) with only certain essential parameters estimated by an EM algo-

rithm. Given that both algorithms are EM-based and thus depended on the initial-

ization, we will execute them for a total of 20 runs with results presented being the

average of these. Additionally, since neither algorithms can automatically determine

the number of clusters g, BIC will be utilized for such purpose.

Note that we compare and evaluate results based on the accuracy of cluster group-

ing, that is how many observations were correctly grouped together or identified as

outliers; in the event that more than five clusters are found, we classified the re-

sults as being ‘correct’ as long as the clusters were simply a subset of the original

clusters. Additionally, we also reported the average dimensionality of the subspace

clusters recovered by our approach. We also consider the performance of VB-GMM

by reporting the average final number of components in the model, kfinal, of all 2D

subspace combinations, and the average number of iterations required to reach con-

verged models.

6.4.1 Sensitivity to choice of δ, the tolerance level for determining if theVB-GMM model has converged

We initialize the VB algorithm for each 2D subspace with nine components i.e., h = 3or kinitial = 9, and have c1 = 1.5, c2 = 90% and c3 = 12%. The results are shown in

Table 6.1 for two different choices of δ. We can see that if we have smaller δ then we

will require more VB iterations (i.e., more computations), but it does not guarantee

that we will obtain better clustering accuracy. Therefore it seems that it is only nec-

essary choose δ sufficiently small to ensure one obtains good density approximation

to each 2D subspace for identifying subspace clusters. Recall that the primary pur-

pose of fitting GMMs is simply to identify the potential dense regions within each 2D

subspace.


Table 6.2: Comparison of results obtained using different GMM granularity

Granularity h× h (kinitial) 2× 2 (4) 3× 3 (9) 4× 4 (16) 5× 5 (25)Avg. kfinal 3.9 8.9 13.8 17.1

Avg. # of VB Iterations 26.6 22.1 27.7 30.1# of Subspace Clusters 6 5 5 5

Clustering Accuracy 78.3% 96.9% 98.7% 81.4%Avg. Cluster Dimensionality 4.8 6.0 6.0 3.6

6.4.2 Sensitivity to choice of GMM granularity h (or kinitial)

Next, we examine the effects of using different GMM granularity i.e., different h

or kinitial. Based on the results reported in the previous subsection, we choose

δ = 5× 10−2 (with c1 = 1.5, c2 = 90% and c3 = 12%) here; the results are shown in Ta-

ble 6.2. The results suggest that having more mixture components for approximating

the density distribution of each 2D subspace does not guarantee that we will obtain

a better clustering accuracy. Note the excellent clustering accuracy that is achieved

when VB-GMM is initialized with either nine or 16 components. This is somewhat in

contrast to the grid-based approaches for which better clustering results may be ob-

tained with finer granularity. This in turn also highlights a potential challenge when

GMMs are estimated with an EM-based algorithm since, unlike VB, it is unable to

remove redundant components. As is the case when a smaller δ is chosen, initial-

izing VB-GMM with larger kinitial will require more computations. This appears to

be wasteful as empirical results suggest that our inference was not improved. This

suggests that kinitial should not be too large or too small.

6.4.3 Sensitivity to choice of c2, the likelihood threshold where observa-tions are considered to be in the dense regions

Unlike most hard clustering algorithms, mixture models can provide each observa-

tion with a membership likelihood measure with respect to a certain component.

It provides an opportunity to define dense regions based on only a subset of ob-

servations of a component. Here we examine how the choice of c2, the likelihood

threshold where observations are considered to be in the dense regions, can affect

the results which are shown in Table 6.3. It shows that, at least for our proposed

method, better results can be obtained with larger c2. That is, we consider a dense

region to be the area covered by all observations that are classified with high prob-

ability as belonging to a heavily weighted component; this differs from Kriegel et al.

(2005)’s approach of using the area covered by all observations assigned to a heavily

weighted component even those for which the probabilities of assignment are rather

low. While we found that by adjusting some other parameter values (e.g., simply hav-

ing c3 = 20%) can significantly improve the very poor results shown towards the right

6.4 Experimental Results 165

Table 6.3: Comparison of results obtained using different c2, the likelihood thresholdwhere observations are considered to be in the dense regions

c2 90% 80% 70% 60% 50% 40% 30% < 20%

# of Subspace Clusters 5 5 5 3 2 3 2 2Clustering Accuracy 96.9% 94.3% 75.0% 41.4% 19.3% 7.8% 21.0% 18.0%

Avg. Cluster Dimensionality 6.0 6.0 6.0 8.0 16.0 16.7 16.0 16.5

Table 6.4: Comparison of results obtained using different c3, the threshold level indetermining the dimension relevance to an observation

c3 4% 8% 12% 16% 20% 24% 28% 32%

# of Subspace Clusters 2 5 5 5 5 5 5 5Clustering Accuracy 22.9% 93.4% 96.9% 97.8% 97.2% 92.5% 92.7% 91.7%

Avg. Cluster Dimensionality 6.0 6.0 6.0 6.0 5.0 3.4 3.2 3.2

hand side of Table 6.3; choosing smaller c2 still leads to the clustering algorithm be-

ing less accurate then choosing larger c2.

6.4.4 Sensitivity to choice of c3, the threshold level in determining the di-mension relevance to an observation

The clustering results are shown in Table 6.4, suggest that setting c3 too small (or

too large) can strongly influence the results. For this particular dataset, it appears

that our algorithm is relatively robust; however, having larger c3 implies that more

dimensions with respect to an observation will be considered as irrelevant which in

turn leads to smaller average cluster dimensionality. That is, the subspace clusters

will have been identified mostly correctly, but not all of the dimensions of the clus-

ters will have been. However, in any case dimensions identified with respect to each

subspace cluster were sufficient in obtaining good clustering accuracy.

6.4.5 Effect of data dimensionality d

Finally, we consider an additional scenario where the intrinsic dimensionality of the

subspace clusters is much lower than 24% of the total dimensionality. We test our al-

gorithm in this respect by adding irrelevant dimensions to our existing test data: the

intrinsic dimensionality of the subspace clusters reduced to 12% when an additional

25 noise dimensions are added, and to 8% and 6%, respectively, when a total of 50

and 75 irrelevant dimensions are added. Many existing algorithms would find sce-

narios such as 6% challenging (Moise et al., 2008). While we found the selection of

the parameter values became more critical when the intrinsic dimensionality of the

subspace clusters was smaller, we show that our approach can still identify the sub-

space clusters accurately (see Table 6.5). This contrasts significantly with situations

where the full dimensional GMM or the HDDC algorithm is applied (see Table 6.6).


Table 6.5: Comparison of results obtained for different d

d 25 50 75 100

# of Subspace Clusters 5 5 5 5Clustering Accuracy 96.9% 96.6% 93.5% 92.2%

Avg. Cluster Dimensionality 6.0 5.0 3.6 3.6

Table 6.6: Comparison of results obtained for different d for full dimensional GMMand HDDC

Algorithm Full Dimensional GMM HDDCd 25 50 75 100 25 50 75 100

# of Subspace Clusters 4.50 1.00 1.00 1.00 7.40 3.40 1.00 failedClustering Accuracy 84.8% 20.0% 20.0% 20.0% 92.0% 54.1% 20.0% n/a

The results of using the full dimensional GMM were to be expected; this is simply

the outcome of ‘curse of dimensionality’ as discussed in the introduction. In this

particular example, it is unable to cluster the data when d ≥ 50; its recorded accuracy

of 20% simply reflects the fact that the largest component weight of the simulated

dataset is 20%.

On the other hand, HDDC is more robust than the full dimensional GMM. This is

not surprising since HDDC was designed to discover subspace clusters distributed

as Gaussian. However, the results in Table 6.6 indicated that in this case, the effec-

tiveness of this model-based approach decreased quite sharply with increasing d in

comparison to our proposed method in which the cluster subspaces are, in a sense,

determined based on observations’ nearest neighbors.

Yet, importantly, we note that the number of clusters determined for HDDC by BIC

appear to be problematic. When there are less irrelevant attributes in the dataset

i.e., d is smaller, BIC selects higher number of clusters g than actually existed. This

is perhaps understandable since the BIC is based on providing a density approxima-

tion rather than the number of clusters per se (Biernacki et al., 2000; Baudry et al.,

2010). However, as the number of irrelevant attributes included increases, i.e., d is

larger, the BIC appears to be ineffective in determining the suitable g in the dataset.

That is, in contrast to full dimensional GMM, for d = 50 and 75, HDDC can actu-

ally achieve a somewhat similar clustering accuracy to that of d = 25 when different,

larger g is chosen. This implies that the key issue for HDDC for clustering high di-

mensional data lies with how g should be chosen; BIC appears to have over penalized

in relation to the number of parameters in the model in the high dimensional space

(c.f. Biernacki et al., 2000). Nonetheless, we note that the small matrix determinant

estimation error has caused the HDDC algorithm to terminate when d = 100.

6.5 Discussion 167

6.5 Discussion

In this paper, we have shown that we can use mixture models for assisting in identi-

fying subspace clusters and our straightforward intuitive method appears to be use-

ful. The proposed approach for identifying nearest neighbors in the high dimen-

sional space appears effective. We have shown that good results can be obtained

without having to execute many VB iterations, and also that approximating each 2D

subspace in very fine detail may not be helpful in the identification of the subspace

clusters. Additionally, we showed that improved results may be obtained by select-

ing only those highly probable observations in the dense components, rather than of

all observations of the components as was done in Kriegel et al. (2005). However, we

cannot generalize this finding at this point with respect to other existing algorithms

without further exploration. Finally, we have shown that our method can also iden-

tify subspace clusters with very low intrinsic dimensionality and it compares better

than the full dimensional GMM and the HDDC algorithm. Additionally, we have

observed that using the BIC can be problematic in determining the number of clus-

ters. More research is required, particularly on the scalability and the comparisons

to other existing algorithms, as well as on the ability to automatically select appro-

priate parameter values. Ideas from experimental design could also possibly have

application here for reducing the number of subspace combinations that have to be

considered in the VB approximations.

6.6 References

Aggarwal, C. C., Hinneburg, A., Keim, D. A., 2001. On the surprising behavior of dis-

tance metrics in high dimensional spaces. In: Van den Bussche, J., Vianu, V. (Eds.),

Proceedings of the 8th International Conference on Database Theory. Vol. 1973.

Springer, London, pp. 420–434.












19 (2), 332–353.

Bellman, R. E., 1961. Adaptive Control Processes: A Guided Tour, 5th Edition. Prince-

ton University, Princeton, NJ.

Beyer, K. S., Goldstein, J., Ramakrishnan, R., Shaft, U., 1999. When is “nearest neigh-

bor” meaningful? In: Beeri, C., Buneman, P. (Eds.), Proceedings of the Seventh

International Conference on Database Theory. Vol. 1540. Springer, Jerusalem, Is-

rael, pp. 217–235.

Biernacki, C., Celeux, G., Govaert, G., 2000. Assessing mixture model for clustering

with integrated completed likelihood. IEEE Transactions on Pattern Analysis and

Machine Intelligence 22 (7), 719–725.

Biernacki, C., Celeux, G., Govaert, G., Langrognet, F., 2006. Model-based cluster

and discriminant analysis with the MIXMOD software. Computational Statistics

& Data Analysis 51 (2), 587–600.

Bouveyron, C., Girard, S., Schmid, C., 2007. High-dimensional data clustering. Com-

putational Statistics & Data Analysis 52 (1), 502–519.

Celeux, G., Govaert, G., 1995. Gaussian parsimonious clustering models. Pattern

Recognition 28 (5), 781–793.

Chang, W.-C., 1983. On using principal components before separating a mixture

of two multivariate normal distributions. Journal of the Royal Statistical Society:

Series C (Applied Statistics) 32 (3), 267–275.













6.6 References 169

Friedman, J., Meulman, J., 2004. Clustering objects on subsets of attributes. Journal

of the Royal Statistical Society: Series B (Statistical Methodology) 66 (4), 1–25.



Hastie, T., Tibshirani, R., Friedman, J. H., 2009. The Elements of Statistical Learning:

Data Mining, Inference, and Prediction, 2nd Edition. Springer Series in Statistics.

Springer, New York.

Hinneburg, A., Keim, D. A., 1999. Optimal grid-clustering: towards breaking the

curse of dimensionality in high-dimensional clustering. In: Atkinson, M. P., Or-

lowska, M. E., Valduriez, P., Zdonik, S. B., Brodie, M. L. (Eds.), Proceedings of the

25th International Conference on Very Large Data Bases. Morgan Kaufmann, Ed-

inburgh, UK, pp. 506–517.


Saddle River, NJ.

Kriegel, H.-P., Kroger, P., Renz, M., Wurst, S. H. R., 2005. A generic framework for ef-

ficient subspace clustering of high-dimensional data. In: Proceedings of the Fifth

IEEE International Conference on Data Mining. IEEE, Houston, TX, pp. 250–257.

Kriegel, H.-P., Kroger, P., Zimek, A., 2009. Clustering high-dimensional data: A survey

on subspace clustering, pattern-based clustering, and correlation clustering. ACM

Transactions on Knowledge Discovery from Data 3 (1), 1–58.

Liu, G., Li, J., Sim, K., Wong, L., 2007. Distance based subspace clustering with flexi-

ble dimension partitioning. In: Proceedings of the 23rd International Conference

on Data Engineering. IEEE, Istanbul, Turkey, pp. 1250–1254.

Maugis, C., Celeux, G., Martin-Magniette, M.-L., 2009. Variable selection for cluster-

ing with Gaussian mixture models. Biometrics 65 (3), 701–709.



Analysis 51 (11), 5352–5367.

McLachlan, G. J., Peel, D., Bean, R. W., 2003. Modelling high-dimensional data by

mixtures of factor analyzers. Computational Statistics & Data Analysis 41 (3), 379–

388.




Nagesh, H. S., Goil, S., Choudhary, A. N., 2000. Adaptive grids for clustering mas-

sive data sets. In: Proceedings of the 2000 International Conference on Parallel

Processing. IEEE, Toronto, ON, Canada, pp. 477–484.

Ng, E. K. K., Fu, A. W.-C., Wong, R. C.-W., 2005. Projective clustering by histograms.

IEEE Transactions on Knowledge and Data Engineering 17 (3), 369–383.

Parsons, L., Haque, E., Liu, H., 2004. Subspace clustering for high dimensional data:

a review. SIGKDD Explorations 6 (1), 90–105.

Raftery, A. E., Dean, N., 2006. Variable selection for model-based clustering. Journal

of the American Statistical Association 101 (473), 168–178.


6 (2), 461–464.

Scott, D. W., 1992. Multivariate Density Estimation: Theory, Practice, and Visualiza-

tion. Wiley Series in Probability and Statistics. Wiley, New York.

Scrucca, L., 2010. Dimension reduction for model-based clustering. Statistics and

Computing 20, 471–484.

Tipping, M. E., Bishop, C. M., 1999. Mixtures of probabilistic principal component

analyzers. Neural Computation 11 (2), 443–482.



ley, New York.




Woo, K.-G., Lee, J.-H., Kim, M.-H., Lee, Y.-J., 2004. FINDIT: a fast and intelligent

subspace clustering algorithm using dimension voting. Information and Software

Technology 46 (4), 255–271.



(in press).

http://dx.doi.org/10.1007/s11222-010-9217-9



7Conclusion

7.1 Discussion

In this thesis, we have motivated and demonstrated the value of analysing each cus-

tomer’s habitual consumption behaviour with the use of the variational Bayesian

(VB) method and Gaussian mixture models (GMMs). Before concluding our sum-

mary of the contribution made here, we detail some future research directions

that are expected to be active and useful from both the viewpoint of methodol-

ogy as well as application. These are separated into the following two categories:

semi-parametric Bayesian methods & mixed membership models, and the spatial-

temporal/longitudinal extension.

7.1.1 Semi-parametric Bayesian methods & mixed membership models

Nonparametric Bayesian models provide a flexible approach to many more difficult

problems; these models can be as complex as necessary (Blei and Jordan, 2004).

Their key underlying assumption is that there is a set of random variables arising

from some an unknown probability distribution and that this underlying distribu-

tion itself is drawn from a prior distribution; the cornerstone prior choice is Dirichlet

process (DP), a nonparametric measure on measures (Ferguson, 1973). DP is useful

in that the drawn measures are discrete which is useful as it avoids problems at-

tached to trying to fit the structure of a model (Blei and Jordan, 2004).

In relation to this research, DP mixture models (Antoniak, 1974; Ferguson, 1983),

semi-parametric Bayesian methods, are particularly useful in that there is no need

to prespecify the number of components k since it is ‘unbounded’ (Blei and Jordan,

2006; Heller and Ghahramani, 2007); DP is an alternative tactic to model selection

(Escobar and West, 1995). The usefulness of DP mixture models to date is largely

172 Chapter 7. Conclusion

due to the development of Markov chain Monte Carlo (MCMC) sampler (MacEach-

ern and Muller, 1998; Neal, 2000). However, despite the fact that some recent studies

have been shown improvement in the model quality by allowing splitting/merging

components (e.g., Green and Richardson, 2001; Jain and Neal, 2004), or not making

use of restricted conjugate prior settings (e.g., Jain and Neal, 2007), or even being

able to obtain models with hierarchically nested structure (Teh et al., 2006), MCMC-

based DP models are not suitable for analysing massive, multivariate and highly cor-

related data (Blei and Jordan, 2004).

One interesting development in this area is the use of deterministic VB, the underly-

ing method utilised in this research, which until recently has focused only on para-

metric models and typically has been used in the context of exponential family of

distributions (Ghahramani and Beal, 2001; Wainwright and Jordan, 2003). Early re-

sults of VB-based DP mixture models (Blei and Jordan, 2004, 2006) appear effective,

scalable and thus promising. However, one of the key drawbacks of DP mixture mod-

els more generally is the so called ‘rich-get-richer’ property (Wallach et al., 2010);

that is, as number of observations n → ∞, a small number of large clusters and

larger number of small clusters are expected to be found. Nonetheless, research

in this direction seems promising and is certainly a valuable extension. However,

the effectiveness of DP mixture models with respect to heterogeneous and spiky

data such as the one this research has been analysing require further investigation.

Note that there are other alternatives beside making use of VB in DP mixture models

(Guha, 2010). Examples include Pennell and Dunson (2007) which adopts an em-

pirical Bayes approach and MacEachern et al. (1999) that makes use of sequential

importance sampling (SIS). Other related approaches include the use data squash-

ing (DuMouchel et al., 1999; Madigan et al., 2002; Owen, 2003).

Finally, partitioning data into mutually exclusive groups, as we have done in Chap-

ter 5 with respect to segmenting different subscriber spatial usage behaviour, while

useful, is perhaps unnecessarily restricted. That is, some observations/customers do

naturally belong to multiple groups. Consequently, it may be more preferable to ex-

periment with profiling customers’ heterogeneous behaviour with mixed member-

ship models (Blei et al., 2003; Heller and Ghahramani, 2007; Airoldi et al., 2008). This

can be a useful extension; some model-free (projected) clustering algorithms (c.f.

§ 2.3.4) for high dimensional data can already produce overlapped clusters, but to

the best of our knowledge, cannot provide degrees of membership likelihoods which

can be critical to this application.

7.1 Discussion 173

7.1.2 Spatial-temporal/longitudinal extension

Besides assuming the relationships among observations are independent and iden-

tically distributed (i.i.d.), one of the most critical assumptions made in this research

is assuming attributes i.e., spatial and temporal behaviours, are independent from

each other (Hsu et al., 2008). That is, behaviours more generally, not limiting to just

spatial usage behaviour, are expected to be different with respect to different time

periods, for example, weekdays, weeknights and weekends (c.f. Ghosh et al., 2006a).

Being able to capture and interpret the complex interactions among different be-

haviours is certainly valuable from the viewpoint of this application; though this can

be somewhat circumvented by first partitioning the data into different periods. Of

course, there is still also the issue of seasonality/periodic behavioural variations that

must be taken into consideration.

However, it is perhaps more valuable to extend the algorithm for analysing longitu-

dinal/sequential data such that it is possible to track model changes or at least to

update the latest model without having all the (historical) data being available (c.f.

Doucet et al., 2001; Babcock et al., 2002a). Currently VB requires iteratively scan-

ning through the entire dataset. One exception is Bruneau et al. (2008) in which the

VB-GMM method has been reformulated such that models can be aggregated. In

that, the algorithm takes the parameter values of the existing models as an input in-

stead of the observations; parameters of the models are ‘summarised’ via a modified

VB algorithm making use of virtual sampling technique (Vasconcelos and Lippman,

1998). This strategy perhaps can be generalised further.

In other words, from the viewpoint of this application, it could be preferable to have

each subscriber’s several years worth of prior historical data i.e., not just 17 months,

‘compressed’, say quarterly, into sets of model parameters; the historical model sum-

maries can then be utilised for ‘updating’ the latest (possibly exponential decay)

models. Such a tactic is clearly more computationally efficient than the current ap-

proach which requires all of the raw data being available; though the resulting abil-

ity to analyse individuals’ gradual or sudden behavioural changes (c.f. Jacoby, 1978;

Flint et al., 1997; Eriksson and Mattsson, 2002; Liu et al., 2005; Wang and Hong, 2006)

or to differentiate/segment their behavioural longitudinally would perhaps be even

more beneficial since subscriber behaviour is known to be typically not stationary

(Wedel and Kamakura, 1998, Chapter 10).

The temporal aspect of behaviour is rarely examined. In other words, customer

segmentation today is, at best, “redrawn as soon as they have lost their relevance”

(Yankelovich and Meer, 2006, p.129). This is perhaps largely a result of the lack

of suitable/established models (c.f. Wedel and Kamakura, 1998, Chapter 10), the


volatile nature of customer behaviour (Hunt, 1997; Arnould et al., 2004), and the dif-

ficulties in capturing all of the historical events/interventions (and forecasting for

that matter) and determining its causal implications. We believe the extension of

the current VB-GMM method to the temporal setting may assist this greatly since VB

is efficient in terms of computational storage requirements and speed. Note that the

behavioural changes of customers can also lead to an event in terms of consump-

tion or interventions (e.g., churn and change of product/service) by the customer to

address this; the relationship between them are not necessarily causal as is usually

assumed.

7.2 Summary of Contributions

The aim of this research has been to turn large volumes of dormant, seemingly val-

ueless, but typically readily available ‘cheap’ data into detail but useful customer in-

telligences i.e., statistical data mining, for the wireless telecommunication industry.

A unique aspect of the work presented here is the investigation into modelling, in-

terpreting and differentiating patterns which are highly heterogeneous, both within

and between patterns, and spiky (c.f. each wireless subscriber’s spatial usage be-

haviour) in an effective and efficient manner; the term spiky describes data patterns

with large areas of low probability mixed with small areas of high probability. Devel-

opments for the recently popular variational Bayesian (VB) method, and Gaussian

mixture models (GMMs) and clustering techniques underpin the entire study.

Chapter 1 outlined the motivation for our research and the potential merit, from

the viewpoint of business applications, in analysing individuals’ frequently over-

looked habitual consumption behaviour with selected illustrations. It details the

broader research interest in understanding how customers typically utilise prod-

ucts/services temporally and spatially. It also motivates the necessity of using sta-

tistical modelling in this context. Appendix A provided more in depth related detail.

Chapter 2 provided an up to date thorough review and investigation of the nature

of the problem and the data as well as considering the advantages and drawbacks

of various potentially useful current techniques. These methodological reviews dis-

cussed many very recent developments. Note that while our dataset can be consid-

ered as data stream, our research did not proceed in the direction of data stream

mining. It is worthwhile pointing out that the definition of data stream was outlined

for the first time in 1998. Research labs within companies such as AT&T, Bell, Google,

IBM and Microsoft have since been actively researching suitable techniques in this

still relatively unestablished field; this could be an exciting future research direction.

The careful methodological evaluation which led to the utilisation of VB and GMMs

7.2 Summary of Contributions 175

for the study was also described. Appendix B and C gave more detailed information

related to these issues.

Chapter 3 outlined and illustrated the methodology effectiveness and robustness

of adopting VB-GMM to modelling one-dimensional circular data i.e., individuals’

temporal usage behaviour with respect to hours of the day. Moreover, recall that

one of the key advantages of fitting GMMs with VB is its ability to automatically de-

termine the number of components to represent the data by effectively eliminating

redundant components specified in the initial model. This chapter also examined

and made note of a generally overlooked implication of this irreversible property of

VB as well as the implications of the initial component allocation strategies.

Chapter 4 was an important and significant chapter. A new VB algorithm, called

split and eliminate variational Bayesian (SEVB), was developed, it is capable of

exploring the parameter space more fully than the standard method for two-

dimensional data. Unlike the standard algorithm, our SEVB can discover models

with a higher number of components than proposed initially; this is achieved by

allowing components to be split. Relying on the ‘competing’ and ‘eliminating’ na-

ture of the mixture components within the VB framework, SEVB attempts to split all

but only those poorly fitted components at each split opportunity i.e., after a sta-

ble model has been obtained. The adopted strategy is clearly more computationally

efficient than existing alternatives that attempt to split all components one by one

until the ‘optimal’ model is found. Several criteria were introduced to ensure the

scalability of the algorithm.

Moreover, this new SEVB algorithm has considered, for the first time, splitting com-

ponents into two overlapping subcomponents with one having much larger variance

than the other; this is in addition to splitting components standardly into two side-

by-side subcomponents. This newly introduced concept is motivated by our appli-

cation data i.e., individuals’ mobility patterns which have high probabilities for sev-

eral of their own preferred locations. This is in contrast to the common perceptions

that good clustering results or mixture models should have clusters or components

that are isolated and clearly separated.

The adoption of this new component overlapping philosophy coupled with the fact

that our data is somewhat discrete (c.f. the location of the user is recorded based on

the location of the cell tower where the activity was initialised) implies that classical

approaches of model evaluation or selection are no longer appropriate. The advan-

tage of having Bayesian models is that it is appropriate to assess the results by mea-

suring the goodness-of-fit. This chapter introduced a new adaptable goodness-of-fit

measure, called mean absolute error adjusted for covariance (MAEAC), which aims

to measure the average estimated errors of observations with the use of Mahalanobis


distance (MD); its effectiveness over other existing measures was demonstrated em-

pirically though not theoretically.

Overall, the modelling results for the real world data, show that our new SEVB al-

gorithm is more robust, practical and flexible for analysing large numbers of het-

erogeneous and spiky patterns. From the application perspective, to the best of our

knowledge, this is the first piece of research that aims to model an individual’s over-

all mobility pattern with a GMM. Some notable previous attempts have made use of

density-based clustering algorithms such as DBSCAN. However, we demonstrated

in this chapter that such a tactic is ineffective even for identifying only individual’s

highly visited locations (which was their aim) whereas our SEVB-GMM can provide

an interpretable and effective approximation to an individual’s overall behavioural

pattern.

Chapter 5 made full use of the SEVB-GMM algorithm developed in Chapter 4. It

examines the effectiveness of the algorithm more fully; it showed that it can accu-

rately model very complicated bivariate patterns i.e., individuals’ spatial usage be-

haviour, with considerably less space requirement. SEVB’s scalability superiority

over other common approaches such as the expectation-maximisation (EM) algo-

rithm was also discussed; this is also the result of SEVB being able to automatically

determine the model complexity. Nonetheless, this application chapter begins by

demonstrating, for those aiming to improve marketing abilities, the importance of

distributional understanding. We also illustrated the fact that the widely used av-

erage measures approach can be misleading using the example of two individuals’

outbound voice call duration patterns. This can be somewhat concerning (to the

marketers) with respect to existing pricing strategies (c.f. Danaher, 2002), and cus-

tomer valuation (c.f. Blattberg and Deighton, 1996) or churn models; these are often

based on the use of uninformative measures such as averages thus often not address

issues such as value at risk (VaR) (Chatfield, 1995).

The key contribution in this chapter, at least from the viewpoint of the application,

was that it illustrated how the approximated models can be interpreted; without

this step of translation of the hidden meanings of the patterns, these new statisti-

cal models would be worthless to profit-oriented organisations. Several ‘signatures’,

corresponding to the characteristics of users’ spatial behaviour, for profiling the pat-

terns were developed through extensive data exploration analyses; theses signatures

are then utilised for pattern differentiation i.e., segmenting behaviours of the sub-

scribers. This chapter showed that empirical analyses suggest that these signatures

are meaningful and more stable than the currently commonly adopted alternative

approach of ordered partitioning of subscribers based on their aggregated voice call

durations and SMS counts; spatial usage behaviour among customer groups from

7.2 Summary of Contributions 177

which we might infer richer behavioural descriptions such as lifestyle or occupa-

tional traits, that otherwise cannot be easily or cheaply obtained, turned out, as we

expected, to be highly differentiable and therefore valuable for business strategy for-

mulation.

Finally, in a similar way to Chapter 4, this is the first research, to the best of our

knowledge, that can automatically profile each user’s overall spatial usage behaviour

meaningfully, and differentiate general users based on their actual observed mobil-

ity patterns. These alternative insights can assist business in better interacting with

each individual (c.f. customer relationship management (CRM)) including handling

and guiding the customer shifts (Flint et al., 1997), and in more effective strategic

and tactical informed decision making (c.f. decision support system (DSS)) such as

customer behaviour segmentation based pricing (van Raaij et al., 2003; Jaihak and

Rao, 2003), business management (c.f. business performance management (BPM))

and resource planning (c.f. enterprise resource planning (ERP)), for example.

Chapter 6 continued the discussion of Chapter 5 with respect to the need to apply

a clustering algorithm which is suitable for high dimensional data; this is because the

number of subscriber behavioural characteristics we are interested in and extracted

from the database are likely to increase over time. In this chapter, we designed a new

high dimensional clustering algorithms and showed that for the first time it is suit-

able to make use of mixture models instead of the commonly adopted histogram/

grid approaches for identifying subspace clusters in the high dimensional space; the

concept of subspace clusters is that some dimensions are considered as noise for

some clusters. Due to the nature of the problem i.e., there is a lack of data separation

in the high dimensional space, we have introduced several new concepts defining

what is ‘similar’. While clearly more research is still required on the scalability and

the comparisons to other existing algorithms as well as being able to automatically

select appropriate parameter values, empirical results suggested that our straight-

forward intuitive method appears to be useful, even for identifying subspace clusters

with very low intrinsic dimensionality.

Overall, from the application perspective this research provides businesses with

a more precise understanding of existing customers’ habitual consumption be-

haviour, particularly how the products/services are utilised spatially; this is a com-

peting advantage in a very competitive industry (Wei and Chiu, 2002). In terms of

contribution to statistical methodology, we have developed a new SEVB-GMM algo-

rithm, which can automatically determine the model complexity and can explore the

parameter space more thoroughly, and it is also much more suitable for modelling

heterogeneous and spiky patterns; the first meaningful approaches in profiling and

differentiating each individual’s overall mobility pattern are also presented.

AReview of Research Question

A.1 Telecommunication Industry Research

There is a great demand for the use of data mining in the telecommunication indus-

try with the view to improving the business (Weiss, 2005); and many examples have

already been provided in Berry and Linoff (2000, Chapter 11 and 12). However, most

research in this industry appears to concentrate primarily on analysing demand, and

churn.

Demand analysis typically focuses on analysing aggregated product/service or

telecommunication traffic volume (Amaral et al., 1995; Levy, 1999; Heitfield and

Levy, 2001; Cox Jr., 2001; Fildes, 2002; Gilbert et al., 2003; Li et al., 2006; Marinucci

and Perez-Amaral, 2005); the objective often being fraud detection (Xing and Giro-

lami, 2007; Gibbons and Matias, 1998) or infrastructure resource management (Cox

Jr. and Popken, 2002).

On the other hand, churn analysis (also known as survival analysis) is perhaps the

most widely studied individual users’ behaviour. Its popularity, particularly in the

wireless telecommunication industry (Mozer et al., 2000; Wei and Chiu, 2002; Cox

Jr., 2002; Ahn et al., 2006; Figini et al., 2006; Lemmens and Croux, 2006; Neslin et al.,

2006), can be explained by:

• the relatively high churn rate (Bolton, 1998; Neslin et al., 2006),

• high cost of acquisition,

• no direct customer contact, and

• the significant roles played by the device models (Berry and Linoff, 2000).

However, we believe analysis of this kind as done today is often flawed. Firstly, they

typically focus only on analysing customers’ single product/service churn behaviour

instead of taking a more holistic view. Secondly, besides the need to being cautious

180 Appendix A. Review of Research Question

in dealing with truncated/censored observations, they are often based on the com-

mon flawed strategy of ‘zero defections’ (Reichheld and Sasser Jr, 1990) rather than

focusing on strategies for optimizing retention values (Blattberg et al., 2000). Finally,

having advanced models for predicting churn events utilizing contract expiry date,

despite being a common practice, is not necessarily more meaningful or valuable

than simple models based on the contract deadline.

A.2 Customer/Consumer Research

Customers are the most important asset of any business, but today they are more

educated, sophisticated, expectant, demanding and volatile than ever (Yankelovich

and Meer, 2006). This is the result of market competition (Reinartz and Kumar, 2000).

Companies today typically have acknowledged that not all customers are the same

(Peppers et al., 1999; Hallberg, 1995), or equally profitable for business (Cooper and

Kaplan, 1991; Storbacka, 1997; Niraj et al., 2001); “being willing and able to change

your behaviour toward an individual customer based on what the customer tells you

and what else you know about the customer” (Peppers et al., 1999, p.151) is vital

to business survival and success (Lloyd, 2005; Arnould et al., 2004). Consequently,

organisations today are more willing to focus on customers (Kotler and Armstrong,

2009; Christopher et al., 1991, p.13) and customer relationships (Gummesson, 1999;

Reichheld, 1996, p.24). Similarly, they are now also more willing to move away from

the traditional prospect of the ‘4Ps’ (i.e., product, price, promotion and place) (Bor-

den, 1964; McCarthy, 1978) and the mass or transactional marketing attitude (Dwyer

et al., 1987; Gummesson, 1994; Gronroos, 1994; Payne et al., 1998; Gummesson,

1999) which treats new customers as equals to long term loyal/profitable customers.

That is, companies today have now typically recognized the need to differentiate

customers according to detailed understanding of their current and future needs,

wants, desires, behaviour, profitability and values to the business for exchanging

(i.e., establishing, developing, maintaining and terminating) appropriate relation-

ships with them (Morgan and Hunt, 1994; Blattberg and Deighton, 1996; Reichheld,

1996; Fournier et al., 1998; Peppers et al., 1999). The importance of customer un-

derstandings has prompted many organisations to invest billions of dollars in con-

structing customer relationship management (CRM) systems; this initiative is based

on the belief and hope that holistic knowledge of the customers can be obtained for

managing each individual customer more effectively, and consequently deliver value

to the business (Rigby et al., 2002; Stone et al., 2004, p.90).

The process of artificially grouping heterogeneous customers or the focused market

A.2 Customer/Consumer Research 181

based on similar characteristics, needs, preferences and behaviour exhibited for dis-

tinct marketing propositions (i.e., targeting and positioning) is known as market seg-

mentation (Smith, 1956). The advantages of such theoretical conceptual tactics have

already been well documented and accepted (McDonald and Dunbar, 2004; Wein-

stein, 2004; Wedel and Kamakura, 1998, pp.3-5). Essentially, homogeneous mar-

ket segmentation (Blattberg and Deighton, 1996; Dhar and Glazer, 2003) has been

viewed as the foundation for effective marketing planning and strategy formulation,

and hence:

• provides businesses competitive advantages and better returns as a result of

being able to better serve customers (Egan, 2005, p.214); or

• satisfies their varying needs and wants with either existing or future products/

services (McDonald and Dunbar, 2004; Weinstein, 2004; Wedel and Kamakura,

1998, p.3);

Additionally, market segmentation is also critical for integrated marketing commu-

nication (Duncan, 2005; Shimp, 2007). However, the fast changing market environ-

ment along with advancement in information technology in recent years has made

it possible, and sometimes necessary for marketers to interact with finer segments

or even segments-of-one (i.e., individuals) (Wedel and Kamakura, 1998, p.4). In this

respect, the subject of market segmentation links closely to the subject of:

• relationship marketing (also known as one-to-one marketing or customer rela-

tionship management) where there is a stronger emphasis on the value of the

customers (Blattberg and Deighton, 1996; Buttle, 2009, pp.127-136); and

• transactional-oriented database or direct marketing (O’Malley et al., 1999;

Evans et al., 2004; Egan, 2005, p.215).

Nevertheless, segment identification is still a major focus of today’s research (Wedel

and Kamakura, 1998, p.327), and is highly dependent on the bases (i.e., variables

or criteria) researched (Weinstein, 2004, p.19) and the methods employed to do so

(Wedel and Kamakura, 1998, p.5).

A better understanding of customer behaviour is essential for a successful business.

However, customer/consumer research today appears to have limited emphasis on

examining customers’ actual behaviour (Jacoby, 1978; Yankelovich and Meer, 2006);

most behaviour investigations to date are largely limited to:

• the purchasing aspect of the observable product-specific (Wedel and Ka-

makura, 1998, p.10) behaviour (e.g., buying goods such as houses, vehicles, or

plasma televisions) (Alderson, 1957; Jacoby, 1978); or

• the loyalties (e.g., product/service attrition).

That is, customers’ habitual consumption behaviour (e.g., making phone calls, ac-

cessing the Internet and using water or electricity), which is different to the purchas-

ing behaviour and is more relevant to service than retail industries, is typically being

ignored today. This is in spite of the fact that most businesses have understood that

a “successful customer relationship requires a deep understanding of the context in


which our products and services are used in the course of our customer’s day-to-day

lives” (Fournier et al., 1998, p.49). Existing knowledge of individual’s habitual con-

sumption behaviour appears to be mostly limited to discrete (e.g., which services

customers use and the number of institutions they conduct business with), average

or aggregated measures (e.g., number of transactions per month) that are not neces-

sarily appropriate or meaningful for describing the observed pattern.

Below we briefly discuss on the selected subjects of customer/consumer research.

We focus particularly on the (1) customer management system, (2) heterogeneity

aspect of the customer behaviour, (3) today’s typical focus of consumer behaviour

research, and (4) segmentation. We conclude with our proposed research aim, that

is, focusing on profiling and differentiating individuals’ habitual consumption be-

haviour.

A.2.1 Customer management system

The ideology of relationship marketing, the economics of customer retention (c.f.

zero defection) (Reichheld and Sasser Jr, 1990) and the possibility afforded by tech-

nological developments (Sheth and Parvatiyar, 1995) have propelled the focus on

customer relationships in mass markets, resulting in the emergence of CRM systems

(Mitussis et al., 2006). Many ambitious businesses have already heavily invested in

CRM hoping that such move will ‘automatically’ present them with improved cus-

tomer insights for better business decision making; though they typically focus only

on end customers (Mitussis et al., 2006) and their loyalty (Buttle, 2009). Unfortu-

nately, despite the potential benefits of having CRM (Stone et al., 2004, p.98), more

than half of the much hyped CRM initiatives were believed to have generated unsat-

isfactory returns (i.e., not being an effective and profitable communications system

with customers) (Gummesson, 1999; Rigby et al., 2002; Weinstein, 2004; Egan, 2005;

Strouse, 2004; Mitussis et al., 2006; Buttle, 2009). This is often the result of having

inadequate culture adjustments from being product focused to customer focused,

or having too much emphasis on the critical information technology infrastructure.

That is, the actual implementation of many ‘real world’ CRM systems appeared to be

building up a massive customer database, and displaying standard reports or vague

indicators (i.e., ambiguous and subjective; e.g., market share indicator) (Doyle, 1995;

Mitussis et al., 2006; Egan, 2005, pp.18,219) without clear strategies or sufficient

value adding analyses (Peppers et al., 1999; Rigby et al., 2002; Cokins, 2004; Rigby

and Ledingham, 2004; Stone et al., 2004; Buttle, 2009, Chapter 1). In other words,

‘real world’ CRM systems often have been focused more on the operational aspect

rather than the analysis; customer insights do not just ‘pop up’, as they require thor-

ough investigations despite the common misperceptions (Little and Marandi, 2003;


Berry and Linoff, 2004; Egan, 2005, p.220). Furthermore, many currently available

analyses can be difficult to adopt in practice without resulting misleading or statis-

tically biased findings, as they often need to assume companies only offer a single

product/service, have customer’s entire business, or rely on assumptions or infor-

mation that may not be available to them (e.g., personal income, and gender or age

of the actual user rather than the account holder) (Reinartz and Kumar, 2000; Verhoef

and Donkers, 2001). Nonetheless, it is encouraging to see that companies have now

rightly shifted their focus to customers, and emphasis on the profitability (c.f. Ac-

tivity Based Management (ABM)) rather than the revenue contribution (Babad and

Balachandran, 1993; Foster and Gupta, 1994; Morgan and Hunt, 1994; Cooper and

Kaplan, 1998). This research aims to provide reliable customer insights utilizing al-

ready available behaviour data that can be easily adopted by wireless telecommuni-

cation providers.

A.2.2 Customer behaviour heterogeneity

The principle, “all customers are not created equal”, has already been established

(Hallberg, 1995). The traditional 20/80 rule (also known as Pareto’s Principal or

Pareto’s Law) reinforces this concept by suggesting that the best 20% of customers

are responsible for 80% of revenue; yet in reality the contrast between customer prof-

itability and behaviour is far more than what the rule suggests. In fact, Cooper and

Kaplan (1991) found that the profit figure for only 9% of customers is as high as 225%

for a manufacturer. Similarly is the study of Storbacka (1997) in the context of retail

banking which also shows that half of the customers are not only unprofitable, but

have destroyed the overall profit by 50%. Still, Niraj et al. (2001) has shown that the

loss on some customers can be as high as 252% of the sales revenue.

Customer heterogeneity highlights strategic issues and naturally causes challenges

when companies attempt to apply relationship marketing efficiently and effectively

(Hunt, 1997; Eriksson and Mattsson, 2002). In fact, it is often necessary to passively

ignore or actively deter customers showing certain behaviour (Hunt, 1997; Gummes-

son, 1999, p.26). However, the uni-normal distribution, lognormal distribution or

even highly aggregated measures such as the average, that have no distributional

assumptions at all, have been applied extensively, deliberately or not, and are gener-

ally not appropriate in describing the customers and their behaviour (Schultz, 1995).

On the other hand, extreme usage behaviour is commonly being observed and is a

critical insight which should never be considered as outliers (Hinneburg and Keim,

1999; Shaw et al., 2001; Lloyd, 2005). Consequently, this research investigates mod-

els to better deal with uncertainty to incorporate the inherent stochastic nature of

customers’ behaviour (Niraj et al., 2001).


A.2.3 Consumer behaviour research

Cognitive & Affective Consumer behaviour is a complex, multidimensional, dy-

namic process (Belk, 1987; Belk et al., 1988). Recently, (post-modernism) marketers

(Thompson et al., 1989; Holbrook and Hirschman, 1982) have focused on:

• consumers’ information processing & decision processes (i.e., cognitive as-

pects), and

• their experiences (i.e., affective aspects)

in examining both:

• internal influences (e.g., attitude, knowledge, motivations, needs, opinions,

perceptions, personality, and involvement), and

• external influences (e.g., culture, lifestyle, marketing activities, reference

groups, social class, and values)

on consumers (Schiffman and Kanuk, 2004). The cognitive and affective (C&A) ap-

proach (Mitchell, 1983; Rokeach, 1973; Kahle, 1983; Veroff et al., 1981) has been used

extensively by practitioners to theoretically explain consumer behaviour (Peter and

Olson, 1983).

The C&A approaches typically built on theories such as:

• Maslow’s theory (Maslow, 1954),

• Theory of reasoned action (TRA) (Ajzen and Fishbein, 1980),

• Theory of planned behaviour (TPB) (Ajzen, 1991), and

• Theory of value-attitude-behaviour hierarchy (Homer and Kahle, 1988).

However, these theories, while likely to be true, still with little empirical support (An-

derson Jr. and Golden, 1984; Kahle et al., 1986; Schwartz and Bilsky, 1987; Yalch and

Brunel, 1996; Holt, 1997; Ajzen, 2001; Schiffman and Kanuk, 2004; Rokeach, 1973,

p.122). Still, C&A researchers (Jacoby, 1978; Anderson, 1983; Hirschman, 1986; Gra-

ham, 2005) argued that empirical testing and prediction are unnecessary, criticising

that “facts do not speak for themselves” (Anderson, 1983, p.28). They, along with (Pe-

ter and Olson, 1983), believe that with an ‘infinite’ number of objective testings, and

no matter how large the datasets are, is still no guaranteed the truth. Moreover, they

also believe that all observations are subject to errors, and the choice of methodolo-

gies, data and findings are heavily influenced by the researchers. Consequently, C&A

researchers believe this cognitively and socially signification problem of consumer

behaviour should be solved by theoretically driven research.

However, it is important to point out that, C&A approaches have often been con-

ducted using a large amount of variables for describing consumer’s value and

lifestyle which may discover useful variables discriminating consumers that just

happened to be statistically significant simply by chance; no statistical test results

related to VALS (values, attitudes, and lifestyles) categories were reported in Mitchell

(1983) and Kahle et al. (1986). Additionally, the currently popular C&A approaches


(including segmentation based on C&A measures):

• may not be suitable for a fast changing (Egan, 2005, p.18) or innovative market

(Strouse, 2004, p.39), and

• require customers’ unobservable information. That is, they require customer

details that are not legally, directly, or dynamically available to the business

(e.g. income, education, culture, attitudes, perceptions, and satisfaction) (Ver-

hoef and Donkers, 2001). Note that many of these attributes (e.g. experience,

lifestyle) are subject to higher uncertainty because they are difficult to measure.

Furthermore, C&A approaches typically rely heavily on external market research, for

which accuracy is a concern (Wolfers and Zitzewitz, 2004; Leigh and Wolfers, 2006;

Arrow et al., 2008). The reliability of the surveys has become increasingly challeng-

ing partly because of the rapidly declining response rate (Bickart and Schmittlein,

1999; Curtin et al., 2005; Robert, 2006); Bickart and Schmittlein (1999) estimated that

as few as 5% adults in US are accountable for 50% of the telephone survey inter-

views as a result of personal characteristics. There is also the issue that the surveyed

public comes from a variety of backgrounds meaning that they will interpret survey

questions differently (Grunert and Scherlorn, 1990; Brennan and Hoek, 1992; Kahle

et al., 1992), and there is a distinction between cognitive response and actual deci-

sion making (Claxton et al., 1974), for example.

Behavioural Finally, while C&A approaches may assist companies to understand

subjective theoretical hypotheses about why consumers behave the way we observe,

they are weak in predicting behaviour and hence are little use in assisting businesses:

• to plan and manage the business (Hudson and Ozanne, 1988),

• to attain their ultimate goal which is to better value lifetime profitability of their

customers,

• to better control or change their behaviour, and to better meet their individual

customer needs (Watson, 1913; Skinner, 1974) by implementing effective mar-

keting strategy at the appropriate time (Wicker, 1969; Anderson Jr. and Golden,

1984; Kahle et al., 1986; Schwartz and Bilsky, 1987; Blattberg and Deighton,

1996; Peppers et al., 1999; Dhar and Glazer, 2003; Solomon, 2004; Kumar et al.,

2006; Yankelovich and Meer, 2006).

Consequently, this research focuses on understanding each customer’s actual be-

haviour utilizing data already available instead. This approach (also known as mod-

ernism, positivism or behaviourism) has been shown to be useful for predicting cus-

tomers behaviour and their potential value to the business (Foster and Gupta, 1994),

but is less emphased today (Yankelovich and Meer, 2006).


A.2.4 Customer/market segmentation

Studies have shown that segmenting customers homogeneously can help compa-

nies achieve greater profitability in a quicker and more focused manner (Kotler,

1991; Blattberg and Deighton, 1996; Reichheld, 1996; Dhar and Glazer, 2003). How-

ever, homogeneous customer segmentation is not novel in itself (Lloyd, 2005). While

researchers and practitioners often debate the best way of segmenting customers,

they often fail to realize the need to differentiate customers differently for dif-

ferent purposes and often misuse segmentation according to its original design

(Yankelovich and Meer, 2006). Below we discuss some segmentation techniques

commonly applied today.

Psychographical segmentation (Rokeach, 1973; Kahle, 1983; Veroff et al., 1981;

Mitchell, 1983), based on the framework of C&A discussed above, has now been used

extensively today by marketers; this includes, AIO (i.e., activities, interests, and opin-

ions) which focuses on individuals’ personality, and VALS2 which is heavily driven by

psychology (Cahill, 2006, pp.15,25), for example. In fact, many organizations today

have been ‘activity looking’ for customers’ ‘speculative’ C&A behaviour understand-

ing (Gordon, 1998; Egan, 2005, pp.17-18). While undoubtedly this segmentation ap-

proach can be valuable to market positioning, new product concepts, advertising

and distribution (Wind, 1978; Belk et al., 1988; Wedel and Kamakura, 1998, pp.15,32),

many have questioned its explanatory power and effectiveness (Lastovicka, 1982;

Lastovicka et al., 1990; Novak and MacEvoy, 1990; Gordon, 1998; Cahill, 2006; Egan,

2005; Wedel and Kamakura, 1998, p.13). Particularly its capabilities for identifying

behavioural factors that will influence a particular brand (Ziff, 1971; Wells, 1975;

Dickson, 1982). Moreover, despite its popularity, psychographical segmentation is

believed to be “a mostly wasteful diversion from its original and true purpose - dis-

covering customers whose behaviour can be changed or whose needs are not being

met”, and informed the company in knowing “which markets to enter or what kinds

of offers to make, how products should be taken to market, and how they should

be priced” (Yankelovich and Meer, 2006, p.126). That is, the outcome of psycho-

graphical segmentation still follows the tradition of last century (e.g., demographic

segmentation approach (Yankelovich and Meer, 2006)) i.e., spoken to average con-

sumers (Arnould et al., 2004, p.159).

Besides psychographical segmentations, customers today are often being parti-

tioned (McDonald and Dunbar, 2004; Duncan, 2005; Buttle, 2009, pp.154-157) based

on:

• demographics or geographics. For example, PRIZM (potential rating index for

zip markets) is on the basis that where people lived and who they lived among

tells a lot about them (Cahill, 2006, p.19). Demographics, in particular, have

been shown to be important for acquisition for financial service companies

A.3 Review Conclusion & Research Proposal 187

(Kamakura et al., 1991; Rust and Verhoef, 2005). However, this information may

not be available to all products/services or industries/markets;

• interactive channel or intervention opportunities such as cross-sell and up-

selling (DeSarbo and Ramaswamy, 1994; Cokins and King, 2004; Rust and Ver-

hoef, 2005);

• lifecycle stages (Dwyer et al., 1987; Christopher et al., 1991; Payne et al., 1998;

Gordon, 1998; Stone et al., 2004, p.98) or relationship status (e.g., satisfaction,

loyalty, and referrals). This is on the premises that satisfied customers are more

loyal, and loyal customers are more profitable and referrals more (Baldinger

and Rubinson, 1996; Reichheld, 1996; Dowling and Uncles, 1997; Knox, 1998;

Reinartz and Kumar, 2000; Anderson and Mittal, 2000; Kumar et al., 2007);

• benefits sought from the products/services,

• current profit, revenue, or usage contributions (Shapiro et al., 1987; Bult and

Wansbeek, 1995; Bitran and Mondschein, 1996; Zeithaml et al., 2001; van Raaij

et al., 2003; Kotler and Armstrong, 2009). Note that more focus is now placed on

the profitability instead of the revenue contribution (Foster and Gupta, 1994;

Cooper and Kaplan, 1998), or

• lifetime/future/potential values (Niraj et al., 2001; Verhoef and Donkers, 2001).

Note that these measures may be difficult to define (Zeithaml, 2000; Venkate-

san et al., 2007) particularly with changing market definitions/conditions (e.g.,

change of pricing structure), and can be misleading for organisations with mul-

tiple products/services (Verhoef and Donkers, 2001)).

Additionally, it is important to point out that RFM (recency, frequency, and monetary

value) segmentation (Berry and Linoff, 2004) is a popular tactic for segmenting cus-

tomers based on their usage contribution. However, it has been shown to be inap-

propriate with its use of averages (Blattberg and Deighton, 1996), it is often focused

only on revenue contribution rather than profitability (Dhar and Glazer, 2003), and

is often misinterpreted by practitioners in looking at each measure independently

(Stone et al., 2004, pp.40-45).

Notice that besides segmenting customers based on their average usage contribu-

tion, it appears that there is little focus how each user has utilised the product/

service particularly from the viewpoint of habitual consumption instead of purchas-

ing behaviour. This research aims to address this shortfall.

A.3 Review Conclusion & Research Proposal

Customer Behavioural Segmentation

A detailed understanding of the customers and the ability to predict their future be-

haviour as well as their potential value with good confidence is vital to the business


(Kumar et al., 2006), but can not be achieved by simply grouping or segmenting the

customers (Rust and Verhoef, 2005). On the other hand, analysing the little studied

behavioural data, that is the golden resource and asset conveniently available to all

established businesses, has been shown to be potentially more important and effec-

tive in achieving this objective (Foster and Gupta, 1994; Wei and Chiu, 2002). Yet,

most companies today seem to have a limited understanding of their customers’ ac-

tual behaviour (Fournier et al., 1998; Yankelovich and Meer, 2006). Furthermore,

behavioural approaches have been found to be particularly useful in understand-

ing customers’ consumption behaviours that have performed frequently and have

become a habit with less intentions (Schmittlein and Peterson, 1994; Verplanken

et al., 1998; Ouellette and Wood, 1998; Leone et al., 1999; Ajzena and Fishbeinb, 2000;

Ajzen, 2001). Still, most existing behaviour studies concentrate only on analysing

purchasing behaviour that has been shown to be very different to consuming be-

haviours (Alderson, 1957) and is more relevant to retail industries rather then service

industries.

Consequently, this research aims to fill the gap by analysing customers’ actual ha-

bitual consumption behavioural data. This also means that, while psychographic

measures can provide a richer description and understanding of consumers, and be-

ing able to be involved in the different stages of a consumer’s lifecycle and decision

making process (Belk et al., 1988), we believe these measures should be utilized for

assisting the company to better understand the needs of customers or the reasoning

of their behaviour rather than being ‘actively looked’ into as commonly done in the

industry today (Yankelovich and Meer, 2006; Stone et al., 2004, p.114).

Additionally, the benefit of taking a data-driven approach to customer/consumer

behaviour understanding can also be demonstrated by the importance of existing

customers. Existing customers have been shown to be more valuable than the new

ones, because they are less expensive to create extra values (e.g., up-selling, and

cross-selling); and the cost of acquiring new customers to replace the lost ones are

expensive. This is why customer retention, especially those high value customers

(Blattberg and Deighton, 1996), has been described by some as the most important

role in relationship marketing critical to the business profitability (Reichheld and

Sasser Jr, 1990). Studies have suggested that companies who have been able to retain

just 5% more of their existing customers have been able to almost double the com-

pany profits (Reichheld and Sasser Jr, 1990), as the profits generated by the retained

customers tend to accelerate over time due to price premium, cost saving, and rev-

enue (also known as customer share of wallet (SOW)) growth (Reichheld, 1996, p.39).

Moreover, the importance of the existing customers can also be demonstrated from

the viewpoint of referral i.e., the word of mouth (WOM) effects, which have been

A.3 Review Conclusion & Research Proposal 189

found to be highly valued by the potential customers. That is, positive WOM will po-

tentially improve the company’s future profit; and even more importantly, negative

ones could hurt the company’s outlook, particularly when no well defined impres-

sion has been formed by the potential customers (Arndt, 1967; Richins, 1983; Murray,

1991; Herr et al., 1991). Of course, lack of information on the potentially new cus-

tomers also conversely makes the existing customers more valuable (Hwang et al.,

2004). For industries such as wireless telecommunication, where the customer attri-

tions have been found to be a huge headache for the business, the value of the ex-

isting customers is undoubtedly even greater (Wei and Chiu, 2002). Accordingly, our

proposed research i.e., to comprehend as well as differentiate habitual consumption

behaviour of the existing customers should be valuable to the businesses.

BReview of Data Stream Mining

B.1 Data Stream & Its Mining Challenges

Data stream (also known as stream data) is a massive or possibly ‘infinite’ volume of

unordered sequential data (Henzinger et al., 1998; Babcock et al., 2002a; Gaber et al.,

2005; Muthukrishnan, 2005; Aggarwal et al., 2007). It is often real time data, or data

generated by a continuous process which grows rapidly at an ‘unlimited’ rate. Every-

day examples of such data structures include telecommunication, banking, credit

card, shopping, financial market transactions, for example. Examples also include

Internet clickstream records (c.f. text or web mining), weather measurements, and

sensor network, mobile traffic or security monitoring observations (Babu et al., 2001;

Gama and Gaber, 2007). Data stream poses many great challenges for an insightful

analysis (Gaber et al., 2007), because of its often low level detail construct nature (c.f.

Cortes et al., 2000; Han and Kamber, 2006, p.468).

Large volumes of data (stream) pose efficiency and scalability challenges (Han and

Kamber, 2006). That is, storing an entire set of data on a disk or memory, and ran-

domly accessing it for analysis is generally not possible (Dong et al., 2003). ‘Tra-

ditional’ data mining techniques focused on learning data with bounded memory

(Vitter, 2008), but generally require multiple scans of the data (Wang et al., 2003; Ag-

garwal et al., 2007). As a result, they are not suitable for the data stream environment

where data can generally only be looked at it once (at least for the preprocessing

step) without advance knowledge such as the size of the data, for example (Babcock

et al., 2002a).

In other words, traditional data mining algorithms work on the assumption that

the same data is being analysed throughout the entire process, whereas in the

data stream mining scenarios, data is being continuously updated throughout the

analytical process. Traditional approaches such as those constructing histograms

192 Appendix B. Review of Data Stream Mining

(Piatetsky-Shapiro and Connell, 1984; Muralikrishna and DeWitt, 1988; Ioannidis

and Poosala, 1995; Poosala et al., 1996; Poosala and Ioannidis, 1997; Jagadish et al.,

1998) are not suitable for the data stream environment as they require superliner

time and space complexity (Vitter and Wang, 1999; Aggarwal and Yu, 2007). Simi-

larly, the popular singular value decomposition (SVD) also requires multiple scan of

the data (Littau and Boley, 2006b).

Another feature of data stream is that it may evolve over time (Yang et al., 2005;

Gao et al., 2007). This is known as data stream evolution (Aggarwal, 2003), changes

in data stream (Dong et al., 2003), or concept drift (Wang et al., 2003). This non-

stationary issue is not new to data stream, and is often being addressed by real time

or incremental methods that continuously update the models when new data arrives

(Wang et al., 2003). The typical approach to address this non-stationary issue is to

utilise a time window or a data weighting scheme. Unfortunately, pattern changes

may often be more critical or informative than pattern snapshots (Dong et al., 2003),

and simply adopting existing single pass (including real time or incremental) mining

algorithms may not be suitable (Aggarwal, 2003). Note that, data stream models are

very similar to real time or incremental models in the sense that decisions need to

be made before all the data are available; they are however, different in what data is

being accessed, and the timing of the required decision (Guha et al., 2003a).

In short, while many traditional algorithms have already been developed to address

the efficiency and scalability challenges posed by large volumes of data (Han and

Kamber, 2006), they typically are not suitable for analysis in the data stream envi-

ronment scenarios; the key additional challenges of data stream mining when com-

pared to traditional data mining (Han and Kamber, 2006) are: single pass processing

(at least for preprocessing), random accessing data constraints, and concept drift

(Aggarwal et al., 2007).

B.2 Synopsis Data Structure

Unlike traditional data mining, it is generally acceptable to have approximate solu-

tions in data stream mining (Dong et al., 2003). Many more traditional data reduc-

tion techniques such as sampling, data weighting (e.g., exponential decay models

(Gilbert et al., 2001; Cohen and Strauss, 2003)), sliding (time) windows which fo-

cus only on part of the stream (Babcock et al., 2002b; Datar et al., 2002; Datar and

Motwani, 2007), in one form or another, are often adopted by techniques specially

designed for data stream. So too is the histogram approach which has also been

shown to be useful for approximating the distribution of the data (Silverman, 1986)

as well as as a basic data analysis and visualisation tool in the data stream environ-

ment (Thaper et al., 2002). Whereas ill-biased load shedding techniques that ignore

B.2 Synopsis Data Structure 193

chunks of data are sometimes being used for data stream mining (Babcock et al.,

2007).

Extensive numbers of recent studies have been focused on efficiently construct-

ing synopsis data structures (Gibbons and Matias, 1999) that can summarise data

with acceptable levels of accuracy but substantively smaller than their base datasets.

Some recent studies also focused on not only time required for constructing, and

space required for storage; but also time for updating, and (query) responding, and

required working space (Matias and Urieli, 2005; Muthukrishnan, 2005; Cormode

et al., 2006). However, they are often designed explicitly for specific applications use,

and it is yet unknown how well the different methods compare with one another

(Aggarwal and Yu, 2007).

Although many synopsis designs have been shown to have good accuracy, be robust,

and be able to be maintained easily, they mostly focused on the management sys-

tem (Arasu et al., 2003) or database (Babcock et al., 2002a) view point. For example,

some (additiveable) synopses can serve an important roll in self-maintaining views

in the database or (dynamically) provide approximate information even when the

base data is not available or is remote (Faloutsos et al., 1997; Thaper et al., 2002;

Babcock et al., 2002a). Most of the focus has been approximating (query) answers

(Chakrabarti et al., 2001), and selective (join) estimation (i.e., estimate fraction of

records satisfy the query) (Alon et al., 1999); typical applications include computing

aggregates, evaluate difference between data streams, identify heavy hitters, item

frequency, and frequent itemsets (Babcock et al., 2002a; Muthukrishnan, 2005; Ag-

garwal, 2007a).

Random sampling (Acharya et al., 1999; Chaudhuri et al., 1999; Acharya et al., 2000)

and histograms (Ioannidis and Poosala, 1999; Poosala and Ganti, 1999; Ioannidis,

2003; Muthukrishnan and Strauss, 2004) are two popular techniques which have

been frequently utilised either as stand alone synopses or embedded in other syn-

opses (e.g., wavelet-based histograms (Matias et al., 1998; Vitter and Wang, 1999; Ma-

tias et al., 2000)). Histograms, in particular, were been shown to be useful for query

optimization (Poosala and Ioannidis, 1997), approximate data warehouse queries

(Acharya et al., 1999), and approximate answers for correlated aggregate queries over

data stream (Gehrke et al., 2001; Dobra et al., 2002, 2004), for example. Note that

an interesting approach (DuMouchel et al., 1999) is to perform data stream analysis

based on generates pseudo data which reproduced according to series of statistical

moments computed (with the use of sampling) from mutually exclusive groups of

actual objects. Below we briefly discuss some key synopses with different problem

focuses or challenges.


Input Data Input Size Is Unknown As discussed previously, one of the key chal-

lenges associated with the traditional data mining techniques is the need to have

prior knowledge of the data size i.e., number of observations. Reservoir-based sam-

pling (Vitter, 1985) that maintains a random sample of fixed size is the first algorithm

to break this barrier within this research framework; this is in contrast to classical

random sampling which requires knowing the target data size prior to calculate the

sample size and hence obtain the sample. Vitter (1985)’s technique has recently been

improved (Gibbons and Matias, 1998; Chaudhuri et al., 1999; Babcock et al., 2002b;

Aggarwal, 2006), and has been shown to be useful for constructing histograms (syn-

opsis) (Gibbons et al., 1997; Chaudhuri et al., 1998) in the data stream environment.

Note that constructing histograms (Agrawal and Swami, 1995; Alsabti et al., 1997;

Manku et al., 1998) in the data stream environment is closely related to estimating

quantiles without data being completely available (Aggarwal and Yu, 2007).

Recent advancements means that approaches can now estimate quantiles (Manku

et al., 1999; Greenwald and Khanna, 2001) and construct near optimal histograms

(synopses) (Guha et al., 2001, 2002; Guha and Koudas, 2002; Guha et al., 2006) in a

(working) space efficient manner (Guha, 2005) and in a single pass fashion without

the need for advance input data size knowledge.

More Accurate and Space Efficient Synopses A large portion of the recent liter-

ature also focuses on improving data representation accuracy, and is often based

on transformation (Lee et al., 1999). (Discrete) wavelet-based transformations, in

particular, have been frequently used within the synopsis design (Chakrabarti et al.,

2001; Gilbert et al., 2003). They have been shown to be better than their transforma-

tion alternatives (Barbara et al., 1997; Peng and Chu, 2004) (as well as sampling and

the better class of histograms (Chakrabarti et al., 2001)) in being able to:

• represent data in multiple resolution (Matias et al., 1998), and

• approximate spare and/or skewed data (Barbara et al., 1997) with only the most

significant wavelet coefficients (i.e., space efficient).

However, quality wavelets, other than those minimising Euclidean errors cannot be

easily obtained (Karras and Mamoulis, 2005; Guha, 2005; Guha and Harb, 2005). The

selection of quality wavelets have been shown to depend on the query workloads

(Matias and Urieli, 2005; Muthukrishnan, 2005), and any traced wavelets changes

throughout the process can have complicated effects (Matias et al., 2000; Guha et al.,

2004b).

Randomised projection techniques (which many based on the use of wavelets)

(Gilbert et al., 2002, 2003) have been shown to be able to construct synopses that are

even more space efficient then the wavelet approach. They typically work on poly-

log space with respect to the base data because that is the minimum requirement for

B.2 Synopsis Data Structure 195

database indexing (Muthukrishnan, 2005). However, synopses based on randomised

projection are generally much more difficult to interpret (than the wavelets, for ex-

ample). Note that an extensive amount of research has been done on these ran-

domised projection techniques (Flajolet and Martin, 1983; Alon et al., 1996; Feigen-

baum et al., 1999; Indyk, 2000) in recent years (Babcock et al., 2002a; Muthukrishnan,

2005; Aggarwal, 2007a).

Synopses with Guarantee Bounds Random sampling is generally considered as

easy, efficient, and widely applicable. However, many critics argue that it is:

• not suitable for evaluating infrequent patterns (Aggarwal and Yu, 2007; Gaber

et al., 2007), and

• difficult to determine whether a truly representative samples have been drawn

from the datasets (Littau and Boley, 2006a).

Nonetheless, one of the key advantage of a random sampling is being able to provide

unbiased data estimates with probabilistic error bounds (Haas, 1997). This is in con-

trast to most other synopses (Aggarwal and Yu, 2007) for which it is difficult to find

the error bounds.

Consequently, some recent studies (Matias et al., 1998; Manku et al., 1999; Indyk

et al., 2000; Gehrke et al., 2001; Garofalakis and Gibbons, 2002; Guha et al., 2004b;

Guha and Harb, 2005) have focused on being able to provide approximation guar-

antees. Many of them now focus on minimising relative errors, or minimising maxi-

mum absolute or maximum relative errors (Garofalakis and Kumar, 2004; Karras and

Mamoulis, 2005); they do these instead of minimising inappropriate absolute errors

or the overall root mean squared errors which have been shown to result in poor

quality data representation.

Multi-dimensional Extension Apart from sampling (Barbara et al., 1997; Aggarwal

and Yu, 2007) that has the same dimensional representation of original data, most

other approximation techniques do not work well or only have very limited success

with higher dimensional data (i.e., more than four or five) (Vitter et al., 1998; Vitter

and Wang, 1999; Gilbert et al., 2003). However, despite the fact that attributes are

often wrongly assumed to be independent, such approaches (i.e., one-dimensional

synopses) are often still adopted (in the commercial software) (Poosala and Ioanni-

dis, 1997; Matias et al., 2000). Some recent research has focused on, for example:

• constructing multi-dimensional histograms (Poosala and Ioannidis, 1997; Ma-

tias et al., 1998; Vitter et al., 1998; Vitter and Wang, 1999; Aboulnaga and Chaud-

huri, 1999; Muthukrishnan et al., 1999; Gunopulos et al., 2000; Chakrabarti

et al., 2001; Wu et al., 2001; Thaper et al., 2002),


• extended wavelets (Stollnitz et al., 1996; Deligiannakis and Roussopoulos,

2003; Guha et al., 2004a), and

• multi-dimensional synopses based on the randomised projection technique

(Cormode et al., 2006).

However, challenges remain in constructing optimal data summaries in a timely and

space efficient manner.

Temporal Extension While some synopses (e.g., wavelet synopses, synopses based

on the randomised projection technique) are believed to be able to easily extendable

to the temporal representation of the data stream (with a much larger space require-

ment), they are seldom used (Aggarwal and Yu, 2007). The exceptions include:

• an offline approach by Indyk et al. (2000) suitable for analysing universal

trends, and

• the work of Thaper et al. (2002) that can be used for tracking or comparing the

distribution of data streams temporally,

for example. That is, most existing synopsis constructions, concentrate on provid-

ing one static or continuously updated summary data structure efficiently (Thaper

et al., 2002). However, such continuously updated approaches still act like reap-

plying the traditional algorithms every time when new data arrives (Thaper et al.,

2002; Aggarwal et al., 2003). That is, while maintaining latest summary information

may have resolved the data evolution issue, a proper understanding of the evolving

behaviour, seasonality or periodic insights, for example, cannot be achieved. Like

multi-dimensional extension, temporal extension has many challenges ahead.

B.3 Review Conclusion

There is an extensive number of recent studies that have focused on obtaining ac-

curate and robust data summary structures in efficient single pass processing with a

limited space framework (Gibbons and Matias, 1999). However, most of the research

comes from research laboratories within companies such as AT&T, Bell, Google, IBM,

and Microsoft (as well as selected academics and universities around the world)

(Muthukrishnan, 2005) with management systems or database applications being

their primary focus. Additionally, they are typically based on the concept of av-

erages (Aggarwal et al., 2003), or in a format that distribution and/or seasonality/

periodic information cannot be easier obtained (Littau and Boley, 2006b). Conse-

quently, these approaches, despite their efficiency, do not appear to be appropriate

extracting customer behavioural characteristics and for customer analytics in gen-

eral; in these settings data is likely to be spare and/or skew, or comprises a mixture

of distributions with seasonality/periodic patterns. This is despite our research data

is in the form of data stream.

B.3 Review Conclusion 197

In this research, we adopt variational Bayesian (VB) method to extract customer be-

haviour characteristics more formally and naturally. While VB is not a single pass

algorithm (as defined by data stream mining research), it can provide more sophisti-

cated statistical understanding which otherwise cannot be obtained in data stream

models (c.f. Muthukrishnan, 2005). Alternatively, some recent studies (Zhou et al.,

2003; Heinz and Seeger, 2006, 2007, 2008) have successfully used one-dimensional

kernel density estimation in the stream environment efficiently; such nonparamet-

ric approaches can be useful to our research if our goal today is simply to approxi-

mate various customers’ behaviour. Note that as in the case of data stream mining,

more work is still needed to extend VB to the multiple dimension case and tempo-

rally.

CReview of Clustering Time Series & Data Stream

C.1 Time Series Representation & Clustering

One of the logical choices for analysing customer behaviour longitudinally is to

group time series that are correlated, or have similar patterns or similar fitted models

(Lin et al., 2004; Ratanamahatana et al., 2005; Wang et al., 2006), for example. How-

ever, simply examining the similarity among subsequences (as is often done) instead

of the entire series can be misleading (Keogh et al., 2003).

Many data approximation techniques have already been studied for reducing data

dimensionality critical for time series clustering (Gavrilov et al., 2000; Keogh and

Kasetty, 2003; Ding et al., 2008) and its closely related problems of similarity searches

and indexing (Agrawal et al., 1995; Yi and Faloutsos, 2000; Keogh et al., 2001;

Chakrabarti et al., 2002). Common representations that have been proposed to date

include:

• statistical models (Xiong and Yeung, 2004),

• spectral transformations (Agrawal et al., 1993; Faloutsos et al., 1994),

• dynamic time wrapped (DTW) (Berndt and Clifford, 1994),

• wavelets (Chan and Fu, 1999),

• singular value decomposition (SVD),

• piecewise polynomial models (Yi and Faloutsos, 2000; Chakrabarti et al., 2002),

and

• symbolic models (Lin et al., 2003),

for example; whereas improvements have been made recently in:

• reducing locally reconstruction error, improving the accuracy, efficiency, scal-

ability, space usage (Keogh et al., 2001; Chakrabarti et al., 2002),

• making better suited for series out of phase, with missing values or different

length (Xiong and Yeung, 2004; Keogh and Ratanamahatana, 2005),

• making better suited for data stream environment (Palpanas et al., 2004;

200 Appendix C. Review of Clustering Time Series & Data Stream

Yankov et al., 2007),

• extending to multiple dimensional series (Vlachos et al., 2005), and

• moving towards parametric free data mining (Keogh et al., 2004),

for example.

Symbolic representations (Lin et al., 2003, 2007), in particular, have recently been

shown to be very promising for representing massive amounts of time series in clus-

tering (Lin et al., 2004), indexing and mining (Shieh and Keogh, 2008), detecting un-

usual patterns (Keogh et al., 2005; Wei et al., 2006; Yankov et al., 2007), and visuali-

sation (Kumar et al., 2005), for example. On the other hand, many statistical models

such as hidden Markov models (HMM), Markov models (Ge and Smyth, 2000), and

autoregressive moving average (ARMA) models (Kalpakis et al., 2001; Xiong and Ye-

ung, 2002) have been shown to perform unfavourably in comparison (Keogh and

Kasetty, 2003; Ratanamahatana et al., 2005; Wang et al., 2006). This is perhaps not

unexpected since these statistical models are often based on assumptions such as

stationary, normality and independent residuals, and they often do not take trends

into consideration or cannot be fitted easily (Chatfield, 1995).

C.2 Clustering on Extracted Time Series Characteristics

However, ‘traditional’ time series clustering algorithms (i.e., those algorithms group-

ing time series that are correlated, or have similar patterns or fitted models) often

do not take the temporal aspect of the data into consideration. That is, they often

analyse series as sequences (i.e., irrespective of time) and thus are not necessarily

appropriate for all applications as they generally do not incorporate seasonality or

periodic insights (Ghosh and Strehl, 2004). Additionally, from the viewpoint of busi-

ness application, these clusters (or customer groups) are not meaningful without

identifying their (longitudinal) characteristics.

Alternatively, one can take a different approach to this problem; that is, to extract

the (longitudinal) characteristics of each series (e.g., trend, seasonality, serial corre-

lation, skewness, and signal to noise ratio (SNR) for expressing the fluctuations of the

series) (Armstrong, 2001; Last et al., 2001; Nanopoulos et al., 2001) prior to perform-

ing any non-time series clustering (Wang et al., 2006). This strategy (i.e., clustering

on extracted time series characteristics) has been shown to be robust with good ac-

curacy (Wang et al., 2006) and can incorporate the temporal aspect of the data more

appropriately. We believe this approach is more suitable to our research.

Unfortunately, in practice, many time series are not necessary stationary over

C.3 Data Stream Clustering 201

time, and may involve interventions (e.g., step functions) (Chatfield, 1995) mak-

ing the series features extraction challenging. While there have been (mostly one-

dimensional) techniques proposed which aim to quantify the changes (Ganti et al.,

1999b, 2002), detecting changes (Krishnamurthy et al., 2003; Zeira et al., 2004; Kifer

et al., 2004; Schweller et al., 2004), and diagnosis changes (Aggarwal, 2003; Dasu

et al., 2006) even for the data stream environment, it is uncertain how these inter-

ventions should be incorporated into the clustering process for application such as

ours. Below we briefly discuss some recent literature which is closely related to time

series clustering.

C.3 Data Stream Clustering

Recent studies (O’Callaghan et al., 2002; Guha et al., 2003b; Charikar et al., 2003; Bab-

cock et al., 2003; Aggarwal et al., 2003) have been shown to be able to approximate k-

centroid clustering algorithms efficiently in the data stream environment (i.e., deci-

sions need to be made before all the data is available and data can only be read once;

c.f. Appendix B). However, while some algorithms (e.g., Aggarwal et al., 2003; Cao

et al., 2006), that are based on a micro-macro clustering strategy, have been shown

to be able to obtain better clusters, to provide multiple time granularity informa-

tion, and to provide means for tracing objects or clusters temporally, they typically

function from the viewpoint of sequences rather than time series. In other words,

as in the case of traditional time series clustering algorithms, they typically focus on

analysing data independent of time.

One rare data stream clustering algorithm is HPStream (Aggarwal et al., 2004) which

can identify subspace clusters in the high-dimensional data stream environment.

Its subspace notion can improve on the common but problematic approach of clus-

tering equal length time series, which treats each time point as one dimension (c.f.

Xiong and Yeung, 2004) and thus faces serious issue of curse of dimensionality. How-

ever, HPStream is still focusing on analysing the sequences.

Note that analysing sequences can still be very useful, and this has actually been

applied quite frequently. For example, to combat the issue of data evolution (i.e.,

pattern changes over time), many stream mining algorithms have been designed to:

• continuously update statistical information (e.g., correlation among multiple

series) (Yi et al., 2000; Guha et al., 2003a; Sakurai et al., 2005),

• continuously update models (Domingos and Hulten, 2000, 2001; Hulten et al.,

2001), or

• focus on monitoring the data stream (Ganti et al., 2001; Zhu and Shasha, 2002;

Wang et al., 2002b; Zhu and Shasha, 2003; Kleinberg, 2003; Papadimitriou et al.,

2007).

202 Bibliography

Similarly, most temporal and spatial-temporal algorithms and applications (e.g.,

earth science, epidemiology ecology, and climatology) (Last et al., 2001; Li et al.,

2004; Aref et al., 2004; Han and Kamber, 2006; Huang et al., 2008; Hsu et al., 2008)

typically only focus on:

• analysing or monitoring sequential patterns (Agrawal and Psaila, 1995). Note

that scan statistics (Neill et al., 2005), for example, can be utilised for adjusting

the compared patterns with respect to seasonality, for example; or

• mining time independent transaction association rules and frequent patterns

(Agrawal et al., 1993; Agrawal and Srikant, 1994; Agrawal et al., 1995; Srikant

and Agrawal, 1996; Han et al., 1999). However, many do this by first converting

data unnaturally into a sequence of events (Tan et al., 2001; Perlman and Java,

2003; Mamoulis et al., 2004).

Nonetheless, algorithms originated from the viewpoint of data stream mining, as in

the case of traditional time series clustering algorithms, generally do not appear to

be appropriate for our application.

C.4 Review Conclusion

In summary, we believe the most appropriate approach to analyse customer be-

haviour longitudinally, or spatially for that matter, is to first extract each customer’s

overall behaviour characteristics and then cluster the customers based on their ex-

tracted features. Obviously, the extracted features need to be reprehensive and sta-

ble. In Chapter 5, we investigate how to profile each customer’s spatial behaviour

meaningfully, and evaluate the usefulness of segmenting customers based on these

extracted characteristics. However, when the number of customer attributes to be

considered is large (as is typically the case), it can be problematic to group customers

based on the typically applied classic clustering algorithms such as k-means and hi-

erarchical algorithms (c.f. Wang et al., 2006). This is the result of curse of dimension-

ality. In Chapter 6, we investigate high-dimensional data clustering.

Bibliography

Aboulnaga, A., Chaudhuri, S., 1999. Self-tuning histograms: building histograms

without looking at data. In: Delis, A., Faloutsos, C., Ghandeharizadeh, S. (Eds.),

Proceedings of the 1999 ACM SIGMOD International Conference on Management

of Data. ACM, Philadelphia, PA, pp. 181–192.

Acharya, S., Gibbons, P. B., Poosala, V., 2000. Congressional samples for approximate

answering of group-by queries. In: Chen, W., Naughton, J., Bernstein, P. (Eds.),


of Data. ACM, Dallas, TX, pp. 487–498.

Acharya, S., Gibbons, P. B., Poosala, V., Ramaswamy, S., 1999. Join synopses for

approximate query answering. In: Delis, A., Faloutsos, C., Ghandeharizadeh, S.

(Eds.), Proceedings of the 1999 ACM SIGMOD International Conference on Man-

agement of Data. ACM, Philadelphia, PA, pp. 275–286.

Achtert, E., Bohm, C., David, J., Kroger, P., Zimek, A., 2008. Robust clustering in ar-

bitrarily oriented subspaces. In: Proceedings of the 2008 SIAM International Con-

ference on Data Mining. SIAM, Atlanta, GA, pp. 763–774.

Achtert, E., Bohm, C., Kriegel, H.-P., Kroger, P., Muller-Gorman, I., Zimek, A., 2007a.

Detection and visualization of subspace cluster hierarchies. In: Ramamohanarao,

K., Krishna, P. R., Mohania, M. K., Nantajeewarawat, E. (Eds.), Proceedings of the

12th International Conference on Database Systems for Advanced Applications.

Springer, Bangkok, Thailand, pp. 152–163.

Achtert, E., Bohm, C., Kriegel, H.-P., Kroger, P., Zimek, A., 2007b. On exploring com-

plex relationships of correlation clusters. In: Proceedings of the 19th International

Conference on Scientific and Statistical Database Management. IEEE, Banff, AB,

Canada, pp. 7–16.

204 Bibliography

Achtert, E., Bohm, C., Kriegel, H.-P., Kroger, P., Zimek, A., 2007c. Robust, complete,

and efficient correlation clustering. In: Proceedings of the 2007 SIAM International

Conference on Data Mining. SIAM, Minneapolis, MN, pp. 413–418.

Achtert, E., Bohm, C., Kroger, P., Zimek, A., 2006. Mining hierarchies of correlation

clusters. In: Proceedings of the 18th International Conference on Scientific and

Statistical Database Management. IEEE, Vienna, Austria, pp. 119–128.

Agarwal, D., McGregor, A., Phillips, J. M., Venkatasubramanian, S., Zhu, Z., 2006.

Spatial scan statistics: approximations and performance study. In: Eliassi-Rad,

T., Ungar, L. H., Craven, M., Gunopulos, D. (Eds.), Proceedings of the Twelfth

ACM SIGKDD International Conference on Knowledge Discovery and Data Min-

ing. ACM, Philadelphia, PA, pp. 24–33.

Agarwal, P. K., Mustafa, N. H., 2004. k-means projective clustering. In: Deutsch,

A. (Ed.), Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART Sympo-

sium on Principles of Database Systems. ACM, Paris, France, pp. 155–165.

Agarwal, S., Lim, J., Zelnik-Manor, L., Perona, P., Kriegman, D. J., Belongie, S., 2005.

Beyond pairwise clustering. In: Proceedings of the 2005 IEEE Computer Society

Conference on Computer Vision and Pattern Recognition. Vol. 2. IEEE, San Diego,

CA, pp. 838–845.

Aggarwal, C. C., 2003. A framework for diagnosing changes in evolving data streams.

In: Halevy, A. Y., Ives, Z. G., Doan, A. (Eds.), Proceedings of the 2003 ACM SIGMOD

International Conference on Management of Data. ACM, San Diego, CA, pp. 575–

586.

Aggarwal, C. C., 2006. On biased reservoir sampling in the presence of stream evolu-

tion. In: Dayal, U., Whang, K.-Y., Lomet, D. B., Alonso, G., Lohman, G. M., Kersten,

M. L., Cha, S. K., Kim, Y.-K. (Eds.), Proceedings of the 32nd International Confer-

ence on Very Large Data Bases. ACM, Seoul, Korea, pp. 607–618.

Aggarwal, C. C., 2007a. Data Streams: Models and Algorithms. Advances in Database


Aggarwal, C. C., 2007b. An introduction to data streams. In: Aggarwal, C. C. (Ed.),

Data Streams: Models and Algorithms. Advances in Database Systems. Springer,

New York.

Aggarwal, C. C., Han, J., Wang, J., Yu, P. S., 2003. A framework for clustering evolv-

ing data streams. In: Freytag, J. C., Lockemann, P. C., Abiteboul, S., Carey, M. J.,

Selinger, P. G., Heuer, A. (Eds.), Proceedings of the 29th International Conference

on Very Large Data Bases. Morgan Kaufmann, Berlin, Germany, pp. 81–92.

Aggarwal, C. C., Han, J., Wang, J., Yu, P. S., 2004. A framework for projected clustering

of high dimensional data streams. In: Nascimento, M. A., Ozsu, M. T., Kossmann,

Bibliography 205

D., Miller, R. J., Blakeley, J. A., Schiefer, K. B. (Eds.), Proceedings of the Thirtieth

International Conference on Very Large Data Bases. Morgan Kaufmann, Toronto,

ON, Canada, pp. 852–863.

Aggarwal, C. C., Han, J., Wang, J., Yu, P. S., 2007. On clustering massive data streams:

a summarization paradigm. In: Aggarwal, C. C. (Ed.), Data Streams: Models and

Algorithms. Advances in Database Systems. Springer, New York.

Aggarwal, C. C., Hinneburg, A., Keim, D. A., 2001. On the surprising behavior of dis-

tance metrics in high dimensional spaces. In: Van den Bussche, J., Vianu, V. (Eds.),

Proceedings of the 8th International Conference on Database Theory. Vol. 1973.

Springer, London, pp. 420–434.

Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., Park, J. S., 1999. Fast algorithms

for projected clustering. In: Delis, A., Faloutsos, C., Ghandeharizadeh, S. (Eds.),


of Data. ACM, Philadelphia, PA, pp. 61–72.

Aggarwal, C. C., Yu, P. S., 2000. Finding generalized projected clusters in high dimen-

sional spaces. In: Chen, W., Naughton, J. F., Bernstein, P. A. (Eds.), Proceedings of

the 2000 ACM SIGMOD International Conference on Management of Data. ACM,

Dallas, TX, pp. 70–81.

Aggarwal, C. C., Yu, P. S., 2001. Outlier detection for high dimensional data. In: Aref,

W. G. (Ed.), Proceedings of the 2001 ACM SIGMOD International Conference on

Management of Data. ACM, Santa Barbara, CA, pp. 37–46.

Aggarwal, C. C., Yu, P. S., 2007. A survey of synopsis construction in data streams. In:

Aggarwal, C. C. (Ed.), Data Streams: Models and Algorithms. Advances in Database


Agrawal, R., Faloutsos, C., Swami, A., 1993. Efficient similarity search in sequence

databases. In: Lomet, D. B. (Ed.), Proceedings of the 4th International Conference

of Foundations of Data Organization and Algorithms. Springer, Chicago, IL, pp.

69–84.





Agrawal, R., Lin, K.-I., Sawhney, H. S., Shim, K., 1995. Fast similarity search in the

presence of noise, scaling, and translation in time-series databases. In: Dayal, U.,

Gray, P. M. D., Nishio, S. (Eds.), Proceedings of the 21st International Conference

on Very Large Data Bases. Morgan Kaufmann, Zurich, Switzerland, pp. 490–501.

206 Bibliography

Agrawal, R., Psaila, G., 1995. Active data mining. In: Fayyad, U. M., Uthurusamy, R.

(Eds.), Proceedings of the First International Conference on Knowledge Discovery

and Data Mining. AAAI, Montreal, QC, Canada, pp. 3–8.

Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules. In: Bocca,

J. B., Jarke, M., Zaniolo, C. (Eds.), Proceedings of the 20th International Conference

on Very Large Data Bases. Morgan Kaufmann, Santiago de Chile, Chile, pp. 487–

499.

Agrawal, R., Swami, A. N., 1995. A one-pass space-efficient algorithm for finding

quantiles. In: Chaudhuri, S., Deshpande, A., Krishnamurthy, R. (Eds.), Proceed-

ings of the Seventh International Conference on Management of Data. McGraw-

Hill, Pune, India.

Ahn, J.-H., Han, S.-P., Yung-Seop, L., 2006. Customer churn analysis: churn determi-

nants and mediation effects of partial defection in the Korean mobile telecommu-

nications service industry. Telecommunications Policy 30 (10-11), 552–568.

Airoldi, E. M., Blei, D. M., Fienberg, S. E., Xing, E. P., 2008. Mixed membership

stochastic blockmodels. The Journal of Machine Learning Research 9 (Sep), 1981–

2014.

Aitkin, M., Rubin, D. B., 1985. Estimation and hypothesis testing in finite mixture

models. Journal of the Royal Statistical Society: Series B (Statistical Methodology)

47, 67–75.

Aitkin, M., Wilson, G. T., 1980. Mixture models, outliers, and the EM algorithm. Tech-

nometrics 22 (3), 325–331.

Ajala, I., 26 Nov 2005. GIS and GSM network quality monitoring: A Nigerian case

study.

URL http://www.directionsmag.com/articles/

Ajala, I., 07 Mar 2006. Spatial analysis of GSM subscriber call data records.

URL http://www.directionsmag.com/articles/

Ajzen, I., 1991. The theory of planned behavior. Organizational Behavior and Human

Decision Processes 50 (2), 179–211.

Ajzen, I., 2001. Nature and operation of attitudes. Annual Review of Psychology

52 (1), 27–58.

Ajzen, I., Fishbein, M., 1980. Understanding Attitudes and Predicting Social Behav-

ior. Prentice-Hall, Englewood-Cliffs, NJ.

Ajzena, I., Fishbeinb, M., 2000. Attitudes and the attitude-behavior relation: rea-

soned and automatic processes. European Review of Social Psychology 11, 1–33.

Bibliography 207

Akaike, H., 1974. A new look at the statistical model identification. IEEE Transactions

on Automatic Control 19 (6), 716723.

Alderson, W., 1957. Marketing Behavior and Executive Action: A Functionalist Ap-

proach to Marketing Theory. Richard D. Irwin, Homewood, IL.

Alon, N., Gibbons, P. B., Matias, Y., Szegedy, M., 1999. Tracking join and self-join

sizes in limited storage. In: Proceedings of the Eighteenth ACM SIGMOD-SIGACT-

SIGART Symposium on Principles of Database Systems. ACM, Philadelphia, PA,

pp. 10–20.

Alon, N., Matias, Y., Szegedy, M., 1996. The space complexity of approximating the

frequency moments. In: Proceedings of the 28th Annual ACM Symposium on The-

ory of Computing. ACM, Philadelphia, PA, pp. 20–29.

Alsabti, K., Ranka, S., Singh, V., 1997. A one-pass algorithm for accurately estimat-

ing quantiles for disk-resident data. In: Jarke, M., Carey, M. J., Dittrich, K. R.,

Lochovsky, F. H., Loucopoulos, P., Jeusfeld, M. A. (Eds.), Proceedings of the 23rd

International Conference on Very Large Data Bases. Morgan Kaufmann, Athens,

Greece, pp. 346–355.

Amaral, T. P., Gonzalez, F. A., Jimenez, B. M., 1995. Business telephone traffic demand

in Spain: 19801991, an econometric approach. Information Economics and Policy

7 (2), 115–134.

Anderson, E. W., Mittal, V., 2000. Strengthening the satisfaction-profit chain. Journal

of Service Research 3 (2), 107–120.

Anderson, P. F., 1983. Marketing, scientific progress, and scientific method. Journal

of Marketing 47 (4), 18–31.

Anderson Jr., W. T., Golden, L. L., 1984. Lifestyle and psychographics: a critical review

and recommendation. Advances in Consumer Research 11 (1), 405–411.

Andrieu, C., de Freitas, N., Doucet, A., Jordan, M. I., 2003. An introduction to MCMC

for machine learning. Machine Learning 50 (1-2), 5–43.

Ankerst, M., Breunig, M., Kriegel, H.-P., Sander, J., 1999. OPTICS: ordering points to

identify the clustering structure. ACM SIGMOD Record 28 (2), 49–60.

Antoniak, C., 1974. Mixtures of Dirichlet processes with applications to Bayesian

nonparametric problems. The Annals of Statistics 2 (6), 1152–1174.

Arabie, P., Hubert, L. J., 1996. An overview of combinatorial data analysis. In: Arabie,

P., Hubert, L. J., De Soete, G., , De Soete, G. (Eds.), Clustering and Classification.

World Scientific, River Edge, NJ, pp. 5–63.

208 Bibliography

Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Nishizawa, I., Rosenstein, J.,

Widom, J., 2003. STREAM: stanford stream data manager. In: Halevy, A. Y., Ives,

Z. G., Doan, A. (Eds.), Proceedings of the 2003 ACM SIGMOD International Con-

ference on Management of Data. ACM, San Diego, CA, pp. 665–665.

Archambeau, C., Verleysen, M., 2007. Robust Bayesian clustering. Neural Networks

20 (1), 129–138.

Aref, W. G., Elfeky, M. G., Elmagarmid, A. K., 2004. Incremental, online, and merge

mining of partial periodic patterns in time-series databases. IEEE Transactions on

Knowledge and Data Engineering 16 (3), 332–342.

Armstrong, J. S., 2001. Principles of Forecasting: A Handbook for Researchers and

Practitioners. International Series in Operations Research & Management Science.

Kluwer Academic, Boston, MA.

Arndt, J., 1967. Role of product-related conversations in the diffusion of a new prod-

uct. Journal of Marketing Research 4 (3), 291–295.

Arnould, E. J., Price, L., Zinkhan, G. M., 2004. Consumers, 2nd Edition. McGraw-

Hill/Irwin Series in Marketing. McGraw-Hill/Irwin, Boston, MA.

Arrow, K. J., Forsythe, R., Gorham, M., Hahn, R., Hanson, R., Ledyard, J. O., Levmore,

S., Litan, R., Milgrom, P., Nelson, F. D., Neumann, G. R., Ottaviani, M., Schelling,

T. C., Shiller, R. J., Smith, V. L., Snowberg, E., Sunstein, C. R., Tetlock, P. C., Tet-

lock, P. E., Varian, H. R., Wolfers, J., Zitzewitz, E., 2008. The promise of prediction

markets. Science 320 (5878), 877–878.

Assent, I., Krieger, R., Muller, E., Seidl, T., 2007a. DUSC: dimensionality unbiased

subspace clustering. In: Proceedings of the 7th IEEE International Conference on

Data Mining. IEEE, Omaha, NE, pp. 409–414.

Assent, I., Krieger, R., Muller, E., Seidl, T., 2007b. VISA: visual subspace clustering

analysis. SIGKDD Explorations 9 (2), 5–12.





Azzalini, A., 1996. Statistical Inference: Based on the Likelihood. Monographs on

Statistics and Applied Probability. Chapman & Hall, London.

Babad, Y. M., Balachandran, B. V., 1993. Cost driver optimization in activity-based

costing. The Accounting Review 68 (3), 563–575.

Bibliography 209

Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J., 2002a. Models and issues

in data stream systems. In: Popa, L. (Ed.), Proceedings of the Twenty-first ACM

SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM,

Madison, WI, pp. 1–16.

Babcock, B., Datar, M., Motwani, R., 2002b. Sampling from a moving window over

streaming data. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium

on Discrete Algorithms. ACM, San Francisco, CA, pp. 633–634.

Babcock, B., Datar, M., Motwani, R., 2007. Load shedding in data stream systems. In:

Aggarwal, C. C. (Ed.), Data Streams: Models and Algorithms. Advances in Database


Babcock, B., Datar, M., Motwani, R., O’Callaghan, L., 2003. Maintaining variance and

k-medians over data stream windows. In: Proceedings of the Twenty-Second ACM

SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM,

San Diego, CA, pp. 234–243.

Babu, S., Subramanian, L., Widom, J., 2001. A data stream management system for

network traffic management. In: Workshop on Network-Related Data Manage-

ment. Santa Barbara, CA.

Balakrishnan, S., Madigan, D., 2006. A one-pass sequential Monte Carlo method for

Bayesian analysis of massive datasets. Bayesian Analysis 1 (2), 345–362.

Balazinska, M., Castro, P., 2003. Characterizing mobility and network usage in a cor-

porate wireless local-area network. In: Proceedings of the First International Con-

ference on Mobile Systems, Applications, and Services. USENIX, San Francisco,

CA, pp. 303–316.

Baldinger, A. L., Rubinson, J., 1996. Brand loyalty: the link between attitude and be-

havior. Journal of Advertising Research 36 (6), 22–34.

Ball, G. H., Hall, D. J., 1965. ISODATA, a novel method of data analysis and pattern

classification. Tech. rep., Stanford Research Institute, Menlo Park, CA.

Banfield, J. D., Raftery, A. E., 1993. Model-based Gaussian and non-Gaussian cluster-

ing. Biometrics 49 (3), 803–821.

Barbara, D., Chen, P., 2000. Using the fractal dimension to cluster datasets. In: Pro-

ceedings of the Sixth ACM SIGKDD International Conference on Knowledge Dis-

covery and Data Mining. ACM, Boston, MA, pp. 260–264.

Barbara, D., Dumouchel, W., Faloutsos, C., Haas, P., Hellerstein, J., Ioannidis, Y., Ja-

gadish, H. V., Johnson, T., Ng, R., Poosala, V., Ross, K., Sevcik, K., 1997. The new

jersey data reduction report. IEEE Data Engineering Bulletin 20 (4), 3–45.

210 Bibliography

Batty, M., 2003. Agent-based pedestrian modelling. In: Longley, P. A., Batty, M. (Eds.),

Advanced Spatial Analysis: The CASA Book of GIS. ESRI Press, Redlands, CA.



19 (2), 332–353.

Bayes, T., 1763. An essay towards solving a problem in the doctrine of chances. Philo-

sophical Transactions of the Royal Society of London 53,54, 370–418,296–325.

Beal, M. J., Ghahraman, Z., 2002. The variational Bayesian EM algorithm for incom-

plete data: with application to scoring graphical model structures. In: Bernardo,

J. M., Bayarri, M. J., Berger, J. O., Dawid, A. P., Heckerman, D., Smith, A. F. M., West,

M. (Eds.), Proceedings of the Seventh Valencia International Meeting. Oxford Uni-

versity, Tenerife, Spain, pp. 453–464.

Beal, M. J., Ghahraman, Z., 2006. Variational Bayesian learning of directed graphical

models with hidden variables. Bayesian Analysis 1 (4), 793832.

Belk, R. W., 1987. ACR presidential address: happy thought. Advances in Consumer

Research 14 (1), 1–4.

Belk, R. W., Sherry Jr., J. F., Wallendorf, M., 1988. A naturalistic inquiry into buyer and

seller behavior at a swap meet. Journal of Consumer Research 14 (4), 449–470.

Bellman, R. E., 1961. Adaptive Control Processes: A Guided Tour, 5th Edition. Prince-

ton University, Princeton, NJ.

Berchtold, S., Bohm, C., Keim, D. A., Kriegel, H.-P., 1997. A cost model for nearest

neighbor search in high-dimensional data space. In: Proceedings of the Sixteenth

ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

ACM, Tucson, AZ, pp. 78–86.

Berchtold, S., Bohm, C., Kriegal, H.-P., 1998. The pyramid-technique: Towards break-

ing the curse of dimensionality. In: Proceedings of the 1998 ACM SIGMOD Inter-

national Conference on Management of data. ACM, pp. 142–153.

Berchtold, S., Keim, D. A., Kriegel, H.-P., 1996. The X-tree: an index structure for

high-dimensional data. In: Vijayaraman, T. M., Buchmann, A. P., Mohan, C., Sarda,

N. L. (Eds.), Proceedings of the 22nd International Conference on Very Large Data

Bases. Morgan Kaufmann, Mumbai, India, pp. 28–39.

Berkhin, P., 2006. A survey of clustering data mining techniques. In: Kogan, J.,

Nicholas, C., Teboulle, M. (Eds.), Grouping Multidimensional Data: Recent Ad-

vances in Clustering. Springer, New York.

Bibliography 211

Berndt, D. J., Clifford, J., 1994. Using dynamic time warping to find patterns in time

series. In: Fayyad, U. M., Uthurusamy, R. (Eds.), AAMI Workshop on Knowledge

Discovery in Databases. AAAI, Seattle, WA, pp. 359–370.

Berry, M. J. A., Linoff, G., 2000. Mastering Data Mining: The Art and Science of Cus-

tomer Relationship Management. Wiley, New York.

Berry, M. J. A., Linoff, G., 2004. Data Mining Techniques: for Marketing, Sales, and

Customer Relationship Management, 2nd Edition. Wiley, Indianapolis, IN.

Besag, J., Green, P., Higdon, D., Mengersen, K., 1995. Bayesian computation and

stochastic systems. Statistical Science 10 (1), 3–41.

Beyer, K. S., Goldstein, J., Ramakrishnan, R., Shaft, U., 1999. When is “nearest neigh-

bor” meaningful? In: Beeri, C., Buneman, P. (Eds.), Proceedings of the Seventh In-

ternational Conference on Database Theory. Vol. 1540. Springer, Jerusalem, Israel,

pp. 217–235.

Bezdek, J. C., 1981. Pattern Recognition with Fuzzy Objective Function Algorithms.

Advanced Applications in Pattern Recognition. Springer, New York.

Bhaduri, K., Das, K., Sivakumar, K., Kargupta, H., Wolff, R., 2007. Algorithms for dis-

tributed data stream mining. In: Aggarwal, C. C. (Ed.), Data Streams: Models and

Algorithms. Advances in Database Systems. Springer, New York.

Bhattacharya, C. B., 1998. When customers are members: Customer retention in

paid membership contexts. Journal of the Academy of Marketing Science 26 (1),

31–44.

Bickart, B., Schmittlein, D., 1999. The distribution of survey contact and participa-

tion in the United States: constructing a survey-based estimate. Journal of Mar-

keting Research 36 (2), 286–294.

Biernacki, C., Celeux, G., Govaert, G., 2000. Assessing mixture model for clustering

with integrated completed likelihood. IEEE Transactions on Pattern Analysis and

Machine Intelligence 22 (7), 719–725.

Biernacki, C., Celeux, G., Govaert, G., 2003. Choosing starting values for the EM algo-

rithm for getting the highest likelihood in multivariate Gaussian mixture models.

Computational Statistics & Data Analysis 41 (3-4), 561–575.

Biernacki, C., Celeux, G., Govaert, G., Langrognet, F., 2006. Model-based cluster

and discriminant analysis with the MIXMOD software. Computational Statistics

& Data Analysis 51 (2), 587–600.

Binder, D. A., 1978. Bayesian cluster analysis. Biometrika 65 (1), 31–38.

212 Bibliography

Birant, D., Kut, A., 2007. ST-DBSCAN: an algorithm for clustering spatial-temporal

data. Data & Knowledge Engineering 60 (1), 208–221.

Bishop, C. M., 2006. Pattern Recognition and Machine Learning. Information Sci-

ence and Statistics. Springer, New York.

Bitran, G. R., Mondschein, S. V., 1996. Mailing decisions in the catalog sales industry.

Management Science 42 (9), 1364–1381.

Blattberg, R. C., Deighton, J., 1996. Manage marketing by the customer equity test.


Blattberg, R. C., Getz, G., Thomas, J. S., 2000. Customer Equity: Building and Manag-

ing Relationships as Valuable Assets. Harvard Business School, Boston, MA.

Blei, D. M., Jordan, M. I., 2004. Variational methods for the Dirichlet process. In:

Brodley, C. E. (Ed.), Proceedings of the Twenty-first International Conference Ma-

chine Learning. ACM, Banff, AB, Canada.

Blei, D. M., Jordan, M. I., 2006. Variational inference for Dirichlet process mixtures.


Blei, D. M., Ng, A. Y., Jordan, M. I., 2003. Latent Dirichlet allocation. Journal of Ma-

chine Learning Research 3 (Jan), 993–1022.

Bohm, C., Berchtold, S., Keim, D. A., 2001. Searching in high-dimensional spaces:

index structures for improving the performance of multimedia databases. ACM

Computing Surveys 33 (3), 322–373.

Bohm, C., Braunmuller, B., Breunig, M. M., Kriegel, H.-P., 2000. High performance

clustering based on the similarity join. In: Proceedings of the 2000 ACM CIKM

International Conference on Information and Knowledge Management. ACM,

McLean, VA, pp. 298–305.

Bohm, C., Kailing, K., Kriegel, H.-P., Kroger, P., 2004a. Density connected cluster-

ing with local subspace preferences. In: Proceedings of the 4th IEEE International

Conference on Data Mining. IEEE, Brighton, UK, pp. 27–34.

Bohm, C., Kailing, K., Kroger, P., Zimek, A., 2004b. Computing clusters of correlation

connected objects. In: Weikum, G., Konig, A. C., Deßloch, S. (Eds.), Proceedings of


Paris, France, pp. 455–466.

Bolton, R. N., 1998. A dynamic model of the duration of the customer’s relation-

ship with a continuous service provider: the role of satisfaction. Marketing Science

17 (1), 45–65.

Bibliography 213

Borden, N. H., 1964. The concept of the marketing mix. Journal of Advertising Re-

search 4 (June), 2–7.

Bouveyron, C., Girard, S., Schmid, C., 2007. High-dimensional data clustering. Com-

putational Statistics & Data Analysis 52 (1), 502–519.

Boyles, R. A., 1983. On the convergence of the EM algorithm. Journal of the Royal

Statistical Society: Series B (Statistical Methodology) 45 (1), 47–50.

Bradley, P., Fayyad, U. M., Reina, C., 1998. Scaling clustering algorithms to large

databases. In: Agrawal, R., Stolorz, P. E., Piatetsky-Shapiro, G. (Eds.), Proceedings

of the Fourth International Conference on Knowledge Discovery and Data Mining.

AAAI, New York, pp. 9–15.

Bradley, P. S., Reina, C., Fayyad, U. M., 2000. Clustering very large databases using EM

mixture models. In: Proceedings of the 15th International Conference on Pattern

Recognition. IEEE, Barcelona, Spain, pp. 2076–2080.

Braun, M., McAuliffe, J., 2010. Variational inference for large-scale models of discrete

choice. Journal of American Statistical Association 105 (489), 324–335.

Brennan, M., Hoek, J., 1992. The behavior of respondents, nonrespondents, and re-

fusers across mail surveys. Public Opinion Quarterly 56 (4), 530–535.

Breunig, M. M., Kriegel, H.-P., Ng, R. T., Sander, J., 2000. LOF: identifying density-

based local outliers. In: Chen, W., Naughton, J. F., Bernstein, P. A. (Eds.), Proceed-

ings of the 2000 ACM SIGMOD International Conference on Management of Data.

ACM, Dallas, TX, pp. 93–104.

Brockmann, D., Hufnagel, L., Geisel, T., 2006. The scaling laws of human travel. Na-

ture 439 (7075), 462–465.

Bruneau, P., Gelgon, M., Picarougne, F., 2008. Parameter-based reduction of Gaus-

sian mixture models with a variational-Bayes approach. In: Proceedings of the

19th International Conference on Pattern Recognition. IEEE, Tampa, FL, pp. 1–4.

Buckinx, W., Van den Poel, D., 2005. Customer base analysis: partial defection of be-

haviourally loyal clients in a non-contractual FMCG retail setting. European Jour-

nal of Operational Research 164 (1), 252–268.

Bult, J. R., Wansbeek, T., 1995. Optimal selection for direct mail. Marketing Science

14 (4), 378–394.

Burbeck, K., Nadjm-Tehrani, S., 2005. ADWICE - anomaly detection with real-time

incremental clustering. In: Park, C., Chee, S. (Eds.), Proceedings of the Seventh

International Conference on the Theory and Application of Cryptology and Infor-

mation Security. Springer, Seoul, Korea, pp. 407–424.

214 Bibliography

Buttle, F., 2009. Customer Relationship Management: Concepts and Technologies,

2nd Edition. Butterworth-Heinemann, London.

Cadez, I. V., Smyth, P., Ip, E., Mannila, H., 2001. Predictive profiles for transaction data

using finite mixture models. Tech. Rep. UCI-ICS 01-67, Department of Information

& Computer Science, University of California, Irvine, CA.

Cahill, D. J., 2006. Lifestyle Market Segmentation. Haworth Series in Segmented, Tar-

geted, and Customized Marketing. Haworth, New York.

Calinski, T., Harabasz, J., 1974. A dendrite method for cluster analysis. Communica-

tions in Statistics - Theory and Methods 3 (1), 1–27.

Camp, T., Boleng, J., Davies, V., 2002. A survey of mobility models for ad hoc network

research. Wireless Communications and Mobile Computing 2 (5), 483–502.

Campbell, N. A., 1980. Robust procedures in multivariate analysis I: robust covari-

ance estimation. Journal of the Royal Statistical Society: Series C (Applied Statis-

tics) 29 (3), 231–237.

Cao, F., Ester, M., Qian, W., Zhou, A., 2006. Density-based clustering over an evolving

data stream with noise. In: Ghosh, J., Lambert, D., Skillicorn, D. B., Srivastava, J.

(Eds.), Proceedings of the Sixth SIAM International Conference on Data Mining.

SIAM, Bethesda, MD.

Cattell, R. B., 1966. The scree test for the number of factors. Multivariate Behavioral

Research 1 (2), 245–276.

Celeux, G., Chauveau, D., Diebolt, J., 1996. Stochastic versions of the EM algorithm:

an experimental study in the mixture case. Journal of Statistical Computation and

Simulation 55 (4), 287–314.

Celeux, G., Diebolt, J., 1985. The SEM algorithm: a probabilistic teacher algorithm

derived from EM algorithm for the mixture problem. Computational Statistics

Quarterly 2, 73–82.

Celeux, G., Forbes, F., Robert, C., Titterington, D., 2006. Deviance information criteria

for missing data models. Bayesian Analysis 1 (4), 651–674.

Celeux, G., Govaert, G., 1992. A classification EM algorithm for clustering and two

stochastic versions. Computational Statistics & Data Analysis 14, 315–332.

Celeux, G., Govaert, G., 1995. Gaussian parsimonious clustering models. Pattern

Recognition 28 (5), 781–793.



tion 95 (451), 957–970.

Bibliography 215

Chakrabarti, K., Garofalakis, M. N., Rastogi, R., Shim, K., 2001. Approximate query

processing using wavelets. The VLDB Journal 10 (2-3), 199–223.

Chakrabarti, K., Keogh, E. J., Mehrotra, S., Pazzani, M. J., 2002. Locally adaptive di-

mensionality reduction for indexing large time series databases. ACM Transac-

tions on Database Systems 27 (2), 188–228.

Chakrabarti, K., Mehrotra, S., 1999. The hybrid tree: an index structure for high di-

mensional feature spaces. In: Proceedings of the 15th International Conference on

Data Engineering. IEEE, Sydney, Australia, pp. 440–447.

Chan, K.-p., Fu, A. W.-C., 1999. Efficient time series matching by wavelets. In: Pro-

ceedings of the 15th International Conference on Data Engineering. IEEE, Sydney,

Australia, pp. 126–133.

Chang, J.-W., Jin, D.-S., 2002. A new cell-based clustering method for large, high-

dimensional data in data mining applications. In: Proceedings of the 2002 ACM

Symposium on Applied Computing. ACM, Madrid, Spain, pp. 503–507.

Chang, W.-C., 1983. On using principal components before separating a mixture of

two multivariate normal distributions. Journal of the Royal Statistical Society: Se-

ries C (Applied Statistics) 32 (3), 267–275.

Chapelle, O., Sch olkopf, B., Zien, A., 2010. Semi-Supervised Learning. Adaptive

Computation and Machine Learning. MIT, London.

Charikar, M., O’Callaghan, L., Panigrahy, R., 2003. Better streaming algorithms for

clustering problems. In: Proceedings of the 35th Annual ACM Symposium on The-

ory of Computing. ACM, San Diego, CA, pp. 30–39.

Chatfield, C., 1995. Model uncertainty, data mining and statistical inference. Journal

of the Royal Statistical Society: Series A (Statistics in Society) 158 (3), 419–466.

Chaudhuri, S., Dayal, U., Ganti, V., 2001. Database technology for decision support

systems. Computer 34 (12), 48–55.

Chaudhuri, S., Motwani, R., Narasayya, V., 1998. Random sampling for histogram

construction: How much is enough? In: Proceedings of the 1998 ACM SIGMOD

international conference on Management of data. ACM, Seattle, WA, pp. 436–447.

Chaudhuri, S., Motwani, R., Narasayya, V. R., 1999. On random sampling over joins.

In: Delis, A., Faloutsos, C., Ghandeharizadeh, S. (Eds.), Proceedings of the 1999

ACM SIGMOD International Conference on Management of Data. ACM, Philadel-

phia, PA, pp. 263–274.

Cheeseman, P., Stutz, J., 1996. Bayesian classification (AutoClass): theory and re-

sults. In: Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (Eds.),

Advances in Knowledge Discovery and Data Mining. AAAI.

216 Bibliography

Cheng, C. H., Fu, A. W.-C., Zhang, Y., 1999. Entropy-based subspace clustering for

mining numerical data. In: Proceedings of the Fifth ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining. ACM, San Diego, CA, pp.

84–93.

Cheng, H., Hua, K. A., Vu, K., 2008. Constrained locally weighted clustering. Proceed-

ings of the VLDB Endowment 1 (1), 90–101.

Cheng, Y., Church, G. M., 2000. Biclustering of expression data. In: Bourne, P. E.,

Gribskov, M., Altman, R. B., Jensen, N., Hope, D. A., Lengauer, T., Mitchell, J. C.,

Scheeff, E. D., Smith, C., Strande, S., Weissig, H. (Eds.), Proceedings of the Eighth

International Conference Intelligent Systems for Molecular Biology. AAAI, La Jolla,

CA, pp. 93–103.

Cherkassky, V., Mulier, F. M., 2007. Learning from Data: Concepts, Theory, and Meth-

ods, 2nd Edition. Wiley, New York.

Cheung, Y.-m., 2005. Maximum weighted likelihood via rival penalized EM for den-

sity mixture clustering with automatic model selection. IEEE Transactions on


Cho, Y. H., Kim, J. K., 2004. Application of web usage mining and product taxonomy

to collaborative recommendations in e-commerce. Expert Systems with Applica-

tions 26 (2), 233–246.

Cho, Y. H., Kim, J. K., Kim, S. H., 2002. A personalized recommender system based on

web usage mining and decision tree induction. Expert Systems with Applications

23 (3), 329–342.

Chong, C.-C., Guvenc, I., Watanabe, F., Inamura, H., 2009. Ranging and localization

by UWB radio for indoor LBS. NTT DOCOMO Technical Journal 11 (1), 41–48.

Chopin, N., 2002. A sequential particle filter method for static models. Biometrika

89 (3), 539–552.

Christopher, M., Payne, A., Ballantyne, D., 1991. Relationship Marketing: Bring-

ing Quality, Customer Service and Marketing Together. The Marketing Series.

Butterworth-Heinemann, Boston, MA.

Claritas Inc., 2008. PRIZM NE (R).

Claxton, J. D., Fry, J. N., Portis, B., 1974. A taxonomy of prepurchase information

gathering patterns. Journal of Consumer Research 1 (3), 35–42.

Cohen, E., Strauss, M., 2003. Maintaining time-decaying stream aggregates. In: Pro-

ceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on

Principles of Database Systems. ACM, San Diego, CA, pp. 223–233.

Bibliography 217

Cokins, G., 2004. Performance Management: Finding the Missing Pieces (to Close

the Intelligence Gap). Wiley and SAS Business Series. Wiley, Hoboken, NJ.

Cokins, G., King, K., 2004. Managing customer profitability and economic value in

the telecommunications industry: a holistic look at the individual level to build

corporate profitability one customer at a time. Tech. rep., SAS Institute Inc., Cary,

NC.

Cole, A. J., Wishart, D., 1970. An improved algorithm for the Jardine-Sibson method

of generating overlapping clusters. Computer Journal 13 (2), 156–163.



18 (3), 745–755.

Cooper, R., Kaplan, R. S., 1991. Profit priorities from activity-based costing. Harvard

Business Review May-June, 130–135.

Cooper, R., Kaplan, R. S., 1998. The promise - and peril - of integrated cost systems.





Cormode, G., Garofalakis, M. N., Sacharidis, D., 2006. Fast approximate wavelet

tracking on streams. In: Ioannidis, Y. E., Scholl, M. H., Schmidt, J. W., Matthes, F.,

Hatzopoulos, M., Bohm, K., Kemper, A., Grust, T., Bohm, C. (Eds.), Proceedings of

the Tenth International Conference on Extending Database Technology. Springer,

Munich, Germany, pp. 4–22.

Cortes, C., Fisher, K., Pregibon, D., Rogers, A., 2000. Hancock: a language for extract-

ing signatures from data streams. In: Proceedings of the Sixth ACM SIGKDD In-

ternational Conference on Knowledge Discovery and Data Mining. ACM, Boston,

MA, pp. 9–17.

Cortes, C., Pregibon, D., 1998. Giga mining. In: Agrawal, R., Stolorz, P. E., Piatetsky-

Shapiro, G. (Eds.), Proceedings of the Fourth International Conference on Knowl-

edge Discovery and Data Mining. AAAI, New York, pp. 174–178.

Cortes, C., Pregibon, D., 1999. Information mining platform: an infrastructure for

KDD rapid deployment. In: Proceedings of the Fifth ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining. ACM, San Diego, CA, pp.

327–331.

Cox Jr., L. A., 2001. Forecasting demand for telecommunications products from

cross-sectional data. Telecommunication Systems 16 (3), 437–454.

218 Bibliography

Cox Jr., L. A., 2002. Data mining and causal modeling of customer behaviors.

Telecommunication Systems 21 (2-4), 349–381.

Cox Jr., L. A., Popken, D. A., 2002. A hybrid system-identification method for forecast-

ing telecommunications product demands. International Journal of Forecasting

18 (4), 647–671.

Curtin, R., Presser, S., Singer, E., 2005. Changes in telephone survey nonresponse

over the past quarter century. Public Opinion Quarterly 69 (1), 87–98.

Danaher, P. J., 2002. Optimal pricing of new subscription services: analysis of a mar-

ket experiment. Marketing Science 21 (2), 119–138.

Dasu, T., Johnson, T., 2003. Exploratory Data Mining and Data Cleaning. Wiley Series

in Probability and Statistics. Wiley, Hoboken, NJ.

Dasu, T., Krishnan, S., Venkatasubramanian, S., Yi, K., 2006. An information-

theoretic approach to detecting changes in multi-dimensional data streams. In:

Proceedings of the 38th Symposium on the Interface of Statistics, Computing Sci-

ence, and Applications. Pasadena, CA.

Datar, M., Gionis, A., Indyk, P., Motwani, R., 2002. Maintaining stream statistics over

sliding window. SIAM Journal on Computing 31 (6), 1794–1813.

Datar, M., Motwani, R., 2007. The sliding-window computation model and results.

In: Aggarwal, C. C. (Ed.), Data Streams: Models and Algorithms. Advances in

Database Systems. Springer, New York.

Deligiannakis, A., Roussopoulos, N., 2003. Extended wavelets for multiple measures.

In: Halevy, A. Y., Ives, Z. G., Doan, A. (Eds.), Proceedings of the 2003 ACM SIGMOD

International Conference on Management of Data. ACM, San Diego, CA, pp. 229–

240.




DeSarbo, W. S., Howard, D. J., Jedidi, K., 1990. MULTICLUS: A new method for simul-

taneously performing multidimensional scaling and cluster analysis. Psychome-

trika 56 (1), 121–136.

DeSarbo, W. S., Ramaswamy, V., 1994. CRISP: customer response based iterative seg-

mentation procedures for response modeling in direct marketing. Journal of Direct

Marketing 8 (3), 7–20.

Dhar, R., Glazer, R., 2003. Hedging customers. Harvard Business Review 81 (5), 86–92.

Bibliography 219

Dickson, P. R., 1982. Person-situation: segmentation’s missing link. Journal of Mar-

keting 46 (4), 56–64.

Diebolt, J., Robert, C. P., 1994. Estimation of finite mixture distributions through

Bayesian sampling. Journal of the Royal Statistical Society: Series B (Statistical

Methodology) 56 (2), 363–375.

Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E. J., 2008. Querying

and mining of time series data: experimental comparison of representations and

distance measures. In: Proceedings of the 34th International Conference on Very

Large Data Bases. ACM, Auckland, New Zealand, pp. 1542–1552.

Dobra, A., Garofalakis, M. N., Gehrke, J., Rastogi, R., 2002. Processing complex ag-

gregate queries over data streams. In: Franklin, M. J., Moon, B., Ailamaki, A. (Eds.),


of Data. ACM, Madison, WI, pp. 61–72.

Dobra, A., Garofalakis, M. N., Gehrke, J., Rastogi, R., 2004. Sketch-based multi-query

processing over data streams. In: lisa Bertino, Christodoulakis, S., Plexousakis,

D., Christophides, V., Koubarakis, M., Bohm, K., Ferrari, E. (Eds.), Proceedings of

the Ninth International Conference on Extending Database Technology. Springer,

Heraklion, Greece, pp. 551–568.

Domeniconi, C., Papadopoulos, D., Gunopulos, D., Ma, S., April 22-24, 2004 2004.

Subspace clustering of high dimensional data. In: Berry, M. W., Dayal, U., Kamath,

C., Skillicorn, D. B. (Eds.), Proceedings of the Fourth SIAM International Confer-

ence on Data Mining. SIAM, Lake Buena Vista, FL.

Domingos, P., Hulten, G., 2000. Mining high-speed data streams. In: Proceedings

of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining. ACM, Boston, MA, pp. 71–80.

Domingos, P., Hulten, G., 2001. A general method for scaling up machine learning al-

gorithms and its application to clustering. In: Brodley, C. E., Danyluk, A. P. (Eds.),

Proceedings of the Eighteenth International Conference on Machine Learning.

Morgan Kaufmann, Williamstown, MA, pp. 106–113.

Dong, G., Han, J., Lakshmanan, L. V., Pei, J., Wang, H., Yu, P. S., 2003. Online mining

of changes from data streams: research problems and preliminary results. In: Pro-

ceedings of the 2003 ACM SIGMOD Workshop on Management and Processing of

Data Streams. ACM, San Diego, CA.

Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., Picard, D., 1996. Density estima-

tion by wavelet thresholding. The Annals of Statistics 24 (2), 508–539.

Doucet, A., de Freitas, N., Gordon, N., 2001. An introduction to sequential Monte

Carlo methods. In: Doucet, A., de Freitas, N., Gordon, N. (Eds.), Sequential Monte

220 Bibliography

Carlo Methods in Practice. Statistics for Engineering and Information Science.

Springer, New York.

Dougherty, E. R., Brun, M., 2004. A probabilistic theory of clustering. Pattern Recog-

nition 37 (5), 917–925.

Dowling, G. R., Uncles, M., 1997. Do customer loyalty programs really work? Sloan

Management Review 38 (4), 71–82.

Doyle, P., 1995. Marketing in the new millennium. European Journal of Marketing

29 (13), 23–41.

Dubes, R., 1987. How many clusters are best? - an experiment. Pattern Recognition

20 (6), 645–663.

Dubes, R. C., 1999. Cluster analysis and related issues. In: Chen, C.-H., Pau, L. F.,

Wang, P. S.-P. (Eds.), Handbook of Pattern Recognition & Computer Vision. World

Scientific Publishing Company, River Edge, NJ.

Duda, R. O., Hart, P. E., Stork, D. G., 2001. Pattern Classification, 2nd Edition. Wiley,

New York.

DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., Pregibon, D., 1999. Squashing

flat files flatter. In: Proceedings of the Fifth ACM SIGKDD International Confer-

ence on Knowledge Discovery and Data Mining. ACM, San Diego, CA, pp. 6–15.

Duncan, T., 2005. Principles of Advertising & IMC, 2nd Edition. The McGraw-

Hill/Irwin series in marketing. McGraw-Hill/Irwin, Burr Ridge, IL.

Dwyer, F. R., Schurr, P. H., Oh, S., 1987. Developing buyer-seller relationships. Journal

of Marketing 51 (2), 11–27.

Efron, B., Tibshirani, R., 1993. An Introduction to the Bootstrap. Chapman &

Hall/CRC Monographs on Statistics & Applied Probability. Chapman & Hall, Lon-

don.

Egan, J., 2005. Relationship Marketing: Exploring Relational Strategies in Marketing,

2nd Edition. Prentice Hall, New York.

Eriksson, K., Mattsson, J., 2002. Managers’ perception of relationship management

in heterogeneous markets. Industrial Marketing Management 31 (6), 535–543.

Escobar, M. D., West, M., 1995. Bayesian density estimation and inference using mix-

tures. Journal of the American Statistical Association 90 (430), 577–588.

Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., Xu, X., 1998. Incremental cluster-

ing for mining in a data warehousing environment. In: Gupta, A., Shmueli, O.,

Widom, J. (Eds.), Proceedings of the 24th International Conference on Very Large

Data Bases. Morgan Kaufmann, New York, pp. 323–333.

Bibliography 221





Estivill-Castro, V., Lee, I., 2000. AMOEBA: hierarchical clustering based on spatial

proximity using Delaunay diagram. In: Foyer, P., Yeh, A., He, J. (Eds.), Proceedings

of the 9th International Symposium on Spatial Data Handling. IGU, Beijing, China,

pp. 26–41.

Evans, M., O’Malley, L., Patterson, M., 2004. Exploring Direct & Relationship Market-

ing, 2nd Edition. Thomson, London.

Everitt, B. S., 1974. Cluster Analysis. Reviews of Current Research. Heinemann, Lon-

don.

Everitt, B. S., 1979. Unresolved problems in cluster analysis. Biometrics 35 (1), 169–

181.

Everitt, B. S., Hand, D. J., 1981. Finite Mixture Distributions. Monographs on Applied

Probability and Statistics. Chapman & Hall, London.

Faloutsos, C., Jagadish, H. V., Sidiropoulos, N., 1997. Recovering information from

summary data. In: Carey, M. J. M. J., Dittrich, K. R., Lochovsky, F. H., Loucopoulos,

P., Jeusfeld, M. A. (Eds.), Proceedings of the 23rd International Conference on Very

Large Data Bases. Morgan Kaufmann, Athens, Greece, pp. 36–45.

Faloutsos, C., Ranganathan, M., Manolopoulos, Y., 1994. Fast subsequence matching

in time-series databases. In: Snodgrass, R. T., Winslett, M. (Eds.), Proceedings of


Minneapolis, MN, pp. 419–429.

Farnstrom, F., Lewis, J., Elkan, C., 2000. Scalability for clustering algorithms revisited.

SIGKDD Explorations 2 (1), 51–57.

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., 1996. From data mining to knowledge

discovery in databases. AI Magazine 17 (3), 37–54.

Fearnhead, P., 2008. Computational methods for complex stochastic systems: a re-

view of some alternatives to MCMC. Statistics and Computing 18 (2), 151–171.

Feigenbaum, J., Kannan, S., Strauss, M., Viswanathan, M., 1999. An approximate l1-

difference algorithm for massive data streams. In: Proceedings of the 40th Annual

Symposium on Foundations of Computer Science. IEEE, New York, pp. 501–511.

Feldman, J., Muthukrishnan, S., Sidiropoulos, A., Stein, C., Svitkina, Z., 2008. On the

complexity of processing massive, unordered, distributed data. The Computing

Research Repository.

222 Bibliography

Ferguson, T. S., 1973. A Bayesian analysis of some nonparametric problems. The An-

nals of Statistics 1 (2), 209–230.

Ferguson, T. S., 1983. Bayesian density estimation by mixtures of normal distribu-

tions. In: Rizvi, H., Rustagi, J. (Eds.), Recent Advances in Statistics. Academic, New

York, pp. 287–303.

Fern, X. Z., Brodley, C. E., 2003. Random projection for high dimensional data clus-

tering. In: Fawcett, T., Mishra, N. (Eds.), Proceedings of the Twentieth Interna-

tional Conference on Machine Learning. AAAI, Washington, DC, pp. 186–193.

Fernandez-Duran, J. J., 2004. Circular distributions based on nonnegative trigono-

metric sums. Biometrics 60 (2), 499–503.

Figini, S., Giudici, P., Brooks, S. P., 2006. Bayesian feature selection for estimating

customer survival. In: The Eighth World Meeting on Bayesian Statistics. Valencia,

Spain.

Fildes, R., 2002. Telecommunications demand forecasting: a review. International

Journal of Forecasting 18 (4), 489–522.

Fisher, D. H., 1987. Improving inference through conceptual clustering. In: Proceed-

ings of the Sixth National Conference on Artificial Intelligence. AAAI, Seattle, WA,

pp. 461–465.

Fisher, N. I., 1996. Statistical analysis of circular data, 2nd Edition. Cambridge Uni-

versity Press, Cambridge, UK.

Fisher, N. I., Lee, A. J., 1994. Time series analysis of circular data. Journal of the Royal

Statistical Society. Series B (Methodological) 56 (2), 327–339.

Flajolet, P., Martin, G. N., 1983. Probabilistic counting. In: Proceedings of the 24th

Annual IEEE Symposium on Foundations of Computer Science. IEEE, Tucson, AZ,

pp. 76–82.

Flint, D. J., Woodruff, R. B., Gardial, S. F., 1997. Customer value change in industrial

marketing relationships: a call for new strategies and research. Industrial Market-

ing Management 26 (2), 163–175.

Foster, G., Gupta, M., 1994. Marketing, cost management and management account-

ing. Journal of Management Accounting Research 6 (Fall), 43–77.

Fournier, S., Dobscha, S., Mick, D. G., 1998. Preventing the premature death of rela-

tionship marketing. Harvard Business Review 76 (1), 42–51.



Bibliography 223

Fraley, C., Raftery, A. E., 1999. MCLUST: software for model-based cluster and dis-

criminant analysis. Tech. Rep. Tech Report 342, Statistics Department, University

of Washington, Seattle, WA.

Fraley, C., Raftery, A. E., 2002. Model-based clustering, discriminant analysis, and

density estimation. Journal of the American Statistical Association 97 (458), 611–

631.

Frarley, J. U., Ring, L. W., 1966. A stochastic model of supermarket traffic flow. Oper-

ations Research 14 (4), 555–567.

Friedman, J., Fisher, N., 1999. Bump hunting in high-dimensional data. Statistics and

Computing 9 (2), 123–143.

Friedman, J., Meulman, J., 2004. Clustering objects on subsets of attributes. Journal

of the Royal Statistical Society: Series B (Statistical Methodology) 66 (4), 1–25.

Gaber, M., Zaslavsky, A., Krishnaswamy, S., 2005. Mining data streams: a review. SIG-

MOD Record 34 (2), 18–26.

Gaber, M. M., Zaslavsky, A., Krishnaswamy, S., 2007. A survey of classification meth-

ods in data streams. In: Aggarwal, C. C. (Ed.), Data Streams: Models and Algo-

rithms. Advances in Database Systems. Springer, New York.

Gaede, V., Gunther, O., 1998. Multidimensional access methods. ACM Computing

Surveys 30 (2), 170–231.

Gama, J., Gaber, M. M., 2007. Learning from Data Streams: Processing Techniques in

Sensor Networks. Springer, Dordrecht, Netherlands.

Ganti, V., Gehrke, J., Ramakrishnan, R., 1999a. CACTUS - clustering categorical data

using summaries. In: Proceedings of the Fifth ACM SIGKDD International Confer-

ence on Knowledge Discovery and Data Mining. ACM, San Diego, CA, pp. 73–83.

Ganti, V., Gehrke, J., Ramakrishnan, R., 1999b. A framework for measuring changes

in data characteristics. In: Proceedings of the Eighteenth ACM SIGMOD-SIGACT-

SIGART Symposium on Principles of Database Systems. ACM, Philadelphia, PA,

pp. 126–137.

Ganti, V., Gehrke, J., Ramakrishnan, R., 2001. DEMON: mining and monitoring evolv-

ing data. IEEE Transactions on Knowledge and Data Engineering 13 (1), 50–63.

Ganti, V., Gehrke, J., Ramakrishnan, R., 2002. Mining data streams under block evo-

lution. SIGKDD Explorations 3 (2), 1–10.

Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A. L., French, J. C., 1999c. Clustering

large datasets in arbitrary metric spaces. In: Proceedings of the 15th International

Conference on Data Engineering. IEEE, Sydney, Australia, pp. 502–511.

224 Bibliography

Gao, J., Fan, W., Han, J., 2007. On appropriate assumptions to mine data streams:

analysis and practice. In: Proceedings of the Seventh IEEE International Confer-

ence on Data Mining. IEEE, Omaha, NE, pp. 143–152.

Garg, A., Mangla, A., Gupta, N., Bhatnagar, V., 2006. PBIRCH: a scalable parallel clus-

tering algorithm for incremental data. In: Proceedings of the Tenth International

Database Engineering and Applications Symposium. IEEE, Delhi, India, pp. 315–

316.

Garofalakis, M. N., Gibbons, P. B., 2002. Wavelet synopses with error guarantees. In:

Franklin, M. J., Moon, B., Ailamaki, A. (Eds.), Proceedings of the 2002 ACM SIG-

MOD International Conference on Management of Data. ACM, Madison, WI, pp.

476–487.

Garofalakis, M. N., Kumar, A., 2004. Deterministic wavelet thresholding for

maximum-error metrics. In: Deutsch, A. (Ed.), Proceedings of the Twenty-third

ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.

ACM, Paris, France, pp. 166–176.

Gavrilov, M., Anguelov, D., Indyk, P., Motwani, R., 2000. Mining the stock market:

which measure is best? In: Proceedings of the Sixth ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining. ACM, Boston, MA, pp. 487–

496.

Ge, X., Smyth, P., 2000. Deformable Markov model templates for time-series pattern

matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining. ACM, Boston, MA, pp. 81–90.

Gehrke, J., Korn, F., Srivastava, D., 2001. On computing correlated aggregates over

continual data streams. SIGMOD Record 30 (2), 13–24.

Gelfand, A. E., Smith, A. F. M., 1990. Sampling-based approaches to calculating

marginal densities. Journal of the American Statistical Association 85 (410), 398–

409.



Geman, S., Geman, D., 1984. Stochastic relaxation, Gibbs distributions, and the

Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence 6 (6), 721–741.

Gennari, J. H., Langley, P., Fisher, D. H., 1989. Models of incremental concept forma-

tion. Artificial Intelligence 40 (1-3), 11–61.

Geyer, C. J., 1992. Practical Markov chain Monte Carlo. Statistical Science 7 (4), 473–

483.

Bibliography 225




Ghahramani, Z., Beal, M. J., 2001. Propagation algorithms for variational Bayesian

learning. In: Leen, T. K., Dietterich, T. G., Tresp, V. (Eds.), Proceedings of the 2001

Neural Information Processing Systems. MIT, Denver, CO, pp. 507–513.

Ghosh, J., Beal, M. J., Ngo, H. Q., Qiao, C., 2006a. On profiling mobility and predicting

locations of wireless users. In: Proceedings of the 2nd International Workshop on

Multi-hop Ad Hoc Networks: From Theory to Reality. ACM, Florence, Italy, pp. 55–

62.

Ghosh, J., Strehl, A., 2004. Clustering and visualization of retail market baskets. In:

Pal, N. R., Jain, L. C. (Eds.), Advanced Techniques in Knowledge Discovery and

Data Mining. Advanced Information and Knowledge Processing. Springer, New

York.

Ghosh, J., Strehl, A., 2006. Similarity-based text clustering: a comparative study. In:

Kogan, J., Nicholas, C., Teboulle, M. (Eds.), Grouping Multidimensional Data: Re-

cent Advances in Clustering. Springer, New York.

Ghosh, J. K., Delampady, M., Samanta, T., 2006b. An Introduction to Bayesian analy-

sis: Theory and Methods. Springer Texts in Statistics. Springer, New York.

Ghosh, K., Jammalamadaka, S. R., Tiwari, R. C., 2003. Semiparametric Bayesian tech-

niques for problems in circular data. Journal of Applied Statistics 30 (2), 145–161.

Gibbons, P. B., Matias, Y., 1998. New sampling-based summary statistics for improv-

ing approximate query answers. In: Haas, L. M., Tiwary, A. (Eds.), Proceedings of


Seattle, WA, pp. 331–342.

Gibbons, P. B., Matias, Y., 1999. Synopsis data structures for massive data sets. In:

Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms.

Vol. A. ACM, Baltimore, MD, pp. 909–910.

Gibbons, P. B., Matias, Y., Poosala, V., 1997. Fast incremental maintenance of ap-

proximate histograms. In: Jarke, M., Carey, M. J., Dittrich, K. R., Lochovsky, F. H.,

Loucopoulos, P., Jeusfeld, M. A. (Eds.), Proceedings of the 23rd International Con-

ference on Very Large Data Bases. Morgan Kaufmann, Athens, Greece, pp. 466–475.

Gilbert, A. C., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., Strauss, M., 2002.

Fast, small-space algorithms for approximate histogram maintenance. In: Pro-

ceedings of the 34th Annual ACM Symposium on Theory of Computing. ACM,

Quebec, QC, Canada, pp. 389–398.

226 Bibliography

Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., Strauss, M., 2001. Surfing wavelets on

streams: One-pass summaries for approximate aggregate queries. In: Apers, P.

M. G., Atzeni, P., Ceri, S., Paraboschi, S., Ramamohanarao, K., Snodgrass, R. T.

(Eds.), Proceedings of the 27th International Conference on Very Large Data Bases.

Morgan Kaufmann, Roma, Italy, pp. 79–88.

Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., Strauss, M., 2003. One-pass wavelet

decompositions of data streams. IEEE Transactions on Knowledge and Data Engi-

neering 15 (3), 541–554.

Gilks, W. R., Oldfield, L., Rutherford, A., 1989. Statistical analysis. In: Knapp, B. W.,

Darken, W. R., Gilks, S. F. S., Boumsell, L., Harlan, J. M., Kishimoto, T., Morimoto,

C., Ritz, J., Shaw, S., Silverstein, R., Springer, T., Tedder, T. F., Todd, R. F. (Eds.),

Leucocyte Typing IV. Oxford University, Oxford, UK, pp. 6–12.

Gilks, W. R., Richardson, S., Spiegelhalter, D. J., 1998. Markov Chain Monte Carlo in

Practice. Chapman & Hall, Boca Raton, FL.

Gilks, W. R., Thomas, A., Spiegelhalter, D. J., 1994. A language and program for com-

plex Bayesian modelling. Journal of the Royal Statistical Society: Series D (The

Statistician) 43 (1), 169177.

Giudici, P., Castelo, R., 2001. Association models for web mining. Data Mining and

Knowledge Discovery 5 (3), 183–196.

Giudici, P., Passerone, G., 2002. Data mining of association structures to model con-

sumer behaviour. Computational Statistics & Data Analysis 38 (4), 533–541.

Glymour, C., Madigan, D., Pregibon, D., Smyth, P., 1996. Statistical inference and data

mining. Communications of the ACM 39 (11), 35–41.



Gordon, I. H., 1998. Relationship Marketing: New Strategies, Techniques, and Tech-

nologies to Win the Customers You Want and Keep Them Forever. Wiley, Etobi-

coke, ON, Canada.

Graham, G., 2005. Behaviorism. In: Zalta, E. N. (Ed.), The Stanford Encyclopedia of

Philosophy, fall 2005 Edition.

Green, P. J., 1995. Reversible jump Markov chain Monte Carlo computation and

Bayesian model determination. Biometrika 82 (4), 711–732.

Green, P. J., Richardson, S., 2001. Modelling heterogeneity with and without the

Dirichlet process. Scandinavian Journal of Statistics 28 (2), 355–375.

Bibliography 227

Greenwald, M., Khanna, S., 2001. Space-efficient online computation of quantile

summaries. In: Aref, W. G. (Ed.), Proceedings of the 2001 ACM SIGMOD Interna-

tional Conference on Management of Data. ACM, Santa Barbara, CA, pp. 58–66.

Gronroos, C., 1994. Quo vadis, marketing? toward a relationship marketing

paradigm. Journal of Marketing Management 10 (5), 347–360.

Grunert, S. C., Scherlorn, G., 1990. Consumer values in West Germany underlying di-

mensions and cross-cultural comparison with North America. Journal of Business

Research 20 (2), 97–107.

Guha, S., 2005. Space efficiency in synopsis construction algorithms. In: Bohm, K.,

Jensen, C. S., Haas, L. M., Kersten, M. L., Larson, P.-A., Ooi, B. C. (Eds.), Proceedings

of the 31st International Conference on Very Large Data Bases. ACM, Trondheim,

Norway, pp. 409–420.

Guha, S., 2010. Posterior simulation in countable mixture models for large datasets.

Journal of the American Statistical Association 105 (490), 775–786.

Guha, S., Gunopulos, D., Koudas, N., 2003a. Correlating synchronous and asyn-

chronous data streams. In: Getoor, L., Senator, T. E., Domingos, P., Faloutsos,

C. (Eds.), Proceedings of the Ninth ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining. ACM, Washington, DC, pp. 529–534.

Guha, S., Harb, B., 2005. Wavelet synopsis for data streams: minimizing non-

Euclidean error. In: Grossman, R., Bayardo, R. J., Bennett, K. P. (Eds.), Proceedings

of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery

and Data Mining. ACM, Chicago, IL, pp. 88–97.

Guha, S., Harb, B., 2006. Approximation algorithms for wavelet transform coding of

data streams. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium

on Discrete Algorithm. ACM, Miami, FL, pp. 698–707.

Guha, S., Indyk, P., Muthukrishnan, S., Strauss, M., 2002. Histogramming data

streams with fast per-item processing. In: Widmayer, P., Ruiz, F. T., Bueno, R. M.,

Hennessy, M., Eidenbenz, S., Conejo, R. (Eds.), Proceedings of the 29th Interna-

tional Colloquium on Automata, Languages and Programming. Springer, Malaga,

Spain, pp. 681–692.

Guha, S., Kim, C., Shim, K., 2004a. XWAVE: approximate extended wavelets for

streaming data. In: Nascimento, M. A., Ozsu, M. T., Kossmann, D., Miller, R. J.,

Blakeley, J. A., Schiefer, K. B. (Eds.), Proceedings of the 30th International Con-

ference on Very Large Data Bases. Morgan Kaufmann, Toronto, ON, Canada, pp.

288–299.

228 Bibliography

Guha, S., Koudas, N., 2002. Approximating a data stream for querying and estima-

tion: algorithms and performance evaluation. In: Proceedings of the 18th Interna-

tional Conference on Data Engineering. IEEE, San Jose, CA, pp. 567–576.

Guha, S., Koudas, N., Shim, K., 2001. Data-streams and histograms. In: Proceedings

of the 33th Annual ACM Symposium on Theory of Computing. ACM, Heraklion,


Guha, S., Koudas, N., Shim, K., 2006. Approximation and streaming algorithms for

histogram construction problems. ACM Transactions on Database Systems 31 (1),

396–438.

Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L., 2003b. Clustering

data streams: theory and practice. IEEE Transactions on Knowledge and Data En-

gineering 15 (3), 515–528.

Guha, S., Rastogi, R., Shim, K., 1998. CURE: an efficient clustering algorithm for large

databases. In: Haas, L. M., Tiwary, A. (Eds.), Proceedings of the ACM SIGMOD In-

ternational Conference on Management of Data. ACM, Seattle, WA, pp. 73–84.

Guha, S., Rastogi, R., Shim, K., 1999. ROCK: a robust clustering algorithm for cate-

gorical attributes. In: Proceedings of the 15th International Conference on Data

Engineering. IEEE, Sydney, Australia, pp. 512–521.

Guha, S., Shim, K., Woo, J., 2004b. REHIST: relative error histogram construction

algorithms. In: Nascimento, M. A., Ozsu, M. T., Kossmann, D., Miller, R. J., Blakeley,

J. A., Schiefer, K. B. (Eds.), Proceedings of the 30th International Conference on

Very Large Data Bases. Morgan Kaufmann, Toronto, ON, Canada, pp. 300–311.

Gummesson, E., 1994. Making relationship marketing operational. International

Journal of Service Industry Management 5 (5), 5–20.

Gummesson, E., 1999. Total Relationship Marketing: Rethinking Marketing. Man-

agement from 4Ps to 30Rs. Butterworth-Heinemann, Oxford, UK.

Gunopulos, D., Kollios, G., Tsotras, V. J., Domeniconi, C., 2000. Approximating multi-

dimensional aggregate range queries over real attributes. In: Chen, W., Naughton,

J. F., Bernstein, P. A. (Eds.), Proceedings of the 2000 ACM SIGMOD International

Conference on Management of Data. ACM, Dallas, TX, pp. 463–474.

Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. Jour-

nal of Machine Learning Research 3 (Mar), 1157–1182.

Haas, P. J., 1997. Large-sample and deterministic confidence intervals for online ag-

gregation. In: Ioannidis, Y. E., Hansen, D. M. (Eds.), Proceedings of the Ninth In-

ternational Conference on Scientific and Statistical Database Management. IEEE,

Olympia, WA, pp. 51–63.

Bibliography 229

Hallberg, G., 1995. All Customers Are Not Created Equal: The Differential Marketing

Strategy for Brand Loyalty and Profits. Wiley, New York.

Han, C., Carlin, B. P., 2001. Markov chain Monte Carlo methods for computing

Bayes factors: a comparative review. Journal of the American Statistical Associa-

tion 96 (455), 11221132.

Han, J., Dong, G., Yin, Y., 1999. Efficient mining of partial periodic patterns in time

series database. In: Proceedings of the 15th International Conference on Data En-

gineering. IEEE, Sydney, Australia, pp. 106–115.

Han, J., Kamber, M., 2006. Data Mining: Concepts and Techniques, 2nd Edition. The

Morgan-Kaufmann Series in Data Management Systems. Morgan Kaufmann, San

Francisco, CA.

Han, J., Pei, J., Yin, Y., 2000. Mining frequent patterns without candidate generation.

In: Chen, W., Naughton, J. F., Bernstein, P. A. (Eds.), Proceedings of the 2000 ACM

SIGMOD International Conference on Management of Data. ACM, Dallas, TX, pp.

1–12.

Hand, D. J., 1998. Data mining: statistics and more? The American Statistician 52 (2),

112–118.

Har-even, M., Brailovsky, V. L., 1995. Probabilistic validation approach for clustering.

Pattern Recognition Letters 16 (11), 1189–1196.

Hartigan, J. A., 1975. Clustering Algorithms. Wiley Series in Probability and Mathe-

matical Statistics. Wiley, New York.

Hastie, T., Tibshirani, R., Friedman, J. H., 2009. The Elements of Statistical Learning:

Data Mining, Inference, and Prediction, 2nd Edition. Springer Series in Statistics.

Springer, New York.

Hastings, W. K., 1970. Monte Carlo sampling methods using Markov Chains and their

applications. Biometrika 57 (1), 97–109.

Heinz, C., Seeger, B., 2006. Resource-aware kernel density estimators over stream-

ing data. In: Yu, P. S., Tsotras, V. J., Fox, E. A., Liu, B. (Eds.), Proceedings of the

2006 ACM CIKM International Conference on Information and Knowledge Man-

agement. ACM, Arlington, VA, pp. 870–871.

Heinz, C., Seeger, B., 2007. Adaptive wavelet density estimators over data streams.

In: Proceedings of the 19th International Conference on Scientific and Statistical

Database Management. IEEE, Banff, AB, Canada, pp. 35–35.

Heinz, C., Seeger, B., 2008. Cluster kernels: resource-aware kernel density estima-

tors over streaming data. IEEE Transactions on Knowledge and Data Engineering

20 (7), 880–893.

230 Bibliography

Heitfield, E., Levy, A., 2001. Parametric, semi-parametric and non-parametric mod-

els of telecommunications demand: an investigation of residential calling pat-

terns. Information Economics and Policy 13 (3), 311–329.

Heller, K. A., Ghahramani, Z., 2007. A nonparametric Bayesian approach to modeling

overlapping clusters. In: Proceedings of the Eleventh International Conference on

Artificial Intelligence and Statistics. San Juan, PR.

Henzinger, M. R., Raghavan, P., Rajagopalan, S., My 26 1998 1998. Computing on data

streams. Tech. Rep. SRC-TN-1998-011, Systems Research Center, Palo Alto, CA.

Herr, P. M., Kardes, F. R., Kim, J., 1991. Effects of word-of-mouth and product-

attribute information of persuasion: an accessibility-diagnosticity perspective.

Journal of Consumer Research 17 (4), 454–462.

Hinneburg, A., Gabriel, H.-H., 2007. DENCLUE 2.0: fast clustering based on kernel

density estimation. In: Berthold, M. R., Shawe-Taylor, J., Lavrac, N. (Eds.), Pro-

ceedings of the Seventh International Symposium on Intelligent Data Analysis.

Vol. 4723. Springer, Ljubljana, Slovenia, pp. 70–80.

Hinneburg, A., Keim, D. A., 1998. An efficient approach to clustering in large mul-

timedia databases with noise. In: Agrawal, R., Stolorz, P. E., Piatetsky-Shapiro, G.

(Eds.), Proceedings of the Fourth International Conference on Knowledge Discov-

ery and Data Mining. AAAI, New York, pp. 58–65.

Hinneburg, A., Keim, D. A., 1999. Optimal grid-clustering: towards breaking the curse

of dimensionality in high-dimensional clustering. In: Atkinson, M. P., Orlowska,

M. E., Valduriez, P., Zdonik, S. B., Brodie, M. L. (Eds.), Proceedings of the 25th In-

ternational Conference on Very Large Data Bases. Morgan Kaufmann, Edinburgh,

UK, pp. 506–517.

Hirschman, E. C., 1986. Humanistic inquiry in marketing research: philosophy,

method, and criteria. Journal of Marketing Research 23 (3), 237–249.

Hofstede, F. T., Wedel, M., Steenkamp, J.-B. E. M., 2002. Identifying spatial segments

in international markets. Marketing Science 21 (2), 160–177.

Holbrook, M. B., Hirschman, E. C., 1982. The experiential aspects of consumption:

consumer fantasies, feelings, and fun. Journal of Consumer Research 9 (2), 132–

140.

Holt, D. B., 1997. Poststructuralist lifestyle analysis: conceptualizing the social pat-

terning of consumption in postmodernity. Journal of Consumer Research 23 (4),

326–50.

Homer, P. M., Kahle, L. R., 1988. A structural equation test of the value-attitude-

behavior hierarchy. Journal of Personality and Social Psychology 54 (4), 638–646.

Bibliography 231

Houle, M. E., Sakuma, J., 2005. Fast approximate similarity search in extremely high-

dimensional data sets. In: Proceedings of the 21st International Conference on

Data Engineering. IEEE, Tokyo, Japan, pp. 619–630.

Hsu, W., Lee, M. L., Wang, J., 2008. Temporal and Spatio-Temporal Data Mining. IGI,

Hershey, PA.

Huang, Y., Zhang, L., Zhang, P., 2008. A framework for mining sequential patterns

from spatio-temporal event data sets. IEEE Transactions on Knowledge and Data

Engineering 20 (4), 433–448.

Huang, Z., 1998. Extensions to the k-means algorithm for clustering large data sets

with categorical values. Data Mining and Knowledge Discovery 2 (3), 283–304.

Hubbard, R., Lindsay, R. M., 2008. Why p values are not a useful measure of evidence

in statistical significance testing. Theory & Psychology 18 (1), 69–88.

Hudson, L. A., Ozanne, J. L., 1988. Alternative ways of seeking knowledge in con-

sumer research. Journal of Consumer Research 14 (4), 508.

Hui, S. K., Bradlow, E. T., Fader, P. S., 2009a. Testing behavioral hypotheses using an

integrated model of grocery store shopping path and purchase behavior. Journal

of Consumer Research 36 (3), 478–493.

Hui, S. K., Fader, P. S., Bradlow, E. T., 2009b. The traveling salesman goes shopping:

The systematic deviations of grocery paths from TSP-optimality. Marketing Sci-

ence 28 (3), 566–572.

Hulten, G., Spencer, L., Domingos, P., 2001. Mining time-changing data streams. In:

Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining. ACM, San Francisco, CA, pp. 97–106.

Hunt, S. D., 1997. Competing through relationships: grounding relationship market-

ing in resource-advantage theory. Journal of Marketing Management 13 (5), 431–

445.

Hwang, H., Jung, T., Suh, E., 2004. An LTV model and customer segmentation based

on customer value: a case study on the wireless telecommunication industry. Ex-

pert Systems with Applications 26 (2), 181–188.

Hyndman, R. J., 1995. The problem with Sturges’ rule for constructing histograms.

Tech. rep., Department of Econometrics and Business Statistics, Monash Univer-

sity, Clayton, VIC, Australia.

URL http://www-personal.buseco.monash.edu.au/~hyndman/papers/sturges.

htm

232 Bibliography

Indyk, P., 2000. Stable distributions, pseudorandom generators, embeddings and

data stream computation. In: Proceedings of the 41st Annual Symposium on

Foundations of Computer Science. IEEE, Redondo Beach, CA, pp. 189–197.

Indyk, P., Koudas, N., Muthukrishnan, S., 2000. Identifying representative trends

in massive time series data sets using sketches. In: Abbadi, A. E., Brodie, M. L.,

Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.-Y. (Eds.), Pro-

ceedings of the 26th International Conference on Very Large Data Bases. Morgan

Kaufmann, Cairo, Egypt, pp. 363–372.

Intel Corporation, 2002. CDR analysis and warehousing for mobile networks. Tech.

rep., Intel Corporation, Santa Clara, CA.

Ioannidis, Y. E., 2003. The history of histograms. In: Freytag, J. C., Lockemann, P. C.,

Abiteboul, S., Carey, M. J., Selinger, P. G., Heuer, A. (Eds.), Proceedings of the 29th

International Conference on Very Large Data Bases. Morgan Kaufmann, Berlin,

Germany, pp. 19–30.

Ioannidis, Y. E., Poosala, V., 1995. Balancing histogram optimality and practicality for

query result size estimation. SIGMOD Record 24 (2), 233–244.

Ioannidis, Y. E., Poosala, V., 1999. Histogram-based approximation of set-valued

query-answers. In: Atkinson, M. P., Orlowska, M. E., Valduriez, P., Zdonik, S. B.,

Brodie, M. L. (Eds.), Proceedings of the 25th International Conference on Very

Large Data Bases. Morgan Kaufmann, Edinburgh, UK, pp. 174–185.



Jacoby, J., 1978. Consumer research: a state of the art review. Journal of Marketing

42 (2), 87–96.

Jagadish, H. V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. C., Suel, T.,

1998. Optimal histograms with quality guarantees. In: Gupta, A., Shmueli, O.,



Jaihak, C., Rao, V. R., 2003. A general choice model for bundles with multiple-

category products: application to market segmentation and optimal pricing for

bundles. Journal of Marketing Research 40 (2), 115–130.

Jain, A. K., 2010. Data clustering: 50 years beyond k-means. Pattern Recognition Let-

ters 31 (8), 651–666.


Saddle River, NJ.

Bibliography 233

Jain, A. K., Murty, M. N., Flynn, P. J., 1999. Data clustering: a review. ACM Computing

Surveys 31 (3), 264–323.

Jain, S., Neal, R. M., 2004. A split-merge Markov chain Monte Carlo procedure for the

Dirichlet process mixture model. Journal of Computational and Graphical Statis-

tics 13 (1), 158–182.

Jain, S., Neal, R. M., 2007. Splitting and merging components of a nonconjugate

Dirichlet process mixture model (with discussion). Bayesian Analysis 2 (3), 445–

472.

Jammalamadaka, S. R., Sengupta, A., 2001. Topics in Circular Statistics. Series on

Multivariate Analysis. World Scientific, Singapore.

Jefferys, W. H., Berger, J. O., 1992. Ockham’s razor and Bayesian analysis. American

Scientist 80 (Jan/Feb), 64–72.

Jiang, D., Tang, C., Zhang, A., 2004. Cluster analysis for gene expression data: a sur-

vey. IEEE Transactions on Knowledge and Data Engineering 16 (11), 1370–1386.

Jing, L., Ng, M. K., Huang, J. Z., 2007. An entropy weighting k-means algorithm

for subspace clustering of high-dimensional sparse data. IEEE Transactions on


Jolliffe, I. T., 2002. Principal Component Analysis, 2nd Edition. Springer Series in

Statistics. Springer, New York.

Jordan, M. I., Ghahramani, Z., Jasskkola, T. S., Saul, L. K., 1998. An introduction

to variational methods for graphical models. In: Jordan, M. I. (Ed.), Learning

in Graphical Models. Adaptive Computation and Machine Learning. MIT, Cam-

bridge, UK, pp. 105–162.

Kahle, L. R., 1983. Social Values and Social Change: Adaptation to Life in America.

Praeger, New York.

Kahle, L. R., Beatty, S. E., Homer, P., 1986. Research in brief alternative measurement

approaches to consumer values: the list of values (LOV) and values and life style

(VALS). Journal of Consumer Research 13 (3), 405.

Kahle, L. R., Liu, R., Watkins, H., 1992. Psychographic variation across the United

States geographic regions. Advances in Consumer Research 19 (1), 346–352.

Kailing, K., Kriegel, H., Kroger, P., 2004. Density-connected subspace clustering for

high-dimensional data. In: Berry, M. W., Dayal, U., Kamath, C., Skillicorn, D. B.

(Eds.), Proceedings of the Fourth SIAM International Conference on Data Mining.

SIAM, Lake Buena Vista, FL, pp. 246–257.

234 Bibliography

Kailing, K., Kriegel, H.-P., Kroger, P., Wanka, S., 2003. Ranking interesting subspaces

for clustering high dimensional data. In: Lavrac, N., Blockeel, D. G. H., Todorovski,

L. (Eds.), Proceedings of the Seventh European Conference on Principles and Prac-

tice of Knowledge Discovery in Databases. Springer, Cavtat-Dubrovnik, Croatia,

pp. 241–252.

Kalpakis, K., Gada, D., Puttagunta, V., 2001. Distance measures for effective cluster-

ing of ARIMA time-series. In: Cercone, N., Lin, T. Y., Wu, X. (Eds.), Proceedings of

the 2001 IEEE International Conference on Data Mining. IEEE, San Jose, CA, pp.

273–280.

Kamakura, W. A., Ramaswami, S. N., Srivastava, R. K., 1991. Applying latent trait anal-

ysis in the evaluation of prospects for cross-selling of financial services. Interna-

tional Journal of Research in Marketing 8 (4), 329–349.

Kandogan, E., 2001. Visualizing multi-dimensional clusters, trends, and outliers us-

ing star coordinates. In: Proceedings of the Seventh ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining. ACM, San Francisco, CA,

pp. 107–116.

Karlis, D., Xekalaki, E., 2003. Choosing initial values for the EM algorithm for finite

mixtures. Computational Statistics & Data Analysis 41 (3-4), 577–590.

Karras, P., Mamoulis, N., 2005. One-pass wavelet synopses for maximum-error met-

rics. In: Bohm, K., Jensen, C. S., Haas, L. M., Kersten, M. L., Larson, P.-A., Ooi, B. C.

(Eds.), Proceedings of the 31st International Conference on Very Large Data Bases.

ACM, Trondheim, Norway, pp. 421–432.

Karypis, G., Eui-Hong, H., Kumar, V., 1999. CHAMELEON: hierarchical clustering us-

ing dynamic modeling. Computer 32 (8), 68–75.

Kass, R. E., Raftery, A. E., 1995. Bayes factors. Journal of the American Statistical As-

sociation 90 (430), 773–795.

Kaufman, L., Rousseeuw, P. J., 1990. Finding Groups in Data: An Introduction to Clus-

ter Analysis. Wiley Series in Probability and Mathematical Statistics. Wiley, New

York.

Keaveney, S. M., Parthasarathy, M., 2001. Customer switching behavior in online ser-

vices: an exploratory study of the role of selected attitudinal, behavioral, and de-

mographic factors. Journal of the Academy of Marketing Science 29 (4), 374–390.

Keogh, E. J., Chakrabarti, K., Pazzani, M., Mehrotra, S., 2001. Dimensionality reduc-

tion for fast similarity search in large time series databases. Knowledge and Infor-

mation Systems 3 (3), 263–286.

Bibliography 235

Keogh, E. J., Kasetty, S., 2003. On the need for time series data mining benchmarks: a

survey and empirical demonstration. Data Mining and Knowledge Discovery 7 (4),

349–371.

Keogh, E. J., Lin, J., Fu, A. W.-C., 2005. Hot SAX: efficiently finding the most unusual

time series subsequence. In: Proceedings of the Fifth IEEE International Confer-

ence on Data Mining. IEEE, Houston, TX, pp. 226–233.

Keogh, E. J., Lin, J., Truppel, W., 2003. Clustering of time series subsequences is

meaningless: implications for previous and future research. In: Proceedings of the

Third IEEE International Conference on Data Mining. IEEE, Melbourne, FL, pp.

115–122.

Keogh, E. J., Lonardi, S., Ratanamahatana, C., 2004. Towards parameter-free data

mining. In: Kim, W., Kohavi, R., Gehrke, J., DuMouchel, W. (Eds.), Proceedings of

the Tenth ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining. ACM, Seattle, WA, pp. 206–215.

Keogh, E. J., Ratanamahatana, C., 2005. Exact indexing of dynamic time warping.

Knowledge and Information Systems 7 (3), 358–386.

Kifer, D., Ben-David, S., Gehrke, J., 2004. Detecting change in data streams. In: Nasci-

mento, M. A., Ozsu, M. T., Kossmann, D., Miller, R. J., Blakeley, J. A., Schiefer, K. B.

(Eds.), Proceedings of the Thirtieth International Conference on Very Large Data

Bases. Morgan Kaufmann, Toronto, ON, Canada, pp. 180–191.

Kim, D., Yum, B.-J., 2005. Collaborative filtering based on iterative principal compo-

nent analysis. Expert Systems with Applications 28 (4), 823–830.

Kleinberg, J. M., 2002. An impossibility theorem for clustering. In: Becker, S., Thrun,

S., Obermayer, K. (Eds.), Proceedings of the 2002 Neural Information Processing

Systems. MIT, Vancouver, BC, Canada, pp. 446–453.

Kleinberg, J. M., 2003. Bursty and hierarchical structure in streams. In: Proceedings

of the Eighth ACM SIGKDD International Conference on Knowledge Discovery

and Data Mining. Vol. V7. ACM, Edmonton, AB, Canada, pp. 373–397.

Knox, S., 1998. Loyalty-based segmentation and the customer development process.

European Management Journal 16 (6), 729–737.

Kogan, J., 2007. Introduction to Clustering Large and High-Dimensional Data. Cam-

bridge University, Cambridge, UK.

Kogan, J., Nicholas, C. K., Teboulle, M., 2006. Grouping Multidimensional Data: Re-

cent Advances in Clustering. Springer, New York.

236 Bibliography

Konig, A., Gratz, A., 2004. Advanced methods for the analysis of semiconductor man-

ufacturing process data. In: Pal, N. R., Jain, L. C. (Eds.), Advanced Techniques in

Knowledge Discovery and Data Mining. Advanced Information and Knowledge

Processing. Springer, New York.

Kotler, P., 1991. Marketing Management: Analysis, Planning, and Control. Prentice-

Hall, Englewood Cliffs, NJ.

Kotler, P., Armstrong, G., 2009. Principles of Marketing, 13th Edition. Pearson, Upper

Saddle River, NJ.

Kriegel, H.-P., Kroger, P., Renz, M., Wurst, S. H. R., 2005. A generic framework for ef-

ficient subspace clustering of high-dimensional data. In: Proceedings of the Fifth

IEEE International Conference on Data Mining. IEEE, Houston, TX, pp. 250–257.

Kriegel, H.-P., Kroger, P., Schubert, E., Zimek, A., 2008. A general framework for

increasing the robustness of PCA-based correlation clustering algorithms. In:

Proceedings of the 20th international conference on Scientific and Statistical

Database Management. Springer-Verlag, Berlin, Germany, pp. 418–435.

Kriegel, H.-P., Kroger, P., Zimek, A., 2009. Clustering high-dimensional data: A survey

on subspace clustering, pattern-based clustering, and correlation clustering. ACM

Transactions on Knowledge Discovery from Data 3 (1), 1–58.

Krishnamurthy, B., Sen, S., Zhang, Y., Chen, Y., 2003. Sketch-based change detection:

methods, evaluation, and applications. In: Proceedings of the Third ACM SIG-

COMM Conference on Internet Measurement. ACM, Miami Beach, FL, pp. 234–

247.

Kuiper, N. H., 1962. Tests concerning random points on a circle. Proceedings of the

Koninklijke Nederlandse Akademie van Wetenschappen, Series A 63, 38–47.

Kumar, N., Lolla, V. N., Keogh, E. J., Lonardi, S., Ratanamahatana, C. A., 2005. Time-

series bitmaps: a practical visualization tool for working with large time series

databases. In: Proceedings of the 2005 SIAM International Data Mining Confer-

ence. Newport Beach, CA.

Kumar, V., Petersen, J. A., Leone, R. P., 2007. How valuable is word of mouth? Harvard

Business Review 85 (10), 139–146.

Kumar, V., Venkatesan, R., Reinartz, W., 2006. Knowing what to sell, when, and to

whom. Harvard Business Review 84 (3), 131–137.

Lange, T., Roth, V., Braun, M., Buhmann, J., 2004. Stability-based validation of clus-

tering solutions. Neural Computation 16 (6), 1299–1323.

Larson, J. S., Bradlow, E. T., Fader, P. S., 2005. An exploratory look at supermarket

shopping paths. International Journal of Research in Marketing 22 (4), 395 – 414.

Bibliography 237

Last, M., Klein, Y., Kandel, A., 2001. Knowledge discovery in time series databases.

IEEE Transactions on Systems, Man, and Cybernetics, Part B 31 (1), 160–169.

Lastovicka, J. L., 1982. On the validation of lifestyle traits: a review and illustration.

Journal of Marketing Research 19 (1), 126–138.

Lastovicka, J. L., Murry Jr., J. P., Joachimsthaler, E. A., 1990. Evaluating the measure-

ment validity of lifestyle typologies. Journal of Marketing Research 27 (1), 11–23.

Lee, H.-Y., Ong, H.-L., 1996. Visualization support for data mining. IEEE Expert 11 (5),

69–75.

Lee, J.-H., Kim, D.-H., Chung, C.-W., 1999. Multi-dimensional selectivity estimation

using compressed histogram information. In: Delis, A., Ghandeharizadeh, C. F. S.

(Eds.), Proceedings of the 1999 ACM SIGMOD International Conference on Man-

agement of Data. ACM, Philadelphia, PA, pp. 205–214.

Lee, P. M., 2004. Bayesian Statistics: An Introduction, 3rd Edition. Hodder Arnold,

London.

Lee, W.-P., Liu, C.-H., Lu, C.-C., 2002. Intelligent agent-based systems for personal-

ized recommendations in internet commerce. Expert Systems with Applications

22 (4), 275–284.

Lees, K., Roberts, S., Skamnioti, P., Gurr, S., 2007. Gene microarray analysis using

angular distribution decomposition. Journal of Computational Biology 14 (1), 68–

83.

Leigh, A., Wolfers, J., 2006. Competing approaches to forecasting elections: eco-

nomic models, opinion polling and prediction markets. Economic Record 82 (258),

325–340.

Lemmens, A., Croux, C., 2006. Bagging and boosting classification trees to predict

churn. Journal of Marketing Research 43 (2), 276–286.

Lemon, K. N., White, T. B., Winer, R. S., 2002. Dynamic customer relationship man-

agement: incorporating future considerations into the service retention decision.


Leone, L., Perugini, M., Ercolani, A. P., 1999. A comparison of three models of

attitude-behavior relationships in the studying behavior domain. European Jour-

nal of Social Psychology 29 (2-3), 161–189.

Levy, A., 1999. Semi-parametric estimates of demand for intra-LATA telecommuni-

cations. In: Loomis, D. G., Taylor, L. D. (Eds.), The Future of the Telecommunica-

tions Industry: Forecasting and Demand Analysis. Kluwer, Boston, MA, pp. 115–

124.

238 Bibliography

LGR Telecommunications, 05 Jun 2008a. CDRInsight.

LGR Telecommunications, 05 Jun 2008b. CDRLive.

Li, S.-T., Shue, L.-Y., Lee, S.-F., 2006. Enabling customer relationship management

in ISP services through mining usage patterns. Expert Systems with Applications

30 (4), 621–632.

Li, W., McCallum, A., 2006. Pachinko allocation: DAG-structured mixture models of

topic correlations. In: Cohen, W. W., Moore, A. (Eds.), Proceedings of the 23rd In-

ternational Conference on Machine Learning. ACM, Pittsburgh, PA, pp. 577–584.

Li, Y., Han, J., Yang, J., 2004. Clustering moving objects. In: Kim, W., Kohavi, R.,

Gehrke, J., DuMouchel, W. (Eds.), Proceedings of the Tenth ACM SIGKDD Interna-

tional Conference on Knowledge Discovery and Data Mining. ACM, Seattle, WA,

pp. 617–622.

Li, Y., Lu, L., Li, X., 2005. A hybrid collaborative filtering method for multiple-interests

and multiple-content recommendation in e-commerce. Expert Systems with Ap-

plications 28 (1), 67–77.

Lin, J., Keogh, E. J., Lonardi, S., Chiu, B. Y.-c., 2003. A symbolic representation of time

series, with implications for streaming algorithms. In: Zaki, M. J., Aggarwal, C. C.

(Eds.), Proceedings of the Eighth ACM SIGMOD Workshop on Research Issues in

Data Mining and Knowledge Discovery. ACM, San Diego, CA, pp. 2–11.

Lin, J., Keogh, E. J., Wei, L., Lonardi, S., 2007. Experiencing SAX: a novel symbolic

representation of time series. Data Mining and Knowledge Discovery 15 (2), 107–

144.

Lin, J., Vlachos, M., Keogh, E. J., Gunopulos, D., 2004. Iterative incremental clustering

of time series. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides,

V., Koubarakis, M., Bohm, K., Ferrari, E. (Eds.), Proceedings of the Ninth In-

ternational Conference on Extending Database Technology. Springer, Heraklion,


Littau, D., Boley, D., 2006a. Clustering very large data sets with principal direction

divisive partitioning. In: Kogan, J., Nicholas, C., Teboulle, M. (Eds.), Grouping Mul-

tidimensional Data: Recent Advances in Clustering. Springer, New York.

Littau, D., Boley, D., 2006b. Streaming data reduction using low-memory factored

representations. Information Sciences 176 (14), 2016–2041.

Little, E., Marandi, E., 2003. Relationship Marketing Management. Thomson, Lon-

don.

Bibliography 239

Liu, A. H., Leach, M. P., Bernhardt, K. L., 2005. Examining customer value percep-

tions of organizational buyers when sourcing from multiple vendors. Journal of

Business Research 58 (5), 559–568.

Liu, B., Xia, Y., Yu, P. S., 2000. Clustering through decision tree construction. In: Pro-

ceedings of the Ninth International Conference on Information and knowledge

management. ACM, New York, pp. 20–29.

Liu, G., Li, J., Sim, K., Wong, L., 2007. Distance based subspace clustering with flexi-

ble dimension partitioning. In: Proceedings of the 23rd International Conference

on Data Engineering. IEEE, Istanbul, Turkey, pp. 1250–1254.

Liu, G., Sim, K., Li, J., Wong, L., 2009. Efficient mining of distance-based subspace

clusters. Statistical Analysis and Data Mining 2 (5-6), 427–444.

Liu, T., Bahl, P., Chlamtac, I., 1998. Mobility modeling, location tracking, and tra-

jectory prediction in wireless ATM networks. IEEE Journal on Selected Areas in

Communications 16 (6), 922–936.

Lloyd, A., 2005. The grid and CRM: From ‘if’ to ‘when’? Telecommunications Policy

29 (2-3), 153–172.

MacEachern, S. N., Clyde, M., , Liu, J., 1999. Sequential importance sampling for

nonparametric Bayes models: the next generation. The Canadian Journal of Statis-

tics 27 (2), 251–267.

MacEachern, S. N., Muller, P., 1998. Estimating mixture of Dirichlet process models.

Journal of Computational and Graphical Statistics 7 (2), 223–238.

MacKay, D. J. C., 1995. Probable networks and plausible predictions - a review of

practical Bayesian methods for supervised neural networks. Computation in Neu-

ral Systems 6 (3), 469–505.

MacKay, D. J. C., 1998. Choice of basis for Laplace approximation. Machine Learning

33 (1), 77–86.

MacQueen, J. B., 1967. Some methods of classification and analysis of multivariate

observations. In: Cam, L. M. L., Neyman, J. (Eds.), Proceedings of the Fifth Berkeley

Symposium on Mathematical Statistics and Probability. University of California,

Berkeley, CA, pp. 281–297.

Madeira, S. C., Oliveira, A. L., 2004. Biclustering algorithms for biological data analy-

sis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformat-

ics 1 (1), 24–45.

Madigan, D., Raghavan, N., DuMouchel, W., Nason, M., Posse, C., Ridgeway, G., 2002.

Likelihood-based data squashing: a modeling approach to instance construction.

Data Mining and Knowledge Discovery 6 (2), 173–190.

240 Bibliography

Madigan, D., Ridgeway, G., 2003. Bayesian data analysis. In: Ye, N. (Ed.), The Hand-


ciates, Mahwah, NJ.

Mahalanobis, P. C., 1936. On the generalized distance in statistic. Proceedings of the

National Institute of Sciences of India 2 (1), 49–55.

Mamoulis, N., Cao, H., Kollios, G., Hadjieleftheriou, M., Tao, Y., Cheung, D. W., 2004.

Mining, indexing, and querying historical spatiotemporal data. In: Kim, W., Ko-

havi, R., Gehrke, J., DuMouchel, W. (Eds.), Proceedings of the Tenth ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining. ACM, Seat-

tle, WA, pp. 236–245.

Manku, G. S., Rajagopalan, S., Lindsay, B. G., 1998. Approximate medians and other

quantiles in one pass and with limited memory. In: Haas, L. M., Tiwary, A. (Eds.),


of Data. ACM, Seattle, WA, pp. 426–435.

Manku, G. S., Rajagopalan, S., Lindsay, B. G., 1999. Random sampling techniques for

space efficient online computation of order statistics of large datasets. In: Delis,

A., Faloutsos, C., Ghandeharizadeh, S. (Eds.), Proceedings of the 1999 ACM SIG-

MOD International Conference on Management of Data. ACM, Philadelphia, PA,

pp. 251–262.

Manolopoulos, Y., Nanopoulos, A., Papadopoulos, A., Theodoridis, Y., 2005. R-

Trees: Theory and Applications. Advanced Information and Knowledge Process-

ing. Springer, London.

Mao, J., Jain, A. K., 1996. A self-organizing network for hyperellipsoidal clustering

(HEC). IEEE Transactions on Neural Networks 7 (1), 16–29.

Mardia, K. V., Jupp, P. E., 2000. Directional Statistics, 2nd Edition. Wiley Series in

Probability and Statistics. Wiley, Chichester, UK.

Marin, J.-M., Robert, C. P., 2007. Bayesian Core: A Practical Approach to Computa-

tional Bayesian Statistics. Springer Texts in Statistics. Springer, New York.

Marinucci, M., Perez-Amaral, T., 2005. Econometric modeling of business telecom-

munications demand using RETINA and finite mixtures. Tech. rep., Facultad

de Ciencias Economicas y Empresariales, Universidad Complutense de Madrid,

Madrid, Spain.

Marron, J. S., Wand, M. P., 1992. Exact mean integrated squared error. The Annals of

Statistics 20 (2), 712–736.

Maslow, A. H., 1954. Motivation and personality. Harper’s Psychological Series.

HarperCollins, New York.

Bibliography 241

Matias, Y., Urieli, D., 2005. Optimal workload-based weighted wavelet synopses.

In: Eiter, T., Libkin, L. (Eds.), Proceedings of the Tenth International Conference

Database Theory. Vol. 3363. Springer, Edinburgh, UK, pp. 368–382.

Matias, Y., Vitter, J. S., Wang, M., 1998. Wavelet-based histograms for selectivity esti-

mation. In: Haas, L. M., Tiwary, A. (Eds.), Proceedings of the 1998 ACM SIGMOD

International Conference on Management of Data. ACM, Seattle, WA, pp. 448–459.

Matias, Y., Vitter, J. S., Wang, M., 2000. Dynamic maintenance of wavelet-based his-

tograms. In: Abbadi, A. E., Brodie, M. L., Chakravarthy, S., Dayal, U., Kamel, N.,

Schlageter, G., Whang, K.-Y. (Eds.), Proceedings of the 26th International Confer-

ence on Very Large Data Bases. Morgan Kaufmann, Cairo, Egypt, pp. 101–110.

Maugis, C., Celeux, G., Martin-Magniette, M.-L., 2009. Variable selection for cluster-

ing with Gaussian mixture models. Biometrics 65 (3), 701–709.

McCallum, A., Nigam, K., Ungar, L. H., 2000. Efficient clustering of high-dimensional

data sets with application to reference matching. In: Proceedings of the Sixth


ing. ACM, Boston, MA, pp. 169–178.

McCarthy, E. J., 1978. Basic Marketing: A Managerial Approach, 6th Edition. Richard.

D. Irwin, Homewood, IL.

McDonald, M., Dunbar, I., 2004. Market segmentation: how to do it, how to profit

from it, 3rd Edition. Elsevier, Oxford, UK.

McGrory, C. A., 2006. Variational approximations in Bayesian model selection. Ph.D.

thesis, University of Glasgow.



Analysis 51 (11), 5352–5367.

McHugh, R. B., 1956. Efficient estimation and local identification in latent class anal-

ysis. Psychometrika 21 (4), 331–347.

McLachlan, G., Krishnan, T., 2008. The EM Algorithm and Extensions, 2nd Edition.

Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ.



McLachlan, G. J., Peel, D., Basford, K. E., Adams, P., 1999. The EMMIX software for

the fitting of mixtures of normal and t-components. Journal of Statistical Software

4 (2), 1–4.

242 Bibliography

McLachlan, G. J., Peel, D., Bean, R. W., 2003. Modelling high-dimensional data by

mixtures of factor analyzers. Computational Statistics & Data Analysis 41 (3), 379–

388.

McVinish, R., Mengersen, K., 2008. Semiparametric Bayesian circular statistics.

Computational Statistics & Data Analysis 52 (10), 4722–4730.

Mengersen, K., Robert, C., 1994. Testing for mixtures: a Bayesian entropy approach.

In: Bernardo, J. M., Berger, J. O., Dawid, A. P., Smith, A. F. M. (Eds.), Proceedings of

the Fifth Valencia International Meeting. Clarendon, Alicante, Spain.

Mengersen, K. L., Tweedie, R. L., 1996. Rates of convergence of the Hastings and

Metropolis algorithms. The Annals of Statistics 24 (1), 101121.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., Teller, E., 1953.

Equation of state calculations by fast computing machines. Journal of Chemical

Physics 21 (6), 1087–1092.



Milligan, G. W., 1980. An examination of the effect of six types of error pertubation

on fifteen clustering algorithms. Psychometrika 45 (3), 325–342.

Minka, T. P., 2001. Expectation propagation for approximate Bayesian inference. In:

Breese, J. S., Koller, D. (Eds.), Proceedings of the 17th Conference in Uncertainty in

Artificial Intelligence. Morgan Kaufmann, Seattle, WA, pp. 362–369.

Minka, T. P., Ghahramani, Z., 2003. Expectation propagation for infinite mixtures.

In: NIPS’03 Workshop on Nonparametric Bayesian Methods and Infinite Models.

Whistler, BC, Canada.

Mitchell, A., 1983. The Nine American Lifestyles. Macmillan, New York.

Mitussis, D., O’Malley, L., Patterson, M., 2006. Mapping the re-engagement of CRM

with relationship marketing. European Journal of Marketing 40 (5-6), 572–589.

Moise, G., Sander, J., 2008. Finding non-redundant, statistically significant regions

in high dimensional data: a novel approach to projected and subspace clustering.

In: Li, Y., Liu, B., Sarawagi, S. (Eds.), Proceedings of the 14th ACM SIGKDD Inter-

national Conference on Knowledge Discovery and Data Mining. ACM, Las Vegas,

NV, pp. 533–541.



Morgan, R. M., Hunt, S. D., 1994. The commitment-trust theory of relationship mar-

keting. Journal of Marketing 58 (3), 20–38.

Bibliography 243

Mozer, M. C., Wolniewicz, R., Grimes, D. B., Johnson, E., Kaushansky, H., 2000. Pre-

dicting subscriber dissatisfaction and improving retention in the wireless telecom-

munications industry. IEEE Transactions on Neural Networks 11 (3), 690–696.

Muralikrishna, M., DeWitt, D. J., 1988. Equi-depth histograms for estimating selec-

tivity factors for multi-dimensional queries. In: Boral, H., Larson, P.-A. (Eds.), Pro-

ceedings of the 1988 ACM SIGMOD International Conference on Management of

Data. ACM, Chicago, IL, pp. 28–36.

Murray, K. B., 1991. A test of services marketing theory: consumer information ac-

quisition activities. Journal of Marketing 55 (1), 10–25.

Murtagh, F., Starck, J.-L., Berry, M. W., 2000. Overcoming the curse of dimensionality

in clustering by means of the wavelet transform. The Computer Journal 43 (2), 107–

120.

Muthukrishnan, S., 2005. Data streams: Algorithms and applications. Foundations

and Trends in Theoretical Computer Science 1 (2), 117–236.

Muthukrishnan, S., Poosala, V., Suel, T., 1999. On rectangular partitionings in two

dimensions: Algorithms, complexity, and applications. In: Beeri, C., Buneman, P.

(Eds.), Proceedings of the Seventh International Conference on Database Theory.

Vol. 1540. Springer, Jerusalem, Israel, pp. 236–256.

Muthukrishnan, S., Strauss, M., 2004. Approximate histogram and wavelet sum-

maries of streaming data. Tech. Rep. TR: 2004-52, DIMACS, The State University

of New Jersey, Piscataway, NJ.

Nagesh, H. S., Goil, S., Choudhary, A. N., 2000. Adaptive grids for clustering massive

data sets. In: Proceedings of the 2000 International Conference on Parallel Pro-

cessing. IEEE, Toronto, ON, Canada, pp. 477–484.

Nanopoulos, A., Alcock, R., Manolopoulos, Y., 2001. Feature-based classification of

time-series data. In: Information Processing and Technology. Nova, New York, pp.

49–61.

Neal, R. M., 2000. Markov chain sampling methods for Dirichlet process mixture

models. Journal of Computational and Graphical Statistics 9 (2), 249–265.

Neal, R. M., Hinton, G. E., 1998. A view of the EM algorithm that justifies incremental,

sparse, and other variants. In: Jordan, M. I. (Ed.), Learning in Graphical Models.

MIT, Cambridge, UK, pp. 355–368.

Neill, D. B., Moore, A. W., Sabhnani, M., Daniel, K., 2005. Detection of emerging

space-time clusters. In: Grossman, R., Bayardo, R. J., Bennett, K. P. (Eds.), Proceed-

ing of the Eleventh ACM SIGKDD International Conference on Knowledge Discov-

ery in Data Mining. ACM, Chicago, IL, pp. 218–227.

244 Bibliography

Neslin, S. A., Gupta, S., Kamakura, W., Junxiang, L., Mason, C. H., 2006. Defection de-

tection: measuring and understanding the predictive accuracy of customer churn

models. Journal of Marketing Research 43 (2), 204–211.

Newcomb, S., 1886. A generalized theory of the combination of observations so as to

obtain the best result. American Journal of Mathematics 8 (4), 343–366.

Ng, E. K. K., Fu, A. W.-C., Wong, R. C.-W., 2005. Projective clustering by histograms.


Ng, R. T., Han, J., 1994. Efficient and effective clustering methods for spatial data

mining. In: Bocca, J. B., Jarke, M., Zaniolo, C. (Eds.), Proceedings of the 20th In-

ternational Conference on Very Large Data Bases. Morgan Kaufmann, Santiago de

Chile, Chile, pp. 144–155.

Niraj, R., Gupta, M., Narasimhan, C., 2001. Customer profitability in a supply chain.


Novak, T. P., MacEvoy, B., 1990. On comparing alternative segmentation schemes: the

list of values (LOV) and values and life styles (VALS). Journal of Consumer Research

17 (1), 105–109.

Nurmi, P., Koolwaaij, J., 2006. Identifying meaningful locations. In: Proceedings of

the Third Annual International Conference on Mobile and Ubiquitous Systems:

Networks and Services. IEEE, San Jose, CA, pp. 1–8.

O’Callaghan, L., Meyerson, A., Motwani, R., Mishra, N., Guha, S., 2002. Streaming-

data algorithms for high-quality clustering. In: Proceedings of the 18th Interna-

tional Conference on Data Engineering. IEEE, San Jose, CA, pp. 685–696.

O’Malley, L., Patterson, M., Evans, M. J., 1999. Exploring Direct Marketing. Thomson,

London, UK.

Opper, M., Saad, D. (Eds.), 2001. Advanced Mean Field Methods: Theory and Prac-

tice. MIT Press, Cambridge, MA.

Ordonez, C., Omiecinski, E., 2004. Efficient disk-based k-means clustering for rela-

tional databases. IEEE Transactions on Knowledge and Data Engineering 16 (8),

909–921.

Ordonez, C., Omiecinski, E., 2005. Accelerating EM clustering to find high-quality

solutions. Knowledge and Information Systems 7 (2), 135–157.

Ouellette, J. A., Wood, W., 1998. Habit and intention in everyday life: the multiple

processes by which past behavior predicts future behavior. Psychological Bulletin

124 (1), 54–74.

Bibliography 245

Owen, A. B., 2003. Data squashing by empirical likelihood. Data Mining and Knowl-

edge Discovery 7 (1), 101–113.

Palpanas, T., Vlachos, M., Keogh, E. J., Gunopulos, D., Truppel, W., 2004. Online am-

nesic approximation of streaming time series. In: Proceedings of the 20th Interna-

tional Conference on Data Engineering. IEEE, Boston, MA, pp. 338–349.

Papadimitriou, S., Sun, J., Faloutsos, C., 2007. Dimensionality reduction and fore-

casting on streams. In: Aggarwal, C. C. (Ed.), Data Streams: Models and Algo-

rithms. Advances in Database Systems. Springer, New York.

Park, B.-H., Kargupta, H., 2003. Distributed data mining. In: Ye, N. (Ed.), The Hand-


ciates, Mahwah, NJ.

Parsons, L., Haque, E., Liu, H., 2004. Subspace clustering for high dimensional data:

a review. SIGKDD Explorations 6 (1), 90–105.

Parthasarathy, S., Ghoting, A., Otey, M. E., 2007. A survey of distributed mining of

data streams. In: Aggarwal, C. C. (Ed.), Data Streams: Models and Algorithms. Ad-

vances in Database Systems. Springer, New York.

Patrikainen, A., Meila, M., 2006. Comparing subspace clusterings. IEEE Transactions

on Knowledge and Data Engineering 18 (7), 902–916.

Payne, A., Christopher, M., Clark, M., Peck, H., 1998. Relationship Marketing

for Competitive Advantage: Winning and Keeping Customers, 2nd Edition.

Butterworth-Heinemann, Oxford, UK.

Pena, D., Prieto, F. J., 2001. Multivariate outlier detection and robust covariance ma-

trix estimation. Technometrics 43 (3), 286–310.

Pearson, K., 1894. Contributions to the theory of mathematical evolution. Philosoph-

ical Transactions of the Royal Society of London 185, 71–110.

Peng, Z. K., Chu, F. L., 2004. Application of the wavelet transform in machine con-

dition monitoring and fault diagnostics: a review with bibliography. Mechanical

Systems and Signal Processing 18 (2), 199–221.

Pennell, M. L., Dunson, D. B., 2007. Fitting semiparametric random effects models

to large data sets. Biostatistics 8 (4), 821–834.

Peppers, D., Rogers, M., Dorf, B., 1999. Is your company ready for one-to-one mar-

keting? Harvard Business Review 77 (1), 151–160.

Perkins, C. E., 2001. Ad Hoc Networking. Addison-Wesley, Boston, MA.

246 Bibliography

Perlman, E., Java, A., 2003. Predictive mining of time series data in astronomy. In:

Astronomical Data Analysis Software and Systems XII ASP Conference Series. Vol.

295. pp. 431–434.

Peter, J. P., Olson, J. C., 1983. Is science marketing? Journal of Marketing 47 (4), 111–

125.

Pewsey, A., 2008. The wrapped stable family of distributions as a flexible model for

circular data. Computational Statistics & Data Analysis 52 (3), 1516 – 1523.

Pham, D. T., Dimov, S. S., Nguyen, C. D., 2004. An incremental k-means algorithm.

Proceedings of the Institution of Mechanical Engineers, Part C 218 (7), 783–795.

Piatetsky-Shapiro, G., Connell, C., 1984. Accurate estimation of the number of tuples

satisfying a condition. In: Yormark, B. (Ed.), Proceedings of the 1984 ACM SIGMOD

International Conference on Management of Data. ACM, Boston, MA, pp. 256–276.

Pizzuti, C., Talia, D., 2003. P-AutoClass: scalable parallel clustering for mining large

data sets. IEEE Transactions on Knowledge and Data Engineering 15 (3), 629–641.

Polymenis, A., Titterington, D. M., 1998. On the determination of the number of com-

ponents in a mixture. Statistics & Probability Letters 38 (4), 295–298.

Poosala, V., Ganti, V., 1999. Fast approximate answers to aggregate queries on a data

cube. In: Ozsoyoglu, Z. M., Ozsoyoglu, G., Hou, W.-C. (Eds.), Proceedings of the

11th International Conference on Scientific on Scientific and Statistical Database

Management. IEEE, Cleveland, OH, pp. 24–33.

Poosala, V., Ioannidis, Y. E., 1997. Selectivity estimation without the attribute value

independence assumption. In: Carey, M. J. M. J., Dittrich, K. R., Lochovsky, F. H.,

Loucopoulos, P., Jeusfeld, M. A. (Eds.), Proceedings of the 23rd International Con-

ference on Very Large Data Bases. Morgan Kaufmann, Athens, Greece, pp. 486–495.

Poosala, V., Ioannidis, Y. E., Haas, P. J., Shekita, E. J., 1996. Improved histograms for

selectivity estimation of range predicates. In: Jagadish, H. V., Mumick, I. S. (Eds.),


of Data. ACM, Montreal, QC, Canada, pp. 294–305.

Priebe, C. E., 1994. Adaptive mixtures. Journal of American Statistical Association

89 (427), 796–806.

Procopiuc, C. M., Jones, M., Agarwal, P. K., Murali, T. M., 2002. A Monte Carlo algo-

rithm for fast projective clustering. In: Franklin, M. J., Moon, B., Ailamaki, A. (Eds.),


of Data. ACM, Madison, WI, pp. 418–427.

Punj, G., Stewart, D. W., 1983. Cluster analysis in marketing research: review and

suggestions for application. Journal of Marketing Research 20 (2), 134–148.

Bibliography 247

Rachev, S. T., Hsu, J. S. J., Bagasheva, B. S., Fabozzi, F. J., 2008. Bayesian methods in

finance. The Frank J. Fabozzi Series. John Wiley & Sons, Inc., Hoboken, NJ.

Raftery, A. E., 1996. Hypothesis testing and model selection. In: Gilks, W. R., Richard-

son, S., Spiegelhalter, D. J. (Eds.), Practical Markov Chain Monte Carlo. Chapman

& Hall, London, pp. 163–188 (Chapter 10).

Raftery, A. E., Dean, N., 2006. Variable selection for model-based clustering. Journal


Ratanamahatana, C., Keogh, E. J., Bagnall, A. J., Lonardi, S., 2005. A novel bit level

time series representation with implication of similarity search and clustering.

In: Ho, T. B., Cheung, D. W.-L., Liu, H. (Eds.), Proceedings of the Ninth Pacific-

Asia Conference Advances in Knowledge Discovery and Data Mining. Vol. 3518.

Springer, Hanoi, Vietnam, pp. 771–777.

Reichheld, F. F., 1996. The Loyalty Effect: The Hidden Force Behind Growth, Profits,

and Lasting Value. Harvard Business School, Boston, MA.

Reichheld, F. F., Sasser Jr, W. E., 1990. Zero defections: quality comes to services.

Harvard Business Review 68 (5), 105–111.

Reinartz, W. J., Kumar, V., 2000. On the profitability of long-life customers in a non-

contractual setting: an empirical investigation and implications for marketing.





Richins, M. L., 1983. Negative word-of-mouth by dissatisfied consumers: a pilot

study. Journal of Marketing 47 (1), 68–78.

Ridgeway, G., Madigan, D., 2002. Bayesian analysis of massive datasets via particle

filters. In: Proceedings of the Eighth ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining. ACM, Edmonton, AB, Canada, pp. 5–13.

Ridgeway, G., Madigan, D., 2003. A sequential Monte Carlo method for Bayesian

analysis of massive datasets. Data Mining and Knowledge Discovery 7 (3), 301–

319.

Rigby, D. K., Ledingham, D., 2004. CRM done right. Harvard Business Review 82 (11),

118–129.

Rigby, D. K., Reichheld, F. F., Schefter, P., 2002. Avoid the four perils of CRM. Harvard

Business Review 80 (2), 101–109.

248 Bibliography

Rissanen, J., 1983. A universal prior for integers and estimation by minimum descrip-

tion length. Annals of Statistics 11 (2), 416–431.

Robert, C. P., Casella, G., 1999. Monte Carlo statistical methods. Springer Texts in

Statistics. Springer, New York.

Robert, M. G., 2006. Nonresponse rates and nonresponse bias in household surveys.

Public Opinion Quarterly 70 (5), 646–675.

Roberts, S. J., Husmeier, D., Rezek, I., Penny, W., 1998. Bayesian approaches to Gaus-

sian mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intel-

ligence 20 (11), 1133–1142.

Roeder, K., Wasserman, L., 1997. Practical Bayesian density estimation using mix-

tures of normals. Journal of the American Statistical Association 92 (439), 894–902.

Rokeach, M., 1973. The Nature of Human Values. Free, New York.

Rossi, P. E., Allenby, G. M., McCulloch, R., 2005. Bayesian Statistics and Marketing.

John Wiley & Sons, Inc.

Rue, H., Martino, S., Chopin, N., 2009. Approximate Bayesian inference for latent

Gaussian models by using integrated nested Laplace approximations. Journal of

the Royal Statistical Society: Series B (Statistical Methodology) 71 (2), 319–392.

Rust, R. T., Verhoef, P. C., 2005. Optimizing the marketing interventions mix in

intermediate-term CRM. Marketing Science 24 (3), 477–489.

Sakurai, Y., Papadimitriou, S., Faloutsos, C., 2005. BRAID: Stream mining through

group lag correlations. In: Ozcan, F. (Ed.), Proceedings of the 2005 ACM SIGMOD

International Conference on Management of Data. ACM, Baltimore, MD, pp. 599–

610.

Sander, J., Ester, M., Kriegel, H.-P., Xu, X., 1998. Density-based clustering in spatial

databases: the algorithm GDBSCAN and its applications. Data Mining and Knowl-

edge Discovery 2 (2), 169–194.

SAS Institute Inc., 1983. Cubic clustering criterion. Tech. Rep. SAS Technical Report

A-108, SAS Institute Inc., Cary, NC.

Schervish, M. J., 1996. P values: what they are and what they are not. The American

Statistician 50 (3), 203–206.

Schiffman, L. G., Kanuk, L. L., 2004. Consumer Behavior, 8th Edition. Pearson, Upper

Saddle River, NJ.

Schikuta, E., Erhart, M., 1997. The BANG-clustering system: grid-based data anal-

ysis. In: Liu, X., Cohen, P. R., Berthold, M. R. (Eds.), Proceeding of the Advances

Bibliography 249

in Intelligent Data Analysis, Reasoning about Data, Second International Sympo-

sium. Springer, London, pp. 513–524.

Schmittlein, D. C., Peterson, R. A., 1994. Customer base analysis: an industrial pur-

chase process application. Marketing Science 13 (1), 41–67.

Schultz, D. E., 1995. From the editor the technological challenges to traditional direct

marketing. Journal of Direct Marketing 9 (1), 5–7.

Schwartz, S. H., Bilsky, W., 1987. Toward a universal psychological structure of hu-

man values. Journal of Personality and Social Psychology 53 (3), 550–562.


6 (2), 461–464.

Schweller, R. T., Gupta, A., Parsons, E., Chen, Y., 2004. Reversible sketches for effi-

cient and accurate change detection over network data streams. In: Lombardo, A.,

Kurose, J. F. (Eds.), Proceedings of the 4th ACM SIGCOMM Conference on Internet

Measurement. ACM, Taormina, Sicily, Italy, pp. 207–212.

Scott, D. W., 1992. Multivariate Density Estimation: Theory, Practice, and Visualiza-

tion. Wiley Series in Probability and Statistics. Wiley, New York.

Scott, D. W., 2009. Sturges’ rule. Wiley Interdisciplinary Reviews: Computational

Statistics 1 (3), 303–306.

Scott, D. W., Sain, S. R., 2005. Multidimensional density estimation. In: Rao, C., Weg-

man, E. J., Solka, J. L. (Eds.), Handbook of Statistics 24: Data Mining and Data

Visualization. Vol. 24 of Handbook of Statistics. Elsevier, San Diego, CA, pp. 229–

261.

Scrucca, L., 2010. Dimension reduction for model-based clustering. Statistics and

Computing 20, 471–484.

Sequeira, K., Zaki, M. J., 2004. SCHISM: A new approach for interesting subspace

mining. In: Proceedings of the Fourth IEEE International Conference on Data Min-

ing. IEEE, Brighton, UK, pp. 186–193.

Shapiro, Benson P. Rangan, V. K., Moriarty Jr., R. T., Ross, E. B., 1987. Manage cus-

tomers for profits (not just sales). Harvard Business Review 65 (5), 101–108.

Shaw, M. J., Subramaniam, C., Tan, G. W., Welge, M. E., 2001. Knowledge manage-

ment and data mining for marketing. Decision Support Systems 31 (1), 127–137.

Sheikholeslami, G., Chatterjee, S., Zhang, A., 1998. WaveCluster: a multi-resolution

clustering approach for very large spatial databases. In: Gupta, A., Shmueli, O.,



250 Bibliography

Sheskin, D. J., 2004. Handbook of Parametric and Nonparametric Statistical Proce-

dures, 3rd Edition. Chapman & Hall, Boca Raton, FL.

Sheth, J. N., Parvatiyar, A., 1995. Relationship marketing in consumer markets: an-

tecedents and consequences. Journal of the Academy of Marketing Science 23 (4),

255–271.

Shieh, J., Keogh, E. J., 2008. iSAX: indexing and mining terabyte sized time series. In:

Li, Y., Liu, B., Sarawagi, S. (Eds.), Proceedings of the 14th ACM SIGKDD Interna-

tional Conference on Knowledge Discovery and Data Mining. ACM, Las Vegas, NV,

pp. 623–631.

Shimp, T. A., 2007. Advertising, Promotion and Other Aspects of Integrated Market-

ing Communications, 7th Edition. Thomson, Mason, OH.

Silverman, B. W., 1986. Density Estimation for Statistics and Data Analysis. Mono-

graphs on Statistics and Applied Probability. Chapman & Hall, New York.

Skinner, B. F., 1974. About Behaviorism. Vintage, New York.

Smith, K. A., Willis, R. J., Brooks, M., 2000. An analysis of customer retention and

insurance claim patterns using data mining: a case study. Journal of Operational

Research Society 51 (5), 532–541.

Smith, S. P., Jain, A. K., Jan 1984. Testing for uniformity in multidimensional data.

IEEE Transactions on Pattern Analysis and Machine Intelligence 6 (1), 73 –81.

Smith, W. R., 1956. Product differentiation and market segmentation as alternative

marketing strategies. Journal of Marketing 21 (1), 3–8.

Smyth, P., 2000. Model selection for probabilistic clustering using cross-validated

likelihood. Statistics and Computing 10 (1), 63–72.

Solomon, M. R., 2004. Consumer Behavior: Buying, Having, and Being, 6th Edition.

Prentice Hall, Upper Saddle River, NJ.




Srikant, R., Agrawal, R., 1996. Mining sequential patterns: generalizations and per-

formance improvements. In: Apers, P. M. G., Bouzeghoub, M., Gardarin, G. (Eds.),

Proceedings of the Fifth International Conference on Extending Database Tech-

nology. Vol. 1057. Springer, Avignon, France, pp. 3–17.

Stephens, M., 2000. Bayesian analysis of mixture models with an unknown number

of components - an alternative to reversible jump methods. The Annals of Statis-

tics 28 (1), 40–74.

Bibliography 251

Stephens, M. A., 1970. Use of the Kolmogorov-Smirnov, Cramer-von Mises and re-

lated statistics without extensive tables. Journal of the Royal Statistical Society: Se-

ries B (Statistical Methodology) 32 (1), 115–122.

Sterne, J. A. C., Smith, G. D., Cox, D. R., 2001. Sifting the evidence - what’s wrong with

significance tests? British Medical Journal 322 (7280), 226–231.

Stollnitz, E. J., DeRose, A. D., Salesin, D. H., 1996. Wavelets for Computer Graph-

ics. The Morgan Kaufmann Series in Computer Graphics. Morgan Kaufmann, San

Francisco, CA.

Stone, M., Bond, A., Foss, B., 2004. Consumer Insight: How to Use Data and Mar-

ket Research to Get Closer to Your Customer. Market Research in Practice Series.

Kogan Page, London.

Storbacka, K. E., 1997. Segmentation based on customer profitability - retrospective

analysis of retail bank customer bases. Journal of Marketing Management 13 (5),

479–492.

Strouse, K. G., 2004. Customer-centered Telecommunications Services Marketing.

Artech House telecommunications library. Artech House, Boston, MA.

Stryker, S., Burke, P. J., 2000. The past, present, and future of an identity theory. Social

Psychology Quarterly 63 (4), 284–297.

Sturges, H. A., 1926. The choice of a class interval. Journal of the American Statistical

Association 21 (153), 65–66.

Svensen, M., Bishop, C. M., 2005. Robust Bayesian mixture modelling. Neurocom-

puting 64, 235–252.

Symons, M. J., 1981. Clustering criteria and multivariate normal mixtures. Biomet-

rics 37 (1), 35–43.

Tan, P., Steinbach, M., Kumar, V., Potter, C., Klooster, S., Torregrosa, A., 2001. Finding

spatio-temporal patterns in earth science data. In: Proceedings of the KDD 2001

Workshop on Temporal Data Mining. ACM, San Francisco, CA.

Tanay, A., Sharan, R., Shamir, R., 2006. Biclustering algorithms: a survey. In: Aluru,

S. (Ed.), Handbook of Computational Molecular Biology. Chapman & Hall, Boca

Raton, FL.

Teh, Y. W., Jordan, M. I., Beal, M. J., Blei, D. M., 2006. Hierarchical Dirichlet processes.

Journal of the American Statistical Association 101 (476), 1566–1581.




252 Bibliography

Thaper, N., Guha, S., Indyk, P., Koudas, N., 2002. Dynamic multidimensional his-

tograms. In: Franklin, M. J., Moon, B., Ailamaki, A. (Eds.), Proceedings of the 2002

ACM SIGMOD International Conference on Management of Data. ACM, Madison,

WI, pp. 428–439.

Thompson, C. J., Locander, W. B., Pollio, H. R., 1989. Putting consumer experi-

ence back into consumer research: the philosophy and method of existensial-

phenomenology. Journal of Consumer Research 16 (2), 133–146.

Tibshirani, R., Walther, G., Hastie, T., 2001. Estimating the number of clusters in a

data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Sta-

tistical Methodology) 63 (2), 411–423.

Tierney, L., 1994. Markov chains for exploring posterior distributions. The Annals of

Statistics 22 (4), 1701–1728.

Tipping, M. E., Bishop, C. M., 1999. Mixtures of probabilistic principal component

analyzers. Neural Computation 11 (2), 443–482.



ley, New York.

Train, K. E., McFadden, D. L., Ben-Akiva, M., 1987. The demand for local telephone

service: a fully discrete model of residential calling patterns and service choices.

RAND Journal of Economics 18 (1), 109–123.

Tung, A. K. H., Xu, X., Ooi, B. C., 2005. CURLER: finding and visualizing nonlinear

correlation clusters. In: Proceedings of the 2005 ACM SIGMOD International Con-

ference on Management of Data. ACM, New York, pp. 467–478.

Twedt, D. W., 1967. How does brand awareness-attitude affect marketing strategy?




Ueda, N., Nakano, R., Ghahramani, Z., Hinton, G. E., 2000. SMEM algorithm for mix-

ture models. Neural Computation 12 (9), 2109–2128.

Van Mechelen, I., Bock, H.-H., Boeck, P. D., 2004. Two-mode clustering methods: a

structured overview. Statistical Methods in Medical Research 13 (5), 363–394.

van Raaij, E. M., Vernooij, M. J. A., van Triest, S., 2003. The implementation of cus-

tomer profitability analysis: a case study. Industrial Marketing Management 32 (7),

573–583.

Bibliography 253

Vasconcelos, N., Lippman, A., 1998. Learning mixture hierarchies. In: Kearns, M. J.,

Solla, S. A., Cohn, D. A. (Eds.), Proceedings of the 1998 Neural Information Pro-

cessing Systems. MIT, Denver, CO, pp. 606–612.

Venkatesan, R., Kumar, V., Bohling, T., 2007. Optimal customer relationship man-

agement using Bayesian decision theory: an application for customer selection.

Journal of Marketing Research 44 (4), 579–594.

Verhoef, P. C., Donkers, B., 2001. Predicting customer potential value an application

in the insurance industry. Decision Support Systems 32 (2), 189–199.

Veroff, J., Douvan, E., Kulka, R. A., 1981. The Inner American: A Self-Portrait from

1957 to 1976. Basic Books, New York.

Verplanken, B., Aarts, H., van Knippenberg, A., Moonen, A., 1998. Habit versus

planned behaviour: a field experiment. The British Journal of Social Psychology

37 (1), 111–128.

Vitter, J. S., 1985. Random sampling with a reservoir. ACM Transactions on Mathe-

matical Software 11 (1), 37–57.

Vitter, J. S., 2008. External memory algorithms and data structures: dealing with mas-

sive data. Foundations and Trends in Theoretical Computer Science 2 (4), 305–474.

Vitter, J. S., Wang, M., 1999. Approximate computation of multidimensional aggre-

gates of sparse data using wavelets. In: Delis, A., Faloutsos, C., Ghandeharizadeh,

S. (Eds.), Proceedings of the ACM SIGMOD International Conference on Manage-

ment of Data. ACM, Philadelphia, PA, pp. 193–204.

Vitter, J. S., Wang, M., Iyer, B. R., 1998. Data cube approximation and histograms

via wavelets. In: Gardarin, G., French, J. C., Pissinou, N., Makki, K., Bouganim, L.

(Eds.), Proceedings of the 1998 ACM CIKM International Conference on Informa-

tion and Knowledge Management. ACM, Bethesda, MD, pp. 96–104.

Vlachos, M., Hadjieleftheriou, M., Keogh, E., Gunopulos, D., Manolopoulos, Y., Pa-

padopoulos, A., Vassilakopoulos, M., Manolopoulos, Y., Papadopoulos, A., Vas-

silakopoulos, M., 2005. Indexing multi-dimensional trajectories for similarity

queries. In: Manolopoulos, Y., Papadopoulos, A. N., Vassilakopoulos, M. G. (Eds.),

Spatial Databases: Technologies, Techniques and Trends. IGI, London, pp. 107–

128.

Volkovich, Z., Kogan, J., Nicholas, C., 2006. Sampling methods for building initial par-

titions. In: Kogan, J., Nicholas, C., Teboulle, M. (Eds.), Grouping Multidimensional

Data: Recent Advances in Clustering. Springer, New York.

Wainwright, M. J., Jordan, M. I., 2003. Graphical models, exponential families, and

variational inference. Tech. Rep. Technical Report 649, Department of Statistics,

University of California, Berkeley, Berkeley, CA.

254 Bibliography

Wallace, C. S., Dowe, D. L., 1994. Intrinsic classification by MML - the Snob pro-

gram. In: Proceedings of the Seventh Australian Joint Conference on Artificial In-

telligence. World Scientific, Singapore, pp. 37–44.

Wallace, C. S., Dowe, D. L., 2000. MML clustering of multi-state, Poisson, von Mises

circular and Gaussian distributions. Statistics and Computing 10 (1), 73–83.

Wallace, C. S., Freeman, P. R., 1987. Estimation and inference by compact coding.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) 49 (3),

240–265.

Wallach, H. M., Dicker, L., Jensen, S. T., Heller, K. A., 2010. An alternative prior pro-

cess for nonparametric Bayesian clustering. In: Proceedings of the 13th Interna-

tional Conference on Artificial Intelligence and Statistics. Sardinia, Italy.




Wang, H., Fan, W., Han, P. S. Y. J., 2003. Mining concept-drifting data streams us-

ing ensemble classifiers. In: Getoor, L., Senator, T. E., Domingos, P., Faloutsos,

C. (Eds.), Proceedings of the Ninth ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining. ACM, Washington, DC, pp. 226–235.

Wang, H., Wang, W., Yang, J., Yu, P. S., 2002a. Clustering by pattern similarity in large

data sets. In: Franklin, M. J., Moon, B., Ailamaki, A. (Eds.), Proceedings of the 2002

ACM SIGMOD International Conference on Management of Data. ACM, Madison,

WI, pp. 394–405.

Wang, H.-F., Hong, W.-K., 2006. Managing customer profitability in a competitive

market by continuous data mining. Industrial Marketing Management 35 (6), 715–

723.

Wang, M., Chan, N. H., Papadimitriou, S., Faloutsos, C., Madhyastha, T. M., 2002b.

Data mining meets performance evaluation: fast algorithms for modeling bursty

traffic. In: Proceedings of the 18th International Conference on Data Engineering.

IEEE, San Jose, CA, pp. 507–516.

Wang, N., Raftery, A. E., 2002. Nearest-neighbor variance estimation (NNVE). Journal


Wang, S. J., Woodward, W. A., Gray, H. L., Wiechecki, S., Sain, S. R., 1997a. A new test

for outlier detection from a multivariate mixture distribution. Journal of Compu-

tational and Graphical Statistics 6 (3), 285–299.

Wang, W., Yang, J., Muntz, R. R., 1997b. STING: a statistical information grid ap-

proach to spatial data mining. In: Jarke, M., Carey, M. J., Dittrich, K. R., Lochovsky,

Bibliography 255

F. H., Loucopoulos, P., Jeusfeld, M. A. (Eds.), Proceedings of the 23rd International

Conference on Very Large Data Bases. Morgan Kaufmann, Athens, Greece, pp. 186–

195.

Wang, X., Smith, K. A., Hyndman, R. J., 2006. Characteristic-based clustering for time

series data. Data Mining and Knowledge Discovery 13 (3), 335–364.

Wang, Y.-F., Chuang, Y.-L., Hsu, M.-H., Keh, H.-C., 2004. A personalized recom-

mender system for the cosmetic business. Expert Systems with Applications 26 (3),

427–434.





Waterhouse, S. R., MacKay, D. J. C., Robinson, A. J., 1996. Bayesian methods for mix-

tures of experts. In: Touretzky, D. S., Mozer, M., Hasselmo, M. E. (Eds.), Proceedings

of the 1996 Neural Information Processing Systems. MIT, Denver, CO, pp. 351–357.

Watson, J. B., 1913. Psychology as the behaviorist views it. Psychological Review 20,

158–177.

Weber, R., Schek, H.-J., Blott, S., 1998. A quantitative analysis and performance study

for similarity-search methods in high-dimensional spaces. In: Gupta, A., Shmueli,

O., Widom, J. (Eds.), Proceedings of the 24rd International Conference on Very

Large Data Bases. Morgan Kaufmann, New York, pp. 194–205.

Wedel, M., Kamakura, W. A., 1998. Market Segmentation: Conceptual and Method-

ological Foundations. International Series in Quantitative Marketing. Kluwer Aca-

demic, Boston, MA.

Wei, C.-P., Chiu, I.-T., 2002. Turning telecommunications call details to churn predic-

tion: a data mining approach. Expert Systems with Applications 23 (2), 103–112.

Wei, L., Keogh, E. J., Xi, X., 2006. SAXually explicit images: finding unusual shapes.

In: Proceedings of the 6th IEEE International Conference on Data Mining. IEEE,

Hong Kong, China, pp. 711–720.

Weinstein, A., 2004. Handbook of Market Segmentation: Strategic Targeting for Busi-

ness and Technology Firms, 3rd Edition. Haworth Series in Segmented, Targeted,

and Customized Market. Haworth, Binghamton, UK.

Weiss, G. M., 2005. Data mining in telecommunications. In: Maimon, O., Rokach, L.

(Eds.), Data Mining and Knowledge Discovery Handbook. Springer, New York.

Wells, W. D., 1975. Psychographics: a critical review. Journal of Marketing Research

12 (2), 196–213.

256 Bibliography

Wicker, A. W., 1969. Attitudes vs. actions: the relationship of verbal and overt behav-

ioral responses to attitude objects. Journal of Social Issues 25 (4), 41–78.

Wind, Y., 1978. Issues and advances in segmentation research. Journal of Marketing

Research 15 (3), 317–337.

Winn, J. M., Bishop, C. M., 2005. Variational message passing. Journal of Machine

Learning Research 6, 661–694.

Witten, I. H., Frank, E., 2005. Data Mining: Practical Machine Learning Tools and

Techniques, 2nd Edition. Morgan Kaufmann Series in Data Management Systems.

Morgan Kaufman, Boston, MA.

Wolfers, J., Zitzewitz, E., 2004. Prediction markets. Journal of Economic Perspectives

18 (2), 107–126.

Woo, K.-G., Lee, J.-H., Kim, M.-H., Lee, Y.-J., 2004. FINDIT: a fast and intelligent

subspace clustering algorithm using dimension voting. Information and Software

Technology 46 (4), 255–271.

Wu, B., McGrory, C. A., Pettitt, A. N., 2010a. Customer spatial usage behavior profiling

and segmentation with mixture modeling. Submitted.



(in press).

URL http://dx.doi.org/10.1007/s11222-010-9217-9



Wu, C. F. J., 1983. On convergence properties of the EM algorithm. The Annals of

Statistics 11 (1), 95–103.

Wu, Y.-L., Agrawal, D., Abbadi, A. E., 2001. Applying the golden rule of sampling for

query estimation. SIGMOD Record 30 (2), 449–460.

Xing, D., Girolami, M., 2007. Employing latent Dirichlet allocation for fraud detection

in telecommunications. Pattern Recognition Letters 28 (13), 1727–1734.

Xiong, Y., Yeung, D.-Y., 2002. Mixtures of ARMA models for model-based time se-

ries clustering. In: Proceedings of the 2002 IEEE International Conference on Data

Mining. IEEE, Maebashi City, Japan, pp. 717–720.

Xiong, Y., Yeung, D.-Y., 2004. Time series clustering with ARMA mixtures. Pattern

Recognition 37 (8), 1675–1689.

Xu, R., Wunsch II, D., 2005. Survey of clustering algorithms. IEEE Transactions on

Neural Networks 16 (3), 645–678.

Bibliography 257

Xu, X., Ester, M., Kriegel, H.-P., Sander, J., 1998. A distribution-based clustering al-

gorithm for mining in large spatial databases. In: Proceedings of the 14th Interna-

tional Conference on Data Engineering. IEEE, Orlando, FL, pp. 324–331.

Yalch, R., Brunel, F., 1996. Need hierarchies in consumer judgments of product de-

signs: is it time to reconsider Maslow’s theory? Advances in Consumer Research

23 (1), 405–410.

Yamazaki, K., Watanabe, S., 2003. Singularities in mixture models and upper bounds

of stochastic complexity. Neural Networks 16 (7), 1029–1038.

Yang, J., Wang, W., Wang, H., Yu, P. S., 2002. δ-clusters: capturing subspace correla-

tion in a large data set. In: Proceedings of the 18th International Conference on

Data Engineering. IEEE, San Jose, CA, pp. 517–528.

Yang, Y., Wu, X., Zhu, X., 2005. Combining proactive and reactive predictions for

data streams. In: Grossman, R., Bayardo, R. J., Bennett, K. P. (Eds.), Proceedings

of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery

and Data Mining. ACM, Chicago, IL, pp. 710–715.

Yankelovich, D., Meer, D., 2006. Rediscovering market segmentation. Harvard Busi-

ness Review 84 (2), 122–131.

Yankov, D., Keogh, E. J., Rebbapragada, U., 2007. Disk aware discord discovery: find-

ing unusual time series in terabyte sized datasets. In: Proceedings of the Seventh

IEEE International Conference on Data Mining. IEEE, Omaha, NE, pp. 381–390.

Yi, B.-K., Faloutsos, C., 2000. Fast time sequence indexing for arbitrary Lp norms.

In: Abbadi, A. E., Brodie, M. L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter,

G., Whang, K.-Y. (Eds.), Proceedings of the 26th International Conference on Very

Large Data Bases. Morgan Kaufmann, Cairo, Egypt, pp. 385–394.

Yi, B.-K., Sidiropoulos, N., Johnson, T., Jagadish, H. V., Faloutsos, C., Biliris, A., 2000.

Online data mining for co-evolving time sequences. In: Proceedings of the 16th

International Conference on Data Engineering. IEEE, San Diego, CA, pp. 13–22.

Yip, K. Y., Cheung, D. W., Ng, M. K., 2004. HARP: a practical projected clustering algo-

rithm. IEEE Transactions on Knowledge and Data Engineering 16 (11), 1387–1397.

Yip, K. Y., Cheung, D. W., Ng, M. K., 2005. On discovery of extremely low-dimensional

clusters using semi-supervised projected clustering. In: Proceedings of the 21st

International Conference on Data Engineering. IEEE, Tokyo, Japan, pp. 329–340.

Yiu, M. L., Mamoulis, N., 2005. Iterative projected clustering by subspace mining.


Yu, D., Sheikholeslami, G., Zhang, A., 2002. FindOut: finding outliers in very large

datasets. Knowledge and Information Systems 4 (4), 387–412.

258 Bibliography

Zahn, C. T., 1971. Graph-theoretical methods for detecting and describing gestalt

clusters. IEEE Transactions on Computers 1, 68–86.

Zaki, M. J., Peters, M., Assent, I., Seidl, T., 2005. CLICKS: an effective algorithm for

mining subspace clusters in categorical datasets. In: Grossman, R., Bayardo, R. J.,

Bennett, K. P. (Eds.), Proceedings of the Eleventh ACM SIGKDD International Con-

ference on Knowledge Discovery and Data Mining. ACM, Chicago, IL, pp. 736–742.

Zeira, G., Last, M., Maimon, O., 2004. Segmentation of continuous data streams

based on a change detection methodology. In: Pal, N. R., Jain, L. C. (Eds.), Ad-

vanced Techniques in Knowledge Discovery and Data Mining. Advanced Informa-

tion and Knowledge Processing. Springer, New York.

Zeithaml, V. A., 2000. Service quality, profitability, and the economic worth of cus-

tomers: what we know and what we need to learn. Journal of the Academy of Mar-

keting Science 28 (1), 67–85.

Zeithaml, V. A., Rust, R. T., Lemon, K. N., 2001. The customer pyramid: creating and

serving profitable customers. California Management Review 43 (4), 118–142.

Zhang, D., Gunopulos, D., Tsotras, V. J., Seeger, B., 2003. Temporal and spatio-

temporal aggregations over data streams using multiple time granularities. Infor-

mation Systems 28 (1-2), 61–84.

Zhang, J., Hsu, W., Lee, M.-L., 2005. Clustering in dynamic spatial databases. Journal

of Intelligent Information Systems 24 (1), 5–27.

Zhang, T., Ramakrishnan, R., Livny, M., 1996. BIRCH: an efficient data clustering

method for very large databases. In: Jagadish, H. V., Mumick, I. S. (Eds.), Proceed-

ings of the 1996 ACM SIGMOD International Conference on Management of Data.

ACM, Montreal, QC, Canada, pp. 103–114.

Zhou, A., Cai, Z., Wei, L., Qian, W., 2003. M-kernel merging: towards density estima-

tion over data streams. In: Proceedings of the Eighth International Conference on

Database Systems for Advanced Applications. IEEE, Kyoto, Japan, pp. 285–292.

Zhu, Y., Shasha, D., 2002. StatStream: statistical monitoring of thousands of data

streams in real time. In: Proceedings of the 28th International Conference on Very

Large Data Bases. Morgan Kaufmann, Hong Kong, China, pp. 358–369.

Zhu, Y., Shasha, D., 2003. Efficient elastic burst detection in data streams. In: Getoor,

L., Senator, T. E., Domingos, P., Faloutsos, C. (Eds.), Proceedings of the Ninth


ing. ACM, Washington, DC, pp. 336–345.

Ziff, R., 1971. Psychographics for market segmentation. Journal of Advertising Re-

search 11 (2), 3–9.

With applications to proﬁling and differentiating habitual ... · With applications to proﬁling...

Documents

Transcript of With applications to proﬁling and differentiating habitual ... · With applications to proﬁling...