Post on 20-Jan-2016
description
Knowledge discovery & data mining
Clustering & customer segmentation
Fosca Giannotti and Dino PedreschiPisa KDD Lab
CNUCE-CNR & Univ. Pisahttp://www-kdd.di.unipi.it/
A tutorial @ EDBT2000
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 2
Module outline
The clustering taskApplication to a case-study in
customer segmentationQuick overview of main clustering
techniques
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 3
Problem 1: customer segmentation
Given: Large data base of customer data
containing their properties and past buying records
Goal: Find groups of customers with similar
behavior Find customers with unusual behavior
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 4
Problem 2: CAD databases
Given: Large data base of CAD data containing
abstract feature vectors
Goal: Find homogeneous groups of similar CAD
parts Determine standard parts for each group Use standard parts instead of special
partsreduction of the number of parts to be produced
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 5
Clustering: general problem description
Given: A data set with N d-dimensional data
items.Find:
a natural partitioning of the data set into a number of clusters (k) and noise.
Clusters should be such that: items in same cluster are similar
intra-cluster similarity is maximized items from different clusters are different
inter-cluster similarity is minimized
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 6
Clustering = unsupervised classification
No predefined classes!
Used either as a stand-alone tool to get insight
into data distribution
or as a preprocessing step for other algorithms.
Help to understand how objects in a data set
tend to naturally group together
Cluster: a number of things of the same kind
being close together in a group
(Longman dictionary of contemporary English)
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 7
Research perspective
From the past … clustering is a well-known problem in
statistics more recent research in
machine learning databases visualization
… to the future effective and efficient clustering algorithms
for large high-dimensional data sets with high noise
requires scalability with respect tothe number of data points (N)the number of dimensions (d)the noise level
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 8
Basic methods
Partitioning methods k-means, k-medoids
Hierarchical methods Agglomerative/divisive, BIRCH, CURE
Linkage-based methodsDensity-based methods
DBSCAN, DENCLUE
Statistical methods IBM-IM demographic clustering, COBWEB
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 9
A small example
Source: DM@CINECAhttp://open.cineca.it/datamining/dmCineca/index.htm
Domain: technology watch - a.k.a. competitive intelligence Which are the emergent technologies? Which of my competitors are investing on
them? In which area are my competitors active? Which area will my competitor drop in
the near future?Data sources:
public (on-line) databases
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 10
The Derwent database
Contains all patents filed worldwide in last 10 years
Searching this database by keywords may yield thousands of documents
Derwent documents are semi-structured: many long text fields
Goal: analyze Derwent documents to build a model of competitors’ strategy
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 11
Structure of Derwent documents
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 12
Example dataset
Patents in the area: patch technology (medical plaster) 105 companies from 12 countries 94 classification codes 52 Derwent codes
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 13
Clustering output
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 14
Zoom on cluster 2
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 15
Zoom on cluster 2 - profiling competitors
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 16
Activity of competitors in the clusters
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 17
Temporal analysis of clusters
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 18
Module outline
The clustering taskApplication to a case-study in
customer segmentationQuick overview of main clustering
techniques
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 19
A case-study on customer segmentation
Source: Gary Saarenvirta, “Mining customer data”. DB2 magazine on line, 1998
http://www.db2mag.com/98fsaar.html
Gratefully acknowledged!
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 20
Customer segmentation
Task: use customer-purchase transaction data to track buying behavior create strategic business initiatives divide customers into segments based on
shareholder value variables:customer profitability, measure of risk, measure of the lifetime value of a customer,retention probability.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 21
Attractive customer segments
high-profit, high-value, and low-risk customers highly attractive: typically 10% to 20% of
customers who create 50% to 80% of a company's profits
strategic initiative for the segment is retention
low-profit, high-value, and low-risk customers attractive strategic initiative for the segment is to increase
profitabilitycross-selling (selling new products) up-selling (selling more of what customers currently buy)
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 22
Behavioral vs. demographic segments
Within behavioral segments: demographic subsegments.
Customer demographic data are not used to create segments.
Demographic (sub)segmenting is used to select appropriate tactics (advertising, marketing channels, and campaigns) to implement the strategic initiatives.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 23
AIR MILES
is a Canadian reward program for a coalition of more than 125 companies in all industry sectors finance, credit card, retail, grocery, gas,
telecom.
60% of Canadian households enrolled is a frequent-shopper program:
consumers collect bonuses that can be redeemed for rewards (air travel, hotel accommodation, theatre/sport tickets, …)
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 24
Data capture
The coalition partners capture consumer transactions and transmit them to AIR MILES, which
stores these transactions and uses the data for database marketing initiatives.
AIR MILES data warehouse contained (at the date of the project) more than 6.3 million household records 1 billion transaction records.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 25
Before data mining
Analytical techniques used: Recency, Frequency, Monetary value
(RFM) analysis online analytic processing tools linear statistical methods
to analyze the success of the various marketing initiatives undertaken by the coalition.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 26
Data mining project at AMRP
Goal: create a customer segmentation using a data mining tool and
compare the results to an existing segmentation developed using RFM analysis.
data mining platform IBM - DB2 Universal Database Enterprise
parallelized over a five-node RS/6000 SP system.
IBM - Intelligent Miner for Data
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 27
Dataset
~ 50,000 customers and their associated transactions for a 12-month period.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 28
Data preparation
“shareholder value” variables revenue customer tenure number of sponsor companies shopped at over the
customer tenure number of sponsor companies shopped at over the
last 12 months, recency (in months) of the last transaction
calculated by aggregating the transaction data and then adding then to each customer record
84 variables = 14 categories of sponsor companies 3 variables per category 2 quarters (first two quarters of 1997)
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 29
Data cleaning - missing values
demographic data is usually categorical has a high % of missing values the missing values can be set to either
unknown or unanswered (if result of unanswered questions)
if a large portion of the attribute is missing, it may be discarded
in the case study, missing numeric values set to 0
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 30
Data transformation
Ratio variables. E.g.: profitability = profit / tenure
Time-derivative variables. E.g.: profit 2nd quarter - profit 1st quarter
Discretization using quantiles. E.g., break points at 10, 25, 50, 75, and 90.
Discretization using predefined ranges. E.g., those used in census
Log transforms. E.g., for very skewed distributions
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 31
Distribution of original data
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 32
Distribution of discretized data
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 33
Before/after discretization
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 34
Clustering/segmentation methodology
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 35
IBM-IM demographic clustering
Designed for categorical variablesSimilarity index:
increases with number of common values on same attribute
decreases with number of different values on same attribute
# of clusters is not fixed a priori only upper bound set
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 36
Demographic clustering: data structure
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 37
Demographic clustering: parameters
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 38
Demographic clustering: similarity index
proportional to 1-1 (# of common 1’s) inversely proportional to 0-1 and 1-0unaffected by 0-0 (# of common 0’s)Condorcet index=
N11 / (N11 + 1/2(N01 + N10))Dice index=
N11 / (N11 + 1/4(N01 + N10))Dice looser then Condorcet
appropriate with highly different objects Indexes generalize from objects to clusters
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 39
Demographic clustering: similarity index
Similarity threshold i,j assumed similar if s(i,j)> low values (<0.5) appropriate with highly
different objects
Weights for attributes importance of attributes in the similarity
index may be varied with different weights default weight = 1
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 40
IM Demographic clustering
basic parameters to set: Maximum number of clusters. Maximum number of passes through the data. Accuracy: a stopping criterion for the
algorithm. If the change in the Condorcet criterion between data passes is smaller than the accuracy (as %), the algorithm will terminate.
In the case study: maximum # of clusters: 9 maximum # of passes: 5 accuracy: 0.1
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 41
Input dataset
dataset: all continuous variables discretized. input variables :
# of products purchased over customer’s lifetime # of products purchased in the last 12 months Customer's revenue contribution over lifetime Customer tenure in months Ratio of revenue to tenure Ratio of number of products to tenure Region Recency Tenure (# of months since customer first enrolled in
the program).
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 42
Input dataset
Other discrete and categorical variables and some interesting continuous variables were input as supplementary variables:
variables used to profile the clusters but not to define them.
easier interpretation of clusters using data other than the input variables.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 43
Output of demographic clustering
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 44
Visualization of clusters
horizontal strip = a clusterclusters are ordered from top to bottom
in order of sizevariables are ordered from left to right
in order of importance to the cluster, based on a chi-square test between variable and cluster ID.
other metrics include entropy, Condorcet criterion, and database order.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 45
Visualization of clusters
variables used to define clusters are without brackets, while the supplementary variables appear within brackets.
numeric (integer), discrete numeric (small integer), binary, and continuous variables have their frequency distribution shown as a bar graph.
red bars = distribution of the variable within the current cluster.
gray solid bars = distribution of the variable in the whole universe.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 46
Visualization of clusters
Categorical variables are shown as pie charts.
inner pie = distribution of the categories for the current cluster
outer ring = distribution of the variable for the entire universe.
The more different the cluster distribution is from the average, the more interesting or distinct the cluster.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 47
Qualitative characterization of clusters
Gold98 is a binary variable that indicates the best customers in the database, created previously by the business using RFM analysis.
The clusters have almost all Gold or no Gold customers.
Gold segment is confirmed!
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 48
Qualitative characterization of clusters
Clustering results validate the existing concept of Gold customers, suggest the introduction of a cluster within the
Gold98 segment - a platinum segment
Cluster 5: 9 %of the population. revenue, bonus collected are all in the
75th percentile and above, skewed to almost all greater than the 90th percentile.
looks like a very profitable cluster
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 49
Detailed view of cluster 5
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 50
Profiling clusters
Goal: assess the potential business value of each cluster quantitatively by profiling the aggregate values of the shareholder value variables by cluster.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 51
Profiling clusters
leverage = ratio of revenue to customer. product index = ratio of the average
number of products purchased by the customers in the cluster divided by the average number of products purchased overall.
as profitability increases, so does the average number of products purchased.
customer profitability increases as tenure increases.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 52
Business opportunities
Best customers in clusters 2, 5, and 7: indication: retention
Clusters 2, 6, and 0 indication: cross-selling by contrasting with
clusters 5 and 7. 2, 6, and 0 have a product index close to that of 5
and 7, which have the highest number of products purchased.
Try to convert customers from clusters 2, 6, and 0 to clusters 5 and 7. By comparing which products are bought we can find products that are candidates for cross-selling.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 53
Business opportunities
Clusters 3 and 4 indication: cross-selling to clusters 2, 6,
and 0
Cluster 1 indication: wait and see. It appears to be
a group of new customers
Cluster 8 indication: no waste of marketing dollars
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 54
Follow-up
Reactions from the management visualization of results allowed for meaningful
and actionable analysis. original segmentation methodology validated,
and obtained refinements potentially valuable. decision to undertake further data mining
projects, including predictive models for direct mail targeting, further work on segmentation using more detailed
behavioral data, opportunity identification using association algorithms
within the segments discovered.
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 55
Module outline
The clustering taskApplication to a case-study in
customer segmentationQuick overview of main clustering
techniques
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 56
A taxonomy of main clustering methods
Partitioning methods K-means (MacQueen 67), expectation maximization
(Lauritzen 95), CLARANS (Ng and Han 94)
Hierarchical methods Agglomerative/divisive, BIRCH (Zhang et al 96), CURE
(Guha et al 98)
Linkage-based methodsDensity-based methods
DBSCAN (Ester et al 96), DENCLUE (Hinnenburg and Keim 98)
Statistical methods IBM-IM demographic clustering, COBWEB (Fisher 87),
Autoclass (Cheeseman 96)
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 57
Which notion of distance?
Given two samples with n continuous attributes X=<x1,…,xn> and Y=<y1,…,yn>, the Minkowski distance between X and Y is:
d(X,Y)= (|x1-y1|q + … + |xn-yn|
q)1/q
Euclidean distance: q=2Manhattan distance: q=1
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 58
K-Means
Determine k prototypes (p1, … ,pk) of a given data set
Assign data points to nearest prototype
Minimize distance stop-criterion:
Iterative Algorithm Shift the prototypes towards the mean of their point
set Re-assign the data points to the nearest prototype
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 59
K-Means: example
112
2 3
3
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 60
Expectation Maximization
Generalization of k-Meansprobabilistic assignment of points to
clusters Idea:
Estimate parameters of k Gaussians Optimize the probability that the mixture
of parameterized Gaussians fits the data Iterative algorithm similar to k-Means
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 61
Linkage-based methods (from statistics)
Single Linkage Connected components for distance d
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 62
Linkage-based methods
Method of Wishart Min. no. of neighboring points: 4
Reduce data set
Apply Single Linkage
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 63
Kernel Density Estimation
Data Set Influence Function Density Function Influence Function: Influence of a data point in
its neighborhood Density Function: Sum of the influences of all
data points
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 64
Kernel Density Estimation
Influence Function The influence of a data point y at a point
x in the data space is modeled by a function Kernel Density Estimation:
Density Function The density at a point x in the data space
is defined as the sum of influences of all data points:
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 65
KDE and visual clustering
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 66
KDE and visual clustering
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 67
Scalability problem of clustering methods
Effectiveness degenerates with dimensionality d with noise level
Efficiency degenerates (at least) linearly with no of data points N exponentially with dimensionality d
Konstanz, 27-28.3.2000 EDBT2000 tutorial - Clust 68
References - Clustering
L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 1990. J. MacQueen. Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Synp. On Math., Stat., and
Prob., 1:281-297, 1967. S. L. Lauritzen. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis,
19:191-201, 1995. Charu C. Agrawal, Ceilia Procopiuc, Joel L. Wolf, Philip S. Yu, and Jong Soo Prk, Fast Algorithms for Projected Clustering, the ACM
SIGMOD Conference on Management of Data, Philadelphia, PA, June 1999. Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan, Automatic Subspace Clustering on High Dimensional
Data for Data Mining Applications, the ACM SIGMOD Conference on Management of Data, Seattle, Washington, June 1998. P. Cheeseman and J. Stutz. Bayesian Classification (Autoclass): theory and results. In U. M. Fayyad et al (eds.) Advances in Knowledge
Discovery and Data Mining, p. 153-180, AAAI/MIT Press, 1996. Martin Ester, Hans-Peter Kriegel, Jorg Sander, Michael Wimmer, and Xiaowei Xu, Incremental Clustering for Mining in a Data
Warehousing Environment, the VLDB Conference, New York City, New York, August 1998. Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu, A density-based algorithm for discovering clusters in large spatial
database with noise. In Proc. KDD’96 A. Hinnenburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. KDD’98. P. Michaud. Clustering techniques. Future Generation Computer Systems, 13, 1997. D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987. Venkatesh Ganti, Raghu Ramakrishnan, and Johannes Gehrke, Clustering Large Datasets in Arbitrary Metric Spaces, the 15th
International Conference on Data Engineering, Sydney, Australia, April 1999. Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, CURE: An Efficient Clustering Algorithm for Large Databases, the ACM SIGMOD
Conference on Management of Data, Seattle, Washington, June 1998 Sudipto Guha, Rajeev Rastogi and Kyuseok Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, the 15th
International Conference on Data Engineering, Sydney, Australia, April 1999. Anil K. Jain and Richard C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, New Jersey, 1988. Raymond T. Ng and Jiawei Han, Efficient and effective clustering methods for spatial data mining, the VLDB Conference, Santiago,
Chile, September 1994. Tian Zhang, Raghu Ramakrishnan, and Miron, Birch: An efficient data clustering method for very large databases, the ACM SIGMOD
Conference on Management of Data, Montreal, Canada, June 1996.