Marketing Database Analysis
Anna Andrusova
Nathan BaileyJames Ballard
Han SiDemin WangMKT 6362. Database
Marketing
Overview of Business Problem• In the 1990’s and early 2000’s, Dominick’s was a chain of over 100 grocery stores in the Chicago Metropolitan area
• For this evaluation, we are performing a corporate-level as well as a category-level data analysis
• Corporate Analysis – Relate store sales performance with known demographics to facilitate corporate planning activities and test potential locations
• Category Analysis – Relate category sales performance with known demographics to improve sales performance and expand product offerings
Data DescriptionStore-level historical data on the sales over more than seven year
period
Customer Count FileDaily sales of stores in 30 product categories:• Bakery•Beer•Cosmetic•Dairy•Meat•Pharmacy•Grocery
Store-Specific Demographics
Demographic profiles of stores: • Age• Single / Retired / Unemployed• Mortgage• Poverty• Income• Education• Household size• Working woman, etc
• Cheese•Wine•Health and Beauty•Deli•Fish•Floral•Jewelry, etc.
Data PreparationStep 1. The latest year’s sales data was aggregated by Store and summarized for the year from Customer Count FileStep 2. Demographic variables were added from Store Account FileResulting data set:• 1-record per store (94 stores) containing 12-month sales data and store demographic data
• Sales data on 30 product categories (the ‘Behavior’ variables)
• 43 demographic variables for residents living near the store
Approach1. Segmentation: create groups of the stores similar in their
performance according to certain group of product categories and dissimilar to the other groups according to the same group of categories
Method: Non-hierarchical and hierarchical clustering 2. Response Analysis: find targetable characteristics of identified groups of the stores Method: Discriminant analysis3. Model Validation: evaluate performance of the models on a hold-out sample (20% of the stores)4. Recommendations and conclusions
Dominick’s Data Set
General Data Set
Corporate AnalysisCategory Analysis
Data Preparation
ClustersHierarchical Clustering and Non-Hierarchical Clustering
Response Analysis Discriminate Analysis Hold-Out
Group
20%
Model Test
Conclusion and Recommendation
Corporate Analysis ResultsCategory Analysis Results
Flowchart of the Approach
Cluster HistoryNumber
ofClusters
Clusters Joined Freq New ClusterRMS Std Dev
SemipartialR-Square
R-Square CentroidDistance
Tie
… … … … … … … …. 11 CL21 311 3 255876 0.0013 .955 2.09E6 10 CL15 112 15 223435 0.0018 .953 2.09E6
9 CL18 CL11 9 293813 0.0044 .949 2.25E6 8 CL14 314 5 281264 0.0020 .947 2.37E6 7 CL10 CL17 43 329098 0.0346 .912 2.84E6 6 304 315 2 376122 0.0018 .910 2.86E6 5 CL8 CL9 14 451590 0.0209 .889 3.85E6
4 CL13 CL7 76 455327 0.1236 .766 3.88E6 3 CL12 CL5 16 567698 0.0270 .739 5.93E6 2 CL3 CL6 18 679121 0.0365 .702 6.84E6 1 CL4 CL2 94 918977 0.7022 .000 1.05E7
Corporate AnalysisStep #1 – Hierarchical Clustering
Conclusion: optimal number of clusters is between 3 and 6
3 clusters 4 clusters 5 clusters 6 clusters
Pseudo F Statistic 256.19 245.65 246.97 260.81
Approximate Expected Over-All R-Squared
0.7364 0.77973 0.80166 0.8157
Cubic Clustering Criterion 5.517 6.505 7.813 16.200
Corporate Analysis (Cont.)Step #2 – Non-Hierarchical Clustering
Conclusion: based on the results of both Hierarchical and Non Hierarchical clustering 6-cluster solution
is determined to be optimal
Corporate Analysis – Clustering ResultsCluster Summary
Cluster Freq RMS Std Deviation
Max Distancefrom Seed to Observation
RadiusExceeded
Nearest Cluster
Distance BetweenCluster Centroids
1 33 201245 2233427 6 29484672 1 . 0 3 47654173 6 336424 2455353 4 42861414 9 293813 2207687 3 42861415 16 274583 3058122 6 31920186 29 213948 2063995 1 2948467
Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6
35.1%
1.1%
6.4%9.6%
17.0%
30.9%
21.5%
2.4%
12.1%14.7%
21.2%
28.2%
% of stores vs. % of sales
% of stores % of sales
Corporate Analysis - Discriminant AnalysisConfidence Level: 90%
Univariate Test Statistics
F Statistics, Num DF=5, Den DF=79
Variable TotalStandardDeviation
PooledStandardDeviation
BetweenStandardDeviation
R-Square R-Square/ (1-RSq)
F Value Pr > F
EDUC 0.1129 0.1102 0.0394 0.1029 0.1147 1.81 0.1200
NOCAR 0.1316 0.1287 0.0453 0.1000 0.1111 1.76 0.1318
INCSIGMA 2323 2264 824.9388 0.1064 0.1190 1.88 0.1070
HSIZE1 0.0829 0.0809 0.0292 0.1045 0.1167 1.84 0.1138
SINHOUSE 0.2173 0.2103 0.0817 0.1194 0.1355 2.14 0.0690
HVAL200 0.1853 0.1758 0.0792 0.1541 0.1822 2.88 0.0194
SINGLE 0.0703 0.0665 0.0306 0.1593 0.1895 2.99 0.0158
NWRKCH17 0.0199 0.0194 0.006933 0.1024 0.1141 1.80 0.1218
TELEPHN 0.0309 0.0293 0.0134 0.1581 0.1879 2.97 0.0166
SHPINDX 0.2482 0.2405 0.0924 0.1168 0.1323 2.09 0.0753
* 17 statistically significant variables in total
Corporate Analysis - Discriminant Analysis (Cont.)
CanonicalCorrelation
AdjustedCanonical
Correlation
ApproximateStandard
Error
SquaredCanonical
Correlation
1 0.847077 0.761387 0.030819 0.717540
Multivariate Statistics and F ApproximationsS=5 M=15 N=21
Statistic Value F Value Num DF Den DF Pr > FWilks' Lambda
0.02426163 1.39 180 223.58 0.0103
Pillai's Trace 2.50666011 1.34 180 240 0.0172Hotelling-Lawley Trace
6.07753961 1.44 180 164.86 0.0093
Roy's Greatest Root
2.54031820 3.39 36 48 <.0001
Means of the independent variables are statistically different among segments
Only 2.4% of the variance in the discriminant scores is not explained by the differences among groups of the stores Ratio between-group SS to
the total SS => Good set of descriptors
Error Count Estimates for CLUSTER 1 3 4 5 6 Total
Rate 0.1429 0.0000 0.0000 0.3333 0.3333 0.1619Priors 0.1667 0.1667 0.1667 0.1667 0.1667 0.8333
Error Count Estimates for CLUSTER 1 2 3 4 5 6 Total
Rate 0.1818 0.0000 0.0000 0.1667 0.3571 0.1923 0.1497Priors 0.1667 0.1667 0.1667 0.1667 0.1667 0.1667
Corporate Analysis – Classification Results
OriginalDataset
Hold-out Sample
~ 85% of the stores are classified correctly
~ 84% of the stores are classified correctly
Category Analysis: Beer and Wine
Cluster HistoryNumber
ofClusters
Clusters Joined Freq New Cluster
RMS Std Dev
SemipartialR-Square
R-Square CentroidDistance
Tie
9 CL16 309 8 72804.2 0.0031 .906 197203 8 CL23 CL13 10 93539.6 0.0091 .897 200748 7 CL10 CL31 11 95378.9 0.0085 .888 239550 6 CL7 CL8 21 145510 0.0459 .842 311263 5 CL87 CL11 61 112380 0.0639 .778 318702 4 CL5 CL6 82 170030 0.2099 .568 385452 3 CL4 CL15 85 185394 0.0973 .471 609748 2 CL3 CL9 93 226017 0.3212 .150 696877 1 CL2 304 94 243807 0.1499 .000 1.29E6
Step #1 – Hierarchical Clustering
Conclusion: optimal number of clusters is between 4 and 6
Category Analysis: Beer and Wine (Cont.)
Step #2 – Non-Hierarchical Clustering
4 clusters 5 clusters 6 clusters
Pseudo F Statistic 87.53 116.85 131.08
Approximate Expected Over-All R-Squared 0.7692 0.81988 0.85358
Cubic Clustering Criterion -1.336 1.458 2.489
Conclusion: based on the results of both Hierarchical and Non Hierarchical clustering 6-cluster
solution is determined to be optimal
Category Analysis: Beer and Wine (Cont.)Cluster Summary
Cluster Frequency RMS Std Deviation
Maximum Distance
from Seedto
Observation
RadiusExceeded
Nearest Cluster
Distance Between
Cluster Centroids
1 35 83267.8 194999 2 2685322 32 78629.9 206948 1 2685323 8 131663 250170 2 3746034 9 82174.1 159203 2 3336465 9 80329.2 180104 4 3773896 1 . 0 3 924906
Cluster MeansCluster BEER WINE1 144128.421 101864.5772 326776.212 298713.2413 493651.738 634093.243
4 649465.774 213912.8425 955669.947 434505.4596 383045.800 1552362.060
Cluster #5 is the top seller of BeerCluster #6 is the Top seller of WineCluster #1 has the lowest sales of both Beer & Wine
One store in Cluster 6
outlier
Discriminant Analysis: Beer and WineConfidence level: 95% Univariate Test Statistics
F Statistics, Num DF=5, Den DF=79Variable Total
StandardDeviation
PooledStandardDeviation
BetweenStandardDeviation
R-Square R-Square/ (1-RSq)
F Value Pr > F
AGE9 0.0272 0.0261 0.0109 0.1347 0.1557 2.46 0.0400EDUC 0.1129 0.1051 0.0528 0.1843 0.2259 3.57 0.0058INCOME 0.2921 0.2793 0.1192 0.1405 0.1635 2.58 0.0324INCSIGMA 2323 2191 1021 0.1630 0.1948 3.08 0.0137HSIZEAVG 0.2686 0.2480 0.1303 0.1985 0.2477 3.91 0.0032HSIZE2 0.0322 0.0298 0.0154 0.1942 0.2410 3.81 0.0038HSIZE567 0.0325 0.0277 0.0200 0.3176 0.4655 7.35 <.0001HH3PLUS 0.0844 0.0796 0.0371 0.1628 0.1944 3.07 0.0138HH4PLUS 0.0650 0.0606 0.0303 0.1833 0.2244 3.55 0.0061DENSITY 0.001250 0.001192 0.000518 0.1447 0.1692 2.67 0.0277HVAL150 0.2460 0.2260 0.1217 0.2064 0.2601 4.11 0.0023HVAL200 0.1853 0.1664 0.0992 0.2417 0.3188 5.04 0.0005HVALMEAN 47.3071 42.9341 24.4560 0.2254 0.2909 4.60 0.0010SINGLE 0.0703 0.0664 0.0308 0.1616 0.1927 3.04 0.0145UNEMP 0.0239 0.0226 0.0103 0.1576 0.1871 2.96 0.0169WRKWNCH 0.0446 0.0424 0.0187 0.1483 0.1742 2.75 0.0241TELEPHN 0.0309 0.0287 0.0148 0.1929 0.2389 3.78 0.0041POVERTY 0.0457 0.0441 0.0175 0.1238 0.1413 2.23 0.0590
Statistically significant variables in discriminating observations among groups
Discriminant Analysis: Beer and Wine (Cont.)
CanonicalCorrelation
AdjustedCanonical
Correlation
ApproximateStandard
Error
SquaredCanonical
Correlation
1 0.846814 0.751237 0.030868 0.717094
Multivariate Statistics and F ApproximationsS=5 M=15 N=21
Statistic Value F Value Num DF Den DF Pr > FWilks' Lambda 0.01346418 1.72 180 223.58 <.0001Pillai's Trace 2.81504177 1.72 180 240 <.0001Hotelling-Lawley Trace
7.26639429 1.72 180 164.86 0.0002
Roy's Greatest Root
2.53474655 3.38 36 48 <.0001
Means of the independent variables are statistically different among segments
Only 1.3% of the variance in the discriminant scores is not explained by the differences among groups of the stores
Good set of descriptors
Beer & Wine Category Analysis – Classification ResultsOriginalDataset
Error Count Estimates for CLUSTER 1 2 3 4 5 6 Total
Rate 0.5714 0.6207 0.7143 0.6000 0.8750 1.0000 0.7302Priors 0.1667 0.1667 0.1667 0.1667 0.1667 0.1667
Hold-out Sample
Error Count Estimates for CLUSTER 1 2 3 4 5 Total
Rate 0.1667 0.3333 0.5000 0.5000 0.5000 0.4000Priors 0.1667 0.1667 0.1667 0.1667 0.1667 0.8333
~ 27% of the stores are classified correctly
~ 60% of the stores are classified correctly
RecommendationsCorporate Level:• Resource allocation among the stores: perform additional analysis of the
stores in underperforming segments (1 & 6) • Evaluation of the potential locations for a new store: deploy discriminant
function to predict performance of the stores in different product categories based on the demographic profiles of their locations
Category Level (Beer & Wine):• Marketing strategy for a new brand of Beer or Wine: adjust targeting
strategy for a product based on the demographic profile of the location it will be sold
• Choice of the stores to test market a new product: recommend to perform a market test for Beer in stores of segments 4 & 5 and Wine in segments 3 &6
Limitations of the AnalysisAdditional data
• Product-level data: assessment of specific product sales in new stores & prediction of a new product performance that is being considered to be launched
• Customer-specific data: ability to build better predictive models tied to the
customer demographics (scanner data from the loyalty program members’ transactions)
Higher quality analysis at a more granular level
Top Related