Hierarchical Signal Propagation for Household Level Sales ...
Transcript of Hierarchical Signal Propagation for Household Level Sales ...
Hierarchical Signal Propagation for Household Level Sales in
Bayesian Dynamic Models
by
Di Deng
Department of Statistical ScienceDuke University
Date:Approved:
Mike West, Advisor
Peter Hoff
Andrew Cron
A thesis submitted in partial fulfillment of therequirements for the degree of Master of Science
in the Department of Statistical Sciencein the Graduate School of
Duke University
2021
ABSTRACT
Hierarchical Signal Propagation for Household Level Sales in
Bayesian Dynamic Models
by
Di Deng
Department of Statistical ScienceDuke University
Date:Approved:
Mike West, Advisor
Peter Hoff
Andrew Cron
An abstract of a thesis submitted in partial fulfillment of therequirements for the degree of Master of Science
in the Department of Statistical Sciencein the Graduate School of
Duke University
2021
Abstract
Large consumer sales companies frequently face challenges in customizing decision
making for each individual customer or household. This dissertation presents a novel,
efficient and interpretable approach to such personalized business strategies, involving
multi-scale dynamic modeling, Bayesian decision analysis and detailed application in
the context of supermarket promotion decisions and sales forecasting.
We use a hierarchical, sequential, probabilistic and computationally efficient Bayesian
dynamic modeling framework to propagate signals down the hierarchy, from the level
of overall supermarket sales in a store, to items sold in a department of the store,
within refined categories in a department, and then to the finest level of individual
items on sale. Scalability is achieved by extending the decouple-recouple concept:
the core example involves 162,319 time series over a span of 112 weeks, arising from
combinations of 211 items and 2,000 households. In addition to novel dynamic model
developments and application in this multi-scale framework, this thesis also devel-
ops a comprehensive customer labeling system, built based on customer purchasing
behavior in the context of prices and discounts offered by the store. This labeling
system addresses a main goal in the applied context to define customer categorization
to aid in business decision making beyond the currently adopted models. Further, a
key and complementary contribution of the thesis is development of Bayesian deci-
sion analysis using a set of loss functions that suit the context of the price discount
selection for supermarket promotions. Formal decision analysis is explored both the-
oretically and via simulations. Finally, some of the modeling developments in the
multi-scale framework are of general interest beyond the specific applied motivat-
ing context here, and are incorporated into the latest version of PyBATS, a Python
package for Bayesian time series analysis and forecasting.
iv
Contents
Abstract iv
List of Figures viii
List of Tables ix
1 Introduction 1
1.1 Data and Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Prior Relevant Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Thesis Scope and Contributions . . . . . . . . . . . . . . . . . . . . . 3
2 Dynamic Models 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 DGLMs: Dynamic Generalized Linear Models . . . . . . . . . . . . . 6
2.3 DMMs: Dynamic Mixture Models . . . . . . . . . . . . . . . . . . . . 7
2.3.1 DCMMs: Dynamic Count Mixture Models . . . . . . . . . . . 7
2.3.2 DLMMs: Dynamic Linear Mixture Models . . . . . . . . . . . 8
2.4 Multi-scale Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Case Study and Examples . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.1 Individual Household DGLMs . . . . . . . . . . . . . . . . . . 11
2.5.2 Multi-scale Modeling . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.3 Model Evaluation and Comparison . . . . . . . . . . . . . . . 14
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Labeling System 18
3.1 Motivations and Purposes . . . . . . . . . . . . . . . . . . . . . . . . 18
v
3.2 Labeling System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Case Study and Examples . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Decision Analysis 25
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Business Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Tentative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 Poisson Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.2 Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 DCMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Computation, Implementation and Code 39
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Copula Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.1 VBLB for Latent Factor DGLMs . . . . . . . . . . . . . . . . 39
5.2.2 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Clustering Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 49
6 Conclusions and Summary 60
Appendices 63
A DGLMs 64
A.1 VBLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.2 Discount Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
A.3 Random Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
vi
A.4 Multi-scale Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
B More Figures 69
C More Code 73
vii
List of Figures
2.1 Modeling hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 (a) Coefficient of average discount percent in the external model M0;(b) Product of (a) and actual discount percent; (c) Coefficient of (b)in an individual model M2 . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Model comparison in terms of forecasting accuracy. Naive model: Mnaive;
DGLM: M1; Latent: M2; TF: a logistic regression model written in Tensor-
Flow by 84.51◦ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Interactive visual aids of labeling system . . . . . . . . . . . . . . . . . . 23
4.1 Distributions of four model parameters . . . . . . . . . . . . . . . . . . 29
4.2 Utility vs. Discount . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Distributions of simulated outcomes over a year (p/c = 1.2). . . . . . 34
4.4 Distributions of simulated outcomes over a year (p/c = 2). . . . . . . 35
4.5 Distributions of simulated outcomes over a year (p/c = 10). . . . . . . 36
B.1 Distributions of simulated parameters over a year (p/c = 1.2). . . . . 70
B.2 Distributions of simulated parameters over a year (p/c = 2). . . . . . 71
B.3 Distributions of simulated parameters over a year (p/c = 10). . . . . . 72
viii
List of Tables
3.1 Example items for each group . . . . . . . . . . . . . . . . . . . . . . 21
4.1 Summary statistics of logistic and poisson regressions . . . . . . . . . . . 30
ix
Chapter 1
Introduction
1.1 Data and Context
The data processed throughout the case study is provided by 84.51◦. It records
weekly purchasing data of 211 actively selling items across over 2000 households,
during the span of 112 weeks, from September 5th, 2017 to October 22nd, 2019.
For each row/visit, the key numeric variables include regular price, discounted/net
price of items, units of items sold, as well as identification information, such as the
date, household identification number, category/department identification number of
items.
We create a few derivative variables for the modeling purpose: total money spent
on items for each visit, dummy variable of whether or not there were promotions,
and the discount percentage: ratio of discount and regular price.
1.2 Prior Relevant Work
The motives of this thesis are closely related to customized forecasting and decision
making in various retail contexts (e.g. Chen et al., 2018), even though the broader
applications of individualized statistical models aim for recommmendation systems,
such as image recommendation systems (Niu et al., 2018) to music (Wang et al.,
2013).
1
Collaborative filtering and matrix factorization (Su and Khoshgoftaar, 2009; Du
et al., 2018) are usually the pillars of customized recommendation systems. More
recently, deep learning shows potentials in dealing with big, complex data (He et al.,
2018; Niu et al., 2018; Naumov et al., 2019). Unlike most of non-dynamic methods,
this thesis aims to forecast the purchasing behaviors of an individual. The whole
setting of this thesis is dynamic, like the dynamic extension of matrix factoriza-
tion (Jerfel et al., 2017) and temporal features in (Chu and Park, 2009; Hu et al.,
2015).
Customized prediction also has its significance in medical applications, such as ge-
nomics (e.g. Nevins et al., 2003; Pittman et al., 2004; West et al., 2006). Even though
there has been growth in the field such as glaucoma progression prediction (Kazemian
et al., 2018), the main goal focuses on a single value of interest. This thesis presents
methodology beyond forecasting a single value of interest (i.e. glaucoma progres-
sion prediction (Kazemian et al., 2018)). Instead, it deals with forecasting across
thousands of households and hundreds of items.
Finally, in the retail domain, which is the setting of this thesis, relevant personal-
ized models are either not explicitly dynamic and unable to deal with a large number
of time series (e.g. Lichman and Smyth, 2018; Kazemian et al., 2018; Lichman and
Smyth, 2018; Thai-Nghe et al., 2011), or difficult to interpret (e.g. Salinas et al., 2019;
Chen et al., 2018). The formers have hierarchical structure, but the formulations do
not allow comoputational scalability. This makes it difficult to model many time se-
ries. The deep learning methods in the laters are probabilistic and scalable, but they
trade for interpretability. When it comes to forecasting and decision making in retail
domain, this thesis presents a probabilistic, scalable, interpretable model. The inter-
pretability enables clear communication and easy decision making for downstream
collaborators. For example, one can easily determine how many discount coupons
2
which household needs for which item, so that the store can send out appropriate
amount of promotions to targeted households at the right time.
1.3 Thesis Scope and Contributions
When it comes to commercial usage of dynamic modeling, we prefer online models
that generate full distributions of quantities of interest with both computational
speed and forecasting accuracy. This thesis adopts the Bayesian Dynamic Generalized
Linear Modeling framework by West and Harrison (1997), which is designed to be
sequential and probabilistic and is reviewed in section 2.2.
The challenge in this context occurs due to the inherent sparsity of the data at the
finest level of individual household and item, where the random noises can dominate
the real signals. The sporadic counts can be well modeled by mixture models, such
as Dynamic Counts Mixture Models (Berry and West, 2020) and Dynamic Linear
Mixture Models (Yanchenko et al., 2021), which are described in section 2.3.
In practice, computational speed and accuracy have an incompatible trade-off
relationship—we need to sacrifice one to compensate for the other. Within com-
mercial settings, the new information comes in so fast that one cannot afford to
run a computationally intense model, for example, a model that requires MCMC.
In order to promote efficiency while maintain forecasting accuracy, we resort to the
decouple-recouple modeling strategy proposed by Berry and West (2020). The mod-
els adopting this strategy are named multi-scale models, which first treat each series
independently and then propagates common simultaneous high-level signals down
to the decoupled series to restore the dependence. Decoupling enables fast parallel
computation, while recoupling mitigates the random noises issue at the finest level
which contributes to overall accuracy along with the restored dependence. The rest
3
of chapter 2 reviews the framework of multi-scale modeling and showcases modeling
results on the data mentioned below. Note that even though section 2.4 describes the
multi-scale modeling by Berry and West (2020), section 2.5 utilizes an approximated
but much more efficient version by Lavine et al. (2020).
Notice that in section 1.1, there is no demographic information about those house-
holds available in the data. Since the multi-scale modeling scheme requires signals
from some aggregate levels, we inevitably need to create a set of criteria to classify
households into groups. Chapter 3 discusses the motivation and the more profound
significance beyond modeling of the labeling/classification system in section 3.1, and
defines the specific standards in section 3.2, with section 3.3 showcases a few examples
in both tabular and graphic forms.
Ultimately, the pursuit of better modeling is at service of better decision making.
In the commercial context, questions such as how many discounts one should give to
a particular household, or even what is the optimal discounts, are of great interest.
The answers to such questions involve the decision maker’s utility function—what
he/she prioritizes, i.e. short-term profits or long-term customer relationships. Chap-
ter 4 explores these questions under the framework in chapter 2, in the context of
businesses like supermarkets, which is described in details by section 4.2. Then, sec-
tion 4.3 and 4.4 walk through mathematical details of decision optimizations under
relevant models, from the simple to the more sophisticated, complemented by more
illustrations in section 4.4.
Chapter 5, as the last segment of the thesis, elaborates my contributions in terms
of programming to the project (Yanchenko et al. (2021)). It contains both the latent
factor modeling, and the labeling system based on household purchasing behaviors,
which is introduced in chapter 3.
4
Chapter 2
Dynamic Models
2.1 Introduction
Dynamic modeling is of great interest of commercial outlets such as e-commence
companies like Amazon and supermarkets like Walmart and Target. This chapter
reviews the framework of the relevant models for our problem. Specifically, we use
dynamic multi-scale mixture models that are well suited to deal with multivariate
time series that are either non-negative counts or continuous-valued. Built off ex-
tensions to dynamic generalized linear models by West and Harrison (1997), these
models inherent the advantages of being sequential and probabilistic, and are able to
generate samples from the implied predictive distributions of target quantities, which
allows inference on various statistics and further decision analysis (Chapter 4).
Background
Many prior works pave the way for this thesis. Over 20 years ago, the framework of
dynamic generalized linear models was established (West and Harrison, 1997, chap.
14). In recent years, researchers have picked up the baton, extending and modifying
the old framework to build tailormade models for count-valued time seriess (Berry
and West, 2020; Berry et al., 2020). The framework of multi-scale modeling leverages
information from aggregate level, which provides a potential solution to the zero-
inflated data. To improve computational efficiency, Lavine et al. (2020) propose
5
a copula-based approximation that drastically speeds up the modeling, meanwhile
maintains forecasting accuracy.
2.2 DGLMs: Dynamic Generalized Linear Models
DGLMs are dynamic models whose primary variables are from the exponential family.
The sampling model for a time series i over time t is given by Equation 2.1. Here i
indicates the index of each individual series, while t is the time.
p(yi,t|µi,t, τi,t, Dt) = b(yi,t, τi,t)exp[τi,t(yi,tµi,t − a(µi,t))], i = 1 : N, t = 1, 2, 3, . . .
(2.1)
Equation 2.1 is the conditional distribution of yi,t, given all the information avail-
able up to time t, which is denoted by Dt = {yt,Dt−1, It−1}, , where It−1 represents
any additional relevant information beyond the observed data. µi,t and τi,t are the
natural parameter and precision parameter, respectively. Here, in the DGLM frame-
work, the focus is µi,t, which maps to the linear predictor λi,t = g(µi,t) via a link
function g(·). As a state-space model, its dynamic Markov evolution is defined as
λi,t = F ′i,tθi,t where θi,t = Gi,tθi,t−1 + ωi,t with ωi,t ∼ [0,Wi,t] (2.2)
where
• Fi,t is a matrix of known covariates at time t,
• θi,t is the state vector, which evolves via a first-order Markov process,
6
• Gi,t is a known state matrix,
• ωi,t is the stochastic innovation vector, or evolution “noise ”, with E(ωi,t|Dt−1, It−1) =
0 and V(ωi,t|Dt−1, It−1) = Wi,t, independently over time, which can be con-
trolled by the design/scheme descirbed in details in Appendix A.2
• For Poisson DGLMs, the design has a random effect parameter ρ ∈ (0, 1] to
account for the overdispersion. The models in both Section 2.5.1 and 2.5.2
specify this parameter. More details of random effect can be found in Appendix
A.3.
2.3 DMMs: Dynamic Mixture Models
When it comes to zero-inflated data, a single DGLM is not flexible enough to capture
the signal from the finest level, which is the goal of our project: customized modeling
and decision-making for individual households and items. Therefore, we resort to
mixture models which consider zeros separately, such as the Dynamic Count Mixture
Models (Berry and West (2020)) and Dynamic Linear Mixture Models (Yanchenko
et al. (2021)). These two models are designed for non-negative counts and continuous
values respectively. In the context of our business problem, the former models weekly
sales, while the later is used for weekly spending.
2.3.1 DCMMs: Dynamic Count Mixture Models
In order to deal with non-negative counts with many zeros, Berry and West (2020)
propose Dynamic Count Mixture Models (DCMMs). The models are a mixture of
Bernoulli DGLMs and shifted Poisson DGLMs, as described by Equation 2.3.
7
Bernoulli DGLM: zt ∼ Ber(πt)
Poisson DGLM: yt|zt =
0, zt = 0
1 + st, st ∼ Po(µt), zt = 1
(2.3)
The two components of this mixture, Bernoulli and Poisson, separately evolve,
predict and update just like a univariate DGLM, which are detailed in Appendix A.1.
2.3.2 DLMMs: Dynamic Linear Mixture Models
With the similar strategy of treating those inflated zeros separately, Yanchenko et al.
(2021) proposes Dynamic Linear Mixture Models, which are mixtures of Bernoulli
and Normal DGLMs, as in Equation 2.4, and are used to model the logarithm of
weekly spending of each individual household.
Bernoulli DGLM: zt ∼ Ber(πt)
Normal DGLM: xt|zt =
0, zt = 0
xt ∼ N(F ′tθt,Vt), zt = 1
(2.4)
Similarly, DLMMs retain the flexibility, computational efficiency, and full proba-
bilistic uncertainty from Bayesian DGLMs in West and Harrison (1997).
8
2.4 Multi-scale Framework
In this project, the multi-scale modeling framework incorporates all the dynamic
models all together. The framework has been conceptualized, developed, elaborated,
exemplified, throughout prior work such as Berry and West (2020), Carvalho and
West (2007), and Ferreira et al. (2003, 2006). This prior work uses the “bottom-
up/top-down”ideas and novelly adopts them in multi-scale time series models and
Bayesian forecasting.
Specifically, the application of multi-scale modeling relies on the natural hierarchy
of items defined by the store (see section 3.1, Yanchenko et al. (2021)), and the
innovation of household grouping achieved by the labeling system/criteria in Chapter
3. This modeling strategy utilizes information across items and households that are
somehow close in the hierarchy, which allows signals from the shared level to be
propagated in the ”top-down” fashion, thus improving forecast accuracy at the finest
level: each item household pair.
Under the multi-scale framework, the shared signals from the aggregate level are
simultaneous with the quantity of interest, as opposed to lagged, which retains the
online learning of the model of interest, as well as takes into account the additional
uncertainty introduced by the high-level signals. This is critical for any inference on
the predictive samples. Note that this requires some knowledge/control of future val-
ues, such as the discount percentages offerred next week. The multi-scale framework
is described by Equation 2.5. (A summary of this section can be found in Appendix
A.4)
9
Mi : Equation 2.1 and 2.2 with θi,t = (γi,t, βi,t)′ and Fi,t = (hi,t, φt)
′, i = 1 : N
M0 : φt ∼ p(φt|Dt−1)
(2.5)
There are independent external models denoted by M0 modeling the simultaneous
predictor vector φt, which is incorporated into the individual models: Mi in the
regressor vector Fi,t. Each Mi has its own dynamic state vector βi,t for the shared
signal φt, which allows individual models to uniquely respond to the shared higher-
level signal thereby improving forecast accuracy.
When it comes to the implementation of the multi-scale model, Berry and West
(2020) propose a direct Monte Carlo method, which obviates the usage of Markov
Chain Monte Carlo, while in a more recent literature, Lavine et al. (2020) adopt an
analytical approximation which significantly boosts computational efficiency, as well
as maintains similar forecast accuracy.
2.5 Case Study and Examples
As a part of the project (Yanchenko et al. (2021)), one of the major goals of this thesis
is to identify, capture, and utilize the price sensitivity of each household. Specifically,
as displayed in Figure 2.1, I find a multi-scale model that utilizes an aggregate dis-
count percentage across households, to improve forecasting accuracy. The household
hierarchy used to aggregate the discount information is introduced and elaborated in
Chapter 3. Here in this chapter, models or modeling results of households with high
price sensitivity are exemplified, evaluated, and discussed.
Similar to Figure 5 in Yanchenko et al. (2021) which visualizes the modeling
10
decomposition of each item, household pair, Figure 2.1 demonstrates the two main
implementations of this problem: ”top-down” propagations along the hierarchy of
an item (store, department, category, product) or a household (groups with different
price sensitivity/loyalty). This thesis mainly contributes to the later.
Figure 2.1: Modeling hierarchy
2.5.1 Individual Household DGLMs
Resorting to the decouple-recouple strategy, univariate DCMMs (Equation 2.3) model
the weekly sales of each household, with the first two covariates of the regressor
vector F ′ being (1, discountt), where discountt is the simultaneous binary indicator
of weekly promotion. The third covariate explored is the information in weekly
discount pecentage, which is to be used either directly or aggregately. Model M1 and
M2 give the full descriptions of the models.
11
M1 :
• Response variable yt: weekly sales of an item from a particular household
• F ′t = (1, discountt, discount percentt),G = I3
• Discount factors: ρ = 0.6, ρlocal linear = 0.98, ρregression = 0.98
M2 :
• Response variable yt: weekly sales of an item from a particular household
• F ′t = (1, discountt, aggregate discount percentt),G = I3
• Discount factors: ρ = 0.6, ρlocal linear = 0.98, ρregression = 0.98
2.5.2 Multi-scale Modeling
The idea of the multi-scale modeling is to have an aggregate level signal as a baseline
reference extracted from group behaviors, so that we have, at least, some ”safety”
information to draw on, when there is nothing but noise at the finest level (household).
In contrast to M1 which is a household specific model, M2 is a multi-scale model
(Equation 2.5) with a simultaneous covariate that incorporates price sensitivity across
a group of households. The exploration finds that M2 outperforms M1 and others,
especially when it comes to households with high price sensitivity. This section
describes the external model that generates the third covariate of M2: aggregate
discount percentt.
External Model Specification
The external model M0, whose parameters are specified below, is a Poisson DGLM
(Equation 2.1 and 2.2)—a DCMM would have been practically equivalent, due to the
12
lack of zero inflation in the aggregate data.
M0 :
• Response variable yt: weekly sales of an item from a group of households
• F ′t = (1, average discount percentt),G = I3
• Discount factors: ρ = 0.6, ρlocal linear = 0.998, ρregression = 0.998
Model Integration: Signal Propagation
Upon possessing the coefficients of M0, the next step is to combine the aggregate level
signal and the household specific information. Specifically, given the state vector θ′t =
(αt, βt)′ of M0, the third covariate of M2 becomes aggregate discount percentt =
βt ∗ discount percentt.
In the context of this application, βt can be interpreted as a measure of price
sensitivity on the item for the whole group of households, while multiplying it with
the household specific discount percent accounts for the heterogeneity of promotions
within the group.
Figure 2.2 shows the critical quantities over the forecasting period in the de-
scribed process. Figure 2.2a is the coefficient of average discount percent: βt, which
is multiplied by the simultaneous individual household discount percent to obtain
the customized aggregate discount percent for that particular household—shown in
Figure 2.2b. Note that in Figure 2.2a, the coefficient is well above zero, indicating a
strong price sensitivity of the chosen household group, which validates the multi-scale
strategy. The latent factor in Figure 2.2b displays one household of the group—it
has zero values in Figure 2.2b, because the household was not offered any promotion
on item 62 for those weeks. Figure 2.2c gives the coefficient of the latent factor: ag-
gregate discount percent, in model M2. The mean and one-standard-deviation region
13
(a) Coefficient of average discount percent (b) Aggregate discount percent
(c) Coefficient of aggregate discount percent
Figure 2.2: (a) Coefficient of average discount percent in the external model M0;(b) Product of (a) and actual discount percent; (c) Coefficient of (b) in an individualmodel M2
implies the combined covariate plays a consistently valuable role in that particular
model.
2.5.3 Model Evaluation and Comparison
Figure 2.2 illustrates the multi-scale model at the individual household level, this
section discusses the performance of the model on the chosen group, and compares
to other alternatives.
Figure 2.3 shows the accuracy of all three models: M1, M2, TensorFlow logistic
14
regression model, and the naive guess 2.6 based on the promotion. Figure 2.3a
compares the multi-scale model with each of the others, while Figure 2.3b displays
their individual distributions.
Naive Guess: zt =
0, No discount
1, Discount
(2.6)
In Figure 2.3a, each point represents a household in the group, and the straight
line is the y = x line. Therefore, the points above the line are where the multi-scale
model outperforms the others. From the first subplot of Figure 2.3a, we can see that
the multi-scale model is better than the naive guess, when the households follow the
promotions between 60% to 80%. This is significant, because that is actually the
most common case and when it is difficult to predict the behavior—if all customers
are following the promotions, there is no need to build models more complex than
the naive guess. In the second subplot, for a few households, using the aggregate
discount percent conspicuously dominates the DGLM without the signal from an
aggregate level, which emphasizes the importance of ”top-down” propagation, as
mentioned at the beginning of Section 2.5.2. Compared to the TensorFlow model,
the multi-scale model generates similar outcomes, while the later has the benefits of
being probabilistic, sequential, and a lot faster.
As the parental project of this one, Yanchenko et al. (2021) compares model M2
and model M1 in terms of other metrics, such as MAD, MAPE and ZAPE. Particu-
larly, Table 4 and 6 exemplify the improvement on a larger scope. The “simultaneous
”column and “multi-sclae ”column in Table 4 and 6 respectively are the performances
of the model chosen in Section 2.5.3. In the paper, the model is applied to a more
heterogeneous household group and outperforms the alternatives by all metrics.
15
(a) Accuracy pairwise comparisons
(b) Accuracy distributions
Figure 2.3: Model comparison in terms of forecasting accuracy. Naive model: Mnaive;DGLM: M1; Latent: M2; TF: a logistic regression model written in TensorFlow by 84.51◦
16
2.6 Summary
This chapter reviews the framework of Bayesian dynamic generalized linear models
and the extensions to count-valued time series. The rest of it elaborates and exemplify
the multi-scale modeling approach. This “top-down ”strategy shows potential to
handle the difficult modeling when it comes to sparse data, compared to other models
described in Section 2.5.3.
The extension of the multi-scale approach to hierarchical decomposition (Figure
2.1) not only captures household behavior, but also maintains scalability. The key is
to identify a group of individual series that share information (Chapter 3).
All models throughout this chapter are extensions of the fully probabilistic, inter-
pretable, sequential dynamic generalized linear models. They are tailormade for the
individualized forecasting and decision making problem.
17
Chapter 3
Labeling System
The “top-down ”modeling strategy (Berry and West, 2020) is suited to the personal-
ized household forecasting problem described in Section 2.5. It seeks common signals
in aggregate level and propagates such signals down the hierarchy. This is effective
when it comes to sparse data, as I face throughout this thesis. One obstacle to realize
this modeling concept is the lack of proper aggregate information, which is the main
incentive of this chapter (Section 3.1).
3.1 Motivations and Purposes
Upon implementation of the multiscale modeling strategy, which propogates clearer
signals from the aggregate level for each household, it is natural to develop a set of
grouping/clustering criteria, based on which we could circumvent the lack of demo-
graphic information and identify the appropriate aggregate signals. The goal is to
group thousands of households according to their promotion scenarios and purchasing
behaviors. Based on the quantification of such scenarios and behaviors, households
are classified into eight different categories, with each geometrically represented by
an octant of a unit cube centering around the origin. Note that the process described
above can be implemented for every item actively sold in the store, which allows us
to identify and then model the aggregate signals for every household, item combi-
nation. From a holistic perspective, the grouping not only enables the multiscale
strategy, but also, as a guidance, illuminates the proper actions on different groups
18
of households and identifies the strength of the model.
3.2 Labeling System
Due to the unavailability of the demographic information about the households, the
following grouping is developed, on the basis of the promotion circumstances and
buying behaviors, which are defined for every household, item combination, as below:
For every item-household pair: (i,h), i = 1:I, h = 1:H, with I, H being the total
number of items being sold and households recorded,
• Discount Offered Percentage (DOP): over the span of the 112 weeks recorded,
the proportion of weeks when there were promotions offered to household h of
item i.
• Discounted Purchase Percentage (DPP): among the weeks when item i was
discounted for household h, the proportion of household h made a purchase.
• Regular Purchase Percentage (RPP): among the weeks when item i was at
regular price for household h, the proportion of household h made a purchase.
These three quantities together define a household space for each item, whose
domain is a unit cube centering around the origin, with the eight divisions established
and interpreted as below:
For i = 1:I,
1. The octant containing (DOPi, DPPi, RPPi) = (0, 0, 1);
Interpretation: Loyal households who are very consistent on item i.
19
2. The octant containing (DOPi, DPPi, RPPi) = (0, 1, 1);
Interpretation: Similarly loyal households to type 1 who are very consistent on
item i.
3. The octant containing (DOPi, DPPi, RPPi) = (1, 1, 0);
Interpretation: Promotion sensitive households who are responding and enjoy-
ing the discounts on item i.
4. The octant containing (DOPi, DPPi, RPPi) = (1, 1, 1);
Interpretation: Similar to type 3.
5. The octant containing (DOPi, DPPi, RPPi) = (0, 0, 0);
Interpretation: Untouched or pristine households who might respond to pro-
motions of item i if delivered.
6. The octant containing (DOPi, DPPi, RPPi) = (0, 1, 0);
Interpretation: Similar to type 5.
7. The octant containing (DOPi, DPPi, RPPi) = (1, 0, 0);
Interpretation: Disinterested households, despite of promotions of item i
8. The octant containing (DOPi, DPPi, RPPi) = (1, 0, 1);
Interpretation: Similar to type 7.
Based on the classification, it is natural for the eight types of customers to coalesce
into four larger groups, as the following:
For i = 1:I, among which households,
1. habit and loyalty for item i are established.
Actions: Maintain the relationship and occasionally compensate for their loy-
alty to item i.
20
2. promotion sensitivity and interests in item i are detectable or even conspicuous—
which is the ideal group of customers to model the price sensitivity.
Actions: It is interesting to find the amount of promotions of item i that gen-
erates the most profits, which depends on the distribution of sales and the
quantity being optimized.
3. promotions are not available.
Actions: Explore and experiment with these customers by delivering promo-
tions of item i.
4. disinterests in the item or disregard for the promotions are noticeable.
Actions: Check the validity of the promotions sent out. If they are disregarded,
stop the promotions of item i.
Note that the households from group 2 above are of the most modeling interest,
given that the covariates are price.
3.3 Case Study and Examples
The examples demonstrated in this section are results of the labeling system, Section
3.2, on a portion of the data—the highest spending household group—described in
Section 1.1.
Table 3.1: Example items for each group
item group1 group2 group3 group4 total
type1 type2 type3 type4 type5 type6 type7 type8
...
21
62 0 0 578 254 0 0 1071 44 1947
...
72 34 395 71 2 567 412 3 0 1484
...
176 5 46 0 0 1522 98 10 0 1681
...
199 0 0 5 6 7 0 1653 1 1672
...
As mentioned in Section 3.1, the labeling system aids in identifying signals to
model—which is discussed in Section 2.5, as well as sheds light on decision making
at individual item, household level.
Table 3.1 displays the four items with significant numbers of households in each
group. As discussed in Section 3.2, group 2 is of the most interest when it comes
to modeling households’ sensitivity to promotions and item 62 is the chosen item to
illustrate the multi-scale modeling strategy in Section 2.5 and the multi-step decision
analysis in Section 4.4. Group 1 is the loyal customers with consistent spending on a
given item, exemplified by the 429 households purchasing item 72; while the majority
of the households recorded in item 176 are classified in group 3, which suggests ten-
tative promotions; lastly, item 199 is not a seling item, despite the promotions, which
should draw the attention of decision makers. Potential questions to be investigated
are should the store 1. check the validation or accessibility of the promotions being
sent out. 2. shrink the promotions to save on mailing, etc. 3. reduce the inventory,
since it does not sell. 4. bundle it with other items that sell.
Figure 3.1 visualizes the definition of the clustering criteria in Section 3.2, with
a couple of examples with and without the grouping regions. For a particular item,
22
(a) Subspaces defined by the labeling system(b) 3D scatterplot of all households recordedfor item X
(c) 3D grouped scatterplot (d) Interactive legend
Figure 3.1: Interactive visual aids of labeling system
23
these kind of plots demonstrate its customers’ sensitivity to the promotions, loyalty
to the product, as well as help identify anomaly in delivery of its promotions. The
four cuboids which consist of two octants each, represent the four customer groups.
Each point in Figure 3.1b and 3.1c is a household recorded for that particular item.
The axes are the three quantities defined in Section 3.2: DOP, DPP, RPP, with more
information incorporated in the plots, such as household id, the exact values of the
three axes, plus a couple categorical results, as shown in Figure 3.1d.
This visualization serves as a dictionary and enables easy, straightforward search-
ing for any particular record in the data. For example, one might be curious about
the exact information of a point, after locating it in group 2 of Figure 3.1c. Then user
can turn off the shadow for the grouping, to display in the mode of 3.1b, and hovers
on the chosen point for the household index, the average discounted sales, whether or
not the customer buys more with promotions, etc. The tool also makes it simple for
anormaly detection. For instance, a household buying significantly more without dis-
counts rather than with discounts is indicated by a cross, and is easily distinguished
from a household buying more with discounts than without them, which is a circle.
3.4 Summary
This chapter defines a set of standards to assign households according to their pur-
chasing behavior. These criteria are best demonstrated, utilized interactively, as
illustrated by Section 3.3 and Figure 3.1. The outcomes are referred in Chapter 2
espeically for the definition of aggregate information. Besides, the user can interact
with the figures, exploring features of interest, such as popularity of items, promotion
availability, distributions of households in the space of purchasing behavior, etc.
24
Chapter 4
Decision Analysis
4.1 Introduction
In reality, for any business analyses, it is essential to make decisions and understand
the corresponding consequences and uncertainties attached to them. It converts
our statistical efforts into business potentials and bestows real-life significance upon
the project. This chapter begins with a few simple examples of decision analysis
tailor-made for this context of item-specific discount offers. Then it proceeds with
more realistic settings where simulation-based approach shows advantages in terms
of efficiency. Finally, this chapter concludes with an example focusing not only on
optimization of the expected utility, but also the uncertainty analysis coming from
the full distribution, showcasing the advantage of the probabilistic model.
4.2 Business Context
For retail businesses such as grocery stores or supermarkets, it is of great interest to
understand—for a given item—how discounts impact sales and eventually profits per
unit time. A typical setup would be the following:
• An item has usual/nominal selling price $p.
• Item cost is $c, intended to capture all real costs for the store (purchase/whole-
sale costs, storage, labour, etc).
25
• Percent discount 100d% for decision variable d ∈ (0, 1] (usually, d has to be
greater than 0 to make profit); discounted price is $(1− d)p.
• Implied profit per item at discount d is then $pd = ${(1 − d)p − c}. Note
that a short-term decision maker would always have this value positive, i.e.
d < 1 − c/p, which is the scenario considered here. However, sometimes it
is beneficial for the long term to have d ≥ 1 − c/p for a controlled span of
period, i.e. sacrificing short-term profits to build up entrenched relationship
with customers, which suggests a more sophisticated setup than described here.
(An extra term describing expected gain is needed in the expected utility)
• y is the number of items sold per unit time at offered discount.
• Implied expected profit (utility)= $ud where
ud = E(y|d){(1− d)p− c} (4.1)
Supposedly, smaller d implies higher price and lower expected sales; higher d
increases expected sales but reduces profit per sale. Hence ud may have optimized
point(s) within reasonable range.
4.3 Tentative Models
4.3.1 Poisson Model
When it comes to non-negative counts, Poisson model is one of the lower-hanging
fruits. Conditioned on a chosen discount d, assuming the sales of a particular item y
26
follows a Poisson distribution, try a linear model for the log link. Statistical details
of the model and its optimization are shown below.
• Sales: y|d ∼ Po(µd) with log(µd) = α + βd. Naturally β > 0.
• Expected profit:
ud = µd{(1− d)p− c} = {(1− d)p− c} exp{α + βd}. (4.2)
• Maximizing ud is equivalent to maximizing log(ud), with d ∈ (0, 1− c/p],
doptimal = argmaxd log(ud) =
1− c
p− 1
β, β > p
p−c
0, otherwise
(4.3)
• Sometimes in practice, the actual selling price $p and its cost $c are not of great
interest. In those circumstances, it makes sense to replace p/c with r which is
the markup plus 1. Then equation 4.3 simply becomes
doptimal = argmaxd log(ud) =
1− 1
r− 1
β, β > r
r−1
0, otherwise
(4.4)
4.3.2 Mixture Model
• Sales: y|d = z(x+ 1), where z, x are independent,
z ∼ Ber(πd), logit(πd) = α0 + β0d
x ∼ Po(µd), log(µd) = α + βd
(4.5)
Naturally β0, β > 0.
27
• Expected profit:
ud = πd(µd + 1){(1− d)p− c}
= {(1− d)p− c} logit−1(α0 + β0d)(1 + exp(α + βd))
= {(1− d)p− c}exp(α0 + β0d)(1 + exp(α + βd))
1 + exp(α0 + β0d)
(4.6)
• Maximizing ud is equivalent to maximizing log(ud), with d ∈ (0, 1−c/p]. Taking
the first derivative, we have the following problem:
β0(1− πd) + βµd
1 + µd=
p
(1− d)p− cβ0, β > 0 (4.7)
Due to the difficulty of solving this for doptimal analytically, we resort to a
numerical method for mode hunting, which is a seven-dimension problem:
(d, α0, β0, α, β, p, c). Similar to how we obtain Equation 4.4, with p(1−d)p−c writ-
ten as r(1−d)r−1 , where r = p/c, we reduce the total dimension to six. Besides,
by incorporating the information from business context, we are able to narrow
down the plausible values of some parameters, thus mitigate the computational
burden/intensity.
• Referring to Table 4.1—which shows the distributions of those four coefficients
over 300 household-level data—for the purpose of this analysis, some plausible
28
domains are chosen to be the following.
d ∈ (0, 1− 1/r] where r = p/c
α0 ∈ (−0.9, 1.2) taking the 10, 90 percentiles
β0 ∈ (0, 1.7) truncating the positive portion
α ∈ (−0.55, 0.95) taking the 10, 90 percentiles
β ∈ (0, 2.2) truncating the positive portion up to 75 percentile
r ∈ (1.1, 2) just a reasonable guess
Figure 4.1: Distributions of four model parameters
For each set of parameters, it is trivial to compute the optimal discount under the
given utility. Figure 4.2 shows the relationships between discount: d and πd, µd and
29
Table 4.1: Summary statistics of logistic and poisson regressions
alpha0 beta0 alpha beta
count 300.000000 300.000000 300.000000 300.000000mean 0.220211 -5.577831 0.206850 1.466264std 0.911970 2.894260 0.596577 1.207612min -4.153728 -19.449333 -2.457143 -2.22932610% -0.872800 -8.751870 -0.556313 -0.01021325% -0.199389 -7.043806 -0.209543 0.71673550% 0.323359 -5.459991 0.199102 1.41622075% 0.788633 -3.830893 0.603878 2.20895690% 1.229518 -2.197984 0.958858 3.157952max 2.602268 1.693713 2.071097 5.358105
ud, with four different sets of intercepts, but the same slopes: (β0, β) = (1.7, 2.2).
Since the intercepts represent the circumstances without discounts, high values of
intercepts compared to slopes will lead to doptimal = 0, because the item is already
popular without discounts, a lower price would simply hurt the profit and bring
marginal increase in sales. On the other hand, very low values give the same results for
a different reason—customers are so indifferent to the item that even high discounts
are not able to attract them. In terms of slopes, high values indicate high sensitivity
in discounts and an outstanding peak in utility can be expected.
4.4 DCMMs
The decision analysis can be incorporated into the framework of Bayesian Dynamic
Linear Models. Without approximations of digamma and trigamma functions, the
optimal problem cannot be written in a close form. One has to resort to iterative
numerical solution based on standard Newton-Raphson to find the implied conjugate
parameters. With the following approximations of digamma and trigamma func-
tions, respectively, we are able to write out the optimization problem in terms of the
30
regressor vector Ft.
φ(x) u log(x)
φ′(x) u1
x
(4.8)
For Binomial DGLMs, we have
αt =1 + exp(ft)
qt
βt =1 + exp(−ft)
qt
(4.9)
and for Poisson DGLMs, we have
αt =1
qt
βt =exp(−ft)
qt.
(4.10)
In general, we might want to optimize the expectation of the scaled direct out-
come: the profit in this case. That is the product of the expected sales and a linear
function of the regressor vector Ft. If we continue to work throught the math, we
have the following optimization problem:
πt = αt
αt+βt= 1+exp(ft)
2+exp(ft)+exp(−ft)
µt = αt
βt= exp(ft)
ut = πt(µt + 1)F ′tb
(4.11)
where ft = F ′tat, at the first moment of the evolved state vector θt and b is the
32
known linear coefficients. Note that this does not depend on the second moment qt,
which makes sense because it is the first moment of the profit that we are optimize.
So, after simplification, we have
ut(Ft) =(1 + exp(F ′tat))
2
2 + exp(F ′tat) + exp(−F ′tat)F ′tb (4.12)
While Equation 4.12 is hard to solve analytically, it is straightfoward to approx-
imate the optimal solution computationally, when Ft is short. In the case of this
study, Ft = (1, d)′, where d is the discount percentage—which has a plausible range
from 0 to 1. In reality, d < 1 − c/p for any positive profit, which means we need to
discuss this problem under various ratios of cost and sale price c/p.
Here, I have explored three different sale, cost ratio: 1.2, 2, 10. The first two ratios
are realistic, while 10 is to experiment with an extreme case. The question of interest
is to forecast the outcomes if the store were to use the optimal discount determined
by Equation 4.12, with Ft = (1, d)′ and b = (p− c,−p)′, every week for 52 weeks.
In order to obtain the distributions of optimal discounts and corresponding profits,
parameters, it is convenient to utilize the DCMMs—as emphasized throughout this
study and delineated in Appendix A, it is probabilistic and sequential. Figure 4.3,
4.4, 4.5 show such distributions for each scenario. Appendix B has more figures of
the relevant model parameters: πt, µt.
The figures are simulations of outcomes, given the models trained up until week
1, and the store always picks the discount percentage that maximizes the total profits
in the impending weeks. All three figures use the same household, item pair, only
with p/c different.
If one compares the results across the three, it is conspicuous that lower cost/higher
sale price affords the store more space to offer discount, thereby increases the weekly
33
(a) Distributions of optimal discounts over a year
(b) Distributions of optimal profits over a year
Figure 4.3: Distributions of simulated outcomes over a year (p/c = 1.2).34
(a) Distributions of optimal discounts over a year
(b) Distributions of optimal profits over a year
Figure 4.4: Distributions of simulated outcomes over a year (p/c = 2).35
(a) Distributions of optimal discounts over a year
(b) Distributions of optimal profits over a year
Figure 4.5: Distributions of simulated outcomes over a year (p/c = 10).36
sales large enough to compensate the discounted price, as a result generates more
total profits. Meanwhile, one tends to investigate each figure along its one-year hori-
zon. It is worth being noted that the distributions appear to converge as enough time
elapses. This is easy to accept once we realize that despite the exquisite design of
DCMMs/DGLMs, it does not inject new information or introduce disturbances. So,
after accepting the model up until week 1, we are bound to have a stationary forecast
after enough time passed. This discloses the difficulty of forecasting long term—
without any definite, deterministic insight, forecasts made by stationary models (all
pragmatic time series models) are simply reflection of the observed. In comparison,
short-term forecasting is more plausible (Figure 2.3b), because our variables of inter-
est are not as volatile in the short term as long. In a few words, statistical models
are simply capsules of available information, after training on the past, all one can
hope is that the history could shed light on the future.
4.5 Summary
This chapter begins with a business setting that approximates the reality, exploring
decision analysis problem under a few models. The goal throughout is to maximize
profit earned by an item from a household (Equation 4.1). Simple models such as
Poisson regression (Section 4.3) as well as DCMMs are studied for the optimiza-
tion problem (Section 4.4). I derive the mathematics for each case, at least up to
simplification of the problem. I resort to numberic method and simulation-based
computation, when the analytical solution is difficult to obtain (Section 4.4).
More areas can be explored in terms of loss/utility function. Relevant loss func-
tions for zero-inflated count-vaued time series are ZAPE, adjusted ZAPE, MAPE,
etc Yanchenko et al. (2021). Besides, the probabilistic model allows much more com-
37
plicated utility function than these that only provide point forecast. For instance, a
decision maker can ask for 0.5 or higher probability of gaining four dollars of profit
from a household over a span of two weeks.
Of course, there are expected but unsolved questions, such as forecasting and
decision making long-term. Forecasting long-term has always been challenging but
intriguing, regardless of the field. Applications include natural disasters like earth-
quakes (Talebi, 2017), as well as artificial advancements (Kott and Perconti, 2018).
The ability of forecasting long-term has significance for policy-makers, business own-
ers, residents in a particular area, potentially everyone. However, since a bad forecast
is worse than no forecast at all, there are much fewer studies on long-term horizon
than short-term. In my personal opinion, the best forecast is to push the future
towards to desired direction. We will meet with the future where our eyes are on—it
could be late, but hopefully not absent.
38
Chapter 5
Computation, Implementation and Code
5.1 Introduction
This chapter is dedicated to showcasing the related programming contributions I
have made to the project. Section 5.2 first introduces the mathematics behind the
programming, followed by section 5.3, which contains the codes generating those
clustering interactive 3D plots (Figure 3.1) in section 3.3.
5.2 Copula Approximation
Lavine et al. (2020) proposes a copula-based analytic method to approximate the
simulation-based one in Berry and West (2020). This approximation balances speed
and accuracy, substantially improves the computational cost. This section derives
the mathematics behind Variantional Bayes and Linear Bayes (VBLB) for multi-
scale DGLMs, as well as the programming contribution I have made to the published
Python package PyBATS.
5.2.1 VBLB for Latent Factor DGLMs
This subsection extends VBLB in Appendix A.1 to the latent factor modeling context.
To implement the method in the latent factor context, we first need to know the first
two moments of the linear predictor λi,t for all i = 1 : N and their covariances.
39
Expanding the expression for λi,t, we get
λi,t = F ′i,tθi,t = h′i,tγi,t + φ′tβi,t (5.1)
We also denote the first two moments of φt as φt|Dt−1 ∼ [bt,Bt] and partition
the moments of the state vector as the following:
θi,t|Dt ∼
aγ,i,taβ,i,t
,
Rγ,i,t Si,t
S′i,t Rβ,i,t
(5.2)
The mean of the linear predictor is then,
fi,t = E[λi,t] = E[F ′i,tθi,t]
= h′i,taγ,i,t + b′taβ,i,t
(5.3)
The variance of the linear predictor can be calculated using the law of total
covariance,
40
qi,t = V ar[λi,t] = V ar[F ′i,tθi,t]
= Cov(h′i,tγi,t + φ′tβi,t,h′i,tγi,t + φ′tβi,t)
= Cov(E[h′i,tγi,t + φ′tβi,t|φt], E[h′i,tγi,t + φ′tβi,t|φt])
+ E[Cov(h′i,tγi,t + φ′tβi,t,h′i,tγi,t + φ′tβi,t|φt)]
= Cov(h′i,taγ,i,t + φ′taβ,i,t,h′i,taγ,i,t + φ′taβ,i,t) Note that φt is independent
+ E[V ar[h′i,tγi,t] + φ′tV ar[βi,t]φt + h′i,tCov(γi,t,βi,t)φt + φ′tCov(βi,t,γi,t)hi,t]
= V ar[φ′taβ,i,t] + E[h′i,tRγ,i,thi,t + φ′tV ar[βi,t]φt + 2h′i,tSi,tφt]
= a′β,i,tBtaβ,i,t + h′i,tRγ,i,thi,t + 2h′i,tSi,tbt + E[tr(φ′tRβ,i,tφt)]
(5.4)
E[tr(φ′tRβ,i,tφt))] = E[tr(Rβ,i,tφtφ′t)]
= tr(E[Rβ,i,tφtφ′t])
= tr(Rβ,i,tE[φtφ′t])
= tr(Rβ,i,t(V ar[φt] + E[φt]E[φt]′))
= tr(Rβ,i,tBt) + tr(Rβ,i,tbtb′t)
= tr(Rβ,i,tBt) + b′tRβ,i,tbt
(5.5)
Therefore, the moments of the linear predictor of the extended VBLB for the
latent factor modeling are
fi,t = h′i,taγ,i,t + b′taβ,i,t
qi,t = h′i,tRγ,i,thi,t + 2h′i,tSi,tbt + b′tRβ,i,tbt + a′β,i,tBtaβ,i,t + tr(Rβ,i,tBt)
(5.6)
41
Accordingly, the adaptive vector in the LB update step is Ri,tF̃i,t/qi,t where F̃ =
(h′i,t, b′i,t)′. In contrast to the traditional DGLMs, this modified analysis introduces
more uncertainty due to the fact that φt is simutaneous and comes from another
external model, which are explicitly written out in qi,t as the last two terms.
Now that we have the means and variances, we only need the pairwise covariance
between λi,t and λj,t, i 6= j, i, j = 1 : N to complete the joint covariance matrix.
qi,j,t = Cov(λi,t, λj,t)
= Cov(E[λi,t|φt], E[λj,t|φt]) + E[Cov(λi,t, λj,t|φt)]
= Cov(h′i,taγ,i,t + φ′taβ,i,t,h′j,taγ,j,t + φ′taβ,j,t) + 0
= Cov(φ′taβ,i,t,φ′taβ,j,t)
= a′β,i,tBtaβ,j,t
(5.7)
We got zero for the third step because of the independence between Mi and Mj
given φt, which is the key assumption for the decouple-recouple modeling strategy.
At this point, we have finished the modifications under the multiscale modeling
context, which paves the road for the construction of copula in section 3 of Lavine
et al. (2020).
5.2.2 Code
This subsection contains aspects of code I developed for the main modeling compo-
nents of thesis research. This covers aspects of the dynamic latent factor framework
that is part of the PyBATS package (https://lavinei.github.io/pybats/). The
first couple of functions extract the linear predictor λ for the latent factor, while the
second couple generate scaled versions of model coefficients. The later can be achieved
42
by using dlm coef fxn() and dlm coef forecast fxn(), with merge lf with predictor(),
whose explanations are available at https://lavinei.github.io/pybats/latent_
factor.html
## Latent factor functions for linear predictor lambda
def lambda_fxn(date, mod, k, **kwargs):
"""
function that returns mean and variance of linear predictor
↪→ lambda
:param date: date index
:param mod: model that is being run
:param k: forecast horizon
:param kwargs: other arguments
:return: mean and variance of lambda
"""
return (mod.F.T @ mod.m).copy().reshape(-1), (mod.F.T @ mod.C
↪→ @ mod.F).copy()
def lambda_forecast_fxn(date, mod, k, forecast_path = False,
↪→ **kwargs):
"""
functions that return forecast mean and variance, potentially
↪→ covariance of lambda (if forecast_path is True)
:param date: date index
:param mod: model that is running
:param k: forecast horizon
:param forecast_path: True or False
43
:param kwargs: other arguments
:return: forecast mean and variance, potentially covariance
↪→ of lambda (if forecast_path is True)
"""
lambda_mean = []
lambda_var = []
if forecast_path:
lambda_cov = [np.zeros([1, h]) for h in range(1, k)]
for j in range(1, k + 1):
f, q = mod.get_mean_and_var(mod.F, mod.a.reshape(-1), mod.R)
lambda_mean.append(f.copy())
lambda_var.append(q.copy())
if forecast_path:
if j > 1:
for i in range(1, j):
lambda_cov[j-2][i-1] = mod.F.T @ forecast_R_cov(mod, i, j) @
↪→ mod.F
if forecast_path:
return lambda_mean, lambda_var, lambda_cov
else:
return lambda_mean, lambda_var
44
lambda_lf = latent_factor(gen_fxn = lambda_fxn,
↪→ gen_forecast_fxn = lambda_forecast_fxn)
## Latent factor functions for scaled model coefficients
def dlm_coef_scale_fxn(date, mod, scale = None, idx = None,
↪→ scale_which = None, **kwargs):
"""
function that gets the mean and variance of coefficent latent
↪→ factor
:param date: date index
:param mod: model that is being run
:param scale: scalars that used to scale the mean and
↪→ variance, as known fixed values. For example,
↪→ covariates of
models that use this latent factor. Should be in pandas data
↪→ frame with scalars in columns and dates as index
:param scale_which: index of coefficents to be scaled by
↪→ series in scale (need to be within idx)
:param idx: index of coefficents desired to extract
:param kwargs: other arguments
:return: mean and variance of scaled coefficents
"""
if scale is None:
return dlm_coef_fxn(date, mod, idx, **kwargs)
45
if idx is None:
idx = np.arange(0, len(mod.m))
if not set(scale_which).issubset(set(idx)):
ValueError("scale_which needs to be subset of idx")
m_scale, C_scale = mod.m.copy(), mod.C.copy()
scale_matrix = np.identity(C_scale.shape[0])
scale_matrix[np.ix_(scale_which, scale_which)] = scale.loc[
↪→ date].values * scale_matrix[np.ix_(scale_which,
↪→ scale_which)]
m_scale = scale_matrix@m_scale
C_scale = scale_matrix@C_scale@scale_matrix
return (m_scale[idx]).reshape(-1), (C_scale[np.ix_(idx, idx)])
↪→ .copy()
def dlm_coef_scale_forecast_fxn(date, mod, k, scale = None,
↪→ idx=None, scale_which = None, forecast_path=False, **
↪→ kwargs):
"""
function that compute the forecast mean, variance and
46
↪→ potentially covariance (if forecast_path is True)
:param date: date index
:param mod: model that is being run
:param k: forecast horizon
:param scale: scalars that used to scale the mean and
↪→ variance, as known fixed values. For example,
↪→ covariates of
models that use this latent factor. Should be in pandas data
↪→ frame with scalars in columns and dates as index
:param scale_which: index of coefficents to be scaled by
↪→ series in scale (need to be within idx)
:param idx: index of coefficents desired to extract
:param forecast_path: True or False
:param kwargs: other arguments
:return: forecast mean, variance and potentially covariance (
↪→ if forecast_path is True)
"""
if scale is None:
return dlm_coef_forecast_fxn(date, mod, k, idx=None,
↪→ forecast_path=False, **kwargs)
if idx is None:
idx = np.arange(0, len(mod.m))
p = len(idx)
if not set(scale_which).issubset(set(idx)):
47
ValueError("scale_which needs to be subset of idx")
dlm_coef_mean = []
dlm_coef_var = []
if forecast_path:
dlm_coef_cov = [np.zeros([p, p, h]) for h in range(1, k)]
for j in range(1, k + 1):
a, R = forecast_aR(mod, j)
a_scale = a.copy()
R_scale = R.copy()
scale_matrix = np.identity(R_scale.shape[0])
scale_matrix[np.ix_(scale_which, scale_which)] = scale.loc[
↪→ date].values*scale_matrix[np.ix_(scale_which,
↪→ scale_which)]
a_scale = scale_matrix@a_scale
R_scale = scale_matrix@R_scale@scale_matrix
dlm_coef_mean.append(a_scale[idx].copy().reshape(-1))
dlm_coef_var.append(R_scale[np.ix_(idx, idx)].copy())
if forecast_path:
if j > 1:
for i in range(1, j):
R_cov_scale = forecast_aR(mod, i)[1]
R_cov_scale = scale_matrix@R_cov_scale@scale_matrix
Gk = np.linalg.matrix_power(mod.G, j - i)
48
dlm_coef_cov[j-2][:,:,i-1] = (Gk@R_cov_scale)[np.ix_(idx, idx)
↪→ ]
if forecast_path:
return dlm_coef_mean, dlm_coef_var, dlm_coef_cov
else:
return dlm_coef_mean, dlm_coef_var
dlm_coef_scale_lf = latent_factor(gen_fxn = dlm_coef_fxn,
↪→ gen_forecast_fxn=dlm_coef_forecast_fxn)
5.3 Clustering Visualization
Below are the python codes which output interactive 3D plots exemplified by Figure
3.1. It has the required packages and data manipulations work before the plotly
ploting commands.
import pandas as pd
import numpy as np
from plotly.graph_objects import Scatter3d, Volume
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "browser"
## define a function to merge multiple dataframes
49
def df_merge(df, on, how = ’outer’):
"""
:param df: List of dataframes to be merged
:param how: Choices are "outer", "inner", or index
:return: A merged dataframe
"""
if how in [’outer’, ’inner’]:
return reduce(lambda x, y: pd.merge(x, y, on = on, how = how),
↪→ df)
else:
df_reorder = [df.pop(how)]
df_reorder.extend(df)
return reduce(lambda x, y: pd.merge(x, y, on = on, how = "left
↪→ "), df_reorder)
## read in the provided data
data = pd.read_pickle(’The household data’)
## clean and create sensitivity data
item_discount_ratio = data.loc[:,[’date’, ’item’,’household’,
↪→ ’discount’,’discount_pot’, ’item_qty’, ’net_price’, ’
↪→ regular_price’]]
item_discount_ratio[’discount_percentage’] = (
↪→ item_discount_ratio[’regular_price’] -
↪→ item_discount_ratio[’net_price’]) / item_discount_ratio[
50
↪→ ’regular_price’]
item_discount_ratio[’discount_sen’] = item_discount_ratio.
↪→ discount == item_discount_ratio.discount_pot
sen = item_discount_ratio.groupby([’item’, ’household’],
↪→ as_index = False, observed = True)[’discount_sen’, ’
↪→ discount_pot’, ’item_qty’].mean().\
sort_values(by = [’discount_sen’, ’discount_pot’,’item_qty’],
↪→ ascending=[False, False, True])
## clean and create discounted purchase data
item_discount_ratio_buyd = item_discount_ratio.loc[
↪→ item_discount_ratio.discount_pot == 1]
item_discount_ratio_buyd[’buy_discount’] =
↪→ item_discount_ratio_buyd.item_qty > 0
buyd = item_discount_ratio_buyd.groupby([’item’, ’household’],
↪→ as_index = False, observed = True)[’buy_discount’, ’
↪→ item_qty’, ’discount_percentage’].mean().\
sort_values(by = [’buy_discount’, ’item_qty’, ’
↪→ discount_percentage’], ascending=[False, True, False])
## clean and create regular purchase data
item_discount_ratio_buyr = item_discount_ratio.loc[
↪→ item_discount_ratio.discount_pot == 0]
item_discount_ratio_buyr[’buy_regular’] =
↪→ item_discount_ratio_buyr.item_qty > 0
buyr = item_discount_ratio_buyr.groupby([’item’, ’household’],
51
↪→ as_index = False, observed = True)[’buy_regular’, ’
↪→ item_qty’].mean().\
sort_values(by = [’buy_regular’,’item_qty’], ascending=[False,
↪→ True])
# create extra variables for plotting
buy_data = df_merge([sen, buyd, buyr], on = [’item’, ’
↪→ household’], how=’outer’)
buy_data.columns = [’item’, ’household’, ’Discount sensitivity
↪→ ’, ’Discount offered’,
’Sales’, ’Discount buy’, ’Discount sales’, ’Discount percent’,
’Regular buy’, ’Regular sales’]
buy_data.iloc[:,2:] = buy_data.iloc[:,2:].fillna(0)
buy_data[’Buy more with discount’] = buy_data[’Discount buy’]
↪→ > buy_data[’Regular buy’]
buy_data[’Discount level’] = [’small’ if d < 0.25 else ’median
↪→ ’ if d < 0.6 else ’large’ for d in buy_data[’Discount
↪→ percent’]]
# 3D interactive plots for households
# Initialize figure with 212 3D subplots
rows = 4
cols = 4
specs = [[{’type’: ’scene’} for j in range(cols)] for i in
52
↪→ range(rows)]
subplot_titles = [’panel’ + str(i) for i in range(16)]
fig = make_subplots(
rows=rows, cols=cols,
specs=specs, subplot_titles=subplot_titles)
## There are 211 items, thus 211 plot panels. index 211 to
↪→ 274 are used to create color shadows for each customer
↪→ group
for item in range(275):
## Plot 3D scatterplot for each household item pair
if item < 211:
data = buy_data.loc[buy_data.item == item]
# Generate data
x = data[’Discount offered’]
y = data[’Discount buy’]
z = data[’Regular buy’]
symbol = data[’Buy more with discount’].map({True: ’circle’,
↪→ False: ’x’})
color = data[’Discount level’].map({’small’: ’blue’, ’median’:
↪→ ’green’, ’large’: ’red’})
size = data[’Discount sales’]
size = (size - np.min(size)) / (np.max(size) - np.min(size)) *
53
↪→ 20 + 6
# adding surfaces to subplots.
fig.add_trace(
Scatter3d(
x=x,
y=y,
z=z,
name=’item’ + str(item),
visible=False,
## more information added to the scatterplots
customdata=np.stack((data[’household’].values,
data[’Buy more with discount’].values,
data[’Buy more with discount’].map({True: ’Circle’, False: ’X’
↪→ }).values,
data[’Discount level’].values,
data[’Discount level’].map(
{’small’: ’<25%’, ’median’: ’25-60%’, ’large’: ’>60%’}).values
↪→ ),
axis=-1),
mode=’markers’,
marker=dict(
size=size,
color=color,
cauto=True,
54
symbol=symbol,
opacity=0.8
),
hovertemplate=
’<b>Household</b>: %{customdata[0]}<br>’ +
’<b>Discount offered</b>: %{x:.0%}<br>’ +
’<b>Discount buy</b>: %{y:.0%}<br>’ +
’<b>Regular buy</b>: %{z:.0%}<br>’ +
’<b>Discount sales</b>: %{marker.size:.2f} units<br>’ +
’<b>Buy more with discount</b>: %{customdata[1]} (%{customdata
↪→ [2]})<br>’ +
’<b>Discount level</b>: %{customdata[3]} (%{customdata[4]})<br
↪→ >’,
hoverlabel=dict(bgcolor=color)
),
row=(np.floor((item % 16) / 4) + 1).astype(’int’), col=(item %
↪→ 16) % 4 + 1)
if item < 16:
fig[’layout’][’scene’ + str(item + 1)][’xaxis’] = {’title’: {’
↪→ text’: ’Discount offered’}}
fig[’layout’][’scene’ + str(item + 1)][’yaxis’] = {’title’: {’
↪→ text’: ’Discount buy’}}
fig[’layout’][’scene’ + str(item + 1)][’zaxis’] = {’title’: {’
↪→ text’: ’Regular buy’}}
55
## Create shadow for each customer group
else:
if (item - 211) % 4 == 0:
X, Y, Z = np.mgrid[0:0.5:2j, 0:1:2j, 0.5:1:2j]
values = np.zeros(X.shape)
elif (item - 211) % 4 == 1:
X, Y, Z = np.mgrid[0.5:1:2j, 0.5:1:2j, 0:1:2j]
values = np.ones(X.shape)
elif (item - 211) % 4 == 2:
X, Y, Z = np.mgrid[0:0.5:2j, 0:1:2j, 0:0.5:2j]
values = np.ones(X.shape) * 2
elif (item - 211) % 4 == 3:
X, Y, Z = np.mgrid[0.5:1:2j, 0:0.5:2j, 0:1:2j]
values = np.ones(X.shape) * 3
x = X.flatten()
y = Y.flatten()
z = Z.flatten()
value = values.flatten()
fig.add_trace(
Volume(
name=’group’ + str((item - 211) % 4),
x=x,
y=y,
z=z,
56
value=value,
opacity=0.3, # needs to be small to see through all surfaces
surface_count=50, # needs to be a large number for good
↪→ volume rendering
colorscale="RdBu",
showlegend=False,
showscale=False,
isomax=3,
isomin=0,
hovertemplate=
’<b>Group</b>: #%{value: .f}’
),
row=(1 + np.floor((np.floor((item - 211) / 4)) / 4)).astype(’
↪→ int’),
col=((np.floor((item - 211) / 4)) % 4 + 1).astype(’int’)
)
# create buttons
buttons = []
for i in range(27):
if i < 25:
buttons.append(dict(method=’update’,
args=[{"visible": [False if np.floor(item / 16) != i else True
↪→ for item in range(211)] + [
False] * 64}],
57
label="item" + str(i * 16) + "--" + str((i + 1) * 16 - 1)))
buttons.append(dict(method=’update’,
args=[{"visible": [False if np.floor(item / 16) != i else True
↪→ for item in range(211)] + [
True] * 64}],
label="item" + str(i * 16) + "--" + str((i + 1) * 16 - 1) + "
↪→ grouped"))
else:
buttons.append(dict(method=’update’,
args=[{"visible": [False if np.floor(item / 16) != i else True
↪→ for item in range(211)] + [
False] * 64}],
label="item208--210"))
buttons.append(dict(method=’update’,
args=[{"visible": [False if np.floor(item / 16) != i else True
↪→ for item in range(211)] + [
True] * 64}],
label="item208--210" + " grouped"))
fig.update_layout(scene=dict(
xaxis_title=’Discount offered’,
yaxis_title=’Discount buy’,
zaxis_title=’Regular buy’),
title_text=’Comprehensive 3D plots for all items (Author:
↪→ Daniel Deng)’,
height=2500,
58
width=1800,
updatemenus=[dict(type=’buttons’,
buttons=buttons,
x=1.09,
xanchor=’left’,
y=1,
yanchor=’top’)],
hovermode=’closest’
)
fig.show()
59
Chapter 6
Conclusions and Summary
Recently, the application of statistical modeling in commercial problems surges, since
people found the tremendous upsides once insights are revealed. Businesses like
Walmart, Amazon, Harris Teteer start to seek statistical methods that improve their
decision making, thereby consolidate their customer relationship.
In this thesis, I introduce the multi-sacle modeling within the Bayesian Dynamic
Modeling framework. It showcases the power of hierarchical, sequential, probabilistic
and computationally efficient models, as well emphasizes the novel decouple-recouple
modeling strategy, which propagates the signals down the hierarchy. I also exemplify
the improvement on the forecasting accuracy.
This method aims to mitigate the difficulty when it comes to forecasting sporadic
data—which is the finest level in our hierarchy. It has been a challenge for years,
so the improvement brought by this approach is another step forward. The multi-
scale modeling successfully inherents hierarchical information of the retail setting:
households visit a store, spending on items, and purchasing outcomes connect across
large categories of items to small, refined categories, and eventually to specific items.
In chapter 3, I elaborate the design and criteria which I use to classify thousands
of households based on their purchasing behaviors. This not only enables the multi-
scale modeling in chapter 2, but also sets an example of visualized learning, providing
a valuable way of thinking about customer behaviors. The case study exemplify
the identification of price-sensitive households, which paves the way to customized
decision analysis.
60
At the end, making good decisions is the ultimate goal of modeling. In chapter 4,
I explore the optimal-discount problem under various models. I also attempt to solve
the decision analysis over longer period of time (a year). Even though the results
(section 4.4) align with business sense, the usage of the model on such a long time
span remains questionable.
Finally, I demonstrate my contributions in terms of programming to this project
(Yanchenko et al. (2021)). First, my research has contributed extensions and func-
tionality on latent factor dynamic modeling to the existing PyBATS package. Sec-
ond, my development of innovative data assessment and dynamic visualization with
household labeling has defined software that is available for further applications.
Future Work and Comments
This thesis presents a novel approach to efficiently forecast sparse time series. How-
ever, the decision analysis based on the model has a lot more to explore than presented
in Chapter 4. The main obstacle is to forecast long-term. First, one needs to define
how long is long-term, based on the context. For instance, three to six months might
be long for retail setting, while one to two years can be short for earthquake or volcano
eruption. Problems with more artificial components are generally easier than those
without control. Secondly, accouting for the significant factors can be challenging.
Sometimes, even for a social behavior problem like retailing, there might be unex-
pected shocks that make our forecast obsolete (e.g. COVID 19 in 2020). Lastly, the
uncertainty associated with our forecast increases rapidly with the length of horizon
and number of uncertain factors. This can leave us with a statistically right forecast,
but it has no pragmatic use.
Fortunately, I would like to consider the long-term forecasting in the following way.
61
No one really forsees the future. Instead, we can only study the past for insights that
are helpful for our decision making at the present. This has a significant impact on
the future. As statisticians, we learn from the history in a quantitative way: from
data. We extract and summarize information buried in the data that is not visible
to naked eyes. As a result, interpretability and openness are the key, assuming that
we do not believe in some “black box ”to determine our future (see what happened
to Catholicism when plague hit). Therefore, I think the problem of forecasting long-
term is simply a modeling or mathematical problem. Rather, it is closely related to
the horizon of total human knowledge.
Back to statistical modeling and decision analysis, a rational decision maker
should listen to multiple sources, to decrease uncertainty in quality of agents. Bayesian
Predictive Synthesis (McAlinn et al., 2020; West and Crosse, 1992; West, 1992) pro-
vides a potential framework for future researchers. A decision maker using such
framework takes into account all probabilistic information from available agents, and
update his own opinion on the quantity of interest.
62
Appendix A
DGLMs
• yt denotes the time series of interest, no matter it is continuous, binary or
non-negative count.
• At any given time t, available information is denoted by Dt = {yt, Dt−1, It−1},
where It−1 is any relevant additional information at time t− 1.
• Ft, θt are the dynamic regression vector and state vector at time t, respectively.
• λt = F ′tθt, where λt is the linear predictor at time t. It links the parameter of
interest and the linear regression via link functions, i.e., λt = logit(πt) for bino-
mial DGLM and λt = log(µt) for Poisson DGLM, where πt, µt are probability
of success and mean for these precesses.
• state vector θt evolves via θt = Gtθt +wt and wt ∼ (0,Wt), where Gt is the
known evolution matrix and wt is the stochastic innovation vector.
• wt is independent of current and past states with moments E[wt|Dt−1, It−1] = 0
and V [wt|Dt−1, It−1] = Wt
A.1 VBLB
1. Current information is summarized in mean vector and variance matrix of the
posterior state vector θt−1|Dt−1, It−1 ∼ [mt−1,Ct−1].
64
2. Via the evolution equation θt = Gtθt + wt, the implied 1-step ahead prior
moments at time t are θt|Dt−1, It−1 ∼ [at,Rt], with at = GtCt−1G′t and
Rt = GtCt−1G′t +Wt.
3. The time t conjugate prior satisfies E[λt|Dt−1, It−1] = ft = F ′tat and V [λt|Dt−1, It−1] =
qt = F ′tRtFt.
i.e.
Binomial: yt ∼ Bin(ht, πt), conjugate prior: πt ∼ Be(αt, βt), with ft =
ψ(αt) − ψ(βt) and qt = ψ′(αt) + ψ′(βt), where ψ(x), ψ′(x) are digamma and
trigamma functions.
Poisson: yt ∼ Poi(µt), conjugate prior: µt ∼ Ga(αt, βt), with ft = ψ(αt) −
log(βt) and qt = ψ′(αt).
4. Forecast yt 1-step ahead using the conjugacy-induced predictive distribution
p(yt|Dt−1, It−1). This can be simulated trivially.
5. Observing yt, update to the posterior.
i.e.
Binomial: conjugate posterior: πt ∼ Be(αt + yt, βt + ht − yt).
Poisson: conjugate posterior µt ∼ Ga(αt + yt, βt + 1).
6. Update posterior mean and variance of the linear predictor λt: gt = E[λt|Dt]
and pt = V [λt|Dt]
7. Linear Bayes estimation gives posterior moments mt = at +RtFt(gt − ft)/qt
and Ct = Rt −RtFtF′tR′t(1− pt/qt)/qt
This completes the time t− 1-to-t evolve-predict-update cycle.
65
A.2 Discount Factors
• Regression vector F can include intercept, known quantities, such as price of
items, indicator of whether or not using a firewall.
i.e.
F ′t = (1, pricet, promotiont, 1, 0, 1, 0, 1, 0)
• Evolution matrix Gt is usually a block-diagonal matrix. For normal covariates
in F matrix, Gt takes values of 1 to allow the corresponding coefficients to
evolve with random innovation wt, while Gt can also include seasonal effects
by adding blocks of seasonal components.
i.e.
Gt = blockdiag(1, 1, 1,H1,H2,H3), whereHj =
cos(2πj/7) sin(2πj/7)
−sin(2πj/7) cos(2πj/7)
,
for j = 1,2,3
• Evolution variance matrixWt can be controlled by discount factor δj ∈ (0, 1], j =
1 : J , via the following design:
Note that Rt = GtCt−1G′t +Wt.
Let Pt = GtCt−1G′t and Wt = blockdiag(Pt1(1− δ1)/δ1, . . . ,PtJ(1− δJ)/δJ),
where Ptj is the corresponding diagonal block of Pt = GtCt−1G′t.
This design enables separate discount factors for different components and each
component’s uncertainty increases by (1− δj)/δj, while maintains correlations
in Ptj.
A.3 Random Effects
• Applicable to any DGLMs.
66
• Capture additional variation.
• Extended state vector: θt = (ξt,θ′t,0)′ and regression vector: F ′t = (1,F ′t,0)′,
where ξt is a sequence of independent, zero-mean random effects and θ′t,0,F′t,0
are the baseline state vector and regression vector. Extended linear predictor:
λt = ξt + λt,0
• ξt provides an additional, day-specific ”shocks” to latent coefficients.
• A random effect discount factor ρ ∈ (0, 1] is used to control the level of vari-
ability injected (via a similar fashion as the other discount factors):
i.e.
qt,0 = V [λt,0|Dt−1, It−1], let vt = V [ξt|Dt−1, It−1] = qt,0(1 − ρ)/ρ, which inflates
the variation of λt by (1− ρ)/ρ
A.4 Multi-scale Modeling
• Use decouple/recouple method to enable information sharing across series as
well as scalability.
• Add information at aggregate level to avoid being obscured by noises.
• For each of the N univariate series, it has a state vector and regression vector
defined by the following:
Mi : θi,t = (γ′i,t,β′i,t)′, Fi,t = (f ′i,t,φ
′t)′, i = 1 : N
which implies λi,t = γ′i,tfi,t+β′i,tφt, where the first three contain series-specific
information, while φt is a latent factor shared by all series.
67
• φt, the common latent factor can be any common factors and modeled by an-
other DGLM, denoted asM0, conditioned on which, the updates and forecasting
of each Mi perform separately and in parallel.
• This decoupling/recoupling technique enables scalability of the N individual
series, while manages to create linkage across series.
68
Appendix B
More Figures
Here are the figures of parameters distributions from the case study in section 4.4.
We can see that one is able to boost the shifted mean µt of DCMMs and probability
πt, by offering more discounts, given the item remains profitable.
69
(a) Distributions of optimal Bernoulli probability over a year
(b) Distributions of optimal Poisson mean over a year
Figure B.1: Distributions of simulated parameters over a year (p/c = 1.2).70
(a) Distributions of optimal Bernoulli probability over a year
(b) Distributions of optimal Poisson mean over a year
Figure B.2: Distributions of simulated parameters over a year (p/c = 2).71
(a) Distributions of optimal Bernoulli probability over a year
(b) Distributions of optimal Poisson mean over a year
Figure B.3: Distributions of simulated parameters over a year (p/c = 10).72
Appendix C
More Code
This last appendix attaches the codes for modeling.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
####### My pybats is called pybats_latest, you need to change
↪→ that to your file name.
from pybats_latest.analysis import analysis, analysis_dcmm
from pybats_latest.latent_factor import dlm_coef_scale_fxn,
↪→ dlm_coef_scale_forecast_fxn, latent_factor, \
merge_lf_with_predictor, dlm_coef_fxn, dlm_coef_forecast_fxn
from pybats_latest.point_forecast import zape_point_estimate
from sklearn.metrics import roc_auc_score, f1_score
from functools import partial, reduce
## These are used for actual modeling, see the part at the
↪→ bottom (it is commented)
from complement import create_agg_data,
↪→ latent_factor_generator, multi_scale_modeling,
73
↪→ list_files
from joblib import Parallel, delayed
import multiprocessing
import time
import os
def create_agg_data(directory, data_names, data_name):
"""
functions that create aggregate level data
:param directory: directory of the data and where you want to
↪→ save it
:param data_names: data paths for each individual data
:param data_name: name of data to be stored
:return:
"""
item_total_sales = np.zeros(112)
item_total_transaction = np.zeros(112)
item_discount = np.zeros(112)
item_discount_perc = np.zeros(112)
total_household = np.zeros(112)
for f in data_names:
data = pd.read_pickle(directory + ’/’ + f)
74
item_total_sales += data.item_qty.values
item_total_transaction += (data.item_qty > 0).astype(’int’).
↪→ values
item_discount += data.discount_pot.values
item_discount_perc += data.discount_amount_pot.values/data.
↪→ regular_price.values
total_household += 1
item_discount = item_discount/len(data_names)
item_discount_perc = item_discount_perc/len(data_names)
item_data = pd.DataFrame({’date’:data.date.values, ’
↪→ total_sales’:item_total_sales, ’total_transaction’:
↪→ item_total_transaction,
’discount’:item_discount, ’discount_perc’:item_discount_perc,
↪→ ’total_household’:total_household})
item_data.to_pickle(directory + ’/’ + data_name)
def latent_factor_generator(agg_data_path):
"""
function that generates latent factor on aggregate level
:param agg_data_path: a string of the data path
:return: a list of latent factor
"""
agg_data = pd.read_pickle(agg_data_path)
75
Y_sales = agg_data.total_sales.values
X = np.c_[agg_data.discount.values, agg_data.discount_perc.
↪→ values]
# X[-8:, 0] = 1
# X[-8:, 1] += 0.1
n = agg_data.total_household.values
# latent factor parameters
prior_length = 8
nsamps = 5000
delregn = 0.98
deltrend = 0.98
delseas = 0.98
rho = 0.6
adapt_discount = None
forecast_start = 52
forecast_end = 103
k = 8
T = 112
start_date = pd.to_datetime(’2017-09-05’) # Make up a start
↪→ date
dates = pd.date_range(start_date, start_date + pd.DateOffset(
↪→ days=T - 1), freq=’D’)
forecast_start_date = start_date + pd.DateOffset(days=
↪→ forecast_start)
76
forecast_end_date = dates[-1] - pd.DateOffset(days=k)
# latent factor modeling
## latent factor for sales
idx = np.array([2])
dlm_coef_fxn_sales = partial(dlm_coef_fxn, idx=idx)
dlm_coef_forecast_fxn_sales = partial(dlm_coef_forecast_fxn,
↪→ idx=idx)
discount_sensitivity_lf_sales = latent_factor(gen_fxn=
↪→ dlm_coef_fxn_sales,
gen_forecast_fxn=dlm_coef_forecast_fxn_sales)
discount_latent_sales = analysis(Y=Y_sales, X=X, family=’
↪→ poisson’, prior_length=prior_length,
k=k, rho = rho,
forecast_start=forecast_start, forecast_end=forecast_end,
forecast_start_date=forecast_start_date, forecast_end_date=
↪→ forecast_end_date,
dates=dates,
nsamps=nsamps,
deltrend=deltrend, delregn=delregn, adapt_discount=
↪→ adapt_discount,
ret=[’new_latent_factors’],
77
new_latent_factors=[discount_sensitivity_lf_sales.copy()])
# ##### This part can be replaced by a function called
↪→ latent_factor_plot().
# M = np.array([])
# V = np.array([])
#
# for date in dates[forecast_start:forecast_end]:
# m, v = discount_latent_sales.get_lf_forecast(date)
# M = np.append(M, m[0])
# V = np.append(V, v[0])
#
# lf_mean = pd.DataFrame({’average’: M, ’upper’: M + np.sqrt(
↪→ V), ’lower’: M - np.sqrt(V), ’date’: dates[
↪→ forecast_start:forecast_end]})
# fig, ax = plt.subplots(1,1)
# ax.plot(np.arange(0, len(lf_mean.date), 1), lf_mean.average
↪→ .values,
# color = ’red’, alpha = 0.5, label = ’mean’)
# ax.fill_between(np.arange(0, len(lf_mean.date), 1), lf_mean
↪→ .upper.values, lf_mean.lower.values,
# alpha = 0.4, label = ’unit sd region’)
# plt.xticks(rotation=20)
# plt.legend()
78
# ax.set_ylabel("Coefficient of average discount percentage")
#
# ax.set_title("Coefficient of " + "item" + "Your name")
# fig.savefig("Your path")
return [discount_latent_sales]
def multi_scale_modeling(data_path, latent_factor):
"""
implement a multi-scale modeling on data for item 62
:param data_path: path of time series chosen
:param latent_factor: latent factor to be used
:return:
"""
discount_latent = latent_factor
try:
data = pd.read_pickle(data_path)
Y = data.item_qty.values
buy = (Y > 0).astype(’float’)
X = np.c_[data.discount_pot.values, data.discount_amount_pot.
↪→ values / data.regular_price.values]
# X[-8:,0] = 1
# X[-8:,1] += 0.1
household = data.household.iloc[0]
group = data_path[-5:-4]
79
# model parameters
prior_length = 52
nsamps = 5000
delregn = 0.998
deltrend = 0.998
delseas = 0.998
rho = 0.6
adapt_discount = None
forecast_start = 52
forecast_end = 103
k = 8
T = len(Y)
start_date = pd.to_datetime(’2017-09-05’) # Make up a start
↪→ date
dates = pd.date_range(start_date, start_date + pd.DateOffset(
↪→ days=T - 1), freq=’D’)
forecast_start_date = start_date + pd.DateOffset(days=
↪→ forecast_start)
forecast_end_date = dates[-1] - pd.DateOffset(days=k)
### You can add your signal as a latent factor in the latent
↪→ factor list. Here the first one is my latent factor:
↪→ discount times coefficients
80
discount_latent[0] = merge_lf_with_predictor(discount_latent
↪→ [0], X[:, 1], dates)
# discount_latent[1] = "Your latent factor" (You will need to
↪→ append it to the input of this function)
# this function plots mean and sd shadow for latent factors
↪→ you have in the input.
# latent_factor_plot(discount_latent, directory=, names=)
print("begin " + str(household))
try:
samples = analysis_dcmm(Y=Y, X=X[:,0].reshape(112,-1), k=k,
↪→ prior_length=prior_length,
forecast_start=forecast_start, forecast_end=forecast_end,
forecast_start_date=forecast_start_date, forecast_end_date=
↪→ forecast_end_date,
dates=dates, latent_factor=discount_latent[0],
nsamps=nsamps, rho=rho,
delseas=delseas, deltrend=deltrend, delregn=delregn,
↪→ adapt_discount=adapt_discount,
ret=[’forecast’])
print(samples.shape)
# point forecasts
81
buy_samples = (samples > 0).astype(’float’)
medians = np.median(buy_samples[:,:,0], axis=0).astype(’int’)
# probs = np.mean(buy_samples, axis=(0, 2))
# performance scores
accuracy = (medians == buy[forecast_start + 1:forecast_end +
↪→ 2]).astype(’float’).mean()
f1 = f1_score(buy[forecast_start + 1:forecast_end + 2].astype(
↪→ ’int’), medians)
naive = (X[:,0][forecast_start + 1:forecast_end + 2] == buy[
↪→ forecast_start + 1:forecast_end + 2]).astype(’float’).
↪→ mean()
naive_f1 = f1_score(buy[forecast_start + 1:forecast_end + 2].
↪→ astype(’int’), X[:,0][forecast_start + 1:forecast_end +
↪→ 2].astype(’int’))
zape = zape_point_estimate(samples)
print(str(household) + "finished")
return [household, accuracy, f1, naive, naive_f1, zape, np.
↪→ median(buy_samples, axis=0)]
except ValueError:
print("error!!!!!!!!!!!")
82
except EOFError:
print("Opps")
def latent_factor_plot(latent_factor, directory, names):
"""
:param latent_factor: a list of latent factors you want to
↪→ plot
:param directory: path you want to save the figure
:param names: list of names of your figure
:return: it saves the
"""
for l, n in zip(latent_factor, names):
M = np.array([])
V = np.array([])
for date in dates[forecast_start:forecast_end]:
m, v = l.get_lf_forecast(date)
M = np.append(M, m[0])
V = np.append(V, v[0])
lf_mean = pd.DataFrame({’average’: M, ’upper’: M + np.sqrt(V),
↪→ ’lower’: M - np.sqrt(V),
’date’: dates[forecast_start:forecast_end]})
83
fig, ax = plt.subplots(1, 1)
ax.plot(np.arange(0, len(lf_mean.date), 1), lf_mean.average.
↪→ values,
color=’red’, alpha=0.5, label=’mean’)
ax.fill_between(np.arange(0, len(lf_mean.date), 1), lf_mean.
↪→ upper.values, lf_mean.lower.values,
alpha=0.4, label=’unit sd region’)
plt.xticks(rotation=20)
plt.legend()
ax.set_ylabel("Coefficient multiplied by household discount
↪→ percent")
ax.set_title("Latent factor for " + n)
fig.savefig(directory + ’/’ + n + ’.png’)
## Here are how I fit those models, you can adapt them if you
↪→ want to. These work with the functions above.
# # Data names
# item72_names_group0 = list_files(os.getcwd() + "/Data/Items
↪→ /group0", "item72", ".pkl")
84
# item62_names_group1 = list_files(os.getcwd() + "/Data/Items
↪→ /group1", "item62", ".pkl")
# item17_names_group2 = list_files(os.getcwd() + "/Data/Items
↪→ /group2", "item17", ".pkl")
# item76_names_group3 = list_files(os.getcwd() + "/Data/Items
↪→ /group3", "item76", ".pkl")
#
# # # create aggregate level data
# create_agg_data(os.getcwd() + "/Data/Items/group0",
↪→ item72_names_group0, ’agg-72-group0.pkl’)
# create_agg_data(os.getcwd() + "/Data/Items/group1",
↪→ item62_names_group1, ’agg-62-group1.pkl’)
# create_agg_data(os.getcwd() + "/Data/Items/group2",
↪→ item17_names_group2, ’agg-17-group2.pkl’)
# create_agg_data(os.getcwd() + "/Data/Items/group3",
↪→ item76_names_group3, ’agg-76-group3.pkl’)
#
#
# # create latent factors
# latent_factor72 = latent_factor_generator(os.getcwd() + "/
↪→ Data/Items/group0/" + ’agg-72-group0.pkl’)
# latent_factor62 = latent_factor_generator(os.getcwd() + "/
↪→ Data/Items/group1/" + ’agg-62-group1.pkl’)
# latent_factor17 = latent_factor_generator(os.getcwd() + "/
↪→ Data/Items/group2/" + ’agg-17-group2.pkl’)
# latent_factor76 = latent_factor_generator(os.getcwd() + "/
85
↪→ Data/Items/group3/" + ’agg-76-group3.pkl’)
#
#
# # parallelism
# num_cores = multiprocessing.cpu_count()
#
# scores72 = []
# scores72.append(Parallel(n_jobs=num_cores)(delayed(
↪→ multi_scale_modeling)(data_path = "Data/Items/group0/"
↪→ + data_path, latent_factor = latent_factor72) for
↪→ data_path in item72_names_group0))
#
# scores62 = []
# scores62.append(Parallel(n_jobs=num_cores)(delayed(
↪→ multi_scale_modeling)(data_path = "Data/Items/group1/"
↪→ + data_path, latent_factor = latent_factor62) for
↪→ data_path in item62_names_group1))
#
# scores17 = []
# scores17.append(Parallel(n_jobs=num_cores)(delayed(
↪→ multi_scale_modeling)(data_path = "Data/Items/group2/"
↪→ + data_path, latent_factor = latent_factor17) for
↪→ data_path in item17_names_group2))
#
# scores76 = []
# scores76.append(Parallel(n_jobs=num_cores)(delayed(
86
↪→ multi_scale_modeling)(data_path = "Data/Items/group3/"
↪→ + data_path, latent_factor = latent_factor76) for
↪→ data_path in item76_names_group3))
#
# # get rid of results that are None
# scores72 = [[score for score in scores72[0] if score is not
↪→ None]]
# scores62 = [[score for score in scores62[0] if score is not
↪→ None]]
# scores17 = [[score for score in scores17[0] if score is not
↪→ None]]
# scores76 = [[score for score in scores76[0] if score is not
↪→ None]]
#
# # print out how many households are left
# print(len(scores72[0]))
# print(len(scores62[0]))
# print(len(scores17[0]))
# print(len(scores76[0]))
#
# # save the results in a numpy zip file.
# np.savez(os.getcwd() + "/plots/performances", item72 = np.
↪→ array(scores72[0]), item62 = np.array(scores62[0]),
# item17 = np.array(scores17[0]), item76 = np.array(scores76
↪→ [0]))
87
Bibliography
Berry, L. R., P. Helman, and M. West (2020). Probabilistic forecasting of hetero-geneous consumer transaction-sales time series. International Journal of Forecast-ing 36, 552–569.
Berry, L. R. and M. West (2020). Bayesian forecasting of many count-valued timeseries. Journal of Business and Economic Statistics 38, 872–887.
Carvalho, C. M. and M. West (2007). Dynamic matrix-variate graphical models.Bayesian Analysis 2, 69–98.
Chen, T., B. Keng, and J. Moreno (2018). Multivariate arrival times with recurrentneural networks for personalized demand forecasting. In 2018 IEEE InternationalConference on Data Mining Workshops (ICDMW), pp. 810–819.
Chu, W. and S.-T. Park (2009). Personalized recommendation on dynamic content us-ing predictive bilinear models. In Proceedings of the 18th International Conferenceon World Wide Web, WWW ’09, New York, NY, USA, pp. 691–700. Associationfor Computing Machinery.
Du, C., C. Li, Y. Zheng, J. Zhu, and B. Zhang (2018, February). Collaborative filter-ing with user-item co-autoregressive models. In Proceedings of the Thirty-SecondAAAI Conference on Artificial Intelligence, New Orleans, Louisiana. Associationfor the Advancement of Artificial Intelligence.
Ferreira, M. A. R., Z. Bi, M. West, H. K. H. Lee, and D. M. Higdon (2003). Multiscalemodelling of 1-D permeability fields. In J. M. Bernardo, M. J. Bayarri, J. O.Berger, A. P. David, D. Heckerman, A. F. M. Smith, and M. West (Eds.), BayesianStatistics 7, pp. 519–528. Oxford University Press.
Ferreira, M. A. R., M. West, H. K. H. Lee, and D. M. Higdon (2006). Multiscale andhidden resolution time series models. Bayesian Analysis 2, 294–314.
He, X., Z. He, X. Du, and T.-S. Chua (2018, July). Adversarial personalized rankingfor recommendation. In SIGIR ’18: 41st International ACM SIGIR Conferenceon Research and Development in Information Retrieval, Ann Arbor, MI.
Hu, Y., Q. Peng, X. Hu, and R. Yang (2015). Web service recommendation basedon time series forecasting and collaborative filtering. In 2015 IEEE InternationalConference on Web Services, pp. 233–240.
Jerfel, G., M. Basbug, and B. Engelhardt (2017, 20–22 Apr). Dynamic collabora-tive filtering with compound Poisson factorization. Volume 54 of Proceedings ofMachine Learning Research, Fort Lauderdale, FL, USA, pp. 738–747. PMLR.
88
Kazemian, P., M. S. Lavieri, M. P. V. Oyen, C. Andrews, and J. D. Stein (2018, April).Personalized prediction of Glaucoma progression under different target intraocularpressure levels using filtered forecasting methods. Ophthalmology 125 (4), 569–577.
Kott, A. and P. Perconti (2018). Long-term forecasts of military technologies for a20-30 year horizon: An empirical assessment of accuracy. Technological Forecastingand Social Change 137, 272–279.
Lavine, I., A. J. Cron, and M. West (2020). Bayesian computation in dynamiclatent factor models. Technical Report, Department of Statistical Science, DukeUniversity. arxiv.org/abs/2007.04956.
Lichman, M. and P. Smyth (2018, April). Prediction of sparse user-item consumptionrates with zero-inflated poisson regression. In WWW ’18: Proceedings of the 2018World Wide Web Conference, pp. 719–728.
McAlinn, K., K. A. Aastveit, J. Nakajima, and M. West (2020). Multivariate Bayesianpredictive synthesis in macroeconomic forecasting. Journal of the American Sta-tistical Association 115, 1092–1110. arXiv:1711.01667. Published online: Oct 92019.
Naumov, M., D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang,U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherni-avskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen,W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy (2019). Deep learn-ing recommendation model for personalization and recommendation systems.arxiv.org/abs/1906.00091.
Nevins, J. R., E. S. Huang, H. Dressman, J. L. Pittman, A. T. Huang, and M. West(2003). Towards integrated clinico-genomic models for personalized medicine:Combining gene expression signatures and clinical factors in breast cancer out-comes prediction. Human Molecular Genetics 12, 153–157.
Niu, W., J. Caverlee, and H. Lu (2018). Neural personalized ranking for imagerecommendation. In Proceedings of 1th ACM International Conference on WebSearch and Data Mining (WSDM 2018). ACM.
Pittman, J. L., E. S. Huang, H. K. Dressman, C. F. Horng, S. H. Cheng, M. H.Tsou, C. M. Chen, A. Bild, E. S. Iversen, A. T. Huang, J. R. Nevins, and M. West(2004). Integrated modeling of clinical and gene expression information for per-sonalized prediction of disease outcomes. Proceedings of the National Academy ofSciences 101, 8431–8436.
Salinas, D., M. Bohlke-Schneider, L. Callot, R. Medico, and J. Gasthaus (2019). High-dimensional multivariate forecasting with low-rank Gaussian copula processes. In
89
Advances in Neural Information Processing Systems 32, pp. 6827–6837. CurranAssociates, Inc.
Su, X. and T. M. Khoshgoftaar (2009, Jan). A survey of collaborative filteringtechniques. Advances in Artificial Intelligence.
Talebi, M., Z. M. P. A. A. A. (2017). Long-term probabilistic forecast for m 5.0earthquakes in iran. Pure Appl. Geophys. 174, 1561–1580.
Thai-Nghe, N., T. Horv’th, and L. Schmidt-Thieme (2011). Personalized forecastingstudent performance. In 2011 IEEE 11th International Conference on AdvancedLearning Technologies, pp. 412–414.
Wang, X., Y. Wang, D. Hsu, and Y. Wang (2013). Exploration in interactive person-alized music recommendation: A reinforcement learning approach. ACM Trans.Multimedia Comput. Commun. Appl. 2 (3).
West, M. (1992). Modelling agent forecast distributions. Journal of the Royal Sta-tistical Society (Ser. B) 54, 553–567.
West, M. and J. Crosse (1992). Modelling of probabilistic agent opinion. Journal ofthe Royal Statistical Society (Ser. B) 54, 285–299.
West, M. and P. J. Harrison (1997). Bayesian Forecasting and Dynamic Models (2nded.). Springer-Verlag, New York, Inc.
West, M., A. T. Huang, G. S. Ginsberg, and J. R. Nevins (2006). Embracing thecomplexity of genomic data for personalized medicine. Genome Research 16, 559–566.
Yanchenko, A., D. D. Deng, J. Li, A. J. Cron, and M. West (2021). Hierarchical dy-namic modelling for individualized bayesian forecasting. Department of StatisticalScience, Duke University. Submitted for publication. arXiv:2101.03408.
90