Hierarchical Signal Propagation for Household Level Sales ...

99
Hierarchical Signal Propagation for Household Level Sales in Bayesian Dynamic Models by Di Deng Department of Statistical Science Duke University Date: Approved: Mike West, Advisor Peter Hoff Andrew Cron A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the Department of Statistical Science in the Graduate School of Duke University 2021

Transcript of Hierarchical Signal Propagation for Household Level Sales ...

Hierarchical Signal Propagation for Household Level Sales in

Bayesian Dynamic Models

by

Di Deng

Department of Statistical ScienceDuke University

Date:Approved:

Mike West, Advisor

Peter Hoff

Andrew Cron

A thesis submitted in partial fulfillment of therequirements for the degree of Master of Science

in the Department of Statistical Sciencein the Graduate School of

Duke University

2021

ABSTRACT

Hierarchical Signal Propagation for Household Level Sales in

Bayesian Dynamic Models

by

Di Deng

Department of Statistical ScienceDuke University

Date:Approved:

Mike West, Advisor

Peter Hoff

Andrew Cron

An abstract of a thesis submitted in partial fulfillment of therequirements for the degree of Master of Science

in the Department of Statistical Sciencein the Graduate School of

Duke University

2021

Copyright c© 2021 by Di Deng

All rights reserved

Abstract

Large consumer sales companies frequently face challenges in customizing decision

making for each individual customer or household. This dissertation presents a novel,

efficient and interpretable approach to such personalized business strategies, involving

multi-scale dynamic modeling, Bayesian decision analysis and detailed application in

the context of supermarket promotion decisions and sales forecasting.

We use a hierarchical, sequential, probabilistic and computationally efficient Bayesian

dynamic modeling framework to propagate signals down the hierarchy, from the level

of overall supermarket sales in a store, to items sold in a department of the store,

within refined categories in a department, and then to the finest level of individual

items on sale. Scalability is achieved by extending the decouple-recouple concept:

the core example involves 162,319 time series over a span of 112 weeks, arising from

combinations of 211 items and 2,000 households. In addition to novel dynamic model

developments and application in this multi-scale framework, this thesis also devel-

ops a comprehensive customer labeling system, built based on customer purchasing

behavior in the context of prices and discounts offered by the store. This labeling

system addresses a main goal in the applied context to define customer categorization

to aid in business decision making beyond the currently adopted models. Further, a

key and complementary contribution of the thesis is development of Bayesian deci-

sion analysis using a set of loss functions that suit the context of the price discount

selection for supermarket promotions. Formal decision analysis is explored both the-

oretically and via simulations. Finally, some of the modeling developments in the

multi-scale framework are of general interest beyond the specific applied motivat-

ing context here, and are incorporated into the latest version of PyBATS, a Python

package for Bayesian time series analysis and forecasting.

iv

Contents

Abstract iv

List of Figures viii

List of Tables ix

1 Introduction 1

1.1 Data and Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Prior Relevant Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Thesis Scope and Contributions . . . . . . . . . . . . . . . . . . . . . 3

2 Dynamic Models 5

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 DGLMs: Dynamic Generalized Linear Models . . . . . . . . . . . . . 6

2.3 DMMs: Dynamic Mixture Models . . . . . . . . . . . . . . . . . . . . 7

2.3.1 DCMMs: Dynamic Count Mixture Models . . . . . . . . . . . 7

2.3.2 DLMMs: Dynamic Linear Mixture Models . . . . . . . . . . . 8

2.4 Multi-scale Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Case Study and Examples . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5.1 Individual Household DGLMs . . . . . . . . . . . . . . . . . . 11

2.5.2 Multi-scale Modeling . . . . . . . . . . . . . . . . . . . . . . . 12

2.5.3 Model Evaluation and Comparison . . . . . . . . . . . . . . . 14

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Labeling System 18

3.1 Motivations and Purposes . . . . . . . . . . . . . . . . . . . . . . . . 18

v

3.2 Labeling System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Case Study and Examples . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Decision Analysis 25

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Business Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Tentative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.1 Poisson Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.2 Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 DCMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Computation, Implementation and Code 39

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2 Copula Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2.1 VBLB for Latent Factor DGLMs . . . . . . . . . . . . . . . . 39

5.2.2 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 Clustering Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Conclusions and Summary 60

Appendices 63

A DGLMs 64

A.1 VBLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A.2 Discount Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

A.3 Random Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

vi

A.4 Multi-scale Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

B More Figures 69

C More Code 73

vii

List of Figures

2.1 Modeling hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 (a) Coefficient of average discount percent in the external model M0;(b) Product of (a) and actual discount percent; (c) Coefficient of (b)in an individual model M2 . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Model comparison in terms of forecasting accuracy. Naive model: Mnaive;

DGLM: M1; Latent: M2; TF: a logistic regression model written in Tensor-

Flow by 84.51◦ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Interactive visual aids of labeling system . . . . . . . . . . . . . . . . . . 23

4.1 Distributions of four model parameters . . . . . . . . . . . . . . . . . . 29

4.2 Utility vs. Discount . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Distributions of simulated outcomes over a year (p/c = 1.2). . . . . . 34

4.4 Distributions of simulated outcomes over a year (p/c = 2). . . . . . . 35

4.5 Distributions of simulated outcomes over a year (p/c = 10). . . . . . . 36

B.1 Distributions of simulated parameters over a year (p/c = 1.2). . . . . 70

B.2 Distributions of simulated parameters over a year (p/c = 2). . . . . . 71

B.3 Distributions of simulated parameters over a year (p/c = 10). . . . . . 72

viii

List of Tables

3.1 Example items for each group . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Summary statistics of logistic and poisson regressions . . . . . . . . . . . 30

ix

Chapter 1

Introduction

1.1 Data and Context

The data processed throughout the case study is provided by 84.51◦. It records

weekly purchasing data of 211 actively selling items across over 2000 households,

during the span of 112 weeks, from September 5th, 2017 to October 22nd, 2019.

For each row/visit, the key numeric variables include regular price, discounted/net

price of items, units of items sold, as well as identification information, such as the

date, household identification number, category/department identification number of

items.

We create a few derivative variables for the modeling purpose: total money spent

on items for each visit, dummy variable of whether or not there were promotions,

and the discount percentage: ratio of discount and regular price.

1.2 Prior Relevant Work

The motives of this thesis are closely related to customized forecasting and decision

making in various retail contexts (e.g. Chen et al., 2018), even though the broader

applications of individualized statistical models aim for recommmendation systems,

such as image recommendation systems (Niu et al., 2018) to music (Wang et al.,

2013).

1

Collaborative filtering and matrix factorization (Su and Khoshgoftaar, 2009; Du

et al., 2018) are usually the pillars of customized recommendation systems. More

recently, deep learning shows potentials in dealing with big, complex data (He et al.,

2018; Niu et al., 2018; Naumov et al., 2019). Unlike most of non-dynamic methods,

this thesis aims to forecast the purchasing behaviors of an individual. The whole

setting of this thesis is dynamic, like the dynamic extension of matrix factoriza-

tion (Jerfel et al., 2017) and temporal features in (Chu and Park, 2009; Hu et al.,

2015).

Customized prediction also has its significance in medical applications, such as ge-

nomics (e.g. Nevins et al., 2003; Pittman et al., 2004; West et al., 2006). Even though

there has been growth in the field such as glaucoma progression prediction (Kazemian

et al., 2018), the main goal focuses on a single value of interest. This thesis presents

methodology beyond forecasting a single value of interest (i.e. glaucoma progres-

sion prediction (Kazemian et al., 2018)). Instead, it deals with forecasting across

thousands of households and hundreds of items.

Finally, in the retail domain, which is the setting of this thesis, relevant personal-

ized models are either not explicitly dynamic and unable to deal with a large number

of time series (e.g. Lichman and Smyth, 2018; Kazemian et al., 2018; Lichman and

Smyth, 2018; Thai-Nghe et al., 2011), or difficult to interpret (e.g. Salinas et al., 2019;

Chen et al., 2018). The formers have hierarchical structure, but the formulations do

not allow comoputational scalability. This makes it difficult to model many time se-

ries. The deep learning methods in the laters are probabilistic and scalable, but they

trade for interpretability. When it comes to forecasting and decision making in retail

domain, this thesis presents a probabilistic, scalable, interpretable model. The inter-

pretability enables clear communication and easy decision making for downstream

collaborators. For example, one can easily determine how many discount coupons

2

which household needs for which item, so that the store can send out appropriate

amount of promotions to targeted households at the right time.

1.3 Thesis Scope and Contributions

When it comes to commercial usage of dynamic modeling, we prefer online models

that generate full distributions of quantities of interest with both computational

speed and forecasting accuracy. This thesis adopts the Bayesian Dynamic Generalized

Linear Modeling framework by West and Harrison (1997), which is designed to be

sequential and probabilistic and is reviewed in section 2.2.

The challenge in this context occurs due to the inherent sparsity of the data at the

finest level of individual household and item, where the random noises can dominate

the real signals. The sporadic counts can be well modeled by mixture models, such

as Dynamic Counts Mixture Models (Berry and West, 2020) and Dynamic Linear

Mixture Models (Yanchenko et al., 2021), which are described in section 2.3.

In practice, computational speed and accuracy have an incompatible trade-off

relationship—we need to sacrifice one to compensate for the other. Within com-

mercial settings, the new information comes in so fast that one cannot afford to

run a computationally intense model, for example, a model that requires MCMC.

In order to promote efficiency while maintain forecasting accuracy, we resort to the

decouple-recouple modeling strategy proposed by Berry and West (2020). The mod-

els adopting this strategy are named multi-scale models, which first treat each series

independently and then propagates common simultaneous high-level signals down

to the decoupled series to restore the dependence. Decoupling enables fast parallel

computation, while recoupling mitigates the random noises issue at the finest level

which contributes to overall accuracy along with the restored dependence. The rest

3

of chapter 2 reviews the framework of multi-scale modeling and showcases modeling

results on the data mentioned below. Note that even though section 2.4 describes the

multi-scale modeling by Berry and West (2020), section 2.5 utilizes an approximated

but much more efficient version by Lavine et al. (2020).

Notice that in section 1.1, there is no demographic information about those house-

holds available in the data. Since the multi-scale modeling scheme requires signals

from some aggregate levels, we inevitably need to create a set of criteria to classify

households into groups. Chapter 3 discusses the motivation and the more profound

significance beyond modeling of the labeling/classification system in section 3.1, and

defines the specific standards in section 3.2, with section 3.3 showcases a few examples

in both tabular and graphic forms.

Ultimately, the pursuit of better modeling is at service of better decision making.

In the commercial context, questions such as how many discounts one should give to

a particular household, or even what is the optimal discounts, are of great interest.

The answers to such questions involve the decision maker’s utility function—what

he/she prioritizes, i.e. short-term profits or long-term customer relationships. Chap-

ter 4 explores these questions under the framework in chapter 2, in the context of

businesses like supermarkets, which is described in details by section 4.2. Then, sec-

tion 4.3 and 4.4 walk through mathematical details of decision optimizations under

relevant models, from the simple to the more sophisticated, complemented by more

illustrations in section 4.4.

Chapter 5, as the last segment of the thesis, elaborates my contributions in terms

of programming to the project (Yanchenko et al. (2021)). It contains both the latent

factor modeling, and the labeling system based on household purchasing behaviors,

which is introduced in chapter 3.

4

Chapter 2

Dynamic Models

2.1 Introduction

Dynamic modeling is of great interest of commercial outlets such as e-commence

companies like Amazon and supermarkets like Walmart and Target. This chapter

reviews the framework of the relevant models for our problem. Specifically, we use

dynamic multi-scale mixture models that are well suited to deal with multivariate

time series that are either non-negative counts or continuous-valued. Built off ex-

tensions to dynamic generalized linear models by West and Harrison (1997), these

models inherent the advantages of being sequential and probabilistic, and are able to

generate samples from the implied predictive distributions of target quantities, which

allows inference on various statistics and further decision analysis (Chapter 4).

Background

Many prior works pave the way for this thesis. Over 20 years ago, the framework of

dynamic generalized linear models was established (West and Harrison, 1997, chap.

14). In recent years, researchers have picked up the baton, extending and modifying

the old framework to build tailormade models for count-valued time seriess (Berry

and West, 2020; Berry et al., 2020). The framework of multi-scale modeling leverages

information from aggregate level, which provides a potential solution to the zero-

inflated data. To improve computational efficiency, Lavine et al. (2020) propose

5

a copula-based approximation that drastically speeds up the modeling, meanwhile

maintains forecasting accuracy.

2.2 DGLMs: Dynamic Generalized Linear Models

DGLMs are dynamic models whose primary variables are from the exponential family.

The sampling model for a time series i over time t is given by Equation 2.1. Here i

indicates the index of each individual series, while t is the time.

p(yi,t|µi,t, τi,t, Dt) = b(yi,t, τi,t)exp[τi,t(yi,tµi,t − a(µi,t))], i = 1 : N, t = 1, 2, 3, . . .

(2.1)

Equation 2.1 is the conditional distribution of yi,t, given all the information avail-

able up to time t, which is denoted by Dt = {yt,Dt−1, It−1}, , where It−1 represents

any additional relevant information beyond the observed data. µi,t and τi,t are the

natural parameter and precision parameter, respectively. Here, in the DGLM frame-

work, the focus is µi,t, which maps to the linear predictor λi,t = g(µi,t) via a link

function g(·). As a state-space model, its dynamic Markov evolution is defined as

λi,t = F ′i,tθi,t where θi,t = Gi,tθi,t−1 + ωi,t with ωi,t ∼ [0,Wi,t] (2.2)

where

• Fi,t is a matrix of known covariates at time t,

• θi,t is the state vector, which evolves via a first-order Markov process,

6

• Gi,t is a known state matrix,

• ωi,t is the stochastic innovation vector, or evolution “noise ”, with E(ωi,t|Dt−1, It−1) =

0 and V(ωi,t|Dt−1, It−1) = Wi,t, independently over time, which can be con-

trolled by the design/scheme descirbed in details in Appendix A.2

• For Poisson DGLMs, the design has a random effect parameter ρ ∈ (0, 1] to

account for the overdispersion. The models in both Section 2.5.1 and 2.5.2

specify this parameter. More details of random effect can be found in Appendix

A.3.

2.3 DMMs: Dynamic Mixture Models

When it comes to zero-inflated data, a single DGLM is not flexible enough to capture

the signal from the finest level, which is the goal of our project: customized modeling

and decision-making for individual households and items. Therefore, we resort to

mixture models which consider zeros separately, such as the Dynamic Count Mixture

Models (Berry and West (2020)) and Dynamic Linear Mixture Models (Yanchenko

et al. (2021)). These two models are designed for non-negative counts and continuous

values respectively. In the context of our business problem, the former models weekly

sales, while the later is used for weekly spending.

2.3.1 DCMMs: Dynamic Count Mixture Models

In order to deal with non-negative counts with many zeros, Berry and West (2020)

propose Dynamic Count Mixture Models (DCMMs). The models are a mixture of

Bernoulli DGLMs and shifted Poisson DGLMs, as described by Equation 2.3.

7

Bernoulli DGLM: zt ∼ Ber(πt)

Poisson DGLM: yt|zt =

0, zt = 0

1 + st, st ∼ Po(µt), zt = 1

(2.3)

The two components of this mixture, Bernoulli and Poisson, separately evolve,

predict and update just like a univariate DGLM, which are detailed in Appendix A.1.

2.3.2 DLMMs: Dynamic Linear Mixture Models

With the similar strategy of treating those inflated zeros separately, Yanchenko et al.

(2021) proposes Dynamic Linear Mixture Models, which are mixtures of Bernoulli

and Normal DGLMs, as in Equation 2.4, and are used to model the logarithm of

weekly spending of each individual household.

Bernoulli DGLM: zt ∼ Ber(πt)

Normal DGLM: xt|zt =

0, zt = 0

xt ∼ N(F ′tθt,Vt), zt = 1

(2.4)

Similarly, DLMMs retain the flexibility, computational efficiency, and full proba-

bilistic uncertainty from Bayesian DGLMs in West and Harrison (1997).

8

2.4 Multi-scale Framework

In this project, the multi-scale modeling framework incorporates all the dynamic

models all together. The framework has been conceptualized, developed, elaborated,

exemplified, throughout prior work such as Berry and West (2020), Carvalho and

West (2007), and Ferreira et al. (2003, 2006). This prior work uses the “bottom-

up/top-down”ideas and novelly adopts them in multi-scale time series models and

Bayesian forecasting.

Specifically, the application of multi-scale modeling relies on the natural hierarchy

of items defined by the store (see section 3.1, Yanchenko et al. (2021)), and the

innovation of household grouping achieved by the labeling system/criteria in Chapter

3. This modeling strategy utilizes information across items and households that are

somehow close in the hierarchy, which allows signals from the shared level to be

propagated in the ”top-down” fashion, thus improving forecast accuracy at the finest

level: each item household pair.

Under the multi-scale framework, the shared signals from the aggregate level are

simultaneous with the quantity of interest, as opposed to lagged, which retains the

online learning of the model of interest, as well as takes into account the additional

uncertainty introduced by the high-level signals. This is critical for any inference on

the predictive samples. Note that this requires some knowledge/control of future val-

ues, such as the discount percentages offerred next week. The multi-scale framework

is described by Equation 2.5. (A summary of this section can be found in Appendix

A.4)

9

Mi : Equation 2.1 and 2.2 with θi,t = (γi,t, βi,t)′ and Fi,t = (hi,t, φt)

′, i = 1 : N

M0 : φt ∼ p(φt|Dt−1)

(2.5)

There are independent external models denoted by M0 modeling the simultaneous

predictor vector φt, which is incorporated into the individual models: Mi in the

regressor vector Fi,t. Each Mi has its own dynamic state vector βi,t for the shared

signal φt, which allows individual models to uniquely respond to the shared higher-

level signal thereby improving forecast accuracy.

When it comes to the implementation of the multi-scale model, Berry and West

(2020) propose a direct Monte Carlo method, which obviates the usage of Markov

Chain Monte Carlo, while in a more recent literature, Lavine et al. (2020) adopt an

analytical approximation which significantly boosts computational efficiency, as well

as maintains similar forecast accuracy.

2.5 Case Study and Examples

As a part of the project (Yanchenko et al. (2021)), one of the major goals of this thesis

is to identify, capture, and utilize the price sensitivity of each household. Specifically,

as displayed in Figure 2.1, I find a multi-scale model that utilizes an aggregate dis-

count percentage across households, to improve forecasting accuracy. The household

hierarchy used to aggregate the discount information is introduced and elaborated in

Chapter 3. Here in this chapter, models or modeling results of households with high

price sensitivity are exemplified, evaluated, and discussed.

Similar to Figure 5 in Yanchenko et al. (2021) which visualizes the modeling

10

decomposition of each item, household pair, Figure 2.1 demonstrates the two main

implementations of this problem: ”top-down” propagations along the hierarchy of

an item (store, department, category, product) or a household (groups with different

price sensitivity/loyalty). This thesis mainly contributes to the later.

Figure 2.1: Modeling hierarchy

2.5.1 Individual Household DGLMs

Resorting to the decouple-recouple strategy, univariate DCMMs (Equation 2.3) model

the weekly sales of each household, with the first two covariates of the regressor

vector F ′ being (1, discountt), where discountt is the simultaneous binary indicator

of weekly promotion. The third covariate explored is the information in weekly

discount pecentage, which is to be used either directly or aggregately. Model M1 and

M2 give the full descriptions of the models.

11

M1 :

• Response variable yt: weekly sales of an item from a particular household

• F ′t = (1, discountt, discount percentt),G = I3

• Discount factors: ρ = 0.6, ρlocal linear = 0.98, ρregression = 0.98

M2 :

• Response variable yt: weekly sales of an item from a particular household

• F ′t = (1, discountt, aggregate discount percentt),G = I3

• Discount factors: ρ = 0.6, ρlocal linear = 0.98, ρregression = 0.98

2.5.2 Multi-scale Modeling

The idea of the multi-scale modeling is to have an aggregate level signal as a baseline

reference extracted from group behaviors, so that we have, at least, some ”safety”

information to draw on, when there is nothing but noise at the finest level (household).

In contrast to M1 which is a household specific model, M2 is a multi-scale model

(Equation 2.5) with a simultaneous covariate that incorporates price sensitivity across

a group of households. The exploration finds that M2 outperforms M1 and others,

especially when it comes to households with high price sensitivity. This section

describes the external model that generates the third covariate of M2: aggregate

discount percentt.

External Model Specification

The external model M0, whose parameters are specified below, is a Poisson DGLM

(Equation 2.1 and 2.2)—a DCMM would have been practically equivalent, due to the

12

lack of zero inflation in the aggregate data.

M0 :

• Response variable yt: weekly sales of an item from a group of households

• F ′t = (1, average discount percentt),G = I3

• Discount factors: ρ = 0.6, ρlocal linear = 0.998, ρregression = 0.998

Model Integration: Signal Propagation

Upon possessing the coefficients of M0, the next step is to combine the aggregate level

signal and the household specific information. Specifically, given the state vector θ′t =

(αt, βt)′ of M0, the third covariate of M2 becomes aggregate discount percentt =

βt ∗ discount percentt.

In the context of this application, βt can be interpreted as a measure of price

sensitivity on the item for the whole group of households, while multiplying it with

the household specific discount percent accounts for the heterogeneity of promotions

within the group.

Figure 2.2 shows the critical quantities over the forecasting period in the de-

scribed process. Figure 2.2a is the coefficient of average discount percent: βt, which

is multiplied by the simultaneous individual household discount percent to obtain

the customized aggregate discount percent for that particular household—shown in

Figure 2.2b. Note that in Figure 2.2a, the coefficient is well above zero, indicating a

strong price sensitivity of the chosen household group, which validates the multi-scale

strategy. The latent factor in Figure 2.2b displays one household of the group—it

has zero values in Figure 2.2b, because the household was not offered any promotion

on item 62 for those weeks. Figure 2.2c gives the coefficient of the latent factor: ag-

gregate discount percent, in model M2. The mean and one-standard-deviation region

13

(a) Coefficient of average discount percent (b) Aggregate discount percent

(c) Coefficient of aggregate discount percent

Figure 2.2: (a) Coefficient of average discount percent in the external model M0;(b) Product of (a) and actual discount percent; (c) Coefficient of (b) in an individualmodel M2

implies the combined covariate plays a consistently valuable role in that particular

model.

2.5.3 Model Evaluation and Comparison

Figure 2.2 illustrates the multi-scale model at the individual household level, this

section discusses the performance of the model on the chosen group, and compares

to other alternatives.

Figure 2.3 shows the accuracy of all three models: M1, M2, TensorFlow logistic

14

regression model, and the naive guess 2.6 based on the promotion. Figure 2.3a

compares the multi-scale model with each of the others, while Figure 2.3b displays

their individual distributions.

Naive Guess: zt =

0, No discount

1, Discount

(2.6)

In Figure 2.3a, each point represents a household in the group, and the straight

line is the y = x line. Therefore, the points above the line are where the multi-scale

model outperforms the others. From the first subplot of Figure 2.3a, we can see that

the multi-scale model is better than the naive guess, when the households follow the

promotions between 60% to 80%. This is significant, because that is actually the

most common case and when it is difficult to predict the behavior—if all customers

are following the promotions, there is no need to build models more complex than

the naive guess. In the second subplot, for a few households, using the aggregate

discount percent conspicuously dominates the DGLM without the signal from an

aggregate level, which emphasizes the importance of ”top-down” propagation, as

mentioned at the beginning of Section 2.5.2. Compared to the TensorFlow model,

the multi-scale model generates similar outcomes, while the later has the benefits of

being probabilistic, sequential, and a lot faster.

As the parental project of this one, Yanchenko et al. (2021) compares model M2

and model M1 in terms of other metrics, such as MAD, MAPE and ZAPE. Particu-

larly, Table 4 and 6 exemplify the improvement on a larger scope. The “simultaneous

”column and “multi-sclae ”column in Table 4 and 6 respectively are the performances

of the model chosen in Section 2.5.3. In the paper, the model is applied to a more

heterogeneous household group and outperforms the alternatives by all metrics.

15

(a) Accuracy pairwise comparisons

(b) Accuracy distributions

Figure 2.3: Model comparison in terms of forecasting accuracy. Naive model: Mnaive;DGLM: M1; Latent: M2; TF: a logistic regression model written in TensorFlow by 84.51◦

16

2.6 Summary

This chapter reviews the framework of Bayesian dynamic generalized linear models

and the extensions to count-valued time series. The rest of it elaborates and exemplify

the multi-scale modeling approach. This “top-down ”strategy shows potential to

handle the difficult modeling when it comes to sparse data, compared to other models

described in Section 2.5.3.

The extension of the multi-scale approach to hierarchical decomposition (Figure

2.1) not only captures household behavior, but also maintains scalability. The key is

to identify a group of individual series that share information (Chapter 3).

All models throughout this chapter are extensions of the fully probabilistic, inter-

pretable, sequential dynamic generalized linear models. They are tailormade for the

individualized forecasting and decision making problem.

17

Chapter 3

Labeling System

The “top-down ”modeling strategy (Berry and West, 2020) is suited to the personal-

ized household forecasting problem described in Section 2.5. It seeks common signals

in aggregate level and propagates such signals down the hierarchy. This is effective

when it comes to sparse data, as I face throughout this thesis. One obstacle to realize

this modeling concept is the lack of proper aggregate information, which is the main

incentive of this chapter (Section 3.1).

3.1 Motivations and Purposes

Upon implementation of the multiscale modeling strategy, which propogates clearer

signals from the aggregate level for each household, it is natural to develop a set of

grouping/clustering criteria, based on which we could circumvent the lack of demo-

graphic information and identify the appropriate aggregate signals. The goal is to

group thousands of households according to their promotion scenarios and purchasing

behaviors. Based on the quantification of such scenarios and behaviors, households

are classified into eight different categories, with each geometrically represented by

an octant of a unit cube centering around the origin. Note that the process described

above can be implemented for every item actively sold in the store, which allows us

to identify and then model the aggregate signals for every household, item combi-

nation. From a holistic perspective, the grouping not only enables the multiscale

strategy, but also, as a guidance, illuminates the proper actions on different groups

18

of households and identifies the strength of the model.

3.2 Labeling System

Due to the unavailability of the demographic information about the households, the

following grouping is developed, on the basis of the promotion circumstances and

buying behaviors, which are defined for every household, item combination, as below:

For every item-household pair: (i,h), i = 1:I, h = 1:H, with I, H being the total

number of items being sold and households recorded,

• Discount Offered Percentage (DOP): over the span of the 112 weeks recorded,

the proportion of weeks when there were promotions offered to household h of

item i.

• Discounted Purchase Percentage (DPP): among the weeks when item i was

discounted for household h, the proportion of household h made a purchase.

• Regular Purchase Percentage (RPP): among the weeks when item i was at

regular price for household h, the proportion of household h made a purchase.

These three quantities together define a household space for each item, whose

domain is a unit cube centering around the origin, with the eight divisions established

and interpreted as below:

For i = 1:I,

1. The octant containing (DOPi, DPPi, RPPi) = (0, 0, 1);

Interpretation: Loyal households who are very consistent on item i.

19

2. The octant containing (DOPi, DPPi, RPPi) = (0, 1, 1);

Interpretation: Similarly loyal households to type 1 who are very consistent on

item i.

3. The octant containing (DOPi, DPPi, RPPi) = (1, 1, 0);

Interpretation: Promotion sensitive households who are responding and enjoy-

ing the discounts on item i.

4. The octant containing (DOPi, DPPi, RPPi) = (1, 1, 1);

Interpretation: Similar to type 3.

5. The octant containing (DOPi, DPPi, RPPi) = (0, 0, 0);

Interpretation: Untouched or pristine households who might respond to pro-

motions of item i if delivered.

6. The octant containing (DOPi, DPPi, RPPi) = (0, 1, 0);

Interpretation: Similar to type 5.

7. The octant containing (DOPi, DPPi, RPPi) = (1, 0, 0);

Interpretation: Disinterested households, despite of promotions of item i

8. The octant containing (DOPi, DPPi, RPPi) = (1, 0, 1);

Interpretation: Similar to type 7.

Based on the classification, it is natural for the eight types of customers to coalesce

into four larger groups, as the following:

For i = 1:I, among which households,

1. habit and loyalty for item i are established.

Actions: Maintain the relationship and occasionally compensate for their loy-

alty to item i.

20

2. promotion sensitivity and interests in item i are detectable or even conspicuous—

which is the ideal group of customers to model the price sensitivity.

Actions: It is interesting to find the amount of promotions of item i that gen-

erates the most profits, which depends on the distribution of sales and the

quantity being optimized.

3. promotions are not available.

Actions: Explore and experiment with these customers by delivering promo-

tions of item i.

4. disinterests in the item or disregard for the promotions are noticeable.

Actions: Check the validity of the promotions sent out. If they are disregarded,

stop the promotions of item i.

Note that the households from group 2 above are of the most modeling interest,

given that the covariates are price.

3.3 Case Study and Examples

The examples demonstrated in this section are results of the labeling system, Section

3.2, on a portion of the data—the highest spending household group—described in

Section 1.1.

Table 3.1: Example items for each group

item group1 group2 group3 group4 total

type1 type2 type3 type4 type5 type6 type7 type8

...

21

62 0 0 578 254 0 0 1071 44 1947

...

72 34 395 71 2 567 412 3 0 1484

...

176 5 46 0 0 1522 98 10 0 1681

...

199 0 0 5 6 7 0 1653 1 1672

...

As mentioned in Section 3.1, the labeling system aids in identifying signals to

model—which is discussed in Section 2.5, as well as sheds light on decision making

at individual item, household level.

Table 3.1 displays the four items with significant numbers of households in each

group. As discussed in Section 3.2, group 2 is of the most interest when it comes

to modeling households’ sensitivity to promotions and item 62 is the chosen item to

illustrate the multi-scale modeling strategy in Section 2.5 and the multi-step decision

analysis in Section 4.4. Group 1 is the loyal customers with consistent spending on a

given item, exemplified by the 429 households purchasing item 72; while the majority

of the households recorded in item 176 are classified in group 3, which suggests ten-

tative promotions; lastly, item 199 is not a seling item, despite the promotions, which

should draw the attention of decision makers. Potential questions to be investigated

are should the store 1. check the validation or accessibility of the promotions being

sent out. 2. shrink the promotions to save on mailing, etc. 3. reduce the inventory,

since it does not sell. 4. bundle it with other items that sell.

Figure 3.1 visualizes the definition of the clustering criteria in Section 3.2, with

a couple of examples with and without the grouping regions. For a particular item,

22

(a) Subspaces defined by the labeling system(b) 3D scatterplot of all households recordedfor item X

(c) 3D grouped scatterplot (d) Interactive legend

Figure 3.1: Interactive visual aids of labeling system

23

these kind of plots demonstrate its customers’ sensitivity to the promotions, loyalty

to the product, as well as help identify anomaly in delivery of its promotions. The

four cuboids which consist of two octants each, represent the four customer groups.

Each point in Figure 3.1b and 3.1c is a household recorded for that particular item.

The axes are the three quantities defined in Section 3.2: DOP, DPP, RPP, with more

information incorporated in the plots, such as household id, the exact values of the

three axes, plus a couple categorical results, as shown in Figure 3.1d.

This visualization serves as a dictionary and enables easy, straightforward search-

ing for any particular record in the data. For example, one might be curious about

the exact information of a point, after locating it in group 2 of Figure 3.1c. Then user

can turn off the shadow for the grouping, to display in the mode of 3.1b, and hovers

on the chosen point for the household index, the average discounted sales, whether or

not the customer buys more with promotions, etc. The tool also makes it simple for

anormaly detection. For instance, a household buying significantly more without dis-

counts rather than with discounts is indicated by a cross, and is easily distinguished

from a household buying more with discounts than without them, which is a circle.

3.4 Summary

This chapter defines a set of standards to assign households according to their pur-

chasing behavior. These criteria are best demonstrated, utilized interactively, as

illustrated by Section 3.3 and Figure 3.1. The outcomes are referred in Chapter 2

espeically for the definition of aggregate information. Besides, the user can interact

with the figures, exploring features of interest, such as popularity of items, promotion

availability, distributions of households in the space of purchasing behavior, etc.

24

Chapter 4

Decision Analysis

4.1 Introduction

In reality, for any business analyses, it is essential to make decisions and understand

the corresponding consequences and uncertainties attached to them. It converts

our statistical efforts into business potentials and bestows real-life significance upon

the project. This chapter begins with a few simple examples of decision analysis

tailor-made for this context of item-specific discount offers. Then it proceeds with

more realistic settings where simulation-based approach shows advantages in terms

of efficiency. Finally, this chapter concludes with an example focusing not only on

optimization of the expected utility, but also the uncertainty analysis coming from

the full distribution, showcasing the advantage of the probabilistic model.

4.2 Business Context

For retail businesses such as grocery stores or supermarkets, it is of great interest to

understand—for a given item—how discounts impact sales and eventually profits per

unit time. A typical setup would be the following:

• An item has usual/nominal selling price $p.

• Item cost is $c, intended to capture all real costs for the store (purchase/whole-

sale costs, storage, labour, etc).

25

• Percent discount 100d% for decision variable d ∈ (0, 1] (usually, d has to be

greater than 0 to make profit); discounted price is $(1− d)p.

• Implied profit per item at discount d is then $pd = ${(1 − d)p − c}. Note

that a short-term decision maker would always have this value positive, i.e.

d < 1 − c/p, which is the scenario considered here. However, sometimes it

is beneficial for the long term to have d ≥ 1 − c/p for a controlled span of

period, i.e. sacrificing short-term profits to build up entrenched relationship

with customers, which suggests a more sophisticated setup than described here.

(An extra term describing expected gain is needed in the expected utility)

• y is the number of items sold per unit time at offered discount.

• Implied expected profit (utility)= $ud where

ud = E(y|d){(1− d)p− c} (4.1)

Supposedly, smaller d implies higher price and lower expected sales; higher d

increases expected sales but reduces profit per sale. Hence ud may have optimized

point(s) within reasonable range.

4.3 Tentative Models

4.3.1 Poisson Model

When it comes to non-negative counts, Poisson model is one of the lower-hanging

fruits. Conditioned on a chosen discount d, assuming the sales of a particular item y

26

follows a Poisson distribution, try a linear model for the log link. Statistical details

of the model and its optimization are shown below.

• Sales: y|d ∼ Po(µd) with log(µd) = α + βd. Naturally β > 0.

• Expected profit:

ud = µd{(1− d)p− c} = {(1− d)p− c} exp{α + βd}. (4.2)

• Maximizing ud is equivalent to maximizing log(ud), with d ∈ (0, 1− c/p],

doptimal = argmaxd log(ud) =

1− c

p− 1

β, β > p

p−c

0, otherwise

(4.3)

• Sometimes in practice, the actual selling price $p and its cost $c are not of great

interest. In those circumstances, it makes sense to replace p/c with r which is

the markup plus 1. Then equation 4.3 simply becomes

doptimal = argmaxd log(ud) =

1− 1

r− 1

β, β > r

r−1

0, otherwise

(4.4)

4.3.2 Mixture Model

• Sales: y|d = z(x+ 1), where z, x are independent,

z ∼ Ber(πd), logit(πd) = α0 + β0d

x ∼ Po(µd), log(µd) = α + βd

(4.5)

Naturally β0, β > 0.

27

• Expected profit:

ud = πd(µd + 1){(1− d)p− c}

= {(1− d)p− c} logit−1(α0 + β0d)(1 + exp(α + βd))

= {(1− d)p− c}exp(α0 + β0d)(1 + exp(α + βd))

1 + exp(α0 + β0d)

(4.6)

• Maximizing ud is equivalent to maximizing log(ud), with d ∈ (0, 1−c/p]. Taking

the first derivative, we have the following problem:

β0(1− πd) + βµd

1 + µd=

p

(1− d)p− cβ0, β > 0 (4.7)

Due to the difficulty of solving this for doptimal analytically, we resort to a

numerical method for mode hunting, which is a seven-dimension problem:

(d, α0, β0, α, β, p, c). Similar to how we obtain Equation 4.4, with p(1−d)p−c writ-

ten as r(1−d)r−1 , where r = p/c, we reduce the total dimension to six. Besides,

by incorporating the information from business context, we are able to narrow

down the plausible values of some parameters, thus mitigate the computational

burden/intensity.

• Referring to Table 4.1—which shows the distributions of those four coefficients

over 300 household-level data—for the purpose of this analysis, some plausible

28

domains are chosen to be the following.

d ∈ (0, 1− 1/r] where r = p/c

α0 ∈ (−0.9, 1.2) taking the 10, 90 percentiles

β0 ∈ (0, 1.7) truncating the positive portion

α ∈ (−0.55, 0.95) taking the 10, 90 percentiles

β ∈ (0, 2.2) truncating the positive portion up to 75 percentile

r ∈ (1.1, 2) just a reasonable guess

Figure 4.1: Distributions of four model parameters

For each set of parameters, it is trivial to compute the optimal discount under the

given utility. Figure 4.2 shows the relationships between discount: d and πd, µd and

29

Table 4.1: Summary statistics of logistic and poisson regressions

alpha0 beta0 alpha beta

count 300.000000 300.000000 300.000000 300.000000mean 0.220211 -5.577831 0.206850 1.466264std 0.911970 2.894260 0.596577 1.207612min -4.153728 -19.449333 -2.457143 -2.22932610% -0.872800 -8.751870 -0.556313 -0.01021325% -0.199389 -7.043806 -0.209543 0.71673550% 0.323359 -5.459991 0.199102 1.41622075% 0.788633 -3.830893 0.603878 2.20895690% 1.229518 -2.197984 0.958858 3.157952max 2.602268 1.693713 2.071097 5.358105

ud, with four different sets of intercepts, but the same slopes: (β0, β) = (1.7, 2.2).

Since the intercepts represent the circumstances without discounts, high values of

intercepts compared to slopes will lead to doptimal = 0, because the item is already

popular without discounts, a lower price would simply hurt the profit and bring

marginal increase in sales. On the other hand, very low values give the same results for

a different reason—customers are so indifferent to the item that even high discounts

are not able to attract them. In terms of slopes, high values indicate high sensitivity

in discounts and an outstanding peak in utility can be expected.

4.4 DCMMs

The decision analysis can be incorporated into the framework of Bayesian Dynamic

Linear Models. Without approximations of digamma and trigamma functions, the

optimal problem cannot be written in a close form. One has to resort to iterative

numerical solution based on standard Newton-Raphson to find the implied conjugate

parameters. With the following approximations of digamma and trigamma func-

tions, respectively, we are able to write out the optimization problem in terms of the

30

Figure 4.2: Utility vs. Discount

31

regressor vector Ft.

φ(x) u log(x)

φ′(x) u1

x

(4.8)

For Binomial DGLMs, we have

αt =1 + exp(ft)

qt

βt =1 + exp(−ft)

qt

(4.9)

and for Poisson DGLMs, we have

αt =1

qt

βt =exp(−ft)

qt.

(4.10)

In general, we might want to optimize the expectation of the scaled direct out-

come: the profit in this case. That is the product of the expected sales and a linear

function of the regressor vector Ft. If we continue to work throught the math, we

have the following optimization problem:

πt = αt

αt+βt= 1+exp(ft)

2+exp(ft)+exp(−ft)

µt = αt

βt= exp(ft)

ut = πt(µt + 1)F ′tb

(4.11)

where ft = F ′tat, at the first moment of the evolved state vector θt and b is the

32

known linear coefficients. Note that this does not depend on the second moment qt,

which makes sense because it is the first moment of the profit that we are optimize.

So, after simplification, we have

ut(Ft) =(1 + exp(F ′tat))

2

2 + exp(F ′tat) + exp(−F ′tat)F ′tb (4.12)

While Equation 4.12 is hard to solve analytically, it is straightfoward to approx-

imate the optimal solution computationally, when Ft is short. In the case of this

study, Ft = (1, d)′, where d is the discount percentage—which has a plausible range

from 0 to 1. In reality, d < 1 − c/p for any positive profit, which means we need to

discuss this problem under various ratios of cost and sale price c/p.

Here, I have explored three different sale, cost ratio: 1.2, 2, 10. The first two ratios

are realistic, while 10 is to experiment with an extreme case. The question of interest

is to forecast the outcomes if the store were to use the optimal discount determined

by Equation 4.12, with Ft = (1, d)′ and b = (p− c,−p)′, every week for 52 weeks.

In order to obtain the distributions of optimal discounts and corresponding profits,

parameters, it is convenient to utilize the DCMMs—as emphasized throughout this

study and delineated in Appendix A, it is probabilistic and sequential. Figure 4.3,

4.4, 4.5 show such distributions for each scenario. Appendix B has more figures of

the relevant model parameters: πt, µt.

The figures are simulations of outcomes, given the models trained up until week

1, and the store always picks the discount percentage that maximizes the total profits

in the impending weeks. All three figures use the same household, item pair, only

with p/c different.

If one compares the results across the three, it is conspicuous that lower cost/higher

sale price affords the store more space to offer discount, thereby increases the weekly

33

(a) Distributions of optimal discounts over a year

(b) Distributions of optimal profits over a year

Figure 4.3: Distributions of simulated outcomes over a year (p/c = 1.2).34

(a) Distributions of optimal discounts over a year

(b) Distributions of optimal profits over a year

Figure 4.4: Distributions of simulated outcomes over a year (p/c = 2).35

(a) Distributions of optimal discounts over a year

(b) Distributions of optimal profits over a year

Figure 4.5: Distributions of simulated outcomes over a year (p/c = 10).36

sales large enough to compensate the discounted price, as a result generates more

total profits. Meanwhile, one tends to investigate each figure along its one-year hori-

zon. It is worth being noted that the distributions appear to converge as enough time

elapses. This is easy to accept once we realize that despite the exquisite design of

DCMMs/DGLMs, it does not inject new information or introduce disturbances. So,

after accepting the model up until week 1, we are bound to have a stationary forecast

after enough time passed. This discloses the difficulty of forecasting long term—

without any definite, deterministic insight, forecasts made by stationary models (all

pragmatic time series models) are simply reflection of the observed. In comparison,

short-term forecasting is more plausible (Figure 2.3b), because our variables of inter-

est are not as volatile in the short term as long. In a few words, statistical models

are simply capsules of available information, after training on the past, all one can

hope is that the history could shed light on the future.

4.5 Summary

This chapter begins with a business setting that approximates the reality, exploring

decision analysis problem under a few models. The goal throughout is to maximize

profit earned by an item from a household (Equation 4.1). Simple models such as

Poisson regression (Section 4.3) as well as DCMMs are studied for the optimiza-

tion problem (Section 4.4). I derive the mathematics for each case, at least up to

simplification of the problem. I resort to numberic method and simulation-based

computation, when the analytical solution is difficult to obtain (Section 4.4).

More areas can be explored in terms of loss/utility function. Relevant loss func-

tions for zero-inflated count-vaued time series are ZAPE, adjusted ZAPE, MAPE,

etc Yanchenko et al. (2021). Besides, the probabilistic model allows much more com-

37

plicated utility function than these that only provide point forecast. For instance, a

decision maker can ask for 0.5 or higher probability of gaining four dollars of profit

from a household over a span of two weeks.

Of course, there are expected but unsolved questions, such as forecasting and

decision making long-term. Forecasting long-term has always been challenging but

intriguing, regardless of the field. Applications include natural disasters like earth-

quakes (Talebi, 2017), as well as artificial advancements (Kott and Perconti, 2018).

The ability of forecasting long-term has significance for policy-makers, business own-

ers, residents in a particular area, potentially everyone. However, since a bad forecast

is worse than no forecast at all, there are much fewer studies on long-term horizon

than short-term. In my personal opinion, the best forecast is to push the future

towards to desired direction. We will meet with the future where our eyes are on—it

could be late, but hopefully not absent.

38

Chapter 5

Computation, Implementation and Code

5.1 Introduction

This chapter is dedicated to showcasing the related programming contributions I

have made to the project. Section 5.2 first introduces the mathematics behind the

programming, followed by section 5.3, which contains the codes generating those

clustering interactive 3D plots (Figure 3.1) in section 3.3.

5.2 Copula Approximation

Lavine et al. (2020) proposes a copula-based analytic method to approximate the

simulation-based one in Berry and West (2020). This approximation balances speed

and accuracy, substantially improves the computational cost. This section derives

the mathematics behind Variantional Bayes and Linear Bayes (VBLB) for multi-

scale DGLMs, as well as the programming contribution I have made to the published

Python package PyBATS.

5.2.1 VBLB for Latent Factor DGLMs

This subsection extends VBLB in Appendix A.1 to the latent factor modeling context.

To implement the method in the latent factor context, we first need to know the first

two moments of the linear predictor λi,t for all i = 1 : N and their covariances.

39

Expanding the expression for λi,t, we get

λi,t = F ′i,tθi,t = h′i,tγi,t + φ′tβi,t (5.1)

We also denote the first two moments of φt as φt|Dt−1 ∼ [bt,Bt] and partition

the moments of the state vector as the following:

θi,t|Dt ∼

aγ,i,taβ,i,t

,

Rγ,i,t Si,t

S′i,t Rβ,i,t

(5.2)

The mean of the linear predictor is then,

fi,t = E[λi,t] = E[F ′i,tθi,t]

= h′i,taγ,i,t + b′taβ,i,t

(5.3)

The variance of the linear predictor can be calculated using the law of total

covariance,

40

qi,t = V ar[λi,t] = V ar[F ′i,tθi,t]

= Cov(h′i,tγi,t + φ′tβi,t,h′i,tγi,t + φ′tβi,t)

= Cov(E[h′i,tγi,t + φ′tβi,t|φt], E[h′i,tγi,t + φ′tβi,t|φt])

+ E[Cov(h′i,tγi,t + φ′tβi,t,h′i,tγi,t + φ′tβi,t|φt)]

= Cov(h′i,taγ,i,t + φ′taβ,i,t,h′i,taγ,i,t + φ′taβ,i,t) Note that φt is independent

+ E[V ar[h′i,tγi,t] + φ′tV ar[βi,t]φt + h′i,tCov(γi,t,βi,t)φt + φ′tCov(βi,t,γi,t)hi,t]

= V ar[φ′taβ,i,t] + E[h′i,tRγ,i,thi,t + φ′tV ar[βi,t]φt + 2h′i,tSi,tφt]

= a′β,i,tBtaβ,i,t + h′i,tRγ,i,thi,t + 2h′i,tSi,tbt + E[tr(φ′tRβ,i,tφt)]

(5.4)

E[tr(φ′tRβ,i,tφt))] = E[tr(Rβ,i,tφtφ′t)]

= tr(E[Rβ,i,tφtφ′t])

= tr(Rβ,i,tE[φtφ′t])

= tr(Rβ,i,t(V ar[φt] + E[φt]E[φt]′))

= tr(Rβ,i,tBt) + tr(Rβ,i,tbtb′t)

= tr(Rβ,i,tBt) + b′tRβ,i,tbt

(5.5)

Therefore, the moments of the linear predictor of the extended VBLB for the

latent factor modeling are

fi,t = h′i,taγ,i,t + b′taβ,i,t

qi,t = h′i,tRγ,i,thi,t + 2h′i,tSi,tbt + b′tRβ,i,tbt + a′β,i,tBtaβ,i,t + tr(Rβ,i,tBt)

(5.6)

41

Accordingly, the adaptive vector in the LB update step is Ri,tF̃i,t/qi,t where F̃ =

(h′i,t, b′i,t)′. In contrast to the traditional DGLMs, this modified analysis introduces

more uncertainty due to the fact that φt is simutaneous and comes from another

external model, which are explicitly written out in qi,t as the last two terms.

Now that we have the means and variances, we only need the pairwise covariance

between λi,t and λj,t, i 6= j, i, j = 1 : N to complete the joint covariance matrix.

qi,j,t = Cov(λi,t, λj,t)

= Cov(E[λi,t|φt], E[λj,t|φt]) + E[Cov(λi,t, λj,t|φt)]

= Cov(h′i,taγ,i,t + φ′taβ,i,t,h′j,taγ,j,t + φ′taβ,j,t) + 0

= Cov(φ′taβ,i,t,φ′taβ,j,t)

= a′β,i,tBtaβ,j,t

(5.7)

We got zero for the third step because of the independence between Mi and Mj

given φt, which is the key assumption for the decouple-recouple modeling strategy.

At this point, we have finished the modifications under the multiscale modeling

context, which paves the road for the construction of copula in section 3 of Lavine

et al. (2020).

5.2.2 Code

This subsection contains aspects of code I developed for the main modeling compo-

nents of thesis research. This covers aspects of the dynamic latent factor framework

that is part of the PyBATS package (https://lavinei.github.io/pybats/). The

first couple of functions extract the linear predictor λ for the latent factor, while the

second couple generate scaled versions of model coefficients. The later can be achieved

42

by using dlm coef fxn() and dlm coef forecast fxn(), with merge lf with predictor(),

whose explanations are available at https://lavinei.github.io/pybats/latent_

factor.html

## Latent factor functions for linear predictor lambda

def lambda_fxn(date, mod, k, **kwargs):

"""

function that returns mean and variance of linear predictor

↪→ lambda

:param date: date index

:param mod: model that is being run

:param k: forecast horizon

:param kwargs: other arguments

:return: mean and variance of lambda

"""

return (mod.F.T @ mod.m).copy().reshape(-1), (mod.F.T @ mod.C

↪→ @ mod.F).copy()

def lambda_forecast_fxn(date, mod, k, forecast_path = False,

↪→ **kwargs):

"""

functions that return forecast mean and variance, potentially

↪→ covariance of lambda (if forecast_path is True)

:param date: date index

:param mod: model that is running

:param k: forecast horizon

:param forecast_path: True or False

43

:param kwargs: other arguments

:return: forecast mean and variance, potentially covariance

↪→ of lambda (if forecast_path is True)

"""

lambda_mean = []

lambda_var = []

if forecast_path:

lambda_cov = [np.zeros([1, h]) for h in range(1, k)]

for j in range(1, k + 1):

f, q = mod.get_mean_and_var(mod.F, mod.a.reshape(-1), mod.R)

lambda_mean.append(f.copy())

lambda_var.append(q.copy())

if forecast_path:

if j > 1:

for i in range(1, j):

lambda_cov[j-2][i-1] = mod.F.T @ forecast_R_cov(mod, i, j) @

↪→ mod.F

if forecast_path:

return lambda_mean, lambda_var, lambda_cov

else:

return lambda_mean, lambda_var

44

lambda_lf = latent_factor(gen_fxn = lambda_fxn,

↪→ gen_forecast_fxn = lambda_forecast_fxn)

## Latent factor functions for scaled model coefficients

def dlm_coef_scale_fxn(date, mod, scale = None, idx = None,

↪→ scale_which = None, **kwargs):

"""

function that gets the mean and variance of coefficent latent

↪→ factor

:param date: date index

:param mod: model that is being run

:param scale: scalars that used to scale the mean and

↪→ variance, as known fixed values. For example,

↪→ covariates of

models that use this latent factor. Should be in pandas data

↪→ frame with scalars in columns and dates as index

:param scale_which: index of coefficents to be scaled by

↪→ series in scale (need to be within idx)

:param idx: index of coefficents desired to extract

:param kwargs: other arguments

:return: mean and variance of scaled coefficents

"""

if scale is None:

return dlm_coef_fxn(date, mod, idx, **kwargs)

45

if idx is None:

idx = np.arange(0, len(mod.m))

if not set(scale_which).issubset(set(idx)):

ValueError("scale_which needs to be subset of idx")

m_scale, C_scale = mod.m.copy(), mod.C.copy()

scale_matrix = np.identity(C_scale.shape[0])

scale_matrix[np.ix_(scale_which, scale_which)] = scale.loc[

↪→ date].values * scale_matrix[np.ix_(scale_which,

↪→ scale_which)]

m_scale = scale_matrix@m_scale

C_scale = scale_matrix@C_scale@scale_matrix

return (m_scale[idx]).reshape(-1), (C_scale[np.ix_(idx, idx)])

↪→ .copy()

def dlm_coef_scale_forecast_fxn(date, mod, k, scale = None,

↪→ idx=None, scale_which = None, forecast_path=False, **

↪→ kwargs):

"""

function that compute the forecast mean, variance and

46

↪→ potentially covariance (if forecast_path is True)

:param date: date index

:param mod: model that is being run

:param k: forecast horizon

:param scale: scalars that used to scale the mean and

↪→ variance, as known fixed values. For example,

↪→ covariates of

models that use this latent factor. Should be in pandas data

↪→ frame with scalars in columns and dates as index

:param scale_which: index of coefficents to be scaled by

↪→ series in scale (need to be within idx)

:param idx: index of coefficents desired to extract

:param forecast_path: True or False

:param kwargs: other arguments

:return: forecast mean, variance and potentially covariance (

↪→ if forecast_path is True)

"""

if scale is None:

return dlm_coef_forecast_fxn(date, mod, k, idx=None,

↪→ forecast_path=False, **kwargs)

if idx is None:

idx = np.arange(0, len(mod.m))

p = len(idx)

if not set(scale_which).issubset(set(idx)):

47

ValueError("scale_which needs to be subset of idx")

dlm_coef_mean = []

dlm_coef_var = []

if forecast_path:

dlm_coef_cov = [np.zeros([p, p, h]) for h in range(1, k)]

for j in range(1, k + 1):

a, R = forecast_aR(mod, j)

a_scale = a.copy()

R_scale = R.copy()

scale_matrix = np.identity(R_scale.shape[0])

scale_matrix[np.ix_(scale_which, scale_which)] = scale.loc[

↪→ date].values*scale_matrix[np.ix_(scale_which,

↪→ scale_which)]

a_scale = scale_matrix@a_scale

R_scale = scale_matrix@R_scale@scale_matrix

dlm_coef_mean.append(a_scale[idx].copy().reshape(-1))

dlm_coef_var.append(R_scale[np.ix_(idx, idx)].copy())

if forecast_path:

if j > 1:

for i in range(1, j):

R_cov_scale = forecast_aR(mod, i)[1]

R_cov_scale = scale_matrix@R_cov_scale@scale_matrix

Gk = np.linalg.matrix_power(mod.G, j - i)

48

dlm_coef_cov[j-2][:,:,i-1] = (Gk@R_cov_scale)[np.ix_(idx, idx)

↪→ ]

if forecast_path:

return dlm_coef_mean, dlm_coef_var, dlm_coef_cov

else:

return dlm_coef_mean, dlm_coef_var

dlm_coef_scale_lf = latent_factor(gen_fxn = dlm_coef_fxn,

↪→ gen_forecast_fxn=dlm_coef_forecast_fxn)

5.3 Clustering Visualization

Below are the python codes which output interactive 3D plots exemplified by Figure

3.1. It has the required packages and data manipulations work before the plotly

ploting commands.

import pandas as pd

import numpy as np

from plotly.graph_objects import Scatter3d, Volume

from plotly.subplots import make_subplots

import plotly.io as pio

pio.renderers.default = "browser"

## define a function to merge multiple dataframes

49

def df_merge(df, on, how = ’outer’):

"""

:param df: List of dataframes to be merged

:param how: Choices are "outer", "inner", or index

:return: A merged dataframe

"""

if how in [’outer’, ’inner’]:

return reduce(lambda x, y: pd.merge(x, y, on = on, how = how),

↪→ df)

else:

df_reorder = [df.pop(how)]

df_reorder.extend(df)

return reduce(lambda x, y: pd.merge(x, y, on = on, how = "left

↪→ "), df_reorder)

## read in the provided data

data = pd.read_pickle(’The household data’)

## clean and create sensitivity data

item_discount_ratio = data.loc[:,[’date’, ’item’,’household’,

↪→ ’discount’,’discount_pot’, ’item_qty’, ’net_price’, ’

↪→ regular_price’]]

item_discount_ratio[’discount_percentage’] = (

↪→ item_discount_ratio[’regular_price’] -

↪→ item_discount_ratio[’net_price’]) / item_discount_ratio[

50

↪→ ’regular_price’]

item_discount_ratio[’discount_sen’] = item_discount_ratio.

↪→ discount == item_discount_ratio.discount_pot

sen = item_discount_ratio.groupby([’item’, ’household’],

↪→ as_index = False, observed = True)[’discount_sen’, ’

↪→ discount_pot’, ’item_qty’].mean().\

sort_values(by = [’discount_sen’, ’discount_pot’,’item_qty’],

↪→ ascending=[False, False, True])

## clean and create discounted purchase data

item_discount_ratio_buyd = item_discount_ratio.loc[

↪→ item_discount_ratio.discount_pot == 1]

item_discount_ratio_buyd[’buy_discount’] =

↪→ item_discount_ratio_buyd.item_qty > 0

buyd = item_discount_ratio_buyd.groupby([’item’, ’household’],

↪→ as_index = False, observed = True)[’buy_discount’, ’

↪→ item_qty’, ’discount_percentage’].mean().\

sort_values(by = [’buy_discount’, ’item_qty’, ’

↪→ discount_percentage’], ascending=[False, True, False])

## clean and create regular purchase data

item_discount_ratio_buyr = item_discount_ratio.loc[

↪→ item_discount_ratio.discount_pot == 0]

item_discount_ratio_buyr[’buy_regular’] =

↪→ item_discount_ratio_buyr.item_qty > 0

buyr = item_discount_ratio_buyr.groupby([’item’, ’household’],

51

↪→ as_index = False, observed = True)[’buy_regular’, ’

↪→ item_qty’].mean().\

sort_values(by = [’buy_regular’,’item_qty’], ascending=[False,

↪→ True])

# create extra variables for plotting

buy_data = df_merge([sen, buyd, buyr], on = [’item’, ’

↪→ household’], how=’outer’)

buy_data.columns = [’item’, ’household’, ’Discount sensitivity

↪→ ’, ’Discount offered’,

’Sales’, ’Discount buy’, ’Discount sales’, ’Discount percent’,

’Regular buy’, ’Regular sales’]

buy_data.iloc[:,2:] = buy_data.iloc[:,2:].fillna(0)

buy_data[’Buy more with discount’] = buy_data[’Discount buy’]

↪→ > buy_data[’Regular buy’]

buy_data[’Discount level’] = [’small’ if d < 0.25 else ’median

↪→ ’ if d < 0.6 else ’large’ for d in buy_data[’Discount

↪→ percent’]]

# 3D interactive plots for households

# Initialize figure with 212 3D subplots

rows = 4

cols = 4

specs = [[{’type’: ’scene’} for j in range(cols)] for i in

52

↪→ range(rows)]

subplot_titles = [’panel’ + str(i) for i in range(16)]

fig = make_subplots(

rows=rows, cols=cols,

specs=specs, subplot_titles=subplot_titles)

## There are 211 items, thus 211 plot panels. index 211 to

↪→ 274 are used to create color shadows for each customer

↪→ group

for item in range(275):

## Plot 3D scatterplot for each household item pair

if item < 211:

data = buy_data.loc[buy_data.item == item]

# Generate data

x = data[’Discount offered’]

y = data[’Discount buy’]

z = data[’Regular buy’]

symbol = data[’Buy more with discount’].map({True: ’circle’,

↪→ False: ’x’})

color = data[’Discount level’].map({’small’: ’blue’, ’median’:

↪→ ’green’, ’large’: ’red’})

size = data[’Discount sales’]

size = (size - np.min(size)) / (np.max(size) - np.min(size)) *

53

↪→ 20 + 6

# adding surfaces to subplots.

fig.add_trace(

Scatter3d(

x=x,

y=y,

z=z,

name=’item’ + str(item),

visible=False,

## more information added to the scatterplots

customdata=np.stack((data[’household’].values,

data[’Buy more with discount’].values,

data[’Buy more with discount’].map({True: ’Circle’, False: ’X’

↪→ }).values,

data[’Discount level’].values,

data[’Discount level’].map(

{’small’: ’<25%’, ’median’: ’25-60%’, ’large’: ’>60%’}).values

↪→ ),

axis=-1),

mode=’markers’,

marker=dict(

size=size,

color=color,

cauto=True,

54

symbol=symbol,

opacity=0.8

),

hovertemplate=

’<b>Household</b>: %{customdata[0]}<br>’ +

’<b>Discount offered</b>: %{x:.0%}<br>’ +

’<b>Discount buy</b>: %{y:.0%}<br>’ +

’<b>Regular buy</b>: %{z:.0%}<br>’ +

’<b>Discount sales</b>: %{marker.size:.2f} units<br>’ +

’<b>Buy more with discount</b>: %{customdata[1]} (%{customdata

↪→ [2]})<br>’ +

’<b>Discount level</b>: %{customdata[3]} (%{customdata[4]})<br

↪→ >’,

hoverlabel=dict(bgcolor=color)

),

row=(np.floor((item % 16) / 4) + 1).astype(’int’), col=(item %

↪→ 16) % 4 + 1)

if item < 16:

fig[’layout’][’scene’ + str(item + 1)][’xaxis’] = {’title’: {’

↪→ text’: ’Discount offered’}}

fig[’layout’][’scene’ + str(item + 1)][’yaxis’] = {’title’: {’

↪→ text’: ’Discount buy’}}

fig[’layout’][’scene’ + str(item + 1)][’zaxis’] = {’title’: {’

↪→ text’: ’Regular buy’}}

55

## Create shadow for each customer group

else:

if (item - 211) % 4 == 0:

X, Y, Z = np.mgrid[0:0.5:2j, 0:1:2j, 0.5:1:2j]

values = np.zeros(X.shape)

elif (item - 211) % 4 == 1:

X, Y, Z = np.mgrid[0.5:1:2j, 0.5:1:2j, 0:1:2j]

values = np.ones(X.shape)

elif (item - 211) % 4 == 2:

X, Y, Z = np.mgrid[0:0.5:2j, 0:1:2j, 0:0.5:2j]

values = np.ones(X.shape) * 2

elif (item - 211) % 4 == 3:

X, Y, Z = np.mgrid[0.5:1:2j, 0:0.5:2j, 0:1:2j]

values = np.ones(X.shape) * 3

x = X.flatten()

y = Y.flatten()

z = Z.flatten()

value = values.flatten()

fig.add_trace(

Volume(

name=’group’ + str((item - 211) % 4),

x=x,

y=y,

z=z,

56

value=value,

opacity=0.3, # needs to be small to see through all surfaces

surface_count=50, # needs to be a large number for good

↪→ volume rendering

colorscale="RdBu",

showlegend=False,

showscale=False,

isomax=3,

isomin=0,

hovertemplate=

’<b>Group</b>: #%{value: .f}’

),

row=(1 + np.floor((np.floor((item - 211) / 4)) / 4)).astype(’

↪→ int’),

col=((np.floor((item - 211) / 4)) % 4 + 1).astype(’int’)

)

# create buttons

buttons = []

for i in range(27):

if i < 25:

buttons.append(dict(method=’update’,

args=[{"visible": [False if np.floor(item / 16) != i else True

↪→ for item in range(211)] + [

False] * 64}],

57

label="item" + str(i * 16) + "--" + str((i + 1) * 16 - 1)))

buttons.append(dict(method=’update’,

args=[{"visible": [False if np.floor(item / 16) != i else True

↪→ for item in range(211)] + [

True] * 64}],

label="item" + str(i * 16) + "--" + str((i + 1) * 16 - 1) + "

↪→ grouped"))

else:

buttons.append(dict(method=’update’,

args=[{"visible": [False if np.floor(item / 16) != i else True

↪→ for item in range(211)] + [

False] * 64}],

label="item208--210"))

buttons.append(dict(method=’update’,

args=[{"visible": [False if np.floor(item / 16) != i else True

↪→ for item in range(211)] + [

True] * 64}],

label="item208--210" + " grouped"))

fig.update_layout(scene=dict(

xaxis_title=’Discount offered’,

yaxis_title=’Discount buy’,

zaxis_title=’Regular buy’),

title_text=’Comprehensive 3D plots for all items (Author:

↪→ Daniel Deng)’,

height=2500,

58

width=1800,

updatemenus=[dict(type=’buttons’,

buttons=buttons,

x=1.09,

xanchor=’left’,

y=1,

yanchor=’top’)],

hovermode=’closest’

)

fig.show()

59

Chapter 6

Conclusions and Summary

Recently, the application of statistical modeling in commercial problems surges, since

people found the tremendous upsides once insights are revealed. Businesses like

Walmart, Amazon, Harris Teteer start to seek statistical methods that improve their

decision making, thereby consolidate their customer relationship.

In this thesis, I introduce the multi-sacle modeling within the Bayesian Dynamic

Modeling framework. It showcases the power of hierarchical, sequential, probabilistic

and computationally efficient models, as well emphasizes the novel decouple-recouple

modeling strategy, which propagates the signals down the hierarchy. I also exemplify

the improvement on the forecasting accuracy.

This method aims to mitigate the difficulty when it comes to forecasting sporadic

data—which is the finest level in our hierarchy. It has been a challenge for years,

so the improvement brought by this approach is another step forward. The multi-

scale modeling successfully inherents hierarchical information of the retail setting:

households visit a store, spending on items, and purchasing outcomes connect across

large categories of items to small, refined categories, and eventually to specific items.

In chapter 3, I elaborate the design and criteria which I use to classify thousands

of households based on their purchasing behaviors. This not only enables the multi-

scale modeling in chapter 2, but also sets an example of visualized learning, providing

a valuable way of thinking about customer behaviors. The case study exemplify

the identification of price-sensitive households, which paves the way to customized

decision analysis.

60

At the end, making good decisions is the ultimate goal of modeling. In chapter 4,

I explore the optimal-discount problem under various models. I also attempt to solve

the decision analysis over longer period of time (a year). Even though the results

(section 4.4) align with business sense, the usage of the model on such a long time

span remains questionable.

Finally, I demonstrate my contributions in terms of programming to this project

(Yanchenko et al. (2021)). First, my research has contributed extensions and func-

tionality on latent factor dynamic modeling to the existing PyBATS package. Sec-

ond, my development of innovative data assessment and dynamic visualization with

household labeling has defined software that is available for further applications.

Future Work and Comments

This thesis presents a novel approach to efficiently forecast sparse time series. How-

ever, the decision analysis based on the model has a lot more to explore than presented

in Chapter 4. The main obstacle is to forecast long-term. First, one needs to define

how long is long-term, based on the context. For instance, three to six months might

be long for retail setting, while one to two years can be short for earthquake or volcano

eruption. Problems with more artificial components are generally easier than those

without control. Secondly, accouting for the significant factors can be challenging.

Sometimes, even for a social behavior problem like retailing, there might be unex-

pected shocks that make our forecast obsolete (e.g. COVID 19 in 2020). Lastly, the

uncertainty associated with our forecast increases rapidly with the length of horizon

and number of uncertain factors. This can leave us with a statistically right forecast,

but it has no pragmatic use.

Fortunately, I would like to consider the long-term forecasting in the following way.

61

No one really forsees the future. Instead, we can only study the past for insights that

are helpful for our decision making at the present. This has a significant impact on

the future. As statisticians, we learn from the history in a quantitative way: from

data. We extract and summarize information buried in the data that is not visible

to naked eyes. As a result, interpretability and openness are the key, assuming that

we do not believe in some “black box ”to determine our future (see what happened

to Catholicism when plague hit). Therefore, I think the problem of forecasting long-

term is simply a modeling or mathematical problem. Rather, it is closely related to

the horizon of total human knowledge.

Back to statistical modeling and decision analysis, a rational decision maker

should listen to multiple sources, to decrease uncertainty in quality of agents. Bayesian

Predictive Synthesis (McAlinn et al., 2020; West and Crosse, 1992; West, 1992) pro-

vides a potential framework for future researchers. A decision maker using such

framework takes into account all probabilistic information from available agents, and

update his own opinion on the quantity of interest.

62

Appendices

63

Appendix A

DGLMs

• yt denotes the time series of interest, no matter it is continuous, binary or

non-negative count.

• At any given time t, available information is denoted by Dt = {yt, Dt−1, It−1},

where It−1 is any relevant additional information at time t− 1.

• Ft, θt are the dynamic regression vector and state vector at time t, respectively.

• λt = F ′tθt, where λt is the linear predictor at time t. It links the parameter of

interest and the linear regression via link functions, i.e., λt = logit(πt) for bino-

mial DGLM and λt = log(µt) for Poisson DGLM, where πt, µt are probability

of success and mean for these precesses.

• state vector θt evolves via θt = Gtθt +wt and wt ∼ (0,Wt), where Gt is the

known evolution matrix and wt is the stochastic innovation vector.

• wt is independent of current and past states with moments E[wt|Dt−1, It−1] = 0

and V [wt|Dt−1, It−1] = Wt

A.1 VBLB

1. Current information is summarized in mean vector and variance matrix of the

posterior state vector θt−1|Dt−1, It−1 ∼ [mt−1,Ct−1].

64

2. Via the evolution equation θt = Gtθt + wt, the implied 1-step ahead prior

moments at time t are θt|Dt−1, It−1 ∼ [at,Rt], with at = GtCt−1G′t and

Rt = GtCt−1G′t +Wt.

3. The time t conjugate prior satisfies E[λt|Dt−1, It−1] = ft = F ′tat and V [λt|Dt−1, It−1] =

qt = F ′tRtFt.

i.e.

Binomial: yt ∼ Bin(ht, πt), conjugate prior: πt ∼ Be(αt, βt), with ft =

ψ(αt) − ψ(βt) and qt = ψ′(αt) + ψ′(βt), where ψ(x), ψ′(x) are digamma and

trigamma functions.

Poisson: yt ∼ Poi(µt), conjugate prior: µt ∼ Ga(αt, βt), with ft = ψ(αt) −

log(βt) and qt = ψ′(αt).

4. Forecast yt 1-step ahead using the conjugacy-induced predictive distribution

p(yt|Dt−1, It−1). This can be simulated trivially.

5. Observing yt, update to the posterior.

i.e.

Binomial: conjugate posterior: πt ∼ Be(αt + yt, βt + ht − yt).

Poisson: conjugate posterior µt ∼ Ga(αt + yt, βt + 1).

6. Update posterior mean and variance of the linear predictor λt: gt = E[λt|Dt]

and pt = V [λt|Dt]

7. Linear Bayes estimation gives posterior moments mt = at +RtFt(gt − ft)/qt

and Ct = Rt −RtFtF′tR′t(1− pt/qt)/qt

This completes the time t− 1-to-t evolve-predict-update cycle.

65

A.2 Discount Factors

• Regression vector F can include intercept, known quantities, such as price of

items, indicator of whether or not using a firewall.

i.e.

F ′t = (1, pricet, promotiont, 1, 0, 1, 0, 1, 0)

• Evolution matrix Gt is usually a block-diagonal matrix. For normal covariates

in F matrix, Gt takes values of 1 to allow the corresponding coefficients to

evolve with random innovation wt, while Gt can also include seasonal effects

by adding blocks of seasonal components.

i.e.

Gt = blockdiag(1, 1, 1,H1,H2,H3), whereHj =

cos(2πj/7) sin(2πj/7)

−sin(2πj/7) cos(2πj/7)

,

for j = 1,2,3

• Evolution variance matrixWt can be controlled by discount factor δj ∈ (0, 1], j =

1 : J , via the following design:

Note that Rt = GtCt−1G′t +Wt.

Let Pt = GtCt−1G′t and Wt = blockdiag(Pt1(1− δ1)/δ1, . . . ,PtJ(1− δJ)/δJ),

where Ptj is the corresponding diagonal block of Pt = GtCt−1G′t.

This design enables separate discount factors for different components and each

component’s uncertainty increases by (1− δj)/δj, while maintains correlations

in Ptj.

A.3 Random Effects

• Applicable to any DGLMs.

66

• Capture additional variation.

• Extended state vector: θt = (ξt,θ′t,0)′ and regression vector: F ′t = (1,F ′t,0)′,

where ξt is a sequence of independent, zero-mean random effects and θ′t,0,F′t,0

are the baseline state vector and regression vector. Extended linear predictor:

λt = ξt + λt,0

• ξt provides an additional, day-specific ”shocks” to latent coefficients.

• A random effect discount factor ρ ∈ (0, 1] is used to control the level of vari-

ability injected (via a similar fashion as the other discount factors):

i.e.

qt,0 = V [λt,0|Dt−1, It−1], let vt = V [ξt|Dt−1, It−1] = qt,0(1 − ρ)/ρ, which inflates

the variation of λt by (1− ρ)/ρ

A.4 Multi-scale Modeling

• Use decouple/recouple method to enable information sharing across series as

well as scalability.

• Add information at aggregate level to avoid being obscured by noises.

• For each of the N univariate series, it has a state vector and regression vector

defined by the following:

Mi : θi,t = (γ′i,t,β′i,t)′, Fi,t = (f ′i,t,φ

′t)′, i = 1 : N

which implies λi,t = γ′i,tfi,t+β′i,tφt, where the first three contain series-specific

information, while φt is a latent factor shared by all series.

67

• φt, the common latent factor can be any common factors and modeled by an-

other DGLM, denoted asM0, conditioned on which, the updates and forecasting

of each Mi perform separately and in parallel.

• This decoupling/recoupling technique enables scalability of the N individual

series, while manages to create linkage across series.

68

Appendix B

More Figures

Here are the figures of parameters distributions from the case study in section 4.4.

We can see that one is able to boost the shifted mean µt of DCMMs and probability

πt, by offering more discounts, given the item remains profitable.

69

(a) Distributions of optimal Bernoulli probability over a year

(b) Distributions of optimal Poisson mean over a year

Figure B.1: Distributions of simulated parameters over a year (p/c = 1.2).70

(a) Distributions of optimal Bernoulli probability over a year

(b) Distributions of optimal Poisson mean over a year

Figure B.2: Distributions of simulated parameters over a year (p/c = 2).71

(a) Distributions of optimal Bernoulli probability over a year

(b) Distributions of optimal Poisson mean over a year

Figure B.3: Distributions of simulated parameters over a year (p/c = 10).72

Appendix C

More Code

This last appendix attaches the codes for modeling.

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

####### My pybats is called pybats_latest, you need to change

↪→ that to your file name.

from pybats_latest.analysis import analysis, analysis_dcmm

from pybats_latest.latent_factor import dlm_coef_scale_fxn,

↪→ dlm_coef_scale_forecast_fxn, latent_factor, \

merge_lf_with_predictor, dlm_coef_fxn, dlm_coef_forecast_fxn

from pybats_latest.point_forecast import zape_point_estimate

from sklearn.metrics import roc_auc_score, f1_score

from functools import partial, reduce

## These are used for actual modeling, see the part at the

↪→ bottom (it is commented)

from complement import create_agg_data,

↪→ latent_factor_generator, multi_scale_modeling,

73

↪→ list_files

from joblib import Parallel, delayed

import multiprocessing

import time

import os

def create_agg_data(directory, data_names, data_name):

"""

functions that create aggregate level data

:param directory: directory of the data and where you want to

↪→ save it

:param data_names: data paths for each individual data

:param data_name: name of data to be stored

:return:

"""

item_total_sales = np.zeros(112)

item_total_transaction = np.zeros(112)

item_discount = np.zeros(112)

item_discount_perc = np.zeros(112)

total_household = np.zeros(112)

for f in data_names:

data = pd.read_pickle(directory + ’/’ + f)

74

item_total_sales += data.item_qty.values

item_total_transaction += (data.item_qty > 0).astype(’int’).

↪→ values

item_discount += data.discount_pot.values

item_discount_perc += data.discount_amount_pot.values/data.

↪→ regular_price.values

total_household += 1

item_discount = item_discount/len(data_names)

item_discount_perc = item_discount_perc/len(data_names)

item_data = pd.DataFrame({’date’:data.date.values, ’

↪→ total_sales’:item_total_sales, ’total_transaction’:

↪→ item_total_transaction,

’discount’:item_discount, ’discount_perc’:item_discount_perc,

↪→ ’total_household’:total_household})

item_data.to_pickle(directory + ’/’ + data_name)

def latent_factor_generator(agg_data_path):

"""

function that generates latent factor on aggregate level

:param agg_data_path: a string of the data path

:return: a list of latent factor

"""

agg_data = pd.read_pickle(agg_data_path)

75

Y_sales = agg_data.total_sales.values

X = np.c_[agg_data.discount.values, agg_data.discount_perc.

↪→ values]

# X[-8:, 0] = 1

# X[-8:, 1] += 0.1

n = agg_data.total_household.values

# latent factor parameters

prior_length = 8

nsamps = 5000

delregn = 0.98

deltrend = 0.98

delseas = 0.98

rho = 0.6

adapt_discount = None

forecast_start = 52

forecast_end = 103

k = 8

T = 112

start_date = pd.to_datetime(’2017-09-05’) # Make up a start

↪→ date

dates = pd.date_range(start_date, start_date + pd.DateOffset(

↪→ days=T - 1), freq=’D’)

forecast_start_date = start_date + pd.DateOffset(days=

↪→ forecast_start)

76

forecast_end_date = dates[-1] - pd.DateOffset(days=k)

# latent factor modeling

## latent factor for sales

idx = np.array([2])

dlm_coef_fxn_sales = partial(dlm_coef_fxn, idx=idx)

dlm_coef_forecast_fxn_sales = partial(dlm_coef_forecast_fxn,

↪→ idx=idx)

discount_sensitivity_lf_sales = latent_factor(gen_fxn=

↪→ dlm_coef_fxn_sales,

gen_forecast_fxn=dlm_coef_forecast_fxn_sales)

discount_latent_sales = analysis(Y=Y_sales, X=X, family=’

↪→ poisson’, prior_length=prior_length,

k=k, rho = rho,

forecast_start=forecast_start, forecast_end=forecast_end,

forecast_start_date=forecast_start_date, forecast_end_date=

↪→ forecast_end_date,

dates=dates,

nsamps=nsamps,

deltrend=deltrend, delregn=delregn, adapt_discount=

↪→ adapt_discount,

ret=[’new_latent_factors’],

77

new_latent_factors=[discount_sensitivity_lf_sales.copy()])

# ##### This part can be replaced by a function called

↪→ latent_factor_plot().

# M = np.array([])

# V = np.array([])

#

# for date in dates[forecast_start:forecast_end]:

# m, v = discount_latent_sales.get_lf_forecast(date)

# M = np.append(M, m[0])

# V = np.append(V, v[0])

#

# lf_mean = pd.DataFrame({’average’: M, ’upper’: M + np.sqrt(

↪→ V), ’lower’: M - np.sqrt(V), ’date’: dates[

↪→ forecast_start:forecast_end]})

# fig, ax = plt.subplots(1,1)

# ax.plot(np.arange(0, len(lf_mean.date), 1), lf_mean.average

↪→ .values,

# color = ’red’, alpha = 0.5, label = ’mean’)

# ax.fill_between(np.arange(0, len(lf_mean.date), 1), lf_mean

↪→ .upper.values, lf_mean.lower.values,

# alpha = 0.4, label = ’unit sd region’)

# plt.xticks(rotation=20)

# plt.legend()

78

# ax.set_ylabel("Coefficient of average discount percentage")

#

# ax.set_title("Coefficient of " + "item" + "Your name")

# fig.savefig("Your path")

return [discount_latent_sales]

def multi_scale_modeling(data_path, latent_factor):

"""

implement a multi-scale modeling on data for item 62

:param data_path: path of time series chosen

:param latent_factor: latent factor to be used

:return:

"""

discount_latent = latent_factor

try:

data = pd.read_pickle(data_path)

Y = data.item_qty.values

buy = (Y > 0).astype(’float’)

X = np.c_[data.discount_pot.values, data.discount_amount_pot.

↪→ values / data.regular_price.values]

# X[-8:,0] = 1

# X[-8:,1] += 0.1

household = data.household.iloc[0]

group = data_path[-5:-4]

79

# model parameters

prior_length = 52

nsamps = 5000

delregn = 0.998

deltrend = 0.998

delseas = 0.998

rho = 0.6

adapt_discount = None

forecast_start = 52

forecast_end = 103

k = 8

T = len(Y)

start_date = pd.to_datetime(’2017-09-05’) # Make up a start

↪→ date

dates = pd.date_range(start_date, start_date + pd.DateOffset(

↪→ days=T - 1), freq=’D’)

forecast_start_date = start_date + pd.DateOffset(days=

↪→ forecast_start)

forecast_end_date = dates[-1] - pd.DateOffset(days=k)

### You can add your signal as a latent factor in the latent

↪→ factor list. Here the first one is my latent factor:

↪→ discount times coefficients

80

discount_latent[0] = merge_lf_with_predictor(discount_latent

↪→ [0], X[:, 1], dates)

# discount_latent[1] = "Your latent factor" (You will need to

↪→ append it to the input of this function)

# this function plots mean and sd shadow for latent factors

↪→ you have in the input.

# latent_factor_plot(discount_latent, directory=, names=)

print("begin " + str(household))

try:

samples = analysis_dcmm(Y=Y, X=X[:,0].reshape(112,-1), k=k,

↪→ prior_length=prior_length,

forecast_start=forecast_start, forecast_end=forecast_end,

forecast_start_date=forecast_start_date, forecast_end_date=

↪→ forecast_end_date,

dates=dates, latent_factor=discount_latent[0],

nsamps=nsamps, rho=rho,

delseas=delseas, deltrend=deltrend, delregn=delregn,

↪→ adapt_discount=adapt_discount,

ret=[’forecast’])

print(samples.shape)

# point forecasts

81

buy_samples = (samples > 0).astype(’float’)

medians = np.median(buy_samples[:,:,0], axis=0).astype(’int’)

# probs = np.mean(buy_samples, axis=(0, 2))

# performance scores

accuracy = (medians == buy[forecast_start + 1:forecast_end +

↪→ 2]).astype(’float’).mean()

f1 = f1_score(buy[forecast_start + 1:forecast_end + 2].astype(

↪→ ’int’), medians)

naive = (X[:,0][forecast_start + 1:forecast_end + 2] == buy[

↪→ forecast_start + 1:forecast_end + 2]).astype(’float’).

↪→ mean()

naive_f1 = f1_score(buy[forecast_start + 1:forecast_end + 2].

↪→ astype(’int’), X[:,0][forecast_start + 1:forecast_end +

↪→ 2].astype(’int’))

zape = zape_point_estimate(samples)

print(str(household) + "finished")

return [household, accuracy, f1, naive, naive_f1, zape, np.

↪→ median(buy_samples, axis=0)]

except ValueError:

print("error!!!!!!!!!!!")

82

except EOFError:

print("Opps")

def latent_factor_plot(latent_factor, directory, names):

"""

:param latent_factor: a list of latent factors you want to

↪→ plot

:param directory: path you want to save the figure

:param names: list of names of your figure

:return: it saves the

"""

for l, n in zip(latent_factor, names):

M = np.array([])

V = np.array([])

for date in dates[forecast_start:forecast_end]:

m, v = l.get_lf_forecast(date)

M = np.append(M, m[0])

V = np.append(V, v[0])

lf_mean = pd.DataFrame({’average’: M, ’upper’: M + np.sqrt(V),

↪→ ’lower’: M - np.sqrt(V),

’date’: dates[forecast_start:forecast_end]})

83

fig, ax = plt.subplots(1, 1)

ax.plot(np.arange(0, len(lf_mean.date), 1), lf_mean.average.

↪→ values,

color=’red’, alpha=0.5, label=’mean’)

ax.fill_between(np.arange(0, len(lf_mean.date), 1), lf_mean.

↪→ upper.values, lf_mean.lower.values,

alpha=0.4, label=’unit sd region’)

plt.xticks(rotation=20)

plt.legend()

ax.set_ylabel("Coefficient multiplied by household discount

↪→ percent")

ax.set_title("Latent factor for " + n)

fig.savefig(directory + ’/’ + n + ’.png’)

## Here are how I fit those models, you can adapt them if you

↪→ want to. These work with the functions above.

# # Data names

# item72_names_group0 = list_files(os.getcwd() + "/Data/Items

↪→ /group0", "item72", ".pkl")

84

# item62_names_group1 = list_files(os.getcwd() + "/Data/Items

↪→ /group1", "item62", ".pkl")

# item17_names_group2 = list_files(os.getcwd() + "/Data/Items

↪→ /group2", "item17", ".pkl")

# item76_names_group3 = list_files(os.getcwd() + "/Data/Items

↪→ /group3", "item76", ".pkl")

#

# # # create aggregate level data

# create_agg_data(os.getcwd() + "/Data/Items/group0",

↪→ item72_names_group0, ’agg-72-group0.pkl’)

# create_agg_data(os.getcwd() + "/Data/Items/group1",

↪→ item62_names_group1, ’agg-62-group1.pkl’)

# create_agg_data(os.getcwd() + "/Data/Items/group2",

↪→ item17_names_group2, ’agg-17-group2.pkl’)

# create_agg_data(os.getcwd() + "/Data/Items/group3",

↪→ item76_names_group3, ’agg-76-group3.pkl’)

#

#

# # create latent factors

# latent_factor72 = latent_factor_generator(os.getcwd() + "/

↪→ Data/Items/group0/" + ’agg-72-group0.pkl’)

# latent_factor62 = latent_factor_generator(os.getcwd() + "/

↪→ Data/Items/group1/" + ’agg-62-group1.pkl’)

# latent_factor17 = latent_factor_generator(os.getcwd() + "/

↪→ Data/Items/group2/" + ’agg-17-group2.pkl’)

# latent_factor76 = latent_factor_generator(os.getcwd() + "/

85

↪→ Data/Items/group3/" + ’agg-76-group3.pkl’)

#

#

# # parallelism

# num_cores = multiprocessing.cpu_count()

#

# scores72 = []

# scores72.append(Parallel(n_jobs=num_cores)(delayed(

↪→ multi_scale_modeling)(data_path = "Data/Items/group0/"

↪→ + data_path, latent_factor = latent_factor72) for

↪→ data_path in item72_names_group0))

#

# scores62 = []

# scores62.append(Parallel(n_jobs=num_cores)(delayed(

↪→ multi_scale_modeling)(data_path = "Data/Items/group1/"

↪→ + data_path, latent_factor = latent_factor62) for

↪→ data_path in item62_names_group1))

#

# scores17 = []

# scores17.append(Parallel(n_jobs=num_cores)(delayed(

↪→ multi_scale_modeling)(data_path = "Data/Items/group2/"

↪→ + data_path, latent_factor = latent_factor17) for

↪→ data_path in item17_names_group2))

#

# scores76 = []

# scores76.append(Parallel(n_jobs=num_cores)(delayed(

86

↪→ multi_scale_modeling)(data_path = "Data/Items/group3/"

↪→ + data_path, latent_factor = latent_factor76) for

↪→ data_path in item76_names_group3))

#

# # get rid of results that are None

# scores72 = [[score for score in scores72[0] if score is not

↪→ None]]

# scores62 = [[score for score in scores62[0] if score is not

↪→ None]]

# scores17 = [[score for score in scores17[0] if score is not

↪→ None]]

# scores76 = [[score for score in scores76[0] if score is not

↪→ None]]

#

# # print out how many households are left

# print(len(scores72[0]))

# print(len(scores62[0]))

# print(len(scores17[0]))

# print(len(scores76[0]))

#

# # save the results in a numpy zip file.

# np.savez(os.getcwd() + "/plots/performances", item72 = np.

↪→ array(scores72[0]), item62 = np.array(scores62[0]),

# item17 = np.array(scores17[0]), item76 = np.array(scores76

↪→ [0]))

87

Bibliography

Berry, L. R., P. Helman, and M. West (2020). Probabilistic forecasting of hetero-geneous consumer transaction-sales time series. International Journal of Forecast-ing 36, 552–569.

Berry, L. R. and M. West (2020). Bayesian forecasting of many count-valued timeseries. Journal of Business and Economic Statistics 38, 872–887.

Carvalho, C. M. and M. West (2007). Dynamic matrix-variate graphical models.Bayesian Analysis 2, 69–98.

Chen, T., B. Keng, and J. Moreno (2018). Multivariate arrival times with recurrentneural networks for personalized demand forecasting. In 2018 IEEE InternationalConference on Data Mining Workshops (ICDMW), pp. 810–819.

Chu, W. and S.-T. Park (2009). Personalized recommendation on dynamic content us-ing predictive bilinear models. In Proceedings of the 18th International Conferenceon World Wide Web, WWW ’09, New York, NY, USA, pp. 691–700. Associationfor Computing Machinery.

Du, C., C. Li, Y. Zheng, J. Zhu, and B. Zhang (2018, February). Collaborative filter-ing with user-item co-autoregressive models. In Proceedings of the Thirty-SecondAAAI Conference on Artificial Intelligence, New Orleans, Louisiana. Associationfor the Advancement of Artificial Intelligence.

Ferreira, M. A. R., Z. Bi, M. West, H. K. H. Lee, and D. M. Higdon (2003). Multiscalemodelling of 1-D permeability fields. In J. M. Bernardo, M. J. Bayarri, J. O.Berger, A. P. David, D. Heckerman, A. F. M. Smith, and M. West (Eds.), BayesianStatistics 7, pp. 519–528. Oxford University Press.

Ferreira, M. A. R., M. West, H. K. H. Lee, and D. M. Higdon (2006). Multiscale andhidden resolution time series models. Bayesian Analysis 2, 294–314.

He, X., Z. He, X. Du, and T.-S. Chua (2018, July). Adversarial personalized rankingfor recommendation. In SIGIR ’18: 41st International ACM SIGIR Conferenceon Research and Development in Information Retrieval, Ann Arbor, MI.

Hu, Y., Q. Peng, X. Hu, and R. Yang (2015). Web service recommendation basedon time series forecasting and collaborative filtering. In 2015 IEEE InternationalConference on Web Services, pp. 233–240.

Jerfel, G., M. Basbug, and B. Engelhardt (2017, 20–22 Apr). Dynamic collabora-tive filtering with compound Poisson factorization. Volume 54 of Proceedings ofMachine Learning Research, Fort Lauderdale, FL, USA, pp. 738–747. PMLR.

88

Kazemian, P., M. S. Lavieri, M. P. V. Oyen, C. Andrews, and J. D. Stein (2018, April).Personalized prediction of Glaucoma progression under different target intraocularpressure levels using filtered forecasting methods. Ophthalmology 125 (4), 569–577.

Kott, A. and P. Perconti (2018). Long-term forecasts of military technologies for a20-30 year horizon: An empirical assessment of accuracy. Technological Forecastingand Social Change 137, 272–279.

Lavine, I., A. J. Cron, and M. West (2020). Bayesian computation in dynamiclatent factor models. Technical Report, Department of Statistical Science, DukeUniversity. arxiv.org/abs/2007.04956.

Lichman, M. and P. Smyth (2018, April). Prediction of sparse user-item consumptionrates with zero-inflated poisson regression. In WWW ’18: Proceedings of the 2018World Wide Web Conference, pp. 719–728.

McAlinn, K., K. A. Aastveit, J. Nakajima, and M. West (2020). Multivariate Bayesianpredictive synthesis in macroeconomic forecasting. Journal of the American Sta-tistical Association 115, 1092–1110. arXiv:1711.01667. Published online: Oct 92019.

Naumov, M., D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang,U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherni-avskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen,W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy (2019). Deep learn-ing recommendation model for personalization and recommendation systems.arxiv.org/abs/1906.00091.

Nevins, J. R., E. S. Huang, H. Dressman, J. L. Pittman, A. T. Huang, and M. West(2003). Towards integrated clinico-genomic models for personalized medicine:Combining gene expression signatures and clinical factors in breast cancer out-comes prediction. Human Molecular Genetics 12, 153–157.

Niu, W., J. Caverlee, and H. Lu (2018). Neural personalized ranking for imagerecommendation. In Proceedings of 1th ACM International Conference on WebSearch and Data Mining (WSDM 2018). ACM.

Pittman, J. L., E. S. Huang, H. K. Dressman, C. F. Horng, S. H. Cheng, M. H.Tsou, C. M. Chen, A. Bild, E. S. Iversen, A. T. Huang, J. R. Nevins, and M. West(2004). Integrated modeling of clinical and gene expression information for per-sonalized prediction of disease outcomes. Proceedings of the National Academy ofSciences 101, 8431–8436.

Salinas, D., M. Bohlke-Schneider, L. Callot, R. Medico, and J. Gasthaus (2019). High-dimensional multivariate forecasting with low-rank Gaussian copula processes. In

89

Advances in Neural Information Processing Systems 32, pp. 6827–6837. CurranAssociates, Inc.

Su, X. and T. M. Khoshgoftaar (2009, Jan). A survey of collaborative filteringtechniques. Advances in Artificial Intelligence.

Talebi, M., Z. M. P. A. A. A. (2017). Long-term probabilistic forecast for m 5.0earthquakes in iran. Pure Appl. Geophys. 174, 1561–1580.

Thai-Nghe, N., T. Horv’th, and L. Schmidt-Thieme (2011). Personalized forecastingstudent performance. In 2011 IEEE 11th International Conference on AdvancedLearning Technologies, pp. 412–414.

Wang, X., Y. Wang, D. Hsu, and Y. Wang (2013). Exploration in interactive person-alized music recommendation: A reinforcement learning approach. ACM Trans.Multimedia Comput. Commun. Appl. 2 (3).

West, M. (1992). Modelling agent forecast distributions. Journal of the Royal Sta-tistical Society (Ser. B) 54, 553–567.

West, M. and J. Crosse (1992). Modelling of probabilistic agent opinion. Journal ofthe Royal Statistical Society (Ser. B) 54, 285–299.

West, M. and P. J. Harrison (1997). Bayesian Forecasting and Dynamic Models (2nded.). Springer-Verlag, New York, Inc.

West, M., A. T. Huang, G. S. Ginsberg, and J. R. Nevins (2006). Embracing thecomplexity of genomic data for personalized medicine. Genome Research 16, 559–566.

Yanchenko, A., D. D. Deng, J. Li, A. J. Cron, and M. West (2021). Hierarchical dy-namic modelling for individualized bayesian forecasting. Department of StatisticalScience, Duke University. Submitted for publication. arXiv:2101.03408.

90