JR - Discrete Choice Teory, Info Teory, Multinomial Logit & Gravity Model
Boosted multinomial logit model (working manuscript)
-
Upload
jay-jianqiang-wang -
Category
Data & Analytics
-
view
180 -
download
1
Transcript of Boosted multinomial logit model (working manuscript)
Boosted Multinomial Logit Model
September 10, 2012
Abstract
Understanding market demand is important to manage price strategies. Motivated by the
need to empirically estimate demand functions, we propose the application of boosting to
the class of attraction based demand model, which is popular in the pricing optimization
literature.In the proposed approach, the utility of a product is specified semiparametrically,
either by a varying-coefficient linear model or a partially linear model. We formulate the
multinomial likelihood and apply gradient boosting to maximize the likelihood. Several
attraction functions like the multinomial logit (MNL), linear and constant elasticity of sub-
stitution (CES) attraction functions are compared empirically and the implications of the
model estimates on pricing are discussed.
KEY WORDS: Boosting; functional gradient descent; tree-based regression; varying-
coefficient model.
1 Introduction
Building a reliable demand model is critical for pricing and portfolio management. In build-
ing a demand model, we should consider customer preference about attributes, price sen-
sitivity, and competition effects. The model should have strong prediction power and still
being flexible.
In our model, we use aggregated mobile PC sales data from a third-party marketing firm.
The data includes HP & Compaq information, as well as competitors’ sales. Each row of the
data includes Brands, country, region, attributes, period, channel, price, and sales volume.
The sales data is large-scale, with thousands of rows and many columns, across different
time and region. Thus, we have a high-dimensional prediction problem, and need to allow
price sensitivity to vary with time, region and configuration.
Broadly speaking, there are two ways of building demand models: modeling sales volume
or customer preference. We focus on modeling customer valuation/preference using DCMs.
In DCM, we specify the choice set, the set of products the customers are choosing from.
Each product in the choice set has a utility, which depends on brand, attribute, price and
other factors. The customer chooses the product with the highest utility for purchase.
There are several complications with the utility function specification: nonlinearity and
non-additivity. Explain nonlinearity here. Further, the attribute effects are non-additive.
What we mean here is that, for example, the difference between the utility of 4GB RAM
and 2GB RAM may be different between different brands, or when combined with different
CPUs. Thus our model need to flexible. We achieve this by semiparametric DCM, to model
product utility without specifying a functional form. To flexibly model the utility functions,
we have proposed a novel boosted tree based varying-coefficient DCM. Assume that we have
a single market with M products. Briefly explain the formulation, and emphasize that in
the formulation, both intercept and slope are functions of a large number of mixed-type
variables, which makes the estimation problem really difficult. (The title of this page should
be varying-coefficient DCM given what you deleted.)
2
To estimate the nonparametric utility function written in the previous page, we use
boosted trees. The tree-base approach, use a heuristic algorithm, tries to partition the
products into homogeneous groups based on utility functions. We want the utility function
within a group to be as similar as possible, but between groups to be different. The right
hand side shows a demo of a simple tree with 4 nodes. We can see products are grouped based
on utility function, and the groups are formed by splitting on the features. The boosting
approach improves over the tree method, and it repeatedly generates trees to model the
“residuals” from the previous iteration. Thus the boosting result is a sum of trees, and on
the other hand, boosting is a way of maximizing likelihood that contains unknown functions.
Other use cases of the model include feature importance plot and brand level utility
functions. The feature importance plot tell us which features are importance in determining
utility function, and brand level utility functions give us ideas of brand value and price
sensitivity within each brand.
The remainder of the paper proceeds as follows.
2 Literature Review
We discuss two streams of literature that are relevant to this research: multinomial logit
demand modeling, and boosting.
Most demand research is constructed upon a structure of how demand responses to
prices. This paper is no exception. The multinomial logit (MNL) discrete choice model
is particularly popular after it was first proposed by McFadden (?) because of appealing
theoretical properties (consistency with random utility choices) and ease of application to
empirical studies. It has received significant attention by researchers from economics, mar-
keting, transportation science and operations management, and it has motivated tremendous
theoretical research and empirical validations in a large range of applications. The MNL is
a special case of the class of attraction models proposed by Luce (?). See also Ben-Akiva,
and Lerman (?) for a thorough review of choice models.
3
In most literature (for example, Berry 1994 and Kamkura, Kim and Lee 1996), the utility
function is assumed to be stationary and linear in product attributes. In practice, these
assumptions are seldom true. (cite tree based paper) addresses both issues. Time varying
coefficients are used to incorporate non-stationary demand. In addition, (tree base paper)
uses a non-parametric approach to specific the structure of the utility function. In particular,
a modified tree-base regression method is used to discover the nonlinear dependencies on,
and interaction effects between product attributes, in a MNL framework.
(add boosting literature here)
The main contribution of this paper is to apply boosting method to tree-based and time
varying coefficient MNL demand models. From a modeling perspective, the tree-based and
time varying coefficient MNL models successfully addresses two of the major criticisms of
MNL models. However, both models are challenging to estimate empirically because the
search space for potential specifications is large with little known structure to be exploited.
For example, the standard binary splitting method to estimate the tree-based MNL model
is path dependent, and potentially results in sub-optimal estimation. Boosting alleviates
some of these problems. In empirical test of field data, boosting can improve out-of-sample
performance by x%.
3 Boosted Multinomial Logit Model
In this exposition, consider a single market with K products in competition. The market
could be a mobile computer market in a geographical location over a period of time, or an
online market for certain non-perishable goods. The notion of a product could potentially
include “non-purchase” option. Denote the sales volume of the i-th product as ni, where
i = 1, · · · , K. The total market size is denoted as N =∑K
i=1 ni. Further, let (s′i,x′i, ni)
denote the vector of measurements on product i. Here, si = (si1, si2, · · · , siq)′ consists of
product attributes, brand and channel information, whose effect on utility has an unknown
functional form. The vector of linear predictors is xi = (xi1, xi2, · · · , xip)′, often consisting
4
of price or other predictors with linear effects.
The utility of a product captures the overall attractiveness given attributes, brand, price
and factors relating to customers’ shopping experience. The utility is often positively cor-
related with product attributes, but is adversely affected by price. The utility of the i-th
product is denoted as
ui = fi + εi,
where fi is a deterministic function of si and xi, and εi denotes the random noise term
not captured by the auxiliary variables, arising from the idiosyncratic errors in customers’
decision making. If we assume that the εi’s are independent and identially distributed with
standard Gumbel distribution, then a utility maximization principle leads to the following
expression of the choice probability for the i-th product,
pi =exp(fi)∑Ki=1 exp(fi)
. (1)
Further, we assume the vector of sales volume (n1, · · · , nM) follows multinomial distribution
with N trials and probabilities (p1, · · · , pK) defined by (1). The resulting model is called
the multinomial logit (MNL) model. The attraction function in MNL model is exponential,
which can be generalized to arbitrary attraction functions. Let g(·) denote the attraction
function generically, which is a known monotone function that takes values on (0,+∞).
Under attraction function g(·), the choice probability of product i is,
pi =g(fi)∑Ki=1 g(fi)
. (2)
To estimate the utility functions, we can maximize the data likelihood, or equivalently,
minimize the −2 logL where L denotes the multinomial likelihood function. Without causing
much confusion, we will work with J(f) defined below, which differs from −2 logL by a
constant,
J(f) = −2K∑i=1
ni {log(g(fi))}+ 2N log
{K∑i=1
g(fi)
}, (3)
where f = (f1, · · · , fK)′ denotes the vector of product utilities. The model can also be
regarded as poisson regression model conditioning on the total sales volume in a consideration
5
set, also known as conditional poisson regression. The model is conceptually similar to the
stratified Cox’s proportional hazard model with an offset term that depends on the surviving
cases in the corresponding stratum (Cox 1975, Hosmer and Lemeshow 1999).
We consider two semiparametric models of utility: the functional-coefficient model and
partially linear model, and refer to the resulting choice models as functional-coefficient and
partially linear choice models, respectively.
Functional-coefficient MNL
In functional-coefficient MNL model, we specify the utility function as
fi = x′iβ(si), (4)
which is a linear function of x with coefficients depending on s. The function reduces to
a globally linear function once we remove the dependence of the coefficients on s, which
corresponds to a linear MNL model. In simple cases with xi = (1, xi)′ where xi is the price
of product i, the utility function becomes β0(si) + β1(si)xi. Here, both the base utility and
price elasticity depend on si, and the price coefficient is constant when si is fixed.
Our estimation of the coefficient surface β(si) involves minimizing the following −2log-
likelihood by boosted varying-coefficient trees:
J(f) = −2K∑i=1
ni {log(g(x′iβ(si)))}+ 2N log
{K∑i=1
g(x′iβ(si))
}.
The technical details for growing varying-coefficient trees can be found in Wang and Hastie
(2012), and are briefly reviewed in section 4.1 of the current paper. As shown in Algorithm 1,
our proposed method starts with an estimate of the constant-coefficient linear MNL model,
iteratively constructs varying-coefficient trees, and then fits linear MNL models using tree-
generated bases. The incremental trees are grown in such a way that best predict the pseudo
observations ξi, which represent the gradient for minimizing J(f).
The estimation of the linear MNL model involves iteratively reweighted least squares,
or IRLS (Green 1984). We take the initial estimates as an example. Let β(b−1)
denote the
6
estimate from the (b− 1)-th iteration, and p(b−1)i denote the fitted choice probability. Next,
we construct pseudo response as
y(b)i = x′iβ
(b−1)+
ni
N− p(b−1)i
p(b−1)i (1− p(b−1)i )
,
and fit y(b)i on x′i using weighted least squares with observation weight p
(b−1)i (1− p(b−1)i ). This
procedure is iterated until convergence.
Algorithm 1 Boosted Functional-coefficient MNL.
Require: B – the number of boosting steps, ν – the “learning rate”, and M – number of
terminal nodes for a single tree.
1. Start with naive fit f(0)i = x′iβ, where β is estimated via iteratively reweighted least
squares (IRLS) under a linear MNL model.
2. For b = 1, · · · , B, repeat:
(a) Compute the “pseudo observations”: ξi = − ∂φ∂fi
∣∣∣f=f (b−1)
.
(b) Fit ξi on si and xi using the “PartReg” algorithm to obtain partitions
(C(b)1 , · · · , C(b)
M ).
(c) Let zi = (I(si∈C
(b)1 ), · · · , I
(si∈C(b)M ), xiI(si∈C(b)
1 ), · · · , xiI(si∈C(b)
M ))′, and apply IRLS
to estimate γ(b) by minimizing
J(γ(b)) = −2K∑i=1
ni
{log(g(f
(b−1)i + z′iγ
(b)))}+ 2N log
{K∑i=1
g(f(b−1)i + z′iγ
(b))
},
and denote the estimated vector as γ(b) = (γ(b)01 , · · · , γ
(b)0M , γ
(b)11 , · · · , γ
(b)1M)′.
(d) Update the fitted model by f (b) = f (b−1) + ν∑M
m=1
{γ(b)0m + γ
(b)1mxi
}I(si∈C
(b)m )
.
3. Output the fitted model f = f(B)
.
7
Partially Linear MNL
In partially linear choice model, we specify the utility function as
fi = β0(si) + x′iβ, (5)
which consists of a nonparametric term β0(si) and a linear term x′iβ. If the linear predictors
include the price only, the resulting model consists of a base utility that is a nonparametric
function of attributes, and a globally constant price elasticity. In a refined model, interactions
between price and other factors like brand or product category can be incorporated into the
design matrix of the linear term x′iβ, to allow the price coefficient to vary along certain
dimensions. Another interesting special case of partially linear MNL is a nonparametric
MNL model, by removing the linear predictors xi and only fitting a nonparametric utility
function. All the special cases can be estimated under the same boosted tree framework.
The boosting algorithm for the partially linear model is explained in Algorithm 2. Here,
the varying intercept β0(si) is initially fitted with a constant value, and then approximated
by piecewise constant trees using the CART algorithm. At every stage, the search for
optimal partitioning in CART and the estimation of β are conducted sequentially, instead
of simultaneously. Specifically, we search for the optimal tree split for predicting the pseudo
residuals, ignoring the linear predictors, and then fit a linear MNL model using the tree
grouping and the original predictors xi jointly.
4 Computational Details
4.1 Tree-based Varying-coefficient Regression
The estimation of the boosted varying-coefficient MNL model involves iteratively applying
the “PartReg” algorithm for constructing tree-based regressions. Let (s′i,x′i, yi) denote the
measurements on subject i, where i = 1, · · · , n. Here, the varying-coefficient variable or par-
tition variable, is si = (si1, si2, · · · , siq)′ and the regression variable is xi = (xi1, xi2, · · · , xip)′.
8
Algorithm 2 Boosted Partially Linear MNL model.
Require: B – the number of boosting steps, ν – the “learning rate”, and M – the number
of terminal nodes for a single tree.
1. Start with naive fit f(0)i = β0 + x′iβ, where β0 and β are estimated via Newton-
Raphson algorithm or IRLS.
2. For b = 1, · · · , B, repeat:
(a) Compute the “pseudo observations”: ξi = − ∂J∂fi
∣∣∣f=f
(b−1).
(b) Fit ξi on si using the CART algorithm (Breiman et al. 1984) to obtain
ξi =M∑m=1
ξ(b)m I(si∈C
(b)m ).
(c) Let zi = (I(si∈C
(b)1 ), · · · , I
(si∈C(b)M )
)′, and apply IRLS to minimize
J(γ0,γ) = −2K∑i=1
ni
{log(g(f
(b−1)i + z′iγ0 + x
′iγ))
}+2N log
{K∑i=1
g(f(b−1)i + z′iγ0 + x
′iγ)
},
and denote the estimates as (γ(b)0m, · · · , γ
(b)0m, γ
(b)).
(d) Update the fitted regression function by f(b)i = f
(b−1)i +
ν∑M
m=1
{γ(b)0mI
(si∈C(b)m )
+ νx′iγ(b)}
.
3. Output the fitted model f = f(B)
.
9
The two sets of variables are allowed to have overlaps. The first element of xi is set to be 1
if we allow for an intercept term.
Let {Cm}Mm=1 denote a partition of the space Rq satisfying Cm∩Cm′ = ∅ for any m 6= m′,
and ∪Mm=1Cm = Rq. The set Cm is referred to as a terminal node or leaf node, which defines
the ultimate grouping of the observations. Here, M denotes the number of partitions. The
number of tree nodes M is fixed when the trees are used as base learners in boosting. The
tree-based varying-coefficient model is
yi =M∑m=1
x′iβmI(si∈Cm) + εi, (6)
where I(·) denotes the indicator function with I(c) = 1 if event c is true and zero otherwise.
The error terms εis are assumed to have zero mean and homogeneous variance σ2.
The least squares criterion for (6) leads to the following estimator of (Cm,βm), as mini-
mizers of sum of squared errors (SSE),
(Cm, βm) = arg min(Cm,βm)
n∑i=1
(yi −
M∑m=1
x′iβmI(si∈Cm)
)2
= arg min(Cm,βm)
n∑i=1
M∑m=1
(yi − x′iβm)2
I(si∈Cm).
(7)
In the above, the estimation of βm is nested in that of the partitions. We take the least
squares estimator,
βm(Cm) = arg minβm
n∑i=1
(yi − x′iβm)2
I(si∈Cm),
in which the minimization criterion is essentially based on the observations in node Cm only.
Thus, we can “profile” out the regression parameters βm and have
Cm = arg minCm
M∑m=1
SSE(Cm) := arg minCm
n∑i=1
M∑m=1
(yi − x′iβm(Cm)
)2I(si∈Cm), (8)
where SSE(Cm) := arg minCm
∑ni=1 (yi − x′iβm)2 I(si∈Cm).
The sets {Cm}Mm=1 comprise an optimal partition of the space expanded by the partition-
ing variables s, where the “optimality” is with respect to the least squares criterion. The
search for the optimal partition is of combinatorial complexity, and it is of great challenge
to find the globally optimal partition even for a moderate-sized dataset. The tree-based
10
algorithm is an approximate solution to the optimal partitioning and scalable to large-scale
datasets. We restrict our discussions to binary trees that employ “horizontal” or “vertical”
partitions of the feature space and are stage-wise optimal.
In Algorithm 3, we cycle through the partition variables at each iteration and consider
all possible binary splits based on each variable. The candidate split depends on the type
of the variable. For an ordinal or a continuous variable, we sort the distinct values of the
variable, and place “cuts” between any two adjacent values to form partitions.
Splitting based on an unordered categorical variable is challenging, especially when there
are many categories. We propose to order the categories and treat the variable as an ordinal
variable. The ordering approach is much faster than exhaustive search, and performs com-
parably to the more complex search algorithms when combined with boosting. The category
ordering approach is similar to CART (Breiman et al. 1984). In a piecewise constant model
like CART, the categories are ordered based on the mean response in each category, and
then treated as ordinal variables (Hastie et al. 2009). This reduces the computation com-
plexity from exponential to linear. The simplification was justified by Fisher (1958) in an
optimal splitting setup, and is exact for a continuous-response regression problem where the
mean is the modeling target. In the partitioned regression context, let βl denote the least
squares estimate of β based on observations in the l-th category. The fitted model in the
l-th category is denoted as x′βl. A strict ordering of the hyperplanes x′βl may not exist,
thus we suggest an approximate solution. We propose to order the L categories using x′βl,
where x is the mean vector of xis in the current node, and then treat the categorical variable
as ordinal. This approximation works well when the fitted models are clearly separated, but
is not guaranteed to provide an optimal split at the current stage.
4.2 Split Selection
The partitioning algrithms CART and “PartReg” aim at achieving optimal reduction of
complexity at each stage. In exhaustive search, the number of binary partitions for an ordinal
11
Algorithm 3 “PartReg” Algorithm (Breadth-first search).
Require: n0– the minimum number of observations in a terminal node and M– the desired
number of terminal nodes.
1. Initialize the current number of terminal nodes l = 1 and Cm = Rq.
2. While l < M , loop:
(a) For m = 1 to l and j = 1 to q, repeat:
i. Consider all partitions of Cm into Cm,L and Cm,R based on the j-th
variable. The maximum reduction in SSE is,
∆SSEm,j = max{SSE(Cm)− SSE(Cm,L)− SSE(Cm,R)}, (9)
where the maximum is taken over all possible partitions based on the
j-th variable such that min{#Cm,L,#Cm,R} ≥ n0 and #C denotes the
cardinality of set C.
ii. Let ∆SSEl = maxm maxj ∆SSEm,j, namely the maximum reduction in
the sum of squared error among all candidate splits in all terminal nodes
at the current stage.
(b) Let ∆SSEm∗,j∗ = ∆SSEl, namely the j∗-th variable on the m∗-th terminal node
provides the optimal partition. Split the m∗-th terminal node according to the
optimal partitioning criterion and increase l by 1.
12
variable with L categories is L − 1 and the number is 2L−1 − 1 for a categorical variable.
Thus, the number of possible partitions for a categorical variable grows exponentially, which
has greatly increased the search space, causing the tree splitting to favor the categorical
variables. Our varying-coefficient tree algorithm takes a response-driven ordering of the
categories, and has alleviated the issue with unfair split selection to some extent. But bias
remains with the current method, resulting from the following aspects:
1. The response-driven ordering of the nominal categories can cause bias to split selection.
2. The number of categories is unequal among various variables.
Thus, the direct use of the tree or boosting algorithm for inference, especially on variable
importance, should be cautioned. To further reduce the bias in split selection, we adopt a
pretest procedure using the analysis of covariance (ANCOVA). The use of significance testing-
based procedure in decision trees dates back to the CHAID technique (Kass 1980), in which
a Bonferroni factor was introduced in classification based on multi-way splits. A number of
algorithms explicitly dealt with split selection in classification or regression tree, including the
FACT (Loh and Vanichsetakul 1988), GUIDE (Loh and Shih 1997), and QUEST algorithms
(Loh 1997), among others. Hothorn et al. (2006) proposes to use permutation test to select
the split variable and a multiple testing procedure for testing the global null hypothesis that
none of the predictors is significant. In the context of boosting, the recent Hofner et al.
(2011) paper proposes to use component-wise learners with comparable degrees of freedom,
and the degrees of freedom are made comparable by ridge penalty. The simulation has shown
satisfactory results under the null model, in which the response variable is independent of
the covariates.
5 Mobile Computer Sales in Australia
The proposed semiparametric MNL models have been applied to the aggregated monthly
mobile computer sales data in Australia, obtained from a third-party marketing firm. The
13
dataset contains the sales volume of various categories of mobile computers, including lap-
tops, netbooks, hybrid tablets, ultra-mobile personal computers and so on. The monthly
sales data goes from October 2010 to March 2011, and covers all mobile computer brands
on the Austalian market. Every row of the data set contains detailed configurations of the
product, the sales volume, the revenue generated from selling the product in certain month
and state. The average selling price is derived by taking the ratio of the revenue to the sales
volume.
The data contains 6 months of mobile computer sales in 5 Australian states. A choice
set is defined as the combination of a month and a state, leading to 30 choice sets. A choice
set contains approximately 100 to 200 products under competition. Other definitions of a
choice set have also been attempted, but for the sake of brevity, we only present results under
this definiton of a choice set. We randomly select 25 choice sets as the training data and
the remaining 5 as test data. In this paper, we only present the model estimates with price
residuals (denoted as xi without causing much confusion) as the linear predictor, instead of
the original price. The price residuals are the linear regression residuals after we fit price on
product attributes and brand. The residuals are now uncorrelated with product attributes,
and a demand model using the residuals as input usually leads to higher price sensitivities.
Without causing much confusion, we denote the residual of the i-th observation as xi.
We have considered five specifications of the mean utility function, including two es-
sentially linear specifications and three nonparametric or semiparametric models. The two
intrinsically linear choice models are estimated using elastic net (Zou and Hastie 2005) which
will be explained in detail in the next section, and the remaining models are estimated via
boosted trees. The five models are listed below:
M1. Varying coefficient-MNL model:
fi = x′iβ(si) = β0(si) + β1(si)xi. (10)
Here, the utility is a linear function of price residuals with coefficients depending on
14
attributes, brand and sales channel. The multivariate coefficient surface β(si) is of
estimation interest.
M2. Partially linear-MNL model:
fi = β0(si) + xiβ1.
The utility consists of a base utility, which is a nonparametric function of product
attributes and reportting channel, and a linear effect of price residuals. This model
assumes constant price effect on the utility.
M3. Nonparametric-MNL model:
fi = β(si, xi).
Here, the utility is a nonparametric function of the entire set of predictors. Customers’
sensitivity to price is implicit, rather than explicitly specified.
M4. Linear-MNL model. The coefficient β(si) in (10) is approximated by a linear function
of si, and the model is estimated using penalized iteratively reweighted least squares
(IRLS).
M5. Quadratic-MNL model. We approximate the coefficient β(si) in (10) by a quadratic
function of si with first-order interactions among the elements of si. The model is
again estimated using penalized IRLS.
Elastic net varying-coefficient MNL
We take the quadratic MNL as an example for explaining the penalized IRLS algorithm in
MNL models. The first step is to generate the feature vector, in which we first create dummy
variables based on categorical variables, and then generate design matrix Z by including both
the quadratic effect of individual variables and first-order interaction effect between pairs of
variables. We denote the i-th row of Z as z′i, and then specify β0(si) as z′iγ0 and β1(si) as
15
z′iγ1. Next, we seek to estimate the following penalized generalized linear model:
(γ0, γ1) = arg minγ0,γ1−2
K∑i=1
nilog(g(z′iγ0 + (z′ixi)γ1)) + 2N log
{K∑i=1
g(z′iγ0 + (z′ixi)γ1)
}
+ λ
{α∑i,j
|γij|+(1− α)
2
∑i,j
γ2ij
}. (11)
In the penalized regression above, the penalty is a convex combination of L1 and L2 penalty
with tuning parameter α controling the relatively weight of the respective penalty. Model
(11) reduces to ridge regression if we set α as 0 and reduces to LASSO regression if α = 1.
The penalized linear MNL model (11) can be estimated by penalized IRLS algorithm
(Friedman et al. 2010). Let γ(b−1)0 and γ
(b−1)1 denotes estimates from the (b−1)-th iteration,
and p(b−1)i denotes the fitted probabilities. In the next iteration, we construct pseudo response
as
y(b)i = z′iγ
(b−1)0 + z′ixiγ
(b−1)1 +
ni
N− p(b−1)i
p(b−1)i (1− p(b−1)i )
,
and fit y(b)i on (z′i, z
′ixi)
′ with weights p(b−1)i (1 − p
(b−1)i ) and the elastic net penalty. The
elastic net penalized weighted least squares can be implemented by the glmnet package in
R, and iterated until convergence.
The three nonparametric or semiparametric models are estimated via boosted trees. The
varying-coefficient MNL model is estimated with Algorithm 1 and the remaining two models
are estimated with Algorithm 2 or its variant. The base learner is an M -node tree with
M = 4, and the learning rate is specified as ν = 0.1. In Figure 1, we plot training and test
sample R2s against tuning parameter for models M1-M3 and M5 (α = 1). For the three
models estimated with boosted trees, the R2 increases dramatically before 200 iterations,
but the improvement slows down when the number of iterations further increases. We do
not observe significant overfitting when the number of boosting iterations gets much larger.
The five MNL models are compared in Table 1 in terms of model implications, predictive
performance and time spent. The varying-coefficient MNL model has the best predictive
16
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
Varying coefficient−MNL, Boosted
Iterations
R2
TrainingTest
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
Partially linear, Boosted
Iterations
R2
TrainingTest
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
Nonparametric, Boosted
Iterations
R2
TrainingTest
−5 −4 −3 −2 −1
0.0
0.2
0.4
0.6
0.8
1.0
glmnet, alpha=1
log(lambda)
R2
TrainingTest
Figure 1: The training and test sample R2, plotted against tuning parameters, under the
varying-coefficient MNL (top left), partially linear MNL (top right), nonparametric MNL
(bottom left) and quadratic MNL model with LASSO penalty (bottom right).
performance among all five models, followed by penalized quadratic MNL models. The
nonparametric MNL model has inferior performance to the other two semiparametric models,
which is contradictory to the fact that this model includes the other two as special cases.
One possible explanation is that the tree-based method fails to learn variable interactions,
especially the interaction between xi and si. Unfortunately, the varying-coefficient MNL
takes the longest to fit, if no significance test is performed. The pretest-based approach
speeds up the boosting algorithm, but slightly deteriorates the model performance. Both
partially linear and nonparametric MNLs are much faster than varying-coefficient MNL,
given the use of the built-in rpart function instead of user-defined tree growing algorithm.
17
Table 1: Comparison of various versions of MNL models (i.e., M1-M5), including model
specification, estimation method, predictive performance and time consumption.
Utility Optimal R2 Interactions
SpecificationEstimation
Training TestTime (min)
Among attributes
(α = 1) .399 .357 .17 X
Linear(α = 1
2) .419 .379 .48 X
(α = 1)penalized IRLS
.582 .499 76.91 1st-order
Quadratic(α = 1
2) .554 .53 52.78 1st-order
Varying-coef. .734 .697 186.47
Partially linearboosted trees
.493 (.014) .455 (.023) 24.63 (M-2)th-order
NonparametricM=4, B=1000
.52 (.017) .502 (.053) 23.43
6 Discussion
Acknowledgements
References
Breiman, L., J. Friedman, R. Olshen, and C. Stone (1984). Classification and Regression
Trees. Wadsworth, New York.
Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269–276.
Fisher, W. (1958). On grouping for maximal homogeniety. Journal of the American Sta-
tistical Association 53 (284), 789–798.
Friedman, J. H., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized
linear models via coordinate descent. Journal of Statistical Software 33 (1), 1–22.
Green, P. J. (1984). Iteratively reweighted least squares for maximum likelihood esti-
mation, and some robust and resistant alternatives. Journal of the Royal Statistical
18
Society, Series B 46 (2), 149–192.
Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer-Verlag, New York.
Hofner, B., T. Hothorn, T. Kneib, and M. Schmid (2011). A framework for unbiased model
selection based on boosting. Journal of Computational and Graphical Statistics 20 (4),
956–971.
Hosmer, D. W. J. and S. Lemeshow (1999). Applied survival analysis: regression modeling
of time to event data. John Wiley & Sons.
Hothorn, T., K. Hornik, and A. Zeileis (2006). Unbiased recursive partitioning: A condi-
tional inference framework. Journal of Computational and Graphical Statistics 15 (3),
651–674.
Kass, G. V. (1980). An exploratory technique for investigating large quantities of categor-
ical data. Applied Statistics 29, 119–127.
Loh, W.-Y. (1997). Regression trees with unbiased variable selection and interaction de-
tection. Statistica Sinica 12, 361–386.
Loh, W.-Y. and Y.-S. Shih (1997). Split selection methods for classification trees. Statistica
Sinica 7, 815–840.
Loh, W.-Y. and N. Vanichsetakul (1988). Tree-structured classification via generalized
discriminant analysis (with discussion). Journal of the American Statistical Associa-
tion 83, 715–728.
Wang, J. C. and T. Hastie (2012). Boosted varying-coefficient regression models for prod-
uct demand prediction. Under revision.
Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society, Series B 67 (2), 301–320.
19