Latent Group Structures with Heterogeneous Distributions: Identi … · 2019-02-05 · Latent Group...
Transcript of Latent Group Structures with Heterogeneous Distributions: Identi … · 2019-02-05 · Latent Group...
Latent Group Structures with Heterogeneous
Distributions: Identification and Estimation∗
Heng Chen† Xuan Leng‡ Wendun Wang§
January 21, 2019
Abstract
Panel data are often characterized by cross-sectional heterogeneity, and a flexible yet
parsimonious way of modeling heterogeneity is to cluster units into groups. A group pat-
tern of heterogeneity may exist not only in the mean but also in the other characteristics
of the distribution. To identify latent groups and recover the heterogeneous distribu-
tion, we propose a clustering method based on composite quantile regressions. We show
that combining the strength across multiple panel quantile regression models improves
the precision of the group membership estimates if the group structure is common across
quantiles. Asymptotic theories for the proposed estimators are established, while their
finite-sample performance is demonstrated by simulations. We finally apply the proposed
methods to analyze the cross-country output effect of infrastructure capital.
Keywords: Distributional heterogeneity, cluster analysis, panel structure model, com-
posite quantile regression, infrastructure effect
JEL Classification: C31, C33, C38, H54
∗We thank the participants of the seminars at the Nanyang Technological University, Erasmus UniversityRotterdam, 4th Conference of the International Society for Nonparametric Statistics in Salerno, and 4th DongbeiEconometrics Workshop in Dalian for their useful discussions and constructive comments. Wang acknowledgesthe financial support of the EUR Fellowship. Any remaining errors are ours.†[email protected]. Currency Department, Bank of Canada.‡[email protected]. Econometric Institute, Erasmus University Rotterdam.§[email protected]. Econometric Institute, Erasmus University Rotterdam and Tinbergen Institute
1
1 Introduction
In many applications using panel data, the effect of the covariates on the dependent variable
differs across individual units, and such individual heterogeneity may be conveniently charac-
terized by a group pattern. For example, Hahn and Moon (2010) provided a theoretical jus-
tification of the group structure in game theory and macroeconomic models in which multiple
equilibria can appear. Bonhomme and Manresa (2015a) showed in a study of the democracy–
income relationship that countries can be separated into groups with different democratic tran-
sition paths. Ando and Bai (2016) found evidence of a group pattern of heterogeneity in the
styles of US mutual funds and performance of asset returns in Mainland China. Bonhomme
et al. (2017a) argued that the group structure can be a good discrete approximation even if
individual heterogeneity is continuous.
When we describe the features of a group, the mean statistics do not provide complete
information, and it is often of great interest and necessity to unveil the entire distribution of
the group, including such features as the volatility of asset returns of a group and the extent
to which different groups of firms behave in extreme cases (e.g., during the global financial
crisis and the Great Moderation). Such distributional heterogeneity (i.e., the difference in
the distribution of groups) is thus more general and informative than mean heterogeneity.
However, existing heterogeneous panel data models primarily focus on the heterogeneity of the
(conditional) mean effect, and thus do not capture any distributional heterogeneity.
This paper presents a new model and a new estimation procedure that allow us to examine
the distributional heterogeneity of panel data. We consider panel structure models in which in-
dividual units in the same group share a common conditional distribution, while the distribution
can differ in location, shape, or both across groups. This model offers a flexible yet parsimo-
nious approach to capture the heterogeneous distributional effects across individual units. The
group membership structure (i.e., which individuals belong to which group) is assumed to be
unknown and estimated from the data. For each group, we explore the distribution features by
adopting quantile regressions, such that the units in a group share the same quantile regression
function. Exploiting the information contained at different quantiles of the distribution in turn
improves classification accuracy as the clustering here is based not only on the location of the
distribution (measured by the mean), but also on the shape reflected by a range of quantiles.
Hence, when there is no or little heterogeneity in the group means, we can still correctly identify
the group membership structure as long as this structure is retained in the other parts of the
distribution. This occurs, for example, when the asset returns of two groups of firms behave
2
similarly in the tranquil period, but differently in the volatile period. Moreover, classification
based on the distribution is robust against outliers of the dependent variable compared with
mean-based classification. We name our model the panel structure quantile regression (PSQR)
model.
To estimate this model, we employ an iterative algorithm that alters between group member-
ship estimation and panel quantile regression estimation. The aim of our estimation approach
is to find the optimal group membership for each individual unit by minimizing its composite
quantile check function given the regression quantile estimates, and estimate the regression
quantiles given the estimated group memberships. More specifically, the composite quantile
check function is defined as the weighted average of the standard quantile check function over
different quantile levels. The advantage of using the composite quantile check function for
classification is that it allows us to employ the information contained at multiple quantiles
simultaneously when the group membership structure does not vary across quantiles. This
approach thus relaxes the group separation condition and improves the estimation accuracy of
the group membership parameters.1
A technical contribution of this study is that in addition to conventional consistency results,
we precisely quantify the speed of the convergence of the misclassification probability. We show
that the convergence rate is an exponential function of the length of the time series and that it
also depends on various quantities such as the number of quantiles used for classification, degree
of group separation, variance of the error terms, and serial correlation in the data. To the best
of our knowledge, this is the first study to precisely quantify the asymptotic behavior of group
membership estimates. Indeed, previous studies in the panel data classification literature only
provide the consistency of the group membership estimates or a rough convergence rate of the
misclassification probability,2 neither of which explain how the features of the data influence
classification accuracy. With our newly-established asymptotic properties of the misclassifica-
tion probability, we can explicitly show that using multiple quantiles improves classification
accuracy, and this strongly advocates the usage of the composite quantile approach.
The effectiveness of using multiple quantiles for classification relies on the assumption that
1From the non-parametric perspective, the set of quantile regressions chosen at different quantile levels canbe viewed as a set of basis functions (not necessarily orthogonal) used to approximate the log-likelihood ofthe unknown distribution. When the set is large, the composite approach can approximate the log-likelihoodfunction well; therefore, it yields a nearly efficient method.
2A typical clustering result is that the group membership estimates converge at the rate of T−δ, where δ is anunknown positive number (e.g., Bonhomme and Manresa (2015a) and Okui and Wang (2018)). The exponentialrate of convergence is interestingly related to the minimax mismatch ratio of the stochastic block models inZhang and Zhou (2016), in which their signal-to-noise ratio could be linked to the degree of group separationand variance of the error terms in our case.
3
the group membership structure is invariant over a range of (but not necessarily all) quantiles.
This is relevant in many applications since the factors that drive the group pattern are often
rather inertial. For example, Brand and Xie (2010) found that the economic returns of education
differ significantly over individuals depending on how likely they are to attend college. Since the
likelihood of attending college typically does not change over the different quantiles of the wage
distribution, it seems plausible that the group structure of heterogeneity is also invariant to
quantiles. Even in those cases where the group membership structure may vary over the whole
distribution, some quantile levels still share a common group structure, and the composite
quantile approach can thus be applied to this (small) range of quantiles.
We evaluate the finite-sample performance of our method via a simulation and compare it
with other methods of modeling a group pattern of heterogeneity. We first show that our method
can precisely estimate group memberships and quantile-specific slope coefficients. Classification
accuracy improves as the length of the time series increases. Next, we show that classification
accuracy is influenced by the degree of group separation, number of quantiles used for clus-
tering, and signal-to-noise ratio. In particular, we find that employing multiple quantiles for
classification significantly improves the accuracy of the group membership estimates compared
with existing methods. When groups are only separated in the tails of the distribution but not
their means, conventional mean-based clustering techniques fail to identify the group structure,
while our method is still effective at identifying the latent group structure. Furthermore, owing
to the more accurate group membership estimates, the group-specific regression quantiles are
also estimated more accurately by our method in finite samples.
We illustrate our method by investigating the output effect of infrastructure capital. We
find that the effect of infrastructure exhibits a large degree of heterogeneity across countries, not
only in the mean effect but also in the shapes of the distributions. Interestingly, the geographic
pattern is still salient. Most European and Asian countries behave similarly. American and
some African countries are classified into the same group with a relatively moderate effect of
infrastructure. For both groups, the effect of infrastructure becomes more positive and stronger
in the right tail of the distribution (i.e., when the economy is strong); however, the variation is
larger in the American/African group than in the European/Asian group. Such distributional
heterogeneity in the output effect of infrastructure has not been captured by existing studies.
The remainder of this paper is organized as follows. We position our work against the
related literature in Section 2. We set up the model and describe our estimation method
in Section 3. The asymptotic properties are provided in Section 4. Section 5 provides a
method for determining the number of groups and Section 6 considers an extension of our basic
4
model, allowing for individual-specific fixed effects. We study the finite-sample properties of our
estimator via a simulation in Section 7 and demonstrate our method with a real data application
in Section 8. Section 9 concludes. The technical details are organized in the Appendix.
2 Literature review
Many studies have aimed to identify the latent group structure in panel data models. Sun
(2005) considered a finite-mixture panel data model with unknown group membership and
Kasahara and Shimotsu (2009) provided the conditions under which the non-parametric identi-
fication of finite-mixture models of dynamic discrete choices is possible. Alternatively, Kmeans,
an “all-or-nothing” classification method (Pollard, 1981), has recently been extensively studied
in the panel data framework; see Lin and Ng (2012), Bonhomme and Manresa, Sarafidis and
Weber (2015), Bonhomme et al. (2017b), Okui and Wang (2018), and Liu et al. (2018). An-
other popular classification method is to use classifier-Lasso for clustering; this approach was
first proposed by Su et al. (2016) and further popularized by Su and Ju (2018), Wang et al.
(2017), and Su et al. (2017). Other classification techniques include the thresholding algorithm
(Vogt and Linton, 2016), pairwise comparisons (Krasnokutskaya et al., 2017), and binary seg-
mentation, as discussed by Ke et al. (2016) and Su and Wang (2017), among others. Most of
these studies focus on the heterogeneity in the mean effect, and classification is solely based
on the mean estimates. Thus, they fail to capture distributional heterogeneity. In addition,
the literature on clustering analysis in panel data models does not provide precise asymptotic
analysis of the group membership estimates besides their (super-)consistency. We thus con-
tribute to the literature by studying the distribution features of latent groups and classifying
individuals based on distributional heterogeneity. We also precisely quantify the speed of the
convergence of the misclassification probability and show the extent to which the accuracy of
the group membership estimates depends on the features of the model and data.
Our study builds on previous panel quantile regression models. Following the seminal study
of Koenker (2004), a number of influential authors have provided strict (asymptotic) analyses
on the estimation of panel quantile regression, such as Canay (2011), Galvao (2011), Kato et al.
(2012), Galvao and Kato (2016), Graham et al. (forthcoming), and Harding et al. (2017). These
studies all assume that individuals belong to a homogeneous population, and thus the quantile
regression coefficient is common across units. However, as suggested by Galvao et al. (2018),
cross-sectional heterogeneity remains a prominent feature in panel quantile regression. They
proposed a test for homogeneity in fixed effects quantile regressions and found strong evidence
5
of heterogeneity in the sensitivity of asset returns to firm characteristics at the tail quantiles
(i.e., during booms and busts). Motivated by their finding, we allow regression quantiles to
differ across individuals via a group pattern of heterogeneity. An alternative way in which
to uncover the distributional effect is via distribution regressions, and there exists an inverse
relationship between conditional quantile regression and conditional distributional regression
(Leorato and Peracchi, 2015; Chernozhukov et al., 2018). Distribution regressions offer a flexible
tool with which to model and estimate the entire conditional distribution of any type of outcome
variable (discrete, continuous, mixed), as they allow the effect of the covariates to vary across
different points of the conditional distribution. The cost of this flexibility is that the distribution
regression parameters can be difficult to interpret because they do not directly correspond to
the quantile effects.
Another stream of the literature modeling heterogeneous distributions including Rosen et al.
(2000); Ng and McLachlan (2014); Sugasawa (2018) considers the finite mixture of latent con-
ditional distributions. These models often assume that the mixture probability depends on
some observables or that the mixture distribution is composed of several common (known)
distributions. On the contrary, we allow the group memberships and form of distributional
heterogeneity to be fully unrestricted. In other words, we do not require knowledge of the
distribution form for each group, and the distribution can also vary across groups in an arbi-
trary manner. Since finite-mixture models under normal errors are equivalent to the Kmeans
approach (Bonhomme and Manresa, 2015b), we differ from both methods in that our classifi-
cation is not based on mean heterogeneity but rather on distributional heterogeneity.
One closely related work is Gu and Volgushev (2018), who considered panel quantile regres-
sion with group fixed effects. We differ from their studies in three main aspects. First, they
considered the fixed effects to have a group pattern of heterogeneity, while the regression quan-
tiles are cross-sectional homogeneous. We allow the regression quantiles to be heterogeneous
across groups to capture the distributional heterogeneity of the effect of the covariates. In ad-
dition, we allow for additive individual-specific fixed effects, maintaining the common situation
of individual unobserved heterogeneity. Hence, we do not circumvent the incidental parameter
problem. Second, they employed Lasso techniques for clustering, while our approach resembles
the Kmeans approach. Finally, and most importantly, they estimated the groups at one given
quantile level. In practice, however, it is typically unclear which quantile to utilize, and quantile-
by-quantile estimation does not guarantee obtaining common group membership estimates if
the group membership structure does not vary across quantiles. In contrast, our estimation is
based on the composite quantile loss function, which allows us to obtain a quantile-invariant
6
group membership structure and facilitates the identification of groups. We precisely quantify
the speed of the convergence of the misclassification probability, from which we explicitly show
that using multiple quantiles leads to more accurate group membership estimates (even in the
asymptotics) and thus more accurate regression quantiles than using a single quantile. Our
asymptotic results are thus more informative than the consistency of the group membership
estimates in the literature including Gu and Volgushev (2018). This theoretical comparison is
also complemented by the simulation study.
Finally, the idea of using composite quantiles is not new in the literature. Koenker (2005,
Section 5.5), Zou and Yuan (2008), and Zhao and Xiao (2014) showed in their cross-sectional
regressions that combining quantile information improves the efficiency of slope coefficients
that are quantile-independent. Fan et al. (2017) employed the composite idea to estimate the
linear quantile regression at boundary quantiles. We consider panel quantile regression models
in which the slope coefficients depend on the quantiles, but the group membership structure
is quantile-invariant. Hence, we use the composite quantile approach to estimate the group
membership parameter that does not vary across quantiles. Unlike the slope coefficients, group
memberships are discrete, individual-specific, invariant to relabeling, and enter the model as an
index of the slope coefficients. Therefore, it requires new techniques to show the improvement
in the efficiency of the group membership estimates due to combining quantile information.
3 Model setup and estimation
In this section, we first describe the setup of our model and link it to several popular models
and then explain our estimation approach.
3.1 Model setup
Suppose we observe {{yit, xit}Tt=1}Ni=1, where yit is the scalar dependent variable of individual i
observed at time t and xit is a (p + 1) × 1 vector of exogenous regressors, typically including
the first element being 1. We are interested in the effect of xit on the conditional quantile of
yit, and this distributional effect is allowed to differ across individual units. The heterogeneity
can exist in the location of the distribution and/or the shape of the distribution. We assume
that the heterogeneous conditional quantile effect can be characterized by a group structure
such that individuals in the same group share a common conditional quantile curve, while the
curves can differ at any quantile across groups. Thus, we consider the panel data generated
7
from the following model with group-specific regression quantiles:
Qτ (yit|xit) = x′itβgi(τ), i = 1, . . . , N, t = 1, . . . , T, (3.1)
where τ ∈ (0, 1) is the quantile index and Qτ (yit|xit) is the conditional τ -quantile of yit given
xit. The quantile coefficient βgi(τ) is characterized by a group pattern of heterogeneity and the
group membership of unit i is denoted by gi ∈ {1, . . . , G} with G being the number of groups.
The group membership structure {gi}Ni=1 is unknown and to be estimated. We assume that the
group membership is time-invariant and independent of τ for all i = 1, . . . , N . This is common
in practice and can occur if the group membership of an individual unit is predetermined; then,
the time observations of the dependent variable for this unit are generated from the associated
conditional distribution of this group.
This model captures the cross-sectional heterogeneity and varying effects of the covariates
across different quantiles of the conditional distribution of the dependent variable. It includes
several important models as special cases. When τ = 0.5, our model collapses to a median
regression with group-specific slope coefficients, which is closely related to the panel structure
model with group-specific conditional mean effects (Lin and Ng, 2012; Bonhomme and Manresa,
2015a; Su et al., 2016). If the parameter βgi(τ) is cross-sectional homogeneous β(τ), the model
reduces to quantile regression random effects (Galvao and Poirier, 2016). Our model also
includes the panel quantile with group fixed effects (Gu and Volgushev, 2018) as a special case
where only the intercept is allowed to have a group pattern of heterogeneity.
Model (3.1) assumes that the fixed effects are group-specific. In some applications, it is
desirable to allow unobserved heterogeneity to be individual-specific and correlated with the
covariates. Hence, we consider an extension of model (3.1) with individual fixed effects in
Section 6 as
Qτ (yit|αi(τ), xit) = αi(τ) + x′itβgi(τ), i = 1, . . . , N, t = 1, . . . , T, (3.2)
where αi(τ) is the i-th individual’s fixed effects. While allowing individual unobserved hetero-
geneity is more appealing, it is technically more involved because of the presence of incidental
parameters and non-smooth features of the quantile regression objective function (Galvao and
Kato, 2016).
Remark 1. We identify distributional heterogeneity by using repeated time series observations
8
of each individual unit.3 We discuss the identifiability of our model from the following two
aspects. First, we can identify the latent group membership for each individual as long as s/he
has sufficiently many time observations lying in the non-overlapping region of the two groups.
This is satisfied if we can observe each individual unit for infinitely many periods. Second, we
allow the distribution of a group to be of any shape, including the multi-modal distribution, and
can still correctly identify the group membership structure. This is again achieved by observing
each individual unit for infinitely many periods. For example, if a group is characterized by a
bimodal distribution, it would not be identified as two unimodal groups because the distribution
of each individual unit in this group is bimodal.
3.2 Estimation method
Model (3.1) contains two types of parameters to estimate: the group-specific regression quantiles
βgi(τ) for some quantile index τ and the group membership variable gi for all i ∈ {1, . . . , N}. In
practice, we often consider a finite number of quantiles, τ1, . . . , τK , and thus we must estimate
K regression quantiles βg(τk) ∈ B ⊂ Rp+1 for each group. Let β(τk) = {β′1(τk), . . . , β′G(τk)}′,
and further denote β(τ ) = {β′(τ1), . . . , β′(τk)}′ ∈ BGK with τ = (τ1, . . . , τK)′. Define γ =
{g1, . . . , gN} ∈ GN as the partition of N individuals into at most G groups, and let ΓG be all
possible partitions. Note that γ does not change over the quantile levels. We first discuss the
estimation of β(τ ) and γ for the given number of groups G and then discuss how to determine
the number of groups in Section 5.
We propose estimating the regression quantiles and group memberships by minimizing the
following composite quantile function as
(β(τ ), γ) = arg min(β(τ ),γ)∈BGK×ΓG
K∑k=1
N∑i=1
T∑t=1
wk ρτ (yit − x′itβgi(τk)), (3.3)
where the check function ρτ (u) = (τk−I(u < 0))u, and the weight wk ∈ [0, 1] with k = 1, . . . , K.
Typically, we consider K equally spaced quantiles, say τk = k/ (K + 1), and use equal weights if
the multiple quantiles all contain clustering information. When K = 1, this can be regarded as
Kmeans-type clustering based on the check function at a given quantile.4 Using the composite
quantile approach has several advantages. First, it addresses the problem which quantile to use
3This differs from the group identification in the cross section data, where the assumptions of the distributionof each group is typically required, see, e.g., Dong and Lewbel (2011).
4One may consider optimal weights that minimize a certain accuracy measure in the vein of Zhao and Xiao(2014). We leave this interesting topic to future research. Nevertheless, we do discuss how to verify whetherquantiles contain clustering information in Section 5.
9
for clustering, and it directly provides the group membership estimates that do not vary across
quantiles. Second, it leads to a more efficient estimate for the group membership parameter
than using a single quantile or using the mean. Finally, it allows us to identify the latent group
structure when there is group separation only at a part of the distribution, e.g. no separation
at the mean but only at the tail quantile.
Since an exhaustive search of the optimal partition of the parameter space is virtually
infeasible (Su et al., 2016), we solve the optimization problem by adopting the following iterative
algorithm in the spirit of Bonhomme and Manresa (2015a) and Okui and Wang (2018):
Algorithm 3.1. Let β(0)(τk) be the initial estimate of β(τk) for τk = 1, . . . , K and set s = 0.
Step 1 For the given β(s)(τk), k = 1, . . . , K, compute for all i ∈ {1, . . . , N},
g(s)i = arg min
gi∈G
K∑k=1
T∑t=1
wkρτk(yit − x′itβ(s)gi
(τk)).
Step 2 For the given γ(s), compute for each τk quantile level,
β(s+1)(τk) = arg minβ∈BG
N∑i=1
T∑t=1
ρτk(yit − x′itβg(s)i(τk)), k = 1, . . . ,K.
Step 3 Set s = s+ 1. Go to Step 1 until numerical convergence.
This algorithm iterates between the classification and estimation steps. In the classification
step (Step 1), we update the group memberships based on the composite quantile check func-
tion given the regression quantile estimates. This step contrasts with the standard Kmeans
(Bonhomme and Manresa, 2015a) or Lasso-based (Su et al., 2016; Gu and Volgushev, 2018) al-
gorithms that cluster individuals only based on the mean or a single quantile of the distribution.
Since the group membership structure is common across the conditional distribution of the de-
pendent variable, multiple quantiles all contain clustering information, although the quantities
of information may differ. Hence, using multiple quantiles for classification is expected to be
more efficient than existing approaches.
Step 2 estimates the regression quantiles given the group membership estimates by applying
standard quantile regression estimations to each group. Since each βg(τ) can depend on τ , the
estimation is conducted independently at each quantile level. The composite feature of the
objective function (3.3) does not help estimate the regression quantiles directly. However, it
does indirectly improve the coefficient estimation in finite samples owing to the more precise
10
group membership estimates. This improvement is especially significant when two groups are
less well separated or when the signal-to-noise ratio is low. In some special cases where a
subvector of βg(τ) is independent of τ , using composite quantiles helps improve the asymptotic
efficiency of the estimators for this subvector of parameters, similar to the arguments of Zou
and Yuan (2008) and Zhao and Xiao (2014).
4 Asymptotic properties
In this section, we study the asymptotic properties of the proposed estimators and formally
demonstrate the advantages of using multiple quantiles. Some notations are prerequisites. Let
β0 and g0i be the true values of β and gi. Define εit(τk) := yit − x′itβ
0g0i
(τk). It then follows
from (3.1) that Qτk(εit(τk)|xit) = 0. For each individual i, let Fi,τk(·|x) denote the conditional
distribution of εit(τk) given xit = x for all t = 1, . . . , T . We assume that Fi,τk(·|x) has a density
fi,τk(·|x).
4.1 Weak consistency
We first show the consistency of the regression quantile estimates under the unknown (i.e.,
estimated) group membership structure. The following assumptions are required.
Assumption 1.
(i) B is a compact subset of Rp+1.
(ii) For each i ≥ 1, the process {(yit, xit) : t ≥ 1} is strictly stationary and α-mixing with
the α-mixing coefficient αi(t). Furthermore, assume supi≥1 αi(t)→ 0 as t→∞.
(iii) There exists a constant M such that supi≥1 E [‖xit‖] ≤M .
(iv) For each i and k, fi,τk(u|x) is continuously differentiable with respect to u. Let
f(1)i,τk
(u|x) := (∂/∂u)fi,τk(u|x). There exists a constant Cf such that |f (1)i,τk
(u|x)| ≤ Cf
uniformly over (u, x) for all i ≥ 1 and k = 1, . . . , K.
(v) λN(g, g, τk) is the minimum eigenvalue ofN−1∑N
i=1 I{g0i = g, gi = g}E [fi,τk(0|xit)xitx′it]
such that for all g, g ∈ {1, . . . , G} and k = 1, . . . , K,
λN(g, g, τk) ≥ λN , and lim infN→∞
λN > 0, a.s.
11
Assumption 1(i) requires the compactness of the parameter space, which is standard in the
econometrics literature. Assumption 1(ii) is similar to (A1) of Galvao and Kato (2016), ensuring
that the distribution of the residual εit(τk) is common over time. Compared with Assumptions
2(c) and 2(d) in Bonhomme and Manresa (2015a) and Condition 1 in Vogt and Linton (2016),
we do not necessarily require exponentially or sufficiently high polynomial decaying mixing
rates. Assumption 1(iii), which controls the tail behavior of the variables, is used to bound the
maximum clustering error. Assumption 1(iv) restricts the smoothness of the density function of
the residuals, similar to Assumption (A5) in Galvao and Kato (2016). Finally, Assumption 1(v)
is a relevant rank condition akin of Assumption 1(g) in Bonhomme and Manresa (2015).
Theorem 4.1. Under Assumption 1, we have, as N, T →∞,
max1≤i≤N
‖βgi(τk)− β0g0i
(τk)‖P→ 0, (4.1)
for any k = 1, . . . , K and g0i ∈ {1, . . . , G}.
This theorem states that as N and T diverge, the estimated regression quantiles under the
unknown (i.e., estimated) group membership structure converge to their true values with known
group memberships.
4.2 Asymptotic behavior of the misclassification probability
The weak consistency result provides some justification for our method. Nevertheless, it does
not explain whether and the degree to which the accuracy of the group membership estimates
is influenced by using composite quantiles and by the features of the data. In this section,
we aim to precisely quantify the accuracy of the group membership estimates and examine
how the use of distributional information and other features of the data affects the level of
accuracy. A challenge is that the group membership parameter is discrete and label-invariant,
and thus directly studying its asymptotic distribution is rather difficult, if not impossible. To
overcome this challenge, we focus on the misclassification probability, a widely used measure
of classification accuracy, defined as
MP [β(τ )] =1
N
N∑i=1
I{gi(β(τ )) 6= g0
i
}. (4.2)
We derive the rate of the convergence of this probability and examine how the number of
quantiles used for the estimation and the features of the data affect this rate. Some additional
12
assumptions and a lemma are required.
Assumption 2.
(i) For all g ∈ {1, . . . , G}, 1N
∑Ni=1 I{g0
i = g} p−→ πg > 0.
(ii) Let g 6= g with (g, g) ∈ {1, . . . , G}2. Then, there exists a k ∈ {1, . . . , K} such that
‖β0g(τk)− β0
g(τk)‖ > 0.
(iii) The minimum eigenvalues of E [xitx′it] are bounded away from zero uniformly over
i ≥ 1.
(iv) For each i ≥ 1, ρi(t) is the ρ-mixing coefficient of the process {(yit, xit) : t ≥ 1},defined as the “maximal correlation” by Kolmogorov and Rozanov (1960). Let ρ′ :=
supt≥1,i≥1 ρi(t), and assume ρ′ < 1.
(v) There exists a constant M such that supi≥1 ‖xit‖ ≤M a.s.
(vi) Define λ as the minimum eigenvalue of the following matrix:(1
NT
N∑i=1
T∑t=1
I(g0i = g)xit
)(1
NT
N∑i=1
T∑t=1
I(g0i = g)x′it
),
Then, λp−→ λ > 0 as N, T →∞.
Assumption 2(i) requires a sufficient number of individual units in each group as in Bon-
homme and Manresa’s Assumption 2(a). Assumptions 2(ii) and 2(iii) jointly provide the
conditions under which we can identify the group membership. They require that the groups
are well separated. Assumption 2(iv) requires that the data should not be perfectly correlated
across individuals and over time, while a certain degree of serial correlation and cross-sectional
dependence are allowed. Assumption 2(v) is similar but stronger than 1(iii). It can be relaxed
as E||xit||2+δ ≤ M at the expense of lengthier proofs. Assumption 2(vi) guarantees that the
rank condition holds in any group structure.
Remark 2. The group separation assumption (Assumption 2(ii)) is satisfied as long as the
regression quantiles are distinct across groups for at least one quantile level. This allows, for
example, that there is no separation in the mean but only in the tail quantile, and thus our
group separation condition is weaker than the standard mean-based separation condition (e.g.,
Bonhomme and Manresa (2015a) and Su et al. (2016)).
13
Remark 3. Compared with Assumption 2(d) in Bonhomme and Manresa (2015a) that imposed
an exponential decay for the tail of errors, we allow the errors to have heavy tails. We quantify
the speed of the convergence of the misclassification probability by using a central limit theorem
for strongly mixing processes; see Bradley and Tone (2017). This also differs from Bonhomme
and Manresa (2015a) who used exponential inequalities for dependent processes (Lemma B.5).
Lemma 1. Under Assumptions 1 and 2(i)–2(ii), we have, as N, T →∞,
‖βg(τk)− β0g(τk)‖
p−→ 0,
for k = 1, . . . , K and g ∈ {1, . . . , G}.
This lemma suggests that the difference between the estimated and true regression quantiles is
negligible in the asymptotics. In combination with Theorem 4.1, this implies that the group
membership estimate converges to the true value in the asymptotics. This lemma allows us to
study the asymptotic behavior of the group membership estimates in the neighborhood of the
true value. Specifically, we denote Nη as the set of parameters β(τ ) ∈ ΘGK such that for any
η > 0, ‖βg(τk)− β0g(τk)‖ < η, for any k = 1, . . . , K and g ∈ {1, . . . , G}.
The following quantities are crucial for establishing the asymptotic properties of the mis-
classification probability. Define
bit(K) = K−1
K∑k=1
wk
∫ x′it(β0g(τk)−β0
g(τk))
0
(I(εit(τk ≤ u)− τk) du,
and its first and second moments are
E [bit(K)] = K−1
K∑k=1
wkE
[∫ x′it(β0g(τk)−β0
g(τk))
0
(Fi,τk(u|xit)− τk) du
],
and
E [b2it(K)] = E
[K−1
K∑k=1
wk
∫ x′it(β0g(τk)−β0
g(τk))
0
(I(εit(τk ≤ u)− τk) du
]2
.
Let ζg,g = infi≥1 E [bit(K)]/√Var [bit(K)] > 0 and C ′ = (1 + ρ′)/(1 − ρ′) with ρ′ defined as in
Assumption 2(iv). The following theorem depicts the rate of the convergence of the misclassi-
fication probability.
14
Theorem 4.2. Under Assumptions 1 and 2,we have, as N, T →∞,
supβ(τ )∈Nη
1
N
N∑i=1
I{gi(β(τ )) 6= g0
i
}= OP
(T−1/2 exp (−ζT )
), (4.3)
where
ζ = ming 6=g
g,g∈{1,...,G}
ζ2g,g
8C ′.
This theorem shows that the rate of the convergence of the misclassification probability is an
exponential function of T , which is in line with the literature (Bonhomme and Manresa, 2015a;
Okui and Wang, 2018). We take one step further to show that this rate depends on the number
of quantiles used for the estimation, magnitude of the noise, degree of group separation, and
serial correlation. In particular, we can see that ζ is a function of the group separation across
quantiles, which also depends on the error term. From the upper integrand of E [bit(K)], we
observe that the exponential-decaying convergence rate only requires that the group pattern
is well separated at some but not all quantiles (Assumptions 2(ii) and 2(iii)). Hence, our
composite approach is more robust in identifying the group structure than only using a single
quantile or the mean for clustering. C ′ captures the degree of serial dependence. A large de-
gree of dependence corresponds to a smaller C ′, which further leads to a larger misclassification
probability.5
Illustrative example: To better illustrate how the rate of convergence is determined by
various quantities, we consider a simple model with two groups (G = 2) and only an intercept
as
yit = β0g0i
+ εit, g0i ∈ {1, 2}, (4.4)
where εit follows i.i.d. N(0, σ2). We assume that β01 < β0
2 without loss of generality. (4.4) can
be rewritten, in a similar form of (3.1), as
yit = β0g0i
(τk) + εit(τk), k = 1, . . . , K, (4.5)
where β0g0i
(τk) = β0g0i
+qτk , and εit(τk) = εit−qτk with qτ denoting the 100τ% quantile of N(0, σ2).
In this case, the misclassification probability is the probability of (mis)classifying an individual
5For a given finite K, it is possible to construct data-driven weights wk for k = 1, . . . ,K based on Theorem 4.2to obtain an improved convergence rate for the group membership estimates. For example, we can first setwk = 1 in the objective function (3.3) to estimate β0
g(τk), Fi,τk(u|xit), and εit(τk. Then, we can compute thedata-driven weights wk from maxwk≥0,Var [bit(wk)]=1 E [bit(wk)] and apply them in (3.3) to re-estimate the model.
15
into group 2 given that s/he is in group 1. It follows from (3.3) that this probability can be
obtained by
P(gi[β0(τ )] = 2|g0
i = 1)
= P
(T∑t=1
K∑k=1
wk ρτk(yit − β02(τk)) ≤
T∑t=1
K∑k=1
wk ρτk(yit − β01(τk))
).
By using Knight’s (1998) identity, we can rewrite the misclassification probability as
P(gi(β0(τ )) = 2|g0
i = 1) = P
(1
T
T∑t=1
bK,t ≤ 0
),
where bi,t(K) = K−1∑K
k=1 wk∫ β0
2−β01
0(I(εit(τk ≤ u)− τk) du. In this simple case, we have
E [bi,t(K)] = K−1
K∑k=1
wk
∫ β02−β0
1
0
(Φ((u+ qτk)/σ)− τk) du,
and
Var [bi,t(K)]
= K−2
K∑k=1
K∑l=1
wkwl
∫ β02−β0
1
0
∫ β02−β0
1
0
Φ
(min((u+ qτk), (s+ qτl))
σ
)− Φ
(u+ qτkσ
)Φ
(s+ qτlσ
)duds.
where Φ(·) is the standard normal distribution. The central limit theorem implies that
√TT−1
∑Tt=1 bi,t(K)− E [bi,t(K)]√
Var [bi,t(K)]
d−→ N(0, 1), T →∞.
Hence, we can show that, as T →∞,
P(gi(β0(τ )) = 2|g0
i = 1)→ Φ
(− E [bi,t(K)]√
Var [bi,t(K)]
√T
),
= O
(T−1/2 exp
(− (E [bi,t(K)])2
2Var [bi,t(K)]T
)),
where the equality uses the Mills ratio. Figure 1 depicts the relationships between the misclassi-
fication probability and number of quantiles used for the estimation, degree of group separation,
and variance of the error term.
16
FIGURE 1
The left panel of Figure 1 shows that the misclassification probability is a decreasing function
of the number of quantiles. The middle panel considers a smaller degree of group separation,
where we see a larger misclassification probability than in the left panel. The right panel sug-
gests that increasing the variance of the error term leads to a higher misclassification probability
as expected. In this special case, the association between the misclassification probability and
these key quantities can be theoretically proven (see the supplementary file). In the general
case, it is difficult to show the monotonic relationship between the misclassification probability
and these key quantities analytically since the distribution of the error term is unknown. Nev-
ertheless, numerical investigation suggests that these relationships are valid for a wide range of
distributions such as most of the distributions in the exponential family.
Theorem 4.2 implies that the estimation error caused by unknown group memberships
converges to zero at an exponential rate. Formally, denote β(τ ) as the estimated regression
quantiles under the true group membership structure as
β(τ ) = arg minβ(τ )∈BGK
N∑i=1
T∑t=1
K∑k=1
wk ρτk(yit − x′itβg0i (τk)).
Then, the following corollary relates the estimates obtained from the unknown and true group
memberships.
Corollary 1. Under the assumptions in Theorem 4.2, we have for all g ∈ {1, . . . , G} and
k = 1, . . . , K,
‖βg(τk)− βg(τk)‖ = OP
(T−1/2 exp (−ζT )
), (4.6)
as N, T →∞.
This corollary states that the estimators of the regression quantiles under unknown group
memberships also converge to that under the true group memberships at a similar exponential
rate, which depends on the number of quantiles used and features of the data.
4.3 Asymptotic distribution
Finally, we derive the asymptotic distribution of the regression quantile estimates. Extra as-
sumptions are needed for this result.
17
Assumption 3.
(i) Let yi = (yi1, . . . , yiT )′ and xi = (x′i1, . . . , x′iT )′. For all g ∈ {1, . . . , G} : {(yi, xi)I(g0
i =
g)}Ni=1 are i.i.d.
(ii) The conditional density of yit given xit , I(g0i = g)fg(y|xit), is bounded and continuous
for all g ∈ {1, . . . , G}.
(iii) The matrix Γ(τk, g) = E[xitx
′itI(g0
i = g)fg(x′itβ
0g(τk)|xit)
]is invertible and has the
minimum eigenvalue bounded away from zero, uniformly for all k = 1, . . . , K.
Assumption 3 is similar to Assumptions A1–A4 in Galvao and Poirier (2016), which are
used to prove the asymptotic normality of the pooled linear random effects quantile estimator,
except that we are concerned about the observations in each group. Assumption 3(i) requires
that individual units are not cross-sectionally dependent.6 Assumption 3(ii) imposes identical
distributions of units within the group, which strengthens Assumption 1(iv). Assumption 3(iii)
is a rank condition for the identification, and the matrix Γ(τk, g) determines the variance in the
asymptotic distribution.
Corollary 2. If Assumptions 1, 2, and 3 hold, and N/(√T exp(ζT ))→ 0, as N, T →∞, with
ζ > 0 defined in Theorem 4.2, then we have for all g ∈ {1, . . . , G},
Γ(τ , g)√πgNT
(βg(τ )− β0
g(τ ))⇒ z(τ , g), (4.7)
where z(·, g) is the K-dimensional normal distribution with a zero mean, whose covariance
matrix is
E [z(τ , g)z(τ , g)] = plimT→∞T−1
T∑s=1
T∑t=1
E[(I(εit(τ ) ≤ 0)− τ )(I(εis(τ ) ≤ 0)− τ )xitxisI(g0
i = g)],
with εit(τ ) = (εit(τ1), . . . , εit(τK)).
This result is a direct consequence of Corollary 1 given the well-established asymptotic
distribution of the regression quantile estimates. If individual units are allowed to be depen-
dent within a group by relaxing Assumption 3(i), a cluster-robust variance-covariance matrix
estimator for inference can be employed as in Galvao and Poirier (2016).
6For the weak consistency and asymptotic equivalence, we can allow for the lagged outcomes and generalpredetermined regressors. However, to derive the asymptotic distribution, we have to rule out the laggedoutcomes, unless stronger assumptions are added. We conjecture that when lagged outcomes are included, adynamic quantile IV-type method (Galvao, 2011) can be applied to handle the bias from the unobserved initialvalues.
18
Remark 4. If the estimated conditional quantile is non-monotone on τ , we can rearrange
it into a monotone function by simply sorting the values of the function in a non-decreasing
order. Chernozhukov et al. (2010) showed that such rearrangement can improve the finite-
sample properties of the estimator.
5 Determining the number of groups
We have thus far assumed that the number of groups is known. However, this number is rarely
given in applications and we therefore need to determine it before implementing our estimator.
A popular approach is to minimize some information criterion (IC) computed for different
numbers of groups, which trades off between model fitness and the number of parameters (e.g.,
Bonhomme and Manresa (2015a); Su et al. (2016); Gu and Volgushev (2018)). However, the
use of an IC in our case is complicated because our estimator is obtained from the composite
quantile objective function. The heterogeneity in the number of groups across quantiles may
“contaminate” the behavior of the IC if the composite feature of the objective function is not
well incorporated (Gao and Song, 2010). In particular, if the groups are only identified at the tail
quantiles but not at the central quantiles, the standard IC is likely to underestimate the number
of groups because the check function evaluated at the central quantiles can play a dominant role
in the entire composite quantile objective function, such that the objective function undervalues
the improvement in fitness when the number of groups rises. The underestimation of the number
of groups is particularly undesirable (compared with overestimation) as it leads to inconsistent
coefficient estimates (Liu et al., 2018).
To avoid underestimation, we propose an innovative approach for selecting the number of
groups. Our approach is to first choose the optimal number of groups at each specific quantile by
minimizing the quantile-specific IC. Then, the final number of groups is the maximum number
over all considered quantiles. We assume the true value of the number of groups G0 is bounded
from above by a finite integer Gmax. The quantile-specific IC is defined as follows:
IC(G, τ) =1
NT
N∑i=1
T∑t=1
ρτ (yit − x′itβ(G)gi
(τ)) +G(p+ 1)f(N, T ), (5.1)
where (G) refers to the estimator with G groups, G(p+1) is the number of parameters of interest,
and f(N, T ) is the tuning parameter. We then minimize the IC to determine the number of
19
groups at each given quantile τk for k = 1, . . . , K as
G(τk) = arg minG=1,...,Gmax
IC(G, τk). (5.2)
We can show that for each τk, G(τk) can consistently estimate G0 given the following assump-
tions. Denote any G-partition of {1, 2, . . . , N} by P (G) = (P1, . . . , PG), where we suppress
the dependence of {Pg, g = 1, . . . , G} on G for notation convenience. Then P (G) ∈ P(G) with
P(G) denoting the collection of all such partitions. Let eP (G) = 1NT
∑Gg=1
∑i∈Pg
∑Tt=1 ρτk(yit −
x′itβPg(τk)), with βPg(τk) = arg minβ1
NPgT
∑i∈Pg
∑Tt=1 ρτk(yit−x′itβgi(τk)) where NPg is the num-
ber of units in each partition Pg.
Assumption 4.
(i) As N, T →∞,min
1≤G<G0
infP (G)∈PG
eP (G)
p−→ e > eG0 ,
where eG0 = plimN,T→∞1NT
∑Ni=1
∑Tt=1 ρτk(yit − x′itβ0
g0i(τk)).
(ii) As N, T →∞, f(N, T )→ 0 and√NTf(N, T )→∞.
Assumption 4(i) implies that the estimation error delivered by any underfitted model is
larger than the true model. It in turn enables G to be chosen larger than or equal to G0.
Assumption 4(ii), by quantifying the magnitude of f(N, T ) asymptotically, ensures the choice
of G cannot be larger than G0.
Theorem 5.1. Suppose Assumptions 1-4 hold and for any δ > 0, N/(√T exp(δT )) → 0, as
N, T →∞ . Then P(G(τk) = G0)→ 1 .
Next, we choose the maximum number of groups over all quantiles under consideration:
G = arg maxk=1,...,K
G(τk). (5.3)
This procedure avoids underspecifying the number of groups since the consistent selection
criterion in (5.2) is computed at each quantile separately and no summation over quantiles is
made. Hence, we can select the right number of groups even when the groups are not separately
identified at some (but not all) quantiles.
The estimated number of groups at each quantile level also provides a way in which to
determine the weights in the composite quantile objection function. If G(τk) = 1 for some τk
20
(e.g., τk = 0.5), but larger than 1 for other quantiles, this suggests that we should not use
these non-informative quantiles for clustering (i.e., setting wk = 0). As long as the quantile τk
contains any clustering information (i.e., G(τk) > 1), we recommend including these quantiles
in the composite quantile check function for estimating group memberships even though G(τk)
may differ across τk. The motivation for this approach is that using the composite approach
helps us obtain quantile-invariant group memberships, avoids the difficulty of relabeling groups
across quantile levels, and improves the clustering accuracy.
One problem of this procedure is that it may overestimate the number in the finite sample if
one quantile returns an incorrectly large number of groups, and this typically occurs when there
are limited observations at some tail quantiles. The cost of overestimation is comparably less
than that of underestimation since it does not harm the consistency of the coefficient estimates
but only sacrifices efficiency and suffers from small-sample bias (Bonhomme and Manresa,
2015b; Liu et al., 2018). A solution to this problem is to skip some of the extreme quantiles,
say 0.1 and 0.9, that have limited observations.
6 Models with individual fixed effects
In this section, we consider extending the benchmark model in (3.1) to allow individual-specific
fixed effects (rather than group fixed effects) as
Qτk(yit|xit) = αi(τk) + x′itβgi(τk), i = 1, . . . , N, t = 1, . . . , T, k = 1, . . . , K, (6.1)
where αi represents the time-invariant individual fixed effects potentially correlated with the
regressors, xit is the p × 1 regressor vector without constant terms, and βgi(τk) ∈ B ⊂ Rp
and αi(τk) ∈ A ⊂ R. By allowing unobserved heterogeneity to be individual-specific, (6.1)
provides a richer context of heterogeneity than (3.1). To derive the weak consistency of the
regression quantile estimates under unknown group memberships, we need to impose a more
restrictive asymptotic relation between N and T (i.e., logN/√T → 0) because of the incidental
parameters from the individual fixed effects. We employ the assumption structure of Kato et al.
(2012) for the weak consistency.
Assumption 5.
(i) A ⊂ R and B ⊂ Rp are compact.
(ii) For each i ≥ 1, the process {(yit, xit) : t ≥ 1} is strictly stationary and β-mixing with
21
the mixing coefficient βi(t). Assume that there exist constants a ∈ (0, 1) and B ≥ 0 such
that supi≥1 βi(t) ≤ Bat. Furthermore, the processes {(yit, xit) : t ≥ 1} are independent
across i.
(iii) The minimum eigenvalues of E [fi,τk(0|xit)(1, x′it)′(1, x′it)] are bounded away from zero
uniformly over i ≥ 1.
Theorem 6.1. If (logN)/√T → 0 as N, T → ∞, and Assumptions 1(iv), 2(v), and 5 hold,
then we have for all g ∈ {1, . . . , G} and any k = 1, . . . , K,
max1≤i≤N
‖βgi(τk)− β0g0i
(τk)‖P→ 0. (6.2)
With the help of Theorem 6.1, we can obtain similar results to those shown in Theorem 4.2
and Corollary 1 for the individual fixed effects model. In particular, Theorem 6.1 implies that
the difference between the regression quantile estimates under unknown but estimated group
memberships and those under true group memberships is at the level of an exponential decay.
This in turn implies the asymptotic normality of the regression quantile estimates with unknown
group memberships. Furthermore, as stated by Kato et al. (2012), the estimation of regression
quantiles suffers from finite-sample bias because of the presence of individual fixed effects. To
improve finite-sample estimation performance, we could apply a similar smoothing technique
for bias correction as suggested in Galvao and Kato (2016), but this is beyond the scope of this
study.
7 Monte Carlo simulation
In this section, we evaluate the finite-sample performance of the proposed method. In particular,
we examine whether PSQR can correctly classify individual units as well as effectively recover
the quantile-specific slope coefficients within each group. By comparing the PSQR model with
mean-based or single-quantile-based clustering, we also shed light on the importance of taking
the entire distribution into account when clustering.
7.1 Data generation process
We consider four data generation processes (DGPs), differing in the distributions of errors and
degree of group separation.
22
DGP.1: The first and benchmark DGP concerns the typical location-scale shift model:
yit = αgi + βgixit + (1 + γxit)εit, gi = 1, . . . , G0, (7.1)
where we set γ = 0.5. We follow Kato et al. (2012) to generate xit = 0.3αgi +zit, where zit
is independently and identically generated by χ25. The error term εit is i.i.d. and follows a
standard normal distribution. There are three groups (G0 = 3) that contain Nj individual
units, respectively, for j = 1, 2, 3, and N1 +N2 +N3 = N . We fix the ratio among groups
such that N1 : N2 : N3 = 0.3 : 0.3 : 0.4. The intercept and slope coefficient in the three
groups are given, respectively, by
(α1, α2, α3)′ = (β1, β2, β3)′ = (1, 1.75, 2.5)′.
In this case, groups are well separated in their means, while the shape of their distributions
is common (see Figure 2 (a) for the density function of αgi + εit for the three groups).
Thus, both the mean-based clustering and the PSQR (classified based on the quantiles)
estimators are expected to identify the groups.
DGP.2: We then consider the case where groups are less clearly separated with less
discrepancy between the coefficient parameters. We generate the data with
(α1, α2, α3)′ = (β1, β2, β3)′ = (1, 1.5, 2)′.
The explanatory variable and error term are generated in the same way as in DGP.1. This
case allows us to examine how the mean-based clustering and PSQR estimators perform
when the entire distribution of groups is less separable.
DGP.3 In practice, the groups might differ not only in their means but also in their shapes
of the distribution. To mimic this situation, we consider the case with heterogeneous slope
coefficients and the distribution of the error term is also allowed to vary across groups. We
consider the errors of three groups generated from three distribution functions, namely
normal, Gamma, and Weibull distributions. By appropriately setting the parameters of
these distributions and subtracting the theoretical mean, the errors of the three groups
follow heterogeneous distributions with mean zero but distinct tail behavior. In particular,
23
we generate
εit ∼
i.i.d. N(0, 1) if gi = 1,
i.i.d. Gamma(shG, scG)− E[Gamma(shG, scG)] if gi = 2,
i.i.d. Weibull(shW , scW )− E[Weibull(shW , scW )] if gi = 3.
Here, shG and shW are the shape parameters of the Gamma and Weibull distributions
and scG and scW are the scale parameters of the two distributions, respectively. We set
(scG, shG) = (5, 1) and (scW , shW ) = (1, 3). Besides the errors, the remaining variables
and parameters are the same as in DGP.1. Although the three groups are characterized by
distinct means and distributions, group separation in this case is not necessarily stronger
than in DGP.1 since the densities of Groups 1 and 2 overlap more, especially around the
mean (see Figure 2 (c)). Moreover, the distributional heterogeneity between the clusters
can only be captured by PSQR, but not by the mean-based clustering.
DGP.4: We consider the scenario where groups are only distinguished by their shapes of
the distribution but not their mean values. This is potentially the case where classification
is the most difficult because the difference between groups is even less significant. We
generate a homogeneous mean for the intercept and slope coefficient as
(α1, α2, α3)′ = (β1, β2, β3)′ = (1, 1, 1)′.
The error term is generated in the same way as in DGP.3, whose distribution varies across
the three groups. Since each group has a zero mean (because of the subtraction of the
mean), the means of the coefficients are identical (see Figure 2 (d)).
For each DGP, we consider two cross-sectional sample sizes, N = (50, 100), and two lengths
of time series, T = (30, 60), leading to four combinations of cross-sectional and time series
dimensions. The number of replications is set to 1000.
FIGURE 2
7.2 Implementation and evaluation
The PSQR estimator minimizes the composite objective function with different quantile levels
and thus classification is based on the entire distribution. To implement our method, we con-
sider two specifications of quantiles, a narrow and a wide range of quantiles; the corresponding
24
estimators are denoted by PSQRnarrow and PSQRwide. In DGP.1–DGP.3, we consider the nar-
row range of quantiles τ ∈ {0.4, 0.5, 0.6} that concentrates on the central part of the distribution
and the wide range of quantiles τ ∈ {0.1, 0.2, . . . , 0.9} that spans the entire distribution. In
DGP.4, the narrow range of quantiles is specified as τ ∈ {0.1, 0.2, 0.8, 0.9} since heterogeneity
only appears at the extremal tails (according to the IC at each quantile) and the wide range
remains the same. These two specifications of quantiles allow us to examine how the range of
quantiles affects classification accuracy and coefficient estimation.
We compare our PSQR estimator with the group fixed effects (GFE) type estimator (Bon-
homme and Manresa, 2015a) and single-quantile group estimators using the best quantile. The
GFE-type estimator minimizes the least squares objective function, and thus its classification
is solely based on the mean. We also consider the single-quantile estimators obtained by mini-
mizing the PSQR objective function (3.3) but with only one quantile level. This estimator can
also be regarded as an extension of Gu and Volgushev (2018)’s quantile panel group fixed effects
estimator, allowing for both the intercept and the slope coefficients to exhibit a group pattern
of heterogeneity. We report the single-quantile estimator at the best quantile chosen ex post
based on their classification accuracy. This is, of course, not feasible in practice since we do not
know ex ante which quantile is the most informative for clustering. Both the GFE-type and the
single-quantile group estimators can be regarded as types of “limited information” estimators,
as they only employ information at a single point of the distribution. By comparing PSQR
with these two limited information estimators, we shed light on how distributional information
contributes to group membership estimation.
We evaluate the performance of the proposed method based on selecting the right number
of groups, clustering, and the coefficient estimates across quantiles. First, we examine how
the IC-based procedure performs in determining the number of groups. To compute the IC in
practice, we find that f(N, T ) = 0.1 log(NT )/√NT works fairly well based on a large number
of experiments with many alternatives, and we employ this penalty in all simulations and the
application. Our penalty term in (5.1) is comparable to the BIC proposed by Bonhomme and
Manresa (2015b). Performance is evaluated by the empirical probability of selecting a particular
number. Second, we measure clustering accuracy by taking the average of the misclassification
frequency (gi 6= g0i ) across replications. Let I(·) be the indicator function. The misclassification
frequency is the ratio of misclassified units to the total number of units:
MF =1
N
N∑i=1
I(gi 6= g0i ).
25
Finally, we evaluate the accuracy of the coefficient estimates at each quantile by using their
root mean squared error (RMSE) and the coverage probability of the two-sided nominal 95%
confidence interval. The overall RMSE for all units is
RMSE(β(τ)) =
√√√√ 1
N
N∑i=1
[βgi(τ)− β0
g0i(τ)]2
.
The coverage probability is computed as
Coverage(β(τ)) =1
N
N∑i=1
I(LCi(τ) ≤ β0g0i
(τ) ≤ UCi(τ)),
where LCi(τ) and UCi(τ) are the lower and upper 95% confidence bounds of βgi(τ) based
on the Huber sandwich estimate of standard deviation. Because the RMSE and coverage
probability are based on the coefficient estimates of the individual units, they are both affected
by classification accuracy.
7.3 Results
Determining the number of groups
As the classification and coefficient estimation both rely on the choice of the number of groups,
we first examine how the IC-based procedure performs in determining the number of groups. We
use the IC defined in (5.2) to select the number of groups at each quantile τk = 0.1, 0.2, . . . , 0.9
and then choose the maximum number over the quantiles as in (5.3). Table 1 provides the
empirical probability of selecting a particular number of groups, ranging from G = 1 to 5.
Recall that the true number of groups is 3.
TABLE 1
Table 1 shows that our method can effectively detect the right number of groups and that
the frequency of selecting the right number generally increases with the time dimension. As
expected, when the number of groups is misspecified, it is more likely to be overestimated than
to be underestimated. In DGP.1 and DGP.3, where the means of the groups are well separated,
the empirical probability of selecting three groups is above 96% in all cases. In DGP.2, where
the mean is less separated, the method has a larger probability of underestimating the number
26
of groups, but only when N and T are small. Even in this case, the probability of selecting the
right number is still above 82%, and this probability quickly increases as N or T rises. When the
coefficients of the three groups share a common mean in DGP.4, our method can correctly detect
three groups in 98% of the cases even when T = 30 due to the sharp separation of groups in the
tails. We further examine the selected number of groups at each quantile (not reported) in this
case. We find that the IC selects one group at the central quantiles, namely τ ∈ {0.4, 0.5, 0.6},and three groups are reported only at the tail quantiles, namely τ ∈ {0.1, 0, 2, 0.8, 0.9}. This
leads to the final result of the three groups following the maximization step in (5.3). The result
for the IC at each specific quantile also suggests that we should estimate the structure of the
three groups by only using the tail quantile level {0.1, 0, 2, 0.8, 0.9}, since no heterogeneity is
exhibited at the central quantiles.
Classification accuracy
Next, we examine classification performance. Table 2 presents the misclassification rate of
the four estimators: the GFE-type estimator, single-quantile estimator, and PSQR estimators
based on the two specifications of quantiles. We compute the single-quantile estimator at each
specific quantile level ranging from 0.1 to 0.9 and report the best one in terms of the lowest
misclassification rate. We refer to this estimator as SQbest.
TABLE 2
In general, we find that PSQR produces a lower misclassification rate than GFE in all cases
and that increasing the time dimension significantly reduces the misclassification rate for all
methods. In particular, in DGP.1, where groups are well separated by their means and have a
common error distribution, all methods lead to accurate classification. Although the misclas-
sification rate of all methods is generally low, the group membership estimate of PSQRnarrow
is twice as accurate as that of GFE and it also slightly outperforms the best single-quantile
estimator obtained at τbest = 0.5. Using a wide range of quantiles, τ ∈ {0.1, 0.2, . . . , 0.9},further halves the misclassification rate of PSQR compared with using fewer quantiles. This
result suggests that employing information at multiple quantiles improves classification when
a group pattern of heterogeneity is common across different quantiles of the distribution.
When the groups are less separated in their means as in DGP.2, classification is more diffi-
cult and the misclassification rate of all methods increases as expected. The GFE-type method
27
misclassifies around 6% of individuals when T = 30. SQbest (with τbest = 0.5) leads to a mis-
classification rate of around 4% when T = 30, close to the level of PSQRnarrow. Nevertheless, if
we take advantage of the whole distribution by using a wider range of quantiles, we manage to
reduce the misclassification rate to roughly 2.5% under small T . Interestingly, increasing the
time dimension disproportionately improves the classification accuracy of the different meth-
ods. The improvement in PSQRwide is the most pronounced compared with the other three
estimators because it employs more quantiles and the information at each quantile becomes
more precise as T increases.
In DGP.3, classification turns out to be even harder since the densities of Groups 1 and 2
overlap more (see Figure 2 (c)). In this case, the single-quantile method (with τbest = 0.5) pro-
duces the least accurate group membership estimates, with a misclassification rate of more than
9% when T = 30 and 4% when T = 60. The misclassification rates of GFE and PSQRnarrow
are around 8% under small T and 3% under large T . In contrast, PSQRwide can better separate
groups by exploring the heterogeneity in the whole distribution, and the misclassification rate
is around 2.5% when T = 30, more than three times lower than that for GFE. The improve-
ment in classification accuracy is even more sizeable as the time dimension increases, with a
misclassification rate below 5%, at least seven times lower than those for GFE and PSQRnarrow.
Finally, we consider DGP.4, where the means of groups are identical. In this case, the
performance of GFE is particularly poor with a misclassification rate above 50%. The high
misclassification rate of GFE does not decrease by increasing T . On the contrary, PSQR can
identify the correct group structure. PSQRwide outperforms GFE by roughly 70% when T = 30,
with a misclassification rate below 17%, and classification accuracy dramatically improves as
T increases. As suggested by the IC at each quantile, individual units are best classified into
three groups only at τ ∈ {0.1, 0.2, 0.8, 0.9}. Hence, we consider PSQRnarrow, which is based
on these tail quantiles. We can also view PSQRnarrow here as the unequal weighted average
of the composite quantiles, where each of the four quantiles (0.1, 0.2, 0.8, and 0.9) receives a
weight of one-quarter and the others have a weight of zero. We find that classification accuracy
is further improved compared with PSQRwide; the misclassification rate is around 6% when
{N, T} = {50, 30} and decreases to below 2.5% as either N or T doubles. This is not surprising
because the central quantiles do not contain group information and thus incorporating these
quantiles into the composite function does not help identify groups (it only adds noise). In this
DGP, the best single-quantile estimator is obtained at τbest = 0.1. Compared with SQbest, the
rate of PSQRnarrow is lower when T = 30, suggesting the benefits of using composite quantiles
when T is relatively small. When T is large, SQbest performs similarly as PSQRnarrow. This is
28
because 0.1 is the most informative quantile for identifying the three groups, while PSQRnarrow
uses three additional quantile (τ = 0.2, 0.8, 0.9), which contributes little additional information
for group separation. Although SQbest sometimes performs as well as PSQR under large T , the
single-quantile approach is not recommended since we do not know ex ante which quantile is
the most informative for clustering and the estimation accuracy of the other quantiles is not
guaranteed. This is confirmed by the performance of the single-quantile approach using the
second-best quantile τ = 0.9; indeed, its misclassification rate is much higher than PSQRnarrow
for all sample sizes.
Accuracy of the regression quantile estimates
Finally, we compare the accuracy of the regression quantile estimates. Table 3 presents the
bias, RMSE, and coverage probability of the three methods (GFE, best single-quantile, and
PSQR). To save space, we only report the statistics of the slope coefficient βgi produced by
PSQRwide among the two versions of PSQR in DGP.1–DGP.3 and PSQRnarrow in DGP.4. The
results of the intercept estimate αgi are qualitatively similar.
TABLE 3
In DGP.1, all three methods provide accurate coefficient estimates. Although the discrep-
ancy between the three methods is marginal, we find that PSQRwide produces the smaller RMSE
and more accurate coverage probability (closer to the nominal 95%) at τ = 0.5 compared with
the GFE and single-quantile estimators. As we move to the tail index (i.e., when τ is close to
0 or 1), PSQRwide still works well, although less satisfactorily compared with the central index
as in most quantile regression exercises.
In DGP.2, the difference between the three methods is larger due to diverse classification
performance. Since PSQRwide produces the most accurate group membership estimates, its
coefficient estimates are unsurprisingly more accurate than under the other two methods. Sim-
ilar results are observed in DGP.3, where we find that GFE is rather unsatisfactory (coverage
probability below 90% when T = 30). On the contrary, PSQRwide continues to perform well
with a smaller RMSE and reasonable coverage probabilities.
Finally, in DGP.4, GFE completely breaks down with a coverage probability far away from
the nominal level. PSQRnarrow based on informative quantiles and SQbest both produce fairly
good coefficient estimates. We compare SQbest with PSQRnarrow at τ = 0.1 and find a smaller
29
RMSE and better coverage probability for PSQRnarrow when T = 30. When T = 60, the
coefficient estimates of SQbest have a better coverage probability than PSQRnarrow, but at the
cost of a larger RMSE. By comparing the second-best single-quantile estimator (τsecond best =
0.9) with PSQRnarrow at τ = 0.9, we can see that the former is strictly dominated by the latter,
again confirming the unstable performance of the single-quantile estimator.
In general, the classification results strongly favor PSQR using composite quantiles. If
the means are well separated, using a wide range of quantiles provides more accurate group
membership estimates compared with the GFE and single-quantile approaches, both of which
only employ limited distributional information. If there is no heterogeneity in the mean but only
in the tails, it is beneficial to consider PSQR based on only “informative” quantiles. Although
the single-quantile approach sometimes outperforms PSQR, in practice, which quantile is the
most informative is not a priori known, and there is no guarantee that the best single quantile
is chosen. For central quantiles without coefficient heterogeneity, we can thus estimate the
coefficient parameters by using standard quantile regressions.
8 Empirical application: Output effect of infrastructure
capital
In this section, we apply the PSQR method to investigate the effect of infrastructure capital
on aggregate output. The role of infrastructure capital in the local economy has long been
a central concern in macroeconomics given its strong policy implications. For example, the
US government increased infrastructure expenditure to counteract the recent recession (Leeper
et al., 2010). Hence, understanding the contribution of infrastructure expenditure to output is
crucial for evaluating the effectiveness of this fiscal stimulus. Scholarly attempts to quantify the
output effects of infrastructure have grown rapidly since the influential work of Aschauer (1989);
however, previous studies lack consensus and their empirical results are diverse depending on
the datasets and empirical methodologies employed (see the references in Calderon et al. (2015)
and Romp and De Haan (2007)).
We employ the cross-country panel dataset presented by Calderon et al. (2015) that covers
88 countries over 1960–2010. As suggested by Calderon et al. (2015), countries are likely to
differ significantly in their output elasticity of infrastructure because of their different levels of
technology development, institutions, demographic features, and so on (see also Gregoriou and
Ghosh (2009)). Ignoring such heterogeneity results in biased estimates of the output effect of
30
infrastructure. We assume that the countries are characterized by a group pattern of hetero-
geneity in the sense that the effect of infrastructure is common within a group, and we then
estimate the number of groups and group memberships from the data. This assumption allows
us to capture the potential similarity of countries, better understand the sources of hetero-
geneity, and obtain more efficient estimates than individual country estimates, especially given
the short time span. Even if a group pattern of heterogeneity is allowed, the (conditional)
distributional effect within a group may not necessarily be uniform, and the conditional mean
effect does not provide a complete picture. For countries in the same group, the effect of infras-
tructure could vary markedly depending on their economic statuses, such that the estimated
coefficient of infrastructure is not constant at different quantiles of output.
To capture the distributional effect of infrastructure that varies across groups, we consider
the following PSQR model:
Qτ (Oit|fit−1) = β1,gi(τ)Cit−1 + β2,gi(τ)Sit−1 + β3,gi(τ)Zit−1, (8.1)
where Oit is real output measured by the logarithm of real GDP per worker in 2000 PPP (pur-
chasing power parity) US dollars and fit = (Cit, Sit, Zit) is the set of explanatory variables. In
this set, Cit is the logarithm of physical capital per worker constructed by using the perpetual
inventory method, Sit is human capital proxied by the average years of secondary schooling
of the population, and Zit is the logarithm of physical infrastructure per worker measured by
a synthetic index that summarizes three infrastructure dimensions (telecommunications, elec-
tric power, and roads) using a principle component method (see Calderon et al. (2015) for
the details of the definition and construction of the variables). This model is a reduced-form
representation of the structural aggregate production function with constant returns to scale.
All the variables are first differenced to remove unit roots following Calderon et al. (2015).
TABLE 4
We first examine the median effect of the output determinants by using PSQR with G = 1
and compare this with the estimates of Calderon et al. (2015) that focused on the homogeneous
mean effect (see the first two columns of Table 4). The estimated median effects of the capital
stock and human capital produced by PSQR are 0.29 and 0.09, respectively, both of which are
in the range of previous estimates in the empirical macroeconomic literature and close to the
values reported by Calderon et al. (2015) (0.34 for capital stock and 0.10 for human capital).
31
The PSQR estimate of the infrastructure effect is 0.08, identical to the estimate provided by
Calderon et al. (2015) and close to the value reported by Wu et al. (2017) using a different
dataset.7
FIGURE 3
The median or mean effect, however, does not give the full picture of the distribution, which
may differ across countries. Hence, we examine the distributional effect at the other quantiles
and allow cross-country heterogeneity in the distribution. We determine the number of groups
by using the IC-based procedure discussed in Section 5. We also allow the maximum number
of groups to be six and compute the IC at quantiles ranging from 0.1 to 0.9 in steps of 0.1.
The IC suggests that two groups exist. Hence, we estimate (8.1) by using G = 2.8 Figure 3
displays the estimated group patterns found by using PSQR. Interestingly, both geographic
and economic features play a role in the clustering outcome. Group 1 consists of 54 countries,
including most Asian and European countries, whereas Group 2 contains Canada, the United
States, Australasian countries, and some coastal African countries.
The right panel of Table 4 presents the PSQR estimates under G = 2. Three important
results emerge from the analysis. First, we find a large degree of heterogeneity in the dis-
tributional effect of human capital and synthetic infrastructure at the central quantiles. In
particular, human capital imposes a significant effect on output in both groups, but the effect
is much larger in Group 2 than in Group 1. Further, the infrastructure effect is not significant
in Group 1 but is significantly positive in Group 2. This finding suggests that the countries in
Group 2 drive the positive and significant mean/median infrastructure effect when we estimate
a homogeneous panel (G = 1). However, this homogeneity assumption ignores the fact that
there are a large proportion of countries (in Group 1) where physical infrastructure contributes
little to output.
Second, the output effects of the three ingredients all vary dramatically across quantiles. At
lower quantiles, the effects of physical and human capital are negative or insignificant. However,
these effects become positive and strong as the quantile level increases. The output elasticity of
infrastructure is negative at the lower quantiles in Group 1 but positive in Group 2. Again, as
7They considered Chinese provincial panel and reported an estimated output effect of infrastructure of 0.06.8The drop in the IC from G = 1 to G = 2 is particularly large for the tailed quantiles, say 0.1, 0.2, 0.8,
and 0.9, while the change in the IC around the central quantiles is minor, suggesting that the heterogeneity isespecially pronounced at the tails but not around the median of the distribution. We also estimate the model byrestricting the central quantiles to have homogeneous coefficients, finding that the group membership structureestimated only based on the tails is largely similar to that estimated from the whole range of quantiles.
32
the quantile level increases, the elasticity of both groups turns positive and becomes statistically
more significant. In fact, the literature on the output effect of infrastructure does document pos-
sible opposing effects (see Bom and Ligthart (2008) for a review and meta-analysis). Although
infrastructure investment is generally expected to have a positive effect on output, the relation
could be negative because of overinvestment, environmental damage, or spillover effects. Inap-
propriate or excessive investment is particularly relevant in some developed European countries
where infrastructure capacity is relatively large, such as Germany and the Netherlands (Sturm
et al., 1999; Uhde, 2010). The negative effects of (transportation) infrastructure investment
could also arise from spillovers since the migration of labor and mobile capital may hurt the
economic development of competitive neighboring regions. Again, this spillover often occurs
within the European Union where factor migration is convenient (Cantos et al., 2005). This
finding explains the negative output effect of infrastructure in Group 1, which contains most
European countries, suggesting that when these countries are in a poor economic state, the
negative effect of infrastructure dominates the positive impact. At upper quantiles, the infras-
tructure effect is strongly positive for both groups, and the size is still within a reasonable level
as suggested by the literature, ranging from 0.07 to 0.17.
Finally, we find that the distributional effect of physical and human capital is more dispersed
in Group 2 than in Group 1. This result is expected since Group 2 contains highly developed
countries such as the United States, Canada, and Australia as well as several African countries.
The heterogeneity in the shape of the distribution also contributes to the clustering outcome.
In general, we find a large degree of heterogeneity in the output effect of the three ingredients
both across groups and across quantile levels. The two groups of countries differ not only in their
median effects of the output ingredients, but also in their shapes of the distributional effect.
This distributional heterogeneity cannot be captured by standard clustering approaches, but is
well demonstrated by the PSQR procedure.
9 Conclusion and future research
This study offers a flexible yet parsimonious way of modeling the distributional heterogeneity of
the slope coefficients in panel data models. We model cross-sectional heterogeneity via a latent
group pattern, such that individual units in a group share a common conditional distribution.
The conditional distributions of the groups can differ not (only) in their means but (also) the
(tail) quantiles. We capture the distributional effect within each group by regression quantiles.
We propose a composite quantile approach to simultaneously estimate the group membership
33
structure and regression quantiles of each group. We also precisely quantify the convergence
rate of the misclassification probability and show that using multiple quantiles for clustering
improves the accuracy of the group membership estimates over existing methods in which
clustering is only based on the mean or some single quantile.
Several issues deserve future research. First, we assume that group membership is invariant
across quantiles. It is an open question how to verify this assumption in practice. Given
the discrete and label-invariant features of the group membership parameter, it is difficult
to obtain the variance of its estimate, if not impossible. Hence, direct tests based on the
group membership estimates at different quantiles seem infeasible. One possible approach is
to derive the confidence intervals of the group membership estimates at each quantile in the
vein of Dzemski and Okui (2018), and then check whether the intervals at different quantiles
overlap. Second, our model allows individual or group fixed effects, while sometimes it is useful
to consider unobserved two-way fixed effects. How to deal with two incidental parameters
in panel quantile regression with heterogeneous slope coefficients is an interesting question.
Finally, the computation cost is non-trivial when the number of individual units is large. Thus,
it is desirable to resort to alternative algorithms that work faster and are more stable when the
number of individual units is large.
34
Figure 1: Misclassification probability in the illustrative example
1 3 5 7 90.012
0.014
0.016
0.018
0.02
1 3 5 7 90.13
0.14
0.15
0.16
0.17
1 3 5 7 9 110
0.1
0.2
0.3
(a) β02 − β0
1 = 1, σ2ε = 1 (b) β0
2 − β01 = 0.5, σ2
ε = 1 (c) β02 − β0
1 = 1, K = 9
Figure 2: Density of αgi + εit for the three groups in the simulation
-4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5 Group 1
Group 2
Group 3
-4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5 Group 1
Group 2
Group 3
(a) DGP.1 (b) DGP.2
-4 -2 0 2 4 60
0.3
0.6
0.9
1.2 Group 1
Group 2
Group 3
-4 -2 0 2 4 60
0.3
0.6
0.9
1.2 Group 1
Group 2
Group 3
(c) DGP.3 (d) DGP.4
35
Figure 3: Estimated group memberships (G = 2)
Group 1
Group 2
36
Table 1: Group number selection frequency using IC when G0 = 3
N T 1 2 3 4 5 1 2 3 4 5DGP.1 DGP.2
50 30 0.000 0.000 0.964 0.036 0.000 0.000 0.172 0.828 0.000 0.00050 60 0.000 0.000 0.991 0.009 0.000 0.000 0.000 0.996 0.004 0.000100 30 0.000 0.000 0.997 0.003 0.000 0.000 0.000 0.989 0.011 0.000100 60 0.000 0.000 0.998 0.002 0.000 0.000 0.000 0.993 0.007 0.000
DGP.3 DGP.450 30 0.000 0.011 0.983 0.006 0.000 0.000 0.000 0.983 0.017 0.00050 60 0.000 0.000 0.994 0.006 0.000 0.000 0.000 0.998 0.002 0.000100 30 0.000 0.000 0.978 0.022 0.000 0.000 0.000 0.984 0.016 0.000100 60 0.000 0.000 0.989 0.011 0.000 0.000 0.000 1.000 0.000 0.000
37
Table 2: Misclassification frequencies
N = 50 N = 100T = 30 T = 60 T = 30 T = 60
DGP.1 GFE 0.0066 0.0004 0.0068 0.0003SQbest 0.0030 0.0000 0.0028 0.0001PSQRnarrow 0.0028 0.0000 0.0024 0.0000PSQRwide 0.0014 0.0000 0.0012 0.0000
DGP.2 GFE 0.0606 0.0116 0.0583 0.0106SQbest 0.0451 0.0062 0.0423 0.0057PSQRnarrow 0.0434 0.0051 0.0394 0.0046PSQRwide 0.0257 0.0020 0.0244 0.0019
DGP.3 GFE 0.0874 0.0365 0.0859 0.0356SQbest 0.0977 0.0420 0.0904 0.0391PSQRnarrow 0.0865 0.0357 0.0796 0.0315PSQRwide 0.0268 0.0051 0.0230 0.0041
DGP.4 GFE 0.5296 0.5418 0.5123 0.5123SQbest 0.0719 0.0268 0.0603 0.0166SQsecond best 0.1283 0.0586 0.1138 0.0367PSQRnarrow 0.0612 0.0163 0.0370 0.0123PSQRwide 0.1423 0.0495 0.1344 0.0301
Notes: In DGP.1–DPG.3, SQbest is single-quantile estimator based on the best quantile τbest = 0.5, PSQRnarrow
is based on τ = {0.4, 0.5, 0.6}, and PSQRwide is based on τ = {0.1, 0.2, . . . , 0.9}. In DGP.4, SQbest is the single-quantile estimator based on the best quantile τbest = 0.1, SQsecond best is the single-quantile estimator basedon the second best quantile τsecond best = 0.9, PSQRnarrow is based on τ = {0.1, 0.2, 0.8, 0.9}, and PSQRwide isbased on τ = {0.1, 0.2, . . . , 0.9}.
38
Table 3: Bias, root mean squared error, and coverage probability of coefficient estimates
τ Bias RMSE Coverage Bias RMSE Coverage{N, T} = {50, 30} {N, T} = {50, 60}
DGP.1 GFE 0.0004 0.3181 0.9291 −0.0033 0.2474 0.9385SQbest 0.5 −0.0004 0.2981 0.9719 −0.0026 0.2330 0.9661PSQRwide 0.1 0.0056 0.3310 0.9265 0.0010 0.2760 0.9374
0.3 0.0010 0.2945 0.9624 −0.0003 0.2401 0.95830.5 −0.0006 0.2905 0.9653 −0.0021 0.2330 0.96520.7 −0.0018 0.2931 0.9625 −0.0041 0.2414 0.95810.9 −0.0062 0.3325 0.9311 −0.0038 0.2780 0.9264
{N, T} = {100, 30} {N, T} = {100, 60}GFE −0.0012 0.2474 0.9385 0.0015 0.2103 0.9464SQbest 0.5 −0.0007 0.2542 0.9715 −0.0035 0.1989 0.9634PSQRwide 0.1 0.0010 0.2760 0.9374 0.0040 0.2308 0.9483
0.3 −0.0003 0.2432 0.9583 0.0008 0.2069 0.96120.5 −0.0021 0.2371 0.9652 0.0005 0.2006 0.96390.7 −0.0041 0.2414 0.9581 0.0009 0.2046 0.95480.9 −0.0038 0.2780 0.9264 −0.0003 0.2335 0.9350
{N, T} = {50, 30} {N, T} = {50, 60}DGP.2 GFE 0.0036 0.3930 0.9021 0.0015 0.2814 0.9356
SQbest 0.5 0.0031 0.3644 0.9604 0.0006 0.2593 0.9572PSQRwide 0.1 0.0183 0.3775 0.9104 0.0043 0.2842 0.9293
0.3 0.0059 0.3462 0.9477 0.0028 0.2524 0.95350.5 0.0016 0.3408 0.9581 0.0006 0.2478 0.95170.7 −0.0018 0.3445 0.9539 −0.0006 0.2520 0.95880.9 −0.0085 0.3716 0.9257 −0.0030 0.2849 0.9298
{N, T} = {100, 30} {N, T} = {100, 60}GFE 0.0015 0.3749 0.8849 0.0003 0.2570 0.9408SQbest 0.5 0.0009 0.3419 0.9475 0.0006 0.2289 0.9601PSQRwide 0.1 0.0122 0.3376 0.9298 0.0020 0.2414 0.9313
0.3 0.0060 0.3171 0.9518 0.0001 0.2157 0.95510.5 0.0004 0.3150 0.9594 0.0003 0.2109 0.95300.7 −0.0054 0.3183 0.9514 −0.0006 0.2174 0.95440.9 −0.0096 0.3382 0.9223 −0.0009 0.2407 0.9370
Notes: In DGP.1–DPG.3, SQbest is single-quantile estimator based on the best quantile τbest = 0.5, andPSQRwide is based on τ = {0.1, 0.2, . . . , 0.9}. In DGP.4, SQbest is the single-quantile estimator based onthe best quantile τbest = 0.1, SQsecond best is the single-quantile estimator based on the second best quantileτsecond best = 0.9, and PSQRnarrow is based on τ = {0.1, 0.2, 0.8, 0.9}.
39
Table 3 (con’t): Bias, root mean squared error, and coverage probability of coefficientestimates
τ Bias RMSE Coverage Bias RMSE Coverage{N, T} = {50, 30} {N, T} = {50, 60}
DGP.3 GFE 0.0036 0.5241 0.8939 0.0025 0.4132 0.9264SQbest 0.5 0.0130 0.4950 0.9371 0.0046 0.3915 0.9468PSQRwide 0.1 0.0095 0.3446 0.9324 0.0029 0.2730 0.9396
0.3 0.0080 0.3504 0.9630 0.0043 0.2714 0.95510.5 0.0083 0.3897 0.9615 0.0025 0.2906 0.96230.7 0.0020 0.4455 0.9551 −0.0014 0.3244 0.94980.9 −0.0235 0.5616 0.9204 −0.0076 0.4065 0.9331
{N, T} = {100, 30} {N, T} = {100, 60}GFE 0.0040 0.5035 0.8891 0.0016 0.3949 0.9253SQbest 0.5 0.0074 0.4714 0.9150 0.0030 0.3680 0.9362PSQRwide 0.1 0.0049 0.2888 0.9396 0.0016 0.2272 0.9500
0.3 0.0077 0.3083 0.9592 0.0014 0.2286 0.95730.5 0.0053 0.3510 0.9542 0.0011 0.2483 0.95800.7 0.0016 0.4118 0.9506 −0.0004 0.2848 0.94970.9 −0.0168 0.5278 0.9231 −0.0031 0.3551 0.9434
{N, T} = {50, 30} {N, T} = {50, 60}DGP.4 GFE −0.0009 0.6458 0.3097 0.0050 0.5530 0.3429
SQbest 0.1 0.0104 0.4343 0.9080 0.0075 0.3093 0.9418SQsecond best 0.9 −0.0183 0.5853 0.8907 −0.0136 0.4547 0.9193PSQRnarrow 0.1 0.0054 0.3881 0.9151 0.0057 0.2827 0.9186
0.2 −0.0035 0.3674 0.9231 0.0025 0.2660 0.93020.8 0.0020 0.4244 0.8984 0.0047 0.3333 0.90970.9 −0.0119 0.4968 0.8820 −0.0024 0.3790 0.8952
{N, T} = {100, 30} {N, T} = {100, 60}GFE −0.0062 0.6402 0.2972 −0.0020 0.5503 0.3021SQbest 0.1 0.0188 0.3980 0.9138 0.0008 0.2733 0.9400SQsecond best 0.9 −0.0129 0.5621 0.8505 −0.0066 0.4334 0.9183PSQRnarrow 0.1 0.0028 0.3198 0.9286 −0.0018 0.2477 0.9198
0.2 −0.0036 0.3085 0.9310 −0.0017 0.2390 0.93150.8 0.0019 0.3453 0.9200 −0.0020 0.2837 0.93010.9 −0.0089 0.4123 0.9183 −0.0031 0.3280 0.9032
Notes: In DGP.1–DPG.3, SQbest is single-quantile estimator based on the best quantile τbest = 0.5, andPSQRwide is based on τ = {0.1, 0.2, . . . , 0.9}. In DGP.4, SQbest is the single-quantile estimator based onthe best quantile τbest = 0.1, SQsecond best is the single-quantile estimator based on the second best quantileτsecond best = 0.9, and PSQRnarrow is based on τ = {0.1, 0.2, 0.8, 0.9}.
40
Table 4: Output effect of physical capital, human capital, and infrastructure
Calderon G = 1 G = 2τ et al.(2015) 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Group 1C 0.34 0.29 0.02 0.10 0.15 0.21 0.24 0.28 0.31 0.37 0.49
(0.01) (0.02) (0.05) (0.04) (0.04) (0.03) (0.03) (0.03) (0.03) (0.04) (0.05)S 0.10 0.09 −0.50 −0.23 −0.12 −0.02 0.05 0.19 0.19 0.31 0.55
(0.01) (0.01) (0.04) (0.03) (0.02) (0.02) (0.02) (0.02) (0.02) (0.03) (0.03)Z 0.08 0.08 −0.16 −0.09 −0.08 −0.04 −0.01 0.09 0.07 0.07 0.09
(0.01) (0.01) (0.04) (0.03) (0.03) (0.03) (0.03) (0.03) (0.02) (0.02) (0.02)Group 2
C −0.03 0.14 0.16 0.24 0.28 0.31 0.34 0.38 0.46(0.06) (0.04) (0.03) (0.03) (0.03) (0.03) (0.03) (0.04) (0.04)
S −0.60 −0.25 0.02 0.16 0.27 0.39 0.58 0.76 1.12(0.05) (0.05) (0.04) (0.03) (0.03) (0.03) (0.04) (0.04) (0.05)
Z 0.04 0.06 0.07 0.06 0.08 0.07 0.08 0.09 0.17(0.01) (0.04) (0.03) (0.03) (0.03) (0.03) (0.03) (0.03) (0.04)
Notes:
1. Standard deviations are in the parentheses.
2. C measures the physical capital, S measures the human capital, and Z measures the physical infrastruc-ture.
References
T. Ando and J. Bai. Panel data models with grouped factor structure under unknown group
membership. Journal of Applied Econometrics, 31:163–191, 2016.
D. A. Aschauer. Is public expenditure productive? Journal of Monetary Economics, 23:
177–200, 1989.
P. R. D. Bom and J. Ligthart. How productive is public capital? a meta-analysis. CESifo
Working Paper Series 2206, CESifo Group Munich, 2008.
S. Bonhomme and E. Manresa. Grouped patterns of heterogeneity in panel data. Econometrica,
83:1147–1184, 2015a.
S. Bonhomme and E. Manresa. Supplement to ‘grouped patterns of heterogeneity in panel
data’. Econometrica Supplemental Material, 83:1147–1184, 2015b.
41
S. Bonhomme, T. Lamadon, and E. Manresa. Discretizing unobserved heterogeneity. Working
paper, 2017a.
S. Bonhomme, T. Lamadon, and E. Manresa. A distributional framework for matched employer
employee data. Working paper, 2017b.
R. C. Bradley and C. Tone. A central limit theorem for non-stationary strongly mixing random
fields. Journal of Theoretical Probability, 30:655–674, 2017.
J. E. Brand and Y. Xie. Who benefits most from college? Evidence for negative selection
in heterogeneous economic returns to higher education. American Sociological Review, 75:
273–302, 2010.
C. Calderon, E. Moral-Benito, and L. Serven. Is infrastructure capital productive? A dynamic
heterogeneous approach. Journal of Applied Econometrics, 30:177–198, 2015.
I. A. Canay. A simple approach to quantile regression for panel data. The Econometrics
Journal, 14:368–386, 2011.
P. Cantos, M. Gumbau-Albert, and J. Maudos. Transport infrastructures, spillover effects and
regional growth: evidence of the spanish case. Transport Reviews, 25:25–50, 2005.
V. Chernozhukov, I. FernandezVal, and A. Galichon. Quantile and probability curves without
crossing. Econometrica, 78:1093–1125, 2010.
V. Chernozhukov, I. Fernandez-Val, and M. Weidner. Network and panel quantile effects via
distribution regression. working papers 1803.08154, arXiv.org, 2018.
Y. Dong and A. Lewbel. Nonparametric identification of a binary random factor in cross section
data. Journal of Econometrics, 163:163–171, 2011.
A. Dzemski and R. Okui. Confidence set for group membership. Working paper, 2018.
J. Fan and Q. Yao. Nonlinear time series: nonparametric and parametric methods. Springer-
Verlag, 2008.
Y. Fan, E. Guerre, and S. Lazarova. A unified framework for the estimation and inference in
linear quantile regression: A local polynomial approach. Working paper, 2017.
A. F. Galvao. Quantile regression for dynamic panel data with fixed effects. Journal of Econo-
metrics, 164:142–157, 2011.
42
A. F. Galvao and K. Kato. Smoothed quantile regression for panel data. Journal of Economet-
rics, 193:92–112, 2016.
A. F. Galvao and A. Poirier. Random effects quantile regression. Working paper, 2016.
A. F. Galvao, T. Juhl, G. Montes-Rojas, and J. Olmo. Testing slope homogeneity in quantile
regression panel data with an application to the cross-section of stock returns. Journal of
Financial Econometrics, 16:211–243, 2018.
X. Gao and P. X.-K. Song. Composite likelihood bayesian information criteria for model se-
lection in high-dimensional data. Journal of the American Statistical Association, 105:1531–
1540, 2010.
B. S. Graham, J. Hahn, A. Poirier, and J. L. Powell. A quantile correlated random coefficients
panel data model. Journal of Econometrics, forthcoming.
A. Gregoriou and S. Ghosh. On the heterogeneous impact of public capital and current spending
on growth across nations. Economics Letters, 105:32–35, 2009.
J. Gu and S. Volgushev. Panel data quantile regression with grouped fixed effects. Working
paper, 2018.
J. Hahn and H. R. Moon. Panel data models with finite number of multiple equilibria. Econo-
metric Theory, 26:863–881, 2010.
M. Harding, C. Lamarche, and M. H. Pesaran. Common correlated effects estimation of het-
erogeneous dynamic panel quantile regression models. Working paper, 2017.
H. Kasahara and K. Shimotsu. Nonparametric identification of finite mixture models of dynamic
discrete choices. Econometrica, 77:135–175, 2009.
K. Kato, A. F. Galvao, and G. V. Montes-Rojas. Asymptotics for panel quantile regression
models with individual effects. Journal of Econometrics, 170:76–91, 2012.
Y. Ke, J. Li, and W. Zhang. Structure identification in panel data analysis. The Annals of
Statistics, 44:1193–1233, 2016.
K. Knight. Limiting distributions for L1 regression estimators under general conditions. The
Annals of Statistics, 26:755–770, 1998.
43
R. Koenker. Quantile regression for longitudinal data. Journal of Multivariate Analysis, 91:
74–89, 2004.
R. Koenker. Quantile Regression. Cambridge University Press, Cambridge, UK, 2005.
A. N. Kolmogorov and Y. A. Rozanov. On strong mixing conditions for stationary gaussian
processes. Theory of Probability & Its Applications, 5(2):204–208, 1960.
E. Krasnokutskaya, K. Song, and X. Tang. Estimating unobserved agent heterogeneity using
pairwise comparisons. Working paper, 2017.
E. M. Leeper, T. B. Walker, and S.-C. S. Yang. Government investment and fiscal stimulus.
Journal of Monetary Economics, 57:1000–1012, 2010.
S. Leorato and F. Peracchi. Comparing distribution and quantile regression. EIEF Working
Papers Series 1511, Einaudi Institute for Economics and Finance (EIEF), 2015.
C.-C. Lin and S. Ng. Estimation of panel data models with parameter heterogeneity when
group membership is unknown. Journal of Econometric Methods, 1:42–55, 2012.
R. Liu, A. Schick, Z. Shang, Y. Zhang, and Q. Zhou. Identification and estimation in panel
models with overspecified number of groups. Working paper, 2018.
S. Ng and G. McLachlan. Mixture models for clustering multilevel growth trajectories. Com-
putational Statistics & Data Analysis, 71:43–51, 2014.
R. Okui and W. Wang. Heterogeneous structural breaks in panel data models. Working paper,
2018.
D. Pollard. Strong consistency of k-means clustering. The Annals of Statistics, 9:135–140, 1981.
W. Romp and J. De Haan. Public capital and economic growth: A critical survey. Perspektiven
der Wirtschaftspolitik, 8:6–52, 2007.
O. Rosen, W. Jiang, and M. Tanner. Mixtures of marginal models. Biometrika, 87:391–404,
2000.
V. Sarafidis and N. Weber. A partially heterogeneous framework for analyzing panel data.
Oxford Bulletin of Economics and Statistics, 77:274–296, 2015.
44
J.-E. Sturm, J. Jacobs, and P. Groote. Output effects of infrastructure investment in the
Netherlands, 1853–1913. Journal of Macroeconomics, 21:355–380, 1999.
L. Su and G. Ju. Identifying latent grouped patterns in panel data models with interactive
fixed effects. Journal of Econometrics, 2018. forthcoming.
L. Su and W. Wang. Identifying latent group structures in nonlinear panels. Working paper,
2017.
L. Su, Z. Shi, and P. C. B. Phillips. Identifying latent structures in panel data. Econometrica,
84:2215–2264, 2016.
L. Su, X. Wang, and S. Jin. Sieve estimation of time-varying panel data models with latent
structures. Journal of Business & Economic Statistics, 2017. forthcoming.
S. Sugasawa. Grouped heterogeneous mixture modeling for clustered data. Working paper,
2018.
Y. Sun. Estimation and inference in panel strcutural models. Working paper, Department of
Economics, UCSD, 2005.
N. Uhde. Output effects of infrastructures in east and west German states. Intereconomics, 45:
322–328, 2010.
M. Vogt and O. Linton. Classification of non-parametric regression functions in longitudinal
data models. Journal of the Royal Statistical Society: Series B, 79:5–27, 2016.
W. Wang, X. Zhang, and R. Paap. To pool or not to pool: What is a good strategy for
parameter estimation and forecasting in panel regressions? Working paper, 2017.
G. L. Wu, Q. Feng, and Z. Wang. Estimating productivity of public infrastructure investment.
Working paper, 2017.
A. Y. Zhang and H. H. Zhou. Minimax rates of community detection in stochastic block models.
The Annals of Statistics, 44:2252–2280, 2016.
Z. Zhao and Z. Xiao. Efficient regressions via optimally combining quantile information. Econo-
metric Theory, 30:1272–1314, 2014.
H. Zou and M. Yuan. Composite quantile regression and the oracle model selection theory. The
Annals of Statistics, 36:1108–1126, 2008.
45
A Appendix
This appendix provides the proofs of the technical results in the main text. We first prove
the weak consistency of regression quantile estimates under unknown (estimated) group mem-
berships in A.1, and then discuss how to obtain the convergence rate of the misclassification
probability in A.2. Then we focus on the asymptotic distribution of the regression quantile
estimates in A.3. The proof on the consistency of the estimated number of groups is given
in A.4, and finally the case of individual fixed effects is discussed in A.5.
A.1 Proof of Theorem 4.1 (weak consistency)
Proof. By the definition of β(τ ), we have for each fixed k, β(τk) minimizes the sub-objective
function:N∑i=1
T∑t=1
ρτk(yit − x′itβgi(τk)). (A.1)
For notation simplicity, we replace βgi(τk) by βgi throughout the proof. Let
∆i(β) := T−1
T∑t=1
{ρτk(yit − x′itβgi)− ρτk(yit − x′itβ0
g0i)}.
For any δ > 0, define B0g(δ) := {β : ‖β − β0
g‖ ≤ δ} as the neighbourhood, and its boundary
∂B0g(δ) := {β : ‖β − β0
g‖ = δ}. We now prove (4.1). Note that by the definition of β(τk), we
have{max
1≤i≤N‖βgi − β0
g0i‖ > δ
}={‖βgi − β0
g0i‖ > δ, 1 ≤ ∃ i ≤ N
}⊂
{N−1
N∑i=1
I{g0i = g, gi = g}∆i(β) ≤ 0, ∃ g, g ∈ {1, . . . , G0}, ∃ β /∈ B0
g(δ)
}. (A.2)
For β /∈ B0g(δ), define β = rgβ + (1 − rg)β0
g where rg = δ/‖β − β0g‖. Observe that rg ∈ (0, 1)
and β ∈ ∂B0g(δ). Similar to the proof of Theorem 3.1 in Kato et al. (2012), the convexity of the
46
sub-objective function (A.1) yields that
rg
N∑i=1
I{g0i = g, gi = g}∆i(β)
≥N∑i=1
I{g0i = g, gi = g}∆i(β)
=N∑i=1
I{g0i = g, gi = g}E [∆i(β)] +
N∑i=1
I{g0i = g, gi = g} {∆i(β)− E [∆i(β)]}. (A.3)
We now show that for some εδ > 0,{N−1
N∑i=1
I{g0i = g, gi = g}∆i(β) ≤ 0, ∃ g, g ∈ {1, . . . , G0}, ∃ β /∈ B0
g(δ)
}
⊂
{max
g,g∈{1,...,G0}sup
β∈∂B0g(δ)
∣∣∣∣∣N−1
N∑i=1
I{g0i = g, gi = g} {∆i(β)− E [∆i(β)]}
∣∣∣∣∣ > εδ
}. (A.4)
To prove (A.4), by noting (A.3), it suffices to show that for all g, g ∈ {1, . . . , G0},
lim infN→∞
minβ∈∂B0
g(δ)N−1
N∑i=1
I{g0i = g, gi = g}E [∆i(β)] > 0, a.s. (A.5)
By the Knight (1998)’s identity we have
E [∆i(β)] = E
[∫ x′i1(β−β0g)
0
{Fi,τk(z|xit)− τ}dz
]
=1
2(β − β0
g)′ E [fi,τk(0|xit)xitx′it] (β − β0
g) + o(δ2),
as δ → 0, where the last equality is the uniform expansion of E [∆i(β)] over {β ∈ ∂B0g(δ)} and
i ≥ 1, by Assumption 1(iv). It hence follows that for each β ∈ ∂B0g(δ),
N−1
N∑i=1
I{g0i = g, gi = g}E [∆i(β)]
=1
2(β − β0
g)′N−1
N∑i=1
I{g0i = g, gi = g}E [fi,τk(0|xit)xitx′it] (β − β0
g) + o(δ2)
≥ 1
2δ2λN(g, g, τk) + o(δ2) a.s.,
47
which leads to (A.5) by Assumption 1(v). Combining (A.2) and (A.4) yields that{max
1≤i≤N‖βgi − β0
g0i‖ > δ
}
⊂
{max
g,g∈{1,...,G0}sup
β∈∂B0g(δ)
∣∣∣∣∣N−1
N∑i=1
I{g0i = g, gi = g} {∆i(β)− E [∆i(β)]}
∣∣∣∣∣ > εδ
}.
Therefore, to prove (4.1), it suffices to show that as N, T →∞,
maxg,g∈{1,...,G0}
supβ∈∂B0
g(δ)
∣∣∣∣∣N−1
N∑i=1
I{g0i = g, gi = g} {∆i(β)− E [∆i(β)]}
∣∣∣∣∣ = oP (1),
which is equivalent to
supβ∈∂B0
g(δ)
∣∣∣∣∣N−1
N∑i=1
I{g0i = g, gi = g} {∆i(β)− E [∆i(β)]}
∣∣∣∣∣ = oP (1), (A.6)
for any g, g ∈ {1, . . . , G0}, since G0 is finite. We obtain (A.6) by showing that for every ε > 0,
limT→∞
P
{sup
β∈B0g(δ)
|∆i(β)− E [∆i(β)]| > ε
}= 0, g ∈ {1, . . . , G0}, (A.7)
uniformly for all i ≥ 1. Without loss of generality, we write B0g(δ) = B0(δ) for simplicity. Since
B0(δ) is a compact subset of Rp+1, there exist J balls with centers {β(j), j = 1, . . . , J} and
radius r such that the collection of the J balls covers B0(δ). Then for each β ∈ B0(δ), there is
j ∈ {1, . . . , J} such that ‖β − β(j)‖ ≤ r. Observe that
|∆i(β)−∆i(β(j))| =
∣∣∣∣∣T−1
T∑t=1
[ρτk (yit − x′itβ)− ρτk
(yit − x′itβ(j)
)]∣∣∣∣∣≤ T−1
T∑t=1
C(1 + ‖xit‖)‖β − β(j)‖,
for some constant C > 0 independent of i and t. Let L(x) := C(1 + ‖xit‖) and κ :=
supi≥1 E [L(x)] <∞ by Assumption 1(iii). Then we have
|∆i(β)−∆i(β(j))| ≤ rT−1
∣∣∣∣∣T∑t=1
{L(xit)− E [L(xit)]}
∣∣∣∣∣+ rκ,
48
and hence
|∆i(β)− E [∆i(β)]| ≤ |∆i(β)−∆i(β(j))|+ |∆i(β
(j))− E [∆i(β(j))]|+ E |∆i(β
(j))−∆i(β)|
≤ rT−1
∣∣∣∣∣T∑t=1
{L(xit)− E [L(xit)]}
∣∣∣∣∣+ rκ
+ |∆i(β(j))− E [∆i(β
(j))]|+ rκ.
Setting r = ε/(6κ) leads to
|∆i(β)− E [∆i(β)]| ≤ rT−1
∣∣∣∣∣T∑t=1
{L(xit)− E [L(xit)]}
∣∣∣∣∣+ ε/6
+ |∆i(β(j))− E [∆i(β
(j))]|+ ε/6.
Therefore, we have
P
{sup
β∈B0(δ)
|∆i(β)− E [∆i(β)]| > ε
}
≤ P
{T−1
∣∣∣∣∣T∑t=1
{L(xit)− E [L(xit)]}
∣∣∣∣∣ > 2κ
}
+J∑j=1
P{|∆i(β
(j))− E [∆i(β(j))]| > ε/3
}, (A.8)
with J can be chosen such that J = O(ε−p−1) as ε → 0. By Assumption 1(ii), application of
the ergodic theorem for α-mixing processes (see Proposition 2.8 of Fan and Yao (2008) ) yields
that both terms on the right side of (A.8) are o(1) as T →∞, uniformly over 1 ≤ i ≤ N . This
leads to (A.7) and hence (4.1).
A.2 Convergence rate of misclassification probability
We first provide the proof of Lemma 1.
Proof. We first prove that
ming∈{1,...,G0}
‖βg(τk)− β0g(τk)‖
p−→ 0, (A.9)
49
for all g ∈ {1, . . . , G0} and k = 1, . . . , K. Note that
1
N
N∑i=1
(min
g∈{1,...,G0}I{g0
i = g}‖βg(τk)− β0g(τk)‖
)
=
(1
N
N∑i=1
I{g0i = g}
)(min
g∈{1,...,G0}‖βg(τk)− β0
g(τk)‖).
By Assumption 2(i), to prove (A.9) it suffices to show
1
N
N∑i=1
(min
g∈{1,...,G0}I{g0
i = g}‖βg(τk)− β0g(τk)‖
)p−→ 0.
Now,
1
N
N∑i=1
I{g0i = g}
(min
g∈{1,...,G0}‖βg(τk)− β0
g(τk)‖)
≤ 1
N
N∑i=1
I{g0i = g}‖βgi(τk)− β0
g(τk)‖
≤ 1
N
N∑i=1
‖βgi(τk)− β0g0i
(τk)‖
≤ max1≤i≤N
‖βgi(τk)− β0g0i
(τk)‖,
which is op(1) as N, T →∞, following the consistency result in Theorem 4.1. Therefore, (A.9)
holds by Assumption 2(i).
We define, for all g ∈ {1, . . . , G0} and k = 1, . . . , K,
σ(g) = arg ming∈{1,...,G0}
‖βg(τk)− β0g(τk)‖.
We now show that with probability approaching 1 as N, T →∞, σ : {1, . . . , G0} → {1, . . . , G0}is a one-to-one mapping. Let g 6= g. By the triangle inequality, we have
‖βσ(g)(τk)− βσ(g)(τk)‖ ≥ ‖β0g(τk)− β0
g(τk)‖
− ‖βσ(g)(τk)− β0g(τk)‖ − ‖βσ(g)(τk)− β0
g(τk)‖,
where the right-hand side converges in probability to ‖β0g(τk) − β0
g(τk)‖ > 0 by Assumption
50
2(ii) and the result (A.9). In other words, as N, T → ∞, with probability approaching 1 we
have σ(g) 6= σ(g) if g 6= g. Hence, σ is an invertible mapping, admitting that
‖βσ(g)(τk)− β0g(τk)‖
p−→ 0, N, T →∞,
for all g ∈ {1, . . . , G0} and k = 1, . . . , K. Since the objective function (A.1) is independent on
relabeling of the groups, without loss of generality we take σ(g) = g. This completes the proof.
Lemma 2. Suppose Z(T ) := {Z(T )t , t = 1, . . . , T} is an array of random variables such that
EZ(T )t = 0 and E
(Z
(T )t
)2
<∞ for each t = 1, . . . , T . Suppose the following mixing conditions
hold:
α(Z, t) := supTα(Z(T ), t)→ 0, t→∞,
ρ′(Z, 1) := supT
max1≤t<T
ρ(Z(T ), t) < 1.
Define the random sum ST :=∑T
t=1 Z(T )t , and σ2
T := ES2T . Then for any T ≥ 1, the variance
of partial sums is bounded as follows:
C−1Z
T∑t=1
E(Z
(T )t
)2
≤ σ2T ≤ CZ
T∑t=1
E(Z
(T )t
)2
, (A.10)
where CZ = (1 + ρ′(Z, 1))/(1 − ρ′(Z, 1)). Assume that σ2T > 0. Suppose that the Lindeberg
condition
∀ε > 0, limT→∞
1
σ2T
T∑t=1
E(Z
(T )t
)2
I(∣∣∣Z(T )
t
∣∣∣ > εσ−1/2T exp(−ζT )
)= 0
holds. We have as T →∞,
σ−1T ST
d−→ N(0, 1).
Proof of Theorem 4.2
Proof. We first examine the asymptotic property of gi(β(τ )). It follows from the definition of
gi(·) that for all g ∈ {1, . . . , G0} and k = 1, . . . , K,
I {gi(β(τ )) = g} ≤ I
{T∑t=1
K∑k=1
wkρτk(yit − x′itβg(τk)) ≤T∑t=1
K∑k=1
wkρτk(yit − x′itβg0i (τk))
},
51
so
1
N
N∑i=1
I{gi(β(τ )) 6= g0
i
}=
1
N
N∑i=1
G0∑g=1
I{g0i 6= g
}I {gi(β(τ )) = g}
≤ 1
N
N∑i=1
G0∑g=1
Zig(β(τ )),
where
Zig(β(τ )) = I{g0i 6= g
}× I
{T∑t=1
K∑k=1
wkρτk(yit − x′itβg(τk)) ≤T∑t=1
K∑k=1
wkρτk(yit − x′itβg0i (τk))
}.
By the identity we have
Zig(β(τ )) = I{g0i 6= g
}× I
{T∑t=1
K∑k=1
wk
∫ x′it(βg(τk)−βg0i
(τk))
0
[I(εit(τk) + x′it(β
0g0i
(τk)− βg0i (τk)) ≤ u)− τk
]du ≤ 0
}
≤ maxg 6=g
I
{T∑t=1
K∑k=1
wk
∫ x′it(βg(τk)−βg(τk))
0
[I(εit(τk) + x′it(β
0g(τk)− βg(τk)) ≤ u
)− τk
]du ≤ 0
}.
We now bound Zig(β(τ )) , for all β(τ ) ∈ Nη, by a quantity that does not depend on β(τ ).
Define
DTK :=
∣∣∣∣∣T∑t=1
K∑k=1
wk
∫ x′it(βg(τk)−βg(τk))
0
[I(εit(τk) + x′it(β
0g(τk)− βg(τk)) ≤ u
)− τk
]du
−T∑t=1
K∑k=1
wk
∫ x′it(β0g(τk)−β0
g(τk))
0
[I (εit(τk) ≤ u)− τk] du
∣∣∣∣∣ .
52
We observe using the Cauchy-Schwarz (CS) inequality that
DTK =
∣∣∣∣∣T∑t=1
K∑k=1
wk
∫ x′it(βg(τk)−β0g(τk))
x′it(βg(τk)−β0g(τk))
[I (εit(τk) ≤ u)− τk] du
−T∑t=1
K∑k=1
wk
∫ x′it(β0g(τk)−β0
g(τk))
0
[I (εit(τk) ≤ u)− τk] du
∣∣∣∣∣≤
∣∣∣∣∣T∑t=1
K∑k=1
wk
∫ x′it(β0g(τk)−β0
g(τk))+x′it(βg(τk)−β0g(τk))
x′it(β0g(τk)−β0
g(τk))
[I (εit(τk) ≤ u)− τk] du
∣∣∣∣∣+
∣∣∣∣∣T∑t=1
K∑k=1
wk
∫ x′it(βg(τk)−β0g(τk))
0
[I (εit(τk) ≤ u)− τk] du
∣∣∣∣∣≤ TK2η
(1
T
T∑t=1
‖xit‖
).
We thus obtain that
Zig(β(τ )) ≤ maxg 6=g
I
{T∑t=1
K∑k=1
wk
∫ x′it(β0g(τk)−β0
g(τk))
0
[I {εit(τk) ≤ u} − τk] du ≤ TK2η
(1
T
T∑t=1
‖xit‖
)},
where the right-hand side of this inequality, denoted by Zig, does not depend on β(τ ). As a
result,
supβ(τ )∈Nη
1
N
N∑i=1
I{gi(β(τ )) 6= g0
i
}≤ 1
N
N∑i=1
G0∑g=1
Zig. (A.11)
(A.11) implies that for any δ > 0, P(supβ(τ )∈Nη N−1∑N
i=1 I {gi(β(τ )) 6= g0i } > δ) is bounded by
N−1∑N
i=1
∑G0
g=1 E Zig/δ. So, in order to obtain the result (4.3), we first derive the asymptotic
behavior of E Zig as T →∞. Define
bi,t(K) = K−1
K∑k=1
wk
∫ x′it(β0g(τk)−β0
g(τk))
0
(I(εit(τk) ≤ u)− τk) du.
We then have by Assumption 2(v),
E Zig = P(Zig = 1)
≤∑g 6=g
P
(T−1
T∑t=1
bi,t(K) ≤ 2ηM
). (A.12)
To bound the term on the right-hand side of (A.12), we rely on the use of the Central Limit
53
Theorem for mixing processes. Specifically, we use the following result, which is a direct
consequence of Theorems 1.1 and 2.2 in Bradley and Tone (2017).
For each i = 1, . . . , N , {Z(T )i,t := T−1/2(bi,t(K)−E [bi,t(K)]), t = 1, . . . , T} satisfies the mixing
condition in Lemma 2 by Assumptions 1(ii) and 1(v). We show that the Lindeberg condition
holds as follows. Define Si,T =∑T
t=1 Z(T )i,t and σ2
i,T = ES2i,T . We have that for ∀ ε > 0,
1
σ2i,T
T∑t=1
E(Z
(T )i,t
)2
I(∣∣∣Z(T )
i,t
∣∣∣ > εσi,T
)
≤ 1
εδσ2+δi,T
T∑t=1
E∣∣∣Z(T )
i,t
∣∣∣2+δ
≤ ε−δ
[C−1i
T∑t=1
E(Z
(T )i,t
)2]−(1+δ/2) T∑
t=1
E∣∣∣Z(T )
i,t
∣∣∣2+δ
= ε−δ[C−1i E
(Z
(T )i,1
)2]−(1+δ/2)
T−δ/2E∣∣∣Z(T )
i,1
∣∣∣2+δ
−→ 0, as T →∞, (A.13)
where Ci = (1+ρ′(Zi, 1))/(1−ρ′(Zi, 1)). Moreover, the convergence in (A.13) is uniform for i =
1, . . . , N since both C−1i and E
∣∣∣Z(T )i,1
∣∣∣2+δ
can be unifromly bounded. Now we apply Lemma 2 to
{Z(T )i,t , t = 1, . . . , T} to bound the right-hand side of (A.12). First, for η ≤ 1
4Minfi≥1 E [bi,t(K)],
P
(T−1
T∑t=1
bi,t(K) ≤ 2ηM
)= P
(σ−1i,TSi,T ≤ σ−1
i,T
√T (2ηM − E [bi,t(K)])
)≤ P
(σ−1i,TSi,T ≤ −
1
2σ−1i,T
√T infi≥1
E [bi,t(K)]
), (A.14)
noting that by Assumptions 2(ii)–2(iii) and 2(v),
infi≥1
E [bi,t(K)] = infi≥1
K−1
K∑k=1
wkE
[∫ x′it(β0g(τk)−β0
g(τk))
0
(Fi,τk(u|xit)− τk) du
]
≥ c infi≥1
K−1
K∑k=1
wk(β0g(τk)− β0
g(τk))′E [xitx
′i1](β0
g(τk)− β0g(τk))
≥ c
(infi≥1
λi
)K−1
K∑k=1
wk‖β0g(τk)− β0
g(τk)‖2 > 0,
where c is a constant independent of i, t and k and λi is the minimum eigenvalue of E [xitx′it].
54
Since it follows from Assumption 2(v) and the CS inequality that,
σ2i,T ≤ Ci
T∑t=1
E(Z
(T )i,t
)2
= CiE (bi,t(K)− E [bi,t(K)])2
≤ E[‖xit‖2
]K−1
K∑k=1
‖β0g(τk)− β0
g(τk)‖2 <∞,
we have
P
(σ−1i,TSi,T ≤ −
1
2σ−1i,T
√T infi≥1
E [bi,t(K)]
)≤ P
(σ−1i,TSi,T ≤ −
1
2
√T (sup
i≥1CiVar [bi,t(K)])−1/2 inf
i≥1E [bi,t(K)]
)
≤ P
(σ−1i,TSi,T ≤ −
1
2√C ′
infi≥1
E [bi,t(K)]√Var [bi,t(K)]
√T
), (A.15)
where C ′ = (1 + ρ′)/(1− ρ′) with ρ′ defined in Assumption 2(iv). By Lemma 2 we have,
limT→∞
P(σ−1i,TSi,T ≤ −
ζg,g
2√C′
√T)
Φ(− ζg,g
2√C′
√T) = 1
unifromly for i = 1, . . . , N with Φ(·) denoting the standard normal distribution function. There-
fore, combining (A.12), (A.14) and (A.15) yields that for sufficiently large T ,
E Zig ≤∑g 6=g
P
(σ−1i,TSi,T ≤ −
ζg,g
2√C ′
√T
)
≤ 2∑g 6=g
Φ
(− ζg,g
2√C ′
√T
)
≤ D0T−1/2
∑g 6=g
φ
(ζg,g
2√C ′
√T
), (A.16)
uniformly for i = 1, . . . , N with D0 denoting a constant, where the last inequality follows from
Mills ratio and φ(·) is the standard normal density function. We now define
ζ = − ming 6=g
g,g∈{1,...,G0}
ζ2g,g
8C ′.
55
It follows from (A.16) that for any ε > 0, there is M∗ = G0(G0 − 1) D0√2πε−1 such that
P
(sup
β(τ )∈Nη
1
N
N∑i=1
I{gi(β(τ )) 6= g0
i
}> M∗ T−1/2 exp(−ζT )
)
≤ P
(1
N
N∑i=1
G0∑g=1
Zig > M∗ T−1/2 exp(−ζT )
)
≤ 1
M∗ T−1/2 exp(−ζT )E
(1
N
N∑i=1
G0∑g=1
Zig
)
≤ 1
M∗ T−1/2 exp(−ζT )D0T
−1/2
G0∑g=1
∑g 6=g
φ
(− ζg,g
2√C ′
√T
)≤ ε.
That is (4.3).
Next, we prove (4.6). Let us denote
Q(β(τ )) =1
NTK
N∑i=1
T∑t=1
K∑k=1
wk ρτk(yit − x′itβgi(β(τ ))(τk)
),
and
Q(β(τ )) =1
NTK
N∑i=1
T∑t=1
K∑k=1
wk ρτk
(yit − x′itβg0i (τk)
).
Then Q(·) is minimized at β(τ ), and Q(·) is minimized at β(τ ). By Assumptions 1(i), 2(v)
and (4.3), it is easy to observe that
supβ(τ )∈Nη
∣∣∣Q(β(τ ))− Q(β(τ ))∣∣∣ = OP (T−1/2 exp(−ζT )), (A.17)
as N, T →∞. Note that by Lemma 1, we have, as N, T →∞,
P(β(τ ) /∈ Nη
)−→ 0. (A.18)
Similarly, since β(τ ) is also the consistent estimator of β0(τ ) under the assumptions of Theorem
4.1, we have
P(β(τ ) /∈ Nη
)−→ 0. (A.19)
Now, combining (A.17) and (A.18) yields that
Q(β(τ ))− Q(β(τ )) = OP (T−1/2 exp(−ζT )). (A.20)
56
This is because, for x > 0,
P(∣∣∣Q(β(τ ))− Q(β(τ ))
∣∣∣ > xT−1/2 exp(−ζT ))
≤ P(β(τ ) /∈ Nη
)+ P
(sup
β(τ )∈Nη
∣∣∣Q(β(τ ))− Q(β(τ ))∣∣∣ > xT−1/2 exp(−ζT )
).
Likewise, combining (A.17) and (A.19), we obtain
Q(β(τ ))− Q(β(τ )) = OP (T−1/2 exp(−ζT )). (A.21)
Hence, using (A.20) and (A.21), and the definition of β(τ ) and β(τ ) yields
0 ≤ Q(β(τ ))− Q(β(τ )) = Q(β(τ ))− Q(β(τ )) +OP (T−1/2 exp(−ζT ))
≤ OP (T−1/2 exp(−ζT )).
It thus follows that
Q(β(τ ))− Q(β(τ )) = OP (T−1/2 exp(−ζT )). (A.22)
We also observe that
Q(β(τ ))− Q(β(τ )) =1
NTK
N∑i=1
T∑t=1
K∑k=1
wk
(ρτk(yit − x′itβg0i (τk))− ρτk(yit − x
′itβg0i (τk))
)
=1
NTK
N∑i=1
T∑t=1
K∑k=1
wk
∫ x′it(βg0i
(τk)−βg0i
(τk))
0
(I(yit − x′itβg0i (τk) ≤ u)− τk
)du, (A.23)
where the last equality is using the identity of Knight (1998). Note that I(yit − x′itβg0i (τk) ≤u)−τk only takes the values either at 1−τk or τk for k = 1, . . . , K. It hence follows from (A.22)
and (A.23) that
1
NTK
N∑i=1
T∑t=1
K∑k=1
x′it
(βg0i (τk)− βg0i (τk)
)
=1
K
K∑k=1
G0∑g=1
(1
N
N∑i=1
I(g0i = g)
1
T
T∑t=1
xit
)′ (βg(τk)− βg(τk)
)= OP (T−1/2 exp(−ζT )).
57
In particular, for all g and k, we have
(1
N
N∑i=1
I(g0i = g)
1
T
T∑t=1
xit
)′ (βg(τk)− βg(τk)
)= OP (T−1/2 exp(−ζT )),
as N, T →∞. Note that∣∣∣∣∣(
1
N
N∑i=1
I(g0i = g)
1
T
T∑t=1
xit
)′ (βg(τk)− βg(τk)
)∣∣∣∣∣≥ λ−1/2 ‖βg(τk)− βg(τk)‖,
where λp−→ λ > 0 as a consequence of Assumption 2(vi). Hence, βg(τk)−βg(τk) = OP (T−1/2 exp(−ζT )).
This shows (4.6).
A.3 Asymptotic distribution of regression quantile estimates
Here we prove Corollary 2.
Proof. We have by Theorem 3.1 in Galvao and Poirier (2016) that for all g ∈ {1, . . . , G0} as
N, T →∞,
Γ(τ , g)√πgNT
(βg(τ )− β0
g(τ ))⇒ z(τ , g),
where z(·, g) is the K-dimensional normal distribution with zero mean, whose covariance matrix
is
E [z(τ , g)z(τ ′, g)′] = plimT→∞T−1
T∑s=1
T∑t=1
E[(I(εit(τ ) ≤ 0)− τ )(I(εis(τ
′) ≤ 0)− τ ′)xitx′isI(g0i = g)
],
with εit(τ ) = (εit(τ1), . . . , εit(τK))′. Result (4.7) then follows from the fact that ‖βg(τ ) −βg(τk)‖ = OP
(T−1/2 exp (−ζT )
), where ζ > 0, is defined in Theorem 4.2.
A.4 Consistency of IC at given τ
In this section, we provide the proof of Theorem 5.1.
Proof. The structure of proof is similar to Theorem 2.6 in Su et al. (2016). We shall show
that limN,T→∞ P(IC(G) > IC(G0)) = 1 for all G 6= G0 and G < Gmax. Let ψ(ωit; βgi) =
58
ρτk(yit − x′itβgi(τk)), with ωit = (yit, x′it). Then we have from (5.1) that
IC(G) =1
NT
N∑i=1
T∑t=1
ψ(ωit; β(G)gi
) +G(p+ 1)f(N, T ).
IfG = G0, IC(G0) = eG0+G0(p+1)f(N, T ), where eG0 = 1NT
∑Ni=1
∑Tt=1 ψ(ωit; βgi). It hence fol-
lows from Theorem 4.1 and Assumption 4 that IC(G0)p−→ σ2
0.We now prove limN,T→∞ P(IC(G) >
IC(G0)) = 1 for 1 ≤ G < G0 and G0 < G ≤ Gmax, respectively.
Case 1: 1 ≤ G < G0. By Assumption 4,
min1≤G<G0
IC(G) ≥ min1≤G<G0
infP (G)∈PG
eP (G) +G(p+ 1)f(N, T )p−→ e > eG0 ,
where P (G) = (P1, . . . , PG) and PG denote any G-partition of {1, 2, . . . , N} and the collection of
all such partitions, respectively . It hence follows that P(IC(G) > IC(G0))→ 1, for 1 ≤ G < G0,
as N, T →∞.
Case 2: G0 < G ≤ Gmax. With the group membership estimation {gi(G), i = 1, . . . , N}, we
define Pg(G) = {i ∈ {1, 2, . . . , N} : gi(G) = g} for g = 1, . . . , G. Let P (G) = {P1(G), . . . , PG(G)}.Then we rewrite the first term in IC(G) in the following way.
1
NT
N∑i=1
T∑t=1
ψ(ωit; β(G)gi
) =1
NT
G∑g=1
∑i∈Pg(G)
T∑t=1
ψ(ωit; β(G)gi
)
= D1NT +D2NT −D3NT +D4NT , (A.24)
where
D1NT =1
NT
G0∑g=1
∑i∈P 0
g
T∑t=1
ψ(ωit; β(G0)
g0i),
D2NT =1
NT
G0∑g=1
∑i∈Pg(G)\P 0
g
T∑t=1
ψ(ωit; β(G)gi
),
D3NT =1
NT
G0∑g=1
∑i∈P 0
g \Pg(G)
T∑t=1
ψ(ωit; β(G0)
g0i),
and
D4NT =1
NT
G∑g=G0+1
∑i∈Pg(G)
T∑t=1
ψ(ωit; β(G)gi
).
59
First, following the proof of Theorem 4.2, we can show that, as N, T →∞,
DjNT = OP
(T−1/2 exp (−δjT )
), j = 2, 3, 4, (A.25)
for some δj > 0. For the expansion of D1NT , we have
D1NT − eG0 =1
NTK
N∑i=1
T∑t=1
K∑k=1
wk
∫ x′it
(β(G0)
g0i
(τk)−β0g0i
(τk)
)0
(I(εit(τk) ≤ u)− τk) du
≤ N−1
N∑i=1
(T−1
T∑t=1
‖xit‖
)(K−1
K∑k=1
‖β(G0)
g0i(τk)− β0
g0i(τk)‖
)
= OP
((NT )−1/2
), (A.26)
where the last equality follows from Corollary 2. Therefore, combining (A.24), (A.25) and
(A.26) yields that
1
NT
N∑i=1
T∑t=1
ψ(ωit; β(G)gi
) = eG0 +OP
((NT )−1/2
). (A.27)
Using (A.27) and the fact that√NTf(N, T )→∞, we eventually obtain that
P
(min
G0<G≤Gmax
IC(G) > IC(G0)
)= P
(√NT
(min
G0<G≤Gmax
IC(G)− IC(G0)
)> 0
)= P
(OP (1) +
√NTf(N, T )(G−G0)(p+ 1) > 0
)p−→ 1, N, T →∞. (A.28)
It follows that P(IC(G) > IC(G0))→ 1, for G0 < G < Gmax.
A.5 Asymptotics in the presence of individual-specific fixed effects
We provide the proof of Theorem 6.1 as follows.
Proof. The proof is similar to that of Proposition 3.1 in Galvao and Kato (2016). Put
∆i(αi, βgi) := T−1
T∑t=1
{ρτk(yit − αi − x′itβgi)− ρτk(yit − αi0 − x′itβ0
g0i)}.
60
For δ > 0, define B0i (δ) := {(α, β) : |α − αi0| + ‖β − β0
g0i‖ ≤ δ}, and ∂B0
i (δ) := {(α, β) : |α −αi0|+ ‖β − β0
g0i‖ = δ}. We can expand ∆i(α, β) uniformly over (α, β) ∈ ∂B0
i (δ) by Assumption
1(iv), and using Assumption 5(iii) yields that
infi≥1
min(α,β)∈∂B0
i (δ)E [∆i(α, β)] > 0.
Therefore, it follows from the similar proof of Theorem 3.1 in Kato et al. (2012) that for some
εδ > 0, {max
1≤i≤N‖βgi − β0
g0i‖ > δ
}⊂{
∆i(αi, βgi) ≤ 0, 1 ≤ ∃ i ≤ N, ∃ (αi, βgi) /∈ B0i (δ)
}⊂
{max
1≤i≤Nsup
(αi,βgi )∈B0i (δ)
|∆i(αi, βgi)− E [∆i(αi, βgi)]| > εδ
}.
This implies that to prove (6.2), it suffices to show that for every ε > 0,
max1≤i≤N
P
{sup
(α,β)∈B0i (δ)
|∆i(α, β)− E [∆i(α, β)]| > εδ
}= o
(N−1
), (A.29)
as N →∞. To show this, we follow the Corollary C.1(a Bernstein type inequality for β-mixing
processes) in the Supplemental Appendix of Galvao and Kato (2016). Under Assumptions 2(v)
and 5(ii), and taking s = 2 logN and q = [√T ] yield that for any εδ > 0,
max1≤i≤N
P
{sup
(α,β)∈B0i (δ)
|∆i(α, β)− E [∆i(α, β)]| > εδ
}
≤ 2(N−2 +
√TBa[
√T ]).
The right side is o (N−1) by noting that (logN)/√T → 0, as N, T →∞. Therefore, we obtain
(A.29) and then (6.2).
61