Latent Group Structures with Heterogeneous Distributions: Identi … · 2019-02-05 · Latent Group...

Latent Group Structures with Heterogeneous

Distributions: Identification and Estimation∗

Heng Chen† Xuan Leng‡ Wendun Wang§

January 21, 2019

Abstract

Panel data are often characterized by cross-sectional heterogeneity, and a flexible yet

parsimonious way of modeling heterogeneity is to cluster units into groups. A group pat-

tern of heterogeneity may exist not only in the mean but also in the other characteristics

of the distribution. To identify latent groups and recover the heterogeneous distribu-

tion, we propose a clustering method based on composite quantile regressions. We show

that combining the strength across multiple panel quantile regression models improves

the precision of the group membership estimates if the group structure is common across

quantiles. Asymptotic theories for the proposed estimators are established, while their

finite-sample performance is demonstrated by simulations. We finally apply the proposed

methods to analyze the cross-country output effect of infrastructure capital.

Keywords: Distributional heterogeneity, cluster analysis, panel structure model, com-

posite quantile regression, infrastructure effect

JEL Classification: C31, C33, C38, H54

∗We thank the participants of the seminars at the Nanyang Technological University, Erasmus UniversityRotterdam, 4th Conference of the International Society for Nonparametric Statistics in Salerno, and 4th DongbeiEconometrics Workshop in Dalian for their useful discussions and constructive comments. Wang acknowledgesthe financial support of the EUR Fellowship. Any remaining errors are ours.†[email protected]. Currency Department, Bank of Canada.‡[email protected]. Econometric Institute, Erasmus University Rotterdam.§[email protected]. Econometric Institute, Erasmus University Rotterdam and Tinbergen Institute

1

1 Introduction

In many applications using panel data, the effect of the covariates on the dependent variable

differs across individual units, and such individual heterogeneity may be conveniently charac-

terized by a group pattern. For example, Hahn and Moon (2010) provided a theoretical jus-

tification of the group structure in game theory and macroeconomic models in which multiple

equilibria can appear. Bonhomme and Manresa (2015a) showed in a study of the democracy–

income relationship that countries can be separated into groups with different democratic tran-

sition paths. Ando and Bai (2016) found evidence of a group pattern of heterogeneity in the

styles of US mutual funds and performance of asset returns in Mainland China. Bonhomme

et al. (2017a) argued that the group structure can be a good discrete approximation even if

individual heterogeneity is continuous.

When we describe the features of a group, the mean statistics do not provide complete

information, and it is often of great interest and necessity to unveil the entire distribution of

the group, including such features as the volatility of asset returns of a group and the extent

to which different groups of firms behave in extreme cases (e.g., during the global financial

crisis and the Great Moderation). Such distributional heterogeneity (i.e., the difference in

the distribution of groups) is thus more general and informative than mean heterogeneity.

However, existing heterogeneous panel data models primarily focus on the heterogeneity of the

(conditional) mean effect, and thus do not capture any distributional heterogeneity.

This paper presents a new model and a new estimation procedure that allow us to examine

the distributional heterogeneity of panel data. We consider panel structure models in which in-

dividual units in the same group share a common conditional distribution, while the distribution

can differ in location, shape, or both across groups. This model offers a flexible yet parsimo-

nious approach to capture the heterogeneous distributional effects across individual units. The

group membership structure (i.e., which individuals belong to which group) is assumed to be

unknown and estimated from the data. For each group, we explore the distribution features by

adopting quantile regressions, such that the units in a group share the same quantile regression

function. Exploiting the information contained at different quantiles of the distribution in turn

improves classification accuracy as the clustering here is based not only on the location of the

distribution (measured by the mean), but also on the shape reflected by a range of quantiles.

Hence, when there is no or little heterogeneity in the group means, we can still correctly identify

the group membership structure as long as this structure is retained in the other parts of the

distribution. This occurs, for example, when the asset returns of two groups of firms behave

2

similarly in the tranquil period, but differently in the volatile period. Moreover, classification

based on the distribution is robust against outliers of the dependent variable compared with

mean-based classification. We name our model the panel structure quantile regression (PSQR)

model.

To estimate this model, we employ an iterative algorithm that alters between group member-

ship estimation and panel quantile regression estimation. The aim of our estimation approach

is to find the optimal group membership for each individual unit by minimizing its composite

quantile check function given the regression quantile estimates, and estimate the regression

quantiles given the estimated group memberships. More specifically, the composite quantile

check function is defined as the weighted average of the standard quantile check function over

different quantile levels. The advantage of using the composite quantile check function for

classification is that it allows us to employ the information contained at multiple quantiles

simultaneously when the group membership structure does not vary across quantiles. This

approach thus relaxes the group separation condition and improves the estimation accuracy of

the group membership parameters.1

A technical contribution of this study is that in addition to conventional consistency results,

we precisely quantify the speed of the convergence of the misclassification probability. We show

that the convergence rate is an exponential function of the length of the time series and that it

also depends on various quantities such as the number of quantiles used for classification, degree

of group separation, variance of the error terms, and serial correlation in the data. To the best

of our knowledge, this is the first study to precisely quantify the asymptotic behavior of group

membership estimates. Indeed, previous studies in the panel data classification literature only

provide the consistency of the group membership estimates or a rough convergence rate of the

misclassification probability,2 neither of which explain how the features of the data influence

classification accuracy. With our newly-established asymptotic properties of the misclassifica-

tion probability, we can explicitly show that using multiple quantiles improves classification

accuracy, and this strongly advocates the usage of the composite quantile approach.

The effectiveness of using multiple quantiles for classification relies on the assumption that

1From the non-parametric perspective, the set of quantile regressions chosen at different quantile levels canbe viewed as a set of basis functions (not necessarily orthogonal) used to approximate the log-likelihood ofthe unknown distribution. When the set is large, the composite approach can approximate the log-likelihoodfunction well; therefore, it yields a nearly efficient method.

2A typical clustering result is that the group membership estimates converge at the rate of T−δ, where δ is anunknown positive number (e.g., Bonhomme and Manresa (2015a) and Okui and Wang (2018)). The exponentialrate of convergence is interestingly related to the minimax mismatch ratio of the stochastic block models inZhang and Zhou (2016), in which their signal-to-noise ratio could be linked to the degree of group separationand variance of the error terms in our case.

3

the group membership structure is invariant over a range of (but not necessarily all) quantiles.

This is relevant in many applications since the factors that drive the group pattern are often

rather inertial. For example, Brand and Xie (2010) found that the economic returns of education

differ significantly over individuals depending on how likely they are to attend college. Since the

likelihood of attending college typically does not change over the different quantiles of the wage

distribution, it seems plausible that the group structure of heterogeneity is also invariant to

quantiles. Even in those cases where the group membership structure may vary over the whole

distribution, some quantile levels still share a common group structure, and the composite

quantile approach can thus be applied to this (small) range of quantiles.

We evaluate the finite-sample performance of our method via a simulation and compare it

with other methods of modeling a group pattern of heterogeneity. We first show that our method

can precisely estimate group memberships and quantile-specific slope coefficients. Classification

accuracy improves as the length of the time series increases. Next, we show that classification

accuracy is influenced by the degree of group separation, number of quantiles used for clus-

tering, and signal-to-noise ratio. In particular, we find that employing multiple quantiles for

classification significantly improves the accuracy of the group membership estimates compared

with existing methods. When groups are only separated in the tails of the distribution but not

their means, conventional mean-based clustering techniques fail to identify the group structure,

while our method is still effective at identifying the latent group structure. Furthermore, owing

to the more accurate group membership estimates, the group-specific regression quantiles are

also estimated more accurately by our method in finite samples.

We illustrate our method by investigating the output effect of infrastructure capital. We

find that the effect of infrastructure exhibits a large degree of heterogeneity across countries, not

only in the mean effect but also in the shapes of the distributions. Interestingly, the geographic

pattern is still salient. Most European and Asian countries behave similarly. American and

some African countries are classified into the same group with a relatively moderate effect of

infrastructure. For both groups, the effect of infrastructure becomes more positive and stronger

in the right tail of the distribution (i.e., when the economy is strong); however, the variation is

larger in the American/African group than in the European/Asian group. Such distributional

heterogeneity in the output effect of infrastructure has not been captured by existing studies.

The remainder of this paper is organized as follows. We position our work against the

related literature in Section 2. We set up the model and describe our estimation method

in Section 3. The asymptotic properties are provided in Section 4. Section 5 provides a

method for determining the number of groups and Section 6 considers an extension of our basic

4

model, allowing for individual-specific fixed effects. We study the finite-sample properties of our

estimator via a simulation in Section 7 and demonstrate our method with a real data application

in Section 8. Section 9 concludes. The technical details are organized in the Appendix.

2 Literature review

Many studies have aimed to identify the latent group structure in panel data models. Sun

(2005) considered a finite-mixture panel data model with unknown group membership and

Kasahara and Shimotsu (2009) provided the conditions under which the non-parametric identi-

fication of finite-mixture models of dynamic discrete choices is possible. Alternatively, Kmeans,

an “all-or-nothing” classification method (Pollard, 1981), has recently been extensively studied

in the panel data framework; see Lin and Ng (2012), Bonhomme and Manresa, Sarafidis and

Weber (2015), Bonhomme et al. (2017b), Okui and Wang (2018), and Liu et al. (2018). An-

other popular classification method is to use classifier-Lasso for clustering; this approach was

first proposed by Su et al. (2016) and further popularized by Su and Ju (2018), Wang et al.

(2017), and Su et al. (2017). Other classification techniques include the thresholding algorithm

(Vogt and Linton, 2016), pairwise comparisons (Krasnokutskaya et al., 2017), and binary seg-

mentation, as discussed by Ke et al. (2016) and Su and Wang (2017), among others. Most of

these studies focus on the heterogeneity in the mean effect, and classification is solely based

on the mean estimates. Thus, they fail to capture distributional heterogeneity. In addition,

the literature on clustering analysis in panel data models does not provide precise asymptotic

analysis of the group membership estimates besides their (super-)consistency. We thus con-

tribute to the literature by studying the distribution features of latent groups and classifying

individuals based on distributional heterogeneity. We also precisely quantify the speed of the

convergence of the misclassification probability and show the extent to which the accuracy of

the group membership estimates depends on the features of the model and data.

Our study builds on previous panel quantile regression models. Following the seminal study

of Koenker (2004), a number of influential authors have provided strict (asymptotic) analyses

on the estimation of panel quantile regression, such as Canay (2011), Galvao (2011), Kato et al.

(2012), Galvao and Kato (2016), Graham et al. (forthcoming), and Harding et al. (2017). These

studies all assume that individuals belong to a homogeneous population, and thus the quantile

regression coefficient is common across units. However, as suggested by Galvao et al. (2018),

cross-sectional heterogeneity remains a prominent feature in panel quantile regression. They

proposed a test for homogeneity in fixed effects quantile regressions and found strong evidence

5

of heterogeneity in the sensitivity of asset returns to firm characteristics at the tail quantiles

(i.e., during booms and busts). Motivated by their finding, we allow regression quantiles to

differ across individuals via a group pattern of heterogeneity. An alternative way in which

to uncover the distributional effect is via distribution regressions, and there exists an inverse

relationship between conditional quantile regression and conditional distributional regression

(Leorato and Peracchi, 2015; Chernozhukov et al., 2018). Distribution regressions offer a flexible

tool with which to model and estimate the entire conditional distribution of any type of outcome

variable (discrete, continuous, mixed), as they allow the effect of the covariates to vary across

different points of the conditional distribution. The cost of this flexibility is that the distribution

regression parameters can be difficult to interpret because they do not directly correspond to

the quantile effects.

Another stream of the literature modeling heterogeneous distributions including Rosen et al.

(2000); Ng and McLachlan (2014); Sugasawa (2018) considers the finite mixture of latent con-

ditional distributions. These models often assume that the mixture probability depends on

some observables or that the mixture distribution is composed of several common (known)

distributions. On the contrary, we allow the group memberships and form of distributional

heterogeneity to be fully unrestricted. In other words, we do not require knowledge of the

distribution form for each group, and the distribution can also vary across groups in an arbi-

trary manner. Since finite-mixture models under normal errors are equivalent to the Kmeans

approach (Bonhomme and Manresa, 2015b), we differ from both methods in that our classifi-

cation is not based on mean heterogeneity but rather on distributional heterogeneity.

One closely related work is Gu and Volgushev (2018), who considered panel quantile regres-

sion with group fixed effects. We differ from their studies in three main aspects. First, they

considered the fixed effects to have a group pattern of heterogeneity, while the regression quan-

tiles are cross-sectional homogeneous. We allow the regression quantiles to be heterogeneous

across groups to capture the distributional heterogeneity of the effect of the covariates. In ad-

dition, we allow for additive individual-specific fixed effects, maintaining the common situation

of individual unobserved heterogeneity. Hence, we do not circumvent the incidental parameter

problem. Second, they employed Lasso techniques for clustering, while our approach resembles

the Kmeans approach. Finally, and most importantly, they estimated the groups at one given

quantile level. In practice, however, it is typically unclear which quantile to utilize, and quantile-

by-quantile estimation does not guarantee obtaining common group membership estimates if

the group membership structure does not vary across quantiles. In contrast, our estimation is

based on the composite quantile loss function, which allows us to obtain a quantile-invariant

6

group membership structure and facilitates the identification of groups. We precisely quantify

the speed of the convergence of the misclassification probability, from which we explicitly show

that using multiple quantiles leads to more accurate group membership estimates (even in the

asymptotics) and thus more accurate regression quantiles than using a single quantile. Our

asymptotic results are thus more informative than the consistency of the group membership

estimates in the literature including Gu and Volgushev (2018). This theoretical comparison is

also complemented by the simulation study.

Finally, the idea of using composite quantiles is not new in the literature. Koenker (2005,

Section 5.5), Zou and Yuan (2008), and Zhao and Xiao (2014) showed in their cross-sectional

regressions that combining quantile information improves the efficiency of slope coefficients

that are quantile-independent. Fan et al. (2017) employed the composite idea to estimate the

linear quantile regression at boundary quantiles. We consider panel quantile regression models

in which the slope coefficients depend on the quantiles, but the group membership structure

is quantile-invariant. Hence, we use the composite quantile approach to estimate the group

membership parameter that does not vary across quantiles. Unlike the slope coefficients, group

memberships are discrete, individual-specific, invariant to relabeling, and enter the model as an

index of the slope coefficients. Therefore, it requires new techniques to show the improvement

in the efficiency of the group membership estimates due to combining quantile information.

3 Model setup and estimation

In this section, we first describe the setup of our model and link it to several popular models

and then explain our estimation approach.

3.1 Model setup

Suppose we observe {{yit, xit}Tt=1}Ni=1, where yit is the scalar dependent variable of individual i

observed at time t and xit is a (p + 1) × 1 vector of exogenous regressors, typically including

the first element being 1. We are interested in the effect of xit on the conditional quantile of

yit, and this distributional effect is allowed to differ across individual units. The heterogeneity

can exist in the location of the distribution and/or the shape of the distribution. We assume

that the heterogeneous conditional quantile effect can be characterized by a group structure

such that individuals in the same group share a common conditional quantile curve, while the

curves can differ at any quantile across groups. Thus, we consider the panel data generated

7

from the following model with group-specific regression quantiles:

Qτ (yit|xit) = x′itβgi(τ), i = 1, . . . , N, t = 1, . . . , T, (3.1)

where τ ∈ (0, 1) is the quantile index and Qτ (yit|xit) is the conditional τ -quantile of yit given

xit. The quantile coefficient βgi(τ) is characterized by a group pattern of heterogeneity and the

group membership of unit i is denoted by gi ∈ {1, . . . , G} with G being the number of groups.

The group membership structure {gi}Ni=1 is unknown and to be estimated. We assume that the

group membership is time-invariant and independent of τ for all i = 1, . . . , N . This is common

in practice and can occur if the group membership of an individual unit is predetermined; then,

the time observations of the dependent variable for this unit are generated from the associated

conditional distribution of this group.

This model captures the cross-sectional heterogeneity and varying effects of the covariates

across different quantiles of the conditional distribution of the dependent variable. It includes

several important models as special cases. When τ = 0.5, our model collapses to a median

regression with group-specific slope coefficients, which is closely related to the panel structure

model with group-specific conditional mean effects (Lin and Ng, 2012; Bonhomme and Manresa,

2015a; Su et al., 2016). If the parameter βgi(τ) is cross-sectional homogeneous β(τ), the model

reduces to quantile regression random effects (Galvao and Poirier, 2016). Our model also

includes the panel quantile with group fixed effects (Gu and Volgushev, 2018) as a special case

where only the intercept is allowed to have a group pattern of heterogeneity.

Model (3.1) assumes that the fixed effects are group-specific. In some applications, it is

desirable to allow unobserved heterogeneity to be individual-specific and correlated with the

covariates. Hence, we consider an extension of model (3.1) with individual fixed effects in

Section 6 as

Qτ (yit|αi(τ), xit) = αi(τ) + x′itβgi(τ), i = 1, . . . , N, t = 1, . . . , T, (3.2)

where αi(τ) is the i-th individual’s fixed effects. While allowing individual unobserved hetero-

geneity is more appealing, it is technically more involved because of the presence of incidental

parameters and non-smooth features of the quantile regression objective function (Galvao and

Kato, 2016).

Remark 1. We identify distributional heterogeneity by using repeated time series observations

8

of each individual unit.3 We discuss the identifiability of our model from the following two

aspects. First, we can identify the latent group membership for each individual as long as s/he

has sufficiently many time observations lying in the non-overlapping region of the two groups.

This is satisfied if we can observe each individual unit for infinitely many periods. Second, we

allow the distribution of a group to be of any shape, including the multi-modal distribution, and

can still correctly identify the group membership structure. This is again achieved by observing

each individual unit for infinitely many periods. For example, if a group is characterized by a

bimodal distribution, it would not be identified as two unimodal groups because the distribution

of each individual unit in this group is bimodal.

3.2 Estimation method

Model (3.1) contains two types of parameters to estimate: the group-specific regression quantiles

βgi(τ) for some quantile index τ and the group membership variable gi for all i ∈ {1, . . . , N}. In

practice, we often consider a finite number of quantiles, τ1, . . . , τK , and thus we must estimate

K regression quantiles βg(τk) ∈ B ⊂ Rp+1 for each group. Let β(τk) = {β′1(τk), . . . , β′G(τk)}′,

and further denote β(τ ) = {β′(τ1), . . . , β′(τk)}′ ∈ BGK with τ = (τ1, . . . , τK)′. Define γ =

{g1, . . . , gN} ∈ GN as the partition of N individuals into at most G groups, and let ΓG be all

possible partitions. Note that γ does not change over the quantile levels. We first discuss the

estimation of β(τ ) and γ for the given number of groups G and then discuss how to determine

the number of groups in Section 5.

We propose estimating the regression quantiles and group memberships by minimizing the

following composite quantile function as

(β(τ ), γ) = arg min(β(τ ),γ)∈BGK×ΓG

K∑k=1

N∑i=1

T∑t=1

wk ρτ (yit − x′itβgi(τk)), (3.3)

where the check function ρτ (u) = (τk−I(u < 0))u, and the weight wk ∈ [0, 1] with k = 1, . . . , K.

Typically, we consider K equally spaced quantiles, say τk = k/ (K + 1), and use equal weights if

the multiple quantiles all contain clustering information. When K = 1, this can be regarded as

Kmeans-type clustering based on the check function at a given quantile.4 Using the composite

quantile approach has several advantages. First, it addresses the problem which quantile to use

3This differs from the group identification in the cross section data, where the assumptions of the distributionof each group is typically required, see, e.g., Dong and Lewbel (2011).

4One may consider optimal weights that minimize a certain accuracy measure in the vein of Zhao and Xiao(2014). We leave this interesting topic to future research. Nevertheless, we do discuss how to verify whetherquantiles contain clustering information in Section 5.

9

for clustering, and it directly provides the group membership estimates that do not vary across

quantiles. Second, it leads to a more efficient estimate for the group membership parameter

than using a single quantile or using the mean. Finally, it allows us to identify the latent group

structure when there is group separation only at a part of the distribution, e.g. no separation

at the mean but only at the tail quantile.

Since an exhaustive search of the optimal partition of the parameter space is virtually

infeasible (Su et al., 2016), we solve the optimization problem by adopting the following iterative

algorithm in the spirit of Bonhomme and Manresa (2015a) and Okui and Wang (2018):

Algorithm 3.1. Let β(0)(τk) be the initial estimate of β(τk) for τk = 1, . . . , K and set s = 0.

Step 1 For the given β(s)(τk), k = 1, . . . , K, compute for all i ∈ {1, . . . , N},

g(s)i = arg min

gi∈G

K∑k=1

T∑t=1

wkρτk(yit − x′itβ(s)gi

(τk)).

Step 2 For the given γ(s), compute for each τk quantile level,

β(s+1)(τk) = arg minβ∈BG

N∑i=1

T∑t=1

ρτk(yit − x′itβg(s)i(τk)), k = 1, . . . ,K.

Step 3 Set s = s+ 1. Go to Step 1 until numerical convergence.

This algorithm iterates between the classification and estimation steps. In the classification

step (Step 1), we update the group memberships based on the composite quantile check func-

tion given the regression quantile estimates. This step contrasts with the standard Kmeans

(Bonhomme and Manresa, 2015a) or Lasso-based (Su et al., 2016; Gu and Volgushev, 2018) al-

gorithms that cluster individuals only based on the mean or a single quantile of the distribution.

Since the group membership structure is common across the conditional distribution of the de-

pendent variable, multiple quantiles all contain clustering information, although the quantities

of information may differ. Hence, using multiple quantiles for classification is expected to be

more efficient than existing approaches.

Step 2 estimates the regression quantiles given the group membership estimates by applying

standard quantile regression estimations to each group. Since each βg(τ) can depend on τ , the

estimation is conducted independently at each quantile level. The composite feature of the

objective function (3.3) does not help estimate the regression quantiles directly. However, it

does indirectly improve the coefficient estimation in finite samples owing to the more precise

10

group membership estimates. This improvement is especially significant when two groups are

less well separated or when the signal-to-noise ratio is low. In some special cases where a

subvector of βg(τ) is independent of τ , using composite quantiles helps improve the asymptotic

efficiency of the estimators for this subvector of parameters, similar to the arguments of Zou

and Yuan (2008) and Zhao and Xiao (2014).

4 Asymptotic properties

In this section, we study the asymptotic properties of the proposed estimators and formally

demonstrate the advantages of using multiple quantiles. Some notations are prerequisites. Let

β0 and g0i be the true values of β and gi. Define εit(τk) := yit − x′itβ

0g0i

(τk). It then follows

from (3.1) that Qτk(εit(τk)|xit) = 0. For each individual i, let Fi,τk(·|x) denote the conditional

distribution of εit(τk) given xit = x for all t = 1, . . . , T . We assume that Fi,τk(·|x) has a density

fi,τk(·|x).

4.1 Weak consistency

We first show the consistency of the regression quantile estimates under the unknown (i.e.,

estimated) group membership structure. The following assumptions are required.

Assumption 1.

(i) B is a compact subset of Rp+1.

(ii) For each i ≥ 1, the process {(yit, xit) : t ≥ 1} is strictly stationary and α-mixing with

the α-mixing coefficient αi(t). Furthermore, assume supi≥1 αi(t)→ 0 as t→∞.

(iii) There exists a constant M such that supi≥1 E [‖xit‖] ≤M .

(iv) For each i and k, fi,τk(u|x) is continuously differentiable with respect to u. Let

f(1)i,τk

(u|x) := (∂/∂u)fi,τk(u|x). There exists a constant Cf such that |f (1)i,τk

(u|x)| ≤ Cf

uniformly over (u, x) for all i ≥ 1 and k = 1, . . . , K.

(v) λN(g, g, τk) is the minimum eigenvalue ofN−1∑N

i=1 I{g0i = g, gi = g}E [fi,τk(0|xit)xitx′it]

such that for all g, g ∈ {1, . . . , G} and k = 1, . . . , K,

λN(g, g, τk) ≥ λN , and lim infN→∞

λN > 0, a.s.

11

Assumption 1(i) requires the compactness of the parameter space, which is standard in the

econometrics literature. Assumption 1(ii) is similar to (A1) of Galvao and Kato (2016), ensuring

that the distribution of the residual εit(τk) is common over time. Compared with Assumptions

2(c) and 2(d) in Bonhomme and Manresa (2015a) and Condition 1 in Vogt and Linton (2016),

we do not necessarily require exponentially or sufficiently high polynomial decaying mixing

rates. Assumption 1(iii), which controls the tail behavior of the variables, is used to bound the

maximum clustering error. Assumption 1(iv) restricts the smoothness of the density function of

the residuals, similar to Assumption (A5) in Galvao and Kato (2016). Finally, Assumption 1(v)

is a relevant rank condition akin of Assumption 1(g) in Bonhomme and Manresa (2015).

Theorem 4.1. Under Assumption 1, we have, as N, T →∞,

max1≤i≤N

‖βgi(τk)− β0g0i

(τk)‖P→ 0, (4.1)

for any k = 1, . . . , K and g0i ∈ {1, . . . , G}.

This theorem states that as N and T diverge, the estimated regression quantiles under the

unknown (i.e., estimated) group membership structure converge to their true values with known

group memberships.

4.2 Asymptotic behavior of the misclassification probability

The weak consistency result provides some justification for our method. Nevertheless, it does

not explain whether and the degree to which the accuracy of the group membership estimates

is influenced by using composite quantiles and by the features of the data. In this section,

we aim to precisely quantify the accuracy of the group membership estimates and examine

how the use of distributional information and other features of the data affects the level of

accuracy. A challenge is that the group membership parameter is discrete and label-invariant,

and thus directly studying its asymptotic distribution is rather difficult, if not impossible. To

overcome this challenge, we focus on the misclassification probability, a widely used measure

of classification accuracy, defined as

MP [β(τ )] =1

N

N∑i=1

I{gi(β(τ )) 6= g0

i

}. (4.2)

We derive the rate of the convergence of this probability and examine how the number of

quantiles used for the estimation and the features of the data affect this rate. Some additional

12

assumptions and a lemma are required.

Assumption 2.

(i) For all g ∈ {1, . . . , G}, 1N

∑Ni=1 I{g0

i = g} p−→ πg > 0.

(ii) Let g 6= g with (g, g) ∈ {1, . . . , G}2. Then, there exists a k ∈ {1, . . . , K} such that

‖β0g(τk)− β0

g(τk)‖ > 0.

(iii) The minimum eigenvalues of E [xitx′it] are bounded away from zero uniformly over

i ≥ 1.

(iv) For each i ≥ 1, ρi(t) is the ρ-mixing coefficient of the process {(yit, xit) : t ≥ 1},defined as the “maximal correlation” by Kolmogorov and Rozanov (1960). Let ρ′ :=

supt≥1,i≥1 ρi(t), and assume ρ′ < 1.

(v) There exists a constant M such that supi≥1 ‖xit‖ ≤M a.s.

(vi) Define λ as the minimum eigenvalue of the following matrix:(1

NT

N∑i=1

T∑t=1

I(g0i = g)xit

)(1

NT

N∑i=1

T∑t=1

I(g0i = g)x′it

),

Then, λp−→ λ > 0 as N, T →∞.

Assumption 2(i) requires a sufficient number of individual units in each group as in Bon-

homme and Manresa’s Assumption 2(a). Assumptions 2(ii) and 2(iii) jointly provide the

conditions under which we can identify the group membership. They require that the groups

are well separated. Assumption 2(iv) requires that the data should not be perfectly correlated

across individuals and over time, while a certain degree of serial correlation and cross-sectional

dependence are allowed. Assumption 2(v) is similar but stronger than 1(iii). It can be relaxed

as E||xit||2+δ ≤ M at the expense of lengthier proofs. Assumption 2(vi) guarantees that the

rank condition holds in any group structure.

Remark 2. The group separation assumption (Assumption 2(ii)) is satisfied as long as the

regression quantiles are distinct across groups for at least one quantile level. This allows, for

example, that there is no separation in the mean but only in the tail quantile, and thus our

group separation condition is weaker than the standard mean-based separation condition (e.g.,

Bonhomme and Manresa (2015a) and Su et al. (2016)).

13

Remark 3. Compared with Assumption 2(d) in Bonhomme and Manresa (2015a) that imposed

an exponential decay for the tail of errors, we allow the errors to have heavy tails. We quantify

the speed of the convergence of the misclassification probability by using a central limit theorem

for strongly mixing processes; see Bradley and Tone (2017). This also differs from Bonhomme

and Manresa (2015a) who used exponential inequalities for dependent processes (Lemma B.5).

Lemma 1. Under Assumptions 1 and 2(i)–2(ii), we have, as N, T →∞,

‖βg(τk)− β0g(τk)‖

p−→ 0,

for k = 1, . . . , K and g ∈ {1, . . . , G}.

This lemma suggests that the difference between the estimated and true regression quantiles is

negligible in the asymptotics. In combination with Theorem 4.1, this implies that the group

membership estimate converges to the true value in the asymptotics. This lemma allows us to

study the asymptotic behavior of the group membership estimates in the neighborhood of the

true value. Specifically, we denote Nη as the set of parameters β(τ ) ∈ ΘGK such that for any

η > 0, ‖βg(τk)− β0g(τk)‖ < η, for any k = 1, . . . , K and g ∈ {1, . . . , G}.

The following quantities are crucial for establishing the asymptotic properties of the mis-

classification probability. Define

bit(K) = K−1

K∑k=1

wk

∫ x′it(β0g(τk)−β0

g(τk))

0

(I(εit(τk ≤ u)− τk) du,

and its first and second moments are

E [bit(K)] = K−1

K∑k=1

wkE

[∫ x′it(β0g(τk)−β0

g(τk))

0

(Fi,τk(u|xit)− τk) du

],

and

E [b2it(K)] = E

[K−1

K∑k=1

wk


g(τk))

0

(I(εit(τk ≤ u)− τk) du

]2

.

Let ζg,g = infi≥1 E [bit(K)]/√Var [bit(K)] > 0 and C ′ = (1 + ρ′)/(1 − ρ′) with ρ′ defined as in

Assumption 2(iv). The following theorem depicts the rate of the convergence of the misclassi-

fication probability.

14

Theorem 4.2. Under Assumptions 1 and 2,we have, as N, T →∞,

supβ(τ )∈Nη

1

N

N∑i=1

I{gi(β(τ )) 6= g0

i

}= OP

(T−1/2 exp (−ζT )

), (4.3)

where

ζ = ming 6=g

g,g∈{1,...,G}

ζ2g,g

8C ′.

This theorem shows that the rate of the convergence of the misclassification probability is an

exponential function of T , which is in line with the literature (Bonhomme and Manresa, 2015a;

Okui and Wang, 2018). We take one step further to show that this rate depends on the number

of quantiles used for the estimation, magnitude of the noise, degree of group separation, and

serial correlation. In particular, we can see that ζ is a function of the group separation across

quantiles, which also depends on the error term. From the upper integrand of E [bit(K)], we

observe that the exponential-decaying convergence rate only requires that the group pattern

is well separated at some but not all quantiles (Assumptions 2(ii) and 2(iii)). Hence, our

composite approach is more robust in identifying the group structure than only using a single

quantile or the mean for clustering. C ′ captures the degree of serial dependence. A large de-

gree of dependence corresponds to a smaller C ′, which further leads to a larger misclassification

probability.5

Illustrative example: To better illustrate how the rate of convergence is determined by

various quantities, we consider a simple model with two groups (G = 2) and only an intercept

as

yit = β0g0i

+ εit, g0i ∈ {1, 2}, (4.4)

where εit follows i.i.d. N(0, σ2). We assume that β01 < β0

2 without loss of generality. (4.4) can

be rewritten, in a similar form of (3.1), as

yit = β0g0i

(τk) + εit(τk), k = 1, . . . , K, (4.5)

where β0g0i

(τk) = β0g0i

+qτk , and εit(τk) = εit−qτk with qτ denoting the 100τ% quantile of N(0, σ2).

In this case, the misclassification probability is the probability of (mis)classifying an individual

5For a given finite K, it is possible to construct data-driven weights wk for k = 1, . . . ,K based on Theorem 4.2to obtain an improved convergence rate for the group membership estimates. For example, we can first setwk = 1 in the objective function (3.3) to estimate β0

g(τk), Fi,τk(u|xit), and εit(τk. Then, we can compute thedata-driven weights wk from maxwk≥0,Var [bit(wk)]=1 E [bit(wk)] and apply them in (3.3) to re-estimate the model.

15

into group 2 given that s/he is in group 1. It follows from (3.3) that this probability can be

obtained by

P(gi[β0(τ )] = 2|g0

i = 1)

= P

(T∑t=1

K∑k=1

wk ρτk(yit − β02(τk)) ≤

T∑t=1

K∑k=1

wk ρτk(yit − β01(τk))

).

By using Knight’s (1998) identity, we can rewrite the misclassification probability as

P(gi(β0(τ )) = 2|g0

i = 1) = P

(1

T

T∑t=1

bK,t ≤ 0

),

where bi,t(K) = K−1∑K

k=1 wk∫ β0

2−β01

0(I(εit(τk ≤ u)− τk) du. In this simple case, we have

E [bi,t(K)] = K−1

K∑k=1

wk

∫ β02−β0

1

0

(Φ((u+ qτk)/σ)− τk) du,

and

Var [bi,t(K)]

= K−2

K∑k=1

K∑l=1

wkwl

∫ β02−β0

1

0

∫ β02−β0

1

0

Φ

(min((u+ qτk), (s+ qτl))

σ

)− Φ

(u+ qτkσ

)Φ

(s+ qτlσ

)duds.

where Φ(·) is the standard normal distribution. The central limit theorem implies that

√TT−1

∑Tt=1 bi,t(K)− E [bi,t(K)]√

Var [bi,t(K)]

d−→ N(0, 1), T →∞.

Hence, we can show that, as T →∞,

P(gi(β0(τ )) = 2|g0

i = 1)→ Φ

(− E [bi,t(K)]√

Var [bi,t(K)]

√T

),

= O

(T−1/2 exp

(− (E [bi,t(K)])2

2Var [bi,t(K)]T

)),

where the equality uses the Mills ratio. Figure 1 depicts the relationships between the misclassi-

fication probability and number of quantiles used for the estimation, degree of group separation,

and variance of the error term.

16

FIGURE 1

The left panel of Figure 1 shows that the misclassification probability is a decreasing function

of the number of quantiles. The middle panel considers a smaller degree of group separation,

where we see a larger misclassification probability than in the left panel. The right panel sug-

gests that increasing the variance of the error term leads to a higher misclassification probability

as expected. In this special case, the association between the misclassification probability and

these key quantities can be theoretically proven (see the supplementary file). In the general

case, it is difficult to show the monotonic relationship between the misclassification probability

and these key quantities analytically since the distribution of the error term is unknown. Nev-

ertheless, numerical investigation suggests that these relationships are valid for a wide range of

distributions such as most of the distributions in the exponential family.

Theorem 4.2 implies that the estimation error caused by unknown group memberships

converges to zero at an exponential rate. Formally, denote β(τ ) as the estimated regression

quantiles under the true group membership structure as

β(τ ) = arg minβ(τ )∈BGK

N∑i=1

T∑t=1

K∑k=1

wk ρτk(yit − x′itβg0i (τk)).

Then, the following corollary relates the estimates obtained from the unknown and true group

memberships.

Corollary 1. Under the assumptions in Theorem 4.2, we have for all g ∈ {1, . . . , G} and

k = 1, . . . , K,

‖βg(τk)− βg(τk)‖ = OP

(T−1/2 exp (−ζT )

), (4.6)

as N, T →∞.

This corollary states that the estimators of the regression quantiles under unknown group

memberships also converge to that under the true group memberships at a similar exponential

rate, which depends on the number of quantiles used and features of the data.

4.3 Asymptotic distribution

Finally, we derive the asymptotic distribution of the regression quantile estimates. Extra as-

sumptions are needed for this result.

17

Assumption 3.

(i) Let yi = (yi1, . . . , yiT )′ and xi = (x′i1, . . . , x′iT )′. For all g ∈ {1, . . . , G} : {(yi, xi)I(g0

i =

g)}Ni=1 are i.i.d.

(ii) The conditional density of yit given xit , I(g0i = g)fg(y|xit), is bounded and continuous

for all g ∈ {1, . . . , G}.

(iii) The matrix Γ(τk, g) = E[xitx

′itI(g0

i = g)fg(x′itβ

0g(τk)|xit)

]is invertible and has the

minimum eigenvalue bounded away from zero, uniformly for all k = 1, . . . , K.

Assumption 3 is similar to Assumptions A1–A4 in Galvao and Poirier (2016), which are

used to prove the asymptotic normality of the pooled linear random effects quantile estimator,

except that we are concerned about the observations in each group. Assumption 3(i) requires

that individual units are not cross-sectionally dependent.6 Assumption 3(ii) imposes identical

distributions of units within the group, which strengthens Assumption 1(iv). Assumption 3(iii)

is a rank condition for the identification, and the matrix Γ(τk, g) determines the variance in the

asymptotic distribution.

Corollary 2. If Assumptions 1, 2, and 3 hold, and N/(√T exp(ζT ))→ 0, as N, T →∞, with

ζ > 0 defined in Theorem 4.2, then we have for all g ∈ {1, . . . , G},

Γ(τ , g)√πgNT

(βg(τ )− β0

g(τ ))⇒ z(τ , g), (4.7)

where z(·, g) is the K-dimensional normal distribution with a zero mean, whose covariance

matrix is

E [z(τ , g)z(τ , g)] = plimT→∞T−1

T∑s=1

T∑t=1

E[(I(εit(τ ) ≤ 0)− τ )(I(εis(τ ) ≤ 0)− τ )xitxisI(g0

i = g)],

with εit(τ ) = (εit(τ1), . . . , εit(τK)).

This result is a direct consequence of Corollary 1 given the well-established asymptotic

distribution of the regression quantile estimates. If individual units are allowed to be depen-

dent within a group by relaxing Assumption 3(i), a cluster-robust variance-covariance matrix

estimator for inference can be employed as in Galvao and Poirier (2016).

6For the weak consistency and asymptotic equivalence, we can allow for the lagged outcomes and generalpredetermined regressors. However, to derive the asymptotic distribution, we have to rule out the laggedoutcomes, unless stronger assumptions are added. We conjecture that when lagged outcomes are included, adynamic quantile IV-type method (Galvao, 2011) can be applied to handle the bias from the unobserved initialvalues.

18

Remark 4. If the estimated conditional quantile is non-monotone on τ , we can rearrange

it into a monotone function by simply sorting the values of the function in a non-decreasing

order. Chernozhukov et al. (2010) showed that such rearrangement can improve the finite-

sample properties of the estimator.

5 Determining the number of groups

We have thus far assumed that the number of groups is known. However, this number is rarely

given in applications and we therefore need to determine it before implementing our estimator.

A popular approach is to minimize some information criterion (IC) computed for different

numbers of groups, which trades off between model fitness and the number of parameters (e.g.,

Bonhomme and Manresa (2015a); Su et al. (2016); Gu and Volgushev (2018)). However, the

use of an IC in our case is complicated because our estimator is obtained from the composite

quantile objective function. The heterogeneity in the number of groups across quantiles may

“contaminate” the behavior of the IC if the composite feature of the objective function is not

well incorporated (Gao and Song, 2010). In particular, if the groups are only identified at the tail

quantiles but not at the central quantiles, the standard IC is likely to underestimate the number

of groups because the check function evaluated at the central quantiles can play a dominant role

in the entire composite quantile objective function, such that the objective function undervalues

the improvement in fitness when the number of groups rises. The underestimation of the number

of groups is particularly undesirable (compared with overestimation) as it leads to inconsistent

coefficient estimates (Liu et al., 2018).

To avoid underestimation, we propose an innovative approach for selecting the number of

groups. Our approach is to first choose the optimal number of groups at each specific quantile by

minimizing the quantile-specific IC. Then, the final number of groups is the maximum number

over all considered quantiles. We assume the true value of the number of groups G0 is bounded

from above by a finite integer Gmax. The quantile-specific IC is defined as follows:

IC(G, τ) =1

NT

N∑i=1

T∑t=1

ρτ (yit − x′itβ(G)gi

(τ)) +G(p+ 1)f(N, T ), (5.1)

where (G) refers to the estimator with G groups, G(p+1) is the number of parameters of interest,

and f(N, T ) is the tuning parameter. We then minimize the IC to determine the number of

19

groups at each given quantile τk for k = 1, . . . , K as

G(τk) = arg minG=1,...,Gmax

IC(G, τk). (5.2)

We can show that for each τk, G(τk) can consistently estimate G0 given the following assump-

tions. Denote any G-partition of {1, 2, . . . , N} by P (G) = (P1, . . . , PG), where we suppress

the dependence of {Pg, g = 1, . . . , G} on G for notation convenience. Then P (G) ∈ P(G) with

P(G) denoting the collection of all such partitions. Let eP (G) = 1NT

∑Gg=1

∑i∈Pg

∑Tt=1 ρτk(yit −

x′itβPg(τk)), with βPg(τk) = arg minβ1

NPgT

∑i∈Pg

∑Tt=1 ρτk(yit−x′itβgi(τk)) where NPg is the num-

ber of units in each partition Pg.

Assumption 4.

(i) As N, T →∞,min

1≤G<G0

infP (G)∈PG

eP (G)

p−→ e > eG0 ,

where eG0 = plimN,T→∞1NT

∑Ni=1

∑Tt=1 ρτk(yit − x′itβ0

g0i(τk)).

(ii) As N, T →∞, f(N, T )→ 0 and√NTf(N, T )→∞.

Assumption 4(i) implies that the estimation error delivered by any underfitted model is

larger than the true model. It in turn enables G to be chosen larger than or equal to G0.

Assumption 4(ii), by quantifying the magnitude of f(N, T ) asymptotically, ensures the choice

of G cannot be larger than G0.

Theorem 5.1. Suppose Assumptions 1-4 hold and for any δ > 0, N/(√T exp(δT )) → 0, as

N, T →∞ . Then P(G(τk) = G0)→ 1 .

Next, we choose the maximum number of groups over all quantiles under consideration:

G = arg maxk=1,...,K

G(τk). (5.3)

This procedure avoids underspecifying the number of groups since the consistent selection

criterion in (5.2) is computed at each quantile separately and no summation over quantiles is

made. Hence, we can select the right number of groups even when the groups are not separately

identified at some (but not all) quantiles.

The estimated number of groups at each quantile level also provides a way in which to

determine the weights in the composite quantile objection function. If G(τk) = 1 for some τk

20

(e.g., τk = 0.5), but larger than 1 for other quantiles, this suggests that we should not use

these non-informative quantiles for clustering (i.e., setting wk = 0). As long as the quantile τk

contains any clustering information (i.e., G(τk) > 1), we recommend including these quantiles

in the composite quantile check function for estimating group memberships even though G(τk)

may differ across τk. The motivation for this approach is that using the composite approach

helps us obtain quantile-invariant group memberships, avoids the difficulty of relabeling groups

across quantile levels, and improves the clustering accuracy.

One problem of this procedure is that it may overestimate the number in the finite sample if

one quantile returns an incorrectly large number of groups, and this typically occurs when there

are limited observations at some tail quantiles. The cost of overestimation is comparably less

than that of underestimation since it does not harm the consistency of the coefficient estimates

but only sacrifices efficiency and suffers from small-sample bias (Bonhomme and Manresa,

2015b; Liu et al., 2018). A solution to this problem is to skip some of the extreme quantiles,

say 0.1 and 0.9, that have limited observations.

6 Models with individual fixed effects

In this section, we consider extending the benchmark model in (3.1) to allow individual-specific

fixed effects (rather than group fixed effects) as

Qτk(yit|xit) = αi(τk) + x′itβgi(τk), i = 1, . . . , N, t = 1, . . . , T, k = 1, . . . , K, (6.1)

where αi represents the time-invariant individual fixed effects potentially correlated with the

regressors, xit is the p × 1 regressor vector without constant terms, and βgi(τk) ∈ B ⊂ Rp

and αi(τk) ∈ A ⊂ R. By allowing unobserved heterogeneity to be individual-specific, (6.1)

provides a richer context of heterogeneity than (3.1). To derive the weak consistency of the

regression quantile estimates under unknown group memberships, we need to impose a more

restrictive asymptotic relation between N and T (i.e., logN/√T → 0) because of the incidental

parameters from the individual fixed effects. We employ the assumption structure of Kato et al.

(2012) for the weak consistency.

Assumption 5.

(i) A ⊂ R and B ⊂ Rp are compact.

(ii) For each i ≥ 1, the process {(yit, xit) : t ≥ 1} is strictly stationary and β-mixing with

21

the mixing coefficient βi(t). Assume that there exist constants a ∈ (0, 1) and B ≥ 0 such

that supi≥1 βi(t) ≤ Bat. Furthermore, the processes {(yit, xit) : t ≥ 1} are independent

across i.

(iii) The minimum eigenvalues of E [fi,τk(0|xit)(1, x′it)′(1, x′it)] are bounded away from zero

uniformly over i ≥ 1.

Theorem 6.1. If (logN)/√T → 0 as N, T → ∞, and Assumptions 1(iv), 2(v), and 5 hold,

then we have for all g ∈ {1, . . . , G} and any k = 1, . . . , K,

max1≤i≤N


(τk)‖P→ 0. (6.2)

With the help of Theorem 6.1, we can obtain similar results to those shown in Theorem 4.2

and Corollary 1 for the individual fixed effects model. In particular, Theorem 6.1 implies that

the difference between the regression quantile estimates under unknown but estimated group

memberships and those under true group memberships is at the level of an exponential decay.

This in turn implies the asymptotic normality of the regression quantile estimates with unknown

group memberships. Furthermore, as stated by Kato et al. (2012), the estimation of regression

quantiles suffers from finite-sample bias because of the presence of individual fixed effects. To

improve finite-sample estimation performance, we could apply a similar smoothing technique

for bias correction as suggested in Galvao and Kato (2016), but this is beyond the scope of this

study.

7 Monte Carlo simulation

In this section, we evaluate the finite-sample performance of the proposed method. In particular,

we examine whether PSQR can correctly classify individual units as well as effectively recover

the quantile-specific slope coefficients within each group. By comparing the PSQR model with

mean-based or single-quantile-based clustering, we also shed light on the importance of taking

the entire distribution into account when clustering.

7.1 Data generation process

We consider four data generation processes (DGPs), differing in the distributions of errors and

degree of group separation.

22

DGP.1: The first and benchmark DGP concerns the typical location-scale shift model:

yit = αgi + βgixit + (1 + γxit)εit, gi = 1, . . . , G0, (7.1)

where we set γ = 0.5. We follow Kato et al. (2012) to generate xit = 0.3αgi +zit, where zit

is independently and identically generated by χ25. The error term εit is i.i.d. and follows a

standard normal distribution. There are three groups (G0 = 3) that contain Nj individual

units, respectively, for j = 1, 2, 3, and N1 +N2 +N3 = N . We fix the ratio among groups

such that N1 : N2 : N3 = 0.3 : 0.3 : 0.4. The intercept and slope coefficient in the three

groups are given, respectively, by

(α1, α2, α3)′ = (β1, β2, β3)′ = (1, 1.75, 2.5)′.

In this case, groups are well separated in their means, while the shape of their distributions

is common (see Figure 2 (a) for the density function of αgi + εit for the three groups).

Thus, both the mean-based clustering and the PSQR (classified based on the quantiles)

estimators are expected to identify the groups.

DGP.2: We then consider the case where groups are less clearly separated with less

discrepancy between the coefficient parameters. We generate the data with

(α1, α2, α3)′ = (β1, β2, β3)′ = (1, 1.5, 2)′.

The explanatory variable and error term are generated in the same way as in DGP.1. This

case allows us to examine how the mean-based clustering and PSQR estimators perform

when the entire distribution of groups is less separable.

DGP.3 In practice, the groups might differ not only in their means but also in their shapes

of the distribution. To mimic this situation, we consider the case with heterogeneous slope

coefficients and the distribution of the error term is also allowed to vary across groups. We

consider the errors of three groups generated from three distribution functions, namely

normal, Gamma, and Weibull distributions. By appropriately setting the parameters of

these distributions and subtracting the theoretical mean, the errors of the three groups

follow heterogeneous distributions with mean zero but distinct tail behavior. In particular,

23

we generate

εit ∼

i.i.d. N(0, 1) if gi = 1,

i.i.d. Gamma(shG, scG)− E[Gamma(shG, scG)] if gi = 2,

i.i.d. Weibull(shW , scW )− E[Weibull(shW , scW )] if gi = 3.

Here, shG and shW are the shape parameters of the Gamma and Weibull distributions

and scG and scW are the scale parameters of the two distributions, respectively. We set

(scG, shG) = (5, 1) and (scW , shW ) = (1, 3). Besides the errors, the remaining variables

and parameters are the same as in DGP.1. Although the three groups are characterized by

distinct means and distributions, group separation in this case is not necessarily stronger

than in DGP.1 since the densities of Groups 1 and 2 overlap more, especially around the

mean (see Figure 2 (c)). Moreover, the distributional heterogeneity between the clusters

can only be captured by PSQR, but not by the mean-based clustering.

DGP.4: We consider the scenario where groups are only distinguished by their shapes of

the distribution but not their mean values. This is potentially the case where classification

is the most difficult because the difference between groups is even less significant. We

generate a homogeneous mean for the intercept and slope coefficient as

(α1, α2, α3)′ = (β1, β2, β3)′ = (1, 1, 1)′.

The error term is generated in the same way as in DGP.3, whose distribution varies across

the three groups. Since each group has a zero mean (because of the subtraction of the

mean), the means of the coefficients are identical (see Figure 2 (d)).

For each DGP, we consider two cross-sectional sample sizes, N = (50, 100), and two lengths

of time series, T = (30, 60), leading to four combinations of cross-sectional and time series

dimensions. The number of replications is set to 1000.

FIGURE 2

7.2 Implementation and evaluation

The PSQR estimator minimizes the composite objective function with different quantile levels

and thus classification is based on the entire distribution. To implement our method, we con-

sider two specifications of quantiles, a narrow and a wide range of quantiles; the corresponding

24

estimators are denoted by PSQRnarrow and PSQRwide. In DGP.1–DGP.3, we consider the nar-

row range of quantiles τ ∈ {0.4, 0.5, 0.6} that concentrates on the central part of the distribution

and the wide range of quantiles τ ∈ {0.1, 0.2, . . . , 0.9} that spans the entire distribution. In

DGP.4, the narrow range of quantiles is specified as τ ∈ {0.1, 0.2, 0.8, 0.9} since heterogeneity

only appears at the extremal tails (according to the IC at each quantile) and the wide range

remains the same. These two specifications of quantiles allow us to examine how the range of

quantiles affects classification accuracy and coefficient estimation.

We compare our PSQR estimator with the group fixed effects (GFE) type estimator (Bon-

homme and Manresa, 2015a) and single-quantile group estimators using the best quantile. The

GFE-type estimator minimizes the least squares objective function, and thus its classification

is solely based on the mean. We also consider the single-quantile estimators obtained by mini-

mizing the PSQR objective function (3.3) but with only one quantile level. This estimator can

also be regarded as an extension of Gu and Volgushev (2018)’s quantile panel group fixed effects

estimator, allowing for both the intercept and the slope coefficients to exhibit a group pattern

of heterogeneity. We report the single-quantile estimator at the best quantile chosen ex post

based on their classification accuracy. This is, of course, not feasible in practice since we do not

know ex ante which quantile is the most informative for clustering. Both the GFE-type and the

single-quantile group estimators can be regarded as types of “limited information” estimators,

as they only employ information at a single point of the distribution. By comparing PSQR

with these two limited information estimators, we shed light on how distributional information

contributes to group membership estimation.

We evaluate the performance of the proposed method based on selecting the right number

of groups, clustering, and the coefficient estimates across quantiles. First, we examine how

the IC-based procedure performs in determining the number of groups. To compute the IC in

practice, we find that f(N, T ) = 0.1 log(NT )/√NT works fairly well based on a large number

of experiments with many alternatives, and we employ this penalty in all simulations and the

application. Our penalty term in (5.1) is comparable to the BIC proposed by Bonhomme and

Manresa (2015b). Performance is evaluated by the empirical probability of selecting a particular

number. Second, we measure clustering accuracy by taking the average of the misclassification

frequency (gi 6= g0i ) across replications. Let I(·) be the indicator function. The misclassification

frequency is the ratio of misclassified units to the total number of units:

MF =1

N

N∑i=1

I(gi 6= g0i ).

25

Finally, we evaluate the accuracy of the coefficient estimates at each quantile by using their

root mean squared error (RMSE) and the coverage probability of the two-sided nominal 95%

confidence interval. The overall RMSE for all units is

RMSE(β(τ)) =

√√√√ 1

N

N∑i=1

[βgi(τ)− β0

g0i(τ)]2

.

The coverage probability is computed as

Coverage(β(τ)) =1

N

N∑i=1

I(LCi(τ) ≤ β0g0i

(τ) ≤ UCi(τ)),

where LCi(τ) and UCi(τ) are the lower and upper 95% confidence bounds of βgi(τ) based

on the Huber sandwich estimate of standard deviation. Because the RMSE and coverage

probability are based on the coefficient estimates of the individual units, they are both affected

by classification accuracy.

7.3 Results

Determining the number of groups

As the classification and coefficient estimation both rely on the choice of the number of groups,

we first examine how the IC-based procedure performs in determining the number of groups. We

use the IC defined in (5.2) to select the number of groups at each quantile τk = 0.1, 0.2, . . . , 0.9

and then choose the maximum number over the quantiles as in (5.3). Table 1 provides the

empirical probability of selecting a particular number of groups, ranging from G = 1 to 5.

Recall that the true number of groups is 3.

TABLE 1

Table 1 shows that our method can effectively detect the right number of groups and that

the frequency of selecting the right number generally increases with the time dimension. As

expected, when the number of groups is misspecified, it is more likely to be overestimated than

to be underestimated. In DGP.1 and DGP.3, where the means of the groups are well separated,

the empirical probability of selecting three groups is above 96% in all cases. In DGP.2, where

the mean is less separated, the method has a larger probability of underestimating the number

26

of groups, but only when N and T are small. Even in this case, the probability of selecting the

right number is still above 82%, and this probability quickly increases as N or T rises. When the

coefficients of the three groups share a common mean in DGP.4, our method can correctly detect

three groups in 98% of the cases even when T = 30 due to the sharp separation of groups in the

tails. We further examine the selected number of groups at each quantile (not reported) in this

case. We find that the IC selects one group at the central quantiles, namely τ ∈ {0.4, 0.5, 0.6},and three groups are reported only at the tail quantiles, namely τ ∈ {0.1, 0, 2, 0.8, 0.9}. This

leads to the final result of the three groups following the maximization step in (5.3). The result

for the IC at each specific quantile also suggests that we should estimate the structure of the

three groups by only using the tail quantile level {0.1, 0, 2, 0.8, 0.9}, since no heterogeneity is

exhibited at the central quantiles.

Classification accuracy

Next, we examine classification performance. Table 2 presents the misclassification rate of

the four estimators: the GFE-type estimator, single-quantile estimator, and PSQR estimators

based on the two specifications of quantiles. We compute the single-quantile estimator at each

specific quantile level ranging from 0.1 to 0.9 and report the best one in terms of the lowest

misclassification rate. We refer to this estimator as SQbest.

TABLE 2

In general, we find that PSQR produces a lower misclassification rate than GFE in all cases

and that increasing the time dimension significantly reduces the misclassification rate for all

methods. In particular, in DGP.1, where groups are well separated by their means and have a

common error distribution, all methods lead to accurate classification. Although the misclas-

sification rate of all methods is generally low, the group membership estimate of PSQRnarrow

is twice as accurate as that of GFE and it also slightly outperforms the best single-quantile

estimator obtained at τbest = 0.5. Using a wide range of quantiles, τ ∈ {0.1, 0.2, . . . , 0.9},further halves the misclassification rate of PSQR compared with using fewer quantiles. This

result suggests that employing information at multiple quantiles improves classification when

a group pattern of heterogeneity is common across different quantiles of the distribution.

When the groups are less separated in their means as in DGP.2, classification is more diffi-

cult and the misclassification rate of all methods increases as expected. The GFE-type method

27

misclassifies around 6% of individuals when T = 30. SQbest (with τbest = 0.5) leads to a mis-

classification rate of around 4% when T = 30, close to the level of PSQRnarrow. Nevertheless, if

we take advantage of the whole distribution by using a wider range of quantiles, we manage to

reduce the misclassification rate to roughly 2.5% under small T . Interestingly, increasing the

time dimension disproportionately improves the classification accuracy of the different meth-

ods. The improvement in PSQRwide is the most pronounced compared with the other three

estimators because it employs more quantiles and the information at each quantile becomes

more precise as T increases.

In DGP.3, classification turns out to be even harder since the densities of Groups 1 and 2

overlap more (see Figure 2 (c)). In this case, the single-quantile method (with τbest = 0.5) pro-

duces the least accurate group membership estimates, with a misclassification rate of more than

9% when T = 30 and 4% when T = 60. The misclassification rates of GFE and PSQRnarrow

are around 8% under small T and 3% under large T . In contrast, PSQRwide can better separate

groups by exploring the heterogeneity in the whole distribution, and the misclassification rate

is around 2.5% when T = 30, more than three times lower than that for GFE. The improve-

ment in classification accuracy is even more sizeable as the time dimension increases, with a

misclassification rate below 5%, at least seven times lower than those for GFE and PSQRnarrow.

Finally, we consider DGP.4, where the means of groups are identical. In this case, the

performance of GFE is particularly poor with a misclassification rate above 50%. The high

misclassification rate of GFE does not decrease by increasing T . On the contrary, PSQR can

identify the correct group structure. PSQRwide outperforms GFE by roughly 70% when T = 30,

with a misclassification rate below 17%, and classification accuracy dramatically improves as

T increases. As suggested by the IC at each quantile, individual units are best classified into

three groups only at τ ∈ {0.1, 0.2, 0.8, 0.9}. Hence, we consider PSQRnarrow, which is based

on these tail quantiles. We can also view PSQRnarrow here as the unequal weighted average

of the composite quantiles, where each of the four quantiles (0.1, 0.2, 0.8, and 0.9) receives a

weight of one-quarter and the others have a weight of zero. We find that classification accuracy

is further improved compared with PSQRwide; the misclassification rate is around 6% when

{N, T} = {50, 30} and decreases to below 2.5% as either N or T doubles. This is not surprising

because the central quantiles do not contain group information and thus incorporating these

quantiles into the composite function does not help identify groups (it only adds noise). In this

DGP, the best single-quantile estimator is obtained at τbest = 0.1. Compared with SQbest, the

rate of PSQRnarrow is lower when T = 30, suggesting the benefits of using composite quantiles

when T is relatively small. When T is large, SQbest performs similarly as PSQRnarrow. This is

28

because 0.1 is the most informative quantile for identifying the three groups, while PSQRnarrow

uses three additional quantile (τ = 0.2, 0.8, 0.9), which contributes little additional information

for group separation. Although SQbest sometimes performs as well as PSQR under large T , the

single-quantile approach is not recommended since we do not know ex ante which quantile is

the most informative for clustering and the estimation accuracy of the other quantiles is not

guaranteed. This is confirmed by the performance of the single-quantile approach using the

second-best quantile τ = 0.9; indeed, its misclassification rate is much higher than PSQRnarrow

for all sample sizes.

Accuracy of the regression quantile estimates

Finally, we compare the accuracy of the regression quantile estimates. Table 3 presents the

bias, RMSE, and coverage probability of the three methods (GFE, best single-quantile, and

PSQR). To save space, we only report the statistics of the slope coefficient βgi produced by

PSQRwide among the two versions of PSQR in DGP.1–DGP.3 and PSQRnarrow in DGP.4. The

results of the intercept estimate αgi are qualitatively similar.

TABLE 3

In DGP.1, all three methods provide accurate coefficient estimates. Although the discrep-

ancy between the three methods is marginal, we find that PSQRwide produces the smaller RMSE

and more accurate coverage probability (closer to the nominal 95%) at τ = 0.5 compared with

the GFE and single-quantile estimators. As we move to the tail index (i.e., when τ is close to

0 or 1), PSQRwide still works well, although less satisfactorily compared with the central index

as in most quantile regression exercises.

In DGP.2, the difference between the three methods is larger due to diverse classification

performance. Since PSQRwide produces the most accurate group membership estimates, its

coefficient estimates are unsurprisingly more accurate than under the other two methods. Sim-

ilar results are observed in DGP.3, where we find that GFE is rather unsatisfactory (coverage

probability below 90% when T = 30). On the contrary, PSQRwide continues to perform well

with a smaller RMSE and reasonable coverage probabilities.

Finally, in DGP.4, GFE completely breaks down with a coverage probability far away from

the nominal level. PSQRnarrow based on informative quantiles and SQbest both produce fairly

good coefficient estimates. We compare SQbest with PSQRnarrow at τ = 0.1 and find a smaller

29

RMSE and better coverage probability for PSQRnarrow when T = 30. When T = 60, the

coefficient estimates of SQbest have a better coverage probability than PSQRnarrow, but at the

cost of a larger RMSE. By comparing the second-best single-quantile estimator (τsecond best =

0.9) with PSQRnarrow at τ = 0.9, we can see that the former is strictly dominated by the latter,

again confirming the unstable performance of the single-quantile estimator.

In general, the classification results strongly favor PSQR using composite quantiles. If

the means are well separated, using a wide range of quantiles provides more accurate group

membership estimates compared with the GFE and single-quantile approaches, both of which

only employ limited distributional information. If there is no heterogeneity in the mean but only

in the tails, it is beneficial to consider PSQR based on only “informative” quantiles. Although

the single-quantile approach sometimes outperforms PSQR, in practice, which quantile is the

most informative is not a priori known, and there is no guarantee that the best single quantile

is chosen. For central quantiles without coefficient heterogeneity, we can thus estimate the

coefficient parameters by using standard quantile regressions.

8 Empirical application: Output effect of infrastructure

capital

In this section, we apply the PSQR method to investigate the effect of infrastructure capital

on aggregate output. The role of infrastructure capital in the local economy has long been

a central concern in macroeconomics given its strong policy implications. For example, the

US government increased infrastructure expenditure to counteract the recent recession (Leeper

et al., 2010). Hence, understanding the contribution of infrastructure expenditure to output is

crucial for evaluating the effectiveness of this fiscal stimulus. Scholarly attempts to quantify the

output effects of infrastructure have grown rapidly since the influential work of Aschauer (1989);

however, previous studies lack consensus and their empirical results are diverse depending on

the datasets and empirical methodologies employed (see the references in Calderon et al. (2015)

and Romp and De Haan (2007)).

We employ the cross-country panel dataset presented by Calderon et al. (2015) that covers

88 countries over 1960–2010. As suggested by Calderon et al. (2015), countries are likely to

differ significantly in their output elasticity of infrastructure because of their different levels of

technology development, institutions, demographic features, and so on (see also Gregoriou and

Ghosh (2009)). Ignoring such heterogeneity results in biased estimates of the output effect of

30

infrastructure. We assume that the countries are characterized by a group pattern of hetero-

geneity in the sense that the effect of infrastructure is common within a group, and we then

estimate the number of groups and group memberships from the data. This assumption allows

us to capture the potential similarity of countries, better understand the sources of hetero-

geneity, and obtain more efficient estimates than individual country estimates, especially given

the short time span. Even if a group pattern of heterogeneity is allowed, the (conditional)

distributional effect within a group may not necessarily be uniform, and the conditional mean

effect does not provide a complete picture. For countries in the same group, the effect of infras-

tructure could vary markedly depending on their economic statuses, such that the estimated

coefficient of infrastructure is not constant at different quantiles of output.

To capture the distributional effect of infrastructure that varies across groups, we consider

the following PSQR model:

Qτ (Oit|fit−1) = β1,gi(τ)Cit−1 + β2,gi(τ)Sit−1 + β3,gi(τ)Zit−1, (8.1)

where Oit is real output measured by the logarithm of real GDP per worker in 2000 PPP (pur-

chasing power parity) US dollars and fit = (Cit, Sit, Zit) is the set of explanatory variables. In

this set, Cit is the logarithm of physical capital per worker constructed by using the perpetual

inventory method, Sit is human capital proxied by the average years of secondary schooling

of the population, and Zit is the logarithm of physical infrastructure per worker measured by

a synthetic index that summarizes three infrastructure dimensions (telecommunications, elec-

tric power, and roads) using a principle component method (see Calderon et al. (2015) for

the details of the definition and construction of the variables). This model is a reduced-form

representation of the structural aggregate production function with constant returns to scale.

All the variables are first differenced to remove unit roots following Calderon et al. (2015).

TABLE 4

We first examine the median effect of the output determinants by using PSQR with G = 1

and compare this with the estimates of Calderon et al. (2015) that focused on the homogeneous

mean effect (see the first two columns of Table 4). The estimated median effects of the capital

stock and human capital produced by PSQR are 0.29 and 0.09, respectively, both of which are

in the range of previous estimates in the empirical macroeconomic literature and close to the

values reported by Calderon et al. (2015) (0.34 for capital stock and 0.10 for human capital).

31

The PSQR estimate of the infrastructure effect is 0.08, identical to the estimate provided by

Calderon et al. (2015) and close to the value reported by Wu et al. (2017) using a different

dataset.7

FIGURE 3

The median or mean effect, however, does not give the full picture of the distribution, which

may differ across countries. Hence, we examine the distributional effect at the other quantiles

and allow cross-country heterogeneity in the distribution. We determine the number of groups

by using the IC-based procedure discussed in Section 5. We also allow the maximum number

of groups to be six and compute the IC at quantiles ranging from 0.1 to 0.9 in steps of 0.1.

The IC suggests that two groups exist. Hence, we estimate (8.1) by using G = 2.8 Figure 3

displays the estimated group patterns found by using PSQR. Interestingly, both geographic

and economic features play a role in the clustering outcome. Group 1 consists of 54 countries,

including most Asian and European countries, whereas Group 2 contains Canada, the United

States, Australasian countries, and some coastal African countries.

The right panel of Table 4 presents the PSQR estimates under G = 2. Three important

results emerge from the analysis. First, we find a large degree of heterogeneity in the dis-

tributional effect of human capital and synthetic infrastructure at the central quantiles. In

particular, human capital imposes a significant effect on output in both groups, but the effect

is much larger in Group 2 than in Group 1. Further, the infrastructure effect is not significant

in Group 1 but is significantly positive in Group 2. This finding suggests that the countries in

Group 2 drive the positive and significant mean/median infrastructure effect when we estimate

a homogeneous panel (G = 1). However, this homogeneity assumption ignores the fact that

there are a large proportion of countries (in Group 1) where physical infrastructure contributes

little to output.

Second, the output effects of the three ingredients all vary dramatically across quantiles. At

lower quantiles, the effects of physical and human capital are negative or insignificant. However,

these effects become positive and strong as the quantile level increases. The output elasticity of

infrastructure is negative at the lower quantiles in Group 1 but positive in Group 2. Again, as

7They considered Chinese provincial panel and reported an estimated output effect of infrastructure of 0.06.8The drop in the IC from G = 1 to G = 2 is particularly large for the tailed quantiles, say 0.1, 0.2, 0.8,

and 0.9, while the change in the IC around the central quantiles is minor, suggesting that the heterogeneity isespecially pronounced at the tails but not around the median of the distribution. We also estimate the model byrestricting the central quantiles to have homogeneous coefficients, finding that the group membership structureestimated only based on the tails is largely similar to that estimated from the whole range of quantiles.

32

the quantile level increases, the elasticity of both groups turns positive and becomes statistically

more significant. In fact, the literature on the output effect of infrastructure does document pos-

sible opposing effects (see Bom and Ligthart (2008) for a review and meta-analysis). Although

infrastructure investment is generally expected to have a positive effect on output, the relation

could be negative because of overinvestment, environmental damage, or spillover effects. Inap-

propriate or excessive investment is particularly relevant in some developed European countries

where infrastructure capacity is relatively large, such as Germany and the Netherlands (Sturm

et al., 1999; Uhde, 2010). The negative effects of (transportation) infrastructure investment

could also arise from spillovers since the migration of labor and mobile capital may hurt the

economic development of competitive neighboring regions. Again, this spillover often occurs

within the European Union where factor migration is convenient (Cantos et al., 2005). This

finding explains the negative output effect of infrastructure in Group 1, which contains most

European countries, suggesting that when these countries are in a poor economic state, the

negative effect of infrastructure dominates the positive impact. At upper quantiles, the infras-

tructure effect is strongly positive for both groups, and the size is still within a reasonable level

as suggested by the literature, ranging from 0.07 to 0.17.

Finally, we find that the distributional effect of physical and human capital is more dispersed

in Group 2 than in Group 1. This result is expected since Group 2 contains highly developed

countries such as the United States, Canada, and Australia as well as several African countries.

The heterogeneity in the shape of the distribution also contributes to the clustering outcome.

In general, we find a large degree of heterogeneity in the output effect of the three ingredients

both across groups and across quantile levels. The two groups of countries differ not only in their

median effects of the output ingredients, but also in their shapes of the distributional effect.

This distributional heterogeneity cannot be captured by standard clustering approaches, but is

well demonstrated by the PSQR procedure.

9 Conclusion and future research

This study offers a flexible yet parsimonious way of modeling the distributional heterogeneity of

the slope coefficients in panel data models. We model cross-sectional heterogeneity via a latent

group pattern, such that individual units in a group share a common conditional distribution.

The conditional distributions of the groups can differ not (only) in their means but (also) the

(tail) quantiles. We capture the distributional effect within each group by regression quantiles.

We propose a composite quantile approach to simultaneously estimate the group membership

33

structure and regression quantiles of each group. We also precisely quantify the convergence

rate of the misclassification probability and show that using multiple quantiles for clustering

improves the accuracy of the group membership estimates over existing methods in which

clustering is only based on the mean or some single quantile.

Several issues deserve future research. First, we assume that group membership is invariant

across quantiles. It is an open question how to verify this assumption in practice. Given

the discrete and label-invariant features of the group membership parameter, it is difficult

to obtain the variance of its estimate, if not impossible. Hence, direct tests based on the

group membership estimates at different quantiles seem infeasible. One possible approach is

to derive the confidence intervals of the group membership estimates at each quantile in the

vein of Dzemski and Okui (2018), and then check whether the intervals at different quantiles

overlap. Second, our model allows individual or group fixed effects, while sometimes it is useful

to consider unobserved two-way fixed effects. How to deal with two incidental parameters

in panel quantile regression with heterogeneous slope coefficients is an interesting question.

Finally, the computation cost is non-trivial when the number of individual units is large. Thus,

it is desirable to resort to alternative algorithms that work faster and are more stable when the

number of individual units is large.

34

Figure 1: Misclassification probability in the illustrative example

1 3 5 7 90.012

0.014

0.016

0.018

0.02

1 3 5 7 90.13

0.14

0.15

0.16

0.17

1 3 5 7 9 110

0.1

0.2

0.3

(a) β02 − β0

1 = 1, σ2ε = 1 (b) β0

2 − β01 = 0.5, σ2

ε = 1 (c) β02 − β0

1 = 1, K = 9

Figure 2: Density of αgi + εit for the three groups in the simulation

-4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5 Group 1

Group 2

Group 3

-4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5 Group 1

Group 2

Group 3

(a) DGP.1 (b) DGP.2

-4 -2 0 2 4 60

0.3

0.6

0.9

1.2 Group 1

Group 2

Group 3

-4 -2 0 2 4 60

0.3

0.6

0.9

1.2 Group 1

Group 2

Group 3

(c) DGP.3 (d) DGP.4

35

Figure 3: Estimated group memberships (G = 2)

Group 1

Group 2

36

Table 1: Group number selection frequency using IC when G0 = 3

N T 1 2 3 4 5 1 2 3 4 5DGP.1 DGP.2

50 30 0.000 0.000 0.964 0.036 0.000 0.000 0.172 0.828 0.000 0.00050 60 0.000 0.000 0.991 0.009 0.000 0.000 0.000 0.996 0.004 0.000100 30 0.000 0.000 0.997 0.003 0.000 0.000 0.000 0.989 0.011 0.000100 60 0.000 0.000 0.998 0.002 0.000 0.000 0.000 0.993 0.007 0.000

DGP.3 DGP.450 30 0.000 0.011 0.983 0.006 0.000 0.000 0.000 0.983 0.017 0.00050 60 0.000 0.000 0.994 0.006 0.000 0.000 0.000 0.998 0.002 0.000100 30 0.000 0.000 0.978 0.022 0.000 0.000 0.000 0.984 0.016 0.000100 60 0.000 0.000 0.989 0.011 0.000 0.000 0.000 1.000 0.000 0.000

37

Table 2: Misclassification frequencies

N = 50 N = 100T = 30 T = 60 T = 30 T = 60

DGP.1 GFE 0.0066 0.0004 0.0068 0.0003SQbest 0.0030 0.0000 0.0028 0.0001PSQRnarrow 0.0028 0.0000 0.0024 0.0000PSQRwide 0.0014 0.0000 0.0012 0.0000



DGP.4 GFE 0.5296 0.5418 0.5123 0.5123SQbest 0.0719 0.0268 0.0603 0.0166SQsecond best 0.1283 0.0586 0.1138 0.0367PSQRnarrow 0.0612 0.0163 0.0370 0.0123PSQRwide 0.1423 0.0495 0.1344 0.0301

Notes: In DGP.1–DPG.3, SQbest is single-quantile estimator based on the best quantile τbest = 0.5, PSQRnarrow

is based on τ = {0.4, 0.5, 0.6}, and PSQRwide is based on τ = {0.1, 0.2, . . . , 0.9}. In DGP.4, SQbest is the single-quantile estimator based on the best quantile τbest = 0.1, SQsecond best is the single-quantile estimator basedon the second best quantile τsecond best = 0.9, PSQRnarrow is based on τ = {0.1, 0.2, 0.8, 0.9}, and PSQRwide isbased on τ = {0.1, 0.2, . . . , 0.9}.

38

Table 3: Bias, root mean squared error, and coverage probability of coefficient estimates

τ Bias RMSE Coverage Bias RMSE Coverage{N, T} = {50, 30} {N, T} = {50, 60}

DGP.1 GFE 0.0004 0.3181 0.9291 −0.0033 0.2474 0.9385SQbest 0.5 −0.0004 0.2981 0.9719 −0.0026 0.2330 0.9661PSQRwide 0.1 0.0056 0.3310 0.9265 0.0010 0.2760 0.9374

0.3 0.0010 0.2945 0.9624 −0.0003 0.2401 0.95830.5 −0.0006 0.2905 0.9653 −0.0021 0.2330 0.96520.7 −0.0018 0.2931 0.9625 −0.0041 0.2414 0.95810.9 −0.0062 0.3325 0.9311 −0.0038 0.2780 0.9264

{N, T} = {100, 30} {N, T} = {100, 60}GFE −0.0012 0.2474 0.9385 0.0015 0.2103 0.9464SQbest 0.5 −0.0007 0.2542 0.9715 −0.0035 0.1989 0.9634PSQRwide 0.1 0.0010 0.2760 0.9374 0.0040 0.2308 0.9483

0.3 −0.0003 0.2432 0.9583 0.0008 0.2069 0.96120.5 −0.0021 0.2371 0.9652 0.0005 0.2006 0.96390.7 −0.0041 0.2414 0.9581 0.0009 0.2046 0.95480.9 −0.0038 0.2780 0.9264 −0.0003 0.2335 0.9350

{N, T} = {50, 30} {N, T} = {50, 60}DGP.2 GFE 0.0036 0.3930 0.9021 0.0015 0.2814 0.9356

SQbest 0.5 0.0031 0.3644 0.9604 0.0006 0.2593 0.9572PSQRwide 0.1 0.0183 0.3775 0.9104 0.0043 0.2842 0.9293

0.3 0.0059 0.3462 0.9477 0.0028 0.2524 0.95350.5 0.0016 0.3408 0.9581 0.0006 0.2478 0.95170.7 −0.0018 0.3445 0.9539 −0.0006 0.2520 0.95880.9 −0.0085 0.3716 0.9257 −0.0030 0.2849 0.9298

{N, T} = {100, 30} {N, T} = {100, 60}GFE 0.0015 0.3749 0.8849 0.0003 0.2570 0.9408SQbest 0.5 0.0009 0.3419 0.9475 0.0006 0.2289 0.9601PSQRwide 0.1 0.0122 0.3376 0.9298 0.0020 0.2414 0.9313

0.3 0.0060 0.3171 0.9518 0.0001 0.2157 0.95510.5 0.0004 0.3150 0.9594 0.0003 0.2109 0.95300.7 −0.0054 0.3183 0.9514 −0.0006 0.2174 0.95440.9 −0.0096 0.3382 0.9223 −0.0009 0.2407 0.9370

Notes: In DGP.1–DPG.3, SQbest is single-quantile estimator based on the best quantile τbest = 0.5, andPSQRwide is based on τ = {0.1, 0.2, . . . , 0.9}. In DGP.4, SQbest is the single-quantile estimator based onthe best quantile τbest = 0.1, SQsecond best is the single-quantile estimator based on the second best quantileτsecond best = 0.9, and PSQRnarrow is based on τ = {0.1, 0.2, 0.8, 0.9}.

39

Table 3 (con’t): Bias, root mean squared error, and coverage probability of coefficientestimates

τ Bias RMSE Coverage Bias RMSE Coverage{N, T} = {50, 30} {N, T} = {50, 60}

DGP.3 GFE 0.0036 0.5241 0.8939 0.0025 0.4132 0.9264SQbest 0.5 0.0130 0.4950 0.9371 0.0046 0.3915 0.9468PSQRwide 0.1 0.0095 0.3446 0.9324 0.0029 0.2730 0.9396

0.3 0.0080 0.3504 0.9630 0.0043 0.2714 0.95510.5 0.0083 0.3897 0.9615 0.0025 0.2906 0.96230.7 0.0020 0.4455 0.9551 −0.0014 0.3244 0.94980.9 −0.0235 0.5616 0.9204 −0.0076 0.4065 0.9331

{N, T} = {100, 30} {N, T} = {100, 60}GFE 0.0040 0.5035 0.8891 0.0016 0.3949 0.9253SQbest 0.5 0.0074 0.4714 0.9150 0.0030 0.3680 0.9362PSQRwide 0.1 0.0049 0.2888 0.9396 0.0016 0.2272 0.9500

0.3 0.0077 0.3083 0.9592 0.0014 0.2286 0.95730.5 0.0053 0.3510 0.9542 0.0011 0.2483 0.95800.7 0.0016 0.4118 0.9506 −0.0004 0.2848 0.94970.9 −0.0168 0.5278 0.9231 −0.0031 0.3551 0.9434

{N, T} = {50, 30} {N, T} = {50, 60}DGP.4 GFE −0.0009 0.6458 0.3097 0.0050 0.5530 0.3429

SQbest 0.1 0.0104 0.4343 0.9080 0.0075 0.3093 0.9418SQsecond best 0.9 −0.0183 0.5853 0.8907 −0.0136 0.4547 0.9193PSQRnarrow 0.1 0.0054 0.3881 0.9151 0.0057 0.2827 0.9186

0.2 −0.0035 0.3674 0.9231 0.0025 0.2660 0.93020.8 0.0020 0.4244 0.8984 0.0047 0.3333 0.90970.9 −0.0119 0.4968 0.8820 −0.0024 0.3790 0.8952

{N, T} = {100, 30} {N, T} = {100, 60}GFE −0.0062 0.6402 0.2972 −0.0020 0.5503 0.3021SQbest 0.1 0.0188 0.3980 0.9138 0.0008 0.2733 0.9400SQsecond best 0.9 −0.0129 0.5621 0.8505 −0.0066 0.4334 0.9183PSQRnarrow 0.1 0.0028 0.3198 0.9286 −0.0018 0.2477 0.9198

0.2 −0.0036 0.3085 0.9310 −0.0017 0.2390 0.93150.8 0.0019 0.3453 0.9200 −0.0020 0.2837 0.93010.9 −0.0089 0.4123 0.9183 −0.0031 0.3280 0.9032

Notes: In DGP.1–DPG.3, SQbest is single-quantile estimator based on the best quantile τbest = 0.5, andPSQRwide is based on τ = {0.1, 0.2, . . . , 0.9}. In DGP.4, SQbest is the single-quantile estimator based onthe best quantile τbest = 0.1, SQsecond best is the single-quantile estimator based on the second best quantileτsecond best = 0.9, and PSQRnarrow is based on τ = {0.1, 0.2, 0.8, 0.9}.

40

Table 4: Output effect of physical capital, human capital, and infrastructure

Calderon G = 1 G = 2τ et al.(2015) 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Group 1C 0.34 0.29 0.02 0.10 0.15 0.21 0.24 0.28 0.31 0.37 0.49

(0.01) (0.02) (0.05) (0.04) (0.04) (0.03) (0.03) (0.03) (0.03) (0.04) (0.05)S 0.10 0.09 −0.50 −0.23 −0.12 −0.02 0.05 0.19 0.19 0.31 0.55

(0.01) (0.01) (0.04) (0.03) (0.02) (0.02) (0.02) (0.02) (0.02) (0.03) (0.03)Z 0.08 0.08 −0.16 −0.09 −0.08 −0.04 −0.01 0.09 0.07 0.07 0.09

(0.01) (0.01) (0.04) (0.03) (0.03) (0.03) (0.03) (0.03) (0.02) (0.02) (0.02)Group 2

C −0.03 0.14 0.16 0.24 0.28 0.31 0.34 0.38 0.46(0.06) (0.04) (0.03) (0.03) (0.03) (0.03) (0.03) (0.04) (0.04)

S −0.60 −0.25 0.02 0.16 0.27 0.39 0.58 0.76 1.12(0.05) (0.05) (0.04) (0.03) (0.03) (0.03) (0.04) (0.04) (0.05)

Z 0.04 0.06 0.07 0.06 0.08 0.07 0.08 0.09 0.17(0.01) (0.04) (0.03) (0.03) (0.03) (0.03) (0.03) (0.03) (0.04)

Notes:

1. Standard deviations are in the parentheses.

2. C measures the physical capital, S measures the human capital, and Z measures the physical infrastruc-ture.

References

T. Ando and J. Bai. Panel data models with grouped factor structure under unknown group

membership. Journal of Applied Econometrics, 31:163–191, 2016.

D. A. Aschauer. Is public expenditure productive? Journal of Monetary Economics, 23:

177–200, 1989.

P. R. D. Bom and J. Ligthart. How productive is public capital? a meta-analysis. CESifo

Working Paper Series 2206, CESifo Group Munich, 2008.

S. Bonhomme and E. Manresa. Grouped patterns of heterogeneity in panel data. Econometrica,

83:1147–1184, 2015a.

S. Bonhomme and E. Manresa. Supplement to ‘grouped patterns of heterogeneity in panel

data’. Econometrica Supplemental Material, 83:1147–1184, 2015b.

41

S. Bonhomme, T. Lamadon, and E. Manresa. Discretizing unobserved heterogeneity. Working

paper, 2017a.

S. Bonhomme, T. Lamadon, and E. Manresa. A distributional framework for matched employer

employee data. Working paper, 2017b.

R. C. Bradley and C. Tone. A central limit theorem for non-stationary strongly mixing random

fields. Journal of Theoretical Probability, 30:655–674, 2017.

J. E. Brand and Y. Xie. Who benefits most from college? Evidence for negative selection

in heterogeneous economic returns to higher education. American Sociological Review, 75:

273–302, 2010.

C. Calderon, E. Moral-Benito, and L. Serven. Is infrastructure capital productive? A dynamic

heterogeneous approach. Journal of Applied Econometrics, 30:177–198, 2015.

I. A. Canay. A simple approach to quantile regression for panel data. The Econometrics

Journal, 14:368–386, 2011.

P. Cantos, M. Gumbau-Albert, and J. Maudos. Transport infrastructures, spillover effects and

regional growth: evidence of the spanish case. Transport Reviews, 25:25–50, 2005.

V. Chernozhukov, I. FernandezVal, and A. Galichon. Quantile and probability curves without

crossing. Econometrica, 78:1093–1125, 2010.

V. Chernozhukov, I. Fernandez-Val, and M. Weidner. Network and panel quantile effects via

distribution regression. working papers 1803.08154, arXiv.org, 2018.

Y. Dong and A. Lewbel. Nonparametric identification of a binary random factor in cross section

data. Journal of Econometrics, 163:163–171, 2011.

A. Dzemski and R. Okui. Confidence set for group membership. Working paper, 2018.

J. Fan and Q. Yao. Nonlinear time series: nonparametric and parametric methods. Springer-

Verlag, 2008.

Y. Fan, E. Guerre, and S. Lazarova. A unified framework for the estimation and inference in

linear quantile regression: A local polynomial approach. Working paper, 2017.

A. F. Galvao. Quantile regression for dynamic panel data with fixed effects. Journal of Econo-

metrics, 164:142–157, 2011.

42

A. F. Galvao and K. Kato. Smoothed quantile regression for panel data. Journal of Economet-

rics, 193:92–112, 2016.

A. F. Galvao and A. Poirier. Random effects quantile regression. Working paper, 2016.

A. F. Galvao, T. Juhl, G. Montes-Rojas, and J. Olmo. Testing slope homogeneity in quantile

regression panel data with an application to the cross-section of stock returns. Journal of

Financial Econometrics, 16:211–243, 2018.

X. Gao and P. X.-K. Song. Composite likelihood bayesian information criteria for model se-

lection in high-dimensional data. Journal of the American Statistical Association, 105:1531–

1540, 2010.

B. S. Graham, J. Hahn, A. Poirier, and J. L. Powell. A quantile correlated random coefficients

panel data model. Journal of Econometrics, forthcoming.

A. Gregoriou and S. Ghosh. On the heterogeneous impact of public capital and current spending

on growth across nations. Economics Letters, 105:32–35, 2009.

J. Gu and S. Volgushev. Panel data quantile regression with grouped fixed effects. Working

paper, 2018.

J. Hahn and H. R. Moon. Panel data models with finite number of multiple equilibria. Econo-

metric Theory, 26:863–881, 2010.

M. Harding, C. Lamarche, and M. H. Pesaran. Common correlated effects estimation of het-

erogeneous dynamic panel quantile regression models. Working paper, 2017.

H. Kasahara and K. Shimotsu. Nonparametric identification of finite mixture models of dynamic

discrete choices. Econometrica, 77:135–175, 2009.

K. Kato, A. F. Galvao, and G. V. Montes-Rojas. Asymptotics for panel quantile regression

models with individual effects. Journal of Econometrics, 170:76–91, 2012.

Y. Ke, J. Li, and W. Zhang. Structure identification in panel data analysis. The Annals of

Statistics, 44:1193–1233, 2016.

K. Knight. Limiting distributions for L1 regression estimators under general conditions. The

Annals of Statistics, 26:755–770, 1998.

43

R. Koenker. Quantile regression for longitudinal data. Journal of Multivariate Analysis, 91:

74–89, 2004.

R. Koenker. Quantile Regression. Cambridge University Press, Cambridge, UK, 2005.

A. N. Kolmogorov and Y. A. Rozanov. On strong mixing conditions for stationary gaussian

processes. Theory of Probability & Its Applications, 5(2):204–208, 1960.

E. Krasnokutskaya, K. Song, and X. Tang. Estimating unobserved agent heterogeneity using

pairwise comparisons. Working paper, 2017.

E. M. Leeper, T. B. Walker, and S.-C. S. Yang. Government investment and fiscal stimulus.

Journal of Monetary Economics, 57:1000–1012, 2010.

S. Leorato and F. Peracchi. Comparing distribution and quantile regression. EIEF Working

Papers Series 1511, Einaudi Institute for Economics and Finance (EIEF), 2015.

C.-C. Lin and S. Ng. Estimation of panel data models with parameter heterogeneity when

group membership is unknown. Journal of Econometric Methods, 1:42–55, 2012.

R. Liu, A. Schick, Z. Shang, Y. Zhang, and Q. Zhou. Identification and estimation in panel

models with overspecified number of groups. Working paper, 2018.

S. Ng and G. McLachlan. Mixture models for clustering multilevel growth trajectories. Com-

putational Statistics & Data Analysis, 71:43–51, 2014.

R. Okui and W. Wang. Heterogeneous structural breaks in panel data models. Working paper,

2018.

D. Pollard. Strong consistency of k-means clustering. The Annals of Statistics, 9:135–140, 1981.

W. Romp and J. De Haan. Public capital and economic growth: A critical survey. Perspektiven

der Wirtschaftspolitik, 8:6–52, 2007.

O. Rosen, W. Jiang, and M. Tanner. Mixtures of marginal models. Biometrika, 87:391–404,

2000.

V. Sarafidis and N. Weber. A partially heterogeneous framework for analyzing panel data.

Oxford Bulletin of Economics and Statistics, 77:274–296, 2015.

44

J.-E. Sturm, J. Jacobs, and P. Groote. Output effects of infrastructure investment in the

Netherlands, 1853–1913. Journal of Macroeconomics, 21:355–380, 1999.

L. Su and G. Ju. Identifying latent grouped patterns in panel data models with interactive

fixed effects. Journal of Econometrics, 2018. forthcoming.

L. Su and W. Wang. Identifying latent group structures in nonlinear panels. Working paper,

2017.

L. Su, Z. Shi, and P. C. B. Phillips. Identifying latent structures in panel data. Econometrica,

84:2215–2264, 2016.

L. Su, X. Wang, and S. Jin. Sieve estimation of time-varying panel data models with latent

structures. Journal of Business & Economic Statistics, 2017. forthcoming.

S. Sugasawa. Grouped heterogeneous mixture modeling for clustered data. Working paper,

2018.

Y. Sun. Estimation and inference in panel strcutural models. Working paper, Department of

Economics, UCSD, 2005.

N. Uhde. Output effects of infrastructures in east and west German states. Intereconomics, 45:

322–328, 2010.

M. Vogt and O. Linton. Classification of non-parametric regression functions in longitudinal

data models. Journal of the Royal Statistical Society: Series B, 79:5–27, 2016.

W. Wang, X. Zhang, and R. Paap. To pool or not to pool: What is a good strategy for

parameter estimation and forecasting in panel regressions? Working paper, 2017.

G. L. Wu, Q. Feng, and Z. Wang. Estimating productivity of public infrastructure investment.

Working paper, 2017.

A. Y. Zhang and H. H. Zhou. Minimax rates of community detection in stochastic block models.

The Annals of Statistics, 44:2252–2280, 2016.

Z. Zhao and Z. Xiao. Efficient regressions via optimally combining quantile information. Econo-

metric Theory, 30:1272–1314, 2014.

H. Zou and M. Yuan. Composite quantile regression and the oracle model selection theory. The

Annals of Statistics, 36:1108–1126, 2008.

45

A Appendix

This appendix provides the proofs of the technical results in the main text. We first prove

the weak consistency of regression quantile estimates under unknown (estimated) group mem-

berships in A.1, and then discuss how to obtain the convergence rate of the misclassification

probability in A.2. Then we focus on the asymptotic distribution of the regression quantile

estimates in A.3. The proof on the consistency of the estimated number of groups is given

in A.4, and finally the case of individual fixed effects is discussed in A.5.

A.1 Proof of Theorem 4.1 (weak consistency)

Proof. By the definition of β(τ ), we have for each fixed k, β(τk) minimizes the sub-objective

function:N∑i=1

T∑t=1

ρτk(yit − x′itβgi(τk)). (A.1)

For notation simplicity, we replace βgi(τk) by βgi throughout the proof. Let

∆i(β) := T−1

T∑t=1

{ρτk(yit − x′itβgi)− ρτk(yit − x′itβ0

g0i)}.

For any δ > 0, define B0g(δ) := {β : ‖β − β0

g‖ ≤ δ} as the neighbourhood, and its boundary

∂B0g(δ) := {β : ‖β − β0

g‖ = δ}. We now prove (4.1). Note that by the definition of β(τk), we

have{max

1≤i≤N‖βgi − β0

g0i‖ > δ

}={‖βgi − β0

g0i‖ > δ, 1 ≤ ∃ i ≤ N

}⊂

{N−1

N∑i=1

I{g0i = g, gi = g}∆i(β) ≤ 0, ∃ g, g ∈ {1, . . . , G0}, ∃ β /∈ B0

g(δ)

}. (A.2)

For β /∈ B0g(δ), define β = rgβ + (1 − rg)β0

g where rg = δ/‖β − β0g‖. Observe that rg ∈ (0, 1)

and β ∈ ∂B0g(δ). Similar to the proof of Theorem 3.1 in Kato et al. (2012), the convexity of the

46

sub-objective function (A.1) yields that

rg

N∑i=1

I{g0i = g, gi = g}∆i(β)

≥N∑i=1

I{g0i = g, gi = g}∆i(β)

=N∑i=1

I{g0i = g, gi = g}E [∆i(β)] +

N∑i=1

I{g0i = g, gi = g} {∆i(β)− E [∆i(β)]}. (A.3)

We now show that for some εδ > 0,{N−1

N∑i=1

I{g0i = g, gi = g}∆i(β) ≤ 0, ∃ g, g ∈ {1, . . . , G0}, ∃ β /∈ B0

g(δ)

}

⊂

{max

g,g∈{1,...,G0}sup

β∈∂B0g(δ)

∣∣∣∣∣N−1

N∑i=1

I{g0i = g, gi = g} {∆i(β)− E [∆i(β)]}

∣∣∣∣∣ > εδ

}. (A.4)

To prove (A.4), by noting (A.3), it suffices to show that for all g, g ∈ {1, . . . , G0},

lim infN→∞

minβ∈∂B0

g(δ)N−1

N∑i=1

I{g0i = g, gi = g}E [∆i(β)] > 0, a.s. (A.5)

By the Knight (1998)’s identity we have

E [∆i(β)] = E

[∫ x′i1(β−β0g)

0

{Fi,τk(z|xit)− τ}dz

]

=1

2(β − β0

g)′ E [fi,τk(0|xit)xitx′it] (β − β0

g) + o(δ2),

as δ → 0, where the last equality is the uniform expansion of E [∆i(β)] over {β ∈ ∂B0g(δ)} and

i ≥ 1, by Assumption 1(iv). It hence follows that for each β ∈ ∂B0g(δ),

N−1

N∑i=1

I{g0i = g, gi = g}E [∆i(β)]

=1

2(β − β0

g)′N−1

N∑i=1

I{g0i = g, gi = g}E [fi,τk(0|xit)xitx′it] (β − β0

g) + o(δ2)

≥ 1

2δ2λN(g, g, τk) + o(δ2) a.s.,

47

which leads to (A.5) by Assumption 1(v). Combining (A.2) and (A.4) yields that{max


g0i‖ > δ

}

⊂

{max

g,g∈{1,...,G0}sup

β∈∂B0g(δ)

∣∣∣∣∣N−1

N∑i=1

I{g0i = g, gi = g} {∆i(β)− E [∆i(β)]}

∣∣∣∣∣ > εδ

}.

Therefore, to prove (4.1), it suffices to show that as N, T →∞,

maxg,g∈{1,...,G0}

supβ∈∂B0

g(δ)

∣∣∣∣∣N−1

N∑i=1

I{g0i = g, gi = g} {∆i(β)− E [∆i(β)]}

∣∣∣∣∣ = oP (1),

which is equivalent to

supβ∈∂B0

g(δ)

∣∣∣∣∣N−1

N∑i=1

I{g0i = g, gi = g} {∆i(β)− E [∆i(β)]}

∣∣∣∣∣ = oP (1), (A.6)

for any g, g ∈ {1, . . . , G0}, since G0 is finite. We obtain (A.6) by showing that for every ε > 0,

limT→∞

P

{sup

β∈B0g(δ)

|∆i(β)− E [∆i(β)]| > ε

}= 0, g ∈ {1, . . . , G0}, (A.7)

uniformly for all i ≥ 1. Without loss of generality, we write B0g(δ) = B0(δ) for simplicity. Since

B0(δ) is a compact subset of Rp+1, there exist J balls with centers {β(j), j = 1, . . . , J} and

radius r such that the collection of the J balls covers B0(δ). Then for each β ∈ B0(δ), there is

j ∈ {1, . . . , J} such that ‖β − β(j)‖ ≤ r. Observe that

|∆i(β)−∆i(β(j))| =

∣∣∣∣∣T−1

T∑t=1

[ρτk (yit − x′itβ)− ρτk

(yit − x′itβ(j)

)]∣∣∣∣∣≤ T−1

T∑t=1

C(1 + ‖xit‖)‖β − β(j)‖,

for some constant C > 0 independent of i and t. Let L(x) := C(1 + ‖xit‖) and κ :=

supi≥1 E [L(x)] <∞ by Assumption 1(iii). Then we have

|∆i(β)−∆i(β(j))| ≤ rT−1

∣∣∣∣∣T∑t=1

{L(xit)− E [L(xit)]}

∣∣∣∣∣+ rκ,

48

and hence

|∆i(β)− E [∆i(β)]| ≤ |∆i(β)−∆i(β(j))|+ |∆i(β

(j))− E [∆i(β(j))]|+ E |∆i(β

(j))−∆i(β)|

≤ rT−1

∣∣∣∣∣T∑t=1


∣∣∣∣∣+ rκ

+ |∆i(β(j))− E [∆i(β

(j))]|+ rκ.

Setting r = ε/(6κ) leads to

|∆i(β)− E [∆i(β)]| ≤ rT−1

∣∣∣∣∣T∑t=1


∣∣∣∣∣+ ε/6

+ |∆i(β(j))− E [∆i(β

(j))]|+ ε/6.

Therefore, we have

P

{sup

β∈B0(δ)

|∆i(β)− E [∆i(β)]| > ε

}

≤ P

{T−1

∣∣∣∣∣T∑t=1


∣∣∣∣∣ > 2κ

}

+J∑j=1

P{|∆i(β

(j))− E [∆i(β(j))]| > ε/3

}, (A.8)

with J can be chosen such that J = O(ε−p−1) as ε → 0. By Assumption 1(ii), application of

the ergodic theorem for α-mixing processes (see Proposition 2.8 of Fan and Yao (2008) ) yields

that both terms on the right side of (A.8) are o(1) as T →∞, uniformly over 1 ≤ i ≤ N . This

leads to (A.7) and hence (4.1).

A.2 Convergence rate of misclassification probability

We first provide the proof of Lemma 1.

Proof. We first prove that

ming∈{1,...,G0}

‖βg(τk)− β0g(τk)‖

p−→ 0, (A.9)

49

for all g ∈ {1, . . . , G0} and k = 1, . . . , K. Note that

1

N

N∑i=1

(min

g∈{1,...,G0}I{g0

i = g}‖βg(τk)− β0g(τk)‖

)

=

(1

N

N∑i=1

I{g0i = g}

)(min

g∈{1,...,G0}‖βg(τk)− β0

g(τk)‖).

By Assumption 2(i), to prove (A.9) it suffices to show

1

N

N∑i=1

(min

g∈{1,...,G0}I{g0

i = g}‖βg(τk)− β0g(τk)‖

)p−→ 0.

Now,

1

N

N∑i=1

I{g0i = g}

(min

g∈{1,...,G0}‖βg(τk)− β0

g(τk)‖)

≤ 1

N

N∑i=1

I{g0i = g}‖βgi(τk)− β0

g(τk)‖

≤ 1

N

N∑i=1


(τk)‖

≤ max1≤i≤N


(τk)‖,

which is op(1) as N, T →∞, following the consistency result in Theorem 4.1. Therefore, (A.9)

holds by Assumption 2(i).

We define, for all g ∈ {1, . . . , G0} and k = 1, . . . , K,

σ(g) = arg ming∈{1,...,G0}

‖βg(τk)− β0g(τk)‖.

We now show that with probability approaching 1 as N, T →∞, σ : {1, . . . , G0} → {1, . . . , G0}is a one-to-one mapping. Let g 6= g. By the triangle inequality, we have

‖βσ(g)(τk)− βσ(g)(τk)‖ ≥ ‖β0g(τk)− β0

g(τk)‖

− ‖βσ(g)(τk)− β0g(τk)‖ − ‖βσ(g)(τk)− β0

g(τk)‖,

where the right-hand side converges in probability to ‖β0g(τk) − β0

g(τk)‖ > 0 by Assumption

50

2(ii) and the result (A.9). In other words, as N, T → ∞, with probability approaching 1 we

have σ(g) 6= σ(g) if g 6= g. Hence, σ is an invertible mapping, admitting that

‖βσ(g)(τk)− β0g(τk)‖

p−→ 0, N, T →∞,

for all g ∈ {1, . . . , G0} and k = 1, . . . , K. Since the objective function (A.1) is independent on

relabeling of the groups, without loss of generality we take σ(g) = g. This completes the proof.

Lemma 2. Suppose Z(T ) := {Z(T )t , t = 1, . . . , T} is an array of random variables such that

EZ(T )t = 0 and E

(Z

(T )t

)2

<∞ for each t = 1, . . . , T . Suppose the following mixing conditions

hold:

α(Z, t) := supTα(Z(T ), t)→ 0, t→∞,

ρ′(Z, 1) := supT

max1≤t<T

ρ(Z(T ), t) < 1.

Define the random sum ST :=∑T

t=1 Z(T )t , and σ2

T := ES2T . Then for any T ≥ 1, the variance

of partial sums is bounded as follows:

C−1Z

T∑t=1

E(Z

(T )t

)2

≤ σ2T ≤ CZ

T∑t=1

E(Z

(T )t

)2

, (A.10)

where CZ = (1 + ρ′(Z, 1))/(1 − ρ′(Z, 1)). Assume that σ2T > 0. Suppose that the Lindeberg

condition

∀ε > 0, limT→∞

1

σ2T

T∑t=1

E(Z

(T )t

)2

I(∣∣∣Z(T )

t

∣∣∣ > εσ−1/2T exp(−ζT )

)= 0

holds. We have as T →∞,

σ−1T ST

d−→ N(0, 1).

Proof of Theorem 4.2

Proof. We first examine the asymptotic property of gi(β(τ )). It follows from the definition of

gi(·) that for all g ∈ {1, . . . , G0} and k = 1, . . . , K,

I {gi(β(τ )) = g} ≤ I

{T∑t=1

K∑k=1

wkρτk(yit − x′itβg(τk)) ≤T∑t=1

K∑k=1

wkρτk(yit − x′itβg0i (τk))

},

51

so

1

N

N∑i=1

I{gi(β(τ )) 6= g0

i

}=

1

N

N∑i=1

G0∑g=1

I{g0i 6= g

}I {gi(β(τ )) = g}

≤ 1

N

N∑i=1

G0∑g=1

Zig(β(τ )),

where

Zig(β(τ )) = I{g0i 6= g

}× I

{T∑t=1

K∑k=1

wkρτk(yit − x′itβg(τk)) ≤T∑t=1

K∑k=1

wkρτk(yit − x′itβg0i (τk))

}.

By the identity we have

Zig(β(τ )) = I{g0i 6= g

}× I

{T∑t=1

K∑k=1

wk

∫ x′it(βg(τk)−βg0i

(τk))

0

[I(εit(τk) + x′it(β

0g0i

(τk)− βg0i (τk)) ≤ u)− τk

]du ≤ 0

}

≤ maxg 6=g

I

{T∑t=1

K∑k=1

wk

∫ x′it(βg(τk)−βg(τk))

0


0g(τk)− βg(τk)) ≤ u

)− τk

]du ≤ 0

}.

We now bound Zig(β(τ )) , for all β(τ ) ∈ Nη, by a quantity that does not depend on β(τ ).

Define

DTK :=

∣∣∣∣∣T∑t=1

K∑k=1

wk

∫ x′it(βg(τk)−βg(τk))

0


0g(τk)− βg(τk)) ≤ u

)− τk

]du

−T∑t=1

K∑k=1

wk


g(τk))

0

[I (εit(τk) ≤ u)− τk] du

∣∣∣∣∣ .

52

We observe using the Cauchy-Schwarz (CS) inequality that

DTK =

∣∣∣∣∣T∑t=1

K∑k=1

wk

∫ x′it(βg(τk)−β0g(τk))

x′it(βg(τk)−β0g(τk))


−T∑t=1

K∑k=1

wk


g(τk))

0


∣∣∣∣∣≤

∣∣∣∣∣T∑t=1

K∑k=1

wk


g(τk))+x′it(βg(τk)−β0g(τk))

x′it(β0g(τk)−β0

g(τk))


∣∣∣∣∣+

∣∣∣∣∣T∑t=1

K∑k=1

wk

∫ x′it(βg(τk)−β0g(τk))

0


∣∣∣∣∣≤ TK2η

(1

T

T∑t=1

‖xit‖

).

We thus obtain that

Zig(β(τ )) ≤ maxg 6=g

I

{T∑t=1

K∑k=1

wk


g(τk))

0

[I {εit(τk) ≤ u} − τk] du ≤ TK2η

(1

T

T∑t=1

‖xit‖

)},

where the right-hand side of this inequality, denoted by Zig, does not depend on β(τ ). As a

result,

supβ(τ )∈Nη

1

N

N∑i=1

I{gi(β(τ )) 6= g0

i

}≤ 1

N

N∑i=1

G0∑g=1

Zig. (A.11)

(A.11) implies that for any δ > 0, P(supβ(τ )∈Nη N−1∑N

i=1 I {gi(β(τ )) 6= g0i } > δ) is bounded by

N−1∑N

i=1

∑G0

g=1 E Zig/δ. So, in order to obtain the result (4.3), we first derive the asymptotic

behavior of E Zig as T →∞. Define

bi,t(K) = K−1

K∑k=1

wk


g(τk))

0

(I(εit(τk) ≤ u)− τk) du.

We then have by Assumption 2(v),

E Zig = P(Zig = 1)

≤∑g 6=g

P

(T−1

T∑t=1

bi,t(K) ≤ 2ηM

). (A.12)

To bound the term on the right-hand side of (A.12), we rely on the use of the Central Limit

53

Theorem for mixing processes. Specifically, we use the following result, which is a direct

consequence of Theorems 1.1 and 2.2 in Bradley and Tone (2017).

For each i = 1, . . . , N , {Z(T )i,t := T−1/2(bi,t(K)−E [bi,t(K)]), t = 1, . . . , T} satisfies the mixing

condition in Lemma 2 by Assumptions 1(ii) and 1(v). We show that the Lindeberg condition

holds as follows. Define Si,T =∑T

t=1 Z(T )i,t and σ2

i,T = ES2i,T . We have that for ∀ ε > 0,

1

σ2i,T

T∑t=1

E(Z

(T )i,t

)2

I(∣∣∣Z(T )

i,t

∣∣∣ > εσi,T

)

≤ 1

εδσ2+δi,T

T∑t=1

E∣∣∣Z(T )

i,t

∣∣∣2+δ

≤ ε−δ

[C−1i

T∑t=1

E(Z

(T )i,t

)2]−(1+δ/2) T∑

t=1

E∣∣∣Z(T )

i,t

∣∣∣2+δ

= ε−δ[C−1i E

(Z

(T )i,1

)2]−(1+δ/2)

T−δ/2E∣∣∣Z(T )

i,1

∣∣∣2+δ

−→ 0, as T →∞, (A.13)

where Ci = (1+ρ′(Zi, 1))/(1−ρ′(Zi, 1)). Moreover, the convergence in (A.13) is uniform for i =

1, . . . , N since both C−1i and E

∣∣∣Z(T )i,1

∣∣∣2+δ

can be unifromly bounded. Now we apply Lemma 2 to

{Z(T )i,t , t = 1, . . . , T} to bound the right-hand side of (A.12). First, for η ≤ 1

4Minfi≥1 E [bi,t(K)],

P

(T−1

T∑t=1

bi,t(K) ≤ 2ηM

)= P

(σ−1i,TSi,T ≤ σ−1

i,T

√T (2ηM − E [bi,t(K)])

)≤ P

(σ−1i,TSi,T ≤ −

1

2σ−1i,T

√T infi≥1

E [bi,t(K)]

), (A.14)

noting that by Assumptions 2(ii)–2(iii) and 2(v),

infi≥1

E [bi,t(K)] = infi≥1

K−1

K∑k=1

wkE

[∫ x′it(β0g(τk)−β0

g(τk))

0

(Fi,τk(u|xit)− τk) du

]

≥ c infi≥1

K−1

K∑k=1

wk(β0g(τk)− β0

g(τk))′E [xitx

′i1](β0

g(τk)− β0g(τk))

≥ c

(infi≥1

λi

)K−1

K∑k=1

wk‖β0g(τk)− β0

g(τk)‖2 > 0,

where c is a constant independent of i, t and k and λi is the minimum eigenvalue of E [xitx′it].

54

Since it follows from Assumption 2(v) and the CS inequality that,

σ2i,T ≤ Ci

T∑t=1

E(Z

(T )i,t

)2

= CiE (bi,t(K)− E [bi,t(K)])2

≤ E[‖xit‖2

]K−1

K∑k=1

‖β0g(τk)− β0

g(τk)‖2 <∞,

we have

P


1

2σ−1i,T

√T infi≥1

E [bi,t(K)]

)≤ P


1

2

√T (sup

i≥1CiVar [bi,t(K)])−1/2 inf

i≥1E [bi,t(K)]

)

≤ P


1

2√C ′

infi≥1

E [bi,t(K)]√Var [bi,t(K)]

√T

), (A.15)

where C ′ = (1 + ρ′)/(1− ρ′) with ρ′ defined in Assumption 2(iv). By Lemma 2 we have,

limT→∞

P(σ−1i,TSi,T ≤ −

ζg,g

2√C′

√T)

Φ(− ζg,g

2√C′

√T) = 1

unifromly for i = 1, . . . , N with Φ(·) denoting the standard normal distribution function. There-

fore, combining (A.12), (A.14) and (A.15) yields that for sufficiently large T ,

E Zig ≤∑g 6=g

P


ζg,g

2√C ′

√T

)

≤ 2∑g 6=g

Φ

(− ζg,g

2√C ′

√T

)

≤ D0T−1/2

∑g 6=g

φ

(ζg,g

2√C ′

√T

), (A.16)

uniformly for i = 1, . . . , N with D0 denoting a constant, where the last inequality follows from

Mills ratio and φ(·) is the standard normal density function. We now define

ζ = − ming 6=g

g,g∈{1,...,G0}

ζ2g,g

8C ′.

55

It follows from (A.16) that for any ε > 0, there is M∗ = G0(G0 − 1) D0√2πε−1 such that

P

(sup

β(τ )∈Nη

1

N

N∑i=1

I{gi(β(τ )) 6= g0

i

}> M∗ T−1/2 exp(−ζT )

)

≤ P

(1

N

N∑i=1

G0∑g=1

Zig > M∗ T−1/2 exp(−ζT )

)

≤ 1

M∗ T−1/2 exp(−ζT )E

(1

N

N∑i=1

G0∑g=1

Zig

)

≤ 1

M∗ T−1/2 exp(−ζT )D0T

−1/2

G0∑g=1

∑g 6=g

φ

(− ζg,g

2√C ′

√T

)≤ ε.

That is (4.3).

Next, we prove (4.6). Let us denote

Q(β(τ )) =1

NTK

N∑i=1

T∑t=1

K∑k=1

wk ρτk(yit − x′itβgi(β(τ ))(τk)

),

and

Q(β(τ )) =1

NTK

N∑i=1

T∑t=1

K∑k=1

wk ρτk

(yit − x′itβg0i (τk)

).

Then Q(·) is minimized at β(τ ), and Q(·) is minimized at β(τ ). By Assumptions 1(i), 2(v)

and (4.3), it is easy to observe that

supβ(τ )∈Nη

∣∣∣Q(β(τ ))− Q(β(τ ))∣∣∣ = OP (T−1/2 exp(−ζT )), (A.17)

as N, T →∞. Note that by Lemma 1, we have, as N, T →∞,

P(β(τ ) /∈ Nη

)−→ 0. (A.18)

Similarly, since β(τ ) is also the consistent estimator of β0(τ ) under the assumptions of Theorem

4.1, we have

P(β(τ ) /∈ Nη

)−→ 0. (A.19)

Now, combining (A.17) and (A.18) yields that

Q(β(τ ))− Q(β(τ )) = OP (T−1/2 exp(−ζT )). (A.20)

56

This is because, for x > 0,

P(∣∣∣Q(β(τ ))− Q(β(τ ))

∣∣∣ > xT−1/2 exp(−ζT ))

≤ P(β(τ ) /∈ Nη

)+ P

(sup

β(τ )∈Nη

∣∣∣Q(β(τ ))− Q(β(τ ))∣∣∣ > xT−1/2 exp(−ζT )

).

Likewise, combining (A.17) and (A.19), we obtain


Hence, using (A.20) and (A.21), and the definition of β(τ ) and β(τ ) yields

0 ≤ Q(β(τ ))− Q(β(τ )) = Q(β(τ ))− Q(β(τ )) +OP (T−1/2 exp(−ζT ))

≤ OP (T−1/2 exp(−ζT )).

It thus follows that


We also observe that

Q(β(τ ))− Q(β(τ )) =1

NTK

N∑i=1

T∑t=1

K∑k=1

wk

(ρτk(yit − x′itβg0i (τk))− ρτk(yit − x

′itβg0i (τk))

)

=1

NTK

N∑i=1

T∑t=1

K∑k=1

wk

∫ x′it(βg0i

(τk)−βg0i

(τk))

0

(I(yit − x′itβg0i (τk) ≤ u)− τk

)du, (A.23)

where the last equality is using the identity of Knight (1998). Note that I(yit − x′itβg0i (τk) ≤u)−τk only takes the values either at 1−τk or τk for k = 1, . . . , K. It hence follows from (A.22)

and (A.23) that

1

NTK

N∑i=1

T∑t=1

K∑k=1

x′it

(βg0i (τk)− βg0i (τk)

)

=1

K

K∑k=1

G0∑g=1

(1

N

N∑i=1

I(g0i = g)

1

T

T∑t=1

xit

)′ (βg(τk)− βg(τk)

)= OP (T−1/2 exp(−ζT )).

57

In particular, for all g and k, we have

(1

N

N∑i=1

I(g0i = g)

1

T

T∑t=1

xit


)= OP (T−1/2 exp(−ζT )),

as N, T →∞. Note that∣∣∣∣∣(

1

N

N∑i=1

I(g0i = g)

1

T

T∑t=1

xit


)∣∣∣∣∣≥ λ−1/2 ‖βg(τk)− βg(τk)‖,

where λp−→ λ > 0 as a consequence of Assumption 2(vi). Hence, βg(τk)−βg(τk) = OP (T−1/2 exp(−ζT )).

This shows (4.6).

A.3 Asymptotic distribution of regression quantile estimates

Here we prove Corollary 2.

Proof. We have by Theorem 3.1 in Galvao and Poirier (2016) that for all g ∈ {1, . . . , G0} as

N, T →∞,

Γ(τ , g)√πgNT

(βg(τ )− β0

g(τ ))⇒ z(τ , g),

where z(·, g) is the K-dimensional normal distribution with zero mean, whose covariance matrix

is

E [z(τ , g)z(τ ′, g)′] = plimT→∞T−1

T∑s=1

T∑t=1

E[(I(εit(τ ) ≤ 0)− τ )(I(εis(τ

′) ≤ 0)− τ ′)xitx′isI(g0i = g)

],

with εit(τ ) = (εit(τ1), . . . , εit(τK))′. Result (4.7) then follows from the fact that ‖βg(τ ) −βg(τk)‖ = OP

(T−1/2 exp (−ζT )

), where ζ > 0, is defined in Theorem 4.2.

A.4 Consistency of IC at given τ

In this section, we provide the proof of Theorem 5.1.

Proof. The structure of proof is similar to Theorem 2.6 in Su et al. (2016). We shall show

that limN,T→∞ P(IC(G) > IC(G0)) = 1 for all G 6= G0 and G < Gmax. Let ψ(ωit; βgi) =

58

ρτk(yit − x′itβgi(τk)), with ωit = (yit, x′it). Then we have from (5.1) that

IC(G) =1

NT

N∑i=1

T∑t=1

ψ(ωit; β(G)gi

) +G(p+ 1)f(N, T ).

IfG = G0, IC(G0) = eG0+G0(p+1)f(N, T ), where eG0 = 1NT

∑Ni=1

∑Tt=1 ψ(ωit; βgi). It hence fol-

lows from Theorem 4.1 and Assumption 4 that IC(G0)p−→ σ2

0.We now prove limN,T→∞ P(IC(G) >

IC(G0)) = 1 for 1 ≤ G < G0 and G0 < G ≤ Gmax, respectively.

Case 1: 1 ≤ G < G0. By Assumption 4,

min1≤G<G0

IC(G) ≥ min1≤G<G0

infP (G)∈PG

eP (G) +G(p+ 1)f(N, T )p−→ e > eG0 ,

where P (G) = (P1, . . . , PG) and PG denote any G-partition of {1, 2, . . . , N} and the collection of

all such partitions, respectively . It hence follows that P(IC(G) > IC(G0))→ 1, for 1 ≤ G < G0,

as N, T →∞.

Case 2: G0 < G ≤ Gmax. With the group membership estimation {gi(G), i = 1, . . . , N}, we

define Pg(G) = {i ∈ {1, 2, . . . , N} : gi(G) = g} for g = 1, . . . , G. Let P (G) = {P1(G), . . . , PG(G)}.Then we rewrite the first term in IC(G) in the following way.

1

NT

N∑i=1

T∑t=1

ψ(ωit; β(G)gi

) =1

NT

G∑g=1

∑i∈Pg(G)

T∑t=1

ψ(ωit; β(G)gi

)

= D1NT +D2NT −D3NT +D4NT , (A.24)

where

D1NT =1

NT

G0∑g=1

∑i∈P 0

g

T∑t=1

ψ(ωit; β(G0)

g0i),

D2NT =1

NT

G0∑g=1

∑i∈Pg(G)\P 0

g

T∑t=1

ψ(ωit; β(G)gi

),

D3NT =1

NT

G0∑g=1

∑i∈P 0

g \Pg(G)

T∑t=1

ψ(ωit; β(G0)

g0i),

and

D4NT =1

NT

G∑g=G0+1

∑i∈Pg(G)

T∑t=1

ψ(ωit; β(G)gi

).

59

First, following the proof of Theorem 4.2, we can show that, as N, T →∞,

DjNT = OP

(T−1/2 exp (−δjT )

), j = 2, 3, 4, (A.25)

for some δj > 0. For the expansion of D1NT , we have

D1NT − eG0 =1

NTK

N∑i=1

T∑t=1

K∑k=1

wk

∫ x′it

(β(G0)

g0i

(τk)−β0g0i

(τk)

)0

(I(εit(τk) ≤ u)− τk) du

≤ N−1

N∑i=1

(T−1

T∑t=1

‖xit‖

)(K−1

K∑k=1

‖β(G0)

g0i(τk)− β0

g0i(τk)‖

)

= OP

((NT )−1/2

), (A.26)

where the last equality follows from Corollary 2. Therefore, combining (A.24), (A.25) and

(A.26) yields that

1

NT

N∑i=1

T∑t=1

ψ(ωit; β(G)gi

) = eG0 +OP

((NT )−1/2

). (A.27)

Using (A.27) and the fact that√NTf(N, T )→∞, we eventually obtain that

P

(min

G0<G≤Gmax

IC(G) > IC(G0)

)= P

(√NT

(min

G0<G≤Gmax

IC(G)− IC(G0)

)> 0

)= P

(OP (1) +

√NTf(N, T )(G−G0)(p+ 1) > 0

)p−→ 1, N, T →∞. (A.28)

It follows that P(IC(G) > IC(G0))→ 1, for G0 < G < Gmax.

A.5 Asymptotics in the presence of individual-specific fixed effects

We provide the proof of Theorem 6.1 as follows.

Proof. The proof is similar to that of Proposition 3.1 in Galvao and Kato (2016). Put

∆i(αi, βgi) := T−1

T∑t=1

{ρτk(yit − αi − x′itβgi)− ρτk(yit − αi0 − x′itβ0

g0i)}.

60

For δ > 0, define B0i (δ) := {(α, β) : |α − αi0| + ‖β − β0

g0i‖ ≤ δ}, and ∂B0

i (δ) := {(α, β) : |α −αi0|+ ‖β − β0

g0i‖ = δ}. We can expand ∆i(α, β) uniformly over (α, β) ∈ ∂B0

i (δ) by Assumption

1(iv), and using Assumption 5(iii) yields that

infi≥1

min(α,β)∈∂B0

i (δ)E [∆i(α, β)] > 0.

Therefore, it follows from the similar proof of Theorem 3.1 in Kato et al. (2012) that for some

εδ > 0, {max


g0i‖ > δ

}⊂{

∆i(αi, βgi) ≤ 0, 1 ≤ ∃ i ≤ N, ∃ (αi, βgi) /∈ B0i (δ)

}⊂

{max

1≤i≤Nsup

(αi,βgi )∈B0i (δ)

|∆i(αi, βgi)− E [∆i(αi, βgi)]| > εδ

}.

This implies that to prove (6.2), it suffices to show that for every ε > 0,

max1≤i≤N

P

{sup

(α,β)∈B0i (δ)

|∆i(α, β)− E [∆i(α, β)]| > εδ

}= o

(N−1

), (A.29)

as N →∞. To show this, we follow the Corollary C.1(a Bernstein type inequality for β-mixing

processes) in the Supplemental Appendix of Galvao and Kato (2016). Under Assumptions 2(v)

and 5(ii), and taking s = 2 logN and q = [√T ] yield that for any εδ > 0,

max1≤i≤N

P

{sup

(α,β)∈B0i (δ)

|∆i(α, β)− E [∆i(α, β)]| > εδ

}

≤ 2(N−2 +

√TBa[

√T ]).

The right side is o (N−1) by noting that (logN)/√T → 0, as N, T →∞. Therefore, we obtain

(A.29) and then (6.2).

61

Latent Group Structures with Heterogeneous Distributions: Identi … · 2019-02-05 · Latent Group...

Documents

Transcript of Latent Group Structures with Heterogeneous Distributions: Identi … · 2019-02-05 · Latent Group...