Post on 25-May-2022
LASSO for Stochastic Frontier Models with Many
Efficient Firms∗
William C. Horrace† Hyunseok Jung‡ Yoonseok Lee§
March 2022
Abstract
We apply the adaptive LASSO (Zou, 2006) to select a set of maximally efficient firms
in the panel fixed-effect stochastic frontier model. The adaptively weighted L1 penalty
with sign restrictions for firm-level inefficiencies allows simultaneous estimation of the
maximal efficiency and firm-level inefficiency parameters, which results in a faster rate
of convergence of the corresponding estimators than the least-squares dummy variable
approach. We show that the estimator possesses the oracle property and selection
consistency still holds with our proposed tuning parameter selection criterion. We also
propose an efficient optimization algorithm based on coordinate descent. We apply
the method to estimate a group of efficient police officers who are best at detecting
contraband in motor vehicle stops (i.e. search efficiency) in Syracuse, NY.
Keywords: Panel Data, Fixed-Effect Stochastic Frontier Model, Adaptive LASSO, L1
Regularization, Sign Restriction, Zero Inefficiency.
∗We are grateful to Badi Baltagi, Christopher Parmeter and the participants at the 15th European
Workshop on Efficiency and Productivity Analysis, the 28th annual meeting of the Midwest Econometrics
Group and the International Association for Applied Econometric 2019 Annual Conference for their valuable
comments and suggestions. All errors are our own.†Department of Economics, Syracuse University, Syracuse, NY, 13244. whorrace@syr.edu‡Corresponding author: Department of Economics, University of Arkansas, Fayetteville, AR 72701.
hj020@uark.edu§Department of Economics, Syracuse University, Syracuse, NY, 13244. ylee41@syr.edu
1 Introduction
Stochastic frontier (SF) models for panel data typically estimate firm-level efficiency from
firm fixed-effects and rank them to identify a single firm in the sample as the most efficient
firm. That is, SF estimators do not identify efficiency ties in general, yet there may be
several firms in the sample tied for most efficient, particularly in competitive markets.
There exist some methodologies to identify multiple efficient firms in the literature, but
they rely on strong distributional assumptions and use two-step procedures. In the first
step, firm-level efficiencies (or equivalent measures) are estimated, and in the second step a
separate inference technique or a selection criterion is used to determine membership in a
subset of the most efficient firms. For example, for the parametric SF model of Aigner, Lovell
and Schmidt (1977), Horrace and Schmidt (1996), Simar and Wilson (2009), and Wheat,
Greene and Smith (2014) estimate the efficiency using Jondrow, Lovell, Materov and Schmidt
(JLMS, 1982) and construct univariate prediction intervals to identify multiple efficient firms
that are statistically indistinguishable from the most efficient firm. Horrace (2005) and
Flores-Lagunes, Horrace and Schnier (2007) extend this to multivariate intervals that account
for the multiplicity inherent in the ranked estimates, and Horrace and Schmidt (2000) develop
multivariate intervals for the fixed-effect SF model of Schmidt and Sickles (1984). Despite
the semi-parametric nature of the fixed-effect model, these inference techniques still rely
on a parametric assumption on the distribution of estimated efficiencies (i.e., that they
are normally distributed or asymptotically so). More recently, Kumbhakar, Parmeter and
Tsionas (2013) propose a zero inefficiency stochastic frontier model for cross sectional data
that produces a subset of firms in the sample that are fully efficient. They estimate the
probability of a firm falling into the zero inefficiency regime using a latent class model,
then use the probability to determine efficient firms. However, this approach still relies on
1
parametric distributional assumptions and two-step procedure.1
In this paper, we explicitly assume that some fraction of firms in the panel are fully
efficient and develop a one-step, semi-parametric procedure for identifying a subset of efficient
firms using the adaptive LASSO (Zou, 2006). Specifically, the proposed approach proceeds as
the existing least squared dummy variable (LSDV) estimation, but the objective function is
augmented with the adaptively weighted shrinkage L1 penalty for the firm-level inefficiencies.
Since the inefficiency parameters are constrained to be non-negative in the model, we imposes
sign restrictions on the parameters.2 The estimation procedure hence identifies a subset of
firm-level inefficiencies as exactly zero, which is an interesting feature of our model compared
to the conventional LASSO where identification of non-zero coefficients is of primary interest.
The LASSO has been applied to various selection problems, but our paper is the first to
consider its application to the stochastic frontier models for the identification of efficient
firms. We also propose an efficient optimization algorithm based on the coordinate descent
method, which significantly reduces computational costs.
Our estimator requires inefficiencies to be time-invariant, so it is best deployed when
measures of average sample inefficiency are appropriate or desired. If high frequency data
are available over a short time period (e.g., a year), time-invariant assumption is arguably
reasonable.3 There are more flexible specifications that allow inefficiency to vary over time
while accounting for time-invariant firm heterogeneity (e.g., Greene, 2005, “true fixed-effect
SF model”), but they often come at the cost of more parametric assumptions. We discuss
these models and limitations of time-invariant inefficiency in greater detail in the next section.
1Rho and Schmidt (2015) raise an identification issue for this model.2So, our proposed method is a constrained LASSO. Constrained LASSO has been introduced in statistics
literature (e.g., Hua et al., 2015), but it hasn’t yet been used in economic context to the best of our knowledge.3Our empirical example uses high frequency stop/search/arrest data from the Syracuse Police Depart-
ment to estimate officer-level time-invariant search efficiency for a single year. Technically, our high frequencydata are in the cross-sectional dimension, as each officers does a certain number of vehicle searches in a year,and the analysis proceeds without regard to continuous time. In this sense “high frequency” in this paper isa broader definition than it may first appear.
2
We analyze the asymptotic properties of the proposed estimator for the case (N, T )→∞,
where N is the number of firms and T is the number of time periods in the sample. We allow
for time-series dependence and cross-sectional heteroskedasticity in errors and covariates,
which is new for the analysis of panel SF models in the literature.4 Also, in our approach,
N is allowed to be much larger than T under proper moment conditions. In this case,
modeling a group of efficient firms is more desirable since the existence of many efficient
firms should be more apparent when markets are large and competitive. We show that the
proposed estimator consistently identifies a set of true efficient firms when the efficiency gap
between the efficient and the inefficient groups vanishes slowly enough with the sample size
and errors and covariates satisfy proper moment conditions. The LASSO estimator of the
maximal efficiency shows√δNT consistency, where δ is the proportion of fully efficient firms
in the sample, while the LSDV estimator exhibits√T/(logN)2 consistency. Consequently,
the LASSO estimators outperform LSDV in many panels, including short panels. This is
borne out in our simulation study. We also study the tuning parameter selection for the
LASSO, with which selection consistency still holds.
We apply our technique to a 2006 panel of Syracuse, NY, police stop/search/arrest data,
previously analyzed by Horrace and Rohlin (2016). Their focus was on estimating vehicle
stop rate differentials by race, and testing their significance for the entire force, ignoring
officer identifiers in the data. We use a linear probability model to model arrest rates
conditional on a vehicle search with time-invariant officer fixed effects, while controlling for
other features of officer ability and patrol assignment. Our LASSO technique identifies a
group of 45 out of 139 officers who are efficient at vehicle searches in 2006 (i.e., 45 officers
who are best at uncovering illegal items leading to a motorist arrest). Policy-makers might
use our linear probability model and the LASSO to identify a subset of efficient officers for
4Park, Sickles and Simar (1998) study the asymptotic properties of the LSDV estimators under i.i.ddata. In this paper, we derive the convergence rates of the LSDV estimators under our setup.
3
performance recognition, for example.
The rest of this paper is organized as follows. The next section introduces the model and
the adaptive LASSO estimator. Section 3 provides some technical assumptions and derives
the oracle property of the estimator. Section 4 discusses optimization algorithm and tuning
parameter selection. Section 5 and 6 provide simulation and empirical application results,
and section 7 concludes. All the proofs and additional simulation results are given in the
online supplementary material.
2 LASSO for Identifying Efficient Firms
2.1 LSDV estimation
We consider the panel SF model with time-invariant technical inefficiency (e.g., Schmidt and
Sickles, 1984) given as
yit = α0 + x′itβ0 + vit − u0,i (1)
for i = 1, ..., N and t = 1, ..., T , where yit is the logarithm of scalar output of the ith firm in
the tth period, α0 is a common intercept, xit is the logarithm of a p×1 input vector, and β0 is
a p× 1 corresponding parameter vector of marginal effects. The regression equation has two
error terms: the first error term vit is a two sided noise with E[vit|xit, u0,i] = 0 and the second
error term u0,i is time-invariant firm-specific inefficiencies, which can be arbitrarily correlated
with xit. We suppose no cross-sectional dependence, but allow time-series dependence over
errors and covariates. Unlike the standard fixed-effect panel regression models, we restrict
u0,i ≥ 0 for all i but we do not impose any distributional assumption on this inefficiency.
The time-invariant inefficiency u0,i is somewhat restrictive, especially when we study
4
panel data over a long period of time. However, as mentioned earlier, if high frequency
data are available and measures of average sample inefficiency are desired, this approach can
still be employed.5 There are SF models that specify time-varying inefficiency with some
restrictive functional forms (e.g., Ahn, Lee and Schmidt, 2007). However, it appears that it
is ambiguous how to apply the LASSO in these cases to model a group of efficient firms.
Another limitation due to the time-invariant inefficiency is that the model (1) cannot
identify marginal effects for time-invariant regressors in xit, so their marginal effects are
absorbed into the inefficiency term. In this case, the interpretation of the firm-specific ineffi-
ciency u0,i can be subtle. There are SF models that include an additional term that accounts
for such time-invariant firm heterogeneity (e.g., Greene, 2005), but these models generally
require distributional assumptions on the noise and inefficiency terms. Our interest in this
paper is to specify a model amenable to semi-parametric estimation, so these approaches are
not appropriate for our case. If firm heterogeneity is more of a concern, practitioners may use
the panel data estimator of Hausman and Taylor (1981) where a set of time-invariant regres-
sors or instruments that are not correlated with fixed effects are used to control individual
heterogeneity. Another method one can adopt to minimize time-invariant heterogeneity is
the within-a-category comparison proposed by Feng and Horrace (2007) where comparisons
of fixed effects are made within groups of relatively homogeneous firms, rather than across
all the firms.
Existing studies estimate (1) using the standard least squared dummy variables (LSDV)
5Fixed effects are still more generally employed to measure agent-specific effects. Examples includemeasurements of teacher quality (Rothstein, 2010; Chetty et al., 2014), school value-added (Angrist et al.,2017) and hospital efficacy (Friedson et al., 2019) where the quantities of interest are calculated by someforms of fixed effects. Moreover, if high-frequency data for multiple periods (e.g., years) are available, wecan partition the data along the time dimension (e.g., by year) and deploy separate high frequency modelsfor each partition, while allowing inefficiency to vary over time.
5
method. More precisely, we rewrite (1) as
yit = α0,i + x′itβ0 + vit, (2)
where α0,i = α0 − u0,i is the firm-specific fixed effect. We can consistently estimate α0,i
(as T → ∞) and β0 (as N or T → ∞) by the standard within estimation, denoting each
estimator αi and β, respectively, provided xit does not include any time-invariant variables.
In the LSDV approach, the frontier parameter α0 is estimated as
α = max1≤i≤N
αi, (3)
which can be verified to be consistent for α0 with (N, T )→∞ under the assumption that the
density of u0,i is nonzero in the neighborhood of zero, so min1≤i≤N u0,i → 0 as N →∞ with
probability approaching to one (w.p.a.1) and consequently max1≤i≤N α0,i → α0 as N → ∞
(e.g., Greene, 1980; Schmidt and Sickles, 1984). The individual firm inefficiency u0,i is then
consistently estimated as
ui = α− αi.
α represents the maximal efficiency in the sample, and we interpret ui as the relative ineffi-
ciency to the most efficient firm.
In practice, it is very unlikely that there are ties in the estimates ui. For this reason,
all the firms have strictly positive ui values except for the firm that is estimated as most
efficient in the sample. Therefore, this approach has the limitation that it can identify only
one (relatively the most) efficient firm, even when there are multiple efficient firms with
u0,i = 0.
6
2.2 Adaptive LASSO estimation
To overcome the aforementioned limitation in the LSDV approach, we instead estimate
(1) using the adaptive least absolute shrinkage and selection operator (adaptive LASSO)
method, from which we can identify multiple efficient firms (i.e., all the firms with the true
u0,i are zero) by shrinking small values of ui toward zero.
To this end, we first assume the following sparsity condition. We let S = {i : u0,i = 0}
be the index set of efficient firms and |C| be the cardinality of a set C.
Assumption 1 δ = |S|/N → δ0 ∈ (0, 1) as N →∞.
This sparsity assumption implies that |S| firms are efficient in the sample and the fraction
of efficient firms doesn’t vanish as N →∞, which plays an important role in the asymptotic
analysis later. Note that the model (1) becomes the standard fixed-effect SF model when
|S| = 1 and hence δ0 = 0; it becomes the neoclassic production model when |S| = N and
hence δ0 = 1. Although we suppose p = dim(β0) is fixed in this paper, we can also allow p to
increase with N and assume sparsity on β0, under which we can identify nonzero elements of
β0 as well. However, this result is already well-studied (e.g., Belloni, Chernozhukov, Hansen
and Kozbur, 2016; Caner, Han and Lee, 2018), so we focus on shrinkage estimators of u0,i in
this paper.
We let β be a consistent estimator of β0 from (2), such as the standard within estimator:
β = arg minβ
N∑i=1
T∑t=1
(yit − x′itβ)2
, (4)
where yit = yit − yi with yi = (1/T )∑T
s=1 yis and similarly for xit. After concentrating out
7
β0 in (1), the adaptive LASSO estimator for θ0 = (α0, u0,1, ..., u0,N)′ is defined as
θ(λ) = (α(λ), u1(λ), ..., uN(λ))′
= arg minα,u1,...,uN ; ui≥0 ∀i
{N∑i=1
T∑t=1
(yit − x′itβ − α + ui
)2
+ λ
N∑i=1
πiui
}, (5)
where λ > 0 is a tuning parameter. {πi}Ni=1 are some data-dependent weights, which are
obtained from some consistent initial estimates of u0,i. In particular, we let πi = u−γi for
some γ > 1, where ui is the LSDV estimator described in the previous section.6 Unlike
the original LASSO by Tibshirani (1996), the adaptive LASSO allows for unequal shrinkage
for each parameter depending on the data-dependent weight πi, which results in the oracle
property (see Fan and Li, 2001; Zou, 2006). However, it should be emphasized that θ(λ) in
(5) is different from the standard adaptive LASSO estimator by Zou (2006) since we impose
sign restrictions on a diverging number of parameters ui ≥ 0 for all i.
One important remark on (5) is that we estimate α0 and (u0,1, ..., u0,N)′ together in
one step. This is not feasible in the standard fixed-effect SF model because of the perfect
multicollinearity between the constant term and the individual dummies. In contrast, this
one-step estimation is feasible in our case due to the sparsity assumption and L1 penalty
term, which eliminates some of the individual dummies.
The main goal of this estimation is to identify two groups: efficient and inefficient firms.
Therefore, this approach seems similar to Bonhomme and Manresa (2015), who also consider
a latent group structure problem determined by group-specific fixed effects. However, their
methodology relies on minimization of a least squares criterion with respect to all possible
groupings, whereas we use the LASSO technique to identify the latent groups (efficient firms
vs. inefficient firms) under sign restrictions on the fixed effects.
6Note that in the LSDV estimation, the firm with the largest firm fixed-effect estimate has a zeroinefficiency estimate. For ui = 0, we use an arbitrarily small value (e.g., 1/N) to construct the weight.
8
The adaptive LASSO problem in (5) is also related to the latent group structure model
by Su, Shi and Phillips (2016),7 or the fused LASSO by Tibshirani, Saunders, Rosset, Zhu
and Knight (2005). They penalize over pairwise-differences among the coefficient values
to produce group identification. However, our problem is different from theirs because we
impose sign restrictions on u0,i and allow that the size of the smallest (non-zero) inefficiency
shrinks to zero at a proper rate. In comparison, Su, Shi and Phillips (2016) assume that the
group-specific parameters in their model are separated from each other by a fixed distance.
3 Oracle Properties
The adaptive LASSO allows for unequal shrinkage for each parameter and results in the
oracle property. Such oracle property extends to our case under (N, T )-asymptotics, where
N can grow faster than T when errors and covariates satisfy proper moment conditions.
We assume the following conditions in our asymptotic analysis. We define
η = mini∈Sc
u0,i > 0.
Assumption 2: (1) (i) E[vit|xit, u0,i] = 0 for all i and t, and {(xit, vit) : t = 1, ..., T} are
independent over i; (ii) For each i, {(xit, vit) : t = 1, ..., T} is strong mixing with mixing
coefficients α[t] ≤ cαρt for some cα > 0 and ρ ∈ (0, 1); (iii) supi≥1 supt≥1E||xit||q < ∞ and
supi≥1 supt≥1E|vit|q <∞ for some q ≥ 4.
(2) As (N, T ) → ∞, (i) β − β0 = Op((NT )−1/2); (ii) T 1/2(logN)−1 → ∞ and
NT 1−q/2(log T )2q → 0; (iii) T 1/2(logN)−1η →∞.
(3) As (N, T )→∞, λT−1/2N1/2η−γ → 0 and λT (γ−1)/2(logN)−γ−1 →∞ for some γ > 1.
7Kutlu, Tran and Tsionas (2020) apply the shrinkage technique by Su, Shi and Phillips (2016) to para-metric SF models to identify groups of firms sharing the same slope parameters.
9
In Assumption 2-(1), we rule out cross-sectional dependence, but allow for time-series
dependence and heteroskedasticity in the errors and covariates. In Assumption 2-(1)-(ii)
and (iii), we require (xit, vit) be a strong mixing process over t with geometric decay rate,
and further restrict the moments of ||xit|| and |vit| to be finite up to a certain order. The
tail restrictions and finite moment condition allow us to use exponential inequalities for
strong mixing processes (e.g., Merlevede, Peligrad and Rio, 2009) to bound misclassification
probabilities and achieve selection consistency.8
Assumption 2-(2)-(i) holds for general M-estimators such as the within estimator under
(N, T ) → ∞. Assumption 2-(2)-(ii), (iii) and 2-(3) impose rate conditions on N, T, η and
λ to ensure both selection and estimation consistency. Assumption 2-(2)-(ii) allows that N
can grow much faster than T when q is large (i.e. when the tail probability of the error
decays quickly). Therefore it covers many panel structures including short panels. Allowing
for large N (i.e., large market) compared to T is useful in our context since we consider
time-invariant inefficiency and assume many efficient firms. The rate conditions also control
for the magnitude of the tuning parameter λ, so the LASSO procedure can select the zero
coefficients correctly without yielding bias in the nonzero coefficient estimators in the limit.9
The assumption allows the nonzero inefficiencies to be close to zero (i.e., η can be very small),
but it shrinks sufficiently slow enough to be distinguished from the zero coefficients and also
not affected by shrinkage estimation.
We first derive the following lemma, which shows that the LSDV estimator of the frontier
parameter α0 summarized in Section 2.1 is consistent.10
Lemma 1 Recall that α = max1≤i≤N αi, where αi is the LSDV estimator of α0,i in (2).
8Alternatively, exponential moment conditions can be employed as in Bonhomme and Manresa (2015).9In particular, Assumption 2-(3) implies that λ should decrease as N increases when N � T .
10This lemma serves as a technical lemma to prove the theorems in this chapter and also allows us tocompare the convergence rate of α, the LSDV estimator, with that of α(λ), the LASSO estimator.
10
Then, under Assumptions 1, 2-(1) and 2-(2), as (N, T )→∞,
α− α0 = Op
((logN)/T 1/2
), where α0 is defined in (1).
This lemma implies α is estimated from one of the efficient firms in the sample w.p.a.1
(i.e. Pr (α = maxi∈S αi) → 1 as (N, T ) → ∞). The rate in this lemma is identical to that
derived in Park, Sickles and Simar (1998), but their result is derived under i.i.d data with
exponential moment conditions imposed on errors and covariates.11 So the lemma can be
seen as a generalization of their result.
Now we turn to the LASSO estimators. Let S = {i : ui(λ) = 0}. We first establish the
selection consistency.
Theorem 1 Suppose Assumptions 1 and 2 hold. Then, Pr(S = S)→ 1 as (N, T )→∞.
Theorem 1 implies that the LASSO consistently identifies two latent groups provided the rate
conditions on N, T, η and λ are satisfied. This, in turn, implies that in the limit, the latent
groups can be treated as known (i.e. the oracle information) and used for the estimation of
α0 and inefficiencies to improve their convergence rates.
We introduce the following assumptions and notations for the limiting distributions of
the LASSO estimators.
Assumption 3 (i) There exist positive constants σ2S1 , σ
2S2 , σS1S2 , σ
2Sc and σ2
i for each i ∈ Sc
such that
σ2S1 = plim
N,T→∞
1
δNT
∑i∈S
T∑t=1
T∑k=1
vitvik
11Recall that we impose only finite moment conditions for the errors and covariates and allow for time-series dependence.
11
σ2S2 = Υ′SH
−10
{plimN,T→∞
1
δNT
∑i∈S
T∑t=1
T∑k=1
xitvitvikx′it
}H−1
0 ΥS
σS1S2 = Υ′
SH−10
{plimN,T→∞
1
δNT
∑i∈S
T∑t=1
T∑k=1
xitvitvik
}
σ2Sc = Υ′SH
−10
{plimN,T→∞
1
(1− δ)NT∑i∈Sc
T∑t=1
T∑k=1
xitvitvikx′it
}H−1
0 ΥS
σ2i = plim
T→∞
1
T
T∑t=1
T∑k=1
vitvik,
where ΥS = plimN,T→∞
(δNT )−1∑
i∈S∑T
t=1 xit, H0 = plimN,T→∞
(NT )−1∑N
i=1
∑Tt=1 xitx
′it > 0; (ii)
As (N, T ) → ∞, (δNT )−1/2∑
i∈S∑T
t=1 vitd−→ N (0, σ2
S1), (δNT )−1/2∑
i∈S∑T
t=1 Υ′SH−10 xitvit
d−→ N (0, σ2S2), ((1 − δ)NT )−1/2
∑i∈Sc
∑Tt=1 Υ
′SH−10 xitvit
d−→ N (0, σ2Sc) and T−1/2
∑Tt=1 vit
d−→
N (0, σ2i ) for each i ∈ Sc.
Theorem 2 Suppose Assumptions 1, 2 and 3 hold. Then, as (N, T )→∞,
(i)√δNT (α(λ)− α0)
d−→ N(0, σ2
S1 + δ2σ2S2 − 2δσ2
S1S2 + δ(1− δ)σ2Sc)
(ii)√T (ui(λ)− u0,i)
d−→ N (0, σ2i ) for each i ∈ Sc.
Theorem 2 tells that we can efficiently estimate the frontier and the firm-level inefficiency
parameters using the LASSO estimator. Combined with Theorem 1, therefore, it establishes
the oracle property of the adaptive LASSO estimators. It is worthy to note that α(λ)−α0 =
Op
((δNT )−1/2
), which shows a much faster rate than the LSDV estimator, α, in Lemma
1. This is quite an intuitive result because the LSDV estimator uses only a single best
firm’s observations, but α(λ) uses |S| · T observations of the firms identified as efficient by
the LASSO. As long as δ doesn’t vanish as N → ∞, which is a reasonable assumption for
competitive markets, the LASSO estimator will be hence preferred.
12
4 Computation
4.1 Optimization algorithm
The L1 penalty term in the LASSO object function has no second derivative at the origin,
so we can’t directly apply the standard quadratic optimization algorithms such as Newton-
Raphson. Many alternative optimization algorithms have been developed: local quadratic
approximation (Fan and Li, 2001), least angle regression (Efron, Hastie, Johnstone and
Tibshirani, 2004), coordinate descent algorithm (Friedman, Hastie and Tibshirani, 2010),
among others.
In this section, we derive an efficient coordinate decent algorithm that accounts for the
sign restrictions in our model. This algorithm uses preliminary inefficiency ranking informa-
tion from the initial LSDV estimation, which allows us to skip a large number of irrelevant
optimization steps.
The efficiency of our proposed optimization procedure can be understood as follows. In
our problem, the Karush-Kuhn-Tucker (KKT) conditions12 for (5) implies that the coordi-
nate decent algorithm boils down to successively updating α (λ) , u1 (λ) , ...uN (λ) based on
the following two equations:
α (λ) =1
NT
N∑i=1
T∑t=1
(yit − x′itβ + ui (λ)
)ui (λ) = max
{0, α (λ)− 1
T
T∑t=1
(yit − x′itβ
)− λ
2Tπi
}for i = 1, ..., N . (6)
We note that the ordering of α (λ) − 1T
∑Tt=1
(yit − x′itβ
)in (6) follows the ordering of ui
since ui = α − αi and αi = 1T
∑Tt=1
(yit − x′itβ
). And the shrinkage effect from the penalty
12See the proof of Theorem 1 in the online supplementary material for more details.
13
term in (6), λ2Tπi, is larger for smaller ui since πi = u−γi with γ > 1. This implies that, for
given λ, if ui ≤ uj, then ui(λ) ≤ uj(λ). Therefore, we can skip updates for all the firms i
with ui ≤ uj (and identify them as efficient firms) once uj(λ) shrinks to 0.13 This reduces
computational costs significantly when N is large. Our proposed algorithm based on this
idea is summarized as follows.
1. Using β from the initial estimation, let
α(0)i =
1
T
T∑t=1
(yit − x′itβ), α(0) = max1≤i≤N
α(0)i , and u
(0)i = α(0) − α(0)
i
for each i. Define order statistics α(0)[1] ≤ α
(0)[2] ≤ ... ≤ α
(0)[N ] and u
(0)[1] ≥ u
(0)[2] ≥ ... ≥ u
(0)[N ]
so that α(0)[N ] = max1≤i≤N α
(0)i and u
(0)[N ] = min1≤i≤N u
(0)i . In this step, we have only one
fully efficient firm with u[N ] = u(0)[N ] = 0.
2. For a given λ, check the KKT condition for the second best firm based on the sign of
∆[N−1] = u(0)[N−1] −
λ
2Tπ[N−1].
In particular, if ∆[N−1] ≤ 0, let u(1)[N−1] = 0 and update α(0) as α(1) = (α
(0)[N ] + α
(0)[N−1])/2.
Using this new frontier parameter estimate α(1), update the rest of the inefficiencies as
u(1)[N−1−j] = u
(0)[N−1−j] − (α(0) − α(1)) for all j ≤ N − 2. If ∆[N−1] > 0, go to the Step 4
below.
3. Sequentially repeat Step 2 for each ∆[N−k] for k = 2, 3, ..., N − 1 as long as ∆[N−k] ≤ 0
holds. For each k, we let u(k)[N−k] = 0 and update α(k−1) as α(k) = (1/(k+1))
∑kj=0 α
(k−1)[N−j].
We also update u(k)[N−1−j] = u
(k−1)[N−1−j] − (α(k−1) − α(k)) for all k ≤ j ≤ N − 2.
13This reasoning readily applies to the balanced panel case. For the unbalanced panel, the shrinkage effectis (λ/2Ti)πi where Ti is the number of time periods for i, which does not necessarily preserve the orderingof ui. In this case, the standard coordinate decent algorithm based on the two equations above can be used.
14
4. If ∆[N−k] > 0 at some k ≥ 1, we update the non-zero inefficiencies (i.e., u[N−j] > 0 for
k ≤ j ≤ N − 1) as uk[N−j] = u(k−1)[N−j] −
λ2Tπ[k] for all k ≤ j ≤ N − 1 and then report the
results.14
This coordinate descent algorithm uses the convexity of the object function and the
preliminary inefficiency ranking at the same time, which enables us to reach the minimum
of the object function quickly.
4.2 Tuning parameter choice
The performance of the adaptive LASSO estimator relies on an appropriate choice of the
tuning parameter, λ. Methods based on cross validations and AIC criteria are known to result
in over-selection (i.e., too many nonzero estimates), which will result in under-selection of the
efficient firms in our context. Wang, Li and Tsai (2007) instead propose tuning parameter
choice based on the BIC-type criterion, which is shown to consistently estimate the correct
model when it exists.
We consider a BIC-type criterion for the choice of λ, which is given by
λ∗ = arg minλ
log σ2(λ) +φNTNT|Sc (λ) |, (7)
where φNT is a sequence increasing with N or T , Sc (λ) = {i : ui(λ) > 0} and
σ2(λ) =1
NT
N∑i=1
T∑t=1
(yit − x
′
itβ − θ(λ))2
14In fact, directly minimizing (5) using any typical methods shrinks not only ui(λ) but also α(λ). However,this is an undesirable shrinkage bias, which may slow down the convergence of α(λ), particularly when N islarge (Equation A.6 in the online supplementary material includes the explicit form of this bias). Therefore,in the spirit of post-LASSO estimation (e.g., Belloni and Chernozhukov, 2013), in our algorithm, we skip thesteps that induce such shrinkage effect on α(λ) and achieve smaller finite sample bias of α(λ). This omissiondoesn’t alter any of the asymptotic results in Section 3.
15
from (5) for a fixed λ. The following theorem proves that selection consistency still holds
with the tuning parameter chosen by (7).
Theorem 3 Suppose Assumptions 1 and 2 hold, and (φNT/T )1/2 η−1 → 0 and
φNT/(logN)2 →∞. Then, as (N, T )→∞,
Pr(S(λ∗) = S
)→ 1, where λ∗ is given in (7).
Theorem 3 indicates that when φNT grows with an appropriate rate, we can consistently
identify the true set of efficient firms using the tuning parameter chosen by (7). Particularly,
the conditions (φNT/T )1/2 η−1 → 0 and φNT/(logN)2 →∞ ensure the probabilities of under-
fitting (i.e. some of non-zero inefficiencies are estimated as zero) and over-fitting (i.e. some
of zero inefficiencies are estimated as non-zero), respectively, to vanish asymptotically. The
choice of φNT is crucial in practice to control such under- and over-fitting probabilities. From
our simulations, we found that 0.1(logN)2cNT with cNT = log (log (NT/(N + T ))) works
well for various panel structures.15 We use it for our simulations and empirical application.
5 Simulations
In this section, we study the finite sample performance of the LASSO estimator. We consider
the model (1) with α0 = 1, β0 = (1, 1, 1, 1)′, xit ∼ i.i.d. N (0,Σ), where the (j1, j2)th element
of Σ is 0.5|j1−j2| for j1, j2 = 1, ..., 4, and δ = 0.3 (i.e., 30% of firms in the sample are fully
15A φNT that satisfies the rate conditions can take a form of ν(logN)2cNT where ν is some positiveconstant that gives flexibility to control the degree of penalization in the criterion (similar to ERIC by Hui,Warton and Foster (2015)) and cNT is a diverging sequence but its rate of divergence is arbitrarily slow.Note that cNT = log (log (NT/(N + T ))) ≈ min{log (logN) , log (log T )} in our case. We also experimentedwith other types of selection criteria in the simulation study, including ERIC and ICp1 by Bai and Ng (2002),and found that (7) performs best in this panel SF model.
16
efficient).16 The two sided error, vit, is conditionally heteroskedastic and serially correlated
such that vit = 0.25vi,t−1 + ωit for t = 2, ..., T and vi1 = ωi1 where ωit|xit ∼ i.i.d. N (0, σ2it)
with σit = 0.45 if∑4
j=1 xitj < 0 and σit = 1.45 otherwise.17
In every simulation each nonzero individual inefficiency, u0,i, is identically and indepen-
dently generated from an exponential distribution, max{0.01, (1/σu)e−u/σu} for some σu > 0,
where trimming is to ensure all draws are strictly positive. We experiment with σu ∈ {1, 2, 4}.
Note that as σu gets smaller, the probability of small inefficiency draws gets higher, making
it more difficult for the LASSO to distinguish them from zero. This would be particularly
difficult when T is small.18 Figure 1 shows the density of the inefficiency u0,i for each given
σu value (figure on the left) and an example of draws from each case (figure on the right). We
can clearly see that inefficiencies have high density near zero when σu = 1. For the penalty
function, we set γ = 2 and λ is selected by (7) from a grid search over 250 evenly spaced
points between 10−4 and 10T .19 We simulate each case 1000 times for the combinations of
N ∈ {100, 200, 400, 1000} and T ∈ {10, 30, 50, 70}.
[=== Figure 1 is here ===]
First, Table 1 reports and compares the results from the adaptive LASSO estimation
in (5) and the conventional LSDV approach described in Section 2.1. In particular, we
report the root mean squared errors (RMSE) of ULASSO = (u1(λ∗), ..., uN(λ∗))′ and ULSDV =
(u1, ..., uN)′; point estimates of α0 from αLASSO = α(λ∗) and αLSDV = α (= max1≤i≤N αi);
16Additional simulation results for δ ∈ {0.1, 0.9} are in the online supplementary material. As δ de-creases, the finite sample performance of the LASSO estimators deteriorates, but we still observe notableimprovements from the LASSO estimation compared to the LSDV.
17The variances of ωit were chosen so that the overall variance of vit is approximately one.18In this case, the rate conditions on η in Assumption 2-(2)-(iii) and 2-(3) are likely to be violated.19We are free to choose the value of γ as long as it satisfies the rate conditions in Assumption 2-(3).
From the asymptotic analysis, we can see that setting a higher value for γ ensures the LASSO estimateszero coefficients as zero, but also increases the probability of estimating (small) nonzero coefficients as zero.Therefore, in empirics γ should be determined in light of this trade-off.
17
and the sample correlations between the ranking of U0Sc (i.e. nonzero inefficiencies) and the
ranking of their counterpart estimates ULASSO,Sc and ULSDV,Sc for given S.20
[=== Table 1 here ===]
The LASSO notably outperforms the LSDV in terms of RMSE of U from all cases. Note
that as N increases, the RMSE of ULASSO decreases but that of ULSDV increases, leading
to a larger disparity between the two methods.21 When N = 1000, the RMSE of ULSDV is
almost three times bigger than that of ULASSO. As the asymptotic analysis implies, this is
mainly because of the faster convergence of αLASSO to the true value.
As the means and variances of αLASSO and αLSDV in Table 1 shows, the distributions
of αLASSO are centered much closer to the true value (α0 = 1) than that of αLSDV even
when T and σu are small, and the bias and variance of αLASSO decrease quickly as N or
T increase. In addition, the max operator that αLSDV uses to estimate α0 tends to pick
the most biased individual intercept estimate. Therefore, in the presence of multiple zero-
inefficiency firms, the max operator produces a biased estimate for α0, which, in turn, leads
to bias in estimating inefficiencies u0,i’s.22
The LASSO and the LSDV show similar rank correlation results. It appears that the
LASSO preserves the original ranking better than LSDV when T and σu are small. This is
when we have a large uncertainty in the inefficiency estimates and the LASSO improves the
ranking accuracy by estimating statistically indistinguishable small inefficiencies as zero.
20More precisely, the entries in Table 1 (and also those in Table 2) are the average values for eachmeasure over 1,000 replications and their corresponding standard deviations are in the parentheses.Rank correlations are computed only among the inefficiencies whose true values are nonzero, that iscorr(R(U0Sc), R(ULASSO,Sc)) and similarly for LSDV, where R(·) is a mapping from estimates to rank-ings.
21When N is very large (e.g., 1000), however, we find that the RMSE of ULASSO starts to increase. Thisis related to the form of φNT in (7). The impact of φNT on the selection performance and consequentlythe estimation of α and ui is discussed later when we discuss about the selection accuracy of the LASSOestimation.
22Wang and Schmidt (2009) also document the “upward” bias of LSDV estimators using simulations.
18
Second, Table 2 presents the selection accuracy of the LASSO estimation. In particular,
we report the probability of yielding a zero estimate for i ∈ S, PS = Pr(i ∈ S|i ∈ S); the
probability of yielding a nonzero estimate for i ∈ Sc, PSc = Pr(i ∈ Sc|i ∈ Sc); the estimated
proportion of efficient firms δ; and the maximum value of u0,i, whose true value is nonzero
but estimated as zero, representing the degree of misclassification (i.e., maxi∈S|i∈Sc u0,i; Max-
miss).
[=== Table 2 here ===]
Both PS and PSc improve as T increases, but PSc decreases as N increases while PS
increases. The trade-off between PS and PSc when N increases is related to the form of
φNT in (7). Theorem 3 implies that φNT should grow faster than (logN)2, which ensures
the LASSO estimates (a diverging number of) zero coefficients as zero when N increases,
but the smallest inefficiency, η, at the same time, should be sufficiently large enough not to
adversely increase the probability of estimating (small) nonzero coefficients as zero. In our
simulations, we allow for small efficiencies, so the trade-off between PS and PSc is apparent
when N increases. This is particularly true when σu is small.23 However, note that most of
the inefficient firms incorrectly estimated as efficient firms (i.e., those with zero inefficiency
estimates) would have near zero inefficiency. The small values of Max-miss in Table 2 imply
that only the firms near zero inefficiency could be incorrectly categorized as fully efficient in
the LASSO procedure.
More importantly, it is impressive that even when T is small, including (N, T ) =
(1000, 10), δ is quite close to the true proportion δ = 0.3 as long as σu is large. This
23The degree and pattern of this trade-off apparently depends on the choice of φNT , which ultimatelyaffects the estimation of α and ui. Therefore, similarly as γ, in practice φNT should be chosen in the light ofthis trade-off. However, we find that φNT that is optimal for a wide range of N is difficult to find (e.g., ourφNT appears to grow rather quickly as N increases, leading to an underestimation of α when N = 1000).Optimal choice of φNT is left for future research.
19
gives us an important implication: our approach can be used even for short panels, as long
as there are not too many near zero inefficiency firms. Hence, in practice, information on
the variance of u0,i would be important in the choice of the proposed LASSO approach. Cai,
Horrace and Lee (2021) studies nonparametric identification of σu in the panel setup, where
it is allowed to be conditionally heteroskedastic.
6 Empirical Application: Police Vehicle Search Effi-
ciency in Syracuse, New York
In this section, we consider a police chief who selects a group of best officers for annual
evaluation based on how successfully they carried out vehicle searches throughout a year.
The idea is that officers perform a cost & benefit analysis in the decision to search the vehicle
of a stopped motorist. The costs to search are the opportunity cost of their time and effort
and the potential cost of being targeted for a “wrongful search” when the search fails to
uncover illegal items (contraband). The benefit to a successful search (one that uncovers
contraband) is the arrest of the motorist. Specifically, we model success rates (i.e. hit rates)
conditional on a search of a stopped vehicle among officers using a linear probability model,
and use the officer fixed effects in the model to calculate officer-specific success rates (i.e.,
search efficiency). We include several police productivity inputs and dummy variables to
control for heterogeneity due to differing levels of police experience, and location and time
of the vehicle searches.24
We use the panel of discretionary vehicle search activity by officer from 2006 to 2009
24Defining police productivity by success rate has a limitation that officers with higher standard forguilty tend to have a higher success rate since they would search vehicles with a high probability of carryingcontraband only. We may consider a composite measure that accounts for both quantity and quality ofsearch, which is left for future research.
20
in the City of Syracuse NY, that was previously analyzed by Horrace and Rohlin (2016)
in a different context. We use the data for year 2006 only and exclude officers whose total
number of vehicle searches in the year is less than five. We also exclude stops made in the
census tracts that had observations less than five. Our final sample includes 139 field officers
and 2,863 observations (i.e. searches). Note that our sample is an unbalance panel where
each officer made a different number of searches, Ti, for the given period.25
The linear probability model is specified as follows:
Pr(arrrestit = 1|α0, xit, zi, u0,i) = α0 + x′
itβ0,1 + z′
iβ0,2 − u0,i
for i = 1, ..., 139 and t = 1, ..., Ti where arrrestit is the binary outcome variable for officer i
at time t,26 which is 1 if the search results in an arrest of the motorist and 0 otherwise. The
explanatory variables, xit and zi are time-varying and time-invariant, respectively, and xit is
allowed to be correlated with u0,i, but zi is assumed to be strictly exogenous. As discussed
in Section 2.1, zi is included here to control important time-invariant heterogeneity among
officers: experience. Police Experience measures a contemporaneous experience level for
each officer, based on ‘years of employment on the force.’27 We could consider a measure
of experience based on the cumulative number of stops made by each officer. However, this
measure may be endogenous to the probability that a vehicle search will lead to the discovery
of contraband (and arrest). We instead use officer start date as a proxy for experience,
which is plausibly exogenous. Though it will be removed in the within estimation, β0,2 can
be estimated in the between equation.28 To capture possible nonlinearity in the relationship
25We use the average number of searches among officers for the T in the BIC criterion (7).26In the data, we identify the exact time (hh:mm:ss) of a stop27Note that our analysis is for one year and Experience is a yearly variable, so it is time-invariant in this
analysis.28The between equation is: yi. = α0 + z
′
iβ0,2 + ςi where yi. = 1Ti
∑Ti
t=1
(arrrestit − x
′
itβ1,LSDV
)and ςi is
the regression error that contains u0,i and the original two-sided error. This regression is valid as long as zi
21
between Experience and arrest rate, we include a third order polynomial for zi. The time-
varying explanatory variables, xit, includes variables that control for other dimensions of
heterogeneity in search activity: motorist Youth; Dispersion and Scale of stop activity at
officer-level; and Census × Shift and Season dummies. Y outh is a dummy for drivers under
25, and Dispersion and Scale are constructed based on monthly stop activity, which account
for police heterogeneity due to different types of duties assigned to the officers. Dispersion
measures the spatial dispersion of stop activity for each officer and Scale measures the
intensity with which officers perform duties, which are defined as
Dispersionit =
J∑j=1
(STijt∑Jj=1 STijt
)2−1
and Scaleit =
∑Jj=1 STijt∑N
k=1
∑Jj=1 STkjt
where STijt is the number of stops in census tract j in the month of t by officer i and J
is the total number of census tracts. These variables addresses potential selectivity and
heterogeneity that arise from the way in which the police chief assigns officers to specific
duties in specific parts of the city. For example, officers that tend to do more stops (ceteris
paribus) may be assigned to parts of the city where performing many stops is optimal from
the perspective of improving arrest rates. Census × Shift includes 99 dummies for different
combinations of census tracts and three work shifts: 7am-3pm, 3pm-11pm and 11pm-7am,
and Season includes dummies for spring, summer and fall.
We first present the estimation results for β0,1 and β0,2.
[=== Table 3 and Figure 2 are here ===]
The estimates of β0,1 in Table 3 are intuitive. The negative coefficient on Youth implies
that arrest rate for young motorists is on average lower than older motorists, which implies
is exogenous to u0,i and the original two-sided error.
22
that the officers searched young motorists without successes (i.e. arrests) more frequently
than the older motorists. This may be interpreted as bias toward young motorists, but
this may be simply because the signal of guiltiness to the officers from young motorists
is noisier. Dispersion is positive but statistically insignificant. The positive estimate may
indicate that the officers who carry out duties in a larger area may obtain additional learning
opportunities, which enhances their ability to detect crime. Scale is negative and significant,
which can be seen as a result of quality-quantity trade-off in search activity.
Figure 2 depicts the change in arrest rate by police experience and its 95% confidence
intervals. Arrest rate improves as years of employment increases until around tenth year, and
then decreases. Vehicle searches involve prediction tasks regarding the likelihood of arrest,
which may improve over time due to learning-by-doing, but the inverted U-shape curve
implies that learning in policing is not a constant accumulation but involves a degradation
after a certain period.
We now turn to our results on police search efficiency. The LASSO estimates 32.4 %
of officers (which is 45 out of 139 officers) as efficient. The distribution of the inefficiencies
are reported in Figure 3. In Figure 3, the blue histogram represents the distribution of the
inefficiencies from the conventional LSDV approach and the yellow one represents that from
the LASSO, where the 32.4 % of mass is concentrated at zero.29 The distribution of the
inefficiencies from the conventional LSDV looks like a bimodal distribution, which has two
peaks at around 0.2 and 0.6. It appears that the LASSO shrinks the inefficient mass of officers
at the first peak towards zero inefficiency, implying that this mass of officers are equally
efficient. After the LASSO application, the density function of the inefficiencies becomes
more similar to the half-normal or exponential distributions that are typically assumed in
parametric SF models.
29The black and red dotted lines are (kernel smoothed) density functions estimated by the default “ks-density” command in Matlab.
23
[=== Figure 3 is here ===]
The single-year analysis can be extended to multi-year analysis in a straightforward way.
We can deploy separate single-year models for each year, while allowing inefficiency to vary
over year. Therefore, if high-frequency data for multiple years are available, our approach
can allow for time-varying inefficiency and identify a group of efficient agents for each year.
7 Conclusion
We proposed an adaptive LASSO estimation to identify a group of efficient firms in the
panel stochastic frontier model. The method is particularly useful when the market size
is large. We showed that it outperforms the conventional LSDV-based approach in many
aspects. More generally, when we have a panel linear regression model with individual
fixed effects, and the ranking among the fixed effects contains important information, our
approach can identify a subset of the best (or worst) effects. This type of “best and the
rest” classification can be hence used as an adaptive sample splitting method. The empirical
application demonstrates well the practical value of the proposed method.
24
References
Ahn, S. C., Lee, Y. H. and Schmidt, P. (2007), ‘Stochastic frontier models with multiple
time-varying individual effects’, Journal of Productivity Analysis 27(1), 1–12.
Aigner, D., Lovell, C. and Schmidt, P. (1977), ‘Formulation and estimation of stochastic
frontier production function models’, Journal of Econometrics 6, 21–37.
Angrist, J. D., Hull, P. D., Pathak, P. A. and Walters, C. R. (2017), ‘Leveraging Lotteries
for School Value-Added: Testing and Estimation*’, The Quarterly Journal of Economics
132(2), 871–919.
Bai, J. and Ng, S. (2002), ‘Determining the number of factors in approximate factor models’,
Econometrica 70(1), 191–221.
Belloni, A. and Chernozhukov, V. (2013), ‘Least squares after model selection in high-
dimensional sparse models’, Bernoulli 19(2), 521–547.
Belloni, A., Chernozhukov, V., Hansen, C. and Kozbur, D. (2016), ‘Inference in high-
dimensional panel models with an application to gun control’, Journal of Business &
Economic Statistics 34(4), 590–605.
Bonhomme, S. and Manresa, E. (2015), ‘Grouped patterns of heterogeneity in panel data’,
Econometrica 83(3), 1147–1184.
Cai, J., Horrace, W. C. and Lee, Y. (2021), Panel nonparametric conditional heteroskedastic
frontiers with application to CO2 emissions. Working paper.
Caner, M., Han, X. and Lee, Y. (2018), ‘Adaptive elastic net gmm estimation with many in-
valid moment conditions: Simultaneous model and moment selection’, Journal of Business
& Economic Statistics 36(1), 24–36.
25
Chetty, R., Friedman, J. N. and Rockoff, J. E. (2014), ‘Measuring the impacts of teach-
ers 1: Evaluating bias in teacher value-added estimates’, American Economic Review
104(9), 2593–2632.
Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004), ‘Least angle regression’, The
Annals of Statistics 32(2), 407–451.
Fan, J. and Li, R. (2001), ‘Variable selection via nonconcave penalized likelihood and its
oracle properties’, Journal of the American Statistical Association 96, 1348–1360.
Feng, Q. and Horrace, W. C. (2007), ‘Fixed-effect estimation of technical efficiency with
time-invariant dummies’, Economics Letters 95(2), 247–252.
Flores-Lagunes, A., Horrace, W. C. and Schnier, K. E. (2007), ‘Identifying technically effi-
cient fishing vessels: a non-empty, minimal subset approach’, Journal of Applied Econo-
metrics 22(4), 729–745.
Friedman, J., Hastie, T. and Tibshirani, R. (2010), ‘Regularization paths for generalized
linear models via coordinate descent.’, Journal of statistical software 33(1), 1–22.
Friedson, A. I., Horrace, W. C. and Marier, A. F. (2019), ‘So many hospitals, so little infor-
mation: How hospital value-based purchasing is a game of chance’, Southern Economic
Journal 86(2), 773–799.
Greene, W. H. (1980), ‘Maximum likelihood estimation of econometric frontier functions’,
Journal of Econometrics 13(1), 27–56.
Greene, W. H. (2005), ‘Reconsidering heterogeneity in panel data estimators of the stochastic
frontier model’, Journal of Econometrics 126(2), 269–303.
26
Hausman, J. A. and Taylor, W. E. (1981), ‘Panel data and unobservable individual effects’,
Econometrica 49(6), 1377–1398.
Horrace, W. C. (2005), ‘On ranking and selection from independent truncated normal dis-
tributions’, Journal of Econometrics 126(2), 335–354.
Horrace, W. C. and Rohlin, S. M. (2016), ‘How dark is dark? bright lights, big city, racial
profiling’, The Review of Economics and Statistics 98(2), 226–232.
Horrace, W. C. and Schmidt, P. (1996), ‘Confidence statements for efficiency estimates from
stochastic frontier models’, Journal of Productivity Analysis 7(2), 257–282.
Horrace, W. C. and Schmidt, P. (2000), ‘Multiple comparisons with the best, with economic
applications’, Journal of Applied Econometrics 15(1), 1–26.
Hua, Q., Zeng, P. and Lin, L. (2015), ‘The dual and degrees of freedom of linearly constrained
generalized lasso’, Computational Statistics and Data Analysis 86, 13–26.
Hui, F. K. C., Warton, D. I. and Foster, S. D. (2015), ‘Tuning parameter selection for the
adaptive lasso using eric’, Journal of the American Statistical Association 110(509), 262–
269.
Jondrow, J., Lovell, C., Materov, I. S. and Schmidt, P. (1982), ‘On the estimation of technical
inefficiency in the stochastic frontier production function model’, Journal of Econometrics
19(2-3), 233–238.
Kumbhakar, S. C., Parmeter, C. F. and Tsionas, E. G. (2013), ‘A zero inefficiency stochastic
frontier model’, Journal of Econometrics 172(1), 66–76.
Kutlu, L., Tran, K. C. and Tsionas, M. G. (2020), ‘Unknown latent structure and inefficiency
in panel stochastic frontier models’, Journal of Productivity Analysis 54, 75–86.
27
Merlevede, F., Peligrad, M. and Rio, E. (2009), ‘Bernstein inequality and moderate devia-
tions under strong mixing conditions’, Inst. Math. Stat. (IMS) Collect: High Dimensional
Probability V, 273–292.
Park, B., Sickles, R. and Simar, L. (1998), ‘Stochastic panel frontiers: A semiparametric
approach’, Journal of Econometrics 84(2), 273–301.
Rho, S. and Schmidt, P. (2015), ‘Are all firms inefficient?’, Journal of Productivity Analysis
43(3), 327–349.
Rothstein, J. (2010), ‘Teacher quality in educational production: Tracking, decay, and stu-
dent achievement’, The Quarterly Journal of Economics 125(1), 175–214.
Schmidt, P. and Sickles, R. C. (1984), ‘Production frontiers and panel data’, Journal of
Business and Economic Statistics 2(4), 367–374.
Simar, L. and Wilson, P. W. (2009), ‘Inferences from cross-sectional, stochastic frontier
models’, Econometric Reviews 29(1), 62–98.
Su, L., Shi, Z. and Phillips, P. C. B. (2016), ‘Identifying latent structures in panel data’,
Econometrica 84(6), 2215–2264.
Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal
Statistical Society. Series B (Methodological) 58(1), 267–288.
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005), ‘Sparsity and
smoothness via the fused lasso’, Journal of the Royal Statistical Society: Series B (Statis-
tical Methodology) 67(1), 91–108.
Wang, H., Li, R. and Tsai, C.-L. (2007), ‘Tuning parameter selectors for the smoothly clipped
absolute deviation method.’, Biometrika 94(3), 553–568.
28
Wang, W. S. and Schmidt, P. (2009), ‘On the distribution of estimated technical efficiency
in stochastic frontier models’, Journal of Econometrics 148(1), 36–45.
Wheat, P., Greene, W. and Smith, A. (2014), ‘Understanding prediction intervals for firm
specific inefficiency scores from parametric stochastic frontier models’, Journal of Produc-
tivity Analysis 42(1), 55–65.
Zou, H. (2006), ‘The adaptive lasso and its oracle properties’, Journal of the American
Statistical Association 101(476), 1418–1429.
29
Figures and Tables
Table 1: Estimation Accuracy
Point estimateRMSE (α0 = 1) Rank correlation
(N, T) σu ULASSO ULSDV αLASSO αLSDV LASSO LSDV
(100, 10) 1 0.4222 0.9497 1.210 1.862 0.87 0.85(0.1495) (0.1787) (0.214) (0.195) (0.045) (0.044)
(100, 30) 1 0.2389 0.5517 1.097 1.498 0.93 0.93(0.0643) (0.1056) (0.108) (0.117) (0.023) (0.024)
(100, 50) 1 0.1816 0.4260 1.059 1.383 0.95 0.95(0.0397) (0.0839) (0.075) (0.093) (0.017) (0.017)
(100, 70) 1 0.1505 0.3608 1.045 1.326 0.96 0.96(0.0249) (0.0674) (0.055) (0.075) (0.013) (0.013)
(100, 10) 4 0.4618 0.9272 1.252 1.833 0.98 0.98(0.1841) (0.1876) (0.244) (0.205) (0.007) (0.007)
(100, 30) 4 0.2434 0.5387 1.102 1.483 0.99 0.99(0.0677) (0.1052) (0.106) (0.116) (0.003) (0.003)
(100, 50) 4 0.1813 0.4147 1.065 1.372 0.99 0.99(0.0368) (0.0824) (0.069) (0.090) (0.002) (0.002)
(100, 70) 4 0.1506 0.3523 1.048 1.316 1.00 1.00(0.0289) (0.0673) (0.055) (0.074) (0.002) (0.002)
(200, 10) 1 0.3602 1.0541 1.088 1.974 0.88 0.85(0.0390) (0.1711) (0.109) (0.183) (0.031) (0.031)
(200, 70) 1 0.1436 0.3967 1.018 1.365 0.97 0.96(0.0111) (0.0619) (0.036) (0.067) (0.009) (0.009)
(200, 10) 4 0.3878 1.0273 1.144 1.944 0.98 0.98(0.0540) (0.1744) (0.117) (0.188) (0.005) (0.005)
(200, 70) 4 0.1405 0.3909 1.027 1.359 1.00 1.00(0.0108) (0.0649) (0.033) (0.070) (0.001) (0.001)
(400, 10) 1 0.3587 1.1359 1.005 2.062 0.90 0.85(0.0250) (0.1512) (0.074) (0.161) (0.022) (0.020)
(400, 70) 1 0.1479 0.4295 0.997 1.400 0.97 0.96(0.0111) (0.0623) (0.026) (0.066) (0.005) (0.006)
(400, 10) 4 0.3650 1.1296 1.086 2.055 0.98 0.98(0.0219) (0.1676) (0.072) (0.179) (0.003) (0.003)
(400, 70) 4 0.1397 0.4280 1.012 1.398 1.00 1.00(0.0070) (0.0615) (0.022) (0.066) (0.001) (0.001)
(1000, 10) 1 0.3905 1.2564 0.933 2.191 0.92 0.85(0.0279) (0.1517) (0.046) (0.159) (0.015) (0.013)
(1000, 10) 4 0.3656 1.2368 1.035 2.170 0.99 0.98(0.0139) (0.1523) (0.046) (0.161) (0.002) (0.002)
Each entry contains the average value for each measure over 1,000 replicationsand their corresponding standard deviations are in parentheses. Rank correla-tions are computed only among the inefficiencies whose true values are nonzero.
30
Tab
le2:
Sel
ecti
onA
ccura
cy
σu
=1
σu
=2
σu
=4
(N,
T)
PS
PSc
δM
ax-m
iss
PS
PSc
δM
ax-m
iss
PS
PSc
δM
ax-m
iss
(100
,10
)0.
6576
0.79
360.
3418
0.66
610.
6397
0.88
890.
2697
0.58
420.
6346
0.93
930.
2329
0.49
38(0
.197
7)(0
.108
4)(0
.128
1)(0
.271
4)(0
.210
2)(0
.068
2)(0
.103
4)(0
.299
3)(0
.212
8)(0
.043
5)(0
.087
6)(0
.309
9)(1
00,
30)
0.72
090.
8421
0.32
680.
4070
0.72
090.
9135
0.27
680.
3528
0.73
690.
9535
0.25
360.
2859
(0.1
705)
(0.0
805)
(0.0
997)
(0.1
665)
(0.1
766)
(0.0
522)
(0.0
818)
(0.1
822)
(0.1
680)
(0.0
335)
(0.0
657)
(0.1
891)
(100
,50
)0.
7617
0.85
950.
3269
0.32
780.
7555
0.92
390.
2799
0.27
760.
7864
0.95
700.
2660
0.23
43(0
.159
3)(0
.069
7)(0
.088
2)(0
.135
0)(0
.161
5)(0
.045
1)(0
.072
1)(0
.147
8)(0
.152
5)(0
.030
7)(0
.059
7)(0
.156
1)(1
00,
70)
0.78
940.
8699
0.32
790.
2846
0.79
610.
9300
0.28
780.
2400
0.81
020.
9616
0.27
000.
1919
(0.1
400)
(0.0
641)
(0.0
790)
(0.1
112)
(0.1
377)
(0.0
417)
(0.0
625)
(0.1
231)
(0.1
410)
(0.0
280)
(0.0
540)
(0.1
320)
(200
,10
)0.
7918
0.71
830.
4347
0.91
950.
7609
0.85
040.
3330
0.84
530.
7563
0.92
110.
2821
0.75
15(0
.117
3)(0
.091
3)(0
.094
6)(0
.231
7)(0
.127
6)(0
.059
6)(0
.074
5)(0
.251
9)(0
.129
5)(0
.036
1)(0
.059
0)(0
.277
9)(2
00,
70)
0.86
980.
8416
0.37
190.
3769
0.87
080.
9135
0.32
180.
3470
0.88
140.
9530
0.29
730.
2996
(0.0
824)
(0.0
528)
(0.0
566)
(0.0
985)
(0.0
798)
(0.0
336)
(0.0
427)
(0.1
036)
(0.0
776)
(0.0
219)
(0.0
336)
(0.1
180)
(100
0,10
)0.
9445
0.56
160.
5902
1.41
970.
9110
0.77
170.
4331
1.32
780.
8936
0.88
190.
3507
1.25
00(0
.027
1)(0
.052
3)(0
.043
4)(0
.183
2)(0
.035
8)(0
.034
7)(0
.033
3)(0
.187
4)(0
.041
0)(0
.022
3)(0
.025
9)(0
.213
8)
Eac
hen
try
conta
ins
the
aver
age
valu
efo
rea
chm
easu
reov
er1,
000
replica
tion
san
dth
eir
corr
esp
ondin
gst
andar
ddev
iati
ons
are
inpar
enth
eses
.
31
Table 3: LSDV Estimates of β0,1
Estimate S.E.
Youth −0.041∗∗ 0.018
Dispersion 0.004 0.004
Scale −5.659∗∗∗ 2.000
Note: The linear probability model includes dum-mies for different combinations of census tracts andthree work shifts, and dummies for seasons. Stan-dard errors are clustered at the officer level. ***, **,and * indicate statistical significance at the 1%, 5%and 10% levels, respectively.
Figure 1: PDFs of Inefficiency with Different σu Values and an Example of Draws from EachPDF.
32
Figure 2: Change in Arrest Rate by Experience.
33
Figure 3: Distribution of Search Inefficiency
34
Supplementary Material for “LASSO for Stochastic FrontierModels with Many Efficient Firms”By William C. Horrace, Hyunseok Jung and Yoonseok Lee
A. Proofs
Let κNT = (logN)/√T . We first derive some technical lemmas.
Lemma A.1 Suppose Assumption 2-(1) and 2-(2)-(ii) hold. Then, for some 0 < Cx, Cv <∞, as
(N,T )→∞, we have
(a) max1≤i≤N
Pr
(∥∥∥∥∥ 1
T
T∑t=1
{xit − E[xit]}
∥∥∥∥∥ ≥ CxκNT)
= o(N−1
), and
max1≤i≤N
Pr
(∣∣∣∣∣ 1
T
T∑t=1
vit
∣∣∣∣∣ ≥ CvκNT)
= o(N−1
);
(b) Pr
(max
1≤i≤N
∥∥∥∥∥ 1
T
T∑t=1
{xit − E[xit]}
∥∥∥∥∥ ≥ CxκNT)
= o (1) , and
Pr
(max
1≤i≤N
∣∣∣∣∣ 1
T
T∑t=1
vit
∣∣∣∣∣ ≥ CvκNT)
= o (1) .
Proof of Lemma A.1 We only prove for the first part of (a) since the proof for the second
part of (a) is analogous, and (a) imply (b) because
Pr
(max
1≤i≤N
∥∥∥∥∥ 1
T
T∑t=1
{xit − E[xit]}
∥∥∥∥∥ ≥ CxκNT)≤
N∑i=1
Pr
(∥∥∥∥∥ 1
T
T∑t=1
{xit − E[xit]}
∥∥∥∥∥ ≥ CxκNT)
≤ N max1≤i≤N
Pr
(∥∥∥∥∥ 1
T
T∑t=1
{xit − E[xit]}
∥∥∥∥∥ ≥ CxκNT)
= N · o(N−1) = o(1)
and similarly for the second part of (b), if (a) is true.
35
To prove the first result of (a), we let MT =√T/(log T )2 and 1it = 1{||xit|| < MT }. We define
ξ1,it = xit1it − E [xit1it] ,
ξ2,it = xit (1− 1it) ,
ξ3,it = −E [xit (1− 1it)] .
Then, xit − E[xit] = ξ1,it + ξ2,it + ξ3,it and thus we have
max1≤i≤N
Pr
(∥∥∥∥∥ 1
T
T∑t=1
{xit − E[xit]}
∥∥∥∥∥ ≥ CxκNT)≤ max
1≤i≤NPr
(∥∥∥∥∥ 1
T
T∑t=1
ξ1,it
∥∥∥∥∥+
∥∥∥∥∥ 1
T
T∑t=1
ξ2,it
∥∥∥∥∥+
∥∥∥∥∥ 1
T
T∑t=1
ξ3,it
∥∥∥∥∥ ≥ CxκNT)
.
We prove the first part of (a) by showing
(a1) N · max1≤i≤N
Pr
(∥∥∥∥∥ 1
T
T∑t=1
ξ1,it
∥∥∥∥∥ ≥ Cx2κNT
)= o (1) ,
(a2) N · max1≤i≤N
Pr
(∥∥∥∥∥ 1
T
T∑t=1
ξ2,it
∥∥∥∥∥ ≥ Cx2κNT
)= o (1) , and
(a3) max1≤i≤N
∥∥∥∥∥ 1
T
T∑t=1
ξ3,it
∥∥∥∥∥ = o(κNT ).
To prove (a1), we let ξϕ1,it = ϕ′ξ1,it for some constant p × 1 vector ϕ with ||ϕ|| = 1.
Then, by Assumption 2-(1)-(ii), ξϕ1,it is a zero-mean strong mixing process, not necessarily
stationary, with the mixing coefficients satisfying α[t] ≤ cαρt for some cα > 0 and ρ ∈
(0, 1). In addition, max1≤t≤T |ξϕ1,it| ≤ 2MT almost surely by construction. We define v2N =
max1≤i≤N supt≥1{var(ξϕ1,it) + 2
∑∞s=t+1 |cov(ξϕ1,it, ξ
ϕ1,is)|, which is bounded by Assumption 2-(1)-
(ii) and (iii), and the Davydov inequality. Then, by Lemma S1.1 of Su, Shi and Phillips (2016),
36
there exists a constant C0 > 0 such that for any T ≥ 2 and Cx > 0,
N · max1≤i≤N
Pr
(∣∣∣∣∣ 1
T
T∑t=1
ξϕ1,it
∣∣∣∣∣ ≥ Cx2κNT
)≤ N exp
(−
C0C2xT
2κ2NT /4
v2NT + 4M2
T + 2CxTκNTMT (log T )2 /2
)
= exp
(−{
C0C2x(logN)2/4
v2N + 4/(log T )4 + Cx(logN)
− logN
}).
Thus, by choosing Cx sufficiently large, it follows that
N max1≤i≤N
Pr
(∥∥∥∥∥ 1
T
T∑t=1
ξ1,it
∥∥∥∥∥ ≥ Cx2κNT
)→ 0 as (N,T )→∞.
Next, by Assumption 2-(1)-(iii) and 2-(2)-(ii), and the Boole and Markov inequalities, we have
N · max1≤i≤N
Pr
(∥∥∥∥∥ 1
T
T∑t=1
ξ2,it
∥∥∥∥∥ ≥ Cx2κNT
)≤ N · max
1≤i≤NPr
(max
1≤t≤T‖xit‖ ≥MT
)≤ NT max
1≤i≤Nmax
1≤t≤TPr (‖xit‖ ≥MT )
≤ NT
M qT
max1≤i≤N
max1≤t≤T
E ||xit||q
= o(1).
Lastly, by Assumption 2-(1)-(iii), and the Holder and Markov inequalities,
max1≤i≤N
∥∥∥∥∥ 1
T
T∑t=1
ξ3,it
∥∥∥∥∥ ≤ max1≤i≤N
max1≤t≤T
E ‖xit1 {‖xit‖ ≥MT }‖
≤ max1≤i≤N
max1≤t≤T
(E ‖xit‖q/2
)2/qmax
1≤i≤Nmax
1≤t≤T{Pr (||xit|| ≥MT )}(q−2)/q
≤ max1≤i≤N
max1≤t≤T
(E ‖xit‖q/2
)2/qmax
1≤i≤Nmax
1≤t≤T
(E ‖xit‖q
M qT
)(q−2)/q
= O(M−(q−2)T
)= o(κNT )
where we use the fact that M(q−2)T κNT = T (q−3)/2 logN/(log T )2 → ∞ for q ≥ 4 in the last step.
Then, the desired result follows by combining (a1), (a2) and (a3). �
37
Proof of Lemma 1 First, note that
max1≤i≤N
∣∣∣∣∣ 1
T
T∑t=1
{x′it(β0 − β) + vit
}∣∣∣∣∣≤
(max
1≤i≤N
∥∥∥∥∥ 1
T
T∑t=1
{xit − E[xit]}
∥∥∥∥∥+ max1≤i≤N
E||xit||
)∥∥∥β − β0
∥∥∥+ max1≤i≤N
∣∣∣∣∣ 1
T
T∑t=1
vit
∣∣∣∣∣ ,where max1≤i≤N E||xit|| = O(1) and
∥∥∥β − β0
∥∥∥ = Op((NT )−1/2) due to Assumption 2-(1)-(iii) and
2-(2)-(i), which implies, for sufficiently large 0 < C <∞,
Pr
(max
1≤i≤N
∣∣∣∣∣ 1
T
T∑t=1
{x′it(β0 − β) + vit
}∣∣∣∣∣ ≥ CκNT)
= o (1) (A.1)
by Lemma A.1.
Recall η = mini∈Sc u0,i and αi = T−1∑T
t=1(yit−x′itβ) = T−1∑T
t=1(α0−u0,i+x′it(β0− β) + vit)
where u0,i = 0 for all i ∈ S. Thus, it follows that
mini∈S
αi −maxi∈Sc
αi
= mini∈S
{1
T
T∑t=1
(α0 + x′it(β0 − β) + vit)
}−max
i∈Sc
{1
T
T∑t=1
(α0 − u0,i + x′it(β0 − β) + vit)
}
≥ mini∈Sc
u0,i +
[mini∈S
{1
T
T∑t=1
(x′it(β0 − β) + vit)
}−max
i∈Sc
{1
T
T∑t=1
(x′it(β0 − β) + vit)
}]
≥ η − 2 max1≤i≤N
∣∣∣∣∣ 1
T
T∑t=1
(x′it(β0 − β) + vit)
∣∣∣∣∣>
η
2−Op(κNT ),
which implies
Pr
(mini∈S
αi −maxi∈Sc
αi > 0
)→ 1 (A.2)
as (N,T ) → ∞ since η > 0 and η/κNT → ∞ by Assumption 2-(2)-(iii). (A.2), in turn, implies
38
Pr (α = maxi∈S αi)→ 1 as (N,T )→∞ because α is defined as max1≤i≤N αi.
By (A.2), we can let α = maxi∈S αi for sufficiently large (N,T ), instead of α = max1≤i≤N αi.
Hence, for sufficiently large (N,T ), we have
|α− α0| =
∣∣∣∣∣maxi∈S
{1
T
T∑t=1
(α0 + x′it(β0 − β) + vit
)}− α0
∣∣∣∣∣≤ max
1≤i≤N
∣∣∣∣∣ 1
T
T∑t=1
(x′it(β0 − β) + vit
)∣∣∣∣∣ = Op(κNT )
from A.1, which proves Lemma 1.
Since ui = α − αi = (α − α0) + (α0 − αi) = (α − α0) + (u0,i + α0,i − αi) so that |ui − u0,i| ≤
|α− α0|+ |αi − α0,i| ≤ 2 max1≤i≤N
∣∣∣ 1T
∑Tt=1 x
′it(β0 − β) + vit
∣∣∣ by the results above, we also have
Pr (|ui − u0,i| ≥ CκNT ) = o (1) (A.3)
for sufficiently large 0 < C <∞. �
Proof of Theorem 1 For Equation (5) in the main text, we form a Lagrangian as
L(α, {ui}Ni=1, {ρi}Ni=1
)=
N∑i=1
T∑t=1
(yit − x′itβ − α+ ui
)2+ λ
N∑i=1
πiui −N∑i=1
ρiui,
where ρi ≥ 0, ui ≥ 0, and ρiui = 0 (complementary slackness) for all i. From the Karush-Kuhn-
Tucker (KKT) conditions, we have
α (λ) =1
NT
N∑i=1
T∑t=1
(yit − x′itβ + ui (λ)
)(A.4)
ui (λ) = max
{0, α (λ)− 1
T
T∑t=1
(yit − x′itβ
)− λ
2Tπi
}. (A.5)
39
Recall δ = |S|/N and let δ = |S|/N . By plugging (A.5) into (A.4), we have
α (λ) =1
NT
∑i∈S
T∑t=1
(yit − x′itβ
)+
1
NT
∑i∈Sc
T∑t=1
(yit − x′itβ + ui (λ)
)
=1
NT
∑i∈S
T∑t=1
(yit − x′itβ
)+
1
N
∑i∈Sc
(α (λ)− λ
2Tπi
)
=1
NT
∑i∈S
T∑t=1
(x′it
(β0 − β
)+ α0 − u0,i + vit
)+(
1− δ)α (λ)− λ
2NT
∑i∈Sc
πi
and hence
α (λ)− α0 =1
δNT
∑i∈S
T∑t=1
(x′it
(β0 − β
)− u0,i + vit
)− λ
2δNT
∑i∈Sc
πi. (A.6)
This shows that α(λ) is estimated as a common intercept for the firms classified as fully efficient
by the LASSO and also contains bias due to the use of shrinkage on ui(λ). From (A.5), it follows
that, for i ∈ Sc (i.e. ui (λ) > 0),
ui (λ) = α (λ)− 1
T
T∑t=1
(x′it
(β0 − β
)+ α0 − u0,i + vit
)− λ
2Tπi
=1
δNT
∑j∈S
T∑t=1
(x′jt
(β0 − β
)− u0,j + vjt
)− 1
T
T∑t=1
(x′it
(β0 − β
)− u0,i + vit
)− λ
2δNT
∑j∈Sc
πj −λ
2Tπi.
We prove the theorem by showing S ⊂ S and Sc ⊂ Sc w.p.a.1.
(i) We first prove S ⊂ S w.p.a.1 by showing Pr (maxi∈S ui(λ) > 0) → 0. Let τ = maxi∈S ui.
Then, from (A.5), for any C > 0, we have
Pr
(maxi∈S
ui(λ) > 0
)= Pr
(maxi∈S
{α(λ)− αi −
λ
Tπi
}> 0
)
40
≤ Pr
(maxi∈S
{α(λ)− αi −
λ
Tπi
}> 0, τ ≤ CκNT
)+ Pr(τ > CκNT )
≤ Pr
maxi∈S
1
δNT
∑j∈S
T∑t=1
(x′jt(β0 − β)− u0,j + vjt
)− λ
2δNT
∑j∈Sc
πj
− 1
T
T∑t=1
(x′it(β0 − β) + vit
)− λ
2T(CκNT )−γ
}> 0
)+ Pr(τ > CκNT )
≤ Pr
(2 max
1≤i≤N
∣∣∣∣∣ 1
T
T∑t=1
x′it(β0 − β) + vit
∣∣∣∣∣− λ
2T(CκNT )−γ > 0
)+ Pr(τ > CκNT ) (A.7)
where we use the fact that u0,j ≥ 0 and πj ≥ 0 for all j in the last step. Then, by choosing
sufficiently large C < ∞, we can easily show that first term in (A.7) is o(1) due to (A.1) and
((λ/T )κ−γNT )/κNT → ∞ as (N,T ) → ∞ by Assumption 2-(3). The second term in (A.7) is also
o(1) because
τ = maxi∈S
ui = maxi∈S{(α− α0)− (αi − α0,i)} ≤ 2 max
1≤i≤N
∣∣∣∣∣ 1
T
T∑t=1
x′it(β0 − β) + vit
∣∣∣∣∣and (A.1), where we use the fact u0,i = 0 for i ∈ S.
(ii) Next, we prove Sc ⊂ Sc w.p.a.1. Define Di ≡ {ui(λ) = 0} and then,
Pr (there exists i ∈ Sc such that ui(λ) = 0) = Pr
(⋃i∈ScDi
).
Let |Sc| = J . We arbitrarily list the firms in Sc and use an auxiliary index, [j] for j = 1, ..., J ,
to denote the jth firm on the list. Then, we can partition⋃i∈Sc Di into disjoint sets such that
D[1] ∩(⋃J
j=2D[j]
)c, D[2] ∩
(⋃Jj=3D[j]
)c, ..., and D[J ]. Therefore, we have
Pr
(⋃i∈ScDi
)
=
J∑j=1
Pr
D[j] ∩
J⋃k=j+1
D[k]
c41
=J∑j=1
Pr(u[j](λ) = 0, u[j+1](λ) > 0, u[j+2](λ) > 0, ..., u[J ](λ) > 0
),
which is true regardless of the order of the firms on the list. So, we list the firms in Sc according
to the size of inefficiency in ascending order so that u0,[1] ≤ ... ≤ u0,[j]... ≤ u0,[J ]. Then, we have
Pr (there exists i ∈ Sc such that ui(λ) = 0)
=
J∑j=1
Pr(u[j](λ) = 0, u[j+1](λ) > 0, u[j+2](λ) > 0, ..., u[J ](λ) > 0
)=
J∑j=1
Pr(u[j](λ) = 0
∣∣∣ u[j+1](λ) > 0, ..., u[J ](λ) > 0)× Pr
(u[j+1](λ) > 0
∣∣∣ u[j+2](λ) > 0, ...)× ...
...× Pr(u[J−1](λ) > 0
∣∣∣ u[J ](λ) > 0)× Pr
(u[J ](λ) > 0
)≤
J∑j=1
Pr(u[j](λ) = 0
∣∣∣ u[j+1](λ) > 0, ..., u[J ](λ) > 0)
=J∑j=1
Pr
1
δNT
∑i∈S
T∑t=1
(x′it(β0 − β)− u0,i + vit
)− λ
2δNT
∑i∈Sc
πi
− 1
T
T∑t=1
(x′[j]t(β0 − β)− u0,[j] + v[j]t
)− λ
2Tπ[j] < 0
∣∣∣ u[j+1](λ) > 0, ..., u[J ](λ) > 0
)
=J∑j=1
Pr
u0,[j] −∑
i∈S u0,i
δN︸ ︷︷ ︸(∗)
+1
δNT
∑i∈S
T∑t=1
(x′it(β0 − β) + vit
)− λ
2δNT
∑i∈Sc
πi
− 1
T
T∑t=1
(x′[j]t(β0 − β) + v[j]t
)− λ
2Tπ[j] < 0
∣∣∣ u[j+1](λ) > 0, ..., u[J ](λ) > 0
)(A.8)
We let S∗ = Sc ∩ S and δ∗ = |S∗|/N . Then, (∗) in the jth probability of (A.8) satisfies
u0,[j] −∑
i∈S u0,i
δN≥ u0,[j] −
δ∗u0,[j]
δ
since u0,i = 0 for all i ∈ S and u0,[j] = maxi∈S∗u0,i in the jth event by construction, which further
42
gives us the results
u0,[j] −δ∗
δu0,[j] =
δ
δu0,[j] ≥ δu0,[j] ≥ δη (A.9)
since δ − δ∗ = δ and δ ≤ δ ≤ 1 as S ⊂ S.
Let η =mini∈Sc ui and α =∣∣∣ 1δNT
∑i∈S∑T
t=1
(x′it(β0 − β) + vit
)∣∣∣. Then, by choosing sufficiently
large 0 < C <∞, we have
Pr (there exists i ∈ Sc such that ui(λ) = 0)
≤ Pr(
there exists i ∈ Sc such that ui(λ) = 0, ||β0 − β|| ≤ κNT , η ≥ η − CκNT , α ≤ CκNT , S ⊂ S)
+ Pr(||β0 − β|| > κNT
)+ Pr (α > CκNT ) + Pr (η < η − CκNT ) + Pr
(S 6⊂ S
)(A.10)
where Pr(||β0 − β|| > κNT
)= o(1) by Assumption 2-(2)-(i), Pr
(S 6⊂ S
)= o(1) by the first part
of this proof, Pr (α > CκNT ) = o(1) by the fact that α ≤ max1≤i≤N
∣∣∣ 1T
∑Tt=1 x
′it(β0 − β) + vit
∣∣∣ and
(A.1), and Pr (η < η − CκNT ) = o(1) by the fact that
|η − η| ≤ |η − u`|+ |u`0 − η| (A.11)
and (A.3) where ` =argmini∈Sc ui and `0 =argmini∈Scu0,i.30 Furthermore, we have
λ
2δNT
∑i∈Sc
πi +λ
2Tπ[j] ≤
λ
2δNT(1− δ)Nη−γ +
λ
2Tη−γ =
λ
2δTη−γ ≤ λ
δTη−γ , (A.12)
where we use the fact Sc ⊂ Sc and δ ≤ δ ≤ 1 as S ⊂ S. Then, for the first term in (A.10), by
combining (A.8), (A.9) and (A.12), we have
Pr(
there exists i ∈ Sc such that ui(λ) = 0, ||β0 − β|| ≤ κNT , η ≥ η − CκNT , α ≤ CκNT ,S ⊂ S)
30Note that |η − η| ≤ |u`0 − η| if η > η and |η − η| ≤ |η − u`| if η < η.
43
≤J∑j=1
Pr
(δη − CκNT −
∣∣∣∣∣ 1
T
T∑t=1
{x′[j]t(β0 − β) + v[j]t
}∣∣∣∣∣− λ
δTη−γ < 0, ||β0 − β|| ≤ κNT , η ≥ η − CκNT
)
≤J∑j=1
Pr
(δη − CκNT −
∣∣∣∣∣ 1
T
T∑t=1
{x′[j]t(β0 − β) + v[j]t
}∣∣∣∣∣− λ
δT(η − CκNT )−γ < 0, ||β0 − β|| ≤ κNT
)
≤J∑j=1
Pr
(δη − CκNT − κNT
(∣∣∣∣∣∣∣∣∣∣ 1
T
T∑t=1
{x[j]t − E[x[j]t]
}∣∣∣∣∣∣∣∣∣∣+ E
∥∥x[j]t
∥∥)− ∣∣∣∣∣ 1
T
T∑t=1
v[j]t
∣∣∣∣∣− λ
δT(η − CκNT )−γ < 0
)≤
J∑j=1
Pr
(δη − CκNT − κNT
(CκNT + E
∥∥x[j]t
∥∥)− ∣∣∣∣∣ 1
T
T∑t=1
v[j]t
∣∣∣∣∣− λ
δT(η − CκNT )−γ < 0
)
+J∑j=1
Pr
(∣∣∣∣∣∣∣∣∣∣ 1
T
T∑t=1
{xit − E[xit]}
∣∣∣∣∣∣∣∣∣∣ > CκNT
)
≤ N max1≤i≤N
Pr
(∣∣∣∣∣ 1
T
T∑t=1
vit
∣∣∣∣∣ > <NT)
+N max1≤i≤N
Pr
(∣∣∣∣∣∣∣∣∣∣ 1
T
T∑t=1
{xit − E[xit]}
∣∣∣∣∣∣∣∣∣∣ > CκNT
)(A.13)
where <NT = δη−CκNT−κNT (CκNT + E||xit||)− λδT (η−CκNT )−γ . Then we can easily show that
the two terms in (A.13) are o(1) by an application of Lemma A.1 and the fact that <NT /κNT =
δηκNT− C − CκNT − E||xit|| − λ
δT η−γκ−1
NT (1− CκNT /η)−γ →∞ as (N,T )→∞ by Assumption 1
and 2. Thus, the proof is complete. �
Proof Theorem 2 By Theorem 1, w.p.a 1, we have
√δNT (α(λ)− α0) =
1√δNT
∑i∈S
T∑t=1
(x′it(β0 − β) + vit
)− λ
2√δNT
∑i∈Sc
πi
The second term is op(1) since
λ√δNT
∑i∈Sc
πi ≤√
(1− δ)2
δλ
√N
Tη−γ
(η
η
)−γ= op(1) (A.14)
44
by Assumption 2-(3) and the fact that
η
η≤ 1 +
|η − η|η
= 1 + op(1),
due to (A.11) and κNT /η → 0 as (N,T )→∞ by Assumption 2-(2)-(iii).
Since β − β0 = (∑N
i=1
∑Tt=1 xitx
′it)−1∑N
i=1
∑Tt=1 xitvit, and
∑Ni=1
∑Tt=1 xitvit =∑N
i=1
∑Tt=1 xitvit, we have
√δNT (α(λ)− α0)
=1√δNT
∑i∈S
T∑t=1
vit
−√δ
(1
δNT
∑i∈S
T∑t=1
x′it
)(1
NT
N∑i=1
T∑t=1
xitx′it
)−1(1√NT
N∑i=1
T∑t=1
xitvit
)+ op(1).
We define
ΥS = plimN,T→∞
1
δNT
∑i∈S
T∑t=1
xit
H0 = plimN,T→∞
1
NT
N∑i=1
T∑t=1
xitx′it
where H0 > 0 by Assumption 3. We split the sample into S and Sc and define two statistics as
ΞS,NT ≡ 1√δNT
∑i∈S
T∑t=1
{vit − δΥ′SH−1
0 xitvit}
ΞSc,NT ≡ 1√(1− δ)NT
∑i∈Sc
T∑t=1
√δ(1− δ)Υ′SH−1
0 xitvit,
which are independent since the observations are cross-sectionally independent. By Assumption 3,
we have
ΞS,NTd−→ N
(0, σ2S1 + δ2σ2
S2 − 2δσS1S2)
45
ΞSc,NTd−→ N
(0, δ(1− δ)σ2
Sc)
as (N,T )→∞, where
σ2S1 = plim
N,T→∞
1
δNT
∑i∈S
T∑t=1
T∑k=1
vitvik
σ2S2 = Υ′SH
−10
{plim
N,T→∞
1
δNT
∑i∈S
T∑t=1
T∑k=1
xitvitvikx′it
}H−1
0 ΥS
σS1S2 = Υ′SH−10
{plim
N,T→∞
1
δNT
∑i∈S
T∑t=1
T∑k=1
xitvitvik
}
σ2Sc = Υ′SH
−10
{plim
N,T→∞
1
(1− δ)NT∑i∈Sc
T∑t=1
T∑k=1
xitvitvikx′it
}H−1
0 ΥS .
Hence,√δNT (α(λ) − α0) = ΞS,NT + ΞSc,NT
d−→ N(0, σ2S1 + δ2σ2
S2 − 2δ2σS1S2 + δ(1− δ)σ2Sc)
and
the desired result follows.31
For the second result, for i ∈ Sc, we have
√T (ui(λ)− u0,i) =
√T (α(λ)− α0)− 1√
T
T∑t=1
x′it(β0 − β)− 1√T
T∑t=1
vit −λ
2√Tπi
≡ Ψ1,NT + Ψ2i,NT + Ψ3i,T + Ψ4i,NT ,
where Ψ1,NT = Op(1/√δN) = op(1) from the first result, Ψ2i,NT = Op(1/
√N) = op(1) since
β − β0 = Op(1/√NT ), and Ψ4i,NT = op(1) by a similar argument as in (A.14). Since Ψ3i,T
d−→
N (0, σ2i ) as T →∞ by Assumption 3, where σ2
i = plimT→∞1T
∑Tt=1
∑Tk=1 vitvik for each i, we have
the desired result. �
31When vit is conditionally homoskedastic across i, we have σ2S2 = σ2
Sc =
Υ′SH−10
{limT→∞ T−1
∑Tt=1
∑Tk=1 xitvitvikx
′it
}H−10 ΥS and the limiting expression simplies to
N(0, σ2S1 + δσ2
S2 − 2δσS1S2).
46
Proof of Theorem 3 We first define
Λ− = {λ : Pr(S(λ) ) S)→ 1 as (N,T )→∞}
Λ0 = {λ : Pr(S(λ) = S)→ 1 as (N,T )→∞}
Λ+ = {λ : Pr(S(λ) ( S)→ 1 as (N,T )→∞}
similarly as Hui, Warton and Foster (2015).32 We denote the post-LASSO version of θ(λ) by
θS(λ),33 the post-LASSO version of σ2(λ) by σ2
S(λ), where
σ2S(λ)
=1
NT
N∑i=1
T∑t=1
(yit − x
′itβ − θS(λ)
)2,
and the post-LASSO BIC by BIC(λ),
BIC(λ) = log σ2S(λ)
+φNTNT|Sc(λ)|.
The following lemma shows that asymptotically a λ that yields an over-fitted or under-fitted
model can’t be selected by BIC(λ).
Lemma A.2 Suppose Assumptions 1 and 2 hold and there exists λ0 ∈ Λ0. Then,
Pr
(inf
λ∈Λ−∪Λ+
BIC(λ) > BIC(λ0)
)→ 1 as (N,T )→∞
32Recall Assumption 2-(3): i) λT−1/2N1/2η−γ → 0; ii) λT (γ−1)/2(logN)−γ−1 → ∞ for some γ > 1.Theorem 1 implies that, for λ ∈ Λ0, both i) and ii) must be satisfied. For λ ∈ Λ+, Assumption i) is satisfied,but not ii), that is, λ is not large enough, so some of zero inefficiencies are estimated as nonzero, resultingover-fitted models. For λ ∈ Λ−, ii) is satisfied, but not i), resulting under-fitted models. In finite samplesunder-fitted models include the cases where some of efficient firms are estimated as inefficient while some ofinefficient firms are estimated as efficient. However, Theorem 1 and its proof imply that we can ignore thesecases asymptotically.
33These post-LASSO version estimates are simply least squares estimates given estimated set of efficient
firms, ˆS(λ).
47
Proof of Lemma A.2 (i) We first show Pr(infλ∈Λ−BIC(λ) > BIC(λ0)
)→ 1 as (N,T )→∞.
Let λ− ∈ Λ−. Since Pr(S(λ−) ) S)→ 1 as (N,T )→∞, for sufficiently large (N,T ), we have
σ2S(λ−)
=1
NT
N∑i=1
T∑t=1
(yit − x
′itβ − θS(λ−)
)2
=1
NT
T∑t=1
∑i∈S
(x′it(β0 − β)−
(αS(λ−) − α0
)+ vit
)2
+1
NT
T∑t=1
∑i∈S∗
(x′it(β0 − β)−
(αS(λ−) − α0
)− u0,i + vit
)2
+1
NT
T∑t=1
∑i∈S∗∗
(x′it(β0 − β)−
(αS(λ−) − α0
)+(ui,S(λ−) − u0,i
)+ vit
)2
where S∗ = Sc ∩ S(λ−) and S∗∗ = Sc ∩ Sc(λ−). Similarly, for large (N,T ),
σ2S(λ0)
=1
NT
N∑i=1
T∑t=1
(yit − x
′itβ − θS(λ0)
)2
=1
NT
T∑t=1
∑i∈S
(x′it(β0 − β)−
(αS(λ0) − α0
)+ vit
)2
+1
NT
T∑t=1
∑i∈S∗
(x′it(β0 − β)−
(αS(λ0) − α0
)+(ui,S(λ0) − u0,i
)+ vit
)2
+1
NT
T∑t=1
∑i∈S∗∗
(x′it(β0 − β)−
(αS(λ0) − α0
)+(ui,S(λ0) − u0,i
)+ vit
)2
Then, for large (N,T ), it can be verified that
σ2S(λ−)
− σ2S(λ0)
= δ
{αS(λ−) − α0 −
1
δNT
T∑t=1
∑i∈S
(x′it(β0 − β) + vit
)}2
+1
N
∑i∈S∗
{αS(λ−) − α0 + ui,0 −
1
T
T∑t=1
(x′it(β0 − β) + vit
)}2
>1
N
∑i∈S∗
{αS(λ−) − α0 + ui,0 −
1
T
T∑t=1
(x′it(β0 − β) + vit
)}2
48
=1
N
∑i∈S∗
1
δNT
T∑t=1
∑i∈S(λ−)
(x′it(β0 − β) + vit
)− 1
δN
∑i∈S∗
ui,0
+ui,0 −1
T
T∑t=1
(x′it(β0 − β) + vit
)}2
≥ 1
N
∑i∈S∗
∣∣∣∣∣∣ui,0 − 1
δN
∑i∈S∗
ui,0
∣∣∣∣∣∣︸ ︷︷ ︸(∗)
−
∣∣∣∣∣∣ 1
δNT
T∑t=1
∑i∈S(λ−)
(x′it(β0 − β) + vit
)− 1
T
T∑t=1
(x′it(β0 − β) + vit
)∣∣∣∣∣∣︸ ︷︷ ︸(∗∗)
2
by the reverse triangle inequality and the fact that
αS(λ−) − α0 =1
δNT
T∑t=1
∑i∈S(λ−)
(x′it(β0 − β) + vit
)− 1
δN
∑i∈S∗
ui,0.
where δ =∣∣∣S(λ−)
∣∣∣ /N . Also note that (∗) is Op(1) or has the rate of η which converges to zero
slower than (∗∗).
Therefore, for large (N,T ), we have
σ2S(λ−)
− σ2S(λ0)
> δ∗
{=− 2 max
1≤i≤N
∣∣∣∣∣ 1
T
T∑t=1
(x′it(β0 − β) + vit
)∣∣∣∣∣}2
(A.15)
where = = mini∈S∗
∣∣∣ui,0 − 1δN
∑i∈S ui,0
∣∣∣ and δ∗ =∣∣∣S∗∣∣∣ /N .
Finally, note that for any λ− ∈ Λ−,
BIC(λ−)− BIC(λ0) = log
{1 +
σ2S(λ−)
− σ2S(λ0)
σ2S(λ0)
}− φNT
Tδ∗
49
≥ min
{log 2,
σ2S(λ−)
− σ2S(λ0)
2σ2S(λ0)
}− φNT
Tδ∗,
and log 2− φNTT δ∗ > 0 as (N,T )→ 0 due to the condition that (φNT /T )1/2 η−1 → 0. Therefore, to
prove Pr(infλ∈Λ−BIC(λ) > BIC(λ0)
)→ 1 as (N,T )→∞, it suffice to show
infλ∈Λ−
{σ2S(λ−)
− σ2S(λ0)
2σ2S(λ0)
}− φNT
Tδ∗ (A.16)
is positive w.p.a.1 as (N,T )→∞.
Inequality (A.15) implies that (A.16) is asymptotically greater than
δ∗
2σ−2
S(λ0)
{=− 2 max
1≤i≤N
∣∣∣∣∣ 1
T
T∑t=1
(x′it(β0 − β) + vit
)∣∣∣∣∣}2
− φNTT
δ∗
=φNTT
δ∗
1
2σ2S(λ0)
((T
φNT
)1/2[=− 2 max
1≤i≤N
∣∣∣∣∣ 1
T
T∑t=1
(x′it(β0 − β) + vit
)∣∣∣∣∣])2
− 1
,
which is asymptotically positive since σ2S(λ0)
is bounded, = is Op(1) or Op(η) hence asymptotically
dominates max1≤i≤N
∣∣∣ 1T
∑Tt=1
(x′it(β0 − β) + vit
)∣∣∣ = Op
(logN√T
)due to Assumption 2-(2)-(iii), and(
TφNT
)1/2= →∞ by the condition that (φNT /T )1/2 η−1 → 0.
(ii) Next, we show Pr(infλ∈Λ+BIC(λ) > BIC(λ0)
)→ 1 as (N,T )→∞. Let λ+ ∈ Λ+. Simi-
larly as in (i), for large (N,T ), it can be verified that
σ2S(λ+)
− σ2S(λ0)
≥ −δ◦αS(λ0) − α0 −
1
δ◦NT
T∑t=1
∑i∈S◦
(x′it(β0 − β) + vit
)2
−δ◦◦ max1≤i≤N
{∣∣∣αS(λ0) − α0
∣∣∣+
∣∣∣∣∣ 1
T
T∑t=1
(x′it(β0 − β) + vit
)∣∣∣∣∣}2
.
where δ◦ = |S◦|/N and δ◦◦ = |S◦◦|/N with S◦ = S ∩ S(λ+) and S◦◦ = S ∩ Sc(λ+).
50
Therefore, to show Pr(infλ∈Λ+BIC(λ) > BIC(λ0)
)→ 1 as (N,T )→∞, it suffice to show
BIC(λ+)− BIC(λ0) ≥ φNTT
δ◦◦ − δ◦
2σ2S(λ0)
αS(λ0) − α0 −1
δ◦NT
T∑t=1
∑i∈S◦
(x′it(β0 − β) + vit
)2
︸ ︷︷ ︸(∗)
− δ◦◦
2σ2S(λ0)
max1≤i≤N
{∣∣∣αS(λ0) − α0
∣∣∣+
∣∣∣∣∣ 1
T
T∑t=1
(x′it(β0 − β) + vit
)∣∣∣∣∣}2
︸ ︷︷ ︸(∗∗)
is positive w.p.a.1 as (N,T ) → ∞, which follows by the condition φNT /(logN)2 → ∞ since (∗∗)
is greater than (∗), but (∗∗) = Op
((logN)2
T
)because |αS(λ0) − α0| = Op(
1√δNT
) due to Theorem 2
and max1≤i≤N
∣∣∣ 1T
∑Tt=1
(x′it(β0 − β) + vit
)∣∣∣ = Op
(logN√T
).34 �
Next, to link the post-LASSO BIC and LASSO BIC, we show the following:
σ2(λ0)− σ2S(λ0)
= op
(1
NT
). (A.17)
Due to shrinkage effect, we have σ2(λ0) − σ2S(λ0)
> 0 and, similarly as in the proof of Lemma A.2
above, we can show that, for large (N,T ),
σ2(λ0)− σ2S(λ0)
= δ
{λ
2δNT
∑i∈Sc
πi
}2
+1
N
∑i∈Sc
{λ
2Tπi
}2
where we use the fact that α(λ0)− α0 = 1δNT
∑Tt=1
∑i∈S
(x′it(β0 − β) + vit
)− λ
2δNT
∑i∈Sc πi and
(α(λ0)− α0)− (ui(λ0)− u0,i) = 1T
∑Tt=1
(x′it(β0 − β) + vit
)− λ
2T πi for i ∈ Sc w.p.a 1 as (N,T )→
∞. Then, using the results in the proof of Theorem 2, we have
σ2(λ0)− σ2S(λ0)
≤ 1
NT
{√(1− δ)2
4δλ
√N
Tη−γ
}2
+1− δNT
{λ
2
√N
Tη−γ
}2
= op
(1
NT
)34Even when |S◦◦| is finite so δ◦◦ = O
(1N
)as N → ∞, we obtain the same conclusion since δ◦ → δ in
this case, so (∗) = Op(
1NT
).
51
since λ√
NT η−γ = op(1).
Finally, (A.17) and the fact BIC(λ) > BIC(λ) for any λ due to shrinkage effect imply
BIC(λ)− BIC(λ0) > BIC(λ)− BIC(λ0) + op
(1
NT
),
which gives
Pr
(inf
λ∈Λ−∪Λ+
BIC(λ) > BIC(λ0)
)→ 1 as (N,T )→∞.
This means that asymptotically a λ which yields an over-fitted or under-fitted model can’t be
chosen based on the BIC criterion, so the desired result follows. �
52
B. Additional Simulations for δ ∈ {0.1, 0.9}
Table B.1: Estimation Accuracy: δ = 0.1
Point estimateRMSE (α0 = 1) Rank correlation
(N, T) σu ULASSO ULSDV αLASSO αLSDV LASSO LSDV
(100, 10) 1 0.4537 0.8630 1.166 1.761 0.87 0.85(0.1765) (0.1820) (0.272) (0.204) (0.041) (0.039)
(100, 30) 1 0.2623 0.4822 1.059 1.420 0.94 0.93(0.0753) (0.1056) (0.143) (0.121) (0.019) (0.019)
(100, 50) 1 0.2014 0.3675 1.034 1.318 0.96 0.95(0.0576) (0.0830) (0.108) (0.095) (0.013) (0.014)
(100, 70) 1 0.1733 0.3089 1.025 1.266 0.97 0.96(0.0481) (0.0700) (0.095) (0.081) (0.011) (0.011)
(100, 10) 4 0.4987 0.7802 1.225 1.663 0.98 0.98(0.1918) (0.1880) (0.294) (0.217) (0.006) (0.006)
(100, 30) 4 0.2818 0.4585 1.103 1.390 0.99 0.99(0.1003) (0.1174) (0.168) (0.138) (0.003) (0.003)
(100, 50) 4 0.2136 0.3528 1.063 1.297 0.99 0.99(0.0711) (0.0914) (0.124) (0.107) (0.002) (0.002)
(100, 70) 4 0.1722 0.2914 1.041 1.245 1.00 1.00(0.0453) (0.0713) (0.089) (0.084) (0.001) (0.001)
(200, 10) 1 0.4011 0.9625 1.025 1.874 0.89 0.85(0.0627) (0.1703) (0.153) (0.185) (0.029) (0.026)
(200, 70) 1 0.1661 0.3502 0.985 1.313 0.97 0.96(0.0191) (0.0675) (0.053) (0.075) (0.008) (0.008)
(200, 10) 4 0.4327 0.8770 1.122 1.779 0.98 0.98(0.0743) (0.1721) (0.178) (0.193) (0.004) (0.004)
(200, 70) 4 0.1614 0.3353 1.017 1.295 1.00 1.00(0.0187) (0.0708) (0.057) (0.08) (0.001) (0.001)
(400, 10) 1 0.4168 1.0597 0.920 1.981 0.91 0.85(0.0399) (0.1713) (0.086) (0.184) (0.021) (0.018)
(400, 70) 1 0.1794 0.3868 0.952 1.353 0.97 0.96(0.0217) (0.0643) (0.039) (0.070) (0.005) (0.005)
(400, 10) 4 0.4097 0.9973 1.045 1.911 0.98 0.98(0.0376) (0.1728) (0.123) (0.188) (0.003) (0.003)
(400, 70) 4 0.1577 0.3799 0.999 1.346 1.00 1.00(0.0103) (0.0668) (0.039) (0.073) (0.001) (0.001)
(1000, 10) 1 0.4792 1.1787 0.822 2.108 0.93 0.85(0.0430) (0.1546) (0.057) (0.164) (0.014) (0.011)
(1000, 10) 4 0.4158 1.1115 0.970 2.037 0.99 0.98(0.0276) (0.1612) (0.081) (0.171) (0.002) (0.002)
53
Table B.2: Estimation Accuracy: δ = 0.9
Point estimateRMSE (α0 = 1) Rank correlation
(N, T) σu ULASSO ULSDV αLASSO αLSDV LASSO LSDV
(100, 10) 1 0.2772 1.0699 1.175 1.994 0.84 0.81(0.1068) (0.1713) (0.144) (0.184) (0.133) (0.151)
(100, 30) 1 0.1415 0.6292 1.057 1.582 0.91 0.89(0.0458) (0.0996) (0.153) (0.107) (0.090) (0.099)
(100, 50) 1 0.1046 0.4901 1.018 1.455 0.94 0.92(0.0368) (0.0769) (0.186) (0.082) (0.068) (0.076)
(100, 70) 1 0.0886 0.4137 0.985 1.383 0.95 0.93(0.0404) (0.0640) (0.228) (0.069) (0.053) (0.056)
(100, 10) 4 0.2744 1.0646 1.174 1.988 0.96 0.96(0.1062) (0.1736) (0.120) (0.186) (0.038) (0.039)
(100, 30) 4 0.1382 0.6229 1.076 1.577 0.98 0.98(0.0461) (0.0975) (0.056) (0.104) (0.026) (0.027)
(100, 50) 4 0.0980 0.4889 1.049 1.455 0.98 0.98(0.0323) (0.0753) (0.039) (0.080) (0.018) (0.019)
(100, 70) 4 0.0799 0.4138 1.037 1.384 0.99 0.99(0.0249) (0.0660) (0.031) (0.071) (0.018) (0.018)
(200, 10) 1 0.1991 1.1702 1.088 2.099 0.89 0.83(0.0439) (0.1619) (0.064) (0.170) (0.075) (0.084)
(200, 70) 1 0.0683 0.4496 1.013 1.422 0.96 0.95(0.0164) (0.0598) (0.067) (0.063) (0.028) (0.032)
(200, 10) 4 0.1992 1.1657 1.091 2.095 0.97 0.97(0.0441) (0.1621) (0.061) (0.172) (0.019) (0.020)
(200, 70) 4 0.0628 0.4488 1.015 1.420 0.99 0.99(0.0117) (0.0575) (0.017) (0.061) (0.007) (0.007)
(400, 10) 1 0.1718 1.2552 1.046 2.190 0.91 0.84(0.0205) (0.1504) (0.036) (0.158) (0.048) (0.058)
(400, 70) 1 0.0656 0.4800 1.008 1.454 0.97 0.96(0.0069) (0.0573) (0.011) (0.060) (0.017) (0.020)
(400, 10) 4 0.1727 1.2531 1.050 2.187 0.98 0.98(0.0221) (0.1502) (0.035) (0.159) (0.010) (0.011)
(400, 70) 4 0.0591 0.4802 1.007 1.454 1.00 0.99(0.0068) (0.0567) (0.010) (0.060) (0.003) (0.003)
(1000, 10) 1 0.1674 1.3605 1.016 2.301 0.93 0.85(0.0112) (0.1436) (0.021) (0.150) (0.026) (0.035)
(1000, 10) 4 0.1641 1.3736 1.023 2.314 0.99 0.98(0.0112) (0.1461) (0.019) (0.152) (0.005) (0.005)
54
Tab
leB
.3:
Sel
ecti
onA
ccura
cy
σu
=1
σu
=2
σu
=4
(N,
T)
PS
PSc
δM
ax-m
iss
PS
PSc
δM
ax-m
iss
PS
PSc
δM
ax-m
iss
δ=
0.1
(100,
10)
0.6
822
0.7
729
0.2
726
0.7
364
0.6
438
0.8
805
0.1
720
0.6
366
0.6
557
0.9
332
0.1
257
0.5
638
(0.2
362)
(0.1
175)
(0.1
241)
(0.3
008)
(0.2
506)
(0.0
765)
(0.0
885)
(0.3
380)
(0.2
465)
(0.0
447)
(0.0
592)
(0.3
286)
(100,
30)
0.7
612
0.8
197
0.2
384
0.4
694
0.7
320
0.9
066
0.1
572
0.3
982
0.7
323
0.9
509
0.1
175
0.3
220
(0.2
034)
(0.0
916)
(0.0
975)
(0.1
893)
(0.2
275)
(0.0
569)
(0.0
681)
(0.2
026)
(0.2
302)
(0.0
360)
(0.0
495)
(0.2
071)
(100,
50)
0.7
945
0.8
431
0.2
207
0.3
659
0.7
625
0.9
191
0.1
490
0.3
125
0.7
761
0.9
573
0.1
161
0.2
546
(0.1
916)
(0.0
769)
(0.0
821)
(0.1
427)
(0.2
171)
(0.0
497)
(0.0
606)
(0.1
594)
(0.2
083)
(0.0
318)
(0.0
438)
(0.1
708)
(100,
70)
0.8
066
0.8
585
0.2
080
0.3
132
0.7
962
0.9
246
0.1
474
0.2
625
0.8
087
0.9
606
0.1
163
0.2
148
(0.1
879)
(0.0
730)
(0.0
791)
(0.1
298)
(0.1
991)
(0.0
441)
(0.0
534)
(0.1
285)
(0.1
791)
(0.0
284)
(0.0
376)
(0.1
388)
(200,
10)
0.8
186
0.6
933
0.3
579
0.9
995
0.7
713
0.8
416
0.2
197
0.9
037
0.7
444
0.9
191
0.1
473
0.7
990
(0.1
404)
(0.1
013)
(0.1
019)
(0.2
437)
(0.1
648)
(0.0
657)
(0.0
720)
(0.2
924)
(0.1
685)
(0.0
421)
(0.0
509)
(0.3
007)
(200,
70)
0.8
913
0.8
257
0.2
460
0.4
118
0.8
690
0.9
117
0.1
664
0.3
651
0.8
630
0.9
550
0.1
268
0.3
110
(0.0
984)
(0.0
587)
(0.0
595)
(0.1
025)
(0.1
152)
(0.0
376)
(0.0
420)
(0.1
161)
(0.1
217)
(0.0
224)
(0.0
288)
(0.1
256)
(1000,
10)
0.9
660
0.5
136
0.5
344
1.5
388
0.9
336
0.7
472
0.3
209
1.4
418
0.8
972
0.8
789
0.1
987
1.2
904
(0.0
252)
(0.0
531)
(0.0
495)
(0.2
005)
(0.0
419)
(0.0
418)
(0.0
408)
(0.1
961)
(0.0
567)
(0.0
276)
(0.0
294)
(0.2
220)
δ=
0.9
(100,
10)
0.7
380
0.7
468
0.6
895
0.3
797
0.7
306
0.8
555
0.6
720
0.2
943
0.7
482
0.9
222
0.6
812
0.2
093
(0.1
465)
(0.1
741)
(0.1
420)
(0.2
794)
(0.1
564)
(0.1
270)
(0.1
470)
(0.2
897)
(0.1
515)
(0.0
887)
(0.1
393)
(0.2
878)
(100,
30)
0.8
134
0.7
934
0.7
527
0.2
462
0.8
140
0.8
867
0.7
439
0.1
759
0.8
207
0.9
383
0.7
448
0.1
117
(0.1
188)
(0.1
756)
(0.1
162)
(0.2
235)
(0.1
220)
(0.1
062)
(0.1
135)
(0.1
847)
(0.1
192)
(0.0
770)
(0.1
097)
(0.1
622)
(100,
50)
0.8
483
0.7
935
0.7
841
0.2
218
0.8
530
0.9
048
0.7
773
0.1
272
0.8
568
0.9
467
0.7
765
0.0
804
(0.1
050)
(0.1
933)
(0.1
044)
(0.2
420)
(0.1
037)
(0.0
991)
(0.0
966)
(0.1
475)
(0.1
057)
(0.0
739)
(0.0
970)
(0.1
296)
(100,
70)
0.8
707
0.8
062
0.8
030
0.2
031
0.8
697
0.9
092
0.7
918
0.1
119
0.8
739
0.9
535
0.7
912
0.0
637
(0.0
978)
(0.2
192)
(0.0
999)
(0.2
808)
(0.0
976)
(0.0
975)
(0.0
912)
(0.1
387)
(0.0
947)
(0.0
682)
(0.0
867)
(0.1
075)
(200,
10)
0.8
697
0.6
501
0.8
178
0.6
744
0.8
703
0.7
940
0.8
038
0.6
103
0.8
753
0.8
882
0.7
990
0.4
721
(0.0
771)
(0.1
375)
(0.0
786)
(0.2
677)
(0.0
766)
(0.1
035)
(0.0
744)
(0.3
021)
(0.0
748)
(0.0
754)
(0.0
705)
(0.3
402)
(200,
70)
0.9
404
0.7
892
0.8
674
0.2
829
0.9
476
0.8
841
0.8
644
0.2
276
0.9
509
0.9
378
0.8
620
0.1
565
(0.0
429)
(0.1
144)
(0.0
445)
(0.1
503)
(0.0
409)
(0.0
738)
(0.0
392)
(0.1
442)
(0.0
422)
(0.0
568)
(0.0
396)
(0.1
513)
(1000,
10)
0.9
675
0.5
090
0.9
198
1.2
077
0.9
670
0.7
041
0.8
999
1.2
131
0.9
697
0.8
341
0.8
894
1.1
399
(0.0
154)
(0.0
675)
(0.0
189)
(0.2
175)
(0.0
154)
(0.0
568)
(0.0
174)
(0.2
426)
(0.0
144)
(0.0
419)
(0.0
152)
(0.2
432)
55