LASSO for Stochastic Frontier Models with Many E cient Firms

LASSO for Stochastic Frontier Models with Many

Efficient Firms∗

William C. Horrace† Hyunseok Jung‡ Yoonseok Lee§

March 2022

Abstract

We apply the adaptive LASSO (Zou, 2006) to select a set of maximally efficient firms

in the panel fixed-effect stochastic frontier model. The adaptively weighted L1 penalty

with sign restrictions for firm-level inefficiencies allows simultaneous estimation of the

maximal efficiency and firm-level inefficiency parameters, which results in a faster rate

of convergence of the corresponding estimators than the least-squares dummy variable

approach. We show that the estimator possesses the oracle property and selection

consistency still holds with our proposed tuning parameter selection criterion. We also

propose an efficient optimization algorithm based on coordinate descent. We apply

the method to estimate a group of efficient police officers who are best at detecting

contraband in motor vehicle stops (i.e. search efficiency) in Syracuse, NY.

Keywords: Panel Data, Fixed-Effect Stochastic Frontier Model, Adaptive LASSO, L1

Regularization, Sign Restriction, Zero Inefficiency.

∗We are grateful to Badi Baltagi, Christopher Parmeter and the participants at the 15th European

Workshop on Efficiency and Productivity Analysis, the 28th annual meeting of the Midwest Econometrics

Group and the International Association for Applied Econometric 2019 Annual Conference for their valuable

comments and suggestions. All errors are our own.†Department of Economics, Syracuse University, Syracuse, NY, 13244. whorrace@syr.edu‡Corresponding author: Department of Economics, University of Arkansas, Fayetteville, AR 72701.

hj020@uark.edu§Department of Economics, Syracuse University, Syracuse, NY, 13244. ylee41@syr.edu

1 Introduction

Stochastic frontier (SF) models for panel data typically estimate firm-level efficiency from

firm fixed-effects and rank them to identify a single firm in the sample as the most efficient

firm. That is, SF estimators do not identify efficiency ties in general, yet there may be

several firms in the sample tied for most efficient, particularly in competitive markets.

There exist some methodologies to identify multiple efficient firms in the literature, but

they rely on strong distributional assumptions and use two-step procedures. In the first

step, firm-level efficiencies (or equivalent measures) are estimated, and in the second step a

separate inference technique or a selection criterion is used to determine membership in a

subset of the most efficient firms. For example, for the parametric SF model of Aigner, Lovell

and Schmidt (1977), Horrace and Schmidt (1996), Simar and Wilson (2009), and Wheat,

Greene and Smith (2014) estimate the efficiency using Jondrow, Lovell, Materov and Schmidt

(JLMS, 1982) and construct univariate prediction intervals to identify multiple efficient firms

that are statistically indistinguishable from the most efficient firm. Horrace (2005) and

Flores-Lagunes, Horrace and Schnier (2007) extend this to multivariate intervals that account

for the multiplicity inherent in the ranked estimates, and Horrace and Schmidt (2000) develop

multivariate intervals for the fixed-effect SF model of Schmidt and Sickles (1984). Despite

the semi-parametric nature of the fixed-effect model, these inference techniques still rely

on a parametric assumption on the distribution of estimated efficiencies (i.e., that they

are normally distributed or asymptotically so). More recently, Kumbhakar, Parmeter and

Tsionas (2013) propose a zero inefficiency stochastic frontier model for cross sectional data

that produces a subset of firms in the sample that are fully efficient. They estimate the

probability of a firm falling into the zero inefficiency regime using a latent class model,

then use the probability to determine efficient firms. However, this approach still relies on

parametric distributional assumptions and two-step procedure.1

In this paper, we explicitly assume that some fraction of firms in the panel are fully

efficient and develop a one-step, semi-parametric procedure for identifying a subset of efficient

firms using the adaptive LASSO (Zou, 2006). Specifically, the proposed approach proceeds as

the existing least squared dummy variable (LSDV) estimation, but the objective function is

augmented with the adaptively weighted shrinkage L1 penalty for the firm-level inefficiencies.

Since the inefficiency parameters are constrained to be non-negative in the model, we imposes

sign restrictions on the parameters.2 The estimation procedure hence identifies a subset of

firm-level inefficiencies as exactly zero, which is an interesting feature of our model compared

to the conventional LASSO where identification of non-zero coefficients is of primary interest.

The LASSO has been applied to various selection problems, but our paper is the first to

consider its application to the stochastic frontier models for the identification of efficient

firms. We also propose an efficient optimization algorithm based on the coordinate descent

method, which significantly reduces computational costs.

Our estimator requires inefficiencies to be time-invariant, so it is best deployed when

measures of average sample inefficiency are appropriate or desired. If high frequency data

are available over a short time period (e.g., a year), time-invariant assumption is arguably

reasonable.3 There are more flexible specifications that allow inefficiency to vary over time

while accounting for time-invariant firm heterogeneity (e.g., Greene, 2005, “true fixed-effect

SF model”), but they often come at the cost of more parametric assumptions. We discuss

these models and limitations of time-invariant inefficiency in greater detail in the next section.

1Rho and Schmidt (2015) raise an identification issue for this model.2So, our proposed method is a constrained LASSO. Constrained LASSO has been introduced in statistics

literature (e.g., Hua et al., 2015), but it hasn’t yet been used in economic context to the best of our knowledge.3Our empirical example uses high frequency stop/search/arrest data from the Syracuse Police Depart-

ment to estimate officer-level time-invariant search efficiency for a single year. Technically, our high frequencydata are in the cross-sectional dimension, as each officers does a certain number of vehicle searches in a year,and the analysis proceeds without regard to continuous time. In this sense “high frequency” in this paper isa broader definition than it may first appear.

We analyze the asymptotic properties of the proposed estimator for the case (N, T )→∞,

where N is the number of firms and T is the number of time periods in the sample. We allow

for time-series dependence and cross-sectional heteroskedasticity in errors and covariates,

which is new for the analysis of panel SF models in the literature.4 Also, in our approach,

N is allowed to be much larger than T under proper moment conditions. In this case,

modeling a group of efficient firms is more desirable since the existence of many efficient

firms should be more apparent when markets are large and competitive. We show that the

proposed estimator consistently identifies a set of true efficient firms when the efficiency gap

between the efficient and the inefficient groups vanishes slowly enough with the sample size

and errors and covariates satisfy proper moment conditions. The LASSO estimator of the

maximal efficiency shows√δNT consistency, where δ is the proportion of fully efficient firms

in the sample, while the LSDV estimator exhibits√T/(logN)2 consistency. Consequently,

the LASSO estimators outperform LSDV in many panels, including short panels. This is

borne out in our simulation study. We also study the tuning parameter selection for the

LASSO, with which selection consistency still holds.

We apply our technique to a 2006 panel of Syracuse, NY, police stop/search/arrest data,

previously analyzed by Horrace and Rohlin (2016). Their focus was on estimating vehicle

stop rate differentials by race, and testing their significance for the entire force, ignoring

officer identifiers in the data. We use a linear probability model to model arrest rates

conditional on a vehicle search with time-invariant officer fixed effects, while controlling for

other features of officer ability and patrol assignment. Our LASSO technique identifies a

group of 45 out of 139 officers who are efficient at vehicle searches in 2006 (i.e., 45 officers

who are best at uncovering illegal items leading to a motorist arrest). Policy-makers might

use our linear probability model and the LASSO to identify a subset of efficient officers for

4Park, Sickles and Simar (1998) study the asymptotic properties of the LSDV estimators under i.i.ddata. In this paper, we derive the convergence rates of the LSDV estimators under our setup.

performance recognition, for example.

The rest of this paper is organized as follows. The next section introduces the model and

the adaptive LASSO estimator. Section 3 provides some technical assumptions and derives

the oracle property of the estimator. Section 4 discusses optimization algorithm and tuning

parameter selection. Section 5 and 6 provide simulation and empirical application results,

and section 7 concludes. All the proofs and additional simulation results are given in the

online supplementary material.

2 LASSO for Identifying Efficient Firms

2.1 LSDV estimation

We consider the panel SF model with time-invariant technical inefficiency (e.g., Schmidt and

Sickles, 1984) given as

yit = α0 + x′itβ0 + vit − u0,i (1)

for i = 1, ..., N and t = 1, ..., T , where yit is the logarithm of scalar output of the ith firm in

the tth period, α0 is a common intercept, xit is the logarithm of a p×1 input vector, and β0 is

a p× 1 corresponding parameter vector of marginal effects. The regression equation has two

error terms: the first error term vit is a two sided noise with E[vit|xit, u0,i] = 0 and the second

error term u0,i is time-invariant firm-specific inefficiencies, which can be arbitrarily correlated

with xit. We suppose no cross-sectional dependence, but allow time-series dependence over

errors and covariates. Unlike the standard fixed-effect panel regression models, we restrict

u0,i ≥ 0 for all i but we do not impose any distributional assumption on this inefficiency.

The time-invariant inefficiency u0,i is somewhat restrictive, especially when we study

panel data over a long period of time. However, as mentioned earlier, if high frequency

data are available and measures of average sample inefficiency are desired, this approach can

still be employed.5 There are SF models that specify time-varying inefficiency with some

restrictive functional forms (e.g., Ahn, Lee and Schmidt, 2007). However, it appears that it

is ambiguous how to apply the LASSO in these cases to model a group of efficient firms.

Another limitation due to the time-invariant inefficiency is that the model (1) cannot

identify marginal effects for time-invariant regressors in xit, so their marginal effects are

absorbed into the inefficiency term. In this case, the interpretation of the firm-specific ineffi-

ciency u0,i can be subtle. There are SF models that include an additional term that accounts

for such time-invariant firm heterogeneity (e.g., Greene, 2005), but these models generally

require distributional assumptions on the noise and inefficiency terms. Our interest in this

paper is to specify a model amenable to semi-parametric estimation, so these approaches are

not appropriate for our case. If firm heterogeneity is more of a concern, practitioners may use

the panel data estimator of Hausman and Taylor (1981) where a set of time-invariant regres-

sors or instruments that are not correlated with fixed effects are used to control individual

heterogeneity. Another method one can adopt to minimize time-invariant heterogeneity is

the within-a-category comparison proposed by Feng and Horrace (2007) where comparisons

of fixed effects are made within groups of relatively homogeneous firms, rather than across

all the firms.

Existing studies estimate (1) using the standard least squared dummy variables (LSDV)

5Fixed effects are still more generally employed to measure agent-specific effects. Examples includemeasurements of teacher quality (Rothstein, 2010; Chetty et al., 2014), school value-added (Angrist et al.,2017) and hospital efficacy (Friedson et al., 2019) where the quantities of interest are calculated by someforms of fixed effects. Moreover, if high-frequency data for multiple periods (e.g., years) are available, wecan partition the data along the time dimension (e.g., by year) and deploy separate high frequency modelsfor each partition, while allowing inefficiency to vary over time.

method. More precisely, we rewrite (1) as

yit = α0,i + x′itβ0 + vit, (2)

where α0,i = α0 − u0,i is the firm-specific fixed effect. We can consistently estimate α0,i

(as T → ∞) and β0 (as N or T → ∞) by the standard within estimation, denoting each

estimator αi and β, respectively, provided xit does not include any time-invariant variables.

In the LSDV approach, the frontier parameter α0 is estimated as

α = max1≤i≤N

αi, (3)

which can be verified to be consistent for α0 with (N, T )→∞ under the assumption that the

density of u0,i is nonzero in the neighborhood of zero, so min1≤i≤N u0,i → 0 as N →∞ with

probability approaching to one (w.p.a.1) and consequently max1≤i≤N α0,i → α0 as N → ∞

(e.g., Greene, 1980; Schmidt and Sickles, 1984). The individual firm inefficiency u0,i is then

consistently estimated as

ui = α− αi.

α represents the maximal efficiency in the sample, and we interpret ui as the relative ineffi-

ciency to the most efficient firm.

In practice, it is very unlikely that there are ties in the estimates ui. For this reason,

all the firms have strictly positive ui values except for the firm that is estimated as most

efficient in the sample. Therefore, this approach has the limitation that it can identify only

one (relatively the most) efficient firm, even when there are multiple efficient firms with

u0,i = 0.

2.2 Adaptive LASSO estimation

To overcome the aforementioned limitation in the LSDV approach, we instead estimate

(1) using the adaptive least absolute shrinkage and selection operator (adaptive LASSO)

method, from which we can identify multiple efficient firms (i.e., all the firms with the true

u0,i are zero) by shrinking small values of ui toward zero.

To this end, we first assume the following sparsity condition. We let S = {i : u0,i = 0}

be the index set of efficient firms and |C| be the cardinality of a set C.

Assumption 1 δ = |S|/N → δ0 ∈ (0, 1) as N →∞.

This sparsity assumption implies that |S| firms are efficient in the sample and the fraction

of efficient firms doesn’t vanish as N →∞, which plays an important role in the asymptotic

analysis later. Note that the model (1) becomes the standard fixed-effect SF model when

|S| = 1 and hence δ0 = 0; it becomes the neoclassic production model when |S| = N and

hence δ0 = 1. Although we suppose p = dim(β0) is fixed in this paper, we can also allow p to

increase with N and assume sparsity on β0, under which we can identify nonzero elements of

β0 as well. However, this result is already well-studied (e.g., Belloni, Chernozhukov, Hansen

and Kozbur, 2016; Caner, Han and Lee, 2018), so we focus on shrinkage estimators of u0,i in

this paper.

We let β be a consistent estimator of β0 from (2), such as the standard within estimator:

β = arg minβ

N∑i=1

T∑t=1

(yit − x′itβ)2

where yit = yit − yi with yi = (1/T )∑T

s=1 yis and similarly for xit. After concentrating out

β0 in (1), the adaptive LASSO estimator for θ0 = (α0, u0,1, ..., u0,N)′ is defined as

θ(λ) = (α(λ), u1(λ), ..., uN(λ))′

= arg minα,u1,...,uN ; ui≥0 ∀i

{N∑i=1

T∑t=1

(yit − x′itβ − α + ui

N∑i=1

}, (5)

where λ > 0 is a tuning parameter. {πi}Ni=1 are some data-dependent weights, which are

obtained from some consistent initial estimates of u0,i. In particular, we let πi = u−γi for

some γ > 1, where ui is the LSDV estimator described in the previous section.6 Unlike

the original LASSO by Tibshirani (1996), the adaptive LASSO allows for unequal shrinkage

for each parameter depending on the data-dependent weight πi, which results in the oracle

property (see Fan and Li, 2001; Zou, 2006). However, it should be emphasized that θ(λ) in

(5) is different from the standard adaptive LASSO estimator by Zou (2006) since we impose

sign restrictions on a diverging number of parameters ui ≥ 0 for all i.

One important remark on (5) is that we estimate α0 and (u0,1, ..., u0,N)′ together in

one step. This is not feasible in the standard fixed-effect SF model because of the perfect

multicollinearity between the constant term and the individual dummies. In contrast, this

one-step estimation is feasible in our case due to the sparsity assumption and L1 penalty

term, which eliminates some of the individual dummies.

The main goal of this estimation is to identify two groups: efficient and inefficient firms.

Therefore, this approach seems similar to Bonhomme and Manresa (2015), who also consider

a latent group structure problem determined by group-specific fixed effects. However, their

methodology relies on minimization of a least squares criterion with respect to all possible

groupings, whereas we use the LASSO technique to identify the latent groups (efficient firms

vs. inefficient firms) under sign restrictions on the fixed effects.

6Note that in the LSDV estimation, the firm with the largest firm fixed-effect estimate has a zeroinefficiency estimate. For ui = 0, we use an arbitrarily small value (e.g., 1/N) to construct the weight.

The adaptive LASSO problem in (5) is also related to the latent group structure model

by Su, Shi and Phillips (2016),7 or the fused LASSO by Tibshirani, Saunders, Rosset, Zhu

and Knight (2005). They penalize over pairwise-differences among the coefficient values

to produce group identification. However, our problem is different from theirs because we

impose sign restrictions on u0,i and allow that the size of the smallest (non-zero) inefficiency

shrinks to zero at a proper rate. In comparison, Su, Shi and Phillips (2016) assume that the

group-specific parameters in their model are separated from each other by a fixed distance.

3 Oracle Properties

The adaptive LASSO allows for unequal shrinkage for each parameter and results in the

oracle property. Such oracle property extends to our case under (N, T )-asymptotics, where

N can grow faster than T when errors and covariates satisfy proper moment conditions.

We assume the following conditions in our asymptotic analysis. We define

η = mini∈Sc

u0,i > 0.

Assumption 2: (1) (i) E[vit|xit, u0,i] = 0 for all i and t, and {(xit, vit) : t = 1, ..., T} are

independent over i; (ii) For each i, {(xit, vit) : t = 1, ..., T} is strong mixing with mixing

coefficients α[t] ≤ cαρt for some cα > 0 and ρ ∈ (0, 1); (iii) supi≥1 supt≥1E||xit||q < ∞ and

supi≥1 supt≥1E|vit|q <∞ for some q ≥ 4.

(2) As (N, T ) → ∞, (i) β − β0 = Op((NT )−1/2); (ii) T 1/2(logN)−1 → ∞ and

NT 1−q/2(log T )2q → 0; (iii) T 1/2(logN)−1η →∞.

(3) As (N, T )→∞, λT−1/2N1/2η−γ → 0 and λT (γ−1)/2(logN)−γ−1 →∞ for some γ > 1.

7Kutlu, Tran and Tsionas (2020) apply the shrinkage technique by Su, Shi and Phillips (2016) to para-metric SF models to identify groups of firms sharing the same slope parameters.

In Assumption 2-(1), we rule out cross-sectional dependence, but allow for time-series

dependence and heteroskedasticity in the errors and covariates. In Assumption 2-(1)-(ii)

and (iii), we require (xit, vit) be a strong mixing process over t with geometric decay rate,

and further restrict the moments of ||xit|| and |vit| to be finite up to a certain order. The

tail restrictions and finite moment condition allow us to use exponential inequalities for

strong mixing processes (e.g., Merlevede, Peligrad and Rio, 2009) to bound misclassification

probabilities and achieve selection consistency.8

Assumption 2-(2)-(i) holds for general M-estimators such as the within estimator under

(N, T ) → ∞. Assumption 2-(2)-(ii), (iii) and 2-(3) impose rate conditions on N, T, η and

λ to ensure both selection and estimation consistency. Assumption 2-(2)-(ii) allows that N

can grow much faster than T when q is large (i.e. when the tail probability of the error

decays quickly). Therefore it covers many panel structures including short panels. Allowing

for large N (i.e., large market) compared to T is useful in our context since we consider

time-invariant inefficiency and assume many efficient firms. The rate conditions also control

for the magnitude of the tuning parameter λ, so the LASSO procedure can select the zero

coefficients correctly without yielding bias in the nonzero coefficient estimators in the limit.9

The assumption allows the nonzero inefficiencies to be close to zero (i.e., η can be very small),

but it shrinks sufficiently slow enough to be distinguished from the zero coefficients and also

not affected by shrinkage estimation.

We first derive the following lemma, which shows that the LSDV estimator of the frontier

parameter α0 summarized in Section 2.1 is consistent.10

Lemma 1 Recall that α = max1≤i≤N αi, where αi is the LSDV estimator of α0,i in (2).

8Alternatively, exponential moment conditions can be employed as in Bonhomme and Manresa (2015).9In particular, Assumption 2-(3) implies that λ should decrease as N increases when N � T .

10This lemma serves as a technical lemma to prove the theorems in this chapter and also allows us tocompare the convergence rate of α, the LSDV estimator, with that of α(λ), the LASSO estimator.

Then, under Assumptions 1, 2-(1) and 2-(2), as (N, T )→∞,

α− α0 = Op

((logN)/T 1/2

), where α0 is defined in (1).

This lemma implies α is estimated from one of the efficient firms in the sample w.p.a.1

(i.e. Pr (α = maxi∈S αi) → 1 as (N, T ) → ∞). The rate in this lemma is identical to that

derived in Park, Sickles and Simar (1998), but their result is derived under i.i.d data with

exponential moment conditions imposed on errors and covariates.11 So the lemma can be

seen as a generalization of their result.

Now we turn to the LASSO estimators. Let S = {i : ui(λ) = 0}. We first establish the

selection consistency.

Theorem 1 Suppose Assumptions 1 and 2 hold. Then, Pr(S = S)→ 1 as (N, T )→∞.

Theorem 1 implies that the LASSO consistently identifies two latent groups provided the rate

conditions on N, T, η and λ are satisfied. This, in turn, implies that in the limit, the latent

groups can be treated as known (i.e. the oracle information) and used for the estimation of

α0 and inefficiencies to improve their convergence rates.

We introduce the following assumptions and notations for the limiting distributions of

the LASSO estimators.

Assumption 3 (i) There exist positive constants σ2S1 , σ

2S2 , σS1S2 , σ

2Sc and σ2

i for each i ∈ Sc

such that

σ2S1 = plim

N,T→∞

∑i∈S

T∑t=1

T∑k=1

vitvik

11Recall that we impose only finite moment conditions for the errors and covariates and allow for time-series dependence.

σ2S2 = Υ′SH

{plimN,T→∞

∑i∈S

T∑t=1

T∑k=1

xitvitvikx′it

}H−1

σS1S2 = Υ′

SH−10

{plimN,T→∞

∑i∈S

T∑t=1

T∑k=1

xitvitvik

σ2Sc = Υ′SH

{plimN,T→∞

(1− δ)NT∑i∈Sc

T∑t=1

T∑k=1

xitvitvikx′it

}H−1

σ2i = plim

T→∞

T∑t=1

T∑k=1

vitvik,

where ΥS = plimN,T→∞

(δNT )−1∑

i∈S∑T

t=1 xit, H0 = plimN,T→∞

(NT )−1∑N

∑Tt=1 xitx

′it > 0; (ii)

As (N, T ) → ∞, (δNT )−1/2∑

i∈S∑T

t=1 vitd−→ N (0, σ2

S1), (δNT )−1/2∑

i∈S∑T

t=1 Υ′SH−10 xitvit

d−→ N (0, σ2S2), ((1 − δ)NT )−1/2

∑i∈Sc

∑Tt=1 Υ

′SH−10 xitvit

d−→ N (0, σ2Sc) and T−1/2

∑Tt=1 vit

d−→

N (0, σ2i ) for each i ∈ Sc.

Theorem 2 Suppose Assumptions 1, 2 and 3 hold. Then, as (N, T )→∞,

(i)√δNT (α(λ)− α0)

d−→ N(0, σ2

S1 + δ2σ2S2 − 2δσ2

S1S2 + δ(1− δ)σ2Sc)

(ii)√T (ui(λ)− u0,i)

d−→ N (0, σ2i ) for each i ∈ Sc.

Theorem 2 tells that we can efficiently estimate the frontier and the firm-level inefficiency

parameters using the LASSO estimator. Combined with Theorem 1, therefore, it establishes

the oracle property of the adaptive LASSO estimators. It is worthy to note that α(λ)−α0 =

((δNT )−1/2

), which shows a much faster rate than the LSDV estimator, α, in Lemma

1. This is quite an intuitive result because the LSDV estimator uses only a single best

firm’s observations, but α(λ) uses |S| · T observations of the firms identified as efficient by

the LASSO. As long as δ doesn’t vanish as N → ∞, which is a reasonable assumption for

competitive markets, the LASSO estimator will be hence preferred.

4 Computation

4.1 Optimization algorithm

The L1 penalty term in the LASSO object function has no second derivative at the origin,

so we can’t directly apply the standard quadratic optimization algorithms such as Newton-

Raphson. Many alternative optimization algorithms have been developed: local quadratic

approximation (Fan and Li, 2001), least angle regression (Efron, Hastie, Johnstone and

Tibshirani, 2004), coordinate descent algorithm (Friedman, Hastie and Tibshirani, 2010),

among others.

In this section, we derive an efficient coordinate decent algorithm that accounts for the

sign restrictions in our model. This algorithm uses preliminary inefficiency ranking informa-

tion from the initial LSDV estimation, which allows us to skip a large number of irrelevant

optimization steps.

The efficiency of our proposed optimization procedure can be understood as follows. In

our problem, the Karush-Kuhn-Tucker (KKT) conditions12 for (5) implies that the coordi-

nate decent algorithm boils down to successively updating α (λ) , u1 (λ) , ...uN (λ) based on

the following two equations:

α (λ) =1

N∑i=1

T∑t=1

(yit − x′itβ + ui (λ)

)ui (λ) = max

{0, α (λ)− 1

T∑t=1

(yit − x′itβ

)− λ

}for i = 1, ..., N . (6)

We note that the ordering of α (λ) − 1T

∑Tt=1

(yit − x′itβ

)in (6) follows the ordering of ui

since ui = α − αi and αi = 1T

∑Tt=1

(yit − x′itβ

). And the shrinkage effect from the penalty

12See the proof of Theorem 1 in the online supplementary material for more details.

term in (6), λ2Tπi, is larger for smaller ui since πi = u−γi with γ > 1. This implies that, for

given λ, if ui ≤ uj, then ui(λ) ≤ uj(λ). Therefore, we can skip updates for all the firms i

with ui ≤ uj (and identify them as efficient firms) once uj(λ) shrinks to 0.13 This reduces

computational costs significantly when N is large. Our proposed algorithm based on this

idea is summarized as follows.

1. Using β from the initial estimation, let

α(0)i =

T∑t=1

(yit − x′itβ), α(0) = max1≤i≤N

α(0)i , and u

(0)i = α(0) − α(0)

for each i. Define order statistics α(0)[1] ≤ α

(0)[2] ≤ ... ≤ α

(0)[N ] and u

(0)[1] ≥ u

(0)[2] ≥ ... ≥ u

(0)[N ]

so that α(0)[N ] = max1≤i≤N α

(0)i and u

(0)[N ] = min1≤i≤N u

(0)i . In this step, we have only one

fully efficient firm with u[N ] = u(0)[N ] = 0.

2. For a given λ, check the KKT condition for the second best firm based on the sign of

∆[N−1] = u(0)[N−1] −

2Tπ[N−1].

In particular, if ∆[N−1] ≤ 0, let u(1)[N−1] = 0 and update α(0) as α(1) = (α

(0)[N ] + α

(0)[N−1])/2.

Using this new frontier parameter estimate α(1), update the rest of the inefficiencies as

u(1)[N−1−j] = u

(0)[N−1−j] − (α(0) − α(1)) for all j ≤ N − 2. If ∆[N−1] > 0, go to the Step 4

below.

3. Sequentially repeat Step 2 for each ∆[N−k] for k = 2, 3, ..., N − 1 as long as ∆[N−k] ≤ 0

holds. For each k, we let u(k)[N−k] = 0 and update α(k−1) as α(k) = (1/(k+1))

∑kj=0 α

(k−1)[N−j].

We also update u(k)[N−1−j] = u

(k−1)[N−1−j] − (α(k−1) − α(k)) for all k ≤ j ≤ N − 2.

13This reasoning readily applies to the balanced panel case. For the unbalanced panel, the shrinkage effectis (λ/2Ti)πi where Ti is the number of time periods for i, which does not necessarily preserve the orderingof ui. In this case, the standard coordinate decent algorithm based on the two equations above can be used.

4. If ∆[N−k] > 0 at some k ≥ 1, we update the non-zero inefficiencies (i.e., u[N−j] > 0 for

k ≤ j ≤ N − 1) as uk[N−j] = u(k−1)[N−j] −

λ2Tπ[k] for all k ≤ j ≤ N − 1 and then report the

results.14

This coordinate descent algorithm uses the convexity of the object function and the

preliminary inefficiency ranking at the same time, which enables us to reach the minimum

of the object function quickly.

4.2 Tuning parameter choice

The performance of the adaptive LASSO estimator relies on an appropriate choice of the

tuning parameter, λ. Methods based on cross validations and AIC criteria are known to result

in over-selection (i.e., too many nonzero estimates), which will result in under-selection of the

efficient firms in our context. Wang, Li and Tsai (2007) instead propose tuning parameter

choice based on the BIC-type criterion, which is shown to consistently estimate the correct

model when it exists.

We consider a BIC-type criterion for the choice of λ, which is given by

λ∗ = arg minλ

log σ2(λ) +φNTNT|Sc (λ) |, (7)

where φNT is a sequence increasing with N or T , Sc (λ) = {i : ui(λ) > 0} and

σ2(λ) =1

N∑i=1

T∑t=1

(yit − x

itβ − θ(λ))2

14In fact, directly minimizing (5) using any typical methods shrinks not only ui(λ) but also α(λ). However,this is an undesirable shrinkage bias, which may slow down the convergence of α(λ), particularly when N islarge (Equation A.6 in the online supplementary material includes the explicit form of this bias). Therefore,in the spirit of post-LASSO estimation (e.g., Belloni and Chernozhukov, 2013), in our algorithm, we skip thesteps that induce such shrinkage effect on α(λ) and achieve smaller finite sample bias of α(λ). This omissiondoesn’t alter any of the asymptotic results in Section 3.

from (5) for a fixed λ. The following theorem proves that selection consistency still holds

with the tuning parameter chosen by (7).

Theorem 3 Suppose Assumptions 1 and 2 hold, and (φNT/T )1/2 η−1 → 0 and

φNT/(logN)2 →∞. Then, as (N, T )→∞,

Pr(S(λ∗) = S

)→ 1, where λ∗ is given in (7).

Theorem 3 indicates that when φNT grows with an appropriate rate, we can consistently

identify the true set of efficient firms using the tuning parameter chosen by (7). Particularly,

the conditions (φNT/T )1/2 η−1 → 0 and φNT/(logN)2 →∞ ensure the probabilities of under-

fitting (i.e. some of non-zero inefficiencies are estimated as zero) and over-fitting (i.e. some

of zero inefficiencies are estimated as non-zero), respectively, to vanish asymptotically. The

choice of φNT is crucial in practice to control such under- and over-fitting probabilities. From

our simulations, we found that 0.1(logN)2cNT with cNT = log (log (NT/(N + T ))) works

well for various panel structures.15 We use it for our simulations and empirical application.

5 Simulations

In this section, we study the finite sample performance of the LASSO estimator. We consider

the model (1) with α0 = 1, β0 = (1, 1, 1, 1)′, xit ∼ i.i.d. N (0,Σ), where the (j1, j2)th element

of Σ is 0.5|j1−j2| for j1, j2 = 1, ..., 4, and δ = 0.3 (i.e., 30% of firms in the sample are fully

15A φNT that satisfies the rate conditions can take a form of ν(logN)2cNT where ν is some positiveconstant that gives flexibility to control the degree of penalization in the criterion (similar to ERIC by Hui,Warton and Foster (2015)) and cNT is a diverging sequence but its rate of divergence is arbitrarily slow.Note that cNT = log (log (NT/(N + T ))) ≈ min{log (logN) , log (log T )} in our case. We also experimentedwith other types of selection criteria in the simulation study, including ERIC and ICp1 by Bai and Ng (2002),and found that (7) performs best in this panel SF model.

efficient).16 The two sided error, vit, is conditionally heteroskedastic and serially correlated

such that vit = 0.25vi,t−1 + ωit for t = 2, ..., T and vi1 = ωi1 where ωit|xit ∼ i.i.d. N (0, σ2it)

with σit = 0.45 if∑4

j=1 xitj < 0 and σit = 1.45 otherwise.17

In every simulation each nonzero individual inefficiency, u0,i, is identically and indepen-

dently generated from an exponential distribution, max{0.01, (1/σu)e−u/σu} for some σu > 0,

where trimming is to ensure all draws are strictly positive. We experiment with σu ∈ {1, 2, 4}.

Note that as σu gets smaller, the probability of small inefficiency draws gets higher, making

it more difficult for the LASSO to distinguish them from zero. This would be particularly

difficult when T is small.18 Figure 1 shows the density of the inefficiency u0,i for each given

σu value (figure on the left) and an example of draws from each case (figure on the right). We

can clearly see that inefficiencies have high density near zero when σu = 1. For the penalty

function, we set γ = 2 and λ is selected by (7) from a grid search over 250 evenly spaced

points between 10−4 and 10T .19 We simulate each case 1000 times for the combinations of

N ∈ {100, 200, 400, 1000} and T ∈ {10, 30, 50, 70}.

[=== Figure 1 is here ===]

First, Table 1 reports and compares the results from the adaptive LASSO estimation

in (5) and the conventional LSDV approach described in Section 2.1. In particular, we

report the root mean squared errors (RMSE) of ULASSO = (u1(λ∗), ..., uN(λ∗))′ and ULSDV =

(u1, ..., uN)′; point estimates of α0 from αLASSO = α(λ∗) and αLSDV = α (= max1≤i≤N αi);

16Additional simulation results for δ ∈ {0.1, 0.9} are in the online supplementary material. As δ de-creases, the finite sample performance of the LASSO estimators deteriorates, but we still observe notableimprovements from the LASSO estimation compared to the LSDV.

17The variances of ωit were chosen so that the overall variance of vit is approximately one.18In this case, the rate conditions on η in Assumption 2-(2)-(iii) and 2-(3) are likely to be violated.19We are free to choose the value of γ as long as it satisfies the rate conditions in Assumption 2-(3).

From the asymptotic analysis, we can see that setting a higher value for γ ensures the LASSO estimateszero coefficients as zero, but also increases the probability of estimating (small) nonzero coefficients as zero.Therefore, in empirics γ should be determined in light of this trade-off.

and the sample correlations between the ranking of U0Sc (i.e. nonzero inefficiencies) and the

ranking of their counterpart estimates ULASSO,Sc and ULSDV,Sc for given S.20

[=== Table 1 here ===]

The LASSO notably outperforms the LSDV in terms of RMSE of U from all cases. Note

that as N increases, the RMSE of ULASSO decreases but that of ULSDV increases, leading

to a larger disparity between the two methods.21 When N = 1000, the RMSE of ULSDV is

almost three times bigger than that of ULASSO. As the asymptotic analysis implies, this is

mainly because of the faster convergence of αLASSO to the true value.

As the means and variances of αLASSO and αLSDV in Table 1 shows, the distributions

of αLASSO are centered much closer to the true value (α0 = 1) than that of αLSDV even

when T and σu are small, and the bias and variance of αLASSO decrease quickly as N or

T increase. In addition, the max operator that αLSDV uses to estimate α0 tends to pick

the most biased individual intercept estimate. Therefore, in the presence of multiple zero-

inefficiency firms, the max operator produces a biased estimate for α0, which, in turn, leads

to bias in estimating inefficiencies u0,i’s.22

The LASSO and the LSDV show similar rank correlation results. It appears that the

LASSO preserves the original ranking better than LSDV when T and σu are small. This is

when we have a large uncertainty in the inefficiency estimates and the LASSO improves the

ranking accuracy by estimating statistically indistinguishable small inefficiencies as zero.

20More precisely, the entries in Table 1 (and also those in Table 2) are the average values for eachmeasure over 1,000 replications and their corresponding standard deviations are in the parentheses.Rank correlations are computed only among the inefficiencies whose true values are nonzero, that iscorr(R(U0Sc), R(ULASSO,Sc)) and similarly for LSDV, where R(·) is a mapping from estimates to rank-ings.

21When N is very large (e.g., 1000), however, we find that the RMSE of ULASSO starts to increase. Thisis related to the form of φNT in (7). The impact of φNT on the selection performance and consequentlythe estimation of α and ui is discussed later when we discuss about the selection accuracy of the LASSOestimation.

22Wang and Schmidt (2009) also document the “upward” bias of LSDV estimators using simulations.

Second, Table 2 presents the selection accuracy of the LASSO estimation. In particular,

we report the probability of yielding a zero estimate for i ∈ S, PS = Pr(i ∈ S|i ∈ S); the

probability of yielding a nonzero estimate for i ∈ Sc, PSc = Pr(i ∈ Sc|i ∈ Sc); the estimated

proportion of efficient firms δ; and the maximum value of u0,i, whose true value is nonzero

but estimated as zero, representing the degree of misclassification (i.e., maxi∈S|i∈Sc u0,i; Max-

miss).

[=== Table 2 here ===]

Both PS and PSc improve as T increases, but PSc decreases as N increases while PS

increases. The trade-off between PS and PSc when N increases is related to the form of

φNT in (7). Theorem 3 implies that φNT should grow faster than (logN)2, which ensures

the LASSO estimates (a diverging number of) zero coefficients as zero when N increases,

but the smallest inefficiency, η, at the same time, should be sufficiently large enough not to

adversely increase the probability of estimating (small) nonzero coefficients as zero. In our

simulations, we allow for small efficiencies, so the trade-off between PS and PSc is apparent

when N increases. This is particularly true when σu is small.23 However, note that most of

the inefficient firms incorrectly estimated as efficient firms (i.e., those with zero inefficiency

estimates) would have near zero inefficiency. The small values of Max-miss in Table 2 imply

that only the firms near zero inefficiency could be incorrectly categorized as fully efficient in

the LASSO procedure.

More importantly, it is impressive that even when T is small, including (N, T ) =

(1000, 10), δ is quite close to the true proportion δ = 0.3 as long as σu is large. This

23The degree and pattern of this trade-off apparently depends on the choice of φNT , which ultimatelyaffects the estimation of α and ui. Therefore, similarly as γ, in practice φNT should be chosen in the light ofthis trade-off. However, we find that φNT that is optimal for a wide range of N is difficult to find (e.g., ourφNT appears to grow rather quickly as N increases, leading to an underestimation of α when N = 1000).Optimal choice of φNT is left for future research.

gives us an important implication: our approach can be used even for short panels, as long

as there are not too many near zero inefficiency firms. Hence, in practice, information on

the variance of u0,i would be important in the choice of the proposed LASSO approach. Cai,

Horrace and Lee (2021) studies nonparametric identification of σu in the panel setup, where

it is allowed to be conditionally heteroskedastic.

6 Empirical Application: Police Vehicle Search Effi-

ciency in Syracuse, New York

In this section, we consider a police chief who selects a group of best officers for annual

evaluation based on how successfully they carried out vehicle searches throughout a year.

The idea is that officers perform a cost & benefit analysis in the decision to search the vehicle

of a stopped motorist. The costs to search are the opportunity cost of their time and effort

and the potential cost of being targeted for a “wrongful search” when the search fails to

uncover illegal items (contraband). The benefit to a successful search (one that uncovers

contraband) is the arrest of the motorist. Specifically, we model success rates (i.e. hit rates)

conditional on a search of a stopped vehicle among officers using a linear probability model,

and use the officer fixed effects in the model to calculate officer-specific success rates (i.e.,

search efficiency). We include several police productivity inputs and dummy variables to

control for heterogeneity due to differing levels of police experience, and location and time

of the vehicle searches.24

We use the panel of discretionary vehicle search activity by officer from 2006 to 2009

24Defining police productivity by success rate has a limitation that officers with higher standard forguilty tend to have a higher success rate since they would search vehicles with a high probability of carryingcontraband only. We may consider a composite measure that accounts for both quantity and quality ofsearch, which is left for future research.

in the City of Syracuse NY, that was previously analyzed by Horrace and Rohlin (2016)

in a different context. We use the data for year 2006 only and exclude officers whose total

number of vehicle searches in the year is less than five. We also exclude stops made in the

census tracts that had observations less than five. Our final sample includes 139 field officers

and 2,863 observations (i.e. searches). Note that our sample is an unbalance panel where

each officer made a different number of searches, Ti, for the given period.25

The linear probability model is specified as follows:

Pr(arrrestit = 1|α0, xit, zi, u0,i) = α0 + x′

itβ0,1 + z′

iβ0,2 − u0,i

for i = 1, ..., 139 and t = 1, ..., Ti where arrrestit is the binary outcome variable for officer i

at time t,26 which is 1 if the search results in an arrest of the motorist and 0 otherwise. The

explanatory variables, xit and zi are time-varying and time-invariant, respectively, and xit is

allowed to be correlated with u0,i, but zi is assumed to be strictly exogenous. As discussed

in Section 2.1, zi is included here to control important time-invariant heterogeneity among

officers: experience. Police Experience measures a contemporaneous experience level for

each officer, based on ‘years of employment on the force.’27 We could consider a measure

of experience based on the cumulative number of stops made by each officer. However, this

measure may be endogenous to the probability that a vehicle search will lead to the discovery

of contraband (and arrest). We instead use officer start date as a proxy for experience,

which is plausibly exogenous. Though it will be removed in the within estimation, β0,2 can

be estimated in the between equation.28 To capture possible nonlinearity in the relationship

25We use the average number of searches among officers for the T in the BIC criterion (7).26In the data, we identify the exact time (hh:mm:ss) of a stop27Note that our analysis is for one year and Experience is a yearly variable, so it is time-invariant in this

analysis.28The between equation is: yi. = α0 + z

iβ0,2 + ςi where yi. = 1Ti

(arrrestit − x

itβ1,LSDV

)and ςi is

the regression error that contains u0,i and the original two-sided error. This regression is valid as long as zi

between Experience and arrest rate, we include a third order polynomial for zi. The time-

varying explanatory variables, xit, includes variables that control for other dimensions of

heterogeneity in search activity: motorist Youth; Dispersion and Scale of stop activity at

officer-level; and Census × Shift and Season dummies. Y outh is a dummy for drivers under

25, and Dispersion and Scale are constructed based on monthly stop activity, which account

for police heterogeneity due to different types of duties assigned to the officers. Dispersion

measures the spatial dispersion of stop activity for each officer and Scale measures the

intensity with which officers perform duties, which are defined as

Dispersionit =

J∑j=1

(STijt∑Jj=1 STijt

)2−1

and Scaleit =

∑Jj=1 STijt∑N

∑Jj=1 STkjt

where STijt is the number of stops in census tract j in the month of t by officer i and J

is the total number of census tracts. These variables addresses potential selectivity and

heterogeneity that arise from the way in which the police chief assigns officers to specific

duties in specific parts of the city. For example, officers that tend to do more stops (ceteris

paribus) may be assigned to parts of the city where performing many stops is optimal from

the perspective of improving arrest rates. Census × Shift includes 99 dummies for different

combinations of census tracts and three work shifts: 7am-3pm, 3pm-11pm and 11pm-7am,

and Season includes dummies for spring, summer and fall.

We first present the estimation results for β0,1 and β0,2.

[=== Table 3 and Figure 2 are here ===]

The estimates of β0,1 in Table 3 are intuitive. The negative coefficient on Youth implies

that arrest rate for young motorists is on average lower than older motorists, which implies

is exogenous to u0,i and the original two-sided error.

that the officers searched young motorists without successes (i.e. arrests) more frequently

than the older motorists. This may be interpreted as bias toward young motorists, but

this may be simply because the signal of guiltiness to the officers from young motorists

is noisier. Dispersion is positive but statistically insignificant. The positive estimate may

indicate that the officers who carry out duties in a larger area may obtain additional learning

opportunities, which enhances their ability to detect crime. Scale is negative and significant,

which can be seen as a result of quality-quantity trade-off in search activity.

Figure 2 depicts the change in arrest rate by police experience and its 95% confidence

intervals. Arrest rate improves as years of employment increases until around tenth year, and

then decreases. Vehicle searches involve prediction tasks regarding the likelihood of arrest,

which may improve over time due to learning-by-doing, but the inverted U-shape curve

implies that learning in policing is not a constant accumulation but involves a degradation

after a certain period.

We now turn to our results on police search efficiency. The LASSO estimates 32.4 %

of officers (which is 45 out of 139 officers) as efficient. The distribution of the inefficiencies

are reported in Figure 3. In Figure 3, the blue histogram represents the distribution of the

inefficiencies from the conventional LSDV approach and the yellow one represents that from

the LASSO, where the 32.4 % of mass is concentrated at zero.29 The distribution of the

inefficiencies from the conventional LSDV looks like a bimodal distribution, which has two

peaks at around 0.2 and 0.6. It appears that the LASSO shrinks the inefficient mass of officers

at the first peak towards zero inefficiency, implying that this mass of officers are equally

efficient. After the LASSO application, the density function of the inefficiencies becomes

more similar to the half-normal or exponential distributions that are typically assumed in

parametric SF models.

29The black and red dotted lines are (kernel smoothed) density functions estimated by the default “ks-density” command in Matlab.

[=== Figure 3 is here ===]

The single-year analysis can be extended to multi-year analysis in a straightforward way.

We can deploy separate single-year models for each year, while allowing inefficiency to vary

over year. Therefore, if high-frequency data for multiple years are available, our approach

can allow for time-varying inefficiency and identify a group of efficient agents for each year.

7 Conclusion

We proposed an adaptive LASSO estimation to identify a group of efficient firms in the

panel stochastic frontier model. The method is particularly useful when the market size

is large. We showed that it outperforms the conventional LSDV-based approach in many

aspects. More generally, when we have a panel linear regression model with individual

fixed effects, and the ranking among the fixed effects contains important information, our

approach can identify a subset of the best (or worst) effects. This type of “best and the

rest” classification can be hence used as an adaptive sample splitting method. The empirical

application demonstrates well the practical value of the proposed method.

References

Ahn, S. C., Lee, Y. H. and Schmidt, P. (2007), ‘Stochastic frontier models with multiple

time-varying individual effects’, Journal of Productivity Analysis 27(1), 1–12.

Aigner, D., Lovell, C. and Schmidt, P. (1977), ‘Formulation and estimation of stochastic

frontier production function models’, Journal of Econometrics 6, 21–37.

Angrist, J. D., Hull, P. D., Pathak, P. A. and Walters, C. R. (2017), ‘Leveraging Lotteries

for School Value-Added: Testing and Estimation*’, The Quarterly Journal of Economics

132(2), 871–919.

Bai, J. and Ng, S. (2002), ‘Determining the number of factors in approximate factor models’,

Econometrica 70(1), 191–221.

Belloni, A. and Chernozhukov, V. (2013), ‘Least squares after model selection in high-

dimensional sparse models’, Bernoulli 19(2), 521–547.

Belloni, A., Chernozhukov, V., Hansen, C. and Kozbur, D. (2016), ‘Inference in high-

dimensional panel models with an application to gun control’, Journal of Business &

Economic Statistics 34(4), 590–605.

Bonhomme, S. and Manresa, E. (2015), ‘Grouped patterns of heterogeneity in panel data’,

Cai, J., Horrace, W. C. and Lee, Y. (2021), Panel nonparametric conditional heteroskedastic

frontiers with application to CO2 emissions. Working paper.

Caner, M., Han, X. and Lee, Y. (2018), ‘Adaptive elastic net gmm estimation with many in-

valid moment conditions: Simultaneous model and moment selection’, Journal of Business

& Economic Statistics 36(1), 24–36.

Chetty, R., Friedman, J. N. and Rockoff, J. E. (2014), ‘Measuring the impacts of teach-

ers 1: Evaluating bias in teacher value-added estimates’, American Economic Review

104(9), 2593–2632.

Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004), ‘Least angle regression’, The

Annals of Statistics 32(2), 407–451.

Fan, J. and Li, R. (2001), ‘Variable selection via nonconcave penalized likelihood and its

oracle properties’, Journal of the American Statistical Association 96, 1348–1360.

Feng, Q. and Horrace, W. C. (2007), ‘Fixed-effect estimation of technical efficiency with

time-invariant dummies’, Economics Letters 95(2), 247–252.

Flores-Lagunes, A., Horrace, W. C. and Schnier, K. E. (2007), ‘Identifying technically effi-

cient fishing vessels: a non-empty, minimal subset approach’, Journal of Applied Econo-

metrics 22(4), 729–745.

Friedman, J., Hastie, T. and Tibshirani, R. (2010), ‘Regularization paths for generalized

linear models via coordinate descent.’, Journal of statistical software 33(1), 1–22.

Friedson, A. I., Horrace, W. C. and Marier, A. F. (2019), ‘So many hospitals, so little infor-

mation: How hospital value-based purchasing is a game of chance’, Southern Economic

Journal 86(2), 773–799.

Greene, W. H. (1980), ‘Maximum likelihood estimation of econometric frontier functions’,

Journal of Econometrics 13(1), 27–56.

Greene, W. H. (2005), ‘Reconsidering heterogeneity in panel data estimators of the stochastic

frontier model’, Journal of Econometrics 126(2), 269–303.

Hausman, J. A. and Taylor, W. E. (1981), ‘Panel data and unobservable individual effects’,

Horrace, W. C. (2005), ‘On ranking and selection from independent truncated normal dis-

tributions’, Journal of Econometrics 126(2), 335–354.

Horrace, W. C. and Rohlin, S. M. (2016), ‘How dark is dark? bright lights, big city, racial

profiling’, The Review of Economics and Statistics 98(2), 226–232.

Horrace, W. C. and Schmidt, P. (1996), ‘Confidence statements for efficiency estimates from

stochastic frontier models’, Journal of Productivity Analysis 7(2), 257–282.

Horrace, W. C. and Schmidt, P. (2000), ‘Multiple comparisons with the best, with economic

applications’, Journal of Applied Econometrics 15(1), 1–26.

Hua, Q., Zeng, P. and Lin, L. (2015), ‘The dual and degrees of freedom of linearly constrained

generalized lasso’, Computational Statistics and Data Analysis 86, 13–26.

Hui, F. K. C., Warton, D. I. and Foster, S. D. (2015), ‘Tuning parameter selection for the

adaptive lasso using eric’, Journal of the American Statistical Association 110(509), 262–

Jondrow, J., Lovell, C., Materov, I. S. and Schmidt, P. (1982), ‘On the estimation of technical

inefficiency in the stochastic frontier production function model’, Journal of Econometrics

19(2-3), 233–238.

Kumbhakar, S. C., Parmeter, C. F. and Tsionas, E. G. (2013), ‘A zero inefficiency stochastic

frontier model’, Journal of Econometrics 172(1), 66–76.

Kutlu, L., Tran, K. C. and Tsionas, M. G. (2020), ‘Unknown latent structure and inefficiency

in panel stochastic frontier models’, Journal of Productivity Analysis 54, 75–86.

Merlevede, F., Peligrad, M. and Rio, E. (2009), ‘Bernstein inequality and moderate devia-

tions under strong mixing conditions’, Inst. Math. Stat. (IMS) Collect: High Dimensional

Probability V, 273–292.

Park, B., Sickles, R. and Simar, L. (1998), ‘Stochastic panel frontiers: A semiparametric

approach’, Journal of Econometrics 84(2), 273–301.

Rho, S. and Schmidt, P. (2015), ‘Are all firms inefficient?’, Journal of Productivity Analysis

43(3), 327–349.

Rothstein, J. (2010), ‘Teacher quality in educational production: Tracking, decay, and stu-

dent achievement’, The Quarterly Journal of Economics 125(1), 175–214.

Schmidt, P. and Sickles, R. C. (1984), ‘Production frontiers and panel data’, Journal of

Business and Economic Statistics 2(4), 367–374.

Simar, L. and Wilson, P. W. (2009), ‘Inferences from cross-sectional, stochastic frontier

models’, Econometric Reviews 29(1), 62–98.

Su, L., Shi, Z. and Phillips, P. C. B. (2016), ‘Identifying latent structures in panel data’,

Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal

Statistical Society. Series B (Methodological) 58(1), 267–288.

Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005), ‘Sparsity and

smoothness via the fused lasso’, Journal of the Royal Statistical Society: Series B (Statis-

tical Methodology) 67(1), 91–108.

Wang, H., Li, R. and Tsai, C.-L. (2007), ‘Tuning parameter selectors for the smoothly clipped

absolute deviation method.’, Biometrika 94(3), 553–568.

Wang, W. S. and Schmidt, P. (2009), ‘On the distribution of estimated technical efficiency

in stochastic frontier models’, Journal of Econometrics 148(1), 36–45.

Wheat, P., Greene, W. and Smith, A. (2014), ‘Understanding prediction intervals for firm

specific inefficiency scores from parametric stochastic frontier models’, Journal of Produc-

tivity Analysis 42(1), 55–65.

Zou, H. (2006), ‘The adaptive lasso and its oracle properties’, Journal of the American

Statistical Association 101(476), 1418–1429.

Figures and Tables

Table 1: Estimation Accuracy

Point estimateRMSE (α0 = 1) Rank correlation

(N, T) σu ULASSO ULSDV αLASSO αLSDV LASSO LSDV

(100, 10) 1 0.4222 0.9497 1.210 1.862 0.87 0.85(0.1495) (0.1787) (0.214) (0.195) (0.045) (0.044)

(100, 30) 1 0.2389 0.5517 1.097 1.498 0.93 0.93(0.0643) (0.1056) (0.108) (0.117) (0.023) (0.024)

(100, 50) 1 0.1816 0.4260 1.059 1.383 0.95 0.95(0.0397) (0.0839) (0.075) (0.093) (0.017) (0.017)

(100, 70) 1 0.1505 0.3608 1.045 1.326 0.96 0.96(0.0249) (0.0674) (0.055) (0.075) (0.013) (0.013)

(100, 10) 4 0.4618 0.9272 1.252 1.833 0.98 0.98(0.1841) (0.1876) (0.244) (0.205) (0.007) (0.007)

(100, 30) 4 0.2434 0.5387 1.102 1.483 0.99 0.99(0.0677) (0.1052) (0.106) (0.116) (0.003) (0.003)

(100, 50) 4 0.1813 0.4147 1.065 1.372 0.99 0.99(0.0368) (0.0824) (0.069) (0.090) (0.002) (0.002)

(100, 70) 4 0.1506 0.3523 1.048 1.316 1.00 1.00(0.0289) (0.0673) (0.055) (0.074) (0.002) (0.002)

(200, 10) 1 0.3602 1.0541 1.088 1.974 0.88 0.85(0.0390) (0.1711) (0.109) (0.183) (0.031) (0.031)

(200, 70) 1 0.1436 0.3967 1.018 1.365 0.97 0.96(0.0111) (0.0619) (0.036) (0.067) (0.009) (0.009)

(200, 10) 4 0.3878 1.0273 1.144 1.944 0.98 0.98(0.0540) (0.1744) (0.117) (0.188) (0.005) (0.005)

(200, 70) 4 0.1405 0.3909 1.027 1.359 1.00 1.00(0.0108) (0.0649) (0.033) (0.070) (0.001) (0.001)

(400, 10) 1 0.3587 1.1359 1.005 2.062 0.90 0.85(0.0250) (0.1512) (0.074) (0.161) (0.022) (0.020)

(400, 70) 1 0.1479 0.4295 0.997 1.400 0.97 0.96(0.0111) (0.0623) (0.026) (0.066) (0.005) (0.006)

(400, 10) 4 0.3650 1.1296 1.086 2.055 0.98 0.98(0.0219) (0.1676) (0.072) (0.179) (0.003) (0.003)

(400, 70) 4 0.1397 0.4280 1.012 1.398 1.00 1.00(0.0070) (0.0615) (0.022) (0.066) (0.001) (0.001)

(1000, 10) 1 0.3905 1.2564 0.933 2.191 0.92 0.85(0.0279) (0.1517) (0.046) (0.159) (0.015) (0.013)

(1000, 10) 4 0.3656 1.2368 1.035 2.170 0.99 0.98(0.0139) (0.1523) (0.046) (0.161) (0.002) (0.002)

Each entry contains the average value for each measure over 1,000 replicationsand their corresponding standard deviations are in parentheses. Rank correla-tions are computed only among the inefficiencies whose true values are nonzero.

replica

Table 3: LSDV Estimates of β0,1

Estimate S.E.

Youth −0.041∗∗ 0.018

Dispersion 0.004 0.004

Scale −5.659∗∗∗ 2.000

Note: The linear probability model includes dum-mies for different combinations of census tracts andthree work shifts, and dummies for seasons. Stan-dard errors are clustered at the officer level. ***, **,and * indicate statistical significance at the 1%, 5%and 10% levels, respectively.

Figure 1: PDFs of Inefficiency with Different σu Values and an Example of Draws from EachPDF.

Figure 2: Change in Arrest Rate by Experience.

Figure 3: Distribution of Search Inefficiency

Supplementary Material for “LASSO for Stochastic FrontierModels with Many Efficient Firms”By William C. Horrace, Hyunseok Jung and Yoonseok Lee

A. Proofs

Let κNT = (logN)/√T . We first derive some technical lemmas.

Lemma A.1 Suppose Assumption 2-(1) and 2-(2)-(ii) hold. Then, for some 0 < Cx, Cv <∞, as

(N,T )→∞, we have

(a) max1≤i≤N

(∥∥∥∥∥ 1

T∑t=1

{xit − E[xit]}

∥∥∥∥∥ ≥ CxκNT)

= o(N−1

), and

max1≤i≤N

(∣∣∣∣∣ 1

T∑t=1

∣∣∣∣∣ ≥ CvκNT)

= o(N−1

(b) Pr

1≤i≤N

∥∥∥∥∥ 1

T∑t=1

{xit − E[xit]}

∥∥∥∥∥ ≥ CxκNT)

= o (1) , and

1≤i≤N

∣∣∣∣∣ 1

T∑t=1

∣∣∣∣∣ ≥ CvκNT)

= o (1) .

Proof of Lemma A.1 We only prove for the first part of (a) since the proof for the second

part of (a) is analogous, and (a) imply (b) because

1≤i≤N

∥∥∥∥∥ 1

T∑t=1

{xit − E[xit]}

∥∥∥∥∥ ≥ CxκNT)≤

N∑i=1

(∥∥∥∥∥ 1

T∑t=1

{xit − E[xit]}

∥∥∥∥∥ ≥ CxκNT)

≤ N max1≤i≤N

(∥∥∥∥∥ 1

T∑t=1

{xit − E[xit]}

∥∥∥∥∥ ≥ CxκNT)

= N · o(N−1) = o(1)

and similarly for the second part of (b), if (a) is true.

To prove the first result of (a), we let MT =√T/(log T )2 and 1it = 1{||xit|| < MT }. We define

ξ1,it = xit1it − E [xit1it] ,

ξ2,it = xit (1− 1it) ,

ξ3,it = −E [xit (1− 1it)] .

Then, xit − E[xit] = ξ1,it + ξ2,it + ξ3,it and thus we have

max1≤i≤N

(∥∥∥∥∥ 1

T∑t=1

{xit − E[xit]}

∥∥∥∥∥ ≥ CxκNT)≤ max

1≤i≤NPr

(∥∥∥∥∥ 1

T∑t=1

ξ1,it

∥∥∥∥∥+

∥∥∥∥∥ 1

T∑t=1

ξ2,it

∥∥∥∥∥+

∥∥∥∥∥ 1

T∑t=1

ξ3,it

∥∥∥∥∥ ≥ CxκNT)

We prove the first part of (a) by showing

(a1) N · max1≤i≤N

(∥∥∥∥∥ 1

T∑t=1

ξ1,it

∥∥∥∥∥ ≥ Cx2κNT

)= o (1) ,

(a2) N · max1≤i≤N

(∥∥∥∥∥ 1

T∑t=1

ξ2,it

∥∥∥∥∥ ≥ Cx2κNT

)= o (1) , and

(a3) max1≤i≤N

∥∥∥∥∥ 1

T∑t=1

ξ3,it

∥∥∥∥∥ = o(κNT ).

To prove (a1), we let ξϕ1,it = ϕ′ξ1,it for some constant p × 1 vector ϕ with ||ϕ|| = 1.

Then, by Assumption 2-(1)-(ii), ξϕ1,it is a zero-mean strong mixing process, not necessarily

stationary, with the mixing coefficients satisfying α[t] ≤ cαρt for some cα > 0 and ρ ∈

(0, 1). In addition, max1≤t≤T |ξϕ1,it| ≤ 2MT almost surely by construction. We define v2N =

max1≤i≤N supt≥1{var(ξϕ1,it) + 2

∑∞s=t+1 |cov(ξϕ1,it, ξ

ϕ1,is)|, which is bounded by Assumption 2-(1)-

(ii) and (iii), and the Davydov inequality. Then, by Lemma S1.1 of Su, Shi and Phillips (2016),

there exists a constant C0 > 0 such that for any T ≥ 2 and Cx > 0,

N · max1≤i≤N

(∣∣∣∣∣ 1

T∑t=1

ξϕ1,it

∣∣∣∣∣ ≥ Cx2κNT

)≤ N exp

C0C2xT

2κ2NT /4

v2NT + 4M2

T + 2CxTκNTMT (log T )2 /2

C0C2x(logN)2/4

v2N + 4/(log T )4 + Cx(logN)

− logN

Thus, by choosing Cx sufficiently large, it follows that

N max1≤i≤N

(∥∥∥∥∥ 1

T∑t=1

ξ1,it

∥∥∥∥∥ ≥ Cx2κNT

)→ 0 as (N,T )→∞.

Next, by Assumption 2-(1)-(iii) and 2-(2)-(ii), and the Boole and Markov inequalities, we have

N · max1≤i≤N

(∥∥∥∥∥ 1

T∑t=1

ξ2,it

∥∥∥∥∥ ≥ Cx2κNT

)≤ N · max

1≤i≤NPr

1≤t≤T‖xit‖ ≥MT

)≤ NT max

1≤i≤Nmax

1≤t≤TPr (‖xit‖ ≥MT )

≤ NT

max1≤i≤N

max1≤t≤T

E ||xit||q

= o(1).

Lastly, by Assumption 2-(1)-(iii), and the Holder and Markov inequalities,

max1≤i≤N

∥∥∥∥∥ 1

T∑t=1

ξ3,it

∥∥∥∥∥ ≤ max1≤i≤N

max1≤t≤T

E ‖xit1 {‖xit‖ ≥MT }‖

≤ max1≤i≤N

max1≤t≤T

(E ‖xit‖q/2

)2/qmax

1≤i≤Nmax

1≤t≤T{Pr (||xit|| ≥MT )}(q−2)/q

≤ max1≤i≤N

max1≤t≤T

(E ‖xit‖q/2

)2/qmax

1≤i≤Nmax

1≤t≤T

(E ‖xit‖q

)(q−2)/q

= O(M−(q−2)T

)= o(κNT )

where we use the fact that M(q−2)T κNT = T (q−3)/2 logN/(log T )2 → ∞ for q ≥ 4 in the last step.

Then, the desired result follows by combining (a1), (a2) and (a3). �

Proof of Lemma 1 First, note that

max1≤i≤N

∣∣∣∣∣ 1

T∑t=1

{x′it(β0 − β) + vit

}∣∣∣∣∣≤

1≤i≤N

∥∥∥∥∥ 1

T∑t=1

{xit − E[xit]}

∥∥∥∥∥+ max1≤i≤N

E||xit||

)∥∥∥β − β0

∥∥∥+ max1≤i≤N

∣∣∣∣∣ 1

T∑t=1

∣∣∣∣∣ ,where max1≤i≤N E||xit|| = O(1) and

∥∥∥β − β0

∥∥∥ = Op((NT )−1/2) due to Assumption 2-(1)-(iii) and

2-(2)-(i), which implies, for sufficiently large 0 < C <∞,

1≤i≤N

∣∣∣∣∣ 1

T∑t=1

{x′it(β0 − β) + vit

}∣∣∣∣∣ ≥ CκNT)

= o (1) (A.1)

by Lemma A.1.

Recall η = mini∈Sc u0,i and αi = T−1∑T

t=1(yit−x′itβ) = T−1∑T

t=1(α0−u0,i+x′it(β0− β) + vit)

where u0,i = 0 for all i ∈ S. Thus, it follows that

mini∈S

αi −maxi∈Sc

= mini∈S

T∑t=1

(α0 + x′it(β0 − β) + vit)

}−max

i∈Sc

T∑t=1

(α0 − u0,i + x′it(β0 − β) + vit)

≥ mini∈Sc

u0,i +

[mini∈S

T∑t=1

(x′it(β0 − β) + vit)

}−max

i∈Sc

T∑t=1

≥ η − 2 max1≤i≤N

∣∣∣∣∣ 1

T∑t=1

∣∣∣∣∣>

2−Op(κNT ),

which implies

(mini∈S

αi −maxi∈Sc

αi > 0

)→ 1 (A.2)

as (N,T ) → ∞ since η > 0 and η/κNT → ∞ by Assumption 2-(2)-(iii). (A.2), in turn, implies

Pr (α = maxi∈S αi)→ 1 as (N,T )→∞ because α is defined as max1≤i≤N αi.

By (A.2), we can let α = maxi∈S αi for sufficiently large (N,T ), instead of α = max1≤i≤N αi.

Hence, for sufficiently large (N,T ), we have

|α− α0| =

∣∣∣∣∣maxi∈S

T∑t=1

(α0 + x′it(β0 − β) + vit

)}− α0

∣∣∣∣∣≤ max

1≤i≤N

∣∣∣∣∣ 1

T∑t=1

(x′it(β0 − β) + vit

)∣∣∣∣∣ = Op(κNT )

from A.1, which proves Lemma 1.

Since ui = α − αi = (α − α0) + (α0 − αi) = (α − α0) + (u0,i + α0,i − αi) so that |ui − u0,i| ≤

|α− α0|+ |αi − α0,i| ≤ 2 max1≤i≤N

∣∣∣ 1T

∑Tt=1 x

′it(β0 − β) + vit

∣∣∣ by the results above, we also have

Pr (|ui − u0,i| ≥ CκNT ) = o (1) (A.3)

for sufficiently large 0 < C <∞. �

Proof of Theorem 1 For Equation (5) in the main text, we form a Lagrangian as

L(α, {ui}Ni=1, {ρi}Ni=1

N∑i=1

T∑t=1

(yit − x′itβ − α+ ui

)2+ λ

N∑i=1

πiui −N∑i=1

ρiui,

where ρi ≥ 0, ui ≥ 0, and ρiui = 0 (complementary slackness) for all i. From the Karush-Kuhn-

Tucker (KKT) conditions, we have

α (λ) =1

N∑i=1

T∑t=1

)(A.4)

ui (λ) = max

{0, α (λ)− 1

T∑t=1

(yit − x′itβ

)− λ

}. (A.5)

Recall δ = |S|/N and let δ = |S|/N . By plugging (A.5) into (A.4), we have

α (λ) =1

∑i∈S

T∑t=1

(yit − x′itβ

∑i∈Sc

T∑t=1

∑i∈S

T∑t=1

(yit − x′itβ

∑i∈Sc

(α (λ)− λ

∑i∈S

T∑t=1

(x′it

(β0 − β

)+ α0 − u0,i + vit

1− δ)α (λ)− λ

∑i∈Sc

and hence

α (λ)− α0 =1

∑i∈S

T∑t=1

(x′it

(β0 − β

)− u0,i + vit

)− λ

∑i∈Sc

πi. (A.6)

This shows that α(λ) is estimated as a common intercept for the firms classified as fully efficient

by the LASSO and also contains bias due to the use of shrinkage on ui(λ). From (A.5), it follows

that, for i ∈ Sc (i.e. ui (λ) > 0),

ui (λ) = α (λ)− 1

T∑t=1

(x′it

(β0 − β

)+ α0 − u0,i + vit

)− λ

∑j∈S

T∑t=1

(x′jt

(β0 − β

)− u0,j + vjt

)− 1

T∑t=1

(x′it

(β0 − β

)− u0,i + vit

)− λ

∑j∈Sc

πj −λ

2Tπi.

We prove the theorem by showing S ⊂ S and Sc ⊂ Sc w.p.a.1.

(i) We first prove S ⊂ S w.p.a.1 by showing Pr (maxi∈S ui(λ) > 0) → 0. Let τ = maxi∈S ui.

Then, from (A.5), for any C > 0, we have

(maxi∈S

ui(λ) > 0

(maxi∈S

{α(λ)− αi −

≤ Pr

(maxi∈S

{α(λ)− αi −

}> 0, τ ≤ CκNT

)+ Pr(τ > CκNT )

≤ Pr

maxi∈S

∑j∈S

T∑t=1

(x′jt(β0 − β)− u0,j + vjt

)− λ

∑j∈Sc

T∑t=1

)− λ

2T(CκNT )−γ

)+ Pr(τ > CκNT )

≤ Pr

(2 max

1≤i≤N

∣∣∣∣∣ 1

T∑t=1

x′it(β0 − β) + vit

∣∣∣∣∣− λ

2T(CκNT )−γ > 0

)+ Pr(τ > CκNT ) (A.7)

where we use the fact that u0,j ≥ 0 and πj ≥ 0 for all j in the last step. Then, by choosing

sufficiently large C < ∞, we can easily show that first term in (A.7) is o(1) due to (A.1) and

((λ/T )κ−γNT )/κNT → ∞ as (N,T ) → ∞ by Assumption 2-(3). The second term in (A.7) is also

o(1) because

τ = maxi∈S

ui = maxi∈S{(α− α0)− (αi − α0,i)} ≤ 2 max

1≤i≤N

∣∣∣∣∣ 1

T∑t=1

x′it(β0 − β) + vit

∣∣∣∣∣and (A.1), where we use the fact u0,i = 0 for i ∈ S.

(ii) Next, we prove Sc ⊂ Sc w.p.a.1. Define Di ≡ {ui(λ) = 0} and then,

Pr (there exists i ∈ Sc such that ui(λ) = 0) = Pr

(⋃i∈ScDi

Let |Sc| = J . We arbitrarily list the firms in Sc and use an auxiliary index, [j] for j = 1, ..., J ,

to denote the jth firm on the list. Then, we can partition⋃i∈Sc Di into disjoint sets such that

D[1] ∩(⋃J

j=2D[j]

)c, D[2] ∩

(⋃Jj=3D[j]

)c, ..., and D[J ]. Therefore, we have

(⋃i∈ScDi

J∑j=1

D[j] ∩

J⋃k=j+1

=J∑j=1

Pr(u[j](λ) = 0, u[j+1](λ) > 0, u[j+2](λ) > 0, ..., u[J ](λ) > 0

which is true regardless of the order of the firms on the list. So, we list the firms in Sc according

to the size of inefficiency in ascending order so that u0,[1] ≤ ... ≤ u0,[j]... ≤ u0,[J ]. Then, we have

Pr (there exists i ∈ Sc such that ui(λ) = 0)

J∑j=1

Pr(u[j](λ) = 0, u[j+1](λ) > 0, u[j+2](λ) > 0, ..., u[J ](λ) > 0

J∑j=1

Pr(u[j](λ) = 0

∣∣∣ u[j+1](λ) > 0, ..., u[J ](λ) > 0)× Pr

(u[j+1](λ) > 0

∣∣∣ u[j+2](λ) > 0, ...)× ...

...× Pr(u[J−1](λ) > 0

∣∣∣ u[J ](λ) > 0)× Pr

(u[J ](λ) > 0

J∑j=1

Pr(u[j](λ) = 0

∣∣∣ u[j+1](λ) > 0, ..., u[J ](λ) > 0)

=J∑j=1

∑i∈S

T∑t=1

(x′it(β0 − β)− u0,i + vit

)− λ

∑i∈Sc

T∑t=1

(x′[j]t(β0 − β)− u0,[j] + v[j]t

)− λ

2Tπ[j] < 0

∣∣∣ u[j+1](λ) > 0, ..., u[J ](λ) > 0

=J∑j=1

u0,[j] −∑

i∈S u0,i

δN︸︷︷︸(∗)

∑i∈S

T∑t=1

)− λ

∑i∈Sc

T∑t=1

(x′[j]t(β0 − β) + v[j]t

)− λ

2Tπ[j] < 0

∣∣∣ u[j+1](λ) > 0, ..., u[J ](λ) > 0

)(A.8)

We let S∗ = Sc ∩ S and δ∗ = |S∗|/N . Then, (∗) in the jth probability of (A.8) satisfies

u0,[j] −∑

i∈S u0,i

δN≥ u0,[j] −

δ∗u0,[j]

since u0,i = 0 for all i ∈ S and u0,[j] = maxi∈S∗u0,i in the jth event by construction, which further

gives us the results

u0,[j] −δ∗

δu0,[j] =

δu0,[j] ≥ δu0,[j] ≥ δη (A.9)

since δ − δ∗ = δ and δ ≤ δ ≤ 1 as S ⊂ S.

Let η =mini∈Sc ui and α =∣∣∣ 1δNT

∑i∈S∑T

)∣∣∣. Then, by choosing sufficiently

large 0 < C <∞, we have

Pr (there exists i ∈ Sc such that ui(λ) = 0)

≤ Pr(

there exists i ∈ Sc such that ui(λ) = 0, ||β0 − β|| ≤ κNT , η ≥ η − CκNT , α ≤ CκNT , S ⊂ S)

+ Pr(||β0 − β|| > κNT

)+ Pr (α > CκNT ) + Pr (η < η − CκNT ) + Pr

(S 6⊂ S

)(A.10)

where Pr(||β0 − β|| > κNT

)= o(1) by Assumption 2-(2)-(i), Pr

(S 6⊂ S

)= o(1) by the first part

of this proof, Pr (α > CκNT ) = o(1) by the fact that α ≤ max1≤i≤N

∣∣∣ 1T

∑Tt=1 x

′it(β0 − β) + vit

∣∣∣ and

(A.1), and Pr (η < η − CκNT ) = o(1) by the fact that

|η − η| ≤ |η − u`|+ |u`0 − η| (A.11)

and (A.3) where ` =argmini∈Sc ui and `0 =argmini∈Scu0,i.30 Furthermore, we have

∑i∈Sc

πi +λ

2Tπ[j] ≤

2δNT(1− δ)Nη−γ +

2Tη−γ =

2δTη−γ ≤ λ

δTη−γ , (A.12)

where we use the fact Sc ⊂ Sc and δ ≤ δ ≤ 1 as S ⊂ S. Then, for the first term in (A.10), by

combining (A.8), (A.9) and (A.12), we have

there exists i ∈ Sc such that ui(λ) = 0, ||β0 − β|| ≤ κNT , η ≥ η − CκNT , α ≤ CκNT ,S ⊂ S)

30Note that |η − η| ≤ |u`0 − η| if η > η and |η − η| ≤ |η − u`| if η < η.

≤J∑j=1

(δη − CκNT −

∣∣∣∣∣ 1

T∑t=1

{x′[j]t(β0 − β) + v[j]t

}∣∣∣∣∣− λ

δTη−γ < 0, ||β0 − β|| ≤ κNT , η ≥ η − CκNT

≤J∑j=1

(δη − CκNT −

∣∣∣∣∣ 1

T∑t=1

{x′[j]t(β0 − β) + v[j]t

}∣∣∣∣∣− λ

δT(η − CκNT )−γ < 0, ||β0 − β|| ≤ κNT

≤J∑j=1

(δη − CκNT − κNT

(∣∣∣∣∣∣∣∣∣∣ 1

T∑t=1

{x[j]t − E[x[j]t]

}∣∣∣∣∣∣∣∣∣∣+ E

∥∥x[j]t

∥∥)− ∣∣∣∣∣ 1

T∑t=1

∣∣∣∣∣− λ

δT(η − CκNT )−γ < 0

J∑j=1

(δη − CκNT − κNT

(CκNT + E

∥∥x[j]t

∥∥)− ∣∣∣∣∣ 1

T∑t=1

∣∣∣∣∣− λ

δT(η − CκNT )−γ < 0

+J∑j=1

(∣∣∣∣∣∣∣∣∣∣ 1

T∑t=1

{xit − E[xit]}

∣∣∣∣∣∣∣∣∣∣ > CκNT

≤ N max1≤i≤N

(∣∣∣∣∣ 1

T∑t=1

∣∣∣∣∣ > <NT)

+N max1≤i≤N

(∣∣∣∣∣∣∣∣∣∣ 1

T∑t=1

{xit − E[xit]}

∣∣∣∣∣∣∣∣∣∣ > CκNT

)(A.13)

where <NT = δη−CκNT−κNT (CκNT + E||xit||)− λδT (η−CκNT )−γ . Then we can easily show that

the two terms in (A.13) are o(1) by an application of Lemma A.1 and the fact that <NT /κNT =

δηκNT− C − CκNT − E||xit|| − λ

δT η−γκ−1

NT (1− CκNT /η)−γ →∞ as (N,T )→∞ by Assumption 1

and 2. Thus, the proof is complete. �

Proof Theorem 2 By Theorem 1, w.p.a 1, we have

√δNT (α(λ)− α0) =

1√δNT

∑i∈S

T∑t=1

)− λ

2√δNT

∑i∈Sc

The second term is op(1) since

λ√δNT

∑i∈Sc

πi ≤√

(1− δ)2

Tη−γ

)−γ= op(1) (A.14)

by Assumption 2-(3) and the fact that

η≤ 1 +

|η − η|η

= 1 + op(1),

due to (A.11) and κNT /η → 0 as (N,T )→∞ by Assumption 2-(2)-(iii).

Since β − β0 = (∑N

∑Tt=1 xitx

′it)−1∑N

∑Tt=1 xitvit, and

∑Ni=1

∑Tt=1 xitvit =∑N

∑Tt=1 xitvit, we have

√δNT (α(λ)− α0)

=1√δNT

∑i∈S

T∑t=1

−√δ

∑i∈S

T∑t=1

x′it

N∑i=1

T∑t=1

xitx′it

)−1(1√NT

N∑i=1

T∑t=1

xitvit

)+ op(1).

We define

ΥS = plimN,T→∞

∑i∈S

T∑t=1

H0 = plimN,T→∞

N∑i=1

T∑t=1

xitx′it

where H0 > 0 by Assumption 3. We split the sample into S and Sc and define two statistics as

ΞS,NT ≡ 1√δNT

∑i∈S

T∑t=1

{vit − δΥ′SH−1

0 xitvit}

ΞSc,NT ≡ 1√(1− δ)NT

∑i∈Sc

T∑t=1

√δ(1− δ)Υ′SH−1

0 xitvit,

which are independent since the observations are cross-sectionally independent. By Assumption 3,

we have

ΞS,NTd−→ N

(0, σ2S1 + δ2σ2

S2 − 2δσS1S2)

ΞSc,NTd−→ N

(0, δ(1− δ)σ2

as (N,T )→∞, where

σ2S1 = plim

N,T→∞

∑i∈S

T∑t=1

T∑k=1

vitvik

σ2S2 = Υ′SH

N,T→∞

∑i∈S

T∑t=1

T∑k=1

xitvitvikx′it

}H−1

σS1S2 = Υ′SH−10

N,T→∞

∑i∈S

T∑t=1

T∑k=1

xitvitvik

σ2Sc = Υ′SH

N,T→∞

(1− δ)NT∑i∈Sc

T∑t=1

T∑k=1

xitvitvikx′it

}H−1

0 ΥS .

Hence,√δNT (α(λ) − α0) = ΞS,NT + ΞSc,NT

d−→ N(0, σ2S1 + δ2σ2

S2 − 2δ2σS1S2 + δ(1− δ)σ2Sc)

the desired result follows.31

For the second result, for i ∈ Sc, we have

√T (ui(λ)− u0,i) =

√T (α(λ)− α0)− 1√

T∑t=1

x′it(β0 − β)− 1√T

T∑t=1

vit −λ

2√Tπi

≡ Ψ1,NT + Ψ2i,NT + Ψ3i,T + Ψ4i,NT ,

where Ψ1,NT = Op(1/√δN) = op(1) from the first result, Ψ2i,NT = Op(1/

√N) = op(1) since

β − β0 = Op(1/√NT ), and Ψ4i,NT = op(1) by a similar argument as in (A.14). Since Ψ3i,T

d−→

N (0, σ2i ) as T →∞ by Assumption 3, where σ2

i = plimT→∞1T

∑Tt=1

∑Tk=1 vitvik for each i, we have

the desired result. �

31When vit is conditionally homoskedastic across i, we have σ2S2 = σ2

Υ′SH−10

{limT→∞ T−1

∑Tt=1

∑Tk=1 xitvitvikx

}H−10 ΥS and the limiting expression simplies to

N(0, σ2S1 + δσ2

S2 − 2δσS1S2).

Proof of Theorem 3 We first define

Λ− = {λ : Pr(S(λ) ) S)→ 1 as (N,T )→∞}

Λ0 = {λ : Pr(S(λ) = S)→ 1 as (N,T )→∞}

Λ+ = {λ : Pr(S(λ) ( S)→ 1 as (N,T )→∞}

similarly as Hui, Warton and Foster (2015).32 We denote the post-LASSO version of θ(λ) by

θS(λ),33 the post-LASSO version of σ2(λ) by σ2

S(λ), where

σ2S(λ)

N∑i=1

T∑t=1

(yit − x

′itβ − θS(λ)

and the post-LASSO BIC by BIC(λ),

BIC(λ) = log σ2S(λ)

+φNTNT|Sc(λ)|.

The following lemma shows that asymptotically a λ that yields an over-fitted or under-fitted

model can’t be selected by BIC(λ).

Lemma A.2 Suppose Assumptions 1 and 2 hold and there exists λ0 ∈ Λ0. Then,

λ∈Λ−∪Λ+

BIC(λ) > BIC(λ0)

)→ 1 as (N,T )→∞

32Recall Assumption 2-(3): i) λT−1/2N1/2η−γ → 0; ii) λT (γ−1)/2(logN)−γ−1 → ∞ for some γ > 1.Theorem 1 implies that, for λ ∈ Λ0, both i) and ii) must be satisfied. For λ ∈ Λ+, Assumption i) is satisfied,but not ii), that is, λ is not large enough, so some of zero inefficiencies are estimated as nonzero, resultingover-fitted models. For λ ∈ Λ−, ii) is satisfied, but not i), resulting under-fitted models. In finite samplesunder-fitted models include the cases where some of efficient firms are estimated as inefficient while some ofinefficient firms are estimated as efficient. However, Theorem 1 and its proof imply that we can ignore thesecases asymptotically.

33These post-LASSO version estimates are simply least squares estimates given estimated set of efficient

firms, ˆS(λ).

Proof of Lemma A.2 (i) We first show Pr(infλ∈Λ−BIC(λ) > BIC(λ0)

)→ 1 as (N,T )→∞.

Let λ− ∈ Λ−. Since Pr(S(λ−) ) S)→ 1 as (N,T )→∞, for sufficiently large (N,T ), we have

σ2S(λ−)

N∑i=1

T∑t=1

(yit − x

′itβ − θS(λ−)

T∑t=1

∑i∈S

(x′it(β0 − β)−

(αS(λ−) − α0

)+ vit

T∑t=1

∑i∈S∗

(x′it(β0 − β)−

(αS(λ−) − α0

)− u0,i + vit

T∑t=1

∑i∈S∗∗

(x′it(β0 − β)−

(αS(λ−) − α0

)+(ui,S(λ−) − u0,i

)+ vit

where S∗ = Sc ∩ S(λ−) and S∗∗ = Sc ∩ Sc(λ−). Similarly, for large (N,T ),

σ2S(λ0)

N∑i=1

T∑t=1

(yit − x

′itβ − θS(λ0)

T∑t=1

∑i∈S

(x′it(β0 − β)−

(αS(λ0) − α0

)+ vit

T∑t=1

∑i∈S∗

(x′it(β0 − β)−

(αS(λ0) − α0

)+(ui,S(λ0) − u0,i

)+ vit

T∑t=1

∑i∈S∗∗

(x′it(β0 − β)−

(αS(λ0) − α0

)+(ui,S(λ0) − u0,i

)+ vit

Then, for large (N,T ), it can be verified that

σ2S(λ−)

− σ2S(λ0)

{αS(λ−) − α0 −

T∑t=1

∑i∈S

∑i∈S∗

{αS(λ−) − α0 + ui,0 −

T∑t=1

∑i∈S∗

{αS(λ−) − α0 + ui,0 −

T∑t=1

∑i∈S∗

T∑t=1

∑i∈S(λ−)

)− 1

∑i∈S∗

+ui,0 −1

T∑t=1

∑i∈S∗

∣∣∣∣∣∣ui,0 − 1

∑i∈S∗

∣∣∣∣∣∣︸︷︷︸(∗)

∣∣∣∣∣∣ 1

T∑t=1

∑i∈S(λ−)

)− 1

T∑t=1

)∣∣∣∣∣∣︸︷︷︸(∗∗)

by the reverse triangle inequality and the fact that

αS(λ−) − α0 =1

T∑t=1

∑i∈S(λ−)

)− 1

∑i∈S∗

where δ =∣∣∣S(λ−)

∣∣∣ /N . Also note that (∗) is Op(1) or has the rate of η which converges to zero

slower than (∗∗).

Therefore, for large (N,T ), we have

σ2S(λ−)

− σ2S(λ0)

> δ∗

{=− 2 max

1≤i≤N

∣∣∣∣∣ 1

T∑t=1

)∣∣∣∣∣}2

(A.15)

where = = mini∈S∗

∣∣∣ui,0 − 1δN

∑i∈S ui,0

∣∣∣ and δ∗ =∣∣∣S∗∣∣∣ /N .

Finally, note that for any λ− ∈ Λ−,

BIC(λ−)− BIC(λ0) = log

σ2S(λ−)

− σ2S(λ0)

σ2S(λ0)

}− φNT

Tδ∗

≥ min

{log 2,

σ2S(λ−)

− σ2S(λ0)

2σ2S(λ0)

}− φNT

Tδ∗,

and log 2− φNTT δ∗ > 0 as (N,T )→ 0 due to the condition that (φNT /T )1/2 η−1 → 0. Therefore, to

prove Pr(infλ∈Λ−BIC(λ) > BIC(λ0)

)→ 1 as (N,T )→∞, it suffice to show

infλ∈Λ−

{σ2S(λ−)

− σ2S(λ0)

2σ2S(λ0)

}− φNT

Tδ∗ (A.16)

is positive w.p.a.1 as (N,T )→∞.

Inequality (A.15) implies that (A.16) is asymptotically greater than

2σ−2

S(λ0)

{=− 2 max

1≤i≤N

∣∣∣∣∣ 1

T∑t=1

)∣∣∣∣∣}2

− φNTT

=φNTT

2σ2S(λ0)

)1/2[=− 2 max

1≤i≤N

∣∣∣∣∣ 1

T∑t=1

)∣∣∣∣∣])2

which is asymptotically positive since σ2S(λ0)

is bounded, = is Op(1) or Op(η) hence asymptotically

dominates max1≤i≤N

∣∣∣ 1T

∑Tt=1

)∣∣∣ = Op

(logN√T

)due to Assumption 2-(2)-(iii), and(

)1/2= →∞ by the condition that (φNT /T )1/2 η−1 → 0.

(ii) Next, we show Pr(infλ∈Λ+BIC(λ) > BIC(λ0)

)→ 1 as (N,T )→∞. Let λ+ ∈ Λ+. Simi-

larly as in (i), for large (N,T ), it can be verified that

σ2S(λ+)

− σ2S(λ0)

≥ −δ◦αS(λ0) − α0 −

δ◦NT

T∑t=1

∑i∈S◦

−δ◦◦ max1≤i≤N

{∣∣∣αS(λ0) − α0

∣∣∣+

∣∣∣∣∣ 1

T∑t=1

)∣∣∣∣∣}2

where δ◦ = |S◦|/N and δ◦◦ = |S◦◦|/N with S◦ = S ∩ S(λ+) and S◦◦ = S ∩ Sc(λ+).

Therefore, to show Pr(infλ∈Λ+BIC(λ) > BIC(λ0)

)→ 1 as (N,T )→∞, it suffice to show

BIC(λ+)− BIC(λ0) ≥ φNTT

δ◦◦ − δ◦

2σ2S(λ0)

αS(λ0) − α0 −1

δ◦NT

T∑t=1

∑i∈S◦

︸︷︷︸(∗)

− δ◦◦

2σ2S(λ0)

max1≤i≤N

{∣∣∣αS(λ0) − α0

∣∣∣+

∣∣∣∣∣ 1

T∑t=1

)∣∣∣∣∣}2

︸︷︷︸(∗∗)

is positive w.p.a.1 as (N,T ) → ∞, which follows by the condition φNT /(logN)2 → ∞ since (∗∗)

is greater than (∗), but (∗∗) = Op

((logN)2

)because |αS(λ0) − α0| = Op(

1√δNT

) due to Theorem 2

and max1≤i≤N

∣∣∣ 1T

∑Tt=1

)∣∣∣ = Op

(logN√T

).34 �

Next, to link the post-LASSO BIC and LASSO BIC, we show the following:

σ2(λ0)− σ2S(λ0)

). (A.17)

Due to shrinkage effect, we have σ2(λ0) − σ2S(λ0)

> 0 and, similarly as in the proof of Lemma A.2

above, we can show that, for large (N,T ),

σ2(λ0)− σ2S(λ0)

∑i∈Sc

where we use the fact that α(λ0)− α0 = 1δNT

∑Tt=1

∑i∈S

)− λ

∑i∈Sc πi and

(α(λ0)− α0)− (ui(λ0)− u0,i) = 1T

∑Tt=1

)− λ

2T πi for i ∈ Sc w.p.a 1 as (N,T )→

∞. Then, using the results in the proof of Theorem 2, we have

σ2(λ0)− σ2S(λ0)

{√(1− δ)2

Tη−γ

+1− δNT

Tη−γ

)34Even when |S◦◦| is finite so δ◦◦ = O

)as N → ∞, we obtain the same conclusion since δ◦ → δ in

this case, so (∗) = Op(

since λ√

NT η−γ = op(1).

Finally, (A.17) and the fact BIC(λ) > BIC(λ) for any λ due to shrinkage effect imply

BIC(λ)− BIC(λ0) > BIC(λ)− BIC(λ0) + op

which gives

λ∈Λ−∪Λ+

BIC(λ) > BIC(λ0)

)→ 1 as (N,T )→∞.

This means that asymptotically a λ which yields an over-fitted or under-fitted model can’t be

chosen based on the BIC criterion, so the desired result follows. �

B. Additional Simulations for δ ∈ {0.1, 0.9}

Table B.1: Estimation Accuracy: δ = 0.1

(100, 10) 1 0.4537 0.8630 1.166 1.761 0.87 0.85(0.1765) (0.1820) (0.272) (0.204) (0.041) (0.039)

(100, 30) 1 0.2623 0.4822 1.059 1.420 0.94 0.93(0.0753) (0.1056) (0.143) (0.121) (0.019) (0.019)

(100, 50) 1 0.2014 0.3675 1.034 1.318 0.96 0.95(0.0576) (0.0830) (0.108) (0.095) (0.013) (0.014)

(100, 70) 1 0.1733 0.3089 1.025 1.266 0.97 0.96(0.0481) (0.0700) (0.095) (0.081) (0.011) (0.011)

(100, 10) 4 0.4987 0.7802 1.225 1.663 0.98 0.98(0.1918) (0.1880) (0.294) (0.217) (0.006) (0.006)

(100, 30) 4 0.2818 0.4585 1.103 1.390 0.99 0.99(0.1003) (0.1174) (0.168) (0.138) (0.003) (0.003)

(100, 50) 4 0.2136 0.3528 1.063 1.297 0.99 0.99(0.0711) (0.0914) (0.124) (0.107) (0.002) (0.002)

(100, 70) 4 0.1722 0.2914 1.041 1.245 1.00 1.00(0.0453) (0.0713) (0.089) (0.084) (0.001) (0.001)

(200, 10) 1 0.4011 0.9625 1.025 1.874 0.89 0.85(0.0627) (0.1703) (0.153) (0.185) (0.029) (0.026)

(200, 70) 1 0.1661 0.3502 0.985 1.313 0.97 0.96(0.0191) (0.0675) (0.053) (0.075) (0.008) (0.008)

(200, 10) 4 0.4327 0.8770 1.122 1.779 0.98 0.98(0.0743) (0.1721) (0.178) (0.193) (0.004) (0.004)

(200, 70) 4 0.1614 0.3353 1.017 1.295 1.00 1.00(0.0187) (0.0708) (0.057) (0.08) (0.001) (0.001)

(400, 10) 1 0.4168 1.0597 0.920 1.981 0.91 0.85(0.0399) (0.1713) (0.086) (0.184) (0.021) (0.018)

(400, 70) 1 0.1794 0.3868 0.952 1.353 0.97 0.96(0.0217) (0.0643) (0.039) (0.070) (0.005) (0.005)

(400, 10) 4 0.4097 0.9973 1.045 1.911 0.98 0.98(0.0376) (0.1728) (0.123) (0.188) (0.003) (0.003)

(400, 70) 4 0.1577 0.3799 0.999 1.346 1.00 1.00(0.0103) (0.0668) (0.039) (0.073) (0.001) (0.001)

(1000, 10) 1 0.4792 1.1787 0.822 2.108 0.93 0.85(0.0430) (0.1546) (0.057) (0.164) (0.014) (0.011)

(1000, 10) 4 0.4158 1.1115 0.970 2.037 0.99 0.98(0.0276) (0.1612) (0.081) (0.171) (0.002) (0.002)

Table B.2: Estimation Accuracy: δ = 0.9

(100, 10) 1 0.2772 1.0699 1.175 1.994 0.84 0.81(0.1068) (0.1713) (0.144) (0.184) (0.133) (0.151)

(100, 30) 1 0.1415 0.6292 1.057 1.582 0.91 0.89(0.0458) (0.0996) (0.153) (0.107) (0.090) (0.099)

(100, 50) 1 0.1046 0.4901 1.018 1.455 0.94 0.92(0.0368) (0.0769) (0.186) (0.082) (0.068) (0.076)

(100, 70) 1 0.0886 0.4137 0.985 1.383 0.95 0.93(0.0404) (0.0640) (0.228) (0.069) (0.053) (0.056)

(100, 10) 4 0.2744 1.0646 1.174 1.988 0.96 0.96(0.1062) (0.1736) (0.120) (0.186) (0.038) (0.039)

(100, 30) 4 0.1382 0.6229 1.076 1.577 0.98 0.98(0.0461) (0.0975) (0.056) (0.104) (0.026) (0.027)

(100, 50) 4 0.0980 0.4889 1.049 1.455 0.98 0.98(0.0323) (0.0753) (0.039) (0.080) (0.018) (0.019)

(100, 70) 4 0.0799 0.4138 1.037 1.384 0.99 0.99(0.0249) (0.0660) (0.031) (0.071) (0.018) (0.018)

(200, 10) 1 0.1991 1.1702 1.088 2.099 0.89 0.83(0.0439) (0.1619) (0.064) (0.170) (0.075) (0.084)

(200, 70) 1 0.0683 0.4496 1.013 1.422 0.96 0.95(0.0164) (0.0598) (0.067) (0.063) (0.028) (0.032)

(200, 10) 4 0.1992 1.1657 1.091 2.095 0.97 0.97(0.0441) (0.1621) (0.061) (0.172) (0.019) (0.020)

(200, 70) 4 0.0628 0.4488 1.015 1.420 0.99 0.99(0.0117) (0.0575) (0.017) (0.061) (0.007) (0.007)

(400, 10) 1 0.1718 1.2552 1.046 2.190 0.91 0.84(0.0205) (0.1504) (0.036) (0.158) (0.048) (0.058)

(400, 70) 1 0.0656 0.4800 1.008 1.454 0.97 0.96(0.0069) (0.0573) (0.011) (0.060) (0.017) (0.020)

(400, 10) 4 0.1727 1.2531 1.050 2.187 0.98 0.98(0.0221) (0.1502) (0.035) (0.159) (0.010) (0.011)

(400, 70) 4 0.0591 0.4802 1.007 1.454 1.00 0.99(0.0068) (0.0567) (0.010) (0.060) (0.003) (0.003)

(1000, 10) 1 0.1674 1.3605 1.016 2.301 0.93 0.85(0.0112) (0.1436) (0.021) (0.150) (0.026) (0.035)

(1000, 10) 4 0.1641 1.3736 1.023 2.314 0.99 0.98(0.0112) (0.1461) (0.019) (0.152) (0.005) (0.005)

(1000,

LASSO for Stochastic Frontier Models with Many E cient Firms

Documents

Transcript of LASSO for Stochastic Frontier Models with Many E cient Firms

Dantzig Lasso

The Lasso Problem and Uniqueness - pdfs.semanticscholar.org fileThe Lasso Problem and Uniqueness Ryan J. Tibshirani Carnegie Mellon University Abstract The lasso is a popular tool

Lasso Them In

Adaptive Lasso and group-Lasso for functional Poisson ...rivoirar/IPR14.pdf · Adaptive Lasso and group-Lasso for functional Poisson regression S. Ivano z, ... ture di erent features

E cient Cardinality/Mean-Variance Portfolioslnv/papers/cardMV.pdfKeywords: portfolio selection, cardinality, sparse portfolios, multiob-jective optimization, e cient frontier, derivative-free

Web viewExercise 1 Photoshop. Recreate the following image. Photoshop tools learned: Lasso Tool, Magnetic Lasso tool, eraser, polygonal lasso, marquee tool, quick selection tool

LASSO, graphical LASSO and the world of convex problemspi.math.cornell.edu/~raazesh/LifeNetworks2014Files/LASSO.pdf · di erent types of problem ... (graphical lasso, discriminant

Efficient Smoothed Concomitant Lasso Estimation for High ...€¦ · E cient Smoothed Concomitant Lasso Estimation for High Dimensional Regression Eugene Ndiaye 1, Olivier Fercoq

The lasso - MyWeb

ON CROSS-VALIDATED LASSO IN HIGH DIMENSIONSdenoted as P1-Lasso and P2-Lasso estimators respectively, and the cross-validated Lasso estimator is denoted as CV-Lasso. Figure 1 contains

071558 01 BRO NFDirWP - Fidelitypersonal.fidelity.com/products/funds/fnx/...to-an-effient-frontier.pdf · an efﬁ cient frontier of direct trading practices. Industry Estimates ...

E cient Cardinality/Mean-Variance Portfolioslnv/papers/cardMV.pdf · Keywords: portfolio selection, cardinality, sparse portfolios, multiob-jective optimization, e cient frontier,

Complete Lasso Interview

Lasso Bicinia

Case Study: Generalized Lasso Problems - Carnegie …ryantibs/convexopt-S15/lectures/19...reasoning for related but distinct generalized lasso problem cases 3 Generalized lasso problems

Adaptive Lasso

Adaptive LASSO for Varying-Coe cient Partially Linear ......LARS is no more intricate computationally than an ordinary least-squares t to the full model (Efron et al., 2004). In the

80228341 Lasso Bicinia

Lasso: Algorithms - MyWeb | Information Technology Services · Lasso geometry Coordinate descent Lasso: Algorithms Patrick Breheny February 17 Patrick Breheny High-Dimensional Data

Allan Wilson Lasso