Sensitivity and dimensionality tests of DEA efficiency scores

13
Sensitivity and dimensionality tests of DEA efficiency scores Andrew Hughes a , Suthathip Yaisawarng b, * a NSW Treasury, Level 26, Governor Macquarie Tower, 1 Farrer Place, Sydney 2000, Australia b Department of Economics, Union College, Schenectady, NY 12308, USA Abstract Extending Hughes and Yaisawarng [J.L.T. Blank (Ed.), Public Provision and Performance: Contributions from Efficiency and Productivity Measurement, Elsevier Science B.V., The Netherlands, p. 277], this paper addresses issues of dimensionality for enhancing the credibility of data envelopment analysis (DEA) results in practical applications and assisting practitioners in making an appropriate selection of variables for a DEA model. The paper develops a structured approach to testing whether changing the number of model variables for a fixed sample size affects DEA results. Using a simulation method, the paper tests whether a calculated DEA efficiency score reflects the dimension of the model or the existence of inefficiency. An empirical illustration is included. Ó 2003 Elsevier B.V. All rights reserved. Keywords: Data envelopment analysis; Sensitivity tests; Dimensionality tests 1. Introduction Following a seminal work by Farrell (1957) on productive efficiency measurement and work by Charnes et al. (1978), data envelopment ana- lysis (DEA) has become a popular empirical method for measuring efficiency. (See Cooper et al. (2000) for a comprehensive list of DEA stud- ies.) This method assigns an efficiency score to each operational unit based on how well it transforms a given set of inputs into outputs, relative to the best performers in the sample. Due to the nature of the technique, several factors including the relationship between sample size and number of model variables may affect DEA results. For example, adding more model vari- ables for a given sample size potentially yields higher efficiency scores for units in the sample. In the same way, fewer units in a sample for a given number of model variables may lead to higher efficiency scores. This is a dimensionality issue. One way to address the issue of dimensions is to expand the sample size using pooled cross section and time series data. This method as- sumes no technological change over the sample periods (an assumption that can be problematic for industries undergoing substantial technical * Corresponding author. Tel.: +1-518-388-6606; fax: +1-518- 388-6988. E-mail addresses: [email protected] (A. Hughes), [email protected] (S. Yaisawarng). 0377-2217/$ - see front matter Ó 2003 Elsevier B.V. All rights reserved. doi:10.1016/S0377-2217(03)00178-4 European Journal of Operational Research 154 (2004) 410–422 www.elsevier.com/locate/dsw

Transcript of Sensitivity and dimensionality tests of DEA efficiency scores

European Journal of Operational Research 154 (2004) 410–422

www.elsevier.com/locate/dsw

Sensitivity and dimensionality tests of DEA efficiency scores

Andrew Hughes a, Suthathip Yaisawarng b,*

a NSW Treasury, Level 26, Governor Macquarie Tower, 1 Farrer Place, Sydney 2000, Australiab Department of Economics, Union College, Schenectady, NY 12308, USA

Abstract

Extending Hughes and Yaisawarng [J.L.T. Blank (Ed.), Public Provision and Performance: Contributions from

Efficiency and Productivity Measurement, Elsevier Science B.V., The Netherlands, p. 277], this paper addresses issues of

dimensionality for enhancing the credibility of data envelopment analysis (DEA) results in practical applications and

assisting practitioners in making an appropriate selection of variables for a DEA model. The paper develops a

structured approach to testing whether changing the number of model variables for a fixed sample size affects DEA

results. Using a simulation method, the paper tests whether a calculated DEA efficiency score reflects the dimension of

the model or the existence of inefficiency. An empirical illustration is included.

� 2003 Elsevier B.V. All rights reserved.

Keywords: Data envelopment analysis; Sensitivity tests; Dimensionality tests

1. Introduction

Following a seminal work by Farrell (1957)on productive efficiency measurement and work

by Charnes et al. (1978), data envelopment ana-

lysis (DEA) has become a popular empirical

method for measuring efficiency. (See Cooper et al.

(2000) for a comprehensive list of DEA stud-

ies.) This method assigns an efficiency score to

each operational unit based on how well it

* Corresponding author. Tel.: +1-518-388-6606; fax: +1-518-

388-6988.

E-mail addresses: [email protected]

(A. Hughes), [email protected] (S. Yaisawarng).

0377-2217/$ - see front matter � 2003 Elsevier B.V. All rights reserv

doi:10.1016/S0377-2217(03)00178-4

transforms a given set of inputs into outputs,

relative to the best performers in the sample. Due

to the nature of the technique, several factorsincluding the relationship between sample size

and number of model variables may affect DEA

results. For example, adding more model vari-

ables for a given sample size potentially yields

higher efficiency scores for units in the sample.

In the same way, fewer units in a sample for a

given number of model variables may lead to

higher efficiency scores. This is a dimensionalityissue.

One way to address the issue of dimensions is

to expand the sample size using pooled cross

section and time series data. This method as-

sumes no technological change over the sample

periods (an assumption that can be problematic

for industries undergoing substantial technical

ed.

A. Hughes, S. Yaisawarng / European Journal of Operational Research 154 (2004) 410–422 411

innovation). Another approach, proposed byNunamaker (1985), is to include alternative sets

of variables for a fixed sample size and to search

for the best dimensions for management to pur-

sue.

Another important issue is the selection of

variables for a model. Given imperfect data, re-

searchers are often required to make tradeoffs in

selecting input and output variables. A smallnumber of studies (see, for example, Valdmanis,

1992) calculate efficiency scores for a sample using

several alternative sets of variables and compare

the mean values of the results. These studies

focus on sensitivity testing of average efficiency

scores relative to a frontier constructed from

the observed sample, rather than of individual

scores and their ranks. Other methods focus onsensitivity testing, for which frontiers are simu-

lated using re-sampling techniques including

bootstrapping and jackknifing, to construct a

confidence interval for each DEA efficiency score.

(See, for example, Grosskopf and Yaisawarng,

1990; Ferrier et al., 1993.) Grosskopf (1996) pro-

vides a brief review of approaches to sensitivity

testing including a statistical test of DEA results.Hughes and Yaisawarng (2000) perform sev-

eral sensitivity tests of their DEA efficiency

scores obtained from four model specifications.

None of these studies, however, provide a test of

the impact of the dimensions of a DEA model on

results.

DEA results should be independent of the di-

mension of the DEA model and robust across al-ternative proxy variables, if the empirical results

are to be used for devising appropriate policy

recommendations. To date, there has been no

systematic testing of the dimensionality of the

DEA efficiency scores relative to variable specifi-

cations.

This paper outlines an empirical procedure for

DEA studies in situations where researchers arefaced with a set of alternative proxy measures for a

variable, which are not perfect substitutes for each

other. We examine various models using different

sets of potential variables for a fixed sample size

and develop a systematic method for testing the

dimensionality of DEA results across alternative

model specifications. Specifically, the paper applies

a simulation method to test whether the number ofmodel variables or the magnitude of inefficiency

affects a calculated efficiency score. The goal of

this exercise is to improve the process for selecting

a suitable DEA model. The paper provides a rig-

orous empirical demonstration of the proposed

procedure using a sample of 161 police patrols in

the State of New South Wales (NSW), Australia,

in 1995–1996.The remainder of the paper is organised as

follows. Section 2 presents a standard input-

oriented DEA model and discusses sensitivity and

dimensionality tests. Section 3 demonstrates the

application of the proposed procedures for a

sample of NSW police patrols in 1995–1996.

Section 4 concludes the paper.

2. Methodology

This section consists of three subsections. Sec-

tion 2.1 introduces a standard input-oriented DEA

model and discusses the effects of additional model

variables on DEA computed efficiency scores. It

also discusses the effects of different types offrontier technology on efficiency scores. Section

2.2 discusses the sensitivity tests of DEA results for

a fixed sample size. Results of the tests can be used

to select a preferred model. Section 2.3 discusses a

procedure for testing whether DEA computed ef-

ficiency scores reflect the existence of inefficiency

or the dimension of the DEA model using a sim-

ulation method.

2.1. A standard input-oriented DEA model

Consider an organisation consisting of several

units performing similar tasks. Suppose there are Junits in the organisation. Each unit uses N in-

puts to produce M outputs. The objective of

each unit is to minimise its inputs used to pro-duce a predetermined level of outputs. Let ymjand xnj be the quantity of output m produced by

unit j, m ¼ 1; . . . ;M , and the quantity of input nused by unit j, n ¼ 1; . . . ;N , respectively. An

input-oriented DEA model used to calculate the

412 A. Hughes, S. Yaisawarng / European Journal of Operational Research 154 (2004) 410–422

efficiency score for unit k is formulated as fol-lows. 1

TEk ¼ min k

s:t: y11z1 þ y12z2 þ � � � þ y1JzJ P y1k;

..

.

yM1z1 þ yM2z2 þ � � � þ yMJzJ P yMk;

x11z1 þ x12z2 þ � � � þ x1JzJ 6 kx1k;

..

.

xN1z1 þ xN2z2 þ � � � þ xNJzJ 6 kxNk;

z1 þ z2 þ � � � þ zJ ¼ 1;

z1 P 0; z2 P 0; . . . ; zJ P 0;

ð1Þwhere zj, j ¼ 1; . . . ; J, are weights or the relative

impact of unit j on the target point for unit k.

There are ðM þN þ 1Þ constraints in this LP

problem. The first M constraints are the output

constraints, one for each output. A constraint on

output m indicates that the linear combination ofoutput m produced by all J units in the sample

must be at least as large as output m produced by

unit k, the unit being assessed. The next N con-

straints are the input constraints. In this case, a

constraint on input n indicates that the linear

combination of input n used by all J units in the

sample must be less than or equal to a k times unit

1 Charnes et al. (1978) formulate a DEA model as a

fractional linear programming (LP) problem, i.e.,

TEk ¼ max u1y1k þ u2y2k þ � � � þ uMyMk ;

s:t: v1x1k þ v2x2k þ � � � þ vNxNk ¼ 1;

u1y1j þ u2y2j þ � � � þ uMyMj

� ðv1x1j þ v2x2j þ � � � þ vNxNjÞ6 0;

j ¼ 1; . . . ; J ; j 6¼ k; um � 0; m ¼ 1; . . . ;M;

vn P 0; n ¼ 1; . . . ;N :

The solution to this problem gives a technical efficiency score

for unit k as well as the set of output and input weights, relative

to a constant returns to scale (CRS) technology. The LP for-

mulation can be modified for a variable returns to scale (VRS)

technology as suggested by Banker et al. (1984). The modified

LP formulation is dual to that of (1). For further details, see

F€aare et al. (1985, 1994), Lovell (1993) or Steering Committee

for the Review of Commonwealth/State Service Provision

(1997).

k�s current inputs, where k is a constant numberthat satisfies all constraints. The smallest value of kis the technical efficiency score of unit k. The last

constraint restricts the sum of the relative impacts

to unity. This requirement permits the technology

or the production frontier to have VRS, i.e., CRS

in some output range, increasing returns or de-

creasing returns in other output ranges. 2 For de-

tails of this restriction, see Afriat (1972).The solution values from the LP problem in (1)

are the technical efficiency score for unit k and the

values of relative impacts zj. When zj > 0, unit j is

an efficient peer for unit k. When zj ¼ 0, unit j is

not an efficient peer for unit k. The LP problem to

compute efficiency scores for the remaining units

in the sample can be formulated in a similar

fashion as found in (1). The only difference is thatthe values of outputs and inputs for unit k on the

right-hand-side of the constraints are replaced

with the outputs and inputs for the unit consid-

ered. To calculate efficiency scores for all J units

in the sample, the LP problem is solved J times;

once for each unit.

Grosskopf (1986) shows that a technical effi-

ciency score relative to a VRS frontier is at least aslarge as a technical efficiency score of the same unit

relative to a CRS frontier, given the same number

of model variables. This is because the VRS tech-

nology envelops the data points more tightly than

the CRS frontier. As a consequence, there will be

more efficient units on the VRS frontier compared

to the number of efficient units on the CRS fron-

tier. This result may be generalised to an increasein the number of constraints in a DEA model. For

any fixed sample size, more constraints in the DEA

formulation, in particular imposing a flexible

frontier technology, increasing the number of in-

puts, and/or increasing the number of outputs,

result in higher efficiency scores and more efficient

units. 3 This generalisation is consistent with

2 The unit restriction on z�s is removed if the frontier

technology exhibits CRS, i.e., a proportional increase in all

inputs leads to the same proportional increase in all outputs.3 Sample sizes also have a similar effect on DEA computed

efficiency scores. For a given set of variables and frontier

technology, the proportion of efficient units in a small sample is

higher than that of a larger sample.

A. Hughes, S. Yaisawarng / European Journal of Operational Research 154 (2004) 410–422 413

Nunamaker�s assertion (1985) that an efficient unitfrom a DEA model with a lower number of inputs

remains efficient in the model with additional in-

puts. Thrall (1989) provides a transition theorem

to reinforce Nunamaker�s proposition under a

fixed sample size but with increasing numbers

of inputs and outputs.

2.2. Sensitivity tests

Researchers are often confronted with a range

of variables that reflect the activities of the or-

ganisation under consideration. These variables

are seldom ideal. Each variable may not fully

capture all aspects of the input or output under

consideration. Due to the nature of DEA model-

ling, adding more variables not only inflates DEAefficiency scores but it also potentially conceals the

actual magnitude of inefficiency. Researchers need

to pay close attention to the selection of variables

for a DEA model. A selection of suitable variables

sometimes cannot be justified on theoretical

grounds. In addition, there is a trade-off between a

marginally improved DEA model specification

(from an increase in the number of variables) andpotentially inflating DEA efficiency scores. It is

recommended that different model specifications

be used and that the results be tested empirically.

Hughes and Yaisawarng (2000) perform several

sensitivity tests of DEA results across specifica-

tions for a fixed sample size. Their procedures

focus on three technical aspects: (1) the overall

relationship, (2) variations of individual scores,and (3) an assessment of appropriate efficient

peers. The authors use objective measures such as

statistical techniques to capture the first two as-

pects and subjective measures such as qualitative

analysis and expert opinions to address the third

aspect. We summarise their procedures for a fixed

sample size of J units and s model specifications

below.

2.2.1. Overall relationship

(a) Between each pair of efficiency scores. This

test concentrates on the ranking of units based on

their efficiency scores, rather than the magnitude

of the efficiency scores. The rationale is that the

position of a unit or its rank in the sample is in-

dependent of the dimension of the DEA modeland that the efficiency rankings in any two models

should be the same, if the results are robust or

insensitive to the model specification. To test the

sensitivity of the DEA results between each pair of

the model specifications, the null hypothesis is that

no association between each pair of efficiency

ranks exists. The test statistic is the Spearman rank

correlation coefficient, which takes the value be-tween )1 and +1. The Spearman rank correlation

coefficient would be statistically significant for

the DEA results that are not sensitive between the

two models.

(b) Across all s specifications. This test looks at

the distribution of the efficiency rankings. The null

hypothesis is that the probability distributions of

efficiency rankings are the same across s specifica-tions. This hypothesis is tested using the Friedman

non-parametric test that not only requires no as-

sumption regarding the distribution of efficiency

rankings but also accounts for the possibility that

efficiency scores from each model specification

are not independent across s specifications. The

Friedman test-statistic approximates a Chi-square

distribution with s� 1 degrees of freedom, where sis number of specifications. A failure to reject the

null hypothesis verifies the robustness of the results.

2.2.2. Variations of individual scores across sspecifications

For this test, the attention is the magnitude of

the efficiency scores. Although efficiency scores

depend on the number of variables included in theDEA model, it is important to analyse the varia-

tion in the actual efficiency scores for individual

units in a sample across s specifications to gain

better understanding of causes of variations, if the

result is indeed sensitive to the model selection.

The difference between the maximum and the

minimum efficiency score for each unit across sspecifications is the range statistic used as a mea-sure of variation. When s is relatively large, say at

least 20, an alternative measure of variation is the

standard deviation of the efficiency scores. If the

DEA results are not sensitive to the number of

variables included in the model, the range statistic

and the standard deviation should be relatively

small. However, the cut-off point for small or large

4 Tauer and Hanchar (1995) introduce the dimensionality

test in the context of an output-oriented DEA model with CRS

technology.

414 A. Hughes, S. Yaisawarng / European Journal of Operational Research 154 (2004) 410–422

variation is a matter of subjective judgement.Units receiving relatively large variation in effi-

ciency scores across model specifications require

further examination.

2.2.3. Analysis of appropriate efficient peers

This procedure tests whether the model selects

an appropriate set of best practice units for the

relative inefficient unit to imitate their manage-ment style. If DEA results are to be used in im-

proving managerial practice, the model must pass

this ‘‘reality’’ check. An appropriate efficient peer

is a best practice performer in the sample that is

similar to the inefficient unit in some respects (e.g.,

size, characteristics) and is able to lend insights

useful for assisting the inefficient unit to improve.

Both qualitative and quantitative information areimportant features to be included in the criteria.

However, the recipe for assessing appropriate ef-

ficient peers is unique for the sample and should be

determined in consultation with experts in the

particular area of research.

2.3. Dimensionality testing

Before appropriate policies can be designed and

implemented to assist inefficient units improve

their performance, policymakers need to know

whether inefficiencies exist across units in the or-

ganisation. Sensitivity tests alone provide insights

whether the DEA results are robust across model

specifications but do not provide the justification

whether these inefficiencies indeed exist. The ap-parent inefficiencies may be a result of dimensio-

nality. This paper develops a dimensionality test

to validate the existence of inefficiencies.

Tauer and Hanchar (1995) use a Monte-Carlo

simulation technique to investigate the dimensio-

nality effects on DEA efficiency scores, where di-

mensionality refers to number of variables (inputs

and outputs) and/or sample size. The authorsgenerate simulated data sets from a uniform dis-

tribution and conclude that the effect of increasing

the number of variables in the model is stronger

than the effect of reducing the number of obser-

vations. This paper focuses on the dimensionality

effect from varying numbers of model variables

for a fixed sample size.

We use a simulation method to perform the di-mensionality test. Our procedure begins with a

construction of k simulated samples; each sample

consists of J units and (N þM) series of numbers

that bear no technological relationship. Specifi-

cally, we draw a series of J random numbers from a

normal distribution. Some of these random num-

bers will be negative; others will be positive values.

Since the actual, observed data is non-negative, wetake the absolute values of these random numbers

and treat them as values of one random variable.

We replicate this procedure until we have N þMseries of numbers. These figures constitute one

simulated sample similar to the actual data set of Junits with N þM variables. Unlike the actual,

observed data set, the numbers in the simulated

sample are purely random. The efficiency scorescomputed from the simulated sample are the effect

of the dimensionality. We apply this mechanism ktimes to generate k simulated samples.

For each simulated sample, we calculate DEA

input-oriented efficiency scores under the assump-

tion of VRS technology 4 and calculate the number

of efficient units expressed as a percentage of the

total units in the sample. The mean percentage ofefficient units over k replications is used to test the

null hypothesis that the proportions of efficient

units from simulated data and the actual data are

the same, using a Z statistic. A failure to reject the

null hypothesis implies that the DEA computed

efficiency scores merely reflect the dimension of the

model. If, on the other hand, we find evidence to

reject the null hypothesis, it suggests that there areinefficiencies across units in the sample. DEA

computed efficiency scores capture the relative

performance of units in the data set and are not

driven by only the number of variables in the model.

3. An empirical demonstration: A case of 1995–1996

NSW police patrols

This section illustrates the empirical procedures

proposed in this paper, using a sample of NSW

5 Since DEA does not allow error in data measurement, it is

important that outliers are inspected before computing effi-

ciency scores. This paper computes all possible output–input

ratios and uses a boxplot method to identify potential outliers.

All suspected outliers were referred to the NSW Police Service

for assessment. The numbers in the sample reflected the

corrections of all reported errors.

A. Hughes, S. Yaisawarng / European Journal of Operational Research 154 (2004) 410–422 415

police patrols. Several DEA model specificationsare developed from available data sets. Since the

data do not capture all aspects of police patrols�activities, conclusions must be interpreted strictly

within the context of the measured model vari-

ables.

Bearing the above limitations in mind, we begin

with an illustration of variables selected for the

various model specifications. Section 3.1 intro-duces the sample of NSW police patrols and range

of potential variables. Section 3.2 formulates pos-

sible sets of variables and presents alternative

DEA model specifications. The theoretical rela-

tionship of efficiency scores across models is also

summarised. Section 3.3 presents the demonstra-

tion results.

3.1. The sample

The sample comprised 161 local police patrol

districts that existed in NSW in 1995–1996. There

were 97 urban patrols covering the Sydney, New-

castle and Wollongong metropolitan areas and

64 regional patrols covering the rest of the State.

Input-oriented models were used to reflect anobjective of delivering services to the community

with the minimum amount of resources. The

NSW Police Service provided the data.

NSW police districts produce two broad types

of services: law enforcement and crime prevention.

In 1995–1996, the latter represented approximately

40% of police work. Numbers of incidents, char-

ges, summons and major car accidents were usedto capture law enforcement activities. Crime pre-

vention activities were measured by distance tra-

velled by police cars (in kilometres) and the

number of intelligence reports prepared by front-

line police officers. Patrol intelligence officers re-

viewed submitted intelligence reports on local

criminal activity by: (i) making an assessment of its

quality, (ii) assigning it a security classification,and (iii) determining its distribution.

Police districts combine labor and capital re-

sources to deliver services to the community. In

the study, labor comprised police and civilian

employees measured as annual-average, full-time

equivalent staff. These figures include staff on

leave, for example, sick leave, long-service leave or

secondments to other police units. Three possiblemeasures of the capital input were available:

number of police cars, number of personal com-

puters, and area of station accommodation (mea-

sured in square metres). Each of these measures

captured slightly different aspects of the capital

input. For a detailed explanation of these vari-

ables, except intelligence reports, see Carrington

et al. (1997).Appendix A displays descriptive statistics of the

entire sample. 5 Appendices B and C summarise

descriptive statistics for the urban and regional

subgroups. In general, number of intelligence re-

ports had the largest variation relative to its mean

among the variables. All patrols used both types of

labor but were more intensive in their use of police

officers than civilian employees. Capital inputs,especially number of police cars and number of

personal computers, were relatively uniform across

patrols. The regional patrols recorded, on average,

significantly higher kilometres travelled and num-

ber of police cars than the urban patrols, reflecting

their larger area of coverage. For a detailed dis-

cussion of the sample, see Hughes and Yaisawarng

(2000).

3.2. Police DEA model specifications

Following Hughes and Yaisawarng (2000), this

paper tests the dimensionality of four possible

DEA models summarised in Table 1.

All models comprise six outputs, two labor in-

puts and a different number of capital inputs. Thealternative measures of the capital input were tes-

ted for their sensitivity. Model 1 has the largest

dimensions since it includes all three measures of

capital. All units in the sample would appear to be

the most efficient compared to their respective ef-

ficiency scores from the remaining three models.

This is a reflection of dimensionality. Model 4, on

Table 1

Specification of police DEA models

Variables Model

1 2 3 4

Outputs

Law enforcement activities

Incidents (excl. major car

accidents) Charges Summons served Major car accidents

Crime prevention activities

Kilometres travelled Intelligence reports

Inputs

Police Civilians Cars Personal computers Area of station accommo-

dation

Note: �� indicates that the variable is included in the model.

Table 2

Summary of pure technical efficiency results (161 observations)

Model

1 2 3 4

Average 0.933 0.921 0.918 0.901

Minimum 0.606 0.606 0.523 0.518

No. of efficient

patrols

89 80 77 67

416 A. Hughes, S. Yaisawarng / European Journal of Operational Research 154 (2004) 410–422

the other hand, includes only one measure of

capital. Efficiency scores from Model 4 would be

the lowest, compared to all other models. Models

2 and 3 each include two measures of capital in-

puts, and efficiency scores computed from these

two models should be somewhere between those

from Models 1 and 4. The relationship betweenefficiency scores across models is theoretically

predicted as follows:

TE1 P TE2 and TE3 P TE4: ð2ÞNote that the relationship between efficiency

scores from Models 2 and 3 cannot be determined

a priori since both specifications include the samenumber of model variables. All four models in-

clude a relatively large number of model variables

and the respective efficiency scores may reflect the

dimensionality.

3.3. Demonstration results

The results consist of three parts. Sections 3.3.1and 3.3.2 report the replicated results of pure

technical efficiency scores for all DEA model

specifications in Table 1 and of sensitivity tests

from Hughes and Yaisawarng (2000) with elabo-

ration. Section 3.3.3 presents the results of thesimulation study, which tests whether the DEA

efficiency scores indicate the existence of ineffi-

ciency across units or simply reflect the particular

dimension of the DEA model.

3.3.1. Pure productive efficiency results

Table 2 summarises pure technical efficiency

scores for all four models. These scores werecomputed relative to VRS technology, i.e., the

DEA formulation in (1) with differences in the

number of model variables as shown in Table 1.

The average efficiency scores range from 0.90 to

0.93 and fall within a 5% range between the

models. The number of units with non-radial

slacks was similar across specifications, represent-

ing between 10% and 35% of the sample. Themagnitude of these slacks was small. The efficiency

scores suggest the scope for potential improvement

in relation to the best practice performers in the

sample. Some patrols may have been able to im-

prove more than others. For example, one urban

patrol could have potentially reduced its inputs by

significant amounts (from 35% to 50%) if it had

operated at best practice. This patrol was rankedamong the least efficient patrols in all models.

3.3.2. Sensitivity tests of DEA results

We replicate the suite of sensitivity tests of

Hughes and Yaisawarng (2000) as discussed in

Section 2.2 and report the detailed analysis below.

These results are used in conjunction of the di-

mensionality tests in the next section to enhancethe credence of the DEA efficiency scores.

3.3.2.1. Correlation tests. The Spearman rank

correlation coefficients between each pair of effi-

ciency scores were extremely high, ranging from

A. Hughes, S. Yaisawarng / European Journal of Operational Research 154 (2004) 410–422 417

0.81 to 0.96. These correlation coefficients were

statistically different from zero at the 1% level of

significance, suggesting that the results were posi-

tively related and stable across specifications. The

Friedman statistical test for the probability distri-

butions of efficiency ranking across four models

had a Chi-square statistic (v23) of 4.023 with a p-

value of 0.2590. There was no evidence to rejectthe null hypothesis that the distributions of effi-

ciency ranking were stable across all four models.

Our results were not sensitive to the model speci-

fications.

3.3.2.2. Comparisons of individual efficiency scores.

An examination of individual efficiency scores

across models revealed that our results were re-markably stable across all specifications. These

results were not very sensitive to different proxies

of capital input. As expected, moving from Model

1 to Model 4 resulted in less efficient patrols, partly

due to a decrease in the dimensionality of the DEA

model. However, the proportion of efficient-

by-default patrols to the total number of efficient

patrols in each specification remained stable in the20–25% range. Each efficient-by-default patrol has

a unique input and output combination, compared

to other patrols in the sample and is compared to

itself in evaluating the efficiency; it does not serve

as a peer for other inefficient patrols.

Detailed analysis of efficiency scores across the

four models revealed that 78 patrols received

identical efficiency scores (measured to three dec-imal places) in all models. Some of these patrols

were efficient; others were not. The remaining 83

patrols received different scores across the four

models. The majority of these 83 patrols had a

moderate range of variation and were distributed

across efficiency groups (e.g., high, low). Only a

small number of patrols experienced large varia-

tion in efficiency scores across models. The changein relative efficiency across models, perhaps, could

be attributable to the change in the input mix that

led to different efficient peer groups or the same

peers with different weights.

A closer look at these 83 patrols revealed that

46 patrols received the same efficiency scores in at

least two models. At least one of the three core

inputs––police officers, civilian employees or police

cars––effectively determined the patrol�s efficiency.(Of these three, the number of police officers was

the most important input.) Among the remaining

37 of the 83 patrols receiving different scores

across the four models, we found that the differ-

ence between their highest and lowest scores across

models ranged from 0.003 to 0.209. For example,

an urban patrol was efficient by default in Models

1 and 3. It was 83% efficient when the number ofpolice cars and the number of computers were

used as proxies for capital (Model 2). Its efficiency

score dropped to 79% when the number of police

cars was the only capital proxy (Model 4). Two

other patrols experienced the next highest varia-

tions in their scores for this group. Both were

inefficient in all four models. The urban patrol

appeared to be more efficient when the area ofstation accommodation was included in place of

the number of personal computers. These results

were broadly consistent with the general charac-

teristics of patrols––urban patrols had smaller

station areas and possibly higher numbers of

computer terminals than regional patrols.

3.3.2.3. Identifying appropriate efficient peers. Asdiscussed in Hughes and Yaisawarng (2000), the

NSW Police Service recommended using a geo-

graphic and/or demographic profile to assess the

appropriateness of peers. An inefficient urban pa-

trol had appropriate efficient peers if one of the

following criteria was satisfied: (1) all peers were

urban patrols, or (2) a regional peer carried a

weight of less than 10%. If an inefficient urbanpatrol had any regional peer with a weight greater

than or equal to 10%, geographic characteristics of

the area as well as its population size were con-

sidered. A regional peer with a population of the

main town centre in the regional patrol area less

than or equal to 2500 persons was an inappropri-

ate peer for an urban patrol. Similar rules were

used in evaluating peers for inefficient regionalpatrols. Any inefficient patrol that had at least one

inappropriate peer was classified as a ‘‘problem’’

case. Relevant mixes of outputs and inputs of

patrols in problem cases were compared.

Results were remarkably similar across models.

The majority of inefficient patrols had appropriate

peers. The number of problem cases ranged from 7

6 It is possible to modify the simulated sample to test

whether an additional variable is relevant to the model. For

example, one may wish to test whether the area of station

accommodation in Model 1 is an important variable. To do

this, a hybrid-simulated sample, which comprises actual data

for all variables except that the actual area of station accom-

modation is replaced by a random number to generate

simulated results. After a number of replications, one can test

whether the simulated results differ from the actual result. If the

results differ statistically, the area of station accommodation is

an important variable for inclusion in the DEA model.

418 A. Hughes, S. Yaisawarng / European Journal of Operational Research 154 (2004) 410–422

(Models 1 and 2) to 11 (Model 4). Three cases werecommon for all four models. They involved two

inefficient urban patrols and one inefficient re-

gional patrol. Each of these inefficient urban (re-

gional) patrols had one regional (urban) peer with

a weight ranging from 12% to 29%. The geo-

graphic characteristics of these peers were very

different from those of the inefficient patrols. A

comparison of input and output compositionssuggested that inappropriate peers for the re-

maining problem cases had very different mixes

compared to inefficient patrols, especially in re-

spect to total incidents, major car accidents, police

officers and police cars.

On the basis of sensitivity results, Model 2 was

preferable. Since DEA results were relatively sta-

ble across the four models, it could be inferred thatadding more capital proxies did not result in a

more accurate measure of efficiency. Instead, it

inflated efficiency scores from an increase in di-

mensionality. Therefore, a model with less di-

mensionality was more appealing. Model 2 was

more superior to Model 4, which had the least

model variables among the four models considered

because Model 2 provided appropriate peers formost inefficient patrols and would be more useful

for management purposes. There were only seven

problem cases in Model 2, representing approxi-

mately 8.6% of inefficient patrols––the lowest

percentage among the four models.

3.3.3. Dimensionality test

From a practitioner�s standpoint, it is importantfor managers or policy makers to reward the

‘‘true’’ best performers and to assist inefficient

units in improving their performance. Since the

number of model variables in relation to sample

size may overstate the number of efficient units,

the effect of dimensions should be tested. In our

empirical demonstration of testing dimensionality

effects for all four models, we created 1000 simu-lated samples for each model using the procedures

described in Section 2.3. Each simulated data set

comprised 161 observations and included between

9 and 11 data series, so that the number of units

and model variables are consistent with the actual

sample of NSW police patrols. For example, a

simulated data set used to test the dimensionality

effect of Model 1 on DEA efficiency scores com-prised 161 observations and included 11 data series

(the number of variables included in Model 1). The

values for each data series were all independent

random numbers and hence there was no struc-

tural relationship among the figures assigned to

each observation.

For each DEA specification in Table 1, we

computed efficiency scores for all 161 simulatedobservations using (1). Specifically, we included 11

series in the calculation of DEA efficiency scores to

obtain simulated results for Model 1. Similarly, we

included all nine series in the calculation to obtain

simulated results for Model 4. Note that the sim-

ulated results for Models 2 and 3 should have been

identical since both models had ten variables. The

total number of efficient units from each modelrelative to each simulated sample was recorded as

a percentage of total observations in the sample. 6

This process was repeated for all 1000 simulated

samples. Table 3 summarises the results from the

actual data set and the simulated samples.

The proportions of efficient patrols from the

actual, observed sample were consistently lower

than the mean proportions from simulated sam-ples (except for Model 4), given the same number

of variables included in the model. The model that

included additional variables (i.e., more dimen-

sions) had a higher percentage of efficient units as

expected. We focused on the apparent differences

between the proportion of efficient units from the

actual data and simulated samples. The null hy-

pothesis was that, for each DEA specification, themean percentage of efficient units from the actual

sample should be the same as the average pro-

portion of efficient units from 1000 replications of

Table 3

Summary of efficiency results from actual sample and simulated

samples (161 observations)

Model

1 2 3 4

Number of variables 11 10 10 9

% of eff. units from

actual sample

55.3 49.7 47.8 41.6

% of eff. units from

simulated samplesa

58.7 50.4 50.4 41.2

Standard deviation 4.52 4.74 4.74 4.69

Z statisticb 23.79 4.67 17.35 )2.70

a The figures are the averages of the simulated percentage of

efficient observations for 1000 replications.b A Z statistic is computed to test the null hypothesis that the

average percentage of efficient observations from the actual

sample is less than that from the simulated samples. The critical

Z statistic for a one-tailed test at the 1% level of significance is

2.33.

A. Hughes, S. Yaisawarng / European Journal of Operational Research 154 (2004) 410–422 419

simulated samples. The alternative hypothesis was

that the mean percentage of efficient units from the

actual sample should be less than the averageproportion of efficient units from the simulated

samples. A one-tailed hypothesis test was used.

The calculated Z statistics are displayed in the last

row of Table 3.

Since these Z statistics exceeded the critical Zvalue at the 1% level of significance for Models 1–

3, we rejected the null hypothesis for these models.

The DEA calculated efficiency scores from thesethree models reflected differences in the perfor-

mance of police patrols in the sample. That is,

inefficiencies exist across police patrols in the

sample and these inefficiencies are not a reflection

of the dimensionality of the first three models.

However, we could not reject the null hypothesis

for Model 4, indicating that the Model 4 results

reflected the dimensionality. Given the nine vari-ables included in Model 4, the percentage of effi-

cient units should have been less than 41.2% if the

number of model variables had not inflated the

DEA efficiency scores. Model 4 was not an ap-

propriate model.

On the basis of dimensionality testing, efficiency

scores obtained from Models 1–3 were not con-

taminated by the dimensionality of the model.These three models could be used in the perfor-

mance analysis of the police patrols in the sample.

Efficiency results from Model 4 were merely driven

by the dimensionality and should not be used.Model 4 would have been acceptable had policy

decision-makers made their judgement on the basis

of sensitivity tests alone. This finding suggests that

empirical researchers should perform both sensi-

tivity and dimensionality tests before reaching a

conclusion as to which model is more appropriate.

In our empirical demonstration, Model 2 is the best.

4. Conclusions

This paper extends a standard procedure for

performance measurement using the DEA tech-

nique. To enhance the managerial usefulness and

reliability of DEA results, the paper suggests that

sensitivity and dimensionality tests be performed.We present a suite of sensitivity tests for DEA

results and develop a procedure to test for model

dimensionality using simulated samples. Although

the sensitivity and dimensionality tests are illus-

trated in the context of the performance of the

NSW police patrols, these tests are applicable to

any DEA study. Both the sensitivity and dimen-

sionality tests are complementary.Our empirical illustration suggests that NSW

police patrols in 1995–1996, on average, were be-

tween 90% and 93% efficient compared to best

performers in the sample. The sensitivity tests

suggest that the results are broadly robust across

the various model specifications. The preferred

model comprising six outputs and four inputs

adequately captured the measured activities of thepolice patrols and provided appropriate efficient

peers to assist performance improvement by in-

efficient units. Further, the dimensionality test

revealed that the results were different from the

simulated results––except for Model 4. Hence,

DEA computed efficiency scores were not driven

by the number of model variables. Rather they

reflected variations in the actual performance ofpolice patrols in the sample, i.e., differences in

patrols� ability to transform model input variables

to deliver measured services. Our procedure vali-

dates the DEA results prior to their use for man-

agement purposes including resource allocation,

benchmarking, and performance-related bonuses

along the lines suggested in Yaisawarng (2002).

420 A. Hughes, S. Yaisawarng / European Journal of Operational Research 154 (2004) 410–422

Acknowledgements

Much of this research project was completed

while S. Yaisawarng worked as a senior economist

at the NSW Treasury. The views expressed in this

paper are those of the authors and do not neces-

Mean Std. de

Reactive outputs

Incidents 3821.52 2451

Major car accidents 488.31 315Summons 28.34 28

Charges 643.91 433

Pro-active outputs

Km travelled by police 341505.54 163775Intelligence reports 359.58 562

Inputs

Police officers 57.63 29Civilian employees 6.52 5

Police cars 10.18 4

Personal computers 28.58 13

Station area 1321.35 1018

Mean Std. d

Reactive outputs

Incidents 4773.25 2494

Major car accidents 636.51 303Summons 24.05 29

Charges 658.11 467

Pro-active outputs

Km travelled by police 276641.60 129113Intelligence reports 437.27 697

Inputs

Police officers 67.09 30Civilian employees 8.03 6

Police cars 8.56 4

Personal computers 30.84 13

Station area 1134.66 985

sarily reflect those of the NSW Treasury, the NSWPolice Service or the NSW Government. The au-

thors thank Jim Baldwin of the NSW Police Ser-

vice for his help with interpreting the data and

Paul Stewart-Stand of Union College for his

research assistance.

Appendix A. Descriptive statistics for NSW police patrols, 1995–1996 (161 observations)

v. Minimum Maximum

.42 378 13497

.18 26 1760

.59 0 213

.28 106 2243

.49 39085 766874

.59 25 6554

.62 10.64 150.64

.66 0.37 37.77

.97 2 27

.46 5 105

.25 112 7030

Appendix B. Descriptive statistics for NSW urban police patrols, 1995–1996 (97 observations)

ev. Minimum Maximum

.53 1082 13497

.26 176 1760

.08 0 213

.46 106 2243

.90 90450 706183

.60 56 6554

.66 24.81 150.64

.44 1.00 37.77

.56 2 27

.90 8 105

.61 112 6918

A. Hughes, S. Yaisawarng / European Journal of Operational Research 154 (2004) 410–422 421

Appendix C. Descriptive statistics for NSW regional police patrols, 1995–1996 (64 observations)

Mean Std. dev. Minimum Maximum

Reactive outputs

Incidents 2379.05 1507.19 378 5849

Major car accidents 263.70 164.54 26 592Summons 34.84 26.75 0 105

Charges 622.38 378.16 144 1961

Pro-active outputs

Km travelled by police 439814.90 162547.80 39085 766874Intelligence reports 241.84 197.22 25 1024

Inputs

Police officers 43.29 21.19 10.64 97.78Civilian employees 4.25 3.06 0.37 12.41

Police cars 12.64 4.57 4 27

Personal computers 25.17 12.08 5 67

Station area 1604.30 1008.93 358 7030

References

Afriat, S., 1972. Efficiency estimation of production functions.

International Economic Review 13, 563–598.

Banker, R.D., Charnes, A., Cooper, W.W., 1984. Some models

for estimating technical and scale efficiencies in data envel-

opment analysis. Management Sciences 30 (9), 1078–1092.

Carrington, R., Puthucheary, N., Rose, D., Yaisawarng, S.,

1997. Performance measurement in government service

provision: The case of police services in New South Wales.

Journal of Productivity Analysis 8 (4), 415–430.

Charnes, A., Cooper, W.W., Rhodes, E., 1978. Measuring

efficiency of decision making units. European Journal of

Operations Research 6, 429–444.

Cooper, W.W., Seiford, L.M., Tone, K., 2000. Data Envelop-

ment Analysis: A Comprehensive Text with Models,

Applications, References and DEA-Solver Software. Klu-

wer Academic Publishers, Boston.

F€aare, R., Grosskopf, S., Lovell, C.A.K., 1985. The Measure-

ment of Efficiency of Production. Kluwer-Nijhoff Publish-

ing, Boston.

F€aare, R., Grosskopf, S., Lovell, C.A.K., 1994. Production

Frontiers. Cambridge University Press, Cambridge.

Farrell, M.J., 1957. The measurement of productive efficiency.

Journal of the Royal Statistical Society Series A 120, 253–

281.

Ferrier, G., Grosskopf, S., Hayes, K., Yaisawarng, S., 1993.

Economies of diversification in the banking industry: A

frontier approach. Journal of Monetary Economics 31, 229–

249.

Grosskopf, S., 1986. The role of reference technology in

measuring productive efficiency. The Economic Journal

96, 499–513.

Grosskopf, S., 1996. Statistical inference and nonparametric

efficiency: A selective survey. Journal of Productivity

Analysis 7 (2–3), 161–176.

Grosskopf, S., Yaisawarng, S., 1990. Economies of scope in the

provision of local public services. National Tax Journal 43,

51–74.

Hughes, A., Yaisawarng, S., 2000. Efficiency of local

police districts: A New South Wales experience. In:

Blank, J.L.T. (Ed.), Public Provision and Perfor-

mance: Contributions from Efficiency and Productivity

Measurement. Elsevier Science B.V., Amsterdam, pp. 277–

296.

Lovell, C.A.K., 1993. Production frontiers and produc-

tive efficiency. In: Fried, H.O., Schmidt, S.S., Lovell,

C.A.K. (Eds.), The Measurement of Productive Effi-

ciency. Oxford University Press, New York, pp. 3–

67.

Nunamaker, T.R., 1985. Using data envelopment analysis to

measure the efficiency of non-profit organisations: A critical

evaluation. Managerial and Decision Economics 6 (1),

50–58.

Steering Committee for the Review of Commonwealth/State

Service Provision, 1997. Data Envelopment Analysis: A

Technique for Measuring the Efficiency of Government

Service Delivery. AGPS, Canberra.

Tauer, L.W., Hanchar, J.J., 1995. Nonparametric technical

efficiency with K firms, N inputs, and M outputs: A

simulation. Agricultural and Resource Economics Review

24 (2), 185–189.

Thrall, R.M., 1989. Classification transitions under expan-

sion of inputs and outputs in data envelopment

analysis. Managerial and Decision Economics 10,

159–162.

422 A. Hughes, S. Yaisawarng / European Journal of Operational Research 154 (2004) 410–422

Valdmanis, V., 1992. Sensitivity analysis for DEA models: An

empirical example using public vs. NFP hospitals. Journal

of Public Economics 48, 185–205.

Yaisawarng, S., 2002. Performance measurement and resource

allocation. In: Fox, K. (Ed.), Efficiency in the Public Sector.

Kluwer Academic Publishers, Boston, pp. 61–81.