JEL Codes

Keywords

1 Introduction

This chapter provides an overview of a particular aspect of stochastic frontier analysis (SFA). The SF model is typically used to estimate best-practice ‘frontier’ functions that explain production or cost and predict firm efficiency relative to these. Extensive reviews of the broad stochastic frontier (SF) methodology are undertaken by Kumbhakar and Lovell (2000), Murillo-Zamorano (2004), Coelli et al. (2005), Greene (2008), and Parmeter and Kumbhakar (2014). This review will focus on the many different uses of various distributional forms.

Section 2 begins with a brief account of the motivation and development of efficiency analysis and prediction based on the standard SF model. A key feature of SF models is the focus on unobserved disturbance in the econometric model. This entails a deconvolution of the disturbance into a firm inefficiency component— quantification of which is the goal of the analysis—and a statistical noise term. Following this general outline, we discuss approaches to dealing with some key specification issues. Section 3 considers alternative distributional assumptions for inefficiency. Section 4 examines panel data issues. Section 5 considers modelling heteroskedasticity in error terms and its usefulness for policy analysis. Section 6 considers alternative noise distributions within SF models. Section 7 considers amendments to the standard SF model when the data contains efficient firms. Section 8 considers other received proposals relevant to appropriate distributional assumptions in SFA. Section 9 concludes.

2 Departure Points

The standard theory of the firm holds that firms seek to maximise profit. Under certain assumptions, a profit function exists that reflects the maximum profit attainable by the firm. The profit function is derived from the firm’s cost function, which represents the minimum cost given outputs and input prices, and its production function, which describes the firm’s technology. These are ‘frontier’ functions in the sense that they represent optimal outcomes that firms cannot improve upon given their existing technology. The duality of the production and cost functions was demonstrated by Shephard (1953). Debreu (1951) introduced the notion of a distance function to describe a multiple output technology and proposed that the radial distance of a producer’s outputs from the distance function be used as a measure of technical inefficiency. Koopmans (1951) provided a definition of technical efficiency.

The idea that firms might depart from profit maximisation was first suggested in passing by Hicks (1935), who speculated that firms with market power in particular may choose to enjoy some of their rents not as profit, but as reduced effort to maximise profits, or ‘a quiet life’. Later, Leibenstein (1966, 1975) discussed various empirical indications of firm-level ‘X-inefficiency’ and how it might arise. The debate between Leibenstein (1978) and Stigler (1976) highlighted two alternative characterisations of inefficiency: as a result of selective rationality and non-maximising behaviour, resulting in non-allocative welfare loss, or as the redistribution of rents within the firm, and therefore consistent with the idea of maximising outward behaviour. The latter characterisation essentially posits that firms are maximising an objective function including factors other than profit, and encompasses a wide range of specific hypotheses about firm behaviour. The revenue maximisation hypothesis of Baumol (1967), the balanced growth maximisation hypothesis of Marris (1964) and the expense preference hypothesis of Williamson (1963) are examples of hypotheses within which the firm (or its managers, given informational asymmetry between principal and agent) pursues other objectives jointly with profit or subject to a profit constraint. We should therefore bear in mind that when we discuss efficiency, it is relative to an objective that we define, and not necessarily that of the firm (or its agents).

The early literature on econometric estimation of cost functions has focal points at Johnston (1960) for the UK coal industry and Nerlove (1963) for US electricity generation. These authors focused primarily on estimation of the shape of the empirical cost or production functions. Typically, ordinary least squares (OLS) was used to estimate a linear model:

$$y_{i} = x_{i} \beta + \varepsilon_{i},$$
(1)

where yi is cost or output, \(\beta\) is a vector of parameters to be estimated, \(\varepsilon_{i}\) is a random error term, \(i = 1,2, \ldots ,I\) denotes an observed sample of data and xi is the vector of independent variables. In the case of a production function, independent variables include input quantities and other factors affecting production, while in the case of a cost frontier, independent variables include output quantities and input prices, along with other factors affecting cost (Shephard 1953). Commonly, the dependent and independent variables are logged, in order to linearise what is assumed to be a multiplicative functional form. Note that the estimation of (1) via least squares, where a symmetric error term is assumed, is only consistent with the idea of a frontier function if we assume that firms are all fully efficient, and that departures from the estimated frontier are explained purely by measurement error and other random factors, such as luck. This fact has motivated many alternative proposals that are consistent with the notion of a frontier. Farrell (1957) proposed the use of linear programming to construct, assuming constant returns to scale, a piecewise linear isoquant and to define technical inefficiency as the radial distance of the firm from this isoquant.

An approach that amends (1) so the error is one-sided, yields a deterministic or ‘full’ frontier specification, in which the residuals are attributed entirely to inefficiency. Since a firm must be operating on or below its production frontier, and on or above its cost frontier, this means that \(s\varepsilon_{i} \le 0\), where s = 1 for a production frontier and s = −1 for a cost frontier. Aigner and Chu (1968) suggested linear or quadratic programming approaches to deterministic frontier estimation. Respectively, these minimise \(\sum\nolimits_{i = 1}^{I} {\varepsilon_{i} }\) or \(\sum\nolimits_{i = 1}^{I} {\varepsilon_{i}^{2} },\) subject to the constraint that \(s\varepsilon_{i} \le 0\). Schmidt (1976) noted that these are maximum likelihood (ML) estimators under the assumptions that the error term is exponentially or half normally distributed. Omitting the restriction that the residuals be one-sided leads to OLS and least absolute deviations (LAD) estimation, which would be ML estimation under the assumptions that \(\varepsilon_{i}\) follows the normal or Laplace distributions, two-sided counterparts of the half-normal and exponential distributions, respectively. Afriat (1972) proposed a deterministic frontier model in which \(\exp \left( {\varepsilon_{i} } \right)\) follows a two-parameter beta distribution, to be estimated via ML, which as Richmond (1974) noted is equivalent to assuming a gamma distribution for \(\varepsilon_{i}\). The usual regularity conditions for ML estimation do not hold for deterministic frontier functions, since the range of variation of the dependent variable depends upon the parameters. Greene (1980) points out that under certain specific assumptions, this irregularity is actually not the relevant constraint. Specifically, if both the density and first derivative of the density of \(\varepsilon\) converge to zero at the origin, then the log-likelihood function is regular for ML estimation purposes. Deterministic frontier models with gamma and lognormal error term are examples.

Deterministic frontier models suffer from a serious conceptual weakness. They do not account for noise caused by random factors such as measurement error or luck. A firm whose production is impacted by a natural disaster might by construction appear to be inefficient. In order to account for measurement error, Timmer (1971) suggested amending the method so that the constraint \(s\varepsilon_{i} \le 0\) holds only with a given probability, thereby allowing a proportion of firms to lie above (below) the production (cost) frontier. However, this probability must be specified in advance in an arbitrary fashion. An alternative proposal made by Aigner et al. (1976) has the error drawn from a normal distribution with variance \(\sigma^{2} \theta\) when \(s\varepsilon_{i} \le 0\) and \(\sigma^{2} \left( {1 - \theta } \right)\) when \(s\varepsilon_{i} > 0\), where \(0 < \theta < 1\). Essentially, though this is not made explicit, this allows for normally distributed noise with variance \(\sigma^{2} \left( {1 - \theta } \right)\) and inefficiency implicitly following a half-normal distribution with variance \(\left( {1 - 2/\pi } \right) \sigma^{2} \left( {1 - \theta} \right)\), under the assumption that where \(s\varepsilon_{i} \le 0\) firms are fully efficient. The resulting likelihood function is that of a 50:50 mixture of two differently scaled normal distributions truncated at zero from the left and right, respectively. The discontinuity of this specification once again violates the standard regularity conditions for ML estimation.

The issues with the models suggested by Timmer (1971) and Aigner et al. (1976) stem in both cases from their peculiar assumption that firms must be fully efficient when \(s\varepsilon_{i} \le 0\), which remains rooted in an essentially deterministic view of frontier estimation. The current literature on SFA, which overcomes these issues, begins with Aigner et al. (1977) and Meeusen and van Den Broeck (1977). They proposed a composed error:

$$\varepsilon_{i} = v_{i} - su_{i}$$
(2)

where vi is a normally distributed noise term with zero mean, capturing random factors such as measurement error and luck, and ui is a non-negative random variable capturing inefficiency and is drawn from a one-sided distribution. Battese and Corra (1977) proposed an alternative parameterisation of the model. Given particular distributional assumptions about the two error components, the marginal distribution of the composed error \(\varepsilon_{i}\) may be derived by marginalising ui out of the joint probability;

$$f_{\varepsilon } \left( {\varepsilon_{i} } \right) = \int\limits_{0}^{\infty } {f_{v} \left( {\varepsilon_{i} + su_{i} } \right)f_{u} \left( {u_{i} } \right)\text{d}u_{i} }$$
(3)

where \(f_{\varepsilon }\), fv, and fu are the density functions for \(\varepsilon_{i}\), vi, and ui, respectively. The half-normal and exponential distributions were originally proposed for ui. Assuming a normal distribution for vi, the resulting distributions for \(\varepsilon_{i}\) are the skew-normal distribution, studied by Weinstein (1964) and Azzalini (1985), and the exponentially modified Gaussian distribution originally derived by Grushka (1972).

The ultimate objective of SFA is deconvolution of estimated residuals into separate predictions for the noise and inefficiency components. The latter is the focus of efficiency analysis. Since the parameters of fu are outputs of the estimation process, we obtain an estimated distribution of efficiency, and as proposed by Lee and Tyler (1978), the first moment of this estimated distribution may be used to predict overall average efficiency. However, decomposing estimated residuals into observation-specific noise and efficiency estimates was elusive until Jondrow et al. (1982) suggested predicting based on the conditional distribution of \(\left. {u_{i} } \right|\varepsilon_{i}\), which is given by

$$f_{\left. u \right|\varepsilon } \left( {u_{i} |\varepsilon_{i} } \right) = \frac{{f_{v} \left( {\varepsilon_{i} + su_{i} } \right)f_{u} \left( {u_{i} } \right)}}{{f_{\varepsilon } \left( {\varepsilon_{i} } \right)}} .$$
(4)

They derived (4) for the normal-half normal and normal-exponential cases. The mean, \(E\left( {u_{i} |\varepsilon_{i} } \right)\), and mode, \(M\left( {u_{i} |\varepsilon_{i} } \right)\), of this distribution were proposed as predictors. Waldman (1984) examined the performance of these and other computable predictors. Battese and Coelli (1988) suggest the use of \(E\left[ {\left. {\exp \left( { - u_{i} } \right)} \right|\varepsilon_{i} } \right]\) when the frontier is log-linear. Kumbhakar and Lovell (2000) suggest that this is more accurate than \(\exp \left[ { - E\left( {u_{i} |\varepsilon_{i} } \right)} \right],\) especially when \(u_{i}\) is large. In practice, the difference often tends to be very small. It should be noted that the distribution of the efficiency predictions, \(E\left( {u_{i} |\varepsilon_{i} } \right)\) will not match the unconditional, marginal distribution of the true, unobserved ui. Wang and Schmidt (2009) derived the distribution of \(E\left( {u_{i} |\varepsilon_{i} } \right)\) and show that it is a shrinkage of ui towards \(E\left( {u_{i} } \right)\), with \(E\left( {u_{i} |\varepsilon_{i} } \right) - u_{i}\) approaching zero as \(\sigma_{v}^{2} \to 0\).

3 Alternative Inefficiency Distributions

The efficiency predictions of the stochastic frontier model are sensitive to the assumed distribution of ui. A number of alternatives have been proposed. Several two-parameter generalisations of the half-normal and exponential distributions, respectively, allow for greater flexibility in the shape of the inefficiency distribution, with non-zero modes in particular. The flexible forms generally enable testing against their simpler nested distributions. Stevenson (1980) proposed the truncated normal modelFootnote 1; Greene (1990) and Stevenson (1980) proposed gamma distributions. The truncated normal distribution, denoted \(N^{ + } \left( {\mu ,\sigma_{u}^{2} } \right)\), nests the half normal when its location parameter \(\mu\) (the pre-truncation mean) is zero, and its mode is \(\mu\) when \(\mu \ge 0\). The similar ‘folded normal distribution’ denoted \(\left| {N\left( {\mu ,\sigma_{u}^{2} } \right)} \right|\), i.e. that of the absolute value of an \(N\left( {\mu ,\sigma_{u}^{2} } \right)\) normal random variable, also nests the half normal when \(\mu\) is zero, but has a non-zero mode only when \(\mu \ge \sigma_{u}\) (Tsagris et al. 2014; Hajargasht 2014).

The gamma distribution with shape parameter k and scale parameter \(\sigma_{u}\) nests the exponential distribution when k = 1. A two-parameter lognormal distribution, which resembles the gamma distribution, for ui is adopted by Migon and Medici (2001). It is possible to adopt even more flexible distributional assumptions; Lee (1983) proposed using a very general four-parameter Pearson distribution for ui as a means of nesting several simpler distributions. On the other hand, Hajargasht (2015) proposed a one-parameter Rayleigh distribution for ui which has the attraction of being a parsimonious way of allowing for a non-zero mode. Griffin and Steel (2008) proposed a three-parameter extension of Greene’s two-parameter gamma model that nests the gamma, exponential, half-normal and (heretofore never considered) Weibull models. Some of these represent minor extensions of the base case models. In all cases, however, the motivation is a more flexible, perhaps less restrictive characterisation of the variation of efficiency across firms. In many cases, the more general formulations nest more restrictive, but common distributional forms.

The inefficiency distributions discussed above were proposed to enable more flexible distributional assumptions about ui. Other proposals have addressed specific practical and theoretical issues. One is the ‘wrong skew’ problem, which is discussed in more detail below. Broadly, the skewness of sui should be negative, both in the theory and as estimated using data. In estimation, it often happens that the information extracted from the data suggests skewness in the wrong direction. This would seem to conflict with the central assumption of the stochastic frontier model. The problem for the theoretical specification is that, since \({\text{Skew}}\left( {\varepsilon_{i} } \right) = {\text{Skew}}\left( {v_{i} } \right) - s{\text{Skew}}\left( {u_{i} } \right) = - s{\text{Skew}}\left( {u_{i} } \right)\) when vi is symmetrically distributed, the skewness of the composed error \(\varepsilon_{i}\) is determined by that of ui. Therefore, imposing \({\text{Skew}}\left( {u_{i} } \right) > 0\) implies that \(- s{\text{Skew}}\left( {\varepsilon_{i} } \right) > 0.\) Since all of the aforementioned distributions for ui allow only for positive skewness, this means that the resulting SF models cannot handle skewness in the ‘wrong’ direction. An estimated model based on sample data will typically give an estimate of zero for \({\text{Var}}\left( {u_{i} } \right)\) if the estimated skewness (however obtained) goes in the wrong direction.

‘Wrong skew’ could be viewed as a finite sample issue, as demonstrated by Simar and Wilson (2010). Even when the assumed distribution of \(\varepsilon_{i}\) is correct, samples drawn from this distribution can have skewness in the ‘wrong’ direction with some probability that decreases with the sample size. Alternatively, it may indeed be the case that, though non-negative, the distribution of ui has a zero or negative skew, and therefore, our distributional assumptions need to be changed accordingly. To this end, Li (1996) and Lee and Lee (2014)Footnote 2 consider a uniform distribution, \(u_{i} \sim U\left( {a,b} \right)\), so that ui and \(\varepsilon_{i}\) are both symmetric, and Carree (2002) and Tsionas (2007) consider the binomial distribution and Weibull distributions, respectively, which both allow for skewness in either direction. Arguably, these ‘solutions’ are ad hoc remedies to what might be a fundamental conflict between the data and the theory. Notwithstanding the availability of these remedies, negative skewness, defined appropriately is a central feature of the model.

Also relevant here are SF models with ‘bounded inefficiency’. These are motivated by the idea that there is an upper bound on inefficiency beyond which firms cannot survive. Such a boundary could be due to competitive pressure, as suggested by Qian and Sickles (2008). However, we also consider that it could arise in monopolistic infrastructure industries which are subject to economic regulation, since depending on the strength of the regulatory regime, some inefficiency is likely to be tolerated.Footnote 3

Implementation of bounded inefficiency involves the right-truncation of one of the canonical inefficiency distributions found in the SF literature. The upper tail truncation point is a parameter that would be freely estimated and is interpreted as the inefficiency bound. Lee (1996) proposed a tail-truncated half-normal distribution for inefficiency, and Qian and Sickles (2008) and Almanidis and Sickles (2012) propose a more general ‘doubly truncated normal’ distribution (i.e. the tail truncation of a truncated normal distribution). Almanidis et al. (2014) discuss the tail-truncated half-normal, tail-truncated exponential and doubly truncated normal inefficiency distributions. The latter of these may have positive or negative skewness depending on its parameter values. In fact, it is clear that this may be true of the right-truncation of many other non-negative distributions with non-zero mode.

A difficulty with certain distributional assumptions is that the integral in (3) may not have a closed-form solution, so that there may not be an analytical expression for the log-likelihood function. This issue first arose in the SF literature in the case of the normal-gamma model, in which case the problem was addressed in several different ways. Stevenson (1980) noted that relatively straightforward closed-form expressions exist for integer values of the shape parameter k, of the normal-gamma model and derived the marginal density of \(\varepsilon_{i}\) for k = 0, k = 1, and k = 2. Restricting k to integer values gives the Erlang distribution, so this proposal amounts to a restrictive normal-Erlang model. The need to derive distinct formulae for every possible integer value of k makes this approach unattractive. Beckers and Hammond (1987) derived a complete log-likelihood for the normal-gamma model, but due to its complexity their approach has not been implemented. Greene (1990) approximated the integral using quadrature, but this approximation proved rather crude (Ritter and Simar 1997). An alternative approach, proposed by Greene (2003), is to approximate the integral via simulation in order to arrive at a maximum simulated likelihood (MSL) solution. For more detail on MSL estimation, see Train (2009). In the context of SFA, Greene and Misra (2003) note that the simulation approach could be used to approximate the integral in (3) for many distributional assumptions as long as the marginal variable ui can be simulated. Since the integral is the expectation of \(f_{v} \left( {\varepsilon_{i} + su_{i} } \right)\) given the assumed distribution for ui, it can be approximated by averaging over Q draws from the distribution of ui:

$$f_{\varepsilon } \left( {\varepsilon_{i} } \right) = \int\limits_{0}^{\infty } {f_{v} \left( {\varepsilon_{i} + su_{i} } \right)f_{u} \left( {u_{i} } \right)\text{d}u_{i} } \approx \frac{1}{Q}\sum\limits_{q = 1}^{Q} {f_{v} \left[ {\varepsilon_{i} + sF_{u}^{ - 1} \left( {d_{q} } \right)} \right]}$$
(5)

where dq is draw number q from the standard uniform distribution, transformed by the quantile function \(F_{u}^{ - 1}\) into a draw from the distribution of ui. In cases in which there is no analytical \(F_{u}^{ - 1}\), such as the normal-gamma model, the integral may nevertheless be expressed in terms of an expectation that may be approximated via simulation. Greene (2003) recommends using Halton sequences, which aim for good coverage of the unit interval, rather than random draws from the uniform distribution, in order to reduce the number of draws needed for a reasonable approximation of the integral.

As an alternative to simulation, various numerical quadrature approaches may be used. Numerical quadrature involves approximating an integral by a weighted sum of values of the integrand at various points. In many cases, this involves partitioning the integration interval and approximating the area under the curve within each of the resulting subintervals using some interpolating function. The advantage of quadrature over simulation lies in speed of computation, given that the latter’s time-consuming need to obtain potentially large numbers of independent draws for each observation. However, it may be challenging to find appropriate quadrature rules in many cases. Another alternative, proposed by Tsionas (2012), is to approximate \(f_{\varepsilon }\) using the (inverse) fast Fourier transform of the characteristic function of \(f_{\varepsilon }\). The characteristic function, \(\varphi_{\varepsilon }\), is the Fourier transform of \(f_{\varepsilon }\), and as shown by Lévy’s inversion theorem (see Theorem 1.5.4 in Lukacs and Laha 1964), the inverse Fourier transform of the characteristic function can be used to obtain \(f_{\varepsilon }\). Since the Fourier transform of a convolution of two functions is simply the product of their Fourier transforms, i.e. \(\varphi_{\varepsilon } = \varphi_{v} \varphi_{u}\) (see Bracewell 1978, p. 110), \(\varphi_{\varepsilon }\) may be relatively simple even when \(f_{\varepsilon }\) has no closed form, and \(f_{\varepsilon }\) may be approximated by the inverse fast Fourier transform of \(\varphi_{\varepsilon }\). On the basis of Monte Carlo experiments, Tsionas (2012) finds that this is a faster method for approximating \(f_{\varepsilon }\) in the normal-gamma and normal-beta cases than either Gaussian quadrature or Monte Carlo simulation, with the former requiring a large number of quadrature points and the latter an even larger number of draws for comparable accuracy. This approach has not yet been adopted as widely as simulation, perhaps due to its relative complexity.

A natural question would be which, of the many alternatives discussed above, is the most appropriate distribution for inefficiency? Unfortunately, theory provides little guidance on this question. Oikawa (2016) argues that a simple Bayesian learning-by-doing model such as that of Jovanovic and Nyarko (1996), in which a firm (or manager) maximises technical efficiency given prior beliefs about and previous realisations of an unknown technology parameter, supports a gamma distribution for inefficiency. However, Tsionas (2017) shows that this conclusion is sensitive to the sampling of, and assumed prior for, the firm-specific parameter, and that under alternative formulations there is no basis for favouring the gamma distribution (or any known distribution). Furthermore, both authors assume that firms maximise expected profits, whereas alternative behavioural assumptions may yield very different results. Of course, the choice of inefficiency distribution may be driven by practical considerations, such as a need to allow for wrong skewness or to estimate an upper bound on inefficiency. The question of which inefficiency distribution to use is an empirical one and leads us to consider testing in the context of SFA. As noted previously, some of the more flexible inefficiency distributions nest simpler distributions. In these cases, we may test against to simpler nested models. For example, we may test down from the normal-gamma to the normal-exponential model by testing the null hypothesis that k = 1. We may test down from the normal-truncated normal (or the normal-folded normal) to the normal-half normal model by testing the null hypothesis that \(\mu = 0\). These are standard problems.

There are some remaining complications in the specification search for the SF model. We may wish to test for the presence of the one-sided error, often interpreted as a test for the presence of inefficiency. In this case, the errors are normally distributed under the null hypothesis \(H_{0} :\sigma_{u} = 0\). This is a non-standard problem because the scale parameter \(\sigma_{u}\) is at a boundary of the parameter space under H0. Case 5 in Self and Liang (1987) shows that where a single parameter of interest lies on the boundary of the parameter space under the null hypothesis, the likelihood ratio (LR) statistic follows a 50:50 mixture of \(\chi_{0}^{2}\), and \(\chi_{1}^{2}\) distributions, denoted \(\chi_{1:0}^{2}\), for which the 95% value is 2.706 (Critical values are presented in Kodde and Palm 1986). Lee (1993) finds that this is the case under \(H_{0} :\sigma_{u} = 0\) in the normal-half normal model. A Lagrange multiplier test for this case in the SF model is developed in Lee and Chesher (1986).

This result does not apply when fu has two or more parameters. Coelli (1995) states that, in the normal-truncated normal model, the LR statistic under \(H_{0} :\sigma_{u} = \mu = 0\) follows a 25:50:25 mixture of \(\chi_{0}^{2}\), \(\chi_{1}^{2}\) and \(\chi_{2}^{2}\) distributions, and that this is a special case of the result for two restrictions in Gouriéroux et al. (1982), which deals with inequality restrictions.Footnote 4 This result matches Case 7 in Self and Liang (1987), in which two parameters of interest lie on the boundary of the parameter space under the null. The test seems to have been incorrectly applied; under \(H_{0} :\sigma_{u} = \mu = 0\), only one parameter lies on the boundary. Equivalently, viewing the test as a one-tailed test of \(H_{0} : \sigma_{u} \le 0, \mu = 0\), we only have one inequality restriction. Case 6 in Self and Liang (1987), in which there are two parameters of interest, one on a boundary, and one not on a boundary, seems to be more applicable, suggesting a 50:50 mixture of \(\chi_{1}^{2}\) and \(\chi_{2}^{2}\) distributions, denoted \(\chi_{2:1}^{2}\). More fundamentally, \(H_{0} :\sigma_{u} = \mu = 0\) may not be the appropriate null hypothesis: when the scale parameter of the inefficiency distribution is set to zero, all other parameters of the distribution are in fact unidentified. Equivalently, a normal distribution for \(\varepsilon_{i}\) can be recovered in the normal-truncated normal case as \(\mu \to - \infty\), for any value of \(\sigma_{u}\). The general problem of testing when there are unidentified nuisance parameters under the null hypothesis is discussed by Andrews (1993a, b) and Hansen (1996). To our knowledge has not been addressed in the SF literature.

We may wish to choose between two non-nested distributions. In this case, Wang et al. (2011) suggest testing goodness of fit by comparing the distribution of the estimated residuals to the theoretical distribution of the compound error term. This is a simpler method than, for example, comparing the distribution of the efficiency predictions to the theoretical distribution of \(E\left( {u |\varepsilon } \right)\) as derived by Wang and Schmidt (2009), since the distribution of the compound error is much simpler. For example, as discussed previously, \(\varepsilon_{i}\) follows a skew-normal distribution in the normal-half normal model, and an exponentially modified Gaussian distribution in the normal-exponential model. Under alternative specifications, the distribution of the estimated residuals may become rather complex, however.

4 Panel Data

The basic panel data SF model in the contemporary literature is as in (1) with the addition of a t subscript to denote the added time dimension of the data:

$$y_{it} = x_{it} \beta + \varepsilon_{it},$$
(6)

where \(t = 1,2, \ldots ,T\). The composite error term is now

$$\varepsilon_{it} = \alpha_{i} + v_{it} - su_{it}.$$
(7)

Along with the usual advantages of panel data, Schmidt and Sickles (1984) identify three benefits specific to the context of SFA. First, under the assumption that inefficiency is either time invariant or that it varies in a deterministic way, efficiency prediction is consistent as \(T \to \infty .\) In contrast, this is not the case as \(N \to \infty\). Second, distributional assumptions can be rendered less important, or avoided altogether, in certain panel data specifications. In particular skewness in the residual distribution does not have to be the only defining factor of inefficiency. Instead, time persistence in inefficiency can be exploited to identify it from random noise. Third, it becomes possible, using a fixed-effects approach, to allow for correlation between inefficiency and the variables in the frontier.Footnote 5 In addition, the use of panel data allows for the modelling of dynamic effects.

In the context of panel data SF modelling, one of the main issues is the assumption made about the variation (or lack thereof) of inefficiency over time. Another is the way in which we control (or do not control) for firm-specific unobserved heterogeneity and distinguishes this from inefficiency. For the purposes of this discussion, we divide the received panel data SF models into three classes: models in which inefficiency is assumed to be time-invariant, models in which inefficiency is time-varying, and models which control for unobserved heterogeneity with either time-invariant or time-varying inefficiency. To finish this section, we consider briefly multi-level panel datasets and the opportunities that they provide for analysis.

4.1 Time-invariant Efficiency

One approach to panel data SFA is to assume that efficiency varies between firms but does not change over time, as first proposed by Pitt and Lee (1981). Referring to (6) and (7), the basic panel data SF model with time-invariant efficiency assumes that \(\alpha_{i} = 0, u_{it} = u_{i}\), so that we have:

$$y_{it} = x_{it} \beta + v_{it} - su_{i}.$$
(8)

This specification has the advantage that prediction (or estimation) of ui is consistent as \(T \to \infty\). The appeal of this result is diminished given that the assumption of time-invariance is increasingly hard to justify as the length of the panel increases. In contrast to the cross-sectional case, there is no need to assume that ui is a random variable with a particular distribution, and therefore, there are several different methods may be used to estimate (8), depending on our assumptions about ui.

Schmidt and Sickles (1984) proposed four alternative approaches. First, we may assume that ui is a firm-specific fixed effect, and to estimate the model using either a least squares dummy variable (LSDV) approach, in which ui is obtained as the estimated parameter on the dummy variable for firm i, or equivalently by applying the within transformation, in which case ui is obtained as firm i’s mean residual. Second, we may assume that ui is a firm-specific random effect and estimate the model using feasible generalised least squares (FGLS). The difference between the fixed-effects and random-effects approaches is that the latter assumes that the firm-specific effects are uncorrelated with the regressors, while the former does not. Third, Schmidt and Sickles (1984) suggested instrumental variable (IV) estimation of the error components model proposed by Hausman and Taylor (1981) and Amemiya and MaCurdy (1986), which allows for the firm-specific effect to be correlated with some of the regressors and uncorrelated with others, and is thus intermediate between the fixed-effects and random-effects models. Fourth, as in Pitt and Lee (1981), ui could be regarded as an independent random variable with a given distribution, as in the cross-sectional setting, with the model being estimated via ML.

The first three approaches share the advantage that no specific distributional assumption about ui is required. As a consequence, the estimated firm-specific effects could be positive. As a result, firm-specific efficiency can only be measured relative to the best in the sample, not to an absolute benchmark. The estimated ui is given by

$$u_{i} = \mathop {\hbox{max} }\limits_{j} sa_{j} - sa_{i},$$
(9)

where ai is the estimated firm-specific effect for firm i. The fixed-effects specification has the advantage of allowing for correlation between ui and xit. But the drawback is that time-invariant regressors cannot be included, meaning that efficiency estimates will be contaminated by any differences due to time-invariant variables. The assumption that the factors are uncorrelated with errors (noise or inefficiency) can be examined using the Hausman test (Hausman 1978; Hausman and Taylor 1981). If this assumption appears to hold, a random effects approach such as Pitt and Lee (1981) may be preferred. Another approach is to estimate a correlated random-effects model using Chamberlain-Mundlak variables—see Mundlak (1978) and Chamberlain (1984)—to allow for correlation between the random effects and the regressors. Griffiths and Hajargasht (2016) propose correlated random effects SF models using Chamberlain-Mundlak variables to allow for correlation between regressors and error components, including inefficiency terms.

The ML approach to estimation of (8) was first suggested by Pitt and Lee (1981), who derived an SF model for balanced panel data with a half-normal distribution for ui and a normal distribution for vit. This model therefore nests the basic cross-sectional model of Aigner et al. (1977) when T = 1. As in the cross-sectional setting, alternative distributional assumptions may be made. Battese and Coelli (1988) generalise the Pitt and Lee (1981) model in two ways: first, by allowing for an unbalanced panel and second, by assuming a truncated normal distribution for ui. Normal-exponential, normal-gamma and normal-Rayleigh variants of the Pitt and Lee (1981) model are implemented in LIMDEP Version 11 (Greene 2016). As in the cross-sectional setting, parameter estimates and efficiency predictions obtained under the ML approach are more efficient than those from semi-parametric models if the distributional assumptions made are valid. If those assumptions are not valid, they may be inconsistent and biased. To be sure, the ability to test distributional assumptions is very limited.

4.2 Time-Varying Efficiency

Allowing for variation in efficiency over time is attractive for a number of reasons. As already noted, the assumption that efficiency is time-invariant is increasingly hard to justify as T increases. We would expect average efficiency to change over time. There may also be changes in the relative positions of firms, in terms of convergence or divergence in efficiency between firms, and potentially also changes in rankings through firms overtaking each other. A wide variety of time-varying efficiency SF specifications have been proposed, each differing with respect to their flexibility in modelling the time path of efficiency and each having their own advantages and disadvantages.

As Amsler et al. (2014) note, panel data SF specifications can be grouped into four categories with respect to how uit changes over time. One of these, covered in the preceding section, is models with time-invariant efficiency, so that uit = ui. Second, we could assume independence of uit over t. In this case, we may simply estimate a pooled cross-sectional SF model, the possibility of unobserved heterogeneity notwithstanding. The advantages of this approach are the flexibility of uit—and by extension, that of \(E\left( {u_{it} |\varepsilon_{it} } \right)\)—over time, its simplicity and its sparsity, given that it adds no additional parameters to the model. However, the assumption of independence over time is clearly inappropriate.

Third, we may treat uit as varying deterministically over time. One approach is to include time-varying fixed or random effects, ait, with uit being given by

$$u_{it} = \mathop {\hbox{max} }\limits_{j} sa_{jt} - sa_{it}.$$
(10)

Of course, given that N ≤ IT firm- and time-specific parametersFootnote 6 cannot be identified, some structure must be imposed. Kumbhakar (1991, 1993) proposed combining firm-specific (but time-invariant) and time-specific (but firm-invariant) effects, such that \(a_{it} = \lambda_{i} + \sum\nolimits_{t = 2}^{T} {\lambda_{t} }\). This imposes a common trend in uit among firms, albeit one that may be quite erratic. Lee and Schmidt (1993) proposed a specification, \(a_{it} = \lambda_{t} \alpha_{i}\), which again imposes a trend over time. This is common for all firms, but complicates estimation due to its non-linearity. An alternative approach is to specify that \(a_{it} = g\left( t \right)\) as proposed by Cornwell et al. (1990), who specifically suggested a quadratic time trend with firm-specific parameters, such that \(a_{it} = \lambda_{i} + \lambda_{i1} t + \lambda_{i2} t^{2}\). This specification is flexible, in that it allows for firms to converge, diverge or change rankings in terms of efficiency. Ahn et al. (2007) propose a specification which nests both the Lee and Schmidt (1993) and Cornwell et al. (1990) models, in which \(a_{it} = \sum\nolimits_{j = 1}^{p} {\lambda_{jt} \alpha_{ji} }\), thus allowing for arbitrary, firm-specific time trends. This specification nests the Lee and Schmidt (1993) model when p = 1, and the Cornwell et al. (1990) model when p = 3, \(\lambda_{1t} = 1\), \(\lambda_{2t} = t\), \(\lambda_{3t} = t^{2}\). The value of p is estimated along with the model parameters. The authors discuss estimation and identification of model parameters. Ahn et al. (2013) discuss estimation of this model when there are observable variables correlated with the firm-specific effects, but not with vit. An alternative approach based on factor modelling and allowing for arbitrary, smooth, firm-specific efficiency trends is proposed by Kneip et al. (2012).

Because semi-parametric models yield only relative estimates of efficiency, it is not possible to disentangle the effects of technical change (movement of the frontier) and efficiency change. An analogous approach in the context of parametric specifications is to use a ‘scaling function’, so that

$$u_{it} = g\left( t \right)u_{i} .$$
(11)

Here, ui is a time-invariant random variable following a one-sided distribution—as in the time-invariant specification of Pitt and Lee (1981)—and g(t) is a non-negative function of t. Kumbhakar (1990) proposed \(g\left( t \right) = 1/\left[ {1 + \exp \left( {\lambda_{1} t + \lambda_{2} t^{2} } \right)} \right]\); Battese and Coelli (1992) proposed \(g\left( t \right) = \exp \left[ {\lambda_{1} \left( {t - T} \right)} \right]\) and \(g\left( t \right) = \exp \left[ {\lambda_{1} \left( {t - T} \right) + \lambda_{2} \left( {t - T} \right)^{2} } \right]\). In each case, ui is assumed to follow half-normal distribution. In these models, efficiency moves in the same direction for all firms, but there may be convergence of firms over time. In addition, with the exception of the one-parameter Battese and Coelli (1992) scaling function, these allow for non-monotonic trends in uit over time. However, they do not allow for changes in rank over time, which requires firm-specific time trends.

Cuesta (2000) generalised the one-parameter Battese and Coelli (1992) scaling function to allow for firm-specific time trends, so that \(g\left( t \right) = \exp \left[ {\lambda_{1i} \left( {t - T} \right)} \right]\). An extension to the two-parameter case would be straightforward. This allows for firm-specific time trends, as in the Cornwell et al. (1990) model, but again at the cost of increasing the number of parameters in the model by a factor of I. However, Wheat and Smith (2012) show that the Cuesta (2000) specification, unlike that of Battese and Coelli (1992), can lead to a counterintuitive ‘falling off’ of firms with high \(E\left( {u_{it} |\varepsilon_{it} } \right)\) in the final year of the sample They propose a model in which \(g\left( t \right) = \exp \left[ {\lambda_{1i} \left( {t - \lambda_{2i} } \right)} \right]\), that does not have the same feature.Footnote 7 More generally, as Wheat and Smith (2012) note, the many different models using that use functions are sensitive to the precise form of g(t) in terms of parameter estimates, fit and efficiency predictions.

A fourth approach to time-variation of uit in panel data SF models is to allow for correlation between uit over time by assuming that \(\left( {u_{i1} , \ldots , u_{iT} } \right)\) are drawn from an appropriate multivariate distribution. Among their various proposals, Pitt and Lee (1981) suggested that \(\left( {u_{i1} , \ldots , u_{iT} } \right)\) could be drawn from a multivariate truncated normal distribution. They abandoned this approach, after noting that the likelihood function for this model involves intractable T-dimensional integrals.Footnote 8 In addition, Horrace (2005) showed that the marginal distribution of uit in this case is not truncated normal. However, as suggested by Amsler et al. (2014), it is possible to specify a multivariate distribution with the desired marginal distributions, and also obviate T-dimensional integration when evaluating \(\ln L\), by using a copula function. Sklar’s theorem—see Nelsen (2006, pp. 17–14)—states that any multivariate cumulative density function can be expressed in terms of a set of marginal cumulative density functions and a copula. For example, we have

$$H_{u} \left( {u_{i1} , \ldots , u_{iT} } \right) = C\left[ {F_{u1} \left( {u_{i1} } \right), \ldots ,F_{uT} \left( {u_{iT} } \right)} \right]$$
(12)

where Hu is a multivariate cumulative density function for \(\left( {u_{i1} , \ldots , u_{iT} } \right),\) C[.] is the copula function, and \(F_{u1} \left( {u_{i1} } \right), \ldots ,F_{uT} \left( {u_{iT} } \right)\) are the marginal cumulative density functions for uit for each time period. We would normally assume that Fut = Fu for all t, so that we have

$$H_{u} \left( {u_{i1} , \ldots , u_{iT} } \right) = C\left[ {F_{u} \left( {u_{i1} } \right), \ldots ,F_{u} \left( {u_{iT} } \right)} \right] .$$
(13)

From this, it can be seen that the probability density function is given by

$$h_{u} \left( {u_{i1} , \ldots , u_{iT} } \right) = \prod\limits_{t = 1}^{T} {\left[ {f_{u} \left( {u_{it} } \right)} \right]c\left[ {F_{u} \left( {u_{i1} } \right), \ldots ,F_{u} \left( {u_{iT} } \right)} \right]}$$
(14)

where c is the derivative of the copula. It follows from this that a multivariate density \(h\left( {u_{i1} , \ldots , u_{iT} } \right)\) with the desired marginal densities given by fu can be obtained by combining fu and Fu with an appropriate copula density c. Many different copula functions exist—it is beyond the scope of this chapter to review the various candidates—each embodying different dependence structures. Note that c = 1 relates to the special case of independence. This allows marginal distributions to be specified, but the problem of T-dimensional integration to evaluate the log-likelihood persists. For this reason, Amsler et al. (2014) propose and implement an alternative approach whereby instead of specifying a copula for \(\left( {u_{i1} , \ldots , u_{iT} } \right)\), a copula is specified for the composite errors \(\left( {\varepsilon_{i1} , \ldots , \varepsilon_{iT} } \right)\). In this case, we have

$$h_{\varepsilon } \left( {\varepsilon_{i1} , \ldots , \varepsilon_{iT} } \right) = \prod\limits_{t = 1}^{T} {\left[ {f_{\varepsilon } \left( {\varepsilon_{it} } \right)} \right]c\left[ {F_{\varepsilon } \left( {\varepsilon_{i1} } \right), \ldots ,F_{\varepsilon } \left( {\varepsilon_{iT} } \right)} \right]}$$
(15)

where \(h_{\varepsilon }\) is the multivariate distribution for \(\left( {\varepsilon_{i1} , \ldots , \varepsilon_{iT} } \right)\) and \(F_{\varepsilon }\) is the marginal cumulative density function for \(\varepsilon_{it}\). In this case, an appropriate marginal distribution for \(\varepsilon_{it}\) is chosen, such as the skew-normal distribution. In this case, the correlation is between the composite errors, introducing dependency between both error components. Amsler et al. (2014) take both approaches, estimating a model in which \(\left( {\varepsilon_{i1} , \ldots , \varepsilon_{iT} } \right)\) is drawn from a joint distribution as in (15) via ML, and a model in which \(\left( {u_{i1} , \ldots , u_{iT} } \right)\) is drawn from a joint distribution as in (14), while vit is assumed independent, via MSL. A Gaussian copula function is used in both cases. The authors discuss prediction of efficiency. In this case, it is based on \(\left. {u_{it} } \right|\varepsilon_{i1} , \ldots , \varepsilon_{iT}\). This results in improved predictions relative to those based on \(\left. {u_{it} } \right|\varepsilon_{it}\), since the composite errors from all years are informative about uit when there is dependency between them.

The copula approach proposed by Amsler et al. (2014) is attractive, since it can be seen as intermediate between the pooled SF approach and the approach of specifying SF models with deterministically time-varying uit. As such, it retains the advantage of the latter approach in allowing for dependency over time, without specifying a particular functional form for the time trend. It also obviates the large number of additional parameters otherwise needed to allow flexibility with respect to time trends. Rather than some factor of I, the number of new parameters is limited to the correlation coefficients \(\rho_{ts} \forall t \ne s\). A number of simplifying assumptions can be made to reduce the number of these while retaining flexibility. Firms may converge or diverge, or change rankings, using a relatively parsimonious specification under this approach.

4.3 Unobserved Heterogeneity

Aside from considerations of the appropriate way to model trends in uit over time, which is peculiar to the panel data SF context, more general panel data issues are also relevant. Primary among these is the need to account for possible unobserved heterogeneity between firms. In general, this means incorporating firm-specific effects which are time-invariant but not captured by the regressors included in the frontier. These may be either correlated or uncorrelated with the regressors, i.e. they may be fixed or random effects, respectively. In general, failure to account for fixed effects may bias parameter estimates, while failure to account for random effects generally will not.Footnote 9 In the SF context, failure to account for fixed or random effects means such effects may be attributed to uit.

A number of models have been proposed which incorporate fixed or random effects. These are interpreted as capturing unobserved heterogeneity rather than as inefficiency effects. Kumbhakar (1991) proposed extending the pooled cross-section model to incorporate firm and time effects uncorrelated with the regressors, so that

$$\varepsilon_{it} = v_{it} + a_{i} + a_{t} - su_{it}$$
(16)

where ai and at are firm- and time-specific fixed or random effects. In the fixed-effects case, Kumbhakar (1991) suggests estimation via ML with firm dummy variables, under the assumptions that ai, at, and vit are drawn from normal distributions with zero means and constant variances, and uit is drawn from a truncated normal distribution. A simplified version of this model, omitting at and treating ai as a fixed effect, was used by Heshmati and Kumbhakar (1994). This model was also considered by Greene (2004, 2005a, b), who proposed the specification

$$\varepsilon_{it} = v_{it} + a_{i} - su_{it}$$
(17)

where ai is a time-invariant fixed or random effect, and the specification is referred to as the ‘true fixed effects’ (TFE) or ‘true random effects’ (TRE) model, accordingly. In the TFE case, estimation proceeds by simply replacing the constant term in the standard pooled with a full set of firm dummies and estimating the model via ML. However, evidence presented by Greene (2005b) from Monte Carlo experiments suggests that this approach suffers from the incidental parameters problem. As a result, Chen et al. (2014) propose an alternative ML approach based on the within transformation, which is not subject to this problem, and Belotti and Ilardi (2018) extend this approach to allow for heteroscedastic uit.

In the TRE case, Greene (2004, 2005a, b) proposed estimation of the model via MSL, assuming that \(a_{i} \sim N\left( {0,\sigma_{a}^{2} } \right)\). Greene (2005b) notes that the TRE approach—and indeed the standard SF model—can be seen as special cases of a random parameters model and proposes a random parameters specification incorporating heterogeneity in \(\beta\), so that

$$y_{it} - x_{i} \beta_{i} = \varepsilon_{it} = v_{it} - su_{it}$$
(18)

where \(\beta_{i}\) is assumed to follow a multivariate normal distribution with mean vector \(\beta\) and covariance matrix \(\sum\). The random intercept is \(\beta_{0i} = \beta_{0} + a_{i}\) in terms of the TRE notation. The model is estimated via MSL. The resemblance of this approach to the Bayesian SF specifications considered by Tsionas (2002) is noted. However, the Bayesian approach has the drawback of requiring some prior distribution to be chosen for all parameters, including those of fu. Greene (2008) notes that in the classical framework, ‘randomness’ of the parameters reflects technological heterogeneity between firms, whereas in the Bayesian framework, ‘randomness’ of the parameters is supposed to reflect the uncertainty of the analyst.Footnote 10

A discrete approximation to the random parameters SF model is possible using a latent class approach to capture heterogeneity in some or all of the \(\beta\) parameters, as proposed by Orea and Kumbhakar (2004). In this specification, each firm belongs to one of J classes, each class having a distinct technology, so that for class j, we have technology parameters \(\beta_{j}\). Class membership is unknown. Each firm is treated as belonging to class j with unconditional probability pj, where the unconditional probabilities are estimated as parameters after normalising such that \(\sum\nolimits_{j = 1}^{J} {p_{j} = 1}\) (leaving J − 1 additional parameters to be estimated). The model may be estimated via ML. Conditional probabilities of class membership for each observation obtained by

$$p_{ij} = \frac{{p_{j} f_{\varepsilon } \left( {y_{it} - x_{i} \beta_{j} } \right)}}{{\sum\nolimits_{j = 1}^{J} {\left[ {p_{j} f_{\varepsilon } \left( {y_{it} - x_{i} \beta_{j} } \right)} \right]} }}$$
(19)

The primary issue with the TFE and TRE and similar models is that any time-invariant effects are attributed to ai, when it is entirely possible that they should, partly or wholly, be attributed to uit. Several recent proposals therefore extend this modelling approach to allow for uit to be broken down into separate time-invariant and time-varying components capturing ‘persistent’ and ‘transient’ inefficiency effects, respectively. Thus,

$$u_{it} = w_{i} + w_{it}$$
(20)

where typically both wi and wit are random variables drawn from some one-sided distribution. A similar decomposition of uit was first suggested by Kumbhakar and Heshmati (1995), who proposed that \(u_{it} = a_{i} + w_{it}\) and Kumbhakar and Hjalmarsson (1995), who proposed \(u_{it} = a_{i} + \alpha_{t} + w_{it}\), where ai and \(\alpha_{t}\) are firm- and time-specific fixed or random effects, respectively.Footnote 11 Colombi et al. (2014) and Tsionas and Kumbhakar (2014) propose an extension of the TRE model, accordingly referred to as the generalised true random effects (GTRE) model, in which

$$\varepsilon_{it} = v_{it} + a_{i} - s\left( {w_{i} + w_{it} } \right)$$
(21)

This model therefore includes four error components, allowing for noise, unobserved heterogeneity, and persistent and transient inefficiency. Identification requires specific distributional assumptions to be made about either ai or wi, or both. The following distributional assumptions are typically made: \(v_{it} \sim N\left( {0,\sigma_{v}^{2} } \right)\), \(a_{i} \sim N\left( {0,\sigma_{a}^{2} } \right)\) and wi and wit are follow half-normal distributions with constant variances. Each of the error components is assumed to be independent. Various approaches to estimation of the GTRE model have been proposed. Kumbhakar et al. (2014) suggest a multi-step approach. In the first step, a standard random effects panel data model including a noise component \(v_{it}^{*} = v_{it} + w_{it}\) and a time-invariant random effects component \(a_{i}^{*} = a_{i} + w_{i}\). This can be estimated via FGLS, avoiding any explicit distributional assumptions. Subsequently, the estimates of these error components are used as the dependent variables in separate constant-only SF models, which decompose them into their two-sided and one-sided components. This is straightforward to implement using standard software packages.

Alternatively, Colombi et al. (2014) use the result that \(\varepsilon_{it}\) in the GTRE model is the sum of two random variables, each drawn from an independent closed skew-normal distribution.Footnote 12 As its name suggests, the closed skew-normal distribution is closed under summation—see Proposition 2.5.1 of González-Farı́as et al. (2004b) or Theorem 1 in González-Farı́as et al. (2004a). Therefore, \(\varepsilon_{it}\) follows a skew-normal distribution. This enables estimation of the model via ML. However, Filippini and Greene (2016) note that this is extremely challenging, since the log-likelihood involves the probability density function for a T-variate normal distribution and the cumulative density function for a T + 1-variate normal distribution. They proposed a simpler approach based on MSL, which exploits the fact that the GTRE model is simply the TRE model in which the time-invariant error component follows a skew-normal distribution. Colombi et al. (2014) show how to obtain predictions for wi and wit.

The attraction of the GTRE model is that it is quite general, in that it allows for the decomposition of the composite error into noise, random effects, persistent inefficiency, and transient inefficiency components. It also nests various simpler models, such as the TRE model, the standard pooled SF model, the Pitt and Lee (1981) model, and a standard random-effects model. However, Badunenko and Kumbhakar (2016) recently concluded on the basis of Monte Carlo experiments that the model is very limited in its ability to precisely predict the individual error components in practice, and suggest that the model may not outperform simpler models in many cases.

4.4 Multi-level Panel Datasets

This section has outlined how panel data allows for a richer characterisation of efficiency and thus panel data is desirable for undertaking efficiency analysis. Both Smith and Wheat (2012) and Brorsen and Kim (2013) have considered using data on a number of organisations over time, but disaggregated on sub-firm divisions (henceforth: plants) for each organisation. Thus, there are two levels of data

$$y_{its} = x_{ijt} \beta + \tau_{its} + v_{its}$$
(22)

which is (6) but with the addition of a plant subscript j. There are two key advantages to considering data of this form. Firstly, such an approach allows for the measurement of internal efficiency variation within an organisation, as well as simultaneously measuring efficiency against comparator organisations (external efficiency). Smith and Wheat (2012) propose a model (ignoring the time dimension for simplicity) in which \(u_{ij} = u_{i} + u_{ij}^{*}\), where ui is a one-sided component common to all of firm i’s plants, and \(u_{ij}^{*}\) is a plant-specific component assumed to follow a half-normal distribution. The authors suggest estimating the model using a two-step approach, in which ui is obtained from a fixed or random effect in the first step. Note that, in the one-period or pooled cross-section cases, this is simply the panel data specification of Kumbhakar and Hjalmarsson (1995) and Kumbhakar and Heshmati (1995).

Lai and Huang (2013) argue that there is likely to be intra-firm correlation between both plant-level efficiency and noise effects. Rather than allow for separate correlations between the vij and the uij, the authors propose a model in which the \(\varepsilon_{ij}\) are correlated such that \(\rho \left( {\varepsilon_{ij} ,\varepsilon_{il} } \right) = \rho\). The components and vij and uij are assumed to be drawn from marginal normal and half-normal distributions, respectively, the authors allow for correlation between the composed errors using a Gaussian copula.

Secondly, both Brorsen and Kim (2013) and Smith and Wheat (2012) demonstrate that there is a need to model costs at the level that management autonomy resides. Failure to do so can result in misleading predictions of efficiency as it mismatches returns to scale properties of the cost function with efficiency. Brorsen and Kim (2013) used data on schools and school districts to show that if the model were estimated using data at district level then returns to scale are found to be decreasing rather than finding that these schools are inefficient. Ultimately, the aggregation bias is resulting in correlation between errors and regressors, since true measures of scale/density (at the disaggregate level) are not included in the model.

5 Heteroscedasticity and Modelling Inefficiency

In many applications of SFA, the analyst is interested in not only in the estimation or prediction of efficiency, but also in its variation in terms of a set of observable variables. However, the standard SF model assumes that ui is independent of observed variables. Many applications, including Pitt and Lee (1981) as an early example, take a two-step approach to modelling efficiency: first, a standard SF model is estimated and used to generate efficiency predictions, and second, these predictions are regressed on a vector of explanatory variables. However, the second-step regression violates the assumption of independence in the first step, and Wang and Schmidt (2002) show that the two-step approach is severely biased. Given that ui is a random variable, appropriate approaches involve specifying one or more parameters of the error distributions as a function of a set of covariates.

Deprins and Simar (1989a, b), Reifschneider and Stevenson (1991), Kumbhakar et al. (1991), Huang and Liu (1994), and Battese and Coelli (1995) all propose extensions of the basic SF model whereby

$$u_{i} = g\left( {z_{i} ,\delta } \right) + w_{i}$$
(23)

where zi is a vector of ‘environmental’ variables influencing inefficiency, \(\delta\) is a vector of coefficients, and wi is a random error. In the Deprins and Simar (1989a, b) specification, \(g\left( {z_{i} ,\delta } \right) = \exp \left( {z_{i} \delta } \right)\) and wi = 0, and the model may estimated via non-linear least squares or via ML assuming \(v_{i} \sim N\left( {0, \sigma_{v}^{2} } \right)\).Footnote 13 Reifschneider and Stevenson (1991) propose restricting both components of ui to be non-negative, i.e. \(g\left( {z_{i} ,\delta } \right), w_{i} \ge 0\), though as Kumbhakar and Lovell (2000) and Greene (2008) note, this is not required for \(u_{i} \ge 0\). An alternative approach was proposed by Kumbhakar et al. (1991), in which \(g\left( {z_{i} ,\delta } \right) = 0\) and wi is the truncation at zero of a normally distributed variable with mean \(z_{i} \delta\) and variance \(\sigma_{u}^{2}\). Huang and Liu (1994) proposed a model in which \(g\left( {z_{i} ,\delta } \right) = z_{i} \delta\) and wi is the truncation at \(- z_{i} \delta\) of an \(N\left( {0,\sigma_{u}^{2} } \right)\) random variable. The latter two models are in fact equivalent, as noted by and Battese and Coelli (1995). In simple terms, the model assumes that \(v_{i} \sim N\left( {0,\sigma_{v}^{2} } \right)\) and \(u \sim N^{ + } \left( {\mu_{i} ,\sigma_{u}^{2} } \right)\), where \(\mu_{i} = z_{i} \delta\). Note that a constant term is included in zi, so that the model nests the normal-truncated normal model of Stevenson (1980) and the normal-half normal model.

Another set of models, motivated by the desire to allow for heteroskedasticity in ui, specify the scale parameter, rather than the location parameter, of the distribution of ui as a function of a set of covariates.Footnote 14 Reifschneider and Stevenson (1991) first proposed amending the normal-half normal model so that \(\sigma_{ui} = h\left( {z_{i} } \right), h\left( {z_{i} } \right) \in \left( {0,\infty } \right)\), but did not make any particular suggestions about \(h\left( {z_{i} } \right)\) other than noting that the function must be constrained to be non-negative. Caudill and Ford (1993) suggested the functional form \(\sigma_{ui} = \sigma_{u} \left( {z_{i} \gamma } \right)^{\alpha }\), which nests the standard homoskedastic normal-half normal model when \(\alpha = 0\). Caudill et al. (1995) suggested a slightly sparser specification in which \(\sigma_{ui} = \sigma_{u} \exp \left( {z_{i} \gamma } \right)\), and Hadri (1999) proposed a similar ‘doubly heteroskedastic’ SF model, \(\sigma_{vi} = \exp \left( {z_{i} \theta } \right), \sigma_{ui} = \exp \left( {z_{i} \gamma } \right)\).

The approaches discussed above can be combined for an encompassing model in which both the location and scale parameters are functions of zi. Wang (2002) proposed a model in which \(u_{i} \sim N^{ + } \left( {\mu_{i} ,\sigma_{ui}^{2} } \right)\), where \(\mu_{i} = z_{i} \delta\) and \(\sigma_{ui}^{2} = \exp \left( {z_{i} \gamma } \right)\), while Kumbhakar and Sun (2013) took this a step further, estimating a model in which \(u_{i} \sim N^{ + } \left( {\mu_{i} ,\sigma_{ui}^{2} } \right)\) and \(v_{i} \sim N\left( {0,\sigma_{vi}^{2} } \right)\), where \(\mu_{i} = z_{i} \delta\), \(\sigma_{vi} = \exp \left( {z_{i} \theta } \right)\), and \(\sigma_{ui} = \exp \left( {z_{i} \gamma } \right)\), effectively combining the Hadri (1999) ‘doubly heteroskedastic’ model with that of Kumbhakar et al. (1991), Huang and Liu (1994), and Battese and Coelli (1995).Footnote 15

Given the motivation of explaining efficiency in terms of zi, and since zi enters the model in a non-linear way. It is desirable to calculate the marginal effect of these zli, the lth environmental variable, on efficiency. Of course, given that ui is a random variable, we can only predict the marginal effect of zl on predicted efficiency, and this means that the marginal effects formula used depends fundamentally on the efficiency predictor adopted. Where \(u_{i} \sim N^{ + } \left( {\mu_{i} ,\sigma_{ui}^{2} } \right), \mu_{i} = z_{i} \delta\), the parameter \(\delta_{l}\) is the marginal effect of zli on the mode of the distribution of ui, except when \(z_{i} \delta \le 0\). The derivative of the unconditional mode predictor,

$$\partial M\left( {u_{i} } \right)/\partial z_{li} = \left\{ {\begin{array}{*{20}c} {\delta_{l} , z_{i} \delta > 0} \\ {0, z_{i} \delta \le 0} \\ \end{array} } \right. .$$
(24)

Therefore, the unconditional mode yields a relatively simple marginal effect. Alternatively, Wang (2002) derived a marginal effects formula based on the derivative of the unconditional mean, \(\partial E\left( {u_{i} } \right)/\partial z_{li}\). As the author shows, since \(E\left( {u_{i} } \right)\) depends on the scale parameter, as well as the location parameter, of the distribution, marginal effects calculated using this formula can be non-monotonic even if zli enters both functions in a linear fashion. This lends itself to potentially useful discussion of the ‘optimal’ (i.e. efficiency maximising) level of zli. As noted by Hadri (1999), the variables entering \(\mu_{i}\), \(\sigma_{vi}\), and \(\sigma_{ui}\) need not be the same in practice.

The efficiency prediction is usually based on the distribution of \(\left. {u_{i} } \right|\varepsilon_{i}\) (specifically its mean) rather than ui. Kumbhakar and Sun (2013) argue that marginal effects should be based on \(\partial E\left( {u_{i} |\varepsilon_{i} } \right)/\partial z_{li}\) rather than \(\partial E\left( {u_{i} } \right)/\partial z_{l}.\) and show that in this case, marginal effects depend upon the parameters not only of fu but also of fv and upon \(\varepsilon_{i}\), i.e. all of the model’s variables and parameters. Stead (2017) derives a marginal effects formula based on the conditional mode, \(\partial M\left( {u_{i} |\varepsilon_{i} } \right)/\partial z_{li}\), which is somewhat simpler, particularly when both \(\sigma_{vi} = \sigma_{v}\) and \(\sigma_{ui} = \sigma_{u}\) in which case \(\partial M\left( {u_{i} |\varepsilon_{i} } \right)/\partial z_{li} = \delta_{l} \left[ {\sigma_{v}^{2} /\left( {\sigma_{v}^{2} + \sigma_{u}^{2} } \right)} \right]\) when \(M\left( {u_{i} |\varepsilon_{i} } \right) > 0\). Note that the marginal effects formulae discussed so far relate to changes in predicted ui rather than predicted efficiency: Stead (2017) derives a marginal effect based on the Battese and Coelli (1988) predictor, \(\partial E\left[ {\left. {\exp \left( { - u_{i} } \right)} \right|\varepsilon_{i} } \right]/\partial z_{li},\), and notes that other formulae should be transformed into inefficiency space by multiplying by \(- \exp \left( { - \hat{u}_{i} } \right)\) where \(\hat{u}_{i}\) is the predictor for ui since \(\partial \exp \left( { - \hat{u}_{i} } \right)/\partial z_{li} = - (\partial \hat{u}_{i} /\partial z_{li} ){ \exp }\left( { - \hat{u}_{i} } \right)\). The choice between conditional and unconditional marginal effects formulae is between prediction of marginal effects for specific observations, and quantifying the relationship between environmental variables and inefficiency in general.

The idea that marginal effects should be based on a predictor of \(\left. {u_{i} } \right|\varepsilon_{i}\) rather than ui has the appeal that the marginal effects discussed are consistent with the preferred efficiency predictor, in the sense that they indicate the change in predicted efficiency resulting from a change in zli. On the other hand, such marginal effects are sensitive to changes in the frontier variables and parameters and the parameters of fv, despite the fact that efficiency is not specified in this way. Another drawback is that while \(\partial E\left( {u_{i} } \right)/\partial z_{li}\) and \(\partial M\left( {u_{i} } \right)/\partial z_{li}\) are parameters for which standard errors and confidence intervals may be estimated, \(\partial E\left( {u_{i} |\varepsilon_{i} } \right)/\partial z_{li}\) and \(\partial M\left( {u_{i} |\varepsilon_{i} } \right)/\partial z_{li}\) are random variables for which prediction intervals are the only appropriate estimate of uncertainty, making hypothesis testing impossible. Kumbhakar and Sun (2013) suggest a bootstrapping approach to derive confidence intervals for \(\partial E\left( {u_{i} |\varepsilon_{i} } \right)/\partial z_{li}\), but this is inappropriate since it treats \(\varepsilon_{i}\) as known.Footnote 16

Given the rather complex marginal effects implied by the models discussed above, alternative specifications with simpler marginal effects have been proposed. Simar et al. (1994) propose that zi should enter as a scaling function, such that \(u_{i} = f\left( {z_{i} \eta } \right)u_{i}^{*}\), where \(u_{i}^{*}\) is assumed to follow some non-negative distribution that does not depend on zi, and \(f\left( {z_{i} \eta } \right)\) is a non-negative scaling function similar to those used in Battese and Coelli (1992) type panel data models. Wang and Schmidt (2002) note several features of this formulation: first, the shape of the distribution of ui is the same for all observations, with \(f\left( {z_{i} \eta } \right)\) simply scaling the distribution; models with this property are described as having the ‘scaling property’. Second, it may yield relatively simple marginal effects expressions, e.g. when \(f\left( {z_{i} \eta } \right) = \exp \left( {z_{i} \eta } \right)\) or similar.Footnote 17 Third, as suggested by Simar et al. (1994), the \(\beta\) and \(\eta\) may be estimated via non-linear least squares without specifying a particular distribution for \(u_{i}^{*}\). The scaling property is discussed further by Alvarez et al. (2006), who suggested testing for the scaling property.

More recently, Amsler et al. (2015) suggested an alternative parameterisation such that zi enters the model through the post-truncation, rather than the pre-truncation, parameters of fu. For example, the left truncation at zero of an \(N\left( {\mu_{i} ,\sigma_{ui}^{2} } \right)\) random variable, which we have denoted \(N^{ + } \left( {\mu_{i} ,\sigma_{ui}^{2} } \right)\), may be reparameterised in terms of \(E\left( {u_{i} } \right)\) and \(\text{VAR}\left( {u_{i} } \right)\); that is, fu may be expressed in terms of these parameters, and as a result, so may \(f_{\varepsilon }\). The authors show that marginal effects are simpler and easier to interpret when environmental variables enter the model such that \(E\left( {u_{i} } \right) = g\left( {z_{i} ,\delta } \right), \text{VAR}\left( {u_{i} } \right) = h\left( {z_{i} ,\gamma } \right)\) than when \(\mu_{i} ,\sigma_{ui}^{2} = g\left( {z_{i} ,\delta } \right),\sigma_{ui}^{2} = h\left( {z_{i} ,\gamma } \right)\). This is intuitive, given that we predict based on post-truncation parameters of fu or \(f_{\left. u \right|\varepsilon }\). This approach is complicated somewhat by the requirement that \(E\left( {u_{i} } \right) > \text{VAR}\left( {u_{i} } \right)\), as shown by Eq. (3) in Barrow and Cohen (1954), Eq. (16) in Bera and Sharma (1999), and Lemma 1 of Horrace (2015). For this reason, the authors suggest a specification in which \(\text{VAR}\left( {u_{i} } \right) = \exp \left( {z_{i} \gamma } \right)\) and \(E\left( {u_{i} } \right) = \text{VAR}\left( {u_{i} } \right) + \exp \left( {z_{i} ,\delta } \right).\)

An additional motivation for the models discussed in this section is the analysis of production risk. Bera and Sharma (1999) proposed, in the context of a production frontier model, that \(\text{VAR}\left( {u_{i} |\varepsilon_{i} } \right)\) be used as a measure of ‘production uncertainty or risk. Note however that this is a far more restrictive measure than that used in the wider literature on production risk, which is variability of output, measured, for example, by \(\text{VAR}\left( {y_{i} } \right)\). Nevertheless, these models offer considerable flexibility in modelling production risk according to this definition. Just and Pope (1978) showed that a drawback of log-linear (non-frontier) production function specifications, in which \(q_{i} = \exp \left( {y_{i} } \right)\), is that the marginal production risk (i.e. the partial derivative of production risk) with respect to a given variable must always be the same as that variable’s marginal product. The authors proposed an alternative specification with an additive error term multiplied by a scaling function. The form allows for variables that affect production and production risk in potentially opposite directions for variables that affect one but not the other. Kumbhakar (1993) and Battese et al. (1997) proposed SF variants of this model by including an inefficiency term ui. Note, however, that any SF model in which one or both error terms are heteroskedastic allows for observation-specific production risk.

6 Alternative Noise Distributions

In the standard SF model, the noise term is assumed to follow a normal distribution. In contrast to the many different proposals concerning the distribution of ui, discussed in Sect. 3, the distribution of vi has received relatively little attention. This is perhaps natural, given that the main focus of SFA is on estimation or prediction of the former component. Nevertheless, consideration of alternative distributions for vi is important for at least two main reasons. First, the standard model is not robust to outliers caused by noise, i.e. when the true noise distribution has thick tails. Second, and perhaps more importantly, the distribution of vi has implications for the deconvolution of \(\varepsilon_{i}\) into noise and inefficiency components. Specifically, the distribution of \(\left. {u_{i} } \right|\varepsilon_{i}\), on which efficiency prediction is typically based, is influenced by fv as well as fu, as shown in (4).

The latter point in particular is not trivial. A change in distributional assumptions regarding vi affects the degree of shrinkage of ui towards \(E\left( {u_{i} } \right)\) using \(E\left( {u_{i} } \right)\).Footnote 18 A change in the assumed noise distribution can even be sufficient to change the rankings of firmsFootnote 19 by altering the monotonicity properties of \(E\left( {u_{i} |\varepsilon_{i} } \right)\) with respect to \(\varepsilon_{i}\), which are in turn linked to the log-concavity properties of fv. Ondrich and Ruggiero (2001) prove that \(E\left( {u_{i} |\varepsilon_{i} } \right)\) is a weakly (strictly) monotonic function of \(\varepsilon_{i}\) for any weakly (strictly) log-concave fv. Since the normal density is strictly log-concave everywhere, \(E\left( {u_{i} |\varepsilon_{i} } \right)\) is a monotonic function of \(\varepsilon_{i}\) in the standard model. Under alternative noise distributions for which fv is not strictly log-concave everywhere, there may be a weakly monotonic or even non-monotonic relationship between \(E\left( {u_{i} |\varepsilon_{i} } \right)\) and \(\varepsilon_{i}\). Such relationships have been noted in several studies proposing alternative, heavy tailed, noise distributions, which are discussed below.

Nguyen (2010) proposed SF models with Cauchy and Laplace distributions for vi, pairing the former with half Cauchy and truncated Cauchy, and the latter with exponential and truncated Laplace distributed for has received vi terms.Footnote 20 Gupta and Nguyen (2010) derive a Cauchy-half Cauchy panel data model with time-invariant inefficiency. Horrace and Parmeter (2018) consider the Laplace-truncated Laplace and Laplace-exponential SF models further, showing that \(f_{\left. u \right|\varepsilon }\) (and therefore also \(E\left( {u_{i} |\varepsilon_{i} } \right)\), or for that matter any predictor based on \(f_{\left. u \right|\varepsilon }\)) is constant for \(s\varepsilon_{i} \ge 0\). The authors conjecture that the assumption of a Laplace distributed vi may be advantageous in terms of estimation of fu, and therefore for the deconvolution of the composed error. Fan (1991) showed that optimal rates of convergence in deconvolution problems decrease with the smoothness of the noise distribution and are considerably faster for ordinary smooth distributions, such as the Laplace, than for super smooth distributions, such as the normal distribution. Optimal convergence rates for nonparametric Gaussian deconvolution are discussed by Fan (1992). Horrace and Parmeter (2011) find that consistent estimation of the distribution of ui in a semparametric SF model, in which \(v_{i} \sim N\left( {0,\sigma_{v}^{2} } \right)\) and fu is unknown, has a \(\ln n\) convergence rate. This implies that convergence rates when \(v_{i} \sim N\left( {0,\sigma_{v}^{2} } \right)\) are rather slow.

In the aforementioned proposals, the distribution of ui is the left truncation at zero of the distribution of vi. In many cases, this ensures that \(f_{\varepsilon }\) can be expressed analytically. Proposition 9 of Azzalini and Capitanio (2003) shows the density of the sum of a random variable and the absolute value of another random variable following the same elliptical distribution. Stead et al. (2018) propose the use of MSL to pair a thick-tailed distribution for vi with any given distribution for ui, and estimate a logistic-half normal SF model. The authors show that the model yields a narrower range of efficiency scores compared to the normal-half normal model.

There are two drawbacks of the above proposals for vi. First, they have fixed shapes, so there is no flexibility in the heaviness of their tails. Second, they do not nest the normal distribution, which makes testing against the standard SF model difficult. One potential noise distribution with neither of these shortcomings is the Student’s t distribution, which has a ‘degrees of freedom’ parameter \(\alpha\) that determines the heaviness of the tails, and which approaches the normal distribution as \(\alpha \to \infty\). Tancredi (2002) proposed an SF model in which vi and ui follow non-standard Student’s t distribution and half t distributions, with scale parameters \(\sigma_{v}\) and \(\sigma_{u}\), respectively, and a common degrees of freedom parameter \(\alpha\). The author shows that \(f_{\left. u \right|\varepsilon } \to 0\) as \(s\varepsilon_{i} \to \infty\) and that \(E\left[ {\exp \left( { - u_{i} |\varepsilon_{i} } \right)} \right]\) and \(\text{VAR}\left[ {\exp \left( { - u_{i} |\varepsilon_{i} } \right)} \right]\) are non-monotonic functions of \(\varepsilon_{i}\). Wheat et al. (2019) estimate a t-half normal model via MSL, similarly finding that \(E\left( {u_{i} |\varepsilon_{i} } \right)\) is non-monotonic, decreasing with \(s\varepsilon_{i}\) at either tail, and discuss testing against the normal-half normal SF model. Bayesian estimation of the t-half t model, and of t-half normal, t-exponential, and t-gamma SF models are discussed by Tchumtchoua and Dey (2007) and Griffin and Steel (2007), respectively.

Another proposal which nests the standard SF model and allows for flexibility in the kurtosis of vi, is that of Wheat et al. (2017), in which vi follows a mixture of two normal distributions with zero means, variances \(\sigma_{v1}^{2}\) and \(\sigma_{v2}^{2}\), respectively, and mixing parameter p. This is often referred to as the contaminated normal distribution.Footnote 21 Alternatively, the model can be interpreted as a latent class model with two regimes having differing noise variances. Efficiency prediction in latent class and mixture SF models is discussed, and \(E\left( {u_{i} |\varepsilon_{i} } \right)\) is shown to be non-monotonic in the contaminated normal-half normal case, as in the t-half normal. Testing down to the standard SF model is less straightforward in this case, since there is an unidentified parameter under the null hypothesis.

The proposals discussed in this section have all been motivated to one degree or another by the need to accommodate outliers in a satisfactory way. An exception to this general rule is Bonanno et al. (2017), who propose an SF model with correlated error components—for a discussion of such models, see Sect. 8.1—in which the marginal distributions of vi and ui are skew logistic and exponential, respectively. The motivation in this case is to allow for non-zero efficiency predictions in the presence of ‘wrong skew’, which the model ascribes to the skewness of vi.

7 Presence of Efficient Firms

A number of papers have considered SFA in the case where some significant proportion of firms lie on the frontier—i.e. are fully efficient—and discussed SF specifications and efficiency prediction appropriate for this case, along with methods used to identify subset of efficient firms.

Horrace and Schmidt (2000) discuss multiple comparisons with the best (MCB)—see Hsu (1981, 1984) for background on MCB—in which there are I populations each with their own distinct parameter values, ai, one of which—e.g. the maximum or the minimum—is the ‘best’ in some sense, against which we want to compare the remaining I − 1 populations. Rather than make individual comparisons, e.g. by testing \(H_{0} : a_{i} = a_{b}\) where \(a_{b} = \max_{j \ne i} sa_{j}\)), MCB constructs joint confidence intervals for a vector of differences \(\left( {\begin{array}{*{20}c} {a_{b} - a_{1} } & {a_{b} - a_{2} } & \ldots & {a_{b} - a_{I - 1} } \\ \end{array} } \right)\). This is motivated by the need to consider the ‘multiplicity effect’ (Hochberg and Tamhane 1987), i.e. the fact that if a large enough number of comparisons are made, some differences are bound to appear significant. MCB is also concerned with constructing a set of populations which could be the best. Horrace and Schmidt (2000) discuss application of MCB to derive such multivariate intervals in the context of the fixed effects, time-invariant efficiency panel SF model of Schmidt and Sickles (1984), and the selection of a set of efficient (or probably efficient) firms based on these.

An alternative approach proposed by Jung (2017) is to use a least absolute shrinkage and selection operator (LASSO) variant of the Schmidt and Sickles (1984) model. LASSO is a method used for variable selection and to penalise overfitting by shrinking the parameter estimates towards zero and was introduced by Tibshirani (1996) in the context of OLS, such that

$$\hat{\beta }_{LASSO} = \mathop {\text{argmin}}\limits_{\beta } \left[ {\frac{1}{I}\sum\limits_{i = 1}^{I} {\varepsilon_{i}^{2} } + \lambda \sum\limits_{k = 1}^{K} {\left| {\beta_{k} } \right|} } \right]$$
(25)

where K is the number of regressors, and \(\lambda\) is a tuning parameter that determines the strength of the penalty (or the degree of shrinkage). The constant term \(\beta_{0}\) is excluded from the penalty term. The penalty is such that it forces some of the coefficients to be zero, hence, its usefulness in variable selection. It is straightforward to extend the approach to a fixed-effects panel data model. Jung (2017) proposes extending the approach to the Schmidt and Sickles (1984) fixed effects SF model, in which \(\beta_{0} = \max_{j} sa_{j}\) and \(u_{i} = \max_{j} sa_{j} - sa_{i}\), and introduces an additional penalty term such that the inefficiency parameters are shrunk towards zero, and ui = 0 for a subset of firms. The author discusses the properties of the model, and in applying the model to a dataset used by Horrace and Schmidt (2000), notes that the resulting set of efficient firms is similar to that obtained using the MCB approach.

Kumbhakar et al. (2013) proposed a zero inefficiency stochastic frontier (ZISF) model. The ZISF model adapts the standard parametric SF model to account for the possibility that a proportion, p, of the firms in the sample are fully efficient using a latent class approach in which ui = 0 with probability p. That is, the ZISF model is a latent class model in which

$$f_{\varepsilon } \left( {\varepsilon_{i} } \right) = pf_{v} \left( {\varepsilon_{i} } \right) + \left( {1 - p} \right)\int\limits_{0}^{\infty } {f_{v} \left( {\varepsilon_{i} + su_{i} } \right)f_{u} \left( {u_{i} } \right)\text{d}u_{i} }$$
(26)

where fv is the density of vi and assumed noise distribution, and fu is the density of ui in the second regime. In the first regime, ui can be thought of as belonging to a degenerate distribution at zero. The ZISF model nests the standard SF model when p = 0, and testing down to the SF model is a standard problem. On the other hand, testing \(H_{0} : p = 1\), i.e. that all firms are fully efficient, is more complicated, that the splitting proportion p lies on the boundary of the parameter space in this case. The authors suggest that the LR statistic follows a \(\chi_{1:0}^{2}\) distribution.Footnote 22 That is, a 50:50 mixture of \(\chi_{0}^{2}\) and \(\chi_{1}^{2}\) distributions. However, Rho and Schmidt (2015) question the applicability of this result, noting an additional complication: under \(H_{o} :p = 1,\) \(\sigma_{u}\) is not identified. Equivalently, p is not identified under \(H_{o} :\sigma_{u} = 0.\) Simulation evidence provided by the authors suggests that estimates of these two parameters are likely to be imprecise when either is small.

Kumbhakar et al. (2013) suggest several approaches to efficiency prediction from the ZISF model. First, the authors suggest weighting regime-specific efficiency predictions by unconditional probabilities of regime membership. Since \(\hat{u}_{i} = 0\) in the first regime regardless of the predictor used, this amounts to using \(\left( {1 - p} \right)E\left( {u_{i} |\varepsilon_{i} } \right)\). This is clearly unsatisfactory, as each firm is assigned the same (unconditional) probabilities for regime membership. A preferable alternative, suggested by both Kumbhakar et al. (2013) and Rho and Schmidt (2015), suggest using \(\left( {1 - p_{i} } \right)E\left( {u_{i} |\varepsilon_{i} } \right),\) where \(p_{i} = pf_{v} \left( {\varepsilon_{i} } \right)/f_{\varepsilon } \left( {\varepsilon_{i} } \right)\), which is a firm-specific probability conditional on \(\varepsilon_{i}\). Note that \(\left( {1 - p} \right)E\left( {u_{i} |\varepsilon_{i} } \right)\) and \(\left( {1 - p_{i} } \right)E\left( {u_{i} |\varepsilon_{i} } \right)\) for all i and any value of \(\varepsilon_{i}\) will yield non-zero predictions of ui under the assumption that \(v_{i} \sim N\left( {0,\sigma_{v}^{2} } \right)\) (see the discussion of the monotonicity properties of \(E\left( {u_{i} |\varepsilon_{i} } \right)\) in Sect. 6), despite the fact we expect pI efficient firms in the sample. Kumbhakar et al. (2013) suggest identifying firms as efficient when pi is greater than some cut-off point; however, the choice of such a cut-off point is arbitrary.

Despite the ZISF model’s motivation, efficient firms cannot be identified on the basis of the resulting point predictions of efficiency or conditional probabilities of regime membership. Firms may be predicted as fully efficient if the conditional mode predictor is used, or possibly if an alternative distribution for vi is assumed (again, refer to Sect. 6), but this is equally true in the standard SF context. An appropriate approach to classifying firms would be to identify those with minimum width prediction intervals, analogous to those derived by Wheat et al. (2014) for \(\left. {u_{i} } \right|\varepsilon_{i}\) in the standard SF model, including zero.

There are trade-offs between each of the three proposed methods. Compared to the ZISF model, the MCB and LASSO approaches have the advantage that no particular distribution for ui is imposed, and efficient firms can be identified on the basis of hypothesis tests. In contrast, the ZISF model limits us to examining prediction intervals. On the other hand, Horrace and Schmidt (2000) and Jung (2017) assume time-invariant efficiency. While Horrace and Schmidt (2000) state that the MCB approach could be adapted to allow for time-varying efficiency (and the same may be true of the LASSO approach), the ZISF approach is the only one that can be applied to cross-sectional data. In addition, it would be straightforward to extend the ZISF approach to incorporate many features found in the SF literature.

8 Miscellaneous Proposals

In this section, we discuss several of the lesser and relatively tangential strands of the SF literature which have adopted novel distributional forms.

8.1 Correlated Errors

A common assumption across all of the aforementioned SF specifications is that the error components, including all noise, inefficiency and random effects components are distributed independently of one another.Footnote 23 Relaxing this assumption seems particularly justified in cases in which there are two or more inefficiency components. Independence between noise and inefficiency terms is usually assumed on the basis that noise represents random factors unrelated to efficiency. On the other hand, it has been argued that such factors may affect firm decision making and therefore efficiency.

Similar to the panel data case discussed in Sect. 4.1, one approach to modelling dependence between errors has been to specify some multivariate analogue to common distributional assumptions under independence. Schmidt and Lovell (1980), Pal and Sengupta (1999), and Bandyopadhyay and Das (2006) consider a left truncated a bivariate normal distribution at zero with respect to a one-sided inefficiency component.Footnote 24 The two-sided component represents noise in the latter two cases and allocative inefficiency in the former. Pal and Sengupta (1999) likewise included allocative inefficiency components, which are assumed to follow a multivariate normal distribution. However, the marginal distributions of the error components are not those commonly used under independence and, more importantly, that they may be inappropriate. Bandyopadhyay and Das (2006) show that while the marginal distribution of ui in their model is half normal, that of vi is skew normal, with skewness determined by the correlation between the two error components. An unusual approach was proposed by Pal (2004), in which conditional distributions for the error components are specified directly along with their marginal distributions. Prediction of efficiency is based on \(f_{\left. u \right|\varepsilon }\) as in the case of independence.

The use of a copula function to allow for dependence between vi and ui was proposed by Smith (2008) and El Mehdi and Hafner (2014). Various alternatives are considered, including the Ali-Mikhail-Haq, Clayton, Fairlie-Gumbel-Morgenstern, Frank and Gaussian copula. From Sklar’s theorem, the joint density fv,u is the product of the marginal densities and the density of the copula. It follows that (3) and (4) must be modified such that

$$f_{\varepsilon } \left( {\varepsilon_{i} } \right) = \int\limits_{0}^{\infty } {f_{v} \left( {\varepsilon_{i} + su_{i} } \right)f_{u} \left( {u_{i} } \right)c_{v,u} \left[ {F_{v} \left( {\varepsilon_{i} + su_{i} } \right),F_{u} \left( {u_{i} } \right)} \right]\text{d}u_{i} }$$
(27)

and

$$f_{{u_{i} |\varepsilon_{i} }} \left( {u_{i} |\varepsilon_{i} } \right) = \frac{{f_{v} \left( {\varepsilon_{i} + su_{i} } \right)f_{u} \left( {u_{i} } \right)c_{v,u} \left[ {F_{v} \left( {\varepsilon_{i} + su_{i} } \right),F_{u} \left( {u_{i} } \right)} \right]}}{{f_{\varepsilon } \left( {\varepsilon_{i} } \right)}}$$
(28)

where cv,u is the copula density. Gómez-Déniz and Pérez-Rodríguez (2015) specify a bivariate Sarmanov distribution for vi and ui with normal and half-normal marginal distributions, respectively. Again, the advantage of the copula approach is that the desired marginal distributions are obtained, with the dependence between the error components captured by cv,u.

8.2 Sample Selection and Endogenous Switching

In the context of linear regression, the sample selection model of Heckman (1976, 1979) is such that

$$y_{i} = \left\{ {\begin{array}{*{20}l} {x_{i} \beta + \varepsilon_{i} , } \hfill & { d_{i} = 1} \hfill \\ {\text{unobserved, }} \hfill & { d_{i} = 0} \hfill \\ \end{array} } \right.,\quad d_{i} = I\left( {d_{i}^{ *} = z_{i} \alpha + w_{i} > 0} \right) ,$$
(29)

where symmetric error terms \(\varepsilon_{i}\) and wi are assumed to follow a bivariate normal distribution with zero means, variances \(\sigma_{\varepsilon }^{2}\) and 1, and correlation coefficient \(\rho\). Unless \(\rho = 0\), least squares will yield biased estimates. Since \(E\left( {y_{i} |x_{i} , d_{i} = 1} \right) = x_{i} \beta + \rho \sigma_{\varepsilon } f_{w} \left( {z_{i} \alpha } \right)/F_{w} \left( {z_{i} \alpha } \right)\), Heckman (1979) proposed a two-step, limited information method in which yi is regressed on xi and the inverse Mills’ ratio \(f_{w} \left( {z_{i} \hat{\alpha }} \right)/F_{w} \left( {z_{i} \hat{\alpha }} \right)\), where \(\hat{\alpha }\) is obtained from a single equation probit model estimated by ML. Alternatively, a full information ML approach may be used to estimate the parameters of the model simultaneously, as in Heckman (1976) and Maddala (1983).

A similar problem is that of endogenous switching. The endogenous switching model of Heckman (1978) has two regimes, membership of which is dependent upon a binary switching dummy:

$$y_{i} = \left\{ {\begin{array}{*{20}l} {x_{i} \beta_{1} + \varepsilon_{1i} ,} \hfill & {d_{i} = 1} \hfill \\ {x_{i} \beta_{2} + \varepsilon_{2i} ,} \hfill & {d_{i} = 0} \hfill \\ \end{array} } \right.,\quad d_{i} = I\left( {d_{i}^{ *} = z_{i} \alpha + w_{i} > 0} \right)$$
(30)

where \(\varepsilon_{1i}\), \(\varepsilon_{2i}\) and wi are assumed to follow a trivariate normal distribution with zero means, and variances \(\sigma_{1\varepsilon }^{2}\), \(\sigma_{2\varepsilon }^{2}\), and \(\sigma_{w}^{2}\). The correlations of \(\varepsilon_{1i}\) and \(\varepsilon_{2i}\) with wi are given by \(\rho_{1}\) and \(\rho_{2}\), respectively, while \(\rho_{12}\) is the correlation between \(\varepsilon_{1i}\) and \(\varepsilon_{2i}\). Again, both two-step partial information and full information ML approaches may be used to estimate the parameters of the model.

In recent years, SF models incorporating sample selection and endogenous switching have been proposed. Bradford et al. (2001) and Sipiläinen and Oude Lansink (2005) use the Heckman (1979) two-step approach, including the estimated inverse Mills’ ratios from single equation probit selection and switching models, respectively, as independent variables in their SF models. However, this is inappropriate in non-linear settings such as SFA, since it is generally not the case that \(E\left[ {\left. {g\left( {x_{i} \beta + \varepsilon_{i} } \right)} \right|d_{i} = 1} \right] = g\left[ {x_{i} \beta + \rho \sigma_{\varepsilon } f_{w} \left( {z_{i} \alpha } \right)/F_{w} \left( {z_{i} \alpha } \right)} \right]\) where g is some non-linear function. Terza (2009) discusses ML estimation of non-linear models with endogenous switching or sample selection in general.

In the SF context, there are many alternative assumptions that may be made about the relationship between noise, inefficiency, and the stochastic component of the selection (or switching) equation. Perhaps the natural approach, implicit in Bradford et al. (2001) and Sipiläinen and Oude Lansink (2005), is to assume that the symmetric noise terms follow a multivariate normal distribution as in the linear model, while the inefficiency terms are drawn from independent one-sided univariate distributions. This is proposed by Greene (2010), who estimates an SF model with sample selection via MSL, and also by Lai (2015), who uses the result that, in both the sample selection and endogenous switching cases, \(f_{\left. \varepsilon \right|d}\) follows a closed skew-normal distribution when the inefficiency terms are truncated normal. This results in analytical log-likelihoods, and the author proposes to predict efficiency based on the distribution of \(\left. {u_{i} } \right|\left( {\varepsilon_{i} |d_{i} } \right)\), specifically using \(E\left[ {\left. {\exp \,\left( { - u_{i} } \right)} \right|\left( {\left. {\varepsilon_{i} } \right|d_{i} } \right)} \right]\).

Note that the distributional assumptions in Greene (2010) and Lai (2015) ensure appropriate marginal distributions for each error component, but do not allow for correlation between the inefficiency terms and the symmetric errors. Lai et al. (2009) introduce correlation between \(\varepsilon_{i}\) (rather than its components) and wi through a copula function. Departing from the usual approach, Kumbhakar et al. (2009) propose an SF model with an endogenous switching equation in which \(d_{i}^{*} = z_{i} \alpha + \delta u_{i} + w_{i}\). That is, they include the inefficiency term as a determinant of regime membership.Footnote 25 The various error components are assumed to be independent of one another, and both the log-likelihood of the model and \(E\left[ {\left. {u_{i} } \right|\left( {\varepsilon_{i} |d_{i} } \right)} \right]\) are obtained by quadrature.

8.3 Two-Tiered Models

SF methods have been widely applied outside of the context of production and cost frontier estimation. Most applications have utilised standard cross-section or panel data SF specifications, or some of the variants discussed above. However, one area of application which has seen its own distinct methodological developments is modelling of earnings determination. Polachek and Yoon (1987) proposed a ‘two-tiered’ SF (2TSF) model in which

$$\varepsilon_{i} = v_{i} - u_{i} + w_{i} ,$$
(31)

where vi is again a normally distributed noise component, and ui and wi follow exponential distributions with means \(\sigma_{u}\) and \(\sigma_{w}\), respectively.Footnote 26 The dependent variable is a worker’s actual wage. The ui component captures deviations from the firm’s reservation wage—i.e. the maximum wage offers the firm would make—as a result of incomplete information on the part of the employee. Similarly, wi captures deviations from the worker’s reservation wage—i.e. the minimum wage offer the worker would accept—as a result of incomplete information on the part of the employer. The inclusion of these two terms therefore allows estimation of the extent of average employee and employer incomplete information, and even observation-specific predictions of these. The assumption of exponentially distributed ui and wi makes derivation of \(f_{\varepsilon }\), and therefore the log-likelihood, straightforward. However, as in the standard SF model, alternative distributional assumptions have been proposed: Papadopoulos (2015) derive a closed form for \(f_{\varepsilon }\) when ui and wi follow half-normal distributions, and Tsionas (2012) estimates the model assuming that they follow gamma distributions via inverse fast Fourier transform of the characteristic function as discussed in Sect. 3.

In general, developments of the 2TSF model have tended to parallel those of the standard SF model. A panel data 2TSF model was proposed by Polachek and Yoon (1996), in which

$$\varepsilon_{ift} = v_{ift} - u_{it} + w_{ft}$$
(32)

where the subscript f denotes the firm. The employee incomplete information component uit and the employer incomplete information component wft, which is assumed to be constant across all employees, are further decomposed such that \(u_{it} = u_{i} + u_{it}^{*}\) and \(w_{ft} = w_{f} + w_{ft}^{*}\), where ui and wf are time-invariant fixed effects and \(u_{it}^{*}\) and \(w_{ft}^{*}\) follow independent exponential distributions. It is clear that many alternative panel data specifications could be proposed, particularly considering the numerous possible extensions of the models discussed in Sect. 4.

In addition, and analogous to the models discussed in Sect. 5, modelling of ui and wi in terms of vectors of explanatory variables has been proposed. Assuming exponential ui and wi, Groot and Oosterbeek (1994) propose modelling the inverse signal-to-noise ratios \(\sigma_{v} /\sigma_{u}\) and \(\sigma_{v} /\sigma_{w}\) as linear functions of vectors zui and zwi. This specification introduces heteroskedasticity of each of the error components, but in rather an odd way, and is problematic in that it does not restrict \(\sigma_{u}\) or \(\sigma_{w}\) to be positive. This issue is resolved by Kumbhakar and Parmeter (2010), who propose a specification in which \(\sigma_{ui} = \exp \left( {z_{ui} d_{u} } \right)\) and \(\sigma_{wi} = \exp \left( {z_{wi} d_{w} } \right)\). Note that this model has the scaling property. Parmeter (2018) proposes estimating a 2TSF model with the scaling property, avoiding explicit distributional assumptions, by non-linear least squares.

Finally, tying back to the previous section, Blanco (2017) proposes an extension of the basic Polachek and Yoon (1987) model to account for sample selection, assuming that the symmetric error components follow a bivariate normal distribution, while the one-sided errors follow independent univariate exponential distributions.

9 Conclusion

The methodological literature on SFA has developed considerably since the first SF models were developed by Aigner et al. (1977) and Meeusen and van Den Broeck (1977). The defining feature of SFA models is the focus on determining observation-specific predictions for inefficiency. This in turn requires a prediction of an inefficiency error terms which is present in tandem with a noise error. Hence, there is a deconvolution problem associated with the error in the model. As such, distributional assumptions are not just required to get ‘best’ estimates of the underlying frontier relationship (cost frontier, production frontier, etc.), but also essential for enabling appropriate predictions of the quantity of interest: firm inefficiency.

This review has considered numerous ways in which SFA has been innovated, which in turn has involved the use of differing distributional forms. One strand of literature concerns alternative distributional assumptions for the inefficiency error term, and more recently, the noise error term. This raises the obvious question as to which to choose. Given economic theory only requires the inefficiency error to be one-sided, it is generally an empirical matter as to which is to be preferred. Formulations which nest other forms as special cases have obvious appeal; however, there are also non-nested tests, such as those developed by Wang et al. (2011) to aid selection.

Another strand of literature considers alternative distributions in the presence of specific empirical issues. The ‘wrong-skew’ problem is a good example, where it is entirely plausible that inefficiency could be found to have skewness counter to the direction imposed by the use of the common, half-normal, truncated-normal or exponential inefficiency distributions. Without a change to the distributional assumptions, the model estimation would indicate no evidence of inefficiency which is often difficult to justify in the context of knowledge and other available evidence of the performance of the industries that these techniques are applied to.

Other innovations include models for sample selection, the presence of efficient firms and two-tier SF models. Panel data is a data structure which greatly increases the scope of modelling possibilities. It potentially allows for construction of predictors of inefficiency without appeal to ‘full’ distributional assumptions on the noise and inefficiency (instead only requiring moment assumptions), by exploiting time persistency in inefficiency. Alternatively, full parametric approaches can be adopted, with the benefit of being able to obtain separate predictions for inefficiency—which may have both time-invariant and time-varying components—and time-invariant unobserved heterogeneity.

Finally, a strand of literature has developed characterising heteroskedasticity in the error components. This is of particular interest as it allows for quantification of the determinants of inefficiency, which is important in beginning to explain why there is a performance gap for a firm in addition to providing a prediction of the size of such a gap. This, in turn can be used by stakeholders to guide implementation of better performance.

Overall it is misleading to think of SFA as representing a single approach to efficiency analysis. Instead, SFA characterises a broad set of models, where different approaches will be relevant given the empirical context. The limited scope of this review has excluded several topics such as nonparametric SF models, Bayesian SF models, metafrontiers, and estimation of distance functions. Inefficiency is an unobserved error component, and so by definition, the predictor of such an error will be sensitive to distributional assumptions regarding inefficiency and the other unobserved error components, such as noise and unobserved heterogeneity. Thus, the conclusion is that for any given empirical application of efficiency analysis, several SFA models will need to be considered in order to establish the sensitivity of the efficiency predictions to the distributional assumptions adopted. This review should provide a useful starting point for such an exercise.