1 Introduction

In the basic stochastic frontier model of Aigner et al. (1977) and Meeusen and van den Broeck (1977), all firms are inefficient to some degree. The one-sided error that represents technical inefficiency has a distribution (for example, half normal) for which zero is in the support, so that zero is a possible value, but it is still the case that the probability is zero that a draw from a half normal exactly equals zero. This may be restrictive empirically, since it is plausible, or at least possible, that an industry may contain a set of firms that are fully efficient.

In this paper we allow the possibility that some firms are fully efficient. We introduce a parameter p which represents the probability that a firm is fully efficient. So the case of p = 0 corresponds to the usual stochastic frontier model and the case of p = 1 corresponds to the case of full efficiency (no one-sided error), while if 0 < p < 1 a fraction p of the firms are fully efficient and a fraction 1 − p are inefficient. This may be important because if some of the firms actually are fully efficient, the usual stochastic frontier model is misspecified and can be expected to yield biased estimates of the technology and of firms’ inefficiency levels.

This model is a special form of the latent class model considered by Caudill (2003), Orea and Kumbhakar (2004), Greene (2005) and others. It has the special feature that the frontier itself does not vary across the two classes of firms; only the existence or non-existence of inefficiency differs. Our model has previously been considered by Kumbhakar, Parmeter, and Tsionas (2013), hereafter KPT. See also Grassetti (2011). Our results were derived without knowledge of the KPT paper, but in this paper we will naturally focus on our results which are not in their paper.

The plan of the paper is as follows. In Sect. 2 we will present the model and give a brief summary of the basic results that are also in the KPT paper. These include the likelihood to be maximized, the form of the posterior probabilities of full efficiency for each firm, and the expression for the estimated inefficiencies for each firm. In Sect. 3 we provide some new results. We discuss identification issues. We give the generalization of the results of Waldman (1982), which establish that there is a stationary point of the likelihood at a point of full efficiency and that this point is a local maximum of the likelihood if the OLS residuals are positively skewed. We propose using logit or probit models to allow additional explanatory variables to affect the probability of a firm being fully efficient. We also discuss the problem of testing the hypothesis that p = 0. In Sect. 4 we present some simulations, and in Sect. 5 we give an empirical example. Finally, Sect. 6 gives our conclusions.

Since this paper is in a volume in honor of Lennart Hjalmarsson, it is appropriate to comment on Lennart’s contributions to efficiency and productivity analysis, and to ask whether there is any link between his work and this paper. Lennart was of course an extremely productive scholar and one of very broad interests. His most remarkable paper is arguably Førsund and Hjalmarsson (1974), “On the Measurement of Productive Efficiency”. This paper laid out in careful terms what it is that we seek to measure, several years before the first non-stone-age tools were developed to measure it. It is harder to find links between his work and this paper, because this is an SFA paper and he was at heart a DEA person. He did some influential work with Almas Heshmati and Subal Kumbhakar comparing the results from different models, including SFA and DEA, but when it came to actual applications he generally picked DEA. His applications were amazing diverse, in terms of industry if not geographical location: Swedish banking, Swedish dairy plants, Swedish electrical distribution, Swedish social insurance offices, Swedish dairy farms, Swedish cement plants, and the Swedish pork industry, to give a partial list. He did occasionally peek at data from the rest of the world, for example banks of other Nordic countries or Columbian cement plants, but this was rare. The best specific link between his work and this paper is that in some of his applications he took interest in the percentage of firms that were 100 % efficient, which of course occurs naturally in DEA, or in the percentage of output produced by firms that were 100 % efficient. For example, in Berg et al. (1993), he commented on the fact that the percentage of loans produced by banks that are 100 % efficient was 44 % in Finland, 52 % in Norway and 72 % in Sweden. So we would like to think that he would have regarded the topic of this paper as interesting.

2 The model and basic results

We begin with the standard stochastic frontier model of the form:

$$ y_{i}={{\bf x}}_{i}^{\prime }\beta +\varepsilon_{i},\quad\varepsilon _{i}=v_{i}-u_{i},\quad u_{i}\geq0. $$
(1)

Here \(i = 1,\ldots, n\) indexes firms. We have in mind a production frontier so that y is typically log output and x is a vector of functions of inputs. The v i are iid \(N\left(0,\sigma_{v}^{2}\right),\) the u i are iid \(N^{+}\left( 0,\sigma_{u}^{2}\right)\) (i.e., half-normal), and x, v, and u are mutually independent (so x can be treated as fixed). We will refer to this model as the basic stochastic frontier (or basic SF) model.

We now define some standard notation. Let ϕ be the standard normal density, and \(\Upphi\) be the standard normal cdf. Let f v and f u represent the densities of v and u:

$$ \begin{aligned} f_{v}\left(v\right)&={\frac{1}{\sqrt{2\pi}\sigma_{v}}}\exp \left(-{\frac{v^{2}}{2\sigma_{v}^{2}}}\right)={\frac{1} {\sigma_{v}}}\phi\left({\frac{v}{\sigma_{v}}}\right), \\ f_{u}\left(u\right)&={\frac{2}{\sqrt{2\pi}\sigma_{u}}}\exp \left(-{\frac{u^{2}}{2\sigma_{u}^{2}}}\right)={\frac{2}{\sigma_{u}}}\phi \left({\frac{u}{\sigma_{u}}}\right),\quad u\geq 0. \\ \end{aligned} $$
(2)

Also define λ = σ u /σ v and σ 2 = σ 2 u  + σ 2 v . This implies that σ 2 v  = σ 2/(1 + λ 2) and σ 2 u  = σ 2 λ 2/(1 + λ 2). Finally, we let \(f_{\varepsilon}\) represent the density of \(\varepsilon=v-u:\)

$$ f_{\varepsilon}\left(\varepsilon\right)= {\frac{2}{\sigma}}\phi \left({\frac{\varepsilon} {\sigma}}\right)\left[1-\Upphi\left({\frac{\varepsilon\lambda} {\sigma}}\right)\right].$$
(3)

Now we define the model of this paper. Suppose there is an unobservable variable z i such that

$$ z_{i}=1\left(u_{i}=0\right)= \left\{\begin{array}{ll}1 &\hbox{if } u_{i}=0 \\ 0 &\hbox{if } u_{i}>0.\end{array}\right.$$

Define \(p=P\left(z_{i}=1\right)=P\left(u_{i}=0\right).\) We assume that \(u_{i}|z_{i}\,=\,0\) is distributed as N +(0,σ 2 u ), that is, half normal. Thus

$$ u_{i}=\left\{\begin{array}{ll}0&\hbox{ with probability}\,p \\ N^{+}\left(0,\sigma_{u}^{2}\right)&\hbox{ with probability}\,1-p.\end{array}\right. $$

This model contains the parameters βσ 2 u σ 2 v , and p or βλσ 2, and p.

We will follow the terminology of KPT and call this model the “zero-inefficiency stochastic frontier” (ZISF) model. The name refers to the fact that, in this model, the event u i  = 0 can occur with non-zero frequency. Note that

$$ \begin{aligned} f\left(\varepsilon\vert z=1\right)&=f_{v}\left(\varepsilon\right), \\ f\left(\varepsilon\vert z=0\right)&=f_{\varepsilon}\left(\varepsilon\right), \\ \end{aligned} $$

(where f v and \(f_{\varepsilon}\) are defined in (2) and (3) above) and so the marginal (unconditional) density of \(\varepsilon\) is

$$ f_{p}\left(\varepsilon\right)=pf_{v}\left(\varepsilon\right) +\left(1-p\right)f_{\varepsilon}\left(\varepsilon\right). $$
(4)

Using this density, we can form the (log) likelihood for the model:

$$ \ln{L\left(\beta, \sigma_{u}^{2}, \sigma_{v}^{2},p\right)} =\sum\limits_{i=1}^{n}\ln{f_{p}\left(y_{i}-{{\bf x}}_{i}^{\prime}\beta\right)}. $$
(5)

We will estimate the model by MLE; that is, by maximizing \(\ln{L}\) with respect to βσ 2 u σ 2 v , and p. Or, alternatively, the model may be parameterized in terms of βλσ 2, and p, with maximization over that set of parameters.

When we have estimated the model, we can obtain \(\hat{\varepsilon}_{i}=y_{i}-{{\bf x}}_{i}^{\prime }\hat\beta,\) an estimate of \(\varepsilon_{i}=y_{i}-{{\bf x}}_{i}^{\prime }\beta.\) Using Bayes rule, we can now update the probability that a particular firm is fully efficient, because \(\varepsilon_{i}\) is informative about that possibility. That is, we can calculate

$$ P\left(z_{i}=1\vert \varepsilon_{i}\right)={\frac{P\left(z_{i}=1\right)\;f\left(\varepsilon_{i}\vert z_{i}=1\right)} {f_{p}\left(\varepsilon_{i}\right)}}={\frac{pf_{v}\left(\varepsilon_{i}\right)} {f_{p}\left(\varepsilon_{i}\right)}}={\frac{pf_{v}\left(\varepsilon_{i}\right)} {pf_{v}\left(\varepsilon_{i}\right)+(1-p)f_{\varepsilon} \left(\varepsilon_{i}\right)}}. $$
(6)

We will call this the “posterior” probability that firm i is fully efficient. It is evaluated at \(\hat{p}, \hat{\varepsilon}_{i}\) and also \(\hat{\sigma}_{u}^{2}\) and \(\hat{\sigma}_{v}^{2},\) which enter into the densities of f v and \(f_{\varepsilon}.\) We put quotes around "posterior" because it is not truly the posterior probability of z i  = 1 in a Bayesian sense. (A true Bayesian posterior would give \(P\left(z_{i}=1\vert y_{i},x_{i}\right)\) and would have started with a prior distribution for the parameters βσ 2 u σ 2 v , and p.)

We now wish to estimate (predict) u i for each firm. Following the logic of Jondrow, Lovell, Materov and Schmidt (1982) (hereafter JLMS), we define \(\hat{u}_{i}=E\left(u_{i}\vert\varepsilon_{i}\right).\) Now

$$ \begin{aligned} E\left(u_{i}\vert\varepsilon_{i}\right)&=E_{z\vert\varepsilon}E\left(u_{i}\vert \varepsilon_{i}, z_{i}\right)\\ &=P\left(z_{i}=1\vert\varepsilon_{i}\right)E\left(u_{i}\vert \varepsilon_{i}, z_{i}=1\right)+P\left(z_{i}=0\vert\varepsilon_{i}\right)E\left(u_{i}\vert \varepsilon_{i}, z_{i}=0\right) \\ &=P\left(z_{i}=0\vert\varepsilon_{i}\right)E\left(u_{i}\vert \varepsilon_{i}, z_{i}=0\right) \\ \end{aligned} $$
(7)

since u i ≡ 0 when z i  = 1. But \(E\left(u_{i}\vert\varepsilon_{i}, z_{i}=0\right)\) is the usual expression from JLMS, and \(P\left(z_{i}=0\vert\varepsilon_{i}\right)=1-P\left(z_{i} =1\vert\varepsilon_{i}\right)\) which can be evaluated using Eq. (6) above. Therefore,

$$ \hat{u}_{i}=E\left(u_{i}\vert\varepsilon_{i}\right) ={\frac{\left(1-p\right)f_{\varepsilon}\left(\varepsilon_{i}\right)} {pf_{v}\left(\varepsilon_{i}\right)+\left(1-p\right)f_{\varepsilon} \left(\varepsilon_{i}\right)}}\times\sigma_{\ast}\left[{\frac{\phi\left(a_{i}\right)} {1-\Upphi\left(a_{i}\right)}}-a_{i}\right], $$
(8)

where \(a_{i}=\varepsilon_{i}{\lambda}/{\sigma}\) and \( \sigma_{\ast}={\sigma_{u}\sigma_{v}}/{\sigma} ={\lambda\sigma}/{(1+\lambda^{2})}.\)

A slight extension of this result, which is not in KPT, is to follow Battese and Coelli (1988) and define technical efficiency as \(TE=\exp{\left(-u\right)}.\) Correspondingly technical inefficiency would be \(1-TE=1-\exp{\left(-u\right)},\) which is only approximately equal to u (for small u). They provide the expression for \(E\left(TE\vert\varepsilon\right).\) In our model the expression is a little more complicated. The equivalent of (7) is that

$$ E(e^{-u_{i}}\vert\varepsilon_{i})=P(z_{i}=1\vert\varepsilon_{i})E(e^{-u_{i}}\vert \varepsilon_{i},z_{i}=1)+P(z_{i}=0\vert\varepsilon_{i})E(e^{-u_{i}}\vert \varepsilon_{i},z_{i}=0). $$

But now \(E(e^{-u_{i}}\vert \varepsilon_{i},z_{i}=1)=1\) and so both terms are non-zero. This leads to the expression

$$ \begin{aligned} \widehat{TE}_{i}&=E\left(e^{-u_{i}}\vert\varepsilon_{i}\right) \\ &={\frac{\left(1-p\right)f_{\varepsilon}\left(\varepsilon_{i}\right)} {pf_{v}\left(\varepsilon_{i}\right)+\left(1-p\right)f_{\varepsilon} \left(\varepsilon_{i}\right)}}\times{\frac{\Upphi\left({\frac{\mu_{i}^{\ast}} {\sigma_{\ast}}}-\sigma_{\ast}\right)} {\Upphi\left({\frac{\mu_{i}^{\ast}} {\sigma_{\ast}}}\right)}}\exp\left({\frac{\sigma_{\ast}^{2}} {2}}-\mu_{i}^{\ast}\right)+{\frac{pf_{v}\left(\varepsilon_{i}\right)} {pf_{v}\left(\varepsilon_{i}\right)+\left(1-p\right)f_{\varepsilon} \left(\varepsilon_{i}\right)}},\\ \end{aligned} $$
(9)

where \(\mu_{i}^{\ast}=-\varepsilon_{i}{\sigma_{u}^{2}}/{\sigma^{2}}, \sigma_{\ast}={\sigma_{u}\sigma_{v}}/{\sigma}\) (as above), and correspondingly \({\mu_{i}^{\ast}}/{\sigma_{\ast}}=-a_{i}\) where \(a_{i}=\varepsilon_{i}{\lambda}/{\sigma}\) (as above).

Note that the expression for \(\hat{u}_{i}\) is just a simple scaling of the JLMS expression. However, for \(\widehat{TE_{i}},\) this is not the case. \(\widehat{TE_{i}}\) is a scaling of the Battese–Coelli estimate, plus a non-zero additive term reflecting \(P(z_{i}=1\vert\varepsilon_{i})E(e^{-u_{i}}\vert \varepsilon_{i},z_{i}=1)=P(z_{i}=1\vert\varepsilon_{i}).\) As in Jondrow et al. (1982), the expression in either (8) or (9) would need to be evaluated at the estimated values of the parameters (p, σ 2 u , and σ 2 v ) and at \(\hat{\varepsilon}_{i}=y_{i}-{{\bf x}}_{i}^{\prime}\hat\beta.\)

In this paper we have maximized the likelihood by a direct optimization (numerical search) with respect to all of the parameters. An alternative would be the EM algorithm, which is applicable when the model can be viewed as one with missing data. In our case the missing data are the z i that indicate whether or not the firm is fully efficient. If we knew the z i , the resulting likelihood (the “complete data likelihood”) would be simple. The EM algorithm alternates between two steps. The first (E) step is to replace the z i by their expectations given the observed data and the tentative parameter values. These expectations are just the posterior probabilities given in (6) above. Then, taking these values as given, the second (M) step is to maximize the likelihood with respect to the remaining parameters. Also the value of p is updated as the average of the posterior probabilities. This procedure is continued until convergence. Greene (2012, pp. 1104–1106) gives computational details for the EM algorithm for a latent class model which is very similar to our model. A conventional wisdom [e.g. Greene (2012, pp. 1104–1106)] is that the EM algorithm can be very slow (many iterations until convergence) but that it is numerically very stable. In particular it is guaranteed that each iteration raises the likelihood value. We did not encounter any serious computational issues in estimating the model, so in our view the choice of method is mostly a matter of personal taste and/or software availability.

3 Extensions of the basic model

We now investigate some extensions of the basic results of the previous section. Most of the results in this section are not in KPT.

3.1 Identification issues

Some of the parameters are not identified under certain circumstances. When p = 1, so that all firms are fully efficient, σ 2 u is not identified. Conversely, when σ 2 u  = 0, p is not identified. In fact, the likelihood value is exactly the same when (i) σ 2 u  = 0, p = anything as when (ii) p = 1, σ 2 u  = anything. More generally, we might suppose that σ 2 u and p will be estimated imprecisely when a data set contains little inefficiency, since it will be hard to determine whether there is little inefficiency because σ 2 u is small or because p is close to one.

This issue of identification is relevant to the problem of testing the null hypothesis that p = 1 against the alternative that p < 1. This is a test of the null hypothesis that all firms are efficient against the alternative that some fraction (possibly all) of them are inefficient, and that is an economically interesting hypothesis. KPT suggest a likelihood ratio test of this hypothesis. As they note, the null distribution of their statistic is affected by the fact that the null hypothesis is on the boundary of the parameter space. They refer to Chen and Liang (2010, Case 2, p. 608) to justify an asymptotic distribution of 1/2χ 20  + 1/2χ 21 for the likelihood ratio statistic. However, it is not clear that this result applies, given that one of the parameters (σ 2 u ) is not identified under the null that p = 1. Specifically, the argument of Chen and Liang (2010) depends on the existence and asymptotic normality of the estimator \(\hat{\eta}(\gamma_{0})\) (see p. 606, line 4) where γ 0 corresponds to p 0 (= 1), and where η corresponds to the other parameters of our model, including σ 2 u .

A more relevant reference, which KPT note but do not pursue, is Andrews (2001). This paper explicitly allows the case in which the parameter vector under the null may lie on the boundary of the maintained hypothesis and there may be a nuisance parameter that appears under the alternative hypothesis, but not under the null. See his Theorem 4, p. 707, for the relevant asymptotic distribution result, which unfortunately is considerably more complicated than the simple result (50–50 mixture of chi-squareds) of Chen and Liang (2010).

3.2 A stationary point for the likelihood

For the basic stochastic frontier model, let the parameter vector be \(\theta=(\beta^{\prime},\lambda,\sigma^{2})^{\prime}.\) Then Waldman (1982) established the following results. First, the log likelihood always has a stationary point at \(\theta^{*}=(\hat{\beta}^{\prime}, 0, \hat{\sigma}^{2})^{\prime},\) where \(\hat\beta=\hbox{OLS }\) and \(\hat{\sigma}^{2}=(\hbox{OLS sum of squared residuals})/n.\) Note that these parameter values correspond to \(\hat{\sigma}_{u}^{2}=0,\) that is, to full efficiency of each firm. Second, the Hessian matrix is singular at this point. It is negative semi-definite with one zero eigenvalue. Third, these parameter values are a local maximizer of the log likelihood if the OLS residuals are positively skewed. This is the so-called “wrong skew problem”.

The log likelihood for the ZISF model has a stationary point very similar to that for the basic stochastic frontier model. This stationary point is also a local maximum of the log likelihood if the least squares residuals are positively skewed.

Theorem 1

Let \(\theta=(\beta^{\prime},\lambda,\sigma^{2},p)^{\prime}\) and let \(\theta^{**}=(\hat\beta^{\prime},0,\hat{\sigma}^{2},\hat{p})^{\prime},\) where \(\hat\beta=\hbox{OLS}, \hat{\sigma}^{2}=(\hbox{OLS sum of squared residuals})/n,\) and where \(\hat{p}\) is any value in [0,1]. Then

  1. 1.

    θ ** is a stationary point of the log likelihood.

  2. 2.

    The Hessian matrix is singular at this point. It is negative semi-definite with two zero eigenvalues.

  3. 3.

    θ ** with \(\hat p\in\left[0,1\right)\) is a local maximizer of the log likelihood function if and only if \( \sum\nolimits_{i=1}^{n}\hat{\varepsilon}_{i}^{3}\!\!>0,\) where \(\hat{\varepsilon}_{i}=y_{i}-{{\bf x}}_{i}^{\prime}\hat\beta\) is the OLS residual.

  4. 4.

    θ ** with \(\hat p=1\) is a local maximizer of the log likelihood function if \(\sum\nolimits_{i=1}^{n}\hat{\varepsilon}_{i}^{3}\!\!>0.\)

Proof

See Appendix. \(\square\)

As is typically done for the basic stochastic frontier model, we will presume that θ ** is the global maximizer of the log likelihood when the residuals have positive ("wrong") skew. Note that at θ **, we have \(\hat{\lambda}=0\) or equivalently \(\hat{\sigma}_{u}^{2}=0,\) and p is not identified when σ 2 u  = 0. We get the same likelihood value for any value of p. In our simulations (in Sect. 4) we will set \(\hat{p}=1\) in the case of wrong skew, since \(\hat{p}=1\) is another way of reflecting full efficiency. However, for a given data set, the value of \(\hat{p}\) does not matter when θ = θ **.

Since \(\hbox{plim}\left(\left(1/n\right)\sum\hat{\varepsilon}_{i}^{3}\right) =E\left(\varepsilon_{i}-E\left(\varepsilon_{i}\right)\right)^3 =\sigma_{u}^{3}\sqrt{2/\pi}\left(1-p\right)\left(-4p^{2}+(8-3\pi)p +\pi-4\right)/\pi\leq0\) for any \(p\in[0,1],\) as the number of observations increases, the probability of a positive third moment of the OLS residuals goes to zero asymptotically. In a finite sample, the probability of a positive third moment increases when λ is small and/or p is near 0 or 1. See Table 1 below. The entries in Table 1 are based on simulations with 100,000 replications, with \(\sigma_{u}=1, \lambda=\sigma_{u}/\sigma_{v}, \lambda\in\left\{0.5,1,2\right\},\) and \(p\in\left\{0,0.1,\ldots,0.9\right\},\) for sample sizes 50,100,200, and 400.

Table 1 Frequency of a positive third moment of the OLS residuals

3.3 Models for the distribution of u i

The ZISF model can be extended by allowing the distribution of u i to depend on some observable variables w i . For example, in our empirical analysis of Sect. 5, the w i will include variables like the age and education of the farmer and the size of his household. These variables can be assumed to affect either P(z i  = 1) or \(f(u_{i}\vert z_{i}=0)\) or both.

First consider the case in which we assume that w i affects the distribution of u i for the inefficient firms. A general assumption would be that the distribution of u i conditional on w i and on z i  = 0 is N +(μ i , σ 2 i ) where μ i and/or σ 2 i depend on w i . For example, in Sect. 5 we will assume the RSCFG model of Reifschneider and Stevenson (1991), Caudill and Ford (1993) and Caudill et al. (1995), under the specific assumptions that μ i  = 0 and \(\sigma_{i}^{2}=\exp{(w_{i}^{\prime}\gamma)}.\) Another possible model is the KGMHLBC model of Kumbhakar et al. (1991), Huang and Liu (1994) and Battese and Coelli (1995), with σ 2 i  = σ 2 u constant and with \(\mu_{i}=w_{i}^{\prime}\psi\) or \(\mu_{i}=c\exp{(w_{i}^{\prime}\psi)}.\) Wang (2002) proposes parameterizing both μ i and σ 2 i . See also Alvarez et al. (2006).

A second case is the one in which we assume that w i affects P(z i  = 1). For example, we could assume a logit model:

$$ P\left(z_{i}=1\vert w_{i}\right)={\frac{\exp{(w_{i}^{\prime}\delta)}}{1+\exp{(w_{i}^{\prime}\delta)}}}. $$

KPT, p. 68, make the same suggestion. A probit model would be another obvious possibility.

Finally, we can consider a more general model in which both \(P(z_{i}=1\vert w_{i})\) and \(f(u_{i}\vert z_{i}=0, w_{i})\) depend on w i , as above. We will estimate such a model in our empirical section.

3.4 Testing the hypothesis that p = 0

In this section, we discuss the problem of testing the null hypothesis H 0: p = 0 against the alternative H A :p > 0. The null hypothesis is that all firms are inefficient, so the basic stochastic frontier model applies. The alternative is that some firms are fully efficient and so the ZISF model is needed.

It is a standard result that, under certain regularity conditions, notably that the parameter value specified by the null hypothesis is an interior point of the parameter space, the likelihood (LR), Lagrange multiplier (LM), and Wald tests all have the same asymptotic χ 2 distribution. However, in our case p cannot be negative, and therefore the null hypothesis that p = 0 lies on the boundary of the parameter space. This is therefore a non-standard problem. Unlike the case of testing the hypothesis that p = 1, however, there is no problem with the identification of the other parameters (nuisance parameters) β, σ 2 u , and σ 2 v , or βλ, and σ 2. We need to restrict σ 2 u  > 0 and σ 2 v  > 0 so that the nuisance parameters are in the interior of the parameter space, and also because p would not be identified if σ 2 u  = 0. However, with these modest restrictions, this is only a mildly non-standard problem, which has been discussed by Rogers (1986), Self and Liang (1987), and Gouriéroux and Monfort (1995, chapter 21), for example.

We consider five test statistics: the likelihood ratio (LR), Wald, Lagrange multiplier (LM), modified Lagrange multiplier (modified LM), and Kuhn–Tucker (KT) tests. All of these except the LM test will have asymptotic distributions that are different from the usual (χ 21 ) distribution.

We will assume that the likelihood function \(L_{n}(\theta)\) satisfies the usual conditions,

$$ \begin{aligned} {\frac{1}{\sqrt{n}}}{\frac{\partial L_{n}(\theta_{0})}{\partial \theta}}\,\overset{d}{\rightarrow }\,& {\mathcal{N}}\left(0,{\mathcal{I}}_{0}\right), \\ {\frac{1}{n}}{\frac{\partial^{2}{\it L}_{n}(\theta_{0})} {\partial\theta\partial\theta}} \,\overset{p}{\rightarrow }\,& {\mathcal{H}}_{0}=-{\mathcal{I}}_{0}, \\ \end{aligned} $$

where \(\theta=(\beta^{\prime},\sigma_{u}, \sigma_{v}, p)^{\prime},\) and the parameters other than p are away from the boundary of their parameter spaces. Define the restricted estimator \((\tilde\theta)\) and the unrestricted estimator \((\hat{\theta}){:}\)

$$ \begin{aligned}\tilde{\theta}&=\mathop{\hbox{argmax}}\limits_{\sigma_{u}\geq0,\sigma_{v}\geq0, p=0}\ln{L}_{n}(\theta), \\\hat{\theta}&=\mathop{\hbox{argmax}}\limits_{\sigma_{u}\geq0,\sigma_{v}\geq0, 0\leq p\leq 1}\ln{L}_{n}(\theta). \\ \end{aligned}$$

We also define \(l_{i}=\ln{f}(\varepsilon_{i}),\hat{s}_{i}=\partial l_{i}(\hat{\theta})/\partial\theta,\tilde{s}_{i}=\partial l_{i}(\tilde\theta)/\partial\theta,\hat{h}_{i}=\partial^{2}l_{i}(\hat{\theta})/\partial\theta\partial\theta^{\prime},\) and \(\tilde{h}_{i}=\partial^{2} l_{i}(\tilde\theta)/\partial\theta\partial\theta^{\prime}.\)

3.4.1 LR test

The LR statistic when testing H 0: p = 0 is \(\xi^{LR}=2(\ln{L}_{n}(\hat{\theta})-\ln{L}_{n}(\tilde{\theta})).\) Under standard regularity conditions, the asymptotic distribution of ξ LR is a mixture of χ 20 and χ 21 , with mixing weights 1/2, where χ 20 is defined as the point mass distribution at zero. That is \( \xi^{LR}\overset{d}{\rightarrow} 1/2\chi_{0}^{2}+ 1/2\chi_{1}^{2}.\) This follows, for example, from Chen and Liang (2010), as cited by KPT.

3.4.2 Wald test

The Wald statistic for H 0 : p = 0 is \( \xi^{W} = {\hat{p}^{2}}/{se\left(\hat{p}\right)^{2}}.\) Note that \(se(\hat{p})^2\) can be computed using the outer product of the score form of the variance matrix of \(\hat{\theta}, [(\sum\nolimits_{i=1}^{n}\hat{s}_{i}\hat{s}_{i}^{\prime})^{-1}],\) the Hessian form, \([(\sum\nolimits_{i=1}^{n}-\hat{h}_{i})^{-1}],\) or the Robust form, \([(\sum\nolimits_{i=1}^{n}-\hat{h}_{i})^{-1}(\sum\nolimits_{i=1}^{n}\hat{s}_{i} \hat{s}_{i}^{\prime})(\sum\nolimits_{i=1}^{n}-\hat{h}_{i})^{-1}].\) As with the LR statistic, \(\xi^{W}\overset{d}{\rightarrow} 1/2\chi_{0}^{2}+ 1/2\chi_{1}^{2}.\) Note that the non-standard nature of this result means that the “significance” of an estimated \(\hat{p}\) from the ZISF model cannot be assessed using standard results.

3.4.3 LM test

The LM statistic for H 0 : p = 0 is \( \xi^{LM}=(\sum\nolimits_{i=1}^{n}\tilde{s}_{i})^{\prime} \tilde{M}^{-1}(\sum\nolimits_{i=1}^{n}\tilde{s}_{i}).\) \(\tilde{M}\) can be either \([(\sum\nolimits_{i=1}^{n}\tilde{s}_{i}\tilde{s}_{i}^{\prime})]\) or \([(\sum\nolimits_{i=1}^{n}-\tilde{h}_{i})],\) in either case evaluated at \(\tilde\theta.\) Unlike the other statistics considered here, the LM statistic has the usual χ 21 distribution. It ignores the one-sided nature of the alternative, because it rejects for a large (in absolute value) positive or negative value of \(\tilde{s}_{i}.\) As pointed out by Rogers (1986), this may result in a loss in power relative to tests that take the one-sided nature of the alternative into account.

3.4.4 Modified LM test

The LM statistic has the usual χ 21 distribution because it does not take account of the one-sided nature of the alternative. By taking account of the one-sided nature of the alternative, the LM test might have better power. The Modified LM statistic proposed by Rogers (1986) is motivated by this point. The modified LM statistic is:

$$ \xi^{{\rm modified}\, {\rm LM}} = \left\{\begin{array}{ll} \xi^{{\rm LM}},& \hbox{ if }\;\sum\nolimits_{i=1}^{n}\tilde{s}_{i}>0 \\ 0, & \hbox{otherwise}.\end{array}\right. $$

In the modified LM statistic, a positive score is taken as evidence against the null and in favor of the alternative p > 0, whereas a negative score is not. So a negative score is simply set to zero. The asymptotic distribution of \(\xi^{{\rm modified}\;{\rm LM}}\) is 1/2χ 20  + 1/2χ 21 .

3.4.5 KT test

Another variant of the score test statistic that takes account of the one-sided nature of the alternative is the KT (Kuhn–Tucker) statistic proposed by Gouriéroux et al. (1982). The KT statistic for H 0: p = 0 is:

$$ \xi^{KT}=\left(\sum\limits_{i=1}^{n}\tilde{s}_{i}-\sum\limits_{i=1}^{n} \hat{s}_{i}\right)^{\prime}\tilde{M}^{-1} \left(\sum\limits_{i=1}^{n}\tilde{s}_{i}-\sum\limits_{i=1}^{n}\hat{s}_{i}\right). $$

Here, \(\tilde{s}_{i}={\partial l_{i}(\tilde{\theta})}/{\partial\theta}\) and \(\hat{s}_{i}={\partial l_{i}(\hat{\theta})}/{\partial\theta},\) as given just before the beginning of Sect. 3.4.1. Also \(\tilde{M}\) can be either \(\sum\nolimits_{i=1}^{n}\tilde{s}_{i}\tilde{s}_{i}^{\prime}\) (OPG form) or \(\sum\nolimits_{i=1}^{n}(-\tilde{h}_{i})\) (Hessian form). When \(\hat{p}=0,\) which occurs with probability one half under the null, \(\sum\nolimits_{i=1}^{n}\tilde{s}_{i}=\sum\nolimits_{i=1}^{n}\hat{s}_{i},\) and the statistic will equal zero. Otherwise, when \(\hat{p}> 0, \sum\nolimits_{i=1}^{n}\hat{s}_{i}=0\) and the test statistic has the usual χ 21 distribution. Therefore, \(\xi^{KT}\overset{d}{\rightarrow} 1/2\chi_{0}^{2}+ 1/2\chi_{1}^{2}.\)

3.4.6 The wrong skew problem, revisited

When the OLS residuals are positively skewed \((\sum\nolimits_{i=1}^{n}\hat{\varepsilon}_{i}^{3}>0),\) we have \(\hat{\sigma}_{u}^{2}=0\) (or equivalently, \(\hat{\lambda}=0\)) and \(\hat{p}\) is not well defined. Also the information matrix, whether evaluated at \(\hat{\theta}\) or \(\tilde\theta,\) is singular. Specifically,

$$ \begin{aligned} \sum\limits_{i=1}^{n}\hat{s}_{i}\hat{s}_{i}^{\prime}&= \left( \begin{array}{cccc} {\frac{1}{\hat{\sigma}_{v}^{2}}}\sum\nolimits_{i=1}^{n} \hat{\varepsilon}_{i}^{2}x_{i}x_{i}^{\prime}& -\left(1-\hat{p}\right)\sqrt{{\frac{2}{\pi}}} {\frac{1}{\hat{\sigma}_{v}^{2}}}\sum\nolimits_{i=1}^{n} \hat{\varepsilon}_{i}^{2}x_{i}& {\frac{1}{\hat{\sigma}_{v}^{5}}}\sum\nolimits_{i=1}^{n} \hat{\varepsilon}_{i}^{3}x_{i}&0 \\ -\left(1-\hat{p}\right)\sqrt{{\frac{2}{\pi}}}{\frac{1}{\hat{\sigma}_{v}^{2}}} \sum\nolimits_{i=1}^{n}\hat{\varepsilon}_{i}^{2}x_{i}^{\prime} &{\frac{2}{\pi}}\left(1-\hat{p}\right)^{2}{\frac{n} {\hat{\sigma}_{v}^{2}}}&-\left(1-\hat{p}\right)\sqrt{{\frac{2}{\pi}}} {\frac{1}{\hat{\sigma}_{v}^{5}}}\sum\nolimits_{i=1}^{n}\hat{\varepsilon}_{i}^{3}&0 \\ {\frac{1}{\hat{\sigma}_{v}^{5}}}\sum\nolimits_{i=1}^{n} \hat{\varepsilon}_{i}^{3}x_{i}^{\prime}&-\left(1-\hat{p}\right) \sqrt{{\frac{2}{\pi}}}{\frac{1}{\hat{\sigma}_{v}^{5}}} \sum\nolimits_{i=1}^{n}\hat{\varepsilon}_{i}^{3} & {\frac{1}{\hat{\sigma}_{v}^{6}}}\sum\nolimits_{i=1}^{n} \hat{\varepsilon}_{i}^4-{\frac{n}{\hat{\sigma}_{v}^{2}}}&0 \\ 0&0&0&0 \\ \end{array}\right) \\ \sum\limits_{i=1}^{n}\tilde{s}_{i}\tilde{s}_{i}^{\prime}&= {\left( \begin{array}{cccc} {\frac{1}{\tilde\sigma_{v}^{2}}}\sum\nolimits_{i=1}^{n} \tilde{\varepsilon}_{i}^{2}x_{i}x_{i}^{\prime}& -\sqrt{{\frac{2}{\pi}}}{\frac{1}{\tilde{\sigma}_{v}^{2}}} \sum\nolimits_{i=1}^{n}\tilde{\varepsilon}_{i}^{2}x_{i}& {\frac{1}{\tilde\sigma_{v}^{5}}}\sum\nolimits_{i=1}^{n} \tilde{\varepsilon}_{i}^{3}x_{i}&0 \\ -\sqrt{{\frac{2}{\pi}}}{\frac{1}{\tilde\sigma_{v}^{2}}} \sum\nolimits_{i=1}^{n}\tilde{\varepsilon}_{i}^{2}x_{i}^{\prime} &{\frac{2}{\pi}}{\frac{n}{\tilde\sigma_{v}^{2}}}& -\sqrt{{\frac{2}{\pi}}}{\frac{1}{\tilde\sigma_{v}^{5}}} \sum\nolimits_{i=1}^{n}\tilde\varepsilon_{i}^{3}&0 \\ {\frac{1}{\tilde\sigma_{v}^{5}}}\sum\nolimits_{i=1}^{n} \tilde{\varepsilon}_{i}^{3}x_{i}^{\prime}&-\sqrt{{\frac{2}{\pi}}} {\frac{1}{\tilde\sigma_{v}^{5}}}\sum\nolimits_{i=1}^{n}\tilde\varepsilon_{i}^{3} & {\frac{1}{\tilde{\sigma}_{v}^{6}}}\sum\nolimits_{i=1}^{n} \tilde\varepsilon_{i}^4-{\frac{n}{\tilde\sigma_{v}^{2}}}&0 \\ 0&0&0&0 \\ \end{array}\right)} \\ \sum\limits_{i=1}^{n}\hat{h}_{i}&= {\left( \begin{array}{cccc} -\sum\nolimits_{i=1}^{n}{\frac{x_{i}x_{i}^{\prime}}{\hat{\sigma}_{v}^{2}}} & \left(1-\hat{p}\right)\sqrt{{\frac{2}{\pi}}} {\frac{1}{\hat{\sigma}_{v}^{2}}}\sum\nolimits_{i=1}^{n}x_{i}& 0& 0 \\ \left(1-\hat{p}\right)\sqrt{{\frac{2}{\pi}}} {\frac{1}{\hat{\sigma}_{v}^{2}}}\sum\nolimits_{i=1}^{n}x_{i}^{\prime} & -\left(1-\hat{p}\right)^{2} {\frac{2}{\pi}} {\frac{n}{\hat{\sigma}_{v}^{2}}} & 0& 0 \\ 0&0& {\frac{-2n}{\hat{\sigma}_{v}^2}}& 0 \\ 0&0&0& 0 \\ \end{array}\right)} \\ \sum\limits_{i=1}^{n}\tilde{h}_{i}&= {\left( \begin{array}{cccc} -\sum\nolimits_{i=1}^{n}{\frac{x_{i}x_{i}^{\prime}}{\tilde{\sigma}_{v}^{2}}} & \sqrt{{\frac{2}{\pi}}}{\frac{1}{\tilde{\sigma}_{v}^{2}}} \sum\nolimits_{i=1}^{n}x_{i}& 0& 0 \\ \sqrt{{\frac{2}{\pi}}}{\frac{1}{\tilde{\sigma}_{v}^{2}}} \sum\nolimits_{i=1}^{n}x_{i}^{\prime}& - {\frac{2}{\pi}} {\frac{n}{\tilde{\sigma}_{v}^{2}}} & 0& 0 \\ 0&0& {\frac{-2n}{\tilde{\sigma}_{v}^2}}& 0 \\ 0&0&0& 0 \\ \end{array}\right).} \\ \end{aligned} $$

All the matrices above are singular for any \(\hat{p}\in[0,1].\) Therefore, when the third moment of the OLS residuals is positive, only the LR statistic can be defined, and equals zero. It remains to decide if we should reject the null hypothesis or not when the OLS residuals have wrong skew. Clearly, the LR test will not reject the null hypothesis, since the statistic equals zero under wrong skew. But for the other tests, the statistic is undefined and it is not clear what to conclude. If we consider the wrong skew cases as indicating that all firms are efficient, then it would be reasonable to reject the null hypothesis. However, as a practical matter, whether we reject the null hypothesis or not does not affect anything, because the estimated model whether p = 0 or not collapses to the same model. It might be reasonable to simply say that p is not identified with incorrectly skewed OLS residuals. For a given data set, both the null and the alternative hypothesis would lead to same results.

Assuming that σ 2 u  > 0, the wrong skew problem occurs with a probability that goes to zero asymptotically. However, as shown in Table 1, it can occur with non-trivial probability in finite samples. Also, the discussion above may be relevant even when the data do not have the wrong skew problem. The log likelihood has a stationary point at θ ** regardless of the skew of the residuals. In the wrong skew case, the likelihood is perfectly flat in the p direction with \(\hat\beta= {\rm OLS},\) \(\hat{\lambda}=0,\) and \(\hat{\sigma}^{2}={1}/{n}SSE.\) In the correct skew case, this is not true, but when λ is small, we expect that the partial of log likelihood with respect to p (evaluated at the MLE of the other parameters) would often be small in the vicinity of p = 0, so that the LM test and its variants might have low power. We will investigate this issue in the simulations of the next section.

3.5 Panel data

A complete treatment of this model with panel data is beyond the scope of this paper but we will make a few general comments. Now we have data \(y_{it}, x_{it}, z_{it}, i=1,\ldots,n, t=1,\ldots,T.\) We will think in terms of n being large and T being small (for example, 43 rice farms each observed for eight years).

If T is small, it is feasible to let any of the parameters of the model be different for different values of t. For example, we might let p be different for different time periods. This creates more parameters to estimate but no conceptual issues of estimation.

Let the errors be v it and u it and let \(\varepsilon_{it}=v_{it}-u_{it},\) the obvious generalization of (1) above. Then the density of \(\varepsilon_{it}\) is still as given in (4), and we can form the “likelihood” in (5), except that the sum would now be over \(t=1,\ldots, T\) as well as over i. This would commonly be called a quasi-likelihood. It is in fact the likelihood if the v it and u it are independent and identically distributed (iid) over t as well as i, in which case \(\varepsilon_{it}\) is also iid over t as well as i. However, if \(\varepsilon_{it}\) is not independent over t, then the true likelihood would depend on the joint distribution of \((\varepsilon_{i1},\ldots,\varepsilon_{iT}),\) and (5) is just an approximation that does not reflect the dependence over t of the \(\varepsilon_{it}.\) It is a standard econometric result that maximization of the quasi-likelihood yields a consistent estimate, although it is not efficient unless we really do have independence over t, and the conventionally-calculated standard errors of the estimates are wrong. Correct standard errors comes from the so-called “sandwich form” for robust standard errors. See e.g., Hayashi (2000, section 8.7, p. 544).

It is hard to specify a joint distribution for \((\varepsilon_{i1},\ldots,\varepsilon_{iT})\) because there is no natural joint distribution when the marginal distributions are non-normal. This problem is discussed, for the simpler case of the standard SFA model, by Amsler et al. (2014). They specify a joint distribution using a copula. This leads to conceptual and computational issues, for which the reader is referred to their paper. In the present context the quasi-MLE approach is probably the best we can reasonably hope for.

4 Simulations

We conducted simulations in order to investigate the finite sample performance of the ZISF model, and to compare it to the performance of the basic stochastic frontier model. We are interested both in parameter estimation and in the performance of tests of the hypothesis p = 0.

We consider a very simple data generating process: \(y_{i}=\beta+\varepsilon_{i},\) where as in Sect. 2 above, \(\varepsilon_{i}=v_{i}-u_{i}\) and u i is half-normal with probability 1 − p and u i  = 0 with probability p. We pick n = 200 and 500, β = 1, and σ u  = 1. We consider p = 0, 0.25, 0.5, and 0.75, and λ = 1, 2, and 5 (i.e., σ v  = 1, 0.5, and 0.2). Our simulations are based on 1,000 replications. Because the MLE’s were sensitive to the starting values used, we used several sets of starting values and chose the results with the highest maximized likelihood value.

Our experimental design was similar to that in KPT. They included a non-constant regressor, but in our experiments that made little difference. A more substantial difference is that we used n = 200 and 500 whereas they used n = 500 and 1,000.

There were some technical problems related to the facts that σ 2 u is not identified when p = 1, and p is not identified when σ 2 u  = 0. We define \(\hat{p}=1\) when \(\hat{\sigma}_{u}=0\) and \(\hat{\sigma}_{u}=0\) when \(\hat{p}=1.\) This would imply that when the OLS residuals have incorrect skew, the MLE would be θ ** with \(\hat{p}=1.\) It was very seldom the case that \(\hat{\sigma}_{u}^{2}=0\) or \(\hat{p}=1\) other than in the wrong skew cases.

4.1 Parameter estimation

Table 2 contains the mean, bias, and MSE of the various parameter estimates, for the basic stochastic frontier model and for the ZISF model, for the case that n = 200. We also present the mean, bias, and MSE of the technical inefficiency estimates, and the mean of the "posterior" probabilities of full efficiency.

Table 2 Basic SF model versus ZISF model, all replications: n = 200

Unsurprisingly, the basic stochastic frontier model performs poorly except when p = 0 (in which case it is correctly specified). This is true for all three values of λ. We over-estimate technical inefficiency, because we act as if all firms are inefficient, whereas in fact they are not. This bias is naturally bigger when p is bigger.

For the ZISF model, the results depend strongly on the value of λ. When λ = 1, the results are not very good. Note in particular the mean values of \(\hat{p},\) which are 0.53, 0.49, 0.51, and 0.57 for p = 0, 0.25, 0.50, and 0.75, respectively. It is disturbing that the mean estimate of p does not appear to depend on the true value of p.

These problems are less severe for larger values of λ. The mean value of \(\hat{p}\) when p = 0 is 0.33 for λ = 2 and 0.16 for λ = 5. The estimates are considerably better for the other values of p. So basically the model performs reasonably well when λ is large enough and p is not too close to zero.

Table 3 is similar to Table 2 except that it reports the results only for the cases of correct skew (i.e., wrong skew cases are not included). This makes almost no difference for λ = 2 or 5, because there are very few wrong skew cases when λ = 2 or 5. For λ = 1, it matters more. However, the conclusions given above really do not change.

Table 3 Basic SF model versus ZISF model, correct skew replications: n = 200

Table 4 contains the same information as Table 2, except that now we have n = 500 rather than n = 200. The results are better for n = 500 than for n = 200, but a larger sample size does not really solve the problems that the ZISF model has in estimating p when p = 0 and/or λ = 1. For example, when p = 0, the mean \(\hat{p}\) for λ = 1, 2, 5 is 0.53,0.33,0.16 when n = 200 and 0.50, 0.30, 0.12 when n = 500. Reading the table in the other direction, when λ = 1, the mean \(\hat{p}\) for p = 0, 0.25, 0.5, 0.75 is 0.53,0.49,0.51,0.57 when n = 200 and 0.50,0.46,0.48,0.58 when n = 500. So again there are problems in estimating p when p = 0 or when λ is small.

Table 4 Basic SF model versus ZISF model, all replications: n = 500

It is perhaps not surprising that we encounter problems when we estimate the ZISF model when the true value of p is zero. Essentially, we are estimating a latent-class model with more classes than there really are. It is true that the class with zero probability contains no new parameters. If it did, they would not be identified and the results would presumably be much worse.

These results do not always agree with the summary of the results in KPT. KPT concentrate on the technical inefficiency estimates, and the only results they show explicitly for the parameter estimates (their Figure 3) are for n = 1,000, and λ = 5 and p = 0.25. We did successfully replicate their results, but n = 1,000 and λ = 5 is a very favorable parameter configuration. In their Sect. 3.1, they say the following about the case when the true p equals zero: "The ML estimator from the ZISF model is found to perform quite well. … Estimates of p were close to zero." It is not clear what parameter configuration this refers to, but in our simulations this is not true except when λ = 5. For smaller values of λ, the ZISF estimates of p when the true p = 0 are not very close to zero.

4.2 Testing the hypothesis p = 0

We now turn to the results of our simulations that are designed to investigate the size and power properties of the tests of the hypothesis p = 0, as discussed in Sect. 3.4 above. This hypothesis is economically interesting, and it is also practically important to know whether p = 0, since our model does not appear to perform well in that case. We would like to be able to recognize cases when p = 0 and just use the basic SF model in these cases.

The data generating process and parameter values for these simulations are as discussed above (in the beginning of Sect. 4). Specifically, the simulations are for n = 200 and n = 500.

We begin with the likelihood ratio (LR) test, which is the test that we believed ex ante would be most reliable. The results for n = 200 are given in Table 5. For each value of λ and p, we give the mean of the statistic (over the full set of 1,000 replications), the number of rejections and the frequency of rejection. The rejection rates in the rows corresponding to p = 0 are the size of the test, whereas the rejection rates in the rows corresponding to the positive values of p represent power.

Table 5 Likelihood ratio test, n = 200

Look first at the set of results for all replications. The size of the test is reasonable. It is undersized for λ = 1 and approximately correctly sized for λ = 2 and 5. However, the power is disappointing, except when λ is large. There is essentially no power, even against the alternative p = 0.75, when λ = 1. When λ = 2, power is 0.60 against p = 0.75, but only 0.24 against p = 0.50 and 0.06 against p = 0.25. Power is more reasonable when λ = 5.

Table 6 gives the same results for n = 500. Increasing n has little effect on the size of the test, but it improves the power. Power is still low when λ = 1 or when λ = 2 and p is not large.

Table 6 Likelihood ratio test, n = 500

In either case (n = 200 or 500), looking separately at the correct-skew cases does not change our conclusions.

In Tables 7 and 8, we give results for the Wald test, for n = 200 and 500, respectively. Since the Wald test is undefined in wrong-skew cases, we show the results only for the correct-skew cases. We consider separately the OPG, Hessian, and Robust forms of the test, as defined in Sect. 3.4.2 above. Regardless of which form of the test is used, the test is considerably over-sized. This is true for both sample sizes. The problem is worst for the Robust form and least serious for the OPG form, but there are serious size distortions in all three cases. Based on these results, the Wald test is not recommended.

Table 7 Wald test, n = 200
Table 8 Wald test, n = 500

In Tables 9 and 10, we give the results for the score-based tests (LM, modified LM, and KT). Once again the tests are undefined for wrong-skew cases so we report results only for the correct-skew cases. The (two-sided) LM test is the best of the three. It shows moderate size distortions and no power when λ = 1, but only modest size distortions when λ = 2 or 5. The modified LM test has bigger size distortions and less power when λ = 2 or 5. The KT test has the largest size distortions and is therefore not recommended.

Table 9 Score-based tests, n = 200
Table 10 Score-based tests, n = 500

Our results are easy to summarize. The likelihood ratio test is the best of the five tests we have considered, at least for these parameter values. It is the only one of the tests that does not over-reject the true null that p = 0. However, it does not have much power. That is, we will have trouble rejecting the hypothesis that the basic SF model is correctly specified, even if the ZISF model is needed and p is not close to zero. The exception to this pessimistic conclusion is the case when both p and λ are large, in which case the power of the test is satisfactory.

5 Empirical example

We apply the models defined in Sects. 2 and 3 to the Philippine rice data used in the empirical examples of Coelli et al. (2005, chapters 8–9). The Philippine data are composed of 43 farmers over 8 years and Coelli et al. (2005) estimate the basic stochastic frontier model with a trans-log production function, ignoring the panel nature of the observations. Their output variable is tonnes of freshly threshed rice, and the input variables are planted area (in hectares), labor, and fertilizer used (in kilograms). These variables are scaled to have unit means so the first-order coefficients of the trans-log function can be interpreted as elasticities of output with respect to inputs evaluated at the variable means. We follow the basic setup of Coelli et al. (2005) but estimate the extended models where some farms are allowed to be efficient, and the probability of farm i being efficient and/or the distribution of u i depend on farm characteristics. Data on age of household head, education of household head, household size, number of adults in the household, and the percentage of area classified as bantog (upland) fields are used as farm characteristics that influence the probability of a farm begin fully efficient and/or the distribution of the inefficiency. See Coelli et al. (2005, Appendix 2) for a detailed description of the data.

5.1 Model

We consider models based on the following specification:

$$ \begin{aligned} \ln{y}_{i}&=\beta_{0}+\theta t+\beta_{1}\ln{area_{i}}+\beta_{2}\ln{labor_{i}}+\beta_{3}\ln{npk_{i}}\\ &\quad +\beta_{11}{\frac{1}{2}}(\ln{area_{i}})^{2}+\beta_{12}\ln{area_{i}}\ln{labor_{i}}+\beta_{13}\ln{area_{i}}\ln{npk_{i}}\\ &\quad +\beta_{22}{\frac{1}{2}}(\ln{labor_{i}})^{2}+\beta_{23}\ln{labor_{i}}\ln{npk_{i}}+\beta_{33}{\frac{1}{2}}(\ln{npk_{i}})^{2}+v_{i}-u_{i},\end{aligned} $$
(10)
$$ \begin{aligned} u_{i}&\sim N^{+}(0,\sigma_{i}^{2}),\\\sigma_{i}^{2}&=\exp{(\gamma_{0}+age_{i}\gamma_{1}+edyrs_{i}\gamma_{2}+hhsize_{i}\gamma_{3}+nadult_{i}\gamma_{4}+banrat_{i}\gamma_{5})},\\ \end{aligned} $$
(11)
$$ P(z_{i}=1 \vert w_{i})={\frac{\exp{(\delta_{0}+age_{i}\delta_{1}+edyrs_{i}\delta_{2}+hhsize_{i}\delta_{3}+nadult_{i}\delta_{4}+banrat_{i}\delta_{5})}}{1+\exp{(\delta_{0}+age_{i}\delta_{1}+edyrs_{i}\delta_{2}+hhsize_{i}\delta_{3}+nadult_{i}\delta_{4}+banrat_{i}\delta_{5})}}} $$
(12)

where area i is the size of planted area in hectares, labor i is a measure of labor, npk i is fertilizer in kilograms, age i is the age of household head, edyrs i is the years of education of the household head, hhsize i is the household size, nadult i is the number of adults in the household, and banrat i is the percentage of area classified as bantog (upland) fields.

We assume a trans-log production function with time trend as in (10). We estimate the following models:

  1. a.

    the basic stochastic frontier model, in which σ 2 i is constant (≡σ 2 u ) and \(P(z_{i}=1\vert w_{i})=0;\)

  2. b.

    the ZISF model in which σ 2 u is constant and \(P(z_{i}=1\vert w_{i})\) is constant (≡ p) but not necessarily equal to zero;

  3. c.

    the “heteroskedasticity” model in which p = 0 but σ 2 i is as given in (11);

  4. d.

    the “logit” model in which σ 2 u is constant but \(P(z_{i}=1\vert w_{i})\) is as given in (12);

  5. e.

    the “logit + heteroskedasticity” model in which σ 2 i is as given in (11) and \(P(z_{i}=1\vert w_{i})\) is as given in (12).

5.2 The estimates

The MLEs and their OPG standard errors are reported in Table 11.

Table 11 Model comparison

Consider first the results for the basic stochastic frontier model (first column of results in the table). The inputs are productive and there are roughly constant returns to scale. Average technical efficiency is about 70 %. The estimated value of λ is 2.75, and both that value and the sample size (n = 344) are big enough to feel confident about proceeding to the ZISF model and its extensions.

The next block of column of results is for the ZISF model. Here we have \(\hat{p}=0.58,\) so a substantial fraction of the observations (farm-time period combinations) are characterized by full efficiency. The technology (effect of inputs on output) is not changed much from the basic SF model, but the intercept is lower and the level of technical efficiency is higher (between 85 and 90 %). Based on our simulations, this is a predictable consequence of finding that a substantial number of observations are fully efficient.

The next block of columns of results is for the heteroskedasticity model in which all farms are inefficient but the level of inefficiency depends on farm characteristics. A number of farm characteristics (age of the farmer, education of the farmer and percentage of bantog fields) have significant effects on the level of inefficiency. In this parameterization, a positive coefficient indicates that an increase in the corresponding variable makes a farm more inefficient. The model implies that farms where the farmer is older and more educated, and where the percentage of bantog fields is lower, tend to be more inefficient (less efficient). Or, saying the same thing the other way around, farms are more efficient on average if the farmer is younger and less educated and the percentage of bantog fields is higher. The effect of education is perhaps surprising. Because this model does not allow any farms to be fully efficient, we once again have a low level of average technical efficiency, about 72 %, which is similar to that for the basic SF model.

Next we consider the logit model in which the distribution of inefficiency is the same for all firms that are not fully efficient, but the probability of being fully efficient depends on farm characteristics according to a logit model. Now age of the farmer and percentage of bantog fields have significant effects on the probability of full efficiency, and the coefficient of household size is almost significant at the 5 % level (t statistic = −1.93). The results indicate that farms with younger farmers, smaller household size, and a larger proportion of bantog fields are more likely to be fully efficient. The results for age of the farmer and percentage of bantog fields are similar in nature to those for the heteroskedasticity model. The average level of inefficiency is once again higher, about 86 %, which is very similar to the result for the ZISF model with constant p.

Finally, the last set of results are for the logit + heteroskedasticity model in which farm characteristics influence both the probability of being fully efficient and the distribution of inefficiency for those farms that are not fully efficient. Now none of the farm characteristics considered have significant effects on the distribution of inefficiency for the inefficient farms, but three of them (age of the farmer, household size and proportion of bantog fields) do have significant effects on the probability of being fully efficient. The coefficients for these three variables have the same signs as in the logit model without heteroskedasticity. It is interesting that we can estimate a model this complicated and still get significant results. Also, we note that, because this model allows the probability of full efficiency, we are back to a high average level of technical inefficiency, between 85 and 90 %.

5.3 Model comparison and selection

We will now test the restrictions that distinguish the various models we have estimated. Based on the results of our simulations, we will use the likelihood ratio (LR) test. We immediately encounter some difficulties because, to use the LR test (or the other tests we considered in Sect. 3.4), the hypotheses should be nested, whereas not all of our models are nested. There are two possible nested hierarchies of models: (a) basic SF\(\subset\) ZISF\(\subset\)logit\(\subset\)logit-heteroskedasticity, and (b) basic SF\(\subset\) heteroskedasticity.

We begin with hierarchy (a). When we test the hypothesis that p = 0 in the ZISF model, we obtain LR = 5.07, which exceeds the 5 % critical value of 2.71 for the distribution (1/2χ 21  + 1/2χ 21 ). So we reject the basic SF model in favor of the ZISF model. Next we test the ZISF model against the logit model. This is a standard test of the hypothesis that δ 1 = δ 2 = δ 3 = δ 4 = δ 5 = 0 in the logit model. The LR statistic of 23.96 exceeds the 5 % critical value for the χ 25 distribution (11.07), so we reject the ZISF model in favor of the logit model. Finally, we test the logit model against the logit-heteroskedasticity model. This is a standard test of the hypothesis that γ 1 = γ 2 = γ 3 = γ 4 = γ 5 = 0 in the logit-heteroskedasticity model. The LR test statistic of 11.12 very marginally exceeds the 5% critical value, so we reject the logit model in favor of the logit-heteroskedasticity model, but not overwhelmingly. Note that the logit model is rejected even though, in the logit-heteroskedasticity model, none of the individual γ j in the heteroskedasticity portion of the model is individually significant.

Now consider hierarchy (b). We test the basic SF model against the heteroskedasticity model. This is a standard test of the hypothesis that γ 1 = γ 2 = γ 3 = γ 4 = γ 5 = 0 in the heteroskedasticity model. The LR statistic of 17.04 exceeds the 5 % critical value, so we reject the basic SF model in favor of the heteroskedasticity model.

We cannot test the heteroskedasticity model against the logit-heteroskedasticity model, at least not by standard methods, since the restriction that would convert the logit-heteroskedasticity model into the heteroskedasticity model is \(\delta_{0}=-\infty\) and under this null the other δ j are unidentified. Still, the difference in log-likelihoods, which is 11.56, would appear to argue in favor of the logit-heteroskedasticity model.

In order to compare the models in a slightly different way, and to amplify on the comment at the end of the preceding paragraph, we will also consider some standard model selection criteria. We consider AIC =  − 2LF + 2d (Akaike 1974), \(\hbox{BIC}=-2LF+d\ln{n}\) (Schwarz 1978) and \(\hbox{HQIC}=-2LF+2d\ln{(\ln{n})}\) (Hannan and Quinn 1979), where d is the number of estimated parameters, n is the number of observations, and LF is the log-likelihood value. Smaller values of these criteria indicate a “better” model. We note that all three criteria favor the logit model over the heteroskedasticity model, two of the three favor the logit-heteroskedasticity model over the heteroskedasticity model, and two of the three favor the logit model over the logit-heteroskedasticity model.

Based on the results of our hypothesis tests and the comparison of the model selection procedures, we conclude that a case could be made for either the logit model or the logit-heteroskedasticity model as the preferred model. As we saw above, the substantive conclusions from these two models were basically the same.

6 Concluding remarks

In this paper we considered a generalization of the usual stochastic frontier model. In this new “ZISF” model, there is a probability p that a firm is fully efficient. This model was proposed by Kumbhakar et al. (2013), who showed how to estimate the model by MLE, how to update the probability of a firm being fully efficient on the basis of the data, and how to estimate the inefficiency level of a specific firm.

We extend their analysis in a number of ways. We show that a result similar to that of Waldman (1982) holds in the ZISF model, namely, that there is always a stationary point of the likelihood at parameter values that indicate no inefficiency, and that this point is a local maximum if the OLS residuals are positively skewed. We show how to test the hypothesis that p = 0. We also provide a more comprehensive set of simulations than Kumbhakar et al. (2013) did.

Let λ = σ u /σ v , a standard measure in the stochastic frontier literature of the relative size of technical inefficiency and statistical noise. The main practical implication of our simulations is that the ZISF model works well when neither λ nor p is small. However, we have trouble estimating p reliably, or testing whether it equals zero, when λ is small. And if the true p equals zero, we have trouble estimating it reliably unless λ is larger than is empirically plausible (e.g., λ = 5). Larger sample size obviously helps, but the above conclusions do not depend strongly on sample size in our simulations. Situations where the ZISF model may be useful therefore have the characteristics that (1) it is reasonable to suppose that some firms are fully efficient, and (2) the inefficiency levels of the inefficient firms are not small relative to statistical noise. Such situations do not seem implausible, and it is an empirical question as to how common they are.