1 Introduction

Most common panel regression models, such as the random and fixed-effects models, assume that the regression slopes are equal across the cross section. In fact, in some parts of the literature the equal slope assumption is so common that it is hardly ever questioned. The main reason for this is twofold (see, e.g., Baltagi et al. 2008). First, it greatly simplifies the process of estimation and inference. In fact, as a referee to this journal points out, the main feature of panel data is the ability to selectively pool the information regarding the assumed common slope, while allowing great heterogeneity in other parts of the model. Second, if true exploiting the equality of the slopes leads to large gains in efficiency. The problem is that if the assumption is false, panel techniques based on models with equal slopes become inconsistent, causing misleading inference (see, e.g., Hsiao 2003, Chapter 6). It is therefore crucial to test for slope homogeneity before proceeding with an analysis based on this assumption.

The classical approach to test the equal slope assumption is to employ a simple F-test (see, e.g., Hsiao and Pesaran 2008; Baltagi et al. 2008; Pesaran and Yamagata 2008). Alternatively, one may use the Hausman test of Pesaran et al. (1996), in which two estimators are compared, one is constructed under the assumption of homogenous slopes, while the other is not. A third possibility is to use the Swamy-type test of Pesaran and Yamagata (2008), which is based on the dispersion of individual slope estimates from a suitable pooled estimator. Although very popular, these approaches suffer from at least two important shortcomings.

First, all three tests assume that the regression errors are cross-sectional independent, a restriction that is likely to be violated in practice, especially in macroeconomic and financial applications involving country-level data where strong intraeconomy linkages can be expected. Recognizing this shortcoming, Phillips and Sul (2003) propose a Hausman-type test that is appropriate in the special case when the dependence can be represented by means of a single common factor. The problem is, of course, that the common factor structure need not be correct, which would then invalidate the test. Another possibility, which allows for more general forms of cross-sectional dependence, is to use the seemingly unrelated regression (SUR) approach. The advantage of this approach is that, as long as the cross-sectional dimension, N, is “small” and the time series dimension, T, is “large,” the cross-sectional dependence can be allowed to be quite general. The drawback is that unless \(T>>N\), the small-sample performance of SUR-based tests is often very poor (see Bun 2004).

The second shortcoming relates to the formulation of the hypotheses tested, which, depending on the problem at hand, may not be very informative. In particular, while the null hypothesis can certainly be formulated as that all the slopes are equal, the alternative hypothesis that there are at least some units for which homogeneity fails is typically too broad for any interesting economic conclusions; it could be that the slopes of all units are different, but it could also be that there is only a small fraction of units for which homogeneity fails. Some studies have considered clustering methods for grouping units such that the slopes are homogeneous within each group, but heterogenous across groups (see, e.g., Kapetanios 2006; Lin and Ng 2012). Unfortunately, these methods are quite complicated to implement, which is probably also the reason for why they are almost never used in practice.

In this paper, we develop several procedures to ascertain the homogeneity of a panel. The point of departure is a quite general data-generating process that allows, for example, serial correlation of unknown form and complex cross-sectional dependencies such as dynamic common factor models. In fact, except for some mild regulatory conditions, there are virtually no restrictions on the forms of serial and cross-sectional dependence that can be permitted. We do require that the regressors are strictly exogenous, but the approach can be easily extended to allow more general types of regressors. Given this generality, corrections aimed at achieving asymptotically pivotal statistics are not really an option. In this paper, we therefore consider the block bootstrap as a means to obtain tests that are asymptotically valid. Two block bootstrap-based test procedures are considered: one is appropriate when testing the above mentioned hypothesis of slope homogeneity versus at least some heterogeneity, while the other can be used to sequentially determine the units for which the slopes are equal. Both procedures are easy to implement and work well even when N and T are similar in magnitude.

The plan of the rest of the paper is as follows. Section 2 describes the model and assumptions, which are used in Sect. 3 to study the properties of the two test procedures considered. Section 4 presents the results of a small Monte Carlo study. Section 5 concludes.

2 Model and assumptions

Let \(y_{i,t}\) be generated as

$$\begin{aligned} y_{i,t} = \theta _i + \beta ^\prime _i x_{i,t} + \varepsilon _{i,t}, \end{aligned}$$
(1)

where \(\theta _{i}\) is regarded as a unit-specific intercept or fixed effect, \(x_{i,t}\) is a \(m\times 1\) vector of regressors, and \(\beta _i\) is a conformable vector of unknown slope coefficients. The purpose of this paper is to infer to which extent \(\beta _i\) can be regarded as equal across i.

The assumptions that we will be working under are stated below, where \(\rightarrow _p\) and \(\rightarrow _d\) signifies convergence in probability and distribution, respectively, and \(C\in (0, \infty )\) is a generic constant not depending on N or T. Let \(Q_{i,T} = T^{-1} (x_i^\prime M_{\tau } x_i)\), where \(x_i = (x_{i,1},\ldots ,x_{i,T})^\prime \) and \({M}_\tau = {I}_T -\tau _T (\tau ^\prime _T \tau _T)^{-1}\tau ^\prime _T\), \({I}_T\) is the \(T\times T\) identity matrix, and \(\tau _T =(1,\ldots ,1)'\) is a \(T \times 1\) vector of ones. It is also convenient to let \(\tilde{z}_t = (\tilde{z}_{1,t}^{\prime },\ldots ,\tilde{z}_{N,t}^{\prime })^\prime \), where \(\tilde{z}_{i,t} = (x_{i,t} - \mu _i) \varepsilon _{i,t}\) and \(\mu _i = E(x_{i,t})\). The sigma-field generated by \(\{\varepsilon _{i,n}\}_{n=1}^t\) (\(\{x_{i,n}\}_{n=1}^t\)) is henceforth going to be denoted \(\mathcal {F}_{\varepsilon i,t}\) (\(\mathcal {F}_{x i,t}\)),

Assumption REGR. \(Q_{i,T} \rightarrow _p Q_i\) as \(T \rightarrow \infty \), where \(Q_i\) is a positive definite matrix.

Assumption ERR.

  1. (i)

    \(E(\varepsilon _{i,t} x_{j,n}) = 0\) for all i, t, j, and n;

  2. (ii)

    \(E(\varepsilon _{i,t}^2 | \mathcal {F}_{\varepsilon i,t-1} \cup \mathcal {F}_{x i,t}) = E(\varepsilon _{i,t}^2) = \sigma _i^2 \le C\) for all t;

  3. (iii)

    \(\Sigma = \mathrm {var}(T^{-1/2} \sum _{t=1}^T \tilde{z}_t)\) is positive definite;

  4. (iv)

    \(\{ (x_{i,t},\varepsilon _{i,t})\}_{t=1}^T\) is an \(\alpha \)-mixing sequence with mixing coefficient \(\alpha _i(n)\), which is such that \(\sup _{i=1,\ldots ,N} \alpha _i (n) \le \alpha (n)\), where \(\alpha (n) = Cn^{-a}\) for some \(a > r/(r-2)\) and \(r > 2\);

  5. (v)

    \(E( || x_{i,t} ||^{2r+p}) \le C\) and \(E( | \varepsilon _{i,t} |^{2r+p}) \le C\) for some \(p > 0\), and all i and t.

Assumptions ERR (i) and (ii) rule out regressors that are non-strictly exogenous (including lagged dependent variables) and heteroskedasticity across time. As we discuss in Remark 1 of Sect.  3, these assumptions are stronger than needed. However, since the main motivation of the paper is to account for cross-sectional (and serial) correlation, we have preferred to keep Assumptions ERR (i) and (ii) as they stand (see, e.g., Kapetanios 2008; Hidalgo 2003, for similar assumptions). These assumptions are not needed for the specific tests that we propose, but for the validity of our residual-based bootstrap, which can be easily modified to accommodate both endogenous regressors and errors that are heteroskedastic across time. The reason for sticking with the residual-based bootstrap is that it makes for relatively straightforward proofs. Assumption ERR (iv) allows for heterogeneous forms of serial correlation across the cross section, but imposes a uniform bound on the mixing coefficients. Many commonly encountered stochastic processes can be accommodated under this assumption (see Davidson 1994, Section 14.4). Note also how Assumption ERR does not impose any particular cross-correlation structure. In particular, it is not necessary to know whether the dependence is weak or strong (see Chudik et al. 2011; Pesaran and Tosetti 2010). The dependence also does not have to be static but may be dynamic in nature, when \(\varepsilon _{i,t}\) is generated by a dynamic common factor model. The moment requirements in Assumption ERR (v) rule out time trends in the regressors.

3 The bootstrap test procedures

Let us denote by q the number of units that cannot be pooled; that is, q is the number of units for which \(\beta _i \ne \beta \), where \(\beta \) is the common value of \(\beta _i\). The purpose of this paper is to make inference regarding q. Let us therefore denote by \(0 = q_1 < \cdots < q_K < N\) a set of K user-defined numbers, representing the number non-poolable units to be considered in the testing. Let \(H_0(q_k)\) denote the null hypothesis that \(q = q_k\), where \(k = 1, \ldots ,K\), and let \(H_1(q_{k+1})\) denote the alternative hypothesis that \(q \ge q_{k+1}\).

3.1 A pooled test for testing \(q = 0\) versus \(q \ge 1\)

In this subsection, we consider the relatively simple problem of testing \(H_0(0)\) versus \(H_1(1)\); that is, the null hypothesis of homogeneity (\(q=0\)) is tested versus the alternative that there is at least one unit that cannot be pooled (\(q \ge 1\)). The reason for considering this testing problem separately is that under the null hypothesis the slopes of all the units are equal, which makes it possible to consider pooled test statistics in the spirit of much of the previous literature (see, e.g., Hsiao and Pesaran 2008; Baltagi et al. 2008; Pesaran and Yamagata 2008). The test statistic that we consider is a bootstrap Swamy-type homogeneity test and is given by

$$\begin{aligned} S = \sum _{i = 1}^N T \big ( {\hat{\beta }}_i - {\hat{\beta }}_\mathrm{{WFE}}\big )^\prime \frac{Q_{i,T}}{\hat{\sigma }_i^2} \big ( {\hat{\beta }}_i - {\hat{\beta }}_\mathrm{{WFE}}\big ), \end{aligned}$$

where \({\hat{\beta }}_i\) is the least-squares (LS) estimator of \(\beta _i\) when applied to cross-sectional unit i, \(\hat{\sigma }^2_i = T^{-1}\sum _{t=1}^T \hat{\varepsilon }_{i,t}^2\), \(\hat{\varepsilon }_{i,t} = y_{i,t} - \hat{\theta }_i - \hat{\beta }^\prime _i x_{i,t}\), and \({\hat{\beta }}_\mathrm{{WFE}}\) is the weighted fixed-effects (WFE) estimator, as given by

$$\begin{aligned} {\hat{\beta }}_\mathrm{{WFE}} = \left( \sum _{i = 1}^{N} \frac{x_i^\prime M_{\tau } x_i}{\hat{\sigma }_i^2} \right) ^{-1} \sum _{i = 1}^{N} \frac{x_i^\prime M_{\tau } {y}_i}{\hat{\sigma }_i^2}. \end{aligned}$$

As Hsiao (2003) and Pesaran and Yamagata (2008) show, under \(H_0\) and the assumption of independently identically distributed (iid) errors, \(S \rightarrow _d \chi ^2[m(N-1)]\) as \(T \rightarrow \infty \) with N held fixed. However, under the more general assumptions laid out above, this is no longer the case. The approach opted for here is therefore based on the bootstrap.

Algorithm BOOT.

  1. 1.

    Estimate (1) using LS for each cross-sectional unit and organize the residuals, \(\hat{\varepsilon }_{i,t}\), in a \(T \times N\) matrix \(\hat{\varepsilon }= (\hat{\varepsilon }_{1},\ldots ,\hat{\varepsilon }_{T})'\), where \(\hat{\varepsilon }_{t} = (\hat{\varepsilon }_{1,t},\ldots ,\hat{\varepsilon }_{N,t})'\).

  2. 2.

    Obtain \({\hat{\beta }}_\mathrm{{WFE}}\) and \(\hat{\theta }_i = \overline{y}_i - {\hat{\beta }}_\mathrm{{WFE}}\overline{x}_i\), where \(\overline{y}_i = T^{-1}\sum _{t=1}^T y_{i,t}\) with a similar definition of \(\overline{x}_i\).

  3. 3.

    Choose a block length, l. Let \(J_{t} = (\hat{{\varepsilon }}_{t},\hat{{\varepsilon }}_{t+1},\ldots ,\hat{{\varepsilon }}_{t+l-1})^\prime \) be the block of l consecutive estimated errors starting at date t, and let \(I_1, I_2,\ldots \) be a sequence of iid random variables with a discrete uniform distribution on \(\{1,\dots ,(T-l+1)\}\). The \(T\times N\) matrix of pseudo errors, \({\varepsilon }^*\), is such that the first l rows are determined by \(J_{I_1}\), the next l rows are given by \(J_{I_2}\), and so on. The procedure is stopped when T rows have been generated.

  4. 4.

    Simulate pseudo-data under \(H_0\) as \(y_{i,t}^* = \hat{\theta }_i + {\hat{\beta }}^\prime _\mathrm{{WFE}} x_{i,t} + \varepsilon _{i,t}^*\).

  5. 5.

    Compute the bootstrap test statistic, \(S^*\), where \(S^*\) is calculated exactly as S, but with \((y_{i}, x_i)\) replaced by \((y_{i}^*, x_i)\).

  6. 6.

    Repeat steps 3–5 B times.

  7. 7.

    Select the bootstrap critical value as the \((1-\alpha )\)-quantile of the ordered \(S^{*}\) statistics.

Remark 1

The above algorithm is an example of a residual-based bootstrap and is similar to the algorithms used in Bun (2004), Kapetanios (2008), Hidalgo (2003), and Zhou and Shao (2013), to mention a few. Alternatively, we may follow, for example, Freedman (1981) and Fitzenberger (1997) and block bootstrap \((y_{i,t}, x_{i,t}^\prime )\). Hence, instead of resampling \(\hat{{\varepsilon }}_{t}\) in step 3 in Algorithm BOOT, we resample \((y_{t}', x_{t}')'\), a \(N\times (1+m)\) matrix. The bootstrap test statistic is S based on \((y_{i}^*, x_i^*)\), the bootstrapped version of \((y_{i}, x_i)\). The main advantage of this resampling scheme is that it does not require homoskedasticity and strictly exogenous regressors.Footnote 1

Remark 2

Algorithm BOOT requires a choice of block length, l. A common approach is to set l as a deterministic function of T. Alternatively, one may follow, for example, Gonçalves and White (2005) and Gonçalves (2011) and set l according to the data-dependent rule of Andrews (1991) or Newey and West (1994), originally proposed for the purpose of bandwidth selection in long-run variance estimation (see Fitzenberger 1997, Section 3.4, for a discussion). In Sect. 4, we use Monte Carlo simulation to evaluate the effect of various rules for selecting l.

The asymptotic validity of the bootstrap procedure requires the following assumption on the block length.

Assumption BL. \(l \rightarrow \infty \) and \(l = o(\sqrt{T})\) as \(T \rightarrow \infty \).

Theorem 1

Suppose that Assumptions REGR and ERR hold and that \(T \rightarrow \infty \) with N held fixed. Under \(H_0(0)\),

$$\begin{aligned} S \rightarrow _d \sum _{j=1}^{kN} \lambda _j U_j^2, \end{aligned}$$

where \(\lambda _j \in [0,\infty )\) and \(U_j \sim N(0,1)\) independently across j. Under \(H_1(1)\),

$$\begin{aligned} S \rightarrow \infty . \end{aligned}$$

The corresponding result for \(S^*\) is given in Theorem 2.

Theorem 2

Suppose that Assumptions REGR, ERR, and BL hold and that \(T \rightarrow \infty \) with N held fixed. Under \(H_0(0)\) and \(H_1(1)\),

$$\begin{aligned} S^* \rightarrow _{d^*} \; \sum _{j=1}^{kN} \lambda _j U_j^2 \quad \hbox {in probability}, \end{aligned}$$

where \(\rightarrow _{d^*}\) convergence in distribution conditional on the realization of the sample.

Remark 3

Together Theorems 1 and 2 establish the asymptotic validity of the proposed bootstrap test procedure. There are two requirements. First, S and \(S^*\) must converge to the same asymptotic null distribution. Second, under \(H_1(1)\), while \(S^*\) should converge to the same asymptotic distribution as under \(H_0(0)\), S should diverge.

Remark 4

The asymptotic results in Theorems 1 and 2 are based on letting \(T \rightarrow \infty \), while keeping N fixed. There are several reasons for considering such a large-T and fixed-N asymptotic framework. First, if T is fixed, then the above resampling scheme will not be asymptotically valid. Also, N plays no role in the proofs of Theorems 1 and 2. As there is no resampling in the cross-sectional dimension, there is no reason to expect the performance of the bootstrap test to become more (or less) accurate when N increases.Footnote 2 Second, it is difficult to obtain distributional results for \(N \rightarrow \infty \) without imposing additional conditions on the cross-sectional dependence structure.Footnote 3 Third, from a practical viewpoint, a test for slope equality may be of greatest interest in applications in which N is relatively small. In such cases, the condition that \(N \rightarrow \infty \) may be difficult to justify. There are, of course, applications in which it is more appropriate to rely on large-N asymptotics, but then other test statistics are likely to be more effective (see Pesaran and Yamagata 2008, for a discussion). However, in such large-N panels T is typically too small for estimation of individual slopes, and therefore, the analysis must be made conditional on the slopes being equal.

Remark 5

In contrast to, for example, Bun (2004), we make no attempt to obtain an asymptotically pivotal test statistics. Nevertheless, it should be noted that there is evidence that bootstrapping of (asymptotically) pivotal test statistics leads to asymptotic refinements (see, e.g., Davidson and MacKinnon 1999). Therefore, if such a test statistic could be obtained, it may be preferable to the approach taken here.

3.2 A sequential test procedure for determining q

The test considered in the previous section is appropriate if one wishes to infer whether there is any evidence against poolability at all. The problem is that in many cases one would like to go further than just concluding that \(q > 0\) in case of a rejection, and in this section, we therefore consider a sequential test that can be used to pinpoint q. In so doing, we will assume that \(q_k = k-1\), where \(k = 1,\ldots ,K\) and \(K=N-1\), such that the number of units to be tested decreases by one at each iteration; later on we discuss how to proceed when \(q_1,\ldots ,q_K\) are set differently.

To test whether a particular unit i has the same slope coefficient vector as a certain benchmark unit b, we may use the following Wald test statistic:

$$\begin{aligned} W_i = T \big ( {\hat{\beta }}_i - {\hat{\beta }}_b \big )^\prime \big ( \hat{\sigma }_i^2 Q_{i,T}^{-1} + \hat{\sigma }_b^2 Q_{b,T}^{-1}\big )^{-1} \big ( {\hat{\beta }}_i - {\hat{\beta }}_b \big ). \end{aligned}$$

The idea is to apply this test statistic in a sequential fashion to determine the set of units with coefficient vector \(\beta _b = \beta \). The problem is that in doing so we are likely to end up with spurious rejections due to the multitude of tests; that is, we face the problem of controlling the overall significance level of the approach. To this end, we follow Smeekes (2015), who considers a bootstrap sequential unit root test to determine the stationary units in a panel.

Let \(W_{(1)}, W_{(2)}, \ldots , W_{(N-1)}\) denote the \(N-1\) order statistics of \(W_{1}, \ldots , W_{b-1},\) \(W_{b+1}, \ldots , W_{N}\) defined as \(W_{(1)} \ge \cdots \ge W_{(N-1)}\). Denote by \(\hat{q}\) the estimated number of units that cannot be pooled with the benchmark. The sequential procedure is carried out as follows.

Algorithm SEQ.

  1. 1.

    Set \(k = 0\).

  2. 2.

    Test \(H_0(k)\) against \(H_1(k+1)\) using \(W_{(k+1)}\) as a test statistic. Reject \(H_0(k)\) if \(W_{(k+1)} > c_{\alpha }(W_{(k+1)})\), where \(c_{\alpha }(W_{(k+1)})\) is the appropriate critical value at significance level \(\alpha \).

  3. 3.

    If \(H_0(k)\) is not rejected, set \(\hat{q} = k\), whereas if \(H_0(k)\) is rejected, set \(k = k+1\) and go back to step 2.

  4. 4.

    Preform steps 2 and 3 until \(H_0(k)\) cannot be rejected anymore, and set \(\hat{q} = k\). If all null hypotheses up to \(H_0(N-1)\) are rejected, set \(\hat{q} = N\).

We now focus on how to obtain appropriate critical values, \(c_{\alpha }(W_{(k+1)})\). Let \(\mathcal {D}_k = \{i : W_i \ge W_{(k)} \}\) denote the set of units for which \(W_i\) is larger than the kth-order statistic. The complement of \(\mathcal {D}_k\) is henceforth denoted \(\mathcal {D}_k^c\). Let us also denote by \(a_{(1:\mathcal {F})}\) the largest element of the set \(\{ a_i : i \in \mathcal {F} \}\). The following bootstrap algorithm will be used to obtain \(c_{\alpha }(W_{(k+1)})\).

Algorithm SEQBOOT.

  1. 1.

    For each unit, estimate (1) by LS and organize the residuals in a \(T \times N\) matrix \(\hat{{\varepsilon }}\).

  2. 2.

    Choose a block length, l, and obtain the matrix of bootstrap errors, \({{\varepsilon }}^*\), as described in step 3 of Algorithm BOOT.

  3. 3.

    Simulate pseudo-data under \(H_0(k)\) as \(y_{i,t}^* = \hat{\theta }_i + {\hat{\beta }}_b^\prime x_{i,t} + \varepsilon _{i,t}^*\).

  4. 4.

    Obtain \(W_i^*\) by applying \(W_i\) to \((y_{i,t}^*, x_{i,t}\)) for all \(i \in \mathcal {D}_k^c\) and obtain the bootstrap test statistic as \(W_{(k+1)}^* = W_{(1:\mathcal {D}_k^c)}^*\).

  5. 5.

    Repeat steps 2–4 B times.

  6. 6.

    Select the bootstrap critical value \(c^*_{\alpha }(k+1)\) as the \((1-\alpha )\)-quantile of the ordered \(W_{(k+1)}^{*}\) statistics.

Remark 6

Note that the sequential procedure will not only estimate the number of poolable units, \(\hat{q}\), but will also identify those units. The set of units that are poolable (non-poolable) equals \(\mathcal {D}_{\hat{q}}^c\) (\(\mathcal {D}_{\hat{q}}\)).

In this study, the benchmark is a single unit. This is not necessary. In fact, as pointed out by Kapetanios (2003), \({\hat{\beta }}_b\) may be based on a pooled estimator. The problem with such an approach is that the pooled estimator may not make sense under the alternative of different slopes. Suppose, for example, that there are two groups of units, whose slopes are equal within each group but heterogeneous across groups. In this case, the pooled estimate will likely lie between the true coefficients, and therefore, the sequential procedure is likely to find evidence against poolability for all units.Footnote 4 The use of a benchmark unit overcomes this problem.

The question is how to choose the benchmark unit. In some applications, there is a natural candidate, as when evaluating policy for a particular unit or when there is a “dominant” unit (see Pesaran and Chudik 2013). In other applications, the choice of benchmark may be less obvious. However, in many cases the researcher will have some a priori information as to the units that are most likely to be poolable, and in such circumstances, the benchmark unit may be picked at random from that set.

Denote by \(\mathcal {Z} = \{ i : \beta _i = \beta _b, i \ne b \}\) the set of units for which the null hypothesis \(\beta _i = \beta _b\) is true. In addition, let \(G_h\) denote the asymptotic distribution (as \(T \rightarrow \infty \)) of \(W_h\) for \(h \in \mathcal {Z}\). Lemma 1 establishes the asymptotic distributions of the relevant test statistics when testing \(H_0(k)\) against \(H_1(k+1)\).

Lemma 1

Suppose \(H_0(k)\) is tested against \(H_1(k+1)\) using Algorithms SEQ and SEQBOOT. Under Assumptions REGR and ERR, as \(T \rightarrow \infty \) with N held fixed

$$\begin{aligned}&W_{(k+1)} \rightarrow \infty \quad \hbox {if}\, k+1 \le q, \\&W_{(k+1)} \rightarrow _d G_{(1:\mathcal {Z})} \quad \hbox {if}\,k+1 > q, \\&W_{i}^* \rightarrow _{d^*} G_i \quad \hbox {for any}\, q\, \hbox {and all} \,i \in \mathcal {D}_k^\mathsf {c}, \\&W_{(k+1)}^* \rightarrow _{d^*} G_{\big (1:\mathcal {D}_k^{c}\big )} \quad \hbox {for any}\, k\, \hbox {and} \,q. \end{aligned}$$

Lemma 1 states the asymptotic validity of the bootstrap approach. Here, \(G_{(1:\mathcal {Z})}\) represents the asymptotic distribution of \(W_{(1:\mathcal {Z})}\). Under Assumptions REGR and ERR, \(G_{(1:\mathcal {Z})}\) is unknown and an analytical approach to the sequential test is not feasible. Fortunately, the desired distribution can be obtained via bootstrapping.

Theorem 3

Under the conditions of Lemma 1,

$$\begin{aligned} \lim _{T \rightarrow \infty } P(\hat{q} = k)= & {} 0 \quad \text{ if } k+1 \le q, \\ \lim _{T \rightarrow \infty } P(\hat{q} = k)= & {} 1 - \alpha \quad \text{ if } k = q, \\ \lim _{T \rightarrow \infty } P(\hat{q} = k)\le & {} \alpha \quad \text{ if } k - 1 \ge q. \end{aligned}$$

Remark 7

The first part of Theorem 3 shows that asymptotically the probability of underrejection is zero; that is, the probability of finding the set of poolable units tends to one as \(T \rightarrow \infty \). Also, as is evident from the third part, since the family-wise error rate (FWE) is at most \(\alpha \), the procedure is able to control the overall significance level.

While asymptotically irrelevant, the finite sample performance of the sequential test depends on the number of hypotheses tested and the true number of poolable units. In particular, for the overall procedure to have satisfactory power when the number of non-poolable units is large, it is necessary that the power at each step of the procedure be close to one. If this is not the case, and if the number of tests is large, the probability of correctly labeling all non-poolable units is likely to be quite low. Consequently, the sequential procedure is mainly suited for panels where N is relatively small. In Sect. 4, we elaborate on this issue.

As a partial solution to the problem of low power, we may follow the suggestion of Smeekes (2015) and apply the sequential procedure to another set of pre-specified null hypotheses. More specifically, instead of testing each unit against the benchmark, we may skip some units. That is, instead of considering \(q_k = k-1\) for \(k = 1,\ldots ,N-1\), we may consider another set of numbers \(q_1,\ldots ,q_K\), where \(q_k\) is not necessarily equal to \(k-1\). In this case, in the first iteration of Algorithm SEQ, \(H_0(q_1)\) is tested against \(H_1(q_2)\). If \(H_0(q_1)\) is rejected, in the second iteration \(H_0(q_2)\) is tested against \(H_1(q_3)\), and so on. By taking \(q_{k+1}-q_{k} > 1\), we reduce the number of tests conducted, which is likely to lead to an increase in the overall power of the testing procedure. There is, however, one major drawback of this approach. Since not every unit is tested against the benchmark, unless \(q \in \{q_k: k = 1,\ldots ,K\}\), \(\hat{q} = q_k\) cannot be interpreted as the number of non-poolable units, nor can \(\mathcal {D}_{\hat{q}}^c\) be interpreted as the set of poolable units. Instead, the finding that \(\hat{q} = q_k\) should be interpreted as that \(q \in [q_{k-1}, q_{k+1}]\). The effect of skipping units is illustrated in the next section.

4 Monte Carlo simulations

In this section, we investigate the small-sample properties of the above tests using Monte Carlo simulations. We begin by considering \(S^*\). In this case, we focus on the performance of the test across different specifications of the dependence structure of the errors and across different block-length selection rules. We then proceed to evaluate the sequential bootstrap approach. Here, we will focus on the ability of the test to correctly identify the set of poolable units and to control the FWE.

4.1 Simulation design

The following data-generating process will be used to analyze the performance of \(S^*\):

$$\begin{aligned} y_{i,t}= & {} \theta _i + \beta _i x_{i,t} + \varepsilon _{i,t},\\ x_{i,t}= & {} \lambda _{x} f_{x,t} + v_{i,t},\\ f_{x,t}= & {} \rho _{x} f_{x,t-1} + e_{x,t},\\ v_{i,t}= & {} \rho _{x} v_{i,t-1} + u_{i,t}, \end{aligned}$$

where \(\theta _i \sim N(1,1)\), \(\lambda _{x} = \sqrt{0.5}\), \(\rho _{x}=0.5\), \(e_{x,t} \sim N(0,1-\rho _x^2)\), and \(u_{i,t} \sim N(0,(1-\rho _x^2)(1-\lambda _{x}^2))\). Three specifications for \(\varepsilon _{i,t}\) are considered.

  1. E1.

    \(\varepsilon _{i,t} \sim N(0,1)\).

  2. E2.

    \(\varepsilon _{t} = (\varepsilon _{1,t},\ldots ,\varepsilon _{N,t})^\prime \) is generated according to the following spatial first-order autoregressive model:

    $$\begin{aligned} \varepsilon _{t} = 0.5 W_N \varepsilon _{t} + e_{t}, \end{aligned}$$

    where \(e_t \sim N(0, I_N)\). The spatial weight matrix \(W_N\) is constructed as a first-order continuity matrix in which each unit, except for the first and last, has one left and one right neighbor (see Anselin et al. 2008). To make the simulation results invariant to the variance of the error terms, \(\varepsilon _{i,t}\) is scaled by the square root of the ith diagonal element of the covariance matrix of \(\varepsilon _t\).

  3. E3.

    \(\varepsilon _{i,t}\) is generated according to the following dynamic common factor model:

    $$\begin{aligned} \varepsilon _{i,t}= & {} \lambda _{\varepsilon } f_{\varepsilon , t} + \eta _{i,t},\\ f_{\varepsilon , t}= & {} \rho _{\varepsilon } f_{\varepsilon , t-1} + e_{\varepsilon , t}, \end{aligned}$$

    where \(\lambda _{\varepsilon } = \sqrt{0.5}\), \(\eta _{i,t} \sim N(0,1-\lambda _{\varepsilon }^2)\), and \(e_{\varepsilon , t} \sim N(0,1-\rho _{\varepsilon }^2)\). When \(\rho _{\varepsilon } = 0\), \(\varepsilon _{i,t}\) is serially independent, whereas when \(\rho _{\varepsilon } \ne 0\), then this is no longer the case.Footnote 5

Under the null, \(\beta _1 = \cdots = \beta _N = 1\), while under the alternative, \(\beta _{i} = 1\) for \(i = 1,\ldots ,\lfloor N/2\rfloor \) and \(\beta _{i} \sim N(1,0.04)\) for \(i = \lfloor N/2\rfloor ,\ldots ,N\), with \(\lfloor x \rfloor \) denoting the integer part of x. All tests are conducted at a 5 % nominal level, and we take \(N \in \{5, 25, 50\}\) and \(T \in \{25, 50, 100\}\).

In case of the sequential testing procedure, we consider \(N \in \{10,20\}\). If \(N=10\), then \(q \in \{0, 1, 3, 6\}\), whereas if \(N = 20\), then \(q \in \{0, 1, 3, 6, 12\}\). In both cases, the benchmark unit is taken to be \(b=N\). The errors are generated as in E3 with \(\rho _{\varepsilon } = 0.3\). For all \(i \in \mathcal {Z}\), \(\beta _i = \beta _b = 1\), and for \(i \notin \mathcal {Z}\), \(\beta _i \sim U(0, 0.5)\). The significance level is taken to be 5%.Footnote 6 To evaluate the ability of the sequential procedure to control the overall significance level, we calculate the empirical FWE as the proportion of tests with at least one false rejection. As a measure of power, we report the proportion of tests in which the estimated set of poolable units, \(\mathcal {D}_{\hat{j}}^c\), equals the true set, \(\mathcal {Z}\), henceforth referred to as “CP.” We also consider skipping units. In this case, we have to decide on a sequence of numbers \(q_1,\ldots ,q_K\) to be tested. In this section, \(q_{k+1} = q_{k} + \delta \), where \(\delta \in \{3,6\}\).

In a final set of experiments, we compare our sequential bootstrap test with two alternative methods frequently encountered in the literature on multiple testing, the Bonferroni and Holm procedures.Footnote 7 Denote by \(\hat{p}_i\) the p value of \(W_i\). In the Bonferroni procedure, \(H_i : \beta _i = \beta _b\) is rejected if \(\hat{p}_i \le \alpha /(N-1)\). The Holm procedure consists of the following steps.

Holm algorithm.

  1. 1.

    Set \(k = 1\).

  2. 2.

    Let \(\hat{p}_{(1)} \le \cdots \le \hat{p}_{(N-1)}\) denote the ordered p values and \(H_{(1)}, \ldots , H_{(N-1)}\) the associated null hypotheses. If \(\hat{p}_{(k)} \ge \alpha /(N-k)\), accept \(H_{(k)}, H_{(k+1)}, \ldots , H_{(N-1)}\) and stop. If \(\hat{p}_{(k)} < \alpha /(N-k)\), proceed to step 3.

  3. 3.

    Reject \(H_{(k)}\), set \(k = k + 1\), and go to step 2.

Compared with the Bonferroni procedure, the Holm criterion for rejecting \(H_{(k)}\) becomes increasingly less strict at larger p values. The Holm procedure is therefore expected to be more powerful.

All results are based on 1,000 replications and 499 bootstrap draws. The first 50 time series observations are discarded to reduce the initial value effect.

4.2 Block-length selection rules

An important consideration in practice is the block length, l. In this section, we consider three rules for selecting l.

In the first rule, \(l = \lfloor 2 T^{1/3}\rfloor \), which amounts to block lengths of 6, 7, and 9 for sample sizes of 25, 50, and 100, respectively. These block lengths are within the range usually encountered in the literature. Although simple, setting l as a function of T means ignoring the covariance structure of the data (see Hall et al. 1995). A data-driven rule is therefore often preferable.

The second rule amounts to setting l according to the Newey and West (1994) automatic procedure for bandwidth selection. This approach not only accounts for the serial correlation of the data, but is also relatively easy to implement and has good small-sample properties (see Gonçalves and White 2005; Gonçalves 2011).

The Newey and West (1994) approach is designed for variance estimation; however, it is more appealing to consider a block-length selection rule designed for distribution estimation. The proposals of Hall et al. (1995) and Lahiri et al. (2007) are examples of nonparametric resampling methods that can be used for this purpose. In particular, while Hall et al. (1995) propose a rule based on subsampling, the rule of Lahiri et al. (2007) is based on the jackknife-after-bootstrap (JAB) method of Lahiri (2002). In this paper, we will employ the latter method because it is computationally more efficient and has been found to perform relatively well in simulations (see Lahiri et al. 2007).

4.3 Monte Carlo results

The results of the Monte Carlo simulations are reported in Tables  1, 2, 3, and 4. Table 1 reports the results for S and \(S^*\), whereas Tables 2, 3, and 4 report the results for the sequential procedure.

Table 1 Size and power of S and \(S^*\)
Table 2 FWE and CP of the sequential test
Table 3 FWE and CP of the sequential test when skipping units
Table 4 FWE and CP of the bootstrap, Bonferroni and Holm procedures

Table 1 provides the empirical size and power of S (the asymptotic test based on iid innovations) in the absence of serial correlation. Looking first at results for E1, we see that S tends to over-reject when \(N \ge T\), which is in line with earlier findings in the literature (see, e.g., Pesaran and Yamagata 2008). By contrast, \(S^*\) tends to perform quite well for all combinations of N and T. Regarding the block-length selection rules, we see that the three rules considered lead to very similar performance. The deterministic rule is slightly oversized; however, the distortions go away as T increases. When it comes to the behavior under the alternative hypothesis, all tests seem to have satisfactory power properties, with power rising in both N and T.

While oversized in the presence of serial correlation, cross-sectional dependence alone (\(\rho _{\varepsilon } = 0\)) makes S undersized. The bootstrap tests, on the other hand, tend to perform well in all cases considered with good size accuracy and power. As expected, the tests based on the data-driven block-length selection rules generally perform best, especially when T is small. Looking next at the results for the case when \(\rho _{\varepsilon } \ne 0\), we see that the size of S is increasing in \(\rho _{\varepsilon }\). The bootstrap tests, on the other hand, continue to perform quite well when \(\rho _{\varepsilon } = 0.3\), although there is a slight tendency to reject too often when T is relatively small. The distortions are made worse by increasing \(\rho _{\varepsilon }\) to 0.6, in which case the largest distortions are obtained by using the JAB rule. In other words, the use of a data-driven block length does not automatically lead to better size accuracy. Specifically, although the block lengths selected by the data-driven rules are increasing in \(\rho _{\varepsilon }\), they do not increase sufficiently, which in turn leads to size distortions. However, while this means that the best performance is sometimes obtained by using the deterministic rule, the difference is not very large with the data-driven rules leading to acceptable performance in most cases considered.

To summarize the results so far, we find that S displays substantial size distortions in the presence of cross-sectional and/or serial dependence. The bootstrap tests, on the other hand, generally show small size distortions and maintain satisfactory power in small samples. In particular, the bootstrap tests are robustness to the specification of the cross-sectional dependence and also perform reasonably well in the presence of error serial correlation. Since the bootstrap tests are also very simple to implement, they should be well suited for applied work.

We now turn our attention to the results for the sequential testing procedure. Here we focus on the results based on the Newey and West (1994) rule, which are shown in Tables 2, 3, and 4. Looking first at FWE in Table 2, it is evident that the sequential test is able to control the overall significance level very well in small samples. This is true for all values of q. We also see that the ability to identify the true set of poolable units (as measured by CP) depends greatly on q when T is small. However, when T increases, CP approaches 95%, as predicted by our theoretical results.

Table 3 provides the results of the sequential test when skipping units. The first thing to note is that, when T is small, CP is much higher in Table 3 than in Table 2. This result suggests that, by skipping units, we may increase the ability of the sequential test to correctly identify the true set of poolable units. On the other hand, Table 3 reveals that the increase in power comes at the expense of higher FWE, at least among the smaller values of T. This trade-off between size and power also becomes evident when comparing the results for \(\delta = 3,\,6\).

Table 4 compares the bootstrap sequential test with the Bonferroni and Holm procedures. In this case, the errors are generated as in E1. Looking first at FWE, as expected, we see that all procedures are quite successful in controlling the overall significance level. We also see that the ability to identify the true set of poolable units, as measured by CP, differs, sometimes considerably, between the procedures. For example, looking at the results for the case when \(q = 12\) and \(T = 50\), we see that the bootstrap CP is almost twice that of the Bonferroni approach. In general, the bootstrap and Holm procedures are able to identify more non-poolable units than can the Bonferroni approach, and this advantage becomes more important as q increases, which is in accordance with our expectations. We also see that the bootstrap approach outperforms the Holm procedure in terms of CP (when \(q > 0\)) in all but three cases.

In conclusion, the bootstrap sequential test can control FWE in small samples. On the other hand, the ability to find the true set of poolable units depends largely on the sample size and the true number of poolable units. When the number of poolable units is small, and T is small, the sequential procedure will inevitably end up rejecting too few non-poolable units. We show, however, that this problem can be somewhat alleviated by skipping units. Finally, the simulation results indicate that even in the iid case, in which standard multiple testing procedures apply, the bootstrap approach offers improvements in the ability to identify non-poolable units.

5 Conclusions

In this paper, we address the issue of testing for slope homogeneity in panel data models with non-spherical errors. Previous slope homogeneity tests require modeling and estimation of the error dependence structure in order to be practical. This may be problematic in the presence of complicated dependencies of possible unknown form. Therefore, in this paper we propose a bootstrap test that can be implemented with no specification or estimation of the dependence structure. Yet, the test can accommodate both cross-sectional and serial dependence in the error process. The results of a Monte Carlo study indicate that the bootstrap test has small size distortions and satisfactory power in small samples.