1 Introduction

Heteroskedasticity, cross-sectional and serial correlations are important problems in the error terms of panel regression models. There are two approaches to deal with these problems. The first approach is to use the ordinary least squares (OLS) estimator but with a robust standard error that is robust to heteroskedasticity and correlations, for example, White (1980); Newey and West (1987); Liang and Zeger (1986); Arellano (1987); Driscoll and Kraay (1998); Hansen (2007a); Vogelsang (2012), among others. A widely used class of robust standard errors is clustered standard errors, for example, Petersen (2009), Wooldridge (2010) and Cameron and Miller (2015). Bai et al. (2019) proposed a robust standard error with unknown clusters. In an interesting paper by Abadie et al. (2017), they argued for caution in the application of clustered standard errors since they may give rise to conservative confidence intervals. The second approach is to use the generalized least squares estimator (GLS) that directly takes into account heteroskedasticity, and cross-sectional and serial correlations in the estimation. It is well known that GLS is more efficient than OLS.

This paper focuses on the second approach. For panel models, the underlying covariance matrix involves a large number of parameters. It is important to make GLS operational. We thus consider feasible generalized least squares (FGLS). Hansen (2007b) studied FGLS estimation that takes into account serial correlation and clustering problems in fixed effects panel and multilevel models. His approach requires the cluster structure to be known. This gives motivation to our paper. We assume the unknown cluster structure, and control heteroskedasticity, both serial and cross-sectional correlations by estimating the large error covariance matrix consistently. In cross-sectional setting, Romano and Wolf (2017) obtained asymptotically valid inference of the FGLS estimator, combined with heteroskedasticity-consistent standard errors without knowledge of the conditional heteroskedasticity functional form. Moreover, Miller and Startz (2018) adapted machine learning methods (i.e., support vector regression) to take into account the misspecified form of heteroskedasticity.

In this paper, we consider (i) balanced panel data, (ii) the case of large-N large-T, and (iii) both serial and cross-sectional correlations, but unknown structure of clusters. We introduce a modified FGLS estimator that eliminates the cross-sectional and serial correlation bias by proposing a high-dimensional error covariance matrix estimator. In addition, our proposed method is applicable when the knowledge of clusters is not available. Let \(u_t\) be an \(N\times 1\) vector of regressor noises, whose definition is to be clear later. Following the idea of Bai and Liao (2017), in this paper, the FGLS involves estimating an \(NT\times NT\) dimensional inverse covariance matrix \(\Omega ^{-1}\), where

$$\begin{aligned} \Omega =(Eu_{t}u_{s}') \end{aligned}$$

where each block \(Eu_{t}u_{s}'\) is an \(N\times N\) autocovariance matrix. Here, parametric structures on the serial or cross-sectional correlations are not imposed. By assuming weak dependences, we apply nonparametric methods to estimate the covariance matrix. To address the estimation of serial autocorrelations, we employ the idea of Newey–West truncation. This method, in the FGLS setting, is equivalent to “banding”, previously proposed by Bickel and Levina (2008b) for estimating large covariance matrices. We apply it to banding out off-diagonal \(N\times N\) blocks that are far from the diagonal block. In addition, to control for the cross-sectional correlation, we assume that each of the \(N\times N\) block matrices are sparse, potentially resulting from the presence of cross-sectional correlations within clusters. We then estimate them by applying the thresholding approach of Bickel and Levina (2008a). We apply thresholding separately to the \(N\times N\) blocks, which are formed by time lags \(Eu_{t}u_{t-h}': h=0,1,2,...\). This allows the cluster-membership to be potentially changing over-time. A contribution of this paper is the theoretical justification for estimating the large error covariance matrix.

For the FGLS, it is crucial for the asymptotic analysis to prove that the effect of estimating \(\Omega \) is first-order negligible. In the usual low-dimensional settings that involve estimating optimal weight matrix, such as the optimal GMM estimations, it has been well known that consistency for the inverse covariance matrix estimator is sufficient for the first-order asymptotic theory, e.g., Hansen (1982), Newey (1990), Newey and McFadden (1994). However, it turns out that when the covariance matrix is of high-dimensions, not even the optimal convergence rate for estimating \( \Omega ^{-1}\) is sufficient. In fact, proving the first-order equivalence between the FGLS and the infeasible GLS (that uses the true \(\Omega ^{-1}\)) is a very challenging problem under the large N, large T setting. We provide a new theoretical argument to achieve this goal.

The banding and thresholding methods, which we employ in this paper, are two useful regularization methods. In the recent machine learning literature, these methods have been extensively exploited for estimating high-dimensional parameters. Moreover, in the econometric literature, nonparametric machine learning techniques have been verified to be powerful tools: Bai and Ng (2017); Chernozhukov et al. (2016); Chernozhukov et al. (2017); Wager and Athey (2018), etc.

The rest of the paper is organized as follows. In Section 2, we describe the model and the large error covariance matrix estimator. Also, we introduce the implementation of FGLS estimator and its limiting distribution. In Sect. 3, we apply our methods to study the US divorce rate problem. Conclusions are provided in Sect. 4. All proofs and Monte Carlo studies are given in the online supplement.

Throughout this paper, let \(\nu _{\min }(A)\) and \(\nu _{\max }(A)\) denote the minimum and maximum eigenvalues of matrix A, respectively. Also, we use \(\Vert A\Vert = \sqrt{\nu _{\max }(A'A)}\), \(\Vert A\Vert _{1} = \max _{i}\sum _{j}|A_{ij}|\) and \(\Vert A\Vert _{F} = \sqrt{tr(A'A)}\) as the operator norm, \(\ell _1\)-norm and the Frobenius norm of a matrix A, respectively. Note that if A is a vector, \(\Vert A\Vert =\Vert A\Vert _{F}\) is equal to the Euclidean norm.

2 Feasible generalized least squares

We consider a linear modelFootnote 1

$$\begin{aligned} y_{it} = x_{it}'\beta + u_{it}. \end{aligned}$$
(1)

The model (1) can be stacked and represented in full matrix notation as

$$\begin{aligned} Y = X\beta + U, \end{aligned}$$
(2)

where \(Y = (y_1',\cdots , y_T')'\) is the \(NT \times 1\) vector of \(y_{it}\) with each \(y_t\) being an \(N\times 1\) vector; \(X = (x_1',\cdots , x_T')'\) is the \(NT \times d\) matrix of \(x_{it}\) with each \(x_t\) being an \(N\times d\); \(U = (u_1',\cdots , u_T')'\) is the \(NT \times 1\) vector of \(u_{it}\) with each \(u_t\) being an \(N\times 1\) vector.

Let \(\Omega = (Eu_{t}u_{s}')\) be an \(NT \times NT\) matrix, consisting of many blocks matrices. The (ts)th block is an \(N \times N\) covariance matrix \(Eu_tu_s'\). We consider the following (infeasible) GLS estimator of \(\beta \):

$$\begin{aligned} \widetilde{\beta }_{GLS}^{inf} = (X'\Omega ^{-1}X)^{-1}X'\Omega ^{-1}Y. \end{aligned}$$
(3)

Note that \(\Omega \) is a high-dimensional conditional covariance matrix, which is very difficult to estimate. We aim to achieve the following: (i) obtain a “good” estimator of \(\Omega ^{-1}\), allowing an arbitrary form of weak dependence in \(u_{it}\), and (ii) show that the effect of replacing \(\Omega ^{-1}\) by \(\widehat{\Omega }^{-1}\) is asymptotically negligible.

We start with a population approximation for \(\Omega \) in order to gain the intuitions. Then, we suggest the estimator for \(\Omega \) that takes into account both serial and cross-sectional correlations.

2.1 Population approximation

We start with a “banding” approximation to control serial correlations. Recall that \(\Omega = (Eu_{t}u_{s}')\), where the (ts) block is \(Eu_{t}u_{s}'\). By assuming serial stationarity and strong mixing condition, \(Eu_{t}u_{s}'\) depends on (ts) only through \(h=t-s\). Specifically, with slight abuse of notation, we can write \(\Omega _{t,s} = \Omega _{h} = Eu_{t}u_{t-h}'\). Note for \(i\ne j\), it is possible that \(Eu_{it}u_{j,t-h}\ne Eu_{i,t-h}u_{jt}\), so \(\Omega _{h}\) is possibly non-symmetric for \(h>0\). On the other hand, \(\Omega \) is symmetric due to \(\Omega _{s,t} = \Omega _{t,s}'\). The diagonal blocks are the same, and all equal \(\Omega _{0}=Eu_{t}u_{t}'\), while magnitudes of the elements of the off-diagonal blocks \(\Omega _{h}=Eu_{t}u_{t-h}'\) decay to zero as \(|h|\rightarrow \infty \) under the weak serial dependence assumption.

In the Newey–West spirit, \(\Omega \) can be approximated by \(\Omega ^{NW} = (\Omega _{t,s}^{NW})\), where each block can be written as \(\Omega _{t,s}^{NW}= \Omega _{h}^{NW}\) for \(h=t-s\). Here, \(\Omega _{h}^{NW}\) is an \(N \times N\) block matrix, defined as:

$$\begin{aligned} \Omega _{h}^{NW} = {\left\{ \begin{array}{ll}Eu_{t}u_{t-h}', &{} \text {if} \;\; |h| \le L\\ 0, &{} \text {if} \;\; |h| > L, \end{array}\right. } \end{aligned}$$

for some pre-determined \(L \rightarrow \infty \). For instance, as suggested by Newey and West (1994), we can set L equal to \(4(T/100)^{(2/9)}\). Note that \(\Omega _{h}^{NW} =\Omega _{-h}^{NW'}\). We regard \(\Omega ^{NW} = (\Omega _{h}^{NW})\) as the “population banding approximation.”

Next, we focus on the \(N\times N\) block matrix \(\Omega _{h}=Eu_{t}u_{t-h}' \) to control cross-sectional correlations. Under the intuition that \(u_{it}\) is cross-sectional weakly dependent, we assume \(\Omega _{h}\) is a sparse matrix, that is, \(\Omega _{h,ij}=Eu_{it}u_{j,t-h}\) is “small” for “many” pairs (ij). Then, \(\Omega _{h}\) can be approximated by a sparse matrix \(\Omega _{h}^{BL} = (\Omega _{h,ij}^{BL})_{N \times N}\) ( Bickel and Levina (2008a)), where

$$\begin{aligned} \Omega _{h,ij}^{BL} = {\left\{ \begin{array}{ll}Eu_{it}u_{j,t-h}, &{} \text {if} \;\; |Eu_{it}u_{j,t-h}| >\tau _{ij}\\ ~ 0, &{} \text {if} \;\; |Eu_{it}u_{j,t-h}| \le \tau _{ij}, \end{array}\right. } \end{aligned}$$

for some pre-determined threshold \(\tau _{ij} \rightarrow 0\). We regard \(\Omega _{h}^{BL}\) as the “population sparse approximation.”

In summary, we approximate \(\Omega \) by an \(NT\times NT\) matrix \(({\widetilde{\Omega }}^{NT}_{t,s})\), where each block \(\widetilde{\Omega }^{NT}_{t,s}\) is an \(N\times N\) matrix, defined as: for \(h=t-s\),

$$\begin{aligned} {\widetilde{\Omega }}^{NT}_{t,s}:= {\left\{ \begin{array}{ll} \Omega _h^{BL} , &{} \text {if} \;\; |h| \le L \\ 0, &{} \text {if} \;\; |h| > L. \end{array}\right. } \quad \end{aligned}$$

Therefore, we use “banding” to control the serial correlation, and “sparsity” to control the cross-sectional correlation.

2.2 Implementation of feasible GLS

2.2.1 The estimator of \(\Omega \) and FGLS

Given the intuition of the population approximation, we construct the large covariance estimator as follows. First, we denote the OLS estimator of \(\beta \) by \(\widehat{\beta }_{OLS}\) and the corresponding residuals by \({\widehat{u}}_{it} = y_{it} - x_{it}'\widehat{\beta }_{OLS}\).

Now, we estimate the \(N\times N\) block matrix \(\Omega _{h}=Eu_{t}u_{t-h}'\). To do so, let

$$\begin{aligned} \widetilde{R}_{h,ij} ={\left\{ \begin{array}{ll} \frac{1}{T}\sum _{t=h+1}^{T}{\widehat{u}}_{it}{\widehat{u}}_{j,t-h}, &{} \text { if } h\ge 0\\ \frac{1}{T}\sum _{t=1}^{T+h}{\widehat{u}}_{it}{\widehat{u}}_{j,t-h}, &{} \text { if } h<0 \end{array}\right. },\quad \text { and } \widetilde{\sigma }_{h,ij} = {\left\{ \begin{array}{ll} \widetilde{R}_{h,ii}, &{} \text {if} \;\; i=j \\ s_{ij}(\widetilde{R}_{h,ij}), &{} \text {if} \;\; i \ne j, \end{array}\right. } \end{aligned}$$

where \(s_{ij}(\cdot ) : {\mathbb {R}} \rightarrow {\mathbb {R}}\) is a “soft-thresholding function” with an entry-dependent threshold \(\tau _{ij}\) such that

$$\begin{aligned} s_{ij}(z) = \text {sgn}(z)(|z| - \tau _{ij})_{+}, \end{aligned}$$

where \((x)_{+} = x\) if \(x\ge 0\), and zero otherwise. Here, \(\text {sgn}(\cdot )\) denotes the sign function, and other thresholding functions, e.g., hard thresholding, are possible. For the threshold value, we specify

$$\begin{aligned} \tau _{ij} = M\gamma _{T}\sqrt{|\widetilde{R}_{0,ii}|\;|\widetilde{R}_{0,jj}|}, \end{aligned}$$

for some pre-determined value \(M>0\), where \(\gamma _{T} = \sqrt{\frac{\log (LN)}{T}}\) is such that \(\max _{h \le L}\max _{i,j \le N}|\widetilde{R}_{h,ij}-Eu_{it}u_{i,t-h}| = O_{P}(\gamma _{T})\). Note that here we use an entry-dependent threshold \(\tau _{ij}\), which may vary across (ij). Then, define

$$\begin{aligned} \widetilde{\Omega }_{h} = (\widetilde{\sigma }_{h,ij})_{N \times N}. \end{aligned}$$
(4)

Next, we define the (ts)th block \(\widehat{\Omega }_{t,s}\) as an \(N \times N\) matrix: for \(h=t-s\),

$$\begin{aligned} \widehat{\Omega }_{t,s} = {\left\{ \begin{array}{ll} \omega (|h|,L)\widetilde{\Omega }_{h}, &{} \text {if} \;\; |h| \le L\\ 0, &{} \text {if} \;\; |h| > L. \end{array}\right. } \end{aligned}$$

Here, \(\omega (h,L)\) is the kernel function (see Andrews (1991) and Newey and West (1994)). We let \(\omega (h,L) = 1-h/(L+1)\) be the Bartlett kernel function, where L is the bandwidth. Our final estimator of \(\Omega \) is an \(NT\times NT\) matrix:

$$\begin{aligned} \widehat{\Omega }=(\widehat{\Omega }_{t,s}). \end{aligned}$$

Here, \(\widehat{\Omega }\) is a nonparametric estimator, which does not require an assumed parametric structure on \(\Omega \).

Finally, given \(\widehat{\Omega }\), we propose the feasible GLS (FGLS) estimator of \(\beta \) as

$$\begin{aligned} \widehat{\beta }_{FGLS} = [X'\widehat{\Omega }^{-1}X]^{-1}X'\widehat{\Omega }^{-1}Y. \end{aligned}$$

Note that the above defined FGLS estimator leaves two quantities to be specified to applied researchers: (i) the constant \(M>0\) in the threshold value for \(\tau _{ij}\), and (ii) the Newey–West bandwidth L. We discuss the choice of these two quantities in Sect. 2.2.2 below.

Remark 2.1

(Universal thresholding) We apply thresholding separately to the \(N\times N\) blocks, \( (\widetilde{\sigma }_{h,ij})_{N \times N}\), which are estimated lagged blocks for \(Eu_{t}u_{t-h}: h=0,1,2,\ldots \). This allows the cluster-membership to be potentially changing over-time, that is, the identities of zeros and nonzero elements of \(Eu_{t}u_{t-h}\) can change over h. If it is known that the cluster-membership (i.e., identities of nonzero elements) is time-invariant, then one would set \({\widetilde{\sigma }}_{h,ij}=0\) if \(\max _{h\le L}|{\widetilde{R}}_{h,ij}|\le \tau _{ij}\) for \(i\ne j\). This potentially would increase the finite sample accuracy of identifying the cluster-membership.

2.2.2 Choice of tuning parameters

Our suggested covariance matrix estimator, \(\widehat{\Omega }\), requires the choice of tuning parameters L and M, which are the bandwidth and the threshold constant, respectively. We write \(\widehat{\Omega }(M,L)=\widehat{\Omega }\), where the covariance estimator depends on M and L. First, to choose the bandwidth L, we suggest using \(L^{*} = 4(T/100)^{2/9}\), which is proposed by Newey and West (1994). For a small size of T, we also recommend \(L \le 3\).

As for the choice of the thresholding constant M, our recommended rule-of-thumb choice is any constant that is on the interval [0.5, 2]. Based on our simulations in extensive studies with various values for N and T, we find that \(M=1.8\) is a universally good choice.

Alternatively, M can also be chosen through multifold cross-validation. To discuss this procedure, let us randomly split the data P times. We divide the data into \(P=\log (T)\) blocks \(J_1,\ldots ,J_P\) with block length \(T/\log (T)\) and take one of the P blocks as the validation set. At the pth split, we denote by \(\widetilde{\Omega }_{0}^{p}\) the sample covariance matrix based on the validation set, defined by \(\widetilde{\Omega }_{0}^{p} = |J_{p}|^{-1}\sum _{t \in J_{p}}{\widehat{u}}_{t}{\widehat{u}}_{t}'\). Let \(\widetilde{\Omega }_{0}^{S,p}(M)\) be the thresholding estimator with threshold constant M using the training data set \(\{{\widehat{u}}_{t}\}_{t \notin J_{p}}\). Finally, we choose the constant \(M^{*}\) by minimizing the cross-validation objective function

$$\begin{aligned} M^* = \arg \min _{c<M<\bar{C}}\frac{1}{P}\sum _{j=1}^{P}\Vert \widetilde{\Omega }_{0}^{S,p}(M)-\widetilde{\Omega }_{0}^{p}\Vert _{F}^2, \end{aligned}$$

where \({\bar{C}}\) is a large constant such that \(\widetilde{\Omega }_{0}^{S}({\bar{C}})\) is a diagonal matrix and can be fixed as, e.g., \({\bar{C}}=3\); c is a constant that guarantees the positive definiteness of \(\widehat{\Omega }(M,L)\) for \(M>c\): for each fixed L,

$$\begin{aligned} c= \inf [M>0: \lambda _{\min }\{\widehat{\Omega }(C,L)\}>0, \forall C>M]. \end{aligned}$$

Here, \(\widetilde{\Omega }_{0}^{S}(M)\) is the soft-thresholded estimator as defined in the Eq. (4). Then, the resulting estimator of \(\Omega \) is \({\widehat{\Omega }}(M^*,L^*)\). To determine this value, one can plot \(\lambda _{\min }\{\widehat{\Omega }(C,L)\}\) as a function of C, fixing \(L=L^*\) and visually determine c.

In summary, Table 1 summarizes the recommended quantities for implementing the proposed FGLS estimator.

Table 1 Recommended choices for implementations

2.2.3 Incorporating known clusters

Note that an advantage of the method proposed in this paper is that it does not assume known cluster information (i.e., the number of clusters and the membership of clusters). On the other hand, when clustering information is available, this method can be modified to take into account that information and is particularly suitable when the number of clusters is small, and the size of each cluster is large.

For example, let \(C_1,...,C_G\) be disjoint subsets of \(\{1,...,N\}\), so that they are known clusters and that \(u_{it}\) and \(u_{js}\) are uncorrelated if i and j belong to different clusters for any (ts). Then, naturally we can re-arranged the \(N\times N\) matrix \(\Omega _h= Eu_tu_{t-h}'\) so that it can be decomposed into G disjoint blocks on the diagonal and off-diagonal blocks are zeros:

$$\begin{aligned} \Omega _h = \begin{pmatrix} \Omega _{h, 1} &{}&{} \\ &{}\ddots &{}\\ &{}&{}\Omega _{h,G} \end{pmatrix}. \end{aligned}$$

It is assumed that G is small while the size of each diagonal block matrix is large. Within the gth (\(g\le G\)) diagonal block matrix, say \(\Omega _{h,g}\), we apply thresholding to further reduce the dimensionality. So we estimate \(\Omega _{h,g}\) by \({\widetilde{\Omega }}_{h,g}=({\widetilde{\sigma }}_{h,g, ij})\), where

$$\begin{aligned} \widetilde{\sigma }_{h, g, ij} = {\left\{ \begin{array}{ll} \widetilde{R}_{h,ii}, &{} \text {if} \;\; i=j, \text { and } i, j \in C_g \\ s_{ij}(\widetilde{R}_{h,ij}), &{} \text {if} \;\; i \ne j, \text { and } i, j \in C_g. \end{array}\right. } \end{aligned}$$

Putting these estimated diagonal blocks together, we obtain \({\widetilde{\Omega }}_h\), the estimated \(\Omega _h\).

The within-cluster thresholding then allows unknown correlations within each cluster. In contrast, the conventional clustered standard errors lose a lot of degrees of freedom when the size of cluster is too large (because each cluster is effectively treated as a “single observation”), resulting in conservative confidence intervals. See Cameron and Miller (2015) for more discussions.

Moreover, when the number of clusters is large, and the size of each cluster is small, then this is the usual setting of cluster standard errors. One does not need to apply thresholding, as the known clusters naturally form small diagonal blocks on \(\Omega _h\). Because the size of these blocks are small, sufficient degrees of freedom is kept and it is then straightforward to estimate \(\Omega _h\).

2.3 The effect of \(\widehat{\Omega }^{-1}-\Omega ^{-1}\)

A key step of proving the asymptotic property for \(\widehat{\beta }_{FGLS}\) is to show that it is asymptotically equivalent to \(\widetilde{\beta }_{GLS}^{inf}\), that is:

$$\begin{aligned} \frac{1}{\sqrt{NT}}X'(\widehat{\Omega }^{-1}-\Omega ^{-1})U = o_{P}(1). \end{aligned}$$
(5)

In the usual low-dimensional settings that involve estimating optimal weight matrix, such as the optimal GMM estimations, it has been well known that consistency for the inverse covariance matrix estimator is sufficient for the first-order asymptotic theory, e.g., Hansen (1982), Newey (1990), Newey and McFadden (1994). It turns out, when the covariance matrix is of high-dimensions, not even the optimal convergence rate of \(\Vert \widehat{\Omega } - \Omega \Vert \) is sufficient. In fact, proving equation (5) is a very challenging problem. In the general case when both cross-sectional and serial correlations are present, our strategy is to use a careful expansion for \(\frac{1}{\sqrt{NT}}X'(\widehat{\Omega }^{-1}-\Omega ^{-1})U\). We shall proceed in two steps:

  1. Step 1:

    Show that \(\frac{1}{\sqrt{NT}}X'(\widehat{\Omega }^{-1}-\Omega ^{-1})U = \frac{1}{\sqrt{NT}}W'(\widehat{\Omega }-\Omega )\varepsilon + o_{P}(1),\) where \(W = \Omega ^{-1}X\), and \(\varepsilon = \Omega ^{-1}U\).

  2. Step 2:

    Show that \(\frac{1}{\sqrt{NT}}W'(\widehat{\Omega }-\Omega )\varepsilon = o_{P}(1)\).

Now, we suppose \(\omega (h,L) =1, \Omega \approx \Omega ^{NW}\) and let \(A_{b_h} = \{(i,j) : |Eu_{it}u_{j,t-h}| \ne 0\}, A_{s_h} = \{(i,j) : |Eu_{it}u_{j,t-h}| = 0\}\). As for Step 2, we shall show,

$$\begin{aligned} \frac{1}{\sqrt{NT}}W'(\widehat{\Omega }-\Omega )\varepsilon&\thickapprox&\qquad \qquad \qquad \frac{1}{\sqrt{NT}}\sum _{|h| \le L}\sum _{i,j \in A_{b_h}}\sum _{t=h+1}^{T}w_{it}\varepsilon _{j,t-h}\frac{1}{T}\nonumber \\&\quad \sum _{s=h+1}^{T}(u_{is}u_{j,s-h}-Eu_{it}u_{j,t-h}). \end{aligned}$$
(6)

Here, \(w_{it}\) is defined such that, we can write \(W = (w_1',\cdots , w_T')'\) with \(w_t\) being an \(N\times d\) matrix of \(w_{it}\); \(\varepsilon _{it}\) is defined similarly. While proving (6) to be \(o_P(1)\), in the presence of both serial and cross-sectional correlations, is very technically challenging. We thus directly assume it is \(o_{P}(1)\) as a high-level condition (see Assumption 2.4 in Sect. 2.4 below). To appreciate the need of this high-level condition, let us consider a simple example as follows.

A simple example To illustrate the key technical issue, consider a simple and ideal case where \(u_{it}\) is known, and independent across both i and t, but with cross-sectional heteroskedasticity. In this case, the covariance matrix of the \(NT \times 1\) vector U is a diagonal matrix, with diagonal elements \(\sigma _{i}^2 = Eu_{it}^2\):

$$\begin{aligned} \Omega = \begin{pmatrix} D &{}&{} \\ &{}\ddots &{}\\ &{}&{}D \end{pmatrix}, \text { where } D = \begin{pmatrix} \sigma _{1}^2 &{}&{} \\ &{}\ddots &{}\\ &{}&{}\sigma _{N}^2 \end{pmatrix}. \end{aligned}$$

Then, a natural estimator for \(\Omega \) is

$$\begin{aligned} \widehat{\Omega } = \begin{pmatrix} {\widehat{D}} &{} &{} \\ &{} \ddots &{}\\ &{}&{}{\widehat{D}} \end{pmatrix}, \text { where } {\widehat{D}} = \begin{pmatrix} \widehat{\sigma }_{1}^2 &{} &{} \\ &{}\ddots &{}\\ &{}&{}\widehat{\sigma }_{N}^2 \end{pmatrix}, \end{aligned}$$

and \(\widehat{\sigma }_{i}^2=\frac{1}{T}\sum _{t=1}^{T}u_{it}^2\), because \(u_{it}\) is known. Then, the GLS becomes:

$$\begin{aligned} \left( \frac{1}{NT}\sum _{i=1}^{N}\sum _{t=1}^{T}x_{it}x_{it}'\widehat{\sigma }_{i}^{-2}\right) ^{-1}\frac{1}{NT}\sum _{i=1}^{N}\sum _{t=1}^{T}x_{it}y_{it}\widehat{\sigma }_{i}^{-2}. \end{aligned}$$

A key step is to prove that the effect of estimating D is asymptotically negligible:

$$\begin{aligned} \frac{1}{\sqrt{NT}}\sum _{i=1}^{N}\sum _{t=1}^{T}x_{it}u_{it}(\widehat{\sigma }_{i}^{-2}-\sigma _{i}^{-2}) = o_{P}(1). \end{aligned}$$
(7)

It can be shown that the problem reduces to proving:

$$\begin{aligned} A \equiv \frac{1}{\sqrt{NT}}\sum _{i=1}^{N}\sum _{t=1}^{T}x_{it}u_{it}\sigma _{i}^{-2}(\frac{1}{T}\sum _{s=1}^{T}(u_{is}^2-Eu_{is}^2)) {\sigma }_{i}^{-2} = o_{P}(1). \end{aligned}$$
(8)

Under the simplified conditions of this example (\(u_{it}\) is independent across both i and t), it is straightforward to calculate var(A) and show that it converges to zero as \(N,T\rightarrow \infty \) regardless of whether \(N<T\) or not.

As for EA, straightforward calculations yield

$$\begin{aligned} EA = \frac{\sqrt{NT}}{T}\frac{1}{NT}\sum _{i=1}^{N}\sum _{t=1}^{T}E(x_{it}E(u_{it}^3|x_{it}))\sigma _{i}^{-4}. \end{aligned}$$

Generally, if \(u_{it}|x_{it}\) is non-Gaussian and asymmetric, \(E(u_{it}^3|x_{it}) \ne 0\). Hence, we require \(N/T \rightarrow 0\) to have \(EA \rightarrow 0\). Hence, to allow for non-Gaussian and asymmetric conditional distributions, in the GLS setting it turns out \(N=o(T)\) is required.

We shall not explicitly impose \(N=o(T)\) in this paper as a formal assumption, but instead impose Assumption 2.4. On one hand, when the distribution of \(u_{it}\) is symmetric, we do not require \(N=o(T)\) because as is shown in the above example, \(E(u_{it}^3|x_{it})=0\) is sufficient for \(EA\rightarrow 0\) and is satisfied by symmetric distributions. On the other hand, when \(u_{it}\) is non-symmetric, Assumption 2.4 then implicitly requires \(N=o(T)\). Note that \(N=o(T)\) is a strong assumption in many microeconomic applications for panel data models. But as illustrated in the above simple example, if \(u_{it}|x_{it}\) is not symmetric, it is required for feasible GLS even if \(\Omega \) is diagonal. One possible approach to weakening this assumption is to remove the higher order bias from \({\widehat{\Omega }}\). Higher order debiasing is a complicated procedure in the presence of general weak dependences. This is left for future research.

2.4 Asymptotic results of FGLS

We impose the following conditions, regulating the sparsity and serial weak dependence.

Assumption 2.1

  1. (i)

    \(\{u_{t},x_{t}\}_{t\ge 1}\) is strictly stationary. In addition, each \(u_{t}\) has zero mean vector, and \(\{u_{t}\}_{t\ge 1}\) and \(\{x_{t}\}_{t\ge 1}\) are independent.

  2. (ii)

    There are constants \(c_{1}, c_{2}> 0\) such that \(\lambda _{\min }(\Omega _{h})>c_{1}\) and \(\Vert \Omega _{h}\Vert _{1} < c_{2}\) for each fixed h.

  3. (iii)

    Exponential tail: There exist \(r_{1}, r_{2}>0\) and \(b_{1}, b_{2} > 0\), and for any \(s > 0, i\le N\) and \(l \le d\),

    $$\begin{aligned} P(|u_{it}|> s) \le exp(-(s/b_{1})^{r_1}),\quad P(|x_{it,l}|>s) \le \exp (-(s/b_2)^{r_2}). \end{aligned}$$
  4. (iv)

    Strong mixing: There exist \(\kappa \in (0,1)\) such that \( r_{1}^{-1}+r_{2}^{-1}+\kappa ^{-1}>1\), and \(C>0\) such that for all \(T>0\),

    $$\begin{aligned} \sup \limits _{A\in \mathcal {F}_{-\infty }^0, B \in \mathcal {F}_{T}^{\infty }}|P(A)P(B)-P(AB)| < \exp (-CT^{\kappa }), \end{aligned}$$

    where \(\mathcal {F}_{-\infty }^0\) and \(\mathcal {F}_{T}^{\infty }\) denote the \(\sigma \)-algebras generated by \(\{(x_{t},u_{t}) : t \le 0\}\) and \(\{(x_{t},u_{t}) : t \ge T\}\), respectively.

Condition (ii) requires that \(\Omega _{h}\) be well conditioned. Condition (iii) ensures the Bernstein-type inequality for weakly dependent data, which requires the underlying distributions to be thin-tailed. Condition (iv) is the standard \(\alpha \)-mixing condition, adapted to the large-N panel. In addition, we impose the following regularity conditions.

Assumption 2.2

  1. (i)

    There exists a constant \(C>0\) such that for all \(i\le N\) and \(t\le T\), \(E\Vert x_{it}\Vert ^{4}<C\) and \(Eu_{it}^{4}<C\).

  2. (ii)

    Define \(\xi _{T}(L) = \max _{t \le T}\sum _{|h| > L} \Vert Eu_tu_{t-h}'\Vert \). Then, \(\xi _{T}(L)\rightarrow 0.\)

  3. (iii)

    Define \(f_{T}(L)=\max _{t \le T}\sum _{|h|\le L}\Vert Eu_{t}u_{t-h}'(1-\omega (|h|,L))\Vert \). Then \(f_{T}(L) \rightarrow 0\).

Assumption 2.2 allows us to prove the convergence rate of the covariance matrix estimator. Condition (ii) is an extension of the standard weak serial dependence condition to the high-dimensional case in panel data literature. It allows us to employ banding or Newey–West trunction procedure. Condition (iii) is well satisfied by various kernel functions for the HAC-type estimator. For the Bartlett kernel, for example,

$$\begin{aligned} \max _{t \le T}\sum _{|h|\le L}\Vert Eu_{t}u_{t-h}'(1-\omega (|h|,L))\Vert \le \frac{1}{L}\max _{t \le T}\sum _{|h|=0}^{\infty }\Vert Eu_{t}u_{t-h}'\Vert |h| \end{aligned}$$

converges to zero as \(L\rightarrow \infty \) as long as \(\max _{t \le T}\sum _{|h|=0}^{\infty }\Vert Eu_{t}u_{t-h}'\Vert |h| < \infty \).

In this paper, we assume \(\Omega _{h}\) to be a sparse matrix for each h and impose similar conditions as those in Bickel and Levina (2008a) and Fan et al. (2013): write \(\Omega _{h} = (\Omega _{h,ij})_{N\times N}\), where \(\Omega _{h,ij}=Eu_{it}u_{j,t-h}\). For some \(q \in [0,1)\), we define

$$\begin{aligned} m_{N}=\max _{|h| \le L}\max _{i\le N}\sum _{j=1}^{N}|\Omega _{h,ij}|^q, \end{aligned}$$

as a measurement of the sparsity. We would require that \( m_{N}\) should be either fixed or grow slowly as \(N \rightarrow \infty \). In particular, when \(q=0\), \(m_{N} = \max _{|h|\le L}\max _{i \le N}\sum _{j=1}^{N}1(\Omega _{h,ij}\ne 0)\), which corresponds to the exact sparsity case.

Let

$$\begin{aligned} \gamma _{T} = \sqrt{\log (LN)/T}. \end{aligned}$$

Assumption 2.3

For any \(NT \times NT\) matrix M, we denote \((M)_{ts,ij}\) as the (ij)th element of the (ts)th block of the matrix M.

  1. (i)

    \(\sum _{|h|>L}\Vert \Omega _{h}\Vert _{1} = O(L^{-\alpha })\), for a constant \(\alpha >0\).

  2. (ii)

    \(\max _{i\le N,t\le T}\sum _{s=1}^{T}\sum _{j=1}^{N}|(\Omega ^{-1})_{ts,ij}| = O(1)\).

  3. (iii)

    There is \(q\in [0,1)\) such that \(Lm_{N}\gamma _{T}^{1-q}=o(1)\) holds. In addition,

    $$\begin{aligned} \sqrt{T}L^{2}m_{N}^2\gamma _{T}^{3-2q} =o(1),\;\; \text { and } \sqrt{NT} L^3m_{N}^3\gamma _{T}^{3-3q}=o(1). \end{aligned}$$
  4. (iv)

    \(\sqrt{NT} ( \xi _{T}(L) + f_{T}(L))^3=o(1)\) and \( L^{-\alpha }T\sqrt{NT}m_{N}\gamma _{T}^{1-q} =o(1). \)

Conditions (i)-(ii) require the weak cross-sectional correlations. Condition (iii) is about the sparsity assumptions on the growth of \(m_N\), associated with q and the speed of L.

Remark 2.2

To understand Assumption 2.3, consider a simple case where \(Eu_{it}u_{j,t-h}\) is nonzero for only finitely many pairs \(i\ne j\). This corresponds to \(q=0\) and \(m_N=O(1)\). Then condition (iii) requires

$$\begin{aligned} \sqrt{N} L^3{\log ^{3/2}(LT)}=o(T) . \end{aligned}$$

In practice, the bandwidth L and \(\log (LT)\) both grow very slowly compared to N and T. So essentially this condition requires \(N=o(T^2)\). In addition, condition (iv) assumes that the autocorrelations should decay sufficiently fast as \(L\rightarrow \infty .\) Suppose both \(\zeta _T(L)\) and \(f_T(L)\) decay in a polynomial rate of L (e.g., with order \(L^{-c_0}\)), then this condition requires that the order of the polynomial, \(c_0\), should be sufficiently large.

All the above conditions, we show in the appendix the convergence of \(\Vert \widehat{\Omega }-\Omega \Vert \). It then leads to the following proposition.

Proposition 2.1

Under Assumption 2.1-2.2, for \(q \in [0,1)\) and \(\alpha >0\) such that Assumption 2.3 holds,

$$\begin{aligned} \sqrt{NT}(\widehat{\beta }_{FGLS}-\beta )= & {} \Gamma ^{-1}\left( \frac{1}{\sqrt{NT}}X'\Omega ^{-1}U\right) \\&+ \Gamma ^{-1}\left( \frac{1}{\sqrt{NT}}X'\Omega ^{-1}(\widehat{\Omega }-\Omega )\Omega ^{-1}U\right) +o_{P}(1), \end{aligned}$$

where \(\Gamma = E(X'\Omega ^{-1}X/NT)\).

As we see from the above proposition, the effect \(\widehat{\Omega }-\Omega \) also appears as an “weighted average” in the second term on the right-hand-side of the expansion. The negligibility of this term relies on the following high-level condition. We define \(W=\Omega ^{-1}X\) and \(\varepsilon = \Omega ^{-1}U\). Then, \(W=(w_{1}',\cdots , w_{T}')'\) with \(w_{t}\) being an \(N\times d\) matrix of \(w_{it}\), and \(\varepsilon _{it}\) is defined similarly.

Assumption 2.4

Let \(A_{b_h} = \{(i,j) : |Eu_{it}u_{j,t-h}| \ne 0\}\). Then,

$$\begin{aligned} \left\| \frac{1}{\sqrt{NT}}\sum _{h=0}^{L}\sum _{i,j \in A_{b_h}}{\mathbb {G}}_{T,ij}^{1}(h){\mathbb {G}}_{T,ij}^{2}(h) \right\| = o_{P}(1), \end{aligned}$$
(9)

where \({\mathbb {G}}_{T,ij}^{1}(h) = \frac{1}{\sqrt{T}}\sum _{t=h+1}^{T}(u_{it}u_{j,t-h}-Eu_{it}u_{j,t-h})\) and \({\mathbb {G}}_{T,ij}^{2}(h) = \frac{1}{\sqrt{T}}\sum _{t=h+1}^{T}w_{it}\varepsilon _{j,t-h}\).

Remark 2.3

While it is difficult to verity the above high-level condition in the presence of either serial dependence or cross-sectional dependence or both, the intuition can be understood in the simple i.i.d. case. Suppose \(u_{it}\) is independent across both i and t. Then, we can set \(L=0\) and this condition becomes

$$\begin{aligned} A\equiv \frac{1}{\sqrt{NT}} \sum _{i=1}^N\frac{1}{T}\sum _{t=1}^T(u_{it}^2-Eu_{it}^2) \sum _{s=1}^Tx_{is}u _{is}\sigma _i^{-4}=o_P(1), \end{aligned}$$

which is (8) as we discussed in Sect. 2.3. As discussed therein, it is straightforward to see that var\((A)=o(1)\), proving \(EA=o(1)\) requires either \(u_{it}\) has a symmetric distribution so that \(Eu_{it}^3=0\), or \(N=o(T)\) for asymmetric distributions. Similar conditions were required for high-dimensional GLS problems, for instance, by Bai and Liao (2017) in panel data with interactive effect estimations.

Then, we have the following limiting distribution.

Theorem 2.1

Suppose \(\mathrm {var}(U|X)=\mathrm {var}(U)=\Omega \). Under the Assumptions 2.1-2.4, for \(q \in [0,1)\) and \(\alpha >0\) such that Assumption 2.3 holds, as \(N, T \rightarrow \infty \),

$$\begin{aligned} \sqrt{NT}(\widehat{\beta }_{FGLS}-\beta ) \overset{d}{\rightarrow } \mathcal {N}(0,\Gamma ^{-1}), \end{aligned}$$

where \(\Gamma =\lim _{NT} E(X'\Omega ^{-1}X/NT)\), assumed to exist. The consistent estimator of \(\Gamma \) is \(\widehat{\Gamma } = X'\widehat{\Omega }^{-1}X/NT\).

The asymptotic variance of the FGLS estimator is \(\text {Avar}(\widehat{\beta }_{FGLS})= \Gamma ^{-1}/NT\), and an estimator of it is \((X'\widehat{\Omega }^{-1}X)^{-1}\). Asymptotic standard errors can be obtained in the usual fashion from the asymptotic variance estimates.

3 Empirical study: Effects of divorce law reforms on divorce rates

In the literature, the cause of the sharp increase in the US divorce rate in the 1960-1970s is an important research question. During 1970s, more than half of states in the US liberalized the divorce system, and the effects of reforms on divorce rates have been investigated by many such as Allen (1992) and Peters (1986). With controls for state and year fixed effects, Friedberg (1998) suggested that state law reforms significantly increased divorce rates. Also, she assumed that unilateral divorce laws affected divorce rates permanently. However, divorce rates from 1975 have been subsequently decreasing according to empirical evidence. Therefore, the question of whether law reforms also affect the divorce rate decrease has arisen. Wolfers (2006) revisited this question by using a treatment effect panel data model and identified only temporal effects of reforms on divorce rates. In particular, he used dummy variables for the first two years after the reforms, 3-4 years, 5-6 years, and so on. More specifically, the following fixed effect panel data model was considered:

$$\begin{aligned} y_{it} = \alpha _{i} + \mu _{t} + \sum _{k=1}^{8}\beta _{k}X_{it,k} + \delta _{i}t + u_{it}, \end{aligned}$$
(10)

where \(y_{it}\) is the divorce rate for state i and year t, \(\alpha _{i}\) a state fixed effect, \(\mu _{t}\) a time fixed effects, and \(\delta _{i}t\) a linear time trend with unknown coefficient \(\delta _{i}\). \(X_{it}\) is a binary regressor which denotes the treatment effect 2k years after the reform. Wolfers (2006) suggested that “the divorce rate rose sharply following the adoption of unilateral divorce laws, but this rise was reversed within about a decade.” He also concluded that “15 years after reform the divorce rate is lower as a result of the adoption of unilateral divorce, although it is hard to draw any strong conclusions about long-run effects.

Both Friedberg (1998) and Wolfers (2006) used a weighted model by multiplying all variables by the square root of state population. In addition, they used ordinary OLS standard error, which does not take into account heteroskedasticity, serial and cross-sectional correlations. However, standard errors might be biased when one disregards these correlations. Therefore, we re-estimated the model of Wolfers (2006) using the proposed FGLS method and OLS with the heteroskedastic standard errors of White (1980), the clustered standard error of Arellano (1987), and the robust standard error of Bai et al. (2019).

The same dataset as in Wolfers (2006) is used, which includes the divorce rate, state-level reform years, binary regressors, and state population. Due to missing observations around divorce law reforms, we exclude Indiana, New Mexico and Louisiana. As a result, we obtain balanced panel data from 1956 to 1988 for 48 states. We fit the models both with and without linear time trend and use OLS and FGLS in each model to estimate \(\beta \).

The choice of the tuning parameters for implementing the FGLS follows the guidance provided in Table 1. Specifically, we set bandwidth \(L=3\) as proposed by Newey and West (1994) (\(L=4(T/100)^{2/9}\)). The thresholding values are chosen by the cross-validation method as discussed in Sect. 2.2.2, more specifically, \(M=1.8\) and \(M=1.9\) for the model with and without linear time trends, respectively. The Bartlett kernel is used in the OLS robust standard error and FGLS estimation.

Model (10) is in fact more complicated than the model we formally studied in this paper, due to the inclusion of linear time trends and fixed effects. While theoretical studies of models with trends might be challenging in the high-dimensional GLS setting, it is straightforward to implement it in the same FGLS framework by applying a projection transformation to eliminate the time trend. Specifically, let \( \ell =(1,2,...,T)' \) and \(P_\ell =I_T-\ell (\ell '\ell )^{-1}\ell '\). We can define \( \widetilde{Y}_i=P_\ell (y_{i1},...,y_{iT})',\) and \( \widetilde{X}_i=P_\ell (X_{i1},...,X_{iT})', \) and define \(\dot{\widetilde{y}}_{it}\) and \(\dot{\widetilde{X}}_{it}\) accordingly from \(\widetilde{y}_{it}\) and \(\widetilde{X}_{it}\) that further remove the fixed effects.

The estimated \(\beta _{1}, \cdots , \beta _{8}\) with and without linear time trend and standard errors are summarized in Table 2 below. The OLS and FGLS estimates in both models are similar to each other. The results show that divorce rates rose soon after the law reform. However, within a decade, divorce rates had fallen over time. Interestingly, FGLS confirms the negative effects of the law reforms on the divorce rates, specifically, 11-15+ years after the reform in the model with state-specific linear time trends, and 9-15+ years after the reform in the model without state-specific linear time trends. In addition, the FGLS estimates for 1-6 and 1-4 years are positive and statistically significant in the models with and without linear time trends, respectively. For OLS, the coefficient estimates for 3-4 and 7-15+ are significant in the model without linear time trends based on \(se_{BCL}\). In contrast, the OLS estimates are statistically significant only for 1-4 years when a linear time trend is added. According to the clustered standard error, \(se_{CX}\), note that only 11-15+ are statistically significant in the model without trends.

According to OLS and FGLS estimation results with and without a linear time trend, we make the following conclusion: in the first 8 years, the overall trend of divorce rate is increasing, but the law reform reduces the divorce rate after 3-4 years. However, 8 years after the reform, we observe that the law reform has a negative effect on divorce rate. Finally, we also note that there is a noticeable difference between the magnitudes of the OLS and FGLS estimates. While the difference of the magnitudes may be due to the small sample problem, the magnitude of these estimates are mostly within the 95% confidence interval of the other estimators. For instance, \({\widehat{\beta }}_{FGLS}= 0.133\) for the effect of 1-2 years, which is within the 95% confidence interval constructed using \({\widehat{\beta }}_{OLS}\); the latter is \( [-0.0184, 0.5304]. \) For another example, \({\widehat{\beta }}_{FGLS}= 0.165\) for the effect of 3-4 years, which is within the 95% confidence interval constructed using \({\widehat{\beta }}_{OLS}\); the latter is [0.050, 0.367]. We note that these confidence intervals are relatively large, as a consequence of the relative small sample size in this study.

Overall, the results of FGLS estimates are consistent with Wolfers (2006). The FGLS confirms that the law reforms significantly contribute to the subsequent decrease in the divorce rates, more specifically, 9–15 years after the reform in the model without linear time trends, and 11–15 years after in the model with linear time trends. Though Wolfers (2006) de-emphasized the negative coefficient at the end of the periods, as these are not robust to inclusion of state-specific quadratic trends, which we did not employ in this paper, nevertheless, we still interpret the economic insight of these results as the consequence of two sides of the same treatment, the law reforms: after earlier dissolution of bad matches after law reforms, marital relations were gradually affected and changed.

Table 2 Empirical application: effects of divorce law reform with state and year fixed effects: US state level data annual from 1956 to 1988, dependent variable is divorce rate per 1000 persons per year

4 Conclusions

This paper considers generalized least squares (GLS) estimation for linear panel data models. By estimating the large error covariance matrix consistently, the proposed feasible GLS estimator is more efficient than the ordinary least squares (OLS) in the presence of heteroskedasticity, serial and cross-sectional correlations. The covariance matrix used for the feasible GLS is estimated via the banding and thresholding method. We establish the limiting distribution of the proposed estimator. A Monte Carlo study is considered. The proposed method is applied to an empirical application.