1 Introduction

Linear mixed-effects models (LMM) represent one of the most wide instruments for modeling data in applied statistics, and increasing research on linear mixed models has been rapidly in the last 10–15 years. This is due to the wide range of its applications to different types of data (clustered data such as repeated measures, longitudinal data, panel data, and small area estimation), which involve the fields of agriculture, economics, medicine, biology, sociology etc.

Some practical issues usually encountered in statistical analysis concern the choice of an appropriate model, estimating parameters of interest and measuring the order or dimension of a model. This paper focuses on model selection, which is essential for making valid inference. The principle of model selection or model evaluation is to choose the “best approximating” model within a class of competing models, characterized by a different number of parameters, a suitable model selection criterion given a data set (Bozdogan 1987). The ideal selection procedure should lead to the “true” model, i.e., the unknown model behind the true process generating the observed data. In practice, one seeks, among a set of plausible candidate models, the parsimonious one that best approximates the “true” model.

The selection of only one model among a pool of candidate models is not a trivial issue in LMMs, and the different methods proposed in the literature over time are, often, not directly comparable. In fact, not only there is a different notation among papers and great confusion as regards the software (R, SAS, MATLAB, etc.) to be used, but also a lack of landmarks allowing users to prefer one method rather than others.

Hence, the main purpose of this review is to provide a view about some useful components/factors characterizing each selection criterion, so that users can identify the method to apply in a specific situation. Moreover, we will also try to tidy up the notation used in the literature, by “translating,” if necessary, the symbols and formulas found in each paper to produce a common “language.” We begin by updating the recent review by Müller et al. (2013), then add some information about each selection criteria, such as the kind of effects that each method focuses on, or the structure of variance–covariance matrix, or the model dimensionality, or even the software used for implementing each method.

When coping with LMMs, it is not a good idea to assume independence or uncorrelation among response observations. For example, in the case of repeated measures, date are collected about the same individual over time. Hence, the traditional linear regression model is not appropriate to describe the data. For a detailed description of analogies and differences between linear mixed models and linear models, see Müller et al. (2013).

An important issue associated with LMMs selection is related to the dimension of the fixed and random components. Most of the literature bases inference, selection and interpretation of models in the finite (fixed) dimensional case, which means that the number of parameters is less than the number of units. Recently, more attention has been given to the handling of high-dimensional settings, which requires more complex computational applications. The word “high-dimensional” refers to situations where the number of unknown parameters that are to be estimated is one or several orders of magnitude larger than the number of samples in the data (Bülmann and van de Geer 2011). Furthermore, in LMMs, the number of parameters can grow exponentially with the sample size, i.e., the number of effects is strictly related to the number of units. Thus, if the sample size increases the set of effects diverges. Only recently, some authors have tried to make inference within the LMM framework, on high-dimensional settings (Fan and Li 2012; Schelldorfer et al. 2011).

Model selection is a challenge in itself when one deals with the classic linear model. It becomes more complex when mixed models are involved, because of the presence of two kinds of effects with completely different characteristics and roles. Among others, a key aspect of linear mixed model selection is how to identify the real important random effects, i.e., those whose coefficients vary among subjects. It is important to note that the exclusion of relevant effects has a drawback on the estimation of the fixed effects: their variance–covariance matrix would be underfitted and the estimation of the variances related to the fixed part estimates would be biased. The inclusion of irrelevant random effects in a model, on the other hand, would lead to a singular variance–covariance matrix of random effects, producing instability in the model (Ahn et al. 2012). As pointed out by Müller et al. (2013), most procedures focus on the selection of fixed effects exclusively. Only Chen and Dunson (2003) and Greven and Kneib (2010) worked on random part selection before Müller et al. (2013). There are obvious difficulties due to computational issues in selecting only the random part, that is why the researchers who worked on the random effects, after Müller et al. (2013), optimize with respect to the fixed part, too, excepted for Li and Zhu (2013). In recent years, in fact, it has been easy to find procedures selecting both the effects .

It is worth noting that since the LMMs are a special case of Generalized LMMs, we obviously excluded from the current review all those methods built mainly for selecting effects in the GLMMs, such as Hui et al. (2017). Moreover, this review doesn’t include works based on graphical tools for model selection if these graphical representations are referred to methods already existent in the literature. This is the case, for example, of Sciandra and Plaia (2018) who adapt an available graphical representation to the class of mixed models, in order to select the fixed effects conditioning on the random part and covariance structure, and of Singer et al. (2017) who discuss different diagnostic methods focusing on residual analysis but also addressing global and local influence, giving general guidelines for model selection.

This review mentions the available theoretical properties corresponding to the different methodologies, with comparisons among them whereas it’s possible.

Müller et al. (2013) classified the proposed methods by considering four different kinds of procedures: information criteria (such as Akaike information criterion, Bayesian information criterion); shrinkage methods such as LASSO and adaptive LASSO; the Fence method; and some Bayesian methods.

In this paper, we prefer to cluster methods according to which part of the model, fixed, random or both, they focus on. The paper is organized as follows. In Sect. 2, we present the structure and notation of a linear mixed model and we discuss some problems occurring in selection models. In Sect. 3, we give an overview of model selection procedures within the LMMs framework that are useful for selecting linear mixed models, by considering the classification proposed in Müller et al. (2013). In Sects. 4 and 5, we describe the methods grouped according to the part of the model selected, i.e., fixed and both, respectively. Finally, we examine some simulations in Sect. 6 and conclude with a brief discussion and some conclusions in Sect. 7. Moreover, to help the reader decide which method to prefer, according to his own data, we include two Tables 2 and 3, that summarize the main features of each method.

2 LMM and the linear mixed model selection problEM

Suppose data are collected from m independent groups of observations (called clusters or subjects in longitudinal data). The response variable \(\varvec{Y}_i\) is specified in the linear mixed models at cluster level as follows:

$$\begin{aligned} \varvec{Y}_i=\varvec{X}_i\varvec{ \beta }+\varvec{Z}_i\varvec{b}_i+\varvec{\epsilon }_i,\;\;\; i=1,2,\ldots ,m, \end{aligned}$$
(1)

where \(\varvec{Y}_i\) is a \(n_i\) dimensional vector of observed responses, \(\varvec{X}_i\) and \(\varvec{Z}_i\) are the known \(n_i\times p\) and \(n_i\times q\) matrices of covariates related to the fixed effects and to the random effects, respectively, \(\varvec{\beta }\) is the p-vector of unknown fixed effects, \(\varvec{b}_i\) is the q-vector of unobserved and independent random effects and \(\varvec{\epsilon }_i\) is the vector of unobserved random errors. Let us assume that \(\varvec{b}_i\)s are independent of \(\varvec{\epsilon }_i\)s and that they are independent and identically distributed random variables for each group of observations in the following way:

$$\begin{aligned} \varvec{b}_i\sim N_q(0,\varvec{\Psi }),\;\;\;\;\; \varvec{\epsilon }_i\sim N_{n_i}(0,\varvec{\Sigma }), \end{aligned}$$
(2)

where \(\varvec{\Psi }\) is a \(q\times q\) positive definite matrix and \(\varvec{\Sigma }\) is a \(n_i\times n_i\) positive definite matrix. Consequently, the response vector follows a multivariate normal distribution, \(\varvec{Y}_i\sim N_{n_i}(\varvec{X}_i\varvec{\beta },\varvec{V}_i)\), where the variance–covariance matrix is given by \(\varvec{V}_i=\varvec{Z}_i\varvec{\Psi } \varvec{Z}_i^{'}+\varvec{\Sigma }\). The vectorized form of the model is:

$$\begin{aligned} \varvec{Y}=\varvec{X}\varvec{\beta }+\varvec{Zb}+\varvec{\epsilon }, \end{aligned}$$
(3)

where all elements concern all macro units; therefore, \(\varvec{Y}\) is a n-dimensional vector (\(n = \sum n_i\)), \(\varvec{X}\) and \(\varvec{Z}\) are the known \(n\times p\) and \(n\times q\) matrices of covariates related to the fixed effects and to the random effects, respectively, \(\varvec{\beta }\) is the p-vector of unknown fixed effects, \(\varvec{b}\) is the q-vector of unobserved and independent random effects and \(\varvec{\epsilon }\) represents the vector of unobserved random errors.

The selection of linear mixed-effects models implies the selection of the “true” fixed parameters and/or the “true” random effects. Even if there exists a kind of estimation for \(\varvec{b}\), the Best Linear Unbiased Predictors [BLUP, see Eq. (7)], the correct investigation for identifying \(\varvec{b}\) requires to estimate its \(q(q+1)/2\) variance–covariance parameters. Let \(\varvec{\tau }\) denote the s-vector filled with all distinctive components in the variance–covariance matrices \(\varvec{\Psi }\) and \(\varvec{\Sigma }\). A random effect is not relevant if its variance–covariance elements, for all observations, are zero (Ahn et al. 2012); hence, it suffices to identify the nonzero diagonal components in \(\varvec{\Psi }\) (Wu et al. 2016) correctly and, also, their related covariance terms, for avoiding the drawback of excluding random effects correlated to some explanatories.

We call \(\varvec{\theta }=(\varvec{\beta },\varvec{\tau })\) the overall set of parameters relevant in a linear mixed model. This set represents the whole group of the parameters related to the true model generating data. Let us identify the selection of linear mixed models with \(\textit{M} \in \mathcal {M}\), where \(\mathcal {M}\) is the countable set containing all candidate models involved in the selection. The number of candidate models used depends on some contextual considerations: some variance–covariance components could be known or assumed to be known; some authors could focus only on nested models; or, still, the classic null model (the one with intercept only) could not be admitted among the set of candidate models (see Sect. 7 for further details).

The conditional log-likelihood for model (3) is given by:

$$\begin{aligned} l(\varvec{\theta }|\varvec{b};\varvec{y})=\log f_{\varvec{y}}(\varvec{y}|\varvec{b};\varvec{\theta })=-\frac{1}{2}\bigg \{\log |\varvec{\Sigma }|+(\varvec{y}-\varvec{X}\varvec{\beta }-\varvec{Zb})^{'}{\varvec{\Sigma }}^{-1}(\varvec{y}-\varvec{X\beta }-\varvec{Zb})\bigg \} -\frac{n}{2}\log (2\pi ), \end{aligned}$$
(4)

while the marginal likelihood is:

$$\begin{aligned} l(\varvec{\theta };\varvec{y},\varvec{b})= \log f_{\varvec{y}}(\varvec{y};\varvec{b},\varvec{\theta })= -\frac{1}{2}\{\log |\varvec{V}|+(\varvec{y}-\varvec{X\beta })'\varvec{V}^{-1}(\varvec{y}-\varvec{X\beta })\}.\qquad \end{aligned}$$
(5)

For fixed \(\varvec{\tau }\), the optimization process of the joint log-likelihood leads to an estimate of \(\varvec{\beta }\) that is similar to a generalized least squares estimator:

$$\begin{aligned} \hat{\varvec{\beta }}(\varvec{\tau })=(\varvec{X}'\varvec{V}^{-1}\varvec{X})^{-1}\varvec{X}'\varvec{V}^{-1}\varvec{y}. \end{aligned}$$
(6)

The most popular approach for predicting \(\varvec{b}\) is an empirical Bayesian method, which uses the posterior distribution \(f(\varvec{b}|\varvec{y})\) yielding the following BLUP prediction:

$$\begin{aligned} \hat{\varvec{b}}(\varvec{\tau })_{\mathrm{BLUP}}=\varvec{\Psi Z}'\varvec{V}^{-1}\{\varvec{y}-\varvec{X}\hat{\varvec{\beta }}(\varvec{\tau })\}. \end{aligned}$$
(7)

The same solutions of \(\hat{\varvec{\beta }}(\varvec{\tau })\) and \(\hat{\varvec{b}}(\varvec{\tau })_{\mathrm{BLUP}}\) can be obtained by solving Henderson’s linear mixed model equations (Müller et al. 2013):

$$\begin{aligned} \begin{bmatrix} \varvec{X}'\varvec{\Sigma }^{-1}\varvec{X}&\varvec{X}'\varvec{\Sigma }^{-1}\varvec{Z} \\ \varvec{Z}'\varvec{\Sigma }^{-1}\varvec{X}&\varvec{Z}'\varvec{\Sigma }^{-1}\varvec{Z}+\varvec{\Psi }^{-1} \end{bmatrix} \begin{bmatrix} \hat{\varvec{\beta }}(\varvec{\tau }) \\ \hat{\varvec{b}}(\varvec{\tau }) \end{bmatrix} = \begin{bmatrix} \varvec{X}'\\ \varvec{Z}' \end{bmatrix} \begin{bmatrix} \varvec{\Sigma }^{-1}\varvec{y} \end{bmatrix}. \end{aligned}$$
(8)

Although consistent, the ML estimator of variance–covariance parameters is known to be biased in small samples. Hence, the restricted maximum likelihood estimators (REML) are used:

$$\begin{aligned} l_R(\varvec{\tau })=-\frac{1}{2}\{\log |\varvec{V}|+\log |\varvec{X}'\varvec{V}^{-1}\varvec{X}|+\varvec{y}'\varvec{P}^{-1}\varvec{y}\}, \end{aligned}$$
(9)

where \(\varvec{P}=\varvec{V}^{-1}-\varvec{V}^{-1}\varvec{X}(\varvec{X}'\varvec{V}^{-1}\varvec{X})^{-1}\varvec{X}'\varvec{V}^{-1}\) (Müller et al. 2013). Thus, the simple ML estimators for \(\varvec{\beta }\) and \(\varvec{\tau }\) will here forth be indicated as \(\hat{\varvec{\beta }}\) and \(\hat{\varvec{\tau }}\), while the REML estimators as \(\hat{\varvec{\beta }}_R\) and \(\hat{\varvec{\tau }}_R\).

It is important to note that in many papers dealing with LMMs some authors use the \(\sigma ^2\) scaled versions of \(\varvec{\Psi }\) and \(\varvec{\Sigma }\), which are \(\sigma ^2\varvec{\Psi _*}\) and \(\sigma ^2\varvec{\Sigma _*}\). Then we are going to use, in the description of the methods, the symbol \(_*\) for those variance–covariance matrices scaled by \(\sigma ^2\).

3 Introduction to model selection criteria

Within the framework of linear mixed-effect models, a large number of selection criteria are available in the literature. Model selection criteria are frequently set up by building estimators of discrepancy measures, which evaluate the distance between the “true” model and an approximating model fitted to the data.

3.1 AIC and its modifications

The most widely used criteria for model selection are the information criteria. Their application consists in finding the model that minimizes a function, in the form of a loss function plus a penalty, usually dependent on model complexity. The Akaike information criterion (AIC), introduced by Akaike (1992), is the most popular method. The Akaike information criterion is based on the Kullback–Leibler distance between the true density of the distribution generating the data, \(\varvec{y}\), and, the approximating model for fitting the data, \(g(\varvec{\theta })\) (Vaida and Blanchard 2005). With his criterion, Akaike tried to combine point estimation and hypothesis testing into a single measure, thus formalizing the concept of finding a good approximation of the true model in a predictive view. In this sense, a good model is the one that is able to generate predictive values (independent of the real data) as close as possible to the observed data. AI is given by \(-2E_{f(\varvec{y})}E_{f(\varvec{y}^*)}\log g\{\varvec{y}^*;\hat{\varvec{\theta }}(\varvec{y})\}\), where \(\hat{\varvec{\theta }}\) is an estimator of \(\varvec{\theta }\), while \(\varvec{y}^*\) represents the predictive set of data obtained from the fitted model and independent of \(\varvec{y}\). Vaida and Blanchard (2005) defined a new version of AI by conditioning the distribution \(f(\varvec{y};\varvec{\theta })\) to the clusters. Hence, the conditional AI (cAI) uses the conditional distribution \(f(\varvec{y};\varvec{\theta },\varvec{b})\) as follows:

$$\begin{aligned} -2E_{f(\varvec{y},\varvec{b})}E_{f(\varvec{y}^*|\varvec{b})}\log g\{\varvec{y}^*;\hat{\varvec{\theta }}(\varvec{y}),\hat{\varvec{b}}(\varvec{y})\}, \end{aligned}$$

where \(\hat{\varvec{b}}(\varvec{y})\) is the estimator of \(\varvec{b}\). It should be noted that \(\varvec{y}^*\) and \(\varvec{y}\) have to be considered conditionally independent of \(\varvec{b}\) and belonging to the same conditional distribution \(f(.|\varvec{b})\). These two last assumptions imply that they have the same random effects \(\varvec{b}\).

The underlying reasoning of the criterion based on the Akaike information criterion is not to identify the true model generating the data, but the best approximation of it, which adapts well to the data. The estimators employed for measuring AI and cAI are known as Akaike information criterion and conditional Akaike information criterion, respectively, and they are both biased for finite samples. They approximate their own information as minus twice the relative log-likelihood function plus a penalty term, \(a_n(d_M)\), which tries to adjust the bias. The marginal AIC, defined by Vaida and Blanchard (2005), has the following generic formula:

$$\begin{aligned} \text {mAIC}=-2l(\hat{\varvec{\theta }})+2a_n(p+q) \end{aligned}$$

where \(a_n=1\) or \(a_n=n/(n-p-q-1)\) in small samples (Vaida and Blanchard 2005; Sugiura 1978). The conditional Akaike information criterion (cAIC—Vaida and Blanchard 2005) provides a procedure for selecting variables in LMMs with the purpose of predicting specific clusters or random effects, since the mAIC is inappropriate when the focus is on clusters and not on the population. For predicting at cluster level, the likelihood needs to be computed conditionally on the clusters and the random effects \(\varvec{b}_i\) need to be considered as parameters. Hence, for computing the cAIC, the terms to estimate are the \(p+q+s\) parameters in \(\varvec{\theta }\). If all the variance elements \(\varvec{\tau }\) are known, the q random effects \(\varvec{b}\) are predicted by the best linear unbiased predictor (BLUP) or using an estimated version of BLUP (Eq. 7). The generic formula for cAIC is:

$$\begin{aligned} \text {cAIC}=-2l(\hat{\varvec{\theta }}|\hat{\varvec{b}})+2a_n (\rho + 1) \end{aligned}$$
(10)

where \(\rho \) is connected to the effective degrees of freedom used in estimating \(\varvec{\beta } \) and \(\varvec{b}\). Many authors (Shang and Cavanaugh 2008; Kubokawa 2011; Vaida and Blanchard 2005; Greven and Kneib 2010; Srivastava and Kubokawa 2010; Liang et al. 2008) have tried to reduce the bias of mAIC and cAIC, working on the penalty term in different ways, i.e., taking into account the MLE estimator or the REML estimator for \(\varvec{\theta }\), distinguishing if variance–covariance matrices are known or unknown. A clear and complete overview of all penalties used in the literature is available in Müller et al. (2013), Sects. 3.1 and 3.2.

3.2 Mallow’s \(C_p\)

Another criterion, based on a discrepancy measure (Gauss discrepancy) and used for choosing the model nearest to the true one, is given by Mallows’ \(C_p\).

$$\begin{aligned} \text {C}_{{p}}=\frac{\hbox {SSE}_p}{\hat{\sigma }^2}-n+2p, \end{aligned}$$

with \(\hbox {SSE}_p\) and p representing, respectively, the error sum of squares and the number of parameters of the reference model and \(\hat{\sigma }^2\) an estimate of \(\sigma ^2\) (Gilmour 1996). Some variants on Mallows’\(C_p\) are provided by Kubokawa (2011) and are clearly presented by Müller et al. (2013).

3.3 BIC

The Bayesian information criterion is based on the marginal distribution of \(\varvec{y}\), which requires the full prior information about all parameters (\(\varvec{\beta },\varvec{\theta }\)) to be computed:

$$\begin{aligned} f(\varvec{y})=\int \int f_m(\varvec{y}|\varvec{\beta },\varvec{\theta })\pi (\varvec{\beta },\varvec{\theta })\hbox {d}\varvec{\beta } \hbox {d}\varvec{\theta }. \end{aligned}$$
(11)

BIC, proposed by Schwarz (1978), is an approximation of \(-2\log \{f_\pi (\varvec{y})\}\), free of any prior distribution setup:

$$\begin{aligned} \text {BIC}=-2l(\hat{\varvec{\theta }})+(p+q)\log (N). \end{aligned}$$
(12)

Since BIC is a Bayesian procedure for model selection, it requires prior distributions. Kubokawa and Srivastava (2010) derived the expression of EBIC, an intermediate method between BIC and full Bayesian variable selection tools. The EBIC procedure employs partial non-subjective prior distribution only for the parameters of interest, ignoring the nuisance parameters in terms of distributional assumptions.

3.4 Shrinkage

Often, it is not feasible to compute information criteria in variable selection when p and/or q are large, i.e., in high-dimensional settings, when one deals with classic linear models. Hence, in this sense, shrinkage methods such as the least absolute shrinkage and selection operator, LASSO (Tibshirani 1996), and its extensions such as the adaptive LASSO, ALASSO (Zou 2006), the elastic net (Zou and Hastie 2005) or the smooth clipped absolute deviation, SCAD (Fan and Li 2012), have been proposed in the literature. When using these techniques, thanks to a penalization system, some coefficients are shrunk toward zero, while at the same time, the once influential on response are estimated to be nonzero. The shrinkage procedures are applicable to either the least squares or the likelihood functions. For the sake of simplicity, the penalized likelihood function is readopted in the case of the classical linear model:

$$\begin{aligned} -\sum _{i=1}^{n}l_i(\varvec{\beta };\varvec{y}_i)+n\sum _{j=1}^{p}p_{\lambda }(||\varvec{\beta }||_\ell ), \end{aligned}$$
(13)

where \(||\varvec{\beta }||_\ell \) is the \(\ell \)-th norm of \(\varvec{\beta }\). Taking into account that \(\ell _1\) corresponds to work with the LASSO, while \(\ell _2\) refers to ridge estimation. The adaptive LASSO is an extension of LASSO. It involves the addition of some weights depending on the \(\ell \)-th norm of \(\varvec{\beta }\), i.e., \(p_\lambda (||\varvec{\beta }||_\ell )=\lambda _j||\varvec{\beta }||_\ell /2\), with \(\lambda _j=\lambda /||\varvec{\beta }||_\ell \), where \(\ell \) is an additional parameter often considered equal to 1.

The generic SCAD penalty on \(\varvec{\theta }\) introduced by Fan and Li (2001) works on the first derivative of \(p_{\lambda }(|\varvec{\theta }|)\):

$$\begin{aligned} p'_{\lambda _j}(|\varvec{\theta }|)=\lambda \biggl \{I(\varvec{\theta }\le \lambda )+\frac{(a\lambda -\varvec{\theta })_+}{(a-1)\lambda } I(\varvec{\theta }-\lambda )\biggr \}. \end{aligned}$$
(14)

For the solution of \(\varvec{\theta }\), Fan and Li (2001) provided an algorithm via local quadratic approximations.

3.5 MDL principle

The minimum description length (MDL) principle originates from data compression literature and Rissanen (1986) who developed it to “understand” the observed data; it represents a valid statistical criterion employed for selecting linear mixed models. This method aims to detect the best model approximating the observed data, among a pool of candidate models, through a data compression process based on the code length needed to describe the data. A model can be described using fewer symbols than those necessary to describe the data. Usually, this criterion is used in the presence of independent data. Li et al. (2014) propose a MDL principle for fixed effects selection when there is a correlation between observations within clusters. The principle is presented as a good trade-off between AIC, thanks to its asymptotic optimality, and BIC, because of its consistency property. The proposed criterion is a hybrid form of MDL which merges a two stage description length and the mixture MDL with the dependent data.

4 Fixed effects selection

AIC and its modifications consist in finding the model that minimizes a function in the form of a loss function plus a penalty, which measures model complexity. Kawakubo and Kubokawa (2014) and Kawakubo et al. (2014) propose a modified conditional AIC and a conditional AIC under covariate shift in Small Area Estimation (SAE), respectively. For linear mixed model selection, random intercept model selection in particular, in the small area estimation, Marhuenda et al. (2013) work on two variants of AIC and two variants of the Kullback symmetric divergence criterion (KIC), defined as:

$$\begin{aligned} \text {KIC}=-2\log f(\varvec{y}|\hat{\varvec{\theta }})+3(p+1). \end{aligned}$$

Kawakubo and Kubokawa (2014) and Kawakubo et al. (2018) provide a modified version of the exact cAIC (McAIC), because the cAIC suggested by Vaida and Blanchard (2005) is highly biased when the candidate models do not include the true model generating the data (underspecified cases). They assume that \(\varvec{\Psi }=\sigma ^2\varvec{\Psi }_*\), \(\varvec{\Sigma }=\sigma ^2I_{n_i}\), and extend cAIC to a procedure that could be valid both for the overspecified cases (situations in which the true model is included among the candidate models) and for the underspecified cases. The modified conditional AIC is given by:

$$\begin{aligned} \text {McAIC}=-2\log f(y|\hat{\varvec{b}}_j,\hat{\varvec{\beta }}_j,\hat{\sigma }^2_j)+\widehat{\Delta _{\mathrm{cAI}}}, \end{aligned}$$
(15)

where \(\widehat{\Delta _{\mathrm{cAI}}}\) is the estimate of the bias of cAIC, estimated by:

$$\begin{aligned} \widehat{\Delta _{\mathrm{cAI}}}=B^*+\widehat{B}_1+\widehat{B}_2+\widehat{B}_3, \end{aligned}$$
(16)

where \(B^*\) is a function of \(\varvec{V}^{-1}\) and \(B_1\), \(B_2\) and \(B_3\) are functions of \(\varvec{V}\) and \(\varvec{X}\). The authors demonstrate that \(B^*\), \(\widehat{B}_1\), \(\widehat{B}_2\) and \(\widehat{B}_3\) have distributions proportional to \(\chi ^2\) with degrees of freedom opportunely quantified and, in the overspecified case, \(\widehat{\Delta _{\mathrm{cAI}}}\) reduces to \(B^*\), i.e., McAIC=cAIC by Vaida and Blanchard (2005).

When the variable selection problem focuses on finding a set of significant variables for a good prediction, Kawakubo et al. (2014) propose a cAIC under covariate shift (CScAIC). They derive the cAIC of Vaida and Blanchard (2005) under the covariate shift for both known and unknown variances \({\sigma }^2\) and \(\varvec{\Psi }_*\) and with \(\varvec{\Sigma }_*\) assumed to be known.

The proposed criterion replaces, in the formula of the classic cAIC, the conditional density of \(\varvec{y}\) (the vector of the observed responses) given \(\varvec{b}\), with the conditional density of \(\tilde{\varvec{y}}\) (the vector of observed responses in the “predictive model”: \(\tilde{\varvec{y}}=\tilde{\varvec{X}}\varvec{\beta }+\tilde{\varvec{Z}}\varvec{b}+\tilde{\varvec{\epsilon }}\), a LMM with same regression coefficients \(\varvec{\beta }\) and random effects \(\varvec{b}\), but different shifted covariates) given \(\varvec{b}\).

$$\begin{aligned} \text {CScAIC}=-2\log g(\tilde{\varvec{y}}|\hat{\varvec{b}},\hat{\varvec{\beta }},\hat{\sigma }^2)+B^*_c, \end{aligned}$$
(17)

when \(\sigma ^2\) is unknown and estimated by its ML estimator and \(B^*_c\) is the bias correction.

Lombardía et al. (2017) introduce a mixed generalized Akaike information criterion, xGAIC, for SAE models. One typical model used in the field of SAE is the Fay–Herriot model, which is a particular type of LMMs containing only one random effect, the intercept. The clusters are represented by areas and the model in Eq. (1) for each area is reduced to: \(\varvec{y}_i=\varvec{x}_i^{'}\varvec{\beta }+\varvec{b}_i+\varvec{\epsilon }_i\), with \(i=1,2,\ldots ,m\).

Instead of the usual AIC types based only on the marginal or the conditional log-likelihood, the authors propose to use a new AIC, based on a combination of both the log-likelihood functions. The quasi-log-likelihood used for deriving the new statistics is the following:

$$\begin{aligned} \log (l_{\varvec{x}})=-\frac{1}{2}m\log (2\pi )-\frac{1}{2}\log |\varvec{V}|-\frac{1}{2}(\varvec{Y}-\varvec{\mu })^{'}\varvec{V}^{-1}(\varvec{Y}-\varvec{\mu }), \end{aligned}$$
(18)

where \(\varvec{\mu }=E(\varvec{Y}|\varvec{b})\). The generalized degrees of freedom (xGDF), linked to the quasi-log-likelihood in Eq. (18), takes into account the expectation and covariance with respect to the marginal distribution of \(\varvec{Y}\):

$$\begin{aligned} \text {xGDF}=\sum _{i=1}^{m}\frac{\partial E_{\varvec{y}}(\hat{\varvec{\mu }}_i)}{\partial (\varvec{X}_i\varvec{\beta })}=\sum _{i=1}^{m}\sum _{j=1}^{m}\varvec{V}^{ij}\hbox {cov}(\hat{\varvec{\mu }}_i,\varvec{y}_j), \end{aligned}$$
(19)

where \(\varvec{V}^{ij}\) is the ij-element of the matrix \(\varvec{V}^{-1}\). Combining the \(\log (l_{\varvec{x}})\) with xGDF, the mixed generalized AIC is finally defined as:

$$\begin{aligned} \text {xGAIC}= -2\log (l_{\varvec{x}})+\hbox {xGDF}. \end{aligned}$$
(20)

Han (2013) derives the closed form for the unbiased conditional AIC when the linear mixed model is reduced to the Fay–Herriot model. The author proposed a more suitable cAIC for three different approaches to fitting the model: the unbiased quadratic estimator (UQE), the REML estimator and the ML estimator. The unbiased cAIC for the Fay–Herriot model has the same form as for the classical LMMs, with i.i.d. errors (see Eq. 10), where the degrees of freedom are measured by \(\Phi =\sum _{i=1}^{m}\frac{\partial \varvec{X}_i\hat{\varvec{\beta }}}{\partial \varvec{Y}_i}=\hbox {tr}(\frac{\partial \varvec{X}^{'}\hat{\varvec{\beta }}}{\partial \varvec{Y}})\), which is computationally expensive, because \(\varvec{X}_i\hat{\varvec{\beta }}\) is not a linear estimator through \(\hat{\sigma }^2_{{{b}}}\) and the derivates therein depend on the specific choice of estimating \(\sigma _{b}^2\):

$$\begin{aligned} \text {cAIC}=-2\log f(\varvec{y}|\varvec{b},\varvec{\theta })+2\Phi . \end{aligned}$$
(21)

If \(\hat{\sigma }^2_{{{b}}}=0\), whatever is the method used for estimating it, then \({\Phi }=p\), otherwise when \(\hat{\sigma }^2_{{b}}>0\) the way of measuring \({\Phi }\) is different. If the unbiased quadratic estimate method is used, then:

$$\begin{aligned} \Phi =\hat{\rho }+2(m-p)^{-1}r^{'}S\varvec{\Sigma }^{-1}P^{*}r_{s}. \end{aligned}$$
(22)

If \(\hat{\sigma }^2_{{{b}}}>0\) is the REML or ML estimate:

$$\begin{aligned} \Phi =\hat{\rho }-2\bigg (\frac{\partial \hat{s}}{\partial \varvec{\sigma }_{\varvec{b}}^2} \bigg )^{-1}r_s^{'}\hat{\varvec{\Sigma }}^{-1}P^{*}S\varvec{\Sigma }^{-1}P^{*}r_{s}, \end{aligned}$$
(23)

with \(\frac{\partial \hat{s}}{\partial {\sigma }_{b}^2}=tr\big ((\varvec{\Sigma }^{-1}P^{*})^2\big )-2r_s^{'}\hat{\varvec{\Sigma }}^{-1}P^{*}r_{s}\) in the case of REML or \(\frac{\partial \hat{s}}{\partial {\sigma }_{b}^2}=\hbox {tr}(\varvec{\Sigma }^{-2})-2r_s^{'}\hat{\varvec{\Sigma }}^{-1}P^{*}r_{s}\) for ML estimating process, \(P^{*}=I-\varvec{X}(\varvec{X}^{'}\varvec{\Sigma }^{-1}\varvec{X})^{-1}\varvec{\Sigma }^{-1}\), r the residuals from the OLS estimation for \(\varvec{\beta }\) and \(r_s=\varvec{\Sigma }^{-1}P^{*}\varvec{Y}\) the standardized residuals obtained from the GLS estimation for \(\varvec{\beta }\). The closed-form cAIC results to be an unbiased estimator for the conditional AI for the Fay–Herriot model.

It is worth mentioning (Lahiri and Suntornchost 2015) for their contribution to the selection of fixed effects in LMMs with applications in SAE models, even if their proposal doesn’t concern a modification of some Information Criteria. The authors define an alternative to the usual Mean Square Error and Mean Square Total, estimating them with \(\widehat{\hbox {MSE}}=\hbox {MSE}-\overline{D}_w\) and \(\widehat{\hbox {MST}}=\hbox {MST}-\overline{D}\), respectively, where \(\overline{D}_w=\sum _{i=1}^{m}((1-h_{ii})D_i)/(m-p)\) and \(\overline{D}_w=\sum _{i=1}^{m}D_i/m\), with \(h_{ii}=\varvec{x}_i^{'}(\varvec{X}^{'}\varvec{X})^{-1}\varvec{x}_i\). They suggest to use \(\widehat{\hbox {MSE}}\) and \(\widehat{\hbox {MST}}\), because under standard regularity conditions these measures tend to the true MSE and MST with probability one, as the number of areas increases. But, since for small areas \(\widehat{\hbox {MSE}}\) and \(\widehat{\hbox {MST}}\) could be negative, the authors suggest an alternative to their estimates, through the function h(xb) in Eq. (24) which guarantees to obtain positive values for them:

$$\begin{aligned} h(\varvec{x},\varvec{b})=\frac{2\varvec{x}}{1+\exp \big (\frac{2\varvec{b}}{\varvec{x}}\big )}. \end{aligned}$$
(24)

This function allows to figure out new estimators in the following way: \(\widehat{\hbox {MSE}}=h(\hbox {MSE},\overline{D}_w)\) and \(\widehat{\hbox {MST}}=h(\hbox {MST},\overline{D})\).

Kubokawa and Srivastava (2010) derive an exact expression of the empirical Bayes information criterion (EBIC) for selecting the fixed effects in a linear mixed model. Their criterion represents an intermediate solution between BIC and the full Bayes variable selection methods, because it exploits the partitioning of the vector of parameters (\(\varvec{\beta },\varvec{\tau }_*,\sigma \)) into two sub-vectors, one for the parameters of interest (\(\varvec{\beta }\)) and the other one for the nuisance parameters (\(\varvec{\tau }_*,\sigma \)). Specifically, it works with a partial non-subjective prior distribution for only the parameters of interest, ignoring a prior setup for the nuisance parameters and applying the Laplace approximation for this one. The full prior distribution \(\pi (\varvec{\beta },\varvec{\tau })\) can be written through a proper prior distribution, \(\pi _1(\varvec{\beta }|\varvec{\tau },\lambda )\), which is not completely subjective because of its dependence on an unknown hyperparameter \(\lambda \):

$$\begin{aligned} \pi (\varvec{\beta },\varvec{\tau })=\pi _1(\varvec{\beta }|\varvec{\tau },\lambda )\pi _2(\varvec{\tau }). \end{aligned}$$

The two authors derive EBIC, starting from the BIC but they approximate the marginal distribution of \(\varvec{y}\), \(f(\varvec{y})\), with one of its two components, i.e., the conditional marginal density based on the partial prior distribution, \(m_1(\varvec{y}|\varvec{\tau },\lambda )\):

$$\begin{aligned} m_1(\varvec{y}|\varvec{\tau },\lambda )=\int f(\varvec{y}|\varvec{\beta },\varvec{\tau })\pi _1(\varvec{\beta }|\varvec{\tau },\lambda )\hbox {d}\varvec{\beta }. \end{aligned}$$

After estimating \(\lambda \), \(\hat{\lambda }=arg \max _{\lambda }{m_1(\varvec{y}|\hat{\varvec{\tau }},\lambda )}\) using a consistent estimator of \(\varvec{\tau }\), the EBIC is obtained as follows:

$$\begin{aligned} \text {EBIC}= & {} -2\log \{m_1(\varvec{y}|\hat{\varvec{\tau }},\hat{\lambda })\}+\dim (\varvec{\theta })\log (n)\\= & {} -2\log \{m_1(\varvec{y}|\hat{\sigma }^2,\hat{\varvec{\tau }_*},\hat{\lambda })\}+(d+1)\log (n). \end{aligned}$$

The derivation of the EBIC neglects the full prior distribution, but it uses the non-subjective prior distribution \(\pi _1(\varvec{\beta }|\sigma ^2,\lambda )\), assuming that, conditioned to \(\sigma ^2\), it assumes a multivariate normal distribution:

$$\begin{aligned} \pi _1(\varvec{\beta }|\sigma ^2,\lambda )=N_p(0,\sigma ^2\lambda ^{-1}W), \end{aligned}$$

with an unknown scalar \(\lambda \) and a \(p\times p\) known matrix W. A possible choice for W could be the so called Zellner’s q-prior, \(W_q=n(\varvec{X}'\varvec{X})^{-1}\). The authors demonstrate that EBIC is a consistent estimator.

Wenren and Shang (2016) and Wenren et al. (2016) work on conditional conceptual predictive statistics and on marginal conceptual predictive statistics for linear mixed model selection, respectively. The conditional \(C_p\) is formalized in both cases in which \(\sigma ^2\) and \(\varvec{\Psi }_*\) are known and unknown. The marginal \(C_p\) appears to be useful in two ways, both when the sample size is small and when there is a high correlation between the observations. Wenren et al. (2016) propose a modified variant of Mallows’ \(C_p\) when there is a correlation between observations, even if not known. They work under the assumption that \(\varvec{\Psi }=\sigma ^2\varvec{\Psi }_*\) and \(\varvec{\Sigma }=\sigma ^2I_{n_i}\). They assume that the estimator of the correlation matrix (for the candidate model) is consistent. The formalization of Modified \(C_p\) (\(\hbox {MC}_p\)) is as follows:

$$\begin{aligned} \text {MC}_{\text {p}}=\frac{\hbox {SS}_{\mathrm{RES}}}{\hat{\sigma }^2}+2p-{n}, \end{aligned}$$
(25)

where \(\hbox {SS}_{\mathrm{RES}}\) is the residual sum of squares for the candidate model, \(\hat{\sigma }^2\) represents an asymptotically unbiased estimator for \(\sigma ^2\) and it is computed for the largest candidate model. \(\hbox {MC}_p\) is a biased estimator for the expectation of the transformed marginal Gauss discrepancy. However, it is an unbiased estimator of \(\Delta _{C_p}(\varvec{\theta })\), if the true model is included in the pool of all candidate models. For better performance, they also provide a more accurate estimator:

$$\begin{aligned} \text {IMC}_{\text {p}}=\frac{(n-p_*-2)\hbox {SS}_{\mathrm{Res}}}{\hbox {SS}_{\mathrm{Res}}^*}+2p-{n}+2, \end{aligned}$$
(26)

using the symbol * for referring to the largest candidate model. \(\hbox {IMC}_p\) results to be an asymptotically unbiased estimator of the expected overall transformed Gauss discrepancy. It is preferred to \(\hbox {MC}_p\) because it avoids the bias introduced by \(\frac{1}{\hat{\sigma }^2}\) used for estimating \(\frac{1}{{\sigma }^2}\).

Wenren and Shang (2016) provide another conceptual predictive statistics for selecting a linear mixed model if one is interested in predicting specific clusters or random effects. Inspired by cAIC and conditional Mallow’s \(C_p\), they construct two versions of the conditional \(C_p\) (\(\hbox {CC}_p\)), according to known or unknown variance components. They work under the assumption that \(\varvec{\Psi }=\sigma ^2\varvec{\Psi }_*\) and \(\varvec{\Sigma }=\sigma ^2I_{n_i}\), too. Assuming that \(\sigma ^2\) and \(\varvec{\Psi }_*\) are known, they combine a goodness of fit term with a penalty term, and propose \(\hbox {CC}_{p}\) defined as:

$$\begin{aligned} \text {CC}_{\text {p}}=\frac{\hbox {SS}_{\mathrm{Res}}}{\sigma ^2}+K, \end{aligned}$$
(27)

where \(K=2\rho -{n}\) defines the effective degrees of freedom with \(\rho =tr(H_1)\) (Hodges and Sargent 2001). If the variance components are unknown, \(\varvec{\Psi }_*\) is substituted by its ML \(\hat{\varvec{\Psi }}_*\) or restricted MLE \(\hat{\varvec{\Psi }}_{*R}\) estimate. The effective degrees of freedom \(\rho \) is also estimated, \(\hat{\rho }=tr(\hat{H}_1)\) where \(\hat{H}_1=\hat{H}_1(\hat{\varvec{\Psi }}_*)\) or \(\hat{H}_1=\hat{H}_1(\hat{\varvec{\Psi }}_{*R})\). \(\sigma ^2\) is estimated in the largest candidate model (\(^*\)) through \( \hat{\sigma }^2=\frac{\hbox {SS}^*_{\mathrm{Res}}}{N-p_*}\), an unbiased estimator of \(\sigma ^2\). For further details about \(\hat{H}_1\) see Hodges and Sargent (2001). By substituting the variance components by their estimators in a suitable way, the conditional \(C_p\) is:

$$\begin{aligned} \text {CC}_{\text {p}}=({n}-p_*)\frac{\hbox {SS}_{\mathrm{Res}}}{\hbox {SS}^*_{\mathrm{Res}}}+\hat{K}, \end{aligned}$$
(28)

with \(\hat{K}=2\hat{\rho }-{n}\) indicating the (ML or REML) estimated penalty term.

Kuran and Özkale (2019) provide a conditional conceptual predictive statistic, too, in the framework of LMMs but applying a ridge estimator for overcoming multicollinearity problems. Like Wenren and Shang (2016), they work under the assumption that \(\varvec{\Psi }=\sigma ^2\varvec{\Psi }_*\) and \(\varvec{\Sigma }=\sigma ^2I_{n_i}\). When we have to manage multicollinearity problems, usually we delete one or more variables related to the fixed effects, but this could cause some not irrelevant consequences: The fitted candidate model could be misspecified. For this reason, the two authors are motivated to require to the ridge estimator and the ridge predictor for LMMs proposed by Liu and Hu (2013) and Özkale and Can (2017):

$$\begin{aligned} \hat{\varvec{\beta }}_k=(\varvec{X}^{'}\varvec{V}_*^{-1}\varvec{X}+kI_p)^{-1}\varvec{X}^{'}\varvec{V}_*^{-1}\varvec{y}, \end{aligned}$$
(29)
$$\begin{aligned} \hat{\varvec{b}}_k=\varvec{\Psi }_{*}\varvec{Z}^{'}\varvec{V}_{*}^{-1}(\varvec{y}-\varvec{X}\hat{\varvec{\beta }}_k), \end{aligned}$$
(30)

where k, a positive real number, represents the ridge biasing parameter. Its selection is obtained by minimizing a generalized cross-validation in the predictive step, while the same is measured through the minimization of the scalar mean square error of the ridge regression, in the estimation process (see Özkale and Can 2017). Following Wenren and Shang (2016), they propose two versions of the conditional conceptual predictive statistic, distinguishing the case in which \(\sigma ^2\) and \(\varvec{\Psi }_*\) are known or they aren’t. The proposed criteria are the same of \(\hbox {CC}_p\) in Eqs. (27) and in (28), substituting the effective degrees of freedom under ridge estimator for LMMs, \(\rho _k=\hbox {tr}(H_{1k})\), to \(\rho \), \(\hat{\rho }_k=\hbox {tr}(\hat{H}_{1k})\) to \(\hat{\rho }\) and \(\hbox {SS}_{\mathrm{Res},k}=(\varvec{y}-\hat{\varvec{y}}_k)^{'}(\varvec{y}-\hat{\varvec{y}}_k)\) to \(\hbox {SS}_{\mathrm{Res}}\), where \(H_{1k}=I_n-\varvec{V}_{*}^{-1}[I_n-\varvec{X}(\varvec{X}^{'}\varvec{V}_{*}^{-1}\varvec{X}+kI_p)^{-1}\varvec{X}^{'}\varvec{V}_{*}^{-1}]\).

Li et al. (2014) proposed a two-stage method based on the MDL principle. When \(\varvec{\beta }\) is the only unknown parameter, encoding the estimated parameter represents the first stage. Then, all the sequence of data with the distribution \(f_{\hat{\varvec{\theta }}}\) is encoded. The resulting total length code used for transmission is equivalent to BIC:

$$\begin{aligned} L(\varvec{y})=L(\varvec{y}|\hat{\varvec{\theta }})+L(\hat{\varvec{\theta }})= -\log {f_{\hat{\varvec{\theta }}}}(\varvec{y})+\frac{p}{2}\log (m). \end{aligned}$$

The penalty term, which measures the precision used to encode each parameter, is \(\log (m)/2\) with a uniform distribution. The authors follow the idea of the mixture MDL proposed by Hansen and Yu (2003), which assumes a mixture distribution induced by the user-defined probability distribution \(w(\varvec{\theta })\) on the parameter space \(\varvec{\Theta }\). They assume that \(\varvec{\Sigma }=\sigma ^2I_{n_i}\), \(\varvec{\beta }\sim N(0,c\sigma ^2(X'_i\varvec{\Psi }_{*i}^{-1}\varvec{X}_i)^{-1})\) and the hyperparameter c is a scalar constrained to be nonnegative. As regards the distribution of \(\sigma ^2\), an inverse gamma distribution is assumed with parameters (a, 3/2). Hence, the mixture description length of \(\varvec{y}\) is expressed as:

$$\begin{aligned} -\log m(\varvec{y})=-\log \int f_{\varvec{\theta }}(\varvec{y})w(\varvec{\theta })\hbox {d}\varvec{\theta }. \end{aligned}$$

The code length is minimized with respect to \(c\ge 0\) and the resulting \(\hat{c}\) is plugged into the code length expression, leading to the \(\hbox {lMDL}_0\) criterion. The expression of the final code length, with only \(\varvec{\beta }\) unknown and ignoring the impact of \(\varvec{b}\), is:

$$\begin{aligned} \left\{ \begin{array}{ll} \frac{1}{2}\left\{ \sum _{i=1}^{n}\varvec{y}_i'\varvec{\Sigma }_i^{-1}\varvec{y}_i-\hbox {FSS}_{\sigma }+p\left[ 1+\log \left( \frac{\hbox {FSS}_{\sigma }}{p}\right) \right] +\log n \right\} ,&{}\quad \text {if}\; \hbox {FSS}_{\sigma }>p,\\ \frac{1}{2} \sum _{i=1}^{n}\varvec{y}_i'\varvec{\Sigma }_i^{-1}\varvec{y}_i,&{}\quad \text {otherwise},\end{array} \right. \end{aligned}$$

\(\hbox {FSS}_{\sigma }=(\sum _{i=1}^{n}\varvec{y}_i'\varvec{\Sigma }_i^{-1}\varvec{X}_i)(\sum _{i=1}^{n}\varvec{X}_i'\varvec{\Sigma }_i^{-1}\varvec{X}_i)^{-1}(\sum _{i=1}^{n}\varvec{X}_i'\varvec{\Sigma }_i^{-1}\varvec{y}_i)\) and \((\log n)/2\) represents the code length necessary for transmitting \(\hat{c}\). If \(\hbox {FSS}_{\sigma }\le p\), \(\hat{c}=0\) and this implies that all fixed effects are null. The \(\hbox {lMDL}_0\) criterion has the same structure of penalized likelihoods such as AIC and BIC, but with a proper data-adaptive penalty, depending on the covariance matrices. The two-stage mixture MDL principle, in the most realistic case with \((\sigma ^2,\varvec{\Psi }_*)\) unknown, it consists in estimating \(\varvec{\Psi }_*\) and plugging it into the code length. Minimization of the code length function, with respect to a and c, leads to an even more complex lMDL structure. The authors showed that the MDL criteria possess the selection consistency of BIC for finite-dimensional models.

Marino et al. (2017) give a really important contribution to the selection of relevant covariates in the LMMs, since their proposal is aimed at mixed models with missing data. Their work deals with selection of covariates in multilevel models, hence applicable to linear mixed models being a two-level model. The authors work under the assumption that \(\varvec{\Psi }=\sigma ^2\varvec{\Psi }_{*}\) and \(\varvec{\Sigma }=\sigma ^2I_{n_i}\) and that parts of the covariates are ignorable missing, hence imputable. They propose to identify the covariates with missing data, to perform imputations producing m complete datasets (multiple imputations) and in the end to stack all these datasets into one single wide complete dataset. Before imputation, the generic linear mixed model in Eq. (1) is rewritten, taking into account for the missing values, as follows:

$$\begin{aligned} \mathbf Y _i=\sum _{l=1}^{L}\sum _{g=1}^{G}(\mathbf X _{ig}^{(l)}\varvec{\beta }_g^{(l)})+\varvec{Z}_i^{(\bullet )}\varvec{b}_i+\varvec{\epsilon }_i,\;\;\; i=1,2,\ldots ,m;\; g=1,\ldots ,G;\; l=1,\ldots ,L; \end{aligned}$$
(31)

where \(\varvec{X}_{ig}^{(l)}\) represents the g-th predictor for the i-th cluster from the l-th imputed dataset. After grouping all datasets into one, according to group relevant variables for imputation, the model could be rewritten in a compact way:

$$\begin{aligned} \mathbf Y _i=\mathbf X _i^{(\bullet )}\varvec{ \beta }^{(\bullet )}+\varvec{Z}_i^{(\bullet )}\varvec{b}_i+\varvec{\epsilon }_i,\;\;\; \end{aligned}$$
(32)

where \(\mathbf X _i^{(\bullet )}=(\mathbf X _{i1}^{(\bullet )},\mathbf X _{i2}^{(\bullet )},\ldots ,\mathbf X _{iG}^{(\bullet )})^{'}\) containing all the imputation data, and \(\varvec{ \beta }^{(\bullet )}\) is the related G-vector of parameters. For identifying the relevant covariates, the authors suggest a shrinkage estimation process, i.e., to maximize the profile penalized REML log-likelihood built for the extended model to imputed datasets:

$$\begin{aligned} Q_R(\varvec{\beta }^{(\bullet )})=l_R(\varvec{ \beta }^{(\bullet )},\sigma ^2,\varvec{\Psi }_{*})-\lambda \sum _{g=1}^{G}\sqrt{u_g}||\beta ^{(\bullet )}_g||, \end{aligned}$$
(33)

where \(\lambda \) is the positive tuning parameter, \(u_g\) is the number of covariates, belonging to the group g, with imputation data inside. In case of no missing data or only one imputation, the optimal penalized solution is obtained through the classical LASSO penalization. Instead of maximizing Eq. (33), because of some computational issues, the authors prefer to solve a different optimization problem through an iterative algorithm concerning the following penalized function:

$$\begin{aligned} Q_R^2(\varvec{ \beta }^{(\bullet )})=l_R(\varvec{ \beta }^{(\bullet )},\sigma ^2,\varvec{\Psi }_{*})-\sum _{g=1}^{G}\varvec{\tau }^2_g-\lambda ^2\sum _{g=1}^{G}\frac{u_g}{4\varvec{\tau }^2_g}[||\beta ^{(\bullet )}_g||]^2, \end{aligned}$$
(34)

Hossain et al. (2018) propose a non-penalty Stein-like shrinkage estimator and then an adaptive version of the same estimator. This approach, first, consists in using a non-penalty shrinkage estimator (SE) and then it applies an adaptive measure related to the number of restrictions, which measures the distance between the restricted and the full model. The procedures works as follows: they propose to maximize the log-likelihood function under the postulated restricted parameter space, using the Lagrange multiplier vector, to get a restricted estimator for \(\varvec{\beta }\) this allows to build the profiling log-likelihood for estimating \(\varvec{\tau }\). Once the RE for \(\varvec{\theta }=(\varvec{\beta },\varvec{\tau })\) are available, the likelihood ratio test statistic \(D_{m}=2[l(\hat{\varvec{\theta }}|\varvec{\theta })-l(\hat{\varvec{\theta }}_{\mathrm{RE}}|\varvec{\theta })]\) is introduced, and it allows to define the pretest estimator (PT) for \(\varvec{\beta }\):

$$\begin{aligned} \hat{\varvec{\beta }}_{\text {PT}}=\hat{\varvec{\beta }}-I(D_{m}\le \chi ^2_{r,\alpha })(\hat{\varvec{\beta }}-\hat{\varvec{\beta }}_{\mathrm{RE}}). \end{aligned}$$
(35)

Since that \(\hat{\varvec{\beta }}_{\mathrm{PT}}\) is a discontinuous function of \(\hat{\varvec{\beta }}\) and \(\hat{\varvec{\beta }}_{\mathrm{RE}}\) and it depends on the \(\alpha \)-level chosen a priori by the user, an adapted shrinkage estimator is built up, as follows:

$$\begin{aligned} \hat{\varvec{\beta }}_{\text {PSE}}=\hat{\varvec{\beta }}_{\mathrm{RE}}+(1-(r-2)D_{m}^{-1})(\hat{\varvec{\beta }}-\hat{\varvec{\beta }}_{\mathrm{RE}}),\quad r\ge 3, \end{aligned}$$
(36)

The shrinkage estimator is, actually, a linear combination of \(\hat{\varvec{\beta }}\) and \(\hat{\varvec{\beta }_{\mathrm{RE}}}\): \(\lambda \hat{\varvec{\beta }}+(1-\lambda )\hat{\varvec{\beta }}_{\mathrm{RE}}\), where the shrinkage parameter \(\lambda \) is an optimal value equal to \((r-2)D_{m}^{-1}\). The final estimator proposed by the authors is the positive-part shrinkage estimator, which takes into account only the positive values of the estimator in Eq. (36) due to the not convex function of SE in \(\hat{\varvec{\beta }}\) and \(\hat{\varvec{\beta }_{\mathrm{RE}}}\).

Only two papers discuss the selection of fixed effects in a linear mixed model in the case of a high dimensional setting: Rohart et al. (2014) and Ghosh and Thoresen (2018).

In many fields, it happens that one has to manage quite large amount of covariates. Thus, if interest is focused on obtaining an optimal inference, then choosing only the relevant covariates is particularly important.

Ghosh and Thoresen (2018) contribute to linear mixed-effects model selection with a non-concave penalization for the selection of fixed effects. Their procedure works with a maximum penalized likelihood, where non-concave penalties are implemented, considering \(\varvec{\Sigma }=\sigma ^2I_{n_i}\) . A general objective function (with a general non-convex optimization):

$$\begin{aligned} Q_{n,\lambda }(\varvec{\beta },\varvec{\eta })=L_n(\varvec{\beta },\varvec{\eta })+\sum _{j=1}^{p}P_{n,\lambda }(|{\beta }_j|), \end{aligned}$$
(37)

has to be minimized with respect to \((\varvec{\beta },\varvec{\eta })\) for a general loss function, \(L(\varvec{\beta },\varvec{\eta })\), which is assumed to be convex only in \(\varvec{\beta }\) and non-convex in \(\varvec{\eta }\). We can distinguish two situations: the number of fixed effects is less than the number of observations (\(p<n\)) and a high-dimensional setup where p is of non-polynomial (NP) order of sample size n.

Making some appropriate assumptions on the penalty, it is important to note that: as n increases, \(\max \{p''_{\lambda _n}(|\varvec{\beta }|)\}\rightarrow 0\) and \(\frac{p'_{\lambda _n}(\varvec{\theta })}{\lambda _n}>0\). Moreover, the true parameter \(\varvec{\beta }_0\) is divided into two sub-vectors \(\varvec{\beta }_0 =(\varvec{\beta }^{(1)'}_0,\varvec{\beta }^{(2)'}_0)'\), where \(\varvec{\beta }^{(2)}_0\) is a null vector. If \(\lambda _n \rightarrow 0\) and \(\sqrt{n}\lambda _n \rightarrow \infty \), as n increases, we can be sure that the local minimizer exists and satisfies that \(\hat{\varvec{\beta }}^{(2)}\) is equal to 0. Concerning the case of high dimensionality, when p is of non-polynomial (NP) order of sample size, one should take into account the SCAD penalty for obtaining an estimator that is simultaneously consistent and satisfies the oracle property (Fan and Li 2001) of variable selection optimality for any suitably chosen regularization sequence \({\lambda _n}\). Under some particular assumptions (extensively presented in Ghosh and Thoresen 2018) what happens is that a local minimizer is obtained, which satisfies, with a probability of reaching one as n increases, that \(\varvec{\beta }^{(2)}=0\) and that the estimated active set of \(\hat{\varvec{\beta }}\) coincides with the true active set of the fixed effect parameters. The \(\hat{\varvec{\beta }}^{(1)}\) and \(\hat{\varvec{\eta }}\) estimators are normally distributed under both types of dimensional settings.

Rohart et al. (2014) focus on the selection of the fixed effects in a high dimensional linear mixed model, suggesting the addition of an \(\ell _1\)-penalization on \(\varvec{\beta }\) to the log-likelihood of the complete data. This penalization is useful in cases where the number of fixed effects is greater than the number of observations: It shrinks some coefficients to zero. They propose an iterative multicycle expectation conditional maximization (ECM) algorithm to solve the minimization problem of the objective function:

$$\begin{aligned} g(\varvec{\theta };\varvec{x})=-2L(\varvec{\theta };\varvec{x})+\lambda ||\varvec{\beta }||_1, \end{aligned}$$
(38)

The algorithm consists of four steps and it converges when three stopping criteria, based, respectively, on \(||\varvec{\beta }^{[t+1]}-\varvec{\beta }^{[t]}||^2\) , \(||\varvec{b}^{[t+1]}_k-\varvec{b}^{[t]}_k||^2\) and \(||L(\varvec{\theta }^{[t+1]},\varvec{x})-L(\varvec{\theta }^{[t]},\varvec{x})||^2\), are fulfilled. Since the estimation of \(\varvec{\theta }\) is biased, a good choice would be to use the algorithm only for estimating the support of \(\varvec{\beta }\) and, after that, to estimate \(\varvec{\theta }\) using a classic mixed model estimation, based on the model that contains the only J relevant fixed effects: \(\varvec{y}=\varvec{X}\varvec{\beta }_j+\varvec{Z}\varvec{b}+\varvec{\epsilon }\). The regularization parameter \(\lambda \) is tuned with the BIC,

$$\begin{aligned} \lambda _{\text {BIC}}=\min _\lambda \{\log |\varvec{V}_\lambda |+(\varvec{y}-\varvec{X}\hat{\varvec{\beta }}_\lambda )'V_\lambda ^{-1}(\varvec{y}-\varvec{X}\hat{\varvec{\beta }}_\lambda )+d_\lambda \log (n)\}, \end{aligned}$$
(39)

where \(d_\lambda \) is the number of nonzero variance–covariance parameters plus the number of nonzero fixed effects coefficients. Substituting the LASSO method in the second step with any other variable selection method that optimizes a criterion, the algorithm becomes a multicycle ECM. All these considerations are valid assuming independence between the random effects, i.e., if there are q random effects corresponding to q grouping factors. As regards the selection of the random effects, it suffices to observe quite a small variance of a random effect to remove it at one step of the algorithm. The algorithm produces the same results and the same theoretical properties of the lmmLasso method (Schelldorfer et al. 2011) when variances are known or they are assumed to be known, but it is much faster.

5 Random effects selection

Testing if random effects exist is equivalent to testing the hypothesis whether their variance/covariance matrix is made by zeros. Some authors, like Zhang et al. (2016), worked on the identification of the covariance structure of random effects, and others such as Wang (2016) provided some characterizations of the response covariance matrix that cause model non-identifiability. The common perspective of these works lies in providing a preliminary analysis before the selection of the effects in a linear mixed model, without providing a tool for testing the significance of random effects. Li and Zhu (2013), instead, introduced a test for evaluating the existence of random effects in semi-parametric mixed models for longitudinal data, proposing a projection method. The two authors created a test with two estimates for the error variance, one consistent under the null hypothesis and the other consistent under both the null and the alternative. The idea was to compare the two estimates under the alternative hypothesis, leading to reject the null one in case of large values of the test. But the test showed to be not stable and powerful, because of the projection matrix of \(\varvec{Z}\) variables onto the space spanned by the \(\varvec{X}\) variables. Hence, the two authors propose a similar, but more powerful test, in the LMMs framework but without projections. For developing the test, no assumptions are necessary for the random effects or the random errors. The test is built using the trace of the variance/covariance matrix of random effects:

$$\begin{aligned} T_{m\Omega }=\frac{\hbox {tr}(\hat{A})}{\sqrt{(\hat{k}-3\hat{\sigma }^4)\hbox {tr}\{\hbox {diag}^2(M^{\mathrm{tr}}_{0m})\}+2\hat{\sigma }^4\hbox {tr}\{(M^{\mathrm{tr}}_{0m})^2\}}}\xrightarrow {d}N(0,1), \quad \text {m}\rightarrow \infty . \end{aligned}$$
(40)

Under the alternative, the same test converges in distribution to \(N(m_\Omega ,1)\), where

$$\begin{aligned} m_\Omega =\frac{k_0{c_{11}-q_1+(q_1-1)c_{13}}\hbox {tr}(\varvec{\Sigma _{z}}Q_{10})}{\sqrt{(k-3\sigma ^4)C_{\mathrm{diag}}+2\sigma ^4C_{\mathrm{tr}}}}, \end{aligned}$$
(41)

with \(c_{11}\) and \(c_{13}\) estimates of variance/covariance matrices related to scaled \(\varvec{Z}\), \(C_{\mathrm{tr}}\) and \(C_{\mathrm{diag}}\) two nonnegative constants such that \(\lim _{m\rightarrow \infty }[m\cdot \hbox {tr}\{\hbox {diag}^2(M_{0m}^{\mathrm{tr}})\}]=C_{\mathrm{diag}}\) and \(\lim _{n\rightarrow \infty }[m\cdot \hbox {tr}(M_{0m}^{\mathrm{tr}}2)]=C_{\mathrm{tr}}\). The test results to be consistent, not only under the null hypothesis, but under the alternative too. Even if the rate of convergence is slower than \(m^{-1/2}\), the test is consistent. Furthermore, the test is good even if high correlations between \(\varvec{Z}\) and \(\varvec{X}\) are present.

6 Fixed and random effects selection

In most real cases, it is a matter of investigating the individuation of the important predictors corresponding not only to the fixed effects but, also, to the random part of the model. The joint selection of the two types of effects has drawn more attention in recent years. Most of the proposed procedures are related to shrinkage methods: It suffices to look simultaneously at Tables 2 and 3 to check this statement. The joint effect selection through penalized function can be based on a two-stage procedure, considering fixed and random effects separately, or a one-stage procedure, considering them jointly. Bondell et al. (2010) underlined that, in a separate selection, a change in the structure of one set of effects can lead to considerable different choices of variables for the other set of effects. Lin et al. (2013), on the other hand, argued that greater computation efficiency is reached if one prefer a separate selection of the effects. The number of stages employed in the shrinkage methods is reported in Table 1.

Table 1 Settings of LMM selection procedures with shrinkage

Braun et al. (2012) propose a predictive cross-validation (CV) criterion for the selection of covariates or random effects in the presence of linear mixed-effects models with serial correlation. Their approach is based on the logarithmic and the continuous ranked probability score (CRPS). Wang and Schaalje (2009) use point predictions, while Braun et al. (2012) focus on the whole predictive distribution, inspired by the proper scoring rules suggested by Gneiting and Raftery (2007), and the “mixed” cross-validation approach provided by Marshall and Spiegelhalter (2003). Going into detail, they use a very common proper score, the LS (local score), which considers the log predictive density \(f(\varvec{y})\) for the observed value \(\varvec{y}_{\mathrm{obs}}\) and the CRPS, which is sensitive to the distance. The CRPS considers how close a predictive value is to the observed value through a ponderation system. With the univariate Gaussian as predictive distribution, the CRPS has the following form:

$$\begin{aligned} \hbox {CRPS}(\varvec{Y},\varvec{y}_{\mathrm{obs}})= \sigma \bigg [\frac{1}{\sqrt{\pi }}-2\varphi \bigg (\frac{\varvec{y}_{\mathrm{obs}}-\varvec{\mu }}{\sigma }\bigg )-\frac{\varvec{y}_{\mathrm{obs}}-\varvec{\mu }}{\sigma }\bigg (2\Phi \bigg (\frac{\varvec{y}_{\mathrm{obs}}-\varvec{\mu }}{\sigma }\bigg )-1\bigg )\bigg ], \end{aligned}$$
(42)

where \(\varphi \) and \(\Phi \) indicate the p.d.f. and the distribution function of a standardized Gaussian variable, respectively. The “mixed” cross-validation approach fits a model to the whole dataset. Once the hyperparameters have been estimated through all data, one observation is left out and for this one the LS and the CRPS are computed. Finally, the cross-validation mean scores \(\overline{\hbox {LS}}_{\mathrm{CV}}\) and \(\overline{\hbox {CRPS}}_{\mathrm{CV}}\) are calculated from the distribution. The \(\overline{\hbox {LS}}_{\mathrm{CV}}\) is asymptotically equivalent to cAIC, but it is preferable to a full cross-validation approach because only one model is fitted at the beginning instead of fitting a model for each observation left out.

Schmidt and Smith (2016) focus on model selection when the number of models involved in the process is huge. They introduce a parameter subset selection algorithm (PSS). This technique consists in ranking the parameters by their significance, to establish the influential parameters. The basic assumption regarding the variance–covariance matrices of the random effects and of the random errors is \(\varvec{\Psi }\) and \(\sigma ^2I_{n_i}\), respectively. The methodology is based on the asymptotic approximation of standard errors, measured through a normalization of the estimated standard deviations for each parameter. The proposed method works as follows: at first an estimate of the error variance is measured, then using a local sensitivity matrix—containing all the derivatives with respect to all fixed and random parameters for each i-th observation—one is able to estimate the variance–covariance matrix with all variances and correlations for the fixed and for the random effects (the authors suggest to use for instance the Moore-Penrose pseudoinverse). An estimate for the standard errors for each parameter is now possible: \(\sqrt{\hbox {Cov}(k,k)}\), which is used for obtaining a measure of the selection score related to each k-th parameter in the i-th individual: \(\alpha _{k_i}=|\hbox {st.err.}_k/\hat{\varvec{\theta }}_{k_i}|\). A small selection score is equivalent to a significant parameter. A ranking of all selection scores is created assigning a selection index \(\gamma _{k_i}\) according to the position reached by each \(\alpha _{k_i}\) in the ordering. For all the parameters is calculated a global selection index \(\Gamma _k=\sum _{i=1}^{m}\gamma _{k_i}\), which implies that the smallest values of this global index are related to the most significant parameters for all the clusters. If two or more parameters bring to the same \(\Gamma _k\), then the parameter that has the smallest selection scores over all m individuals, is chosen as the most significant one. It is worth noting that since the PSS is repeated m times, the m sets of parameter rankings will be all different because the random effects parameter estimate will be different for each individual. The PSS algorithm attributes to the standard errors the role of measuring the parameter uncertainty: the parameters which obtain the smallest selection scores are those most significant and with the smallest uncertainty.

Rocha and Singer (2018) propose exploratory methods based on fitting standard regression models to the individual response profiles or to the rows of the sample within-units covariance matrix (in the case of balanced data) as supplementary tools for selecting a linear mixed-effects model. As concerns the choice of the fixed effects they examine the profile plots and suitable hypothesis tests. Assuming homoschedastic conditional independence, the model in Eq. (1) is rewritten as:

$$\begin{aligned} \varvec{y}_i=\varvec{X}^{*}_i\varvec{\beta }^{*}_i+\varvec{\epsilon }_i, \end{aligned}$$
(43)

where \(\varvec{X}_i^{*}\) contains the common variable between \(\varvec{X}_i\) and \(\varvec{Z}_i\) and those that are unique to both the kind of variables, \(\varvec{\beta }^{*}_i\) contains the amount of \(p+k\) parameters related to the fixed and the random effects. To test whether the generic k-th element of \(\varvec{\beta }\) is null, they propose the following statistic test:

$$\begin{aligned} t=\frac{\overline{\varvec{\beta }}^{*}_k}{n^{-1}\sqrt{\hat{\sigma }^2\hbox {diag}_k[(\sum _{i=1}^{m}\varvec{X}_i^{*^{'}}\varvec{X}^{*}_i)^{-1}]}}\sim t_v , \end{aligned}$$
(44)

where the degrees of freedom \(v=\sum _{i=1}^{m}n_i-m(p+q)\) and the estimated \(\hat{\sigma }^2\) is given by \(\sum _{i=1}^{m}\frac{n_i-(p+q)}{v}\hat{\sigma }^2_i\), with:

$$\begin{aligned} \hat{\sigma }^2_i=\frac{1}{n_i-(p+q)}\varvec{Y}_i^{'}[I_{n_i}-\varvec{X}_i^{*}(\varvec{X}_i^{*}\varvec{X}_i)^{-1}\varvec{X}_i]\varvec{Y}_i\text {.} \end{aligned}$$
(45)

The variance of \(\hat{{\beta }}_{ik}^{*}\), \(i=1,2,\ldots ,m,\) is expected to be equal to the k-th diagonal term of \(\sigma ^2(\varvec{X}_i^{*^{'}}\varvec{X}_i^{*})^{-1}\) when the variance of the corresponding random coefficient, \(\hat{{b}}_{ik}\), is null. Otherwise, we might expect a larger variability of the \(\hat{{\beta }}_{ik}^{*}\) around its mean. The k-th element of \(\hat{\varvec{\beta }}_{i}^{*}\), \(\hat{{\beta }}_{ik}^{*}\), follows a \(\mathscr {N}({\beta }_{ik}^{*};v_{ik}\sigma ^2)\) distribution where \(v_{ik}=\hbox {diag}_k\{(\varvec{X}_i^{*^{'}}\varvec{X}^{*}_i)^{-1}\}\). Therefore, \(\hat{{\beta }}_{ik}^{*}/\sqrt{v_{ik}}\sim \mathscr {N}({\beta }_{ik}^{*}/\sqrt{v_{ik}};\sigma ^2)\). Letting \(\hat{w}_{ik}={\beta }_{ik}^{*}/\sqrt{v_{ik}}\) and \(\overline{w}_{k}=\sum _{i=1}^{m}\hat{w}_{ik}/m\), it follows that:

$$\begin{aligned} t(\hat{w}_k)=\sqrt{n/(n-1)}(\hat{w}_{ik}-\overline{w}_k)/\hat{\sigma }\sim t_v \text {.} \end{aligned}$$
(46)

Thus, for each k we expect around \(\alpha \%\) of the values of \(t(\hat{w}_k)\) outside the corresponding global significance level \(\alpha ^{*}\%=\alpha /(m(p+q))\) Bonferroni-corrected confidence interval, namely \([t_v(\alpha ^{*}/2),t_v(1-\alpha ^{*}/2)]\) where \(t_v(\delta )\) denotes the 100\(\delta \%\) percentile of the t distribution with v degrees of freedom. A larger percentage of points outside that interval suggests that \(b_{ik}\) may be a random coefficient. Combining the two statistic tests in Eqs. (44) and (46) makes possible to detect which effects are statistically significant in the selection procedure. Another way to select the random effects requires the assumption of the homoschedastic conditional independence, i.e., when data are collected at the same time. In this case, the number of units for each i-th individual is the same and hence it is possible to estimate only one variance–covariance matrix \(\varvec{V}\) as \(S-\hat{\sigma }^2I_n\), where \(S=(m-1)^{-1}\sum _{i=1}^{m}(\varvec{y}_i-\overline{\varvec{y}})(\varvec{y}_i-\overline{\varvec{y}})^{'}\). Fitting polynomial models, with the same degree, to the rows of S the exploratory analysis along the lines obtained becomes an additional tool for the selection of the random effects.

6.1 One-stage shrinkage procedures

Chen et al. (2015) propose a variable selection methodology under the ANOVA type linear mixed models, for a high-dimensional setting .They focus on the selection of the fixed effects and on testing the existence of the random effects. The authors state that \(\hbox {cov}(\varvec{b}_i)=\sigma ^2_iI_{n_i}\) and \(\varvec{\Sigma =\sigma ^2I}\), without setting any distributional assumption for \(\varvec{Y}\). The selection regarding the fixed effects is made through the SCAD penalty. With the main purpose of removing the heteroschedasticity and correlation of the response variable, they modify the model in Eq. (1), through an orthogonalization applied to random variables \(\varvec{Z}_\bot \). Let \(\mathscr {M}(\varvec{Z})\) be the vector space spanned by the columns of \(\varvec{Z}\), \(\varvec{Z}_\bot \) such that \(\varvec{Z}_\bot ^{'}\varvec{Z}=0\), \(\mathscr {M}(\varvec{Z})^\bot \) the orthogonal complementary space of \(\mathscr {M}(\varvec{Z})\), therefore:

$$\begin{aligned} \varvec{Z}_\bot \varvec{Y}=\varvec{Z}_\bot \varvec{X}\varvec{\beta }+\varvec{Z}_\bot \varvec{\epsilon }, \end{aligned}$$
(47)

A sparse estimate of \(\varvec{\beta }\) can be obtained by minimizing:

$$\begin{aligned} Q(\varvec{\beta })=\frac{1}{2}(\varvec{Y}-\varvec{X}\varvec{\beta })^{'}P_{(\varvec{Z})_\bot }(\varvec{Y}-\varvec{X}\varvec{\beta })+n\sum _{j=1}^{p}p_{\lambda }(|{\beta }_j|), \end{aligned}$$
(48)

where \(P_{(\varvec{Z})_\bot }=\varvec{Z}_\bot \varvec{Z}_\bot ^{'}\) is the orthogonal projection matrix of space \(\mathscr {M}(\varvec{Z})^{\bot }\) and \(p_{\lambda }(\varvec{\theta })\) is the SCAD penalty. Putting \(\varvec{Y}^{*}=\varvec{Z}^{'}_\bot \varvec{Y}\) and \(\varvec{X}^{*}=\varvec{Z}^{'}_\bot \varvec{X}\) the minimization algorithm \(Q(\varvec{\beta })\), the convergence test and the selection of thresholding parameters can be applied to Eq. (48) without additional effort. Once the fixed effect parameters are estimated, the authors focus on the selection of the random effects, which means to detect if some \(\sigma _i=0\). The formal hypothesis system is:

$$\begin{aligned} H_0: \sigma ^2_k=0, k\in \mathscr {D} \leftrightarrow H_a:\exists \mathscr {D}_{*}\subseteq \mathscr {D}, s.t., \sigma ^2_k>0, k\in \mathscr {D}_{*}, \end{aligned}$$
(49)

where \(\mathscr {D}\) is a subset of 1,2,...,q. Two estimators are proposed for \(\sigma ^2\): one, \(\hat{\sigma }^2\), consistent even if the null hypothesis does not hold, the other one, \(\hat{\sigma }_0\), consistent only under the null hypothesis. Indicating with \(\hat{l}\hat{=}\{i:\hat{\beta }_i\ne 0\}\) all the relevant fixed effects, once the fixed parameters have been estimated, with \(W_{\hat{l}}\hat{=}(\varvec{X}_{\hat{l}},\varvec{Z})\) the relative covariate matrix together with the design matrix for the random effects, an estimate of \(\sigma ^2\) is defined as:

$$\begin{aligned} \hat{\sigma }^2=\frac{\varvec{Y}^{'}P_{(W_{\hat{l}})\bot }\varvec{Y}}{tr[P_{(W_{\hat{l}})\bot }]}, \end{aligned}$$
(50)

where \(P_{(W_{\hat{l}})\bot }\) is the orthogonal projection matrix on the space of \(\mathscr {M}(W_l)^{\bot }\):

$$\begin{aligned} \hat{\sigma }^2_0=\frac{\varvec{Y}^{'}P_{(W_{\hat{l},- \mathscr {D}})\bot }\varvec{Y}}{tr[P_{(W_{\hat{l},- \mathscr {D}})\bot }]}, \end{aligned}$$
(51)

Let’s assume that \(\mathscr {D}=\mathscr {D}_1 \cup \mathscr {D}_2\) with \(\mathscr {D}_1\hat{=}\{k:k\in \mathscr {D}, m_k\rightarrow \infty \) when \(n\rightarrow \infty \}\) and \(\mathscr {D}_2\hat{=}\{k:k\in \mathscr {D}, m_k=O(1)\}\). Under \(H_0\) in (49), under certain conditions and assuming that the \(\mathscr {D}_1\) is a null set, the authors built a test for assessing the existence of at least one of the random effects based on the difference between (50) and (51), which tends in distribution to \(\chi ^2(g)\) where g represents the dimension of space \(\mathscr {M}(P_{(W_{l,-\mathscr {D}})_\bot } Z_\mathscr {D})\). Whereas, under \(H_0\) in (49) if \(\mathscr {D}_1\) contains at least one element and knowing that \(\hat{\sigma }^2-\hat{\sigma }^2_0=\varvec{Y}^{'}M_{n,\hat{l}}\varvec{Y}\), with \(M_{n,\hat{l}}\hat{=}\frac{P_{(W_{\hat{l}})_\bot }}{tr(P_{(W_{\hat{l}})_\bot })}-\frac{P_{(W_{\hat{l},-\mathscr {D}})_\bot }}{tr(P_{(W_{\hat{l},-\mathscr {D}})_\bot })}\), then the test to be considered is:

$$\begin{aligned} T_{nG,\hat{l}}(\gamma )=\frac{\varvec{Y}^{'}M_{n,\hat{l}}\varvec{Y}}{\hat{\sigma }^2\sqrt{\gamma tr\{\hbox {diag}^2(M_{n,\hat{l}})\}+2tr\{M_{n,\hat{l}} \}}}\xrightarrow {d}N(0,1)\quad \text {as}\quad n\rightarrow \infty ,\quad \end{aligned}$$
(52)

where \(\gamma \) indicates the kurtosis parameter that can be estimated with any consistent estimator.

Fan et al. (2014) propose a robust estimator for jointly selecting the fixed and random effects. The variable selection methodology defined by the three authors is robust against outliers in both the response and the covariates. The variance–covariance matrix of the random effects is factorized using the Cholesky decomposition: \(\varvec{\Psi }=\Lambda \Gamma \Gamma ^{'}\Lambda \), where \(\Lambda =\hbox {diag}(\nu _1,\nu _2,\ldots ,\nu _q)\) and \(\Gamma \) represents a diagonal matrix and a triangular matrix with 1 on its diagonal, respectively. Hence, the random effects \(\varvec{b}_i\) are now substituted by \(\Lambda \Gamma \varvec{b}_i^{*}\). It is worth noting that setting to zero one element of \(\Lambda \) implies that all elements of the corresponding row and column in \(\varvec{\Psi }\) are zero, too, i.e., the relative random effect is not significant. To obtain a robust estimator which doesn’t suffer the impact of outliers in the covariates, they introduce some weights, \(w_{ij}\), function of the Mahalanobis distance:

$$\begin{aligned} w_{ij}=\min \bigg \{1,\bigg \{\frac{d_0}{(\varvec{x}_{ij}-m_{\varvec{x}})^{'}S_{\varvec{x}}^{-1}(\varvec{x}_{ij}-m_{\varvec{x}})}\bigg \}^{\frac{\delta }{2}},\bigg \{\frac{b_0}{(\varvec{z}_{ij}-m_{\varvec{z}})^{'}S_{\varvec{z}}^{-1}(\varvec{z}_{ij}-m_{\varvec{z}})}\bigg \}^{\frac{\delta }{2}}\bigg \}, \end{aligned}$$
(53)

where the parameter \(\delta \ge 1\), \(d_0\) and \(b_0\) are the 95-th percentiles of the chi-square distributions with the dimension of \(x_{ij}\) and \(z_{ij}\) like degrees of freedom, respectively. \(S_{\varvec{x}}\) and \(S_{\varvec{z}}\) are the median absolute deviance and \(m_{\varvec{x}}\) and \(m_{\varvec{z}}\) represent the medians of the covariates and random variables, respectively. For reducing the impact of outliers in the response variable, it is modified subtracting \(\upsilon _{ij}\) to each its element in Eq. (54), considering the studentized residuals \(r_{ij}=y_{ij}-x^{'}_{ij}\beta -z^{'}_{ij}\Lambda \Gamma \varvec{b}_i^{*}\)

$$\begin{aligned} \upsilon _{ij}=\hbox {sign}(r_{ij})(|r_{ij}|-c)\sigma I(|r_{ij}|>c). \end{aligned}$$
(54)

The robust log-likelihood is then defined as:

$$\begin{aligned} l^R(\varvec{\theta })=\log \int \sigma ^{2{^{-\frac{mq+n}{2}}}}\exp \bigg \{ -\frac{1}{2\sigma ^2}\big \Vert W^\frac{1}{2} (\varvec{y}^{*}-\varvec{X}\varvec{\beta }-\varvec{Z}I_m\otimes \Lambda I_m\otimes \Gamma \varvec{b}^{*})\big \Vert \bigg \}\times \exp \bigg \{ -\frac{1}{2\sigma ^2}\varvec{b}^{*^{'}}\varvec{b}^{*}\bigg \}. \end{aligned}$$
(55)

To guarantee the consistency property to the estimators, a correction has to be applied to \(l_R(\varvec{\theta })\):

$$\begin{aligned} l^R_C(\varvec{\theta })=l^R(\varvec{\theta })-a_{m}(\varvec{\theta }), \end{aligned}$$
(56)

with \(a_{m}(\varvec{\theta })=\sum _{i=1}^{m}a_i(\varvec{\theta })\) such that \(\frac{\partial }{\partial \varvec{\theta }}a_i(\varvec{\theta })=E_{\varvec{\theta }}\bigg [\frac{\partial l_i^R(\varvec{\theta })}{\partial \varvec{\theta }}\bigg ]\).

Selection and estimation of fixed and random effects are obtained maximizing:

$$\begin{aligned} Q^R(\varvec{\theta })=l_c^R(\varvec{\theta })-n\left( \sum _{j=1}^{p}p_{\lambda _n}(|\beta _j|)+\sum _{j=1}^{q}p_{\lambda {m}}(|\nu _j|)\right) , \end{aligned}$$
(57)

where \(p_{\lambda {m}}(\cdot )\) is a shrinkage penalty with \(\lambda _n\) being the parameter which controls the amount of shrinkage, while \(\overline{\varvec{\beta }}_j\) and \(\overline{\nu }_j\) are the un-penalized maximum estimators in Eq. (55). The authors propose the ALASSO penalty to control the amount of shrinkage. For selecting \(\lambda _{m}\) the authors prefer to minimize the following BIC criterion:

$$\begin{aligned} \text {BIC}(\lambda )=-\frac{1}{2}\log |\hat{\varvec{V}}|-\frac{1}{2}||\varvec{y}-\varvec{X}\hat{\varvec{\beta }}||_{\hat{\varvec{V}}}^2+\log (m)||\hat{\varvec{\theta }}_{\lambda }||_{0}, \end{aligned}$$
(58)

where \(\hat{\sigma }^2\), part of \(\hat{\varvec{V}}\), is the median absolute deviation estimate, \(\hat{\varvec{\beta }}\) and \(\hat{\varvec{V}}\) are obtained as robust estimators and, finally, \(||\hat{\varvec{\theta }}_{\lambda }||_{0}\) states for the zero norm, measuring the amount of nonzero elements of \(\hat{\varvec{\theta }}_{\lambda }\).

Taylor et al. (2012) extend the two-parameter \(L_r\) penalty of Frank and Friedman (1993) and Fu (1998) in order to obtain new mixed model penalized likelihood, useful for selecting both the random and the fixed effects. The extended linear mixed model considers a set of penalized effects (\(\varvec{a}\)), containing a subset of some effects:

$$\begin{aligned} \varvec{y}|\varvec{b}\sim N(\varvec{X}\varvec{\beta }+\varvec{Z}\varvec{b}+\varvec{Ma},\varvec{\Sigma }),\qquad \varvec{y}\sim N(\varvec{X}\varvec{\beta }+\varvec{Ma}, V(\varvec{\tau })). \end{aligned}$$
(59)

The authors use the scaled variance–covariance matrices \(\varvec{\Sigma }_*=\varvec{\Sigma }/\sigma ^2\) and \(V(\varvec{\tau })_*=V(\varvec{\tau })/\sigma ^2\) and identify \(\varvec{a}\), a potentially large vector of k effects, \(k<p+s\) and \(k<n\), with covariates \(\varvec{M}\). The penalized likelihood involves the \(L_r\) class of penalties with \(0<r<1\):

$$\begin{aligned} l=\log f(\varvec{y},\varvec{\theta })-\sum _{j=1}^{k}p_\lambda (|\varvec{a}_j|;r), \end{aligned}$$
(60)

with the penalty term given by: \(p_\lambda (|\varvec{a}_j|;r)=\lambda ((|\varvec{a}_j|+1)^r-1)/r,\lambda >0\). Taking into account a simple setting with \(\sigma ^2=1\) and \(\varvec{M}\) as orthonormal columns, an unbiased OLS estimator for \(\varvec{a}\) is obtained, through an iterative process:

$$\begin{aligned} \varvec{a}_{j(s+1)}=sign(\hat{\varvec{a}}_j)(|{\varvec{a}}_j|-\lambda ^*)_+. \end{aligned}$$
(61)

This penalty is singular at origin, then, a local quadratic approximation is introduced to the derivative of the penalty, approximated as follows:

$$\begin{aligned} p_\lambda (|\varvec{a}_j|;r)\approx \frac{1}{2}(\lambda (|\varvec{a}_{js}|+1)^{r-1}/|\varvec{a}_{js}|)\varvec{a}_j^2, \end{aligned}$$
(62)

Thus, the introduction of a penalized term estimated iteratively, as shown is equivalent to inserting the pseudo-random effects in the linear mixed models. This it suffices to guarantee Henderson’s results for estimation (REML estimates for \(\varvec{\tau }\)) and prediction of both kinds of effects. Thresholding the elements of \(|\varvec{a}_{s+1}|\) with an optimal rule, a partitioned set of estimates into nonzero and zero components \((\varvec{a}_{1,s+1},\varvec{a}_{2,s+1})\) is obtained. The zero set \((\varvec{a}_{2,s+1},\varvec{M}_{2,s+1})\) is discarded from the set of information and the nonzero set replaces \(\varvec{a}_2\) until the iterative penalized REML estimates converge.

Li et al. (2018) propose a doubly regularized approach for selecting both the fixed and the random effects, in two cases: a) finite dimension of fixed and/or random effects, b) fixed and/or random effects that increase as the sample size goes to infinity. Their approach set \(\varvec{\Sigma }=\sigma ^2I_{n_i}\) and \(\varvec{\Psi }=\sigma ^2\varvec{\Psi }_{*}=\sigma ^2LL^{'}\), (Cholesky decomposition) with L a lower triangular matrix containing positive diagonal elements. The authors apply a double regularization (a \(\ell _1\)-norm penalty for \(\varvec{\beta }\) and a \(\ell _2\)-norm penalty for \(\varvec{\Psi }_{*}\) parameters) to the log-likelihood function, \(l(\varvec{\beta },\sigma ^2,\varvec{\Psi }_{*})\) (equivalent to Eq. 5), as concerns the case with \(m<p\). Hence, the objective function to maximize for estimating \(\varvec{\beta }\), \(\sigma ^2\) and \(\varvec{\Psi }_{*}\) is the following:

$$\begin{aligned} Q(\varvec{\beta },\varvec{L},\sigma ^2)=\ell (\varvec{\beta },\sigma ^2,\varvec{L})-\lambda _1\sum _{j=1}^{p}|\beta _j|-\lambda _2\sum _{k=2}^{q}\sqrt{L_{k1}^2+\cdots +L_{kq}^2}. \end{aligned}$$
(63)

For the case \(m>p\), they modify \(l(\cdot )\) in Eq. (63) with the following function:

$$\begin{aligned} \ell _m(\varvec{\beta },\sigma ^2,\varvec{L})=-\frac{1}{2}\sum _{i=1}^{m}\log |\sigma ^2\varvec{V}_{*i}|-\frac{1}{2}\log \bigg |\sigma ^{-2}\sum _{i=1}^{m}\varvec{X}_i^{'}\varvec{V}_{*i}^{-1}\varvec{X}_i\bigg |-\frac{1}{2\sigma ^2}(\varvec{Y}_i-\varvec{X}_i\varvec{\beta })^{'}\varvec{V}^{-1}_{*i}(\varvec{Y}_i-\varvec{X}_i\varvec{\beta }). \end{aligned}$$
(64)

The authors propose an algorithm as effective as the Newton–Raphson algorithm for estimating step by step \(\varvec{\beta }\) and L, since the penalty function in Eq. (64) is separable.

Pan and Shang (2018b) propose a simultaneous selection procedure of fixed and random effects. Let’s assume that \(\varvec{\Psi }=\sigma ^2\varvec{\Psi }_*\), \(\varvec{\Sigma }=\sigma ^2I_{n_i}\) and \(\varvec{\psi }\) containing the \(\frac{q(q+1)}{2}\) unique elements in \(\varvec{\Psi }_*\), and let’s indicate with \(\varvec{\theta }_{*}\) the vector related to (\(\varvec{\beta },\varvec{\psi }\)). The authors maximize the following penalized profile likelihood function:

$$\begin{aligned} Q(\varvec{\theta }_{*})= & {} p(\varvec{\theta }_{*})-\lambda _{m}\rho (|\varvec{\theta }_{*}|)\nonumber \\= & {} -\frac{1}{2}\sum _{i=1}^{m}\log |\varvec{V}_{i*}|-\frac{n}{2}\log \left( \sum _{i=1}^{m}(\varvec{y}_i-\varvec{X}_i\varvec{\beta })^T\varvec{V}_{i*}^{-1}(\varvec{y}_i-\varvec{X}_i\varvec{\beta })\right) -\lambda _{m}\rho (|\varvec{\theta }_{*}|),\nonumber \\ \end{aligned}$$
(65)

where \(\lambda _{m}\) is the tuning parameter controlling the amount of shrinkage and \(\rho (|\varvec{\theta }_{*}|\) is the adaptive Lasso function: \(\rho (|\varvec{\theta }_{*}|=|\varvec{\theta }_{*}|/|\varvec{\tilde{\theta }}_{*}|\), with \(\varvec{\tilde{\theta }}_{*}\) the MLE estimator of \(\varvec{\theta }_{*}\) used as the initial weights vector. To maximize 65, the authors use the Newton–Raphson algorithm, considering a local quadratic approximation at each iteration step as concerns the approximation of \(|\varvec{\theta }_{*}|\).

6.2 Two-stage shrinkage methods

One issue with the application of one stage shrinkage methods is that the combined dimension of both fixed and random effects is higher than the dimension of each of the two steps considered separately (Lin et al. 2013). The computational efficiency depends also on the penalized log-likelihood taken into account for the selection of the random effects: The REML is preferred by Lin et al. (2013) and Pan (2016). The reasoning behind this choice is intuitive and underlined by Lin et al. (2013): REML estimators are unbiased and seem to be more robust to outliers than ML estimators. Furthermore, REML estimators do not involve the fixed effects.

Lin et al. (2013) propose two-stage model selection by REML and pathwise coordinate optimization, inspired by the algorithm suggested by Friedman et al. (2007). The mixed model used is formulated assuming that \(\varvec{\Sigma }=\sigma ^2I_{n_i}\). In detail, during the first stage, the random effects are selected by maximizing the restricted log-likelihood penalized with the adaptive LASSO penalization:

$$\begin{aligned} Q^R(\varvec{\tau })=l^R(\varvec{\tau })-\lambda _{1,{m}}\sum _{j=1}^{s}\lambda _j w_j|{\Psi }_j|, \end{aligned}$$
(66)

where \( {\Psi }_j\) is the j-th diagonal element of \( \varvec{\Psi }\) and \(w_j \) is the known weight. Because of the non-differentiable nature of the objective function, the Newton–Raphson algorithm is used for maximizing \(Q^R(\varvec{\tau })\), after having locally approximated the penalty function by a quadratic function. Once the variance–covariance matrix is estimated, it is considered as known when the following penalized log-likelihood function is maximized to estimate the fixed effects:

$$\begin{aligned} Q^f(\varvec{\beta })=-\frac{1}{2}(\varvec{y}_i-\varvec{X}_i\varvec{\beta })'\varvec{v}_i^{-1}(\varvec{y}_i-\varvec{X}_i\varvec{\beta }) -\lambda \sum _{j=1}^{p} w_j|\beta _j|. \end{aligned}$$
(67)

Wu et al. (2016) propose an orthogonalization-based approach, which selects separately the fixed effects, at first, and then the random effects. All the selection steps are based on the least squares and no specific distribution assumption has to be involved. This method is suggested when the dimension of fixed effects is not large. The mixed model used considers \(\varvec{\Sigma }=\sigma ^2I\) and the selection procedure applies, at first, a QR decomposition of the design matrices, related to the random effects, for obtaining a homogeneous linear regression model (which does not depend on the random effects). To select the fixed effects, it suffices to minimize, with respect to \(\varvec{\beta }\), the sum of residuals with SCAD penalization, thanks to possibility to find an unbiased estimate (Fan and Li 2001):

$$\begin{aligned} S_1(\varvec{\beta })=\frac{1}{2}(\varvec{Y}-\varvec{X}\varvec{\beta })'P_{\varvec{z}'}(\varvec{Y}-\varvec{X}\varvec{\beta })'+(n-ms)\sum _{j=1}^{p}p_{\lambda 1}(|\beta _j|), \end{aligned}$$
(68)

where \(P_{\varvec{z}'}=I-\varvec{Z}(\varvec{Z}'\varvec{Z})^{-1}\varvec{Z}\) is an idempotent matrix and \(p_{\lambda 1}(|\varvec{\beta }_j|) \) is a function whose first derivative depends on the tuning parameter \(\lambda \). A ridge estimation process is computed for obtaining \(\hat{\varvec{\beta }}\), approximately:

$$\begin{aligned} \hat{\varvec{\beta }}^{k+1}=(\varvec{X}'P_{\varvec{z}'}\varvec{X}+(n-ms)\sum (\lambda _1,\hat{\varvec{\beta }}^k))^{-1}\varvec{X}'P_{\varvec{z}'}\varvec{Y}, \end{aligned}$$
(69)

while to estimate \(\sigma ^2\) they consider:

$$\begin{aligned} W^*_2(\varvec{\Psi },\sigma ^2)=\frac{1}{2}\sum _{i=1}^{m}((\varvec{y}_i-\varvec{x}_i\hat{\varvec{\beta }})\otimes (\varvec{y}_i-\varvec{x}_i\hat{\varvec{\beta }})-vec(\varvec{V}_i))' \end{aligned}$$
(70)
$$\begin{aligned} \times ((\varvec{y}_i-\varvec{x}_i\hat{\varvec{\beta }})\otimes (\varvec{y}_i-\varvec{x}_i\hat{\varvec{\beta }})-\hbox {vec}(\varvec{V}_i)), \end{aligned}$$
(71)

where \(\varvec{V}_i\) stands for the variance–covariance matrix of \(\varvec{Y}_i\), \(\otimes \) for the Kronecker tensor product and \(\hat{\varvec{\beta }}\) for the estimates of the fixed effects obtained previously. Then, the objective function \(S_2(\varvec{\theta })\) with the SCAD penalty becomes:

$$\begin{aligned} S_2(\varvec{\tau })=\frac{1}{2}\sum _{i=1}^{m}(\tilde{\varvec{Y}}-\varvec{u}_i\varvec{\tau })'(\hat{\varvec{V}}_i\otimes \hat{\varvec{V}}_i)^{-1}(\tilde{\varvec{Y}}-\varvec{u}_i\varvec{\tau })+ \sum _{i=1}^{m}n_i^2\sum _{j=1}^{(q^2+q)/2+1}p_{\lambda 2}(|\tau _j|),\qquad \quad \end{aligned}$$
(72)

and even in this situation it is solved iteratively obtaining the ridge estimation for \(\varvec{\tau }\):

$$\begin{aligned} \hat{\varvec{\tau }}^{k+1}=(U'\hat{W}^{-k}U+\sum _{i=1}^{m}n_i^2\sum _{\lambda 2}(\hat{\varvec{\tau }}^k))^{-1}U'\hat{W}^{-k}\tilde{\varvec{Y}}, \end{aligned}$$
(73)

knowing that W is a diagonal matrix whose elements are given by \(Wi=\varvec{V}_i\otimes \varvec{V}_i\), \(\tilde{\varvec{Y}} \) is the bias corrected \(\varvec{Y}\) and \(\varvec{u}_i \) is a function of \( \varvec{z}_i\otimes \varvec{z}_i\) .

Ahn et al. (2012) provide a class of robust thresholding and shrinkage procedures for selecting both the effects in linear mixed models. The robustness is guaranteed as they deal with non-normal correlated data and they do not assume any distribution of random effects and errors. For the estimation of the variance components, a moment-based loss function is built. For ensuring the desired sparse structure, they employ a hard thresholding estimator \(\hat{\varvec{\Psi }}^H = [\hat{\sigma }_{ij}^H]\), defined as \(\hat{\sigma }_{ij}^H=\widetilde{\sigma }_{ij}I(|\widetilde{\sigma }_{ij}|>\nu )\), where \(I(\cdot )\) is a typical indicator function and \(\nu \ge 0\) is the parameter which controls the thresholding criterion. Although \(\hat{\varvec{\Psi }}^H\) is consistent, it could not be a positive semi-definite matrix in the presence of small sample sizes. Hence, in this sense, a sandwich estimator with a shrinkage penalty is yielded, by minimizing the following function:

$$\begin{aligned} Q_R(D)= \sum _{i=1}^{m}\sum _{j=1 }^{n_i-1}\sum _{k=j+1}^{n_i}(\widetilde{y}_{ijk}-z_{ij}'D\widetilde{\varvec{\Psi }}Dz_{jk})^2+\lambda \sum _{i=1}^{q}d_i, \qquad \text {subject to all} \ d_i \ge 0, \forall i=1,\ldots ,q. \end{aligned}$$

To select the fixed effects, using \(\varvec{V}=\varvec{Z}\widetilde{\varvec{\Psi }}\varvec{Z}'+\hat{\sigma }^2_\epsilon I_{n}\), a feasible generalized least square (FGLS) estimator for \(\varvec{\beta }\) is computed as the minimizer of the following objective function:

$$\begin{aligned} Q_F(\varvec{\beta })=L_F(\varvec{\beta }|\hat{\varvec{\Psi }},\hat{\sigma }^2_{{\epsilon }})+\varvec{\tau }\sum _{j=1}^{p}w_j|\beta _j|, \end{aligned}$$

where data are transformed and \(w_j\)’s are data-dependent weights.

Pan (2016) and Pan and Shang (2018a) propose a shrinkage method for selecting separately the two kinds of effects. The employment of the profile log-likelihood leads to a more efficient and stable computational procedure. Recalling the linear mixed model, let us assume that \(\varvec{\Psi }=\sigma ^2\varvec{\Psi }_*\), \(\varvec{\Sigma }=\sigma ^2I_{n_i}\) and \(\varvec{\psi }\) contains the \(\frac{q(q+1)}{2}\) unique elements in \(\varvec{\Psi }_*\). The profile and the restricted profile log-likelihood functions are, respectively:

$$\begin{aligned} p(\varvec{\beta },\varvec{\psi })= & {} -\frac{1}{2}\sum _{i=1}^{m}\log |\varvec{V}_i|-\frac{n}{2}\log \left( \sum _{i=1}^{m}(\varvec{y}_i-\varvec{X}_i\varvec{\beta })^T\varvec{V}_i^{-1}(\varvec{y}_i-\varvec{X}_i\varvec{\beta })\right) ,\nonumber \\ \end{aligned}$$
(74)
$$\begin{aligned} p_R(\varvec{\psi },\sigma )= & {} -\frac{1}{2}\log \left| \sum _{i=1}^{m}\varvec{X}_i^T\varvec{V}_i^{-1}\varvec{X}_i\right| -\frac{1}{2}\sum _{i=1}^{m}\log |\varvec{V}_i| \nonumber \\&\quad -\,\frac{1}{2}(n-p)\log \left[ \sum _{i=1}^{m}(\varvec{y}_i-\varvec{X}_i\tilde{\varvec{\beta }})^T\varvec{V}_i^{-1}(\varvec{y}_i-\varvec{X}_i\tilde{\varvec{\beta }})\right] , \end{aligned}$$
(75)

The random covariance structure is selected by maximizing the penalized restricted profile log-likelihood with the adaptive LASSO, but a factorization of the vector containing the variance–covariance elements of \(\varvec{\Psi }_*\) in (\(\varvec{d}, \; \varvec{\gamma }\)) has to be carried out before hand, with \(\varvec{d}\) representing the vector of the diagonal elements and \(\varvec{\gamma }\) the vector of parameters that can vary freely:

$$\begin{aligned} Q_R(\varvec{\psi })=p_R(\varvec{\psi })-\lambda _{1m}\sum _{j=1}^{q}w_{1j}d_j|, \end{aligned}$$
(76)

where \(\lambda _{1m}\) is the tuning parameter and \(w_{1}=1/|\tilde{\varvec{d}}|\) are weights used for reaching the optimality of the solution, with \(\tilde{\varvec{d}}\) computed as a root-n consistent estimator vector of \(\varvec{d}\). The Newton–Raphson algorithm is first applied for maximizing the penalized restricted profile likelihood function leading to \(\hat{\varvec{V}}\) and, then, the same is applied for maximizing the penalized profile likelihood function:

$$\begin{aligned} Q_F(\varvec{\beta })=p_F(\varvec{\beta })-\lambda _{2m}\sum _{j=1}^{p}w_{2j}|\beta _j|, \end{aligned}$$
(77)

where \(p_F(\varvec{\beta })\) is the profile log-likelihood, \(\lambda _{2m}\) is the tuning parameter for fixed effect selection and \(w_{2j}\) are weights computed as the inverse of \(|\tilde{\varvec{\beta }}_j|\), considering that \(\tilde{\varvec{\beta }}\) is the MLE of \(\varvec{\beta }\). When the algorithm converges, the maximizer of the penalized profile log-likelihood is obtained. Hence, the set of suitable covariates is identified.

Fan and Li (2001) stated that “the penalty functions have to be singular at the origin to produce sparse solutions (many estimated coefficients are zero), to satisfy certain conditions to produce continuous models (for stability of model selection), and to be bounded by a constant to produce nearly unbiased estimates for large coefficients.” The estimator obtained through the penalty functions should lead to three important properties: asymptotic unbiasedness for avoiding modeling bias; sparsity, i.e., as a thresholding rule, the estimator should shrink some estimated coefficients to zero in order to reduce model complexity; continuity in data to avoid instability in model prediction. They showed, in few, that the choice of the shrinkage parameter should guarantee the well known oracle properties in the resulting estimator: The penalized likelihood estimator is root-n consistent if \(\lambda _{n}\rightarrow 0\), a set of estimated parameters is set to 0 and the remaining estimators converge asymptotically to a normal distribution when \(\sqrt{n}\lambda _n\rightarrow \infty \).

Hossain et al. (2018) show that under certain regularity conditions and for fixed alternatives \(B_{H_a}=\delta \ne 0\), as n increases, the estimators \(\hat{\varvec{\beta }}_{PT}\) (see in Eq. 35), \(\hat{\varvec{\beta }}_{PSE}\) (see in Eq. 36) and the positive-part shrinkage estimator converge in probability to \(\hat{\varvec{\beta }}\) and they derive the asymptotic joint normality for the unrestricted and restricted estimators, of which the three estimators are a function. Fan et al. (2014) demonstrate that their proposed robust estimator enjoy all the properties defined by Liski and Lisk (2008). Chen et al. (2015) demonstrate only the validity of the Oracle property of only sparsity and consistency, but not the asymptotical distribution. Li et al. (2018) show the “sparsistency” property which ensures the selection consistency for the true signals of both fixed and random effects; hence, they provide analytical proofs about the validity of consistency and sparsity, but nothing about the distributional form. Pan and Shang (2018b) demonstrate that their procedure fills the consistency and the sparsity properties, without mentioning anything about the asymptotical normality. Marino et al. (2017) only refer to take a look at Rubin (2004) in which is possible to assess that “a small number of imputations can lead to high-quality inference.” As concerns Rohart et al. (2014) thus no mention about asymptotic properties fulfilled by their final estimator. Pan (2016), Pan and Shang (2018a), Ahn et al. (2012) and Lin et al. (2013) demonstrate that, if \(\lambda \rightarrow 0\) and \(\sqrt{{m}}\lambda \rightarrow \infty \) as \({m}\rightarrow \infty \), the estimators produced by their two stage model selection are \(\sqrt{m}\) consistent and they possess the oracle properties, i.e., sparsity and asymptotic normality (asymptotically the proposed approaches can discover the subset of significant predictors). In other words, for an oracle procedure, the covariates with nonzero coefficients will be identified with probability tending to one, and the estimates of nonzero coefficients have the same asymptotic distribution as the true model (Pan 2016). All these statements are valid if an appropriate tuning parameter is chosen.

Consistent variable selection depends on the choice of the tuning parameter. The shrinkage procedures yield estimates, assuming the tuning parameters as known, but they are not. Hence, they have to be tuned among a pool of values, from the largest to the smallest quantity, identifying a path through the model space. After constructing the path and reducing parameter space, one can apply a direct approach (information criteria, cross-validation and so forth) to better identify the important variables. For this reason, shrinkage methods are, usually, employed in the case of many variables, thanks to the fact that they do not need to focus on all possible models (\(2^{p+q}\)). The most widely used methods in the literature for tuning the parameter, which controls regularization, are cross-validation and BIC. “A more rigorous theoretical argument justifying the use of the BIC criterion for the \(\ell _1\) penalized MLE in high-dimensional linear mixed-effects models is missing: the BIC has been empirically found to perform reasonably well” (Schelldorfer et al. 2011). This seems to be generally valid for other shrinkage methods: there is not theoretical justification for employing the BIC. Fan et al. (2014) highlight their choice to select the shrinkage parameter through the BIC criterion is due to the fact that GCV leads to over-fitting models and AIC seems not to be consistent when the true model has a sparsity structure. The BIC criterion on which the authors base their selection of \(\lambda _n\) is the following:

$$\begin{aligned} \text {BIC}(\lambda )=-\frac{1}{2}\log |\hat{\varvec{V}}|-\frac{1}{2}||\varvec{y}-\varvec{X}\hat{\varvec{\beta }}||^2_{\hat{\varvec{V}}}+\log (m)||\hat{\varvec{\theta }}_\lambda ||_0, \end{aligned}$$
(78)

where \(\hat{\varvec{V}}=\hbox {diag}(\hat{\varvec{V}}_1,\hat{\varvec{V}}_2,\ldots ,\hat{\varvec{V}}_m)\) and the generic \(\hat{\varvec{V}}_i\), \(\hat{\varvec{\beta }}\), \(\hat{\varvec{\Psi }_{*}}\) are the robust estimates contained in \(\hat{\varvec{\theta }}_{\lambda }\) upon convergence of the EM algorithm. Because of the over-fitting problems using GCV, Marino et al. (2017) choose the BIC criterion for the selection of the tuning parameter:

$$\begin{aligned} \text {BIC}(\lambda )=-2l_R(\varvec{ \beta }^{(\bullet )},\hat{\sigma }^2,\hat{\varvec{\Psi }}_{*})+q\times ln(n), \end{aligned}$$
(79)

where \(l_R(\varvec{ \beta }^{(\bullet )},\hat{\sigma }^2,\hat{\varvec{\Psi }}_{*})\) is the REML log-likelihood function related to the model in (32).

Li et al. (2018) select the two tuning parameter minimizing a variant of BIC, proposed by Wang (2016):

$$\begin{aligned} \text {BIC}=-2p_R(\varvec{\beta },L)+\bigg [d_{\varvec{\beta }}+\frac{(1+d_{\varvec{\Psi }_*})d_{\varvec{\Psi }_*}}{2} \bigg ]\log (n), \end{aligned}$$
(80)

where \(p_R(\varvec{\beta },L)\) is the profile log-likelihood in Eq. (75), \(d_{\varvec{\beta }}\) and \(d_{\varvec{\Psi }_*}\) are given by the amount of nonzero elements in \(\varvec{\beta }\) and on the diagonal of \(\varvec{\Psi }_{*}\), respectively. Pan (2016) and Pan and Shang (2018a) propose to minimize the BIC or the AIC or the generalized CV (GCV) as possible criteria for selecting the optimal tuning parameter. The above criteria, surely, have to be computed with the corresponding profile likelihood, shown in Eqs. (74) and (75), to identify the tuning parameter for the fixed part and the random part, respectively. The degrees of freedom necessary to compute all three criteria also refer to the fixed effects in one case (the number of nonzero \(\hat{\varvec{\beta }}\)’s) and to the random part in the other case (the amount of nonzero parts in \(\hat{\varvec{\psi }}\)). Pan and Shang (2018b) select the optimal \(\lambda \) by minimizing the BIC criterion, where the degrees of freedom takes into account the number of nonzero elements in \(\varvec{\theta }_{*}\). The tuned parameters \((\lambda _1,\lambda _2)\) are computed, by Wu et al. (2016), with a CV or GCV technique. Taylor et al. (2012) and Ahn et al. (2012) choose a tuning parameter that minimizes the BIC criterion; Taylor et al. (2012) focus on the value of r (from a fixed grid, see Eq. (60)), which leads to the minimum BIC, after obtaining convergence for the penalized REML estimators:

$$\begin{aligned} \text {BIC}=-2l(\hat{\varvec{\beta }},\hat{\varvec{a}},\hat{\varvec{\tau }})+\log (m)\#df, \end{aligned}$$
(81)

where \(l(\cdot )\) is the un-penalized (since it involves \(\varvec{a}\) as fixed effects) marginal log-likelihood over the random effects \(\varvec{b}\) evaluated at the REML estimates of \(\varvec{\tau }\) and \(\#df\) represents the number of nonzero elements in \(\hat{\varvec{a}}\). Ahn et al. (2012) work on a modified version of the BIC, similar to the RSS ratio, for both the fixed effects and the random effects:

$$\begin{aligned} \text {BIC}_R(\nu )=\frac{L_0(\varvec{\Psi }^H_\nu )}{L_0(\varvec{\Psi })}+\frac{\log (n)}{n}\times df1, \end{aligned}$$
(82)
$$\begin{aligned} \text {BIC}_F(\varvec{\tau })=\frac{L_F(\hat{\varvec{\beta }}_{\varvec{\tau }}|\hat{\varvec{\Psi }},\hat{\sigma }^2)}{L_F(\hat{\varvec{\beta }}_G|\hat{\varvec{\Psi }},\hat{\sigma }^2)}+\frac{\log (n)}{n}\times df2 , \end{aligned}$$
(83)

where \(\hat{\varvec{\beta }}_G\) is the FGLS estimator and df1 and df2 represent the number of nonzero components on the diagonal in \(\hat{\varvec{\Psi }}^H\) and in \(\hat{\varvec{\beta }}_{\varvec{\tau }}\). The degrees of freedom measure the effective model dimension. Unlike Bondell et al. (2010) and Ibrahim et al. (2011), where the degrees of freedom considered are, respectively, sample size n and cluster size m, in the methods discussed above the number of parameters that can vary freely is connected to the nonzero parameters in the working model (fixed components and variance–covariance elements of the random effects). As pointed out by Müller et al. (2013), the number of nonzero estimated components related to the tuning parameter is not equivalent to the number of independent parameters, which is instead true for the linear models.

The main characteristics associated with shrinkage procedures available in the literature are summarized in Table 1.

7 Review of simulations

Almost all the authors have performed at least one simulation to measure and demonstrate the reliability of their own procedure. As in a meta-analysis, we have collected the simulations but, since the results are not directly comparable, the tables synthesize the main parameters characterizing the simulations. We followed the setting of Müller et al. (2013), for continuity to purposes. Considering Table 2, the smaller the values of \(\min |\varvec{\beta }|/\sigma \) and \(\min \{\hbox {ev}(\varvec{\Psi }/\sigma ^2)\}\) the more difficult the selection of the true model for \(\varvec{\beta }\) and \(\varvec{\tau }\). Nevertheless, it is worth noting that these values are not useful as regards the goodness of fit of the models or the real ability of the methods, once they are applied, for identifying the true values of \(\varvec{\beta }\) and \(\varvec{\tau }\), since they refer to initial settings of simulations and not to their results. As Müller et al. (2013) underlined, one could consider these simulations as a mere meta-analysis. The results obtained are not directly comparable, because the authors use different measures to assess the performance of their method.

Table 2 Summary of settings used for the simulations

It is worth noting that, all simulations are applied with a moderate number of random effects (for both the full and the true model) and of variance–covariance parameters, except for that of Li et al. (2018) and Ahn et al. (2012). A large amount of fixed effects occur in the full model of Chen et al. (2015), Ghosh and Thoresen (2018) and Rohart et al. (2014).

To determine the set of candidate models for \(\varvec{\beta }\), \(|M_{\varvec{\beta }}|\), the authors do not follow the same criterion. Some authors focus only on covariates, and in this sense \(|M_{\varvec{\beta }}|\) is equal to \(2^{p-1}\). (So the intercept is not included for size of \(\varvec{\beta }\)). Others instead refer to p as the whole fixed regression parameters, including the intercept, and thus, the candidate models are \(2^p\). Furthermore some authors, such as Kawakubo et al. (2014), state that they exclude from \(|M_{\varvec{\beta }}|\) the null model (i.e., the model containing only the intercept).

Kawakubo and Kubokawa (2014) found that both the McAIC and a model averaging procedure (which has more appropriate weights) depending on McAIC, work better than cAIC in terms of prediction errors. They prove empirically the same results in the case of small area prediction, which is the topic on which Kawakubo et al. (2014) and Lombardía et al. (2017) focus on. They show, therefore, a prediction error improvement of CScAIC with respect to cAIC. Compared to mAIC, cAIC and BIC, the EBIC of Kubokawa and Srivastava (2010) is the criterion which, by simulation, leads to a better selection of the true model as the number of covariates and the number of clusters increase. These results constitute empirical evidence of the consistency property of the EBIC. Lombardía et al. (2017), instead, compared the extended generalized AIC they defined (20) with the conditional AIC defined by Vaida and Blanchard (2005). They discovered that the xGAIC for the Fay–Herriot model presents better performances in terms of correct classification rates of the true model. As the number of covariates increases, the xGAIC performs better and better (in a scenario with three variables it perfectly brings to the correct model), instead the vAIC selects 44% of the times a model with a fewer number of fixed effects. Wenren and Shang (2016) show that the proposed conditional criteria perform more efficiently than the classic Mallow’s \(C_p\) when more significant fixed effects are added. A large number of units for each cluster is required, if one works with the random effects within clusters (for instance small area estimation) or if one could obtain a less biased estimation of the penalty term. Wenren et al. (2016) show by simulation that their two marginal \(C_p\)-types perform better, in selecting the correct model, than mAIC and mBIC in particular situations: when observations are few and highly correlated or when the true model is included in all candidate models and includes more significant fixed effect variables. Kuran and Özkale (2019) compare the performance of their conditional ridge \(C_p\) with the \(\hbox {CC}_P\) of Wenren and Shang (2016), in both cases of known and unknown variance–covariance matrices of the random effects and of the random errors. Furthermore, they use different values for the ridge parameters and compare various models (with different number of the explicative variables). They show that the percentages of choosing the true model by all the \(C_p\) statistics are quite optimal and comparable and they increase as the number of fixed effects increases as well. When the ridge parameter increases, the number of individuals and the number of units are quite small and the correlation between explanatory variables is not high, the \(\hbox {CRC}_p\) outperforms the \(\hbox {CC}_p\).

Focusing on the shrinkage selection procedures, Hossain et al. (2018) compare the performances, in terms of mean squared prediction errors, reached by their PT and PSE estimators against the unrestricted MLE, the restricted MLE, the LASSO and ALASSO methods. They show that their methodology, as the sample size increases and the number of active covariates decreases, brings to better performance than the other estimators except the restricted MLE. Ghosh and Thoresen (2018) try to demonstrate the great performances of the SCAD penalty over \(\ell _1\) penalization. Hence, by simulations, they point out that both in a low-dimensional setting and in a high-dimensional setting the two penalties correctly select the true fixed effects. With respect to \(\ell _1\), SCAD focuses on a smaller activate set of \(\varvec{\beta }\), especially, in the high-dimensional case. Marino et al. (2017) compare their penalized likelihood procedure for multilevel models with missing models with the LASSO method applied on data without missing values and, hence, used as benchmark reference. Therefore they also compare the performance of their method with the regularized LASSO on complete-case data. When missing data are present in the dataset the proposed methodology performs better, especially when the number of imputations increases. Taking into account only one imputation doesn’t produce huge benefits. On the other hand, the methodology is quite good in identifying the correct model when the number of imputation and the number of units increases. Rohart et al. (2014) reached the same results as Schelldorfer et al. (2011) in the case of known variances, but with an algorithm much faster. It is worth noting that their method can be computationally combined with other procedures. The orthogonal-based SCAD procedure of Wu et al. (2016) is very efficient in selecting the fixed effects as the number of total units increases, but has to be improved for the selection of the random effects. Pan (2016) compared the ability of his two-stage procedure to correctly identify the two kinds of effects with that of Ahn et al. (2012) and Bondell et al. (2010). He found that the percentage of the effects (taken both separately and together) correctly identified was higher than the others and was rose as the number of clusters increased. Only in the case of a non-normal distribution assumed for \(\varvec{\epsilon }\) did the method proposed by Ahn et al. (2012) perform better, since it does not need any distributional assumptions. Pan (2016) also compares the computational efficiency of his model selection with that of Bondell et al. (2010) and concludes that his algorithm takes less time to converge. There are two probable reasons: \(\sigma ^2\) is not included in the profile log- likelihood used by Pan (2016) and a two-stage procedure for selecting both the effects is faster than the procedures involving only one step. Lin et al. (2013) used the same settings for their simulations as those used by Bondell et al. (2010), that is the reason why their results are missing in Table 2: They are available in Table 2 of Müller et al. (2013). The robust selection method presented by Fan et al. (2014) has been shown to lead to the same results of the equivalent non-robust method if the data do not present outliers. On the other hand, the method has no influence on the estimates if outliers are present in the data (both in the response variable and in the covariates), while the non-robust methodology brings to over-fitting with lower fit percentages and higher mean squared errors of the estimated parameters as a consequence. The robust selection method is perturbed by outliers if these are only in the response variable or in the covariates.

Table 3 Settings of LMM selection procedures for all the procedures analyzed in the review

In the case of high-dimensional settings where the focus is on selection the fixed and the random effects, Li et al. (2018) used in their simulations two ways of controlling the tuning parameters: a non-adaptive regularization (NAR), which chooses the tuning parameter from a simple grid of values, and an adaptive regularization (AR), which attributes weights to different penalty parameters. The AR methodology leads to smaller estimation bias for the variance components and to a better control of the false discovery rate. Chen et al. (2015) obtained a good performance selection in terms of low proportion of parameters that did not shrink to zero while one expected the opposite or of parameters shrinking to zero, by mistake. Furthermore, they obtained accurate results in terms of bias and standard deviations of the estimates. They conducted some simulations excluding from the selection the fixed effects, and they discovered that in all situations the fixed effect selection never affects the power performances.

The parameter subset selection method proposed by Schmidt and Smith (2016) leads to better performances, compared to other techniques, among which LASSO, ALASSO and M-ALASSO.

As specified at the beginning of this review, our purpose is to give a clear outline of most methodologies used in linear mixed models that are available in the literature. Hence, in this sense, Table 3 summarizes all the features that easily identify all procedures: the part of the model focusing on (fixed and/or random), the dimension of the linear mixed model used and the structure of variance and covariance matrices. Dimensionality represents the level of the number of parameters (\(\varvec{\theta }=\varvec{\beta },\varvec{\tau }\)) involved in the model. We included not only the methods mentioned by this article, but also those contained in Müller et al. (2013), in order to provide a global view of all methodologies. Taking a look jointly to Table 2 of Müller et al. (2013), Tables 2 and 3, it becomes obvious that most model selection procedures, focusing on selecting both the fixed and the random part in cases of medium and/or high dimensionality, involve a shrinkage procedure. The shrinkage methods are computationally more efficient and statistically accurate (Bülmann and van de Geer 2011; Müller et al. 2013.

8 Review of real examples

LMM are widely used in medical statistics and biostatistics. To enrich this review, we give a brief look at the real examples described in some of the listed papers.

Ahn et al. (2012), Pan (2016) and Hossain et al. (2018) describe the Amsterdam Growth and Health Study, widely used in literature. The Amsterdam Growth and Health Study Data were collected to explore the relationship between lifestyle and health in adolescence and young adulthood. In growing toward independence, the lifestyle habits of teenagers change substantially with respect to physical activity, food intake, tobacco smoking, etc. Accordingly, their health perspective may also change. Individual changes in growth and development can be studied by observing and measuring the same participant over a long period of time. The Amsterdam growth and health longitudinal study was designed to monitor the growth and health of teenagers and to develop future effective interventions for adolescence. A total of 147 subjects in the Netherlands participated in the study, and they were measured over 6 time points; thus, the total number of observations is 882. The continuous response variable of interest was the total serum cholesterol expressed in mmol/l. Pan (2016) in his paper analyses a second dataset, which is the colon cancer data. The goal of the analysis was to estimate the cost attributable to colon cancer after initial diagnosis by cancer stage, comorbidity, treatment regimen, and other patient characteristics. The data reported aggregate Medicare spending on a cohort of 10,109 colon cancer patients up to 5 years after initial hospitalization, and these data are considered as the response for a linear mixed model.

Taylor et al. (2012) applied their method to determine quantitative trait loci (QTL) in a wheat quality data set. The data set was obtained from a two-phase experiment conducted in 2006 involving a wheat population consisting of 180 double haploid (DH) lines from the crossing of two favored varieties. Data were collected from two phases of experimentation consisting of an initial field trial and milling laboratory experiment. A partially replicated design approach was used at both experimental phases. The field trial was designed as a randomized block design. The analysis considers a very large set of candidate variables, and matrix \(\mathbf {a}\) in Eq. (59) is a \((390 \times 1)\) size matrix.

Jiang et al. (2008) considered a dataset from a survey conducted in Guatemala regarding the use of modern prenatal care for pregnancies where some form of care was used. They consider applying the fence method in selection of the fixed covariates in the variance component logistic model. Again, they cope with a quite large number of covariates.

Marino et al. (2017) worked on a dataset provided by the Healthy Directions–Small Business study conducted by Sorensen et al. (2005). Some recent epidemiological studies proved that there is a relationship between dietary patterns and physical inactivity with multiple cancers and chronic diseases. One of the main purposes of the study was to detect whether or not the cancer prevention (based on occupational health and health promotion) could lead to reduce significantly the red meat consumption or to improve significantly the mean consumption of fruits and vegetables, the levels of physical activity, the smoking cessation and the reduction of occupational carcinogens. The HD-SB study was a randomized, controlled trial study conducted between 1999 and 2003 as part of the Harvard Center Prevention Program Project. The study population of the study were twenty-six small manufacturing worksites that employed multi-ethnic, low-wage workers. Participating worksites were randomized to either the 18-month intervention group or minimal intervention control group. The respondents to the study were 974 but only 793 of them answered with complete information; hence, there was 18.5% of missing data. The number of variables involved in the survey was huge, and they were grouped according different areas: health behaviors, red meat consumption, physical activity and consumption of multivitamin and sociodemographic characteristics. The authors took into account 15 covariates, and they built a linear mixed model where the mean consumption of fruit and vegetables at follow-up. They proposed their methodology for missing data with 1, 3,  and 5 imputations, comparing the results to the analysis made on the complete-cases data.

Fan et al. (2014) applied their robust method on a longitudinal progesterone dataset, available on Diggle P.J.’s homepage: https://www.lancs. The dataset contains 492 urine samples from 34 women in a menstrual, where each woman contributed from 11–28 times. The menstrual cycle length was standardized for all women to a reference 28-day cycle. A linear mixed model was analyzed by the authors with the log-transformed progesterone level as response variable, a random intercept and 7 fixed effects: age, bmi, time, the squared values of time and the three first-level interactions among age, bmi and time.

Li et al. (2018) in their paper analyze two datasets. The first is related to a longitudinal randomized controlled trial, involving 423 adolescent children from an Hispanic population in New York City had their parents affected by HIV+. The main purpose was to investigate about a negative state of mind (measured by a Basic Symptoms Inventory, a score well described by Weiss 2005), over six years (each person has been visited about 11.5 times). Six variables were involved in the original dataset, i.e., treatment (or control group), age, gender, Hispanic (\(1=\hbox {Yes}\), \(0=\hbox {No}\)), visit time (expressed in logarithm of year) and visit season. The authors, worked on a linear mixed model containing the six covariates plus the two-way interactions between treatment and time, gender and Hispanic, counting so 10 predictors, which were included in all the two types of effects. Their regularization procedure was applied both with the non-adaptive version and with the adaptive version (through the inverse of the estimated from the ridge-penalization procedure). Their second dataset is related to a clinical study that investigated on a possible relationship of some protein signatures with post-transplant renal functions for people with a kidney transplant. The study involved 95 renal transplant patients. The main purpose of the study was to analyze which proteins had a significant influence on the longitudinal trajectory of renal function measured by glomerular filtration rate (GFR) of the patients.

Lombardía et al. (2017) analyzed a dataset about surveys conducted from the behavioral risk factors information system in Galicia (2010–2011). The sample design applied in the survey was a stratified random sampling, allocating with equal proportions by sex and age group. Forty-one areas from the 53 counties in Galicia were involved in the survey. The authors tried to estimate the prevalence of smokers (at least 16 years old) distinguished by sex. The minimum sample size in the domain was 44 for men and 48 for women. The response variable, employed in the Fay–Herriot model used, was the logarithmic transformation of smokers’ numbers. The covariates were globally 14, classified in four groups: age, degree of urbanization, activity and educational level.

Han (2013) analyzed a public health dataset about obesity released by the U.S. Centers for Disease Control and Prevention, which realized a large health study (6971 people) in the United States (51 counties of California) in the years between 2006 and 2010. The information obtained by the surveys. The purpose of the author was to estimate county level obesity rates for the female Hispanic population within working ages of 18–64.

Bondell et al. (2010) consider a recent study of the association between total nitrate concentration in the atmosphere (TNO3, ug/m\(^3\)) and a set of measured predictors. Nitrate is one of the major components of fine particulate matter (PM2.5) across the USA. However, it is one of the most difficult components to simulate accurately using numerical air quality models. Identifying the empirical relationships that exist between nitrate concentrations and a set of observed variables that can act as surrogates for the different nitrate formation and loss pathways can help the research and can allow for more accurate simulation of air quality. To formulate these relationships, data obtained from the U.S. EPA Clean Air Status and Trends Network (CASTNet) sites are used. The CASTNet dataset consists of multiple sites with repeated measurements of pollution and meteorological variables on each site, i.e., the mean ambient particulate ammonium concentration (NH4, ug/m\(^3\)), the mean ambient particulate sulfate concentration (SO4, ug/m\(^3\)), relative humidity (RH, %), ozone (O3, ppb), precipitation (P, mm/h), solar radiation (SR, W/m\(^2\)), temperature (T, \(\circ \)C), temperature difference between 9 m and 2 m probes (TD, \(\circ \)) and scalar wind speed (WS, m/s). The same data were used by Li et al. (2014) to apply their proposed MDL procedure. A subset of the CASTnet dataset was, instead, implied by Chen et al. (2015), who focused only on five sites across the eastern USA, (2001–2009) and took as original variables TNO3, NH4 and SO4, instead the others variables were transformed from ours to seasonal, substituting the maximum value for O3 and the mean value for the others. The total number of observations were 175, and in the two-way random effect model the variable time and sites were included as main random effect.

Ghosh and Thoresen (2018) investigated the effects of intake of oxidized and non-oxidized fish oil on inflammatory markers in a randomized study of 52 subjects (dataset already studied in literature). Inflammatory markers were measured at baseline and after three and seven weeks. They use the data to investigate whether there are any associations between gene expressions measured at baseline and level of the inflammatory marker ICAM-1 throughout the study. From a vast set of genes, they initially selected \(p = 506\) genes having absolute correlation greater than or equal to 0.2 with the response at any time point, so that the total number of fixed effects considered becomes \(p = 512\). On the other hand, removing the missing observations in the response variable they obtain \(n = 150\) observations, making it a high-dimensional selection problem. Further, due to the longitudinal structure of the data, they additionally considered random effect components in the model: they included random intercept and a random slope.

Finally, Rohart et al. (2014) apply their approach to a real data set from a project in which hundreds of pigs were studied, the aim being to shed light on the relationships between some of the phenotypes of interest and metabolic data. Linear mixed models are appropriate in this case because observations are in fact repeated data collected in different environments (groups of animals reared together in the same conditions). Some individuals were also genetically related, introducing a family effect. The data set consisted of 506 individuals from three breeds, eight environments and 157 families, metabolic data contained \(p=375\) variables, and the phenotype investigated was the daily feed intake (DFI).

Li and Zhu (2013) applied their new covariance-based test on a famous pig weight dataset, containing the weights of 48 pigs, measured in nine successive weeks.

9 Discussion and conclusion

In this paper, we have discussed most of the model selection procedures for linear mixed models available to date. The purpose of our review is to allow users to easily identify the type of method they need, according to certain characteristics, such as the number of clusters and/or the number of units per cluster, the part of the model to be selected (fixed and/or random), the dimension of the model and the structure of the variance–covariance matrices. For all the methods, a description of the simulations, if available, is reported in Table 2: the purpose is to give an idea of the model settings and not to provide evidence of the best methods. We used more or less the same notation as Müller et al. (2013) for alignment with the previous review and, hence, facilitating the comparisons of the various methods over time. But this review is not only an update of Müller’s review (Müller et al. 2013, but an attempt to cluster the procedures from a different point of view: the part of the model to be selected, fixed and/or random. As a matter of fact, this is one of the main issues when looking for an appropriate method to choose. Moreover, particular attention is given to the SW used, together with the implementation and the availability of the code.

This review mentions the available theoretical properties corresponding to the different methodologies, with comparisons among them whereas it’s possible. A relevant importance is given here to the shrinkage methods (focused on the selection of fixed and/or random effects), since these procedures need for the oracle properties established by Fan and Li (2001).

By simulation the authors considered in this review try to achieve the best result, i.e., to identify the optimal model among a pool of candidate models and not the true model. Many issues are related to the choice of the optimal model, one of which is determined by the dimension of the pool of candidate models (\(2^{p+s}\)). The larger this set \(\mathcal {M}\), the lower computational efficiency is. This has been proven by Fence methods and a number of Bayesian methods reported in Müller et al. (2013) as well as the two-stage procedures of Sect. 6.2, which select the two effects separately, thus reducing the overall dimension of models.

Over time, greater attention has been given to the generalization of \(\varvec{\Sigma }\) in Eq. (2): the scaled version \(\sigma ^2\varvec{\Sigma }_*\) replaced \(\sigma ^2I_{n_i}\), but except for Shang and Cavanaugh (2008) the scaled version \(\sigma ^2\varvec{\Psi }_*\) is assumed for \(\varvec{\Psi }\). There is still poor theoretical support for a generalized scenario of the variance–covariance matrices for both the effects.

Most of the methods were implemented in R, using different packages or through their own codes (not published in any package). Some authors, however, do not even specify the software used (see Table 3). As in a meta-analysis, we gathered the simulations presented in the papers described, since the results are not directly comparable, the tables synthesize the main parameters characterizing the simulations.

Hence, the main purpose of this review was to provide an overview of some useful components/factors characterizing each selection criterion, so that users can identify which method to apply in a specific situation also. In addition, an effort was made to tidy up the notation used in the literature, by “translating,” if necessary, symbols and formulas in each paper into a common “language.”