1 Introduction

Finite mixtures present a very attractive modeling framework to increase model flexibility without the high-dimensional parameter spaces used in non-parametric or mixed modeling (McLachlan and Peel 2000). Often, a regular statistical model is too rigid to adequately represent possible heterogeneity in the population. This heterogeneity can often be captured by a mixture of parametric models. Such mixtures have been successfully applied in a wide variety of fields. Wedel and Kamakura (1999) for instance, have spent two chapters of their book on market segmentation on this topic whereas Schlattmann (2009) has written an entire book about medical applications of finite mixture models. However, despite its popularity and frequent usage, there are still some complications with this type of model. The most important of these complications is that of selecting the correct number of components (McLachlan and Peel 2000) which we will refer to as order selection. Not surprisingly, this has generated a lot of theoretical and applied research and many order selection methods have been suggested in the literature by now. However, most of the simulation results which have been presented either disagree with each other or were obtained in very idealized settings where model assumptions matched the simulation settings. Therefore, in this paper, we have investigated violations of standard model assumptions in finite mixtures of linear regression models, in the hope of partly filling this gap. We have compared several old and new order selection methods using two different types of estimation, unpenalized and penalized estimation.

The rest of this paper is structured as follows. In Sect. 2 some technical and practical background will be given about (fitting) a mixture model of linear regressions. In Sect. 3 we present a non-exhaustive overview of various popular and some lesser known but rather effective methods to select the number of components in a mixture model. In this section we also give an overview of some published results. In Sect. 4 the design and results of our simulation study will be presented and discussed. Finally, in Sect. 5 we present our conclusions and some potential lines of further research.

2 Technical background

2.1 Finite mixtures of linear regressions

Suppose a population consists of K subpopulations \(S_k\) indexed by \(k=1,\ldots ,K\). Within each of these subpopulations, suppose it makes sense to model a univariateFootnote 1 random variable Y as a linear combination of p explanatory variables denoted by the vector \(\varvec{x}\). Then, for a random sample of size n across the subpopulations, we have

$$\begin{aligned} \left\{ \begin{array}{ll} y_i=\beta _{01}+\beta _{11}x_{i1}+\cdots +\beta _{p1}x_{ip}+\epsilon _{i1}\qquad \text{ if }\quad i\in S_1 \\ \quad \;\vdots \\ y_i=\beta _{0k}+\beta _{1k}x_{i1}+\cdots +\beta _{pk}x_{ip}+\epsilon _{ik}\qquad \text{ if }\quad i\in S_k \\ \quad \;\vdots \\ y_i=\beta _{0K}+\beta _{1K}x_{i1}+\cdots +\beta _{pK}x_{ip}+\epsilon _{iK}\quad \text{ if }\quad i\in S_K \\ \end{array}\right. \end{aligned}$$
(1)

where \(i=1,\ldots ,n\). Note that the subpopulations are assumed to be mutually exclusive and exhaustive. The error terms within each component are assumed to be i.i.d. normal with mean 0 and variance \(\sigma _k^2\) and independent across the subpopulations. The vector of regression coefficients will be denoted by \(\varvec{\beta }=\left( \varvec{\beta }_1,\ldots ,\varvec{\beta }_K\right) ^T\) where \(\varvec{\beta }_k=\left( \beta _{0k},\beta _{1k},\ldots ,\beta _{pk}\right) ^T\). Let \(\varvec{z}\) be a single trial realization of a random multinomial variable with parameter vector \(\varvec{\pi }=\left( \pi _1,\ldots ,\pi _K\right) ^T\) which indicates from which subpopulation \(Y\) originates. Therefore, if an observation \(i\) belongs to component \(k\), \(\varvec{z}_i\) is a vector of 0s with a 1 at the \(k\)th position. The parameters \(\pi _k,k=1,\ldots ,K\), indicate the relative size of the subpopulations in the entire population under consideration. From its definition it follows that \(\sum _{k=1}^K \pi _k = 1\) and that \(\pi _k\ge 0, \forall k=1,\ldots ,K\). The joint distribution of \(y\) and \(\varvec{z}\), conditional on \(\varvec{x}\), can now be written as

$$\begin{aligned} f\left( y,\varvec{z}|\varvec{x},\varvec{\varPsi }\right) =\prod _{k=1}^K \left[ \pi _k \mathcal N \left( y|\varvec{\beta }_k^T\varvec{x},\sigma _k^2\right) \right] ^{z_{ik}} \end{aligned}$$
(2)

where \(\mathcal N \left( y|\varvec{\beta }_k^T\varvec{x},\sigma _k^2\right) \) represents the normal distribution function of a variable \(y\) with mean \(\varvec{\beta }_k^T\varvec{x}\) and variance \(\sigma _k^2\), \(\varvec{x}\) includes an intercept term and \(\varvec{\varPsi }=\left( \varvec{\pi },\varvec{\beta },\sigma _1^2,\ldots ,\sigma _K^2 \right) ^T\) denotes the complete parameter vector. Note that one of the elements of \(\varvec{\pi }\) is redundant due to the summation restriction given above. The complete data log likelihood or joint log likelihood of \(y\) and \(\varvec{z}\) of the sample can then be expressed as

$$\begin{aligned} LL_c\left( \varvec{\varPsi }\right) =\sum _{i=1}^n \sum _{k=1}^K z_{ik} \left\{ \log \left( \pi _k\right) + \log \left[ \mathcal N \left( y_i|\varvec{\beta }_k^T\varvec{x}_i,\sigma _k^2 \right) \right] \right\} . \end{aligned}$$
(3)

A finite mixture model of linear regressions now arises when the subpopulation indicator variable \(\varvec{z}\) is not observed (or inherently unobservable). In this case, one has to resort to working with the marginal distribution of \(Y\) (marginalized over \(\varvec{Z}= \left( \varvec{z}_1,\ldots ,\varvec{z}_n\right) \)) and the marginal distribution of \(y\), conditional on \(\varvec{x}\), becomes

$$\begin{aligned} f\left( y|\varvec{x},\varvec{\varPsi } \right) =\sum _{k=1}^K \pi _k \mathcal N \left( y|\varvec{\beta }_k^T\varvec{x},\sigma _k^2 \right) \end{aligned}$$
(4)

and the corresponding observed log likelihood of the sample is

$$\begin{aligned} LL\left( \varvec{\varPsi }\right) =\sum _{i=1}^n \log \left[ \sum _{k=1}^K \pi _k \mathcal N \left( y_i|\varvec{\beta }_k^T\varvec{x}_i,\sigma _k^2 \right) \right] . \end{aligned}$$
(5)

This model is the finite mixture model of normal linear regressions that was introduced by Quandt (1972) and Quandt and Ramsey (1978) for a mixture of two switching regressions and later generalized to arbitrary dimensions by Desarbo and Cron (1988). The relative sizes of the subpopulations are called mixture proportions or mixture weightsFootnote 2 and the densities in the subpopulations are called the component densities, which are conditional on component membership and the explanatory variables. Note that, in the absence of any other information, the mixture proportions are the a priori probabilities of belonging to a specific component for a randomly sampled subject. Maximizing the observed log likelihood (5) can be done in a variety of ways (all iteratively as there is no closed-form solution) and is usually done by using the expectation-maximization (EM) algorithm (Dempster et al. 1977) which uses (3) rather than (5). Every iteration in the EM algorithm consists of two steps, an expectation step and a maximization step. In the expectation step the expected value of the complete data log likelihood (3), conditional on the vector of current parameter values and the observed data, is calculated. This expression is then subsequently maximized with respect to the model parameters in the maximization step, yielding a new set of parameter values. Dempster et al. (1977) showed that iterating these two steps is equivalent to maximizing the observed log likelihood, which is the goal. Calculating the conditional expected value of (3) is straightforward as the only random terms are the \(z_{ik}\) which are binary indicator variables and are linear in (3). So, for a general iteration \((t+1)\), the expectation step consists of calculating

$$\begin{aligned} E\left[ Z_{ik}|y_i,\varvec{x}_i,\varvec{\varPsi }^{(t)}\right] =\frac{\pi _k^{(t)}\mathcal N \left( y_i|\varvec{\beta }_k^{T(t)}\varvec{x}_i,\sigma _k^{2(t)} \right) }{\sum _{l=1}^K\pi _l^{(t)}\mathcal N \left( y_i|\varvec{\beta }_l^{T(t)}\varvec{x}_i,\sigma _l^{2(t)} \right) }\equiv \tau _{ik}^{(t+1)}. \end{aligned}$$
(6)

The \(\tau _{ik}^{(t+1)}\) can be viewed as the posterior probability of an observation with observed values \(y_i\) and \(\varvec{x}_i\) to belong to component \(k\). Maximizing (3), with the \(z_{ik}\) replaced by the estimated \(\tau _{ik}^{(t+1)}\), now yields the following closed-form solutions

$$\begin{aligned} \pi _k^{(t+1)}&= \frac{ \sum _{i=1}^n \tau _{ik}^{(t+1)}}{n}\nonumber \\ \varvec{\beta }_k^{(t+1)}&= \left( \varvec{X}^T\varvec{W}_k^{(t+1)}\varvec{X}\right) ^{-1}\varvec{X}^T\varvec{W}_k^{(t+1)}\varvec{y}\nonumber \\ \sigma _k^{2(t+1)}&= \frac{\left( \varvec{y}-\varvec{X}\varvec{\beta }_k^{(t+1)}\right) ^T\varvec{W}_k^{(t+1)}\left( \varvec{Y}-\varvec{X}\varvec{\beta }_k^{(t+1)}\right) }{\sum _{i=1}^n \tau _{ik}^{(t+1)}} \end{aligned}$$
(7)

where \(\varvec{X}\) is the \(n \times (p+1)\) design matrix including an intercept column, \(\varvec{y}\) is the vector with the outcome variable and \(\varvec{W}_k^{(t+1)}\) is a diagonal matrix with diagonal elements \(\tau _{1k}^{(t+1)},\ldots ,\tau _{nk}^{(t+1)}\). The updated parameter estimates can now be used for a new iteration by plugging them into (6). This algorithm is carried out until some convergence criterion is satisfied. A nice property of the EM algorithm is that the observed log likelihood cannot decrease (Dempster et al. 1977).

2.2 Mixture regression in practice

There are some considerations to be made for a practical implementation of finite mixture models. First of all, the log likelihood of all finite mixture models frequently has multiple local optima (McLachlan and Peel 2000). Therefore, for a particular set of starting values, application of the EM algorithm can only guarantee you to find a local maximum [(or a saddle point in pathological cases (McLachlan and Krishnan 2008)] and not the global maximum (if this exists). In order to increase the probability of locating the desired optimum it is recommended to apply the EM algorithm from a variety of starting points (McLachlan and Peel 2000) and select the solution with the highest log likelihood value. This strategy is, however, not a guarantee to success. Then there is still the matter of selecting appropriate starting values. While there has been some research on obtaining good start values (see for instance Karlis and Xekalaki 2003 for univariate normal and Poisson data and Biernacki et al. 2003 for multivariate normal data), as far as we know there are no results for mixtures of linear regressions. Viele and Tong (2002) proposed the following strategy for obtaining a random set of starting parameters:

  • Generate the mixture proportions \(\varvec{\pi }\) as a random draw from a Dirichlet distribution with parameter vector \(\left( 1,\ldots ,1\right) \).

  • For every component \(k=1,\ldots ,K\), select a random sample of \(p+1\) observations \(\left( \varvec{X}_r,\varvec{y}_r\right) \) without replacement from the data. Obtain \(\varvec{\beta }_k\) as the solution of \(\varvec{\beta }_k=\varvec{X}_r^{-1}\varvec{y}_r\).

  • Generate the component variances as random draws from a uniform distribution with support \([0,s^2_{(1)}]\). Here, \(s^2_{(1)}\) denotes the estimated mean squared error obtained from a regular one-component regression analysis.

We have compared this procedure in some small simulation studies with two other procedures. The first alternative procedure only differs in how \(\varvec{\beta }\) is generated. For each component \(k=1,\ldots ,K\), an intercept is randomly drawn from a uniform distribution with support \([min(y_i),max(y_i)], i=1,\ldots ,n\). All other coefficients are initialized as 0. The second alternative procedure consists of randomly assigning each sample point to one of the \(K\) components. We have done this by hard assignment (assign each observation to exactly one component) and by soft assignment (assign each observation to every component with random weights). The EM algorithm is then started from the M-step by considering the assignment as the initial E-step. In our results we found that the strategy of Viele and Tong (2002) performed favorably compared to the alternatives.

Second, the EM algorithm is generally known to converge slowly, linearly or even sublinearly (McLachlan and Krishnan 2008). Usually, the algorithm is stopped when the log likelihood and/or the parameter estimates do not change much during the last iterations (McLachlan and Peel 2000). However, due to its slow convergence rate, one can erroneously stop the algorithm too early, i.e. before convergence. Lindstrom and Bates (1988) call this a ‘measure of lack of progress but not of actual convergence’. Böhning et al. (1994) used Aitken’s acceleration to derive a suitable stopping criterion for a linear convergent sequence. At each iteration (starting from the third), one estimates the stationary value of the log likelihood by using the last three log likelihood values as \(l_{\infty }^{(t+1)}=l^{(t)}+\frac{1}{1-a^{(t)}}(l^{(t+1)}-l^{(t)})\) where for simplicity of notation \(l^{(t)}\) denotes the log likelihood value at iteration \(t\) and \(a^{(t)}=\frac{l^{(t+1)}-l^{(t)}}{l^{(t)}-l^{(t-1)}}\) denotes the estimated rate of convergence of the sequence of log likelihood values. This method is also used to decrease the computation time caused by the multiple random starts as it predicts the stationary log likelihood without requiring the parameters to converge. Each set of starting parameters is iterated until the difference in the estimates of the stationary log likelihood value is smaller than \(10^{-9}\). The solution with the highest estimated stationary log likelihood is then taken as the optimal solution. However, for some selection criteria (see infra) accurate estimates of the parameter values are also necessary. Therefore, the best solution is then iterated further until the difference between the actual log likelihood and the estimated stationary log likelihood is smaller than \(10^{-12}\) and the maximum absolute change in the estimated component variances is smaller than \(10^{-9}\). The latter criterion is added because Abbi et al. (2008) found that the variance parameters have the slowest convergence rate.

Third, for finite mixture models with normal components with component specific variance parameters (or covariance matrices) there exists no global maximum for the log likelihood (McLachlan and Peel 2000). In a mixture of normal linear regressions with \(K>1\) components, one can make the log likelihood infinite by taking any \(p+1\) sample points and put them in a separate component. The resulting fitted hyperplane in this component will then have a perfect fit and its estimated component variance will be 0. Such a solution is obviously neither desired nor useful. A simple solution to this problem would be to put an equality constraint on the variance terms across the components, but this might be too restrictive. Components for which the variance tends to 0 are however not really a problem in practice as the computer will detect these and one can just discard these ‘solutions’. A far more serious problem is the potential existence of ‘spurious’ solutions. McLachlan and Peel (2000, p. 99) describe these as ‘solutions with relatively large local maxima that occur as a consequence of a fitted component having a very small (but nonzero) variance’. Hence, these solutions converge to parameter values which are very close to, but not on, the edge of the parameter space (\(\sigma ^2_k\) and \(\pi _k\) close to 0). Usually, these solutions are not interesting as they accommodate some random local pattern but will most likely not generalize outside the sample. Despite that only a relatively small number of observations belong to these components, their contribution to the log likelihood may be so high that this solution has a larger sample log likelihood than the desired local maximum (containing meaningful components) and hence masks the desired solution. Dealing with such solutions (i.e. eliminating spurious solutions) will probably require some judgement from the researcher. However, in a simulation study this cannot be done. In our implementation, there are two ways for a local solution to be discarded. The first way is when the estimated component SD becomes smaller than \(10^{-10}\) to avoid singularities. A second way out is when an estimated mixture proportion becomes smaller than \(\frac{p+1}{n}\) as this is the boundary value of the effective sample size with which a regression plane with \(p+1\) coefficients can be estimated. Other solutions for the singular/spurious component problem include restricting the parameter space or penalizing the likelihood which is the subject of the next section.

2.3 Penalizing the likelihood

Hathaway (1985) proposed to solve the unboundedness of the likelihood by constraining the parameter space such that \(min_{k,k^{\prime }}(\frac{\sigma _k}{\sigma _{k^{\prime }}})\ge c >0\) for all combinations of \(k,k^{^{\prime }}=1,\ldots ,K\). This formulation ensures that there is a global maximum to the log likelihood which is not singular. Furthermore, by choosing the right \(c\) one can also get rid of the spurious solutions. On the other hand, implementing this constraint restricts the solution space and might exclude the desired solution if \(c\) is too large. A simpler approach seems to be to penalize the likelihood which has been proposed by Ciuperca et al. (2003) and Chen et al. (2008). For finite mixture models of univariate normal distributions Ciuperca et al. (2003) proved that in case \(K\) is known a priori, their penalized likelihood estimator is consistent and Chen et al. (2008) proved that their version of the penalized likelihood estimator is consistent even when \(K\) is unknown. The latter result was generalized to (unconditional) multivariate normal distributions by Chen and Tan (2009). In all three papers the conjugate prior distribution for the component variances is used as the penalty function which makes this method a variant of maximum a posteriori estimation. Ciuperca et al. (2003) showed in a small example how the penalized likelihood method can outperform the method from Hathaway (1985) in case \(c\) is too large. Chen et al. (2008) and Chen and Tan (2009) showed with simulation how their penalized estimator gives similar and sometimes better parameter estimates in terms of bias and variance compared to the unpenalized approach. In this paper, the approach of Chen et al. (2008) is adopted which results in a penalized log likelihood of the following form

$$\begin{aligned} LL\left( \varvec{\varPsi }\right) =\sum _{i=1}^n \log \left[ \sum _{k=1}^K \pi _k \mathcal N \left( y_i|\varvec{\beta }_k^T\varvec{x}_i,\sigma _k^2 \right) \right] -a_n \sum _{k=1}^K\left( \frac{s^2_{(1)}}{\sigma ^2_k}+\log \sigma ^2_k\right) \end{aligned}$$
(8)

where \(a_n\) is a constant which depends on the sample size and moderates the influence of the penalty function. The penalty function in (8) is equivalent to putting an inverse-gamma distribution on the component variances with mode at \(s^2_{(1)}\). This mode is based on the sample data and is taken to be the estimated variance of the error term in a one-component regression. Maximizing (8) now results in a well-posed maximization problem with a global maximum in the interior of the parameter space. This, however, does not make (8) concave (there can still be numerous local optima) and therefore does not rid us of the necessity of starting the EM algorithm from different points. The effect of penalizing the likelihood this way only modifies the estimation of the component variances in the M-step. All other equations in (6) and (7) remain the same. The new closed-form solution in an EM-iteration \((t+1)\) is

$$\begin{aligned} \sigma _k^{2(t+1)}=\frac{\left( \varvec{y}-\varvec{X}\varvec{\beta }_k^{(t+1)}\right) ^T\varvec{W}_k^{(t+1)}\left( \varvec{y}-\varvec{X}\varvec{\beta }_k^{(t+1)}\right) +2a_ns^2_{(1)}}{\sum _{i=1}^n \tau _{ik}^{(t+1)}+2a_n}. \end{aligned}$$
(9)

From (9) it can be seen that when \(a_n\) is a function which goes to 0 for \(n\) going to \(\infty \), the resulting penalized estimator is equivalent to the unpenalized estimator for large sample sizes. However, for a non-zero \(a_n\) in a finite sample, the component variances can never become 0. The resulting estimator in (9) looks similar to the James-Stein estimator which is known to decrease the mean squared error of an original estimator by introducing a relatively small bias (James and Stein 1961; Chen and Tan 2009).

In order to validate these results for mixtures of linear regressions and to select an appropriate \(a_n\) we carried out some simulations. We simulated 200 sets of true parameters for a mixture regression model with true number of components \(K=2\) and 3. The mixture proportions were uniformly drawn from the sets \(\pi _1\in \{0.2,0.3,0.4,0.5\}\) for \(K=2\) and \((\pi _1,\pi _2)\in \{(0.2,0.2),(0.2,0.3),(0.2,0.4),(0.3,0.3),(0.3333,0.3333)\}\) for \(K=3\). The regression coefficients were drawn from \(U[-2,2]\) and the component variances were drawn from \(U[0.5,2]\) where \(U[a,b]\) denotes a continuous uniform distribution with support \([a,b]\). For each of these \(2\times 200\) sets of true parameters, a thousand samples were generated with sample sizes \(n=300\) and \(n=600\). Every sample had \(p=3\) explanatory variables which were drawn from \(U[0,10]\). A sample of size \(n\) is generated by drawing a single trial multinomial variable with the mixture proportions as parameter vector. Hence, each observation is labeled to belong to one specific component. Then, for each observation, the dependent variable \(y_i\) is drawn from a normal distribution with mean \(\varvec{\beta }_k^T\varvec{x}_i\) and variance \(\sigma ^2_k\). Estimation was done using 9 random sets of start parameters and the true parameter vector using unpenalized estimation and penalized estimation with five specifications for \(a_n=n^{-\frac{1}{j}}\) with \(j=1,\ldots ,5\) and where each estimator used the same start values. It is expected that the solution obtained by starting from the true parameter values will converge most of the times to the desired local optimum. The solutions of the random starts however, can converge to spurious solutions which may result in a larger sample log likelihood. The quality of the estimation procedures is therefore judged by their ability to recover the parameters which we measure by the mean squared error (MSE) of the estimates compared with the true parameters.

Tables 1 and 2 present the average mean squared error over the 200 sets of random parameters for \(K=2\) and 3 respectively. The SDs of the MSE are shown in brackets. From Table 1 one can see that larger penalties decrease the average MSE (except for the variance parameters) in the case of two components. Hence, including a penalty term decreases the risk of landing in a spurious solution. For the component variances, the average MSE decreases initially but then rapidly increases beyond the unpenalized average MSE. Hence, by including a larger penalty, the induced bias in the variance estimates offsets the decreased variance of the estimates. From Table 2 it seems that the optimal penalty term with respect to average MSE is somewhere in between the two extremes for most parameters. Both tables demonstrate that the differences between the estimation methods become smaller for larger samples which is expected as larger samples decrease the number of spurious solutions and the value of the penalty term. Furthermore, we can see that the average MSE is larger for 3 component models than for 2 component models. This seems logical as more complex models will likely introduce more local optima and hence probably more spurious optima. Larger sample sizes also decrease the size of the average MSE which reflects the consistency of both estimators. It is also apparent that the intercept parameters are estimated relatively poorly. This is most likely caused by the fact that these parameters are estimated at the boundary of the design space of the explanatory variables. If one is interested in estimating this parameter precisely, better experimental designs are warranted. Note also the very large average and SD of the intercept terms for the unpenalized estimator in the upper part of Table 2. This is due to one very large outlier (estimated MSE almost 96) for which the unpenalized method deviated extremely from the true solution despite the inclusion of the true parameters in the start values.

Table 1 Average MSE (SD MSE) for 2 component models
Table 2 Average MSE (SD MSE) for 3 component models

From Tables 1 and 2 it appears that including a penalty term pays off with respect to the MSE. However, it is hard to determine the optimal value of the penalty constant from these tables. We have summarized the results even more by summing the parameter-wise average MSE. As the different types of parameters have different ranges, the MSEs were first divided by the square of their range to make the errors comparable.

Figure 1 shows the relative total average MSE with respect to the unpenalized estimator. From this plot it appears that a penalty constant of \(n^{-\frac{1}{2}}\) performs best for our very limited grid search although the difference with \(n^{-1}\) is very small. Chen and Tan (2009) also found in their simulations that a penalty constant of \(n^{-\frac{1}{2}}\) performed best relative to no penalty and \(n^{-1}\). It might pay dividends to find the optimal penalty constant over a much finer grid (and an optimal penalty function) but this is beyond the scope of this paper. As was shown empirically, the penalized maximum likelihood estimator has on average a smaller MSE and has the ability to steer the EM algorithm away from spurious optima. Therefore, we hypothesize that this estimator can improve model selection for finite mixture models as most of the (non-Bayesian) selection criteria are based on the maximum log likelihood and/or the maximum likelihood estimates.

Fig. 1
figure 1

Total relative average MSE relative to the unpenalized estimator

3 Order selection

3.1 Order selection criteria

Mixture models in general can be used for two main purposes, namely density estimation or approximation and model-based clustering (McLachlan and Peel 2000). A mixture model can be used to ‘semi-parametrically’ estimate densities as any distributional form can be mimicked by adding enough components (see for instance Marron and Wand 1992). Mixture models can also be used to perform model-based clustering. In model-based clustering, the components represent real but unobserved (or perhaps inherently unobservable) groups in a population and thus have a meaningful interpretation. In both cases the number of components is often unknown a priori. Order selection in finite mixture models consists of finding the appropriate number of components based on the observed data. Order selection for density estimation has mostly been resolved as criteria such as \(AIC\) and \(BIC\) appear to select a suitable number of components (McLachlan and Peel 2000). In a model-based clustering context however, order selection is a hard problem for which still no general solution exists (McLachlan and Peel 2000; Nylund et al. 2007).

An obvious method to determine the number of components would seem to use the likelihood ratio test because a model with \(K\) components is nested in a model with \(K+1\) components. Unfortunately, the limiting distribution of the test statistic is not the usual \(\chi ^2\) distribution with degrees of freedom equal to the difference in numbers of parameters. The reason for this is that the regularity conditions which are used in the derivation of the limiting distribution, are violated in the case of mixture models (Ghosh and Sen 1985).Footnote 3 Moreover, Seidel et al. (2000a, b) and Seidel and Sevcikova (2004) have demonstrated that the distribution of the likelihood ratio test statistic depends on the particular implementation of the EM algorithm. They showed how different start strategies, different stopping rules and different ways of handling spurious components affect this distribution in mixtures of exponential distributions. As a way out, McLachlan (1987) suggested a parametric bootstrap approach. In such a procedure, one generates \(B\) datasets under the null hypothesis (\(H_0: K=K_0\)) and subsequently calculates the likelihood ratio test statistic for each bootstrap sample. Unfortunately, the number of bootstrap samples \(B\) will likely have to be high in order to achieve sufficient power. Furthermore, for every bootstrap sample one has to implement the same estimation procedure used on the original sample which generally will require multiple starts. This results in a computationally burdensome procedure, especially in a simulation settingFootnote 4, and therefore this selection method will not be used here. Burnham and Anderson (2002) give another justification for this decision, as they vehemently argue throughout their book that hypothesis testing procedures are not designed for model selection. Therefore, these tests lack theoretical justification in model selection whereas information criteria such as \(AIC\) are specifically designed for model selection and should be more suited for order selection in mixture models. Furthermore, Sarstedt (2008) searched applications of mixture regression models in marketing journals between 2000 and 2006 and found that none of the 32 articles he found used a likelihood ratio test or a bootstrapped version for model selection. In most articles \(BIC\) was used to select the number of components, followed by \(AIC\) and some variants of that suggesting that in practice the bootstrap test is not really used. Another type of model selection methods which will not be considered here are methods based on the Fisher information matrix because approximations to this matrix are only valid for very large samples, especially for mixture models (McLachlan and Peel 2000) and inaccurate estimates will only introduce extra variability in the order selection.Footnote 5 In what follows, the selection methods which were used in our simulation study will be discussed.

Burnham and Anderson (2002) classify model selection criteria into three broad classes, namely optimization of a selection criterion, hypothesis testing and ad hoc methods. As mentioned previously, hypothesis testing will not be used here. We will start with reviewing some criteria which belong to the first class, the information criteria. Most of these criteria were derived for general statistical models and not for order selection in finite mixture models specifically. It should also be noted that for all subsequent criteria, the model in the candidate model set for which the respective criterion is minimized is the selected model. The best known information criterion is most likely \(AIC\) which stands for Akaike’s information criterion.Footnote 6 \(AIC\) is defined as

$$\begin{aligned} -2LL\left( \hat{\varvec{\varPsi }}\right) +2n_p \end{aligned}$$
(10)

where \(LL\left( \hat{\varvec{\varPsi }}\right) \) is the log likelihood of the data evaluated at the maximum likelihood estimates and \(n_p\) denotes the number of parameters in the model which is equal to \((p+3)K-1\) for mixture regressions with \(p\) explanatory variables in each component. Akaike (1974) derived \(AIC\) as an estimate of the (directed) Kullback–Leibler divergenceFootnote 7 between the true model and the fitted model. The term \(n_p\) is a bias-correction term as the maximized log likelihood is a positively biased estimator of the expected Kullback–Leibler information. Despite popular belief, \(AIC\) does not require that the true model is in the set of candidate models (Konishi and Kitagawa 1996; Burnham and Anderson 2002) but the approximations in the derivation do require the same regularity conditions as are needed for the likelihood ratio test (Titterington et al. 1985; McLachlan and Peel 2000). Several authors have noticed that it tends to overfit, i.e. select too many components, in a finite mixture context (McLachlan and Peel 2000) but it is still used as shown by Sarstedt (2008). Hurvich and Tsai (1989) developed a bias corrected version of \(AIC\) for regular linear models with normal errors. Burnham and Anderson (2002) however, also advocate its use in other contexts unless the underlying probability distribution deviates strongly from a normal one. Finite mixtures of normal distributions however are not normal and can approximate any continuous distribution to a desired degree of accuracy (McLachlan and Peel 2000). Hence, it would seem that this improvement will not work well in the mixture context. The corrected \(AIC\), denoted by \(AIC_c\) is equal to \(AIC+\frac{2n_p(n_p+1)}{n-n_p-1}\). It is straightforward to see that the penalty will be larger than that of \(AIC\) for finite sample sizes and tends to 0 as the sample size increases.

Whereas \(AIC\) is derived by looking at the directed Kullback–Leibler divergence between the truth and the approximating model, Cavanaugh (1999) used the symmetric Kullback–Leibler divergenceFootnote 8 between truth and approximation. He showed that optimizing this criterion leads to \(KIC=-2LL\left( \hat{\varvec{\varPsi }}\right) +3n_p\) which is short for Kullback information criterion and has a larger penalty than \(AIC\). Cavanaugh (2004) also derived a bias corrected version \(KIC_c=-2LL\left( \hat{\varvec{\varPsi }}\right) +n\log \left( \frac{n}{n-n_p+1}\right) +\frac{n\{(n-n_p+1)(2n_p+1)-2\}}{(n-n_p-1)(n-n_p+1)}\) and showed that it can also hold as an approximation for non-linear models. Cavanaugh (1999) argued that \(KIC\) might be a more sensitive measure of departure from the truth than \(AIC\). Interestingly enough, Bozdogan (1993) conjectured that the asymptotic log likelihood ratio for nested mixture models is distributed as a non-central \(\chi ^2\) distribution. From this he derived that the penalty in (10) should be \(3n_p\), which is the same formula as Cavanaugh’s \(KIC\). Another modification of \(AIC\) was suggested by Bhansali and Downham (1977) who suggested to increase the penalty term to \(4n_p\), based on simulations of autoregressive models, which we will denote by \(AIC4\).

One property of \(AIC\) is that it is not a consistent criterion.Footnote 9 A consistent model selection criterion is a criterion which, as the sample size grows, asymptotically selects the true model with probability 1 provided that the true model is in the candidate set of models (Burnham and Anderson 2002). Several of such consistent criteria have been derived in the literature. It should also be noted that by requiring a criterion to be consistent, it no longer is an estimator of the relative Kullback–Leibler divergence and is hence no longer efficient (Burnham and Anderson 2002; Yang 2005). Efficiency here means that as the sample size tends to infinity, an efficient information criterion will select the model in the candidate model set which has the smallest expected squared prediction error. Perhaps the most famous among the consistent criteria is \(BIC\) (Schwarz 1978), known as Bayesian information criterion, which is defined as

$$\begin{aligned} \textit{BIC}=-2LL\left( \hat{\varvec{\varPsi }}\right) +n_p \log (n) \end{aligned}$$
(11)

and can be derived as a large sample approximation of the logarithm of the integrated likelihood (integrated over the parameter space). Using \(BIC\) implies selecting the model with the largest posterior probability without specifying priors. McLachlan and Peel (2000) note that the derivation of (11) requires regularity conditions which break down for finite mixture models. However, as \(AIC\), \(BIC\) is still used in practice as indicated by Sarstedt (2008). It has been reported that \(BIC\) underfits finite mixtures (i.e. selects a model with too few components) for small sample sizes (McLachlan and Peel 2000). \(BIC\) was independently derived as a special case by Rissanen (1986) based on coding theory and the principle of minimum description length. There also exists an adjusted version of \(BIC\), denoted by \(aBIC\), which mitigates underfitting in small samples where sample size \(n\) in (11) is replaced by \(\frac{n+2}{24}\) (Sclove 1987). Liang et al. (1992) mention two other modifications of (11) where the penalty term is \(2n_p \log (n)\) and \(5n_p \log (n)\). These criteria will be denoted by \(MDL2\) and \(MDL5\) respectively. Hannan and Quinn (1979) derived another consistent criterion, \(HQ\), which replaces \(2n_p \log (n)\) by \(2n_p \log \left( \mathrm{{log}}(n)\right) \) in (10) and has a smaller penalty than \(BIC\). Bozdogan (1987) proposed a consistent modification of (10), namely \(CAIC=-2LL\left( \hat{\varvec{\varPsi }}\right) +n_p\left[ \log (n)+1\right] \) which increases the penalty function for any sample size. For non-trivial sample sizes (larger than 55) we can order most of these criteria from the smallest penalty function to the largest penalty function as \(AIC\), \(KIC\), \(AIC4\), \(BIC\), \(CAIC\), \(MDL2\), \(MDL5\). \(AIC_c\), \(KIC_c\), \(aBIC\) and \(HQ\) are somewhere in between depending on the sample size and the dimension of the parameter vector. In general we can say that \(AIC\) would select larger models as it has the lowest penalty term which may cause problems with overfitting, as is reported in the literature. \(MDL5\) on the other hand will select small models as its penalty term is by far the largest and will therefore be most prone to underfitting.

Naik et al. (2007) derived a mixture regression criterion in the spirit of \(AIC\) for simultaneous selection of the number of components and the number of explanatory variables per component, \(MRC\), which has the following formula

$$\begin{aligned} \textit{MRC}=\sum _{k=1}^K n \hat{\pi }_k \log (\hat{\sigma }^2_k)+\sum _{k=1}^K \frac{n \hat{\pi }_k(n \hat{\pi }_k+\hat{p}_k)}{n \hat{\pi }_k-\hat{p}_k-2} -2\sum _{k=1}^K n \hat{\pi }_k \log (\hat{\pi }_k) \end{aligned}$$
(12)

where \(\hat{p}_k=trace\left( \hat{\varvec{X}}_k\left( \hat{\varvec{X}}_k^T\hat{\varvec{X}}_k\right) ^{-1}\hat{\varvec{X}}_k^T\right) , \hat{\varvec{X}}_k=\hat{\varvec{W}}_k^{1/2}\varvec{X}\) and \(\hat{\varvec{W}}_k\) is a diagonal matrix with elements \(\hat{\tau }_{1k},\ldots ,\hat{\tau }_{nk}\). The first term in (12) measures the lack of fit and hence minimizing it will lead to larger models. This tendency is countered by the second term which penalizes retaining many explanatory variables and by the third term which penalizes the number of components. When \(K=1\), (12) is equal to \(AIC_c\) and for large samples it is equivalent to \(AIC\). Similar to Cavanaugh (1999), Hafidi and Mkhadri (2010) derived an information criterion based on the symmetric Kullback–Leibler divergence which we will call \(MRC_k\) and which is defined as \(MRC+\sum _{k=1}^K \left( \hat{p}_k+1\right) \).

Next to the information criteria we will also consider some classification based methods which were also specifically developed for finite mixture models but not for mixtures of (linear) regressions. These methods take classification into account and tend to select models which are able to convincingly classify the observations. It can be shown that the estimated complete data log likelihood is equal to the sample log likelihood minus the entropy of the posterior classification matrix of the estimated posterior probabilities (Hathaway 1986):

$$\begin{aligned} LL_c\left( \hat{\varvec{\varPsi }},\hat{\varvec{\tau }}\right) =LL\left( \hat{\varvec{\varPsi }}\right) +\sum _{k=1}^K \sum _{i=1}^n \hat{\tau }_{ik} \log \hat{\tau }_{ik} \end{aligned}$$
(13)

where \(\hat{\varvec{\tau }}\) denotes the matrix of posterior probabilities and the second term on the right hand side is the negative of the estimated entropy \(EN(\hat{\varvec{\tau }})\). Biernacki and Govaert (1997) suggested using this for order selection. By multiplying (13) by \(-2\) one obtains the classification likelihood criterion (\(CLC\)). Biernacki and Govaert (1997) found that this criterion works well for well separated components and equal mixture proportions. Banfield and Raftery (1993) also used the classification likelihood to derive an approximate Bayesian criterion called the approximate weight of evidence \(AWE=-2 LL_c\left( \hat{\varvec{\varPsi }},\hat{\varvec{\tau }}\right) + 2 n_p (\frac{3}{2}+\log n)\). Note that the penalty term in \(AWE\) is very large. Celeux and Soromenho (1996) propose to use the entropy directly to select the correct number of components by using the normalized entropy criterion \(NEC=\frac{EN(\hat{\varvec{\tau }})}{LL\left( \hat{\varvec{\varPsi }}\right) -LL_{(1)}}\) where \(LL_{(1)}\) denotes the maximized log likelihood for a one-component model. As this criterion is undefined for \(K=1\), Biernacki et al. (1999) modified it by setting \(NEC\) at 1 in this case. As \(CLC\) and \(NEC\) do not penalize for model complexity these methods might tend to overfit which can be overcome by including a penalty for model complexity. Furthermore, \(BIC\) does not take the mixture context into account. Biernacki et al. (1998) proposed to solve these problems with the integrated classification likelihood criterion which is defined as

$$\begin{aligned}&ICL\!=\!CLC\!+\!2n \sum _{k=1}^K \hat{\pi }_k \log (\hat{\pi }_k) \!+\! (n_p-K+1)\log (n)\!-\!2F(n\hat{\pi }_1,\ldots ,n\hat{\pi }_K) \end{aligned}$$
(14)
$$\begin{aligned}&F(n_1,\ldots ,n_K)=\sum _{k=1}^K \log \left( \Gamma (n_k+\alpha )\right) -\log \left( \Gamma (n+K\alpha )\right) -K\log \left( \Gamma (\alpha )\right) \nonumber \\&\qquad \qquad \qquad \quad \qquad +\log \left( \Gamma (K\alpha )\right) \end{aligned}$$
(15)

where \(\Gamma (.)\) is the gamma function and \(\alpha \) represents the parameter of a prior Dirichlet distribution on \(\varvec{\pi }\). Jeffrey’s non-informative prior takes \(\alpha \) as 1/2 which is also what Biernacki et al. (1998) use and what will be used here. Biernacki et al. (2000) also provide a large sample \(BIC\) approximation to \(ICL\) which is \(ICL-BIC=CLC+n_p\log (n)\) and they have found that this approximation doesn’t differ much from using (14). An overview of all order selection criteria considered can be found in Table 3.

Table 3 Overview of order selection criteria

3.2 Previous results

Selecting the correct number of components has been extensively studied in the literature. These simulation studies vary in the type of models considered, the selection methods used and the settings of the simulation design (experimental factors). In this section, some of these studies will be reviewed.

In the context of mixtures of multinomial distributions (also known as latent class analysis) several extensive simulation studies have been performed. Yang (2006) found that \(aBIC\) was generally the best criterion. For large samples \(BIC\) and \(CAIC\) also performed well. Dias (2007) concluded that \(BIC\) outperforms several complete information based criteria. Yang and Yang (2007) also found that \(aBIC\) was the best performing information criterion and also mentions \(KIC\) as a good alternative. Cutler and Windham (1994) simulated mixtures of multivariate normal components. They found that \(ICOMP\) was superior to both \(AIC\) and \(BIC\). In a small scale simulation McLachlan and Ng (2000) found that \(ICL\) and \(ICL\)\(BIC\) outperformed \(BIC\) and \(AIC\) and showed that \(AIC\) tends to overfit. Celeux and Soromenho (1996) also performed some simulations for both univariate and multivariate mixtures of normal distributions. They found that \(AIC\) has a slight tendency to select too many components, that \(BIC\) tends to select too few and that \(NEC\) and \(ICOMP\) generally perform best. Nylund et al. (2007) concluded that \(BIC\) is the best information criterion for both mixtures of multinomial distributions tables and mixtures of normal distributions. They also showed that a parametric bootstrap of the likelihood ratio test outperforms \(BIC\). An interesting study is that of Fonseca and Cardoso (2007) where they compared the performance of several selection measures on 42 real datasets where the true number of components is known. For the categorical datasets, they found that \(KIC\) worked best as it selected the correct number of components in 95 % of the cases. For continuous data, they used multivariate normal models and found that \(BIC\) works best with a success rate of 77 %. In the datasets with mixed types of data (both continuous and categorical) they found that \(ICL\)\(BIC\) performed best (80 % success rate). They also noted that the performance of the \(AIC\) family of information criteria and \(ICL\)\(BIC\) varied a lot across the different types of data. From their results, it can be seen that \(BIC\) has the highest average success rate followed by \(CAIC\). \(CLC\) on the other hand performs worst on average, followed by \(AWE\). Steele and Raftery (2009) also performed a small scale simulation study for univariate mixtures of normal distributions. They found that \(BIC\) outperformed \(AIC\) and \(ICL\)\(BIC\) and that \(AIC\) had a slight tendency to overfit. Jedidi et al. (1997) found that \(BIC\) and to a lesser extent \(CAIC\) work well in mixtures of structural equation models. Andrews and Currim (2003a) showed that \(KIC\) outperforms \(ICOMP\), \(BIC\) and a validation sample method in mixtures of logistic regressions. In the context of mixtures of growth models, Lubke and Neale (2006) found that \(AIC\) and \(aBIC\) outperform \(BIC\) and \(CAIC\). Tofighi and Enders (2008) also found that \(aBIC\) works well and \(BIC\) performs poorly for this type of models.

Hawkins et al. (2001) were the first to systematically investigate model selection in finite mixtures of univariate linear regressions using an extensive simulation study. The factors in the experiment were the true number of mixing components (1 to 4), the mixture proportions and the parameters in the component regression models which were condensed in one measure of separation between the components. They compared order selection based on 22 selection criteria which were based on the log likelihood, an approximation to the Fisher information matrix and several approximations to the complete data log likelihood and the complete data Fisher information matrix. They also included two classification-based measures. In general they concluded that model selection performance of all criteria decreased as the true number of components increased and in the presence of small mixture proportions. The performance increased on the other hand when the components were better separated. For a small number of components (1 or 2) they found that \(ICOMP\) was the second worst criterion (only better than the log likelihood itself). \(BIC\) and to a lesser extent \(AWE\) performed the best in that situation. For larger numbers of components no criterion outperformed the others in all circumstances. They could however conclude that \(AIC\), \(KIC\), \(ICOMP\), \(BIC\) and \(AWE\) as a group performed better than the other measures which were based on approximations of the complete data Fisher information matrix or on the posterior probabilities. Finally, they also noted that \(KIC\) did not systematically outperform \(AIC\) or the other way around. Andrews and Currim (2003b) investigated the performances of \(AIC\), \(KIC\), \(BIC\), \(CAIC\), \(ICOMP\), a validation sample log likelihood and \(NEC\) in a simulation of mixtures of linear regression models with repeated observations per subject. They varied eight factors: the true number of components, the mean separation between component coefficients, the number of subjects, the number of observations per subject, the number of explanatory variables, \(R^2\) within the components, the minimum mixture proportion and the measurement level of the explanatory variables. They found that \(KIC\) was the best criterion in all experimental conditions followed by \(BIC\) and the validation log likelihood. \(ICOMP\), \(NEC\) and \(AIC\) on the other hand did not perform well. Oliveira-Brochado and Martins (2008) performed a similar simulation study as Andrews and Currim (2003b). They added another experimental factor differentiating between normal errors and uniform errors. Furthermore they compared 26 selection criteria. They found that overall, \(KIC\), \(ICL\)\(BIC\), \(HQ\) and \(AIC4\) (in that order) performed best and that \(AIC\), \(AIC_c\) and \(ICOMP\) had the largest tendency to overfit. Most of the classification-based criteria on the other hand showed high rates of underfitting. Both studies also showed that generally the performance of the criteria increases when the true model is less complex, i.e. fewer components and explanatory variables, the separation between the components increases, the sample size grows and the absence of very small components. Surprisingly, Oliveira-Brochado and Martins (2008) found that the effect of error misspecification only had a small negative effect. Finally, Sarstedt (2008) investigated the performance of \(AIC\), \(KIC\),\(BIC\) and \(CAIC\) in mixtures of univariate regressions while varying the sample size systematically between 100 and 500. In this study it was found that \(CAIC\) and to a lesser degree \(BIC\) performed well across all sample sizes. \(KIC\) only performed well for sample sizes larger than 250 and \(AIC\) performed poorly in all experimental conditions.

4 Simulation study

4.1 Experimental design

The design of our simulation study largely follows Hawkins et al. (2001). The number of explanatory variables \(p\) is set to 3 in all true models. All explanatory variables are drawn from uniform distributions with support \([0,10]\). The regression coefficients (including the intercept) and the component variances are drawn from uniform distributions with support \([-2,2]\) and \([0.5,2]\) respectively to increase generalizability. As a measure of separation for the components we calculated the average distance between the component regression hyperplanes as in Hawkins et al. (2001). The distance between 2 components \(k\) and \(l\) at some specific point \(\varvec{x}\) is equal to

$$\begin{aligned} M_{kl}=\sqrt{\frac{\left( \varvec{\beta }_k^T\varvec{x}-\varvec{\beta }_l^T\varvec{x}\right) ^2}{\sigma ^2_k+\sigma ^2_l}}. \end{aligned}$$
(16)

We evaluated this at 50 evenly spaced grid points between 0 and 10 in each of the 3 dimensions and took the average as the separation between component \(k\) and \(l\).

The experimental factors are:

  • \(K^*\) True number of components: 1, 2 or 3;

  • \(n\) Sample size: 100, 300 or 600;

  • \(\varvec{\pi }\) The mixture proportions: equal (\(1/K^*\)) or unequal with \(\varvec{\pi }=(0.34,0.66)\)   for \(K^*=2\) and \(\varvec{\pi }=(0.25,0.25,0.5)\) for \(K^*=3\);

  • \(t\) Type of model (mis)specification: 1–9.

A type of 1 for \(t\) indicates no misspecification and is the only specification (together with \(t=5\) and some instances of \(t=9\)) where the true model is in the set of candidate models. Type 2 means that after the true data generation 3 independent explanatory variables were added to the sample. This is a situation which frequently arises when researchers are unsure which variables are relevant. The data used for estimation thus contain superfluous, uninformative variables. A misspecification type 3 indicates that after data generation one of the explanatory variables was dropped from the sample (we have arbitrarily taken the last one). This mimics a situation where an important variable is unknown to be related to the dependent variable. In both cases, due to the independency of the explanatory variables, it would be expected that the regression coefficients could still be estimated without bias when the model is estimated with \(K^*\) components. It is however expected that with type 2 misspecification the order selection procedures can capitalize on the higher dimensionality of the parameter space and hence prefer models with more components which would lead to overfitting. In situation 3, the parameter space has a smaller dimension and therefore it might be harder to pick up the true number of components. As the importance of the dropped explanatory variable is not uniform across the components (the regression coefficient varies across the components) it might also be the case that specific components become much harder to find for a large \(|\beta _{3k}|\) whereas detection of others might hardly be influenced for small \(|\beta _{3k}|\). It is therefore expected that this would increase the rate of underfitting for the selection procedures. Misspecification type 4 means that the true data generation mechanism includes an interaction (arbitrarily taken between explanatory variables 2 and 3) whereas it is estimated without this effect. The estimated regression coefficients will no longer be unbiased as the explanatory variables are correlated with the unincluded but real interaction effect. It is unclear how this will affect model selection. Type 5 is not a real model misspecification as it indicates that the explanatory variables are correlated. The design matrix for this type was generated according to Falk (1999) with all correlations put to 0.5. For all types 1–5 the errors are normally distributed as specified earlier. Misspecification of type 6 indicates that the normal error terms are transformed to have a higher kurtosis and type 7 that they are transformed to have skewed errors. The transformations were done according to Fleishman’s method (Fleishman 1978). The type 6 errors were transformed to have excess kurtosis of 4 whereas the type 7 errors were transformed to have excess kurtosis of 4 and skewness of 1.5.Footnote 10

The effect of these transformations is illustrated in Fig. 2 for standard normal variables. It can readily be seen that type 6 makes the tails of the error distribution heavier with respect to a normal distribution. On the one hand this makes it easier to find the real components but on the other hand this may lead to extra components which accommodate the outlying observations. It is therefore expected that this type of misspecification will lead to overfitting. For type 7 of model misspecification, the error terms are asymmetric which will most likely also lead to overfitting. Titterington et al. (1985) for instance, showed how it is practically impossible to differentiate a lognormal distribution (which is skewed) from a mixture of 2 normal distributions. The final type of model misspecification (8) is a case where the errors within a component are heteroskedastic meaning that within each component the normal errors have different variances depending on the size of the explanatory variables. This was achieved by multiplying the error of observation \(i\) belonging to component \(k\) with \(\exp (\frac{\sum _{j=1}^p x_{ij}}{5p}-0.3)\). Afterwards the errors were multiplied by the appropriate scaling factor to make them have the required average variance within each component. It is expected that this will also lead to increased overfitting as the regions with higher error variability might accommodate multiple components. As suggested by an anonymous reviewer we also included a setting where the model is a mixture of linear time series models which is taken to be the mixture autoregressive model introduced by Wong and Li (2000). In this model the components are linear autoregressive time series models (\(AR(p_k)\)) where each component could be of a different autoregressive order \(p_k\). One could interpret the components as representing different regimes which manifest themselves as different temporal dependencies among the observations, depending on the time period. On the other hand, this model can also be used as a more flexible tool to model the conditional distribution of the observed variable at different time points. Here we only consider first order stationary models and because in practice the autoregressive order of the time series is not known, we assume here that an upper bound of 3 can be established. This means that in each component, it is assumed that the autoregressive order is 3 during estimation and is uniformly set between 1 and 3 in the data generating process. The design matrix here is formed by the past observations of the observed variable and coefficients of the non-contributing autoregressive terms are set to 0 in the data generating process. Note that there is a distinction here between order selection in a mixture model, which is taken as the selection of the number of components, and autoregressive order selection within a component which means the selection of the number of past observations used to model the observed variable. Here we only consider order selection as this is the most important, necessary first step in model selection for mixture autoregressive models as noted by Wong and Li (2000). As a measure of separation between components we used the unconditional means and variances of the observed variable within each component. An overview of the different model specifications can be found in Table 4.

Table 4 Overview of model (mis)specifications
Fig. 2
figure 2

Kernel density plot of error distributions

The design is full factorial and was executed with 1,000 replications. For each replication and combination of factor settings, a set of parameters and a design matrix was generated as specified above. Component membership was generated by drawing a sample of size \(n\) from a multinomial distribution with parameter vector \(\varvec{\pi }\). The dependent variables \(y_i\) were then generated as a draw from a normal distribution with mean \(\varvec{\beta }_k^T\varvec{x}_i\) for the relevant component \(k\) and a (potentially transformed) variance. Models where \(K^*=1\) were fitted with \(K=1-3\), models with \(K^*=2\) were fitted with \(K=1-5\) and models with \(K^*=3\) were fitted with \(K=1-6\) where \(K^*\) denotes the true number of components. Estimation was done with an unpenalized and a penalized EM algorithm with 200 random starts for \(K>1\). The penalty constant was taken to be \(n^{-\frac{1}{2}}\). As measures of performance we will look at the root mean squared error of estimation and the success rate of the order selection criteria compared to the known true number of components \(K^*\). However, as the correct model is not always in the set of candidate models it might be that a model with \(K\ne K^*\) is a more appropriate model. Therefore we will also look at a validation sample of size 1,000 generated from the true data generating model. The estimated model with the highest log likelihood in the validation sample is then taken as a success with respect to out of sample prediction as this is an estimate of the Kullback–Leibler divergence up to a constant (Burnham and Anderson 2002).

4.2 Results and discussion

Before we analyze the model selection results, we take a look at the convergence of the unpenalized and the penalized estimators. From Table 5 it can be seen that models with \(K = 2\) components were always estimable. Note also that models with only \(K = 1\) component are always estimable and identical for both estimators. Furthermore, there is almost no difference between both estimation procedures (in terms of convergence) when the true number of components is \(K^* = 1\). For a higher number of true components, however, there are large differences between both procedures. Note that for models with true number of components \(K^* = 3\), the penalized estimator did not always converge when estimating with this number of components. The instances when this happened generally occurred when one component was very close to another component which made them virtually indistinguishable, especially when the sample size was small. These solutions were discarded as it is impossible there to select the correct number of components.

Table 5 Rates of properly converged estimations

From Table 5 it can also be observed that a penalized likelihood estimator can partly serve as an order selection tool by not converging to any non-spurious solution with \(K>K^*\). This also happens for the unpenalized estimator but with much lower frequency. Finally, it can be noted that models where \(K > K^*\) could be estimated more frequently for specification types 3, 8 and especially 9.

Next we take a look at the root mean squared error of both estimators when \(K^*>1\) and \(K=K^*\) because for one-component models the estimators are identical and for \(K\ne K^*\) the model structure is too different from the truth to be easily compared. As before, the results are adjusted for the different ranges of the different types of parameters. Figure 3 depicts the root mean squared error for the 3 levels of sample size and the 9 types of specification. From this figure it is immediately clear that on most occasions the penalized estimates have smaller mean squared errors than the unpenalized estimates. Furthermore, when the unpenalized estimator is better, the difference is smaller than the other way around. Hence, once again one can see that penalization leads to better estimation here. It can also be seen that both estimators perform better for larger samples and that specification types 1, 4 and 5 can be accurately estimated. Specification types 8 and especially 3 on the other hand lead to the worst estimates.

Fig. 3
figure 3

Root mean squared error differentiated by sample size (left) and by specification type (right). Note that the axes in both panels are on a logarithmic scale

Figures 4 and 5 present the rates of underfitting, correct fitting and overfitting with respect to the true number of components for each selection method and specification type. What is shown in both figures is actually the difference between the selected number of components and the true number of components. Values larger than 0 indicate overfitting, i.e. the selected number of components is larger than the true number of components. Values below 0, on the other hand, represent underfitting and 0 represents the correct selection of the number of components. The dark bars are the rates based on the penalized estimator whereas the lighter bars represent the unpenalized estimator. A striking difference can be noticed between penalized estimation and unpenalized estimation. For all but two criteria, the model selection criteria have a higher or equal probability of selecting the correct number of components when using the penalized estimator and when they do worse, it is only marginally so. When they do better, on the other hand, the difference is often quite substantial. What also becomes clear from Fig. 4 is that several criteria have a tendency to overfit whereas other criteria have a tendency to underfit. The \(AIC\) family, \(AIC\), \(AIC_c\), \(KIC\), \(KIC_c\) and \(AIC4\), belong to the overfitting group when they are wrong. Especially \(AIC\) and \(AIC_c\) can overfit severely. \(aBIC\) and \(HQ\) also tend to overfit. \(MRC\), \(MRC_k\), \(MDL2\), \(MDL5\), \(CLC\), \(AWE\), \(ICL\), \(ICL\)\(BIC\) and \(NEC\), on the other hand, tend to underfit when they are wrong whereas \(BIC\) and \(CAIC\) can both overfit and underfit. Finally, it appears that on average, over all experimental settings, \(CAIC\), \(MRC\), \(MRC_k\), \(MDL2\), \(ICL\), \(ICL\)\(BIC\), \(BIC\) and \(CLC\) perform best. On the other hand, \(AIC\), \(AIC_c\) and \(aBIC\) perform worst. Figure 5 is constructed similarly as Fig. 4 but the rates are now split up according to specification type. Again, there is a striking difference between the performance of the penalized estimator versus the performance of the unpenalized estimator. We can also see that, on average, specification types 1, 4 and 5 are very similar. It would appear that, on average, the model selection criteria have a high success rate for these types and do not seem to systematically overfit or underfit. With a slightly worse performance, and a slight tendency to overfit, it also seems that adding superfluous covariates is quite innocuous. Dropping a covariate however, \(t = 3\), has a large negative effect but it does not systematically lead to overfitting or underfitting. Error misspecification also decreases the average success rates and has a tendency to lead to overfitting. This is not surprising as the errors for these types naturally lend themselves to be fit by more than 1 normal component. Finally, it can be observed that in the autoregressive models the probability to select the correct number of components is, on average, only slightly larger than the probability of underfitting by 1 component and is smaller than the probability of underfitting.

Fig. 4
figure 4

Rates of underfitting, correct fitting and overfitting by order selection method. The dark bars represent the penalized estimator and the lighter bars represent the unpenalized estimator

Fig. 5
figure 5

Rates of underfitting, correct fitting and overfitting by specification type. The dark bars represent the penalized estimator and the lighter bars represent the unpenalized estimator

Tables 6, 7 and 8 summarize the main results.Footnote 11 In Table 6 one can find the results of the case where the true number of components is 1 (\(K^* = 1\)). Order selection in this case entails the important decision whether there is actually heterogeneity in the population in the form of multiple groups or not. Only the results of order selection based on the penalized estimator are presented here as every selection method performed better based on this estimator. It can readily be seen that the best performing criteria here are \(AWE\), \(MDL5\), \(MRC\), \(MRC_k\), \(ICL\) and \(ICL\)\(BIC\). It should however be noted that the performance of these criteria in this case is not a completely reliable quality measure as a success rate of 100 % can be achieved by making the penalty term on the number of parameters large enough. On the other side of the spectrum one can see that \(AIC\) and \(AIC_c\) perform dreadfully as they overfit in nearly every case. A curious result is that the performance of several criteria diminishes or does not increase when the sample size is larger among all types of true model specification. These criteria are \(AIC\), \(AIC_c\), \(KIC\), \(KIC_c\), \(AIC4\), \(CAIC\) and \(MDL2\). This is not a desirable result as more information should lead to better inference. This implies that, as the number of observations per parameter increases, these criteria tend to overfit. With respect to the type of model (mis)specification, one can observe that even without any misspecification, \(AIC\), \(AIC_c\), \(KIC\), \(KIC_c\) and \(aBIC\) have poor performances. Among the selection criteria that consistently perform well, there is not much of a drop comparing no misspecification to the various other specification types. Another notable effect is that \(BIC\), \(CAIC\) and \(CLC\) tend to vary between excellent performance and very poor performance for different types of model specification. \(BIC\) and \(CAIC\) appear sensitive to error misspecification and specification type \(3\), a missing covariate. Their success rates drop sharply for larger sample sizes in these settings. \(CLC\), on the other hand, appears to be most sensitive to small samples as its success rate increases sharply in going from \(n = 100\) to 300. On average, it can also be seen that error misspecification causes the most severe problems for various selection methods.

Table 6 Success rates of the penalized estimator with respect to the true number of components when \(K^* = 1\)
Table 7 Success rates of the penalized estimator with respect to the true number of components when \(K^* = 2\)
Table 8 Success rates of the penalized (P) and unpenalized (U) estimator with respect to the true number of components when \(K^* = 3\)

Table 7 summarizes the main results of the two-component models which are more challenging as there are more parameters per observation and a selection method can err in two directions. It turns out that equal or unequal component sizes do not have large impacts on the success rates of the selection criteria and this differentiation is therefore not shown here. Again, only the results based on the penalized estimator are shown as every selection method did better or nearly as well based on this estimator. There is here, however, a small exception. \(ICL\), \(ICL\)\(BIC\) and \(NEC\) perform substantially better based on the unpenalized estimator in the case of specification type 9, the mixture autoregressive model. Unfortunately, substantially better does not imply well as their performance rates increase to about 0.52–0.66, 0.32–0.46 and 0.27–0.35 for sample sizes \(n=100\), 300 or 600 respectively. On average, \(MDL2\), \(MRC\), \(MRC_k\), \(ICL\), \(ICL\)\(BIC\) and \(NEC\) perform best here across all other experimental settings. \(AIC\), \(AIC_c\) and \(aBIC\) perform poorly again and one can observe that \(AIC\), \(AIC_c\), \(KIC\) and \(KIC_c\) again show decreasing success rates as the sample size increases due to their overfitting nature. We can also observe that there are several specification types that have a substantial impact on otherwise well performing criteria. It appears that type 9 has the largest impact followed by type 3. Error misspecifications also have a negative impact but to a lesser extent than the former two. It can also be seen that selection criteria that tend to overfit in many specifications achieve higher success rates in the mixture autoregressive model. The criteria that tend to underfit on the other hand, perform very badly for this specification type, especially compared with other model specifications. These criteria do appear to be relatively robust with respect to error misspecifications but not with respect to a missing covariate, \(t=3\). \(BIC\) and \(CAIC\) again vary wildly between excellent performance for specification types 1, 2, 4 and 5 and poor performances for error misspecification where they have substantially decreasing success rates for larger samples. \(MDL2\) appears to be the only criterion that performs well when there is a missing covariate in the sense that it achieves a nice performance for large samples whereas \(KIC\), \(KIC_c\), \(AIC4\) and \(BIC\) achieve decreasing success rates for larger samples but good rates for small samples. It can also be observed that the inclusion of superfluous explanatory variables, a missing interaction term or multicollinearity do not seem very detrimental to the criteria that perform well in correctly specified models. These specifications have a large negative impact on \(AIC\) and \(AIC_c\) though.

Table 8 presents the results of order selection based on the penalized and the unpenalized estimation when the true number of components was \(K^* = 3\). The results are again averaged over the component size equality setting as the success rates were very similar for both settings. Contrary to the previous cases however, there is a substantial difference between order selection based on the penalized estimator and the unpenalized estimator in several experimental settings. These differences can be seen as small sample effects on \(MRC\), \(MRC_k\), \(MDL2\), \(MDL5\) and \(AWE\). It appears that these criteria tend to be too conservative here due to the smaller number of observations per parameter. Hence, their selection performance can be improved by considering spurious optima as the mean squared errors indicate that the penalized estimator is, on average, better for these sample sizes. There are also substantial differences for the mixture autoregressive model and the missing covariate specification in smaller samples. \(KIC_c\), \(AIC4\),\(CAIC\), \(BIC\) \(ICL\), \(ICL\)\(BIC\) and \(NEC\) are most affected here. The resulting inference for these specification types based on both estimators for the last 3 criteria in this group is poor however, for both specifications. Once again it can be observed that \(AIC\) and \(AIC_c\) perform worse in larger samples but for these settings they generally start from relatively high success rates. The model specification with the lowest success rates is once again the autoregressive model. None of the criteria considered here can be said to be reliable to select the correct order for this specification type with maximum success rates well below 0.6. The second hardest model specification is once again a missing covariate and only \(CAIC\), \(BIC\) and \(MDL2\) seem to do relatively well here with respectable success rates for larger samples. Model specification types 1, 2, 4 and 5 seem very similar again. Misspecified errors do have a negative impact on most criteria although there are several criteria that perform well for one or more of these settings. Unfortunately, there does not seem to be a clear best criterion here, unless, perhaps, \(MDL2\) if one overlooks the autoregressive specification. On average, disregarding specification types 3 and 9, \(CAIC\), \(BIC\) and \(MDL2\) perform best.

The supplementary material contains a figure which is similar to Fig. 4. The difference, however, is that this figure relates to the maximum log likelihood in a validation sample generated by the true data generation process. Only cases where the true number of components differed from the optimal out of sample model are considered. Not surprisingly, these are mostly cases with specification types 3, 6, 7, 8 and 9. An interesting pattern can be found here. The criteria which were relatively bad in selecting the true number of components, \(AIC\), \(AIC_c\), \(KIC\), \(KIC_c\), \(AIC4\), \(HQ\) and \(aBIC\), have relatively high success rates here whereas \(MRC\), \(MRC_k\), \(MDL2\), \(MDL5\), \(CLC\), \(AWE\), \(ICL\) and \(ICL\)\(BIC\) do very bad here. The latter criteria tend to underfit with respect to prediction. \(CAIC\) and \(BIC\) are somewhere in between. In conclusion, a trade-off appears to be noticeable between selecting the number of components and selecting a model which predicts future samples best. Hence, \(AIC\) and its relatives in fact do what they are designed to do. Unfortunately, the success rates are not overwhelming as they are, on average, around 0.4. Perhaps, with larger sample sizes, this performance would increase and if Burnham and Anderson (2002) are right in the sense that there do not exist any simple models (i.e. truth has nearly an infinite number of parameters), the \(AIC\) family of efficient selection criteria would be preferred. However, selecting the correct number of components can also be very important and we feel it would be preferable to remedy misspecification by data transformations or different model specifications rather than by adding components which are not represented in the population.

In order to study the results of the penalized estimator while controlling for the separation between the components, Table 9 presents the odds ratios for the experimental factors for the cases where \(K^*>1\). These odds ratios were calculated from logistic regressions for each order selection method. Correctly selecting the true number of components was taken as a success. Two of the covariates in the model warrant some clarification. First, the minimum separation between the components is included in the models. In case \(K^*=2\) this is simply the separation between components 1 and 2. In case \(K^*=3\), the minimum of the three pairwise separations is taken because the components for which the separation is minimal will be harder to separate. Second, the covariate ‘max est’ represents the maximum number of components for which a proper solution was found and is taken as a continuous effect. This covariate was included in the models as it limits the possible amount of overfitting. To illustrate the interpretation of the table entries, consider the estimated odds ratio of \(AIC\) with respect to the sample size factor \(n = 300\) vs \(n = 600\). This odds ratio was estimated at 1.27 and indicates that the odds of a success, i.e. selecting the true number of components, when using \(AIC\) was on average approximately 1.3 times larger in a sample of size 300 than in a sample of size 600, controlling for the other experimental factors. Table entries marked by a \({\dag }\) are not significantly different from 1 at a significance level of 5 %.

Table 9 Odds ratios of selecting the true number of components for the penalized estimator by logistic regression for \(K^*=2, 3\)

Again it can be seen that several criteria perform worse in larger samples. These criteria are \(AIC\), \(AIC_c\), \(KIC\), \(KIC_c\), \(AIC4\), \(CAIC\), \(BIC\), \(HQ\) and \(NEC\). What is also noticeable is that the effect of a small sample size has very large negative effects on \(aBIC\), which is counterintuitive, \(MDL2\), \(MDL5\) and \(AWE\). The effect of the true number of components is also very dissimilar across the different selection criteria and several of the estimated odds ratios are very far from 1. It appears that criteria that tend to overfit perform much better when there are 3 components whereas criteria that tend to underfit perform better when there are 2 components. Hence, for a fixed sample size, more true parameters to estimate favors overfitting criteria. Equal or unequal mixture proportions also have different effects across all methods but the size of these effects is much smaller than the effects of the sample size or the number of components. We can conclude that in most cases the criteria which performed better for smaller samples also perform better in case of equal mixture proportions and with 3 true components. Conversely, selection methods which perform better for larger samples tend to perform better in case of unequal mixture proportions and with 2 true components. This distinction largely coincides with a criterion’s proneness to respectively overfit or underfit. It can be noted that the odds of successfully selecting the true number of components increase when the minimum separation increases as would be expected. The criteria which do not perform well across experimental conditions seem to be less affected by the separation and for \(AIC4\), \(CAIC\) and \(BIC\), which did perform well in many settings, the effect seems to be relatively small. Furthermore, the performance of all criteria decreased as the range of models which could be fitted increased. However, there is an exception here, namely \(NEC\), the criterion which showed the highest rate of underfitting by exactly 1 component. Focusing on the group of order selection criteria which, on average, performed best (\(MDL2\), \(MRC\), \(MRC_k\), \(ICL\)\(BIC\), \(ICL\), \(CAIC\), \(BIC\) and \(CLC\), in that order), one can see that, controlling for all other factors, the effect of various model (mis)specifications compared to no model misspecification is much larger than it appeared earlier. Including superfluous explanatory variables has a large negative effect on \(MDL2\). Omitting a relevant explanatory variable has by far the largest negative effect on all these criteria. The effect of excluding a real interaction and multicollinearity seems to be largely similar within these methods respectively and \(CAIC\), \(BIC\) and \(MDL2\) appear to be most robust here. For these eight criteria, heteroskedasticity within the components seems to have the largest negative effect of all error misspecifications for \(MRC\), \(MRC_k\), \(ICL\) and \(ICL\)\(BIC\). These criteria seem to be relatively robust with respect to errors with heavy tails and skewed errors though. \(MDL2\), \(CAIC\) and \(BIC\) on the other hand, appear to be most affected by skewed errors and the latter two criteria appear to be much less robust when the errors are misspecified. Curiously enough, \(MDL5\) and \(AWE\) actually performed better for heavier tailed or skewed error specifications relative to no misspecification. This would indicate that such misspecifications counter their tendency to underfit due to their large penalty term. On the other hand, these criteria were heavily affected by heteroskedastic errors. Finally, one can see that the autoregressive specification appears to have the second largest negative effect on all these criteria and that \(MRC\), \(MRC_k\), \(ICL\) and \(ICL\)\(BIC\) appear to be least affected.

5 Conclusion

Order selection in finite mixture models is not a simple problem which seems to be confirmed by our simulation study. Different experimental settings influence the order selection criteria differently. Some results however are obtained on which criteria seek to select the number of components rather than minimizing the expected prediction error. For order selection it appears that the newly developed mixture criteria (\(MRC\) and \(MRC_k\)) perform rather well on many occasions, especially for larger samples. Similar things can be said about \(MDL2\), \(CLC\), \(ICL\) and \(ICL\)\(BIC\). For the traditional model selection criteria, \(AIC\) and \(BIC\), we can conclude that they often have lower success rates when the sample size increases. \(BIC\), however, shows good performance, especially in smaller samples and for model specification types which are not very detrimental to the overall success rates, whereas \(AIC\) still tends to overfit there. \(CAIC\) is quite similar to \(BIC\) and also performs well often. Therefore, based on our findings, we would recommend using selection methods which have been specifically derived for finite mixture models or the lesser known \(MDL2\) in larger samples and \(BIC\) or \(CAIC\) in smaller samples. Furthermore, there is some evidence that including irrelevant explanatory variables, excluding interaction effects or multicollinearity are not very detrimental to order selection if one chooses a correct criterion. Not including an important explanatory variable on the other hand does have a substantial negative effect on all criteria. We have also found that distributional misspecification of the error terms has a non-uniform negative effect on the selection criteria. It was also found that several criteria are nearly flawless to reject more components in a one component autoregressive model. In multiple component autoregressive models however, one is better of using one of the better overfitting criteria (\(AIC4\) or \(KIC_c\)) or one of either \(CAIC\) or \(BIC\). In conclusion, we found that none of the selection criteria was robust to every sort of (mis)specification we tested. A limitation of our simulation was that all (mis)specifications were present in all components. Furthermore, we only tested for one particular ‘amount’ of misspecification each time, rather than a range of mild to severe misspecifications. Both of these settings could be interesting avenues to explore further. There is one, nearly constant, positive effect present in all our results: it pays to penalize. We have found that appropriately penalizing the likelihood resulted in fewer spurious solutions. This had a positive effect on the estimation error of the model parameters and on the performance of the order selection criteria. Hence we think a useful future research direction could be to further investigate the choice of a penalty function, the choice of its tuning parameters and its (finite sample) properties for normal mixture models to avoid spurious solutions.