Estimation of the Complexity of a Finite Mixture Distribution: From Well- to Less Known Methods

Balabdaoui, Fadoua; Kolar, Andrei; Kulagina, Yulia; Müller, Lilian

doi:10.1007/s42519-022-00289-1

Estimation of the Complexity of a Finite Mixture Distribution: From Well- to Less Known Methods

Original Article
Open access
Published: 25 August 2022

Volume 16, article number 60, (2022)
Cite this article

Download PDF

You have full access to this open access article

Journal of Statistical Theory and Practice Aims and scope Submit manuscript

Estimation of the Complexity of a Finite Mixture Distribution: From Well- to Less Known Methods

Download PDF

Fadoua Balabdaoui ORCID: orcid.org/0000-0001-8042-4193¹,
Andrei Kolar²,
Yulia Kulagina¹ &
…
Lilian Müller²

1764 Accesses
Explore all metrics

Abstract

Mixture models occur in numerous settings including random and fixed effects models, clustering, deconvolution, empirical Bayes problems and many others. They are often used to model data originating from a heterogeneous population, consisting of several homogeneous subpopulations, and the problem of finding a good estimator for the number of components in the mixture arises naturally. Estimation of the order of a finite mixture model is a hard statistical task, and multiple techniques have been suggested for solving it. We will concentrate on several methods that have not gained much popularity yet deserve the attention of practitioners. These can be categorized into three groups: tools built upon the determinant of the Hankel matrix of moments of the mixing distribution, minimum distance estimators, likelihood ratio tests. We will address theoretical pillars underlying each of the methods, provide some useful modifications for enhancing their performance and present the results of the comparative numerical study that has been conducted under various scenarios. According to the results, none of the methods proves to be a “magic pill”. The results uncover limitations of the techniques and provide practical hints for choosing the best-suited tool under specific conditions.

Mixture models: building a parameter space

Article 26 February 2016

Testing the Number and the Nature of the Components in a Mixture Distribution

On the Use of the Matrix-Variate Tail-Inflated Normal Distribution for Parsimonious Mixture Modeling

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

1.1 Aim and Scope

In multiple applications the collected data may be best described by a multimodal probability mass or density function meaning the empirical distribution contains several regions with high probability mass. Mixture models are a powerful mathematical tool that allows for characterizing such heterogeneous populations, which are believed to consist of multiple homogeneous subpopulations. A great multitude of statistical problems can be cast into the mixture model framework: linear inverse and deconvolution problems, random effects models, repeated measures and measurements error models, empirical and hierarchical Bayes, latent class and latent trait models, clustering, robustness and contamination models, hidden mixture structures, random coefficient regression models and many others [43]. A very important class of mixture models is the class of finite mixtures. These models, which assume a finite number of components, have proved to be very useful and flexible enough to model a vast range of random phenomena thus receiving much attention from both theoretical and practical viewpoints.

In some mixture models applications there is no uncertainty about the number of components in the mixture. This is the case where the components correspond to a well-known existing partition of the population. However, on many occasions this situation is far from realistic and practitioners encounter either the lack or complete absence of a priori information about the actual number of mixture components. In such cases, this number has to be inferred from the data along with the parameters of the component densities. Correct identification of mixture complexity may be of primary interest in itself or may be followed by efficient estimation of all parameters. Due to its practical importance the problem of selecting the optimal mixture complexity has been addressed in numerous statistical publications, and we will point out many seminal works as we proceed.

The objectives of the present survey are to

provide the theoretical background of the reviewed methods for estimating the complexity of a finite mixture;
assess the performance of these methods under various scenarios;
suggest modifications that enhance the performance of some of the methods in particular settings;
identify universal methods that provide stable and accurate results throughout most of the scenarios for different distribution families or single out scenarios under which certain approaches may be preferred to others.

The number of methods devoted to estimating the true number of components in a mixture is undoubtedly too large to be thoroughly described in a single survey. Thus, we restrict attention to a selected subset of approaches that have the merit of being applicable in very general settings; i.e., for wide classes of finite mixtures of distributions. One of the main goals of this survey is to uncover the extent to which each of the methods is successful in consistently estimating the true complexity for various sample sizes. The estimation techniques reviewed below can be split into three main categories:

1.
methods built upon the determinants of the Hankel matrix of moments of the mixing distribution;
2.
methods based on penalized minimum distance between the unknown probability density and a consistent estimator thereof. The distances considered in this survey are the Hellinger as well as the $L_2$-distance;
3.
likelihood ratio test (LRT) - based techniques.

Some of the key criteria we based our choice of the techniques upon were:

a)
a cohesive mathematical theory behind the method, including the asymptotically consistency;
b)
infrequent reference in the literature as well as relatively rare usage in practice despite of the coherent theoretical base;
c)
feasibility of implementation using any programming language, e.g. Python, R, Julia, Matlab, etc.

Pseudocodes for the algorithms discussed in this work are given in Appendix D in the supplementary materials. For completeness, other interesting methods for estimating the complexity of a finite mixture are mentioned in Sect. 8.

Although not strictly a part of a survey, the performance enhancement such as the one we bring to some of the methods through specific modifications is almost inevitable. In fact, the original version of some of the approaches reviewed here cannot be of real practical value without any further adjustment. These modifications, which will be described in separate subsections, include resorting to some judicious scaling in the case of the Hankel-based-methods or using bootstrap instead of penalization for the approaches based on minimum distance estimation. In Sect. 6, we report the results of an extensive numerical study which we carried out for different mixture distributions and various number of components with the goal of comparing the performances of the techniques reviewed in this survey.

Several examples involving the estimation of mixture complexity for real data sets using the discussed methods are presented. The data sets were taken from various fields such as geology, insurance and lexicography.

1.2 Organization of the Paper

The paper is organized as follows:

Section 2 provides some basic background on mixture models, mentions major works on mixture model estimation techniques and gives a brief overview of these approaches.
Section 3 outlines the theoretical foundation of the original method based on the determinants of the Hankel matrix of moments of the mixing distribution as proposed in [21]. In the same section, two modifications of this approach allowing for obtaining improved results, are presented. The section also gives a concise description of a neural network extension of the Hankel matrix approach, proposing possible working configurations of a multilayer perceptron for the mixtures of Gaussian, Poisson and geometric densities.
Section 4 describes methods based on minimum distance estimation. The section relies to a very large extent on the works [69, 74] and [20]. We re-examine the estimation techniques that use the Hellinger and the $L_2$ distances when combined with two different penalties. In the same section, motivated by the idea of enhancing the original method, we propose a modification based on a bootstrap procedure instead of penalization.
Section 5 presents the estimation approach based on the LRT combined with a bootstrap procedure as described in [38].
Section 6 comprises the results of a comparative numerical study where all of the above mentioned techniques are tested on simulated data under various scenarios. Furthermore, the same section contains a discussion of the settings in which certain methods can be favored as they seem to outperform their counterparts.
Section 7 encompasses several real data sets that were analysed using the studied approaches and compares the obtained results.
Section 8 mentions a number of papers where other techniques for mixture complexity estimation, not addressed in the present survey, are considered.
Section 9 summarizes the findings and outlines the limitations of all methods reviewed in this survey.
Appendices A, B, C, D presented as supplementary materials to this paper include additional examples, tables with detailed simulation results, proofs of the theoretical results that are relevant for the methods described in the manuscript and pseudocodes clarifying and simplifying the implementation of the discussed techniques.

2 Finite Mixture Models: General Scope

2.1 Notation and Basic Definitions

We start with defining the terminology that will be used throughout the survey. In the sequel, a real vector of dimension r will be denoted $\varvec{v}_r$ and its components by $v_1, \ldots , v_r$. A class of real vectors of dimension r will also bear the subscript r in its notation. When manipulating several vectors of dimension r we will index them as $\varvec{v}_{r,1}, \varvec{v}_{r, 2}, \ldots $. A random sample of i.i.d. random variables are going to be denoted $(X_1, \ldots , X_n)$. Also, a sequence of random variables (for example converging weakly) will be denoted for example by $Y^{(n)}$. A class of densities which depend on some vector of parameters of dimension r will not necessarily bear the subscript r.

Suppose that some population of interest, represented abstractly by a random variable X, consists of a finite number $m \in {\mathbb {N}} $ of subpopulations. Each subpopulation is generated by some random process that can be modeled by an individual or component distribution, e.g. normal, exponential, Poisson, geometric, etc. We will assume that each of the component distributions admits a density with respect to some common dominating measure $\mu $. Furthermore, the component density is assumed to be parametrized through some unknown vector $\varvec{\phi }_d \in \varvec{\Phi } \subseteq {\mathbb {R}}^d, \; d\ge 1$. To keep the manuscript to a reasonable length, we confine our attention in this survey to the one-dimensional case; i.e. the random variable $X \in {\mathbb {R}}$. In the sequel, the dominating measure $\mu $ is either the Lebesgue measure in case the distribution of the components is absolutely continuous, or the counting measure in case this distribution is discrete. In the latter case, all the examples considered here treat distributions that are supported on the set of non-negative integers. Let ${\mathcal {F}} = \big \{f_{\varvec{\phi }_d}: \varvec{\phi }_d \in \varvec{\Phi } \big \}$ be the family of densities which the components belong to. If ${\mathcal {X}}$ is the support of X, then the distribution of X is said to have a m-component mixture distribution with density

$$\begin{aligned} f_{\varvec{\theta }_{p_m}}(x) = \int _{\Phi } f_{\varvec{\phi }_d}(x) dG(\varvec{\phi }_d) = \sum _{j=1}^{m} \pi _j f_{\varvec{\phi }_{d,j}}(x) \end{aligned}$$

(2.1)

for all $x \in {\mathcal {X}}$, where

$$\begin{aligned}&\varvec{\theta }_{p_m} = (\pi _1,\dots ,\pi _{m},\varvec{\phi }_{d,1}^T,\dots ,\varvec{\phi }_{d,m}^T)^T \in {\mathcal {S}}_{m-1} \times \varvec{\Phi }^m: = \varvec{\Theta }_{p_m}, \nonumber \\&{\mathcal {S}}_{m-1} = \big \{ (\pi _1,\ldots , \pi _m)^T \in [0, 1]^m: \sum _{j=1}^m \pi _j =1 \big \}. \end{aligned}$$

(2.2)

${\mathcal {S}}_{m-1}$ is the $(m-1)$-dimensional simplex and $\varvec{\Phi }^m$ is the Cartesian product $\{(\varvec{\phi }_{d,1}, \ldots , \varvec{\phi }_{d,m})^T: \varvec{\phi }_{d,i} \in \varvec{\Phi }, i =1, \ldots , m \}$ with $p_m = md + m-1$.

Above, G is a discrete distribution defined on $\Phi $ with at most m jump points at $\varvec{\phi }_{d,1}, \ldots , \varvec{\phi }_{d,m}$, and the integral representation in (2.1) is given here only to draw attention that finite mixtures are part of a much bigger family of mixtures where G can be any distribution function, known often under the name of “mixing distribution”. In the sequel, we will refer to either the probability density or probability mass function defined in (2.1) as the mixed density and to $\pi _j, j =1, \ldots , m $ as the mixing probabilities.

We define the family of m-component mixture densities as the set

$$\begin{aligned} {\mathcal {F}}_m= & {} \Big \{ \pi _1 f_{\varvec{\phi }_{d,1}} + \ldots + \pi _m f_{\varvec{\phi }_{d,m}}, (\pi _1, \ldots , \pi _m)^T \in {\mathcal {S}}_{m-1}, (f_{\varvec{\phi }_{d,1}}, \ldots , f_{\varvec{\phi }_{d,m}}) \in {\mathcal {F}}^m \Big \} \nonumber \\= & {} \Big \{f_{\varvec{\theta }_{p_m}}: \varvec{\theta }_{p_m} \in \varvec{\Theta }_{p_m} \Big \}, \end{aligned}$$

where $f_{\varvec{\theta }_{p_m}}$ is given by (2.1).

Suppose we observe n random variables $X_1, \dots , X_n \in {\mathcal {X}}$ which are i.i.d. according to an unknown density $\sim f_0 \in \bigcup _{m\ge 1}{\mathcal {F}}_m$. What is the value of m that can be assigned to this density based on the observed data? It is clear that such a value needs to target the most parsimonious representation of the mixture. Estimation of the true complexity of $f_0$ cannot be presented without touching upon this point, which is discussed in the next section.

2.2 Identifiability and Complexity of Mixture Models

We will now touch upon the identifiability issues arising within the mixture distributions framework, which is a crucial point when the aim is to estimate the true complexity. Identifiability of general mixtures of some additively closed family of distributions was proved in the pioneer work of Teicher [65], who recognized the importance of settling the issue of identifiability before launching into estimation of the mixing distribution. Several articles have been devoted to proving identifiability of finite mixtures of some particular classes of distributions such as finite mixtures of normal or gamma distributions; see e.g. [66]. For identifiability results in other classes or review papers on the subject we can refer to [19, 35, 36, 43, 49, 68].

The identifiability of a mixture is defined as follows: a finite mixture with respect to the family ${\mathcal {F}}$ is said to be identifiable if for any $m \ge 1$ and any two elements $f_{\varvec{\theta }_m}$ and $f_{\varvec{\theta }'_m}$ in ${\mathcal {F}}_m$ satisfying the equality

$$\begin{aligned} f_{\varvec{\theta }_{p_m}} (x) = f_{\varvec{\theta }'_{p_m}}(x), \ \ x \in {\mathcal {X}} \end{aligned}$$

then there exists a permutation $ \sigma : \{1, \ldots , p_m \} \mapsto \{1, \ldots , p_m \}$ such that the components of $\varvec{\theta }_{p_m}$ and $\varvec{\theta }'_{p_m}$ are equal up to the permutation $\sigma $; i.e., $\pi _{\sigma (i)} = \pi '_{i}, \ \varvec{\phi }_{d, \sigma (i)} = \varvec{\phi }'_{d, i}$ for $i=1, \ldots , p_m$.

Different techniques have been developed to show identifiability. One of the most important results is the one shown in [76], which says that the characterizing condition of identifiability is linear independence of the family ${\mathcal {F}}$. Other characterizations or sufficient conditions could be built upon this result by resorting for example to using some additional properties of the elements of ${\mathcal {F}}$ or computing Fourier/ Laplace transforms (see [5, 36, 39]).

When identifiability holds, it is natural to think of the most economic representation of the finite mixture under study. Indeed, we have the inclusions

$$\begin{aligned} {\mathcal {F}}_m \subset {\mathcal {F}}_{m+1} \end{aligned}$$

(2.3)

for all $m \ge 1$, and hence we can introduce the following definition: the index of economical representation for some finite mixture density $f \in \bigcup _{m \ge 1} {\mathcal {F}}_m$ is defined as

$$\begin{aligned} m(f)=\min \big \{ m \in {\mathbb {N}}: f \in {\mathcal {F}}_m \big \}. \end{aligned}$$

This index is exactly what is called the complexity (or order). Note that this number has to be unique, an immediate consequence of identifiability. Also, from a practical point of view, m(f) corresponds to the number of all the components that are actually part of the total population: all the mixing probabilities $\pi _j, j \in \{1, \ldots , m(f)\}$ should satisfy $\pi _j > 0$ by the very definition of m(f). The term identifiability is used here with some abuse as the components of $\varvec{\theta }_{p_{m(f)}}$ are unique up to some permutation (whereas the mixed density is invariant under the m! permutations of the component labels). One can of course require for example that the mixing probabilities are labeled so that $\pi _1< \dots < \pi _m$ in case they are all different. We will be following this convention when reporting the simulation results in Sect. 6.

The discussion above lays the ground for this survey. In the sequel, we shall assume that identifiability assumption holds. Also, the notation $m(f_0) = m_0$ will be used, where $f_0$ is the unknown density in ${\mathcal {F}}_{m_0}$ from which we observe a random sample. The true complexity or order, $m_0$, as well as the true parameter vector

$$\begin{aligned} \varvec{\theta }_{0}:=\varvec{\theta }_{p_{m_0}} \in \varvec{\Theta }_{p_{m_0}} \end{aligned}$$

will be assumed to be unknown. The main goal of the methods reviewed further is to consistently estimate $m_0$. An estimation procedure can be (but does not necessarily have to be) accompanied by the estimation of $\varvec{\theta }_0$.

2.3 Popular Approaches to Mixture Model Estimation

Mixture model estimation has a long history. The early mixture model estimation techniques date back to the end of the 19-th century, when S. Newcomb [52] suggested an iterative reweighting scheme to compute the Maximum Likelihood (ML) estimator of the common mean of a mixture of a known proportions of a finite number of univariate normal populations with known variances. This scheme is regarded by many as a precursor of the well-known Expectation-Maximization (EM) algorithm.

A few years later K. Pearson [56] described an analytical and a graphical solutions to estimating the first five moments of an asymmetrical empirical distribution, which he was aiming to break up into two univariate normal curves. The graphical solutions for mixture model estimation stayed in the focus of attention until the second half of the 20-th century ([14, 34, 58]).

Between 1912 and 1922 R. Fisher [29] attempted to popularize the ML approach to fitting the mixtures. The evolution of the ML approach is considered in detail in [3]. In particular, Fisher made an analysis of the extensions of the method of moments to the likelihood equations as a way of increasing the quality of the estimates, which later caused a dispute with Pearson ([28, 57]). Around the 1950s C. R. Rao [59] used Fisher’s scoring method to estimate the parameters of a mixture of two Gaussian distributions with common variance, and soon after the ML estimation for identifying the number of components as well as for parameter estimation in finite mixture models was addressed in numerous publications, such as [22, 72, 73].

These days the most well-studied and widely-used approach to computing ML estimates for finite mixture models as defined in (2.1), is the EM algorithm, elaborately described in [23], the seminal work that greatly exhilarated the efficient usage of mixture models. The EM algorithm is implemented by assuming that there are latent variables that link every observation to one of the components, which, together with the observed data, yield complete data.

We will summarize the main idea behind this algorithm. To that end consider two sample spaces within the mixture model framework:

1.
the sample space of the incomplete observations, where only the realizations of the random variable X are observed, but no information on the mixing distribution $G(\varvec{\phi }_d)$ is available;
2.
the sample space of the complete observations, the estimation of which can be performed explicitly.

For the sake of simplicity consider the one-dimensional case, $\Phi \subseteq {\mathbb {R}}$. In this case we denote $\varvec{\phi }_d$ simply by $\phi $. The extension to the multidimensional case is possible but complicates the derivations.

Let $x = (x_1,\dots ,x_n)$ be the observed realizations of the random variable X, and let $z = ({\mathbf {z}}_1,\dots ,{\mathbf {z}}_n)$ denote the realizations of the corresponding unobserved (or latent) random vector ${\mathbf {Z}}$ indicating that the observation $x_i, i=1,\dots ,n$ comes from the j-th component, $j=1,\dots ,m$. In other words, ${\mathbf {z}}_i, i =1, \ldots , n $ are realizations of a multinomial distribution with probabilities $\pi _1, \ldots , \pi _m$, and we have that

$$\begin{aligned} z_{ij} = {\left\{ \begin{array}{ll} 1, \text { if } x_i \in j^{th} \text { component} \\ 0, \text { otherwise}. \end{array}\right. } \end{aligned}$$

The pairs $\varvec{y}_i = (x_i, \varvec{z}_i)$, for $i=1,\dots , n$ are i.i.d. and they are usually referred to as the complete or augmented data. Let $\varvec{y} = (\varvec{y}_1, \ldots , \varvec{y}_n)$. For a stipulated mixture complexity $m \in {\mathbb {N}}$, let us denote by $l^c_{\varvec{\theta }_{p_m}}$ the log-likelihood of the complete data; i.e.,

$$\begin{aligned} l^c_{\varvec{\theta }_{p_m}}(\varvec{y})= & {} \sum _{i=1}^{n} \sum _{j=1}^{m} z_{ij} \log \big ( \pi _{j} f_{\varvec{\phi }_{d,j}}(x_i) \big ) \\= & {} \sum _{i=1}^n \sum _{j=1}^m z_{ij} \log (f_{ \phi _{j}}(x_i)) + \sum _{j=1}^m \log (\pi _j) \sum _{i=1}^n z_{i,j}. \end{aligned}$$

On the other hand, the log-likelihood of the observed data x is given by

$$\begin{aligned} l_{\varvec{\theta }_{p_m}}(x)= \sum _{i=1}^n \log \Big ( \sum _{j=1}^m \pi _j f_{\phi _{j}}(x_i)\Big ). \end{aligned}$$

It can be shown that the MLE

$$\begin{aligned} \hat{\varvec{\theta }}_{p_m} = {{\,\mathrm{arg\,max}\,}}_{\varvec{\theta }_{p_m} \in \varvec{\Theta }_{p_m}} l_{\varvec{\theta }_{p_m}}(x). \end{aligned}$$

(2.4)

can be obtained by alternating between an expectation and maximization steps involving both the complete log-likelihood $l^c_{\varvec{\theta }_{p_m}}$. This is precisely what the well-known EM-algorithm does. In the first step, the conditional expectation of $l^c_{\varvec{\theta }_{p_m}}(\varvec{y})$ given the observed data x is computed under the current parameter. Then, the obtained expression is maximized over the parameter space and the maximizer becomes the new parameter. These two steps are repeated until convergence. If s is the number of the iteration of the current E-step, then it is easy to show that this step is completed by computing the conditional expectation of the multinomial vectors $\varvec{z}_i$ given the observed data x. This yields for $i=1, \ldots , n$ and $j=1, \ldots , m$

$$\begin{aligned} {\hat{z}}^{(s}_{ij} = \frac{{\hat{\pi }}^{(s-1)}_j f_{{\hat{\phi }}^{(s-1)}_j}(x_i)}{\sum _{l=1}^m {\hat{\pi }}_l^{(s-1)} f_{{\hat{\phi }}^{(s-1)}_l}(x_i)} \end{aligned}$$

where $({\hat{\pi }}^{(s-1)}_1, \ldots , \pi ^{(s-1)}_m, {\hat{\phi }}^{(s-1)}_1, \ldots , {\hat{\phi }}^{(s)}_m)$ is the MLE obtained at the $(s-1)$-th step. Note that the maximizing mixing probabilities are easily obtained and are explicitly given in the s-th M-step by the expression

$$\begin{aligned} {\hat{\pi }}_j^{(s)} = \frac{1}{n} \sum _{i=1}^n {\hat{z}}_{ij}^{(s)}, \end{aligned}$$

for $j=1, \dots , m$. To obtain ${\hat{\phi }}^{(s)}_j, j=1, \ldots , m$, a numerical method might be required in case a a closed form is not possible. The optimization procedure then seeks to find at least the local maximum as finding the global maximum is not always possible. As noted in [50], the latter often occurs in the case of Gaussian mixtures with non-homogeneous dispersions (unequal covariance matrices). Components that have either one observation, or several identical observations or several nearly-identical observations, result in the estimated covariance matrices that are singular, which causes the likelihood function to be unbounded. Gaussian mixtures with homogeneous components result in covariance matrices that are restricted in the parameter space and thus do not have this problem. For references on the EM-algorithm, see e.g. [23] and [48].

The description given above treats one given m, a candidate for the true mixture complexity. To obtain an estimator for $m_0$, the true complexity, one can resort to maximizing a penalized version of the observed log-likelihood. This means that the log-likelihood will be augmented by a penalty term depending on the model complexity. Several widely used examples of this technique include Akaike Information Criterion (AIC) [2], Bayes Information Criterion (BIC) [62], Integrated Completed Likelihood (ICL) [10], Laplace-Empirical Criterion (LEC) [49], Normalized Entropy Criterion (NEC) [9] and many others [50]. These only differ in the form of the penalty function, and we will concentrate on the two criteria that have gained most popularity in practice: The Bayesian Information Criterion (BIC) and Integrated Completed Likelihood (ICL). While BIC is most widely used for performing model selection tasks, ICL is most frequently applied for solving clustering problems.

The general idea is to treat the task of choosing the number of components in the mixture as a model selection problem by considering a sequence of models ${\mathcal {M}}_1, \dots , {\mathcal {M}}_M$ for $m=1,\dots ,M$ with associated prior probabilities $p({\mathcal {M}}_m)$, which are often taken to be equal. By the Bayes’ Theorem, the posterior probability of model ${\mathcal {M}}_m$, given the observed data $\varvec{x}$ is proportional to the probability of the data given the model multiplied by the model’s prior. Under regularity assumptions, it can be show than twice the posterior probability of the mixture model with m components can be well approximated by the

$$\begin{aligned} \text {BIC}_{m} = 2 l_{\hat{\varvec{\theta }}_{p_m}}(\varvec{x}) - \nu _{{\mathcal {M}}_m} \log n \end{aligned}$$

where $\nu _{{\mathcal {M}}_m} = p_m$ is the number of independent parameters in the model and $ \hat{\varvec{\theta }}_{p_m}$ is the MLE of $\varvec{\theta }_{p_m}$. The true complexity is then estimated by finding the integer m which maximizes $\text {BIC}_{m}$.

Given the discussion above, finding the number of components in the mixture that maximizes $m \mapsto \text {BIC}_{p_m}$ is equivalent to choosing the mixture model with the greatest a posteriori probability. Some of the advantages of the BIC approach are that it is easy to implement, can be used for comparing non-nested models and was shown to be consistent for choosing the correct number of components in [40].

The ICL approach uses the log-likelihood of the complete data and replaces the unobserved labels $z_{ij}, 1 \le i \le n, 1 \le j \le m$ by their maximum a posteriori (MAP) estimator, that is

$$\begin{aligned} {\hat{z}}^*_{ij} = {\left\{ \begin{array}{ll} 1, \text { if } {\hat{z}}_{ij}={{\,\mathrm{arg\,max}\,}}_{1 \le k \le m} {\hat{z}}_{ik} \\ 0, \text { otherwise}. \end{array}\right. } \end{aligned}$$

Thus, for the mixture model with m components

$$\begin{aligned} \text {ICL}_{m} = 2 l^c_{\hat{\varvec{\theta }}_{p_m}}\big (\varvec{x}, \varvec{z}^*\big ) - p_m \log n. \end{aligned}$$

The very useful relationship between $\text {BIC}_{m}$ and $\text {ICL}_{m}$ can be shown:

$$\begin{aligned} \text {ICL}_m = \text {BIC}_m+\sum _{i=1}^n \sum _{j=1}^m {\hat{z}}_{ij} \log {\hat{z}}_{ij}. \end{aligned}$$

It has been shown in [31] that in some cases (e.g. for the mixtures of Gaussians) evaluating the likelihood at the a maximum a posteriori (MAP) estimator instead of the MLE helps the EM algorithm to avoid singularities or degeneracies.

Regularization and variable selection techniques have also found their application in this setting. For example, [55] proposed an estimation technique for Gaussian mixtures in the context of a clustering problem, where the likelihood function is augmented by an $L_1$-norm penalty term $-\lambda \sum _{j=1}^{m} \sum _{k=1}^p |\mu _{jk}|$, where $\mu _{jk}$ is the k-th coordinate of the j-th mean vector, and derived a modification of an EM algorithm fitted for the purpose. The $L_1$ penalty can shrink some of the fitted means toward 0, thus leading to the most parsimonious model.

Example 1: EM solution for the mixture of Gaussian distributions. For a finite mixture of univariate Gaussian distributions with the parameter vector $\varvec{\theta } = \big ( \pi _1, \dots , \pi _m, (\mu _1, \sigma _1), \dots , (\mu _m, \sigma _m)\big )$ and the mixture density given by

$$\begin{aligned} f_{\varvec{\theta }}(x) = \sum _{j=1}^{m} \pi _j \frac{1}{\sqrt{2 \pi } \sigma _j} \exp ^{-\frac{1}{2}\Big ( \frac{x-\mu _j}{\sigma _j}\Big )^2}, \end{aligned}$$

the E-step at the s-th iteration will update the probabilities given the current parameter vector $\varvec{\theta }^{(s-1)}$

$$\begin{aligned} {\hat{z}}_{ij}^{(s)} = \frac{{\hat{\pi }}_j^{(s-1)}\frac{1}{\sqrt{2 \pi } {\hat{\sigma }}_j^{(s-1)}} \exp ^{-\frac{1}{2}\Big ( \frac{x-{\hat{\mu }}_j^{(s-1)}}{{\hat{\sigma }}_j^{(s-1)}}\Big )^2}}{\sum _{j'=1}^m {\hat{\pi }}_{j'}^{(s-1)}\frac{1}{\sqrt{2 \pi } {\hat{\sigma }}_{j'}^{(s-1)}} \exp ^{-\frac{1}{2}\Big ( \frac{x-{\hat{\mu }}_{j'}^{(s-1)}}{{\hat{\sigma }}_{j'}^{(s-1)}}\Big )^2}}. \end{aligned}$$

The M-step provides the following solutions:

$$\begin{aligned} {\hat{\pi }}_j^{(s)}=\frac{\sum _{i=1}^n{\hat{z}}_{ij}^{(s)}}{n}, \quad {\hat{\mu }}_j^{(s)}=\frac{\sum _{i=1}^n{\hat{z}}_{ij}^{(s)} x_i}{\sum _{i=1}^n{\hat{z}}_{ij}^{(s)}}, \quad {\hat{\sigma }}_j^{(s)}=\frac{\sum _{i=1}^n{\hat{z}}_{ij}^{(s)} (x_i-{\hat{\mu }}_j^{(s)})^2}{\sum _{i=1}^n{\hat{z}}_{ij}^{(s)}}. \end{aligned}$$

Further examples (for the mixtures of geometric and Poisson distributions) can be found in Appendix A in the supplementary materials.

Concluding this section it is necessary to point out that a great amount research has been carried out in this area, and multiple software applications have been developed for working with mixture models, in particular with the Gaussian mixture models that are most frequently used in practice. Most of the software is suited for model-based classification and in particular offering the opportunity to find the ML estimates via the EM algorithm. We refer interested practitioners to R packages Mclust [63] and mixtools [7] or the Matlab package MIXMOD [11].

3 Methods Based on the Hankel Matrices

The method of moments is generally considered to be less efficient when compared to maximum likelihood. Nonetheless, as justly argued in [46], there are situations where the method of moments reveals a nice mathematical structure. This is the case for the problem of estimating the true complexity of some finite mixture. As we will see below, the number of support points of a discrete mixing distribution with a finite number of jumps can be elegantly linked to whether the determinant of a special matrix of moments is equal to zero. Such a matrix is known under the name of a Hankel matrix.

We devote this section entirely to the estimation approaches based on the determinants of Hankel matrices of moments of the mixing distribution. The original method, with which we will start, was proposed in [21]. Additionally to the original approach we will describe a couple of its possible extensions. [21] motivated the method with a number of appealing features:

1.
it gives consistent estimators under some mild conditions;
2.
it requires no a priori upper bound on the unknown order of the mixture;
3.
it comes with low computational time as it does not involve estimation of the mixture parameters.

Another attracting property, not mentioned by the authors, is that the method bears a universal character and can be applied to continuous distributions as well as discrete distributions with no modifications, provided that the moment generating function of the distribution exists.

For the reader to be able to appreciate the elegant argument standing behind the method, we shall recall next the key theoretical results furnishing its basis.

3.1 The Main Theoretical Results and Basic Approach

Recall that we have confined the present study to a one-dimensional case, where ${\Phi } \subseteq {\mathbb {R}}$. For a given integer $m \ge 1$ define the set

$$\begin{aligned} {\mathcal {C}}_{2m}= & {} \Big \{ (c_1,\dots ,c_{2 m})^T \in {\mathbb {R}}^{2m}: \exists \ \text {some distribution function }G\text { on }\Phi \text { such that} \\&\ \ c_j = \int _{\Phi } \phi ^j \,dG(\phi ) \ \ \text {for }j \in \{1,\ldots , 2m\} \Big \}. \end{aligned}$$

In other words, the component $c_j$ is equal to the j-th moment of some distribution function G. For convenience, we will write $\varvec{c}_{2m} = (c_1,\dots ,c_{2 m})^T$ for any given real numbers $c_j, j =1, \ldots , 2m$. In [21] this set is defined more generally with non-negative measure G.

For $\varvec{c}_{2m} \in {\mathbb {R}}^{2m}$, the Hankel matrix associated with this vector is the $(m +1) \times (m+1)$ real symmetric matrix , denoted $H(\varvec{c}_{2m})$ and given by

$$\begin{aligned}{}[H(\varvec{c}_{2m})]_{i,j}=c_{i+j-2}, \quad 1 \le i,j \le m+1, \end{aligned}$$

with $[H(\varvec{c}_{2m})]_{1,1} = c_0 =1$. More explicitly, we have that

$$\begin{aligned} H(\varvec{c}) = \begin{pmatrix} 1 &{} c_{1} &{} c_{2} &{} \dots &{} c_{m} \\ c_{1} &{} c_{2} &{} &{} &{} \\ c_{2} &{} &{} &{} &{} \vdots \\ \vdots &{} &{} &{} \ddots &{} \\ c_{m} &{} &{} \dots &{} &{} c_{2m} \end{pmatrix}. \end{aligned}$$

Next we state the key result which links the true complexity of a finite mixture to the Hankel matrix of moments. See also Proposition 1 in [21].

Theorem 3.1

For a given $\varvec{c}_{2m} \in {\mathbb {R}}^{2m}$, the Hankel matrix $H(\varvec{c}_{2m})$ is positive semidefinite if and only if $\varvec{c}_{2m} \in {\mathcal {C}}_{2m}$. Furthermore, $D_{m}:=\det \big ( H(\varvec{c}_{2m}) \big )=0 $ if and only if every distribution function G such that $c_j = \int _{\Phi } \phi ^j dG(\phi )$, G is discrete with at most m support points.

Now we explain how the result above can be applied in the context of estimating the complexity of a finite mixture. Consider $f_0$, a finite mixture of densities which belong to some family ${\mathcal {F}}$ and let $G_0$ be the associated discrete distribution function with true complexity $m_0$. Then, Theorem 3.1 says that

$$\begin{aligned} m_0 = \inf \{m \in {\mathbb {N}}: D_m=0\}, \end{aligned}$$

(3.1)

where $ D_m$, as above in Theorem 3.1, is the determinant of $H(\varvec{c}_{2m})$ with

$$\begin{aligned} \varvec{c}_{2m} = \left( \int _{\Phi } \phi dG_0(\phi ), \ldots , \int _{\Phi } \phi ^{2m} dG_0(\phi ) \right) . \end{aligned}$$

In other words, the correct order of the mixture is the first integer which sets the determinant to zero. But the theorem implies also that $D_m =0$ for all $m \ge m_0$. This characterizing feature of the true complexity is exploited to construct a sensible estimator. Indeed, assuming that it is possible based on the random sample $(X_1, \ldots , X_n) $ to obtain a strongly consistent estimator of any j-th moment of $G_0$, ${{\hat{c}}}_j$ say, then the Hankel estimator of $m_0$ proposed in [21] is given by

$$\begin{aligned} {\hat{m}}_n = {{\,\mathrm{arg\,min}\,}}_{m \in {\mathbb {N}}} \Big \{\vert {\widehat{D}}_{m} \vert + a_m l_n \Big \} \end{aligned}$$

(3.2)

where

$$\begin{aligned} {\widehat{D}}_{m}=\det \big (H(\hat{\varvec{c}}_{2m}) \big ), \ \ \text {with} \ \ \hat{\varvec{c}}_{2m} = \big ({{\hat{c}}}_1, \ldots , {{\hat{c}}}_{2m})^T, \end{aligned}$$

$\{a_m\}_{m \ge 1}$ is a positive and strictly increasing sequence, and $\{l_n\}_{n \ge 1}$ a positive sequence satisfying $\lim _{n \rightarrow \infty } l_n=0$ (we have omitted writing the subscript n in the notation of the estimators of the moments and determinants).

Clearly, the term $ a_m l_n $ is acting as a penalty. Adding a penalty term to $\vert {{\widehat{D}}}_m \vert $ is necessary because otherwise minimizing of $m \mapsto \vert {{\widehat{D}}}_m\vert $ alone might yield an inconsistent estimator. In fact, strong consistency of ${{\hat{c}}}_j$ implies that $\vert {{\widehat{D}}}_{m} \vert $ is a strongly consistent estimator of the true value $\vert D_m \vert = D_m$ (see our remark below). Since the latter is equal to 0 for all $m \ge m_0$, $\vert {{\widehat{D}}}_{m} \vert $ will be close to 0 for all $m \ge m_0$, which might result in choosing a value which is strictly larger than $m_0$. Under some additional assumptions, consistency of ${\hat{m}}_n$ as defined above in (3.2) can be established as shown in Theorem 1 of [21]. We recall this result below.

Theorem 3.2

If for all integers $j, m \ge 1$ we have that

$$\begin{aligned} {\hat{c}}_j \rightarrow c_j \quad \text {and} \quad \frac{{\widehat{D}}_{m} - \ D_m}{l_n} \rightarrow 0 \end{aligned}$$

almost surely as $n \rightarrow \infty $, then ${\hat{m}}_n \rightarrow m_0 \quad \text {a.s.}$ as $n \rightarrow \infty $.

Remark 3.1

Recall that $D_m = \det \big (H(\varvec{c}_{2m}) \big )$. Thus, $D_m$ seen as a multivariate real function of $c_1, \ldots , c_{2m}$ (the components of $\varvec{c}_{2m}$), is infinitely differentiable. Thus, if ${{\hat{c}}}_j$ is a strongly consistent estimator of $c_j$ for any integer $j \ge 1$, then ${{\widehat{D}}}_m = \det \big (H(\hat{\varvec{c}}_{2m}) \big )$ is also a strongly consistent estimator of $D_m$. Furthermore, a multivariate weak convergence of $\hat{\varvec{c}}_{2m}$ toward $\varvec{c}_{2m}$ as in the case where a multivariate Central Limit Theorem applies, the estimator ${{\widehat{D}}}_m$ will converge weakly to $D_m$ at a rate that is as fast as that of $\hat{\varvec{c}}_{2m}$. Typically, the estimators ${{\hat{c}}}_j$ will result from considering some empirical estimators which we know to be asymptotically normal. Below, we will touch upon this point in some more detail.

Remark 3.2

In the light of Remark 3.1, the condition $({{\widehat{D}}}_m - D_m)/l_n \rightarrow _{a.s.} 0, \ \ \forall \ m \in {\mathbb {N}}$ made in Theorem 3.2 is satisfied in case $({{\hat{c}}}_j - c_j)/l_n \rightarrow _{a.s.} 0$ for all $j \in {\mathbb {N}}$. A typical situation is when $\sqrt{n} ({{\hat{c}}}_j - c_j ) \rightarrow _d {\mathcal {N}}(0, \sigma ^2_j)$ (for some $\sigma _j > 0$) and $l_n$ is such that $\sqrt{n} l_n \rightarrow \infty $ in addition to $l_n \rightarrow _d 0$.

Without going into the full proof of Theorem 3.2, let us give some intuition for the condition $({{\widehat{D}}}_m - D_m)/l_n \rightarrow _{a.s.} 0, \ \ \forall \ m \in {\mathbb {N}}$. We have that

$$\begin{aligned} \vert {\widehat{D}}_{m} \vert +a_m l_n = \left\{ \begin{array}{ll} l_n \left( \left| \frac{{\widehat{D}}_{m} - D_m}{l_n} + \frac{D_m}{l_n} \right| +a_m \right) , \ \ \text {for }m < m_0 \\ l_n \left( \vert \frac{{\widehat{D}}_{m} - 0}{l_n} \vert + a_m \right) , \ \ \text {for }m \ge m_0. \end{array} \right. \end{aligned}$$

From the characterization if $m_0$ in (3.1), it follows that $D_m \ne 0$ for $m < m_0$ implying that

$$\begin{aligned} \left| \frac{{\widehat{D}}_{m} - D_m}{l_n} + \frac{D_m}{l_n} \right| \rightarrow \infty \end{aligned}$$

almost surely as $n \rightarrow \infty $, whereas

$$\begin{aligned} \left| \frac{{\widehat{D}}_{m} - 0}{l_n} \right| + a_m \rightarrow a_m \end{aligned}$$

for all $m \ge m_0$, with $a_m> a_{m_0}, \forall \ m> m_0$ since the sequence $\{a_m\}_{m \ge 1}$ is assumed to be strictly increasing. Thus, we expect that as $n \rightarrow \infty $ the minimum of the penalized criterion to be achieved at $m_0$.

The statement about consistency of ${{\hat{m}}}_n$ can be made more refined under additional regularity conditions. More precisely, suppose that for any integer $m \ge 1$ there exist integrable functions $\psi _j$ and $f_j$ for $j = 1, \ldots , 2m $ such that the j-th moment of $G_0$ is given by

$$\begin{aligned} c_j = f_j\big ( {\mathbb {E}}[\varvec{\psi }_{2m}(X)]\big ), \end{aligned}$$

where $ {\mathbb {E}}[\varvec{\psi }_{2m}(X)]= \big ( {\mathbb {E}}[\psi _1(X)], \ldots , {\mathbb {E}}[\psi _{2m}(X)]\big )^T$. Define the estimator ${{\hat{m}}}_n$ the same way as above with

$$\begin{aligned} {\hat{c}}_j = f_j\left( n^{-1} \sum _{i=1}^n \varvec{\psi }_{2m}(X_i)\right) , \ j =1, \ldots , 2m. \end{aligned}$$

We have the following theorem. See also Theorem 2 in [21].

Theorem 3.3

Denote by $\varvec{f}_{2m} $ the multivariate function defined as $\varvec{f}_{2m} (\varvec{t}_{2m}) = (f_1(\varvec{t}_{2m}), \ldots , f_{2m}(\varvec{t}_{2m}))$ for $\varvec{t}_{2m} = (t_1, \ldots , t_{2m})^T \in {\mathbb {R}}^{2m}$. Suppose that for any $m \le m_0$,

$ \varvec{t}^{2m} \mapsto \det \big (H (\varvec{f}_{2m}(\varvec{t}_{2m})) \big ) $ is Lipschitz with respect to some norm on ${\mathbb {R}}^{2m}$,
for any $m \le m_0 $ the generating functions $u \mapsto \int \exp ( u \psi _j(x)) f_0(x) d\mu (x) $ exist in a neighborhood of 0 for all $j =1, \ldots , 2m$.

Furthermore, assume that $ n^{1/2} l_n \rightarrow \infty $. Then, there exists a constant $d > 0$ and integer $n_0 > 0$ such that for all $n \ge n_0$

$$\begin{aligned} {\mathbb {P}}({\hat{m}}_n \le m_0) \le 4 m_0 e^{-d n l^2_n}. \end{aligned}$$

The main argument in the proof uses judicious upper bounds on the probabilities ${\mathbb {P}}({{\hat{m}}}_n \le m_0) $ and $ {\mathbb {P}}({{\hat{m}}}_n > m_0)$ based on concentration inequalities that involve the Cramer transform of the logarithm of the generating function of the centered random variables $\psi _j - {\mathbb {E}}[\psi _j(X)]$ for $j \in \{1, \ldots , 2m\}$ and $m \le m_0$. Before commenting on the result itself, we would like to give some examples, which are relevant for the simulations section coming ahead.

Example 2: Mixture of Gaussian distributions. Consider a finite mixture of Gaussian distributions with density

$$\begin{aligned} f_0(x) = \pi _1 \varphi (x- \theta _1) + \ldots + \pi _{m_0} \varphi (x- \theta _{m_0}), \ \ x \in {\mathbb {R}} \end{aligned}$$

with $\varphi (x) = 1/\sqrt{2\pi } \exp (-x^2/2)$, and $\theta _1, \ldots , \theta _{m_0} \in {\mathbb {R}}$. If $X \sim f_0$, then X has the same distribution as $Z + Y$ where $Z \sim {\mathcal {N}}(0,1)$ and $Y \sim G_0$ with $G_0$ the mixing distribution with support points $\theta _1, \ldots , \theta _{m_0}$ and mixing probabilities $\pi _1, \ldots , \pi _{m_0}$ such that Y and Z are independent. Thus, for any $j \ge 1$

$$\begin{aligned} {\mathbb {E}}(X^j) = \sum _{k=0}^{j} \left( {\begin{array}{c}j\\ k\end{array}}\right) {\mathbb {E}}(Y^k) {\mathbb {E}}(Z^{j-k}) = \sum _{k=0}^{j} \left( {\begin{array}{c}j\\ k\end{array}}\right) c_k \mu _{j-k} \end{aligned}$$

where $\mu _0 = 1$ and for an integer $r \ge 1$

$$\begin{aligned} \mu _r= {\left\{ \begin{array}{ll} 0, \text {if} \;r\; \text {is odd} \\ (r-1)!!, \, \text {if} \;r\; \text {is even}, \end{array}\right. } \end{aligned}$$

where x!! denotes the semifactorial of a number x.

Thus, the vector of moments $\varvec{c}_{2m}$ satisfies the triangular linear system $\varvec{c}_{2m} = B V$ where $B = A^{-1}$ and A is the lower triangular $(2m) \times (2m) $ matrix

$$\begin{aligned} A= \left( \begin{array}{ccccccc} 1 &{} 0 &{} 0 &{} \ldots &{} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} \ldots &{} 0 &{} 0 &{} 0 \\ 3 &{} 0 &{} 1 &{} 0 &{} \ldots &{} 0 &{} 0 \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ 0 &{} \left( {\begin{array}{c}2m\\ 2\end{array}}\right) &{} 0 &{} \left( {\begin{array}{c}2m\\ 4\end{array}}\right)&0&\vdots&1 \end{array} \right) \end{aligned}$$

and

$$\begin{aligned} V = \left( {\mathbb {E}}[X], {\mathbb {E}}[X^2] -1, \ldots , {\mathbb {E}}[X^{2m}] - (2m -1)!! \right) ^T. \end{aligned}$$

In this case, we have $c_j = \sum _{k=1}^{2m} B_{j k} \left( {\mathbb {E}}[X^k] - (k-1)!! \ {\mathbb {I}}_{k \in 2 {\mathbb {N}}} \right) $. Thus, for location mixtures of Gaussian distributions we have shown that

$$\begin{aligned} \psi _j(x) = x^j, \ \ \text {and} \ \ f_j(\varvec{t}_{2m}) = \sum _{k=1}^{2m} B_{j k} \left( t_k - (k-1)!! \ {\mathbb {I}}_{k \in 2 {\mathbb {N}}} \right) \end{aligned}$$

(3.3)

for $j \in \{1, \ldots , 2m\}$.

More examples are available in Appendix A in the supplementary materials.

Now we turn to commenting on Theorem 3.3. Although the result of that theorem seems to give an actual guarantee on the consistency of ${{\hat{m}}}_n$, the exponential bound on the probability of being wrong about $m_0$ depends on $n_0$ and a constant d which are unknown. In case d is small and $n_0$ quite big, then consistency will not be observed for moderate and even big sample sizes: one would need an unrealistically huge number of observations to find the true complexity. Another problem is the estimation of the moments $c_j, j =1, \ldots , 2m$ for large values of m. Although the method does not require to put an upper bound on m while finding the minimum of $\vert {{\widehat{D}}}_m \vert + a_m l_n$ one has to choose some maximum admissible value for the mixture complexity. For large values of m the j-th moment $c_j$ can become very large. When this is combined with a low quality estimator ${{\hat{c}}}_j$, ${{\hat{D}}}_m$ may be far away from 0, which is known to be the theoretical value for $m \ge m_0$. Such a phenomenon is illustrated using mixture of Gaussian distributions

$$\begin{aligned} f_0(x) = 0.3 \varphi (x- 10) + 0.4 \varphi (x- 13) + 0.3 \varphi (x- 17). \end{aligned}$$

(3.4)

In Table 1 we give the first 8 theoretical moments $c_j$ of the mixing distribution and the mean value of their estimates ${{\hat{c}}}_j$ based on 100 replications for each of the sample sizes shown in the table. Table 2 gives the corresponding mean value of ${{\hat{D}}}_m$ as well as its penalized versions with $a_m = m$ and $l_n = \log n/\sqrt{n}$ or $l_n = \sqrt{\log n}/\sqrt{n}$ for $m \in \{1,2,3, 4\}$ computed on the basis of the same replications. It is clear from the values of Table 2 that ${{\hat{m}}}_n =1$ even for this very well-separated mixture and for the large sample size $n =10^4$.

Table 1 The true and estimated moments $c_j$ and ${{\hat{c}}}_j$ for $ j \in \{1, \ldots , 8 \}$ of the mixing distribution of the 3-component mixture of Gaussian distributions given in (3.4)

Estimation of the Complexity of a Finite Mixture Distribution: From Well- to Less Known Methods

Abstract

Similar content being viewed by others

Mixture models: building a parameter space

Testing the Number and the Nature of the Components in a Mixture Distribution

On the Use of the Matrix-Variate Tail-Inflated Normal Distribution for Parsimonious Mixture Modeling

Explore related subjects

1 Introduction

1.1 Aim and Scope

1.2 Organization of the Paper

2 Finite Mixture Models: General Scope

2.1 Notation and Basic Definitions

2.2 Identifiability and Complexity of Mixture Models

2.3 Popular Approaches to Mixture Model Estimation

3 Methods Based on the Hankel Matrices

3.1 The Main Theoretical Results and Basic Approach

Theorem 3.1

Theorem 3.2

Remark 3.1

Remark 3.2

Theorem 3.3

3.2 Modification of the Basic Approach Using Scaling

3.3 Modification of the Basic Approach Using Bootstrap

3.4 Extension of the Hankel Matrix Approach using Neural Networks

4 Methods Based on Minimum Distance Estimators

4.1 General Setting

4.2 The Minimum Distance Estimator: The Basic Approach

4.3 Modification of the Basic Approach Via Bootstrap

5 Sequential Likelihood Ratio Tests with Bootstrap

6 Simulation Results

7 Applications to Real Data

8 Other Work on Mixture Complexity Estimation

9 Some Conclusions

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest statement

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 387 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation