Keywords

1 Dual Representation of the \(\varphi \)-Divergences and Tests

We consider CASM divergences (see [15] for definitions and properties):

$$\begin{aligned} D_\varphi (Q,P) = {\left\{ \begin{array}{ll} \int \varphi (\frac{dQ}{dP}) dP \text { if } Q<< P \\ + \infty \text { otherwise} \end{array}\right. } \end{aligned}$$

where Q and P are probability measures on the same probability space. Extensions to divergences between probability measures and signed measures can be found in [22]. Dual formulations of divergences can be found in [7, 16]. Another interpretation of these formulations can be found in [9, Section 4.6]. They are widely considered in statistics, data analysis and machine learning (see e.g. [4, 20]).

As in [1], let \(\mathcal F\) be some class of \(\mathcal B\)-measurable (borelian) real valued functions and let \(\mathcal M_{\mathcal F} = \{P\in \mathcal M : \int |f| dP < \infty , \forall f\in \mathcal F\}\) where \(\mathcal M\) is the space of probability measures. Let any \(P^*\in \mathcal M\), which shall be the underlying true unknown probability law in a statistical context in the following sections. Assume that \(\varphi \) is differentiable and strictly convex. Then, for all \(P\in \mathcal M_{\mathcal F}\) such that \(D_\varphi (P,P^*)\) is finite and \(\varphi '(dP/dP^*)\) belongs to \(\mathcal F\), \(D_\varphi \) admits the dual representation (see Theorem 4.4 in [6]):

$$\begin{aligned} D_\varphi (P,P^*) = \sup _{f\in \mathcal F} \int f dP - \int \varphi ^\#(f) dP^*, \end{aligned}$$
(1)

where \(\varphi ^\#(x) = \sup _{t\in \mathbb R} tx-\varphi (t)\) is the Fenchel-Legendre convex conjugate. Moreover, the supremum is uniquely attained at \(f = \varphi '(dP/dP^*)\).

This result can be used in two directions. First, a statistical model, e.g. a parametrical model \(\{P_\theta : \theta \in \varTheta \}\) with \(P_\theta \) is absolutely continuous with respect to some dominating measure \(\mu \) for any \(\theta \), naturally induces a family \(\mathcal F = \{\varphi '(p_\theta /p_{\theta '}) : \theta ,\theta '\in \varTheta \}\). This is the main framework of this paper.

Conversely, a class of functions \(\mathcal F\) defines the distribution pairs P and Q that can be compared, which are these such that \(\varphi '(dP/dQ)\in \mathcal F\). Furthermore it induces a divergence \(D_\varphi \) on these pairs. A typical example is the logistic model.

The KLm divergence is defined by the generator \(\varphi : x\in \mathbb R \mapsto -log x + x -1\) and leads to the maximum likelihood estimator for both forms of estimation for the supremal estimator, once of which is defined bellow (see Remark 3.2 in [7]).

We consider in this paper the problem of testing the number of components in a mixture model. This question has been considered by various authors. [2, 10, 12, 14, 17] have considered likelihood ratio tests and showed some difficulties with those due to the fact that the likelihood ratio statistic is unbounded with respect to n. [17] prove that its distribution is driven by a \(\log \log n\) term in a specific simple Gaussian mixture model. The test statistic needs to be calibrated in accordance with this result. But first, as stated by [17], the convergence to the limit distribution is extremely slow, making this result unpractical. And second, it seems very difficult to derive the corresponding term for a different model, and even more so for a general situation.

Our approach to this problem is suggested by the dual representation of the divergence. For the KLm divergence, it amounts to considering the maximum likelihood estimator itself as a test statistic instead of the usual maximum value of the likelihood function. This leads to a well-defined limiting distribution for the test statistic under the null. This holds for a class of estimators obtained by substituting KLm by any regular divergence. This approach also eliminates the curse of irregularity encountered by many authors for the problem of testing the number of components in a mixture.

Since we are interested in composite hypotheses, there is no justification in this context that the likelihood ratio test would be the best (in terms of uniform power) as is usually considered (e.g. [8, 18]) and [13] showed what difficulties likelihood ratio tests can encounter in this context.

[8] considered tests based on an estimation of the minimum divergence between the true distribution and the null model. In we make use of the unicity of the optimiser of the dual representation of the divergence in (1) and of the supremal divergence estimator introduced by [7]. An immediate practical advantage of this choice as compared to estimating the minimum divergence is that one less optimisation is needed. Moreover [23] showed that this estimator is robust for several choices of the divergence.

Our procedure for composite hypotheses consists in the aggregation of simple tests in the spirit of [11]. [5] used a similar aggregation procedure for testing between two distributions under noisy data and obtained some control of the resulting test power.

2 Notation and Hypotheses

Let \(\{f_1(\,.\,;\theta _1):\theta _1\in \varTheta _1\}\), \(\varTheta _1\subset \mathbb R^p\), and \(\{f_2(\,.\,;\theta _2):\theta _2\in \varTheta _2\}\), \(\varTheta _2\subset \mathbb R^q\), be probability density families with respect to a \(\sigma \)-finite measure \(\lambda \) on \((\mathcal X, \mathcal B)\). For some fixed open interval \(]a,b[ \ni 0\), let \(\varTheta \subset ]a,b[ \times \varTheta _1 \times \varTheta _2\), and

$$\begin{aligned} g_{\pi ,\theta } = (1-\pi ) f_1(\,.\,;\theta _1) + \pi f_2(\,.\,;\theta _2) \end{aligned}$$

for any \((\pi ,\theta )\in \varTheta \) with \(\theta = (\theta _1,\theta _2)\).

Assume that \(x_1,\dots ,x_n\in \mathbb R\) have been observed and they are modelled as a realisation of the i.i.d. sample \(X_1,\dots ,X_n\) which distribution \(\mathbb P^* := g_{\pi ^*,\theta ^*}.\lambda \) is known up to the parameters \((\pi ^*,\theta ^*)\in \varTheta \). Our aim is to test the hypothesis \(H_0 : \pi ^* = 0\).

Assume that \(g_{\pi , \theta } = g_{\pi ^*, \theta ^*} \Rightarrow \pi = \pi ^*, \theta _1 = \theta _1^* \text { and, if } \pi ^* \ne 0 \text {, } \theta _2 = \theta _2^*\).

Let g be a probability density with respect to \(\lambda \) such that \(Supp(g) \subset Supp(g_{\pi ,\theta })\) for any \((\pi ,\theta )\in \varTheta \) such that

$$\begin{aligned} \forall (\pi ,\theta ) \in \varTheta , \int \left| \varphi '(\frac{g}{g_{\pi ,\theta }}) \right| g d\lambda < \infty . \end{aligned}$$

Let us define for any \((\pi ,\theta ) \in \varTheta \),

$$\begin{aligned} m_{\pi ,\theta } : x\in \mathcal X \mapsto \int \varphi '\Bigl (\frac{g}{g_{\pi ,\theta }}\Bigr ) g d\lambda - \varphi ^\# \Bigl ( \frac{g}{g_{\pi ,\theta }} \Bigr ) (x) \end{aligned}$$

and assume that \((\pi ,\theta )\mapsto m_{\pi ,\theta }(x)\) is continuous for any \(x\in \mathcal X \). Let us also assume that

$$\begin{aligned} \forall (\tilde{\pi },\tilde{\theta })\in \varTheta , \exists r_0>0 / \forall r \le r_0, \ P^*\, \bigl |\sup _{d((\tilde{\pi },\tilde{\theta }),(\pi ,\theta ))< r} m_{\pi ,\theta } \bigr | < \infty \end{aligned}$$

where \(d(\cdot ,\cdot )\) denotes the Euclidean distance and where, as usual, the operator-type notation \(\mathbb P^*Y\) denotes the expectation—with respect to the probability measure \(\mathbb P^*\)—of the random variable Y.

Theorem 1

For any \((\pi ^*,\theta ^*)\in \varTheta \)

$$\begin{aligned} D_\varphi (g.\lambda ,g_{\pi ^*,\theta ^*}.\lambda ) = \sup _{(\pi ,\theta )\in \varTheta } P^*m_{\pi ,\theta }, \end{aligned}$$

which we call the supremal form of the divergence. Moreover attainment holds uniquely at \((\pi ,\theta ) = (\pi ^*,\theta ^*)\).

Definition 1

Let \(\mathbb {P}_{n}\) denote the empirical measure pertaining to the sample \(X_{1},\dots ,X_{n}\). Define

$$\begin{aligned} (\hat{\pi },\hat{\theta }) := \arg \max _{\left( \pi ,\theta \right) }\mathbb {P}_{n}m_{\pi ,\theta } \end{aligned}$$

the supremal estimator of \(\left( \pi ^{*},\theta ^{*}\right) \).

The existence of \(( \hat{\pi },\hat{\theta })\) can be guaranteed by assuming that \(\varTheta \) is compact. When uniqueness does not hold, consider any maximizer. This class of estimators has been introduced in [7], under the name .

3 Consistency of the Supremal Divergence Estimator

Let us first state the consistency of the supremal divergence estimator of the proportion and the parameters of the existing component, when the non-existing component parameters are fixed, uniformly over the latter.

Here and below, by abuse of notation, we let \(\varphi '\bigl (\frac{g}{g_{\pi ,\theta }}\bigr )\) stand for \(x\mapsto \varphi '\bigl (\frac{g(x)}{g_{\pi ,\theta }(x)}\bigr )\), and so on.

Remark that, for \(\pi ^* = 0\) and any \(\theta _1^*\in \varTheta _1\) and \(\theta _2\in \varTheta _2\), we can unambiguously write \(m_{\pi ^*,\theta _1^*}\) for \(m_{\pi ^*,\theta _1^*,\theta _2}\) since the parameter \(\theta _2\) is not involved in the expression of \(m_{0,\theta _1^*,\theta _2}\).

Theorem 2

Assume that \(\pi ^* = 0\) and let for any \(\theta _2\in \varTheta _2\), \((\hat{\pi }(\theta _2),\hat{\theta }_1(\theta _2))\in ]a,b[ \times \varTheta _1\) such that

$$\begin{aligned} \inf _{\theta _2\in \varTheta _2}P_nm_{\hat{\pi }(\theta _2),\hat{\theta }_1(\theta _2),\theta _2} \ge P_nm_{\pi ^*,\theta _1^*} - o_{P^*}(1). \end{aligned}$$
(2)

Then

$$\begin{aligned} \sup _{\theta _2\in \varTheta _2} d\bigl ( (\hat{\pi }(\theta _2), \hat{\theta }_1(\theta _2)), (0, \theta _1^*) \bigr ) \xrightarrow [n\rightarrow \infty ]{P^*} 0. \end{aligned}$$

The convergence holds a.s. in the particular case of (2) when, a.s.,

$$\begin{aligned} \forall \theta _2\in \varTheta _2, (\hat{\pi }(\theta _2),\hat{\theta }_1(\theta _2)) \in \mathop {\textrm{argmax}}\limits _{(\pi ,\theta _1) \in ]a,b[ \times \varTheta _1} P_nm_{\pi ,\theta _1,\theta _2}. \end{aligned}$$
(3)

4 Asymptotic Distribution of the Supremal Divergence Estimator

Under \(H_0\) (\(\pi ^* = 0\)), the joint asymptotic distribution of \((\hat{\pi }(\theta _2),\hat{\pi }(\theta _2'))\) is provided by the following theorem. The interior of \(\varTheta \) will be denoted by \(\mathring{\varTheta }\).

Theorem 3

Let \(\theta _2\in \varTheta _2\) and \(\theta _2'\in \varTheta _2\) such that \((\pi ^*,\theta _1^*,\theta _2)\in \mathring{\varTheta }\) and \((\pi ^*,\theta _1^*,\theta _2')\in \mathring{\varTheta }\). Write

$$\begin{aligned} \varTheta (\theta _2)&= \{(\pi ,\theta _1)\in ]-\infty ,1[\times \varTheta _1 : (\pi ,\theta _1,\theta _2)\in \varTheta \} \\ \varTheta (\theta _2')&= \{(\pi ,\theta _1)\in ]-\infty ,1[\times \varTheta _1 : (\pi ,\theta _1,\theta _2')\in \varTheta \}, \end{aligned}$$

and let \((\hat{\pi },\hat{\theta }_1)\) and \((\hat{\pi }',\hat{\theta }_1')\) be such that

$$\begin{aligned} (\hat{\pi },\hat{\theta }_1)&\in \mathop {\textrm{argmax}}\limits _{(\pi ,\theta _1)\in \varTheta (\theta _2)} P_nm_{\pi ,\theta _1,\theta _2} \\ (\hat{\pi }',\hat{\theta }_1')&\in \mathop {\textrm{argmax}}\limits _{(\pi ,\theta _1)\in \varTheta (\theta _2')} P_nm_{\pi ,\theta _1,\theta _2'}. \end{aligned}$$

Assume that \(\pi ^* = 0\).

Moreover, assume that :

  • \((\pi ,\theta _1)\in \varTheta (\theta _2)\mapsto m_{\pi ,\theta _1,\theta _2}(x)\) (resp. \((\pi ,\theta _1)\in \varTheta (\theta _2')\mapsto m_{\pi ,\theta _1,\theta _2'}(x)\)) is differentiable \(\lambda \)-a.e. with derivative \(\psi _{\pi ,\theta _1} = \begin{pmatrix} \frac{\partial }{\partial \pi }m_{\pi ,\theta _1,\theta _2} \\ \frac{\partial }{\partial \theta _1}m_{\pi ,\theta _1,\theta _2} \end{pmatrix}_{|(\pi ,\theta _1)}\) (resp. \(\psi _{\pi ,\theta _1}' = \begin{pmatrix} \frac{\partial }{\partial \pi }m_{\pi ,\theta _1,\theta _2'} \\ \frac{\partial }{\partial \theta _1}m_{\pi ,\theta _1,\theta _2'} \end{pmatrix}_{|(\pi ,\theta _1)}\)) such that \(P^* \psi _{\pi ^*,\theta _1^*} = 0\) (resp. \(P^* \psi _{\pi ^*,\theta _1^*}' = 0\)).

  • \((\pi ,\theta _1)\in \varTheta (\theta _2) \mapsto P^* \psi _{\pi ,\theta _1}\) (resp. \((\pi ,\theta _1)\in \varTheta (\theta _2') \mapsto P^* \psi _{\pi ,\theta _1}'\)) is differentiable at \({\pi ^*,\theta _1^*}\) with invertible derivative matrix \(H = D{\left( P^*\psi \right) }_{\left| {\tiny \begin{pmatrix} \pi ^*\\ \theta _1^*\end{pmatrix}}\right. } \) (resp. \(H' = D{\left( P^*\psi '\right) }_{\left| {\tiny \begin{pmatrix} \pi ^*\\ \theta _1^*\end{pmatrix}}\right. } \)).

  • \(\{\psi _{\pi ,\theta _1} : (\pi ,\theta _1)\in \varTheta (\theta _2) \}\) and \(\{\psi _{\pi ,\theta _1}' : (\pi ,\theta _1)\in \varTheta (\theta _2') \}\) are \(P^*\)-Donsker.

  • \(\int (\psi _{\hat{\pi },\hat{\theta }_1}(x) - \psi _{\pi ^*,\theta _1^*}(x))^2 dP^*(x) \xrightarrow []{P^*}0\) and \(\int (\psi _{\hat{\pi },\hat{\theta }_1}'(x) - \psi _{\pi ^*,\theta _1^*}'(x))^2 dP^*(x) \xrightarrow []{P^*}0\).

Assume that \(H {=} D{\left( P^*\psi \right) }_{\left| {\tiny \begin{pmatrix} \pi ^*\\ \theta _1^*\end{pmatrix}}\right. } {=} P^*D^2{\left( h\right) }_{\left| {\tiny \begin{pmatrix} \pi ^*\\ \theta _1^*\end{pmatrix}}\right. } \) (resp. \(H' {=} D{\left( P^*\psi '\right) }_{\left| {\tiny \begin{pmatrix} \pi ^*\\ \theta _1^*\end{pmatrix}}\right. } {=}\) \( P^*D^2{\left( h'\right) }_{\left| {\tiny \begin{pmatrix} \pi ^*\\ \theta _1^*\end{pmatrix}}\right. } \)) with \(P^* |D^2{\left( h\right) }_{\left| {\tiny \begin{pmatrix} \pi ^*\\ \theta _1^*\end{pmatrix}}\right. } | < \infty \) (resp. \(P^* |D^2{\left( h'\right) }_{\left| {\tiny \begin{pmatrix} \pi ^*\\ \theta _1^*\end{pmatrix}}\right. } | < \infty \)) where \(h : (\pi ,\theta _1)\in \varTheta (\theta _2)\mapsto m_{\pi ,\theta _1,\theta _2}(x)\) (resp. \(h' : (\pi ,\theta _1)\in \varTheta (\theta _2')\mapsto m_{\pi ,\theta _1,\theta _2'}(x)\)).

Then with \(a_n\) (resp. \(a_n'\)) being the (1, 1)-entry of the matrix \(H_n^{-1} \cdot \bigl (P_n\psi _{\hat{\pi },\hat{\theta }_1}\psi _{\hat{\pi },\hat{\theta }_1}^T\bigr ) \cdot H_n^{-1}\) (resp. of \(H_n'^{-1} \cdot \bigl ( P_n\psi _{\hat{\pi },\hat{\theta }_1}'\psi _{\hat{\pi },\hat{\theta }_1}'^T\bigr ) \cdot H_n'^{-1}\))—where \(H_n\) (resp. \(H_n'\)) denotes the Hessian matrix of \((\pi ,\theta _1) \mapsto P_nm_{\pi ,\theta _1,\theta _2}\) (resp. of \((\pi ,\theta _1) \mapsto P_nm_{\pi ,\theta _1,\theta _2'}\)) at the point \((\hat{\pi },\hat{\theta }_1)\), which is supposed to be invertible with high probability—one gets

$$\begin{aligned} \begin{pmatrix} \sqrt{\frac{n}{a_n}}(\hat{\pi }- \pi ^*)\\ \sqrt{\frac{n}{a_n'}}(\hat{\pi }' - \pi ^*)\end{pmatrix} \xrightarrow {\mathcal L} \mathcal N(0,U) \end{aligned}$$

with

$$\begin{aligned} U = \left( \begin{array}{cc} 1 &{} \frac{b}{\sqrt{a a'}} \\ \frac{b}{\sqrt{a a'}} &{} 1 \\ \end{array} \right) \end{aligned}$$
(4)

where b is the (1, 1)-entry of the matrix \(H^{-1} \cdot \bigl (\mathbb {P}\psi _{\pi ^*,\theta _1^*}\psi _{\pi ^*,\theta _1^*}'^T\bigr ) \cdot H'^{-1}\), and a (resp. \(a'\)) the (1, 1)-entry of the matrix \(H^{-1} \cdot \bigl (\mathbb {P}\psi _{\pi ^*,\theta _1^*}\psi _{\pi ^*,\theta _1^*}^T\bigr ) \cdot H^{-1}\) (resp. of \(H'^{-1} \cdot \bigl (\mathbb {P}\psi _{\pi ^*,\theta _1^*}'\psi _{\pi ^*,\theta _1^*}'^T\bigr ) \cdot H'^{-1}\)).

This result naturally generalises to k-tuples. The marginal result for \(\theta \) actually also holds when \(\pi ^* > 0\) and \(\theta _2 = \theta _2^*\), which is useful to control the power of the test procedure to be defined.

Let us consider as a test statistic \(T_n = \sup _{\theta _2} \sqrt{\frac{n}{a_n}} \hat{\pi }\) and let us reject \(H_0\) when \(T_n\) takes large values. It seems sensible to reduce \(\hat{\pi }\) for each value of \(\theta _2\) so that, under \(H_0\), it is asymptotically distributed as a \(\mathcal N(0,1)\) and that the (reduced) values of \(\hat{\pi }\) for different values of \(\theta _2\) can be compared. In practice, the asymptotic variance has to be estimated hence the substitution of \(\hat{\pi }\), \(\hat{\theta }_1\), and \(P_n\) for \(\pi ^*\), \(\theta _1^*\), and \(P^*\) in \(H^{-1} P^* \psi _{\pi ^*,\theta _1^*}\psi _{\pi ^*,\theta _1^*}^T H^{-1}\). This choice is justified in [19].

The Bonferoni aggregation rule is not sensible here since the tests for different values of \(\theta _2\) are obviously not independent so that such a procedure would lead to a conservative test. Hence the need in Theorem 3 for the joint asymptotic distribution to take the dependence between \(\hat{\pi }\) for different values of \(\theta _2\). This leads to the study of the asymptotic distribution of \(T_n\) which should be the distribution of \(\sup W\) where W is a Gaussian process which covariance structure is given by Theorem 3. This will be proved in the forthcoming section.

5 Asymptotic Distribution of the Supremum of Supremal Divergence Estimators

\(H_0\) is assumed to hold in this section.

It is stated that the asymptotic distribution of \(T_n\) is that of the supremum of a Gaussian process with the covariance \(\frac{b}{\sqrt{aa'}}\), as in (4).

Then it is stated that the distribution of the latter can be approximated by maximising the Gaussian process with the covariance \(\frac{b_n}{\sqrt{a_na_n'}}\), where \(a_n\), \(a_n'\), and \(b_n\) are estimations of the corresponding quantities, on a finite grid of values for \(\theta _2\).

Let X be the centred Gaussian process over \(\varTheta _2\) with

$$\begin{aligned} \forall \theta _2, \theta _2' \in \varTheta _2, r(\theta _2,\theta _2') = \textrm{Cov}(X_{\theta _2}, X_{\theta _2'}) = \frac{b}{\sqrt{aa'}} \end{aligned}$$

where a and b are defined in Theorem 3.

Theorem 4

Under general regularity conditions pertaining to the class of derivatives of m (Glivenko-Cantelli classes), we have

$$\begin{aligned} \sqrt{\frac{n}{a_n}} (\hat{\pi }_n - \pi ^*) \xrightarrow []{\mathcal L} X. \end{aligned}$$

This results from [21] when \(dim(\varTheta _2) = 1\) and [24] when \(dim(\varTheta _2) > 1\).

Theorem 5

Under the same general regularity conditions as above, we have

$$\begin{aligned} T_n \xrightarrow []{\mathcal L} \sup _{\theta _2\in \varTheta _2}X(\theta _2). \end{aligned}$$

The proof of the last result when \(dim(\varTheta _2) = 1\) makes use of the fact that \(\theta _2 \mapsto \hat{\pi }(\theta _2)\) is cadlag ([21]). This is a reasonable assumption, which holds in the examples which we considered. We are eager for counter-examples! When \(dim(\varTheta _2) > 1\), the result holds also by [3].

Let now \(X^n\) be the centred Gaussian process over \(\varTheta _2\) with

$$\begin{aligned} \forall \theta _2, \theta _2' \in \varTheta _2, \textrm{Cov}(X^n_{\theta _2}, X^n_{\theta _2'}) = \frac{b_n}{\sqrt{a_n a_n'}} \end{aligned}$$

where \(a_n\), \(a_n'\) are defined in Theorem 3 and \(b_n\) is defined analogously.

Theorem 6

Let, for any \(\delta > 0\), \(\varTheta _2^\delta \) be a finite set such that \(\forall \theta _2\in \varTheta _2, \exists \tilde{\theta }_2\in \varTheta _2^\delta / \Vert \theta _2-\tilde{\theta }_2\Vert \le \delta \). Then

$$\begin{aligned} M_n^\delta = \sup _{\theta _2\in \varTheta _2^\delta } X^n_{\theta _2} \xrightarrow [\begin{array}{c} n \rightarrow \infty \\ \delta \rightarrow 0 \end{array}]{\mathcal L} M = \sup _{\theta _2\in \varTheta _2} X_{\theta _2}. \end{aligned}$$

6 Algorithm

Our algorithm for testing that the data was sampled from a single-component mixture (\(H_0 : \pi ^* = 0\)) against a two-component mixture (\(H_1 : \pi ^* > 0\)) is presented in Algorithm 1.

In this algorithm, \(\hat{\pi }(\theta _2)\) is defined in (2) and (3). It depends on g. This Theorems hold as long as g fulfils \(Supp(g) \subset Supp(g_{\pi ,\theta })\) for any \((\pi ,\theta )\in \varTheta \). However it has to be chosen with care. The constants in the asymptotic distribution in Theorem 3 depend on it. Moreover [23] argue that the choice of g can influence the robustness properties of the procedure.

The choice of \(\varphi \) is also obviously crucial (see also [23] for the induced robustness properties).

The choice of \(\varphi \) and g are important practical questions which are work in progress.

As already stated, the supremal estimator for the modified Kullback-Leibler divergence \(\varphi : x \in \mathbb R^{+*} \mapsto -\log x + x - 1\) is the usual maximum likelihood estimator. In this instance the estimator does not depend on g.

figure b