Keywords

1 Introduction

MLE (maximum likelihood estimation) is one the most important estimation methods in statistics [4, 11]. In data engineering it plays the crucial role in particular in EM clustering [15], in information theory it can be “identified” with the cross-entropy, which jointly with the Kullback-Leibler divergence plays the basic role in computer science [6]. In this paper we discuss the MLE in the case when the considered density is Gaussian with the center belonging to a given set. We were inspired by the ideas presented by [5] and consider estimations in various subclasses of normal densities.

One of the crucial question in data analysis is how to choose the best coordinate system and define distance which “optimally” underlines the internal structure of the data [3, 8, 12, 17, 18, 20]. A similar role is played by Mahalanobis distance in discrimination analysis [9]. In general, we first need to decide if we allow or not the translation of the origin of coordinate system. Next we usually consider one of the following:

  • no change in coordinates;

  • possibly different change of scale separately in each coordinate;

  • arbitrary coordinates.

It occurs that the value of likelihood function, in the case when we restrict to the Gaussian densities, can be naturally interpreted as the measure of the fitness of the given coordinate system to the data. Thus in the paper we search for those coordinates in the above situations which best describe (with respect to MLE) the given dataset \(\mathcal {Y}\subset \mathbb {R}^N\).

At the end of the introduction let us mention that our results can be also used in various density estimation and clustering problems which use Gaussian models [1, 5], in particular in the case when we consider the model consisting of Gaussians with centers satisfying certain constraints.

2 Entropy and Gaussian Random Variables

Let X be a random variable with density \(f_{X}\). The differential entropy

$$\begin{aligned} H(X):=\int -\ln (f_{X}(y)) f_{X}(y) dy \end{aligned}$$
(1)

tells us what is the asymptotic expected amount of memory needed to code X [6], and thus the differential code-length optimized for X is given by \(-\ln (f_{X}(x))\).

If we want to code Y (a continuous variable with density \(g_{Y}\)) with the code optimized for X we obtain the cross-entropy which was presented in [6, 10] (we follow the notation from [16]):

$$\begin{aligned} H^{\times }(Y\Vert X):=\int g_{Y}(y) \cdot (-\ln f_{X}(y))dy, \end{aligned}$$
(2)

If A is a linear operator, then \( H^{\times }(AY\Vert AX)=H^{\times }(Y\Vert X)+\ln |{\mathrm {det}}(A)|. \) Since we consider X only from its density \(f_{X}\) point of view, we will commonly use the notation

$$\begin{aligned} H^{\times }(Y\Vert f_{X}):=\int g_{Y}(y) \cdot (-\ln f_{X}(y))dy. \end{aligned}$$
(3)

Roughly speaking, \(H^{\times }(Y\Vert f)\) denotes (asymptotically) the memory needed to code random variable Y with the code optimized for the density f. In the case of given dataset \(\mathcal {Y}\subset \mathbb {R}^N\) we interpret \(\mathcal {Y}\) as an uniform discreet variable Y on \(\mathcal {Y}\). Consequently, our formula is reduced to

$$\begin{aligned} H^{\times }(\mathcal {Y}\Vert f):= H^{\times }(Y \Vert f) =- \frac{1}{|\mathcal {Y}|} \sum _{x \in \mathcal {Y}} \ln (f(x)), \end{aligned}$$
(4)

where \(| \mathcal {Y}|\) denote cardinality of the set \(\mathcal {Y}\).

In our investigations we are interested in (best) coding for Y by densities chosen from a set of densities \(\mathcal {F}\), and thus we will need the following definition.

Definition 1

By the cross-entropy of Y with respect to a family of coding densities \(\mathcal {F}\) we understand

$$\begin{aligned} H^{\times }(Y\Vert \mathcal {F}):=\inf _{f \in \mathcal {F}}H^{\times }(Y\Vert f). \end{aligned}$$
(5)

Observe that the search for the density f with minimal cross entropy leads exactly to the maximum likelihood estimation. Thus in general the calculation of \(H^{\times }(Y\Vert \mathcal {F})\) is nontrivial, as it is equivalent to finding ML estimator.

As is the case in many statistical or data-information problems, the basic role in our investigations is played by the Gaussian densities. We recall that the normal variable with mean \(\mathrm {m}\) and a covariance matrix \(\varSigma \) has the density \( \mathcal {N}\!_{(\mathrm {m},\varSigma )}(x) = \frac{1}{\sqrt{(2\pi )^N {\mathrm {det}}(\varSigma )}}e^{(-\frac{1}{2}\Vert x-\mathrm {m}\Vert _{\varSigma }^2)}, \) where \(\Vert x-\mathrm {m}\Vert _{\varSigma }\) is the Mahalanobis norm \( \Vert x-\mathrm {m}\Vert _{\varSigma }^2:=(x-\mathrm {m})^T\varSigma ^{-1}(x-\mathrm {m}) \), see [13]. The differential entropy of Gaussian distribution is given by

$$ H(\mathcal {N}\!_{(\mathrm {m},\varSigma )})=\frac{N}{2}\ln (2\pi e)+\frac{1}{2}\ln ({\mathrm {det}}(\varSigma )). $$

From now on, if not otherwise specified, we assume that all the considered random variables have finite second moments and that they have values in \(\mathbb {R}^N\). For a random variable Y, by \(\mathrm {m}_Y=E(Y)\) we denote its mean, and by \(\varSigma _Y\) its covariance matrix, that is \( \varSigma _Y=E((Y-\mathrm {m}_Y) \cdot (Y-\mathrm {m}_Y)^T). \)

We will need the following result, which says that the cross-entropy of an arbitrary random variable Y versus normal can be computed just from the knowledge of covariance and mean of Y.

Theorem 1

([4], Theorem 5.59). Let Y be a random variable with finite covariance matrix. Then for arbitrary \(\mathrm {m}\) and positive-definite covariance matrix \(\varSigma \) we have

$$\begin{aligned} \begin{array}{l} H^{\times }(Y\Vert \mathcal {N}\!_{(\mathrm {m},\varSigma )}) = \frac{N}{2} \ln (2\pi )+\frac{1}{2}\Vert \mathrm {m}-\mathrm {m}_Y\Vert ^2_{\varSigma } + \frac{1}{2}\mathrm {tr}(\varSigma ^{-1}\varSigma _Y)+\frac{1}{2}\ln ( {\mathrm {det}}(\varSigma )). \end{array} \end{aligned}$$
(6)

Remark 1

Suppose that we are given a data set \(\mathcal {Y}\). Then we usually understand the data as a sample realization of a random variable Y. Consequently as an estimator for the mean of Y we use the mean \(\mathrm {m}_{\mathcal {Y}} = \frac{1}{|\mathcal {Y}|} \sum \limits _{y \in \mathcal {Y}}y\) of the data \(\mathcal {Y}\) and as the covariance we use the ML estimator \(\frac{1}{|\mathcal {Y}|}\sum \limits _{y \in \mathcal {Y}}(y-\mathrm {m}_{\mathcal {Y}})(y-\mathrm {m}_{\mathcal {Y}})^T\).

As a direct corollary we obtain the formula for the optimal choice of origin.

Proposition 1

Let Y be a random variable and \(\varSigma \) be a fixed covariance matrix. Let M be a nonempty closed subset of \(\mathbb {R}^N\). From all normal coding densities \(\mathcal {N}\!_{(\mathrm {m},\varSigma )}\), where \(\mathrm {m}\in M\), the minimal cross-entropy is realized for that \(\mathrm {m}\in M\) which minimizes \(\Vert \mathrm {m}-\mathrm {m}_Y\Vert _{\varSigma }\), and equals

$$ \begin{array}{l} \inf \limits _{\mathrm {m}\in M}H^{\times }(Y\Vert \mathcal {N}\!_{(\mathrm {m},\varSigma )}) = \frac{1}{2} \big (d^2_{\varSigma }(\mathrm {m}_Y;M)+\mathrm {tr}(\varSigma ^{-1}\varSigma _Y) + \ln ({\mathrm {det}}(\varSigma ))+N\ln (2\pi )\big ), \end{array} $$

where \(d_{\varSigma }\) is a Mahalanobis distance.

Consequently, if \(M=\mathbb {R}^N\) the minimum is realized for \(\mathrm {m}=\mathrm {m}_Y\) and equals

$$ \begin{array}{l} \inf \limits _{\mathrm {m}\in \mathbb {R}^N}H^{\times }(Y\Vert \mathcal {N}\!_{(\mathrm {m},\varSigma )}) =\frac{1}{2} \left( \mathrm {tr}(\varSigma ^{-1}\varSigma _Y)+\ln ({\mathrm {det}}(\varSigma ))+N\ln (2\pi )\right) . \end{array} $$

It occurs that our basic MLE problem, in the case when we restrict to the Gaussian densities, can be naturally interpreted as search for the optimal rescaling (optimal choice of coordinate system).

Remark 2

Let us start from one dimensional space. In such a case, if we allow the translation of the origin of coordinate system, we usually apply the standarization/normalization: \(s:Y \rightarrow (Y-\mathrm {m}_Y)/\sigma _Y\). In the multivariate case the normalization is given by the transformation \(s:X \rightarrow \varSigma _Y^{-1/2}(Y-\mathrm {m}_Y)\). Then we obtain that the coordinates are uncorrelated, and the covariance matrix is identity. Taking the distance between the transformation of points xy:

$$ \Vert sx-sy\Vert ^2 \!=\! (sx-sy)^T (sx-sy) \! = \! (x-y)^T \varSigma ^{-1}(x-y) $$

we arrive naturally at the Mahalanobis distance \(\Vert x-y\Vert _{\varSigma }^2=(x-y)^T \varSigma ^{-1}(x-y)\). If we do not allow the translation of the origin, we usually only scale each coordinate by dividing it by its mean (then the unit-scale plays the normalizing role, as the mean of each coordinate is one), arriving in the case when the mean is one.

To study the question what is the optimal procedure, we need the criterion to compare different coordinate systems. Suppose that we are given a basis \(\mathrm {v}=(v_1,\ldots ,v_N)\) of \(\mathbb {R}^N\) and an origin of coordinate system \(\mathrm {m}\). Then by \(\mathcal {N}\!_{[\mathrm {m},\mathrm {v}]}\) we denote the “normalized” Gaussian density with respect to the basis \(\mathrm {v}\) with center at \(\mathrm {m}\), that is

$$ \begin{array}{l} \mathcal {N}\!_{[\mathrm {m},\mathrm {v}]}(\mathrm {m}+x_1v_1+\ldots +x_N v_N) =\frac{1}{(2\pi )^{N/2}|{\mathrm {det}}(\mathrm {v})|}e^{-(x_1^2+\ldots +x_N^2)/2}. \end{array} $$

Then as a measure of fitness of the coordinate system \([\mathrm {m},\mathrm {v}]\) we understand the cross-entropy \(H^{\times }(Y\Vert \mathcal {N}\!_{[\mathrm {m},\mathrm {v}]})\).

3 Rescaling

Let us first consider the question how we should uniformly rescale the classical coordinates to optimally “fit” the data. Assume that we have fixed an origin of the coordinate system at \(\mathrm {m}\) and that we want to find how we should (uniformly) rescale the coordinates to optimally fit the data. This means that we search for s such that \( s \rightarrow H^{\times }(Y\Vert \mathcal {N}\!_{(\mathrm {m},s\mathrm {I})}) \) attains minimum. Since

$$\begin{aligned} \begin{array}{l} H^{\times }(Y\Vert \mathcal {N}\!_{(\mathrm {m},s\mathrm {I})}) = \frac{1}{2}([\mathrm {tr}(\varSigma _Y)+\Vert \mathrm {m}-\mathrm {m}_Y\Vert ^2]s^{-1} + N \ln (s)+N \ln (2\pi )), \end{array} \end{aligned}$$
(7)

by the trivial calculations we obtain that the above function attains its minimum

$$ \frac{N}{2}(\ln [\mathrm {tr}(\varSigma _Y)+\Vert \mathrm {m}-\mathrm {m}_Y\Vert ^2]+\ln (2\pi e/N)) $$

for \(s=[\mathrm {tr}(\varSigma _Y)+\Vert \mathrm {m}-\mathrm {m}_Y\Vert ^2]/N\). Thus we have arrived at the following theorem.

Theorem 2

Let Y be a random variable with invertible covariance matrix and \(\mathrm {m}\) be fixed. Then the minimum of \(H^{\times }(Y\Vert \{\mathcal {N}\!_{(\mathrm {m},s \mathrm {I})}\}_{s>0})\) is realized for \(s=(\mathrm {tr}(\varSigma _Y)+\Vert \mathrm {m}-\mathrm {m}_Y\Vert ^2)/N\), and equals

$$ \begin{array}{l} H^{\times }(Y\Vert \{\mathcal {N}\!_{(\mathrm {m},s \mathrm {I})}\}_{s>0}) = \frac{N}{2}(\ln [\mathrm {tr}(\varSigma _Y)+\Vert \mathrm {m}-\mathrm {m}_Y\Vert ^2]+\ln (2\pi e/N)). \end{array} $$
Fig. 1.
figure 1

The original data set with optimal coordinate system (the new “optimal” basis is marked by the bold arrows) in the case of the family \(\left\{ \mathcal {N}\!_{(\mathrm {m},s \mathrm {I})}\right\} _{s>0}\) (left figure). The data in the new basis (right figure).

Example 1

Let \(\mathcal {Y}\) be a realization of the normal random variable Y with \(\mathrm {m}_{Y}=[3,4]^T\) and \(\varSigma = \left[ \begin{array}{cc}1&{}0.3\\ 0.3&{}0.6\end{array}\right] \) and let \(\mathrm {m}= [0,0]^T\). In Fig. 1 we present a sample \(\mathcal {Y}\) with the coordinate system (marked by bold black segments) obtained by the Theorem 2 and data in the new basis.

Observe that the above minimum depends only on the trace of the covariance matrix of Y and the Euclidean distance of \(\mathrm {m}\) from \(\mathrm {m}_Y\). If we allow the change of the origin, we have to clearly put the origin it at \(\mathrm {m}_Y\):

Corollary 1

Let Y be a random variable with invertible covariance matrix. Then \( H^{\times }(Y\Vert \{\mathcal {N}\!_{(\mathrm {m},s \mathrm {I})}\}_{s>0,\mathrm {m}\in \mathbb {R}^N})\) is realized for \(\mathrm {m}=\mathrm {m}_Y\), \(s=\frac{1}{N}\mathrm {tr}(\varSigma _Y)\), and equals

$$ H^{\times }(Y\Vert \{\mathcal {N}\!_{(\mathrm {m},s \mathrm {I})}\}_{s>0,\mathrm {m}\in \mathbb {R}^N}) = \tfrac{N}{2}(\ln (\mathrm {tr}(\varSigma _Y))+\ln (\tfrac{2\pi e}{N})). $$

Corollary 2

Let \(\mathcal {Y}=(y_1,\ldots ,y_n)\) be a given data-set. Assume that we want to move the origin to \(\mathrm {m}\), and uniformly rescale the coordinates. Then

$$ s \rightarrow (s-\mathrm {m})/\sqrt{\tfrac{1}{N}(\mathrm {tr}(\varSigma _{\mathcal {Y}})+\Vert \mathrm {m}-\mathrm {m}_Y\Vert ^2)} $$

is the optimal rescaling, where \(\varSigma _{\mathcal {Y}}\) is a covariance of \(\mathcal {Y}\). If we additionally allow the change of the origin, we should put \(\mathrm {m}=\mathrm {m}_S\) and consequently the rescaling takes the form \(s \rightarrow (s-\mathrm {m}_S)/\sqrt{\mathrm {tr}(\varSigma _{\mathcal {Y}})/N}\).

Applying the above we obtain that in the one dimensional case the rescaling takes the form \(s \rightarrow (s-\mathrm {m}_{\mathcal {Y}})/\sigma _{\mathcal {Y}}\) (if we allow change of origin), and \(s \rightarrow s/\sqrt{\mathrm {m}_{\mathcal {Y}}^2+\sigma _{\mathcal {Y}}^2}\) (if we fix the origin at zero).

Fig. 2.
figure 2

The original data set with optimal coordinate system in the case of the family \(\left\{ \mathcal {N}\!_{(\mathrm {m},s \mathrm {I})}\right\} _{s>0,\mathrm {m}\in \mathbb {R}^N}\) (left figure). The data in the new basis (figure on the right).

Example 2

Let \(\mathcal {Y}\) be a realization of the normal random variable Y from Example 1. In Fig. 2 we present a sample \(\mathcal {Y}\) with the coordinate system obtained by the Corollary 1 and data in the new basis.

Now we consider the case when we allow to rescale each coordinate \(Y_i\) of \(Y=(Y_1,\ldots ,Y_N)\) separately. For simplicity we consider the case \(N=2\). Consider the splitting \(\mathbb {R}^N=\mathbb {R}^{N_1} \times \mathbb {R}^{N_2}\). For densities \(f_1\) and \(f_2\) on \(\mathbb {R}^{N_1}\) and \(\mathbb {R}^{N_2}\), respectively, we define the product density \(f_1 \otimes f_2\) on \(\mathbb {R}^N=\mathbb {R}^{N_1} \times \mathbb {R}^{N_2}\) by the formula

$$ (f_1 \otimes f_2)(x_1,x_2):=f_1(x_1) \cdot f_2(x_2), $$

for \((x_1,x_2) \in \mathbb {R}^{N_1} \times \mathbb {R}^{N_2}\). Given density families \(\mathcal {F}_1\) and \(\mathcal {F}_2\), we put \(\mathcal {F}_1 \otimes \mathcal {F}_2:=\{f_1 \otimes f_2:f_1 \in \mathcal {F}_1, f_2 \in \mathcal {F}_2\}\). Let \( Y:(\varOmega ,\mu ) \rightarrow \mathbb {R}^{N_1} \times \mathbb {R}^{N_2} \) be a random variable and let \(Y_1:\varOmega \rightarrow \mathbb {R}^{N_1}\) and \(Y_2:\varOmega \rightarrow \mathbb {R}^{N_2}\) denote the first and second coordinate of Y (observe that in general \(Y_1\) and \(Y_2\) are not independent random variables). On can easily observe that

Proposition 2

Let \(\mathcal {F}_1\) and \(\mathcal {F}_2\) denote coding density families in \(\mathbb {R}^{N_1}\) and \(\mathbb {R}^{N_2}\), respectively, and let \(Y:\varOmega \rightarrow \mathbb {R}^{N_1} \times \mathbb {R}^{N_2}\) be a random variable. Then

$$ H^{\times }(Y\Vert \mathcal {F}_1 \otimes \mathcal {F}_2)=H^{\times }(Y_1 \Vert \mathcal {F}_1)+H^{\times }(Y_2\Vert \mathcal {F}_2). $$
Fig. 3.
figure 3

The original data set with optimal coordinate system in the case of separated random variable when we allow change of origin (figure on the left) and the data in the new basis (figure on the right).

The above result means that if we allow to rescale coordinates, we can treat them as separate random variables. Thus we obtain the following theorem.

Theorem 3

Let \(\mathcal {Y}\) be a data set, and let \(\mathcal {Y}_k\) denote the set containing its k-th coordinate. Then the optimal rescaling for each k-th coordinate is given by

$$ \begin{array}{cc} \mathcal {Y}_k \ni s \rightarrow (s-\mathrm {m}_{\mathcal {Y}_k})/\sigma _{\mathcal {Y}_k} \,\, (if\,\, we\,\, allow\,\, change\,\, of\,\, origin), \, \\ \mathcal {Y}_k \ni s \rightarrow s/\sqrt{\mathrm {m}_{\mathcal {Y}}^2+\sigma _{\mathcal {Y}}^2} \,\, (if\,\, we\,\, fix\,\, the\,\, origin\,\, at\,\, zero). \end{array} $$
Fig. 4.
figure 4

The original data set with optimal coordinate system in the case of separated random variable when we do not allow change of origin (left hand side illustration) and data in the new basis (right hand side illustration).

Example 3

Let \(\mathcal {Y}\) be a realization of the normal random variable Y from Example 1. In Fig. 4 we present a sample \(\mathcal {Y}\) and the coordinate system obtained by the Theorem 3 (if we fix the origin at zero) and data in the new basis. In Fig. 3 we present a sample \(\mathcal {Y}\) and the coordinate system obtained by the Theorem 3 (when we allow change of origin) and data in the new basis.

4 Main Result

We find the optimal coordinate system in the general case by applying an approach similar to that from [19]. To do so, we need a simple consequence of the famous von Neuman trace inequality. Next we discuss the optimal rescaling if we move the coordinate to the mean of the data.

In most of our further results the following proposition will play an important role. In its proof we will use the well-known von Neumann trace inequality described by [7, 14]:

Theorem [von Neumann trace inequality]. Let EF be complex \(N \times N\) matrices. Then

$$\begin{aligned} |\mathrm {tr}(EF) | \le \sum _{i=1}^N s_i(E)\cdot s_i(F), \end{aligned}$$
(8)

where \(s_i(D)\) denote the ordered (decreasingly) singular values of matrix D.

We also need Sherman-Morrison formula [2]:

Theorem [Sherman-Morrison formula]. Suppose A is an invertible square matrix and u, v are column vectors. Suppose furthermore that \(1 + v^T A^{-1}u \ne 0\). Then the Sherman-Morrison formula states that

$$ (A+uv^T)^{-1} = A^{-1} - {A^{-1}uv^T A^{-1} \over 1 + v^T A^{-1}u}. $$

Let us recall that for the symmetric positive matrix its eigenvalues coincide with singular values. Given \(\lambda _1,\ldots ,\lambda _N \in \mathbb {R}\) by \(S_{\lambda _1,\ldots ,\lambda _N}\) we denote the set of all symmetric matrices with eigenvalues \(\lambda _1,\ldots ,\lambda _N\).

Proposition 3

Let B be a symmetric nonnegative matrix with eigenvalues \(\beta _1 \ge \ldots \ge \beta _N \ge 0\). Let \(0 \le \lambda _1 \le \ldots \le \lambda _N\) be fixed. Then

$$ \min _{A \in S_{\lambda _1,\ldots ,\lambda _N}} \mathrm {tr}(AB)=\sum _i \lambda _i \beta _i. $$

Proof

Let \(e_i\) denote the orthogonal basis build from the eigenvectors of B, and let operator \(\bar{A}\) be defined in this basis by \(\bar{A}(e_i)=\lambda _i e_i\). Then trivially

$$ \min _{A \in S_{\lambda _1,\ldots ,\lambda _N}} \mathrm {tr}(AB) \le \mathrm {tr}(\bar{A}B)=\sum _i \lambda _i \beta _i. $$

To prove the inverse inequality we will use the von Neumann trace inequality. Let \(A \in S_{\lambda _1,\ldots ,\lambda _N}\) be arbitrary. We apply the inequality (8) for \(E=\lambda _N \mathrm {I}-A\), \(F=B\). Since E and F are symmetric nonnegatively defined matrices, their eigenvalues \(\lambda _N-\lambda _i\) and \(\beta _i\) coincide with singular values, and therefore by (8)

$$\begin{aligned} \begin{array}{l} \mathrm {tr}((\lambda _N\mathrm {I}-A)B) \le \sum \nolimits _i(\lambda _N-\lambda _i)\beta _i = \lambda _N \sum \nolimits _i \beta _i -\sum \nolimits _i \lambda _i \beta _i. \end{array} \end{aligned}$$
(9)

Since

$$ \mathrm {tr}((\lambda _N\mathrm {I}-A)B)=\lambda _N \sum _i \beta _i -\mathrm {tr}(AB), $$

from inequality (9) we obtain that \( \mathrm {tr}(AB) \ge \sum _i \lambda _i \beta _i. \)

Now we proceed to the main result of the paper. Let \(M \subset \mathbb {R}^N \) then by \( \mathcal {G}_{M} \) we denote the set of Gaussians with mean \(\mathrm {m}\in M\).

Theorem 4

Let \(\mathrm {m}\in \mathbb {R}^N\) be fixed and let \(\mathcal {G}_{\{\mathrm {m}\}}\) denote the set of Gaussians with mean \(\mathrm {m}\). Then \(H^{\times }(Y\Vert \mathcal {G}_{\{\mathrm {m}\}})\) equals

$$ \frac{1}{2}\left( \ln (1+\Vert \mathrm {m}-\mathrm {m}_Y\Vert ^2_{\varSigma _Y})+\ln ({\mathrm {det}}(\varSigma _Y))+N\ln (2\pi e)\right) , $$

and is attained for \( \varSigma =\varSigma _Y + (\mathrm {m}-\mathrm {m}_Y)(\mathrm {m}-\mathrm {m}_Y)^T. \)

Proof

Let us first observe that by applying substitution

$$ A=\varSigma _Y^{1/2}\varSigma ^{-1}\varSigma _Y^{1/2}, v=\varSigma _Y^{-1/2}(\mathrm {m}-\mathrm {m}_Y), $$

we obtain

$$\begin{aligned} \begin{array}{ll} H^{\times }(Y\Vert \mathcal {N}\!_{(\mathrm {m},\varSigma )}) &{}= \frac{1}{2} (\mathrm {tr}(\varSigma ^{-1}\varSigma _Y)+\Vert \mathrm {m}-\mathrm {m}_Y\Vert ^2_{\varSigma } + \ln ({\mathrm {det}}(\varSigma ))+N\ln (2\pi )) \\[1ex] &{}=\frac{1}{2}(\mathrm {tr}(\varSigma ^{-1}\varSigma _Y)+(\mathrm {m}-\mathrm {m}_Y)^T\varSigma ^{-1}(\mathrm {m}-\mathrm {m}_Y) \\[1ex] &{}\quad - \ln ({\mathrm {det}}(\varSigma ^{-1}\varSigma _Y))+\ln ({\mathrm {det}}(\varSigma _Y))+N\ln (2\pi )) \\[1ex] &{}=\frac{1}{2}\big (\mathrm {tr}(A)+v^TAv-\ln ({\mathrm {det}}(A))+\ln ({\mathrm {det}}(\varSigma _Y)) +N\ln (2\pi )\big ). \end{array} \end{aligned}$$
(10)

Then A is a symmetric positive matrix. Contrary given a symmetric positive matrix A we can uniquely determine \(\varSigma \) by the formula

$$\begin{aligned} \varSigma =\varSigma _Y^{1/2}A^{-1}\varSigma _Y^{1/2}. \end{aligned}$$
(11)

Thus finding minimum of (10) reduces to finding a symmetric positive matrix A which minimize the value of

$$\begin{aligned} \mathrm {tr}(A)+v^TAv-\ln ({\mathrm {det}}(A)). \end{aligned}$$
(12)

Let us first consider \(A \in S_{\lambda _1,\ldots ,\lambda _N}\), where \(0 < \lambda _1 \le \ldots \le \lambda _N\) are fixed. Our aim is to minimize

$$ v^TAv=\mathrm {tr}(v^TAv)=\mathrm {tr}(A \cdot (vv^T)). $$

We fix an orthonormal basis such that \(v/\Vert v\Vert \) is its first element, and then by applying von Neumann trace formula we obtain that the above minimizes when v is the eigenvector of A corresponding to \(\lambda _1\), and thus the minimum equals \( \lambda _1 \Vert v\Vert ^2. \) Consequently we arrive at the minimization problem

$$ \lambda _1 (1+\Vert v\Vert ^2)+\sum _{i>1}\lambda _i-\sum _i \ln \lambda _i. $$

Now one can easily verify that the minimum of the above is realized for

$$ \lambda _1=1/(1+\Vert v\Vert ^2), \lambda _i=1 \text{ for }i >1, $$

and then (12) equals \( N+\ln (1+\Vert \mathrm {m}-\mathrm {m}_Y\Vert _{\varSigma _Y}^2), \) while the formula for A minimizing it is given by \( A=\mathrm {I}-\frac{vv^T}{1+\Vert v\Vert ^2}. \) Consequently then the minimal value of (10) is

$$ \frac{1}{2}\left( \ln (1+\Vert \mathrm {m}-\mathrm {m}_Y\Vert ^2_{\varSigma _Y})+\ln ({\mathrm {det}}(\varSigma _Y))+N\ln (2\pi e)\right) . $$

and by (11) and Sherman-Morrison formula is attained for

$$ \begin{array}{l} \varSigma =\varSigma _Y^{1/2}(I\!-\!\frac{\varSigma _Y^{-1/2}(\mathrm {m}-\mathrm {m}_Y)(\mathrm {m}-\mathrm {m}_Y)^T\varSigma _Y^{-1/2}}{1+\Vert \mathrm {m}-\mathrm {m}_Y\Vert _{\varSigma _Y}^2} )^{-1}\varSigma _Y^{1/2} =\varSigma _Y\! +\! (\mathrm {m}-\mathrm {m}_Y)(\mathrm {m}-\mathrm {m}_Y)^T. \end{array} $$
Fig. 5.
figure 5

The original data set with optimal coordinate system in the case of the family \(\mathcal {G}_{\{(0,0)\}}\) (left hand side illustration) and data in the new basis (right hand side illustration).

Example 4

Let \(\mathcal {Y}\) be a realization of the normal random variable Y from Example 1 and let \(\mathrm {m}\) be fixed. In Fig. 5 is presented a sample \(\mathcal {Y}\) and the coordinate system obtained by the Theorem 4 and data in the new basis.

5 Conclusion

In the paper we show that the MLE in the class of Gaussian densities can be understood equivalently as the search for the coordinates which best describe given dataset \(\mathcal {Y}\subset \mathbb {R}^N\). The main result of the paper presents the formula of the optimal coordinate system in the case when the mean of the Gaussian density satisfies certain constrains.

Our work can be used in density estimation and clustering algorithms which use different Gaussian models.