Abstract
We show that the MLE (maximum likelihood estimation) in the class of Gaussian densities can be understood as the search for the best coordinate system which “optimally” underlines the internal structure of the data. This allows in particular to the search for the optimal coordinate system when the origin is fixed in a given point.
P. Spurek—The paper was supported by the National Centre of Science (Poland) Grant No. 2013/09/N/ST6/01178.
J. Tabor— The paper was supported by the National Centre of Science (Poland) Grant No. 2014/13/B/ST6/01792.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
MLE (maximum likelihood estimation) is one the most important estimation methods in statistics [4, 11]. In data engineering it plays the crucial role in particular in EM clustering [15], in information theory it can be “identified” with the cross-entropy, which jointly with the Kullback-Leibler divergence plays the basic role in computer science [6]. In this paper we discuss the MLE in the case when the considered density is Gaussian with the center belonging to a given set. We were inspired by the ideas presented by [5] and consider estimations in various subclasses of normal densities.
One of the crucial question in data analysis is how to choose the best coordinate system and define distance which “optimally” underlines the internal structure of the data [3, 8, 12, 17, 18, 20]. A similar role is played by Mahalanobis distance in discrimination analysis [9]. In general, we first need to decide if we allow or not the translation of the origin of coordinate system. Next we usually consider one of the following:
-
no change in coordinates;
-
possibly different change of scale separately in each coordinate;
-
arbitrary coordinates.
It occurs that the value of likelihood function, in the case when we restrict to the Gaussian densities, can be naturally interpreted as the measure of the fitness of the given coordinate system to the data. Thus in the paper we search for those coordinates in the above situations which best describe (with respect to MLE) the given dataset \(\mathcal {Y}\subset \mathbb {R}^N\).
At the end of the introduction let us mention that our results can be also used in various density estimation and clustering problems which use Gaussian models [1, 5], in particular in the case when we consider the model consisting of Gaussians with centers satisfying certain constraints.
2 Entropy and Gaussian Random Variables
Let X be a random variable with density \(f_{X}\). The differential entropy
tells us what is the asymptotic expected amount of memory needed to code X [6], and thus the differential code-length optimized for X is given by \(-\ln (f_{X}(x))\).
If we want to code Y (a continuous variable with density \(g_{Y}\)) with the code optimized for X we obtain the cross-entropy which was presented in [6, 10] (we follow the notation from [16]):
If A is a linear operator, then \( H^{\times }(AY\Vert AX)=H^{\times }(Y\Vert X)+\ln |{\mathrm {det}}(A)|. \) Since we consider X only from its density \(f_{X}\) point of view, we will commonly use the notation
Roughly speaking, \(H^{\times }(Y\Vert f)\) denotes (asymptotically) the memory needed to code random variable Y with the code optimized for the density f. In the case of given dataset \(\mathcal {Y}\subset \mathbb {R}^N\) we interpret \(\mathcal {Y}\) as an uniform discreet variable Y on \(\mathcal {Y}\). Consequently, our formula is reduced to
where \(| \mathcal {Y}|\) denote cardinality of the set \(\mathcal {Y}\).
In our investigations we are interested in (best) coding for Y by densities chosen from a set of densities \(\mathcal {F}\), and thus we will need the following definition.
Definition 1
By the cross-entropy of Y with respect to a family of coding densities \(\mathcal {F}\) we understand
Observe that the search for the density f with minimal cross entropy leads exactly to the maximum likelihood estimation. Thus in general the calculation of \(H^{\times }(Y\Vert \mathcal {F})\) is nontrivial, as it is equivalent to finding ML estimator.
As is the case in many statistical or data-information problems, the basic role in our investigations is played by the Gaussian densities. We recall that the normal variable with mean \(\mathrm {m}\) and a covariance matrix \(\varSigma \) has the density \( \mathcal {N}\!_{(\mathrm {m},\varSigma )}(x) = \frac{1}{\sqrt{(2\pi )^N {\mathrm {det}}(\varSigma )}}e^{(-\frac{1}{2}\Vert x-\mathrm {m}\Vert _{\varSigma }^2)}, \) where \(\Vert x-\mathrm {m}\Vert _{\varSigma }\) is the Mahalanobis norm \( \Vert x-\mathrm {m}\Vert _{\varSigma }^2:=(x-\mathrm {m})^T\varSigma ^{-1}(x-\mathrm {m}) \), see [13]. The differential entropy of Gaussian distribution is given by
From now on, if not otherwise specified, we assume that all the considered random variables have finite second moments and that they have values in \(\mathbb {R}^N\). For a random variable Y, by \(\mathrm {m}_Y=E(Y)\) we denote its mean, and by \(\varSigma _Y\) its covariance matrix, that is \( \varSigma _Y=E((Y-\mathrm {m}_Y) \cdot (Y-\mathrm {m}_Y)^T). \)
We will need the following result, which says that the cross-entropy of an arbitrary random variable Y versus normal can be computed just from the knowledge of covariance and mean of Y.
Theorem 1
([4], Theorem 5.59). Let Y be a random variable with finite covariance matrix. Then for arbitrary \(\mathrm {m}\) and positive-definite covariance matrix \(\varSigma \) we have
Remark 1
Suppose that we are given a data set \(\mathcal {Y}\). Then we usually understand the data as a sample realization of a random variable Y. Consequently as an estimator for the mean of Y we use the mean \(\mathrm {m}_{\mathcal {Y}} = \frac{1}{|\mathcal {Y}|} \sum \limits _{y \in \mathcal {Y}}y\) of the data \(\mathcal {Y}\) and as the covariance we use the ML estimator \(\frac{1}{|\mathcal {Y}|}\sum \limits _{y \in \mathcal {Y}}(y-\mathrm {m}_{\mathcal {Y}})(y-\mathrm {m}_{\mathcal {Y}})^T\).
As a direct corollary we obtain the formula for the optimal choice of origin.
Proposition 1
Let Y be a random variable and \(\varSigma \) be a fixed covariance matrix. Let M be a nonempty closed subset of \(\mathbb {R}^N\). From all normal coding densities \(\mathcal {N}\!_{(\mathrm {m},\varSigma )}\), where \(\mathrm {m}\in M\), the minimal cross-entropy is realized for that \(\mathrm {m}\in M\) which minimizes \(\Vert \mathrm {m}-\mathrm {m}_Y\Vert _{\varSigma }\), and equals
where \(d_{\varSigma }\) is a Mahalanobis distance.
Consequently, if \(M=\mathbb {R}^N\) the minimum is realized for \(\mathrm {m}=\mathrm {m}_Y\) and equals
It occurs that our basic MLE problem, in the case when we restrict to the Gaussian densities, can be naturally interpreted as search for the optimal rescaling (optimal choice of coordinate system).
Remark 2
Let us start from one dimensional space. In such a case, if we allow the translation of the origin of coordinate system, we usually apply the standarization/normalization: \(s:Y \rightarrow (Y-\mathrm {m}_Y)/\sigma _Y\). In the multivariate case the normalization is given by the transformation \(s:X \rightarrow \varSigma _Y^{-1/2}(Y-\mathrm {m}_Y)\). Then we obtain that the coordinates are uncorrelated, and the covariance matrix is identity. Taking the distance between the transformation of points x, y:
we arrive naturally at the Mahalanobis distance \(\Vert x-y\Vert _{\varSigma }^2=(x-y)^T \varSigma ^{-1}(x-y)\). If we do not allow the translation of the origin, we usually only scale each coordinate by dividing it by its mean (then the unit-scale plays the normalizing role, as the mean of each coordinate is one), arriving in the case when the mean is one.
To study the question what is the optimal procedure, we need the criterion to compare different coordinate systems. Suppose that we are given a basis \(\mathrm {v}=(v_1,\ldots ,v_N)\) of \(\mathbb {R}^N\) and an origin of coordinate system \(\mathrm {m}\). Then by \(\mathcal {N}\!_{[\mathrm {m},\mathrm {v}]}\) we denote the “normalized” Gaussian density with respect to the basis \(\mathrm {v}\) with center at \(\mathrm {m}\), that is
Then as a measure of fitness of the coordinate system \([\mathrm {m},\mathrm {v}]\) we understand the cross-entropy \(H^{\times }(Y\Vert \mathcal {N}\!_{[\mathrm {m},\mathrm {v}]})\).
3 Rescaling
Let us first consider the question how we should uniformly rescale the classical coordinates to optimally “fit” the data. Assume that we have fixed an origin of the coordinate system at \(\mathrm {m}\) and that we want to find how we should (uniformly) rescale the coordinates to optimally fit the data. This means that we search for s such that \( s \rightarrow H^{\times }(Y\Vert \mathcal {N}\!_{(\mathrm {m},s\mathrm {I})}) \) attains minimum. Since
by the trivial calculations we obtain that the above function attains its minimum
for \(s=[\mathrm {tr}(\varSigma _Y)+\Vert \mathrm {m}-\mathrm {m}_Y\Vert ^2]/N\). Thus we have arrived at the following theorem.
Theorem 2
Let Y be a random variable with invertible covariance matrix and \(\mathrm {m}\) be fixed. Then the minimum of \(H^{\times }(Y\Vert \{\mathcal {N}\!_{(\mathrm {m},s \mathrm {I})}\}_{s>0})\) is realized for \(s=(\mathrm {tr}(\varSigma _Y)+\Vert \mathrm {m}-\mathrm {m}_Y\Vert ^2)/N\), and equals
Example 1
Let \(\mathcal {Y}\) be a realization of the normal random variable Y with \(\mathrm {m}_{Y}=[3,4]^T\) and \(\varSigma = \left[ \begin{array}{cc}1&{}0.3\\ 0.3&{}0.6\end{array}\right] \) and let \(\mathrm {m}= [0,0]^T\). In Fig. 1 we present a sample \(\mathcal {Y}\) with the coordinate system (marked by bold black segments) obtained by the Theorem 2 and data in the new basis.
Observe that the above minimum depends only on the trace of the covariance matrix of Y and the Euclidean distance of \(\mathrm {m}\) from \(\mathrm {m}_Y\). If we allow the change of the origin, we have to clearly put the origin it at \(\mathrm {m}_Y\):
Corollary 1
Let Y be a random variable with invertible covariance matrix. Then \( H^{\times }(Y\Vert \{\mathcal {N}\!_{(\mathrm {m},s \mathrm {I})}\}_{s>0,\mathrm {m}\in \mathbb {R}^N})\) is realized for \(\mathrm {m}=\mathrm {m}_Y\), \(s=\frac{1}{N}\mathrm {tr}(\varSigma _Y)\), and equals
Corollary 2
Let \(\mathcal {Y}=(y_1,\ldots ,y_n)\) be a given data-set. Assume that we want to move the origin to \(\mathrm {m}\), and uniformly rescale the coordinates. Then
is the optimal rescaling, where \(\varSigma _{\mathcal {Y}}\) is a covariance of \(\mathcal {Y}\). If we additionally allow the change of the origin, we should put \(\mathrm {m}=\mathrm {m}_S\) and consequently the rescaling takes the form \(s \rightarrow (s-\mathrm {m}_S)/\sqrt{\mathrm {tr}(\varSigma _{\mathcal {Y}})/N}\).
Applying the above we obtain that in the one dimensional case the rescaling takes the form \(s \rightarrow (s-\mathrm {m}_{\mathcal {Y}})/\sigma _{\mathcal {Y}}\) (if we allow change of origin), and \(s \rightarrow s/\sqrt{\mathrm {m}_{\mathcal {Y}}^2+\sigma _{\mathcal {Y}}^2}\) (if we fix the origin at zero).
Example 2
Let \(\mathcal {Y}\) be a realization of the normal random variable Y from Example 1. In Fig. 2 we present a sample \(\mathcal {Y}\) with the coordinate system obtained by the Corollary 1 and data in the new basis.
Now we consider the case when we allow to rescale each coordinate \(Y_i\) of \(Y=(Y_1,\ldots ,Y_N)\) separately. For simplicity we consider the case \(N=2\). Consider the splitting \(\mathbb {R}^N=\mathbb {R}^{N_1} \times \mathbb {R}^{N_2}\). For densities \(f_1\) and \(f_2\) on \(\mathbb {R}^{N_1}\) and \(\mathbb {R}^{N_2}\), respectively, we define the product density \(f_1 \otimes f_2\) on \(\mathbb {R}^N=\mathbb {R}^{N_1} \times \mathbb {R}^{N_2}\) by the formula
for \((x_1,x_2) \in \mathbb {R}^{N_1} \times \mathbb {R}^{N_2}\). Given density families \(\mathcal {F}_1\) and \(\mathcal {F}_2\), we put \(\mathcal {F}_1 \otimes \mathcal {F}_2:=\{f_1 \otimes f_2:f_1 \in \mathcal {F}_1, f_2 \in \mathcal {F}_2\}\). Let \( Y:(\varOmega ,\mu ) \rightarrow \mathbb {R}^{N_1} \times \mathbb {R}^{N_2} \) be a random variable and let \(Y_1:\varOmega \rightarrow \mathbb {R}^{N_1}\) and \(Y_2:\varOmega \rightarrow \mathbb {R}^{N_2}\) denote the first and second coordinate of Y (observe that in general \(Y_1\) and \(Y_2\) are not independent random variables). On can easily observe that
Proposition 2
Let \(\mathcal {F}_1\) and \(\mathcal {F}_2\) denote coding density families in \(\mathbb {R}^{N_1}\) and \(\mathbb {R}^{N_2}\), respectively, and let \(Y:\varOmega \rightarrow \mathbb {R}^{N_1} \times \mathbb {R}^{N_2}\) be a random variable. Then
The above result means that if we allow to rescale coordinates, we can treat them as separate random variables. Thus we obtain the following theorem.
Theorem 3
Let \(\mathcal {Y}\) be a data set, and let \(\mathcal {Y}_k\) denote the set containing its k-th coordinate. Then the optimal rescaling for each k-th coordinate is given by
Example 3
Let \(\mathcal {Y}\) be a realization of the normal random variable Y from Example 1. In Fig. 4 we present a sample \(\mathcal {Y}\) and the coordinate system obtained by the Theorem 3 (if we fix the origin at zero) and data in the new basis. In Fig. 3 we present a sample \(\mathcal {Y}\) and the coordinate system obtained by the Theorem 3 (when we allow change of origin) and data in the new basis.
4 Main Result
We find the optimal coordinate system in the general case by applying an approach similar to that from [19]. To do so, we need a simple consequence of the famous von Neuman trace inequality. Next we discuss the optimal rescaling if we move the coordinate to the mean of the data.
In most of our further results the following proposition will play an important role. In its proof we will use the well-known von Neumann trace inequality described by [7, 14]:
Theorem [von Neumann trace inequality]. Let E, F be complex \(N \times N\) matrices. Then
where \(s_i(D)\) denote the ordered (decreasingly) singular values of matrix D.
We also need Sherman-Morrison formula [2]:
Theorem [Sherman-Morrison formula]. Suppose A is an invertible square matrix and u, v are column vectors. Suppose furthermore that \(1 + v^T A^{-1}u \ne 0\). Then the Sherman-Morrison formula states that
Let us recall that for the symmetric positive matrix its eigenvalues coincide with singular values. Given \(\lambda _1,\ldots ,\lambda _N \in \mathbb {R}\) by \(S_{\lambda _1,\ldots ,\lambda _N}\) we denote the set of all symmetric matrices with eigenvalues \(\lambda _1,\ldots ,\lambda _N\).
Proposition 3
Let B be a symmetric nonnegative matrix with eigenvalues \(\beta _1 \ge \ldots \ge \beta _N \ge 0\). Let \(0 \le \lambda _1 \le \ldots \le \lambda _N\) be fixed. Then
Proof
Let \(e_i\) denote the orthogonal basis build from the eigenvectors of B, and let operator \(\bar{A}\) be defined in this basis by \(\bar{A}(e_i)=\lambda _i e_i\). Then trivially
To prove the inverse inequality we will use the von Neumann trace inequality. Let \(A \in S_{\lambda _1,\ldots ,\lambda _N}\) be arbitrary. We apply the inequality (8) for \(E=\lambda _N \mathrm {I}-A\), \(F=B\). Since E and F are symmetric nonnegatively defined matrices, their eigenvalues \(\lambda _N-\lambda _i\) and \(\beta _i\) coincide with singular values, and therefore by (8)
Since
from inequality (9) we obtain that \( \mathrm {tr}(AB) \ge \sum _i \lambda _i \beta _i. \)
Now we proceed to the main result of the paper. Let \(M \subset \mathbb {R}^N \) then by \( \mathcal {G}_{M} \) we denote the set of Gaussians with mean \(\mathrm {m}\in M\).
Theorem 4
Let \(\mathrm {m}\in \mathbb {R}^N\) be fixed and let \(\mathcal {G}_{\{\mathrm {m}\}}\) denote the set of Gaussians with mean \(\mathrm {m}\). Then \(H^{\times }(Y\Vert \mathcal {G}_{\{\mathrm {m}\}})\) equals
and is attained for \( \varSigma =\varSigma _Y + (\mathrm {m}-\mathrm {m}_Y)(\mathrm {m}-\mathrm {m}_Y)^T. \)
Proof
Let us first observe that by applying substitution
we obtain
Then A is a symmetric positive matrix. Contrary given a symmetric positive matrix A we can uniquely determine \(\varSigma \) by the formula
Thus finding minimum of (10) reduces to finding a symmetric positive matrix A which minimize the value of
Let us first consider \(A \in S_{\lambda _1,\ldots ,\lambda _N}\), where \(0 < \lambda _1 \le \ldots \le \lambda _N\) are fixed. Our aim is to minimize
We fix an orthonormal basis such that \(v/\Vert v\Vert \) is its first element, and then by applying von Neumann trace formula we obtain that the above minimizes when v is the eigenvector of A corresponding to \(\lambda _1\), and thus the minimum equals \( \lambda _1 \Vert v\Vert ^2. \) Consequently we arrive at the minimization problem
Now one can easily verify that the minimum of the above is realized for
and then (12) equals \( N+\ln (1+\Vert \mathrm {m}-\mathrm {m}_Y\Vert _{\varSigma _Y}^2), \) while the formula for A minimizing it is given by \( A=\mathrm {I}-\frac{vv^T}{1+\Vert v\Vert ^2}. \) Consequently then the minimal value of (10) is
and by (11) and Sherman-Morrison formula is attained for
Example 4
Let \(\mathcal {Y}\) be a realization of the normal random variable Y from Example 1 and let \(\mathrm {m}\) be fixed. In Fig. 5 is presented a sample \(\mathcal {Y}\) and the coordinate system obtained by the Theorem 4 and data in the new basis.
5 Conclusion
In the paper we show that the MLE in the class of Gaussian densities can be understood equivalently as the search for the coordinates which best describe given dataset \(\mathcal {Y}\subset \mathbb {R}^N\). The main result of the paper presents the formula of the optimal coordinate system in the case when the mean of the Gaussian density satisfies certain constrains.
Our work can be used in density estimation and clustering algorithms which use different Gaussian models.
References
Banfield, J.D., Raftery, A.E.: Model-based gaussian and non-gaussian clustering. Biometrics 49(3), 803–821 (1993)
Bartlett, M.S.: An inverse matrix adjustment arising in discriminant analysis. Ann. Math. Stat. 22(1), 107–111 (1951)
Borg, I., Groenen, P.: Modern multidimensional scaling: Theory and applications. Springer, Heidelberg (2005)
Van den Bos, A.: Parameter Estimation for Scientists and Engineers. Wiley Online Library, New York (2007)
Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recogn. 28(5), 781–793 (1995)
Cover, T., Thomas, J., Wiley, J., et al.: Elements of Information Theory. Wiley Online Library, New York (1991)
Grigorieff, R.: A note on von neumanns trace inequality. Math. Nachr. 151, 327–328 (1991)
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2011)
Krishnaiah, P.: Handbook of Statistics. North-Holland, New York (1988)
Kullback, S.: Information Theory and Statistics. Dover Publications, New York (1997)
Lehmann, E., Casella, G.: Theory of Point Estimation. Springer, New York (1998)
De Maesschalck, R., Jouan-Rimbaud, D., Massart, D.: The mahalanobis distance. Chemometr. Intell. Lab. Syst. 50(1), 1–18 (2000)
Mahalanobis, P.C.: On the generalised distance in statistics. Proc. Nat. Inst. Sci. 2, 49–55 (1936)
Mirsky, L.: A trace inequality of john von neumann. Monatshefte für mathematik 79(4), 303–306 (1975)
Ng, S., Krishnan, T., McLachlan, G.: The em algorithm. In: Gentle, J.E., Härdle, W.K., Mori, Y. (eds.) Handbook of Computational Statistics Concepts and Methods. Springer Handbooks of Computational Statistics, pp. 139–172. Springer, Heidelberg (2004)
Nielsen, F., Nock, R.: Sided and symmetrized bregman centroids. IEEE Trans. Inf. Theory 55(6), 2882–2904 (2009)
Raykov, T., Marcoulides, G.: An Introduction to Applied Multivariate Analysis. Routledge, London (2008)
Rencher, A.: Methods of Multivariate Analysis. Wiley Online Library, New York (1995)
Theobald, C.: An inequality with application to multivariate analysis. Biometrika 62(2), 461–466 (1975)
Timm, N.: Applied Multivariate Analysis. Springer, New York (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Spurek, P., Tabor, J. (2017). Maximum Likelihood Estimation and Optimal Coordinates. In: Świątek, J., Tomczak, J. (eds) Advances in Systems Science. ICSS 2016. Advances in Intelligent Systems and Computing, vol 539. Springer, Cham. https://doi.org/10.1007/978-3-319-48944-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-48944-5_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48943-8
Online ISBN: 978-3-319-48944-5
eBook Packages: EngineeringEngineering (R0)