Keywords

1 Introduction

Data processing techniques, such as mathematical formulas, statistical methods and machine learning algorithms, require a set of tools for evaluating knowledge extracted from data. There are several ways to perform such estimation. Based on the research presented in [1], we consider the following estimation methods: referential, predictive and informational.

In unsupervised learning it is impossible to use referential or predictive estimation. Referential methods require some reference model which is believed to represent our best guess about the knowledge hidden in data. Predictive methods also have limitation as they require some outer evaluation of predictions. Therefore, the only reliable way to estimate results of unsupervised learning is information estimation.

1.1 Information Measures Underfitting/Overfitting

Unfortunately, information estimation, like many other methods for solving optimization problems, suffer from overfitting. Overfitting is a situation where a model corresponds not only to the relations between variables, but also to random noise in data [2]. A model incompleteness can be caused either by a bias—an error that occurs since the model used is not able to describe the dependencies in the data or by a variance—an error that occurs due to increased sensitivity to noise. Bias and variance, in turn, are related to model complexity—a quantity derived from the type of the model, the amount of input data and the number of parameters. A model that is too simple has high bias and vice versa—a model that is too complex has high variance.

Let us show that informational measures suffer from overfitting. In order to do it consider the Fisher’s iris data set [3] and estimate mutual information of two random variables A and B [4], where A is the real class of an instance and B is some unique identifier, with respect to the original probability distribution of instances. This data set contains \(N=150\) instances of \(A=3\) different classes, therefore, \(P(n)=\frac{1}{150}\), \(P(a)=\frac{1}{3}\) and \(P(b)=\frac{1}{150}\). Mutual information MI(YX) of two discrete random variables is determined by the formula

$$\begin{aligned} MI(Y,X)=\sum _x\sum _yP(x,y)\log \frac{P(x,y)}{P(x)P(y)}. \end{aligned}$$
(1)

Therefore, \(MI(A,N)\approx 1,58\) and \(MI(B,N)\approx 7,23\).

One can use mutual information to compare a result of classification with a reference model or to estimate the amount of knowledge gained after data processing. We are interested in unsupervised learning, therefore, in the second case. For estimating the amount of extracted knowledge mutual information is used in some algorithms for constructing decision trees. In such algorithms, preference is given to attributes with the highest mutual information. Therefore, for a decision tree of the mentioned data set one should choose the attribute B as the first node, but the resulting model, consisting of 150 classes, would be too complicated. Therefore, it is an example of overfitting.

There are various ways of avoiding overfitting:

  • Cross-validation,

  • Regularization,

  • Prior probabilities,

  • Bayes factor et al. [1].

One of the typical methods for solving the problem described above is the regularization of mutual information MI(XY) using the entropy H(X). The ratio

$$\begin{aligned} IGR(X,Y)=\frac{MI(X,Y)}{H(Y)} \end{aligned}$$
(2)

is called the information gain ratio [9].

However, there are cases, where IGR is not applicable. One case is where Y is completely determined by X, i.e. there is a function \(f:X\rightarrow Y\). In this case \(H(Y|X)=0\). Then, as

$$\begin{aligned} MI(Y|X)=H(Y)-H(Y|X) \end{aligned}$$
(3)

it holds that,

$$\begin{aligned} H(Y|X)=0\Rightarrow MI(Y,X)=H(Y) \end{aligned}$$
(4)

therefore,

$$\begin{aligned} H(X|Y)=0\Rightarrow IGR(X,Y)=1 \end{aligned}$$
(5)

Thus, if there is a function \(f:X\rightarrow Y\), IGR is useless. Classification task is indeed a search of a function for assigning labels to data, therefore, IGR is useless for estimating classification in unsupervised learning. The example described above is an example of classification.

1.2 Objective Priors

In this paper we suggest to consider objective priors for preventing overfitting/underfitting. A prior probability distribution of a random variable is a distribution that characterizes its value before obtaining experimental data. There is informative and uninformative, or objective, prior distributions. An uninformative, or objective, distribution expresses vague or general information about a variable and has the following advantages:

  • invariance with respect to parameters structure;

  • inverse dependence on model complexity;

  • independence from subjective assumptions.

Nowadays, applications of objective priors in regression and classification tasks are actively researched [5,6,7,8].

An example of objective prior usage is adjusted mutual information [10]. The idea of adjusted mutual information is to adjust mutual information of two random variables with joint probability distribution of their partitions. Suppose there are N points, two partitions U and V, and the number of points \(a_i=|U_i|\) for \(U_i\subseteq U, i=1...R\) and \(b_j=|V_j|\) for \(V_j\in V, j=1...C\), then the total number of ways to distribute the set N over the two partitions U and V is \(\varOmega \)

$$\begin{aligned} \varOmega =\frac{(N!)^2}{\prod _ia_i!\prod _jb_j!} \end{aligned}$$
(6)

Every two joint partitions U and V can be represented as a contingency table M

$$\begin{aligned} M=[n_{ij}]^{i=1...R}_{j=1...C} \end{aligned}$$
(7)

Suppose there is some contingency table M, then there are w different ways to distribute points so that this M is obtained

$$\begin{aligned} w=\frac{N!}{\prod _i\prod _jn_{ij}!} \end{aligned}$$
(8)

Thus, the probability P(M|ab) for some M with respect to the set \(\mathcal {M}\) of all possible contingency tables is determined by the formula

$$\begin{aligned} P(M|a,b)=\frac{w}{W} \end{aligned}$$
(9)

Mutual information MI(M) of the contingency table M is determined by the formula

$$\begin{aligned} MI(M)=\sum _i\sum _j\frac{n_{ij}}{N}\log \frac{N\dot{n}_{ij}}{a_ib_j} \end{aligned}$$
(10)

then the average mutual information of all possible joint partitions of random variables X and Y is defined as the expectation of mutual information E(MI(M)|ab)

$$\begin{aligned} E(MI(X,Y))=E(MI(M)|a,b)=\sum _{M\in \mathcal {M}}MI(M)P(M|a,b) \end{aligned}$$
(11)

Finally, adjusted mutual information AMI(XY) is defined as follows

$$\begin{aligned} AMI(X,Y)=\frac{MI(X,Y)-E(MI(X,Y))}{max(H(X),H(Y))-E(MI(X,Y))} \end{aligned}$$
(12)

Thus, hypergeometric distribution of two joint partitions is used in adjusted mutual information as an objective prior. The most natural way to use adjusted mutual information is to estimate clustering result. But again it estimates similarity of a resulting partition to a reference partition, while we are interested in a method for estimating knowledge gain.

2 A Bayesian Information Criterion Based on an Objective Prior

In this paper we propose a new method for information estimation, intended to estimate knowledge, gained in result of data processing, that is useful in unsupervised learning. The method is applicable to functionally dependent random variables and based on the Bayes criterion for optimal decision and an objective prior probability distribution of partitions.

2.1 Basic Definitions

Suppose there is a function \(f:X\rightarrow Y\), then for every \(y_i\in Y\) there is an inverse image \(X_i\subseteq X\)

$$\begin{aligned} X_i=\{x:f(x)=y_i\} \end{aligned}$$
(13)

We consider partition \(part(X)=X_1\cup ...\cup X_k\) as a resulting model of X, knowledge gain of which we are going to evaluate. A structure of a partition is described by a number of subsets k and a partition \(\bar{n}\) of a number \(n=|X|\), such that \(n_i=|X_i|\) and \(n_1+...+n_k=n\).

We use relative entropy, or Kullback—Leibler divergence [11], for basic information gain estimation

$$\begin{aligned} D_{KL}(X||Y)=\sum _iP(x_i)\log \frac{P(x_i)}{Q(x_i)} \end{aligned}$$
(14)

Considering \(P(x_i)=\frac{|X_i|}{|X|}\) and \(Q(x_i)=\frac{1}{|X|}\), we get

$$\begin{aligned} D_{KL}(part(X)||X)=\sum _i\frac{|X_i|}{|X|}\log |X_i| \end{aligned}$$
(15)

But we propose to adjust relative entropy with an objective prior to prevent overfitting/underfitting.

2.2 The Bayesian Criterion for Optimal Decision

Preventing overfitting/underfitting means to find an optimal structure of a model. In decision theory various criteria are used to find the optimal choice. Given a set of all possible actions A, a set of states of nature \(\varTheta \), its probability distribution \(P(\theta )\) and a utility function \(U(a, \theta )\) for an action \(a\in A\) and a state of nature \(\theta \in \varTheta \), consider the Bayes criterion for choosing an optimal action \(a^*\) [12]

$$\begin{aligned} a^*_b=\mathop {\hbox {argmax}}\limits _i\sum _jP(\theta _j)U(a_i, \theta _j) \end{aligned}$$
(16)

Given a set X, we consider the set of all possible functions \(X^X\) on X, which forms the set of all possible partitions Part(X) of X. Every partition \(part(X)\in Part(X)\) has a certain structure \(k_i,\bar{n}_j\). Assuming \(A=\{k\}\), \(\varTheta =\{\bar{n}\}\) and \(U(a_i, \theta _j)=D_{KL}(part_{ij}(X)||X)\), we get expected relative entropy

\(E(D_{KL}(part_k(X)||X))\), \(ED_{KL}\) for short, of a partition \(part_k(X)\) for given number of subsets k

$$\begin{aligned} E(D_{KL}(part_k(X)||X))=\sum _jP(\bar{n}_j)D_{KL}(part_{kj}(X)||X) \end{aligned}$$
(17)

and the Bayesian criterion for optimal number of subsets \(k^*_b\)

$$\begin{aligned} k^*_b=\mathop {\hbox {argmax}}\limits _i\sum _jP(\bar{n}_j)D_{KL}(part_{ij}(X)||X) \end{aligned}$$
(18)

It only requires to define the probability distribution \(P(\bar{n})\).

2.3 The Objective Prior Distribution of Number Partitions

Let us define the objective prior distribution \(P(\bar{n})\) for Bayes criterion of optimal number of subsets \(k^*_b\). Given a set X, we denote a set of all possible functions on X as \(F=X^X\) and a set of functions that result in some number partition \(\bar{n_i}\) as \(F_i\subseteq F\). Let \(|X|=n\) and m is a number of elements of \(\bar{n}\) with a given value v, such that \(v_1+...+v_m=k\), then

$$\begin{aligned} P(\bar{n}_i)=\frac{|F_i|}{|F|} \end{aligned}$$
(19)

where \(|F|=n^n\) and

$$\begin{aligned} |F_i|=\frac{n!}{n_1!...n_k!}\frac{n!}{k!(n-k)!}\frac{k!}{v_1!...v_m!} \end{aligned}$$
(20)

Objective prior \(P(\bar{n}_i)\) and, therefore, the criterion \(k^*_b\) is defined for every set X.

2.4 Other Criteria

There are other criteria in decision theory. If probability distribution \(P(\theta )\) is unknown, one can assume states of nature \(\varTheta \) to be equiprobable

$$\begin{aligned} a^*_l=\mathop {\hbox {argmax}}\limits _i\sum _j\frac{1}{|\varTheta |}U(a_i, \theta _j) \end{aligned}$$
(21)

or minimize possible loss

$$\begin{aligned} a^*_w=\mathop {\hbox {argmax}}\limits _i\,\min _jU(a_i, \theta _j) \end{aligned}$$
(22)

Laplace and Wald criteria respectively.

Performing analogous substitutions, we get Laplace criterion for optimal number of subsets \(k^*_l\)

$$\begin{aligned} k^*_l=\mathop {\hbox {argmax}}\limits _i\sum _j\frac{1}{|\{\bar{n}\}|}D_{KL}(part_{ij}(X)||X) \end{aligned}$$
(23)

and Wald criterion for \(k^*_w\)

$$\begin{aligned} k^*_w=\mathop {\hbox {argmax}}\limits _i\,\min _jD_{KL}(part_{ij}(X)||X) \end{aligned}$$
(24)

However, the last criterion does not help with underfitting, as for any n it holds that \(k^*_w=1\), because

$$\begin{aligned} \min D_{KL}(part_1(X)||X)=\max D_{KL}(part(X)||X)=1 \end{aligned}$$
(25)

while Laplace criterion \(k^*_l\) may be used, if probability \(P(\bar{n})\) is too complex to calculate.

3 Evaluating Unsupervised Learning Results

Consider the Fisher’s iris data set, mentioned above, to demonstrate the proposed method. We compare the original labels distribution with \(k=3\) with more or less numbers of classes, \(k=2\) and \(k=4\) respectively. We use k-means clustering algorithm for \(k=2,4\) to compare expected relative entropy of particular partitions. The results are given in the Table 1. As we can see, partitioning a set of 150 objects yields more expected relative entropy for \(k=4\) than for \(k=3\) and \(k=2\).

Table 1. Results of comparing

However, this function has a global maximum, i.e. there is \(k^*_b\) such that for \(k'>k^*_b\) expected relative entropy decreases. To show that, let us consider another experiment. The idea is to simulate numerous results of various unsupervised classification and clustering instances to get posterior distribution and calculate expected relative entropy for every possible k. Thus, we will get a posterior number of subsets \(k^*_p\) with the maximum expected relative entropy and compare it with \(k^*_b\).

Due to complexity issues we consider \(n=75\). The plot of expected relative entropy for \(k=1,...,75\) is given at the Fig. 1. As we see, \(max(ED_{KL})=1.20e-02\) for \(k^*_b=47\).

Fig. 1.
figure 1

Expected entropy for \(n=75\) and \(k=1,...,75\)

To simulate posterior \(k^*_b\) distribution, we considered various clustering methods that do not require k as an input parameter, such as hierarchical clustering [13], affinity propagation [14], mean shift [15] and DBSCAN [16], but it turned out that their results have subjective bias, caused by hyperparameter values, set by a user, or by an algorithm logic itself, which may differ from objective dependencies beneath data. Therefore, we suggest to assign labels randomly, assuming that for large number of samples s the distribution would be similar to real results.

The result of the experiment is given at the Fig. 2. As we can see, \(k^*_p\) comes near to \(k^*_b\) and equals to it nearly after 1000 samples. Thus, we demonstrated that prior criterion \(k^*_b\) has global maximum and correlates with posterior distribution.

Fig. 2.
figure 2

\(k^*_p\) for \(n=75\) and \(s=1,...,10000\)

4 Conclusion

Summing up, we suggested that information estimation is the only reliable way to evaluate results of unsupervised learning and demonstrated that information measures, like mutual information, suffer from underfitting and overfitting. We considered objective priors as a method for preventing underfitting/overfitting, since they have particular advantages, and proposed a new method for evaluating partition of a set, considered as unsupervised classification or clustering result. It includes, on the one hand, the Bayesian criterion for optimal decision, considering number of subsets as set of actions, corresponding number partitions as states of nature and relative entropy as utility function, and, on the other hand, the objective prior probability distribution of number partitions with respect to a number of all set partitions. We illustrated the resulting criterion application on Fisher’s iris data set by comparing original label distribution with results of clustering with different numbers of clusters and demonstrated that the criterion has a global maximum and correlates with posterior distribution for large numbers.