Keywords

1 Introduction

The class imbalance phenomenon, i.e., a largely different fractions of examples from different classes in the learning and in the testing sequences, is known to cause troubles when learning and assessing the quality of classifiers. The reason is in that most of the known classifiers tend to give the priority to the largest class in the learning sequence. This, in turn, leads to a poor generalization properties. On the other hand, the class imbalance is unavoidable when classifiers are used for detecting rare events (e.g., faults in production processes or diagnosis of rare diseases).

Many attempts were proposed in order to circumvent this difficulty. They can be, roughly, clustered as follows.

  1. 1.

    Data editing strategies that attempt to artificially increase the fraction of the minority class (classes) examples in the learning and in the testing sequences. Typically, it is achieved by either the re-sampling from the minority class or by the under-sampling from the majority class or by combining them. These ways, although useful in many cases, have one common drawback, namely, they distort a priori class probabilities, which – in turn – may lead to undesirable preference voting for the minority class.

  2. 2.

    Attaching a high cost for a minority class misclassification, in particular, by using a dedicated metrics.

  3. 3.

    Designing classifiers dedicated to cope with the class imbalance phenomenon.

Our approach differs from the above. Namely, we take several popular classifiers and we propose to rank them from the view point of their robustness against the class imbalance in the data. In addition to the popular classifiers, we consider also the classifier for matrix normal distributions (see [5, 8, 9]).

The second challenge in comparing the robustness of classifiers against the class imbalance is the choice of criterions for their comparisons. Again, a number of criterions is advocated in the literature. For this reason, we propose to use a pair-wise comparisons of classifiers, for which several criterions are calculated. This approach was originated by Slowinski [11] and its applicability is still growing (see [4]).

As an empirical material for case studies we take raw images of the laser additive manufacturing process (see [8] for more detailed description why this process is important).

An important issue in our case study is that we put raw images as the inputs of classifiers. This approach seems to be of importance at least for two reasons, namely,

  • it demonstrates that easily available PC computers can be successful in a cheap way of classifying images, since the process of features extraction is time-consuming (expensive)

  • the results of comparisons of classifiers are not biased by a human-dependent way of feature extraction.

The paper is organized as follows.

  • In Sect. 2 we provide the description of a modified classifier for matrix normal distributions.

  • The well known classifiers that are selected for comparisons are listed and briefly commented in Sect. 3.

  • In Sect. 4 we describe the methodology of testing and comparisons as well as their results.

  • Section 5 contains conclusions, while in the Appendix we summarize the known properties of matrix normal distributions.

2 A Modified Classifier for Matrix Normal Distributions

2.1 MND as Class Densities and Their Estimation

We assume that probability distributions of gray-level images from class \(j=1,\, 2,\dots ,\, J\) have MND with probability density functions (p.d.f.’s) \(f_j(\mathbf {X})\) defined in Appendix.

MND densities have special covariance structure in comparison to a general multivariate Gaussian densities. Namely, their covariance matrices do not have inter rows-columns covariances, which makes them much easier to estimate (see Appendix).

Further on, we assume that we have J learning sequences of the following form: \(\mathbf {X}_i^{(j)}\), \(i=1,\, 2,\ldots N_j\), \(j=1,\, 2, \ldots ,\, J\).

Denote by \(p_j>0\), \(j=1,\, 2, \ldots ,\, J\), \(\sum p_j=1\), a priori class probabilities. It is well known that in a general case the MAP classifier assigns \(\mathbf {X}\) to class \(j^*\) such that

$$\begin{aligned} j^*= {\text {arg}}\max _j [p_j\, f_j(\mathbf {X})] , \end{aligned}$$
(1)

where \({\text {arg}}\max _j {\text {[ ]}}\) stands for the argument for which the maximum is attained. It is also well known that this rule is the optimal one when the 0-1 loss function is used (see, e.g., [2]).

For symmetric and positive definite matrix A define the following function: \(\kappa (A)=\frac{\lambda _{max}(A)}{\lambda _{min}(A)}\), which indicates how large numerical errors can be committed when the inverse of A is calculated. Select \(0<\kappa _{max} < 100 \).

A Modified Matrix Normal Distribution Classifier (MMNDCL)

I. The Learning Phase

  • Step (L1) Collect J learning sequences (for each class) of the following form: \(\mathbf {X}_i^{(j)}\), \(i=1,\, 2,\ldots N_j\), \(j=1,\, 2, \ldots ,\, J\).

  • Step (L2) Estimate the class mean matrices and a priori class probabilities as follows

    $$\begin{aligned} \hat{M}_j=N_j^{-1}\, \sum _{i=1}^{N_j} \mathbf {X}_i^{(j)}, \quad \hat{p}_j=N_j/N, \quad j=1,\, 2, \ldots ,J. \end{aligned}$$
    (2)
  • Step (L3) Calculate the maximum likelihood estimates (MLE) of the inter-row and inter-column covariance matrices by solving the following set of equations:

    $$\begin{aligned} \hat{U}_j= & {} \frac{1}{N_j\, m}\, \sum _{i=1}^{N_j} (\mathbf {X}_i-\hat{\mathbf {M}}_j)\, \hat{V}_j^{-1}\, (\mathbf {X}_i-\hat{\mathbf {M}}_j)^T, \end{aligned}$$
    (3)
    $$\begin{aligned} \hat{V}_j= & {} \frac{1}{N_j\, n}\, \sum _{i=1}^{N_j} (\mathbf {X}_i-\hat{\mathbf {M}}_j)^T\, \hat{U}_j^{-1}\, (\mathbf {X}_i-\hat{\mathbf {M}}_j) \end{aligned}$$
    (4)

    for \(j=1,\, 2,\ldots ,\, J\). Equations (3) and (4) can be solved by the flip-flop method.

  • Step (L4) Estimate the normalization constants of class densities as follows:

    $$\begin{aligned} \hat{c}_j = (2\, \pi )^{0.5\,{n\, m}}\, {\text {det}}[\hat{U_j}]^{0.5\,n}\, {\text {det}}[\hat{V}_j]^{0.5\,m}. \end{aligned}$$
    (5)

II. The recognition Phase

  • Step 1 Acquire \(\mathbf {X}\) to be classified.

  • Step 2 Check whether all the inequalities:

    $$\begin{aligned} \kappa (\hat{U}_j) < \kappa _{max},\; j=1,\, 2, \ldots ,\, J \end{aligned}$$
    (6)

    as well as

    $$\begin{aligned} \kappa (\hat{V}_j) < \kappa _{max},\; j=1,\, 2, \ldots ,\, J \end{aligned}$$
    (7)

    are fulfilled. If so, go to Step 3, otherwise, go to Step 4.

  • Step 3 Classify new image (matrix) \(\mathbf {X}\) according to the following rule:

    $$\begin{aligned} \hat{j} = {\text {arg}} \min _{1\le j \le J} \left[ \frac{1}{2}\, {\text {tr}}[\hat{U}_j^{-1}(\mathbf {X} - {\hat{\mathbf {M}}}_j)\,V_j^{-1}\, (\mathbf {X} - {\hat{\mathbf {M}}}_j)^T ] \right] -\log (\hat{p}_j/\hat{c}_j), \end{aligned}$$
    (8)

    where \(\hat{j}\) is the predicted class for \(\mathbf {X}\).

    Acquire the next image (matrix) \(\mathbf {X}\) for classification and repeat (8).

  • Step 4 Classify new image (matrix) \(\mathbf {X}\) according to the nearest mean rule, i.e., classify it to the class

    $$\begin{aligned} \tilde{j}={\text {arg}}\min _j \, ||\mathbf {X}- \hat{M}_j||^2, \end{aligned}$$
    (9)

    where the squared distance \(||\mathbf {X}- \hat{M}_j||^2\) is defined as follows:

    $$\begin{aligned} ||\mathbf {X}- \hat{M}_j||^2\,=\, {\text {tr}}[(\mathbf {X} - {\hat{\mathbf {M}}}_j)\, (\mathbf {X} - {\hat{\mathbf {M}}}_j)^T ]. \end{aligned}$$
    (10)

    If the class \(\tilde{j}\) in (9) is selected in a sufficiently sure way, e.g., if the following condition holds for a pre-specified \(\zeta >0\)

    $$\begin{aligned} (1+\zeta )\, ||\mathbf {X}- \hat{M}_{\tilde{j}}||^2 < ||\mathbf {X}- \hat{M}_j||^2, \quad j\ne \tilde{j}, \end{aligned}$$
    (11)

    then update the estimates of \(\hat{U}_{\tilde{j}}\) and \(\hat{V}_{\tilde{j}}\) by adding current \(\mathbf {X}\) to the learning sequence as \((\mathbf {X},\, \tilde{j})\). Independently whether condition (11) is fulfilled or not, go to Step 1.

It was proved in [14] that it suffices to perform only one flip-flop operation in Step (L3) in order to obtain the efficient estimates of \(U_j\) and \(V_j\).

2.2 The Methodology of Cross-Validation Testing

In order to test MMNDCL algorithm and to verify its robustness against the class imbalance, we used the cross-validation (CV) methodology in the following extensive version.

  • Step 1 Select from the set of images of the length 900 (at random with the same probabilities) a learning sequence of the length 450 and denote it as \(L_{450}\). The rest of the sequence, denoted as \(T_{450}\), use it for testing.

  • Step 2 Learn MMNDCL, using the \(L_{450}\) sequence.

  • Step 3 Test the classifier from Step 2, applying it to \(T_{450}\), calculate and store the accuracy and other quality criterions (recall, precision, etc., see the next subsection).

  • Step 4 Repeat Steps 1–3 1000 times.

  • Step 5 Provide the averages of the quality indicators, obtained in Step 3, as the outputs.

Notice that this is an intensive testing procedure, because we have to estimate two matrices of the means and four covariance matrices, each of them 1000 times when learning MMNDCL.

3 Classifiers Selected for Verifying Their Robustness Against the Class Imbalance

The following classifiers are selected for comparisons.

(a) :

The MMND classifier in the version that was described in the previous section.

(b) :

The logistic regression classifier with L2 regularization coefficient equal 1. We refer the reader to [12] for a contemporary description of this classifier.

(c) :

The naive Bayes classifier. Despite of its simplicity, this classifier works quite well in many applications. It is of interest to check its robustness against the class imbalance.

(d) :

The feed forward, sigmoidal neural network classifier with the following parameters: two hidden layers, each containing 900 nodes with \(\tanh \) activation functions. L2 regularization coefficient equal 0.1 was used (see [3]).

(e) :

The random forest (RF) classifier was proposed in the famous paper of Breiman [1] in which also the proof of consistency is provided. The popularity of the RF classifiers is still growing. In our experiments, the number of generated trees was 200.

(f) :

The support vector machine (SVM) classifier is currently considered as the classifier of the first choice in most of applications. In our experiments, the Gaussian radial basis functions were used. The soft margin parameter was selected to be 8.

(g) :

The nearest neighbors classifier is the golden classic. Its consistency and other properties are investigated in [2]. In the experiments reported in the next section, the version with 10 nearest neighbors (referred to as 10-NN) is reported.

We refer the reader to [2] for a wide and deep discussions concerning classifiers and their properties.

4 The Results of Testing Classifiers by Cross-Validation

Before providing the results of testing classifiers, we briefly discuss criterions that are selected for comparisons. We also provide a short description of the methodology of comparing classifiers when multiple criterions are used.

4.1 Criterions Selected for Comparisons

When testing a two class classifier on a large number of examples, we collect the following data:

  • TP – the number true positive examples,

  • TN – the number of true negative cases,

  • FP – the number of false positive examples,

  • FN – the number of of false negative cases.

Thus, the total number of test cases is \(FP+FN+TP+TN\).

The following, widely used, measures of classifiers quality are selected for further comparisons.

Accuracy. The accuracy (Acc) is defined as the ratio of all properly classified patterns to all the patterns in the testing sequence:

$$\begin{aligned} Acc=\frac{TP+TN}{FP+FN+TP+TN}. \end{aligned}$$
(12)

It is well known that Acc is not quite adequate, especially when we are faced with a large class imbalance, since it can provide a seemingly high accuracy just by classifying improperly all (or most) items from the minority class.

Recall. The recall (Rec), also known as the sensitivity, is defined as

$$\begin{aligned} Rec = TP/(TP + FN), \end{aligned}$$
(13)

i.e., it is the proportion of positive patterns that are correctly classified. It does not take into account TN and FP cases.

Precision. The precision (Prec), also called (specificity), defined as

$$\begin{aligned} Prec = TP/(TP + FP) \end{aligned}$$
(14)

is – in fact – the true positive accuracy. It does not take into account TN and FN cases.

F1 Score. The F1 score (F1sc) attempts to reduce the drawbacks of Rec and Prec measures by calculating their harmonic mean:

$$\begin{aligned} F1sc= 2.0\, Prec\, Rec/(Prec + Rec). \end{aligned}$$
(15)

Although F1sc is more informative than Prec and Rec separately, it still neglects TN cases, which are of importance in class imbalance cases.

Matthews Correlation Coefficient. A widely accepted alternative to F1sc is the Matthews Correlation Coefficient (MCC) that is defined as follows:

$$\begin{aligned} MCC= \frac{(TP\,TN - FP\,FN)}{ \sqrt{(TP + FP)\,(TP + FN)\,(TN + FP)\,(TN + FN)} }. \end{aligned}$$
(16)

MCC takes into account all the entries of a classifier confusion matrix. MCC is easy to interpret. Namely, if MCC is close to +1, then a classifier at hand provides a good prediction. Conversely, MCC being about \(-1\) indicates that a classifier works properly, but it is advisable to exchange the roles of “true” and “false” classes. Finally, when MCC is near zero, then a classifier is not a good predictor at all, i.e., the tossing of the fair coin would provide comparable results.

4.2 Multiple Criteria Sorting for Assessing Classifiers Quality

The above discussion of the quality measures of classifiers indicates that all of them, although widely used, have also their drawbacks. For this reason, we propose to apply all of them in our case study. This leads to the need of selecting a method for multiple criteria sorting.

Problems of multiple criteria sorting (ranking) of objects have a long history that is documented in a large number of papers. For our purpose of sorting the classifiers according to the above criterions, we use a simplified version of the approach proposed in [4] (see also [11] for the discussion of the fundamental notion of the pair-wise comparisons).

Denote by \(a_1,\, a_2,\, \ldots ,\, a_7\) the set of algorithms (classifiers) to be compared. Let \(g_1,\, g_2,\, \ldots ,\, g_5\) stands for the set of criterions defined in the previous subsection. Then,

$$\begin{aligned} \bar{g}(a_i){\mathop {=}\limits ^{def}} [g_1(a_i),\, g_2(a_i),\, \ldots ,\, g_5(a_i)],\quad i=1,\, 2,\ldots ,\, 7 \end{aligned}$$
(17)

is the vector of criterions that are evaluated for algorithm \(a_i\).

Select \(\epsilon _k>0\) as the level of uncertainty of k-th criterion, i.e., if \(|g_k(a_i)-g_k(a_j)|<\epsilon _k\), then \(a_i\) and \(a_j\) are considered to be equivalent with respect to k-th criterion, \( k=1,\, 2,\ldots ,\, 5\). When algorithms (classifiers) \(a_i\) and \(a_j\), \(i\ne j\) are compared as one pair, then the following rules of adding scores to their total scores (denoted as \(S_i\) and \(S_j\), respectively) are applied.

Scoring the comparison of \(a_i\) and \(a_j\)

For \( k=1,\, 2,\ldots ,\, 5\) perform the following steps.

  • Step (C1) If \(|g_k(a_i)-g_k(a_j)|<\epsilon _k\), do not change \(S_i\) and \(S_j\) and set k to \(k+1\) (Step (C2) is not performed).

  • Step (C2) If \(g_k(a_i) > g_k(a_j)\), then set \(S_i:=S_i+1\) and \(S_j:=S_j-1\). Set k to \(k+1\) and go to Step (C1), unless \(k>5\), otherwise, finish the comparison of \(a_i\) and \(a_j\).

Overall Comparison.

Initialize the all pairs comparison approach by setting \(S_i=0\), \( i=1,\, 2,\ldots ,\, 7\). Perform Step (C1) and (C2) for all pairs of algorithms \(i\ne j\), \(i,\, j=1,\, 2,\ldots ,\, 7\). Sort \(S_i\)’s as the output of the all pairs comparisons and consider the one with the largest \(S_i\) as the winner.

Remarks.

  1. (1)

    In the next subsection \(\epsilon _k=0.01\) for \( k=1,\, 2,\ldots ,\, 5\) was selected.

  2. (2)

    An easy generalization of the above approach to multi-criteria comparisons is to attach nonnegative weights to criterions and to use them in Step (C2), instead of \(\pm 1\), but we skip this generalization in the next subsection.

4.3 The Empirical Material

As an empirical material for comparisons we selected 900 images of the laser based additive manufacturing process. In [8] it was explained in details why it is important to distinguish cases when the laser head is in the middle of a wall to be constructed (class 1) versus the cases when it is near endpoints of the wall (class 2). Roughly speaking, when it is recognized that the laser head is near endpoints of the wall, it is desirable to reduce the laser power in order to prevent the endpoints to be too thick.

Clearly, one can expect that the empirical material contains a smaller number of examples of Class 2 than that of Class 1 since the laser head moves much longer along the middle of the wall than near its endpoints. Indeed, in the testing sequence of images we had 29 images from Class 2 out of all 450 images and a similar fraction in the learning sequence.

Typical examples of images from Class 1 and 2 are shown in Fig. 1 (upper row). These original images have the size of \(111\times 241\). Down-sampled images of the size \(10\times 22\) were supplied as inputs of the classifiers (see Fig. 1 – lower row). Notice that such images as in Fig. 1 (lower row) were inputs of the tested classifiers, without applying any features extraction.

Fig. 1.
figure 1

Examples of images: Classes 1 and 2 (from the left). Original images – upper row and down-sampled images – lower row.

Table 1. The confusion matrix obtained when the MMNDCL (as described in Sect. 1) is applied to 450 long testing sequence.
Table 2. The confusion matrix obtained when the logistic regression classifier (left panel) and the naive Bayes classifier (right panel) are applied to 450 long testing sequence.
Table 3. The confusion matrix obtained when the artificial neural network classifier (left panel) – with two hidden layers and 900 nodes, having \(\tanh \) activation function – and the random forest classifier (right panel) are applied to 450 long testing sequence.
Table 4. The confusion matrix obtained when the SVM classifier (left panel) and 10-NN classifier (right panel) are applied to 450 long testing sequence.
Table 5. The summary of the tests of the classifiers for robustness against the class imbalance. In columns 3–5 the values of criterions for each classifier are displayed. Column 6 contains the scores collected by each classifier according to all the pairs comparisons (see Sect. 4.2). In column 7 the classifiers are ranked according to the scores gained in column 6.

4.4 The Robustness Against the Class Imbalance – The Results of CV Testing

In this subsection we provide the comparisons of the classifiers that are important from the view-point of their robustness against the class imbalance. Conclusions are based on the values of criterions and on the multiple criteria sorting of classifiers that were discussed in the previous subsections. Firstly, we display the confusion matrix for each classifier. Then, in Table 5 we provide the comparison of the classifiers and their sorting, according to the methodology of all the pairs comparisons that is described in Subsect. 4.2.

As expected in the class imbalance case, the confusion matrices in Tables 1, 2, 3 and 4 and the Acc. column in Table 5 display very high accuracies of all the classifiers. However, according to the rest of the criterions (columns 4–7 in Table 5), these classifiers are essentially different. In particular, when the methodology of all the pairs comparisons (see Subsect. 4.2) is applied, then they collect largely different scores (see column 8 in Table 5). In the analysis of this column, notice that a classifiers which is essentially the dominant over all the other classifiers, with respect of all the criterions, would be able to collect at most plus 30 scores. Conversely, a classifier that is dominated by all the other six classifiers would gain minus 30 scores.

From this point of view, the classifiers: MMNDCL, Logistic Regression, Random forests, SVM and 10-NN collected non-negative scores, which means that they are – to some extent – robust against the class imbalance. On the other hand, and somewhat unexpectedly, the naive Bayes the neural networks classifiers gained high negative scores for the comparisons.

In column 9 of Table 5 the ranking of the classifiers is presented, which is based on the scores that are shown in column 8. Formally, the winner is the MMND classifier, when the methodology of Subsect. 4.2 is used. Its success can be explained by the fact that it is essentially based on the rule of the nearest mean (in the Mahalonobis or the Euclidean distances). Notice however, that the winner MMNDCL is only slightly better than the last non-negatively tested 10-NN classifier. Notice also that both the MMNDCL and the 10-NN classifiers require only 10–30 images for a proper functioning. This is in contrast to all the others competing methods. On the other hand, the losers, i.e., the naive Bayes and the neural networks classifiers, are more global and they require relatively more longer learning sequences than they are usually in our disposal.

On the other hand, when only the MCC criterion is considered, the winner is the Random Forest method, while the MMNDCL is ranked at the 4-5 position.

5 Conclusions

A modified MAP classifier for images (matrices) having matrix normal distribution was extensively tested on down-sampled images of the laser additive manufacturing process. In parallel, the well known classifiers are tested using the same sequence of images. The main aim of the tests was to check the robustness of all these classifiers against the class imbalance troubles.

The conclusions are the following:

  1. (I)

    There is the group of classifiers with positive scores in column 8 of Table 5. They can be considered as more robust against the class imbalance than others classifiers, i.e., the neural network and the naive Bayes one.

  2. (2)

    The highest overall scores in column 8 of Table 5 was collected by the MMNDCL method.

  3. (3)

    The above conclusions are based on extensive comparisons, but they are restricted to only one learning-testing sequence of images. These conclusions are – to some extend – confirmed by tests for another sequence of real-life images, namely, by the attempts to classify images of an industrial gas burner (see [10, 13]).

Summarizing, although our attempts of selecting a group of classifiers that are robust against the class imbalance seems to be promising, it is highly desirable to verify these findings on other sets of real-life data.