1 Introduction

Correct classification rate or accuracy, C, is one of the most common performance metrics for classification tasks. However, this metric is not reliable when the classification of imbalanced datasets is considered [26, 28], because it cannot capture the accuracy level of each class in the results. Imbalanced datasets are those where one or several classes have a much lower prior probability in the training set, and they pose a difficult challenge for Machine Learning (ML) researchers [34], because standard error metrics tend to ignore minority classes, only reducing the error of the majority ones.

As a result, complementary and alternative performance metrics for classification have been proposed to take the prior probabilities into account, based on accuracy rates computed independently for each class, partially compensating data skewness. Examples of this type of metrics are the Geometric Mean (GM) [1, 31] and the adjusted GM [4], or the F1 metric [16, 44] (the harmonic mean of precision and sensitivity or recall) and the adjusted F metric [36]. However, while these metrics can be effective in cases of moderate imbalance, the need for specific metrics arises in case of heavily skewed data [21].

In this work, we focus on obtaining binary classification models using the C and F1 metrics simultaneously and analysing their behaviour by a multi-objective evolutionary algorithm (MOEA) [11], which can alleviate the negative effects that imbalanced datasets have on these metrics. Given that the F1 metric provides an evaluation of the balance in terms of classification for both classes of a problem, this methodology can lead to classification models with a trade-off between global accuracy and recall for the minority class (sensitivity).

On the other hand, this theoretical study is supported by an empirical study with 26 datasets from the UCI repository [3] with different degrees of skewness and a complex donor-recipient matching model for a liver transplantation problem [12]. The results are analysed with three metrics: the mentioned C and F1, the latter calculated on the positive class and named as F1Pos, plus a third metric, F1Neg, calculated on the negative class. The study shows that using the majority class as a positive class instead of the minority class results in better performance for both classes, depending on the dataset and the percentage of patterns classified as positive. This can be explained by the fact that, in some cases, the noise of minority classes can mislead the results obtained for metrics such as F1. The classifiers obtained by the pair of metrics (C,F1) are shown to lead to yield performance as balanced as possible in both metrics, without the need of applying resampling techniques.

In summary, the main contributions of this paper are the following:

  • We introduce a theoretical study on the relationship between the C and F1 metrics, which is evaluated both graphically and from an empirical point of view. This study shows how the constraints presented may limit (or not) an algorithm in the process of finding good classifiers, depending on the shape of the feasible region of solutions. Moreover, it provides graphical information about the space of possible solutions attainable for the algorithm.

  • Given a binary classifier B(α), we address the problem of finding the parameters α which maximizes the metrics C and F1. If the C value is evaluated and optimised together with F1, we would achieve an overall success for both classes in a binary classification problem. Unlike previous works which try to obtain a high global accuracy, this proposal simultaneously analyses the accuracy and the F1 level, which is specially useful when the costs of misclassification are different but not exactly known. In this way, the study tries to alleviate the negative effects that imbalanced datasets have on the F1 metric, the value of F1 being increased by slightly decreasing the value of C.

  • On the other hand, binary classification models, which separate positive and negative patterns, can be built making a choice between two alternatives: the positive class is the majority or the minority one. We experimentally show that, in certain datasets, considering the majority class as the positive one instead of using the minority class leads to models with better values in C and F1.

Once explained the motivation of this work, the following sections are organised as follows: In Section 2, C and F1 metrics are presented and their properties are discussed. Section 3 derives the theoretical relation between the values of the contingency matrix and the discussed metrics. Section 4 shows the experiments performed and the results obtained. The conclusions are finally drawn in Section 5.

2 Performance metrics for binary classification

Although ML algorithms can be evaluated using theory (deriving generalized error bounds) [37], empirical evaluation remains the most common approach for algorithm assessment. Evaluation techniques based on multiple experiments are frequently considered in practice [14, 17, 41].

A metric evaluating a classifier must be estimated from all the available samples, which are usually split into training and test sets [25]. The classifier is first designed using training samples, and then it is evaluated based on its classification performance on the test samples. Different performance metrics are used to select the classification model which performs better for a given problem. In this context, class imbalance learning refers to those classification problems where the datasets present a skewed class distribution. In these imbalanced binary classification problems, patterns of a class are highly outnumbered by those of the other class, the minority class being heavily under-represented [2, 7]. Although the minority class is in many problems the class of interest, training a classifier with an imbalanced dataset often produces models biased towards the majority class (sometimes leading to trivial classifiers where all patterns are classified as coming from the majority class), which perform poorly on the minority class [44, 45].

2.1 Traditional metrics

The behaviour of a binary classifier can be described by the number of patterns of the positive class correctly recognised (true positives, TP), the number of patterns of the negative class correctly recognised (true negatives, TN), and by the number of examples which were incorrectly assigned either to the positive class (false positives, FP) or to the negative one (false negatives, FN). These four quantities constitute a confusion matrix, as shown in Table 1.

Table 1 An example of a binary confusion matrix

From the binary confusion matrix, the following metrics can be defined:

  • Global performance, which can be evaluated using the accuracy, C:

    $$ C = \frac{TP+TN}{TP+FP+TN+FN}, $$

    that is, the rate of correct predictions.

  • Precision, P, is a metric of exactness, representing how many examples have the true positive label from all predicted as positive:

    $$ P = \frac{TP}{TP+FP}, $$
  • Recall or sensitivity, R, also known as TP Rate and Positive Precision, is a metric of completeness, and it represents how many positive examples are classified correctly:

    $$ R = \frac{TP}{TP+FN}. $$

Other traditional metrics for imbalanced problem are Kubat’s G-mean [31] and AUC (Area under the Roc curve) [5]. All these metrics were claimed to be effective in evaluating classification performance for binary imbalanced learning scenarios [18, 26], but they present disadvantages. For example, both P and R do not detect changes in TN, and this is a serious problem when the classes have distinct features for which both positive and negative classes are well-defined (e.g. male or female). The retrieval of a positive class, the discrimination between classes or the balance between retrieval from both classes are problem-dependent tasks.

In this first study we will focus on considering the C metric, that takes into account the four values of the confusion matrix of a binary classification problem, and also the F1 metric, that uses three of the four values (it does not detect changes in TN), but it considers the values of P and R through its harmonic mean, as will be shown in the next Section.

2.2 F 1 metric

Another alternative for dealing with imbalanced data is the F-metric (F1) [16, 44]. This metric is specially useful when the cost of misclassification for the different classes is not exactly known.

The F-metric [40] (or balanced F1-score) is given by:

$$ F_{1} = \frac{2}{\frac{1}{P}+\frac{1}{R}}=\frac{2PR}{P+R}, $$

that is, the harmonic mean of P and R. It tends towards the lower of the two and is a summary indicator that can be generalised in Fβ:

$$ F_{\beta} = (1+\beta^{2})\frac{PR}{\beta^{2}P+R}. $$

The Fβ metric allows us to assign β times importance to P over R. The most popular choice is β = 1, i.e. assigning the same importance to P and R, leading to Fβ= 1 = F1. F1 reaches its best value at 1 (perfect P and R) and worst at 0 (P and/or R = 0).

F1 can be used to obtain optimal classification models with high R and P in class imbalance situations due to the trade-off between both metrics [30]. It is commonly used as a criterion for classifier selection [8]. F1 is invariant to changes of TN counts, so it does not recognise the specifying ability of classifiers. Such a metric is more applicable to domains with a multi-modal negative class taken as “everything not positive” [41]. F1 is also invariant to uniform changes of positive and negative counts [41], that is, it is stable with respect to the uniform increase of data size (scalar multiplication of the confusion matrix). If we expect that, for different data sizes, the same proportion of examples will exhibit positive and negative characteristics, that is, the size of a stratified sample does not affect the performance of the classifier, then this invariant metric is a good choice.

Improving a classifier using F1 is generally not easy, as the resulting optimisation problem is non-convex. Therefore, various approximation methods have been proposed. An efficient algorithm to maximise a non-convex approximation to F1 for logistic regression was presented in [27], showing its effectiveness on an information extraction problem. This method can fail if the initial value of the expected F1 is unreasonable. In [39], a variant of the standard Support Vector Machine (SVM) that optimises an approximation to F1 is proposed. An efficient algorithm for maximising a convex lower bound of F1 for SVMs was introduced in [29].

Liu et al. [33] proposed a novel method which maximises an approximate F1 including an L1-regulariser. In [10], the authors presented a non-convex loss function for F1 maximisation in the presence of outliers, proposing a formulation based on the elastic net regulariser (combination of L1 and L2 penalties).

Now, we redefine F1 to study better its optimisation. First, we expand the definition of F1 as follows:

$$ \begin{array}{@{}rcl@{}} F_{1} &=& \frac{2(TP)^{2}}{2(TP)^{2}+(TP)(FP)+(TP)(FN)}\\ &=& \frac{2(TP)}{2(TP)+FP+FN}. \end{array} $$

Although it is not common, we will empirically study the performance of the classifier for the negative class, which can be similarly expressed as:

$$ \begin{array}{@{}rcl@{}} F_{1\text{Neg}} &=& \frac{2(TN)^{2}}{2(TN)^{2}+(TN)(FP)+(TN)(FN)}\\ &=& \frac{2(TN)}{2(TN)+FP+FN}. \end{array} $$

Let f be the ratio of patterns of the positive class:

$$ \begin{array}{@{}rcl@{}} f=\frac{TP+FN}{TP+FN+TN+FP}. \end{array} $$

Let z be the ratio of true positive patterns obtained by the classifier:

$$ z=\frac{TP}{TP+FN+TN+FP}. $$

Then, the confusion matrix can be expressed (in ratios) as:

$$ \begin{array}{@{}rcl@{}} \left( \begin{array} {cc} TP & FN \\ FP & TN \end{array}\right) \equiv \left( \begin{array}{cc} z & \quad f-z \\ 1-f-C+z & \quad C-z \end{array}\right), \end{array} $$
(1)

Taking into account the equivalence shown in (1), F1 can be simplified to:

$$ F_{1} = \frac{2z^{2}}{2z^{2}+z(1-z-C+z)}=\frac{1}{1+\frac{1-C}{2z}}. $$

F1 can be maximised by minimising \(\frac {1-C}{2z}\), assuming that z≠ 0. In this way, the minimisation problem can be expressed as:

$$ \min \left( (1-C)-\lambda(2z)\right), $$

where λ is a positive constant, thus F1 is maximised by maximising both C and z. When z → 0, the maximisation of F1 is difficult because, as it will be shown, F1 → 0.

3 Relation between F 1 and C metrics

We analyse and obtain the relation between C and F1 metrics based on the ratio of patterns of the positive class (f ). To carry out this analysis, it is necessary to study the boundaries in terms of the values that C and F1 metrics can take for a binary classification problem.

If we represent C in the abscissa of the plane and F1 in the ordinate one, as shown in Fig. 1, a classifier can be represented within the square [0,1] × [0,1] in terms of the values that it could take with respect to both metrics, where (1,1) is the optimal point. We denominate feasible region to the area (shaded region) that includes the values which can be reached for both metrics, and the rest of the values outside of that area would result in what we denominate unfeasible region. Note that the feasible area would be attainable or not depending on the difficulty of the dataset considered.

Fig. 1
figure 1

Feasible region (shaded region) for different values of f

In the following sections, the boundaries of the unfeasible region for the pair of metrics (C,F1) are analysed, defined as a conjoint metric of global classification success (C) and priority class accuracy (F1), as well as a more formal and detailed explanation about it and its representation.

3.1 Range of values of F 1 for each C

We want to obtain the range of possible values of F1 for each C ∈ [0,1]. Taking into account the above-mentioned terms z and f, the metrics can be expressed as:

$$ R = \frac{z}{f}, $$
$$ P= \frac{z}{1-f-C + 2z}, $$
$$ F_{1} = \frac{2z}{2z + 1-C}. $$
(2)

Firstly, we analyse the extreme cases:

  • Suppose C = 0.

    Then, only if z = 0 (TP = 0), R = P = 0 and F1 → 0.

  • Suppose C = 1.

    Then, only if z = f, R = P = F1 = 1.

Secondly, we analyse the most common case, 0 < C < 1, for which F1 value is given by (2).

Proposition 1

The following constraint will always be fulfilled, assuring that the confusion matrix is positive:

$$ \max\{0,f+C-1\}<z<\min\{C,f\}. $$
(3)

Proof

Given that:

  • C,f > 0 ⇒ 0 < min{C,f}.

  • C,f < 1 ⇒ f + C − 1 < min{C,f}.

Then: max{0,f + C − 1} < min{C,f} □

Now we analyse the relation between F1 and C. It will be based on the variation of F1 = F1(z) for each value of C.

Lemma 1

F1(z) is an increasing function.

Proof

Given that the derivative of F1(z) is:

$$ \frac{\partial F_{1}(z )}{\partial z} = \frac{2 (1-C )}{(2z + 1-C )^{2}} >0, $$

it is clear that this derivative is always positive, provided that C is constant. □

Using inequality (3) and given that F1(z) is non-decreasing:

$$ \begin{array}{@{}rcl@{}} M &=&\max\{F_{1}(0), F_{1}(f+C-1)\}< F_{1}(z) < \\ &<&\min\{F_{1}(C), F_{1}(f)\}=m. \end{array} $$
(4)

We can now replace these values in (2), obtaining the following expressions:

$$ F_{1}(0)= 0, F_{1}(f+C-1) = \frac{2 (f+C-1 )}{2f+C-1}, $$
$$ F_{1}(C) = \frac{2C}{1+C}, F_{1}(f) = \frac{2f}{2f + 1-C}. $$

From these expressions and inequality (4), the following relations can be obtained:

  • If F1(0) < F1(C),

    $$ 0<\frac{2C}{1+C}, \text{ and then, } C>0. $$
  • If F1(0) < F1(f),

    $$ 0<\frac{2f}{2f + 1-C}, \text{ and then, } f>0 \colon \ 2f + 1-C>0. $$
    (5)

3.2 Boundaries of the infeasible region

We analyse the boundaries according to the relative values of C and F1. Minimum and maximum values in (4) will be called m and M respectively, and based on these values, we determine the functions which limit the infeasible region (boundaries). There are four different cases:

  1. 1.

    M = 0 and m = F1(C).

    In this case,

    M = 0 is only possible if F1(f + C − 1) < F1(0) = 0, i.e.:

    $$ \begin{array}{@{}rcl@{}} \frac{2(f+C-1)}{2f+C-1} < 0 \Rightarrow f&+&C-1<0,\\ C&<&1-f, \end{array} $$
    (6)

    given that 2f + 1 − C > 0 (see (5)).

    On the other hand, if F1(C) < F1(f) and m = F1(C) then:

    $$ \begin{array}{@{}rcl@{}} \frac{2C}{1+C}<\frac{2f}{2f + 1-C},\\ C^{2}-C(f + 1)+f>0,\\ (C-f)(C-1)>0,\\ \left\{\begin{array}{l} C-f<0 \Rightarrow C<f,\\ C-1<0 \Rightarrow C<1. \end{array} \right. \end{array} $$
    (7)

    From (6) and (7), C < min{f,1 − f}, therefore:

    $$ \begin{array}{@{}rcl@{}} 0<F_{1}(z)< \frac{2C}{1+C}. \end{array} $$
  2. 2.

    M = 0 and m = F1(f).

    In this case,

    if M = 0, from (6), C < 1 − f, and if m = F1(f), then F1(f) < F1(C). Consequently:

    $$ \begin{array}{@{}rcl@{}} \frac{2f}{2f + 1-C}<\frac{2C}{1+C}, \end{array} $$

    and following the same steps as in (7), f < C, and as from (6) C < 1 − f, f < C < 1 − f, consequently \(f<\frac {1}{2}\) and:

    $$ \begin{array}{@{}rcl@{}} 0<F_{1}(z)< \frac{2f}{2f + 1-C}. \end{array} $$
  3. 3.

    M = F1(f + C − 1) and m = F1(C).

    In this case,

    M = F1(f + C − 1) if \(F_{1}(f+C-1)=\frac {2(f+C-1)}{2f+C-1}>0\), then f + C − 1 > 0.

    On the other hand, m = F1(C) if F1(C) < F1(f), and operating as in case 1, then C < f.

    From both inequalities, we have that 1 − f < C < f, so, in this case:

    $$ \frac{2(f+C-1)}{2f+C-1}<F_{1}(z)<\frac{2C}{1+C}. $$

    This can only happen if \(f>\frac {1}{2}\).

  4. 4.

    M = F1(f + C − 1) and m = F1(f).

    M = F1(f + C − 1) implies that f + C − 1 > 0, and m = F1(f) implies that f < C. Then, C > max{f,1 − f} and:

    $$ \frac{2(f+C-1)}{2f+C-1}<F_{1}(z)<\frac{2f}{2f + 1-C}. $$

Rearranging the results by using the value of f (if \(f<\frac {1}{2} \Rightarrow \min \{f, 1-f\}=f\) and max{f,1 − f} = 1 − f, or if \(f>\frac {1}{2} \Rightarrow \min \{f, 1-f\}= 1-f\) and max{f,1 − f} = f) and calling Lf(C) to the lower limit and Uf(C) to the upper one, we find that, in both cases, the boundaries are the same:

$$ \begin{array}{@{}rcl@{}} & L_{f}(C) = \left\{ \begin{array}{lc} 0 & 0 \leq C \leq 1-f\\ \frac{2(f+C-1)}{2f+C-1} & 1-f \leq C \leq 1 \end{array}\right., & \\ & U_{f}(C) = \left\{ \begin{array}{lc} \frac{2C}{1+C} & 0 \leq C \leq f\\ \frac{2f}{2f + 1-C} & f \leq C \leq 1 \end{array}\right.. & \end{array} $$

We use non-strict inequalities for the limits (≤) because Lf(C) and Uf(C) are continuous functions (their values are similar at both sides of 1 − f and f, respectively).

3.3 Representation and feasibility

Once the limits of the unfeasible region have been calculated, the graphical representation of the feasible region in the Fig. 1 for different values of f (0.35 and 0.65) can be understood properly.

It is clear that the whole shaded region is feasible. Given any point (x0,y0), it is possible to find a classifier (confusion matrix) such that C = x0 and F1 = y0. This can be done by taking:

$$ z= \frac{y_{0}(1-x_{0})}{2(1-y_{0})}=\frac{F_{1}(1-C)}{2(1-F_{1})}. $$

Note that, for a given value of C (or F1), there are many possible values for F1 (or C). In this way, the performance of a classifier when considering C and F1 can be represented by using this kind of plot to have an idea of the range of possible improvement for a given value of C or F1, according to the feasible region which can still be explored.

4 Experimental validation

This section presents the experiments performed in this paper to study the relation between C and F1 and validate the feasible region derived in the previous section and the possibility of simultaneously optimising C and F1 in a binary classifier B(α).

Firstly, we present the MOEA considered for the optimization of a population of B(α) classifiers, in this case Artificial Neural Networks (ANNs), where α is the structure and weights of the net. We also describe a mono-objective version of the algorithm, which has been implemented for comparison purposes: an Evolutionary Algorithm (EA) for the optimization of ANNs using only one of the two metrics proposed to guide the algorithm, C or F1, and applying the same operators and the same type of ANNs the MOEA used. Finally, we describe the datasets used in the experimentation and the results obtained.

4.1 Algorithms to optimize C and F 1

The two proposed metrics, C and F1, are used for guiding a MOEA called MPENSGA2 (Memetic Pareto Evolutionary NSGA2), described in detail in [19] and based on the original algorithm NSGA2 [13]. MPENSGA2 is based on the evolution of ANNs as binary classifiers [20, 43], where both the structure and the weights of the ANNs are optimised.

The pseudocode of MPENSGA2 is shown in Algorithm 1. The population size is established as N = 100. Five mutation operators are used in this algorithm, four structural mutators (add neurons, delete neurons, add links, delete links) and one parametric mutator (add random noise to the links). The probability of choosing a type of mutator and applying it to an individual is equal to 1/5. With regard to the add or delete link mutations, the links are added or deleted first between the input layer and the hidden layer and then between the hidden layer and the output layer. Specifically, we randomly add or delete 30% of the total number of links in the input-hidden layers, and 5% of the total in the hidden-output layers. Weights are assigned using a uniform distribution defined throughout two intervals, [− 5,5] for connections between the input layer and hidden layer and [− 10,10] for connections between the hidden layer and the output layer. The number of neurons to be added or deleted is chosen randomly between the values [1,2]. All these values have been obtained experimentally and are sufficiently robust. For a more detailed description of the parameters of the algorithm see [19].

Taking into account the feasible region defined above and the Pareto dominance concept, one point in (C,F1) space dominates another if it has more C and equal or greater F1, or if it has greater F1 and equal or better C. Thus, the most competitive classifiers will tend towards the upper right part of the feasible region, as the evolutionary process of the MOEA progresses.

On the other hand, we have adapted the EA algorithm described in [38] in order to compare the simultaneous multi-objective optimization of C and F1 to the mono-objective optimization of each of the two functions. Algorithm 2 shows the pseudocode of the EA. The population size is established at N = 100, the value of the weights of the links are the same than those used in the MOEA, as well as the mutation operators: four structural mutators and one parametric mutator. The activation function of the neurons in hidden layer is the sigmoidal function (as in the MOEA).

figure a

Therefore, four methodologies appear in the comparison process, two corresponding to the MOEA (attending to the C or F1 extreme of the Pareto front) and two corresponding to the EA (optimizing only C or only F1). Note that the MOEA is run once, while the EA has to be run twice (once for each objective, C or F1).

figure b

In order to check whether the null hypothesis can affect the performance of this pair of metrics, we will consider two approaches for each dataset used in the experimental procedure:

  1. 1.

    The null hypothesis is that the positive class is the minority one.

  2. 2.

    The null hypothesis is that the positive class is the majority one.

Although it could seem that the first hypothesis would report the best results for improving the value of F1 for the minority class, we will see that, in some cases, the second approach is able to improve the classification for the less frequent class.

4.2 Datasets and experimental design

In this work 26 benchmark datasets obtained from the UCI repository [3] have been considered, presenting different levels of imbalance, and

a complex and interesting dataset named “Madre”, corresponding to a problem of donor recipient matching in liver transplant [6]. This real transplant dataset consists of data from liver transplants performed in 11 Spanish units, including all the transplants performed between January 1, 2007, and December 31, 2008. Recipient and donor characteristics were reported at the time of transplant: 16 recipient characteristics, 16 donor characteristics and 9 operative factors were reported for each donor-recipient pair. The end-point variable is 3-months graft mortality.

All these datasets have passed through the following preprocessing steps: categorical attributes were expanded into the corresponding binary vectors, and then each attribute was normalized into the interval [− 1,1]. Multiclass datasets were reduced to binary classification using one of the two procedures: 1) choice of one label to represent the positive class and the combination of the others to form the negative class, e.g., Ecoli 3 (“imU”) versus all (“cp”, “im”, “pp”, “om”, “omL”, “imL”, “imS”); and 2) selection of only two labels among all the possible ones, e.g., Yeast (2 versus 4), where the examples labeled as 2 (“me2”) and 4 (“cyt” ) were chosen to represent the positive and negative classes. Both procedures were applied following the suggestions in the literature [9, 46].

The degree of imbalance of a dataset can be indicated by the value of f, previously defined as the ratio of patterns of the class considered as positive, or by the Imbalance Ratio, IR, defined as the ratio of the number of instances in the majority class to the number of examples in the minority class [22]. Each dataset has a different level of imbalance, in order to better explore the relation between C and F1 metrics. The values of f are specified depending on the null hypothesis selected: 1) the positive class is the majority one (f-Maj.), or 2) the positive class is the minority one (f-Min.). Table 2 shows the features for each dataset ordered from highest to lowest imbalance.

Table 2 Characteristics of the datasets

For all datasets, a stratified 10-fold cross validation was conducted, with 3 repetitions per each fold. Each dataset was run using both the majority class and the minority class as the positive label, resulting in 30 runs for each null hypothesis and dataset.

4.3 Results

First of all, we graphically evaluate whether the theoretical constraints derived for the simultaneous optimisation of C and F1 metrics are fulfilled in the experimentation with the MOEA. Figure 2, divided in two parts, shows the graphical representation for a specific run of the Bands dataset:

  • Figure 2a and b include training and test results when the algorithm is run using the majority class as the positive one.

  • Figure 2c and d include training and test results when the algorithm is run using the minority class as the positive one.

Fig. 2
figure 2

Graphical representation of the obtained Pareto fronts with the MOEA for the Bands dataset in training and corresponding test values for the models of the Pareto fronts

On the other hand, Fig. 2a and c represent the values obtained for each individual (classification model) of the population of the MOEA in C and F1 for the training set, showing the non-dominated individuals (red circles ) ordered forming a Pareto front. Figure 2b and d show the results obtained by the same models for both metrics, but in this case for the test set. It is important to highlight that there are no Pareto fronts in the test set, since these fronts are always obtained in training. Therefore, these figures show where the models obtained during training are located when the test set is applied to them. Besides, the individuals that were in the first Pareto front in the training figures can now be located within the (C,F1) feasible region in a zone that is worse (lower values of C and F1), depending on the generalization capacity of the models. Similar conclusions and Subfigures can be found in Fig. 3 for the Liver dataset. As can be checked, the class used as positive label has a clear influence on the shape of the feasible region and the spread of the members of the Pareto fronts.

Fig. 3
figure 3

Graphical representation of the obtained Pareto fronts with the MOEA for the Liver dataset in training and corresponding test values for the models of the Pareto fronts

We evaluate the models of the main Pareto front with maximum training C, and with maximum training F1, since the MOEA returns a Pareto considering both objectives (C and F1), that is, we evaluate the extremes of each front. This is done for each of the 30 runs, for each null hypothesis and for each dataset. To clarify the experimentation, one illustrative example is shown in Fig. 4, where the reader can observe several models of the main Pareto front from one run out of the 30 ones carried out for the Bands dataset (using the minority class as positive label).

Fig. 4
figure 4

Extreme models (C and F1) of the main Pareto front of the MOEA for the Bands dataset in training

Tables 345 and 6 include for each dataset and methodology used, the average test values of C, F1Pos and F1Neg in the 30 runs when considering the majority or the minority class as the positive label. Regarding the notation used, MOEAC or MOEAF1 refers to the multi-objective algorithm considering the average values obtained by the C or F1 extreme, respectively. MONOC or MONOF1 refers to the mono-objective methodology considering the average values obtained optimizing only C or only F1, respectively. The best results of the MOEA when comparing the two null hyphotesis are marked in bold face. The results obtained with the EA methodologies are compared to the corresponding row of the MOEA (using the majority or minority class as positive label, depending on the table). If the mono-objective methodology obtains the best results, it is marked with the asterisk symbol (*).

Table 3 Average test results in 30 runs of those datasets that considering the majority class as positive label (Maj.) using the MOEA leads to better C and F1 for the positive and negative classes (F1Pos and F1Neg) in the C and F1 extremes
Table 4 Average test results in 30 runs of those datasets that considering the majority class as positive label (Maj.) using the MOEA leads to better C and F1Pos in the C and F1 extremes
Table 5 Average test results in 30 runs of those datasets that considering the minority class as positive label (Min.) using the MOEA leads to better C and F1 for the positive and negative classes (F1Pos and F1Neg) in the C and the F1 extremes
Table 6 Average test results in 30 runs of those datasets that considering the minority class as positive label (Min.) using the MOEA leads to better C and F1Pos in the C extreme. Nevertheless in the F1 extreme, considering the majority class as positive label (Maj.) leads to better C and F1Pos

From now on, we refer to the results obtained in F1 for the class considered as positive as F1Pos. Note that the comparison between considering the majority class or the minority one as positive (Maj. or Min. respectively) with the MOEA is done attending only to the C extreme or the F1 extreme. Therefore, when comparing the values of F1 it is important to take into account which is the null hypothesis and the extreme observed, C or F1. In this way, the value of F1Pos obtained when the positive class is the majority one (Maj.) should be compared against the value of F1Neg obtained when the positive class is the minority one (Min.). Similarly, the value of F1Neg when the positive class is the majority one (Maj.) should be compared against the value of F1Pos when it is the minority one (Min.). Finally, the value of C must be directly compared with its counterpart (Maj. Cvs Min. C). As an example of clarification, the results shown in Fig. 5 corresponds to the Hepatitis dataset, and they are compared as follows: for the extreme with the best value in C, considering the majority class as positive, the value F1Pos = 0.893 is compared against the value F1Neg = 0.869 that corresponds to that same extreme but considering the minority class as positive. In the same way, for this C extreme, the value F1Neg = 0.515 (considering the majority class as positive) is compared against the value F1Pos = 0.474 of the minority class considered as positive. Finally, the value C = 82.74 (considering the majority class as positive) can be directly compared with the value C = 79.51 (minority class considered as positive). On the other hand, this same comparison procedure is applied to the F1 extreme. Regarding the mono-objective methodology, the values obtained optimizing only C or F1 are shown considering the majority class as positive label, which is the null hypothesis with the best results in the Hepatitis dataset using the MOEA.

Fig. 5
figure 5

Example of the comparison procedure in the experimental setup

That said, the datasets are grouped in the four tables named above as follows:

  • Table 3 shows the 4 datasets where considering the majority class as positive label (Maj.) using the MOEA leads to better C and F1 for both the positive and negative class (F1Pos and F1Neg) in the extremes C and F1. In this case, the results obtained by the EA methodologies () are compared to the MOEA ones, when they use the majority class as positive label, because this gets better results in these datasets.

  • Table 4 show the 10 datasets where considering the majority class as positive label (Maj.) using the MOEA leads to better C and F1 for the positive class (F1Pos) in the extremes C and F1.

  • Table 5 show the 11 datasets where considering the minority class as positive label (Min.) using the MOEA leads to better C and F1 for the positive and negative class (F1Pos and F1Neg) in the extremes C and F1.

  • And finally, Table 6 shows the 2 datasets where considering the minority class as positive label (Min.) using the MOEA leads to better C and F1 for the positive class (F1Pos) in the C extreme. Nevertheless, in the F1 extreme, the same thing happens but considering the majority class as positive label (Maj.), that is, it leads to better C and F1 for the positive class (F1Pos).

Taking into account this experimental design and comparison process, the following conclusions can be drown about considering the majority class or the minority one as positive class:

  • Majority class as positive label: Tables 3 and 4 show that the value of C decreases when the minority class is considered as positive instead of the majority one.

    For this case, in Table 3 some datasets are included for which the use of the majority class as the positive label improves C and F1 for both classes (F1Pos and F1Neg) and for both extremes of the Pareto front. In this way, when the minority class is considered as positive, C is positively correlated with F1Pos and F1Neg (i.e. the three values decrease) for Hepatitis, Ionos, HorseColic and HeartStatlog datasets. For example, in the Hepatitis dataset, the value Maj.-C = 82.74 becomes Min.-C = 79.51, the value Maj.-F1Pos = 0.893 becomes Min.-F1Neg = 0.869, and the value Maj.-F1Neg = 0.515 becomes Min.-F1Pos = 0.474. The IR of these datasets is not too high, from 1.250 in HeartStatlog to 3.844 in Hepatitis. The mono-objective methodologies does not obtain better results than the multi-objective ones when the majority class is considered as positive label.

    Note that there are cases in which the average values of C, F1Pos and F1Neg are similar or even equal when comparing both extremes. This happens when the values of C and F1 are considerably high, then the models obtained are close to the (1,1) optimal point of the feasible region, which could be narrow (see Fig. 1). This can cause the obtained models to be different in terms of their ANN structure but with similar or equal performance. There could even be cases in which one run provides a main Pareto front with only one individual or model.

    On the other hand, Table 4 include the datasets where considering the majority class as positive leads to better C and F1Pos in the extremes C and F1 for Ecoli (3 versus all), SaHeart, Madre, Pima, SpectfHeart, Glass (1 versus all), BreastC, Bands, Haberman and Glass (0 versus all). In this case, a positive correlation is shown between C and F1Pos when the minority class is considered as positive. Nevertheless, the correlation is negative between C and F1Neg. For example, in the Ecoli (3 versus all), when the value Maj.-C = 91.87 becomes Min.-C = 91.27 (i.e. it decreases), the value Maj.-F1Neg = 0.547 becomes Min.-F1Pos = 0.577 (i.e., it increases). The IR of these datasets is diverse, from 1.704 in Bands to 8.600 in Ecoli (3 versus all).

    Regarding the mono-objective methodologies, only for Ecoli (3 versus all), their results are better than the multi-objective ones in the three metrics considered (C, F1Pos and F1Neg). For SaHeart, Madre, SpectHeart, Haberman and Glass (0 versus all), they also achieve slightly better results but only for some of the three metrics, sometimes causing the value of C to decrease.

  • Minority class as positive label: Continuing with the interpretation of the results, in Table 5, the value of C is shown to increase when the minority class is considered as positive (instead of the majority one).

    For the datasets of this table, C, F1Pos and F1Neg values are improved in the extremes C and F1, i.e. C is positively correlated with F1Pos and F1Neg. The IR of the datasets varies significantly, from 1.148 in HouseVoting to 15.329 in Sick. In this way, obtaining good results in C and F1 when the minority class is considered as positive seems to be independent on the imbalance level and probably related with the structure of the dataset. This empirically shows that there is no reason why we should use this null hypothesis (minority class as positive) when the imbalance level is high.

    Regarding the mono-objective methodologies, they obtain better results than the multi-objective ones only in three datasets: BreastW-Diagnostic, BreastW-Original and HouseVoting.

  • Majority or minority class as positive label: Finally, in Table 6, we show 2 datasets, German and Liver, with behaviours different from those commented so far: in the C extreme, the value of C metric increases when considering the minority class as positive class instead of the majority one, being positively correlated with F1Neg and negatively with F1Pos. On the other hand, if we observe the F1 extreme, the value of C decreases with the same consideration, being C positively correlated with F1Pos and negatively correlated with F1Neg. The IR of these datasets is not high, from 1.379 in Liver to 2.333 in German.

    The mono-objective methodologies obtain better results than the multi-objective ones only in the F1Neg and F1Pos metrics for the German dataset, when the positive label is assigned to the minority and majority class, respectively.

In some cases, the MOEA may stagnate due to the feasibility constraints existing from the relation between C and F1 and due to the value of f (ratio of patterns of the class considered as positive). Furthermore, the number of non-dominated solutions can be small and even identical in terms of their C and F1 values. In these cases, the classifier is not sufficiently trained, and the mono-objective methodologies may obtain better results in generalisation, although this will depend a lot on the database. Figure 6 shows the feasible region for the Glass (0 versus All) and Card datasets when the majority class and the minority one are considered as positive class, respectively. As can be seen, for C values greater than 0.7, the space of solutions begins to narrow drastically, and, under these circumstances, it would be convenient to also experiment with mono-objective algorithms.

Fig. 6
figure 6

Graphical representation of the narrowing of the feasible region for the Glass (0 versus All) and Card datasets for C values greater than 0.7

5 Conclusions

This work presents the theoretical constraints associated to the relation between C and F1 metrics, as a function of the ratio of patterns of the positive class of a binary classifier. We propose to represent binary classifiers as points in a two-dimensional plot according to their C and F1 performances. Using this representation, the constraints limit the feasible region in such a way that this region is wider when the values of C and F1 are relatively low. This representation can give us an idea of the range of possible improvement for a given value of C or F1.

The results show that the theoretical constraints are fulfilled. The MOEA is able to optimise a Pareto front of binary classifiers where both C and F1 values are acceptable, showing a high accuracy both globally and for the positive and negative classes. On the other hand, it has also been proven that mono-objective methodologies generally obtain worse results when optimizing only C or F1 instead of optimising both metrics simultaneously. Its use should only be considered when the feasible region is narrow due to the constraints from the relation of both metrics or due to the value of f (ratio of patterns of the class labelled as positive).

For some datasets, using the majority class instead of the minority one as the positive label results in better performance both for C and F1 for both classes, particularly in those datasets where the degree of imbalance is lower. This option should be considered by decision makers when training binary classifiers, given that, in general, the positive label is always assumed to be the minority one, since it is understood that better results will be obtained. This leads to a performance that, in some cases seems, to be independent of the imbalance level and probably related with the structure of the dataset.

It is also observed that C and F1Pos metrics are correlated in all datasets tested, except in German and Liver when the C extreme is considered. With respect to C and F1Neg, they are negatively correlated in the 10 datasets of Table 4 and in German and Liver dataset when the F1 extreme is considered.

Finally, the use of an MOEA leads to acceptable results for both C and F1 in a good number of datasets according to the experiments performed. The mono-objective methodologies need to be run twice, once for each metric. This increases the computational cost and does not always lead to better results with respect to the multi-objective methodology.

As future research lines, we plan to extend the findings in this paper to a multiclass classification environment by considering, for example, multi-objective evolutionary algorithm based on decomposition [35]. Moreover, an automatic method to choose the best null hypothesis for a given classification problem could be designed, based on the analysis of the dataset and the classifier.

On the other hand, with the rise of popularity of deep learning methods and their recent promising results in many classification and forecasting applications [15, 24, 42], our methodology could be extended to deep structures. In this direction, recent works has applied evolutionary techniques with simple learning modules [23, 32] in order to simultaneously optimize different objectives.