1 Introduction

In the field of machine learning, classification problems are focused on assigning each input vector to one of a finite number of discrete categories [8], given a set of training data with pre-labelled examples. In this context, special considerations should be taken into account if the labels exhibit an ordering relation, i.e. they are naturally ordered according to the variable definition. For example, financial trading could be assisted by ordinal classification techniques predicting not only a binary decision of buying an asset, but also the amount of investment. The decision could categorised by {“no investment”, “low investment”, “medium investment”, “huge investment”}. Machine learning methods should consider the natural order among the classes and penalise differently the errors. Confusing a “no investment” instance with a “huge investment” should be associated a higher cost than a “little investment” prediction for the same instance. Ordinal classification [34] (also known as ordinal regression) deals with this kind of problems by trying to exploit the ordinal relation between labels and imposing it in the models to learn.

The classification with monotonicity constraints, also known as monotonic classification [6], is an ordinal classification problem where a monotonic restriction can be found: a higher value of an attribute in an example, fixing other values, should not decrease its class assignment. The monotonicity of relations between the dependent and explanatory variables is very usual as a prior knowledge form in data classification [40]. To illustrate this, consider a credit card application [14]. A $1000 to $2000 income may be considered a medium value of income in a data set. If a customer A has a medium income, a customer B has a low income (i.e. less than $1000) and the rest of input attributes remain the same, there is a relationship of partial order between A and B: \(B < A\). Considering that the application estimates the lending quantities as the output class, it is quite obvious that the loan that the system should give to customer B should not be greater than the one given to customer A. If it is, a monotonicity constraint is violated in the decision.

At least, two important reasons are identified for explaining why knowledge about monotonicity should be exploited in a learning task [7]. First, monotonicity imposes constraints on the prediction function. This decreases the size of the hypothesis space and also the complexity of the model. Second, in many cases, the domain experts decide the acceptance or rejection of the trained models based on their consistency with respect to the domain knowledge, regardless of their accuracy.

In this paper, we will give a quick snapshot of ordinal and monotonic classifications problems, emphasizing the issues they have in common, the evaluation measures and the most important approaches already proposed. Also, the open issues and present trends in both related classification problems will be examined, suggesting several open challenges and new directions to devote efforts in the near future.

The rest of the paper is organized as follows. We first formalize the ordinal and monotonic classification problems (Sect. 2). Then, in Sect. 3, we describe several performance metrics widely used in the two problems. Afterwards, we enumerate the main methods proposed for tackling these problems (Sect. 4). Open issues and challenges are pointed out in Sect. 5. Finally, some concluding remarks are given in Sect. 6.

2 Problem definition

A standard classification problem consists of predicting the category y of an input pattern \({\mathbf {x}}\), where \(y\in {\mathcal {Y}}=\{{\mathcal {C}}_1,{\mathcal {C}}_2,\ldots ,{\mathcal {C}}_Q\}\) and \({\mathbf {x}}\in {\mathcal {X}}\subseteq {\mathbb {R}}^K\). The objective is to find a classifier \(r:{\mathcal {X}}\rightarrow {\mathcal {Y}}\) to categorise new patterns. The classifier has to be learnt from a training set of N points, \(D=\{({\mathbf {x}}_i,y_i), i=1, \ldots , N\}\). For ordinal and monotonic classification problems, a natural label ordering is included in the form \({\mathcal {C}}_1\prec {\mathcal {C}}_2\prec \cdots \prec {\mathcal {C}}_Q\), where \(\prec \) is an order relation. The position of the label in the ordinal scale is used by many ordinal classification measures and algorithms, which can be expressed by \({\mathcal {O}}({\mathcal {C}}_q)=q, q=1, \ldots , Q\).

The difference between ordinal classification and other supervised problems is now established. Comparing the problem to nominal classification, the order between class labels makes that two different elements of the set \({\mathcal {Y}}\) can be always compared by using the relation \(\prec \), and this is not possible under the nominal classification setting. If compared to regression (where \(y\in {\mathbb {R}}\)), although the standard \(<\) operator can be used to order real values in \({\mathbb {R}}\), labels in ordinal classification (\(y\in {\mathcal {Y}}\)) do not carry metric information, so it is not possible to establish the distance between two given labels (following the example of Sect. 1, the distance between “low investment” and “medium investment” could be significantly higher than that between “no investment” and “low investment”, and there is no principle way to measure it without a priori knowledge of the problem).

The case of monotonic classification is a particular case of ordinal classification, where there are monotonicity constraints between features and decision classes, i.e. \({\mathbf {x}}\succeq {\mathbf {x}}' \rightarrow f({\mathbf {x}}) \ge f(\mathbf {x'})\) [40], where \({\mathbf {x}}\succeq {\mathbf {x}}'\) means that \({\mathbf {x}}\) dominates \({\mathbf {x}}'\), i.e. \(x_k\ge x_k', k=1, \ldots , K\).

In monotonic classification, we have to define the concepts of monotonic classifier and monotonic data set. A monotonic classifier is one that will not violate monotonicity constraints, those given previously. There will be pure monotonic classifiers, whose decisions will be always monotonic between the independent variables and the dependent one; and approximate monotonic classifiers, which try to learn models as monotonic as possible, namely, predictions with the lowest number of monotonic violations.

A training data set D is monotonic if and only if all the pairs of examples i, j are monotonic with respect to each other [5]: \({\mathbf {x}}_{i}\succeq {\mathbf {x}}_{j}\,\rightarrow y_i\ge y_j, \forall _{i,j}\). Some monotonic classifiers require pure monotonic data sets to successfully learn, although there are others that are capable of learning from non-monotonic data sets as well. Even using pure monotonic data sets as input, there are monotonic classifiers that build approximate monotonic models.

3 Performance metrics

Ordinal and monotonic classifiers can be evaluated using different metrics [3, 12, 18, 48]. The two most common metrics are the mean zero-one error (MZE) and the mean absolute error (MAE). The first one is defined as:

$$\begin{aligned} \mathrm{MZE} = 1-\mathrm{Acc} = \frac{1}{N}\sum ^{N}_{i=1}\llbracket y^*_i\ne y_i \rrbracket , \end{aligned}$$

where \(y_i\) and \(y^*_i\) are the true and predicted labels, respectively, and Acc is the accuracy of the classifier. The range of MZE is [0, 1]. It is related to global performance, but the order is not considered. It is also known as 0/1 loss or standard misclassification rate. A way to include order information in the evaluation is to make use of the MAE metric, which is the average deviation in absolute value of the predicted rank (\({\mathcal {O}}(y^*_i)\)) from the true one (\({\mathcal {O}}(y_i)\)) [3]:

$$\begin{aligned} \mathrm{MAE} = \frac{1}{N}\sum ^{N}_{i=1}|{\mathcal {O}}(y_i)-{\mathcal {O}}(y^*_i)|, \end{aligned}$$

where \(\mathrm{MAE}\in [0,Q-1]\). This measure is also referred to as absolute error or rank loss.

Many binary classifiers assign scores to the different examples, and then a threshold is used for separating negative samples from positive ones. In this context, the area under the receiver operating characteristics (AUC) curve is one of the most commonly used metrics for evaluating the performance of a binary classifier, independently of the threshold used. Basically, it estimates the probability that the classifier ranks a randomly chosen positive instance higher than a randomly chosen negative one [23]. AUC has been extended to ordinal classification problems [61], by assuming classifiers based on a scoring function with \(Q-1\) different thresholds (threshold models), in such a way that each class corresponds to an interval delimited by these thresholds (more details on threshold models are given in Sect. 4.1). Consequently, AUC for ordinal classification is based on measuring the probability that a pattern from a given class is correctly ranked (according to the score function) with respect to patterns of the remaining classes.

The same measures described above are also used in monotonic classification for estimating the generalization performance of the trained models over test data. This behaviour could produce some negative effects, which will be discussed in Sect. 5.2.

Regarding the quantification of the monotonicity in predictions, which is a particular condition in monotonic classification when noise is present, there are several metrics.

The first is the non-monotonic index (NMI). This measurement was defined by Ben-David in [5] as the rate of number of violations of monotonicity divided by the total number of pairs of examples (excluding the pairs formed by themselves) in a data set: \(\mathrm{NMI}(D)=\frac{\sum _{i=1}^{N}\sum _{j=1}^{N}m_{i,j}}{N^2-N}\), where \(m_{i,j}\) is equal to 1 if the pair formed by \({\mathbf {x}}_{\mathbf{i}}\) and \({\mathbf {x}}_{\mathbf{j}}\) is non-monotonic. To aggregate standard estimation scores of decision trees with quantification of monotonicity, the order-ambiguity-score (A) is computed, as shown in the next equation, using the concept of NMI:

$$\begin{aligned} A = \left\{ \begin{array}{ll} 0, &{} \quad \text {if } \mathrm{NMI} = 0,\\ -(\log _2 \mathrm{NMI})^{-1}, &{} \quad \text {otherwise}. \end{array} \right. \end{aligned}$$

Other alternative definitions are the following ones:

  • Non-monotonicity index 1 (NMIO) [19], defined as the number of clash-pairs divided by the total number of pairs of examples in the data set: \(\mathrm{NMIO}= \frac{1}{N(N-1)}\sum _{{\mathbf {x}} \in D} {\text {NClash}}({\mathbf {x}})\). \({\text {NClash}}({\mathbf {x}})\) is the number of examples from D that do not meet the monotonicity restrictions (or clash) with respect to \({\mathbf {x}}\).

  • Non-monotonicity index 2 (NMIT) [47], slightly different to the previous, is defined as the number of non-monotonic examples divided by the total number of examples: \(\mathrm{NMIT}= \frac{1}{N}\sum _{{\mathbf {x}} \in D} {\text {Clash}}({\mathbf {x}})\), where \({\text {Clash}}({\mathbf {x}})\) = 1 if \({\mathbf {x}}\) clashes with at least one example in D, and 0 otherwise. If \({\text {Clash}}({\mathbf {x}})\) = 1, \({\mathbf {x}}\) is called a non-monotone example.

We stress that these three last indices range in the unit interval, so they can be conveniently expressed as percentages.

4 Main methods

In this section, we briefly describe the most relevant methods proposed in the specialized literature for ordinal and monotonic classification.

4.1 Ordinal classification classifiers

According to [34], ordinal classification methods can be classified into the following families:

  • Naïve approaches include those methods which simplify ordinal classification into other standard problems, by making some assumptions. For example, all the different labels \(\{{\mathcal {C}}_1,{\mathcal {C}}_2,\ldots ,{\mathcal {C}}_Q\}\) can be mapped to real values \(\{r_1,r_2,\ldots ,r_Q\}\) [58], where \(r_i\in {\mathbb {R}}\), and then standard regression techniques [8] (such as neural networks, support vector regression\(\ldots \)) can be applied. Another option is to consider nominal classification and simply ignore ordering information. Moreover, different misclassification costs can be assigned according to order information, resulting in cost-sensitive classification, where the cost matrix is usually related to the absolute difference of ranks of true and predicted classes (in a similar way to the costs assumed by MAE). The three options imply different assumptions which may hamper the learning process: regression and cost-sensitive classification methods assume a distance or a cost between labels which is generally unknown (being more sensitive to the representation of the labels rather than its ordering [35]), nominal classification ignores order information (generally requiring more training data [35]).

  • Ordinal binary decompositions are based on decomposing the ordinal classification task into a set of binary classification subtasks, in the same vein that multiclass classification problems are frequently simplified into several binary tasks using the well-known One-Versus-One or One-Versus-All schemes [29]. Ordering information can be imbued in these decompositions by considering, for example, that a pattern of class \({\mathcal {C}}_q\) should be classified as positive by all binary classifiers corresponding to classes with a lower or equal rank, i.e. \(f_k({\mathbf {x}})=1, \forall _k{\mathcal {C}}_k\preceq {\mathcal {C}}_q\) (Ordered Partitions scheme) [34].

  • Threshold models are based on assuming that the ordinal class labels originate from consecutive intervals of an unobservable one-dimensional latent variable. These models learn two different elements from training data: (1) a one-dimensional projection function corresponding to an estimation of the latent variable, and (2) a set of \(Q-1\) thresholds which divides the projection in Q classes (each class being defined by an interval). This scheme has been considered by many proposals in the literature, adapting linear logistic regression [46], support vector machines [16], discriminant analysis [57] or Gaussian processes [15] to the context of ordinal classification.

  • Augmented binary classification the reduction framework of Lin and Li [43] approaches ordinal classification problem by reducing it to binary classification with additional input variables and specific weights for the extended patterns. A previous method exploiting the same idea is the data replication method of Cardoso et al. [11], the main difference being that it is limited to the absolute cost, whereas the framework of Lin and Li [43] can be used with any V-shaped cost matrix. On the other hand, data replication proposal includes a parameter s which limits the number of adjacent classes considered [11], in order to reduce the number of additional data points generated by the approach.

4.2 Monotonic classifiers

There are five main families of methods that deal with monotonicity constraints in classification:

  • Instance-based learning methods are pioneers and well-known in the field of monotonic classification. Next, we will describe the three most important techniques in detail: OLM, OSDL and k-NN.

    • The ordinal learning model (OLM) [4] is a very simple algorithm that learns ordinal concepts by eliminating non-monotonic pairwise inconsistencies. The generated concepts can be viewed as rules. During the learning phase, each example is checked against every rule in a rule-base, which is initially empty. If an example is inconsistent with a rule in the rule-base, one of them is selected at random while the other is discarded, but if the example is selected, it must be checked for consistency against all the other monotonicity rules. If it passes this consistency test, it is added as a rule. Consequently, the rule-base is kept monotonic at all times. Classification is done conservatively. All the rules are checked in decreasing order of class values against an attribute vector, and the vector is classified as the class of the first rule that covers it. If such a rule does not exist, the attribute vector is assigned the lowest possible class.

    • The ordinal stochastic dominance learner [42] (OSDL) is based on the concept of ordinal stochastic dominance. The stochastic order computes when a random variable is bigger than another. Considering this order, stochastic dominance can be established as a form of stochastic order. In this case, a probability distribution over possible predictions can be ranked. The ranking depends on the nature of the data set. Stochastic dominance refers to a set of relations that may hold between a pair of distributions.

    • The monotonic k-NN was proposed in [22]. This method consists of two steps. In the first step, the training data is made monotone by relabelling as few cases as possible. This relabelled data set may be considered as the monotone classifier with the smallest error index in the training data. In the second step, a modified nearest neighbour rule is used to predict the class labels of new data, so that violations of the restrictions of monotonicity will not occur.

    Recently, some approaches that hybridize rule induction and instance-based learning, such as the nested generalized example learning, have appeared in monotonic classification [30, 31].

  • Decision trees A monotone extension of ID3 (MID) was proposed by Ben-David [5], using an additional impurity measure for splitting the total ambiguity score. However, the resulting tree may not be monotone anymore even when starting from a monotone data set. MID defines the total-ambiguity-score as the sum of the entropy score of ID3 and the order-ambiguity-score. This last score is defined in terms of the NMI of the tree (see Sect. 3). Makino et al. [44] proposed a monotone (or positive) decision tree (P-DT) and a quasi-monotone (quasi-positive) decision tree (QP-DT) extension of ID3 in the two-class setting. They start from a monotone training set and demand, in the case of QP-DT, that monotonicity is (only) guaranteed on this training set, while in the case of P-DT the tree (or equivalently, the derived rule base) is required to be monotone. These methods have been nontrivially extended in [53] to the multi-class problem, accommodating also continuous attributes. In addition to the fact that these approaches start from a monotone training set, the main technique for guaranteeing (quasi-)monotonicity is by adding at each step, if necessary, new data generated from the data in the previous step. A splitting criteria thought for monotonic classification has been proposed in [10]. The criterion aims at reducing the numbers of non-monotone pairs of points in the resulting branches. It chooses the split with the least number of conflicts. Another way to achieve monotone classification models in a post-processing step is by pruning classification trees [25]. This method prunes the parent of the non-monotone leaf that provides the largest reduction in the number of non-monotonic leaves’ pairs. Here, similar accuracy is reported, with increased comprehensibility. Isotonic regression is also used for relabelling non-monotone leaf nodes of the decision tree [38]. As for explicit monotonic trees, we can find some representatives proposed in the literature. MDT [41] aimed to predict the implicit ordering in terms of pair comparison in the original classification. In [36], the authors propose a rank generalization of Shannon mutual information, namely rank mutual information and underline that this measure is both sensitive to monotonicity and robust to noisy data. Then, this measure is used to build binary tree classifiers guaranteed to have a weak form of monotonicity (rule monotonicity), in the case the starting data set is monotone consistent. They call this algorithm REMT and show that it behaves well compared to both monotone and non-monotone classifiers. An extension of the interval valued attribute decision tree to deal with monotonic classification is given in [63], which selects extended attributes by minimizing rank mutual information to generate a decision tree. Recently, in [45], the authors presented a binary tree classifier, RDMT(H), parametrized by a discrimination measure H used for splitting and other three pre-pruning parameters. According to them, RDMT(H) guarantees a weak form of monotonicity on the resulting tree.

  • Ensemble learning techniques have been proposed for classification with monotonicity constraints. For instance, a boosting-like technique for ordinal classification problems related to decision rules has been proposed in [21, 39]. Ensembles of bagged decision rules have been considered in [9] and bagged decision trees in [56]. This last paper considered global constraints in ordinal classification by imposing the ordinal constraints in a decision function and avoiding over-regularised decision spaces. A straightforward scheme based on Random Forest and ensemble pruning was proposed in [33]. Finally, in [54], the authors developed a method of fusing monotonic decision trees.

  • Neural networks have been also applied on monotonic classification. Total and partial monotonic neural networks were examined in [20], and an adaptation of neural networks that imposes monotonicity constraints on the weights connecting the hidden layer with the output layer was presented in [26].

  • Data preprocessing and construction Another trend in monotonic classification is to preprocess the data [32] in order to “monotonize” the data set, rejecting or relabelling the examples that violate the monotonic restrictions. There are two consolidated techniques for obtaining monotonic data sets, either generating artificial data [47, 52] or by relabelling existing data sets [22, 24, 55]. The second option is the preferred for addressing real data sets with classifiers that exclusively work with monotonic data.

5 Open issues

This section discusses some specific problems and issues associated to ordinal and monotonic classification, establishing aspects of the field which should receive further attention by the research community.

5.1 Open issues in ordinal classification

Focusing on ordinal classification, the main problems are related to the performance metrics and the data sets used for evaluating the classifiers. Specifically:

  • First of all, it is important to outline the necessity of taking ordering information into account. Many works in the machine learning field use ordinal classification data sets, ignoring the order of the categories. This can decrease the performance of the obtained model [34]. Some authors have previously studied whether there are performance improvements when considering order information. In [37], ordinal meta-models were compared against nominal ones, concluding that such ordinal methods may yield better performance. Indeed, much more differences can be found when considering specific ordinal classification methods (instead of meta-models) [34]. Another study [7] argues that ordinal classifiers may not present meaningful advantages over the analogue non-ordinal methods, based on accuracy and Cohen’s Kappa statistic [17]. However, the results in [34] show that statistically significant differences are found when using measures which take the order into account, such as the MAE. In this way, it is extremely important to consider ordinal performance metrics to evaluate the benefits of applying ordinal classifiers.

  • Independently of the performance differences, there are additional advantages on the use of ordinal classification models. For example, threshold models allow projection of the patterns into a real line, according to the latent variable value estimated for each pattern. This additional information can be very useful to detect uncertain predictions (close to the thresholds of its category) or to rank patterns in the same class.

  • Proportional odds model (POM) is a linear model based on extending binary logistic regression to ordinal classification [46]. As such, its training is very fast but its performance is generally low [34], because many real world problems require nonlinear decision boundaries. This fact is important, given that the POM and its variants are the most widely used ordinal classification methods in areas such as medical sciences or psychology [27, 60, 62].

  • Public repositories (such as UCI [2], Keel [1], or mldata.org [50]) include benchmark classification data sets with ordinal classification problems. Indeed, there are many previous works where these data sets are treated as standard classification. A careful examination of the data sets in these repositories is needed to understand the nature of the target variable and separate ordinal classification tasks from standard multiclass classification (although for monotonic classification it makes sense to consider binary problems, ordinal classification problems must be, at least, three-class problems). However, there are some ordinal classification works [15, 16, 43] which consider the repository provided by Chu et al. [15]. These data sets are not real ordinal classification problems but regression ones, which are turned into ordinal classification by discretising the target into Q different bins with equal frequency or equal width. Validating new algorithms using only these data sets can result in misleading conclusions, because it is clear that they can be simpler than real ordinal classification problems. When using equal frequency binning, class imbalance is suppressed, given that all classes are assigned the same number of patterns. For equal width binning, all classes are assigned intervals of the same width in the latent variable, simplifying the problem. Furthermore, there are observed values of the actual target regression variable (although they are ignored), so the classification problem can be simpler than problems where these values are not available and there are only categories. In any case, we do not neglect the opportunity of using these data sets to check how the algorithms perform in a more controlled environment, but we think they should be complemented with real ordinal classification data sets in order to test the performance of the classifiers in realistic settings.

  • Finally, the different performance metrics used to evaluate ordinal classifiers make different assumptions. For example, MAE assigns proportional costs to all pairs of consecutive categories, in such a way that misclassifying a pattern of class \({\mathcal {C}}_3\) as class \({\mathcal {C}}_1\) is exactly twice more costly than assigning it a \({\mathcal {C}}_2\). This may not be the case for many real problems, where the costs of misclassifications may be very different depending on the classes evaluated. One possibility could be the use of association metrics [18], which evaluate the relative order of the patterns, but not the exact labels, i.e. if patterns are well sorted using the ordinal classifier with respect to sorting established by the true labels. Again, association metrics should be used together with MAE or other alternative metrics, because they evaluate different aspects of the classifiers (e.g. a classifier which shifts all the predictions one class in the ordinal scale will score the same value for association metrics).

  • Ordinal classification data sets are unbalanced in nature, because extreme classes tend to be associated to rare events. Uneven pattern distributions pose a serious hindrance for classifier training, in such a way that minority classes tend to be ignored by the obtained classifiers. There have been many efforts for tackling this problem in the binary or multiclass classification contexts [13, 28], but few works adapts these techniques to ordinal classification [51]. Specific characteristics of ordinal classification tasks should be taken into account when developing new strategies for alleviating this problem.

Fig. 1
figure 1

Process of transformation of the 10-fcv to the new one by removing the non-monotonic instances from the test data sets

5.2 Open issues in monotonic classification

This section will be devoted to point out some of the existing open issues we have detected in monotonic classification.

  • As we have indicated before, most of the same measures described for standard ordinal classification are usually used in experimental comparisons of monotonic classifiers for estimating the generalization performance, such as MZE or MAE. This can cause a major deviation and misinterpretation of the results reported over real problems and learned concepts. It is worth mentioning that any performance metric is estimated over a set of test examples, and these test examples are obtained by using statistical validations applied over the original data. Thus, it seems obvious to expect that if the original data set has inconsistencies or non-monotonicity violations, part of these flaws will be inherited in the test partitions. It is frequent and avoidable in standard classification, but it is more critical in monotonic classification. The errors derived from misclassifications of test examples that are really non-monotonic will influence the final estimate of performance measure. If we want to learn a monotonic model, we hope to predict unseen examples satisfying monotonic constraints. In this manner, the experimental evaluations should achieve conclusions from a link between measures of generalization performance and measures for estimating the degree of monotonicity in the predictions, such as NMI.

  • An attempt of reducing the previous effect was suggested in [30], when tackling data sets with noise and possible non-monotonic violations. The measures MAcc and MMAE (monotonic accuracy and MAE, respectively) were used to estimate the errors over test data. Both do not consider the non-monotonic comparable examples in the estimate. The reason for this is to ensure that future examples will fulfil the monotonicity assumption. These metrics serve as monotonicity level measurements of the predictions carried out. To address this, the validation process followed is presented in Fig. 1 where the partitions used in 10-fcv are modified conserving the training data sets but removing the non-monotonic instances in the test data sets. The process of conflictive instance removal searches for fair comparisons and it is based on a deterministic greedy algorithm, avoiding randomness (see Algorithm 1).

    figure a
  • NMI is a well-known metric used in monotonic classification for estimating the degree of monotonicity in a set of predictions. It is also used to compute the degree of monotonicity in models, especially in interpretable models such as decision trees [5, 36] or decision rules [4]. More complex models built from other algorithms such as ensembles or neural networks also contain hidden monotonicity violations in their models. This emphasizes the fact that obtaining monotonic predictions is as important as obtaining monotonic models.

  • The most widely used scheme for preparing the data to learners that explicit require complete monotonic data is to relabel the data [24]. However, relabelling is not always the best approach to preprocess data that contains noise, inconsistencies and harmful examples [32]. The problem may not be only present in the class label, so other techniques such as edition, noise filtering and attribute values correction could work well in these cases.

  • Possible extensions of the monotonic classification problem could consider different priorities among the explanatory attributes or different degrees of non-monotonicity among examples. The former extension refers to prioritize some attributes over others, depending on the background taken from the problem itself. As an example in credit risk, it may be more critical to predict a favourable loan to a first customer with lower income than another customer whose loan was denied, instead of predicting the same when considering the attribute “assets”, keeping the remaining attribute values unchanged. The later extension is very related to the former one, because it assumes the existence of different degrees of non-monotonicity between two examples, which must be calculated by using the explanatory variables in one way or another. Also, the spacial separation of data points could influence this factor.

  • Currently, monotonic classification is seen as a natural extension of classical or ordinal classification. Other predictive learning paradigms that require some interpretation of the results can benefit of monotonic models or predictions in certain real applications, such as Subgroup Discovery [49], Semi-Supervised learning [59] or Multi-Label learning.

6 Conclusions

This paper has presented a review and analysis of two supervised classification tasks highly related: ordinal and monotonic classification. Both are concerned with the classification of patterns into naturally ordered categories, although the latter considers constraints of monotonicity between input and target variables. A common framework and notation is given for both kinds of problems, and the main existing techniques used for them are categorized and reviewed.

Moreover, the paper has uncovered some of the pitfalls and problems hidden in both fields, which are mainly related to the performance metrics, the data sets and the experimental design used for the evaluation of new ordinal or monotonic classification methodologies. We think that our analysis can serve as a motivation for developing specific experimental strategies for these tasks or as a set of recommendations for the application of existing ordinal and monotonic methodologies to other fields of study.