1 Introduction

Learning from noisy data is an important topic in machine learning, data mining and pattern recognition, as real-world data sets may suffer from imperfections in data acquisition, transmission, storage, integration and categorization. Indeed, over the last few decades, noisy data have attracted a considerable amount of attention from researchers and practitioners, and the research community has developed numerous techniques and algorithms in order to deal with the issue [30, 41, 147].

These approaches include the development of learning algorithms which are robust to noise as well as data preprocessing techniques that remove or “repair” noisy instances. Although noise can affect both input and class attributes, class noise is generally considered more harmful to the learning process, and methods for dealing with class noise are becoming more frequent in the literature [147].

Class noise may have many reasons, such as errors or subjectivity in the data labeling process, as well as the use of inadequate information for labeling. For instance, in some medical applications, the true status of some diseases can only be determined by expensive or intrusive procedures, some of which can only be carried out after a patient’s death. Another reason is that data labeling by domain experts is generally costly, and several applications use labels which are automatically defined by autonomous taggers (e.g., sentiment analysis polarization [70]), or by non-domain experts. This approach is common in, e.g., social media analysis [70], where hashtags used by users or information provided by a pool of non-domain experts (crowdsourcing) are used to derive labels.

Even though class noise is predominant in the literature (see Fréenay and Verleysen [30], Nettleton et al. [90] for recent surveys and comparison studies), most of the research has been focused on noise handling in binary class problems. However, new real-life problems have motivated the development of classification paradigms beyond binary classification [50]. These paradigms include ordinal class [48], multiclass [25], multilabel [51] and multi-instance [52] as well as learning from data streams and non-stationary environments [26] and joint exploiting related tasks [28]. Due to the ubiquity of noise, it is of fundamental importance to better understand the relationships and implications of class noise within these paradigms. Each paradigm has its own particularities which impose new challenges and research questions for noise handling. Although research for class noise handling in these paradigms is somewhat present in the literature (as will be discussed further in this paper), it remains quite scarce and requires general discussion of issues, challenges and research practices regarding it. This paper aims to discuss open-ended challenges and future research directions for learning with class noise data, focusing on non-binary classification problems. The main contributions of this paper are:

  • We discuss some current research, as well as the need of adaptation or development of new techniques for handling class noise within non-binary classification paradigms.

  • We also discuss issues related to the simulation of noise scenarios (inclusion of artificial noise) within these paradigms, an experimental artifact frequently adopted for analysis of noise dealing techniques. These issues are important for simulating noise scenarios that may occur in real-world applications and can serve as the basis for uniforming procedures by providing an objective ground in order to assess the robustness of the learning methods.

  • We present some important open-ended issues and offer some possible solutions to the existing problems.

We believe this discussion will encourage researchers and practitioners to explore the problem of class noise handling in new scenarios and different learning paradigms in more detail. The rest of this paper is organized as follows: Sect. 2 presents an overview of techniques and methods for learning from class noise. Section 3 presents a discussion of class noise in the context of six non-binary classification problem: multiclass, multilabel , multitask, multi-instance, ordinal and data streams classification. Section 4 is devoted to present the latest advances in each non-binary classification problem and their respective open issues. Finally, Sect. 5 presents some concluding remarks.

2 An overview of learning with noise

The development of supervised learning algorithms and methods that can handle data sets with class noise is a problem of great practical importance, as numerous real-world problems are noisy [89]. Class noise may have many reasons [30], such as insufficient or imperfect information [11]; errors in data labeling due to the subjectivity [133], the use of automated processes [119] or non-domain experts in data labeling [55]; or encoding or communication problems [12]. In this section, we introduce the basics of label noise: Sect. 2.1 briefly introduces the problem of learning with class noise. Section 2.2 presents a class noise taxonomy based on statistical mechanism that introduce noise; Sect. 2.3 discuss some potential impacts of noisy labels in the learning algorithms, and Sect. 2.4 is devoted to the different techniques developed to cope with class noise.

2.1 Problem statement

A training data set D is composed of a collection of examples (XY). Examples are pairs \((\mathbf {x_i},\mathbf {y_i})\), where \(\mathbf {x} \in X\) is described by attributes from the input space \(\mathcal {X}\), and their corresponding targets \(\mathbf {y_i} \in Y\) from the attributes of the output space \(\mathcal {Y}\), associated with the underline and unknown function \(\mathcal {F} : \mathcal {X} \rightarrow {Y}\). A learning algorithm aims to create, from D, a model function \(\mathcal {H}:\mathcal {X}\rightarrow \mathcal {Y} \approx \mathcal {F}\), that can be applied to predict the target class variables of new instances. In this paper, Y can be formed by a single-class attribute, as in binary or multiclass classification cases, or by more than a single-class attribute, as in multilabel classification cases.

Noise may corrupt the values of attributes in X and Y. In this paper, we restrict the discussion in the target (class) attributes, as class noise is known to be more harmful to the learning process [147]. In this case, instead of the true target values Y, the learner has access to the corrupted values \(\hat{Y}\). The objective of noise handling methods is to learn a good approximation of \(\mathcal {H'}:\mathcal {X}\rightarrow \mathcal {Y} \approx \mathcal {F}\) from the noise data set \(D' = (X,\hat{Y})\). To this end, the side effects of noisy instances (when \(Y \ne \hat{Y}\)) in the learning process should be diminished. Methods for dealing with class noisy data include data- and algorithm-level approaches. In this paper, we are interested in noise introduced by stochastic processes, rather than noise introduced deliberately by an adversarial agent [130]. A noise model aims to describe some of the statistical properties of how noise appears in a data set.

2.2 Class noise models

In Fréenay and Verleysen [30], a taxonomy of class noise mechanisms was proposed. This taxonomy is inspired by the missing value taxonomy [69], adapted to the class noise context. The taxonomy is based on four random variables: X is the feature vector, Y is the true class, \(\hat{Y}\) is the observed class, and E is a binary variable that indicates whether noise is present, that is, \(p_e = P(E=1) = P(Y\ne \hat{Y})\). This taxonomy is only applicable to binary and multiclass problems.

Fig. 1
figure 1

Simplified statistical dependence model from [30] summarizing the class noise typologies. The arrows indicate statistical dependence. Marked arrows (a), (b) and (c) are relevant for the three statistical models considered: noise completely at random (NCAR) (a); noise at random (NAR) (a) and (b) and noise not at random (NNAR) (a), (b) and (c). The dashed line represents the implicit dependence among the input variables and the class labels

According to the statistical dependencies among these four variables, three different statistical models arise. They are briefly represented in Fig. 1, where arrows are used to depict statistical dependence among the variables. In this taxonomy, the class noise occurrence is assumed to be a stochastic process and the probability \(p_e\) of an instance being mislabeled is categorized into three groups:

  • Noise Completely at Random (NCAR) This type of noise occurs in a completely stochastic way, and the probability of an instance being mislabeled does not depend on the class nor the other predictive features. Please note that in binary classification problems, NCAR becomes symmetrical since an asymptotically identical proportion of instances can be corrupted for both classes.

  • Noise at Random (NAR) In this type of noise, the probability of a mislabeling is dependent on the value of the actual class, i.e., it can assume different values for different classes. (Some classes might be noisier than others.) However, the probability of an instance being mislabeled is the same for all instances within each class. NCAR class noise is a particular case of NAR class noise, where the probabilities are the same for all classes.

  • Noise not at Random (NNAR) Both NCAR and NAR models assume that class noise applies to all instances, but this is not always true. In NNAR, the probability of an instance be mislabeled depends somehow on the feature space: for example, instance near class boundaries, similar to instances of another class label, are likely to be noisier.

This taxonomy is sufficient for binary classification problems (as it is based on missing value case, where a value is either missing or not) but as is argued in Sect. 3, non-binary and nonstandard classifications problems may require additional dependencies. To the best of our knowledge, such dependencies are not properly discussed in the literature.

2.3 Impact of class noise in learning algorithms

Class noise may have significant impacts on the learning process. The deterioration in classification performance is perhaps the most frequently studied problem in the literature, from both theoretical and experimental perspectives.

Theoretical studies generally depart from some assumptions of data distribution (e.g., data came from a multivariate Gaussian distribution with identical covariance matrix [81] or unequal covariance matrices [61, 83]), combined with models with known mathematical properties (e.g., a linear discriminant function) to establish bounds on errors in the presence of class noise.

The impact of class noise for some specific algorithms has also been addressed theoretically in the literature. Bi and Jeske [8] study the behavior of logistic regression in comparison with discriminant analysis, under NCAR noise assumption, concluding that logistic regression is less affected. In Okamoto and Yugami [93], average class analysis was used to investigate the behavior of class noise in instance-based learning algorithms. The impact of class noise on decision tree induction was studied in Quinlan [103].

The PAC-learning framework was also used to theoretically study the effect of class noise in classification tasks. In Angluin and Laird [3], the NCAR noise model is assumed and PAC learning was used to estimate the noise rate in a data set, as long as less than 50% of instances were noisy. In recent studies [56, 107], PAC learning is used to provide some insights into conditions in which learning under different class noise models (and not only NCAR) can be successful.

From a empirical perspective, studies dealing with noisy instances are often performed using simulation, by artificially introducing noise according to some noise assumptions [90]. The reasons are twofold: first, few data sets with known noisy instances are available; and second, simulated experiments allow the control of parameters and characteristics of noise introduction in the data set. However, the way noise is artificially introduced may follow different procedures, which may influence the behavior of methods for dealing with noise data. To alleviate this problem, it is important to uniform artificial noise inclusion nomenclatures, as in the taxonomy discussed in the previous section. This paper extends this taxonomy by including nonstandard classification problems, as well as more detailed noise models for multiclass classification.

Some studies reported in the literature have also shown the impact of class noise in feature selection [143], an increase in model complexity [1], and the distortion of observed frequencies [32], among others.

2.4 Strategies for coping with noise

There are traditionally two different strategies for dealing with class noise, each one related to the phase they operate:

  • Algorithm-level approaches Their aim is to design robust classifiers. They are less influenced by noise and do not require any previous noise treatment [82, 120].

  • Data-level approaches They attempt to remove or cleanse the noise present in the data before applying a classifier [12, 37]. This is a popular option if using a robust learner is unfeasible or inappropriate, or even aiming to improve results of robust learners, enabling the selection of any standard classification algorithm.

Recent studies have enriched the taxonomy presented above. For instance, algorithm-level approaches can be differentiated between approaches that either model the presence of noise explicitly or implicitly. Data-level approaches can also comprise filtering or reparation methods if analyzed in detail. Since the categorization of noise treatment methods is richer than the traditional literature categorization, methods for dealing with noise were divided into five categories:

  • Data Cleaning Methods A possible approach for dealing with noise consists in somehow cleaning the noisy instances in the data set by removing them or repairing their labels. However, identifying class noisy instances is not an easy task, and several approaches have been proposed. These approaches include using data characteristic measures [38], building predictive models for identification [150], cost-sensitive models [148], ensembles [113], nearest neighborhood [60], influence of the instance in model creation [141] and graph-based methods [115], to name a few.

  • Noise Reparation This category includes those techniques that aim to correct or repair noisy instances by changing the noisy label fir the correct estimated one [124]. We must differentiate between pure relabeling (or correcting) methods, which only change the noisy instance label [91, 140] and complete noise cleaners that are able to both remove and relabel the noisy instances [86]. As some authors point out [147], noise reparation in training data, even when the test data is also noisy, will enhance the model’s performance compared to doing nothing about noise.

  • Noise Robust Methods These methods are able to learn useful methods without noise modeling or data cleaning even when some amounts of noise labeling are present. Theoretical studies which aim to investigate noise robustness [75] or different loss functions were studied for the binary classification problem [5]. Overfitting avoidance techniques, such as noise robust splitting criteria in decision trees [104] and regularization [57], are generally used to increase noise robustness. Another common practice to increase noise robustness is the use of ensembles to increase the robustness of the base classifiers [113, 117] or to detect noisy instances in the building process [111]. Even popular disciplines such as deep learning evaluate their robustness against label noise, which is crucial in real-world applications [110].

  • Probabilistic Noise Tolerant Methods Some approaches for dealing with noise use a probabilistic noise modeling technique to simultaneously incorporate a noise model in model construction. This is designed to model the (unknown) noise-free distribution by decoupling the data and noise generation process. Some examples include Bayesian approaches, by incorporating the prior probability distribution of noise [27]; subpopulation approaches by using a mixture model of normal and noise distributions [116]; and belief functions to model the degree of certainty in a class label [73].

  • Model-based Noise Tolerant Methods Some artifacts can be incorporated in a learning algorithm in order to increase tolerance to noise. For instance SVMs [44, 65, 137], neural networks [58] and ensembles [99] can incorporate mechanisms for noise tolerance.

In Fig. 2, we depict a combined taxonomy derived from Fréenay and Verleysen [30] related to the traditional approaches described in the literature. As the reader may notice, the Data Cleaning Methods category is directly related to the data-level approaches, while we have opted to distinguish between pure noise filtering methods and repairing algorithms that relabel the identified noisy instances.

Fig. 2
figure 2

Taxonomy of noise treatment methods, adapted from [30] to meet previous nomenclature

The reader must take into account that, when dealing with noisy data, the statistical model which induced the noise should be identified. Failing to do so will result in severe bias in the resulting preprocessed data or the created model. From the knowledge of missing values literature [69], filtering instances is acceptable when NAR or NCAR noisiness models apply. If NNAR noise is detected, more complex strategies should be used: since the noise introduction is dependent on the input attributes, the latter cannot be safely used to identify mislabeled instances.

From a statistical point of view, a classifier estimates the probability P(Y|X). Following the dependences depicted in Fig. 1, the NAR case will result in \(P(\hat{Y}|E,Y)\), which is independent in terms of X on the estimation made by the classifier P(Y|X). However, in the NNAR case we have \(P(\hat{Y}|X,E,Y)\), which is not statistically independent of the distribution of X. As a result, the noise will bias a traditional classifier, whose assumption is such that P(X|Y) is the only dependence among the class label and the observed (input) values.

The NNAR framework implies that the classifier must be aware of the dependence between the input values and the observed (and possibly not true) class label [75]. In these cases, the usage of probabilistic noise tolerant methods is recommended, as they enable the practitioner to model such dependence on both the input values and the observed class. Other approaches can be used, as in risk minimization, where different loss functions have been analyzed under the NNAR assumption [6].

3 Noise characterization in non-binary classification problems

While learning from noisy data is a challenge in itself, open-ended problems in machine learning would make learning with noisy data more difficult to handle. This section analyzes the class noise problem under the context of non-binary classification problems, including multiclass, multilabel, multitask, multi-instance, ordinal and data streams classifications. Each problem presents characteristics which require specialized noise handling techniques, as well as particularities that must be addressed when designing experimental evaluations. Even some well-studied problems such as noise handling in multiclass classification have some peculiarities which have not been properly discussed in the literature.

3.1 Multiclass noise classification

In multiclass classification, each instance is assigned to one class of the set \(\mathcal {Y} = \{c_1,\ldots ,c_{m}\}\), where \(m > 2\) is the cardinality of \(\mathcal {Y}\), i.e., the number of possible classes. Class noise in this case occurs when a true label for some arbitrary class \(c_i\) is erroneously replaced by another label in \(\mathcal {Y} {\setminus } \{c_i\}\) subset.

A lager number of classes are a source of additional trade-offs for machine learning algorithms [4]. A similar phenomenon happens in noise modeling. As seen in Sect. 2, traditional statistical taxonomy of label noise considers three categories, which takes into account whether the occurrence of a class noise depends or not on the true class, or the input attribute values. Although this taxonomy is still applicable to multiclass problems, the multiclass scenario introduces another stochastic dependencies on how the other classes influence the definition of the observed noisy class.

In those cases where noise does not depend on data features (NCAR and NAR), the noise process can be categorized into terms of the labeling or transition matrix [30, 98], as shown in Eq. 1.

$$\begin{aligned} y = \left( \begin{array}{ccc} \gamma _{11} &{} \cdots &{} \gamma _{1m} \\ \vdots &{} \ddots &{} \cdots \\ \gamma _{m1} &{} \cdots &{} \gamma _{mm} \end{array} \right) \end{aligned}$$
(1)

where \(\gamma _{ij}\), for \(i,j = \{1\ldots m\}\), is the probability of an instance whose true label is \(y_i\) which has been (mis)labeled as \(y_j\). Each row in the labeling matrix sums up to 1, as each instance should be assigned to one label.

In the NCAR case, the probability \(p_e\) of an instance being mislabeled does not depend on the class nor the input space. Therefore, the probability \(\gamma _{ii}\) of an instance not being mislabeled is the same for all classes (as it does not depend on the class values) and is equal to \(1 - p_e\). Thus, the main diagonal in the labeling matrix has the same value.

The NAR case allows asymmetrical mislabeling for each class. Let \(\mathcal {P} = \{p_1, \ldots , p_{m} \}\), be the mislabeling rate associated with each class and \(\Pi = \{\pi _{1},\ldots ,\pi _{m}\}\) the class distribution for each class. The overall probability is proportional to the error probability of each class and its corresponding distribution, i.e., \(p_e = \sum _i^m p_i \pi _i \). Thus, \(\gamma _{ii} = 1 - p_i\). Note that NCAR is a special case of NAR, in which \(p_i\) is the same for all m classes.

Equation 2 shows the labeling matrix for binary class problems. Note that, as only two classes are available, the matrix is \(2\times 2\). Under NAR, \(p_1 = p_2 = p_e\), and for both NAR and NCAR, the probabilities of \(\lambda _{ij}\) for \(i\ne j\) are uniquely and unambiguously defined as the complement of \(\lambda _{ii}\), as each row must sum up to one.

$$\begin{aligned} y_\mathrm{bin} = \left( \begin{array}{cc} 1 - p_1 &{} p_1 \\ p_2 &{} 1 - p_2 \end{array} \right) \end{aligned}$$
(2)

However, in problems with more than two classes, the distribution of noise probabilities is no longer uniquely defined, as they should be distributed among the remaining classes. Although there are numerous possibilities to consider, some of them quite arbitrary [62], and we have highlighted some general cases which may be of practical interest.

  • Uniform distribution The most common approach considered in the literature is the uniform case [41], where the observed noise is evenly distributed among the remaining classes. This is equivalent to saying that, when noise for a certain class occurs, all remaining classes are just as equally probable as the observed label of the instance. The corresponding labeling matrix is shown in Eq. 3. In the NCAR and NAR cases, the probabilities \(\gamma _{ij}\) (for \(i \ne j\)) are \(\nicefrac {p_e}{m - 1}\) (as \(p_1,\ldots ,p_n=p_e\)) and \(\nicefrac {p_i}{m - 1}\), respectively.

    $$\begin{aligned} y_\mathrm{uniform} = \left( \begin{array}{ccc} 1- p_1 &{} \cdots &{} \nicefrac {p_1}{m - 1} \\ \vdots &{} \ddots &{} \cdots \\ \nicefrac {p_n}{m - 1} &{} \cdots &{} 1- p_n \end{array} \right) \end{aligned}$$
    (3)
  • Natural distribution An approach that is less considered in the literature but may occur in practice is the natural distribution case, where the noise distribution is proportional to the natural distribution of the remaining classes. That is to say that, when a noise for a certain class occurs, another class with a probability proportional to the natural class distribution replaces it. A possible situation where a natural noise distribution occurs is when the class is derived from the readings of a sensor (e.g., the graduation of a disease based on a screening test). A random Gaussian noise in the sensor would generate a natural distribution in the class noise. The labeling matrix is shown in Eq. 4. For the NCAR and NAR cases, the noise probabilities for each class are \(\gamma {ij} = (\nicefrac {\pi _{j} }{1-\pi _i}) p_e\) (as \(p_1,\ldots ,p_n=p_e\)) and \(\gamma {ij} = \nicefrac { (\pi _{j} }{1-\pi _{i}}) p_i\), respectively, where \(\pi _{i},\pi _{j}\) are the class distributions for \(i,j,i \ne j\).

    $$\begin{aligned} y_\mathrm{natural} = \left( \begin{array}{ccc} 1-p_1 &{} \cdots &{} \nicefrac { (\pi _{j} }{1-\pi _{i}}) p_i \\ \vdots &{} \ddots &{} \cdots \\ \nicefrac { (\pi _{j} }{1-\pi _{i}}) p_i &{} \cdots &{} 1- p_n \end{array} \right) \end{aligned}$$
    (4)
  • Default class In the default class case, whenever a noise label occurs, and regardless of its true value, it is incorrectly assigned to one of the classes. This type of noise may occur when, for instance, a default value (e.g., “other”) is assigned when the true class could not be determined. Let \(c_d\) be the default class. A labeling matrix, for (an arbitrarily chosen) default class 3, is shown in Eq. 5. Then, for NCAR and NAR, \(\gamma {ij} = p_e\) and \(\gamma {ij} = p_i \) if \(j=d\), and 0 otherwise.

    $$\begin{aligned} y_\mathrm{default} = \left( \begin{array}{ccccc} 1-p_1 &{} 0 &{} p_1 &{} \cdots &{} 0 \\ 0 &{} 1-p_2 &{} p_2 &{} \cdots &{} 0\\ 0 &{} 0 &{} 1 &{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \ddots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} p_n &{} \cdots &{} 1-p_n \end{array} \right) \end{aligned}$$
    (5)
  • Blockwise In the blockwise case, there are correlations among the noise patterns of some classes, so that they form sub-matrices (or blocks) within the transition matrix. A common pattern studied in the literature is the pairwise case [147, 149], in which two classes \(c_1\) and \(c_2\) have their true class mislabeled with some probability (for instance, in handwritten digit classification, the digits 3 and 8 have similar shape and may be mislabeled at same rate in the data set). In this case, only two cells outside the diagonal of the labeling matrix are nonzero. However, blocks may have more than two class involved. Furthermore, some data sets may also have more than one block (for instance, in handwritten digit classification, two pairwise blocks may exist, e.g., digits 3–8 and 1–7), and the blocks may have different noise patterns (e.g., one can be NAR and the other NCAR). A labeling matrix with two (arbitrarily chosen) blocks is shown in Eq. 6.

    $$\begin{aligned} y_\mathrm{block} = \left( \begin{array}{ccccc} 1-p_1 &{} p_1 &{} \cdots &{} 0 &{} 0 \\ p_2 &{} 1-p_2 &{} \cdots &{} 0 &{} 0\\ \vdots &{} \ddots &{} \ddots &{} \ddots &{} \vdots \\ 0 &{} 0 &{} \cdots &{} 1-p_{n-1} &{} p_{n-1} \\ 0 &{} 0 &{} \cdots &{} p_{n} &{} 1-p_n \end{array} \right) \end{aligned}$$
    (6)

To epitomize the necessity to take into account the peculiarities of multiclass cases, we use artificially generated data sets.Footnote 1 The data sets are created with 5 classes and two attributes, with class centers evenly distributed alongside a circle of ratio \(\sqrt{\# \text {of classes}}\) (\(\sqrt{5}\)), and a standard deviation of 1. We have data sets with four different class distributions. The class distributions were generated according to a geometric progression which sums up to 1 and four different ratios: 1, 0.75, 0.5 and 0.25. These ratios were selected to generate data with different class distributions, as shown in Fig. 3. A ratio of 1 means that class proportions are equally distributed for all classes (so we have 2,000 points for each class), whereas a ratio of 0.25 means that the second most frequent class has \(\nicefrac {1}{4}\) of the instances of the most frequent class, the third most frequent class has \(\nicefrac {1}{4}\) of the instances of the second most frequent and so on (so the number of instances in classes are about 7500, 1875, 472, 121, 30). A pictorial representation of the four data sets is shown in Fig. 3, where colors indicate class values. Similarly, for ratios of 0.75 and 0.5, the number of instances reduces in a proportion of \(\nicefrac {3}{4}\) and \(\nicefrac {1}{2}\), respectively.

Fig. 3
figure 3

Pictorial representation of the artificial data sets. a Ratio = 1. b Ratio = 0.75. c Ratio = 0.5. d Ratio = 0.25

Figure 4 illustrates the difference in behavior, in terms of balanced error rate, of four different learning algorithms (J48 decision tree, k-nearest neighbors with k=5, linear discriminant analysis and naïve Bayes), where we artificially introduce class noise in ratios of 2.5% to 50% of instances, with steps of 5%, with the uniform and natural distribution class noise patterns. These experiments were performed using tenfold cross-validation, and artificial noise is introduced only in the training set. As expected, there are no perceptible differences when classes are equally balanced (the “1” column). In this case, both corrupting the class uniformly or taking into account the natural class distribution has virtually the same effect, and the natural class distribution is uniformly distributed. However, as the class proportion becomes more skewed, the artificial noise introduction following the natural class distribution tends to be more harmful to learning algorithms than noise introduced uniformly (except for kNN), for the same noise level.Footnote 2

Fig. 4
figure 4

Difference between uniform and natural multiclass noise pattern in an artificial data set

Although the influence of different noise patterns in multiclass classification is clear, these aspects are seldom discussed in the literature. To illustrate this problem, we ran INFFC [113] and CNC-NOS [72], two recently proposed techniques for treating noise, using the same experimental setup described earlier. These results are depicted in Fig. 5, where uniform noise introduction is shown in Fig. 5a, and introduction of noise according to natural class proportion is shown in Fig. 5b. Even though we are dealing with a simple problem, it is possible to see a difference in the performance of the two treatment techniques considering the two different scenarios. INFFC and CNC-NOS can reasonably recover from the noise when it is introduced following a uniform introduction pattern in almost all cases. (The lines corresponding to treated data are almost flat in most of the cases, indicating that the methods can successfully recover from increasing noise ratios.) However, we can observe that the treatment methods are less effective when noise is introduced accordingly to a natural class distribution for highly imbalanced data sets, where the differences in class distribution are larger. (The lines corresponding to treated data bend upwards for larger noise rates, for the data sets generated with 0.25 and 0.5 ratios from minority to majority classes.) This experiment shows that even state-of-the-art methods present different behavior for different types of noise patterns.

Fig. 5
figure 5

Comparison of treated (using INFFC and CNC-MOS) and untreated data. a Uniform noise introduction. b Natural noise introduction

The multiclass case can also imply further issues in the NNAR case (in which the class noise also depends on the input space). Some class combinations may be differently influenced by different regions in the input space. There may be regions in the input space, for example, where the likelihood of class noise for some class combinations is higher than others, while in other regions other class combinations are more likely to be confused. For example, different class boundaries may influence the likelihood of class noise in different regions in the feature space, implying in different blockwise interactions in different regions of the space. Another example that may occur is in high density regions, which may be more likely mislabeled as majority classes or a default class, for instance.

Figure 6 shows a similar experiment to Fig. 4. However, the likelihood of an instance to be mislabeled increases with the distance to the class center. Although the shapes in both figures are similar, we can observe an increase in the error rate of the algorithm, especially J48.Footnote 3 It is expected that in more challenging multiclass classification tasks also becomes much more challenging.

Fig. 6
figure 6

Difference between uniform and natural multiclass noise patterns in an artificial data set, where instances far from the class center are more likely to be mislabeled

We repeat the experimental setup of Fig. 5 using the NNAR noise setup to introduce artificial noise for the experiments shown in Fig. 6. It is interesting to observe that the noise treatment techniques behave differently in this case for the natural and uniform class distribution patterns. CNC-MOS is much less effective for alleviating the noise problem. When the class distribution is more skewed (ratios 0.5 and 0.25), CNC-MOS sometimes is worse than non-treating the data. Furthermore, for natural noise patterns, the performance degradation occurs at lower noise rates for CNC-MOS. The degradation in performance can be explained by the fact that CNC-MOS was designed with the assumption of NCAR noise, but this difference between uniform and natural noise could be studied further.

Fig. 7
figure 7

Comparison of treated (using INFFC) and untreated data. a Uniform noise introduction. b Natural noise introduction

3.2 Noise multilabel classification

In multilabel classification, an instance can be assigned, at the same time, to multiple target labels.Footnote 4 For instance, given a set of topics in a news outlet, a news article can be classified as both sport and politics if it is concerned about rioting against the exuberant cost of funding the World Cup. More formally, given a set \(\mathcal {Y} = \{l_1, \ldots , l_{m}\}\), instances are associated with a subset \(Y \subseteq \mathcal {Y}\), where the cardinality of Y can be any value between 1 (i.e., \(Y \ne \emptyset \)) and the number of labels \(n_l\). The set Y is called the set of relevant labels, while the complement set \(\mathcal {Y} {\setminus } Y\) is considered irrelevant for that instance. To facilitate notation, we define a binary vector \(\mathbf {y} = (y_{1},\ldots ,y_{m})\), where \(y_i=1\) indicates the presence (relevance) and \(y_i=0\) the absence (irrelevance) of \(l_i\) in the labeling of a instance (Fig. 7).

Dealing with noise in multilabel classification is a very important topic, as numerous applications use “soft labels”: labels which are not assigned by a domain expert but are derived from automatic taggers [101] or from non-experts via crowdsourcing [92], which are known to introduce noise. However, the influence of noise in multilabel classification has so far been little studied. For each instance, multilabel noise may occur in two situations: (a) an incorrect false label is included in the relevant label set or (b) a correct true label is not included in the relevant label set.

A straightforward approach for dealing with multilabel noise is to apply data transformation methods [45, 51] and deal with noise in the transformed data sets. Two possible approaches are considering each label individually (i.e., using a strategy similar to binary relevance and consider a binary noise problem for each label) or each label set (i.e., using an strategy similar to label power sets and consider a multiclass problem for the possible label sets) for dealing with labels. This approach will transform the problem of dealing with multilabel noise to binary or multiclass noise problems, so that methods which have been natively proposed to these problems can be applied to handle multilabel noise.

Dealing with the absence of relevant labels is also challenging. Some labels may be very rare (i.e., they suffer from a “label imbalance” problem, a class imbalance problem in multilabel context [16]). Filter techniques were not directly applicable in this case, and noise reparation methods may have poor performance due to the level of scarcity of some labels, as those techniques are generally data driven and cannot work well under the absence of information.

Another problem that may have influence on noise patterns in the multilabel case is related to label dependence. In several multilabel applications, labels are correlated, i.e., the occurrence/absence of one label may influence or be influenced by the occurrence/absence of other(s). For instance, multilabels can be organized into a hierarchical [127] or a graphical [123] structure. Furthermore, this structure may or may not be known at training time [87].

There are two types of independence to consider [24]: marginal independence and conditional independence. When the marginal independence is violated, there are global correlations among labels that are not affected by the input space. In other words, the product of the marginal distribution of labels could not characterize the overall distribution of labels [24]. On the other hand, a conditional independence violation occurs when the correlation among labels is concentrated in some regions of the input space. This is to say that some data characteristics increase or decrease the co-occurrence of labels.

Therefore, when considering label patterns (e.g., for experimental simulations on multilabel classification), there are three cases to be considered:

  • Individual Label Noise In which noise patterns are considered in isolation for each label, and the inclusion of noise in one label does not influence the occurrence of noise in other labels. As each label is considered in isolation, individual label noise for each label pattern could be assumed to be NCAR, NAR or NNAR. In the NCAR case, for a particular label, the noise probability for relevance and irrelevance is the same. In NAR, different noise probabilities for relevance and irrelevance could be modeled, and for NNAR, the relevance and irrelevance noise probabilities also depend on the input space. Furthermore, as labels are binary (it is either relevant or not), a noise occurrence for one label implies in switching the value of that label for an instance from 0 to 1, or vice-versa.

  • Joint Label Noise In which noise patterns take into account correlations among labels. The stronger the correlation, the higher the probability of different correlated labels for an instance to be corrupted together. This is likely to occur, for instance, if the labeling process follows a hierarchical process. If an error occurs in a higher level in the hierarchy, it will propagate to lower levels. To model this dependence, a label correlation model should be taken into account. A model widely adopted in the literature for balancing model complexity and capacity is a sequential chaining structure [19, 87]: \(l_i\) is conditioned on \(\{l_1,\ldots ,l_{i-1}\}\). This model is based on the product rule of probability and is the basis of classifier chains [108] for dealing with model dependence. A possible way to model error in the label chain is shown in Eq. 7.

    $$\begin{aligned} p\left( \tilde{l}_i,e|\{l_{i-i},\ldots ,l_i\}\right) = p\left( \tilde{l}_i|\{\tilde{l}_{i-i},\ldots ,\tilde{l}_1\}\right) \times p\left( e| \{\tilde{l}_{i-1},\ldots ,\tilde{l}_1\},\tilde{l}_i\right) \end{aligned}$$
    (7)

    The first term is due to the label chaining and refers to the influence of previous labels in the definition of the actual label. The second term refers to the probability that an error in the labeling occurs, conditioned to the label chain. Given the label chain, this probability may not depend on the actual label or the instance space, a case analogous to NCAR; it depends only on the relevance/irrelevance of the actual label, a case analogous to NAR, or depends both on the current label and instance space, an analogous situation to NNAR.

  • Joint Instance and Label Noise In which noise patterns take into account correlations among the label and the input space. This situation is similar to the Joint Label Noise, except for the fact that a label correlation should also depend on the instance space. A possible formulation considering the chain rule is shown in Eq. 8.

    $$\begin{aligned} p\left( \tilde{l}_i,e|\{\mathbf {x},l_{i-i},\ldots ,l_i\}\right) = p\left( \tilde{l}_i|\{\mathbf {x},\tilde{l}_{i-i},\ldots ,\tilde{l}_1\}\right) \times p\left( e| \{\mathbf {x},\tilde{l}_{i-1},\ldots ,\tilde{l}_1\},\tilde{l}_i\right) \end{aligned}$$
    (8)

    The second term models the error, considering the correlation between labels and the instance space. Given a label chain that also depends on the instance space, we may have cases analogous to NCAR, NAR and NNAR.

A problem with considering a chaining model for the label correlation is that an ordering among labels must be chosen. This problem has been tackled with a combination of classifiers using different orderings [19]. It would be interesting to analyze the interactions of these methods with noise. Another problem is that, for problems with a large number of labels, the correlation model could be quite complex. A possible approach to overcome this is to consider only correlations within subsets of labels, such as in [134], as it is unlikely that all labels influence each other.

Figures 8 and 9 show the behavior of multilabel classifiers in artificially generated data sets. The artificial data set generation is as follows: 10,000 data points are generated into five clusters (the size of the clusters is similar), with two attributes generated similarly to Fig. 3. However, clusters are labeled from a set of five possible labels. The probability of each label is generated according to a geometric progression, with ratios of 1, 0.75 and 0.5, meaning that labels are equally probable (ratio equal to 1) to the probability of the most frequent label being twice as higher as the probability of the second label, and so on. The label cardinality is about two, meaning that on average two out of five labels are present. For each cluster, the probability of one of the labels is set to zero (meaning this label is not present in that cluster).

Figure 8 shows that the label correlation is zero (that is, the occurrence of one label does not influence the occurrence of others), while the label correlation in Fig. 9 is 0.5, according to the procedure described in Leisch et al. [63]. Noise was introduced independently (individual label noise) and taking into account label correlations (positive labels have 90% chance of being corrupted together when there are more than two positive labels in the same instance). Four different multilabel learning algorithms were utilized. Three of them use a problem transformation approach, converting the multilabel problem to a set of single-label problems and using standard binary classifiers (we use J48 and kNN as base classifiers). Binary relevance (BR) converts the multilabel problem to independent binary classification problems. Classifier chain (CC) trains a chain of classifiers where in each step the data are augmented with previously trained labels. Dependent binary relevance (DBR) uses the values of all other labels. Finally, rFerns Ozuysal et al. [94] uses a version of random forests adapted to the multilabel case. Experiments were carried out using tenfold cross-validation.

Fig. 8
figure 8

F1-measure for a multilabel data set with uncorrelated labels

Analyzing the figures, it is interesting to note that when labels are correlated, the noise introduction pattern which jointly considers positive labels has a stronger negative effect on the behavior of learning algorithms (measured in terms of the average F1-measure over all labels) when labels are correlated. This phenomenon is even accentuated for algorithms designed to take label correlation into account (CC and DBR). The results in this simple artificial data set illustrate the importance of considering joint label noise patterns in research with noise in multilabel problems.

Fig. 9
figure 9

F1-measure for a multilabel data set with correlated labels

3.3 Noise in multitask problems

Multitask learning [13] is a machine learning paradigm which aims to improve the performance of learning algorithms by learning classifiers for multiple tasks together, potentially exploiting commonalities between them. More formally, given a set of supervised tasks \(\mathcal {T} = \{T_1,\ldots ,T_m\}\), where for each task \(T_i\) a set of examples \(\{X_i,Y_i\}\) is available. The goal is to learn a set of m models simultaneously \(\mathcal {H}_i: X_i \rightarrow Y_i\) from the available examples.

The main argument in favor of multitask learning is that the relations between the tasks make it advantageous to learn the tasks simultaneously instead of learning each task independently. For instance, one can learn web-ranking functions for different languages were each language represents a task. Although it is possible to learn an independent ranking function in isolation for each language, by learning them simultaneously it is possible to explore cross-lingual information that enhances the rankings [15]. Another example is the prediction of disease progression measured by the cognitive scores, by learning a predictor simultaneously at multiple time points  [145].

There are some options to explore the commonalities. The different tasks may share some features (for instance, data that come from a relational database may have some features in common), or similar feature extraction techniques can be used (the same algorithm could be used to extract features from images [139], for instance). Another possibility is that the same object is described by different set of uncorrelated features (multiview learning [121]), and the different views are used to improve classification [10].

A straightforward approach for (artificial) noise modeling is to consider an independent noise modeling for each task. In this scenario, different tasks may have different noise patterns, which can be NCAR, NAR or NNAR. However, noise simulation patterns could also take into account the commonalities among the tasks under consideration. For instance, when tasks share the class information (as in multiview learning cases), the same noise pattern should be considered for all tasks.

An area of research closely related to multitask learning is domain adaptation or transfer learning [95], where a source domain, generally with a large training data, is used to help the learning of a target domain, where very little training data is available. For instance, it is possible to use data from a city where a large labeled data set is available to improve predictions in other cities [129]. It would be interesting to investigate transfer learning approaches for noise handling.

Figure 10 shows a comparison of different noise scenarios with and without transfer learning. For this experiment, data from source and target domains were generated according to the settings 1 and 2 from Friedman [31], respectively, using the klaR package.Footnote 5 Source domain contains 5000 data points, while target domain 2000 data points. The number of classes in both domains is three, and the number of features is six. Source domain uses equal spherical covariance matrices, whereas target domain uses unequal spherical covariance matrices. In the figure, the x-axis corresponds to uniform artificial noise introduction in the target, ranging from 0 to 50% with steps of 5%. We use the same procedure for artificial noise introduction in the source domain, although for the sake of visualization, only the cases of 0, 25 and 50% are shown in the figure. We use the incremental version of stochastic gradient descent approach for learning support vector machines from the sklearn package. The figure shows accuracy, estimated using a fivefold cross-validation, with and without knowledge transfer from tasks. The model is initialized with source data and then tuned with target data. Analyzing the graph, we can observe that the benefit of domain adaptation is reduced when the source is corrupted with noise. Furthermore, the combination of large noise rates in both source and target domains is worse than using only the target domain.

Fig. 10
figure 10

A comparison of different noise scenarios with and without knowledge transfer

3.4 Noise in multi-instance learning

In the multi-instance learning framework, training examples consist of “bags” (collections of instances) rather than singleton instances [2]. This allows the representation of complex problems by means of decomposing them into pieces [20]. For example, consider the case of image classification. Multiple patches can be extracted from an image, whereas each patch is associated with an instance. Thus, a bag containing all the patch instances can be used to represent an example (image). Another example is text classification. Sections of a document can be represented by instances, whereas the document itself is represented as the bag of these instances.

The typical formulation assumes that bags belong to two classes (generally called positive and negative). Furthermore, only the labels of the bags are known (i.e., the individual labels of the instances within each bag are unknown). For example, in object recognition labels are associated with an image. The presence of an object is known, while its exact location within the image is not. A bag is labeled as positive if it contains at least one positive instance and negative otherwise. More formally, pairs \(\{\mathcal {B}_i,y_i\}\) compose a data set, where \(\mathcal {B}_i\) is a bag, and \(y_i\) is its class. Moreover, each \(\mathcal {B}_i\) is composed by a set containing j instances \(\{\mathbf {B}_{i1},\ldots ,\mathbf {B}_{ij}\}\) (the index i represents a bag, and the index j instances within the bag) and, generally, the class \(y_i \in \{0,1\}\). Although multiclass labels may be used [96, 136], the use of multi-instance multilabel learning is also common [146].

In multi-instance classification, class noise may occur when a bag is incorrectly labeled. In the binary case, this means that a bag that does not have any positive instance is incorrectly labeled as positive, or when a bag which has at least one positive instances is incorrectly labeled as negative. However, it is also possible to analyze noise at the instance level. Multi-instance learning can be seen as a noise dealing technique for one-tier noise (noise within only the positive class), where negative instances inside a positive bag can be considered as a kind of “noisy instances” of the positive concept [67]. A common approach is to estimate the diversity density estimator [76], where the objective is to build a model which takes into account the evidence required to classify positive instances from the union of positive bags, filtering out its intersection with negative bags.

The noise-OR approach [97] assumes that there is at least one positive instance inside a positive bag and no positive instances in the negative bags. This approach considers instances within a bag as independent and computes the probability of a class given the positive bags using the noise-OR, as shown in Eq. 9:

$$\begin{aligned} p\left( c_i|\mathcal {B}^{+}\right) \propto 1 - \prod \left( 1-p(\mathbf {B_{ij}^+} \in c_i)\right) \end{aligned}$$
(9)

where \(p(\mathbf {B_{ij}^+} \in c_i)\) is the probability of instances in a positive bag \(\mathcal {B}_i^+\) of being positive and can be calculated in several ways. In [64], a boosting approach was used to compute \(p(\mathbf {B_{ij}^+} \in c_i)\) for video taxonomy classification, in order to cope with noise positive bags. Possible noise negative bags were ignored, as the authors argue that this type of noise can be negligible in this kind of application.

A similar approach to noise-OR but with few independent assumptions is to consider the most likely instances within each bag [76]. In Li and Vasconcelos [66], the idea of considering the most likely instances (which authors call top instances) was used to develop a large margin classifier for multi-instance classification. Their approach is applicable to positive and negative bags and thus handles both positive and negative noise bags.

The modeling of different noise scenarios should consider how bags are formed. If bags consist of the creation of instances from labeled examples (e.g., bags are composed of chunks from pre-labeled images), the noise scenarios could ignore the influence of instances within a bag and consider only the noise at bag level. From this perspective, similar approaches to binary, multiclass or multilabel cases can be considered, depending on the type of bag labels.

Figure 11 shows results of an artificially generated data set using milrFootnote 6 R package Chen et al. [18]. The data were generated with 2000 bags. Each instance is generated with 10 features randomly generated with a mean of zero and a standard deviation 1. The number of instances within each bag was sampled uniformly from the range 2–5. The label of each instance was determined according to a logit link, with the parameter \(\beta \) randomly drawn from − 5 to + 5, independently for each feature. A bag is labeled as positive if it contains at least one positive instance. Noise was introduced in two ways: at bag level and at instance level. At bag level, bag labels are flipped with probability p, while at instance level, instance labels are corrupted at with probability p. Changing the label of instances may change the label of a bag if all positive instances of a bag positive bag are corrupted to negative, or if a single negative instance is corrupted to positive.

Fig. 11
figure 11

A comparison of bag noise and instance noise in multi-instance classification

The bag classification error (estimated using fivefold cross-validation) of five different multi-instance learning algorithms is compared in Fig. 11: CitationKNN (uses the Hausdorff distance and the concept citation to find the bag nearest neighbors), MiBoost (boosting adaptation to multi-instance), MiSMO (sequential minimal optimization for SVM adaptation for multi-instance, MiWrapper (combines the scoring prediction for each individual instance within a bag) and SimpleMI (converts the multi-instance problem to a single instance problem by computing the average of all instances within a bag). Although for some algorithms the two different noise types show a similar degradation performance, in CitationKNN and SimpleMI there are some notable differences. The degradation of instance noise is lower than bag noise for CitaitonKNN for almost all artificially generated noise levels, while for SimpleMI, although it is a little higher at lower noise levels, it is lower for intermediate noise levels.

However, in some multi-instances problems, bags are not formed from a single object, but from joining instances that have some characteristics in common. For instance, in [77] multi-instance learning was used for stock market prediction. Positive bags where formed from stocks which had an increase in value from an instant \(t-1\) to the instant t and negative bags from stocks that did not show an increase. Bags are labeled positive by the assumption that the increase in some stocks are due to some fundamental reasons and would lead to an increase in value at some instant \(t+1\), whereas negative labeled bags assume that a decrease in value of all stocks are due to some fundamental reasons and would not lead to an increase in value at a future instant \(t+1\). Both assumptions are unlikely to hold for all time stamps; thus, noisy labeled bags can occur. Furthermore, these noisy bags may depend on the instances that form the bags. For instance, the stock market may be in high volatility period, and (most of) the stocks are going up or down during this period, irrespective of their fundamental reasons. It would be interesting to analyze approaches for dealing with bag noise that takes into account noise due to bag formation.

3.5 Noise in ordinal classes

For some machine learning tasks, classes have a natural ordering relation [48]. For instance, a user can use the ordinal scale {bad,average,good,very good and excellent} for a movie review. More formally, the \(\mathcal {Y} = \{c_1,\ldots ,c_{m}\}\) has an ordering constraint \(c_i \prec c_j, \forall i < j\), where \(\prec \) is an order relation. A common assumption is that data are arranged according to an ordinal scale, i.e., only the ordering among classes matters, not their relative differences. In other words, the classes do not carry metric information, and they only serve as a qualitative indicator.

A naïve approach to define a class noise in ordinal problems is similar to that for multiclass problems (Sect. 3.1), where a class \(c_i\) is erroneously replaced by another class in \(\mathcal {Y} {\setminus } \{c_i\}\). However, this definition considers each mislabeling as equally damaging. Nevertheless, due to inherit ordering among classes, some mislabels are more harmful than others, as mislabeling of adjacent labels may have a lower impact on the learning phase than the mislabeling of distant labels.

In this sense, it would be interesting to investigate cost-sensitive noise handling techniques. Indeed, cost-sensitive learning is one approach for ordinal classification [125] and some approaches such as cost-weighted binary decomposition [68] could be used to extend binary class noise dealing techniques. Another approach for extending binary noise handling techniques to ordered case is to exploit ordinal decomposition [48]. Instead of one-versus-all and one-versus-one decompositions, common in multiclass classification, ordered decomposition uses i binary subproblems by: ordered-partition (\(pos=\{c_j| j >i \}\) and \(neg = \{c_k| k \le i\}\)), one-versus-next (\(pos=\{c_i\}\) and \(neg = \{c_{i+1}\}\)), one-versus-follows (\(pos = \{c_i\}\) and \(neg = \{c_j|j>i\}\)) and one-versus-previous\(pos = \{c_i\}\) and \(neg = \{c_j|j<i\}\) decomposition.

Ordinal classes may require the ordering to be included in noise modeling. Although, as in multiclass cases, arbitrary patterns may be used, some of them being of practical interest. Let \(\mathcal {P} = \{p_1,\ldots ,p_m\}\) be the mislabeling probability of each class,Footnote 7 and \(\gamma _{ij}\) the probability of a instance of class i be mislabeled as class j for \(i,j = \{1\ldots m\}\), and let \(d_{ij} = |i-j|\) the distance (in ordinal scale) between classes \(c_i\) and \(c_j\). Some noise patterns of interest include:

  • Adjacent label noise In this pattern, adjacent classes are more likely to have their labels mixed. A possible adjacent noise distribution is to consider a linear decay approach, as can be defined as \(\gamma _{ij} = \nicefrac {m-d_{ij}}{S}\), where \(S = \sum _{j=1}^{m} d_{ij}\). Another possibility is to consider a square decay approach where the probability of a instance of class i being mislabeled as class j is \(\gamma _{ij} = \nicefrac {(m-d_{ij})^2}{S^2}\) where \(S^2 = \sum _{j=1}^{m} d_{ij}^2\).

  • Diametrical label noise In this pattern, diametrical (opposite) classes are more likely to have their labels mixed. Linear and square decay approach could be defined as \(\gamma _{ij} = \nicefrac {d_{ij}}{S}\), and \(\gamma _{ij} = \nicefrac {d_{ij}^2}{S^2}\), respectively.

  • Optimistic label noise In the optimistic case, the probabilities of a class i being mislabeled as class j is higher for \(j > i\) in comparison with \(j < i\). One possibility is to adopt fixed rates \(p_{\succ } > p_{\prec }\) where \(\gamma _{ij} = p_{\succ }, j>i\) and \(\gamma _{ij} = p_{\prec }, j<i\). Other possible approaches are the adaption of adjacent and diametrical label noise, restricted to cases where \(j > i\) and setting \(\gamma _{ij} = 0\) for \(j < i\).

  • Pessimistic label noise Pessimistic cases are just the opposite to the optimistic cases, i.e., the probabilities of a class i being mislabeled as class j is lower for \(j > i\) in comparison with \(j < i\). One possibility is to adopt fixed rates \(p_{\succ } < p_{\prec }\) where \(\gamma _{ij} = p_{\succ }, j>i\) and \(\gamma _{ij} = p_{\prec }, j<i\). Other possible approaches are the adaption of adjacent and diametrical label noise, restricted to cases where \(j < i\) and setting \(\gamma _{ij} = 0\) for \(j > i\).

Fig. 12
figure 12

A comparison of optimistic, uniform and pessimistic noise in ordinal classification

A special case within ordinal classification is monotonic classification, where there are monotonic constraints between (some) features and an ordinal class [47]. For instance, the credit range of a bank customer may be monotonic depending on its income and patrimony. It has been argued that properly taking into account such constraints may improve classification performance in monotonic classification problems [7]. Instances violating the monotonic constraints can also be considered noisy [22, 54]. The generation of noise monotonic data sets is data dependent, and some approaches can be found in Daniels and Velikova [22, 84].

Figure 12 shows a comparison between two ordinal classifiers for an artificial data set. Data were generated according to a bivariate Gaussian distribution, with mean 0 and standard deviation 1, with 1000 data points. Data are divided into five classes according the radial distance to data centroid. The first ordinal class is the closest to the centroid, whereas the last ordinal class corresponds to that furtherest from the centroid. We simulate three types of noise: uniform, where the corrupted class instance can assume the values of other classes with equal probability; optimistic, where a noisy instance has a 90% probability of being corrupted to a greater class value (when possible) and pessimistic, where a noise instance has a 90% of probability of being corrupted to a lower class value (when possible). Two ordinal classifiers were compared: rpartScore [35], an ordinal classifier based on decision trees, and ordinalForest [53], an ordinal classifier based on random forests. The figure shows the Youden agreement index, a metric commonly used in ordinal classification. Analyzing the figures, it is interesting to observe that the performance of the uniform artificial noise inclusion is somewhat between optimistic and pessimistic artificial noise inclusion. Furthermore, the performance with pessimistic and optimistic noise ”crosses” when the percentage of artificial noise increases. This reflects a possible trend of the ordinal classification algorithms in preferring lower or greater class value predictions. Analyzing the behavior of ordinal and monotonic classification problems with different noise patterns could be an interesting topic for further research and may help shed some light on understating the properties of these algorithms.

3.6 Noisy data streams and non-stationary environments

In a data stream environment, data are provided as an ordered sequence of examples where, in general, the objective is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream [33]. In other words, we have a sequence of instances \((\mathbf {x}_t,y_t)\), where \(\mathbf {x}_t\) represents the feature values of the instance at time t, and \(y_t\) is its (unknown) class value. Data may arrive a single instance at time (online) or in sets (batch). In the online case, only a single instance \(\mathcal {S}_t = \{(\mathbf {x}_t,y_t)\}\) is provided at each time stamp, whereas in the batch case a set \(\mathcal {S}_t = \{(\mathbf {x}_t^1,y_t^1),\ldots ,(\mathbf {x}_t^n,y_t^n)\}\) is available at each time stamp.

This streaming environment introduces several requirements for learning and has consequences for dealing with class noise. First, time and memory constraints require fast and memoryless techniques and may limit the use of some noise dealing methods. Instances may arrive at a fast pace and in general should be processed only once [42]. Therefore, noise dealing techniques that work as preprocessors should obey these constraints.

Another important aspect is dealing with the non-stationary aspect of the data [26]. Evolving data streams require adaptive methods responsible for changes in the data distribution. These changes are known as drifts and may be abrupt, gradual, incremental or reoccurring [36]. In an abrupt drift, data distribution suddenly changes (e.g., the behavior pattern of users when the release of a new web portal). Gradual drift occurs when data described by different distributions appear together in a transition period (e.g., a new product is released, but an older version continues in the market for some time). An incremental drift occurs when there are several intermediate steps between the initial and the final distributions (e.g., the behavior of citizens as the gradual implementation of new subway system is taking place). Finally, in a reoccurring drift, a concept that occurred previously may reappear (e.g., seasonal shopping patterns). Drifts may also be random or may have some predictive properties [85].

Furthermore, the changes in data distribution can also be characterized by how the posterior probability of classes \(p(y_t|\mathbf {x}_t)\) (the decision boundaries) is affected. This distribution can be decomposed as:

$$\begin{aligned} p(y_t|\mathbf {x}_t) = \frac{p(y_t)p(\mathbf {x}_t|y_t)}{p(\mathbf {x}_t)}. \end{aligned}$$

Changes in the data distribution may occur due to changes in any of the terms [40]:

  • the prior probabilities of classes may vary (i.e., \(p(y_{t_i})\ne p(y_{t_j}\)));

  • the probability of an instance with \(\mathbf {x}\) may change over time (i.e., \(p(\mathbf {x}_{t_i})\ne p(\mathbf {x}_{t_j}\)));

  • the class conditional probability may change (i.e., \(p(y_{t_i}|\mathbf {x}_{t_i})\ne p(y_{t_j}|\mathbf {x}_{t_j}\)));

where \(t_i\) and \(t_j\) are two different time stamps. Drifts can be categorized whether they influence or not the decision boundaries: in a virtual concept drift, data distribution changes without affecting the decision boundaries, whereas a real concept drift the changes in data distribution alters class boundaries.

Distinguishing drifts from noise is an important issue in data streams [36]. It is desirable that algorithms detect and react to drifts, but be robust to noise. There are three possible situations involving noise and data streams in the non-stationary case:

  • Stationary noise and evolving data stream In this case, the noise generation process does not change over time, only the data distribution does. For instance, in a monitoring and control application, the monitored process may present some drift, but data come from the same (noise) sensors. Although the process that generates data is stationary, the “observed” noisy instances may vary over time if the noise is not completely random (i.e., if it depends on the class or input space). With this in mind, it would be interesting to investigate low (computational) cost approaches for reacting to drift detection and noise identification. For instance, detection of changes in class priors [144] may be combined with ranking approaches for noise identification [71] by only modifying the noise detection threshold. This approach can easily adapt to changes in class priors, without high computational effort (no new noise model is necessary).

  • Evolving noise and stationary data stream Although the underlying data generation process is stationary, the noise pattern can change over time. This may occur, for instance, due to an increase in noise ratio due to the fading of a sensor (incremental noise change), less noise introduction due to the quality enhancement in data processing (abrupt noise change) or a less experienced employee verifying labels when the expert is out on holidays (reoccurring noise change). In this case, although the observed data distribution may change, such changes are due to variations of the noise distribution over time, not changes in the stationary data stream. Within this scenario, it would be interesting to further research handling changes in the observed data distribution at noise level, rather than changing the predictive model. This requires noise dealing techniques which may adapt to changes in noise patterns, such as anomaly detection [14]. Another important issue is that a test-and-train approach is often assumed for data streams, where the labels of instances are provided after predictions, for performance verification and drift analysis. Feedback based on noise labels would be necessary to properly differentiate drifts from noise.

  • Evolving noise and evolving data stream This is a more challenging, yet quite common scenario [34]. Besides changes in the data generation process, noise patterns also evolve over time. For instance, many problems involve multiple heterogeneous data streams, with the insertion of new sources and removal of old ones [152]. For these problems, new stream sources may have different noise patterns, as well as different dynamics. Generally, it would be very difficult to distinguish and react to noise fluctuations and real drifts. Quickly reacting to changes in the observed data could be a misleading choice due to overreaction to noise. However, designing a noise robust system may lead to underreactions to data drifts. Both cases will show a degradation in performance: in the former, due to lack of robustness to noise and in the latter, due to poor stability to handle drifts. Investigating ways to combine robustness and stability while being flexible and maintaining context awareness to data change is an interesting research direction [34].

Figure 13 depicts a simulation of a data stream of 100,000 data points, with a gradual drift starting from the \(50{,}000\mathrm{th}\) instance. The stream is generated using MOAFootnote 8, following the procedure described in Street and Kim [118]. We have added two types of noise: in the original stream concept and in the gradual drifting mechanism, from 0% (no noise) to 50%, with steps of 5%. In this figure, each plot corresponds to a different noise ratio in the main stream and lines in each plot correspond to different noise ratio in the drifting mechanism. The plots show the prequential error rate, with a window size of 1000 instances, of incremental Naïve Bayes. Analyzing the graphs, it is clear that the noise in the main data stream consistently degrades performance. (The prequential error rate goes down as the noise rate in the main stream increases.) Furthermore, it is also clear that a noise in the drifting pattern also contributes to degrading performance: the higher the noise rate related to the drift, the lower the prequential accuracy when compared to the noise-free drift.

Fig. 13
figure 13

Simulation of a noise stream with noise drifting concept

4 Emerging topics and challenges in noisy non-binary classification

In the previous section, we have presented and formalized the noise occurrence in non-binary classification domains. In this section, we introduce the current open challenges for the same topics, stressing the areas where more attention should be given.

4.1 Issues in multiclass noise classification

Although some studies consider multiclass noise problems, different multiclass noise patterns impose numerous challenges, some of them infrequently addressed in the literature. Even state-of-the-art methods for dealing with binary class noise present considerable variation in performance when considering different multiclass noise patterns at the same noise ratio. Despite this, these issued are seldom considered in the literature. Some of aspects of this topic that could be studied further include the following:

  • Would these different types of noise patterns pose the same or different challenges when dealing with multiple class noise?

  • Which one would be more difficult to tackle?

  • Which aspects of the problem would be more affected by considering different noise multiclass pattern?

  • How do existing methods behave considering these different noise patterns?

One interesting topic for further research is how to extend methods, originally developed only for binary class, to the multiclass case. Some data transformation approaches for transforming multiclass to binary problems, e.g., One-versus-One (OVO) or One-versus-ALL (OVA), could be applied [112]. However, research on this topic generally involves random noise completely at random (NCAR), with uniform class noise distribution. Investigating how these approaches are affected by different noise patterns is an interesting topic for research. For instance, when applying a filter using a OVA decomposition, does the order in which class noise is removed matter? If so, is this influence stronger for different noise distribution among the classes?

Another open-ended problem is the relationship with imbalanced classification [100] and multiclass noise. It is reported in the literature that noise in minority classes is more harmful than in majority classes [126]. However, multiclass imbalance [128] has further issues to consider, as multiple predominant or infrequent classes may occur. It is unclear what learning difficulties multiclass noise can cause under highly imbalanced class distributions, and how to handle it effectively is an open-ended issue. Furthermore, different noise patterns can change the observed class ratio, which may influence the behavior of class imbalance techniques. Uniform class noise, for instance, may mask the observed class ratio of multiple rare classes even for low noise levels. Default class may also introduce an artificial predominant class, thus generating an artificial imbalanced problem due to the presence of noise. Possible ways to handle noise in imbalanced problems include cost-sensitive noise handling [148, 151], attributing and the development of class ratio aware filtering approaches [114] considering the multiclass context.

4.2 Issues in noise multilabel classification

There are many issues that should be considered when dealing with multilabel noise, which are not properly addressed by using data transformation. A first thing to consider is that an instance may be “partially noise,” in the sense that the presence or absence of some labels for the instance may be due to noise, but other labels are correct. In this context, filter-based techniques need to balance the trade-offs between partial noisy/non-noisy label identification for the instance. On the one hand, completely filtering out a partial noise instance might cause the correct labels of that instance to be lost. On the other hand, leaving an instance with partial noisy labels might negatively impact the correct classification of those labels. A binary relevance-like transformation could be used to deal with label noise individually. However, this approach would restrict the application of algorithms based on binary relevance and ignores label correlations.

Several research questions need to be further investigated regarding noise multilabel learning and label dependence:

  • Is the performance of multilabel classifiers designed to incorporate known label dependence affected by the presence of noise?

  • Could noise considerably affect approaches that infer label dependencies from data?

  • If the artificial noise generation process does not consider label correlation, are the models unbiased with respect to the dependencies present in data without the artificial noise?

  • What happens if the correlations among labels in the true data generation process are different from the correlations among noise labeling?

  • Could label dependence be used to develop new techniques for dealing with noise multilabel efficiently?

Another topic that could be further studied is the influence of noisy instances in problems of label distribution learning Geng [43], Xing et al. [132], Gao et al. [39]. Label distribution learning is an emerging paradigm which includes both single-label and multilabel as special cases and aims to learn a distribution over the possible labels representing the degree to which labels describe the instance. Few studies considers noisy labeled instances for label distribution learning. In Chen and Käamäaräainen [17] exploits the reliability of labeled images to incorporate noise information in label distribution learning for age estimation. In He et al. [49], actual labels are replaced by an estimated label distribution to handle label ambiguity. The problem of learning with incomplete supervised information in the context of multilabel learning was studied in Xu and Zhou [135].

4.3 Issues in multitask problems

Although it is possible to deal within noise in each task individually, multitask applications present interesting topics for further research regarding noise handling with related tasks. How noise affects multitasks approaches, specially if the individual tasks have different noise patterns, is an interesting research topic. Studies regarding inter-task noise patterns are also interesting. In [106], structural covariance noise among regression tasks is investigated. It would be interesting to develop related studies with multitask classification problems.

Developing multitask approaches to deal with noise is also interesting. Methods that work as data preprocessors, such as filters or noise reparation techniques, could be enhanced by evaluating noisy instances together. Another possible approach is to investigate ways to transfer/adapt knowledge for dealing with noise from one domain to another. In Xiao et al. [131], an approach based on convolution neural networks for transferring noise information from a small manually cleaned noise-free data set to a large, noisy training set, in the context of image annotation was proposed. Similar approaches and applications in other domains are interesting topics for further research.

4.4 Issues in multi-instance learning

While multi-instance learning embraces the concept of negative examples within a bag at the time of creating a model, it would be interesting to develop or extend noise filtering techniques which work as preprocessors in order to evaluate whether preprocessing noisy instances within each bag may improve performance of multi-instance learning algorithms. For instance, developing filters for removing instances with a high likelihood of being negative from positive bags may help in the learning phase.

Although these approaches can be used to deal with noisy instances within each bag, dealing with bags with noise labels can be more difficult. As positive bags may have both positive and negative instances, and the individual labels or ratios of positive/negative instances within the bags are unknown, dealing with noise bags can be very challenging [77], specially if instance correlation inside the bag may influence noise occurrence [21]. A bag filtering algorithm, for instance, should evaluate all instances within a bag before deciding whether it should or should not be removed. Other approaches may include the conversion of bags into single instances [52]. A possible drawback of this approach would be the lack of data, as the size of the data set is reduced to the number of bags in the data set.

4.5 Issues in ordinal classes

Dealing with noise in ordinal classification has been mostly explored in monotonic classification, where the noise violations are very important, as some monotonic classification learning algorithms require a completely monotone training set [47]. Relabeling noise monotonic instances is a technique commonly used in the literature to deal with monotonic noise [22, 29, 105]. However, relabeling may not be a good option for some data sets, and other noise dealing techniques could be considered. At this point though, there is no available preprocessing technique for dealing with noisy instances in monotonic or ordinal classification, thus constituting an important open issue.

4.6 Data stream and non-stationary environment issues

Some approaches have been used to deal with noise in data streams. For instance, ensemble learning has been used to improve stability of class noise in data streams [85, 142]. Other approaches are to use one-class classifiers [59] and instance selection/active learning [74, 153] to identify possible outliers in data streams. A warning state was also used to increase robustness to noise [9]. In this state, instances that differ from data distribution are firstly tagged as noise, and drift adaption only takes place if the warning state changes to out-of-control due to an increase in different instances. However, these approaches often assume that only the data stream evolves and noise is stationary, and further studies involving evolving noise patterns are welcome.

Another topic for further research is related to evolution in input and class sets. Some applications require the use of new features [79] or the appearance of new classes [78] as the stream flows. New features may be created to differentiate or select from data chunks [46, 80], with implications in noise patterns that depend on the input space. In some stream applications, the set of possible classes is not fixed and new classes may appear [122]. Distinguishing between a new class and noise require more research, especially in evolving noise scenarios.

Multiclass [23] and multilabel [102] data streams inherit the same issues from their non-stationary counterparts, with possible non-stationary class noise. For instance, in multiclass settings better preprocessing techniques may reduce noise correlation between blocks of classes while label dependencies may vary in time.

5 Concluding remarks

An important issue in classification is learning from data with class noise, as noisy classes may decrease predictive performance and difficult model induction. This problem has been addressed in the literature, and many different techniques to address class noise were proposed in the literature, both from a theoretical and practical point of view. These techniques include noise robust/tolerant techniques and data cleaning methods.

However, most noise dealing techniques have mainly been developed and validated considering binary class problems. We have shown that, even in simple synthetically generated problems in multiclass classification the performance of some state-of-the-art noise treatment methods varies considerably considering different noise patterns and skewed class distributions. We insist that these issues have not been properly discussed in the literature.

This paper also considers the class noise problem in the context of nonstandard classification problems. We examine the class noise under the multiclass, multilabel, multitask, multi-instance, ordinal and data stream classification problems. Although research involving class noise has been conducted within these frameworks, they have received considerably less attention than class noise handling in the binary classification case.

For each of these classification problems, we analyze current trends, as well as point out open-ended questions and future research directions. We also consider particular points of interest for artificial noise modeling in these classification scenarios. They are important for avoiding pitfalls and considering suitable scenarios in simulated artificial noise experimentations with more adherent and contrastive variations.

We believe that our paper can serve as a motivation to expand class noise research in other classification problems beyond binary classification problems, by pointing new topics for developing new techniques and studies considering the challenges and particularities of these non-binary classification problems.

As a final remark, it is important to stress that even though there are more studies considering binary class noise models than non-binary cases, many challenges in noise binary classes remain unsolved and require further investigation. For instance, it is known that not all class noisy instances affect the learning algorithms equally [88]. Parameter tuning [138] and classifier evaluation [109] under class noise are difficult tasks to perform. With the objective of cleaning noise in mind, further studies on the use of hybrid methods to reduce incorrect filtering, besides noise filtering and noise cleaning approaches, are necessary.