1 Introduction

Class imbalance is a note-worthy characteristic of data obtained from several real-world domains. The naturally occurring biases in the real world give rise to varying numbers of points in different classes of a dataset. Multi-label datasets—mostly obtained from real-world sources (Li et al. 2014; Katakis et al. 2008) is no exception to this. In a multi-label dataset, an instance is associated with more than one possible label. Let \(\mathcal {D}\) be a multi-label dataset with \(\mathcal {L}\) labels. \(\mathcal {D}=\{(\textbf{x}_{i},\textbf{Y}_{i}), 1\le i\le n\}\). \(\textbf{x}_{i}\)’s denote the feature vectors and \(\textbf{Y}_{i}\) denotes its membership to \(\mathcal {L}\) labels. \(\textbf{Y}_{i}=\{y_{i1},y_{i2},\ldots , y_{i\mathcal {L}}\}\) and for binary classification, \(y_{ij}\) can be either 0 (negative class) or 1 (positive class). The task is to correctly predict the class (0 or 1) for \(\mathcal {L}\) labels of a test instance. In a two-class dataset, we term the class with a higher number of instances and the class with a lower number of instances as the majority class and the minority class respectively. In yeast dataset (Elisseeff and Weston 2001), the imbalance ratio (ratio of majority set cardinality to that of minority set cardinality) is greater than 1.5 for 12 out of 14 labels. Alternatively, we can say that, for 12 out of 14 labels in yeast dataset, one class has 50% more points than the other class. It is also observed that the different labels of a multi-label dataset possess differing degrees of imbalances. This aspect further intricates the issue and calls for dedicated and label-specific handling of the class imbalance issue in a multi-label context.

Data preprocessing is a popular technique for handling the class imbalance of the datasets. This particular technique is motivated to reduce the difference in cardinalities of the classes in a dataset by (1) either removing the points from the majority class (undersampling the majority class) or (2) by adding synthetic points to the minority class (oversampling the minority class). This helps mitigate the bias of the majority class in the classifier modeling phase and helps detect minority instances. In undersampling of data, points are removed from the majority class to reduce the difference in the majority and the minority class cardinalities. It also reduces the overall training data volume, thereby reducing the computation of the classifier modeling. Undersampling is a convenient option for multi-label datasets as their dimensionality is high concerning the number of points and features. We should also remember that—the positive and negative class memberships vary across labels in a multi-label dataset. Even though the feature vectors reside in the same locations of the feature space (for all labels), their changing memberships lead to different majority and minority point configurations.

In this work, we propose a natural neighborhood-based undersampling scheme (NaNUML) to deal with the class imbalance of multi-label datasets. Due to disparate ranges of imbalance ratios and the diversified distributions of majority and minority points across the labels, we resort to a label-specific undersampling. We look at the mutual co-locations of the majority and the minority points within a neighborhood to find the majority candidates to be undersampled. Our principal aim is to find and remove the majority points that overlap with many minority points. Removing the majority points from the overlapped space will increase the cognition of the minority points in those regions.

To find the majority points overlapping in the minority spaces, we employ the technique of natural nearest neighborhood (Zhu et al. 2016). Two points \(\textbf{p}\) and \(\textbf{q}\) are natural neighbors of each other if (1) \(\textbf{p}\) is a k-nearest neighbor of \(\textbf{q}\) and also (2) \(\textbf{q}\) is also a k-nearest neighbor of \(\textbf{p}\). Unlike the identification of the neighbors via a directional and one-sided nearness (like that in the k-nearest neighborhood), natural neighbors are computed based on the mutual nearness of two points (hence, commutative). The relative nearness of two points (relative to their neighborhood) is instrumental in chalking out the neighborhood relation. The mutual nearness protocol of natural neighborhoods aids in the efficient identification of the majority and minority class overlaps. The other significant advantage of the natural neighborhood scheme is computing the neighborhood size ‘k’ without human intervention or a parameter optimization phase. This characteristic is helpful in any machine learning context, and our scheme enjoys the advantage. In NaNUML, a single natural nearest neighbor search is sufficient to compute all labels’ label-specific natural neighbor information.

For each label, we compute the minority natural neighbor count of the majority points. A high minority neighbor count for a majority point indicates its increased overlap with the minority space (as well as the minority points). Hence, the majority points with higher minority natural neighbor counts are potential candidates for undersampling. Accordingly, we remove the majority points in order of their decreasing minority neighbor count. The majority point with the highest minority neighbor count is removed first. The undersampled majority set and the original minority set form the augmented training set and are used to learn a set of label-specific classifiers.

The major highlights of our work are as follows:

  • We undersample the label-specific majority points to obtain an augmented yet reduced training set for each label.

  • We employ a parameter-optimization-free technique to compute the neighbors of the points. The computation of the neighbors is based on a mutual nearness calculation, which helps in an enhanced identification of the majority–minority overlaps.

  • This is the first work to introduce the paradigm of natural neighborhoods in multi-label learning.

  • While undersampling the majority class, we also preserve the key lattice points of the majority class by preserving (and not allowing the undersampling of) the majority points (top 10%) with the highest majority natural neighbor count.

  • The natural neighborhood search is not label-dependent and depends on the distribution of the points in the feature space. Hence, only one natural neighbor search is required (for all labels).

  • The outcomes from an experimental study involving twelve real-world multi-label datasets, seven competing methods (multi-label learners and generic class-imbalance focused learning paradigms), and four evaluating metrics indicate the proposed method’s competence over other competing learners.

2 Related works

This work is focused on the class-imbalance aspect of multi-label learning. The study of the extant works will be devoted to both these aspects—(1) class imbalance learning and (2) multi-label learning in general.

Several diversified approaches are followed in the domain of class-imbalance learning to mitigate the bias of the majority class (He and Garcia 2009). Algorithm-based methods are one of the earliest methods in this field. The methods mostly function in one of two ways—(1) by shifting the boundary away from the minority class to add more region in their favor, or (2) by employing a cost-sensitive learning framework where the misclassification of minority instances incur a higher penalty. Other approaches like kernel-based methods, multi-objective optimization methods, and ensemble-based learners also focus on achieving the same goal.

Data preprocessing is a popular technique of handling the class-imbalance problem (Ali et al. 2019). Here, the schemes are motivated to balance the cardinalities of the majority and the minority classes. This can be done in the following ways—(1) undersampling or removing points from the majority class (Pereira et al. 2020; Tahir et al. 2012), (2) oversampling or adding synthetic points to the minority class (Charte et al. 2015; Chawla et al. 2002), (3) hybrid sampling where both undersampling and oversampling are involved (Choirunnisa and Lianto 2018; Ludera 2021). This step of data sampling occurs before the classification step, and the classifier modeling is done on the augmented data (obtained through preprocessing).

The focus of the researchers on multi-label learning dates back to the beginning of this century (Joachims 1998; Godbole and Sarawagi 2004). The community’s ongoing efforts have provided several ways of handling this issue (Moyano et al. 2018).

Multi-label methods are principally classified into (1) Problem transformation approaches: in which several classifiers are modeled and learned to facilitate an overall multi-label learning of the data at various levels of label association (they are further classified into first-order, second-order and higher order according to the degree of label associations in the classifiers (Zhang and Wu 2015; Sadhukhan and Palit 2020; Tsoumakas et al. 2011; Fürnkranz et al. 2008), and (2) Algorithm adaptation approaches: which consider tweaking of an existing classifier like Support Vector Machine, nearest-neighborhood based classifier, random forest to accommodate the multi-label learning (Gonzalez-Lopez et al. 2018; Nam et al. 2014; Liu et al. 2018; Siblini et al. 2018).

The researchers in multi-label were quick to notice the issue of class imbalance in multi-label datasets (Tahir et al. 2012). We should note that handling the class-imbalance issue in multi-label datasets is way more knotty than single-label traditional datasets. The principal causes are (1) the multi-output nature where the degree of imbalance in each output varies from the others and (2) a set of imbalance ratios, one for each label. Data pre-processing, being a popular choice, is explored in multi-label contexts. MLeNN (Charte et al. 2014) uses the edited Nearest Neighbor rule principles to undersample the majority points with similar label sets of its neighbors in a multi-label dataset. In a hybrid sampling technique, ML-RUS deletes the instances belonging to the majority classes of a multi-label dataset. ML-ROS deletes the clone examples with minority labels to facilitate better learning of the imbalanced multi-label datasets (Pereira et al. 2020). ML-SMOTE resorts to the oversampling of minority classes to balance the cardinalities of the majority and the minority classes of the labels (Charte et al. 2015). Liu and Tsoumakas (2020) couples the majority class undersampling with the classifier chain scheme’s ensembles to tackle the class imbalance issue. COCOA (Zhang et al. 2020) presents a scheme where the asymmetric distribution of classes and the pair-wise label correlations are considered, and a three-way learner is produced for each pair of labels. Daniels and Metaxas (2017) exploits the Hellinger forests to design an imbalance-aware multi-label classifier. In LIIML (Sadhukhan and Palit 2019), an imbalance-informed label-specific feature set is constructed for the labels, followed by a cost-sensitive learning scheme to learn the multi-label datasets. In the next section, we briefly describe the intuition and working principles of the natural nearest neighborhood.

3 Principles of natural nearest neighborhood

Let us have a set of points \(\textbf{x}_{1}, \textbf{x}_{2},\ldots ,\textbf{x}_{n}\) and we want to find the natural neighbors of \(\textbf{x}_{i}\) from the given search space (excluding itself). For some \(k=\alpha \), \((\alpha >= 1)\), we say that \(\textbf{x}_{i}\) is a natural neighbor of \(\textbf{x}_{j}\) (at \(k=\alpha \)), if \(\textbf{x}_{i}\) is a \(\alpha \)-nearest neighbor of \(\textbf{x}_{j}\) and \(\textbf{x}_{j}\) is also a \(\alpha \)-nearest neighbor of \(\textbf{x}_{i}\) (Zhu et al. 2016). Let \(NN(\textbf{x}_{j})\) be a natural neighbor of \(\textbf{x}_{i}\) and \({KNN}_{\alpha }(\textbf{x}_{j})\) be a \(\alpha \)-nearest neighbor of \(\textbf{x}_{i}\). \( \textbf{x}_{j}\in NN(\textbf{x}_{i}) \Longleftrightarrow (\textbf{x}_{i}\in {KNN}_{\alpha }(\textbf{x}_{j}))\cap (\textbf{x}_{j}\in {KNN}_{\alpha }(\textbf{x}_{i}))\).

The authors of this work have also stated the procedure for selecting a natural neighbor eigenvalue \((\lambda )\) (the neighborhood size). In a dataset, the minimum k-value at which all points get at least one natural neighbor is to be noted. Let this critical k-value be \(\beta \). The natural neighbor eigenvalue, \(\lambda \) is computed from \(\beta \). According to the authors,

$$\begin{aligned} \lambda =\sqrt{\beta } \end{aligned}$$

Unlike k-nearest neighborhood search or reverse nearest neighborhood search, natural neighborhood search retrieves a symmetric neighborhood configuration of a dataset. We can identify the true majority and minority class overlaps via the symmetric neighborhood or hand-shake configurations. In this work, NaNUML, the nearest neighbor eigenvalue for each dataset, is computed and used in the subsequent stages for undersampling the majority class. The proposed approach is described in the next section.

4 NaNUML approach

Let a multi-label dataset be denoted \(\mathcal {D}\), and the number of labels be \(\mathcal {L}\).

Algorithm 1
figure a

NaNUML

\(\mathcal {D}=\{(\textbf{x}_{i},\textbf{y}_{i}), 1\le i\le n\}\). \(\textbf{x}_{i}\) denote the ith feature vector and \(\textbf{y}_{i}\) denotes its class information corresponding to \(\mathcal {L}\) labels. \(\textbf{y}_{i}=\{y_{i1},y_{i2},\ldots , y_{i\mathcal {L}}\}\) and each \(y_{ij}\) can be either 0 (negative membership) or 1 (positive membership). Example, \(y_{14}=1\) signifies that \(\textbf{x}_{1}\) belongs to (has) the positive class of the 4th label. Our primary task is to predict the correct membership of the test points for all the labels.

  1. 1.

    Finding the natural neighbors of points in \(\mathcal {D}\) Following the natural neighbor principles, we find the natural neighbors of all points in \(\mathcal {D}\) for \(k=\lambda \) (where \(\lambda \) is the natural neighbor eigenvalue). \(\lambda \) is specific to a dataset. Let \(N(\textbf{x}_{i})\) be the natural neighbor set of \(\textbf{x}_{i}\).

    $$\begin{aligned} \mathcal {N}(\textbf{x}_{i})=\{\textbf{x}_{j}; (\textbf{x}_{i}\in KNN_{\lambda }(\textbf{x}_{j}))\cap (\textbf{x}_{j}\in KNN_{\lambda }(\textbf{x}_{i}))\},\quad i=1,2,\ldots ,n \end{aligned}$$
    (1)

    This step is common for all labels as the labels share the same feature points.

  2. 2.

    Imbalance ratios of the labels and the number of points removed For each label, the points belonging to the positive and negative classes are segregated into two mutually exclusive sets. In a multi-label dataset, usually, the positive class qualifies as the minority class, and the negative class becomes the majority class. Class inversion can indeed occur, where the negative and positive classes change their roles. But, for clarity and synchronization, we denote the positive and negative classes as the minority and majority classes, respectively. Let \(\mathcal {D}_{M(j)}\) and \(\mathcal {D}_{m(j)}\) be the majority and the minority classes of label j, respectively.

    $$\begin{aligned} \begin{aligned} \mathcal {D}_{M(j)}=\{\textbf{x}_{i};1\le i\le n\text { and } y_{ij}=0\}\\ \mathcal {D}_{m(j)}=\{\textbf{x}_{i};1\le i\le n\text { and } y_{ij}=1\}\\ \mathcal {D}=\mathcal {D}_{M(j)}\cup \mathcal {D}_{m(j)} \end{aligned} \end{aligned}$$
    (2)

    For each label, we compute the cardinality of the undersampled set from the difference between the cardinalities of the majority and the minority classes. Let \(u_{j}\) be the number of points to be removed from \(\mathcal {D}_{M(j)}\). Let \(\alpha \) be a number such that \(0 < \alpha \le 1\).

    $$\begin{aligned} u_{j}=\textit{max}(\alpha \times (|\mathcal {D}_{M(j)}|-|\mathcal {D}_{m(j)}|),0), \quad j=1,2,\ldots ,\mathcal {L} \end{aligned}$$
    (3)

    \(\alpha \) allows us to choose the number of points to be removed from the majority point set. When \(\alpha =1\), we equate the cardinality of the undersampled majority point set with that of the minority point set. After the undersampling, the difference in cardinalities of the undersampled majority class and the minority class is equal to the \((1-\alpha )\%\) of the original difference between the two sets. Note that When there is an inversion of the positive and the negative class for a label, [majority class (class 0) has lesser number of points than the minority class (class 1)], \((|\mathcal {D}_{M(j)}|-|\mathcal {D}_{m(j)}|)\) will be negative and \(u_{j}\) will be 0. We will not remove any point for that label.

  3. 3.

    Finding the majority points to be undersampled for each label and generating the augmented dataset For each label, we find the natural neighbor count of the majority points. The majority point set and the minority point set vary across the labels depending on the label-specific membership of the points. Additionally, we segregate this count into two mutually exclusive counts—(1) majority natural neighbor count and (2) minority natural neighbor count. Let \({count}_{Mi(j)}\) and \({count}_{mi(j)}\) denote the majority natural neighbor count and minority natural neighbor count, respectively, of an instance \(\textbf{x}_{i}\) for label j.

    $$\begin{aligned} \begin{aligned} {count}_{M(i)(j)}&=|\{\textbf{x}_{k}: (\textbf{x}_{k} \in \mathcal {N}_{i}) \text { and } (\textbf{x}_{k},\textbf{x}_{i})\in \mathcal {D}_{M(j)}\}| \\ {count}_{m(i)(j)}&=|\{\textbf{x}_{k}: (\textbf{x}_{k} \in \mathcal {N}_{i}) \text { and } (\textbf{x}_{k}\in \mathcal {D}_{m(j)}) \text { and } (\textbf{x}_{i}\in \mathcal {D}_{M(j)})\}| \end{aligned} \end{aligned}$$
    (4)
    • Finding the label-specific majority points, which are the key structural components and preserving them from undersampling We explore the majority natural neighbor counts to find the key structural points of the majority set. The points with the higher majority natural neighbor counts are selected as the key structural points, and the top 10% points are kept away from the undersampling in the next phase (even if their minority counts are higher).

    • Finding the majority points to be removed from the remaining set of points For a label j, we look at the minority natural neighbor count of the remaining majority points. The majority point with the highest minority natural neighbor count is removed (undersampled) first from the majority set. This procedure of undersampling is continued (according to the decreasing order of the minority natural neighbor counts of the majority points) till \(u_{j}\) points are removed. A majority point in a majority class-minority class overlapped region will have a high minority natural neighbor count and is a good candidate for removal.

    Let \(\mathcal {U}_{(j)}\) be the set of removed points from the majority set \(\mathcal {D}_{M(j)}\). The undersampled majority set for label j, \({UM}_{j}\) is obtained by taking the difference of \(\mathcal {U}_{(j)}\) from \(\mathcal {D}_{M(j)}\).

    $$\begin{aligned} \mathcal{U}\mathcal{M}_{(j)}=\mathcal {D}_{M(j)}{\setminus }\mathcal {U}_{(j)}, \quad j=1,2,\ldots ,\mathcal {L} \end{aligned}$$
    (5)

    The undersampled training set for label j, \(\mathcal{U}\mathcal{D}_{(j)}\) is obtained by taking the union of \(\mathcal{U}\mathcal{M}_{(j)}\) and \(\mathcal {D}_{m(j)}\).

    $$\begin{aligned} \mathcal{U}\mathcal{D}_{(j)}=\mathcal{U}\mathcal{M}_{M(j)}\cup \mathcal {D}_{m(j)}, \quad j=1,2,\ldots ,\mathcal {L} \end{aligned}$$
    (6)

    \(\mathcal{U}\mathcal{D}_{(j)}\) is used to train the label-specific classifier for label j, and the classifier is subsequently used to make the predictions for label j.

Remarks In this work, we suggest preserving \(10\%\) majority points as the key structural components of the majority class. In datasets with an imbalance ratio (\(r>10\)), this will impose an upper limit on \(\alpha \).

$$\begin{aligned} \alpha =\frac{0.9r}{r-1} \end{aligned}$$
(7)

Given that, it is not possible to equate the cardinalities of the minority and the undersampled majority classes when \(r>10\). The experimental results on exploring \(\alpha \) manifest that it is a fair trade-off. Too much removal of majority points can lead to the distortion of the majority class. If it is of utmost necessity to balance the cardinalities of the majority and minority classes, it has to be done by lessening the degree of preservation.

In order, we present the Experimental Setup, Results and Discussion, and Conclusion in the following three sections.

5 Experimental setup

  • Datasets We have performed the experiments on 12 real-world multi-label datasets enlisted in Table 1.Footnote 1 Here, instances, inputs, and labels indicate the cardinality, features, and the number of labels respectively in each dataset. Type indicates the nominal or numeric nature of the features. The number of unique label combinations present in a dataset is indicated by Distinct label sets. Cardinality is the average number of labels per instance, and Density is Cardinality weighted by the number of labels. We have pre-processed the datasets according to the recommendations in Zhang et al. (2020) and He and Garcia (2009). Labels having a very high degree of imbalance (50 or greater) or having too few positive samples (20 in this case) are removed. For text datasets (medical, enron, rcv1-s1, rcv1-s2), only the input space features with a high degree of document frequencies are retained.

    Table 1 Description of datasets
  • Comparing algorithms Seven schemes, comprising of, (1) six multi-label learning schemes and (2) one generic class-imbalance focused learners are employed in the empirical study. The multi-label learners involved in the study are COCOA (Zhang et al. 2020), THRESHL (Pillai et al. 2013), IRUS (Tahir et al. 2012), CLR (Fürnkranz et al. 2008), RAKEL (Tsoumakas et al. 2011) and ECC (Read et al. 2011). In COCOA, several imbalance-focused multi-class learners are implemented in the Weka platform using the J48 decision tree with undersampling, where the number of coupling class labels is set as K = min(\(\mathcal {L}-1, 10\)). IRUS is a label-specific undersampling scheme like the proposed method, NaNUML where \(\mathcal {L}\) are trained, one for each label. Each label-specific classifier is trained using the label-specific undersampled training data. IRUS is an ensemble method and the random undersampling is repeated several times to produce a classifier ensemble. THRESHL also learns in a label-specific setting with one classifier for each label. The scheme of THRESHL is to maximize the F-scores in a hold-out setting to find the threshold for classification. CLR is a second-order learning scheme that exploits pair-wise label correlations to obtain a multi-label learning performance. In ECC, the classification outputs of a label are used as an input feature for predicting the succeeding labels, thereby involving the correlations of the labels. RAKEL is also a higher-order learning approach where the set of overlapping and non-overlapping subsets of labels are considered, and multi-class classifiers are learned on the power set of the labels. RML (Tahir et al. 2012) is a generic class-imbalance learner used in the comparative study. In RML, the macro-averaging F measure is used as the optimization metric while modeling the classifier. In IRUS, the C4.5 decision tree is used as the base learner. In RAKEL, the recommended settings of \(k = 3\) and the number of subsets \(m = 2q\) are employed. In ECC, an ensemble size of 100 is chosen. In CLR, a synthetic label is used to differentiate between the relevant and the irrelevant labels. In NaNUML, we have used Support Vector Machine Classifier with linear kernel and the regularization parameter is set to 1.

  • Evaluating metrics Four multi-label domain-specific metrics, namely—macro averaging \(\text {F}_{1}\), macro-averaging AUC, average precision, and ranking loss are used to compute the performance of the proposed and the competing methods. They are briefly described as follows:

    • Macro-averaging \(F_{1}\): it is the average of all the label-specific \(F_{1}\) scores. Let \(F_{1j}\) be the \(F_{1}\) score for label j. The higher the macro averaging \(F_{1}\) score, the better the performance.

      $$\begin{aligned} \text {Macro } F_{1}=\frac{1}{\mathcal {L}}\sum _{j=1}^{\mathcal {L}} F_{1j} \end{aligned}$$
      (8)
    • Macro-averaging AUC: it is the sum of the label-specific AUC scores, weighted by the number of labels \(\mathcal {L}\). Let \({AUC}_{j}\) be the AUC score for label j. The higher the macro averaging AUC score, the better the learner’s performance.

      $$\begin{aligned} \text {Macro } AUC=\frac{1}{\mathcal {L}}\sum _{j=1}^{\mathcal {L}} AUC_{j} \end{aligned}$$
      (9)
    • Average precision: average precision evaluates the average fraction of relevant labels ranked higher than a particular label. It is desirable that, for instance, the relevant labels will be predicted with higher scores (more confidence) than that of the irrelevant or absent ones. Let \(\mathcal {R}(\textbf{x}_{i},l_k)=\{l_j|rank(\textbf{x}_{i},l_j)\le rank(\textbf{x}_{i},l_k), l_j \in \textbf{Y}_{i} \}\)

      $$\begin{aligned} Average Precision =\frac{1}{n}\sum _{1}^{t}\frac{1}{|\textbf{Y}_{i}|}\sum \frac{|\mathcal {R}(\textbf{x}_{i},l_k)|}{rank (\textbf{x}_{i},l_k)} \end{aligned}$$
      (10)
    • Ranking loss: is used to evaluate the percentage of misordered label pairs. Let \(\mathcal {R}(\textbf{x}_{i},l_k)=\{l_j|rank(\textbf{x}_{i},l_j)\le rank(\textbf{x}_{i},l_k), l_j \in \textbf{Y}_{i} \}\). \(\mathbf {Y'}_{i}\) denotes the labels not belonging to \(\textbf{x}_{i}\). The lower the value, the better the performance.

      $$\begin{aligned} {Ranking\;loss}=\frac{1}{n}\sum _{1}^{t}\frac{1}{|\textbf{Y}_{i}||{\mathbf {Y'}}_{i}|} \frac{(y_{ik},y_{ij})| f_{k}(\textbf{x}_{i})) \le f_{j}(\textbf{x}_{i}), (y_{k},y_{j}) \in ({\textbf{Y}}_{i} \times {{\mathbf {Y'}}}_{i}) }{{rank}(\textbf{x}_{i},l_{k})} \end{aligned}$$
      (11)
    • Statistical significance test We have conducted Wilcoxon signed rank test to evaluate the difference in the methods’ performance statistically. We have conducted the tests for a pair of methods—(NaNUML-50% or NaNUML-100% or Best of two) and each competing method on the results obtained from all four evaluating metrics. We have made the evaluations at \(p=0.05\) significance level.

6 Results and discussion

We have randomly partitioned each dataset into two equal (or nearly equal), mutually exclusive halves to construct a training set and a test set for a single run. For each run, we have obtained the results on three metrics. The values in the table are the mean scores obtained from ten experiment runs. The scores obtained on macro-averaging \(F_{1}\), macro-averaging AUC, average precision and ranking loss are shown in Tables 234, and 5 respectively. NaNUML (NaNUML-50% and NaNUML-100%) has obtained the best scores on macro-averaging \(F_{1}\) in 9 out of 12 datasets. Of the nine best scores obtained, NaNUML-50% obtains four, and NaNUML-100% obtains five. COCOA (two) and RML (one) obtain the remaining three best performances. This feat by NaNUML indicates its appropriateness in handling class-imbalance problems in a multi-label context. The performance of NaNUML on macro-averaging AUC is a bit subdued as compared to that of macro-averaging \(F_{1}\). NaNUML has obtained the best scores in 6 out of 12 datasets only. The remaining best scores are shared by COCOA (3 out of 6), CLR (2 out of 6), and ECC (3 out of 6). Between NaNUML-50% and NaNUML-100%, the latter has attained a relatively better performance. NaNUML has attained the best scores on average precision in 7 out of 12 datasets. We may also note that NANUML-50% achieves six out of those cases, and only one is achieved by NaNUML-100%. ECC has attained the remaining five best scores. The probable reason regarding the loss of performance by NaNUML-100% is due to the deletion of some majority instances, which leads to the loss of some pertinent information. On ranking loss, NaNUML has the lowest loss values in 7 out of 12 cases. Out of these, NaNUML-50% and NANUML-100% have achieved 4 and 3, respectively. ECC and CLR have achieved four and one of the best scores, respectively.

We report the statistical significance of the improvement achieved by NaNUML. We have presented the results of the statistical significance test in Table 6. On macro-averaging \(F_{1}\), the performance of NaNUML (best of NaNUML-50% and NaNUML-100% ) is better and statistically superior to all competing methods. Concerning macro-averaging AUC, NaNUML has delivered a statistically significant improvement against three competing methods and has failed to do so against three. The three methods are COCOA, CLR, and ECC. This finding is in congruence with the data presented in Table 3. On average precision and ranking loss, NaNUML has obtained statistically superior performance against four competing methods, and NaNUML’s performance is statistically comparable to that of COCOA and ECC. We should also note that, only in one case, NaNUML-100% has achieved a statistically inferior performance (against COCOA, on average precision). The above-summarized results ascertain the appropriateness of the proposed method, NaNUML, over existing schemes dedicated to multi-label learning and class-imbalance mitigation. It is to be noted that, being an undersampling scheme, NaNUML reduces the complexity associated with the classifier modeling.

Table 2 Macro \(F_{1}\) results
Table 3 Macro AUC results
Table 4 Average precision results
Table 5 Ranking loss results
Table 6 Results of Wilcoxon signed rank test (two-tailed) at \(p=0.05\)

6.1 The role of \(\alpha \)

The degree of undersampling performed on an imbalanced label is quantified and controlled through \(\alpha \). It is varied between 0 and 1 where \(\alpha \)=0 signifies no undersampling and \(\alpha =1\) leads to equal cardinalities of the majority and the minority classes. A low value of \(\alpha \) results in the prevalence of imbalance, whereas a too-high value can lead to the distortion of the majority class. Hence, it is a critical parameter. We vary \(\alpha \) across the following range—0.25, 0.5,0.75, and 1, correlating with the variation of multi-label performances. We explore six datasets—three with numeric features (CAL500, Yeast, and Scene) and three with nominal features (Medical, Llog, and Enron). The plots are shown in Fig. 1.

On macro-averaging \(F_1\), the best scores are rendered at \(\alpha =0.5\) on three cases, \(\alpha =1\) at two case and \(\alpha =0.75\) in one case. At \(\alpha =0.25\), quite low macro-averaging \(F_1\) scores are obtained. The results indicate that a medium to higher \(\alpha \) range is effective in improving the recognition of the minority class of each label. On ranking loss, the best (lowest) scores are served at \(\alpha =0.0.5\) in three cases, \(\alpha =0.75\) in two cases, and \(\alpha =0.25\) in one case. At \(\alpha =1\), ranking loss is pretty high.

A cumulative analysis in the context of the two metrics reveals that a low \(\alpha \) can be detrimental to recognizing the minority class instances. On the contrary, a high value of \(\alpha \) leads to a distortion of the majority case, thereby increasing the ranking loss. Given these two findings, fixing \(\alpha \) between 0.5 and 0.75 is judicious.

Fig. 1
figure 1

Exploration of \(\alpha \) on six datasets. SVC is used as the base classifier

6.2 Exploration of classifiers

NaNUML is a data pre-processing method where we reduce the skewness in the representation of the majority and the minority classes. To assess its intrinsic efficacy, learning it using different classifiers is essential. We explore five classifiers—k-nearest neighbors classifiers at \(k=5\), Decision Tree classifier, Support Vector Machine classifier with linear kernel and regularization parameter \(=\) 1, Random Forest Classifier with depth level \(=\) 5 and number of estimators \(=\) 10, and AdaBoost classifier (with base classifier Decision Tree at depth level \(=\) 1) in this study. The same six datasets and two metrics used in the previous experiment are employed in this study. The outcomes are shown in Fig. 2. For macro-averaging \(F_1\), the Adaboost classifier has performed best. On ranking loss, the SVC has rendered the lowest loss values in most cases. Decision Tree classifier could not fare well on both metrics. Random Forest Classifier has delivered admissible performance on numeric datasets.

Fig. 2
figure 2

Exploration of different classifiers on six, NaNUML-undersampled dataset. \(\alpha \) is fixed at 0.5 for all cases

7 Conclusion

In this work, we have presented a novel label-specific undersampling scheme, NaNUML, for multi-label datasets. NaNUML is based on the parameter-free natural neighbor search, and the critical factor, neighborhood size ‘k’, is determined without invoking any parameter optimization. In our scheme, we eliminate the majority instances closer to the minority class. In addition, we preserve the critical lattices of the majority class by looking at the majority natural neighbor count of the majority class. The other advantage of the scheme is that we require only one natural neighbor search for all labels. Undersampling schema has the intrinsic characteristic of reducing the complexity in the classifier modeling phase (through the reduction in training data), and NaNUML is no exception. The performance of NaNUML indicates its ability to mitigate the class-imbalance issue in multi-label datasets to a considerable extent.

In our future work, we would like to design a natural-neighborhood-based oversampling scheme for class-imbalanced datasets. We would also like to explore if we can incorporate label correlations in our undersampling scheme.