1 Introduction

Rough set theory, introduced by Pawlak (1992), is an effective tool for handling the vagueness, imprecision and uncertainty (Zadeh 1965) in data. With more than 30 years of development, such method has been widely applied to Feature Selection (Min and Xu 2016; Swiniarski and Skowron 2003; Wang et al. 2018), Pattern Recognition (Dai et al. 2013; Hu et al. 2016), Granular Computing (Huang and Li 2018; Pedrycz and Chen 2011, 2015; Peter et al. 2003; Polkowski and Artiemjew 2015; Wang 2017; Wang et al. 2017; Zhi and Li 2018), Knowledge Discovery (Mi et al. 2004; Wu et al. 2016) and so on. Specially, as what has been pointed out by Chen et al. (2012), attribute reduction (Ju et al. 2017; Xu et al. 2016) has been considered as one of the most representative topics in rough set theory, which can be distinguished from other techniques of Feature Selection. This is mainly because: (1) attribute reductions have clear semantic explanations; (2) many measurements developed in rough set can be used to design constraints in attribute reductions.

Presently, it has been reported that attribute reduction aims to remove the redundant attributes with a given constraint. Various measurements such that approximate quality (Pawlak and Skowron 2007), conditional entropy (Hu et al. 2006) and so on have been employed to define constraints. Note that most of the measurements are generally derived from the relationship between conditional attributes and decision attribute. Therefore, the values over decision attribute, i.e., labels are required for exploring attribute reduction. From this point of view, most of the attribute reductions may only be performed over data without missing labels.

In many real-world applications (Chen and Chang 2011; Chen and Chen 2009; Chen and Tanuwijaya 2011; Wang and Chen 2008), acquiring complete data is a difficult task. Generally speaking, the difficulties in labeling samples include two aspects. On the one hand, the correct labels may be unknown. For example, when adopting computer technology to aid medical institution in analyzing medical image, it is practically impossible for doctors or even experienced experts to locate the nidus (Zhou and Li 2010). Consequently, the cases of illness (labels) may be unknown. On the other hand, labeling samples is hard, expensive and time consuming as it costs too many efforts. For instance, in this era abounding with social media and connectivity, web users are becoming increasingly obsessed with interacting and sharing with their phones or computers. Along with this process, data are becoming extremely complex and it is hard for network supervisor to label all samples. To sum up, data with both labeled and unlabeled samples can be seen everywhere, and such type of data is referred to as the partially labeled data (Dai et al. 2017; Liu et al. 2018).

Up to now, most of the previous attribute reductions fail to consider the partially labeled data. Therefore, the motivations of this paper are: (1) how to handle the partially labeled data for realizing attribute reduction; (2) how to preserve or even improve the various performances of the reducts derived from the partially labeled data. For such reasons, we propose a novel attribute reduction approach to partially labeled data. The main process of our approach includes two steps: (1) considering that the approximate quality and binary relation can be used to evaluate the importance of attribute, which comes from labeled and unlabeled samples, respectively, the new importance can be expressed by combining such two measurements; (2) the significance function [also fitness function (Yang and Yao 2018)] can be constructed based on such new importance, correspondingly, a heuristic algorithm can be re-designed for computing reduct over partially labeled data.

Note that the neighborhood rough set (Hu et al. 2008b) is employed to realize the topic addressed in this paper. It is mainly because compared with other rough sets, neighborhood rough set has obvious advantages. Take the comparison between neighborhood rough set and fuzzy rough set (Dubois and Prade 1990) as an example, it is well known that (1) similar to classical rough set, the approximations obtained by neighborhood rough set are clear, while those obtained by fuzzy rough set are still fuzzy; (2) neighborhood rough set provides us a framework for handling continuous or even mixed data (Hu et al. 2008a), while fuzzy rough set is limited to continuous data.

The main contribution of our research includes two aspects: (1) a new importance for attribute reduction over partially labeled data is proposed, a heuristic algorithm based on such importance can be re-designed for computing reduct over partially labeled data; (2) through introducing the neighborhood rough set into our approach, the experimental results over several UCI datasets demonstrate that our approach is effective in selecting qualified attributes from partially labeled data.

The rest of this paper is organized as follows. Section 2 reviews some basic notations of rough set and definition of attribute reduction. The new importance for partially labeled data and the corresponding neighborhood attribute reduction approach are presented in Sect. 3. Experiments are conducted and the experimental results are analyzed in Sect. 4. Finally, we then conclude with some remarks and perspectives for future work in Sect. 5.

2 Preliminary knowledge

2.1 Neighborhood rough set

As one of the most important expanded models of classical rough set (Dou et al. 2016; Skowron and Stepaniuk 1996; Wojna 2005; Xu et al. 2017; Yang et al. 2011a, b), neighborhood rough set has been widely concerned. Since different radii used in neighborhood rough set may characterize the similarity between samples through different scales, neighborhood rough set is then more flexible and more adaptive for complex data (Hu et al. 2008a).

In rough set theory, a decision system can be described by a pair such that \(\text{DS}=\langle U, AT \cup \{d\}\rangle\), in which the universe U is a nonempty and finite set of samples, AT is a nonempty and finite set of conditional attributes and d is the decision attribute. Furthermore, \(\forall x \in U, d(x)\) indicates the label of sample x.

Given a decision system DS, we assume that the values of decision attribute are discrete, then an equivalence relation over d can be defined such that \(\text{IND}_{d}=\{(x,y)\in U \times U: d(x)=d(y)\}\). By \(\text{IND}_{d}\), a partition \(U/ \text{IND}_{d}=\{X_{1}, X_{2},\ldots , X_{n}\}\) is derived. In rough set theory, \(X_{k} \in U/ \text{IND}_{d}\) is called the k-th decision class. Specially, the decision class which contains sample x is denoted by \([x]_{d}\).

Given a decision system DS, \(\forall x\in U\) and \(\forall A \subseteq AT\), then given a radius \(\sigma \in [0,1]\), the size of neighborhood of x related to A can be defined as (Hu et al. 2008b):

$$\begin{aligned} \sigma _{A}(x)&= \min _{y\in U\wedge y\ne x}\triangle _{A}(x,y)+\sigma \cdot \left( \max _{y\in U\wedge y\ne x}\triangle _{A}(x,y)\right.\\&\quad \left. -\min _{y\in U\wedge y\ne x}\triangle _{A}(x,y)\right) , \end{aligned}$$
(1)

where \(\triangle _{A}\) is one distance function, and the Euclidean distance is employed to derive \(\triangle _{A}(x,y)\) in this paper. Note that the construction of Eq. (1) aims to avoid an undesirable case: given a sample \(x \in U\), the neighborhood of x may contain only x itself if smaller radius is used. That is, any two samples can be distinguished to each other and then it is meaningless for learning process.

Following Eq. (1), the neighborhood relation in DS can be defined as:

$$\begin{aligned} \delta _{A}=\{(x,y)\in U\times U:\triangle _{A}(x,y)\le \sigma _{A}(x)\}. \end{aligned}$$
(2)

Correspondingly, the neighborhood of x related to A can be defined as:

$$\begin{aligned} \delta _{A}(x)=\{y\in U: \triangle _{A}(x,y)\le \sigma _{A}(x)\}. \end{aligned}$$
(3)

Definition 1

Given a decision system DS, \(\forall A \subseteq AT\) and \(\forall X_{k} \in U/{\mathrm{IND}}_{d}\), the neighborhood lower and upper approximations of \(X_{k}\) in terms of A are defined as:

$$\begin{aligned}\underline{X_{k}}_{A}=\{x\in U:\delta _{A}(x)\subseteq X_{k}\}; \end{aligned}$$
(4)
$$\begin{aligned}\overline{X_{k}}_{A}=\{x\in U:\delta _{A}(x)\cap X_{k}\ne {\emptyset }\}. \end{aligned}$$
(5)

The pair \([\underline{X_{k}}_{A},\overline{X_{k}}_{A}]\) is called a neighborhood rough set of \(X_{k}\).

2.2 Some measurements

Up to now, many measurements have been proposed to describe the certainty or uncertainty in data from the viewpoint of rough set. Similar to classical rough set, approximate quality (Pawlak 1992) can also be used for describing the degree of certainty in neighborhood rough set. The definition is as follows.

Definition 2

Given a decision system DS, \(\forall A \subseteq AT\), the approximate quality of d related to A is defined as:

$$\begin{aligned} \gamma _{A}(d)=\frac{\left| \bigcup ^{n}_{k=1}\underline{X_{k}}_{A}\right| }{|U|}, \end{aligned}$$
(6)

where |X| denotes the cardinality of set X.

Obviously, \(0\le \gamma _{A}(d) \le 1\) holds. If the value of approximate quality is higher, then the certainty in data is regarded as higher.

Besides approximate quality, conditional entropy is also another widely accepted measurement in rough set, which can characterize the uncertainty. Presently, many definitions of conditional entropy (Hu et al. 2010; Wei et al. 2013; Zhang et al. 2016; Zhu and Wen 2012) have been proposed with respect to different requirements. A typical representation of conditional entropy (Hu et al. 2006) is shown in Definition 3.

Definition 3

Given a decision system DS, \(\forall A \subseteq AT\), the conditional entropy of d related to A is defined as:

$$\begin{aligned} {\mathrm{ENT}}_{A}(d)=-\frac{1}{|U|}\sum _{x\in U}\log \frac{|\delta _{A}(x)\cap [x]_{d}|}{|\delta _{A}(x)|}. \end{aligned}$$
(7)

Different from approximate quality, it is not difficult to observe that the certainty is higher when the value of conditional entropy is lower.

2.3 Neighborhood classifier

Classifier can be used to evaluate the generalization performance of attributes (Chen et al. 2001). In neighborhood rough set, the neighborhood classifier proposed by Hu et al. (2008b) is frequently used. Given a test sample, neighborhood classifier uses the majority rule over labels of neighbors to determine the label of such test sample. The detailed process is shown in Algorithm 1.

figure a

2.4 Attribute reduction

In general, the purpose of attribute reduction is to delete the redundant or irrelevant attributes, and then the rest can still meet the constraint. Based on such purpose, the attribute reduction with respect to the approximate quality can be defined as follows.

Definition 4

Given a decision system DS, \(\forall A \subseteq AT\), A is referred to as a \(\gamma\)-reduct if and only if

  1. 1.

    \(\gamma _{A}(d) \ge \gamma _{AT}(d)\);

  2. 2.

    \(\forall B \subset A, \gamma _{B}(d) < \gamma _{A}(d)\).

Based on Definition 4, the approach to find reduct is not only worth to be addressed but also important. Up to now, the heuristic algorithm based on greedy strategy for finding reduct has been widely used because of its lower time consuming (Wang et al. 2016; Yang et al. 2019). In the framework of heuristic algorithm, the most significant attribute in each iteration is determined by a significance function (Yang and Yao 2018). For example, the significance function in terms of approximate quality is as follows.

Definition 5

Given a decision system DS, if \(A \subset AT\), then \(\forall a \in AT-A\), its significance with respect to approximate quality is :

$$\begin{aligned} {\mathrm{Sig}}_{\gamma }(a, A, d)=\gamma _{A\cup \{a\}}(d)-\gamma _{A}(d). \end{aligned}$$
(8)

The detailed process of heuristic approach to compute reduct in terms of approximate quality is shown in Algorithm 2.

figure b

3 Attribute reduction for partially labeled data

Many significant functions such as the one used in Algorithm 2 are derived from labeled samples. It follows that such significant function can be employed to evaluate the significance of each attribute in terms of labeled samples. However, if unlabeled samples exist in data, then it is impossible for us to obtain the expected measurement. Moreover, if unlabeled samples are ignored, then the information given by these samples may be wasted. For such reason, attribute reduction we mentioned in the above section cannot be directly used for partially labeled data.

Hence, considering the importance of attribute based on both labeled and unlabeled samples, we will propose a combined importance. In such strategy, the neighborhood approximate quality is used to measure importance of attribute in terms of labeled samples while the neighborhood relation is employed to measure the importance of attribute in terms of unlabeled samples. The general process of constructing the new importance is shown in Fig. 1.

Fig. 1
figure 1

Process of constructing new importance

Through many experiments, it is trivial to observe that if the number of used attributes is increasing, then the value of approximate quality tends to be higher while the cardinality of neighborhood relation tends to be lower. Therefore, the higher the importance of an attribute, the greater the value of approximate quality and the finer the neighborhood relation. From this point of view, the new importance is defined in Definition 6.

Definition 6

Given a decision system DS, let \(U=l\cup ul\), where l is the set of labeled samples and \(ul\) is the set of unlabeled samples, \(\forall A \subseteq AT\), the importance of A is defined by:

$$\begin{aligned} {\text{IMP}}(A)=\alpha \cdot \frac{\gamma ^{l}_{A}(d)}{\gamma ^{l}_{AT}(d)}+\beta \cdot \frac{|\delta ^{ul}_{AT}|}{|\delta ^{ul}_{A}|}, \end{aligned}$$
(9)

where \(\gamma ^{l}_{A}(d)\), \(\gamma ^{l}_{AT}(d)\) are computed by labeled samples in l, \(|\delta ^{ul}_{AT}|\), \(|\delta ^{ul}_{A}|\) are obtained by unlabeled samples in \(ul\). In addition, \(\alpha +\beta =1\) and their values are in [0, 1].

Example 1

Let us consider the following example of partially labeled data with ten samples and four conditional attributes. Among these ten samples, seven samples are labeled and three samples are unlabeled.

Suppose that we want to compute the \(\text{IMP}(\{a_{1}\})\) in Table 1 by Definition 6 when the radius used in neighborhood is 0.1.

  1. 1.

    For the first step of computation, we have

    $$\begin{aligned} \gamma ^{l}_{AT}(d) = 0.2857 \quad {\textit{and}}\quad |\delta ^{ul}_{AT}|=6. \end{aligned}$$
  2. 2.

    For the second step of computation, we have

    $$\begin{aligned} \gamma ^{l}_{\{a_{1}\}}(d) = 0\quad {\textit{and}}\quad |\delta ^{ul}_{\{a_{1}\}}|=6. \end{aligned}$$

Therefore, if \(\alpha\) and \(\beta\) are set to 0.8 and 0.2, then \(\text{IMP} (\{a_{1}\}) = 0.8\cdot (0/0.2857) + 0.2\cdot (6/6) = 0.2\) is obtained by Eq. (9).

By Definition 6, the attribute reduction for partially labeled data with respect to IMP can be defined as follows.

Table 1 A small example of partially labeled data

Definition 7

Given a decision system DS, \(\forall A \subseteq AT\), A is referred to as an IMP-reduct if and only if

  1. 1.

    IMP\((A) \ge\) IMP\((AT)\);

  2. 2.

    \(\forall B \subset A\), IMP(B) < IMP(A).

Immediately, the significant function designed for IMP and the attribute reduction algorithm based on IMP can be constructed as follows. Note that to simplify our algorithm and reduce its time consumption, we focus on the suboptimal solution instead of the optimum solution (Yang et al. 2013, 2014). Therefore, the second condition in Definition 7 is not taken into account in this paper.

Definition 8

Given a decision system DS, if \(A \subset AT\), then \(\forall a \in AT-A\), its significance with respect to IMP is:

$$\begin{aligned} {\text{Sig}}_{\mathrm{IMP}}(a, A)=\text{IMP}(A\cup \{a\})-\text{IMP}(A). \end{aligned}$$
(10)
figure c

4 Experiments

4.1 Datasets

To evaluate various performances of our CIMR, 12 real-world datasets from UCI machine learning repository have been employed in the following experiment. Table 2 summarizes some detailed statistics of these datasets used in our experiments.

Table 2 Characteristics of the experimental datasets

4.2 Experimental setup

All the experiments have been carried out on a personal computer with Window7, Intel Core i5-3337U CPU (1.80 GHz) and 4.00 GB memory. The programming language is MATLAB R2014a.

In our experiments, tenfold cross-validation is employed for evaluating the effectiveness of different methods. And for each train set, we randomly divide it into two groups (groups of labeled and unlabeled samples) by ratios of 7:3, 5:5 and 3:7, i.e., three different ratios (30%, 50% and 70%) of missing labels. Moreover, each complete train set without missing labels (0% of missing labels) is retained for CAQR. And then we appoint ten different neighborhood radii such that \(\sigma = 0.03, 0.06,\ldots , 0.3.\)

To set \(\alpha\) and \(\beta\) for a better performance of our approach, we conduct CIMR with different settings beforehand. Tenfold cross-validation is also employed, and then CIMR is performed over each train set. Through executing those derived reducts over each test set, the NEC based classification accuracies are compared. Note that for each setting of \(\alpha\) and \(\beta\), with such three ratios of missing labels, three reducts by CIMR can be derived from each train set. Correspondingly, three groups of classification accuracies can be obtained from each test set. And the averages of these classification accuracies are mainly compared as shown in Table 3. It is not difficult to observe that if \(\alpha\) and \(\beta\) are set to be 0.8 and 0.2, then CIMR may be more effective. From this point of view, we set \(\alpha\) and \(\beta\) to be 0.8 and 0.2 in our experiments.

Table 3 Classification accuracies among different settings of \(\alpha\) and \(\beta\) (greater values are in bold)

4.3 Experimental results and analyses

In the following, with respect to three ratios (30%, 50% and 70%) of missing labels, the reducts derived by CAQR are denoted as 30%-AQR, 50%-AQR and 70%-AQR, respectively. Similar to such representation, the reducts obtained by CIMR are indicated as 30%-IMR, 50%-IMR and 70%-IMR, respectively. Note that CAQR is only executed over the labeled samples and the reduct derived by complete data with 0% of missing labels is denoted as 0%-AQR.

In our experiments, the lengths, values of approximate quality, values of conditional entropy and NEC based classification accuracies derived by reducts will be compared. The detailed experimental results are shown in the following.

With an investigation of Table 4, it is not difficult to observe: (1) with the same ratios of missing labels, IMRs are longer than AQRs; (2) if CAQR is executed over the complete data, the lengths of 0%-AQRs are still smaller. It is mainly because both the value of approximate quality derived by labeled samples and the neighborhood relation obtained by unlabeled samples should be considered synchronously, i.e., the constraint used in CIMR is stricter than that in CAQR.

Table 4 Comparisons among lengths of reducts (great values are in bold)
Fig. 2
figure 2

Comparisons among values of approximate quality derived by reducts

Fig. 3
figure 3

Comparisons among the values of conditional entropy derived by reducts

Fig. 4
figure 4

Comparisons among classification accuracies derived by reducts

By Fig. 2, it is not difficult to observe the following.

  1. 1.

    In general, the values of approximate quality derived by IMRs are higher than or equal to those by AQRs. As shown in Table 4, IMRs are greater than AQRs. Therefore, we may conclude that a reduct with more attributes may bring us higher value of approximate quality.

  2. 2.

    If the radius is greater, then the derived value of approximate quality tends to be lower. It may because that with the greater radius, the value of approximate quality derived by raw data tends to be lower, which may lead to the looser constraints both in CAQR and CIMR. It follows that the corresponding reducts may offer us worse performances in improving the value of approximate quality.

By Fig. 3, the values of conditional entropy derived by IMRs are lower than or equal to those by AQRs. By Table 4, we may conclude that a reduct with more attributes may bring us lower value of conditional entropy. Moreover, similar to the results related to the values of approximate quality, it is easily to know that if the used radius is greater, then the value of conditional entropy derived by reduct may tend to be higher.

By Fig. 4, we can observe that with the same ratios of missing labels, the classification accuracies derived by IMRs are higher than or equal to those by AQRs. It may because in the process of computing reducts, CAQR ignores the unlabeled samples so that samples used in CIMR is more. Therefore, if the complete data are used, then the classification accuracies derived by 0%-AQRs may be higher.

With a thorough investigation of Figs. 2, 3 and 4, we can observe that from the viewpoint of used ratio, when the ratio of missing labels is 30%, the 30%-IMRs offer us the better performances. It may imply that although we have found an effective algorithm to deal with partially labeled data, if the ratio of missing labels is too high, then our algorithm will become much less effective.

4.4 Statistics comparisons among the results

In the following, to further analyze the previous results from the viewpoint of statistics, the one-way analysis of variance (one-way ANOVA) (Fisher 1921) will be employed for comparing such results.

The one-way ANOVA can be used for comparing the averages of two or more groups of data. In the context of this paper, such method is used to rank the differences in performances of two different types of reducts. For example, for the ratios of missing labels 30%, with 10 different neighborhood radii, 10 different 30%-IMRs can be derived, then 10 values of approximate quality can be collected into a group. Similarly, 10 values of approximate quality in terms of 10 different 30%-AQRs can be collected into the other group. Immediately, the difference of the values of approximate quality in such two groups can be compared by the one-way ANOVA.

Note that the returned p value in the one-way ANOVA is under the null hypothesis that these groups are drawn from populations with the same average. If p value is higher than the default 5% significance level (0.05), it indicates that the averages of these groups are similar; instead, the averages of these groups are significantly different. The detailed results of p values are shown in Tables 5, 6 and 7.

Table 5 p values of one-way ANOVA for comparing the values of approximate quality derived by reducts
Table 6 p values of one-way ANOVA for comparing the values of conditional entropy derived by reducts
Table 7 p values of one-way ANOVA for comparing classification accuracies of reducts

Following the results shown in Tables 5, 6 and 7, it is not difficult to observe that in most cases, the returned p values are higher than 0.05. Therefore, we may conclude that the performances of IMRs derived by our approach cannot be worse than those of AQRs.

5 Conclusions and future perspectives

In this paper, we have designed a new algorithm of finding reduct for partially labeled data. The main contributions of this paper are: (1) a new importance for evaluating attribute is proposed, and such importance can be used for searching suitable attributes over partially labeled data; (2) through various comparative experiments, the final results indicate that our approach is effective in handling partially labeled data. The following topics deserve our further investigations.

  1. 1.

    We have only realized our algorithm with the concept of approximate quality in this paper, and some other measurements, such as conditional entropy and neighborhood decision error rate, will be further applied.

  2. 2.

    Attribute reduction can be considered as the previous step of data processing, and classification performances of different classifiers based on our reduct will be further explored.