1 Introduction

Rough set theory [1] is an effective method for dealing with vague, imprecise, or uncertain data and has been widely used in various fields such as machine learning, pattern recognition, and data mining [2,3,4,5]. In rough set theory, each attribute or subset of attributes is considered to be an indiscernibility relation. On this basis, a rough set is defined by two approximations, called the lower and upper approximations, to represent vague, imprecise, or uncertain concepts [6]. Attribute reduction is a primary research topic in rough sets [7,8,9,10,11,12,13,14,15,16,17,18,19]. It aims to remove irrelevant and redundant attributes while retaining important attributes to maintain discriminating power over the data. Many rough sets-based attribute reduction methods have been proposed, including positive region-based, discernibility matrix-based, and information entropy-based methods. Among them, discernibility matrix-based methods have received extensive attention because of their simplicity and ease of implementation [20,21,22,23,24], where heuristic methods initialized with core attributes are often used to generate optimal reducts.

Existing rough sets-based attribute reduction algorithms mainly deal with labeled data or unlabeled data. However, in many practical tasks, such as web-page categorization [25], intrusion detection [26], and medical diagnosis [27], unlabeled training objects are easily available, but labeled ones are difficult to obtain because labeling objects is labor intensive and expensive. If only a small number of labeled objects are used, it often causes overfitting of the model to the data, resulting in weak generalization ability. Additionally, learning a supervised model without considering unlabeled objects leads to a massive waste of exploitable data. Thus, how to utilize a large number of unlabeled objects to improve learning performance has emerged as a hot research topic in machine learning.

In the rough sets field, many scholars have researched the topic of attribute reduction based on discernibility matrix. Wei et al. [20] developed a discernibility-matrix based incremental attribute reduction algorithm to generate the optimal reduct of dynamic data. Ma et al. [21] constructed a compressed binary discernibility matrix for group dynamic data and developed an incremental attribute reduction algorithm, which considered both single dynamic objects and group dynamic objects. By optimizing the space constraint of storing discernibility matrix, Liu et al. [22] designed an incremental attribute method on fused decision table. Additionally, some attribute reduction methods have been proposed for partially labeled data. Dai et al. [28] introduced the concept of discernibility pair and developed two attribute reduction measures for partially labeled categorical data. Based on mutual information, Hu et al. [29] defined the significance measure for attributes in partially labeled data and utilized it as heuristic information to speed up the attribute reduction process. Xie et al. [30] proposed two types of induced hypergraphs for partially labeled decision systems and designed a fast algorithm based on low-complexity heuristics to compute the optimal reduct. Unlike traditional methods that use only one fitness function, Liu et al. [31] introduced an ensemble voting mechanism to select a more appropriate semi-supervised reduct by constructing multiple fitness functions. Gao et al. [32] generated proxy labels for unlabeled data using prior class-distribution information and developed the granular conditional entropy measure for semi-supervised attribute reduction. In addition, some related semi-supervised learning methods have also been proposed. Wang et al. [33] used Gaussian kernel-based fuzzy rough sets to measure the inconsistency of unlabeled objects and provided an active learning model based on SVM. By integrating three-way decision theory and cost-sensitive learning, Min et al. [34] developed an active learning model based on the k-nearest neighbor classifier. By introducing the idea of tri-partition in the three-way decision, Gao et al. [23] proposed the three-way co-decision model to improve the semi-supervised learning performance. In addition, rough set theory has also been successfully applied to address some practical semi-supervised tasks, such as short text classification [35], defect detection [36], relationship categorization [37, 38], and so on [39].

The aforementioned studies primarily focus on rough sets-based semi-supervised attribute reduction or practical applications. However, little work has been devoted to the construction of semi-supervised models to directly learn from partially labeled data using rough sets. Tri-training [40] is a typical disagreement-based semi-supervised model that employs three learners to learn from each other using unlabeled data but encounters the problems of the weak diversity of the base learners and low quality of selected unlabeled data. In this study, we propose a rough sets-based tri-trade model for partially labeled data. The primary contributions are as follows:

  1. 1.

    To address the attribute reduction problem for partially labeled data, a new semi-supervised discernibility matrix is proposed, based on which a beam search-based heuristic attribute reduction algorithm is designed to generate optimal semi-supervised reducts. The semi-supervised discernibility matrix considers both labeled and unlabeled data and allows for a certain degree of inconsistency, which contributes to improving the robustness and adaptability of semi-supervised attribute reduction.

  2. 2.

    To learn from unlabeled data, a tri-trade model that uses three diverse semi-supervised reducts to train base classifiers is constructed, and a novel data editing technique is developed to reliably identify useful unlabeled data. By selecting useful unlabeled objects and simultaneously eliminating mislabeled unlabeled objects, the proposed data editing technique can enable the base classifiers to learn from each other on high-quality unlabeled data.

  3. 3.

    To obtain insight into the proposed model, a theoretical analysis is offered from the perspective of noise learning. Furthermore, extensive experiments are carried out to validate the effectiveness of the proposed model, and good results are achieved.

The rest of this paper is organized as follows. Section 2 introduces some concepts in rough sets and semi-supervised learning, respectively. Section 3 provides a detailed description of the proposed tri-trade model for partially labeled data as well as the theoretical analysis. Section 4 reports the experimental results and analysis. Finally, Section 5 summarizes the paper.

2 Preliminaries

In this section, some concepts related to rough sets and semi-supervised learning are briefly reviewed. A detailed description of these theories can be found in [1, 6, 41,42,43,44,45].

2.1 Rough sets

In rough set theory, data of interest can be represented in an information system [6]. An information system consists of quadruple, denoted as IS = (U,A,V,f), where U is a non-empty finite set of objects, called the universe; A is a non-empty finite set of attributes; V is the union of attribute domains, and let Va denote the domain of attribute a such that \(V=\bigcup V_{a}\) for each aA; f is called the information function such that f(x,a) ∈ Va for each xU and aA, which assigns a unique value to each attribute of an object in U. When the attribute set A can be categorized into condition attribute set C and decision attribute set D and CD = , the information system is also called a decision table or decision information system [6].

Definition 1

Let IS = (U,A = CD,V,f) be a decision table. For any non-empty attribute subset \(B\subseteq A\), the indiscernibility relation induced by B is defined as:

$$ I N D(B) = \{(x, y) \in U \times U: a(x) = a(y), \forall a \in B\} $$
(1)

Definition 2

Let IS = (U,A = CD,V,f) be a decision table and IND(B) be the equivalence relation induced by an attribute subset \(B\subseteq A\), the set of equivalence classes of U induced by IND(B) is denoted as:

$$ U / I N D(B)=\bigcup\left\{[x]_{B}: x \in U\right\}, $$
(2)

where \([x]_{B}=\left \{ y\in U:(x,y) \in IND(B) \right \}\) is called the equivalence class of x under the equivalence relation IND(B).

Definition 3

Let IS = (U,A = CD,V,f) be a decision table. For any subset X of U , the lower and upper approximations with respect to an attribute subset \(B\subseteq A\) are defined as:

$$ \begin{array}{@{}rcl@{}} \underline{B}(X)&=&\left\{x \in U \mid[x]_{B} \subseteq X\right\} \\ \overline{B}(X)&=&\left\{x \in U \mid[x]_{B} \cap X=\emptyset\right\} \end{array} $$
(3)

\(\underline {B}(X)\) is also called the B-positive region of X over U, denoted as POSB(X). The difference set between \(\overline {B}(X)\) and \(\underline {B}(X)\) is called the B-boundary region of X over U, denoted as BNDB(X). And the universe after removing the objects in \(\overline {B}(X)\) is called the B-negative region of X over U, denoted as NEGB(X), namely \(NEG_{B}(X)=U-\overline {B}(X)\).

Definition 4

Let IS = (U,A = CD,V,f) be a decision table and \(U / D= \left \{Y_{1}, Y_{2},{\dots } , Y_{\lvert U/D \rvert } \right \} \) be the partition derived from the decision attribute D over U. The positive, boundary, and negative regions of D given an attribute subset \(B\subseteq C\) are defined as:

$$ \begin{array}{@{}rcl@{}} P O S_{B}(D) & =& \bigcup\limits_{Y_{i} \in U / D} \underline{B}\left( Y_{i}\right) \\ B N D_{B}(D) &=& \bigcup\limits_{Y_{i} \in U / D}\left( \overline{B}\left( Y_{i}\right)-\underline{B}\left( Y_{i}\right)\right) \\ N E G_{B}(D) & =& U-\bigcup\limits_{Y_{i} \in U / D} \overline{B}\left( Y_{i}\right) \end{array} $$
(4)

Definition 5

Let IS = (U,A = CD,V,f) be a decision table. For an attribute subset S of C, S is a reduct of C if and only if:

  1. 1)

    POSS(D) = POSC(D) and

  2. 2)

    \(\forall a \in S,POS_{S-\left \{ a \right \}}(D)\neq POS_{S}(D)\)

Meanwhile, the classification ability of an attribute or attribute subset can be represented by a discernibility matrix whose entry describes the discernible information to each pair of objects with different decisions. Formally, the disernibility matrix, the core attribute, and its attribute reduction are defined as follows.

Definition 6

Let IS = (U,A = CD,V,f) be a decision table. The element of the discernibility matrix M is denoted as:

$$ e_{i j}=\left\{\begin{array}{ll} \left\{a \in C \mid a\left( x_{i}\right) \neq a\left( x_{j}\right)\right\}, & d\left( x_{i}\right) \neq d\left( x_{j}\right) \\ \varnothing, & \text { otherwise } \end{array}\right. $$
(5)

In the discernibility matrix, if two objects have different decisions, the element is a set of discernable attributes, on each of which the two objects have different values; otherwise, the element is empty.

Definition 7

Let IS = (U,A = CD,V,f) be a decision table and M be the discernibility matrix of IS. An attribute aC is a core attribute if and only if there exists a singleton e in M such that \(e=\left \{a\right \}\).

Definition 8

Let IS = (U,A = CD,V,f) be a decision table and M be the discernibility matrix of IS. For an attribute subset S of C, S is a reduct of C if and only if:

  1. 1.

    eMe,Se and

  2. 2.

    aSS = S −{a},∃eMSe =

Different from the positive region-based reduct in Definition 5, the reduct in Definition 8 is a minimal subset of attributes that intersects with every non-empty element in the discernibility matrix. In other words, a discernibility matrix-based reduct is a jointly sufficient and individually necessary attribute subset to discriminate all objects in the original data.

2.2 Semi-supervised learning

In semi-supervised learning, a given partially labeled data U = LN contains a set of labeled objects \(L=\left \{ x_{i},y_{i} \right \}_{i=1}^{l}\) and a set of unlabeled objects \(N=\left \{ x_{j},? \right \}_{j=l+1}^{l+n}\), where xi and xj are described by m attributes, yi belongs to one of the k classes or unknown, and ln. Generally, semi-supervised learning can be classified as semi-supervised clustering, semi-supervised classification, and semi-supervised regression [42, 46, 47]. This paper mainly concentrates on semi-supervised classification.

Semi-supervised classification aims to exploit a large amount of unlabeled data to improve the learner trained only on labeled data. Semi-supervised classification methods can be roughly classified as low-density separation methods, generative methods, graph-based methods, and disagreement-based methods [47]. Self-training [48] is a classic semi-supervised method that retrains the learner by self-labeled objects. More specifically, the model first trains a classifier on labeled objects, and then iteratively annotates some confidently unlabeled objects to retrain the classifier. Co-training [23] is a disagreement-based model that could enable two classifiers to learn from each other on unlabeled data. Standard co-training requires two sufficient and redundant views to describe data. On each view, a classifier is trained on labeled objects, and then the two classifiers share some unlabeled objects with high confidence prediction labels to improve each other. Tri-training [40] is another popular disagreement-based model. It resamples the set of labeled objects to obtain three labeled training sets, on each of which a base classifier is trained. In each iteration of tri-training, if two classifiers have the same prediction on an unlabeled object, this object with its predicted label is used to update the third classifier, until the stopping condition is met.

However, standard tri-training suffers from several problems. On the one hand, due to the constraint of a single view, resampling inevitably leads to high redundancy of the generated data. In particular, when only a few labeled objects are provided, the quality of the generated data is difficult to guarantee. On the other hand, the evaluation of unlabeled objects is judged by only the consistency of base classifiers, without considering their confidence and uncertainty. If base classifiers are weak, unlabeled objects may be mislabeled and classification noise is introduced. Therefore, it is highly desirable to improve the mechanism of training base classifiers and the strategy of selecting unlabeled objects.

3 Tri-trade for partially labeled data

In this section, the overall framework of the proposed model is first described. After that, a semi-supervised attribute reduction algorithm based on discernibility matrix is presented. Subsequently, the tri-trade model is proposed based on three distinct semi-supervised reducts. Finally, the effectiveness of the model is analyzed theoretically.

3.1 Overall framework

Tri-training is an efficient semi-supervised model that employs three classifiers to learn from unlabeled data. However, due to the single-view constraint, the tri-training model suffers from the problem of high redundancy of generated data after resampling. In fact, some datasets, particularly those with a large number of attributes, may generally be reduced to multiple attribute subsets, each of which can completely and competently represent the original data. In addition, these attribute subsets describe the original data from different perspectives, resulting in diverse induction biases. Therefore, by utilizing the diversity of multiple reduced subspaces, we can construct an effective multiview tri-trade model for partially labeled data, which is illustrated in Fig. 1.

Fig. 1
figure 1

Framework of the proposed tri-trade model

Different from standard tri-training, the tri-trade model employs the attribute reduction technique to generate different views. More specifically, a semi-supervised discernibility matrix is first constructed for partially labeled data, and a heuristic algorithm is designed to generate three distinct semi-supervised reducts. On each reduct (view), a base classifier is trained using initially labeled data. Then, by utilizing data editing technique, two base classifiers select some confidently unlabeled data to update the third classifier. When no classifier can be updated, the algorithm terminates and yields a final classifier by combining the three refined classifiers. In the next sections, we elaborate on the details of the proposed model.

3.2 Discernibility matrix-based semi-supervised attribute reduction

In rough sets, traditional discernibility matrix-based attribute reduction methods are often used to deal with completely labeled or unlabeled data. However, in semi-supervised tasks, objects are only partially labeled. To deal with partially labeled data, a new discernibility matrix is developed. In a traditional discernibility matrix, discernible information is generated only from labeled objects or unlabeled objects. Intuitively, a reduct for partially labeled data should distinguish all kinds of objects. Thus, it is desirable that a semi-supervised discernibility matrix can consider both labeled and unlabeled objects. For this purpose, a semi-supervised discernibility matrix is constructed as follows.

Definition 9

Let \(PS=(U=L\cup N,A=C\cup {D},V^{\prime },f)\) be a partially labeled data, the non-empty element of the semi-supervised discernibility matrix SM is defined as:

$$ e_{i j}=\left\{\begin{array}{cc} \left\{a \in C \mid a\left( x_{i}\right) \neq a\left( x_{j}\right)\right\}, & d\left( x_{i}\right) \neq d\left( x_{j}\right) \wedge\left( x_{i} \in L \wedge x_{j} \in L\right) \\ \left\{a \in C \mid a\left( x_{i}\right) \neq a\left( x_{j}\right)\right\}, & \left( x_{i} \in L \wedge x_{j} \in N\right) \vee\left( x_{i} \in N \wedge x_{j} \in L\right) \\ \left\{a \in C \mid a\left( x_{i}\right) \neq a\left( x_{j}\right)\right\}, & x_{i} \in N \wedge x_{j} \in N \end{array}\right. $$
(6)

In the definition, labeled objects with different labels are compared to generate discernible information. Due to the decision uncertainty of unlabeled data, all unlabeled objects are discerned from each other. In addition, to distinguish all kinds of objects, discernible information between labeled and unlabeled objects is generated. In the following, an example is given to illustrate the proposed discernibility matrix.

Example 1

Let \(PS=(U=L\cup N,A=C\cup {D},V^{\prime },f)\) be a partially labeled data shown in Table 1, where \(U=\left \{x_{1},x_{2},{\dots } ,x_{8} \right \}\), \(C=\left \{a_{1},a_{2},a_{3},a_{4},a_{5}\right \},V_{a}=\left \{0,1\right \}\) for every aC, and \(V_{D}=\left \{d_{1},d_{2},?\right \}\).

Table 1 A partially labeled data

In Table 1, there are two labeled objects and six unlabeled objects. According to Definition 9, the discernibility matrix can be derived in Table 2. In Table 2, labeled objects with different labels are compared to generate elements in L × L; unlabeled objects are compared to generate elements in N × N; labeled objects and unlabeled objects are compared to generate elements L × N.

Table 2 The semi-supervised discernibility matrix of partially labeled data in Table 1

To perform semi-supervised attribute reduction based on the proposed discernibility matrix, we first introduce the concepts of the relevant set and the complement set of attributes:

Algorithm 1
figure a

Beam search algorithm for attribute reduction based on semi-supervised discernibility matrix.

Definition 10

Let \(PS=(U=L\cup N,A=C\cup {D},V^{\prime },f)\) be a partially labeled data and SM be the semi-supervised discernibility matrix of PS. Then, for an attribute subset \(B\subseteq C\), its relevant set is defined as:

$$ R M_{S M}(B)=\bigcup\{e \in S M \mid \exists a \in B \wedge a \in e\} $$
(7)

Definition 11

Let \(PS=(U=L\cup N,A=C\cup {D},V^{\prime },f)\) be a partially labeled data and SM be the semi-supervised discernibility matrix of PS. Then, for an attribute subset \(B\subseteq C\), the complement set with respect to its relevant set is defined as:

$$ O M_{SM}(B)=\{e -B \lvert e \in RM_{SM}(B)\} $$
(8)

According to the above definitions, for an attribute set, its relevant set consistes of the elements that contain the attributes in the attribute set, and the relevant set after eliminating the attributes in the given attribute set comprises its complement set.

Based on the set operators presented above, an attribute reduction algorithm can be designed to generate reducts for partially labeled data. However, finding all reducts, or finding a minimal reduct, i.e., a reduct with the minimum number of attributes, is NP-hard. Thus, heuristic algorithms are preferred. By designing reasonable heuristic costs, heuristic algorithms can quickly obtain optimal reducts. Due to its simplicity and efficiency, the greedy forward search strategy of iteratively adding attributes is widely used in practical applications. However, greedy forward search tends to fall into local optimum. Beam search is a forward search heuristic algorithm utilized in this paper. It can explore several optimal reducts in parallel, which can be used as views for semi-supervised learning. More specifically, beam search utilizes a breadth-first strategy to find optimal reducts. In each iteration, candidate attributes are sorted according to the heuristic cost, and a certain number (called the beam width) of attributes with the minimum costs are preserved. Since the tri-trade model requires three distinct reducts to train three base classifiers, the beam width is set to three. In attribute reduction process, it is desired that attributes with strong discriminating power can be be preferentially selected, and optimal reducts should contain fewer attributes. Therefore, to evaluate the cost of each attribute, the heuristic function is defined as follows:

$$ \text{heuristic cost}(a)=\frac{\lvert S \rvert}{\lvert C \rvert} + \frac{\lvert R M_{S M}(a) \rvert}{\lvert SM \rvert} $$
(9)

For an attribute a, the heuristic information consists of two parts: the ratio of the number of selected attributes \(\lvert S \rvert \) in the current attribute subset to the number of all attributes \(\lvert C \rvert \), which aims at minimizing the number of attributes contained in the reducts, and the ratio of the number of elements in the attribute complement set \(\lvert R M_{S M}(a) \rvert \) to the number of all elements \(\lvert SM \rvert \), which aims at selecting the attributes with strong discriminating power. By using this heuristic information, the beam search algorithm can select the most important attributes to generate multiple reducts. This process can be described by Algorithm 1.

The algorithm starts with the construction of a semi-supervised discernibility matrix for partially labeled data. Since the core attributes have unique discriminating power (see Definition 7), the attribute subset is initialized with the core attributes to accelerate the search process (step 1 and step 2). In each iteration, the algorithm selects three attributes with the minimum heuristic costs and discards their relevant set. The search process terminates until the semi-supervised discernibility matrix is empty (step 3 to step 9). Finally, three optimal reducts are yielded, each of which has a nonempty intersection with any nonempty element of the semi-supervised discernibility matrix, thus maintaining the same discriminating power as all condition attributes.

Suppose that the partially labeled data contains \(\lvert U \rvert \) objects described by \(\lvert C \rvert \) condition attributes. The time cost of constructing a semi-supervised discernibility matrix is \(O(\lvert C \rvert \lvert U \rvert ^{2})\). In each iteration, the algorithm selects the optimal three attributes while deleting the corresponding relevant set from the matrix. In the worst case, after \(\lvert C \rvert \) rounds of attribute selection, the matrix is empty. Therefore, the time cost of computing the optimal reducts based on the semi-supervised discernibility matrix is \(O(\lvert C \rvert ^{2} \lvert U \rvert ^{2})\). As a whole, Algorithm 1 has a total time cost of \(O(\lvert C \rvert \lvert U \rvert ^{2} + \lvert C \rvert ^{2} \lvert U \rvert ^{2})\), which approximates \(O(\lvert C \rvert ^{2} \lvert U \rvert ^{2})\), and a total space cost of \(O(\lvert C \rvert \lvert U \rvert ^{2})\).

3.3 Multi-view tri-trade model for partially labeled data

In classic rough sets-based learning methods, the model typically employs a single classifier and mainly addresses labeled data. However, partially labeled data usually comprise relatively little labeled data and a considerable quantity of unlabeled data. When labeled data are limited, a learning model with a single classifier can hardly provide satisfactory results. Tri-training is a disagreement-based model that has been proven to be effective for partially labeled data [40]. Unfortunately, tri-training suffers from problems of the low diversity of base learners and poor quality of selected unlabeled data. Based on Algorithm 1, we can obtain three optimal reducts of partially labeled data. Since each reduct is a jointly sufficient attribute subset that can completely describe the overall data and the process of beam search that starts from different branches in parallel enables certain diversity among reducts, each reduct can be approximated as a sufficient and redundant view. Thus, we can utilize these reducts to improve tri-training.

In addition, not all unlabeled data are conducive to the learning model, and the selection of unlabeled objects is another key factor for the success of semi-supervised learning. Standard tri-training generates pseudo-labeled objects by majority voting. More specifically, if two classifiers make a consistent prediction on an unlabeled object, this object will be annotated with the pseudo-label and is considered a useful object to update the third classifier. However, in some circumstances (particularly in the early iteration), since initially labeled objects are insufficient to train strong base classifiers, a considerable number of objects may be wrongly classified.

The data editing technique is a commonly used method for error estimation, which aims to enhance the quality of training set by identifying and excluding mislabeled objects from the learning process. To improve the quality of training set, Zhang et al. [49] proposed a co-trade model based on data editing to improve co-training. The co-trade model first constructs a weighted graph over the labeled and unlabeled objects to describe the proximity in the attribute space using the k-nearest neighbor method. Based on the manifold assumption that objects with high similarity in the input space should have similar labels, the cut edge weight statistic is then used to explicitly evaluate the labeling confidence of unlabeled objects. Through data editing technique and co-training mechanism, the co-trade model can obtain high-quality labeled object sets to improve base classifiers. Motivated by the above fact, the data editing technique is introduced in tri-training to improve the quality of generating pseudo-labeled objects. More specifically, in each iteration of tri-training, two of the classifiers can use data editing technique to explicitly estimate the labeling confidence of unlabeled objects and collaboratively select unlabeled objects to generate pseudo-labels for the third classifier. This process is described by Algorithm 2.

Algorithm 2
figure b

The selection of high-confidence unlabeled data using data editing.

In Algorithm 2, all parameters are initialized first (step 1 and step 2). In the iterative process, objects in unlabeled set are predicted first with classifiers clf1 and clf2 under each view (step 4). Then, a neighborhood graph is constructed to evaluate the labeling confidence explicitly (step 5). Under each view of unlabeled set, unlabeled data are sorted by labeling confidence in descending order, and an object subset \(N_{i}^{*}\) is chosen with the minimal expected prediction error \(\epsilon ^{\prime }_{i}\). Finally, two classifiers share labeling information to refine each other (step 6 and step 7). The iterative process terminates when the prediction error of either classifier increases on the original labeled set or when the expected prediction error of classifiers does not decrease. The last two classifiers return pseudo-labeled objects with the same prediction on \(N_{1}^{*} \cup N_{2}^{*}\) (step 13). Since the unlabeled sets N1 and N2 have been filtered by data editing technique and the initial classifiers clf1 and clf2 are improved after the co-training process, the final two classifiers can be combined to yield pseudo-labeled objects with high confidence.

Without loss of generality, assume that the partially labeled data has \(\lvert L \rvert \) labeled objects and \(\lvert N \rvert \) unlabeled objects described by \(\lvert C \rvert \) attributes, and the time cost of training a base classifier is approximately \(O(\lvert C \rvert \lvert U \rvert )\). While the time cost of constructing the neighborhood graph is approximately \(O(\lvert U \rvert ^{2})\) and the space cost is approximately \(O(\lvert U \rvert ^{2})\). In each iteration, these two classifiers provide new pseudo-labeled objects for each other. Since the iterations can converge quickly, the time cost of training these classifiers is approximated as \(O(\lvert C \rvert \lvert U \rvert )\). Thus, the total time cost of Algorithm 2 is \(O(\lvert U \rvert ^{2})\), and its space cost is \(O(\lvert U \rvert ^{2})\).

To optimize the tri-training model, the tri-trade model is developed. By utilizing beam attribute reduction algorithm on a semi-supervised discernibility matrix, three distinct attribute subsets are generated from the original attribute set, on which three base classifiers are trained. By utilizing the proposed data editing technique, the quality of unlabeled objects is explicitly estimated and labeling information is reliably shared. The tri-trade model procedure is presented in Algorithm 3.

Algorithm 3
figure c

Tri-trade model for partially labeled data.

In Algorithm 3, three base classifiers are trained on three distinct reducts of the original condition attributes. After initializing all parameters, the classifiers iteratively learn from each other on unlabeled data. More specifically, in each round of tri-training, the classification error rate of each classifier is first estimated. Since it is difficult to estimate the classification error on unlabeled set, only the original labeled set is tested here, based on the heuristic assumption that the unlabeled objects have a similar distribution as the labeled objects. In detail, the estimated classification error rate is the proportion of the objects misclassified by both clfj and clfk to the objects consistently predicted by clfj and clfk. When there is no degradation in the performance of the combination of clfj and clfk, high confidence labeled objects are selected by data editing technology, and clfi is updated by a certain number of newly labeled objects; otherwise, the classifier clfi does not change. It should be noted that the data editing process has no impact on the three classifiers’ updates. When no classifier can be updated, the algorithm terminates and yields the final classifier by combining the three retrained classifiers.

Assume the partially labeled data have \(\lvert L \rvert \) labeled objects and \(\lvert N \rvert \) unlabeled objects described by \(\lvert C \rvert \) attributes where \(\lvert U \rvert =\lvert L \rvert +\lvert N \rvert \). In each round of tri-training, two of the classifiers iteratively label objects for the third classifier using the data editing technique in Algorithm 2. The time cost of this process is \(O(\lvert U \rvert ^{2}+\lvert C \rvert \lvert U \rvert )\), and its space cost is \(O(\lvert C \rvert \lvert U \rvert )\). In the worst case, Algorithm 3 terminates after \(\lvert N \rvert \) rounds of tri-training. Therefore, based on three distinct reducts of a given partially labeled data, the time cost of Algorithm 3 is at most \(O(\lvert U \rvert ^{3})\) and its total space cost is approximated to \(O(\lvert C \rvert \lvert U \rvert )\).

3.4 Theoretical analysis of the tri-trade model effectiveness

In the tri-training model, resampling data may deviate from the original data distribution. However, there is high redundancy in the generated data. Unlike the tri-training model, the tri-trade model trains base classifiers on three distinct reducts. From the perspective of attribute reduction, each reduct is a jointly sufficient subset of attributes and can preserve the same discriminating power as the original attribute set. Furthermore, the beam search attribute reduction algorithm ensures that the three reducts share as few attributes as possible; thus, each reduct describes the original data from a distinct view. The studies in [46] demonstrated that the co-training process can work well when the two classifiers have a large diversity, which guarantees the effectiveness of the proposed model for partially labeled data.

Another key factor for the success of tri-training is the quality of unlabeled objects. The tri-trade model employs the data editing technique to explicitly estimate the labeling confidence of unlabeled objects and utilizes the jointly predictive results of two classifiers. Additionally, a certain number of useful objects are selected for the third classifier to update only if the estimated performance of the classifier does not deteriorate. In essence, the principles of noise learning are implicitly embedded in the tri-training model. According to noise learning theory [49], the following formula holds:

$$ m=\frac{c}{\epsilon^{2}(1-2\eta)^{2}} , $$
(10)

where 𝜖 is the expected worst classification error rate, η denotes the upper bound of the classification noise rate, and c is a constant for a specific learning task. By reformulating inequality (10), the following utility function for the classification noise rate can be derived:

$$ u=\frac{c}{(1-2\eta)^{2}}=m\epsilon^{2} , $$
(11)

To lower the classification noise rate, the utility function should be reduced in each iteration, i.e., \(u^{\prime }<u\). The following inequality can be obtained:

$$ m^{\prime}{\epsilon^{\prime}}^{2} < m\epsilon^{2} , $$
(12)

Since the iteration process satisfies \(\epsilon ^{\prime } < \epsilon \), inequality (12) can be converted as follows:

$$ m'{\epsilon^{\prime}} < m\epsilon , $$
(13)

and the following constraints can be derived:

$$ 0<\frac{\epsilon^{\prime}}{\epsilon}< \frac{m}{m^{\prime}} $$
(14)

Note that \(m^{\prime }{\epsilon ^{\prime }}\) may not be smaller than m𝜖 since \(m^{\prime }\) may be much larger than m. In this case, the function \(subsample(L^{\prime }_{i},\left \lceil \frac {\epsilon l_{i}}{\epsilon ^{\prime }_{i}}-1 \right \rceil )\) randomly selects a certain number of objects \(L^{\prime }_{i}\). Let the integer s denote the size of \(L^{\prime }_{i}\) after subsampling. If it satisfies:

$$ s=\left \lceil \frac{m\epsilon }{\epsilon^{\prime}}-1 \right \rceil $$
(15)

the constrained condition of inequality (13) can be satisfied as well. In this case, m needs to satisfy the following condition:

$$ m>\frac{\epsilon^{\prime}}{\epsilon^{\prime} - \epsilon}, $$
(16)

According to inequality (14), the proposed tri-trade model considers the classifier to be updated on some unlabeled objects only when the estimated error rate does not increase. According to inequalities (15) and (16), the classifier selects a certain number of unlabeled objects in each iteration to satisfy the constraint of inequality (13), thus reducing (or at least maintaining) the classification noise rate. Therefore, the tri-trade model can make efficient use of unlabeled data to enhance its performance.

4 Empirical analysis

This experiment serves two purposes. One is to evaluate the effectiveness of the discernibility matrix-based semi-supervised attribute reduction algorithm. The other aims to compare the performance of the proposed model with other semi-supervised methods. All experiments were carried out on a Windows 10 machine with an Intel(R) Core (TM) i7-10700 CPU @ 2.90GHz dual-core processor and 32GB RAM, and all codes were implemented by Python 3.7 in the platform of PyCharm.

4.1 Investigated datasets and experimental design

In the experiment, twelve UCI datasets are tested. It should be noted that some of the datasets are multicategory. To construct binary classification datasets, the category with the most objects is treated as one class, which is referred to as the positive class, and the remaining objects are grouped into another class, which is referred to as the negative class. Table 3 reports detailed information about all datasets. The first column of the table contains the name of the selected datasets, the second column \(\lvert C \rvert \) and the third column \(\lvert U \rvert \) are the number of attributes and objects in each dataset, respectively, the fourth column “POS/NEG” gives the percentage of positive objects and negative objects, the fifth column “Missing” indicates whether the dataset has missing values, and the last column “Inconsistency” records the number of inconsistent objects in these datasets. Note that these datasets do not have multiple and redundant views.

Table 3 Investigated datasets

To facilitate the experiments, the missing values of attributes in each dataset are replaced by the average or most frequent values of the respective attributes. Since the proposed model is designed for partially labeled data with categorical attributes, each numerical attribute must be discretized into categorical attributes. Due to the simplicity and effectiveness, equal frequency binning with the three bins [50] is employed in the experiments.

To evaluate the performance of the proposed method, ten 10-fold cross-validation tests are employed in the experiments. In each fold, 90% of the objects are selected as the training set, while the remaining 10% are treated as the test set. According to the label rates, the training set is further randomly divided into a set of labeled objects L and a set of unlabeled objects N. The label rates include 1%, 5%, 10%, 15%, and 20%. Under each label rate, the training set is divided ten times independently and randomly. For instance, assuming that there are 1000 objects, 900 objects in each fold are selected as the training set and the remaining 100 objects are regarded as the test sets. When the label rate is 10%, 90 objects with labels are placed in the labeled set L, while the remaining 810 objects after removing labels are placed in the unlabeled set N, and the data division between L and N is repeated ten times independently and randomly. Here, the ratio POS/NEG in the L, N and test set are maintained to be consistent with the original dataset.

To investigate the effectiveness of the beam attribute reduction algorithm for partially labeled data, all selected datasets are tested at a label rate of 10%, and the results of attribute reduction in ten times 10-fold cross-validation are collected. Table 4 shows the statistical results, including the maximum, minimum, and average number of attributes in the reducts, which are listed in the third, fourth, and fifth columns, respectively. In addition, the real reduct information, i.e., attribute reduction at 100% label rate, is collected for comparison. The last column provides a comparison between the semi-supervised reduct and the ground-truth supervised reduct, denoted as the approximate rate, i.e., the rate of the average number of attributes in the semi-supervised reduct to that in the ground-truth supervised reduct.

Table 4 Results of semi-supervised attribute reduction under the label rate of 10%

Table 4 shows that the number of attributes is reduced after semi-supervised attribute reduction. It is worth mentioning that core attributes in the reducts are always retained. The reason may be that in the semi-supervised discernibility matrix, some objects must be discriminated by core attributes. Compared to the ground-truth reduct, the average approximation rate of semi-supervised attribute approximation across all datasets is 72.87%. Notably, at a label rate of 10%, the approximation rates on datasets “credit-a”, “parkinson”, “turkiye” and “wall” are greater than 80%. These results indicate the effectiveness of the proposed attribute reduction method for partially labeled data.

4.2 The effectiveness of the tri-trade method

To evaluate the performance of the proposed tri-trade model, we compare it with the supervised Laplace score and unsupervised Fisher score. Both are standard filter-style methods that assign a score to each attribute and then select the k attributes with the highest score. The main idea of Fisher score [51] is to identify attributes with strong distinguishing power, which is reflected as small intraclass distance and large interclass distance. The main idea of Laplace score [52] is to construct the nearest neighbor graph over all data, and the importance of each attribute is evaluated according to its locality-preserving power [50]. In addition, we compare the proposed tri-trade model to traditional semi-supervised methods such as self-training, co-training, co-trade, and tri-training. Self-training [48] is a self-taught method with only a single classifier. The initial classifier is trained on labeled data and is iteratively refined by its most confident self-labeled data. Co-training [23] is a multi-view disagreement-based method. It trains two initial classifiers on two attribute sets, and one classifier updates the other one with high-confidence objects in each iteration. Since most datasets lack naturally partitioned views, the requirements of the co-training are difficult to satisfy. However, it has been proven that [46] unlabeled objects can still improve the performance of co-training by randomly splitting the original attribute set into two subsets. Therefore, in the experiments, the attributes in each dataset are randomly partitioned into two disjoint sets of nearly equal size. Moreover, co-trade [49] is improved on co-training. It selects high-quality unlabeled objects by a data editing technique to refine the base classifiers. Tri-training [40] is also a multiview disagreement-based method but uses three base classifiers. The settings for all selected methods are shown in Table 5.

Table 5 Experimental settings

In Table 5, to demonstrate the potential of the proposed tri-trade model, full supervised learning, i.e., under the label rate of 100%, is set up for comparison. In addition, supervised learning is also performed on the reduct obtained from attribute reduction based on the Laplace score over LN and Fisher score over L, respectively. The number of attributes k remains the same as the optimal reduct of the semi-supervised discernibility matrix-based method. In addition, self-training trains its base classifier on the optimal reduct obtained from attribute reduction based on semi-supervised discernibility matrix over LN. In co-training and co-trade, two views are generated by randomly dividing the original attribute set into two disjoint subsets with equal size. Tri-training obtains three labeled object subsets by resampling over L, while the tri-trade model obtains three views by semi-supervised discernibility matrix-based attribute reduction over LN. In the proposed tri-trade model, the maximum number of iterations of data editing is set to 30. However, empirical results reveal that the training process terminates no more than 10 rounds in most cases. It should be noted that, compared to the co-training, co-trade, and tri-training, the tri-trade model uses a subset of attributes rather than all of them.

In the experiments, the label rate is set to 10%. Two different base classifiers, J48 and Naive Bayes, are employed and ten 10-fold cross-validations are performed to evaluate the performance. The average classification error rates of the selected methods are recorded in Tables 6 and 7. The column “Max” indicates the average error rates of full supervised learning on different datasets, and the third to ninth columns represent the average error rates of other methods in Table 5. The “Avg.” row shows the average classification error rate of each method on selected datasets. The best classification results among all methods on each dataset are highlighted in bold.

Table 6 Average performance of the selected methods using J48 classifier at 10% label rate
Table 7 Average performance of the selected methods using Naive Bayes classifier at 10% label rate

As shown in Tables 6 and 7, there is a significant difference in the performance among the selected methods. By comparing the results, when evaluated by the number of datasets with the best classification performance, the proposed tri-trade model is always the winner. More specifically, when using J48, the tri-trade model wins 7 out of 12 datasets, while other methods win at most 2 datasets; when using Naive Bayes, the tri-trade model wins 8 out of 12 datasets, while other methods win at most 2 datasets. When evaluated by the average classification error rate, the tri-trade model has an average classification error rate of 23.02% when using J48, which outperforms the full supervised method (23.24%). In contrast, all other methods perform worse than the tri-trade model. Impressively, the average classification error rate of the tri-trade model is 27.35% when using Naive Bayes, which is even better than fully supervised method (29.66%). In summary, the classification performance of the tri-trade model outperforms other semi-supervised methods and is even better than that of the fully supervised method. These results indicate that the tri-trade model can effectively exploit unlabeled data to enhance its performance.

To further evaluate the potential of the proposed model, the methods in Table 5 are performed at other label rates, including 1%, 5%, 10%, 15%, and 20%. Figures 2 and 3 show the average error rates of all methods. Note that “Max” refers to the performance of a single classifier with a label rate of 100%.

Fig. 2
figure 2figure 2

Average error rates of the selected methods under different label rates (J48)

Fig. 3
figure 3figure 3

Average error rates of the selected methods under different label rates (Naive Bayes)

As shown in the figures, both the Fisher score and Laplace score perform poorly on most datasets, since their reducts have a loss of discernibility and they do not utilize unlabeled data to improve the classifier. For instance, in Figs. 2(a), 3(b), (c), and (e), the performance of both Fisher score and Laplace score is far inferior to other methods. Self-training is a single view model and unlabeled data are self-labeled. In general, self-training yields fewer desirable outcomes, as illustrated in Figs. 2(g), (h), 3(f), (g), (h) and (l). One reason may be that the initially labeled data are not representative, and thus the generalization ability of the base classifier is unstable. Furthermore, the utilization of unlabeled objects affects performance. Self-training inevitably mislabeled objects, which further degrades performance. Co-training and co-trade are disagreement-based learning methods that employ two classifiers. However, their overall performance is not satisfactory. The main reason is that the co-training paradigm requires the original attribute set to be naturally partitioned, while in the experiments the subspaces for these two classifiers are formed by randomly dividing the whole attribute set by half. Obviously, this does not guarantee the quality of these two base classifiers. Therefore, the label information exchanged by these two classifiers may contain noise, even though the co-trade imposes a restriction on the exchange of unlabeled objects, resulting in poor performance. Figure 2(c), (d) and (l) demonstrate this trend. Tri-training employs three classifiers to determine how to select unlabeled objects for labeling. The performance of the method remains deficient. The reason may be twofold. The quality of resampling data is not guaranteed, leading to the lack of diversity in the generated classifiers. Additionally, the pseudo-labels produced by majority voting are insufficiently accurate. These lead to unstable performance of tri-training, which can be observed in Fig. 3(e), (i) and (k). Compared to the resampling operation in tri-training, tri-trade model trains its base classifiers on distinct reduced subspaces, each of which is a sufficient attribute subset that maintains the same discriminating power as the whole attribute set. By exploiting unlabeled data, tri-trade model achieves impressive performance. Tri-trade model estimates the labeling confidence explicitly and generates the pseudo-labeled objects by two enhanced classifiers. The tri-trade model carefully selects unlabeled objects for learning, and base classifiers are updated only when the pseudo-labeled labeled objects have a positive influence. Therefore, the tri-trade model can enhance performance by utilizing truly useful unlabeled objects.

Overall, the proposed tri-trade model outperforms all other methods under different label rates. Note that on some large datasets, such as “polish”, “turkiye” and “wall”, the number of labeled objects is sufficient to train a powerful classifier when the label rate is below 20%. However, even in these cases, tri-trade model is still effective. In addition, the performance on some datasets is especially excellent, as seen in Figs. 2(f), (g) and 3(j), where the average error rate of tri-trade model is much lower than that of the other methods. These experimental results demonstrate the superiority of semi-supervised attribute reduction as well as the data editing technique, showing that the tri-trade model has great potential to learn from partially labeled data.

5 Conclusion

In many real-world scenarios, unlabeled data are massive while labeled data are scarce. The strategy of selecting and utilizing unlabeled data is essential for the learning model of partially labeled data. In this study, a novel tri-trade model is proposed for partially labeled data. To obtain multiple distinct views from partially labeled data, a semi-supervised attribute reduction algorithm based on discernibility matrix is developed. Moreover, a new data editing technique is introduced to explicitly estimate the labeling confidence and to cautiously select unlabeled objects to improve the base classifiers. Theoretical analysis and comparative experiments on UCI datasets reveal that the proposed tri-trade model has prominent performance when compared to other methods. Admittedly, the proposed model is only applicable to partially labeled data with categorical attributes, which means that the numerical attributes must be discretized. Extending the model to deal with partially labeled data with both categorical and numerical attributes is worth investigating in the future. Additionally, it is worthwhile to explore other effective strategies for evaluating the labeling confidence of unlabeled data.