1 Introduction

Classification problems exist widely in people’s daily lives. In the field of artificial intelligence, how to effectively tackle classification problems for humans by the application of classification algorithms has become an important research direction. Among all kinds of classification algorithms, the decision tree algorithm is considered as one of the key algorithms to solve classification problems, due to its high-quality classification results and convincing classification process. As the name suggests, the decision tree algorithm is a tree classifier based on some decision rules, and the construction of a decision tree can be summarized into the following two parts: (1) create a split node using decision rules; (2) create a branch and a leaf node [21]. According to the above two parts, one can notice that there exist two popular research directions in decision tree learning, that is, the decision rule (also known as the partition measure) and the structure of tree classifiers. In this paper, we concentrate on the issue of decision rule. Generally speaking, the issue of decision rule exists widely in various theories. However, for a decision tree algorithm, the selection of a specific decision rule depends on whether it can construct the correct decision logic. Therefore, finding the most appropriate decision rule among various theories has become an important research direction in decision tree learning.

As well known, the representative decision tree algorithms include: Iterative Dichotomiser 3(ID3) [24], Classifier 4.5(C4.5) [25] and Classification And Regression Tree(CART) [1], where ID3 uses the information gain in information theory as the decision rule, C4.5 uses the information gain ratio in information theory as the decision rule, and CART uses the Gini index in probability theory as the decision rule. On the basis of the above three algorithms, many researchers attempted to construct novel decision tree algorithms based on different theories [6]. Laber et al. [11] proposed an efficient algorithm with a penalty term in loss function to support the construction of shallow decision trees. Klaus et al. [2] proposed a gradient-based model tree algorithm which improves classification accuracy while maintaining the interpretability. In order to deal with the continuous data, the classical decision tree algorithms usually adopt the strategy of discretization. However, discretization always causes the loss of information structure, which may affect the results of classification. As a result, researchers turn to other theories with more adaptability.

Rough set theory proposed by Pawlak [23] is a widely used tool in feature selection [3, 5, 10, 20, 32, 35]. After several years of development, various variants of rough set theory have been proposed for handling complex data environments [31, 36]. For example, in order to avoid the problems caused by the rigorous equivalence relation, Ziarko proposed the variable precision rough set model, which introduced the perspective of variable precision. In order to deal with the continuous data, Lin et al. [12, 13] proposed the neighborhood rough set model, which introduced the neighborhood structure into rough sets. The neighborhood rough set model is the most commonly used method for handling continuous data in rough sets. Therefore, rough set theory, a theory with powerful generalizability, is very suitable for decision tree learning [9]. For instance, Wang et al. [27] proposed a method that combines ordinal decision trees and fuzzy rough set-based attribute reduction. Yao et al. [33] introduced the notion of attribute purity degree and proposed a decision tree algorithm based on attribute dependency and this notion. Zhai et al. [34] proposed an algorithm called “tolerance rough fuzzy decision tree”, which is based on tolerance rough fuzzy set.

As mentioned above, the neighborhood rough set model is an effective extension of Pawlak rough set model to the solution of continuous data [16, 30]. It can deal with continuous data without discretization, thus avoiding the loss of information structure caused by discretization. Plenty of scholars have explored the improved desicion algorithms based on neighborhood rough set model. Xie et al. [28] proposed the neighborhood equivalence granule structure by neighborhood algebraic similarity and variable precision threshold. Moreover, they introduced the notion of neighborhood Gini index and proposed a neighborhood decision tree algorithm based on variable-precision neighborhood equivalence granules. Liu et al. [15] proposed an improved ID3 algorithm (called DIGGI) based on variable precision neighborhood rough sets. The experimental results show that the DIGGI is effective and its accuracy is greatly improved. Liu et al. [14] pointed out that there exists a neighborhood geometric structure in neighborhood system, and proposed the notion of neighborhood geometric similarity to avoid the contradiction in the transitivity of equivalence relation caused by taking into account the neighborhood algebraic similarity only. Finally, they proposed an improved decision tree algorithm by using the attribute dependency in variable precision neighborhood rough set induced by a novel neighborhood similarity as the decision rule. However, it should be noted that the boundary region is not considered in both of the above two algorithms.

In rough set theory, the boundary region is composed of those elements which cannot be described precisely by existing knowledge when approximating a target set. Therefore, the boundary region is always considered as one of the critical clues to the uncertainty of knowledge. In fact, the boundary region can not only serve as an important explanation for the uncertainty of knowledge, but also be regarded as an essential description for the inherent extent of uncertainty of knowledge. When describing the relationship between two specific concepts in the content of the existing knowledge, the boundary region may be an appropriate option [18].

The application of the boundary region can be divided into the following two parts [19]. First, the computation of the boundary region can be used to measure whether the target set can be represented precisely by the existing knowledge, that is, the problem of whether there exists the boundary region. Second, the boundary region can also reflect the degree of difficulty that apply the existing knowledge to the concept approximation representation, that is, the problem of the size of the boundary region. Therefore, the boundary region can be used not only for the qualitative analysis of the uncertainty of knowledge, but also for the quantitative analysis.

However, there exist some problems when we attempt to apply the boundary region to the comparative quantitative analysis of multiple attributes. For instance, the computation of the boundary region involves a set of elements, which cannot be in a direct comparison. Even though the purpose of quantification can be achieved by computing the cardinality of the set of elements, since there is no limitation on the range of the boundary region, this comparison seems unconvincing. Since this comparative quantitative analysis is a novel exploration of the relationship among multiple attributes in addition to attribute dependency, it is worth studying. Furthermore, if this comparative quantitative analysis is used to describe the relationship between different condition attributes and decision attribute, we can derive the degree of difficulty when the decision attribute is approximately represented by different condition attributes respectively, that is, the influence from different condition attributes to decision respectively. Hence, the obtained boundary region-based decision rules may be suitable for decision tree learning. For instance, Luo et al. [17] proposed a new feature evaluation criterion via the combination of the dependency measure defined in the positive region and the class separability degree of the boundary region. Parthaláin et al. [22] examine a rough set-based feature selection technique that considers the number of objects in the boundary region and the distance of those objects from the lower approximation. But all of those references are the applications of specific samples in the boundary region, rather than a direct application of the boundary region itself. Therefore, in this paper, we focus on the direct application of the boundary region.

At present, there are some problems with the application of a single decision rule, even though the boundary region-based measure is used, for example, the equivalent computation results in different attributes by the single decision rule may cause the situation that the algorithm needs to select an attribute randomly, the inability to comprehensively examine attributes, and so on. Therefore, in order to avoid the equivalent computation results in different attributes, it is necessary to introduce multiple decision rules.

Hence, in order to construct the boundary region-based measure, several parts are considered. First, the boundary region-based measure should cause the same influence as the attribute dependency in mixed measure. The diverse influence may result in the preference to one of the measures in mixed measure for the attribute calculation of the result. Second, the monotonicity of both the attribute dependency and the boundary region-based measure should be the same. The diverse monotonicity may result in the mixed measure not reflecting the situation of attributes truly.

In this paper, we first use the boundary region to induce a measure that can reflect the degree of difficulty in the approximation between two attributes, called boundary coefficient. Second, we propose the notion of boundary mixed attribute dependency by combining the boundary coefficient with the attribute dependency, so that the novel measure can consider both the boundary region and the attribute dependency. Finally, we select the boundary mixed attribute dependency as the decision rule to construct a novel decision tree algorithm. The experimental results show that with a slight increase in leaf nodes, the total running time decreases and the maximum accuracy increases to 0.9518.

The contributions of this paper are given as follows. (1) We use the boundary region to induce a novel measure, i.e., boundary coefficient; (2) We propose the notion of boundary mixed attribute dependency by combining the boundary coefficient with the attribute dependency; (3) A novel decision tree algorithm is constructed based on the boundary mixed attribute dependency.

The remaining part of this paper is organized as follows. Section 2 reviews the neighborhood rough set model, the similarity in neighborhood system and the variable precision neighborhood rough set model. Section 3 gives the definitions of boundary coefficient and boundary mixed attribute dependency, and proposes the novel decision tree algorithm. Section 4 shows the experimental results. Finally, Section 5 concludes the paper.

2 Preliminaries

2.1 Neighborhood rough set model

As an extension of Pawlak rough set model, the neighborhood rough set model has received much attention from scholars, since it can deal with continuous data without discretization [13].

In rough sets, an information table is a 4-tuple \(\left\langle U, A, f, V^{subB} \right\rangle \), where U is a sample set called universe, and A is an attribute set which is the description of the sample characteristics, f is an information function and \(V^{subB}\) is the value range of attribute \(subB \subseteq A\). If the attribute set A is divided into two disjointed subsets C and D, then an information table is also called a decision table, where C is called the set of condition attributes and D is called the set of decision attributes.

Given a continuous information table \(\left\langle U, A, f, V^{subB} \right\rangle \). For any two samples \(s^a, s^b \in U\), let \( subB \subseteq C\) be an attribute subset, the Minkowsky distance between \(s^a\) and \(s^b\) under subB is defined as follows:

$$\begin{aligned} MDF^{subB} \left( s^a,s^b \right) =\left( \sum \limits _{g\in subB}{\left| s_{g}^{a}-s_{g}^{b} \right| }^p \right) ^{\frac{1}{p}}, \end{aligned}$$
(1)

where \(s_{g}^{a}\) and \(s_{g}^{b}\) respectively denote the attribute values of samples \(s^a\) and \(s^b\) on attribute g, and MDF denotes the distance metric.

To make the distance calculation of samples less complex, in this paper, we set \(p=1\), and in this case MDF is also called the Manhattan distance.

Given a continuous information table \(\left\langle U, A, f, V^{subB} \right\rangle \). For any sample \(s^a \in U\) and attribute subsets \(subB \subseteq C\), let \(\eta \) denote the neighborhood radius, the \(\eta -\)neighborhood class of \(s^a\) under the attribute subset subB is defined as follows [7]:

$$\begin{aligned} N^{subB}_\eta \left( s^a \right) =\left\{ s^b \mid MDF^{subB} \left( s^a,s^b \right) \le \eta \right\} . \end{aligned}$$
(2)

Formula (2) is also called the \(\eta -\)neighborhood granulation, and \(N^{subB}_\eta \left( s^a \right) \) is also called the \(\eta -\)neighborhood granule of \(s^a\).

Let \(\left\langle .U,NR \rangle \right. \) be a neighborhood approximation space, where U represents the universe and NR represents the neighborhood relation, for any target set \(X \subseteq U\), the lower and the upper approximation of the target set X can be defined as follows [7]:

$$\begin{aligned} \underline{NR}X=\left\{ s^a \in U \mid \eta \left( s^a\right) \subseteq X \right\} . \end{aligned}$$
(3)
$$\begin{aligned} \overline{NR}X=\left\{ s^a \in U \mid \eta \left( s^a\right) \cap X\ne \emptyset \right\} . \end{aligned}$$
(4)

2.2 Similarity in neighborhood system

In [14], Liu et al. introduced four similarity measures into neighborhood system and constructed four neighborhood algebraic similarities. These four neighborhood algebraic similarities can reflect the degree of similarity between two neighborhood granules. Moreover, Liu et al. pointed out that the neighborhood system has spatial properties. Hence, in terms of the distance between two neighborhood granules, they defined the geometric similarity. By combining the two kinds of similarities mentioned above, four novel neighborhood similarities was constructed.

2.2.1 Neighborhood similarity

Neighborhood similarity is a similarity constructed by combining neighborhood algebraic similarity and neighborhood geometric similarity. It considers both algebraic and geometric perspectives in neighborhood system and can tackle the transitivity contradiction of the equivalence relation.

Definition 1

[14] For any \( s^a,s^b \in U \) and attribute subset \( subB\subseteq C \), let \( MDF^{subB}\left( s^a,s^b \right) \) represent the Minkowsky distance of sample \(s^a\) and sample \(s^b\) under attribute subset subB, and let \( \eta \) represent the neighborhood radius. Four neighborhood similarities are respectively defined as follows.

1. The neighborhood Jaccard geometric similarity under attribute subset subB is defined as:

$$\begin{aligned} NRJGS^{subB}\left( s^a,s^b \right)= & {} \left( 1-\frac{MDF^{subB}\left( s^a,s^b \right) }{2\eta } \right) \nonumber \\{} & {} \times \frac{\left| N^{subB}_\eta \left( s^a \right) \cap N^{subB}_\eta \left( s^b \right) \right| }{\left| N^{subB}_\eta \left( s^a \right) \cup N^{subB}_\eta \left( s^b \right) \right| }. \end{aligned}$$
(5)

2. The neighborhood Optimistic geometric similarity under attribute subset subB is defined as:

$$\begin{aligned} NROGS^{subB}\left( s^a,s^b \right)= & {} \left( 1-\frac{MDF^{subB}\left( s^a,s^b \right) }{2\eta } \right) \nonumber \\{} & {} \times \frac{\left| N^{subB}_\eta \left( s^a \right) \cap N^{subB}_\eta \left( s^b \right) \right| }{\min \left( \left| N^{subB}_\eta \left( s^a \right) \right| ,\left| N^{subB}_\eta \left( s^b \right) \right| \right) }. \end{aligned}$$
(6)

3. The neighborhood Pessimistic geometric similarity under attribute subset subB is defined as:

$$\begin{aligned} NRPGS^{subB}\left( s^a,s^b \right)= & {} \left( 1-\frac{MDF^{subB}\left( s^a,s^b \right) }{2\eta } \right) \nonumber \\{} & {} \times \frac{\left| N^{subB}_\eta \left( s^a \right) \cap N^{subB}_\eta \left( s^b \right) \right| }{\max \left( \left| N^{subB}_\eta \left( s^a \right) \right| ,\left| N^{subB}_\eta \left( s^b \right) \right| \right) }. \end{aligned}$$
(7)

4. The neighborhood Average geometric similarity under attribute subset subB is defined as:

$$\begin{aligned} NRAGS^{subB}\left( s^a,s^b \right)= & {} \left( 1-\frac{MDF^{subB}\left( s^a,s^b \right) }{2\eta } \right) \nonumber \\{} & {} \times \frac{\left| N^{subB}_\eta \left( s^a \right) \cap N^{subB}_\eta \left( s^b \right) \right| }{average\left( \left| N^{subB}_\eta \left( s^a \right) \right| ,\left| N^{subB}_\eta \left( s^b \right) \right| \right) }. \end{aligned}$$
(8)

In this paper, we use \( NRS^{subB}\left( s^a,s^b \right) \) to represent the above four neighborhood similarities.

2.3 Variable precision neighborhood rough set model induced by neighborhood similarity

Based on the neighborhood similarity proposed in Section 2.2.1, we can construct a variable precision neighborhood rough set model. In this model, the classical neighborhood equivalence relation is replaced by the variable precision neighborhood equivalence relation induced by neighborhood similarity, which can avoid the problems caused by the rigorous neighborhood equivalence relation.

Due to the addition of variable precision threshold, the variable precision neighborhood equivalence relation can be constructed. In order to make difference, in this paper, the decision table \(\left\langle U, A, f, V^{subB} \right\rangle \) is replaced by the variable neighborhood decision table \(\left\langle U, A, f, V^{subB}, \eta , \mu \right\rangle \).

Definition 2

[14] Let \(\left\langle U, A, f, V^{subB}, \eta , \mu \right\rangle \) be a variable neighborhood decision table, where \(\eta \) represents the neighborhood radius and \( \mu \) represents the variable precision threshold. For any attribute subset \(subB\subseteq C \) and any tuple \((s^a, s^b) \in U \times U\), the variable precision neighborhood equivalence relation is defined as follows:

$$\begin{aligned} VPNER_{\left( \eta ,\mu \right) }^{subB}=\left\{ \left( s^a, s^b \right) \in \,\,U\times U\left| NRS^{subB}\left( s^a, s^b \right) \ge \mu \right. \right\} . \end{aligned}$$
(9)

Definition 3

[14] Let \(\left\langle U, A, f, V^{subB}, \eta , \mu \right\rangle \) be a variable neighborhood decision table, where \(\eta \) represents the neighborhood radius and \( \mu \) represents the variable precision threshold. For any sample \(s\in U\) any attribute subset \(subB\subseteq C,\) the variable precision neighborhood equivalence class of sample s induced by variable precision neighborhood equivalence relation is defined as follows:

$$\begin{aligned} U/VPNER_{\left( \eta ,\mu \right) }^{subB}= & {} \left\{ \left[ s \right] _{VPNER_{\left( \eta ,\mu \right) }^{_{}subB}}\left| s\in U \right. \right\} \nonumber \\= & {} \left\{ S_{\left( \eta ,\mu \right) }^{1},...,S_{\left( \eta ,\mu \right) }^{X} \right\} . \end{aligned}$$
(10)

where \(S_{\left( \eta ,\mu \right) }^{X}\) represents the variable precision neighborhood equivalence class set formed.

When \(\mu =1\), the variable precision neighborhood granule degenerates into the rigorous equivalent neighborhood granule. When \(\eta =0\) and \(\mu =1\), the variable precision neighborhood granule degenerates into the Pawlak equivalent granule. So the variable precision neighborhood granules are actually a specific type of neighborhood granules.

Definition 4

[14] Let \(\left\langle U, A, f, V^{subB}, \eta , \mu \right\rangle \) be a variable neighborhood decision table, where \(\eta \) represents the neighborhood radius and \( \mu \) represents the variable precision threshold. Assume that \(U/D=\{Y^1,Y^2,...,Y^Z\}\) denotes the partition of U induced by D, for any attribute subset \(subB\subseteq C,\) the lower approximation of D and the upper approximation of D are respectively defined as follows:

$$\begin{aligned} \underline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D=\bigcup _{z=1}^Z{\underline{VPNER_{\left( \eta ,\mu \right) }^{subB}}Y^z}, \end{aligned}$$
(11)
$$\begin{aligned} \overline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D=\bigcup _{z=1}^Z{\overline{VPNER_{\left( \eta ,\mu \right) }^{subB}}Y^z}, \end{aligned}$$
(12)

where

$$\begin{aligned} \underline{VPNER_{\left( \eta ,\mu \right) }^{subB}}Y^z= & {} \bigcup {\left\{ S_{\left( \eta ,\mu \right) }^{x}\left| S_{\left( \!\eta ,\mu \right) }^{x}\subseteq Y^z, \right. S_{\left( \eta ,\mu \right) }^{x}\in U/ VPNER_{\left( \delta ,\beta \right) }^{subB} \right\} },\\{} & {} x=1,2,...,X;z=1,2,...,Z. \end{aligned}$$
$$\begin{aligned} \overline{VPNER_{\left( \eta ,\mu \right) }^{subB}}Y^z\!=\! & {} \bigcup {\left\{ S_{\left( \!\eta ,\mu \right) }^{x}\left| S_{\left( \!\eta ,\mu \right) }^{x}\in U/VPNER_{\left( \eta ,\mu \right) }^{subB}\!:\! S_{\left( \eta ,\mu \right) }^{x}\cap Y^z\ne \!\!\emptyset \! \right. \right\} },\\{} & {} x=1,2,...,X;z=1,2,...,Z. \end{aligned}$$

In the above formula, \(\underline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D\) is called the positive region of D, denoted as \( POS_{\left( \eta ,\mu \right) }^{subB}D \).

Definition 5

[14] Let \(\left\langle U, A, f, V^{subB}, \eta , \mu \right\rangle \) be a variable neighborhood decision table, where \(\eta \) represents the neighborhood radius and \( \mu \) represents the variable precision threshold. For any attribute subset \(subB\subseteq C,\) the attribute dependency of D on subB is defined as follows:

$$\begin{aligned} r_{\left( \eta ,\mu \right) }\left( D,subB \right) =\frac{\left| \underline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D \right| }{\left| U \right| }. \end{aligned}$$
(13)

3 An improved decision tree algorithm based on boundary mixed attribute dependency

In this section, we introduce the boundary coefficient, and propose a novel decision rule — boundary mixed attribute dependency, which is induced by the boundary coefficient and attribute dependency. Moreover, an improved decision tree algorithm is constructed by the novel decision rule.

3.1 Boundary coefficient and boundary mixed attribute dependency

In rough set theory, the boundary region is used to describe the uncertainty when approximating a target set using the existing knowledge. The larger the uncertainty, the larger the boundary region. Therefore, the boundary region is a good perspective for constructing the decision rule in decision tree learning. In this subsection, we first propose a novel measure called boundary coefficient, and then propose the notion of boundary mixed attribute dependency by combining the boundary coefficient with the attribute dependency.

Definition 6

Let \(\left\langle U, A, f, V^{subB}, \eta , \mu \right\rangle \) be a variable neighborhood decision table, where \(\eta \) represents the neighborhood radius and \( \mu \) represents the variable precision threshold. For any attribute subset \(subB\subseteq C,\) the boundary region of D with respect to subB is defined as follows:

$$\begin{aligned} BND_{\left( \eta ,\mu \right) }^{subB}(D)=\overline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D-\underline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D. \end{aligned}$$
(14)

Definition 7

Let \(\left\langle U, A, f, V^{subB}, \eta , \mu \right\rangle \) be a variable neighborhood decision table, where \(\eta \) represents the neighborhood radius and \( \mu \) represents the variable precision threshold. For any attribute subset \(subB\subseteq C,\) the boundary coefficient of D with respect to subB is defined as follows:

$$\begin{aligned} BC_{\left( \eta ,\mu \right) }^{subB}(D)=1-\frac{\left| BND_{\left( \eta ,\mu \right) }^{subB}\left( D \right) \right| }{\left| U \right| }. \end{aligned}$$
(15)

The basic idea of the definition of boundary coefficient can be divided into two parts. (1) To ensure that the boundary coefficient and the attribute dependency can cause the same weighting influence on the final mixed decision rule, the value range of boundary coefficient should be [0,1]. So we use the cardinality of the universe to normalize the cardinality of the boundary region, that is, \(\frac{\left| BND_{\left( \eta ,\mu \right) }^{subB}\left( D \right) \right| }{\left| U \right| }\). Therefore, the computation of the final mixed decision rule is not biased towards the boundary coefficient or the attribute dependency.

(2) To ensure that the performance of boundary coefficient to the final mixed decision rule is the same as that of attribute dependency, the monotonicity of boundary coefficient should be the same as the attribute dependency. That is, the larger the attribute dependency, the larger the boundary coefficient. Since the monotonicity of \(\frac{\left| BND_{\left( \eta ,\mu \right) }^{subB}\left( D \right) \right| }{\left| U \right| }\) is opposite to the attribute dependency, we use \(1- \frac{\left| BND_{\left( \eta ,\mu \right) }^{subB}\left( D \right) \right| }{\left| U \right| }\) as the boundary coefficient of D with respect to subB. Therefore, we can avoid the situation that the larger the attribute dependency, the smaller the boundary coefficient, so that the computation result of the final mixed decision rule can correctly reflect the true situation of attributes.

The boundary coefficient is a measure induced by the boundary region, which can be used to reflect the roughness degree or the difficulty degree during the approximation to the target set.

Property 1

Let \(\left\langle U, A, f, V^{subB}, \eta , \mu \right\rangle \) be a variable neighborhood decision table, where \(\eta \) represents the neighborhood radius and \( \mu \) represents the variable precision threshold. For any attribute subset \(subB\subseteq C,\) we have that:

  1. (1)

    \( BC_{\left( \eta ,\mu \right) }^{subB}(D) \in [0,1] \);

  2. (2)

    If \( BC_{\left( \eta ,\mu \right) }^{subB}(D) =0\), then \(r_{\left( \eta ,\mu \right) }\left( D,subB \right) =0\);

  3. (3)

    If \( BC_{\left( \eta ,\mu \right) }^{subB}(D) =1\) and \(\overline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D=U\), then \(r_{\left( \eta ,\mu \right) }\left( D,subB \right) =1\).

Proof

  1. (1)

    When \( BC_{\left( \eta ,\mu \right) }^{subB}(D) \) takes the minimum value, it means that \(\left| BND_{\left( \eta ,\mu \right) }^{subB}\left( D \right) \right| \) takes the maximum value, i.e., |U|. So we can obtain that the minimum value of \( BC_{\left( \eta ,\mu \right) }^{subB}(D)\) is 0. When \( BC_{\left( \eta ,\mu \right) }^{subB}(D) \) takes the maximum value, it means that \(\left| BND_{\left( \eta ,\mu \right) }^{subB}\left( D \right) \right| \) takes the minimum value, i.e., 0. So we can obtain that the maximum value of \( BC_{\left( \eta ,\mu \right) }^{subB}(D) \) is 1.

  2. (2)

    If \( BC_{\left( \eta ,\mu \right) }^{subB}(D) =0\), then we have that \(\overline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D-\underline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D=U\), which means that \(\overline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D=U\) and \(\underline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D=\emptyset \). Hence, we have that \(r_{\left( \eta ,\mu \right) }\left( D,subB \right) =0\).

  3. (3)

    If \( BC_{\left( \eta ,\mu \right) }^{subB}(D) =1\) and \(\overline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D=U\), then we can obtain that \(\overline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D-\) \( \underline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D =\emptyset \), which means that \(\overline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D=\underline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D=U\). Hence, we have that \(r_{\left( \eta ,\mu \right) }\left( D,subB \right) =1\).

Property 1 discusses the value range of \( BC_{\left( \eta ,\mu \right) }^{subB}(D)\). From Property 1, we can clearly notice that the value range of \( BC_{\left( \eta ,\mu \right) }^{subB}(D)\) is [0, 1], which ensures that the final mixed decision rule can obtain the same impact from the boundary coefficient and the attribute dependency. Moreover, (2) and (3) of Property 1 ensure that when the boundary coefficient takes the maximum value and the minimum value, the value of attribute dependency is the same.\(\square \)

Theorem 1

Let \(\left\langle U, A, f, V^{subB}, \eta , \mu \right\rangle \) be a variable neighborhood decision table, where \(\eta \) represents the neighborhood radius and \( \mu \) represents the variable precision threshold. For any \(subB^1, subB^2 \subseteq C,\) if \( subB^1 \subseteq subB^2\), then we have that \( BC_{\left( \eta ,\mu \right) }^{subB^1}(D) \le BC_{\left( \eta ,\mu \right) }^{subB^2}(D) \).

Proof

Since \( subB^1 \subseteq subB^2 \), we can obtain that the partition of \(subB^1\) is rougher than that of \(subB^2\), that is, for any \(s\in U,\) \( \left[ s \right] _{VPNER_{\left( \eta ,\mu \right) }^{_{}subB^2}} \subseteq \left[ s \right] _{VPNER_{\left( \eta ,\mu \right) }^{_{}subB^1}} \). Hence, we have that \( POS_{\left( \eta ,\mu \right) }^{subB^1} D \subseteq POS_{\left( \eta ,\mu \right) }^{subB^2} D \), which means that \(\frac{\left| BND_{\left( \eta ,\mu \right) }^{subB^2}\left( D \right) \right| }{\left| U \right| } \le \frac{\left| BND_{\left( \eta ,\mu \right) }^{subB^1}\left( D \right) \right| }{\left| U \right| }\). So we can obtain that \( BC_{\left( \eta ,\mu \right) }^{subB^1}(D)\) \( \le BC_{\left( \eta ,\mu \right) }^{subB^2}(D)\).

From Theorem 1, it can be seen that the more knowledge there is, this is, \( subB^1 \subseteq subB^2 \), the deeper realization to the target set can be achieved, this is, \( \left[ s \right] _{VPNER_{\left( \eta ,\mu \right) }^{_{}subB^2}} \subseteq \left[ s \right] _{VPNER_{\left( \eta ,\mu \right) }^{_{}subB^1}} \). Hence, we can obtain more information to make a decision, that is, the boundary coefficient will be larger. So the change of the boundary coefficient is consistent with the way we learn the rules of the world.\(\square \)

Definition 8

Let \(\left\langle U, A, f, V^{subB}, \eta , \mu \right\rangle \) be a variable neighborhood decision table, where \(\eta \) represents the neighborhood radius and \( \mu \) represents the variable precision threshold. For any attribute subset \(subB\subseteq C,\) by combining the boundary coefficient with the attribute dependency, the boundary mixed attribute dependency of D with respect to subB is defined as follows:

$$\begin{aligned} BMAD_{\left( \eta ,\mu \right) }^{subB}(D)= & {} BC_{\left( \eta ,\mu \right) }^{subB}(D)*r_{\left( \eta ,\mu \right) }\left( D,subB \right) \\\nonumber \\= & {} (1-\frac{\left| BND_{\left( \eta ,\mu \right) }^{subB}\left( D \right) \right| }{\left| U \right| })*\frac{\left| \underline{VPNER_{\left( \eta ,\mu \right) }^{subB}}D \right| }{\left| U \right| }. \end{aligned}$$
(16)

Since the boundary mixed attribute dependency is constructed by combining the boundary coefficient and the attribute dependency, it can reflect both the uncertainty and the dependency of D with respect to subB. Therefore, the boundary mixed attribute dependency is a good decision rule for decision tree learning.

Property 2

Let \(\left\langle U, A, f, V^{subB}, \eta , \mu \right\rangle \) be a variable neighborhood decision table, where \(\eta \) represents the neighborhood radius and \( \mu \) represents the variable precision threshold. For any attribute subset \(subB\subseteq C,\) we have that:

  1. (1)

    \( BMAD_{\left( \eta ,\mu \right) }^{subB}(D) \in [0,1] \);

  2. (2)

    If \( BMAD_{\left( \eta ,\mu \right) }^{subB}(D) =0\), then \(r_{\left( \eta ,\mu \right) }\left( D,subB \right) =0\) or \( BC_{\left( \eta ,\mu \right) }^{subB}(D) =0\);

  3. (3)

    If \( BMAD_{\left( \eta ,\mu \right) }^{subB}(D) =1\), then \(r_{\left( \eta ,\mu \right) }\left( D,subB \right) =1\) and \( BC_{\left( \eta ,\mu \right) }^{subB}(D) =1\).

Proof

According to Property 1, it is easy to prove Property 2, so we omit the details.\(\square \)

Theorem 2

Let \(\left\langle U, A, f, V^{subB}, \eta , \mu \right\rangle \) be a variable neighborhood decision table, where \(\eta \) represents the neighborhood radius and \( \mu \) represents the variable precision threshold. For any \(subB^1, subB^2 \subseteq C,\) if \( subB^1 \subseteq subB^2\), then we have that \( BMAD_{\left( \eta ,\mu \right) }^{subB^1}(D) \le BMAD_{\left( \eta ,\mu \right) }^{subB^2}(D) \).

Table 1 A variable neighborhood decision table
Table 2 The results of neighborhood granulation
Table 3 The results of neighborhood similarity on \(att_1\)
Table 4 The results of neighborhood similarity on \(att_2\)

Proof

Since \( subB^1 \subseteq subB^2,\) according to Theorem 1, we can obtain that \( POS_{\left( \eta ,\mu \right) }^{subB^1}D \subseteq POS_{\left( \eta ,\mu \right) }^{subB^2}D \). By Definition 5, we can further obtain that \( r_{\left( \eta ,\mu \right) }\left( D,subB^1 \right) \le r_{\left( \eta ,\mu \right) }\left( D,subB^2 \right) \). Moreover, according to Theorem 1, we can obtain that \( BC_{\left( \eta ,\mu \right) }^{subB^1}(D) \le BC_{\left( \eta ,\mu \right) }^{subB^2}(D) \). Therefore, by Definition 8, we have that \( BMAD_{\left( \eta ,\mu \right) }^{subB^1}(D) \le BMAD_{\left( \eta ,\mu \right) }^{subB^2}(D)\).

Theorem 2 discusses the monotonicity of \(BMAD_{\left( \eta ,\mu \right) }^{subB}(D)\). From Theorem 2 we can notice that the more knowledge we have, the larger the attribute dependency, and the larger \(BMAD_{\left( \eta ,\mu \right) }^{subB}(D)\). Hence, the change of the boundary mixed attribute dependency is also consistent with the way we learn the rules of the world.

Table 5 The results of neighborhood equivalence relation
Table 6 The results of computation

Example 1

Let \(\left\langle U, A, f, V^{subB}, \eta , \mu \right\rangle \) be a variable neighborhood decision table (as shown in Table 1), where U represents the universe, \(\{att_1, att_2\}\) represents the set of condition attributes, and \(\{d\}\) represents the set of decision attributes. Suppose that the neighborhood radius \(\eta \) equals 0.2 and the variable precision threshold \( \mu \) equals 0.7 (Tables 2, 3, 4, and 5).

From Table 6, we can obtain the following conclusions. On the one hand, the boundary mixed attribute dependency of D with respect to \(att_1\) is larger than that of D with respect to \(att_2\). On the other hand, the evaluation effect of the boundary coefficient on the two condition attributes is similar to the attribute dependency. Hence, Example 1 shows the effectiveness of the boundary coefficient and the boundary mixed attribute dependency.

Algorithm 1
figure d

Decision tree algorithm based on the boundary mixed attribute dependency.

3.2 Algorithm design

In this subsection, a novel decision tree algorithm is constructed by regarding the boundary mixed attribute dependency as the partition measure.

Before constructing the algorithm, it should be noted that the value of neighborhood radius plays an important role in neighborhood granulation. However, during the most applications of neighborhood rough set model, the value of neighborhood radius is given by the authors subjectively. Therefore, in this paper, the improved neighborhood radius in [14] is used.

Definition 9

[14] Let \( subB \subseteq C \) be a subset of condition attributes, \(sta\left( subB \right) \) represent the standard deviation of subB, NRC represent the neighborhood radius coefficient, and \(\overline{s}\) represent the average value of subB, a novel neighborhood radius is defined as follows:

$$\begin{aligned} \eta ^{subB}=\frac{sta\left( subB \right) }{NRC}, \end{aligned}$$
(17)

where \(sta(subB)=\sqrt{\frac{1}{G}\sum _{g=1}^G{\left( s^g-\overline{s} \right) }}\) and \(s^g\) denote the attribute values of samples s on attribute g.

The neighborhood radius coefficient is a constant used to control the size of neighborhood radius. In this paper, we set \(NRC=2\). And the flowchart is shown as Fig. 1

The time complexity of Algorithm 1 is mainly related to Step 1. During the neighborhood granulation, all samples need to calculate the neighborhood classes of them and the time complexity is \( O(\left| C \right| *\left| U \right| ^2 ) \). The subsequent calculation of equivalence partition has the same time complexity as the neighborhood granulation, which requires a variable precision equivalent over all samples, and the time complexity is \( O(\left| C \right| *\left| U \right| ^2 ) \). Moreover, the calculation of the partition measure is a computation operation for each attribute, so the time complexity is \( O(\left| C \right| ) \). Therefore, the time complexity of Algorithm 1 is \( O(\left| C \right| *\left| U \right| ^2 ) \).

The time complexities of the others algorithms are conducted carefully in literature [15]. So in this paper, the details are omitted. The time complexity of algorithm in [14] is \( O(\left| C \right| *\left| U \right| ^2 ) \). The time complexity of ID3 algorithm is \(O(\left| C \right| *\left| U \right| *\left| logU \right| )\). The time complexity of C4.5 algorithm is \(O(\left| C \right| *\left| U \right| *\left| logU \right| )\). The time complexity of CART algorithm is\(O(\left| C \right| *\left| U \right| *\left| logU \right| )\)

Fig. 1
figure 1

The flowchart of Algorithm 1

4 Experimental analysis

In this section, we use simulation experiments to verify the effectiveness of Algorithm 1. We choose 14 datasets from the UCI Machine Learning Repository. Table 7 details the information of the 14 datasets. We select the ID3 [24], C4.5 [25], CART [1] algorithms (for these three algorithms, the continuous attributes in each dataset are discretized into 5 intervals by the equal distance partition) and the algorithm proposed in [14] to compare with Algorithm 1. Moreover, we use three evaluation metrics: accuracy, number of leaves, and running time to assess the effectiveness of each algorithm. The experiments are repeated 4 times. In each iteration, the 10-fold cross-validation method is used to verify the accuracy and the number of leaves. Finally, we compute the average values of accuracy and number of leaves. Moreover, the running time is the total time of the 4 iterations. Since the ID3, C4.5 and CART algorithms use the equal distance partition to discretize the continuous attributes in each dataset, the running time of them is very short. Hence, in terms of the running time, we only compare the algorithm proposed in [14] with Algorithm 1.

Table 7 The information of the 14 datasets
Table 8 The running time(s)

In the experiments, the variable precision threshold \(\mu \) is set to 0.8. The experiments are run in MATLAB2021b, and the hardware environment of experiments is Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz and 16.0 GB RAM.

Since there are four similarity measures for both of literature [14] and the Algorithm 1, which results in four diverse experiment results. To make the comparisons more clear, in the remainder of this paper, we use AD(NRJGS) to represent the decision tree algorithm proposed in [14] which is based on \(NGS*Jaccard\), and use BMAD(NRJGS) to represent Algorithm 1 which is based on NRJGS. The other representations are the same. In addition, we do comparisons on the algorithms with the same neighborhood algebraic similarity. For instance, the AD(NRJGS) is compared with BMAD(NRJGS), because the neighborhood algebraic similarity of them are both Jaccard similarity.

From Table 8 and Fig. 2, it can be seen that in most datasets, the running time of Algorithm 1 is less than that of the algorithm proposed in [14]. The effectiveness of Algorithm 1 is more clear as the number of samples increases. In this case, the cost in the computation of boundary coefficient in Algorithm 1 is a worth in getting higher accuracy. In addition, we can clearly observe that the running time of Algorithm 1 is still high, which indicates that the Algorithm 1 is not appropriate for large datasets.

Table 9 and Fig. 3 are the accuracy comparisons. We can observe that the accuracy of Algorithm 1 is the highest, where the highest average of accuracy reaches 0.9518. Especially, the accuracy of the crayo dataset reaches 1 in all four neighborhood similarities. Compared with the three classical algorithms, the increment of accuracy reaches 10%. Compared with the algorithm in [14], the increment of accuracy reaches above 2%, even if the accuracy of all four neighborhood similarities in [14] has reached more than 93%. Hence, the improvement is significant.

Fig. 2
figure 2

The accuracy comparison between algorithms

Table 9 The accuracy
Fig. 3
figure 3figure 3

The running time comparison between algorithms

Table 10 and Fig. 4 are the number of leaves comparisons between algorithms. We can see that the number of leaves of the Algorithm 1 is more than the classical algorithms, which indicates that the degree of fitting of the Algorithm 1 is higher than the classical algorithms. Compared with the algorithm in [14], it is clear that there is not much diversity between the Algorithm 1 and [14]. And it can be seen that the variation in the number of leaves is slight and acceptable. Therefore, the above experimental results show that Algorithm 1 is meaningful and effective.

5 Conclusion and future work

In order to consider the change of boundary region in an improved decision tree algorithm based on rough set theory, the boundary coefficient is proposed in this paper. The boundary coefficient can reflect the difficulty when we use the available knowledge to approximately represent the decision, that is, the possibility of selecting one attribute to start the partition in the decision tree algorithm. Then, the boundary mixed attribute dependency is introduced by combining the boundary coefficient with the attribute dependency as the decision rule. This decision rule has better performance in selecting attributes than the single attribute dependency in decision tree learning. The experimental results show that the accuracy of our algorithm is larger than the compared algorithms, of which the maximum reaches 0.9518. Moreover, in terms of the running time and the number of leaves, our algorithm also performs well. However, the running time is still too long. Because the Algorithm 1 needs to calculate all samples to form neighborhood granular. Then it is needed to calculate all neighborhood granular to form variable precision equivalence relation, which results in long running time and high accuracy. Therefore, further research is needed on how to reduce the running time while maintaining the high accuracy and apply the Algorithm 1 to larger datasets, such as the perspective of local.

Table 10 The number of leaves
Fig. 4
figure 4

The number of leaves comparison between algorithms