Keywords

1 Introduction

Computational intelligent systems often require a preprocessing stage called feature selection or attribute reduction. Feature reduction will help to optimize the computational complexity of the knowledge processing task; sometimes it also comes with additional benefit of improved accuracy, and also the volume of data features to be collected reduces. There is a continuous research in the area of feature selection [1] or attribute reduction; the main goal of which is to optimize the features at the same time retain the quality and originality of data objects. Rough sets [2, 3] are introduced by Pawlak and became very popular as a tool which mathematically deals with vagueness, lack of preciseness, and uncertainness in the knowledge extraction, and analysis of data. The advantage of rough sets is that it does not depend on any external inputs and also do not transform the existing data. It is best suitable for dimensionality reduction. Other applications of rough sets include rule generation and prediction [4]. Traditional rough sets [5] can be applied directly for discrete data, where there is a need for additional step of discretization for real-valued data. To deal with real-time data a combination of fuzzy systems and rough sets were utilized in several works [6,7,8,9] for the purpose of attribute reduction. While performing rough set reduction for real-time data, the process of discretization may incur loss of information for some applications [10]. To avoid such a loss, fuzzy–rough approach can be employed. The fuzzy–rough technique mandates fuzzification of data, which does not require any further information except the number of membership functions to be used for each attribute.

Fuzzy approximation is the basis of most of the literature in fuzzy–rough reduction [11]. Dependence degree-based reduct algorithm was employed in [6] applied to Web categorization. Fuzzy entropy was used in the fuzzy–rough reduction process in [10, 12] using a greedy approach. The work of Eric et al. [8] has used discernibility matrix-based approach for fuzzy–rough reduction. Jensen and Shen proposed fuzzy similarity-based approaches, which use fuzzy boundary region, fuzzy lower approximation, and fuzzy discernibility matrix-based approaches [7]. In [13], the authors have utilized principal component analysis as an algorithm for attribute reduction. Fuzzy–rough reduction aims at finding optimal set of attributes from the available list; however, finding the best combination is always a NP-hard problem. There is a need for some metaheuristic search techniques in order to find the best possible combination among the given attribute list [14]. Few examples of metaheuristic searching techniques are genetic algorithms [15], simulated annealing [16], cuckoo search [17], tabu search [18], etc. Ant colony optimization (ACO), bee colony optimization [19], particle swarm optimization [20], social cognitive optimization [21] fall under the category of swarm intelligence-based techniques [22]. ACO is a metaheuristic global search algorithm acquainted by Dorigo [23]. Jensen and Shen [24] have proposed ACO search for attribute reduction using fuzzy dependency degree, which in turn was derived from fuzzy-positive region, fuzzy lower approximation. Liangjun et al. [25] proposed ACOAR which is a rough set-based ACO and makes use of rough set dependency degree in the algorithm; however, it is mostly suitable for discrete data sets. In [14], Ravi Kiran Varma et al. proposed mutual information-based attribute reduction using ACO. ACO was proven to be fruitful in attribute reduction in these papers [26,27,28]. Hulian and Zong have done a similar work of attribute reduction using rough sets with swarm optimization [29]. Ant colony optimization was also used in the area of vehicular traffic control systems [30].

This paper aims at demonstrating the ability of fuzzy entropy-based fuzzy rough attribute selection using ant colony optimization as the global minima search technique, and the results were compared with similar works in this area [7, 10, 12]. A quick reduct algorithm was developed using fuzzy entropy without the need for fuzzy dependency degree, and also to avoid the disadvantage of local search, ACO was further employed to obtain a best reduct. A similar work is proposed by Varma et al. [31] where the authors applied fuzzy rough sets and ACO for eliminating unimportant features of real-time intrusion detection system (IDS) data set.

2 Back Ground

2.1 Fuzzy Rough Sets and Attribute Selection

In rough sets [2, 12], the approximations done to the original object set is crisp. In the case of fuzzy rough sets, the lower and upper approximations are fuzzy. For real-time data, discretization has to be performed on the original data before applying rough set reduction, and in this process, there may be loss of information in some cases. In fuzzy rough sets, the attributes are converted to fuzzy data where each attribute is divided into membership groups, thereby avoiding the information loss for real-time data [32]. Rough sets can be treated as a special case of fuzzy rough sets where the elements of the set generated by lower approximation have a membership of 1. However, in fuzzy rough sets, the elements of lower approximation contain a membership of [0, 1] and therefore can handle uncertainty in a flexible way than the crisp counterpart [10].

An information system \({\mathbb{Z}}\) can be represented as a set of finite objects \({\mathbb{O}}\) and a set of finite attributes \({\mathbb{T}}\), \({\mathbb{Z}} = \left( {{\mathbb{O}},\text{ }\,{\mathbb{T}}} \right)\). The actual attributes \({\mathbb{T}}\) of the table consist of conditional as well as decision attributes, \({\mathbb{T}} = \left\{ {T_{C} \cup T_{D} } \right\}\) [33]. Like the crisp equivalence class for rough set, fuzzy equivalence class is the important thing in the fuzzy rough set. In fuzzy rough set, the conditional and the decision features are fuzzy in nature. The definitions of fuzzy equivalence and fuzzy similarity relation, fuzzy lower and upper approximations can be found in [10].

Typically in fuzzy rough attribute selection procedures, the lower approximation or the positive region is widely used; the upper approximation is not employed. For any attribute , the fuzzy equivalence is denoted by . Say, an attribute has two fuzzy sets based on two membership functions, then the partition In calculating the dependency degree for fuzzy rough reduction, \({\mathbb{O}}/IND\left( {\mathcal{W}} \right)\) or in short form \({\mathbb{O}}/{\mathcal{W}}\) has to be obtained. \({\mathbb{O}}/{\mathcal{W}}\) in the case of rough set is nothing but the set which contains objects that are indiscernible between the attribute \(t_{1} ,t_{2}\) given \({\mathcal{W}} = \left\{ {t_{1} ,t_{2} } \right\}\). In the case of fuzzy rough reduction, the set \({\mathbb{O}}/{\mathcal{W}}\) is based on the Cartesian product as shown below:

(1)

If \(= \left\{ {l, m} \right\}\) and \({\mathbb{O}}/IND\left( l \right) = \left\{ { l_{1} ,l_{2} } \right\} \,{\text{and}}\,{\mathbb{O}}/IND\left( m \right) = \left\{ { m_{1} ,m_{2} } \right\}\) then

$${\mathbb{O}}/{\mathcal{W}} = \left\{ {l_{1} \cap m_{1} , l_{1} \cap m_{2} , l_{2} \cap m_{1} , l_{2} \cap m_{2} } \right\}$$
(2)
$$\mu_{{E_{1} }} \cap \mu_{{E_{2} }} \cap \cdots \cap \mu_{{E_{n} }} \left( o \right) = \hbox{min} \left( {\mu_{{E_{1} }} ,\mu_{{E_{2} }} , \ldots ,\mu_{{E_{n} }} } \right).$$
(3)

3 Fuzzy Entropy-Based Quick Reduct

The information entropy, for an event E which consists of n outcomes, was proposed by Shannon [34] and is given as:

$${\mathbb{H}}\left( E \right) = - \mathop \sum \limits_{j = 0}^{n} p_{j} { \log }_{2} p_{j} .$$
(4)

Information system \({\mathbb{Z}}\) can be represented as a set of finite objects \({\mathbb{O}}\) and a set of finite attributes \({\mathbb{T}}\), \({\mathbb{Z}} = \left( {{\mathbb{O}},{\mathbb{T}}} \right)\). The actual attributes \({\mathbb{T}}\) of the table consist of conditional as well as decision attributes, \({\mathbb{T}} = \left\{ {{T}_{C} \cup {T}_{D} } \right\}\). Let \({\mathbb{F}}_{1} , {\mathbb{F}}_{2} , \ldots ,{\mathbb{F}}_{n}\) be the fuzzy membership subsets of an attribute, then the fuzzy entropy of a given membership function \({\mathbb{F}}_{j}\) is given by:

$${\mathbb{H}}\left( {{\mathbb{F}}_{j} } \right) = - \mathop \sum \limits_{{Q \in {\mathbb{O}}/T_{D} }}^{n} p (Q |{\mathbb{F}}_{j} ) { \log }_{2} p(Q |{\mathbb{F}}_{j} ) ,$$
(5)

where the decision relative probability is given by

$$p(Q |{\mathbb{F}}_{j} )= \frac{{\left| { Q \cap {\mathbb{F}}_{j} } \right|}}{{\left| {{\mathbb{F}}_{j} } \right|}}$$
(6)

The fuzzy entropy given an attribute subset \({\mathcal{W}}\) is given by Parthalain et al. [12]:

$$\upxi\left( {\mathcal{W}} \right) = \mathop \sum \limits_{{{\mathbb{F}}_{j } \in {\mathbb{O}}/{\mathcal{W}} }} \frac{{\left| {{\mathbb{F}}_{j} } \right|}}{{\mathop \sum \nolimits_{{X_{j } \in {\mathbb{O}}/{\mathcal{W}} }} \left| {X_{j} } \right|}}{\mathbb{H}}\left( {{\mathbb{F}}_{j} } \right).$$
(7)

This fuzzy entropy can be used to bubble out the best contributing attributes, similar to the fuzzy dependency degree. But, in the case of fuzzy entropy, the smaller the better, whereas in the case of fuzzy dependency degree, the larger the better. The earlier fuzzy entropy-based algorithms [10] used both dependency degree as well as fuzzy entropy; however, we propose a fuzzy entropy-based quick attribute reduction which does not need to calculate the fuzzy dependency degree and hence takes shorter time to compute the reduct and also with comparable or improved classification accuracy. The hill climbing approach-based fuzzy entropy quick reduction algorithm is as shown in Fig. 1.

Fig. 1
figure 1

Fuzzy entropy-based quick reduct algorithm

3.1 Fuzzy Entropy-Based Quick Reduct Algorithm (FEQR)

See Fig. 1.

4 Attribute Reduction Using Fuzzy Entropy and Ant Colony Optimization

To overcome the disadvantage of hill climbing-based attribute reduction where there is no guarantee of global best solution, we propose a hybrid intelligent attribute reduction system which makes use of ACO to search the global best attribute reduction set. ACO is an evolutionary swarm-based computational intelligent algorithm proposed by Dorigo et al. [35, 36]. Here a colony of ants work together for a common goal of searching the global best solution. ‘n’ no of ants will be released for ‘m’ no of iterations, and for each iteration one ants solution will be marked as best solution and the path searched by the iteration best ant will be updated with pheromone trail similar to the biological ants. There is another parameter called the heuristic information, which will guide the ants to take a decision of selecting next node from the current node in the path toward a solution construction, with a probability calculation [37]. Therefore, there are two important values that are considered in the ACO solution construction, the pheromone and the heuristic information. In this work, we propose a relative fuzzy entropy-based heuristic information, in ant colony optimization in search of global best attribute set. Relative fuzzy entropy is used as heuristic information. An ant has to select next attribute \(y\) from present attribute \(x\) using the formula:

$$\eta \left( {x,y} \right) = \xi \left( {S_{{{\bar{\text{a}}}}} } \right) - \xi \left( {S_{{{\bar{\text{a}}}}} \cup y} \right)$$
(8)

\(p_{x,y}^{{{\bar{\text{a}}}}} \left( {{\text{ITR}}_{n} } \right)\) is the probability of selecting the next attribute \(y\) by the ant ā from current attribute position \(x\), at \(n{\text{th}}\) iteration. \(\tau_{x,y}\) is the pheromone concentration and \(\eta_{x,y}\) is the heuristic information at the branch x, y. \(\alpha \, {\text{and}} \, \beta\) are the tuning factors for pheromone and heuristic information, respectively.

$$p_{x,y}^{{{\bar{\text{a}}}}} \left( {{\text{ITR}}_{n} } \right) = \frac{{{\tau }_{x,y}^{\alpha } {\eta }_{x,y}^{\beta } \left( {{\text{ITR}}_{n} } \right)}}{{\mathop \sum \nolimits_{{z \in \left( {\varvec{T}_{\varvec{C}} - S_{{{\bar{\text{a}}}}} } \right)\varvec{ }}} {\tau }_{x,z}^{\alpha } {\eta }_{x,z}^{\beta } \left( {{\text{ITR}}_{n} } \right)}}, y \in \left( {\varvec{T}_{\varvec{C}} - S_{{{\bar{\text{a}}}}} } \right)$$
(9)

The proposed algorithm of ant colony optimization of fuzzy entropy-based fuzzy rough feature reduction is shown in Fig. 2.

Fig. 2
figure 2

Ant colony optimization of fuzzy entropy-based fuzzy rough reduction algorithm (ACOFEFR)

4.1 Ant Colony Optimization of Fuzzy Entropy-Based Fuzzy Rough Reduction Algorithm (ACOFEFR)

In this algorithm, a group of ants are released whose duty is to traverse the nodes (attributes) in search of best possible attribute set based on the pheromone and heuristic values. The same process is repeated for few iterations, which can be decided on trial and error basis. In each iteration, all the ants will find a solution, but the global best solution will be considered. In the algorithm, it can be observed that an ant will start with a node randomly in the first iteration, and select the next attribute by calculating the selection probability which in turn depends on heuristic and pheromone values, but there is no randomness from the second iteration onwards, instead they depend on selection probability function even for the first node. It was also observed from trial and error basis that the updating of pheromone for every ants solution produced better result than updating of pheromone per iteration. The evaporation of pheromone, however, is done only per iteration. Relative fuzzy entropy is used as heuristic information in this algorithm, which was shown in Eq. (8). The probability selection formula is shown in Eq. (9). The fuzzy entropy value plays a crucial role in the convergence of the solution by the ants.

5 Result Analysis and Discussion

Several experiments were conducted to provide support for the proposed algorithm, the UCI machine learning data sets are used as the standard benchmark data. All the experiments were conducted over a Windows 7-based i3 machine with 4 GB RAM, and Java program environment. C4.5 decision tree classifier [38] is used for evaluating the performance of proposed algorithms. The parameters used for the ACO algorithm are as shown in Table 1. Triangular membership functions were used here in fuzzification process. The solution provided by the algorithm depends on number of ants released and number of iterations the process of releasing ants will be carried. Initially the program was started with 10 ants and 10 iterations, later on trial and error basis, it was observed that optimal consistent solution was attained for three ants and five iterations for all the above data sets. The advantage of fuzzy entropy-based reduction is with real-valued data sets, and hence, the experiments were conducted on real valued or combination of real and discrete valued data sets are taken. Table 2 presents the experimentation results over 15 UCI data sets [39], column 2, is the name of the data set, the third column is the number of attributes of that data set, the fourth column contains the number of samples of that data set, the fifth column is the reduced attribute set obtained by the greedy hill climbing-based fuzzy entropy quick reduct (FEQR) algorithm, the sixth column gives the size of reduced attributes obtained by the proposed ACOFEFR algorithm, the seventh and eighth columns are the list of attributes obtained by the FEQR and ACOFEFR algorithms, respectively, the ninth column is the classification accuracy obtained by J48 decision tree algorithm for the full attribute data sets, the tenth and eleventh columns are the classification accuracies obtained with the reduced attribute sets of the FEQR and ACOFEFR algorithms, respectively. It was very clear from the table that the ACO-based global searching strategy has a great deal of advantage over the greedy hill climbing fuzzy entropy-based quick reduct method (FEQR), which can be seen from the reduct sizes found by both the algorithms. Except for the case of Iris, Wine, yeast, and diabetes, the ACOFEFR has produced reduced attribute set compared to FEQR. The ACOFEFR for the case of Olitos data set has produced five attributes, whereas the FEQR has produced 15 attributes out of the total 25 attributes, at the same time keeping the classification accuracy unchanged. It can be observed that for most of the datasets, the classification accuracies with the reduced attribute sets generated by ACOFEFR are very close or even better when compared to the full attribute accuracies. The process of solution construction by one ant is shown in Fig. 3 as a sample, for the data set diabetes. The ant has started with attribute number three and is selecting the next attribute based on the selection probability, which indirectly depends on fuzzy entropy, and it can be seen that the ant has constructed a solution 3, 2, 6, which is nothing but the reduced attribute set.

Table 1 ACO parameters
Table 2 Results over 15 UCI data sets
Fig. 3
figure 3

Process of solution construction by one ant

Table 3 shows a comparison with similar works in this area, which have used fuzzy entropy-based algorithms for reduction. The proposed approach has produced better results in the case of Cleveland, Iris, Olitos, and Wine data sets with shorter attribute size and near classification accuracies. The ‘—’ in the table 3 for Iris data set indicates that the result is not available. In the case of Iris data set our algorithm selected the best two attributes and yet maintaining the accuracy mark. In the case of wine data also, our approach has outperformed by selecting only four attributes out of 13. In the case of Olitos data, comparable accuracy was retained and also the attributes are reduced to five which is the best among others. For Cleveland data also the best attribute set was attained.

Table 3 Comparison of proposed algorithm

6 Conclusion

This paper proposed an ant colony optimization-based fuzzy entropy attribute reduction algorithm, which produces global best solution when compared to greedy hill climbing-based approaches. The proposed algorithm was proven to be feasible and attaining comparable and sometimes better results for real-valued data sets in terms of both, number of attributes reduced and classification accuracies. This approach is suitable for real-valued data sets, since the fuzzification process preserves the originality of data in a much better way compared to discretization. Another feature of our algorithm is that it does not need to calculate fuzzy dependency degree. However, fuzzy-based attribute reduction suffers from high computational complexity when compared to non-fuzzy techniques like rough set reduction using discretized data.