RETRACTED ARTICLE: Optimized feature selection algorithm based on fireflies with gravitational ant colony algorithm for big data predictive analytics

AlFarraj, Osama; AlZubi, Ahmad; Tolba, Amr

doi:10.1007/s00521-018-3612-0

RETRACTED ARTICLE: Optimized feature selection algorithm based on fireflies with gravitational ant colony algorithm for big data predictive analytics

S.I. : Emerging Intelligent Algorithms for Edge-of-Things Computing
Published: 04 July 2018

Volume 31, pages 1391–1403, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

RETRACTED ARTICLE: Optimized feature selection algorithm based on fireflies with gravitational ant colony algorithm for big data predictive analytics

Download PDF

Osama AlFarraj¹,
Ahmad AlZubi¹ &
Amr Tolba^1,2

651 Accesses
29 Citations
Explore all metrics

This article was retracted on 11 July 2024

This article has been updated

Abstract

Big data is an important and complex dataset consisting of a large volume of data that helps to collect, store, and analyze data, depending on its applications and predictive analytics. During the predictive process, the method examines different quantities of data, which are difficult to process because their high dimensionality leads to difficulties in examining the correlations among the data. This paper introduces a method of optimized feature selection and soft computing techniques for reducing the dimensionality of the dataset. Initially, the data were collected from various resources that contained some inconsistent data, reducing the system’s efficiency. Then, the inconsistent and noise data were removed by applying a normalized approach. Next, the optimized features were selected using the fireflies gravitational ant colony optimization (FGACO) approach. This optimized feature selection method successfully examines the characteristics and importance of the feature during the selection process. The selected feature consists of all details about particular predictive analytics. The system’s efficiency was then evaluated using different datasets. The experimental results show that FGACO performs better in terms of the sensitivity, specificity, accuracy, and the number of selected features based on time.

A Hybrid Feature Selection Approach for Data Clustering Based on Ant Colony Optimization

Comparative Analysis of Traditional and Optimization Algorithms for Feature Selection

A Novel Clustering-Based Hybrid Feature Selection Approach Using Ant Colony Optimization

Article 09 March 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Predictive analytics [1] is an emerging concept that includes various statistical approaches, machine-learning concepts, modeling, and data-mining concepts for analyzing a set of data in order to predict a pattern to apply to future events. Moreover, the predictive concept examines the risk factors and opportunities for making an effective decision in response to a user request. This crucial predictive analysis process is used in various processes, such as financial services, retail, capacity planning, fraud detection, healthcare systems [2], marketing, actuarial science, and child protection. These applications require large amounts of data for analyzing patterns toward future needs because, for every organization or business, users request many resources on a daily basis. Around 2.5 quintillion bytes of resources or data have been requested and accessed from different resources in order to develop an effective predictive analysis process. Such a large number of resources are difficult to collect; managing their related databases also poses difficulties, which are resolved by using the concept of big data [3].

Although big data provides a collection of data, it needs to handle various challenges, such as sharing, transfer, curation, analysis, and visualization. Big data presents challenges as well: It consists of various advantageous characteristics [4], such as volume, velocity, variety, and veracity. Among these characteristics, volume is especially important because data growth will reach 40 zettabytes by 2040 owing to the growth of various businesses in both the private and government sectors. According to a survey conducted in 2012, 2.8 zettabytes of data have been created for research purposes, but only 5% of their data has been used to create effective research and predictive analytics. This volume of information has been collected from various resources, including the health, retail, and banking sectors [5]. The collected data are moved to an accelerating process, which enables its transfer into the real-time application process. The collected data were processed by applying business intelligence (BI) tools [6], which effectively process big data.

BI ensures that online analytical processing (OLAP) techniques [7] examine the collected data using forward-looking big data analytics, which require casual analysis, an optimization process [8], predictive modeling, text mining, statistical analysis, and the forecasting concept. The process’s steps toward big-data-based predictive analytics are shown in Fig. 1.

Figure 1 shows the normal predictive analytics process [9], which includes steps such as report analysis, monitoring, and predictive analytics. Each step in this process helps in recognizing a particular pattern from the past, present, and future perspectives. During this analytics process, big-data concept drivers are used for analyzing data, which reduces power consumption and storage cost and provides high-speed networking and multi-core processing. In addition, the big-data-based [10] predictive analytics process has several benefits, such as the fastest achievement of business goals, easier examination of the complex predictive model by using casual factors, easier integration with the traditional database, effective scalability, and effective processing of unstructured data. Based on these benefits, the predictive analytics process uses the concept of big data while examining the patterns and processing steps of predictive analytics with big-data process, as shown in Fig. 2.

According to Fig. 2, the user-requested project or data analyzes whether the related decision has been handled by applying the big data concepts. Initially, the data are collected according to the user’s request and then fed into the data-analysis process [11]. The data-analysis process includes the processes of data cleaning (normalization, min–max technique, average, median value, and other missing value-replacing methods), integration, data extraction (statistical features and structural information), selection (genetic algorithm, particle swarm optimization, fireflies, ant bee colony, greedy algorithm, wrapper method, etc.), and pattern recognition (linear discriminate analysis, support vector machine, neural networks, decision tree, k-nearest neighboring method, etc.). From the derived pattern, a particular predictive analytics process is applied, which functions according to the statistics, modeling, and deployment process. During the analysis process, users may request any type of data—medical data, business data, retail information, or banking information—and it will be presented in the database in the form of images, numerical information, and voice data. Most users request the data in terms of images and numerical forms that are used to create the effective predictive analytic system. Therefore, the main contribution of the system is to improve the accuracy of the predictive analytic system by selecting the optimized features, reducing the error rate in an effective manner.

The rest of the paper is organized as follows: Sect. 2 discusses the related work, Sect. 3 discusses the FGACO approach, Sect. 4 examines the efficiency of the feature selection process in big-data-based predictive analytics, and Sect. 5 presents the conclusion.

2 Related work

Ayhan et al. [12] analyzed aviation data, which were developed from an internal research and development project, Boeing research and technology, and an advanced air-traffic management process. The system-developed data have high dimensionality and are therefore difficult to process, so the data are collected, correlated, and stored in a particular data warehouse. The collected data are processed with the help of a custom tool developed by Embry–Riddle Aeronautical University. The tool effectively examines the user-requested query from the large database. The efficiency of the system is then evaluated in the air-traffic data domain, and the developed system retrieves the particular user-requested pattern in an effective manner.

Saravanakumar et al. [13] developed an effective medical predictive system using the big-data technique. Healthcare centers contain mostly unstructured data that are difficult to process owing to their high dimensionality. To deal with the dimensionality issues, the authors used the Hadoop or Map reducing concept. Initially, the medical diabetic data were collected from the non-communicable disease (NCD) database, which comprises diabetic mellitus data. Once collected, the data were fed into the Hadoop or Map reducing concept to examine patterns for the disease, which can help treat patients in the future. The developed system effectively examines particular related diseases, a process that is very affordable and accessible.

Dhar et al. [14] discussed the importance of the predictive analytics system in healthcare centers because healthcare centers face several risks when examining patient data. Therefore, a predictive system was introduced for examining user queries, and a particular pattern was developed by minimizing the risk factors. In addition, the predictive system improves the overall performance of the system. Boukenze et al. [15] implemented a predictive analytics system for analyzing chronic kidney diseases. Initially, data related to chronic kidney diseases were collected from a larger set of data. The collected data were processed by applying the data-mining and machine-learning approaches for reducing the difficulties present in the big data. After data collection, the data were processed by applying the decision-tree (C4.5) algorithm for recognizing cancer-related patterns. The recognized patterns were then used for further research purposes and treatment processes.

Gulati et al. [16] investigated student dropout trends using data-mining approaches. The collected student information had high dimensionality, which increased the difficulty of data processing and dropout feature investigation. Therefore, the optimal features were selected from the collected information using the associative mining rule. The features were selected based on the rules, which improved the accuracy of pattern prediction. The efficiency of the system was evaluated with the help of the college management and teacher data, and the effective features were determined using the Weka feature selection tool. Muthukrishnan et al. [17] used various regression methods, such as ordinary least squares (OLS), least squares regressions, and ridge regression methods, for investigating data collection. The collected data had high dimensionality and were therefore difficult to process, so the optimal features were selected using the regression method, which improves prediction accuracy. The predicted patterns were used in the R package, which was simulated in the environment.

As previously mentioned, big data consists of a large volume of data, increasing the difficulty of analyzing user requests. Therefore, the data’s dimensionality should be reduced to improve the prediction or pattern-recognition process. Several optimization methods [18] are used for examining data present in the database; this work uses the FGACO approach for detecting optimized features from the database, because other methods fail to choose the optimized features relevant to the user request or queries, have poor accuracy in feature section, and may also eliminate some important features in an effective manner. In addition, this work uses multiple datasets, such as cancer dataset, genetics dataset, protein data, and bank marketing data, for examining the feature selection process because the selected features lead to improve the prediction accuracy.

3 Materials and methods

This section discusses the FGACO approach for the big-data-based feature selection process. Big data consists of a large volume of data, which are difficult to process owing to their high dimensionality. Lower dimensionality improves the overall system efficiency. In addition, the reduced features focus only on particular user-requested features, which improves the decision-making process with respect to the user query. Previously, data were collected from the big data [19] database, which is used to detect particular user-requested patterns. In this work, the Varibench protein dataset [20], Protein Data Bank (PDB) dataset [21], bank marketing [22], and lung cancer dataset [23] were used for developing the predictive analytics process. After collecting data from the dataset, each piece of data was examined before processing, because the collected data may be affected by noise, which reduces the entire system efficiency. The noise present in the data was removed before processing using a normalization process. In this work, the data were examined using min–max along with the Z-score normalization [24] approach in order to eliminate inconsistent data from the collection. The normalization process also has several unique benefits, such as an extension to any independent number of datasets, the scaling of each element, and applicability to all dataset values. This paper employs different datasets in the dimensionality-reducing process in order to improve the predictive analytics process, and therefore, min–max along with Z-score normalization was used. The normalization process was carried out as follows:

$$Y = \frac{{v_{i} - E}}{{{\text{std(}}E )}}*\left( {\frac{{A - { \hbox{min} }\,{\text{value}}\,{\text{of}}\,A}}{{{ \hbox{max} }\,{\text{value}}\,{\text{of}}\,A - {\text{Min}}\,{\text{value}}\,{\text{of}}\,A}}} \right)*\left( {D - C} \right) + C$$

(1)

In Eq. (1), Y is the normalized value of a particular input.

v_i is the value of particular column.

A is the original input data.

C and D are the boundary values of the data, which are pre-defined.

From Eq. (1), std(E) represents the standard deviation of the input, which is computed as follows:

$${\text{std(}}E )= \sqrt {\frac{1}{{\left( {n - 1} \right)}}\mathop \sum \limits_{i = 1}^{n} \left( {v_{i} - E} \right)^{2} }$$

(2)

Then, E is denoted as the mean value, which is calculated as follows:

$$E = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} v_{i}$$

(3)

The data are thus normalized based on the above min–max along with the Z-score normalization approach, which effectively converts unstructured data into structured data and scales the particular input to improve the predictive analytics process. The normalized data [25] were fed into the next feature selection process, because this step only determines the quality of the predictive analytics process. Before discussion about the FGACO approach, the simple processing structure is shown in Fig. 3.

3.1 Feature selection process

Feature selection is the process of selecting a subset from the search space. The selected features minimize cost and time and improve the reliability of subsequent process. Several methods [26], such as the genetic algorithm (GA), particle swarm optimization (PSO), bat algorithm (BA), and rough set theory (RST) [27], have been used to reduce the dimension of the feature space and for the selection of optimized features, but they consume a lot of time and reduce the system’s reliability. Therefore, different methods, such as gravitational search algorithm (GSA), fireflies algorithm (FA), ant colony optimization (ACO), and artificial bee colony (ABC), have been introduced for selecting particular suitable features from the feature space. The following section describes the process and procedure of each algorithm and the combined process when selecting optimized features from the feature space.

3.1.1 Gravitational search algorithm for feature selection process

Rashedi developed GSA [28] by examining the features of particular data using Newton’s law, which means [29] that any two particles present in the world attract each other if their force acts in the same direction as the middle line. Moreover, the involved force is directly proportional to the product of their masses and inversely proportional to the square of the distance between them. In other words, the gravitational force of the data is minimized by increasing the distance between the two data or particles. The GSA algorithm selects the features by considering the following steps: identification of the search space, initialization of the agents in the search space, calculation of the fitness function, and updating of the best and worst values of the features, depending on their acceleration, velocity, and direction. Initially, the agents are identified in the search space as having values from 0 to 1. The search space consists of a collection of N agents that occupy particular positions in the m-dimensional space, which is represented as follows:

$$Y_{i} = \left( {y_{i}^{1} , \ldots y_{i}^{d} \ldots y_{i}^{n} } \right), \quad i = 1,2 \ldots N$$

(4)

In Eq. (4), $y_{i}^{d}$ is the ith agent position in dimension d. The defined agents have values between 0 and 1, which facilitates the effective analysis of features. After defining the agents, the mass value of the agents was calculated using the fitness value criteria as follows:

$${\text{ma}}_{i} \left( t \right) = \frac{{{\text{fitness}} _{i} \left( t \right) - {\text{worst}} \left( t \right)}}{{{\text{best}} \left( t \right) - {\text{worst}} \left( t \right)}}$$

(5)

Based on the ma_i(t) value, the force direction of the feature is calculated as follows:

$$M_{i} \left( t \right) = \frac{{{\text{ma}}_{i} \left( t \right)}}{{\mathop \sum \nolimits_{j = 1}^{N} {\text{ma}}_{i} \left( t \right)}}$$

(6)

In the above equations, fitness_i(t) is the fitness value of each agent at a particular time t, and worst(t) and best(t) are defined as follows:

$${\text{best}}\left( t \right) = \hbox{min} \,{\text{fitness}} _{j} \left( t \right)\quad j \in 1, \ldots N$$

(7)

$${\text{worst}} \left( t \right) = \hbox{max} \,{\text{fitness}} _{j} \left( t \right)\quad j \in 1, \ldots N$$

(8)

After estimating the agents, the force of the direction change and the distance between the particles was computed because the force is directly proportional to their two masses and inversely proportional to their square of the distance, which is estimated as follows:

$$F_{ij}^{d} = G\left( t \right)\frac{{M_{i} \left( t \right)*M_{j} \left( t \right)}}{{\left( {D_{ij} \left( t \right)} \right)^{n} + \varepsilon }}$$

(9)

In Eq. (9), $F_{ij}^{d}$ represents the magnitude gravity of the interaction on mass i and j in the dth dimension.

G(t) is the gravitational force at a particular time t, M_i(t) and M_j(t) are the mass values of two different agents. D_ij(t) is the distance between the two agents. Then, G(t) is computed as follows:

$$G\left( t \right) = G\left( {G_{0} .t} \right)$$

(10)

In Eq. (10), G₀ is the initial gravitational value at the first iteration.

Then, the local optimum trapping was reduced using the best fitness value and mass value, computed as follows:

$$F_{i}^{d} \left( t \right) = \mathop \sum \limits_{j \in k - best j \ne i} {\text{rand}}_{j} F_{ij}^{d} \left( t \right)$$

(11)

In Eq. (11), rand_j is a random number with a value between 0 and 1.

During this process, the agents transmit the force to every agent present in the search space, a process that is repeated until the force fails to reach the other agents. Depending on Newton’s second law, the acceleration of the agent in the search space is calculated as follows:

$$a_{i}^{d} \left( t \right) = \frac{{F_{i}^{d} \left( t \right)}}{{M_{i} \left( t \right)}}$$

(12)

Then, the agents search the search space for optimal features. During this process, the velocity of each agent is estimated using the above-estimated acceleration value, which is defined as follows:

$$v_{i}^{d} \left( {t + 1} \right) = {\text{rand}}_{j} *v_{i}^{d} \left( t \right) + a_{i}^{d} \left( t \right)$$

(13)

rand_j is a random number with value between 0 and 1.

After calculation of the new velocity values, they are converted into probability values; i.e., the smaller values are changed to zero and the higher values are changed to large; then, the probability value was increased as follows:

$$S\left( {v_{i}^{d} \left( t \right)} \right) = \left| {\tanh (v_{i}^{d} \left( t \right)} \right|$$

(14)

During this process, the agents are moved into the search space in order to detect their optimized features, which are identified by estimating the relationship between the agents. For each time, the relationship between the agents is estimated, and the position is updated as follows:

$$x_{i}^{d} \left( {t + 1} \right) = x_{i}^{d} \left( t \right) + v_{i}^{d} \left( {t + 1} \right)$$

(15)

Based on the above process, updating the agents’ values helps to determine the optimum feature value. If the estimated agent’s absolute velocity value is closer to the feature mass value, then it is considered an optimal feature and the remaining features must be eliminated from the feature space. The GSA-based feature selection process has several benefits, such as easier implementation, low computation cost, and fast convergence. Although GSA is so effective, with only two parameters of mass and velocity, its computation time remains the main issue. Next, the feature selection process is examined with the help of the FA.

3.1.2 Fireflies algorithm (FA) for feature selection process

The next feature selection method is FA, which is based on the behavior of fireflies, the flashing flies having two features, namely light intensity and attractiveness value. FA selects the optimized features by using these two features. Initially, the intensity value is estimated by calculating the minimum or maximum value of the feature. Then, the attractiveness value of the feature or data is estimated by finding the distance between the features, which is calculated as follows:

$$d_{H} \left( {x,Y} \right) = \hbox{max} \left\{ {\sup_{x \in X} \inf_{y \in Y} d\left( {x,y} \right), \sup_{y \in Y} \inf_{x \in X} d\left( {x,y} \right)} \right\},$$

(16)

where sup refers to supremum, inf refers to infimum, and d_H(X, Y) represents the similarity between the features.

Then, the estimated features are ranked according to their attractiveness and intensity values. Based on their rank, the feature with the highest attractiveness and intensity values is considered the most optimized feature. This process is repeated continuously until the optimized features are estimated. The fireflies-based feature selection process has several advantages, such as robustness, ease of use, and highly precise feature detection. Although FA detects features with high precision, it has a slow convergence, which reduces the overall efficiency while searching for optimal features.

3.1.3 Ant colony optimization (ACO)-based feature selection process

Ant colony optimization is based on the concept of ants’ food-searching process, which is a probabilistic optimization method [30]. During the food-searching process, an ant searches for food in a random path, and if it finds food, it returns by the same path. During the wandering process, the ant leaves behind a particular chemical pheromone so that it can recognize the same path for its return, and this also helps another ant recognize that particular path. The chemical pheromone evaporates quickly, but it has a high density, which eliminates the convergence problem while searching for food. Based on this ant-wandering process, optimal features were selected from a feature set by generating a graph. The graph helps to determine the path between the features [31]. It uses each feature as the node and creates edges between the nodes. Based on these links, the path with the minimum number of nodes is selected for the transaction. During this process, the feature transition and pheromone values are updated continuously to investigate the optimal feature. Then, the probabilistic transition rule is updated as follows:

$$P_{i}^{k} \left( t \right) = \left\{ {\frac{{\left| {t_{i} \left( t \right)} \right|^{\alpha } \left| {n_{j} } \right|^{\beta } }}{{\mathop \sum \nolimits_{\mu } \left| {t_{i} \left( t \right)} \right|^{\alpha } \left| {n_{j} } \right|^{\beta } }} \quad if \,i \in j^{k} } \right.,$$

(17)

where j^k is the set of feasible features, t and n are the pheromone values, and α, β are the heuristic information.

The estimated transition probability value is used to balance the pheromone intensity value. Therefore, the pheromone value is updated in order to avoid the pheromone evaporation process as follows:

$$\Delta t_{i}^{k} = \left\{ {\begin{array}{*{20}l} {\phi .\gamma \left( {s^{k} \left( t \right)} \right) + \frac{{\phi .\left( {n - \left| {S^{k} \left( t \right)} \right|} \right)}}{n}} \hfill & {{\text{if}}\,i \in s^{k} \left( t \right)} \hfill \\ 0 \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right.,$$

(18)

where s^k(t) is the selected feature subset.

This process is repeated until the optimal features are detected from the feature set. The developed ACO method has several advantages, such as effective adaptation to dynamic environments and elimination of the convergence of local optimum value.

According to the above feature selection approaches and benefits, this work employed the FGACO approach for selecting the optimized features from the dataset to develop an effective predictive analytic system. The developed FGACO system selects the optimized features with minimum time and cost and also effectively eliminates the local optimum convergence-related issues. Details regarding the FGACO method are given in the following section.

3.1.4 FGACO-based feature selection process

The important feature selection process is FGACO, which has the aforementioned benefits in its analysis of the features from the feature space. Initially, the data are collected and organized in the feature space, and then the noise present in the data is removed using the min–max along with the z-score normalization process, as described in Eq. (1). After eliminating the inconsistent data from the feature space, the probability transition value of the feature was calculated using Eq. (17). Along with the transition values, the attractiveness and intensity values of the particular feature must be estimated by calculating the distance between the features. According to the above discussions, FA uses the Euclidean distance or Hausdorff distance measurement. In this work, the Minkowski distance was used for estimating the attractiveness value. Next, the attractiveness value is calculated as follows:

$${\text{attractiveness}} = \left( {\mathop \sum \limits_{i = 1}^{n} \left| {x_{i} - y_{i} } \right|^{p} } \right)^{1/p} ,$$

(19)

where n represents the number of features in the feature set and p is the Minkowski distance inequality values.

After estimating the attractiveness value, the intensity value is estimated using the minimum and maximum values of the feature. These values are found to be almost optimal features, but the feature selection process is further enhanced by applying the gravitational search algorithm. The algorithm functions according to Newton’s law, so the acceleration, velocity, and mass are calculated for the recognized feature in the m-dimensional space. First, the mass value of the feature is evaluated using Eqs. (5) and (6). Based on the mass value, the change in the force direction is analyzed by estimating the square of the distance between the feature calculated using Eq. (9). The best fitness value among all the calculated values is identified using Eq. (11). Depending on these values, the acceleration and velocity of the feature are examined using Eqs. (12) and (13). Based on the estimated information, the probability value of the feature is calculated along with the relationship between the features using Eq. (15). Based on the above process, the updating agent values are used to determine the optimum feature value. The calculated velocity value is compared with the mass value, and if the estimated value is closer to the mass value, then it is considered the optimal feature and the remaining features are eliminated. In this process, the transition value is continuously updated using Eq. (18). This process is repeated continuously until all features present in the feature space are examined. The FGACO processing algorithm steps are discussed in the following text.

Algorithm steps for FGACO

Step 1: Collect and initialize all features or data from the dataset
Step 2: Remove the inconsistent and irrelevant data from the dataset as follows: $Y = \frac{{v_{i} - E}}{{{\text{std}} \left( E \right)}}\left( {\frac{{A - { \hbox{min} }\,{\text{value}}\,{\text{of}}\,A}}{{{ \hbox{max} }\,{\text{value}}\,{\text{of}}\,A - {\text{Min}}\,{\text{value}}\,{\text{of}}\,A}}} \right)\left( {D - C} \right) + C$
Step 3: Calculate the transition probability values of the features as follows:$P_{i}^{k} \left( t \right) = \left\{ {\frac{{\left\| {t_{i} \left( t \right)} \right\|^{\alpha } \left\| {n_{j} } \right\|^{\beta } }}{{\mathop \sum \nolimits_{\mu } \left\| {t_{i} \left( t \right)} \right\|^{\alpha } \left\| {n_{j} } \right\|^{\beta } }}\quad {\text{if}}\,i \in j^{k} } \right.$
Step 4: Along with the probability value, estimate the attractiveness value of the feature for computing distance between the features: ${\text{attractiveness}} = \left( {\mathop \sum \limits_{i = 1}^{n} \left\| {x_{i} - y_{i} } \right\|^{p} } \right)^{1/p}$
Step 5: Find the intensity value of each feature by determining its with minimum function.
Step 6: Calculate the force direction of the feature according to its attractiveness and intensity value: $M_{i} \left( t \right) = \frac{{{\text{ma}}_{i} \left( t \right)}}{{\mathop \sum \nolimits_{j = 1}^{N} {\text{ma}}_{i} \left( t \right)}}$ ${\text{ma}}_{i} \left( t \right) = \frac{{{\text{fitness}}_{i} \left( t \right) - {\text{worst}}\left( t \right)}}{{{\text{best}}\left( t \right) - {\text{worst}}\left( t \right)}}$ ${\text{best}}\left( t \right) = { \hbox{min} }\,{\text{fitness}}_{j} \left( t \right)\quad j \in 1, \ldots N$ ${\text{worst}}\left( t \right) = { \hbox{max} }\,{\text{fitness}}_{j} \left( t \right)\quad j \in 1, \ldots N$
Step 7: Based on the above process, estimate the change of force direction by $F_{ij}^{d} = G\left( t \right)\frac{{M_{i} \left( t \right)*M_{j} \left( t \right)}}{{\left( {D_{ij} \left( t \right)} \right)^{n} + \varepsilon }}$
Step 8: According to the feature mass and force direction, examine the best fitness value as: $F_{i}^{d} \left( t \right) = \mathop \sum \limits_{{j \in k - {\text{best}} j \ne i}} {\text{rand}}_{j} F_{ij}^{d} \left( t \right)$
Step 9: Calculate the feature acceleration and velocity value: $a_{i}^{d} \left( t \right) = \frac{{F_{i}^{d} \left( t \right)}}{{M_{i} \left( t \right)}}$ $v_{i}^{d} \left( {t + 1} \right) = {\text{rand}}_{j} *v_{i}^{d} \left( t \right) + a_{i}^{d} \left( t \right)$
Step 10: Convert the calculated velocity value into the probability value: $S\left( {v_{i}^{d} \left( t \right)} \right) = \left\| {\tanh (v_{i}^{d} \left( t \right)} \right\|$
Step 11: Update the position of each feature value as follows: $x_{i}^{d} \left( {t + 1} \right) = x_{i}^{d} \left( t \right) + v_{i}^{d} \left( {t + 1} \right)$
Step 12: Compare the calculated velocity value with the feature mass value; if it is nearer to the mass value, it is considered the optimal feature.
Step 13: Update the pheromone value as follows: $\Delta t_{i}^{k} = \left\{ {\begin{array}{*{20}l} {\phi .\gamma \left( {s^{k} \left( t \right)} \right) + \frac{{\phi .\left( {n - \left\| {S^{k} \left( t \right)} \right\|} \right)}}{n}} \hfill & {{\text{if}}\, i \in s^{k} \left( t \right)} \hfill \\ 0 \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right.$
Step 14: Repeat this process continuously until the optimal features can be detected from the feature set.

According to the above algorithm steps, the effective features are selected from the feature space using the FGACO algorithm, which selects the features with minimum time and cost. In addition, the combined algorithm effectively eliminates the local feature convergence from the feature set and detects the optimized feature in a dynamic environment. The effective analysis of the feature leads to an increase in the overall efficiency of the feature selection process. The selected features help analyze the future decision-making process using the predictive analytics process. The predictive analytics method models the features, and a particular pattern is detected with the help of classification or clustering methods, such as linear discriminate analysis, support vector machine, neural networks, k-nearest neighboring method, self-organization map-based clustering, k-means clustering, and other clustering approaches. These methods successfully examine the relevant patterns using training features, which depend on user-requested queries. Then, the efficiency of the FGACO method is examined using the experimental results, which are explained in the following section.

4 Experimental results

This section discusses the efficiency of the FGACO-based feature selection process. The method was employed with the help of the MATLAB tool, a fourth-generation programming and multi-paradigm environment. The MATLAB setup consists of the feature selection library (FSLib, 2016), which provides a large amount of functionality while examining the features. In addition, FSLib ensures the reduction of the dimensionalities for a reasonable cost. With the experimental setup, the efficiency of the system is examined using different benchmarks of datasets, such as Varibench protein dataset, PDB dataset, bank marketing dataset, and lung cancer dataset. Details of each dataset are given next.

4.1 Varibench protein dataset

Varibench protein dataset contains a collection of biological datasets with different protein details, such as protein stability, protein tolerance, transcription details, and splice sites. Moreover, the dataset has RNA and DNA sequence details, which help predict various disorders. Table 1 shows a few details included in the Varibench protein dataset.

Table 1 Varibench protein dataset variations

Full size table

4.2 PDB dataset

Another dataset is the PDB, which consists of several attributes relevant to the various nucleic acids, proteins, and biological information. In addition, the dataset consists of 103,514 structure factor files, 9057 restraint files in NMR, 28,267 chemical shift files, etc. Table 2 includes the total entries present in the PDB dataset.

Table 2 PDB dataset details

Full size table

4.3 Lung cancer dataset

The lung cancer dataset consists of cancer-related information collected from different patients. Among this information, the dataset consists of 32 instances, each with 57 attributes used for the cancer predictive process.

4.4 Bank marketing dataset

The bank marketing dataset consists of data collected from the Portuguese banking institution, and the collected information is recorded from phone calls. The dataset consists of 45,211 examples collected from user orders. Each entry has 20 attributes that help predict future decisions based on user requests.

By using the above benchmarks of datasets, the efficiency of the FGACO system was evaluated using various metrics, such as the number of selected features, the error rate of the selected features, sensitivity, specificity, accuracy, and time values of the feature selection methods. The feature selection method effectively analyzes the dataset details, and the important, optimum features are selected according to the user request. In addition, the method selects the features in any dataset in a dynamic environment with minimum time. The efficiency of FGACO is compared with that of the traditional GSA, FA, and ACO methods. Then, the number of selected features of different datasets is shown in Table 3.

Table 3 Number of selected features of different datasets

Full size table

Table 3 clearly shows that the FGACO method selects the optimized features from different datasets when compared to other traditional methods. The resultant graph is shown in Fig. 4.

According to Fig. 4, the FGACO method selects the minimum number of features from the different datasets. FGACO selects 163 features from the Varibench dataset, 175 features from the PDB dataset, 194 features from the lung cancer dataset, and 183 features from the bank marketing dataset. The selected features are much more optimized when compared those from the other traditional methods such as GSA (the Varibench dataset has 400 features, PDB has 435 features, the lung cancer dataset has 426 features, and the bank marketing dataset has 415 features), FA (the Varibench dataset has 353 features, PDB has 387 features, the lung cancer dataset has 369 features, and the bank marketing dataset has 351 features), and ACO (the Varibench dataset has 314 features, PDB has 339 features, the lung cancer dataset has 320 features, and the bank marketing dataset has 341 features). From the above selections, the FGACO method selects the optimized features from any kind of dataset, as opposed to the other methods. Although the method selects the minimum number of features, it has a minimal error rate when compared to the other methods. The obtained mean square error value is presented in Table 4. The root-mean-square value is computed by the difference between the predicted value and the observed value, which is estimated as follows,

$${\text{RMSE}} = \sqrt {\frac{{\mathop \sum \nolimits_{t = 1}^{n} \left( {\hat{y}_{i} - y_{i} } \right)^{2} }}{n}}$$

(20)

Table 4 Error rate

Full size table

In Eq. (20), $\hat{y}_{i}$ is represented as the predicted value and y_i is the estimated value from the list, whereas the mean square error value is the average difference between the computed and predicted values. Based on this, Table 4’s value is calculated effectively.

Table 4 clearly shows that the FGACO method selects the optimized features with a minimal error rate from different datasets when compared to other traditional methods, such as GSA, FA, and ACO. The resulting graph is shown in Fig. 5.

From Fig. 4, it can be seen that the FGACO method selects the minimum number of features with a minimal error rate from the different datasets. FGACO exhibits a 0.0945 minimum error rate for the Varibench dataset, a 0.08175 error rate for the PDB dataset, a 0.0919 error rate for the lung cancer dataset, and a 0.8183 error rate for the bank marketing dataset. In addition, it has a minimum root-square-error rate of 0.09164 for the Varibench dataset, 0.089 for the PDB dataset, 0.0832 for the lung cancer dataset, and 0.091 for the bank marketing dataset. The selected features have a minimal error rate when compared to the other traditional methods, such as GSA (the Varibench dataset has 0.89 MSE and 0.9562 RMSE, PDB has 0.867 MSE value and 0.954 RMSE, the lung cancer dataset has 0.8426 MSE and 0.921 RMSE, and the bank marketing dataset has 0.7915 MSE and 0.934 RMSE), FA (Varibench dataset has 0.745 MSE value and 0.8332 RMSE, PDB has 0.7387 MSE value and 0.821 RMSE, lung cancer dataset has 0.7369 MSE and 0.819 RMSE, and bank marketing dataset has 0.6352 MSE and 0.832 RMSE), and ACO (the Varibench dataset has 0.634 MSE value and 0.7243 RMSE, PDB has 0.6339 MSE value and 0.712 RMSE, the lung cancer dataset has 0.6320 MSE and 0.712 RMSE, and the bank marketing dataset has 0.5341 MSE and 0.698 RMSE). According to the above analysis, the FGACO method achieves the minimum error rate, increasing the accuracy of the feature selection method, which is measured using the sensitivity and specificity metrics. These metrics are used to investigate whether the FGACO method correctly recognizes the optimized features, and they are estimated as follows:

$${\text{Sensitivity}} = \frac{{{\text{True}}\,{\text{Positive}}}}{{\left( {{\text{True}}\,{\text{Positive}} + {\text{False}}\,{\text{Negative}}} \right)}}$$

(21)

$${\text{Specificity}} = \frac{{{\text{True}}\,{\text{Negative}}}}{{\left( {{\text{True}}\,{\text{Negative}} + {\text{False}}\,{\text{Positive}}} \right)}}$$

(22)

Based on Eqs. (21) and (22), the sensitivity and specificity of the FGACO method are evaluated, and the obtained values are presented in Table 5.

Table 5 Sensitivity and specificity

Full size table

Table 5 clearly shows that the FGACO method selects the optimized features related to the user requests as correct and true positive rates from the different datasets when compared to the other traditional methods. The graph is shown in Fig. 6.

Figure 6 shows that the FGACO method selects the optimized features from the different datasets with high sensitivity value. FGACO selects the features with 96.23% sensitivity rate from the Varibench dataset; 98.23% from the PDB dataset, 98.63% from the lung cancer dataset, and 98.43% rate from the bank marketing dataset. The selected features are strongly optimized when compared to those from the other traditional methods, such as GSA (sensitivity rates of 79.03% for Varibench, 75.03% for PDB, 76.03% for lung cancer dataset, and 79.4% for bank marketing dataset), FA (sensitivity rates of 82.45% for Varibench, 81.45% for PDB, 83.45% for lung cancer dataset, and 83.67% for bank marketing dataset), and ACO (sensitivity rates of 91.3% for Varibench, 92.3% for PDB, 93.3% for lung cancer dataset, and 94.2% for bank marketing dataset). From the above discussions, the FGACO method selects the optimized features with a high sensitivity rate from any kind of dataset when compared to the other methods. Although the method selects the minimum number of features with a high sensitivity rate, it must have a high specificity rate when compared to other methods in all datasets, because the specificity measures how the feature selection method selects, detects, and eliminates the irrelevant features that are true negative features. The obtained specificity values are shown in Fig. 7.

Figure 7 shows that the FGACO method selects the optimized features from the different datasets. FGACO selects the features with 97.21% specificity rate from the Varibench dataset; 97.31% from the PDB dataset, 98.6% from the lung cancer dataset, and 98.21% from the bank marketing dataset. The selected features are very optimized when compared to the other traditional methods, such as GSA (sensitivity rates of 78.23% for Varibench, 76.32% for PDB, 77.23% for the lung cancer dataset, and 80.36% for the bank marketing dataset), FA (sensitivity rates of 83.42% for Varibench, 83.32% for PDB, 83.52% for the lung cancer dataset, and 82.67% for the bank marketing dataset), and ACO (sensitivity rates of 91.45% for the Varibench dataset, 93.13% for PDB, 94.78% for the lung cancer dataset, and 93.56% for the bank marketing dataset). From the above discussions, the FGACO method selects the optimized features with a high specificity rate from any kind of dataset when compared to the other methods. The increased specificity and sensitivity rates increase the overall accuracy of the feature selection process. The obtained accuracy values are presented in Table 6.

Table 6 Feature selection accuracy

Full size table

Table 6 clearly shows that the FGACO method selects the optimized features with high accuracy when compared to the other traditional methods. The resultant graph is shown in Fig. 8.

Figure 8 shows that the FGACO method selects the optimized features from the different datasets with high accuracy, which means it effectively selects the feature depending on the user query. FGACO selects the features with 98.45% accuracy from the Varibench dataset, 98.57% from the PDB dataset, 97.93% from the lung cancer dataset, and 98.90% from the bank marketing dataset. The selected features’ accuracy indicates that the features are very optimized when compared to the other traditional methods, such as GSA (the Varibench dataset has 83.56% accuracy, PDB has 82.78%, the lung cancer dataset has 83.51%, and the bank marketing dataset has 83.46%), FA (Varibench dataset has 86.65% accuracy, PDB has 85.79%, lung cancer dataset has 86.43%, and bank marketing dataset has 86.09%), and ACO (Varibench dataset has 88.96% accuracy, PDB has 87.56%, lung cancer dataset has 88.32%, and bank marketing dataset has 88.56%). From the above discussions, the FGACO method selects the optimized features with high accuracy from any kind of dataset when compared to the other methods. Thus, the FGACO method detects the features in minimal time, and the obtained time values are presented in Table 7.

Table 7 Time

Full size table

Table 7 clearly shows that the FGACO method selects the optimized features with minimum time when compared to the other traditional methods. The resultant graph is shown in Fig. 9.

Figure 9 shows that the FGACO method selects the optimized features from the different datasets with minimum time. FGACO selects the features in 10.45 ms from the Varibench dataset, 9.57 ms from the PDB dataset, 9.93 ms from the lung cancer dataset, and 8.90 ms from the bank marketing dataset. Although the minimum time for the selected features by FGACO is very optimized when compared to the that of the other traditional methods, such as GSA (the Varibench dataset has 23.89 ms, PDB has 24.78 ms, the lung cancer dataset has 23.98 ms, and the bank marketing dataset has 23.64 ms), FA (Varibench dataset has 19.76 ms, PDB has 19.32 ms, lung cancer dataset has 18.80 ms, and bank marketing dataset has 18.58 ms), and ACO (the Varibench dataset has 14.78 ms, PDB has 14.21 ms, the lung cancer dataset has 14.69 ms, and the bank marketing dataset has 13.98 ms). From the above discussions, the FGACO selects the optimized features from a large dataset in minimum time and with a minimal error rate, and the retrieved features exhibit high accuracy when compared to the other traditional methods. The selected features are used for further decision handling or predictive process, which is done with the help of a classification and clustering process. The FGACO method is more effective for different kinds of data.

5 Conclusion

This paper examines the FGACO-based feature selection process and uses the selected features effectively in a predictive analytics process. During the analysis, the data are collected from four different datasets, namely Varibench, PDB, the lung cancer dataset, and the bank marketing dataset. After collecting the data, the noise present in the data was eliminated with the help of the min–max along with z-score normalization process. Then, the transition probability value, attractiveness, intensity, mass, acceleration, and velocity of the features were estimated. Then, the calculated feature’s velocity value was compared with the mass value, and if it was close to the mass value, it was considered an optimal feature; otherwise, it was considered a local feature. The efficiency of the FGACO method was evaluated using the MATLAB tool, and the feature selection method has an average efficiency of 98.4625%. In addition, the FGACO method selects features in minimum time when compared to other feature selection methods. The feature selection process is further enhanced by applying an estimation process for optimized fitness values.

Change history

11 July 2024
This article has been retracted. Please see the Retraction Notice for more detail: https://doi.org/10.1007/s00521-024-10075-8

References

Samanpour AR, Ruegenberg A, Ahlers R (2018) The future of machine learning and predictive analytics. In: Linnhoff-Popien C, Schneider R, Zaddach M (eds) Digital marketplaces unleashed. Springer, Berlin, pp 297–309. https://doi.org/10.1007/978-3-662-49275-8_30
Chapter Google Scholar
Wang Y, Kung L, Byrd TA (2018) Big data analytics: understanding its capabilities and potential benefits for healthcare organizations. Technol Forecast Soc Change 126:3–13
Article Google Scholar
Dey N, Hassanien AE, Bhatt C, Ashour A, Satapathy SC (eds) (2018) Internet of things and big data analytics toward next-generation intelligence. Springer, Berlin
Google Scholar
Yang C, Huang Q, Li Z, Liu K, Hu F (2017) Big data and cloud computing: innovation opportunities and challenges. Int J Digit Earth 10(1):13–53
Article Google Scholar
Tolba A, Elashkar E (2018) Soft computing approaches based bookmark selection and clustering techniques for social tagging systems. J Cluster Comput. https://doi.org/10.1007/s10586-018-2014-5
Article Google Scholar
Rouhani S, Lecic DM (2018) Business intelligence impacts on design of enterprise systems. In: Encyclopedia of information science and technology, 4th edn, pp 2932–2942 https://doi.org/10.4018/978-1-5225-2255-3.ch256
Shafqat S, Kishwer S, Rasool RU, Qadir J, Amjad T, Ahmad HF (2018) Big data analytics enhanced healthcare systems: a review. J Supercomput. https://doi.org/10.1007/s11227-017-2222-4
Article Google Scholar
Vassakis K, Petrakis E, Kopanakis I (2018) Big data analytics: applications, prospects and challenges. Mobile Big Data. https://doi.org/10.1007/978-3-319-67925-9_1
Article Google Scholar
Gunasekaran A, Papadopoulos T, Dubey R, Wamba SF, Childe SJ, Hazen B, Akter S (2017) Big data and predictive analytics for supply chain and organizational performance. J Bus Res 70:308–317
Article Google Scholar
Heureux A, Grolinger K, Elyamany HF, Capretz MAM (2017) Machine learning with big data: challenges and approaches. IEEE Access 5:7776–7797. https://doi.org/10.1109/ACCESS.2017.2696365
Article Google Scholar
Ramírez-Gallego S, Krawczyk B, García S, Woźniak M, Herrera F (2017) A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239:39–57
Article Google Scholar
Ayhan S, Pesce J, Comitz P, Sweet D, Bliesner S, Gerberick G (2013) Predictive analytics with aviation big data. In: Integrated communications, navigation and surveillance conference (ICNS)
Saravanakumar NM, Eswaric T, Sampath P, Lavanya S (2015) Predictive methodology for diabetic data analysis in big data. Procedia Comput Sci 50:203–208
Article Google Scholar
Dhar V (2014) Big data and predictive analytics in health car, US National Library of Medicine National Institutes of Health Search database. J Big Data 2(3):113–116. https://doi.org/10.1089/big.2014.1525
Article Google Scholar
Boukenze B, Mousannif H, Haqiq A (2016) Predictive analytics in healthcare system using data mining techniques. Comput Sci Inf Technol. https://doi.org/10.5121/csit.2016.60501
Article Google Scholar
Gulati H (2015) Predictive analytics using data mining technique. J Comput Sustain Glob Dev (INDIACom) 713-716
Muthukrishnan R, Rohini R (2017) LASSO: a feature selection technique in predictive modeling for machine learning. IEEE International Conference on Advances in Computer Applications, pp 18–20
Weyland D (2015) A critical analysis of the harmony search algorithm—how not to solve sudoku. Oper Res Perspect 2:97–105
MathSciNet Google Scholar
Lu H, Plataniotis KN, Venetsanopoulos AN (2011) A survey of multilinear subspace learning for tensor data. Pattern Recognit 44(7):1540–1551. https://doi.org/10.1016/j.patcog.2011.01.004
Article Google Scholar
Nair PS, VariBench Vihinen M (2013) A benchmark database for variations. Hum Mutat 34(1):42–49
Article Google Scholar
Smart OS, Horský V, Gore S, Svobodová Vařeková R, Bendová V, Kleywegt GJ, Velankar S (2018) Worldwide Protein Data Bank validation information: usage and trends. Acta Crystallogr Sect D 74(3):237–244
Article Google Scholar
Moro S, Cortez P, Rita P (2014) A data-driven approach to predict the success of bank telemarketing. Decis Support Syst 62:22–31
Article Google Scholar
Hong ZQ, Yang JY (1991) Optimal discriminant plane for a small number of samples and design method of classifier on the plane. Pattern Recognit 24(4):317–324
Article MathSciNet Google Scholar
Zhang Z, Cheng Y, Liu NC (2014) Comparison of the effect of mean-based method and z-score for field normalization of citations at the level of Web of Science subject categories. Scientometrics 101(3):1679–1693
Article Google Scholar
Lones MA (2014) Metaheuristics in nature-inspired algorithms. In: GECCO ‘14. https://doi.org/10.1145/2598394.2609841
Zlochin M, Birattari M, Meuleau N, Dorigo M (2014) Model-based search for combinatorial optimization: a critical survey. Ann Oper Res 131(1–4):373–395
MathSciNet Google Scholar
Zhang J, Chung H, Lo WL (2007) Clustering-based adaptive crossover and mutation probabilities for genetic algorithms. IEEE Trans Evolut Comput 11(3):326–335
Article Google Scholar
Rashedi E, Nezamabadi-Pour H, Saryazdi S (2009) GSA: a gravitational search algorithm. Inf Sci 179(13):2232–2248
Article Google Scholar
Majid M, Bishop JM (2013) Swarmic paintings and colour attention. In: Machado P, McDermott J, Carballal A (eds) EvoMUSART 2013, Swarmic Paintings and Colour Attention. vol 7834. LNCS, pp 97–108
Mellal MA, Williams EJ (2018) A survey on ant colony optimization, particle swarm optimization, and cuckoo algorithms. In: Handbook of research on emergent applications of optimization algorithms. IGI Global, pp 37–51. https://doi.org/10.4018/978-1-5225-2990-3.ch002
Pang J, Gu Y, Xu J, Yu G (2018) Semi-supervised multi-graph classification using optimal feature selection and extreme learning machine. Neurocomputing 277:89–100
Article Google Scholar

Download references

Acknowledgements

The authors extend their appreciation to the Deanship of Scientific Research at King Saud University for funding this work through Research Group No. RG-1438-070.

Author information

Authors and Affiliations

Computer Science Department, Community College, King Saud University, Riyadh, Saudi Arabia
Osama AlFarraj, Ahmad AlZubi & Amr Tolba
Mathematics and Computer Science Department, Faculty of Science, Menoufia University, Shebin El-Kom, Egypt
Amr Tolba

Authors

Osama AlFarraj
View author publications
You can also search for this author in PubMed Google Scholar
Ahmad AlZubi
View author publications
You can also search for this author in PubMed Google Scholar
Amr Tolba
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Osama AlFarraj.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

This article has been retracted. Please see the retraction notice for more detail: https://doi.org/10.1007/s00521-024-10075-8"

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

About this article

Cite this article

AlFarraj, O., AlZubi, A. & Tolba, A. RETRACTED ARTICLE: Optimized feature selection algorithm based on fireflies with gravitational ant colony algorithm for big data predictive analytics. Neural Comput & Applic 31, 1391–1403 (2019). https://doi.org/10.1007/s00521-018-3612-0

Download citation

Received: 05 April 2018
Accepted: 23 June 2018
Published: 04 July 2018
Issue Date: 03 May 2019
DOI: https://doi.org/10.1007/s00521-018-3612-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

RETRACTED ARTICLE: Optimized feature selection algorithm based on fireflies with gravitational ant colony algorithm for big data predictive analytics

Abstract

Similar content being viewed by others

A Hybrid Feature Selection Approach for Data Clustering Based on Ant Colony Optimization

Comparative Analysis of Traditional and Optimization Algorithms for Feature Selection

A Novel Clustering-Based Hybrid Feature Selection Approach Using Ant Colony Optimization

1 Introduction

2 Related work