Keywords

1 Introduction

Over the past decade, the enthusiasm for using Feature Selection (FS) techniques has shifted from being a model that is proving to be a real catalyst for model building. In order to use machine learning methods effectively, pre-processing the data is essential. Feature selection is one of the most common and important techniques for early data processing, and it has become a very important part of the machine learning process. It is the process of finding the right features and removing inaccurate, unwanted, or noisy data. This process speeds up data mining algorithms, improves the precision of the guesswork, and increases the accuracy. Our interest is concentrated on high quality data. The sheer amount of high-quality data has posed the greatest challenge to existing machine operating methods. Depending on the supervised learning, Feature Selection provides a set of features for designers using one of the following methods. One is, specified dimensions for support features that maximize test scope, Another is Overall, the bottom set is the best commitment between size and test score. The method of choice (fs) method has changed dramatically over the last few years and, many papers have found domains with (hundreds to tens of thousands of variables). New approaches are being proposed to deal with these stimulating tasks that include many negative and abnormal variables and often the same few examples of training. 1. We are looking for “variable” variable input variables and, “features” built-in input variables. We use null “variables” and “element” where there is no view in selection algorithms, e.g. Text partitioning problem, represented documents, which is an indication of the size of the terminology consisting of word counting, view and understanding of data, decrease in size and storage requirements, reduce training times and use, reduce the size curse to improve predictive performance. The feature selection problem based on supervised instruction is: Given a set of candidate features that selects a standard set defined by one of three methods. A set-size setup that enables an experimenter with a small size satisfies a specific limit on a test scale and is possible to better understand the results obtained by the inducer, reducing its storage capacity. The indirect feature does not apply to import, but not all relevant features are useful. The selection criteria included are that the time spent in training the model is greatly reduced due to the limited number of parameters used.

Authors of a paper [1] give information about robustness of feature selection techniques which needs importance while analyzing the selected feature subsets. This shows that these techniques are of useful for higher dimensional domains and small sample sizes. In addition, they also investigated the effects of integrating feature selection strategies on categorization, providing a new strategy for model selection.

Another paper [2] lets us consider about the contingency of feature selection, which might provide us with a basic classification in hierarchical system of feature selection techniques, and communicate its use, variety and introduction of a few important applications and future bioinformatics applications. This paper also provides taxonomy of these techniques there were areas of application and their diversification in common and the new coming bioinformatics applications.

This paper is organized as follows: Sect. 1 introduces the concept of Feature Selection. Need of Feature Selection is highlighted in Sect. 2. The various categories of Feature selection techniques are explained in Sect. 3. Section 4 surveys the literature. The methodology adopted in this article is explained in Sect. 5. Experiments performed and results obtained are discussed in Sect. 6. Paper is concluded in Sect. 7.

2 Need of Feature Selection

Feature selection as the name describes is to pick out important features from a specific data set or in simple terms to cut back number of input variables while developing a predictive model and is additionally referred to as variable selection. Nowadays, datasets have abundant information with data which is collected using countless of IoT devices and sensors. Now the times datasets have huge number of attributes in which not all are useful a large number of features make a model bulky, time-taking, high dimensional and harder to implement in production.

Feature selection is vital as there is noisy data which needs to be removed. Lots of low frequent features, multi-type features, too many features compared to samples, complex models, samples in real scenario is non-homogenous with training and test samples. All these points make it very important for data to be cleaned properly. Data cleaning is the most important and tedious work in data mining. If the data is not cleaned and arranged properly it will take much more time to get the desired results. It also reduces the computation time involved to get the model and helps us to focus on what is more important. It reduces the risk of Over-fitting that is also known as curse of dimensionality the dimensionality of the features space increases, number of configurations can grow exponentially which becomes very difficult if we have huge amount of data. It improves Accuracy which leads to less misleading data means and thus improving accuracy, data training time is reduced so that less data means that algorithms train faster. Feature selection also reduce the computational cost of modelling [3].

Seeing all the points we realize how much important it is to select the correct features accurately. One feature may not seem as important but it can be related to some other attribute too. Feature Selection methods helps with these problems by reducing the dimensions, reducing redundancy, reducing noisy data without much loss of the total information. By applying feature selection methods we get to know the significance of each feature in improving the model.

Feature selection can be done manually by data analysts but when the amount of attributes are in enormous amount it is not feasible for a person to do and it might take longer duration of time to complete a given project. Here comes the need for various feature selection methods.

3 Feature Selection Techniques

Feature Selection techniques can be categorized into following categories. Techniques which need to be applied depends upon the type on the type of data whether it is numerical or categorical.

3.1 Filter Methods

Filter methods are not dependent on any learning algorithm, but depend on the complete aspects of the training data. Filter-based feature selection methods [4] use statistical measures to check the correlation or dependency between the given input variables that how much are they dependent on each other that can be filtered to choose the most relevant features. Statistical measures for feature selection must be chosen carefully based on the data type of the input variable and the output or response variable. In filter methods, features are selected by selecting the most relevant attributes without using any machine learning algorithm. Various statistical tests are done like chi square test, ANOVA test, LDA, information gain, fisher score, correlation coefficient, variance threshold etc. and features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. Here correlation is considered as a subjective term.

3.2 Wrapper Methods

Wrapper methods use learning algorithms to evaluate the features. Wrapper methods [5] are of various kinds too and we try to use subset of features and train a data from it. These processes have large number of steps and we also need to consider which method is to applied when depending upon our input data and the type of result e get whether it is numerical or categorical data.

3.3 Hybrid Methods

Hybrid methods combine properties of both Filter and Wrapper methods.

3.4 Embedded Methods

There is also another technique referred to as embedded technique of feature selection and is most advanced one. In our paper we have applied two different embedded techniques only. The difference between the filter and wrapper method is that Wrapper methods measure the “usefulness” of features supported on the classifier performance. On the opposite hand, the filter methods focus and acquire the intrinsic properties of the features measured via univariate statistics instead of cross-validation performance. There are different algorithms in java, python, R. Each one has its importance and some disadvantages like few feature selection techniques can be applied only on data having not large number of attributes or in some cases the accuracy is increased if there are larger number of variables.

In this paper, we have applied two Feature selection algorithms, namely Gradient Boosting Machines (GBM) and Random Forests (RF) [6]. These two techniques are described as follows:

3.4.1 Gradient Boosting Machines

In Gradient Boosting [7, 8], additive regression models are built by iteratively fitting a simple base learner to currently updated pseudo-residuals by applying least squares at every further iteration. The objective of this method is to evolve a function F*(x) which maps x to y, so that when the joint distribution of all values (y, x) is taken, the expected value of Ψ(y, F(x)) which is some specified loss function is minimized. This relation is depicted in Eq. (1).

Where y is the random output or dependent variable and x = {x1, x2,…xn} is a set of random input variables.

$$ F^{*} \left( x \right) = \arg \min E_{y,x} \Psi \left( {y,F\left( x \right)} \right) $$
(1)

3.4.2 Random Forests

Random Forests [9,10,11] were developed as an extension to the popular ensemble technique called Bagging. It is a tree-based ensemble technique where each tree depends on a set of random variables. Random Forests are used for prediction in various domains like Traffic estimation and congestion on roads due to traffic [12].

Equation (2) shows the mean square generalization error in Random Forests for a numeric predictor h(x)

$$ E_{X,Y} \left( {Y - h\left( X \right)} \right)^{2} $$
(2)

The Random Forest predictor is constructed by taking the mean over k of the trees \(\left\{ {h\left( {x,\uptheta _{k} } \right)} \right\}\).

Equation (3) states the case for infinite numbers of trees in the forest

$$ E_{X,Y} \left( {Y - av_{k} h\left( {X,\uptheta _{k} } \right)} \right)^{2} \to E_{X,Y} \left( {Y - E_{\uptheta } h\left( {X,\uptheta } \right)} \right)^{2} $$
(3)

4 Literature Survey

This section reviews the work done by some researchers.

This paper [13] introduces a new feature selection algorithm based on the wrapper process using neural networks. An important feature of this algorithm is the automatic determination of neural network structures during the process. Their algorithm uses a constructive approach that incorporates linking information in selecting features and determining neural network structures. The test results show the essence of a constructive approach to selecting features with integrated structure.

In this work [14], a propose a method of selecting a multi-filter element based on an method that includes the issuance of four filtering methods to achieve optimal selection. Then they also conducted a detailed evaluation of our proposed method using a benchmark acquisition dataset for intrusion detection, NSL-KDD and decision-making tree planning. The findings show that our proposed approach can effectively reduce the number of features from 41 to 13 and has a higher level of accuracy and classification accuracy compared to other classification strategies.

Recent advances [4] in computer technology in terms of speed, cost, and acquisition of large amounts of computer power and the ability to process large amounts of data in a timely manner have stimulated interest in increasing data mining requests to extract useful information from data. Machine learning has become one of the most widely used methods in these data mining applications. In this study, the authors examined the various options between classes and selected methods of using distance in relation to their effectiveness in advancing inclusion data to attract decision trees. The results of their research showed that intermediate-stage measures lead to better performance compared to expected measures, in general.

In this paper [15], writers have presented a review of the state of the art approach to the selection of scientific-theoretic features. The concepts of the importance of feature importance and redundancy are well defined, as well as Markov’s outfit. The problem with the right feature selection is explained. An integrated theoretical framework has been developed, which can re-establish successful success strategies, reflecting the limitations made in each case. There are many open-ended issues in the field that are also being presented.

The authors in this paper [16] applied three Machine Learning techniques namely- Support Vector Machines, Multilayer Perceptron and Random Forests to predict the residency of teachers in Indian universities related to Information and Communication Technology awareness. Results showed that Random Forests was far better at prediction than other two algorithms with prediction accuracy of 72.8% and prediction time of 0.12 s.

Feature Selection is an important phase in Data Analysis and Machine Learning. In a paper [17], the authors have applied Machine Learning algorithms to analyze few categories of Cyber attacks.

In another paper [18], Feature selection is performed to select genes of Microarray datasets, with the objective of applying it for cancer diagnosis. Then Support vector machines technique is applied for classification purpose.

A paper [19] introduces the concept of variable and Feature selection and its objectives. The article emphasizes on various aspects like Feature construction, Feature ranking, multivariate Feature selection and some assessment methods to validate features.

5 Methodology

The research outline adopted in this paper is defined in Fig. 1. It consists of several phases which are explained as follows.

Fig. 1.
figure 1

Methodology

5.1 Dataset Collection

Dataset of store sales was collected and its parameters are described below in Table 1:

Table 1. Dataset Parameter Description

5.2 Algorithm Application

As aforementioned, Random Forests and Gradient Boosting Machines were applied on the collected dataset for Feature Selection.

5.3 Results Comparison

The results of feature selection obtained from both the algorithms, RF and GBM were compared.

5.4 Proposed Algorithm

Since the results of GBM are better as compared to Random Forests, we propose GBM for feature selection.

6 Experiments and Results

In this paper, Random Forests and Gradient Boosting Machines were applied on the dataset with the objective of Feature Selection. The experiments have been performed using R Studio. Table 2 shows the results obtained from both algorithms.

Table 2. Feature Importance given by RF and GBM

It is evident from the results shown in Table 2 that GBM is better at feature selection and importance as compared to RF. Therefore, we propose GBM for the purpose of Feature Selection.

Table 2 shows the priority order of the parameters which are responsible for Store sales by both - RF method and GBM method. GBM method is more accurate as compared to RF method. While using GBM method we were able to arrange the data in the order which is based on the priority that the parameter with the maximum value is at the top i.e. Shipping.Cost and the parameter with the least value is placed in the bottom i.e. Customer.Segment. GBM method gives the result with the values and by using these values we are also able to produce the Feature Importance graph. This graph is more visually understandable and we are also able to decide that which parameter is more important and is more responsible for the sales in store.

Table 3 shows the percentage assigned by GBM to each parameter in sequence.

Table 3. Feature importance percentage by GBM

Figure 2 shows the various features, ordered as per importance using GBM technique. It clearly states that the parameter Shipping.Cost has the highest value from all other parameters.

Fig. 2.
figure 2

Feature importance using GBM

7 Conclusion and Future Scope

This paper presents a survey of the feature selection methods expected in the literature. A few popular feature selection methods like Filter method, Wrapper method, Embedded methods were introduced. From embedded feature selection category, Random Forests and Gradient Boosting Machines were applied on a dataset for selecting important features. Experiments were performed in R Studio and the obtained results show that GBM performs better than RF in terms of Feature Selection. In future, other well-known techniques for Feature Selection can be experimented.