Keywords

1 Introduction

Android has dominated the mobile device industry. Just in 2019, the percentage of mobile devices using the Android operating system was around 76 % [1]. This operating system is supported by various types of applications in different markets which provides grateful functionality to its users. However, the total number of malware applications on the android market has boomed catastrophically, due to being an open-source software and also being one of the most used mobile operating systems. Moreover, many malicious applications are found on unofficial Android markets, where security issues receive less attention. Furthermore, as mentioned formerly, the market share for Android is considerably high compared to other operating systems. Therefore, Android is the hot target of attackers to gain more control of the system. The machine learning approaches have been proposed in recent years as one of the effective solutions in Android malware detection systems [2]. However, data mining researchers consider high-dimensional data analysis as a challenge. Feature selection known as the process of selecting a subset of features that contribute most to the classification process in machine learning activities brings many advantages in the reduction of the data dimensionality and enhancement of classifier efficiency [3]. The effectiveness of these methods has been established in improving the learning performance as well as the high- dimensional data processing. The feature selection technique aims at reducing the irrelevant features, reducing the computation cost, increase prediction performance and gaining a better understanding or representation of the data [4]. Although feature selection methods have been used widely in other fields such as intrusion detection systems [5], the majority of papers in the area of Android malware detection tend to select the essential features by rationalizing which require a deep understanding of the nature of the features involved in Android applications. The research in [2] and [6] have provided review papers on various types of features which have been applied in Android malware detection systems. The results of their investigations demonstrate that despite the advantages of feature selection approaches they have been applied in around 8% to 13% of the reviewed papers. To the best of our knowledge, the majority of feature selection approaches currently used in Android malware detection systems encounter difficulties concerning the nature of the problem. The feature selection problem can be modelled as two conflicting objectives which are to minimize the number of features while offering a higher classification performance (minimize classification error). The single objective feature-selection approaches applied in Android malware detection systems are not able to confront both objectives simultaneously while the multi-objective techniques could be considered as a potential solution to the aforementioned issue [7]. Hence, we would attempt to assess the impact of multi-objective feature selection techniques compared to single-objective approaches in Android malware detection systems. To achieve this objective, the following research questions are defined:

  1. 1.

    Can multi-objective techniques perform better compared to single-objective feature selection approaches in Android malware detection systems?

  2. 2.

    Is there any special deficiency in current multi-objective approaches which could be improved?

Consequently, the gap mentioned is hoped to be filled by using multi-objective techniques as a feature selection method. To address the first question, the proposed framework applies the multi-objective genetic algorithm, namely NSGAII, for the feature selection purpose. The research done in [8] introduces the redundancy issue as one of the intrinsic issues in NSGAII algorithm. Hence, a modified version of the NSGAII method which removes the redundant solutions have been proposed in this paper which has been tested on two well-known datasets in the field of Android malware detection systems. The most significant contributions of this paper can be folded as follows:

  1. 1.

    To the best of our knowledge, there is no other research in the literature assessing the impact of the multi-objective feature selection approaches in malware detection in Android systems.

  2. 2.

    The multi-objective feature selection technique used in this research is a modified version of the standard NSGAII method, which improves the diversity of the solutions.

The rest of this paper is constructed as follows, wherein Sect. 2, we give a brief overview of related works. Section 3 describes the methodology used in the proposed Android malware detection system by defining the dataset in use, the modified feature selection, and the malware detection approach, respectively. The experimental results are discussed in Sect. 4, where the evaluation metrics are defined prior to the results achieved in the experiments. Finally, the conclusions of the research are outlined in Sect. 5.

2 Background

As this study is aimed at improving feature selection approaches used in Android malware detection systems, the literature is explored on this subject. Coronado-De-Alba et al. [9] examined the impact of two different feature selection methods called chi-square and relief approach. The malware samples in this dataset are derived from the Drebin dataset. However, the benign samples are gathered from Google Play, and third-party stores after evaluation by the Virustotal tool [10]. The features used in this experiment are divided into permissions, intents, hardware, and software categories. They have introduced Random Forest and random committee as the most efficient meta learner classifiers. In the next step, Random Forest with 200 random trees is selected while using the random committee as the base classifier, which showed the best outcomes on an unbalanced dataset and without feature selection. However, they report these feature selection methods with the ensemble of classifiers to show better efficiency compared to single classifiers. As a result, they suggest these methods in situations where velocity plays a vital role to cope with the dataset size. Zhao et al. [11] proposed a novel feature selection method called FrequenSel. In this research, more than 32000 features have been gathered before feature selection. These features are divided into APIs, permissions, actions, IP, and URLs. They have mentioned the distribution bias in favor of benign apps features and the long-tail effect as the main issues regarding traditional feature selection methods such as information-gain and chi-square. To solve this problem, the features selected by their algorithm should follow two conditions. First of all, its usage percentage should be higher in malware compared to benign samples. Next, a threshold has been introduced that ensures the proper occurrence of features in all malware samples. The proposed framework depicts better results compared to information gain and chi-square. The research done in [12] has proposed a novel architecture in Android malware detection systems. They have gathered permissions, and API function calls from the Android app samples. In the experiments, the benign app is gathered by Google Play, and the malware data are taken from the Malgenome project. The results of their experiments show that the API calls have a higher impact on the efficiency of the models compared to the permission features. As a result, they have selected a higher range of API calls after the feature selection process. In the feature selection step, ANOVA and SVM-RFE approaches have been applied to rank the features. Finally, they have chosen the first 300 API calls features in the sorted list as well as the top 80 features in the permission set.

3 Methodology

In this section, the datasets used to evaluate the proposed method are described initially. Next, we have defined the feature selection approach and the two-step modifications applied to the feature selection technique, and we would explain how These modifications are done to improve the diversity of the solutions. Finally, the malware detection approach, which divides the samples into benign and malware categories, is discussed in detail.

3.1 Dataset

The experiments have been conducted on two datasets consists of real-world Android application samples, details of which are presented in Table 1. Each of this dataset contains 215 features. We have selected these datasets as they are widely used in the research community. During the dataset generation process, AXMLprinter2 is used to extract permissions and intents from the manifest file. Moreover, the extra features in API calls are derived from the .dex files using reverse engineering by Baksmali disassembler. Finally, the most important ad libraries introduced in [13] have been excluded to improve the quality of API call feature extraction.

Table 1. The sample distributions in Drebin [14], and Malgenome [15] datasets

3.2 Feature Selection

In order to remove the irrelevant features, we have applied feature selection approaches which lead to the dimensionality reduction of the datasets and potential improvement of the classification performance. As mentioned previously, the majority of the projects in this field of study have chosen the list of features based on rationalizations which requires a deep understanding regarding the nature of features available in the datasets. To the best of our knowledge, the multi-objective approaches have not been used for feature selection purposes in Android malware detection systems. This provides a gap for further investigation. In our previous research [5], we proposed an intrusion detection framework based on a modified multi-objective feature selection approach called NSGAII.Hence, it provides enough motivation to assess the effectiveness of this method for feature selection purpose in Android malware detection systems. The modification which was applied to the aforementioned research was improving the diversity of solutions by removing the identical redundant solutions. In this research, two levels of modification have been applied to the standard NSGAII technique. Algorithm 1 demonstrates the steps involved in the modification process where the duplicate solutions referring to exactly the same feature sets are removed in the first level. In the second step, the solutions referring to various feature sets are evaluated to ensure that they report distinct objective functions. In the following paragraphs, we would describe the modification process in detail.

figure a

Modified NSGAII

Step 1: The presence of duplicate solutions in standard NSGAII techniques have been mentioned as a significant threat to diversity [16] and convergence speed [8] of this algorithm. In our previous research [5], we modified the NSGAII technique to remove the duplicate solutions in the merged population. Hence, we applied the same modification technique to remove the duplicate solutions and improve the diversity and convergence speed of the NSGAII algorithm in the first modification step in this research as well.

Step 2: Although in step 1, we have removed the identical solutions referring to the exactly the same solutions, there is still the possibility that the solutions with other degrees of similarity are still present in the offered solutions. In Step 2, we would attempt to remove the solutions reporting exactly the same objective-functions since we would consider these solutions as redundant solutions that their presence is not adding any extra value to the quality of the proposed solution, and this fact provides the motivation to apply the modification in the second step. To ascertain which of the overlapping solutions may provide higher value, we have attempted to discover which one has the highest similarity with a single objective method available in the Weka environment, namely CFS. The reason for choosing a feature selection method is the fact that these solutions are referring to various feature sets proposed by the NSGAII technique. It should be mentioned that the CFS technique could be replaced by any other feature selection approach available in the state-of-the-art. To achieve the highest similarity degree, the linear association between each of the overlapping solutions (i and j) and the array achieved by CFS is calculated using the correlation-coefficient concept. The linear association between the two- vectors, \(A_1 = (A_{1},A_{2},\ldots ,A_{n})\) and \(B = (B_{1},B_{2},\ldots ,B_{n})\), is called the Pearson correlation, and can be calculated according to Eq. 1.

$$\begin{aligned} correlation (A,B)= \frac{\sum _{i=1}^{n} (A_i-\overline{A} )(B_i-\overline{B} )}{\sqrt{\sum _{i=1}^{n} (A_i-\overline{A} )^2}{\sqrt{\sum _{i=1}^{n} (B_i-\overline{B} )^2}} } \end{aligned}$$
(1)

Finally, the solution which reports higher degree of correlation using Eq. 1 is chosen and the other one is eliminated from the list of offered solutions.

3.3 Malware Detection

In the malware detection step, we would like to divide the solutions into a binary classification of malware and benign categories. The feature selection technique defined in Sect. 3.2 provides a list of solutions which provide various feature subsets. Among all of these solutions, we would select the solution with minimum \(f_2\) value which is equivalent to the error rate reported for the corresponding feature set. Next, we would reduce the size of the dataset according to the proposed list of feature sets. Afterwards, the Random Forest classifier, as an ensemble method and peered review machine learning technique found in literature [9], would be applied on the datasets to evaluate the efficiency of the proposed solution and divide the samples into benign and malware categories.

4 Experimental Results

In this section, we would initially describe the evaluation metrics used to assess the effectiveness of the proposed feature sets. Afterwards, the results of two experiments on Drebin and Malgenome datasets are reported according to the aforementioned metrics.

4.1 Evaluation Metrics

Weighted F Measure is applied in the research conducted in [17] to evaluate the efficiency of the method. The dataset used in this research is extracted from this research. Hence, we apply the same metric as described in Eq. 2 by WFM.

$$\begin{aligned} WFM = \frac{ (F_m * N_m ) +(F_b *N_b) }{ ( N_m + N_b )} \end{aligned}$$
(2)

Where the \(F_m\), \(F_b\), \(N_m\), and \(N_b\) used in the evaluation process refer to the F Measure and number of instances in malware and benign instances, respectively. The F value in this equation correlates with the F Measure value where it can be calculated according to the Eq. 3.

$$\begin{aligned} F~Measure = 2*(\frac{Precision*Recall}{Precision+Recall} ) \end{aligned}$$
(3)

While the Precision and Recall are formulated as follows:

$$\begin{aligned} Precision = \frac{TP}{TP+FP} \end{aligned}$$
(4)
$$\begin{aligned} Recall = \frac{TP}{TP+FN} \end{aligned}$$
(5)

where TP, TN, FP, and FN are equal to the following values:

  • True Positive (TP): the number of correctly classified malware files.

  • True Negative (TN): the number of correct classification of benign samples as benign.

  • False Negative (FN): Malicious applications wrongly classified as benign.

  • False Positive (FP): the number of misclassified legitimate applications.

4.2 Results

This section describes the experimental results obtained with the improved NASGII over the datasets selected for these experiments. The proposed feature selection solution was implemented using Matlab R2019a. Next, the Random Forest algorithm is applied to the reduced dataset in the Waikato Environment for Knowledge Analysis (Weka 3.8). The data analyses are performed using a PC with Intel Core i7 processor, 2.1 GHz speed and 8 GB RAM. To evaluate the performance of each model, a 10-fold cross-validation resampling method was applied. This method is selected for the advantages it has in guaranteeing the independence of the validation set from the training one and the validity of the trained classifier against any unseen sample.

Experiment One on Drebin Dataset:

Figure 1 demonstrates the feature subset solutions proposed by the modified multi-objective NSGAII. Each \(*\) symbol in this fig is considered as the demonstration of an individual solution. Further detail about these solutions can be found in Table 2 where the Num, NF, MSE, and ratio refer to the order of solution on the Pareto front, the number of selected features, Mean-square-error (MSE) of the solution achieved by the artificial neural network (ANN), and the percentage of the selected features in each solution compared to the full features available in the original datasets. We have chosen the last solution with the minimum MSE value, which corresponds to 186 features of total 215 features available in Drebin dataset.

Table 2. Pareto front in Drebin
Table 3. Pareto front in Malgenome
Fig. 1.
figure 1

Pareto front derived from Drebin

Next, the features unavailable on the feature subset solution are eliminated from the dataset, and the dataset is fed into the malware detection step where the Random Forest technique available in Weka environment would classify the samples into malware and benign categories. Table 4 demonstrates the confusion matrix derived from the classification approach on Drebin dataset. We have compared the proposed method with two single objective feature- selection techniques called CFS and FilterSubsetEval. To achieve a uniform structure for comparison, we have initially applied each of these techniques on Drebin dataset. Afterwards, the malware detection approach is applied to the reduced size datasets using Random Forest. Moreover, the proposed method results are compared to another research called Droid-Fusion [17], in which they have published the final version of Drebin and Malgenome datasets used in our research. Table 5 reports the results of this comparison, where the proposed method shows higher efficiency in all performance metrics compared to both single objective feature selection approaches. However, it demonstrates better results in three out of five metrics in comparison with Drebin. To evaluate the efficiency of the method in both benign and Malware samples, the precision and recall factors have been reported in malware and benign classes separately instead of a single average factor in all samples.

Table 4. Drebin confusion matrix
Table 5. Evaluation measures for Drebin dataset
Fig. 2.
figure 2

Pareto front derived from Malgenome

Experiment Two on Malgenome Dataset:

The proposed feature selection has been applied on Malgenome dataset as well. Figure 2 demonstrates the feature subset solutions offered by the proposed multi-objective approach in this research. Twelve distinct solutions have been offered by this technique where the solution with the least MSE rate has been chosen to be used to reduce the feature subset size of the Malgenome dataset. Table 3 demonstrates the details of the offered solutions by the multi-objective technique. We have chosen the last solution in this table where 115 features out of 215 features available on Malgenome dataset have been chosen which is equivalent to 53% of the total features. Next, the reduced size dataset is classified using Random Forest classifier. The confusion matrix derived from this process can be found in Table 6. We have compared the proposed method with the same state-of-the-art mentioned in the Drebin experiment. The evaluation results of our experiment demonstrate the superiority of the proposed multi-objective feature selection technique compared to two single objective feature selection methods, namely CFS and FilterSubsetEval in all factors and its improvement in three out of five metrics compared to Droid-Fusion [17] (Table 7).

Table 6. Malgenome confusion matrix
Table 7. Evaluation measures for Malgenome dataset

5 Conclusion

By selecting the relevant features and by applying different machine learning algorithms, it is possible to effectively detect Android malware. The aim of this study was to explore the use of a multi-objective function for feature-selection purpose on Android malware detection systems in order to create the optimal feature subsets. Hence, a NSGAII-ANN method is applied to construct the basis for the feature-selection stage. To improve the proposed framework, we have modified the traditional NSGAII method to solve one of the most significant deficiencies in NSGAII algorithm by a redundant solution removal approach in two stages. Moreover, the Random Forest method is applied in order to evaluate the selected subsets. Experimental results show that the proposed method is superior to other methods found in literature in terms of improving the classification metrics for both datasets. In the future, we would like to test our proposed solution on other real datasets covering a broader range of malware, and we would like to apply and compare different machine learning algorithms. As Android malware detection techniques are mostly applied on imbalanced network traffic datasets, different experiments should be performed applying oversampling and downsampling techniques, expecting to achieve higher accuracy on both the minority class and the entire datasets.