Keywords

1 Introduction

Android has become the fastest-growing mobile OS. According to Web analytics firm StatCounter [1], in March 2017, Android dominated the worldwide OS internet usage market share with 37.93% shooting ahead of even Windows. The esteem of Android OS also reminds us of the influence of portable devices in our lives. These devices have become the storehouse of the details that we provide to the various apps that manage our day-to-day activities such as scheduling events, online shopping, chatting, etc. Virginia Tech researchers [2] have recently discovered that various apps which we use in our mobiles have been collaborating to trade information secretly. This may result in a security breach. Even the latest initiative for enhancing Android security, Google Play Protect [3], a new built-in antivirus program, failed in the Android antivirus tests conducted by independent German lab AV-TEST [4]. It detected only 65.8% of the latest malware and just 79.2% of month-old malware. Such weak detection rates themselves highlight the need for more robust Android malware detection tools. According to internal AV-TEST statistics [4], over 18 million malware samples are concealing in Android apps. Also, many third-party application stores provide more space for spreading malware.

The above-mentioned reasons were enough to motivate us to experiment and explore new methods for detecting Android malware. Most of the Android malware detection systems depend on three methods - static analysis, dynamic analysis, and their hybrid variants. Static analysis has an edge of over dynamic analysis because it is performed in a non-runtime environment and its ability to detect issues early before the app is executed. However, dynamic analysis may reveal the concerns that could not be detected during static analysis.

Various machine learning algorithms have been applied to detect Android malware using static features [5,6,7]. ELM has been explored in the past for Android malware detection [8,9,10]. However, in our study, we explore the effect of filter feature selection methods and their ensembles on the performance of ELM. Also, the features used in our study are API classes which differ from the ones used earlier (permissions, API calls, binder, memory, battery, CPU, network, Dalvik instructions) along with ELM.

The API classes as per Android API level 26 [11] are in thousands which constitute the features used in our study. It has been shown that insignificant features may be removed without affecting the performance of the neural network based Intrusion Detection Systems [12]. Thus, feature selection is an important step which not only eliminates useless features but also results in faster execution and simplification of the machine learning model [13]. Feature selection methods are categorized as filter methods, wrapper methods and embedded methods. We use the filter methods for selecting relevant features as they execute directly on the dataset and are not affected by the biases of any classifier. Also, they are fast as compared to other methods.

As compared to single machine learning model, combining the output of multiple prediction models known as ensemble learning, has been observed to achieve better performance [14]. Apart from classification, ensemble approach has also been applied to feature selection. In [15], five different filters were used to select a different subset of features which were used to train and test different classifiers and their outputs were combined using simple voting. In another study [16], three feature selection methods were combined based on union, intersection and multi-intersection techniques. In our work, we have compared the effect of performing ensemble at feature selection level and at prediction level. Our aim is to achieve better accuracy. We experiment with various filter feature selection methods and their ensembles and present a comparative study of their effect on the accuracy of ELM on a corpus of 14762 Android apps.

The rest of the paper is organized as follows. Section 2 describes filter feature selection methods. Section 3 introduces ELM. In Sect. 4, we present the research design. Section 5 presents the experimental results and analysis. Finally, we conclude in Sect. 6 along with possible future work.

2 Filter Feature Selection Methods

The essence of filter feature selection methods is that they rely on a statistical measure and use a feature ranking technique as the core criteria for feature selection by ordering. Filter methods are applied prior to the classification process to filter out less important features. The relevance of a feature is measured in terms of its usefulness in differentiating between classes. Then, a ranking is generated based on the relevance and subsets of features are selected according to a threshold value. Of the wide range of filter feature selection methods available, we use the following three methods which can be applied to categorical/binary data as the features extracted in our study are binary.

  • Chi-Square [17] measures the extent of independence between a feature and a class and can be compared to the chi-square distribution with one degree of freedom to judge extremeness. The initial assumption is that the feature and class are independent of each other. A high score of chi-square signifies the dependence between feature and the class and thus, the relevance of the feature.

  • OneR [18] builds one rule for each feature in the dataset i.e., it learns from a one-level decision tree and then ranks features based on the fact that features which result in more accurate trees are considered to be more relevant.

  • Relief [19] algorithm works by repeated random sampling of an instance from the dataset, computing its nearest neighbor from the same class as well as from different class and then calculating the worth of a feature based on its ability to discriminate between instances from different classes.

3 Extreme Learning Machine

In this section, we describe the ELM concept. Feedforward neural networks have been used extensively in the past decade due to their capability of approximating complex nonlinear mappings directly from input samples and providing models for various artificial and natural events [20]. However, one of the major bottlenecks in using feedforward neural networks is their learning speed which is slower due to the slow gradient-based learning algorithms that are used to train neural networks and the iterative tuning required for all the parameters of the networks [21]. ELM [21, 22] was originally proposed for training single hidden layer feedforward neural networks (SLFNs). The core concept of ELM is: the hidden layers of SLFNs need not be tuned but can be randomly assigned independently and a simple generalized inverse operation of the hidden layer output matrix can be used to determine the output weights of the network [21]. Since there is no iterative tuning involved like in gradient descent based learning algorithms, ELM is fast and easy to implement. ELM also intends to reach smallest training error and smallest norm of output weights [21, 22]. Apart from the above advantages, studies have shown that it has less computational complexity, nominal optimization constraints, better scalability, and generalization performance.

Given N unique samples of input data, the SLFN with L hidden nodes (additive or RBF nodes) is represented as:

$$ f_{L} \left( x \right) = \sum\nolimits_{i = 1}^{L} {\beta_{i} G\left( {c_{i} ,a_{i} ,x} \right)} $$
(1)

where (ci, ai) are the learning parameters of the ith hidden node, βi is the weight vector linking ith hidden node to the output node, G(ci, ai, x) is the output of the ith hidden node w.r.t the input x.

The fact that standard SLFNs with L hidden nodes can approximate the N input samples (xj, tj\( \epsilon \) Rn x Rm with zero error implies that there exist βi, ci, and ai such that

$$ \sum\nolimits_{i = 1}^{L} {\beta_{i} G\left( {c_{i} ,a_{i} ,x} \right) = t_{j} ,\quad j = 1,2, \ldots ,N} $$
(2)

The above equation can be compactly written as

$$ H\beta = T $$
(3)

where,

$$ H \left( {c_{1} , \ldots c_{L} ,a_{1} , \ldots a_{L} ,x_{1} , \ldots x_{N} } \right) = \left[ {\begin{array}{*{20}c} {G\left( {c_{1} ,a_{1} ,x_{1} } \right)} & \cdots & {G\left( {c_{L} ,a_{L} ,x_{1} } \right)} \\ \vdots & \ddots & \vdots \\ {G\left( {c_{1} ,a_{1} ,x_{N} } \right)} & \cdots & {G\left( {c_{L} ,a_{L} ,x_{N} } \right)} \\ \end{array} } \right]_{N \times L} $$
$$ \beta = \left[ {\begin{array}{*{20}c} {\beta_{1}^{T} } \\ \vdots \\ {\beta_{L}^{T} } \\ \end{array} } \right]_{L \times m} ,\quad {\text{T}} = \left[ {\begin{array}{*{20}c} {t_{1}^{T} } \\ \vdots \\ {t_{N}^{T} } \\ \end{array} } \right]_{N \times m} $$

βT is the transpose of vector β. H is called as the hidden layer output matrix [23]. The ith row of H is the output vector of the hidden layer w.r.t input xi and the jth column of H is the jth hidden node’s output vector w.r.t inputs x1, x2,…, xN. Now the hidden node parameters ci and ai need not be tuned and may be assigned randomly. It has been proved in theory [22, 24, 25] that SLFNs with randomly chosen additive or RBF hidden nodes have the potential of universal approximation. Thus, independent of the training data, the hidden nodes can be generated randomly, i.e., for N unique samples of training data, randomly generated L (≤N) hidden nodes, the output vector T and output matrix of the hidden layer H comprise a linear system and the output weights β are estimated as:

$$ \hat{\beta } = H^{ + } T $$
(4)

where H+ is the Moore-Penrose generalized inverse [26] of H. Thus, the output weights can be calculated in a single step without any iterative tuning of any control parameters.

The ELM algorithm can be summarized as follows [21]:

Given a training set \( \left\{ {\left( {x_{i} ,t_{i} } \right)} \right\}_{i = 1}^{N} \subset R^{n} \times R^{m} \), the hidden node output function G(ci, ai, x) and L hidden nodes:

  1. 1.

    Randomly assign hidden node parameters (ci, ai), i = 1,2,……, L;

  2. 2.

    Calculate output weight vector β: β = H+T.

4 Research Design

The research framework of this study is shown in Fig. 1. Our work uses static features (API classes). Three filter feature selection methods (Chi-Square, OneR, and Relief) are used to generate three subsets of features which were combined by taking the union of them for the simple reason of using the maximum possible relevant features and then classification is performed on this combined subset. This is referred to as pre-classification ensemble approach and the rationale behind this approach is to take advantage of the strengths of individual selectors and release the burden of deciding which feature selection would work for a domain [27]. In another experiment, the feature subsets generated by the three filter feature selection methods are used to train and test three classifiers whose outputs are aggregated using ensemble technique of majority vote which is referred to as the post-classification ensemble approach and the concept behind this approach is that combining multiple classifiers results in more robust solutions [28] and if the features selected by each feature selection methods are diverse, so will be their classifications and hence, the better the ensemble would be. In the majority voting, every classifier makes a prediction (votes) for each instance of the test set and the final prediction is the one that receives maximum votes. The classifier used in all experiments must be unique in order to make comparisons and we use the ELM classifier for the same. Ensemble learning has been used earlier for Android malware detection [6, 8, 29, 30] to improve the performance of the model since it is based on the concept of diversity and generalization. In our study, we compare the introduction of diversity at the feature selection level vs prediction level.

Fig. 1.
figure 1

Research framework

4.1 Dataset Preparation

Dataset is prepared using 7381 benign and 7381 malicious Android apps. The benign dataset is downloaded from Google play store [31] and the malicious dataset is constituted from samples collected from various sources (AndroTracker [32], Drebin database [33], Virus Total [34]). The dataset is partitioned into the training set (70%) and the test set (30%). To generalize the performance, 10-fold cross-validation was performed while generating classifier models in our experiment.

4.2 Feature Extraction

The Android API level 26 consists of 4140 API classes. Thus, we set the corresponding API class feature to 1 if an API belonging to that class is used in an app. For reverse engineering the Android app, Androguard tool [35] is used. Thus, a total of 4140 binary features are extracted.

4.3 Implementation Details

Firstly, features having zero variance are removed from the dataset as they contribute nothing to classification. After removal, we were left with 2392 features. Now, three filter feature selection methods (Chi-Square, OneR, and Relief) are applied on these 2392 features to generate three feature rankings. Most works in the previous research use several thresholds that hold different percentages of most relevant features [36]. However, the thresholds are dependent on the dataset being used. For our study, we use six different threshold values to reduce the dimension of data, including log2(n) threshold, where n is the total number of features, following recommendations from literature to select log2(n) metrics for software quality prediction [37]. The other five are the top 1%, 5%, 25%, 50%, and 75% of the most relevant features of the final ordered ranking obtained from each filter feature selection method. The classifier used in all the experiments is ELM.

In the pre-classification ensemble approach, we take union of the feature subsets for each of the six thresholds, generated by the three filter feature selection methods, which is then used for classification.

In the post-classification ensemble approach, each feature subset generated is used separately for classification thus resulting in three individual classifiers. Then, the output of each classifier is combined by using the majority vote.

R statistical software [38] is used to perform the experiments. For feature selection, we used the FSelector package [39] and for performing classification using ELM, we used the elmNN package [40].

4.4 Evaluation Measures

In this research, to assess the effectiveness of our proposed system, we analyzed the following measures: Accuracy (percentage of correctly identified applications); precision (percentage of actual malicious apps amongst the predicted malicious apps); recall (percentage of actual malicious apps predicted amongst the total malicious apps); F-measure (harmonic mean of precision and recall); and area under Receiver Operating Characteristic (ROC) curve. ROC curve is a graphical plot that illustrates how a diagnostic ability of a classifier change as the internal threshold changes. The area under ROC curve summarizes the performance of a classifier in a single number. It varies between 0 and 1. As it reaches 1, the classifier has better performance.

5 Experimental Results and Analysis

In our experiments, we apply the proposed framework to predict Android malware. In this section, we present the performance measure scores achieved by 10-fold cross-validation criterion of the following experiments - single ELM classifier, three individual feature selection based ELM classifiers (Chi-Square ranked features+ELM, OneR ranked features+ELM, Relief ranked features+ELM), pre-classification ensemble and lastly, post-classification ensemble. The accuracy, precision, recall, F-measure, and area under ROC curve for six different thresholds values of selected features are shown in Tables 1, 2, 3, 4 and 5 respectively.

Table 1. Comparison of prediction accuracy
Table 2. Comparison of precision
Table 3. Comparison of recall
Table 4. Comparison of F-measure
Table 5. Comparison of area under ROC curve

As shown in Table 1, for all the thresholds, the accuracy of the post-classification ensemble is on a higher side than the accuracies achieved by each individual feature selection method, but it is less than the accuracy achieved by the pre-classification ensemble for low threshold values. Also, the pre-classification ensemble outperforms the individual feature selection methods only for the low value of thresholds, and as the number of features increases, the accuracies show no improvement. The highest accuracy (0.9496) is achieved by the pre-classification ensemble for the top 1% of features. The prediction accuracy of single ELM classifier using all the features is 0.9264. Thus, the introduction of ensemble increased the accuracy by 2.32%. Due to the fact that there may be a lot of irrelevant features introduced in the union combination when the threshold increases, the accuracy of the pre-classification ensemble degrades. However, as we can see that the post-classification ensemble is more immune to the introduction of irrelevant features, irrespective of the threshold, it results in the increase in accuracy as compared to the individual feature selection methods.

As shown in Tables 2 and 3, highest precision (0.9426) is again achieved by the pre-classification ensemble with top 1% of features but the post-classification ensemble exhibits the highest recall (0.9684) for the top 25% of features. Although the precision and recall values fluctuate for the pre- and post-classification ensembles, but the F-measure follows the same pattern as the accuracy as shown in Table 4, with the post-classification ensemble resulting in higher F-measure as compared to the individual feature selection methods for all thresholds and the pre-classification ensemble resulting in higher F-measure values for lower thresholds (highest F-measure of 0.95 for the top 1% of features) and then degrading as the threshold increases. As compared to the single ELM classifier results, the introduction of ensemble increases the precision by 2.97%, recall by 2.58% and F-measure by 2.24%. Table 5 illustrates that even area under the ROC curve increased by 1.24% for the ensemble in comparison to the single ELM classifier. The area under the ROC curve also complies with the accuracy and F-measure results with the highest area value of 0.9864 for the pre-classifier ensemble with top 1% of the features.

6 Conclusion

Android malware detection systems require classifiers with high accuracies. One of the ways of improving the accuracy has been the use of ensemble learning. In this paper, we employ ensemble learning at the feature selection level and at the prediction level to address the problem of Android malware detection. API classes are extracted from Android apps that constitute our binary feature set. At the feature selection level, we used three filter feature selection methods and combined their feature subsets by using the union combination where all the features selected by each of the three feature selection methods are used. Thus, ensemble learning is performed prior to classification and hence the name pre-classification ensemble. At the prediction level, ensemble learning is performed by taking the majority vote of the predictions of three individual classifiers trained on each of the feature subsets generated by the three filter feature selection methods. This approach is referred to as post-classification ensemble since ensemble learning is performed after classification. The significance of ensemble not only lies in the improved performance of the models as confirmed by the experimental results but it also relieves from the burden of selecting an appropriate feature selection method or classification algorithm for a specific dataset. The results indicate that the pre-classification ensemble performs good for low values of the threshold of feature subsets and then starts degrading as the threshold increases, the reason being introduction of irrelevant features. However, the post-classification ensemble is not affected by the thresholds as the predictions are now based on majority vote of three individual feature selection methods based classifiers.

As future work, we propose to compare the pre- and post-classification ensemble strategies for other machine learning algorithms. Also, we used only those filter methods that can be applied to binary data. The same can be explored with numeric features and other filter feature selection methods.