Keywords

1 Introduction

Smartphones are ubiquitous today, serving as a portable way to enter personal data, accounts, contacts, and communication services. The Android platform has become the most dominant mobile OS since 2012. Android privacy is seriously jeopardized because of Android’s open nature due to which innumerous adware, ransomware, etc., are disguised in a large number of non-malicious apps. Given this, smartphones have been tempting victims of cyber-attacks over the years. During installation, Android device users who do not regularly check or understand app permissions allow malwares to be installed and gain access to sensitive resources, even without understanding the risk. Users’ private data, such as IMEI, SMS data, call logs, contact list, and application usage statistics, is the main target for intruders, posing a significant threat to mobile users’ confidentiality and security. Therefore, immediate measures are needed to identify and deal with malware for Android devices.

In this research article, we present a machine learning (ML)-based solution for classifying applications into benign and malware classes using an extreme learning machine (ELM) classifier whose parameters are optimized by particle swarm optimization (PSO) algorithm. We had a collection of 2118 Android applications (1700 benign + 418 malicious) as APK files [1]. We generated a dataset by unpacking these APK files into source code and extracted static information including permissions, actions, intents, services, and receivers from AndroidManifest.xml files. Using this, we created a feature set consisting of over 15,000 features. Then, we preprocessed our dataset using feature selection techniques before feeding it into the aforementioned model for classification of Android applications into benign and malicious. The explained architecture is depicted in Fig. 1.

Fig. 1
figure 1

Architecture of the proposed model

2 Related Work

Substantial research has been devoted in applying different machine learning and deep learning algorithms to classify malicious Android apps based on features which are extracted from static as well as dynamic analysis. In [2], authors have implemented a bio-inspired hybrid intelligent method for detecting Android malware (HIMDAM). This research classifies Android apps by applying extreme learning machine (ELM) as a classifier. Further, the model’s accuracy was increased by using evolving spiking neural networks (eSNNs). In this paper [3], an engine for Android malware detection (DroidDetector) based on deep learning is proposed which uses deep belief network (DBN) architecture giving 96.76% testing accuracy. A feature set consisting of 192 features was extracted using static and dynamic analysis of Android applications. This work [4] proposes DroidDeep which merges the concept of static analysis and deep learning. Firstly, more than 30,000 static features were extracted and then fed into a deep learning model based on DBN to reduce the number of features. Finally, the new feature set was put into a support vector machine (SVM) model to detect Android malware. In this paper [5], detection of probable anomalies in two malware datasets was done using a high-performance extreme learning machine (HP-ELM). The results show that this approach achieved a maximum accuracy of 95.92% for top three features. In this study [6], an efficient model is introduced which is based on feature extraction and ELM classifier to detect Android malware. This model utilizes three different feature extraction methods including independent component analysis (ICA), Karhunen–Loeve transform (KLT), and principal component analysis (PCA) and then finally compiles the results using various stacking methods. In [7], authors have proposed a fully connected deep neural network approach for malware detection resulting with maximum accuracy of 94.65%. The dataset used here consists of 331 features including information about permissions in API. This paper [8] presents an effective machine learning approach based on Bayesian classification model. The features were extracted through static analysis, and total of 58 features were selected. DroidMat [9] is a model that collects static information including API calls, required permissions, etc., from AndroidManifest.xml file and then enhances the malware detection efficiency using K-means algorithm. Singular value decomposition (SVD) was used to determine the cluster’s number, and finally, KNN algorithm was used to classify apps as benign or malicious.

3 Methodology

Static analysis for the identification of mobile malware is a fast and inexpensive approach. This method tests an application without any code execution and detects malware before the inspected application is executed. Dynamic analysis detects malware after the program under review is run or during it. Hybrid analysis consists of static as well as dynamic analysis. We are concentrating on the static analysis of mobile applications in this research.

The proposed system consists of three phases—feature extraction, feature selection and detection using machine learning models. All the phases are discussed in detail in this section.

3.1 Creating the Dataset (Feature Extraction)

We received a total of 2118 Android applications consisting of 1700 benign and 418 malicious applications [1]. The corresponding APK files were unpacked to their source code using APK tool. For static analysis, AndroidManifest.xml files were fetched. For each application, this XML file was parsed using Python 3 modules, and information related to permissions, actions, features, and receivers was gathered. These parameters would become the features of the dataset which would be used to train and test the classifier to detect malware in Android applications. Figure 2 diagrammatically represents the feature extraction process.

Fig. 2
figure 2

Feature extraction process for an Android application

In this way, after parsing all the applications, we obtained a total of 15,939 features consisting of the aforementioned parameters. Note that any feature is Boolean or binary, meaning that its feature value is 1 whenever a feature appears in an application; otherwise, it is 0.

3.2 Feature Selection

One possible solution to reduce the amount of features is to use small datasets, resulting in a smaller feature set. A second approach for reducing the number of the features is to consider a limited set of possible features, e.g. permissions for Android only. A third approach is to use methods of feature ranking and feature selection, e.g. univariate statistical tests (chi-square), evaluation of correlation-based features, information gain, gain ratio, and community-based identification and elimination of recursive features. The strength of the relationship between an individual feature and the response variable is determined by these methods. Thus, a single score indicating this relationship is obtained by each feature. Then, according to these scores, all the features in the feature vector are ranked, and the top k features are considered.

For this research, we have applied the chi-square (χ2) [10] test to find out the correlation among the features and to select the necessary features from these 15,939 features.

For a relationship with a given number of variables and sample size, the chi-square method compares the differences between the actual and the predicted results. Degrees of freedom (DOF) is considered for these experiments to determine whether, based on the cumulative number of variables and samples, a specific null hypothesis can be rejected. χ2 provides a way to assess how well a data sample fits the characteristics of the wider population the sample is supposed to represent. If the sample data does not match the population’s predicted properties that we are interested in, then we may not want to use this sample to draw conclusions about the larger population.

We apply the sklearn’s select KBest function for feature selection and use the chi-square test to evaluate the best features. We receive the scores and alpha values (p_values) for each feature. The rejection region lies when the value to alpha drops below 0.05. This indicates that the features with p_values < 0.05 are not useful for the classification of our results, and we can discard them.

After selecting the features using chi-square, we finally get a dataset with 2654 features which will be fed to the machine learning classifiers in the next phase.

3.3 Model Design

Computational models are built which are capable of generalizing a concept when a machine learning algorithm is trained with a sample of input data. In our situation, trained models generalize whether or not an application is malicious or not. Therefore, when a new unidentified dataset is used by this model, it should be able to synthesize it, understand it, and provide us a reliable result.

Firstly, we are going to split the dataset into two groups: 80% for training and 20% for testing. Next, to understand its efficiency, we will run the primitive ELM. To compare the proposed ELM with particle swarm optimization outcomes, the outcomes of the native ELM will be regarded as the gold standard [11,12,13].

3.4 Extreme Leaning Machine (ELM)

Machine learning methods have gained interest in artificial intelligence in recent years, primarily because of their contributions to various disciplines. A specific case of machine learning techniques is the extreme learning machine. ELM can be labelled as supervised learning algorithm competent of figuring out linear and nonlinear problems of classification. ELM is described by conventional artificial neural network architectures as the single layer feedforward neural network (SLFN), where it is not necessary to compute input weights and hidden layer biases iteratively [11]. An example of an ELM model is shown in Fig. 3.

Fig. 3
figure 3

ELM architecture

For each hidden neuron, let IW be (L × n) input weights and B be (L × 1) bias values and O be (C × L) output weights. In a problem with N observations for a multi-category classification (K-distinct classes) {Xi, Ti, i = 1, 2, …, N − 2, N − 1, N}, the ELM network outputs with L hidden layer neurons are identified as:-

$$y_{k} ~ = ~\mathop \sum \limits_{{j = 1}}^{L} O_{{kj}} F_{j} \left( {{\text{IW}},~B,~X_{i} } \right),~\quad k = 1,2, \ldots ,K - 1,K$$
(1)

For the jth hidden neuron, the output is Fj(.), and the corresponding activation function is F(.).

The target t is defined as

$$t_{i}^{p} = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{if}}\quad k_{i} = p} \hfill \\ 0 \hfill & {{\text{Otherwise}}\quad k = 1,~2,~ \ldots ,~K - 1,K} \hfill \\ \end{array} } \right.~$$
(2)

In the matrix representation, it can be written as—Y = OYl.

where Yl is an L × n matrix, which is represented as

$$Y_{l} = ~\left[ {\begin{array}{*{20}c} {F_{1} \left( {{\text{IW}},~B,~X_{1} } \right)} & \cdots & {F_{1} \left( {{\text{IW}},~B,~X_{N} } \right)} \\ \vdots & \ddots & \vdots \\ {F_{L} \left( {{\text{IW}},~B,~X_{1} } \right)} & \cdots & {F_{L} \left( {{\text{IW}},~B,~X_{N} } \right)} \\ \end{array} } \right]$$
(3)

In our case, K = 2 for binary classification. For a given number of hidden layer neurons in the ELM algorithm, the input weights (IW) and bias (B) are initialized with random values. The output weights (O) are determined analytically by considering the classifier output (Y) equal to the Boolean class label (t):

$$O = YY_{l}^{ + }$$
(4)

where \(Y_{l}^{ + }\) is defined as the Moore–Penrose pseudo-inverse of matrix Yl.

3.5 Optimization Technique—Particle Swarm Optimization (PSO)

PSO belongs to the group of techniques of swarm intelligence, influenced by the behavioural interactions of animals that huddle together such as ants, fireflies, and bees. It is an algorithm based on the interaction of individuals which consist of a population, in a search space to evaluate for promising regions. The action of an individual is influenced by either the best personal value achieved or the best global value achieved. Using fitness parameters just like that of evolutionary algorithms, the performance of each individual is determined. The population is called as a swarm, and particles are referred as individuals. The particles remember their best position in the past in the PSO and the optimal global position the particles have ever reached. This property lets particles search more rapidly in multidimensional space [12].

Let us consider a swarm with S number of individuals. Every Pi (i = 1, 2, …, S − 2, S − 1, S) particle in the population is defined by

  1. (i)

    Its current pi(k) position, referring to a probable solution to the optimization problem at iteration k;

  2. (ii)

    Its vi(k) velocity;

  3. (iii)

    The optimal Pbesti(k) position attained in its previous path.

Let Gbest(k) be the optimal global position that the particles of the swarm have found across all trajectories. Optimality of position is determined by using one or more fitness parameters specified in accordance with the optimization problem considered. The particles travel according to the following equations during the search process:

$$v_{i} \left( {k + 1} \right) = Wv_{i} + ~c_{1} r_{1} \left( {{\text{Pbest}}_{i} \left( k \right) - p_{i} \left( k \right)} \right) + ~c_{2} r_{2} \left( {{\text{Gbest}}_{i} \left( k \right) - p_{i} \left( k \right)} \right)$$
(5)
$$p_{i} \left( {k + 1} \right) = p_{i} \left( k \right) + v_{i} \left( {k + 1} \right)$$
(6)

These steps are shown as a flow chart in Fig. 4. The values of constants in the above equations such as W, c1 and c2 should be calculated before the experiment for the velocity updating process. The parameters c1 and c2 denote the weighting of the terms of stochastic acceleration pulling each particle towards the positions of Pbesti and Gbest. The parameters r1 and r2 are random variables in the range of [0, 1].

Fig. 4
figure 4

Native PSO flow chart

3.6 PSO Algorithm for ELM Optimization

The following major steps need to be taken to create an ELM classifier which uses PSO to optimize its parameters. These steps are as follows:

  1. 1.

    Randomly generate a swarm with population of S particles.

  2. 2.

    Initialize the vectors of velocity parameter vi (i = 1, 2, 3, …, S) for each particle with random values.

  3. 3.

    Train an ELM classifier for each particle position of the particle pi (i = 1, 2, 3, …, S), and calculate the respective accuracy/fitness function.

  4. 4.

    Initialize the Pbesti of the particle with its starting position configuration.

  5. 5.

    Detect Gbest representing the maximum value of the fitness function observed over all paths pursued.

  6. 6.

    Update the velocities and the positions of the particles using the equations mentioned in the previous section.

  7. 7.

    For every candidate particle pi, create an ELM model, train the classifier, and calculate the accuracy.

  8. 8.

    Update the particle’s best position (Pbesti) if the current accuracy is greater than the previous attained maximum value for Pbesti.

  9. 9.

    If maximum number of iterations have been completed, then proceed to step 10, else go to step 5.

  10. 10.

    Select the best global value from the swarm of these particles, and use its values for input weights and biases to train the final ELM classifier which will have the best accuracy from all the models that PSO will have trained.

  11. 11.

    Use this ELM classifier to classify applications into benign and malware classes (Fig. 5).

    Fig. 5
    figure 5

    Method representation used in the proposed PSO-ELM. ELM is trained and produces a learning model. This model offers accuracy and loss, all of which are used as the fitness by the PSO algorithm to be evaluated. Finally, for the ELM, the PSO algorithm measures new weights and biases. This approach works when a stop condition is met.

4 Results

After finalizing our dataset by reducing the feature set to 2654 features using chi-square feature selection method, the proposed experimental model was divided into two major experiments.

The first experiment consisted of training a native ELM classifier using the dataset for different values of various parameters. The second experiment was the main focus of this research work. The previously developed ELM classifier was optimized using particle swarm optimization (PSO) and was trained using the same dataset for same parameter settings as the previous experiment and the results were compared.

These experiments were performed using the Python environment on a system with the specifications—Core i5-9300H @2.40 GHz with 8 GB memory.

4.1 Experiment 1

An extreme learning machine (ELM) classifier was implemented in Python using various modules like NumPy, SciPy, etc. The whole experiment was conducted twice using two activation functions—Relu and Sigmoid [13]. In each experiment, we trained a number of ELM classifiers with different number of hidden neurons starting from 100 till 1000 neurons with a step increase of 100 neurons. For a specific number of hidden neurons, ELM classifier was tested 31 times, and the mean of these accuracies was considered.

The maximum average accuracy achieved using Relu as an activation function was 93.28% at 600 neurons in hidden layer, whereas Sigmoid achieved a maximum average accuracy of 94.12% at 500 neurons in hidden layer.

4.2 Experiment 2

Particle swarm optimization (PSO) was used to improve the performance by optimizing the values of input weights and biases of ELM classifiers. This experiment was also conducted twice with the same activation functions and same iterations for the number of hidden neurons. But at each step, ELM classifier was optimized using PSO algorithm explained in Sects. 3.5 and 3.6. The parameter settings used in the PSO algorithm for this experiment are mentioned in Table 1.

Table 1 Values for PSO constants

The maximum accuracy achieved using Relu as an activation function was 98.58% at 600 neurons in hidden layer, whereas Sigmoid achieved a maximum accuracy of 98.82% at 400 neurons in hidden layer.

The complete results of both the experiments are described in Table 2, and their comparison is visualized in Fig. 6.

Table 2 Comparison of accuracies
Fig. 6
figure 6

Comparison of accuracies for various classifiers with different parameters

5 Conclusion and Future Work

Malware poses a serious threat to Android applications in today’s world because data security and data privacy are of important concern to users. To tackle this problem, in this research, we came up with a modified approach to classify malicious applications from benign ones using machine learning techniques and optimized it with swarm intelligence. We extracted a total of 15,949 features which were used to create the dataset. After applying feature selection techniques, we reduced the dataset to achieve better results using our proposed model. Extreme learning machine (ELM) was used to classify the applications, which is a native method that provided us average results. Particle swarm optimization (PSO) was used to improve the performance of the classifier which gave us a maximum accuracy of 98.82% which is far superior than the traditional algorithms. The proposed idea gives us a fine-tuned model which can tackle the problem efficiently and can be of great help to expert users.

For further enhancements, we plan to upgrade this model using other machine learning and deep learning techniques and aim to optimize it further using different swarm intelligence algorithms. Work done in this research is of static analysis, and we plan to update our dataset by adding new features extracted through dynamic analysis.