1 Introduction

Android has gained popularity in the year 2011 due to its open-source and number of free apps in its official play store.Footnote 1 According to the statistics,Footnote 2 more than 2.87 million free apps are present in Google Play Store. Working of android apps depends upon the permissions. At the time of installation, android apps required certain permissions that are required for its proper functioning. On daily basis, cyber-criminals are taking advantage of these permissions and develop malware-infected apps for smartphone users. According to the survey done by Kaspersky Security Network,Footnote 3, there are millions of malware-infected apps which are still submitted in Google Play Store and third-party app stores.

According to the report published by Gartner,Footnote 4, the growth of smartphone is increased by 11% in the upcoming year. During pandemic, everyone dependent upon apps for their jobs. At the time of installation and run-time, android apps demand certain permissions. Google defined these permissionsFootnote 5 as “normal” or “dangerous.” Normal permissions do not pay any impact on user’s privacy. In the reverse, dangerous permissions paid a great effect on user’s privacy. The fault lies in the underneath permission model of android apps.

In the literature [12, 14,15,16,17,18,19,20,21,22,23,24], number of authors proposed android malware detection frameworks using supervised and unsupervised machine learning techniques. The main limitation in their work is that researchers and academicians used limited datasets. In order to achieve better detection rate, in this research article, we proposed a framework that is based on the principle of hybrid artificial intelligence techniques approach of functional link artificial neural network (FLANN) with clonal selection algorithm (CSA), particle swarm optimization (PSO) and genetic algorithm (GA), i.e., FLANN-CSA (FCSA), FLANN-PSO (FPSO and MFPSO) and FLANN-genetic (FGA and AFGA). This study also focuses on the effectiveness of feature selection techniques, i.e., principal component analysis (PCA) and rough set analysis (RSA), which are used to reduce the complexity of the proposed model by minimizing the number of inputs.

The generic steps that are followed in this research paper to identify malware-infected apps are shown in Fig. 1. Initially in the first step, we collect Android Application Packages (.apk) files from different repositories. In the second step, we extracted dynamic features and form the features dataset. Implemented of feature selection techniques is performed in the third step. Further, features are selected by implementing feature selection approaches. In the last step, we validate our developed models by using two performance parameters, i.e., accuracy and F-measure.

Fig. 1
figure 1

Phases involved in this research article

The unique and novel contributions of this study are as follows:

  • To build efficient and effective malware detection model, in this study more than five millions android apps are utilized.

  • Dynamic analysis was performed on collected android apps, and 1844 unique features are extracted.

  • In this chapter, five different hybrid functional link artificial neural networks are proposed.

The rest of the chapter is summarized as follows. In Sect. 2, related work is discussed. Collection of .apk file and formulation of feature dataset is discussed in Sect. 3. Implemented feature selection techniques are discussed in Sect. 4. Section 5 discusses the proposed hybrid machine learning algorithms. Experimental setup to proposed the framework is discussed in Sect. 6. Outcome of the experiment is discussed in Sect. 7. At last, chapter is concluded in Sect. 8.

2 Related Work

Hou et al. [7] proposed a malware detection framework named as “Droiddelver” based on Application Programming Interface (API) that is extracted from smali files. Proposed model was build by using 5000 different android apps and a deep belief network as a machine learning technique. Empirical outcome reveals that the proposed model was able to detect 96.66% of malware-infected apps. Hou et al. [6] proposed a malware detection model named as “Deep4MalDroid” developed on the basis of dynamic analysis approach called component traversal which follows code routines of particular android apps. Based on the extracted features, they construct the weighted directed graphs and then applied deep learning as a machine learning algorithm. An experiment was performed by using 3000 android apps and detect 91.4% malware-infected apps.

Mahindru and Singh [25] proposed dynamic analysis-based approach that are build by using 123 features. An experiment was performed by using 11,000 distinct android apps and five different machine learning algorithms. The malware detection model developed by using simple logistic achieved a higher detection rate as compared to others. Hou et al. [8] developed a framework entitled as “HinDroid” based on the relationships between API calls and developed higher-level semantics that require more efforts for attackers. An experiment was performed by using two different datasets; i.e., one contains 1834 distinct android apps, and the second contains 30,000 distinct android apps. Proposed malware detection framework was able to identify 99.01% malware-infected apps. Martín et al. [26] developed a model named as “MOCDroid,” that is based on the integrity of genetic algorithm. An experiment was performed by using 17,135 android apps and achieved an accuracy of 94.60%. Tong and Yan [30] proposed a hybrid approach that works on the combination of static and dynamic features. Experiment was carry-out by utilizing 2000 different android apps while considering API calls as a feature. Proposed malware detection model achieved the detection rate of 90.19%.

Karbab et al. [10] developed malware detection model named as “MalDozer” that is based on the principle of deep learning techniques. Developed model uses the behavior of API calls to recognize the behavior of benign and malware apps. The developed framework was tested on 38 K benign apps and 33 K malware-infected apps and achieved an F1-score of 96–99%. Cai et al. [4] proposed a dynamic malware detection approach that used calls and inter-component communication as features. An experiment was performed by using 34,343 android apps and the proposed framework achieved an accuracy of 97%. Kim et al. [11] developed a malware detection model on the basis of multimodal deep learning. Features were extracted from the manifest file, dex file and shared libraries for developing the model. The developed model was tested with 41,260 android apps and achieved an accuracy of 98%. Yerima et al. [34] proposed detection model entitled as “DroidFusion,” that is based on the principle of feature selection techniques and implement multiple machine learning algorithms. The proposed malware detection model was tested with 55,018 distinct smartphone apps and achieved the detection rate of 97%. Shen et al. [28] developed a malware detection model that works on the principle of information flow analysis. Developed model is based on the structure of information flows to know the pattern behavior and which helps in distinguishing between benign and malware app. An experiment was performed by using 8598 android apps and achieved an accuracy of 82%.

Arora et al. [1] developed malware detection framework work on graphs that construct by utilizing permissions extract from distinct android apps. An experiment was performed by using 5993 android apps and achieved the detection rate of 95.44%. Xiao et al. [32] developed a model by using deep learning principles. The proposed model is built by using system call sequences and long short-term memory as a machine learning technique. An experiment was performed by using 7103 android apps and achieved an accuracy of 96.6%. Mahindru and Sangal [14] developed a malware detection framework entitled as “DeepDroid” by using significant features selected by feature selection approaches and deep learning as a machine learning technique. Experimental outcome reveals that the framework build by using principal component analysis (PCA) as a feature selection technique achieved a higher detection rate as compared to other techniques. Kumar et al. [12] build a detection framework by utilizing three different data sampling approaches, three different feature selection approaches and seven distinct classifier approaches. Outcome reveals that the framework developed by using upscale sampling technique and ELM with polynomial kernel achieved a higher detection rate as compared to others.

Mahindru and Sangal [16] developed a malware detection framework work on the basis of semi-supervised machine learning techniques. The proposed framework is developed by using four different feature subset selection approaches and LLGC as a machine learning algorithm. The empirical result reveals that framework build using rough set analysis as a feature selection approach achieved the detection rate of 97.8%. Mahindru and Sangal [17] developed malware detection model entitled as “GADroid” that is build by using genetic algorithm as a feature selection approach. Further, selected features are used to build the model by using deep learning as machine learning technique. Experiment was performed on 560,142 distinct android apps, and the developed model is able to achieved an accuracy of 98.6%.

Mahindru and Sangal [19] developed the model named as “PARUDroid.” Proposed model is able to detect 98.8% malware-infected apps. Table 1 describes the frameworks developed in the literature. Previous malware detection model has been proposed with a limited dataset and conquered a higher accuracy with the limited dataset. On the basis of related work, the following questions have been answered in this research article:

Table 1 Malware detection frameworks that are availables in the literature

RQ1. To identify which malware detection model is more effective in detecting malware from real-world apps?

This question helps in identifying the malware detection model which is more effective in detecting malware from real-world apps. To answer this question, in this study distinct malware detection models are developed and compared with two different performance parameters, i.e., F-measure and accuracy.

RQ2. Is the proposed malware detection framework able to identify malware from android devices or not?

To examine this question, in this study, proposed framework is compared with existing malware detection models presented in the literature.

RQ3. While selected features using feature selection approaches paid any impact on malware detection models or not?

To answer this question, developed using in this research article model developed using all extracted features compared with the models developed by using feature selection techniques.

3 Collection of .apk Files and Formulation of Features Dataset

Collection of five million distinct android apps is performed to use in this research article. Benign .apk files are collected from, i.e., slideme,Footnote 6 mumayi,Footnote 7 hiapk,Footnote 8 appchina,Footnote 9 Google’s play store,Footnote 10 Android,Footnote 11 gfan,Footnote 12 and pandaapp,Footnote 13 and malware-infected apps are collected from Android Malware Genome project [35], 1929, botnet samples were collected from [9] and from AndroMalShareFootnote 14 along with their package names. Table 2 represents the distinct categories of android apps with respect to its numbers. Dynamic analysis was performed by using the principle mentioned in [19]. After that, we divided the extracted features into different categories to which they belong. Formulation of feature dataset is mentioned in Table 3.

4 Feature Selection Techniques

Relevant features paid an important role while developing the malware detection models in case of effectiveness and efficiency. In this research article, to select relevant features two different feature selection approaches are considered, i.e., principal component analysis (PCA) and rough set theory.

Table 2 Collected android application packages (.apk)
Table 3 Formulation of feature datasets

4.1 Principal Component Analysis (PCA)

To carry-out a data space, low dimension PCA is considered as feature selection. Figure 2 demonstrates the steps that are considered while selecting features using PCA.

Fig. 2
figure 2

Framework of PCA calculation

4.2 Rough Set Theory

Rough set theory used to eliminate irrelevant features by using approximation, reduced attributes and information method. Steps that are followed in rough set theory are shown in Fig. 3.

Fig. 3
figure 3

Rough set theory framework

5 Proposed Hybrid Machine Learning Techniques

In this section, we discuss various machine learning algorithms that are developed by using genetic algorithm, clonal selection and particle swarm optimization for detection malware from android apps.

5.1 Functional Link Artificial Neural Network (FLANN)

In this research article, FLANN is implemented to detect malware from android apps. FLANN is worked on the architecture of single layered of artificial neural network (ANN), that is responsible to perform complex decision. The computational cost of ANN is very high, but in the case of FLANN it is very less due to not present of hidden layers. Figure 4 demonstrate the basis architecture of FLANN.

Fig. 4
figure 4

Architecture of FLANN

Output is computed by using following equations:

$$\begin{aligned} \hat{z}=\sum \limits _{i=1}^n W_ia_i \end{aligned}$$
(1)

where z and \(\hat{z}\) are the estimated and actual values, \(a_i\) is the function block and W is the weight vector that is defined by using

$$\begin{aligned} A=[1, a_1, \sin {\pi a_1}, \cos {\pi a_1}, a_2, \sin {\pi a_2}, \cos {\pi a_2},\ldots ] \end{aligned}$$
(2)

The revised weight is updated as:

$$\begin{aligned} W_i(k+1)=W_i(k)+\alpha e_i(k)a_i(k) \end{aligned}$$
(3)

where \(e_i\) is the error value and \(\alpha \) is the learning rate that is determined as:

$$\begin{aligned} e_i=z_i-\hat{z}_i \end{aligned}$$
(4)

5.2 FLANN-Genetic (FGA) Technique

This technique is very effective at the time of learning, and it is utilized mostly there for upgrading the weight. A function link neural network with a form of ‘\(a-x\)’ is deemed as estimation; i.e., the network contains l number of input neurons and x number of output neurons.

Weights are calculated using the following equation:

$$\begin{aligned}W_a= \left\{ \begin{array}{ll} -\frac{y_{ad+2}*10^{d-2}+y_{ad+3}*10^{d-3}+\cdots +y_{(a+1)d}}{10^{d-2}}&{}\text{ if }\,0\le y_{ad+1}< 5\\ \frac{y_{ad+2}*10^{d-2}+y_{ad+3}*10^{d-3}+\cdots +y_{(a+1)d}}{10^{d-2}}&{} \text{ if }\,5\le y_{ad+1}\le 9\end{array} \right. \nonumber \end{aligned}$$

5.3 Adaptive FLANN-Genetic (AFGA) Technique

This approach, paid an impact on two different parameters for its advancement, i.e., probability for mutation \((P_m)\) and probability for cross over \((P_c)\). Updated values of \((P_m)_{k+1}\) and \((P_c)_{k+1}\) is calculated by using the following equations:

$$\begin{aligned} (P_m)_{k+1}= & {} (P_m)_i-\frac{C_2*n}{5} \end{aligned}$$
(5)
$$\begin{aligned} (P_c)_{k+1}= & {} (P_c)_i-\frac{C_1*n}{5} \end{aligned}$$
(6)

5.4 FLANN Particle Swarm Optimization (FPSO) Technique

It is based on the principle of particle swarm optimization and Function link neural network. PSO is utilized to update the weight at learning phase. Figure 5 represents the execution of PSO. Formula to calculate the fitness value is:

$$\begin{aligned} F_i= & {} 1/{E_i}\end{aligned}$$
(7)
$$\begin{aligned} V_{k+1}^i= & {} V_k^i+C_1*R_1*(Pbest_k^i-X_k^i)+C_2*R_2*(Gbest_k^n-X_k^i)\end{aligned}$$
(8)
$$\begin{aligned} X_{k+1}^i= & {} X_k^i+V_{k+1}^i \end{aligned}$$
(9)

where X is the position of particles and V is the velocity.

Fig. 5
figure 5

Flowchart representing PSO execution

5.5 FLANN-Clonal Selection Algorithm (FCSA) Approach

FCSA is a hybrid approach using clonal selection algorithm and functional link neural network [13].

5.6 Modified-FLANN Particle Swarm Optimization (MFPSO) Technique

The main difference between PSO and MFPSO approach is that in case of MFPSO mutation stage is included just the completion of first stage. The following equation is required to calculate the update value of mutation.

$$\begin{aligned} (P_m)_{k+1}=(P_m)_i-\frac{C*n}{10} \end{aligned}$$
(10)

where \(P_m\) is the first state of mutation and n is the generation number.

6 Experimental Setup

In this section of the chapter, we discuss the experimental setup done to find that developed malware detection model is effective or not. Six different hybrid functional link artificial neural network machine learning algorithms are implemented in this chapter. In Fig. 6, representation of proposed framework is demonstrated. In the first phase, feature selection techniques are implemented, i.e., PCA and rough set theory to select significant features. In the second phase, to normalize the features min-max approach is implemented. Distinct malware detections are developed by using six different machine learning techniques. After that, confusion matrix is developed by using the technique mentioned in [23, 24]. By comparing the malware detection model, best suitable model is selected and compared with the existing framework mentioned in the literature. If the detection rate is high after comparing the models with the existing framework, then proposed framework is useful or vice versa.

Fig. 6
figure 6

Proposed framework, i.e., ANNDroid

7 Outcomes

In this section, the outcomes are gained by performing feature selection approaches and machine learning algorithms.

7.1 Feature Selection Approaches

Relevant features are selected using PCA whose eigenvalue is greater than 1, and features selected using rough set analysis are basis on heuristic search. Features selected using PCA and rough set analysis are demonstrated in Fig. 7.

Fig. 7
figure 7

Feature selected using PCA and rough set analysis

Table 4 Measured accuracy and F-measure using PCA
Table 5 Measured accuracy and F-measure using rough set analysis

7.2 Machine Learning Approaches

Tables 4 and 5 represent the measured value of accuracy and F-measure using PCA and rough set analysis using the equations mentioned in the literature [18, 19]. From tables, it may be inferred that:

  • Highest detection rate is represented by bold value.

  • It is observed from tables that models developed using features selection techniques achieved higher detection rate as compared to all extracted feature set.

  • Model developed using FLANN-genetic accomplished higher detection rate as resembled to FLANN-PSO and FLANN-CSA.

Fig. 8
figure 8

Box-plot diagram of accuracy and F-measure using PCA

Fig. 9
figure 9

Box-plot diagram of accuracy and F-measure using RSA

In order to search, developed malware detection model is effective or not, box-plot diagrams of the individual developed model is constructed. Figures 8 and 9 demonstrate the box-plot diagrams for accuracy and F-measure using feature selection approaches. From figures, it can be concluded that:

  • Based on Figs. 8 and 9, model developed by using RSA as feature selection technique achieve higher detection rate.

  • On the basis of Fig. 9, it is seen that model developed by using FLANN-genetic is having higher median value and few outliers. Model build by using RSA achieved higher detection rate as compared to PCA.

7.3 Comparison with Existing Developed Frameworks

In order to find out developed malware detection model is effective in detecting malware or not, in this chapter comparison is done by using existing frameworks present in the literature. To perform this experiment, freely available dataset; i.e., Drebin [2] is considered. Table 6 represent the comparison with existing approaches or frameworks presented in the literature.

Table 6 Comparison of developed model with available frameworks present in the literature

7.3.1 Experimental Findings

In this chapter, a framework is developed by using android apps and by utilizing hybrid artificial neural network. Based on the outcome, this study is able to answer the questions discussed in Sect. 2.

RQ1: In the present study, implementation of six different machine learning techniques is used to develop malware detection model. Based on Tables 4 and 5, it can be implicit that model build using FLANN-genetic is more effective in detecting malware-infected from android.

RQ2: Yes, proposed detection model is effective in identifying malware-infected apps when compared to existing frameworks present in the literature.

RQ3: From Tables 4 and 5, it can be concluded that feature selection techniques have a significant role in building the malware detection model. Models developed using feature selection techniques are very effective when compared to the model developed using all extracted features.

8 Conclusion

This chapter paid a significant role while developing the malware detection models by using distinct android apps. In addition to that, it is observed that feature selection approach also paid an significant role while selecting the relevant features from all extracted features. Moreover, model developed using hybrid approach is more capable in detecting malware as compared to previously developed frameworks.