1 Introduction

Detection of malware from smartphones has become a major concern for the research community. At the end of 2019, the number of Android users will be 3.3 billions throughout the world.Footnote 1 Android is based on the Linux kernel and provide useful services such as security configuration, process management and others. The primary reason for the growth of Android operating system is due to its open-nature and freely available apps. At the end of July 2019,Footnote 2 Android had 2.7 billion free and paid apps in its play store. There is an increase of 13%,Footnote 3 in downloading of apps from Google play store with respect to previous years. Android operating system is based on the principle of privilege-separated where each app has its own distinct system identity, i.e., group-ID and Linux user-ID. Each app run in a procedure sandbox and access permission to use the resources which are not present in its sandbox. Depending on the permission sensitivity, the system automatically grants permission or may prompt users to approve or reject requests for permission. Permissions granted by users include, access to the calendar, camera, body sensors, microphone, contacts, location, SMS, storage of the device. To defend Google official marketFootnote 4 from malware-infected apps, Google introduced Google Bouncer in the year 2012, which scans new apps at the time of their launch. But, Google Bouncer has limitation, it can easily be fingerprint.Footnote 5 It is not very difficult for malware apps to bypass Google’s security check and enter to Google official marketFootnote 6 and ultimately to users’ devices. By taking advantage of these permissions, cyber-criminals build malware apps on a daily basis and invite users to install these applications. More than two billion active Android devices are present in the market.Footnote 7 To overcome the drawback of the bouncer and to protect Android devices, Google introduced Google play protect in the market. Google play protect have the capability to protect data in real-time. However, according to a study in,Footnote 8 G-Data Security expert counted 4.18 millions malware applications until the end of the year 2019 and discovered over 7,50,000 new malware applications during the first quarter of 2020.

Android apps work on the permission-based model [15]. Android operating system provides protection at four levels, that categorize permissions asFootnote 9 “signature”, “signature or system”, “normal” and “dangerous”. In our study, we do not consider “signature” and “signature or system” because they are system granted. We only consider “normal” and “dangerous” permissions which are granted by the user. Normal permissions does not pay any risk to the user’s privacy. If the permission is listed in the manifest file, then system grants permission automatically. On the other hand, dangerous permission give access to the user’s confidential data. However, it is purely dependent upon the user to give access or revoke the use of permission or set of permissions.

Selection of right feature or feature sets pay great effect on the performance of malware detection [45, 46, 49, 62]. Feature selection approach is based on the procedure to select appropriate features from total available features. Feature selection approaches are classified into two distinct groups i.e., one group contains feature ranking methods and second group contains feature subset selection methods [45, 46, 49, 62]. Feature ranking approach is based on ordering the feature on the basis of its scoring function [49, 62]. On the other hand, feature subset selection is to discover the optimal feature subset [46]. In our study, we implemented ten distinct feature selection approaches to select best features and hold only those feature sets which have excellent discriminatory power.

In the literature [45, 46, 49, 62], researchers and academicians had applied distinct machine learning algorithms that were based on classification, regression and clustering to develop Android malware detection model. The main flaw in their work is that they used labelled data set to develop malware detection model. So to overcome this issue, in this study, we consider five distinct unsupervised machine learning algorithms [i.e., K-mean, farthest first clustering , filtered clustering, density-based clustering and self-organizing map (SOM)] to develop a model for Android malware detection.

List of phases followed by us in developing malware detection model is demonstrated in Fig. 1. In the first stage, we collect Android application packages (.apk) files from different repositories and identify their classes. In the second stage of our experiment, we extract permissions and API calls from collected .apk files and consider them as features. Further in third stage, we select best features by using ten different feature selection approaches. Next, with the help of selected features we trained five different unsupervised machine learning algorithms and build models. We compare our developed models with the help of distinct performance parameters i.e., intra-cluster distance, inter-cluster distance, accuracy and F-measure. At the last stage, we validate our proposed model with the help of existing techniques available in the literature.

Fig. 1
figure 1

Flow chart of the proposed work

The novel and unique contribution of this research paper are:

  • To the best of our knowledge, this is the first work in which 5,00,000 unique apps are collected which belongs to 30 different categories of Android apps. Extracted features are publicly available for researchers and academicians.Footnote 10 To build effective and efficient malware detection model we extract permissions, rating of an app, number of the user download the app, and API calls, as a feature and achieved a detection rate of 98.8% when compared to distinct anti-virus scanners.

  • We proposed a new approach which works on the principle of unsupervised machine learning algorithm by selecting relevant features using feature selection. Our empirical result reveals that our suggested model is able to detect 98.4% unknown malware from real-world apps.

  • Proposed framework is able to detect malware from Android apps by using 100% unlabelled data set.

  • In this study, we applied t test analysis to investigate that features selected by feature selection approaches are having significant difference or not.

  • Proposed malware detection approach is able to detect malware in less time when compared to distinct anti-virus scanners available in the market.

The rest of the paper is summarized as follows. In Sect. 2, we discuss about the work related to Android malware detection. In Sect. 3, we discuss about the Android permission model. In Sect. 4, we present the formulation of data set. Section 5 presents the features selection approaches implemented in this study. In Sect. 6, we discuss about the different machine learning algorithms. In Sect. 7, we present the different techniques which are used in the literature to detect malware from real-world apps. Section 8 represent the performance parameters and experimental setup is presented in Sect. 9. Section 10, contains the experimental results i.e., which model is best in detecting malware from real-world apps. At last in Sect. 11, we discuss about the threats to validity and conclusion of this empirical study is presented in Sect. 12.

2 Related work

Enck et al. [28] proposed Kirin framework which helps in detecting malware apps based on permissions requested by them during their installation time. Kirin is based on set of rules which helps to mitigate the effect of malware from Android apps. Suarez-Tangil et al. [65] examined out of cloud based detection or on-device detection method, which method is more power saving. They suggested a power model to compare both the methods with the help of machine learning algorithms. Empirical results reveals that cloud based detection method is more effective and better choice to detect malware. Cui et al. [26] proposed a malware detection model based on cloud computing by using network packets. They used the principles of data mining to reduce the branches of packets by gathering knowledge of packets whether it is useful for malware detection or not. They proposed SMMDS in their study which work on the principles of machine learning algorithms to detect malware. Chen et al. [23] proposed a solution which monitor the behavior of smartphones when they are sending user’s private information to an external source. But the solution provided in their study is not effective, because it cannot support real-time detection. Narudin et al. [53] proposed STREAM which automatically installs and runs Android apps and extract features from them. Further, the extracted features are used to train with the help of machine learning classifiers to detect malware from Android apps. STREAM has a disadvantage, it takes a lot of system resources and time to load the data. Wei et al. [73] build a malware detection model based on anomaly behavior of Android apps. They developed a model by considering network information as a feature by using Naïve Bayes and Logistic machine learning algorithms and achieved higher accuracy rate. Ali et al. [11] suggested a malware detection model based on Gaussian mixture. They collected features based on hardware utilization such as CPU, memory, battery and so on and trained it with the help of Gaussian mixture. But the model proposed by them has a limitation, it needs a remote server for computation. Dixon et al. [27] developed a model by using the behaviors of battery life of smartphone when infected by malware. But, the model proposed by them is not able to detect some sophisticated malware.

Tong and Yan [67] proposed hybrid approach to detect malware from Android by using individual system call and sequential system calls related to accessing the files and networks. Their approach is able to detect the behavior of unknown app and achieved the detection rate of 91.76%. But, the presented approach has a limitation, it cannot support real-time detection. Quan et al. [59] used three different feature sets i.e., native code, system calls and API calls to detect malware from Android. The detection rate depends upon the predefined threshold value. Ng et al. [54] developed model by using Dendritic Cell Algorithm and considered system call as a feature. They selected best features by implementing statistical methods and achieved the higher detection rate. Sheen et al. [63] proposed Android malware detection system by considering API calls and permissions as features. They chose features by using Relief algorithm to train three different classifiers: J48, SVM and Naïve Bayes. Detection rate is good, but it also consumes number of resources and its computing burden is too high. Fung et al. [31] proposed a decision model RevMatch which work on the principle of malware detection history to make decision that Android app is infected with malware or not. This approach do not provide real-time detection. Babaagba and Adesanya [12] compared the performance of supervised and unsupervised machine learning algorithms, with and without using feature selection approaches. Empirical study were performed on 149 Android apps and result reveals that model developed with feature selection approach and supervised machine learning algorithm achieved higher detection rate when compared to the model developed using unsupervised machine learning algorithm. Yewale and Singh [78] proposed malware detection model based on opcode frequency. Experiments were performed on 100 distinct files and achieved 96.67% detection rate by using SVM as a machine learning algorithm.

Enck et al. [29] proposed TaintDroid which work on the principles of tracking information-flow in the network. TaintDroid track malicious behavior of apps communication through files, program variables and inter-sectional messages. The process is too much time consuming to label that app is benign or malware. Abawajy and Kelarev [2] proposed ICFS which detect malware from Android by incorporating feature selection approaches and machine learning classifiers. Guo et al. [32] proposed smartphone network behavior using Naïve Bayes as machine learning algorithm. They build a pattern from benign and malware apps to discover malware from unlabeled apps. In recent study [68], Authors presented the mechanism that how an app breaching user privacy to gain user private data. They proposed a general and novel defence solution, to protect resources and data in Android based devices. Rahman and Saha [61] proposed StackDroid a multi-level architecture which is used to minimize the error rate. They detected malware at two different levels, in the first level they consider multi-;ayer perceptron, stochastic gradient descent, random forest and extremely randomized tree and in the second level they consider extreme gradient boosting as machine learning classifier to detect malware from Android. Barrera et al. [13] proposed a methodology which work on the principles of permission model by implementing self-organizing map (SOM) on collected data set of 1,100 Android apps. They analyzed the Android permission model which is used to investigate the malware apps from Android. SOM implemented by them, give a 2-dimensional visualization of high dimension data.

Alazab et al. [4] proposed an effective classification model that combines permission requests and API calls. Further, API calls were divided into three different groups i.e., ambiguous group, risky group, and disruptive group. Experiments were performed on 27,891 malware-infected Android apps and proposed model achieved an F-measure of 94.3%. Xiao et al. [75], proposed a novel detection approach based on the principle of deep learning. In their studies, authors consider semantic information in system call sequences as the natural language and construct a classifier based on the long short-term memory (LSTM) language model. Empirical result reveals that proposed approach achieved the detection rate of 96.6%. Yuxin and Siyi [79] proposed a malware detection model based on the principle of Deep Belief Network (DBN). In their study, they compare the performance of proposed model with three baseline malware detection models, which use support vector machines, decision trees, and the k-nearest neighbor algorithm as classifiers. Experimental results indicate that the autoencoder can effectively model the underlying structure of input data and significantly reduce the dimensions of feature vectors.

Vinayakumar et al. [69] proposed an effective zero-day malware detection model based on the principle of image processing technique with optimal parameters for machine learning algorithms (MLAs) and deep learning architectures. Experiments were performed on two distinct data sets and they achieved a detection rate of 96.2% and 98.8% with proposed detection model. Arora et al. [9] proposed a framework named as PermPair, that constructs and compares the graphs for malware and normal samples by extracting the permission pairs from the manifest file of an app. Empirical result reveals that proposed scheme is successful in detecting malicious samples with an accuracy of 95.44% when compared to mobile anti-malware apps. Lee et al. [42] proposed a malware detection model that learns the generalized correlation between obfuscated string patterns from an application’s package name and the certificate owner name. Experimental results reveal that proposed model is robust in obfuscation and sufficient lightweight for Android devices.

Alzaylaee et al. [6] proposed DL-Droid, based on the principle of deep learning model. Experiments were performed on 30,000 distinct Android apps. Empirical result reveals that DL-Droid can achieve up to 97.8% detection rate (with dynamic features only) and 99.6% detection rate (with dynamic + static features) respectively. Ma et al. [44] proposed a malware detection approach based on the control flow graph of the app to obtain API information. On the basis of API information, they constructed three different data sets that is based on Boolean, frequency, and time-series. By using these three data sets three different malware detection models are constructed. Experiments were conducted on 10010 benign applications and 10683 malicious applications. The result shows that detection model achieves 98.98% detection precision. Jerbi et al. [36] proposed artificial malware-based detection (AMD) that was based on extracted malware patterns that were generated artificially. Experiments were performed on balanced and imbalanced data set and achieved an accuracy of 99.69% for balanced data set and 99.64% for imbalanced data set. Table 1 described the name of the framework or author name whose proposed the approach developed in the literature with its detection type, feature used, implemented algorithm, place of analysis, and major observations in their study.

Previous research work mentioned above has the following limitations: use of limited data set, higher detection rate with limited data set, high computation burden, implementation of limited feature selection approaches, implementation of limited classification algorithms using 100% labelled data set and unable to detect sophisticated malware. To overcome the first limitations, in this study, we collect 5,00,000 Android apps which belong to thirty different categories from different repositories mentioned in Sect. 4. Further, to overcome other limitations, we implement ten distinct feature selection approaches on extracted feature data set (i.e., permissions and API calls consider as feature in this study). Next, the selected features are considered as an input to develop a model by using unsupervised machine learning algorithms (means no labelled data is required to develop the models) so that a suitable model is build to identify malware from real-world apps.

Table 1 Dynamic analysis based smartphone detection presented in literature

2.1 Research questions

To develop a malware detection model for Android with better detection rate and to cover the gaps that are present in the literature, we consider the following research questions in this research paper:

RQ1. Which malware detection model is most appropriate to detect malware from real-world apps?

This question helps in finding the most appropriate model which is suitable for malware detection in Android. In this work, we build 50 distinct models by considering ten distinct feature selection approaches and five different machine learning techniques. Further, to identify that which model is more appropriate for malware detection we consider four distinct performance parameters in our study.

RQ2. Whether the presented malware detection framework is effective or not to detect malware from Android devices?

The goal of this question is to investigate the performance of our malware detection approach. For this, we compare the performance of our developed model with some existing techniques available in the literature.

RQ3. Does a subset of feature perform better than all the extracted features for the task of detecting an app is malicious or not?

The aim of this question, is to evaluate the features and investigate their relationship among benign and malware apps. Distinct kinds of feature reduction approaches are being considered for finding subset of features which are able to detect either the app is malicious or not.

RQ4. Among different implemented feature ranking approaches which approach work best for the task to detect that either an Android app belong to benign class or malware class?

In feature ranking approach, efficiency of the machine learning algorithms is affected by the characteristics and nature of the malware data set. Distinct approaches are being implemented with various criterions to rank the collected feature sets. Four distinct performance criterions i.e., intra-cluster, inter-cluster, F-measure and accuracy are considered in this study, to compare distinct feature-ranking approaches.

RQ5. Among applied feature subset selection approaches, which approach performs foremost for the task of detecting malware from Android apps?

To determine the subset of features which are appropriate to detect either the Android app is benign or malware we consider feature subset selection approaches. In this work, we compare distinct approaches by using four performance criterions i.e., intra-cluster, inter-cluster, F-measure and accuracy.

RQ6. How do the feature subset selection approaches are compared with feature ranking approaches?

In this paper, pair-wise t test being used to determine either the feature subset selection approaches are appropriate than feature ranking approaches or both of them behave equally well.

RQ7. Do the feature selection approaches effect the outcome of the unsupervised machine learning approaches?

It is seen that number of feature selection approaches perform extremely well with specific unsupervised machine learning methods. Therefore, in this research work distinct feature selection approaches are evaluated using distinct unsupervised machine learning approaches to measure their performance. Further, it also emphasis on variation of performance of unsupervised machine learning approaches over distinct unsupervised machine learning approaches.

3 Android permission model

Android security is dependent upon the permission based model that access to functionality or features which could pay impact on the user’s privacy. Android apps are written in java language and run in a Dalvik virtual machine. Android apps demand permissions during their installation and run-time from the users. Permissions such as send messages or making phone calls, access to vibrator or device screen on are asked by an app. User can grant or reject the request of permission made by an app. This facility of granting or revoking the permission is available from Android 6.0 version onwards. Currently in Android, there are more than 110 different features which are gaining access by using these permissions. Android is designed in such a way, that if any third-party app demand any new functionality, then it give privilege to the developer to define new permission under developer-defined permission or self-defined permission. By taking the advantage of this, cyber-criminals define new permissions on regular basis so that they can access the user’s private data for their own benefits.

Fig. 2
figure 2

Class diagram of com.ToGoHome Android app

For an in-depth analysis of permission model, we build an app named as “ToGoHome”.Footnote 11 Figure 2 represent the class diagram of Android app by using the structure showing its classes, attributes and methods involved in it. The package contains 10 distinct classes which contain the logic of cab reservation. In the very first step, when user click on the “HomePage” menu and searching for the cab it direct toward the class name “Pickup” that was collected from the current location of the user by using Global Positioning System (GPS) and call the class name “myLocationListener”. Further “Pickup” class is divided into seven different attributes i.e., updateDisplayDate:void, updateDisplayTime:void, GetDistance:void, checkAddress;bool, onConfigurationChanged:void, onActivityResult:void and onCreate:void. If user confirm his pick up location through GPS then its booking will be confirmed by calling “BookingDetails”. While travelling, user track his live location by calling “Route” class.Footnote 12 After completing the journey, our app call the “Pay” class and it will direct towards the payment website. Through out this process following Android apps permissions are used i.e., ACCESS_FINE_LOCATION, ACCESS_COARSE_LOCATION, ACCESS_NETWORK_STATE, INTERNET, SEND_SMS, and RECEIVE_SMS.

4 Formulation of data set

Figure 3 demonstrates the phases which are followed in extracting features from Android apps. In the very first phase, we identify the URLs from which Android apps are to be collected (mentioned in Sect. 4.1). In the second phase, we take the help of an app crawler to download the apps from identified URLs. Our developed app crawler can download as many apps as possible and do not pay any impact on the android app repository. To perform dynamic analysis of the collected Android apps, we use Android studio as an emulator (mentioned in Sect. 4.2). Further, we write a program in java language and extract permissions and API calls from them and save into the .csv for developing Android malware detection model. Extracted features are publicly available for researchers and academicians.Footnote 13.

Fig. 3
figure 3

Extraction of features from .apk files

4.1 Collection of .apk files

Pervious studies mentioned in Table 1, used only limited data sets of Android apps to examine its associations with malware or benign class and in addition to that in the literature [72, 80] academicians and researchers were not mentioned the categories of the app to which it belongs. Therefore, it is not able to draw generic conclusion relevant to all Android apps and its system. To overcome this gap, we collect apps of thirty different categories which are used to generalize and strengthen our outcomes. We collect the Android apps to build our data set from promise repository. We collect 5,50,000 of .apk files, from Google’s play store,Footnote 14 hiapk,Footnote 15 appchina,Footnote 16 Android,Footnote 17 mumayi,Footnote 18 gfan,Footnote 19 slideme,Footnote 20 and pandaapp.Footnote 21 Among these 5,50,000 benign .apk files, 5,00,000 are distinct. Further, the features are extracted after deleting viruses infected apps, reported by VirusTotalFootnote 22 and Microsoft Windows Defender.Footnote 23 VirusTotal identify malware affected apps by using 70 different antivirus softwares simultaneously. A total of 55,000 malware samples, are collected from three different promised repositories. 1929, botnet samples were collected from [38], which further consist of 14 distinct botnet families. Android Malware Genome project [80] contains a data set of 1200 malware samples that cover the currently present Android malware families. We collected about 56,871 samples from AndroMalShareFootnote 24 along with their package names. After removing duplicate packages from the collected data set, we have 50,000 unique malware samples left in our study. Both benign and malware apps being collected from the above mentioned sources at the end of December 2018. Table 2 shows the number of .apk files belonging to different categories i.e., business, comics, communication, education and so on. To better differentiate between benign and malware apps we consider .apk files belonging to normal, trojan, backdoor, worms, botnet and spyware familiesFootnote 25 are mentioned in Table 2.

Table 2 Categories of .apk files belong to their respective classes (.apk)

4.2 Extraction of features

After collecting a unique samples of .apk files from various sources mentioned in previous subsection, we extract permissions and API calls from each of the .apk file. Extraction of permissions and API calls have been performed with the help of an emulator (in our study we use Android studio). Emulator provides the same API level and execution environment as our smartphones provide to us. In our study, to extract permissions and API calls from Android apps we use Android system version 6.0 Marshmallow (i.e., API level 23) and form our data set for experiments. Previous developed frameworks or approaches used the previous version of Android to extract features from them. There are two reasons for selecting this Android version: first, it asks the user to revoke or grant the permission to use the resources of smartphones and second it covers 28.1% of Android devices which is higher than other versions present in the market.Footnote 26 To extract features from collected .apk files, we execute each of them in an emulator and extract permissions by using self-written code in java from “AndroidManifest.xml”. These permissions are demanded by apps during their installation and run-time. By using the same process again and again, we extract permissions from 5,00,000 different Android apps and record them in the .csv file format. This extracted data set listing the name of the permissions is publicly available for the researchers.Footnote 27,Footnote 28 Previous researchers used limited set of features to develop a model for malware detection. To overcome this gap, in this study we collect 1532 permissions and 310 API calls which helps in building an effective and efficient Android malware detection model. Hence, each of the collected app can be represented as a 1842-dimensional Boolean vector, where “1” implies that the app requires the permission and “0” implies that the permission is not required. It is very common that distinct apps may request the similar set of permissions for its execution. Permissions overview given by GoogleFootnote 29 is used to describe the behavior of a permission i.e., “dangerous” or “normal”.

4.3 Formulation of feature sets

Several approaches had been developed for Android malware detection [1, 13, 45, 46, 49, 50, 74]. In this study, we divide the extracted API calls and permissions in to thirty different feature sets which helps in developing malware detection model. Table 3 displays the basic descriptions of the feature sets which are considered in our work.

Table 3 Formulation of sets containing (permissions, API calls, number of users download an app, and rating of an app) as features

5 Feature selection approaches

On the basis of previous studies, it is seen that in previous studies a number of authors applied different feature ranking approaches to detect malware from Android apps and achieved good detection rate. This indicates that the outcome of malware detection model rely on the features that are taken as an input to design a model. Selecting the suitable feature sets is essential for data preprocessing task in machine learning. In the field of malware detection, some researchers have used selection approaches to select appropriate set of features. In this paper, we implemented ten distinct types of feature selection approaches on a large collection of 1842 features (divided in to thirty distinct feature sets) to identify the best subset of features which assist us to detect malware detection with better detection rate and also minimize the figure of misclassification errors. Feature ranking approaches and Feature subset selection approaches can be defined in the following manner [45, 46, 49, 62]:

  • Feature ranking approaches These approaches, use certain conclusive elements to rank the features. Further, on the basis of their ranks appropriate features can be selected to build the model [49, 62].

  • Feature subset selection approaches These approaches aim to search subset of features which can have good detective capability [4, 46].

5.1 Feature ranking approaches

These approaches rank features separately without applying any training algorithm. Ranking of features depends upon their score. On the basis of our investigation of the previous studies, the majority of approaches are capable to calculate the grading of every feature. In this research, we employ six different ranking approaches to rank the features. Various feature ranking approaches are explained below:

5.1.1 Gain-ratio feature selection

In this selection approach, feature ranking work on the prediction of the gain-ratio in relation to the class [49, 55]. The “Z” known as the gain-ratio of feature is determined as:

$$\begin{aligned} \text{ Gain-Ratio }=\frac{Gain(Z)}{SplitInfo_Z(X)}, \end{aligned}$$
(1)

where Gain \((Z)=I(X)-E(Z)\) and X depicts the set including m numbers of instances with n different classes. The forthcoming statistics necessary to categorize a given sample is calculated by utilizing succeeding equation:

$$\begin{aligned} I(X)=-\sum _{i=1}^{m}P_i\log _2 (p_i). \end{aligned}$$
(2)

Here in this equation \(P_i\) is the chance that a random sample can be a member of class \(C_i\) and is measured by \(z_i/z.\)

The number of instances are given by \(z_{ij}\) of class \(C_i\) in subset \(N_j.\) The foreseen knowledge rely on the partition of subsets by F,  and is presented by

$$\begin{aligned} E(Z)=-\sum _{i=1}^M I(X)\frac{n_{1i}+n_{2i}+\cdots +n_{mi}}{n}. \end{aligned}$$
(3)

\(SplitInfo_F(X)\) is measured by utilizing following equation:

$$\begin{aligned} SplitInfo_F(X)=-\sum _{i=1}^t \frac{|X_i|}{X}\log _2 (\frac{|X_i|}{X}) \end{aligned}$$
(4)

The value of \(SplitInfo_F(X)\) show us the details achieved by dividing the data set of training X into t portions equivalent to t results of a test on the attribute Z.

5.1.2 Chi-squared test

This test is employed to examine the self-determination among two events [49, 57], and in our work, ranking of features is predicted by the significance of its statistic in relation to the class. Higher the calculated value implies the denial of the outliers and consequently these features can be analyzed as better relevance to detect malware from Android apps. Chi-squared attribute evaluation evaluates the worth of a feature by computing the value of the chi-squared statistic with respect to the class. The initial hypothesis \(H_0\) is the assumption that the two features are unrelated, and it is tested by chi-squared formula:

$$\begin{aligned} \chi ^2=\sum _{i=1}^r\sum _{j=1}^c\frac{(O_{ij}-E_{ij})^2}{E_{ij}} \end{aligned}$$
(5)

where \(O_{ij}\) is an observed frequency and \(E_{ij}\) is an expected (theoretical) frequency, asserted by the null hypothesis. The greater the value of \(\chi ^2\) evidence against the hypothesis \(H_0\).

5.1.3 Information-gain feature selection

In Info-gain, features are selected on its relation with respect to the class [49, 55]. In information-gain feature selection approach, entropy is considered as a criterion of impurity in a training set S (data set), which can define a measure reflecting additional information about Y (random feature) provided by X (second random feature) that represents the amount by which the entropy of Y decreases. It is given by

$$\begin{aligned} IG = H(Y) - H(Y|X) = H(X) - H(X|Y) \end{aligned}$$
(6)

The information gained about Y after observing X is equal to the information gained about X after observing Y. A weakness of the IG criterion is that it is biased in favor of features with more values even when they are not more informative [49, 55].

5.1.4 OneR feature selection

OneR feature selection approach is used for grading the features [49, 55]. To rank individual features it utilizes the classification mechanism. In it valuable features are considered as constant ones and divide the set of values into a few dissociate intervals made by straightforward method. In this study, we consider features with better classification rates.

5.1.5 Principal component analysis (PCA)

Reduction of attribute is accomplished by implementing PCA on our collected data set. PCA helps in transforming a high dimension data space into a low dimension data space. Features which are present in low dimension have extreme importance in detecting malware [49, 70]. Correlation among several features are high, so PCA is utilized to relocate these features that are not extremely correlated. The features obtained are named as principal component domain features. Further, to identify significant patterns in the data a small value of principal components is sufficient. The detailed phases of PCA are demonstrated in Fig. 4.

Fig. 4
figure 4

Framework of PCA calculation

Feature data set is collected in the form of \(m \times n\) matrix, that contains n number of data sample and m number of extracted features. In the second phase, normalization of the feature data set is performed by using equation

$$\begin{aligned} \mu _j=\frac{1}{n}\sum _{i=1}^n x_i^j \end{aligned}$$

and replace \(x^j\) with \((x^j-\mu _j)\). Next, we calculate eigen value and eigen vector by using matlab environment. Next, to select first k number of principal components from the covariance matrix we performed following steps

while (i = 1 to m) do evaluate cumvar = \(\sum _{i=1}^k \lambda _{ii}\)

$$\begin{aligned}&\sum _{i=1}^m \lambda _{ii}\\&\hbox {if } (cumvar \ge 0.99)\, or \,(1 - cumvar\le 0.01)\\&\hbox {return k }{99\%\hbox { of variance is retained}}\\&\hbox {end if}\\&\hbox {end while} \end{aligned}$$

cumvar denotes (cumulative variance) and \((\lambda )\) represents eigen values sorted in descending order.

After evaluating this, reduced feature sets are selected for training purpose.

5.1.6 Logistic regression analysis

For feature ranking, univariate logistic regression (ULR) analysis is being considered to verify the degree of importance for every feature sets [25, 49]. In the current work, we consider two benchmarks of LR model which consider to discover the importance of every feature and also used to rank each feature sets. Parameters for Logistic regression analysis are as follows:

  1. 1.

    Value of regression coefficient The coefficient measure of features indicates the degree of correlation of every feature sets with malware.

  2. 2.

    p value p value i.e., level of significance shows the correlation significance.

5.2 Feature subset selection approaches

These approaches are employed to detect appropriate subset of features which jointly have best detective capability. These are established on the hypothesis that developed model has better detection rate and lower value of misclassification errors when linked with few other features or when matched by itself. Several approaches are feasible to identify the right subset of features which helps in detecting malware. In this work, four distinct feature subset selection approaches are considered to calculate the score of feature. Implemented approaches are depicted below:

5.2.1 Correlation based feature selection

This approach is based on correlation approach which select a subset of features that are particularly related to the class (i.e., benign or malware). In this research paper, Pearson’s correlation (r: Coefficient of correlation) has been used for searching the dependency among features. If the value of “r” is higher among the feature sets, it indicates a strong relation among these features. It implies that, there is a statistical reason to consider those classes which are having lower (or highest) feature value with that have lower (or highest) ranges of other highly correlated features [49].

5.2.2 Rough set analysis

This approach is an estimation of conventional set, in terms of a joins of feature sets which provide the upper and the lower estimation of the original data set [46, 56]. This formal estimation, depicts the upper and lower limits of the original data set. The application of this approach is in mining the data from imperfect data. This approach is used to select the reduced set of features from the extracted feature sets. RSA used three distinct notations such as approximations, reduced attributes and information system. The steps that are pursued to get reduced subset by utilizing RSA are mentioned-below and also demonstrated in Fig. 5.

(i) Approximation Let \(A=(C,Z), X\subseteq Z\) and \(Y\subseteq C.\) \(X-\) topmost (XY) and \(X-\) lowermost \((<uline>X</uline>Y)\) approximations of X are utilized to estimate Y. The topmost limit includes all the objects which maybe the part to the set and the lowermost approximation includes of all objects which certainly be a part of the set. The XY and \((<uline>X</uline>Y)\) are computed by utilizing subsequent equations:

$$\begin{aligned} {\bar{X}}Y= & {} \{y_i\in U\mid [y_i]_{Ind(B)}\cap Y\ne 0\} \end{aligned}$$
(7)
$$\begin{aligned} {\underline{X}}Y= & {} \{y_i\in U\mid [y_i]_{{Ind(B)}\cap Y}\}, \end{aligned}$$
(8)

where \(\mid [y_i]_{Ind(C)}\) belongs to the same class of \(y_i\) in connection Ind(C).

(ii) Reduced attributes Correctness evaluation of the group Z (Acc(Z)) in \(A\subseteq B\) is determined as:

$$\begin{aligned} \mu _B(A)=\frac{card({\underline{B}}Z)}{card({\bar{B}}Z)} \end{aligned}$$
(9)

The number of features contained in the topmost or uppermost approximation of the set Z is called the cardinality of the set. Further, all possible feature sets are considered whose accuracy is equivalent to the accuracy of extracted feature sets.

(iii) Information system It is determined as \(Z=(C,B)\), where C is a universe including non-empty set of confined objects and B is the sets of attributes with a finite number of elements. For each \(b \in B\), there exist a function \(F_b:C\rightarrow V_b\), where \(V_b\) denotes the value of attribute b. For each \(A\subset B,\) there exists an equivalence relation known as B-indiscerbility relation is (Ind(Z)). Ind(Z) can be defined as:

$$\begin{aligned} IND_A(Z)=\{(x, y) \in C^2\mid \forall a\in Z, a(x)=a(y)\}. \end{aligned}$$
(10)
Fig. 5
figure 5

Rough set theory framework

5.2.3 Consistency subset evaluation approach

This technique provides the importance of subset of attributes by their level of consistency appearing in class values, when the training instances are applied on the subset of attributes. The consistency rate is calculated with the help of inconsistency rate, where two data elements can be considered as inconsistent if they belong to different class labels (i.e., benign or malware) but have same feature values. For this work, destination variable i.e., apps having two distinct characteristics (i.e., 0 for benign apps and 1 for malware apps). A group of feature (GF) is having Z amount of sample, there are z amount of instances in a manner that \(Z=X_1+X_2+\cdots +X_z.\) Instance \(X_i\) seems in entirely A samples from which \(A_0\) numbers of samples are marked by 0 and \(A_1\) number of samples are marked by 1,  here \(A=A_0+A_1.\) If \(A_1\) is less than \(A_0,\) then the difference count for the instance \(X_i\) is \(INC=A-A_0.\) The inconsistency rate (INCR) of feature set is computed by utilizing succeeding equation:

$$\begin{aligned} INCR=\frac{\sum _{i=1}^z INC_i}{Z} \end{aligned}$$
(11)

5.3 Filtered subset evaluation

Filtered subset evaluation is based on the principle to select random subset evaluator from data set that was gained by applying arbitrary filtering approach [40]. The filtering technique is not based on any induction algorithm. Filtered subset evaluation technique is scalable and fast. Figure 6 demonstrates the steps followed to find subset of feature by utilizing filter method.

Fig. 6
figure 6

Feature selection by utilizing filter approach

6 Machine learning techniques

Various authors applied a number of unsupervised machine learning classifiers like K-mean [18, 52] and Self-Organizing Maps (SOM) [18, 20] to detect Android malware. But, they applied clustering algorithm on limited data set. To overcome this gap, in this study, we implement five different clustering algorithms on our extracted data set i.e., SOM, K-mean, farthest first clustering, filtered clustering and density-based clustering. The choice of clustering algorithms are based on its performance in the literature [18, 20, 52, 74].

6.1 Self-organizing maps (SOM)

SOM is a type of artificial neural network (ANN) that is trained with the help of unsupervised learning to produce a low-dimensional, discretized representation of the input space of the training samples called map and is therefore a method to do dimensionality reduction.Footnote 30 SOM consists of neurons, which have the same dimensionality as the input space and is arranged in a rectangular or a hexagonal grid. SOM neurons can be studied as pointers in the input space, in which more neurons point to regions with high concentration of inputs [13]. The training algorithm can be summarized in four basic steps:

  1. 1.

    Initialize neuron weights \(W_i=[w_{i1}, w_{i2}, \ldots , w{ij}]^T \in R^j.\)

  2. 2.

    Input pattern \(x=[x_1, x_2, \ldots , x_j]^T\in R^j.\) Input pattern corresponding to an app in which permissions are expressed in the form of a bit string. Calculate the distance between pattern x and each neuron weight \(W_R,\) Therefore, identify the winning neuron or best matching neuron Z as follows

    $$\begin{aligned} \Vert x-W_z\Vert =min_R\{\Vert x-W_R\Vert \} \end{aligned}$$
    (12)

    In our study, we employed Euclidean distance as the distance metric, normalized to the range [0, 1].

  3. 3.

    Adjust the weights of winning neuron Z and all neighbor units

    $$\begin{aligned} w_R(t+1)w_R(t)+h_{Z_R}(t)[x(t)-W_R(t)], \end{aligned}$$
    (13)

    where R is the index of the neighbor and t is an integer, the discrete time co-ordinate. Neighborhood kernel \(h_{ZR}(t)\) is a function of time and the distance between neighbor neuron r and winning neuron z. \(h_{RZ}(t)\) defines the region of influence that the input has on the SOM and consists of two components it means that the neighborhood function \(h(\Vert .\Vert ,t)\) and the learning rate function \(\alpha (t)\) is

    $$\begin{aligned} h_{ZR}(t)=h(\Vert b_r-b_z\Vert ; t)\alpha (t) \end{aligned}$$
    (14)

    where b is the location of the neurons.

    In our study, we used Gaussian neighborhood function and the first form of the neighborhood kernel with Gaussian function is

    $$\begin{aligned} h_{RZ}(t)=exp\bigl (\frac{\Vert b_r-b_z\Vert ^2}{2\sigma ^2(t)}\bigr )\alpha (t), \end{aligned}$$
    (15)

    where \(\sigma (t)\) defines the width of the kernel.

  4. 4.

    Repeat steps 2–3 until the convergence criterion is satisfied.

6.2 K-mean

K-mean adopt the method of vector quantization. K-mean clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.Footnote 31 The working of K-mean algorithm is as follows:

  1. 1.

    Specify number of clusters K.

  2. 2.

    Initialize centroids by first shuffling the data set and then randomly selecting K data points for the centroids without replacement.

  3. 3.

    Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.

    • Compute the sum of the squared distance between data points and all centroids.

    • Assign each data point to the closest cluster (centroid).

    • Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

6.3 Farthest first

Farthest first is based on the principle of a bounded metric space in which first point is selected arbitrarily and each successive point is as far as possible from the set of previously-selected points.Footnote 32 The working of farthest first clustering is described below [41]:

For each \(X_i=[x_i,1,x_i,2,...x_i,m]\) in D that is described by m categorical attributes, we use \(f(x_{i,j}|D)\) to denote the frequency count of attribute value \(x_{i,j}\) in the data set. Then, a scoring function is designed for evaluating each point, which is defined as:

$$\begin{aligned} Score(X_i)={\sum _{j=1}^m f(x_{i,j}|D). } \end{aligned}$$
(16)
  1. 1.

    Farthest first traversal (D: data set, k: integer) {

  2. 2.

    randomly select first center;

  3. 3.

    select centers

  4. 4.

    for (I = 2,\(\ldots\),k) {

  5. 5.

    for (each remaining point) { calculate distance to the current center set; }

  6. 6.

    select the point with maximum distance as new center; }

  7. 7.

    assign remaining points

  8. 8.

    for (each remaining point) {

  9. 9.

    calculate the distance to each cluster center;

  10. 10.

    put it to the cluster with minimum distance; } }

6.4 Filtered cluster

Filtered cluster is a special subset of a partially ordered set, in which each cluster is labeled with an ID. If Cluster belong to the first class then it will be labeled with 1 otherwise it is labeled as 0. The working of filter clustering algorithm is described below [37]:

  1. 1.

    In only one scan of the data set derive the \(\frac{l(l-1)}{2}\) contingency tables necessary for computing the previously presented adequacy indices

  2. 2.

    Run the Genetic Algorithm using the fitness function \(fit(SA, SA_*)\)Footnote 33

    • a chromosome codes (corresponds to) a subset of SA;

    • each gene of the chromosome codes an attribute of SA (so, there are l genes);

    • each gene of a chromosome has a binary value: the gene value is 1 (resp. 0) if its associated attribute is present (resp. absent) in the subset of SA coded by the chromosome.

  3. 3.

    Select the best subspace found by the Genetic Algorithm.

6.5 Density-based cluster

Density-based cluster is based on the notion of density. It either grow clusters according to the density of neighborhood objects or according to some density function. The working of density-based clustering is described below [24]:

  1. 1.

    initialize \(t_c = 0\);

  2. 2.

    initialize an empty hash table \(grid\_list\)Footnote 34;

  3. 3.

    while data stream is active do

  4. 4.

    read record \(x = (x_1, x_2, upto, x_d)\);

  5. 5.

    determine the density grid g that contains x;

  6. 6.

    if(g not in \(grid\_list\)) insert g to \(grid\_list\);

  7. 7.

    update the characteristic vector of g;

  8. 8.

    if \(t_c == gap\) then

  9. 9.

    call \(initial\_clustering\)(\(grid\_list\));

  10. 10.

    end if

  11. 11.

    if \(t_c\) mod \(gap == 0\) then

  12. 12.

    detect and remove sporadic grids from \(grid\_list\);

  13. 13.

    call adjust clustering(\(grid\_list\));

  14. 14.

    end if

  15. 15.

    \(t_c = t_c + 1\);

  16. 16.

    end while

  17. 17.

    end procedure

7 Comparison of proposed model with different existing techniques

To validate that our proposed framework is able to achieve higher detection rate or not, we compare the result of our proposed model with three different techniques mentioned below:

  1. (a)

    Comparison of results with previously used classifiers To validate that our proposed model is feasible to detect malware as equivalent to previous used classifiers or not, we calculate two performance parameters like Accuracy and F-measure for new proposed model and existing models. In addition to that, in this study, we also compared our proposed framework with existing frameworks that are present in the literature.

  2. (b)

    Comparison of results with different anti-virus scanners To compare the performance of our model for malware detection, we chose ten available distinct anti-virus scanners and compare their detection rate with the detection rate of the proposed model.

  3. (c)

    Detection of known and unknown malware families Further, to evaluate how much our proposed model is reliable to detect known and unknown malware families, we test known and unknown malware families with our proposed model and calculate the accuracy to detect the malware.

8 Evaluation of performance parameters

In this section of the paper, we discuss the fundamental definitions of the performance parameters utilized by us while evaluating our proposed model for malware detection. Confusion matrix is used to calculate all these parameters. It consists of actual and detected classification information built by detection models. Table 4 demonstrates the confusion matrix for malware detection model. In the present work, four performance parameters namely, inter-cluster distance, intra-cluster distance, F-measure and accuracy are utilized for measuring the performance of malware detection approaches.

Table 4 Confusion matrix to classify a Android app is benign or malware (.apk)

Inter-cluster distance For each of these techniques, we first calculate centroid and then centroid Euclidian distance. In case of Nd-dimensional data points in a cluster: \(\{\mathbf {X}_i\}\) where \(i= 1,2,3,\ldots N,\) the centroid \(\{\mathbf {C}_0\}\), as defined inFootnote 35 is given by

$$\begin{aligned} \mathbf {C_0}=&\ \frac{\sum _{i=1}^{N}\mathbf {X_i}}{N}. \end{aligned}$$
(17)

Next, we define centroid Euclidian distance between the centroid of two clusters. Given the centroid of two clusters \(C_{01}\) and \(C_{02},\) the centroid Euclidian distance or inter-cluster distance between them is defined by

$$\begin{aligned} D_0= ((C_{01}-C_{02})^2)^{\frac{1}{2}}. \end{aligned}$$
(18)

Intra-cluster distance To calculate the intra-cluster distance, we find root-mean-square-total-sample standard deviation (RMSSTD). This is defined by

$$\begin{aligned} RMSSTD=\sqrt{\frac{\sum _{j=1}^{p}\bar{s_j}^2}{p}}, \end{aligned}$$
(19)

where \(\bar{s_j}\) denotes the standard deviation of the attributes and p is the number of features. The smaller the value, the more homogenous the observations are with respect to the variables and vice-versa. Since root-mean-square is scalable dependent, it should only be used to compare the homogeneity of datasets whose variables are measured using similar scales.

F-measure The F-measure is harmonic mean of the precision and recall values used in information retrieval [16]. Precision shows how many applications are in the right cluster w.r.t. the cluster size. Recall shows that how many applications are in the right cluster w.r.t. the total applications. Let i denotes the class label and j denotes the cluster, then the Precision and recall for class i and cluster j are defined as:

$$\begin{aligned} Recall(i,j)= & {} \frac{n_{i,j}}{n_j} \end{aligned}$$
(20)
$$\begin{aligned} Precision(i,j)= & {} \frac{n_{i,j}}{n_i} \end{aligned}$$
(21)

where \(n_{i,j}\) is the number of applications with class label i in cluster j\(n_i\) is the number of applications with class label i and \(n_j\) denotes the number of applications in cluster j. The F-measure for class i and cluster j is given as:

$$\begin{aligned} F(i,j)=\frac{2*Recall(i,j)*Precision(i,j)}{Recall(i,j)+Precision(i,j)}. \end{aligned}$$
(22)

The total F-measure of clustering process is given by:

$$\begin{aligned} F=\sum \frac{n_i}{n} * maxF(i,j) \end{aligned}$$
(23)

where n is the total number of applications.

Accuracy Let \(n_{i,j}\) is the number of applications with class label i in cluster j\(n_{j,i}\) is the number of applications with class label j in cluster i\(n_i\) is the number of applications with class label i and \(n_j\) denotes the number of applications in cluster j. Then Accuracy becomes:

$$\begin{aligned} Accuracy=\frac{x_{ij}+x_{ji}}{x_{ij}+x_{ji}+x_i+x_j}. \end{aligned}$$
(24)

9 Experimental setup

In the present section, we introduce the experimental setup done to find the performance of our developed malware detection models. Five distinct unsupervised machine learning algorithms are implemented on thirty different categories of android apps mentioned in Table 2. All these data sets have varying number of benign or malware apps which are adequate to perform our analysis. Figure 7 demonstrates the framework of our proposed model named as SemiDroid. In the very first step, feature ranking and feature subset selection approaches are applied on the extracted features data set. In the next step, we use the Min-max normalization approach to normalize the data. This approach is based on the principle of linear transformation, which bring each data point \(D_{q_i}\) of feature Q to a normalized value \(D_{q_i},\) that lie in between \(0-1.\) Following equation is considered to find the normalized value of \(D_{q_i}:\)

$$\begin{aligned} Normalized(D_{q_i})=\frac{D_{q_i}-min(Q)}{max(Q)-min(Q)}, \end{aligned}$$

where min(Q) and max(Q) are the minimum and maximum significance of attribute Q,  respectively. In the third step, we trained significant features by implementing distinct machine learning techniques. In the next step, we construct a confusion matrix and calculate the performance parameters i.e., Accuracy and F-measure. Next, we compare the performance of the developed malware detection model and select the best malware detection model. At last, we compare the performance of our proposed malware detection model with existing techniques available in the literature and distinct anti-virus scanners. If the performance of our proposed malware detection model is better than existing techniques then it is useful and in reverse of it if the performance is not enhanced than the proposed malware model is not useful.

Fig. 7
figure 7

Proposed framework i.e., SemiDroid

The subsequent measures are pursued at the time of selecting a subset of features to built the malware detection model that detects either an app is benign or malware. Feature selection approaches are implemented on 30 different data sets of Android apps. Hence, a total of 1650 (( 1 selecting all extracted features + 10 feature selection approaches) × 30 data sets (subsets of different feature sets particular to data sets determined after conducting feature selection) × 5 detection methods) different detection models have been build in this work. Below we provide step by step details of our approach:

  1. 1.

    In the present work, four feature subset selection approaches and six feature ranking approaches are implemented on 30 different feature sets to select the right set of features for malware detection.

  2. 2.

    The subsets of features obtained from aforementioned procedure are given as an input to machine learning classifiers. To compare the developed models, we use 20 fold cross-validation method. Cross-validation is a statistical learning approach that is utilized to classify and match the models by dividing the data into two different portions [40]. One portion is utilized to train and the remaining portion of data is utilized to verify the build model, on the basis of training [40]. The data is initially separated into K same sized segments. K-1 folds are utilized to train the model and the rest one fold is utilized for testing intention. K-fold cross-validation is having important significance in utilizing the data set for the both testing and training. For this study, 20-fold cross-validation is utilized to analyze the models, i.e., data sets are segregated into 20 portions. The outcomes of all build malware detection models are matched with each other by employing two distinct performance measure parameters: F-measure and accuracy.

  3. 3.

    SemiDroid i.e., proposed model build by utilizing above two steps are validated with the existing techniques developed in the literature to review whether the build malware detection model is useful or not.

10 Results of performed experiment

In the current section of the paper, the relationship among different feature sets and malware detection at the class level is submitted. Set of features are used as an input and present the ratio of benign and malware apps within an experiment. intra-cluster distance, inter-cluster distance, F-measure and accuracy are used as performance assessment parameters to match the performance of malware detection model build by using five different unsupervised machine learning algorithms. To depict the experimental results we utilize the abbreviations as given in Table 5 corresponding to their actual names.

Table 5 Used naming convention in this study

10.1 Feature ranking approaches

Six feature ranking approaches: gain-ratio feature evaluation, Chi-squared test, information gain feature evaluation, logistic regression analysis, information gain, oneR feature evaluation and principal component analysis are implemented on a distinct feature sets. Each approach utilizes distinct performance parameters to rank the feature. Moreover, top \(\lceil \log _2 a \rceil\) set of features from “a” number of features being measured to build a model for detecting malware. For initial four feature ranking approaches (gain-ratio feature evaluation, Chi-squared test, OneR feature evaluation and Information gain), top \(\lceil \log _2 a \rceil\) are selected as subset of features, where a is the number of features in the original data set (for this work \(a=20\)). However, in the case of ULR, those features are selected which posses a positive value of regression co-efficient, i.e., p value measure is below 0.05, and in matter of PCA, only those features are selected which have eigenvalue greater than 1. Considered features using feature ranking approaches are demonstrated in Fig. 8.

10.2 Feature subset selection approaches

In the present work, four distinct kinds of feature subset selection approaches are implemented on thirty data sets of Android apps one after another. Feature subset selection approaches work on the principle of hypothesis which make models with better accuracy and make less amount of misclassified errors, while selecting the best features from available number of features. Searching principles are based on heuristic search [17] (for correlation based feature selection, rough set analysis), best first search [14] (for consistency subset evaluation and filtered subset evaluation). Later, these isolation subset of features has been selected as an input for building a model to detect either the app is benign or malware. Considered set of features after feature subset selection approaches are demonstrated in Fig. 9.

Fig. 8
figure 8

Feature ranking approaches

Fig. 9
figure 9

Feature subset selection approaches

10.3 Machine learning techniques

Eleven subsets of features (1 considering all set of extracted features + 10 resulting by implemented feature selection approaches) are used as an input to build a model for malware detection. Hardware utilized to carry out this study is Core i7 processor having storage capacity of 1TB hard disk and 8GB RAM. Detection models are build by using the MATLAB environment. Figure 6 demonstrates the implemented unsupervised machine learning algorithms on our collected data set. Red-cross represents the normal permissions and blue-cross represents the dangerous permissions. From Fig. 10, we analysis that clusters formed by using SOM, density-based clustering, K-mean, and filter clustering have overlapping of normal and dangerous permissions. Only farthest first clustering performed the cluster without overlapping of the permissions. Further, the performance of each detection model is measured by using four performance parameters: intra-cluster distance, inter-cluster distance, F-measure and accuracy.

Fig. 10
figure 10

Unsupervised machine learning algorithms

Tables 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15, present the performance values obtained for distinct data sets by utilizing SOM, K-mean, farthest first clustering, filtered clustering and density based clustering. On the basis of Tables 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15, it may be concluded that:

Table 6 Intra and inter cluster distance using SOM
Table 7 Intra and inter cluster distance using K-mean
Table 8 Intra and inter cluster distance using filtered clustering
Table 9 Intra and inter cluster distance using density based cluster
Table 10 Intra and inter cluster distance using farthest first cluster
Table 11 Accuracy and F-measure using SOM
Table 12 Accuracy and F-measure using K-mean
Table 13 Accuracy and F-measure using filtered clustering
Table 14 Accuracy and F-measure using density-based clustering
Table 15 Accuracy and F-measure using farthest first clustering
  • Bold values indicate the highest detection rate when compared to other values in a specific row.

  • Value of inter-cluster and intra-cluster distance are calculated by using Eqs. 18 and 19.

  • F-measure and accuracy are measured by using Eqs. 22 and 24.

  • Models developed by considering features selected by feature selection approaches as an input is able to detect malware more effectively rather than model developed by using all extracted feature sets.

  • Model constructed by considering FS4, as an input achieved higher detection rate when compared to other models developed by using different feature selection approaches.

  • Model build by considering farthest first clustering by selecting FS4, as an input achieved higher detection rate when compared to other models developed by using different feature selection approaches.

Fig. 11
figure 11

Box-plot diagram of accuracy

Fig. 12
figure 12

Box-plot diagram of F-measure

In this research paper, five distinct unsupervised machine learning algorithms and ten distinct feature selection approaches are considered to select features which helps in identify Android malware more effectively. To find out which developed model is more capable to detect malware, we construct box-plot diagrams of the individual model. Box-plot diagrams helps in identify which model is best suitable for malware detection on the basis of few number of outliers and better value of median. Figures 11 and 12 demonstrate the box-plot diagrams for F-measure and accuracy for every developed model. The x-axis of the diagrams presents the feature selection techniques. Figures include eleven box-plot diagrams: one box-plot diagram consists of all extracted feature sets, four box-plot consist of feature subset selection approaches and six box-plot consist of feature ranking approaches. On the basis of the box-plot diagram, we find following observations:

  • Model constructed by considering five distinct unsupervised machine learning algorithms and FS4 achieved higher median value in addition to few outliers. On the basis of box-plot diagrams demonstrated in Figs. 11 and 12, model developed by considering FS4 as feature selection approach gives better detection rate when compared to other developed approaches.

  • From box-plot diagrams, we observed that model build by considering farthest first machine learning algorithm and FS4, is having few outliers and higher median value. It means that model developed by using RSA for detecting malware and benign apps achieved better results when compared to others.

Fig. 13
figure 13

t test analysis (p value)

Table 16 Performance of distinct feature selection approaches after calculate its mean difference
Fig. 14
figure 14

Cost-benefit value

10.4 Comparison of results

To identify that out of implemented feature selection approaches and machine learning algorithms which technique work well or all of the techniques perform equally well, we employed pair-wise t test in our study.

1. Feature selection approaches In this study, for each of the feature selection approaches two sets are formed, each of feature selection approach have 150 distinct data points (5 machine learning techniques \(\times\) 30 data set). t test is performed on distinct feature selection approaches and the respective p value to measure its statistical significance is compared. The outcome of t test study is demonstrated in Fig. 13b. In the figure, we used two different symbols to represent the p value i.e., circle filled with green color have p value \(> 0.05\) (having no relevance difference) and circle filled with red color have p value \(\le 0.05\) (relevance difference). After observing the Fig. 13b it is clear that, majority of the cells are filled with green color circle. This means that there is no relevance difference among the employed feature selection approaches. Further, by determining the measure of mean difference given in Table 16, we have observed that feature sets obtained by considering FS4 give best outcomes when examined with other implemented feature selection approaches.

In the present work, we also compare the developed model on the basis of cost-benefit analysis. For every feature selection approach, cost-benefit analysis is computed by employing following equation:

$$\begin{aligned} Cost{\text{-}}Benefit=(Based_{cost}+Benefit_{cost})/2. \end{aligned}$$
(25)

Here, \(Based_{cost}\) is dependent on the correlation among the selected features set and error in the class. \(Based_{cost}\) can be calculated from the following equation:

$$\begin{aligned} Based_{cost}=Accuracy \ (SM)\times \rho _{SM.fault}. \end{aligned}$$
(26)

Here, \(Accuracy \ (SM)\) is the classification accuracy to build a malware detection model by utilizing selected features set, \(\rho _{SM.fault}\) is a multiple correlation coefficient among selected features set and error. The proposed model produces higher accuracy and as it have high multiple correlation coefficient so it will achieve a high \(Based_{cost}.\) NAM is considered as feature sets and NSM is considered as the number of selected features after implementing features selection approaches. \(Based_{cost}\) can be calculated from the following equation:

$$\begin{aligned} Based_{cost}=NAM-NSM/NAM \end{aligned}$$
(27)

The feature selection approach which achieve higher value of cost-benefit is an foremost feature selection approach as proposed in [22]. Figure 14a, b demonstrates cost-benefit of distinct feature selection approaches. On the basis of Fig 14a, b we observed that FS4 achieved higher median Cost-benefit measure when matched with other approaches.

Table 17 Mean difference between performance of different Unsupervised methods

2. Machine learning techniques In our study, we implemented eleven different features subsets (i.e., 1 considering all features + 10 feature selection approaches) on thirty different Android app data set by examining four performance parameters i.e., intra- cluster, inter-cluster, F-measure and accuracy, all with 330 data points [(1 considering all set of features + 10 feature selection method) \(\times\) 30 data sets)]. Figure 13a demonstrates the outcomes of t test analysis. On the basis of Fig. 13a, it is noticeable that, there is no relevance difference among these techniques because p value is smaller than 0.05. On the other hand, by determining the difference in their mean value as given in Table 17, Farthest first clustering gives best outcome when compared to other machine learning techniques.

3. Feature subset selection and feature ranking approaches For this study, pair-wise t test is used to identify which feature selection approach work better. For both of the implemented approaches (i.e., feature subset selection and feature ranking) sample pairs of performance evaluation are studied. The performance of averaged feature subset selection and feature ranking techniques outcomes of t test analysis are briefed in Table 18. In this research paper, five distinct kinds of machine learning algorithms are applied on thirty different Android categories by selecting Accuracy and F-measure as performance parameters, in accordance with each feature selection approaches an aggregate number of two sets are utilized. Feature subset selection with 360 distinct points (which means 4 feature subset selection approaches \(\times\) 3 machine learning techniques \(\times\) 30 data sets) and feature ranking with 540 distinct data points (3 machine learning techniques \(\times\) 6 feature ranking approaches \(\times\) 30 data sets). On the basis of Table 18, it is seen that, there isn’t a relevant variation among two implemented approaches, because p value come out to be greater than 0.05. On the other side by calculating the mean difference value of feature subset selection approaches give best results when compare to feature ranking approaches. On the basis of Cost-Benefit analysis as demonstrated in Fig. 14, we can say that both feature subset selection and feature ranking have nearly similar Cost-Benefit value. It proves that the averaged cost and benefit of model build by considering selected set of features with feature subset selection approaches and feature ranking have nearly same value.

Fig. 15
figure 15

Diagram of box-plot showing performance of different classifiers

Table 18 t test analysis among feature subset selection approaches and feature ranking approaches

10.5 Evaluation of proposed framework i.e., SemiDroid

10.5.1 Comparison of results with previously used classifiers

Fig. 16
figure 16

t test analysis (p value)

In addition to the study done in finding the best approach to build a malware detection model accurately, this study also makes the comparison with different most often used supervised machine learning approaches present in literature such as SVM with three distinct kernels i.e., linear, polynomial and RBF, decision tree analysis, logistic regression, neural network and Naïve Bayes classifier. Figure 15 demonstrates the box-plot diagrams for F-measure and accuracy of commonly utilized classifiers and five distinct machine learning algorithms implemented in this paper. On the basis of Fig. 15, we observed that farthest first clustering have higher median value along with some number of outliers.

Pair-wise t test is also implemented to decide which machine learning approach yield best performance. The outcomes of t test study for distinct machine learning approaches are demonstrated in Fig. 16. On the basis of Fig. 16 it is seen that, in number of the cases there is a relevance difference among these machine learning techniques because p value is smaller than 0.05. On the other hand by noticing the mean difference value in Table 19 it can be seen that farthest first clustering achieved better results when compared to other supervised machine learning techniques.

Table 19 Mean difference between performance of different supervised machine learning technique

In addition to that, in our study we compare our proposed malware detection model (i.e., SemiDroid) with existing frameworks or approaches that were developed in the literature. Table 20 shows the name, goal, methodology, deployment, data set and detection rate of suggested approaches or frameworks.

Table 20 Comparison with previously developed frameworks/approaches

10.5.2 Comparison of results with different anti-virus scanners

Although farthest first clustering gives a better performance as compared to the machine learning technique used in the literature, in the end it must be comparable with the common anti-virus products available in practice for Android malware detection. For this experiment, we select 10 different anti-viruses which are available in the market and applied them on our collected data set. The performance of proposed framework is comparatively better than many of the anti-viruses available in the experiment. Table 21 shows the results of the experiment with anti-virus scanners. The detection rate of the anti-viruses scanners varies considerably. Also the best anti-virus scanners detected 96.2% of the Android malwares and certain scanners identified only 82% of the malicious samples, likely do not being specialized in detecting Android malware. By using 1000 Android apps, our proposed framework i.e., SemiDroid gives the detection rate of 98.8% and outperforms 1 out of 10 anti-virus scanners. From this, we can say that our proposed framework is more efficient in detecting malware when compared to distinct anti-virus scanners.

Table 21 Comparison of proposed framework i.e., SemiDroid with distinct anti-virus scanners
Table 22 Top malware families used in our data set

10.5.3 Detection of known and unknown malware families

Detection of known malware families In this section, we check that our proposed framework is capable to detect malware of known family or not. For this experiment, we select 20 sample of each families (in our study, we consider sample of 81 different families shown in Table 22.) and train it with our selected model. Farthest first clustering is capable to detect average 98.8% malware apps. The name of families and the samples used for each family can be found in Table 22 and the detection rate of our proposed framework for each family is illustrated in Fig. 17a, b.

Fig. 17
figure 17

Detection rate of proposed framework farthest first clustering

Detection of unknown malware families To check whether the farthest first clustering is capable to detect unknown malware families or not, we trained, our proposed framework with the random selection of 10 different families obtained by principle of counting and test is applied on the rest of the remaining 71 families present in the data set. Table 23 shows the result of farthest first clustering when we train it with 10 selected families. From Table 23, we can say that if we train farthest first clustering with few number of known families samples which are necessary to generalize the behavior of most malware families it gives better detection rate.

In summary, our proposed framework is capable to detect Android malware more effectively when compared with several anti-virus scanners which regularly update their signature definition. In addition, our proposed framework is capable to identify Android malware more efficiently whenever we trained with limited number of malware families.

Table 23 Detection of SemiDroid to detect unknown malware families

10.5.4 Experimental findings

The comprehensive conclusion of our experimental work is presented in this section. The empirical study was conducted for thirty different categories of Android apps by considering five different unsupervised machine learning techniques i.e., SOM, K-mean, filter clustering, density based clustering and farthest first clustering. On the basis of the experimental results, this research paper is able to answer the questions mentioned in Sect. 2.

RQ1. In this paper, we applied five distinct machine learning algorithms to build a model which help us to detect whether an app is benign or malware. On the basis of Tables 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15, it can be implicit that model build by employing farthest first clustering by using selected set of features obtained as a result of FS4 as an input gives better outcome when compared to others.

RQ2. To respond the RQ2, Fig. 17 and Tables 20 and 21 were analyzed. Here, it is found that model build by utilizing farthest first clustering is capable to detect malware from real-world apps.

RQ3. In the present paper, four distinct kind of feature subset selection approaches and six distinct kind of feature ranking approaches are used to identify the smaller subset of features. By utilizing these approaches, we considered best possible subsets of the features which help us to build a model to identify that either an app is benign or malware. On the basis of the Tables 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15, in number of cases there occurs a reduced subset of features which are best for building a detection model when compared to all the extracted features.

RQ4. In the present paper, six distinct variants of feature ranking approaches are used to discover the reduced subset of features. On the basis of t test study, it is seen that feature selection by implementing PCA approach gives the better outcomes when matched to others approaches.

RQ5, For this paper, four distinct kind of feature subset selection approaches are used to find the reduced subset of features. On the basis of t test study, it is seen that feature selection by utilizing FS4 gives the outcomes which are persuasively better when compared to other approaches.

RQ6. For this work, pair-wise t test being utilized to identify whether feature subset selection approaches perform better than feature ranking approaches or both of them carried out equally well. On the basis of t test outcomes it is seen that, there is a relevance difference among feature subset selection and feature ranking approach. Moreover, the value of mean difference shows that feature subset selection approaches gives better results than the feature ranking.

RQ7. On the basis of Sect. 9, we can observe that the performance of the feature selection approaches vary by using the distinct machine learning techniques. Further, it also observed that selection of machine learning algorithm to build a malware detection model which detect either the app is malware or not is based on the feature selection approaches.

11 Threat to validity

In this section, threats to validity which are experienced at the time of performing the experiment are presented. Below we discuss them:

  1. (i)

    Construct validity In this work, presented models for malware detection only detect either an app is benign or malware, but does not state that how many number of possible permissions and API calls are required to detect malware.

  2. (ii)

    External validity Cyber-criminals develops malware on daily basis to misuse the user information. In this work, we considered 81 different malware families to train the model and our proposed model is capable to detect malware from known and unknown families. Further, research can be extended to train model with more malware families and which is capable to detect more malware apps from real-world.

  3. (iii)

    Internal validity The threat lies in the consistency of the data used in this study. We collected data from different sources mentioned in Sect. 4. Any error in the information not mentioned in the sources was not considered in this work. Also, we can not claim that the data considered for the experiment is 100% accurate, we believed that it has been collected consistently.

12 Conclusion

This work emphasizes on designing a malware detection framework by using selected set of features which help us to identify that an Android app belongs to malware class or benign class. The execution process was performed by taking assistance of thirty different categories of Android apps.

Empirical results indicate that, it is feasible to determine a small subset of features. The malware detection model build by considering this determined set of features is able to detect malware and benign apps with inferior value of misclassified errors and better accuracy. Further, it is also seen that the results of malware detection model, is influenced by the feature selection approaches.

After performing in depth analysis, we found that AA, BU, LS, PE, RA, TO set of features are relevance detectors for malware detection by utilizing feature selection approaches. Further, on the basis of mean difference, it is seen that model build by considering selected set of features as an input gives better detection rate when compared to model build by considering all set of extracted features. Moreover, the model build by utilizing Farthest first clustering gives better outcomes when compared to other techniques.

At last, on the basis of Cost-benefit analysis, we implicit that the selected features by utilizing FS4, achieved high median Cost-Benefit value when compared to other approaches and it is also seen that model build by utilizing Farthest first clustering is capable to detect 98.8% known and unknown malware from real-world apps.

In this work, proposed models for malware detection only detect that either the app is malware or benign. Further, work can be extended to develop a model for malware detection which predict whether a particular feature is capable to detect malware or not. Moreover, this study can be replicated over other Android apps repository which utilized soft computing models to attain better detection rate for malware.