Abstract
With the exponential growth in Android apps, Android based devices are becoming victims of target attackers in the “silent battle” of cybernetics. To protect Android based devices from malware has become more complex and crucial for academicians and researchers. The main vulnerability lies in the underlying permission model of Android apps. Android apps demand permission or permission sets at the time of their installation. In this study, we consider permission and API calls as features that help in developing a model for malware detection. To select appropriate features or feature sets from thirty different categories of Android apps, we implemented ten distinct feature selection approaches. With the help of selected feature sets we developed distinct models by using five different unsupervised machine learning algorithms. We conduct an experiment on 5,00,000 distinct Android apps which belongs to thirty distinct categories. Empirical results reveals that the model build by considering rough set analysis as a feature selection approach, and farthest first as a machine learning algorithm achieved the highest detection rate of 98.8% to detect malware from real-world apps.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Detection of malware from smartphones has become a major concern for the research community. At the end of 2019, the number of Android users will be 3.3 billions throughout the world.Footnote 1 Android is based on the Linux kernel and provide useful services such as security configuration, process management and others. The primary reason for the growth of Android operating system is due to its open-nature and freely available apps. At the end of July 2019,Footnote 2 Android had 2.7 billion free and paid apps in its play store. There is an increase of 13%,Footnote 3 in downloading of apps from Google play store with respect to previous years. Android operating system is based on the principle of privilege-separated where each app has its own distinct system identity, i.e., group-ID and Linux user-ID. Each app run in a procedure sandbox and access permission to use the resources which are not present in its sandbox. Depending on the permission sensitivity, the system automatically grants permission or may prompt users to approve or reject requests for permission. Permissions granted by users include, access to the calendar, camera, body sensors, microphone, contacts, location, SMS, storage of the device. To defend Google official marketFootnote 4 from malware-infected apps, Google introduced Google Bouncer in the year 2012, which scans new apps at the time of their launch. But, Google Bouncer has limitation, it can easily be fingerprint.Footnote 5 It is not very difficult for malware apps to bypass Google’s security check and enter to Google official marketFootnote 6 and ultimately to users’ devices. By taking advantage of these permissions, cyber-criminals build malware apps on a daily basis and invite users to install these applications. More than two billion active Android devices are present in the market.Footnote 7 To overcome the drawback of the bouncer and to protect Android devices, Google introduced Google play protect in the market. Google play protect have the capability to protect data in real-time. However, according to a study in,Footnote 8 G-Data Security expert counted 4.18 millions malware applications until the end of the year 2019 and discovered over 7,50,000 new malware applications during the first quarter of 2020.
Android apps work on the permission-based model [15]. Android operating system provides protection at four levels, that categorize permissions asFootnote 9 “signature”, “signature or system”, “normal” and “dangerous”. In our study, we do not consider “signature” and “signature or system” because they are system granted. We only consider “normal” and “dangerous” permissions which are granted by the user. Normal permissions does not pay any risk to the user’s privacy. If the permission is listed in the manifest file, then system grants permission automatically. On the other hand, dangerous permission give access to the user’s confidential data. However, it is purely dependent upon the user to give access or revoke the use of permission or set of permissions.
Selection of right feature or feature sets pay great effect on the performance of malware detection [45, 46, 49, 62]. Feature selection approach is based on the procedure to select appropriate features from total available features. Feature selection approaches are classified into two distinct groups i.e., one group contains feature ranking methods and second group contains feature subset selection methods [45, 46, 49, 62]. Feature ranking approach is based on ordering the feature on the basis of its scoring function [49, 62]. On the other hand, feature subset selection is to discover the optimal feature subset [46]. In our study, we implemented ten distinct feature selection approaches to select best features and hold only those feature sets which have excellent discriminatory power.
In the literature [45, 46, 49, 62], researchers and academicians had applied distinct machine learning algorithms that were based on classification, regression and clustering to develop Android malware detection model. The main flaw in their work is that they used labelled data set to develop malware detection model. So to overcome this issue, in this study, we consider five distinct unsupervised machine learning algorithms [i.e., K-mean, farthest first clustering , filtered clustering, density-based clustering and self-organizing map (SOM)] to develop a model for Android malware detection.
List of phases followed by us in developing malware detection model is demonstrated in Fig. 1. In the first stage, we collect Android application packages (.apk) files from different repositories and identify their classes. In the second stage of our experiment, we extract permissions and API calls from collected .apk files and consider them as features. Further in third stage, we select best features by using ten different feature selection approaches. Next, with the help of selected features we trained five different unsupervised machine learning algorithms and build models. We compare our developed models with the help of distinct performance parameters i.e., intra-cluster distance, inter-cluster distance, accuracy and F-measure. At the last stage, we validate our proposed model with the help of existing techniques available in the literature.
The novel and unique contribution of this research paper are:
-
To the best of our knowledge, this is the first work in which 5,00,000 unique apps are collected which belongs to 30 different categories of Android apps. Extracted features are publicly available for researchers and academicians.Footnote 10 To build effective and efficient malware detection model we extract permissions, rating of an app, number of the user download the app, and API calls, as a feature and achieved a detection rate of 98.8% when compared to distinct anti-virus scanners.
-
We proposed a new approach which works on the principle of unsupervised machine learning algorithm by selecting relevant features using feature selection. Our empirical result reveals that our suggested model is able to detect 98.4% unknown malware from real-world apps.
-
Proposed framework is able to detect malware from Android apps by using 100% unlabelled data set.
-
In this study, we applied t test analysis to investigate that features selected by feature selection approaches are having significant difference or not.
-
Proposed malware detection approach is able to detect malware in less time when compared to distinct anti-virus scanners available in the market.
The rest of the paper is summarized as follows. In Sect. 2, we discuss about the work related to Android malware detection. In Sect. 3, we discuss about the Android permission model. In Sect. 4, we present the formulation of data set. Section 5 presents the features selection approaches implemented in this study. In Sect. 6, we discuss about the different machine learning algorithms. In Sect. 7, we present the different techniques which are used in the literature to detect malware from real-world apps. Section 8 represent the performance parameters and experimental setup is presented in Sect. 9. Section 10, contains the experimental results i.e., which model is best in detecting malware from real-world apps. At last in Sect. 11, we discuss about the threats to validity and conclusion of this empirical study is presented in Sect. 12.
2 Related work
Enck et al. [28] proposed Kirin framework which helps in detecting malware apps based on permissions requested by them during their installation time. Kirin is based on set of rules which helps to mitigate the effect of malware from Android apps. Suarez-Tangil et al. [65] examined out of cloud based detection or on-device detection method, which method is more power saving. They suggested a power model to compare both the methods with the help of machine learning algorithms. Empirical results reveals that cloud based detection method is more effective and better choice to detect malware. Cui et al. [26] proposed a malware detection model based on cloud computing by using network packets. They used the principles of data mining to reduce the branches of packets by gathering knowledge of packets whether it is useful for malware detection or not. They proposed SMMDS in their study which work on the principles of machine learning algorithms to detect malware. Chen et al. [23] proposed a solution which monitor the behavior of smartphones when they are sending user’s private information to an external source. But the solution provided in their study is not effective, because it cannot support real-time detection. Narudin et al. [53] proposed STREAM which automatically installs and runs Android apps and extract features from them. Further, the extracted features are used to train with the help of machine learning classifiers to detect malware from Android apps. STREAM has a disadvantage, it takes a lot of system resources and time to load the data. Wei et al. [73] build a malware detection model based on anomaly behavior of Android apps. They developed a model by considering network information as a feature by using Naïve Bayes and Logistic machine learning algorithms and achieved higher accuracy rate. Ali et al. [11] suggested a malware detection model based on Gaussian mixture. They collected features based on hardware utilization such as CPU, memory, battery and so on and trained it with the help of Gaussian mixture. But the model proposed by them has a limitation, it needs a remote server for computation. Dixon et al. [27] developed a model by using the behaviors of battery life of smartphone when infected by malware. But, the model proposed by them is not able to detect some sophisticated malware.
Tong and Yan [67] proposed hybrid approach to detect malware from Android by using individual system call and sequential system calls related to accessing the files and networks. Their approach is able to detect the behavior of unknown app and achieved the detection rate of 91.76%. But, the presented approach has a limitation, it cannot support real-time detection. Quan et al. [59] used three different feature sets i.e., native code, system calls and API calls to detect malware from Android. The detection rate depends upon the predefined threshold value. Ng et al. [54] developed model by using Dendritic Cell Algorithm and considered system call as a feature. They selected best features by implementing statistical methods and achieved the higher detection rate. Sheen et al. [63] proposed Android malware detection system by considering API calls and permissions as features. They chose features by using Relief algorithm to train three different classifiers: J48, SVM and Naïve Bayes. Detection rate is good, but it also consumes number of resources and its computing burden is too high. Fung et al. [31] proposed a decision model RevMatch which work on the principle of malware detection history to make decision that Android app is infected with malware or not. This approach do not provide real-time detection. Babaagba and Adesanya [12] compared the performance of supervised and unsupervised machine learning algorithms, with and without using feature selection approaches. Empirical study were performed on 149 Android apps and result reveals that model developed with feature selection approach and supervised machine learning algorithm achieved higher detection rate when compared to the model developed using unsupervised machine learning algorithm. Yewale and Singh [78] proposed malware detection model based on opcode frequency. Experiments were performed on 100 distinct files and achieved 96.67% detection rate by using SVM as a machine learning algorithm.
Enck et al. [29] proposed TaintDroid which work on the principles of tracking information-flow in the network. TaintDroid track malicious behavior of apps communication through files, program variables and inter-sectional messages. The process is too much time consuming to label that app is benign or malware. Abawajy and Kelarev [2] proposed ICFS which detect malware from Android by incorporating feature selection approaches and machine learning classifiers. Guo et al. [32] proposed smartphone network behavior using Naïve Bayes as machine learning algorithm. They build a pattern from benign and malware apps to discover malware from unlabeled apps. In recent study [68], Authors presented the mechanism that how an app breaching user privacy to gain user private data. They proposed a general and novel defence solution, to protect resources and data in Android based devices. Rahman and Saha [61] proposed StackDroid a multi-level architecture which is used to minimize the error rate. They detected malware at two different levels, in the first level they consider multi-;ayer perceptron, stochastic gradient descent, random forest and extremely randomized tree and in the second level they consider extreme gradient boosting as machine learning classifier to detect malware from Android. Barrera et al. [13] proposed a methodology which work on the principles of permission model by implementing self-organizing map (SOM) on collected data set of 1,100 Android apps. They analyzed the Android permission model which is used to investigate the malware apps from Android. SOM implemented by them, give a 2-dimensional visualization of high dimension data.
Alazab et al. [4] proposed an effective classification model that combines permission requests and API calls. Further, API calls were divided into three different groups i.e., ambiguous group, risky group, and disruptive group. Experiments were performed on 27,891 malware-infected Android apps and proposed model achieved an F-measure of 94.3%. Xiao et al. [75], proposed a novel detection approach based on the principle of deep learning. In their studies, authors consider semantic information in system call sequences as the natural language and construct a classifier based on the long short-term memory (LSTM) language model. Empirical result reveals that proposed approach achieved the detection rate of 96.6%. Yuxin and Siyi [79] proposed a malware detection model based on the principle of Deep Belief Network (DBN). In their study, they compare the performance of proposed model with three baseline malware detection models, which use support vector machines, decision trees, and the k-nearest neighbor algorithm as classifiers. Experimental results indicate that the autoencoder can effectively model the underlying structure of input data and significantly reduce the dimensions of feature vectors.
Vinayakumar et al. [69] proposed an effective zero-day malware detection model based on the principle of image processing technique with optimal parameters for machine learning algorithms (MLAs) and deep learning architectures. Experiments were performed on two distinct data sets and they achieved a detection rate of 96.2% and 98.8% with proposed detection model. Arora et al. [9] proposed a framework named as PermPair, that constructs and compares the graphs for malware and normal samples by extracting the permission pairs from the manifest file of an app. Empirical result reveals that proposed scheme is successful in detecting malicious samples with an accuracy of 95.44% when compared to mobile anti-malware apps. Lee et al. [42] proposed a malware detection model that learns the generalized correlation between obfuscated string patterns from an application’s package name and the certificate owner name. Experimental results reveal that proposed model is robust in obfuscation and sufficient lightweight for Android devices.
Alzaylaee et al. [6] proposed DL-Droid, based on the principle of deep learning model. Experiments were performed on 30,000 distinct Android apps. Empirical result reveals that DL-Droid can achieve up to 97.8% detection rate (with dynamic features only) and 99.6% detection rate (with dynamic + static features) respectively. Ma et al. [44] proposed a malware detection approach based on the control flow graph of the app to obtain API information. On the basis of API information, they constructed three different data sets that is based on Boolean, frequency, and time-series. By using these three data sets three different malware detection models are constructed. Experiments were conducted on 10010 benign applications and 10683 malicious applications. The result shows that detection model achieves 98.98% detection precision. Jerbi et al. [36] proposed artificial malware-based detection (AMD) that was based on extracted malware patterns that were generated artificially. Experiments were performed on balanced and imbalanced data set and achieved an accuracy of 99.69% for balanced data set and 99.64% for imbalanced data set. Table 1 described the name of the framework or author name whose proposed the approach developed in the literature with its detection type, feature used, implemented algorithm, place of analysis, and major observations in their study.
Previous research work mentioned above has the following limitations: use of limited data set, higher detection rate with limited data set, high computation burden, implementation of limited feature selection approaches, implementation of limited classification algorithms using 100% labelled data set and unable to detect sophisticated malware. To overcome the first limitations, in this study, we collect 5,00,000 Android apps which belong to thirty different categories from different repositories mentioned in Sect. 4. Further, to overcome other limitations, we implement ten distinct feature selection approaches on extracted feature data set (i.e., permissions and API calls consider as feature in this study). Next, the selected features are considered as an input to develop a model by using unsupervised machine learning algorithms (means no labelled data is required to develop the models) so that a suitable model is build to identify malware from real-world apps.
2.1 Research questions
To develop a malware detection model for Android with better detection rate and to cover the gaps that are present in the literature, we consider the following research questions in this research paper:
RQ1. Which malware detection model is most appropriate to detect malware from real-world apps?
This question helps in finding the most appropriate model which is suitable for malware detection in Android. In this work, we build 50 distinct models by considering ten distinct feature selection approaches and five different machine learning techniques. Further, to identify that which model is more appropriate for malware detection we consider four distinct performance parameters in our study.
RQ2. Whether the presented malware detection framework is effective or not to detect malware from Android devices?
The goal of this question is to investigate the performance of our malware detection approach. For this, we compare the performance of our developed model with some existing techniques available in the literature.
RQ3. Does a subset of feature perform better than all the extracted features for the task of detecting an app is malicious or not?
The aim of this question, is to evaluate the features and investigate their relationship among benign and malware apps. Distinct kinds of feature reduction approaches are being considered for finding subset of features which are able to detect either the app is malicious or not.
RQ4. Among different implemented feature ranking approaches which approach work best for the task to detect that either an Android app belong to benign class or malware class?
In feature ranking approach, efficiency of the machine learning algorithms is affected by the characteristics and nature of the malware data set. Distinct approaches are being implemented with various criterions to rank the collected feature sets. Four distinct performance criterions i.e., intra-cluster, inter-cluster, F-measure and accuracy are considered in this study, to compare distinct feature-ranking approaches.
RQ5. Among applied feature subset selection approaches, which approach performs foremost for the task of detecting malware from Android apps?
To determine the subset of features which are appropriate to detect either the Android app is benign or malware we consider feature subset selection approaches. In this work, we compare distinct approaches by using four performance criterions i.e., intra-cluster, inter-cluster, F-measure and accuracy.
RQ6. How do the feature subset selection approaches are compared with feature ranking approaches?
In this paper, pair-wise t test being used to determine either the feature subset selection approaches are appropriate than feature ranking approaches or both of them behave equally well.
RQ7. Do the feature selection approaches effect the outcome of the unsupervised machine learning approaches?
It is seen that number of feature selection approaches perform extremely well with specific unsupervised machine learning methods. Therefore, in this research work distinct feature selection approaches are evaluated using distinct unsupervised machine learning approaches to measure their performance. Further, it also emphasis on variation of performance of unsupervised machine learning approaches over distinct unsupervised machine learning approaches.
3 Android permission model
Android security is dependent upon the permission based model that access to functionality or features which could pay impact on the user’s privacy. Android apps are written in java language and run in a Dalvik virtual machine. Android apps demand permissions during their installation and run-time from the users. Permissions such as send messages or making phone calls, access to vibrator or device screen on are asked by an app. User can grant or reject the request of permission made by an app. This facility of granting or revoking the permission is available from Android 6.0 version onwards. Currently in Android, there are more than 110 different features which are gaining access by using these permissions. Android is designed in such a way, that if any third-party app demand any new functionality, then it give privilege to the developer to define new permission under developer-defined permission or self-defined permission. By taking the advantage of this, cyber-criminals define new permissions on regular basis so that they can access the user’s private data for their own benefits.
For an in-depth analysis of permission model, we build an app named as “ToGoHome”.Footnote 11 Figure 2 represent the class diagram of Android app by using the structure showing its classes, attributes and methods involved in it. The package contains 10 distinct classes which contain the logic of cab reservation. In the very first step, when user click on the “HomePage” menu and searching for the cab it direct toward the class name “Pickup” that was collected from the current location of the user by using Global Positioning System (GPS) and call the class name “myLocationListener”. Further “Pickup” class is divided into seven different attributes i.e., updateDisplayDate:void, updateDisplayTime:void, GetDistance:void, checkAddress;bool, onConfigurationChanged:void, onActivityResult:void and onCreate:void. If user confirm his pick up location through GPS then its booking will be confirmed by calling “BookingDetails”. While travelling, user track his live location by calling “Route” class.Footnote 12 After completing the journey, our app call the “Pay” class and it will direct towards the payment website. Through out this process following Android apps permissions are used i.e., ACCESS_FINE_LOCATION, ACCESS_COARSE_LOCATION, ACCESS_NETWORK_STATE, INTERNET, SEND_SMS, and RECEIVE_SMS.
4 Formulation of data set
Figure 3 demonstrates the phases which are followed in extracting features from Android apps. In the very first phase, we identify the URLs from which Android apps are to be collected (mentioned in Sect. 4.1). In the second phase, we take the help of an app crawler to download the apps from identified URLs. Our developed app crawler can download as many apps as possible and do not pay any impact on the android app repository. To perform dynamic analysis of the collected Android apps, we use Android studio as an emulator (mentioned in Sect. 4.2). Further, we write a program in java language and extract permissions and API calls from them and save into the .csv for developing Android malware detection model. Extracted features are publicly available for researchers and academicians.Footnote 13.
4.1 Collection of .apk files
Pervious studies mentioned in Table 1, used only limited data sets of Android apps to examine its associations with malware or benign class and in addition to that in the literature [72, 80] academicians and researchers were not mentioned the categories of the app to which it belongs. Therefore, it is not able to draw generic conclusion relevant to all Android apps and its system. To overcome this gap, we collect apps of thirty different categories which are used to generalize and strengthen our outcomes. We collect the Android apps to build our data set from promise repository. We collect 5,50,000 of .apk files, from Google’s play store,Footnote 14 hiapk,Footnote 15 appchina,Footnote 16 Android,Footnote 17 mumayi,Footnote 18 gfan,Footnote 19 slideme,Footnote 20 and pandaapp.Footnote 21 Among these 5,50,000 benign .apk files, 5,00,000 are distinct. Further, the features are extracted after deleting viruses infected apps, reported by VirusTotalFootnote 22 and Microsoft Windows Defender.Footnote 23 VirusTotal identify malware affected apps by using 70 different antivirus softwares simultaneously. A total of 55,000 malware samples, are collected from three different promised repositories. 1929, botnet samples were collected from [38], which further consist of 14 distinct botnet families. Android Malware Genome project [80] contains a data set of 1200 malware samples that cover the currently present Android malware families. We collected about 56,871 samples from AndroMalShareFootnote 24 along with their package names. After removing duplicate packages from the collected data set, we have 50,000 unique malware samples left in our study. Both benign and malware apps being collected from the above mentioned sources at the end of December 2018. Table 2 shows the number of .apk files belonging to different categories i.e., business, comics, communication, education and so on. To better differentiate between benign and malware apps we consider .apk files belonging to normal, trojan, backdoor, worms, botnet and spyware familiesFootnote 25 are mentioned in Table 2.
4.2 Extraction of features
After collecting a unique samples of .apk files from various sources mentioned in previous subsection, we extract permissions and API calls from each of the .apk file. Extraction of permissions and API calls have been performed with the help of an emulator (in our study we use Android studio). Emulator provides the same API level and execution environment as our smartphones provide to us. In our study, to extract permissions and API calls from Android apps we use Android system version 6.0 Marshmallow (i.e., API level 23) and form our data set for experiments. Previous developed frameworks or approaches used the previous version of Android to extract features from them. There are two reasons for selecting this Android version: first, it asks the user to revoke or grant the permission to use the resources of smartphones and second it covers 28.1% of Android devices which is higher than other versions present in the market.Footnote 26 To extract features from collected .apk files, we execute each of them in an emulator and extract permissions by using self-written code in java from “AndroidManifest.xml”. These permissions are demanded by apps during their installation and run-time. By using the same process again and again, we extract permissions from 5,00,000 different Android apps and record them in the .csv file format. This extracted data set listing the name of the permissions is publicly available for the researchers.Footnote 27,Footnote 28 Previous researchers used limited set of features to develop a model for malware detection. To overcome this gap, in this study we collect 1532 permissions and 310 API calls which helps in building an effective and efficient Android malware detection model. Hence, each of the collected app can be represented as a 1842-dimensional Boolean vector, where “1” implies that the app requires the permission and “0” implies that the permission is not required. It is very common that distinct apps may request the similar set of permissions for its execution. Permissions overview given by GoogleFootnote 29 is used to describe the behavior of a permission i.e., “dangerous” or “normal”.
4.3 Formulation of feature sets
Several approaches had been developed for Android malware detection [1, 13, 45, 46, 49, 50, 74]. In this study, we divide the extracted API calls and permissions in to thirty different feature sets which helps in developing malware detection model. Table 3 displays the basic descriptions of the feature sets which are considered in our work.
5 Feature selection approaches
On the basis of previous studies, it is seen that in previous studies a number of authors applied different feature ranking approaches to detect malware from Android apps and achieved good detection rate. This indicates that the outcome of malware detection model rely on the features that are taken as an input to design a model. Selecting the suitable feature sets is essential for data preprocessing task in machine learning. In the field of malware detection, some researchers have used selection approaches to select appropriate set of features. In this paper, we implemented ten distinct types of feature selection approaches on a large collection of 1842 features (divided in to thirty distinct feature sets) to identify the best subset of features which assist us to detect malware detection with better detection rate and also minimize the figure of misclassification errors. Feature ranking approaches and Feature subset selection approaches can be defined in the following manner [45, 46, 49, 62]:
-
Feature ranking approaches These approaches, use certain conclusive elements to rank the features. Further, on the basis of their ranks appropriate features can be selected to build the model [49, 62].
-
Feature subset selection approaches These approaches aim to search subset of features which can have good detective capability [4, 46].
5.1 Feature ranking approaches
These approaches rank features separately without applying any training algorithm. Ranking of features depends upon their score. On the basis of our investigation of the previous studies, the majority of approaches are capable to calculate the grading of every feature. In this research, we employ six different ranking approaches to rank the features. Various feature ranking approaches are explained below:
5.1.1 Gain-ratio feature selection
In this selection approach, feature ranking work on the prediction of the gain-ratio in relation to the class [49, 55]. The “Z” known as the gain-ratio of feature is determined as:
where Gain \((Z)=I(X)-E(Z)\) and X depicts the set including m numbers of instances with n different classes. The forthcoming statistics necessary to categorize a given sample is calculated by utilizing succeeding equation:
Here in this equation \(P_i\) is the chance that a random sample can be a member of class \(C_i\) and is measured by \(z_i/z.\)
The number of instances are given by \(z_{ij}\) of class \(C_i\) in subset \(N_j.\) The foreseen knowledge rely on the partition of subsets by F, and is presented by
\(SplitInfo_F(X)\) is measured by utilizing following equation:
The value of \(SplitInfo_F(X)\) show us the details achieved by dividing the data set of training X into t portions equivalent to t results of a test on the attribute Z.
5.1.2 Chi-squared test
This test is employed to examine the self-determination among two events [49, 57], and in our work, ranking of features is predicted by the significance of its statistic in relation to the class. Higher the calculated value implies the denial of the outliers and consequently these features can be analyzed as better relevance to detect malware from Android apps. Chi-squared attribute evaluation evaluates the worth of a feature by computing the value of the chi-squared statistic with respect to the class. The initial hypothesis \(H_0\) is the assumption that the two features are unrelated, and it is tested by chi-squared formula:
where \(O_{ij}\) is an observed frequency and \(E_{ij}\) is an expected (theoretical) frequency, asserted by the null hypothesis. The greater the value of \(\chi ^2\) evidence against the hypothesis \(H_0\).
5.1.3 Information-gain feature selection
In Info-gain, features are selected on its relation with respect to the class [49, 55]. In information-gain feature selection approach, entropy is considered as a criterion of impurity in a training set S (data set), which can define a measure reflecting additional information about Y (random feature) provided by X (second random feature) that represents the amount by which the entropy of Y decreases. It is given by
The information gained about Y after observing X is equal to the information gained about X after observing Y. A weakness of the IG criterion is that it is biased in favor of features with more values even when they are not more informative [49, 55].
5.1.4 OneR feature selection
OneR feature selection approach is used for grading the features [49, 55]. To rank individual features it utilizes the classification mechanism. In it valuable features are considered as constant ones and divide the set of values into a few dissociate intervals made by straightforward method. In this study, we consider features with better classification rates.
5.1.5 Principal component analysis (PCA)
Reduction of attribute is accomplished by implementing PCA on our collected data set. PCA helps in transforming a high dimension data space into a low dimension data space. Features which are present in low dimension have extreme importance in detecting malware [49, 70]. Correlation among several features are high, so PCA is utilized to relocate these features that are not extremely correlated. The features obtained are named as principal component domain features. Further, to identify significant patterns in the data a small value of principal components is sufficient. The detailed phases of PCA are demonstrated in Fig. 4.
Feature data set is collected in the form of \(m \times n\) matrix, that contains n number of data sample and m number of extracted features. In the second phase, normalization of the feature data set is performed by using equation
and replace \(x^j\) with \((x^j-\mu _j)\). Next, we calculate eigen value and eigen vector by using matlab environment. Next, to select first k number of principal components from the covariance matrix we performed following steps
while (i = 1 to m) do evaluate cumvar = \(\sum _{i=1}^k \lambda _{ii}\)
cumvar denotes (cumulative variance) and \((\lambda )\) represents eigen values sorted in descending order.
After evaluating this, reduced feature sets are selected for training purpose.
5.1.6 Logistic regression analysis
For feature ranking, univariate logistic regression (ULR) analysis is being considered to verify the degree of importance for every feature sets [25, 49]. In the current work, we consider two benchmarks of LR model which consider to discover the importance of every feature and also used to rank each feature sets. Parameters for Logistic regression analysis are as follows:
-
1.
Value of regression coefficient The coefficient measure of features indicates the degree of correlation of every feature sets with malware.
-
2.
p value p value i.e., level of significance shows the correlation significance.
5.2 Feature subset selection approaches
These approaches are employed to detect appropriate subset of features which jointly have best detective capability. These are established on the hypothesis that developed model has better detection rate and lower value of misclassification errors when linked with few other features or when matched by itself. Several approaches are feasible to identify the right subset of features which helps in detecting malware. In this work, four distinct feature subset selection approaches are considered to calculate the score of feature. Implemented approaches are depicted below:
5.2.1 Correlation based feature selection
This approach is based on correlation approach which select a subset of features that are particularly related to the class (i.e., benign or malware). In this research paper, Pearson’s correlation (r: Coefficient of correlation) has been used for searching the dependency among features. If the value of “r” is higher among the feature sets, it indicates a strong relation among these features. It implies that, there is a statistical reason to consider those classes which are having lower (or highest) feature value with that have lower (or highest) ranges of other highly correlated features [49].
5.2.2 Rough set analysis
This approach is an estimation of conventional set, in terms of a joins of feature sets which provide the upper and the lower estimation of the original data set [46, 56]. This formal estimation, depicts the upper and lower limits of the original data set. The application of this approach is in mining the data from imperfect data. This approach is used to select the reduced set of features from the extracted feature sets. RSA used three distinct notations such as approximations, reduced attributes and information system. The steps that are pursued to get reduced subset by utilizing RSA are mentioned-below and also demonstrated in Fig. 5.
(i) Approximation Let \(A=(C,Z), X\subseteq Z\) and \(Y\subseteq C.\) \(X-\) topmost (XY) and \(X-\) lowermost \((<uline>X</uline>Y)\) approximations of X are utilized to estimate Y. The topmost limit includes all the objects which maybe the part to the set and the lowermost approximation includes of all objects which certainly be a part of the set. The XY and \((<uline>X</uline>Y)\) are computed by utilizing subsequent equations:
where \(\mid [y_i]_{Ind(C)}\) belongs to the same class of \(y_i\) in connection Ind(C).
(ii) Reduced attributes Correctness evaluation of the group Z (Acc(Z)) in \(A\subseteq B\) is determined as:
The number of features contained in the topmost or uppermost approximation of the set Z is called the cardinality of the set. Further, all possible feature sets are considered whose accuracy is equivalent to the accuracy of extracted feature sets.
(iii) Information system It is determined as \(Z=(C,B)\), where C is a universe including non-empty set of confined objects and B is the sets of attributes with a finite number of elements. For each \(b \in B\), there exist a function \(F_b:C\rightarrow V_b\), where \(V_b\) denotes the value of attribute b. For each \(A\subset B,\) there exists an equivalence relation known as B-indiscerbility relation is (Ind(Z)). Ind(Z) can be defined as:
5.2.3 Consistency subset evaluation approach
This technique provides the importance of subset of attributes by their level of consistency appearing in class values, when the training instances are applied on the subset of attributes. The consistency rate is calculated with the help of inconsistency rate, where two data elements can be considered as inconsistent if they belong to different class labels (i.e., benign or malware) but have same feature values. For this work, destination variable i.e., apps having two distinct characteristics (i.e., 0 for benign apps and 1 for malware apps). A group of feature (GF) is having Z amount of sample, there are z amount of instances in a manner that \(Z=X_1+X_2+\cdots +X_z.\) Instance \(X_i\) seems in entirely A samples from which \(A_0\) numbers of samples are marked by 0 and \(A_1\) number of samples are marked by 1, here \(A=A_0+A_1.\) If \(A_1\) is less than \(A_0,\) then the difference count for the instance \(X_i\) is \(INC=A-A_0.\) The inconsistency rate (INCR) of feature set is computed by utilizing succeeding equation:
5.3 Filtered subset evaluation
Filtered subset evaluation is based on the principle to select random subset evaluator from data set that was gained by applying arbitrary filtering approach [40]. The filtering technique is not based on any induction algorithm. Filtered subset evaluation technique is scalable and fast. Figure 6 demonstrates the steps followed to find subset of feature by utilizing filter method.
6 Machine learning techniques
Various authors applied a number of unsupervised machine learning classifiers like K-mean [18, 52] and Self-Organizing Maps (SOM) [18, 20] to detect Android malware. But, they applied clustering algorithm on limited data set. To overcome this gap, in this study, we implement five different clustering algorithms on our extracted data set i.e., SOM, K-mean, farthest first clustering, filtered clustering and density-based clustering. The choice of clustering algorithms are based on its performance in the literature [18, 20, 52, 74].
6.1 Self-organizing maps (SOM)
SOM is a type of artificial neural network (ANN) that is trained with the help of unsupervised learning to produce a low-dimensional, discretized representation of the input space of the training samples called map and is therefore a method to do dimensionality reduction.Footnote 30 SOM consists of neurons, which have the same dimensionality as the input space and is arranged in a rectangular or a hexagonal grid. SOM neurons can be studied as pointers in the input space, in which more neurons point to regions with high concentration of inputs [13]. The training algorithm can be summarized in four basic steps:
-
1.
Initialize neuron weights \(W_i=[w_{i1}, w_{i2}, \ldots , w{ij}]^T \in R^j.\)
-
2.
Input pattern \(x=[x_1, x_2, \ldots , x_j]^T\in R^j.\) Input pattern corresponding to an app in which permissions are expressed in the form of a bit string. Calculate the distance between pattern x and each neuron weight \(W_R,\) Therefore, identify the winning neuron or best matching neuron Z as follows
$$\begin{aligned} \Vert x-W_z\Vert =min_R\{\Vert x-W_R\Vert \} \end{aligned}$$(12)In our study, we employed Euclidean distance as the distance metric, normalized to the range [0, 1].
-
3.
Adjust the weights of winning neuron Z and all neighbor units
$$\begin{aligned} w_R(t+1)w_R(t)+h_{Z_R}(t)[x(t)-W_R(t)], \end{aligned}$$(13)where R is the index of the neighbor and t is an integer, the discrete time co-ordinate. Neighborhood kernel \(h_{ZR}(t)\) is a function of time and the distance between neighbor neuron r and winning neuron z. \(h_{RZ}(t)\) defines the region of influence that the input has on the SOM and consists of two components it means that the neighborhood function \(h(\Vert .\Vert ,t)\) and the learning rate function \(\alpha (t)\) is
$$\begin{aligned} h_{ZR}(t)=h(\Vert b_r-b_z\Vert ; t)\alpha (t) \end{aligned}$$(14)where b is the location of the neurons.
In our study, we used Gaussian neighborhood function and the first form of the neighborhood kernel with Gaussian function is
$$\begin{aligned} h_{RZ}(t)=exp\bigl (\frac{\Vert b_r-b_z\Vert ^2}{2\sigma ^2(t)}\bigr )\alpha (t), \end{aligned}$$(15)where \(\sigma (t)\) defines the width of the kernel.
-
4.
Repeat steps 2–3 until the convergence criterion is satisfied.
6.2 K-mean
K-mean adopt the method of vector quantization. K-mean clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.Footnote 31 The working of K-mean algorithm is as follows:
-
1.
Specify number of clusters K.
-
2.
Initialize centroids by first shuffling the data set and then randomly selecting K data points for the centroids without replacement.
-
3.
Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.
-
Compute the sum of the squared distance between data points and all centroids.
-
Assign each data point to the closest cluster (centroid).
-
Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.
-
6.3 Farthest first
Farthest first is based on the principle of a bounded metric space in which first point is selected arbitrarily and each successive point is as far as possible from the set of previously-selected points.Footnote 32 The working of farthest first clustering is described below [41]:
For each \(X_i=[x_i,1,x_i,2,...x_i,m]\) in D that is described by m categorical attributes, we use \(f(x_{i,j}|D)\) to denote the frequency count of attribute value \(x_{i,j}\) in the data set. Then, a scoring function is designed for evaluating each point, which is defined as:
-
1.
Farthest first traversal (D: data set, k: integer) {
-
2.
randomly select first center;
-
3.
select centers
-
4.
for (I = 2,\(\ldots\),k) {
-
5.
for (each remaining point) { calculate distance to the current center set; }
-
6.
select the point with maximum distance as new center; }
-
7.
assign remaining points
-
8.
for (each remaining point) {
-
9.
calculate the distance to each cluster center;
-
10.
put it to the cluster with minimum distance; } }
6.4 Filtered cluster
Filtered cluster is a special subset of a partially ordered set, in which each cluster is labeled with an ID. If Cluster belong to the first class then it will be labeled with 1 otherwise it is labeled as 0. The working of filter clustering algorithm is described below [37]:
-
1.
In only one scan of the data set derive the \(\frac{l(l-1)}{2}\) contingency tables necessary for computing the previously presented adequacy indices
-
2.
Run the Genetic Algorithm using the fitness function \(fit(SA, SA_*)\)Footnote 33
-
a chromosome codes (corresponds to) a subset of SA;
-
each gene of the chromosome codes an attribute of SA (so, there are l genes);
-
each gene of a chromosome has a binary value: the gene value is 1 (resp. 0) if its associated attribute is present (resp. absent) in the subset of SA coded by the chromosome.
-
-
3.
Select the best subspace found by the Genetic Algorithm.
6.5 Density-based cluster
Density-based cluster is based on the notion of density. It either grow clusters according to the density of neighborhood objects or according to some density function. The working of density-based clustering is described below [24]:
-
1.
initialize \(t_c = 0\);
-
2.
initialize an empty hash table \(grid\_list\)Footnote 34;
-
3.
while data stream is active do
-
4.
read record \(x = (x_1, x_2, upto, x_d)\);
-
5.
determine the density grid g that contains x;
-
6.
if(g not in \(grid\_list\)) insert g to \(grid\_list\);
-
7.
update the characteristic vector of g;
-
8.
if \(t_c == gap\) then
-
9.
call \(initial\_clustering\)(\(grid\_list\));
-
10.
end if
-
11.
if \(t_c\) mod \(gap == 0\) then
-
12.
detect and remove sporadic grids from \(grid\_list\);
-
13.
call adjust clustering(\(grid\_list\));
-
14.
end if
-
15.
\(t_c = t_c + 1\);
-
16.
end while
-
17.
end procedure
7 Comparison of proposed model with different existing techniques
To validate that our proposed framework is able to achieve higher detection rate or not, we compare the result of our proposed model with three different techniques mentioned below:
-
(a)
Comparison of results with previously used classifiers To validate that our proposed model is feasible to detect malware as equivalent to previous used classifiers or not, we calculate two performance parameters like Accuracy and F-measure for new proposed model and existing models. In addition to that, in this study, we also compared our proposed framework with existing frameworks that are present in the literature.
-
(b)
Comparison of results with different anti-virus scanners To compare the performance of our model for malware detection, we chose ten available distinct anti-virus scanners and compare their detection rate with the detection rate of the proposed model.
-
(c)
Detection of known and unknown malware families Further, to evaluate how much our proposed model is reliable to detect known and unknown malware families, we test known and unknown malware families with our proposed model and calculate the accuracy to detect the malware.
8 Evaluation of performance parameters
In this section of the paper, we discuss the fundamental definitions of the performance parameters utilized by us while evaluating our proposed model for malware detection. Confusion matrix is used to calculate all these parameters. It consists of actual and detected classification information built by detection models. Table 4 demonstrates the confusion matrix for malware detection model. In the present work, four performance parameters namely, inter-cluster distance, intra-cluster distance, F-measure and accuracy are utilized for measuring the performance of malware detection approaches.
Inter-cluster distance For each of these techniques, we first calculate centroid and then centroid Euclidian distance. In case of N, d-dimensional data points in a cluster: \(\{\mathbf {X}_i\}\) where \(i= 1,2,3,\ldots N,\) the centroid \(\{\mathbf {C}_0\}\), as defined inFootnote 35 is given by
Next, we define centroid Euclidian distance between the centroid of two clusters. Given the centroid of two clusters \(C_{01}\) and \(C_{02},\) the centroid Euclidian distance or inter-cluster distance between them is defined by
Intra-cluster distance To calculate the intra-cluster distance, we find root-mean-square-total-sample standard deviation (RMSSTD). This is defined by
where \(\bar{s_j}\) denotes the standard deviation of the attributes and p is the number of features. The smaller the value, the more homogenous the observations are with respect to the variables and vice-versa. Since root-mean-square is scalable dependent, it should only be used to compare the homogeneity of datasets whose variables are measured using similar scales.
F-measure The F-measure is harmonic mean of the precision and recall values used in information retrieval [16]. Precision shows how many applications are in the right cluster w.r.t. the cluster size. Recall shows that how many applications are in the right cluster w.r.t. the total applications. Let i denotes the class label and j denotes the cluster, then the Precision and recall for class i and cluster j are defined as:
where \(n_{i,j}\) is the number of applications with class label i in cluster j, \(n_i\) is the number of applications with class label i and \(n_j\) denotes the number of applications in cluster j. The F-measure for class i and cluster j is given as:
The total F-measure of clustering process is given by:
where n is the total number of applications.
Accuracy Let \(n_{i,j}\) is the number of applications with class label i in cluster j, \(n_{j,i}\) is the number of applications with class label j in cluster i, \(n_i\) is the number of applications with class label i and \(n_j\) denotes the number of applications in cluster j. Then Accuracy becomes:
9 Experimental setup
In the present section, we introduce the experimental setup done to find the performance of our developed malware detection models. Five distinct unsupervised machine learning algorithms are implemented on thirty different categories of android apps mentioned in Table 2. All these data sets have varying number of benign or malware apps which are adequate to perform our analysis. Figure 7 demonstrates the framework of our proposed model named as SemiDroid. In the very first step, feature ranking and feature subset selection approaches are applied on the extracted features data set. In the next step, we use the Min-max normalization approach to normalize the data. This approach is based on the principle of linear transformation, which bring each data point \(D_{q_i}\) of feature Q to a normalized value \(D_{q_i},\) that lie in between \(0-1.\) Following equation is considered to find the normalized value of \(D_{q_i}:\)
where min(Q) and max(Q) are the minimum and maximum significance of attribute Q, respectively. In the third step, we trained significant features by implementing distinct machine learning techniques. In the next step, we construct a confusion matrix and calculate the performance parameters i.e., Accuracy and F-measure. Next, we compare the performance of the developed malware detection model and select the best malware detection model. At last, we compare the performance of our proposed malware detection model with existing techniques available in the literature and distinct anti-virus scanners. If the performance of our proposed malware detection model is better than existing techniques then it is useful and in reverse of it if the performance is not enhanced than the proposed malware model is not useful.
The subsequent measures are pursued at the time of selecting a subset of features to built the malware detection model that detects either an app is benign or malware. Feature selection approaches are implemented on 30 different data sets of Android apps. Hence, a total of 1650 (( 1 selecting all extracted features + 10 feature selection approaches) × 30 data sets (subsets of different feature sets particular to data sets determined after conducting feature selection) × 5 detection methods) different detection models have been build in this work. Below we provide step by step details of our approach:
-
1.
In the present work, four feature subset selection approaches and six feature ranking approaches are implemented on 30 different feature sets to select the right set of features for malware detection.
-
2.
The subsets of features obtained from aforementioned procedure are given as an input to machine learning classifiers. To compare the developed models, we use 20 fold cross-validation method. Cross-validation is a statistical learning approach that is utilized to classify and match the models by dividing the data into two different portions [40]. One portion is utilized to train and the remaining portion of data is utilized to verify the build model, on the basis of training [40]. The data is initially separated into K same sized segments. K-1 folds are utilized to train the model and the rest one fold is utilized for testing intention. K-fold cross-validation is having important significance in utilizing the data set for the both testing and training. For this study, 20-fold cross-validation is utilized to analyze the models, i.e., data sets are segregated into 20 portions. The outcomes of all build malware detection models are matched with each other by employing two distinct performance measure parameters: F-measure and accuracy.
-
3.
SemiDroid i.e., proposed model build by utilizing above two steps are validated with the existing techniques developed in the literature to review whether the build malware detection model is useful or not.
10 Results of performed experiment
In the current section of the paper, the relationship among different feature sets and malware detection at the class level is submitted. Set of features are used as an input and present the ratio of benign and malware apps within an experiment. intra-cluster distance, inter-cluster distance, F-measure and accuracy are used as performance assessment parameters to match the performance of malware detection model build by using five different unsupervised machine learning algorithms. To depict the experimental results we utilize the abbreviations as given in Table 5 corresponding to their actual names.
10.1 Feature ranking approaches
Six feature ranking approaches: gain-ratio feature evaluation, Chi-squared test, information gain feature evaluation, logistic regression analysis, information gain, oneR feature evaluation and principal component analysis are implemented on a distinct feature sets. Each approach utilizes distinct performance parameters to rank the feature. Moreover, top \(\lceil \log _2 a \rceil\) set of features from “a” number of features being measured to build a model for detecting malware. For initial four feature ranking approaches (gain-ratio feature evaluation, Chi-squared test, OneR feature evaluation and Information gain), top \(\lceil \log _2 a \rceil\) are selected as subset of features, where a is the number of features in the original data set (for this work \(a=20\)). However, in the case of ULR, those features are selected which posses a positive value of regression co-efficient, i.e., p value measure is below 0.05, and in matter of PCA, only those features are selected which have eigenvalue greater than 1. Considered features using feature ranking approaches are demonstrated in Fig. 8.
10.2 Feature subset selection approaches
In the present work, four distinct kinds of feature subset selection approaches are implemented on thirty data sets of Android apps one after another. Feature subset selection approaches work on the principle of hypothesis which make models with better accuracy and make less amount of misclassified errors, while selecting the best features from available number of features. Searching principles are based on heuristic search [17] (for correlation based feature selection, rough set analysis), best first search [14] (for consistency subset evaluation and filtered subset evaluation). Later, these isolation subset of features has been selected as an input for building a model to detect either the app is benign or malware. Considered set of features after feature subset selection approaches are demonstrated in Fig. 9.
10.3 Machine learning techniques
Eleven subsets of features (1 considering all set of extracted features + 10 resulting by implemented feature selection approaches) are used as an input to build a model for malware detection. Hardware utilized to carry out this study is Core i7 processor having storage capacity of 1TB hard disk and 8GB RAM. Detection models are build by using the MATLAB environment. Figure 6 demonstrates the implemented unsupervised machine learning algorithms on our collected data set. Red-cross represents the normal permissions and blue-cross represents the dangerous permissions. From Fig. 10, we analysis that clusters formed by using SOM, density-based clustering, K-mean, and filter clustering have overlapping of normal and dangerous permissions. Only farthest first clustering performed the cluster without overlapping of the permissions. Further, the performance of each detection model is measured by using four performance parameters: intra-cluster distance, inter-cluster distance, F-measure and accuracy.
Tables 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15, present the performance values obtained for distinct data sets by utilizing SOM, K-mean, farthest first clustering, filtered clustering and density based clustering. On the basis of Tables 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15, it may be concluded that:
-
Bold values indicate the highest detection rate when compared to other values in a specific row.
-
Value of inter-cluster and intra-cluster distance are calculated by using Eqs. 18 and 19.
-
F-measure and accuracy are measured by using Eqs. 22 and 24.
-
Models developed by considering features selected by feature selection approaches as an input is able to detect malware more effectively rather than model developed by using all extracted feature sets.
-
Model constructed by considering FS4, as an input achieved higher detection rate when compared to other models developed by using different feature selection approaches.
-
Model build by considering farthest first clustering by selecting FS4, as an input achieved higher detection rate when compared to other models developed by using different feature selection approaches.
In this research paper, five distinct unsupervised machine learning algorithms and ten distinct feature selection approaches are considered to select features which helps in identify Android malware more effectively. To find out which developed model is more capable to detect malware, we construct box-plot diagrams of the individual model. Box-plot diagrams helps in identify which model is best suitable for malware detection on the basis of few number of outliers and better value of median. Figures 11 and 12 demonstrate the box-plot diagrams for F-measure and accuracy for every developed model. The x-axis of the diagrams presents the feature selection techniques. Figures include eleven box-plot diagrams: one box-plot diagram consists of all extracted feature sets, four box-plot consist of feature subset selection approaches and six box-plot consist of feature ranking approaches. On the basis of the box-plot diagram, we find following observations:
-
Model constructed by considering five distinct unsupervised machine learning algorithms and FS4 achieved higher median value in addition to few outliers. On the basis of box-plot diagrams demonstrated in Figs. 11 and 12, model developed by considering FS4 as feature selection approach gives better detection rate when compared to other developed approaches.
-
From box-plot diagrams, we observed that model build by considering farthest first machine learning algorithm and FS4, is having few outliers and higher median value. It means that model developed by using RSA for detecting malware and benign apps achieved better results when compared to others.
10.4 Comparison of results
To identify that out of implemented feature selection approaches and machine learning algorithms which technique work well or all of the techniques perform equally well, we employed pair-wise t test in our study.
1. Feature selection approaches In this study, for each of the feature selection approaches two sets are formed, each of feature selection approach have 150 distinct data points (5 machine learning techniques \(\times\) 30 data set). t test is performed on distinct feature selection approaches and the respective p value to measure its statistical significance is compared. The outcome of t test study is demonstrated in Fig. 13b. In the figure, we used two different symbols to represent the p value i.e., circle filled with green color have p value \(> 0.05\) (having no relevance difference) and circle filled with red color have p value \(\le 0.05\) (relevance difference). After observing the Fig. 13b it is clear that, majority of the cells are filled with green color circle. This means that there is no relevance difference among the employed feature selection approaches. Further, by determining the measure of mean difference given in Table 16, we have observed that feature sets obtained by considering FS4 give best outcomes when examined with other implemented feature selection approaches.
In the present work, we also compare the developed model on the basis of cost-benefit analysis. For every feature selection approach, cost-benefit analysis is computed by employing following equation:
Here, \(Based_{cost}\) is dependent on the correlation among the selected features set and error in the class. \(Based_{cost}\) can be calculated from the following equation:
Here, \(Accuracy \ (SM)\) is the classification accuracy to build a malware detection model by utilizing selected features set, \(\rho _{SM.fault}\) is a multiple correlation coefficient among selected features set and error. The proposed model produces higher accuracy and as it have high multiple correlation coefficient so it will achieve a high \(Based_{cost}.\) NAM is considered as feature sets and NSM is considered as the number of selected features after implementing features selection approaches. \(Based_{cost}\) can be calculated from the following equation:
The feature selection approach which achieve higher value of cost-benefit is an foremost feature selection approach as proposed in [22]. Figure 14a, b demonstrates cost-benefit of distinct feature selection approaches. On the basis of Fig 14a, b we observed that FS4 achieved higher median Cost-benefit measure when matched with other approaches.
2. Machine learning techniques In our study, we implemented eleven different features subsets (i.e., 1 considering all features + 10 feature selection approaches) on thirty different Android app data set by examining four performance parameters i.e., intra- cluster, inter-cluster, F-measure and accuracy, all with 330 data points [(1 considering all set of features + 10 feature selection method) \(\times\) 30 data sets)]. Figure 13a demonstrates the outcomes of t test analysis. On the basis of Fig. 13a, it is noticeable that, there is no relevance difference among these techniques because p value is smaller than 0.05. On the other hand, by determining the difference in their mean value as given in Table 17, Farthest first clustering gives best outcome when compared to other machine learning techniques.
3. Feature subset selection and feature ranking approaches For this study, pair-wise t test is used to identify which feature selection approach work better. For both of the implemented approaches (i.e., feature subset selection and feature ranking) sample pairs of performance evaluation are studied. The performance of averaged feature subset selection and feature ranking techniques outcomes of t test analysis are briefed in Table 18. In this research paper, five distinct kinds of machine learning algorithms are applied on thirty different Android categories by selecting Accuracy and F-measure as performance parameters, in accordance with each feature selection approaches an aggregate number of two sets are utilized. Feature subset selection with 360 distinct points (which means 4 feature subset selection approaches \(\times\) 3 machine learning techniques \(\times\) 30 data sets) and feature ranking with 540 distinct data points (3 machine learning techniques \(\times\) 6 feature ranking approaches \(\times\) 30 data sets). On the basis of Table 18, it is seen that, there isn’t a relevant variation among two implemented approaches, because p value come out to be greater than 0.05. On the other side by calculating the mean difference value of feature subset selection approaches give best results when compare to feature ranking approaches. On the basis of Cost-Benefit analysis as demonstrated in Fig. 14, we can say that both feature subset selection and feature ranking have nearly similar Cost-Benefit value. It proves that the averaged cost and benefit of model build by considering selected set of features with feature subset selection approaches and feature ranking have nearly same value.
10.5 Evaluation of proposed framework i.e., SemiDroid
10.5.1 Comparison of results with previously used classifiers
In addition to the study done in finding the best approach to build a malware detection model accurately, this study also makes the comparison with different most often used supervised machine learning approaches present in literature such as SVM with three distinct kernels i.e., linear, polynomial and RBF, decision tree analysis, logistic regression, neural network and Naïve Bayes classifier. Figure 15 demonstrates the box-plot diagrams for F-measure and accuracy of commonly utilized classifiers and five distinct machine learning algorithms implemented in this paper. On the basis of Fig. 15, we observed that farthest first clustering have higher median value along with some number of outliers.
Pair-wise t test is also implemented to decide which machine learning approach yield best performance. The outcomes of t test study for distinct machine learning approaches are demonstrated in Fig. 16. On the basis of Fig. 16 it is seen that, in number of the cases there is a relevance difference among these machine learning techniques because p value is smaller than 0.05. On the other hand by noticing the mean difference value in Table 19 it can be seen that farthest first clustering achieved better results when compared to other supervised machine learning techniques.
In addition to that, in our study we compare our proposed malware detection model (i.e., SemiDroid) with existing frameworks or approaches that were developed in the literature. Table 20 shows the name, goal, methodology, deployment, data set and detection rate of suggested approaches or frameworks.
10.5.2 Comparison of results with different anti-virus scanners
Although farthest first clustering gives a better performance as compared to the machine learning technique used in the literature, in the end it must be comparable with the common anti-virus products available in practice for Android malware detection. For this experiment, we select 10 different anti-viruses which are available in the market and applied them on our collected data set. The performance of proposed framework is comparatively better than many of the anti-viruses available in the experiment. Table 21 shows the results of the experiment with anti-virus scanners. The detection rate of the anti-viruses scanners varies considerably. Also the best anti-virus scanners detected 96.2% of the Android malwares and certain scanners identified only 82% of the malicious samples, likely do not being specialized in detecting Android malware. By using 1000 Android apps, our proposed framework i.e., SemiDroid gives the detection rate of 98.8% and outperforms 1 out of 10 anti-virus scanners. From this, we can say that our proposed framework is more efficient in detecting malware when compared to distinct anti-virus scanners.
10.5.3 Detection of known and unknown malware families
Detection of known malware families In this section, we check that our proposed framework is capable to detect malware of known family or not. For this experiment, we select 20 sample of each families (in our study, we consider sample of 81 different families shown in Table 22.) and train it with our selected model. Farthest first clustering is capable to detect average 98.8% malware apps. The name of families and the samples used for each family can be found in Table 22 and the detection rate of our proposed framework for each family is illustrated in Fig. 17a, b.
Detection of unknown malware families To check whether the farthest first clustering is capable to detect unknown malware families or not, we trained, our proposed framework with the random selection of 10 different families obtained by principle of counting and test is applied on the rest of the remaining 71 families present in the data set. Table 23 shows the result of farthest first clustering when we train it with 10 selected families. From Table 23, we can say that if we train farthest first clustering with few number of known families samples which are necessary to generalize the behavior of most malware families it gives better detection rate.
In summary, our proposed framework is capable to detect Android malware more effectively when compared with several anti-virus scanners which regularly update their signature definition. In addition, our proposed framework is capable to identify Android malware more efficiently whenever we trained with limited number of malware families.
10.5.4 Experimental findings
The comprehensive conclusion of our experimental work is presented in this section. The empirical study was conducted for thirty different categories of Android apps by considering five different unsupervised machine learning techniques i.e., SOM, K-mean, filter clustering, density based clustering and farthest first clustering. On the basis of the experimental results, this research paper is able to answer the questions mentioned in Sect. 2.
RQ1. In this paper, we applied five distinct machine learning algorithms to build a model which help us to detect whether an app is benign or malware. On the basis of Tables 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15, it can be implicit that model build by employing farthest first clustering by using selected set of features obtained as a result of FS4 as an input gives better outcome when compared to others.
RQ2. To respond the RQ2, Fig. 17 and Tables 20 and 21 were analyzed. Here, it is found that model build by utilizing farthest first clustering is capable to detect malware from real-world apps.
RQ3. In the present paper, four distinct kind of feature subset selection approaches and six distinct kind of feature ranking approaches are used to identify the smaller subset of features. By utilizing these approaches, we considered best possible subsets of the features which help us to build a model to identify that either an app is benign or malware. On the basis of the Tables 6, 7, 8, 9, 10, 11, 12, 13, 14 and 15, in number of cases there occurs a reduced subset of features which are best for building a detection model when compared to all the extracted features.
RQ4. In the present paper, six distinct variants of feature ranking approaches are used to discover the reduced subset of features. On the basis of t test study, it is seen that feature selection by implementing PCA approach gives the better outcomes when matched to others approaches.
RQ5, For this paper, four distinct kind of feature subset selection approaches are used to find the reduced subset of features. On the basis of t test study, it is seen that feature selection by utilizing FS4 gives the outcomes which are persuasively better when compared to other approaches.
RQ6. For this work, pair-wise t test being utilized to identify whether feature subset selection approaches perform better than feature ranking approaches or both of them carried out equally well. On the basis of t test outcomes it is seen that, there is a relevance difference among feature subset selection and feature ranking approach. Moreover, the value of mean difference shows that feature subset selection approaches gives better results than the feature ranking.
RQ7. On the basis of Sect. 9, we can observe that the performance of the feature selection approaches vary by using the distinct machine learning techniques. Further, it also observed that selection of machine learning algorithm to build a malware detection model which detect either the app is malware or not is based on the feature selection approaches.
11 Threat to validity
In this section, threats to validity which are experienced at the time of performing the experiment are presented. Below we discuss them:
-
(i)
Construct validity In this work, presented models for malware detection only detect either an app is benign or malware, but does not state that how many number of possible permissions and API calls are required to detect malware.
-
(ii)
External validity Cyber-criminals develops malware on daily basis to misuse the user information. In this work, we considered 81 different malware families to train the model and our proposed model is capable to detect malware from known and unknown families. Further, research can be extended to train model with more malware families and which is capable to detect more malware apps from real-world.
-
(iii)
Internal validity The threat lies in the consistency of the data used in this study. We collected data from different sources mentioned in Sect. 4. Any error in the information not mentioned in the sources was not considered in this work. Also, we can not claim that the data considered for the experiment is 100% accurate, we believed that it has been collected consistently.
12 Conclusion
This work emphasizes on designing a malware detection framework by using selected set of features which help us to identify that an Android app belongs to malware class or benign class. The execution process was performed by taking assistance of thirty different categories of Android apps.
Empirical results indicate that, it is feasible to determine a small subset of features. The malware detection model build by considering this determined set of features is able to detect malware and benign apps with inferior value of misclassified errors and better accuracy. Further, it is also seen that the results of malware detection model, is influenced by the feature selection approaches.
After performing in depth analysis, we found that AA, BU, LS, PE, RA, TO set of features are relevance detectors for malware detection by utilizing feature selection approaches. Further, on the basis of mean difference, it is seen that model build by considering selected set of features as an input gives better detection rate when compared to model build by considering all set of extracted features. Moreover, the model build by utilizing Farthest first clustering gives better outcomes when compared to other techniques.
At last, on the basis of Cost-benefit analysis, we implicit that the selected features by utilizing FS4, achieved high median Cost-Benefit value when compared to other approaches and it is also seen that model build by utilizing Farthest first clustering is capable to detect 98.8% known and unknown malware from real-world apps.
In this work, proposed models for malware detection only detect that either the app is malware or benign. Further, work can be extended to develop a model for malware detection which predict whether a particular feature is capable to detect malware or not. Moreover, this study can be replicated over other Android apps repository which utilized soft computing models to attain better detection rate for malware.
Notes
Mahindru, Arvind (2020), “Android permissions dataset, Android Malware and benign Application Data set (consist of permissions and API calls)”, Mendeley Data, V3, doi: 10.17632/b4mxg7ydb7.3.
Testing were performed on local system.
Live location of user is seen on Google Maps. Google Maps are pre-installed on Android based devices.
Mahindru, Arvind (2020), “Android permissions dataset, Android Malware and benign Application Data set (consist of permissions and API calls)”, Mendeley Data, V3, doi: http://dx.doi.org/10.17632/b4mxg7ydb7.3
Malware families are identified by VirusTotal.
A data set DS composed by a set (O) of n objects described by a set (SA) of l attributes.
\(grid\_list\) consist of attributes.
References
Aafer Y, Du W, Yin H (2013) Droidapiminer: mining api-level features for robust malware detection in android. In: International conference on security and privacy in communication systems, Springer, pp 86–103
Abawajy J, Kelarev A (2017) Iterative classifier fusion system for the detection of android malware. IEEE Transactions on Big Data
Alam MS, Vuong ST (2013) Random forest classification for detecting android malware. In: 2013 IEEE international conference on green computing and communications and IEEE Internet of Things and IEEE cyber, physical and social computing, IEEE, pp 663–669
Alazab M, Alazab M, Shalaginov A, Mesleh A, Awajan A (2020) Intelligent mobile malware detection using permission requests and API calls. Future Gener Comput Syst 107:509–521
Almin SB, Chatterjee M (2015) A novel approach to detect android malware. Procedia Comput Sci 45:407–417
Alzaylaee MK, Yerima SY, Sezer S (2020) DL-droid: deep learning based android malware detection using real devices. Comput Secur 89:101663
Amos B, Turner H, White J (2013) Applying machine learning classifiers to dynamic android malware detection at scale. In: 2013 9th international wireless communications and mobile computing conference (IWCMC), IEEE, pp 1666–1671
Andriatsimandefitra R, Tong VVT (2015) Detection and identification of android malware based on information flow monitoring. In: 2015 IEEE 2nd international conference on cyber security and cloud computing, IEEE, pp 200–203
Arora A, Peddoju SK, Conti M (2019) Permpair: Android malware detection using permission pairs. IEEE Trans Inf Forensics Secur 15:1968–1982
Arp D, Spreitzenbarth M, Hubner M, Gascon H, Rieck K, Siemens C (2014) Drebin: effective and explainable detection of android malware in your pocket. NDSS 14:23–26
Attar AE, Khatoun R, Lemercier M (2014) A gaussian mixture model for dynamic detection of abnormal behavior in smartphone applications. In: 2014 global information infrastructure and networking symposium (GIIS), IEEE, pp 1–6
Babaagba KO, Adesanya SO (2019) A study on the effect of feature selection on malware analysis using machine learning. In: Proceedings of the 2019 8th international conference on educational and information technology, pp 51–55
Barrera D, Kayacik HG, Oorschot PCV, Somayaji A (2010) A methodology for empirical analysis of permission-based security models and its application to android. In: Proceedings of the 17th ACM conference on computer and communications security, pp 73–84
Bibi KF, Banu MN (2015) Feature subset selection based on filter technique. In: 2015 international conference on computing and communications technologies (ICCCT), IEEE, pp 1–6
Birendra C (2016) Android permission model. arXiv preprint arXiv:160704256
Blair DC (1979) Information retrieval, 2nd ed. C. J. van Rijsbergen. J Am Soc Inf Sci 30(6):374–375. https://doi.org/10.1002/asi.4630300621. https://ideas.repec.org/a/bla/jamest/v30y1979i6p374-375.html
Blessie EC, Karthikeyan E (2012) Sigmis: a feature selection algorithm using correlation based method. J Algorithms Comput Technol 6(3):385–394
Burguera I, Zurutuza U, Nadjm-Tehrani S (2011) Crowdroid: behavior-based malware detection system for android. In: Proceedings of the 1st ACM workshop on security and privacy in smartphones and mobile devices, pp 15–26
Cai H, Meng N, Ryder B, Yao D (2018) Droidcat: effective android malware detection and categorization via app-level profiling. IEEE Trans Inf Forensics Secur 14(6):1455–1470
Canbek G, Baykal N, Sagiroglu S (2017) Clustering and visualization of mobile application permissions for end users and malware analysts. In: 2017 5th international symposium on digital forensic and security (ISDFS), IEEE, pp 1–10
Caviglione L, Gaggero M, Lalande JF, Mazurczyk W, Urbański M (2015) Seeing the unseen: revealing mobile malware hidden communications via energy consumption and artificial intelligence. IEEE Trans Inf Forensics Secur 11(4):799–810
Chaikla N, Qi Y (1999) Genetic algorithms in feature selection. In: IEEE SMC’99 conference proceedings. 1999 IEEE international conference on systems, man, and cybernetics (Cat. No. 99CH37028), IEEE, vol 5, pp 538–540
Chen PS, Lin SC, Sun CH (2015) Simple and effective method for detecting abnormal internet behaviors of mobile devices. Inf Sci 321:193–204
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 133–142
Cruz AEC, Ochimizu K (2009) Towards logistic regression models for predicting fault-prone code across software projects. In: 2009 3rd international symposium on empirical software engineering and measurement, IEEE, pp 460–463
Cui B, Jin H, Carullo G, Liu Z (2015) Service-oriented mobile malware detection system based on mining strategies. Pervas Mobile Comput 24:101–116
Dixon B, Mishra S (2013) Power based malicious code detection techniques for smartphones. In: 2013 12th IEEE international conference on trust, security and privacy in computing and communications, IEEE, pp 142–149
Enck W, Ongtang M, McDaniel P (2009) On lightweight mobile phone application certification. In: Proceedings of the 16th ACM conference on computer and communications security, pp 235–245
Enck W, Gilbert P, Han S, Tendulkar V, Chun BG, Cox LP, Jung J, McDaniel P, Sheth AN (2014) Taintdroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Trans Comput Syst (TOCS) 32(2):1–29
Faruki P, Ganmoor V, Laxmi V, Gaur MS, Bharmal A (2013) Androsimilar: robust statistical feature signature for android malware detection. In: Proceedings of the 6th international conference on security of information and networks, pp 152–159
Fung CJ, Lam DY, Boutaba R (2014) Revmatch: An efficient and robust decision model for collaborative malware detection. In: 2014 IEEE network operations and management symposium (NOMS), IEEE, pp 1–9
Guo DF, Sui AF, Shi YJ, Hu JJ, Lin GZ, Guo T (2014) Behavior classification based self-learning mobile malware detection. JCP 9(4):851–858
Han W, Xue J, Wang Y, Liu Z, Kong Z (2019) Malinsight: a systematic profiling based malware detection framework. J Netw Comput Appl 125:236–250
Holland B, Deering T, Kothari S, Mathews J, Ranade N (2015) Security toolbox for detecting novel and sophisticated android malware. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, IEEE, vol 2, pp 733–736
Hou S, Ye Y, Song Y, Abdulhayoglu M (2017) Hindroid: an intelligent android malware detection system based on structured heterogeneous information network. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1507–1515
Jerbi M, Dagdia ZC, Bechikh S, Said LB (2020) On the use of artificial malicious patterns for android malware detection. Comput Secur 92:101743
Jouve PE, Nicoloyannis N (2005) A filter feature selection method for clustering. In: International symposium on methodologies for intelligent systems, Springer, pp 583–593
Kadir AFA, Stakhanova N, Ghorbani AA (2015) Android botnets: What URLs are telling us. In: International conference on network and system security, Springer, pp 78–91
Karbab EB, Debbabi M, Derhab A, Mouheb D (2018) Maldozer: automatic framework for android malware detection using deep learning. Digit Investig 24:S48–S59
Kohavi R, John GH et al (1997) Wrappers for feature subset selection. Artificial intelligence 97(1–2):273–324
Kumar M, et al. (2013) An optimized farthest first clustering algorithm. In: 2013 Nirma University international conference on engineering (NUiCONE), IEEE, pp 1–5
Lee WY, Saxe J, Harang R (2019) Seqdroid: obfuscated android malware detection using stacked convolutional and recurrent neural networks. In: Deep Learning applications for cyber security, Springer, pp 197–210
Lindorfer M, Neugschwandtner M, Weichselbaum L, Fratantonio Y, Veen VVD, Platzer C (2014) Andrubis–1,000,000 apps later: a view on current android malware behaviors. In: 2014 third international workshop on building analysis datasets and gathering experience returns for security (BADGERS), IEEE, pp 3–17
Ma Z, Ge H, Liu Y, Zhao M, Ma J (2019) A combination method for android malware detection based on control flow graphs and machine learning algorithms. IEEE Access 7:21235–21245
Mahindru A, Sangal A (2019) Deepdroid: feature selection approach to detect android malware using deep learning. In: 2019 IEEE 10th international conference on software engineering and service science (ICSESS), IEEE, pp 16–19
Mahindru A, Sangal A (2020a) Feature-based semi-supervised learning to detect malware from android. Automated software engineering: a deep learning-based approach. Springer, Berlin, pp 93–118
Mahindru A, Sangal A (2020b) Feature-based semi-supervised learning to detect malware from android. Automated software engineering: a deep learning-based approach. Springer, Berlin, pp 93–118
Mahindru A, Sangal A (2020a) Gadroid: a framework for malware detection from android by using genetic algorithm as feature selection approach. Int J Adv Sci Technol 29(5):5532–5543
Mahindru A, Sangal A (2020b) Perbdroid: effective malware detection model developed using machine learning classification techniques. A journey towards bio-inspired techniques in software engineering. Springer, Berlin, pp 103–139
Mahindru A, Singh P (2017) Dynamic permissions based android malware detection using machine learning techniques. In: Proceedings of the 10th innovations in software engineering conference, pp 202–210
Martinelli F, Mercaldo F, Saracino A (2017) Bridemaid: an hybrid tool for accurate detection of android malware. In: Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp 899–901
Milosevic N, Dehghantanha A, Choo KKR (2017) Machine learning aided android malware classification. Comput Electr Eng 61:266–274
Narudin FA, Feizollah A, Anuar NB, Gani A (2016) Evaluation of machine learning classifiers for mobile malware detection. Soft Comput 20(1):343–357
Ng DV, Hwang JIG (2014) Android malware detection using the dendritic cell algorithm. In: 2014 international conference on machine learning and cybernetics, IEEE, vol 1, pp 257–262
Novakovic J (2010) The impact of feature selection on the accuracy of naïve bayes classifier. In: 18th telecommunications forum TELFOR, vol 2, pp 1113–1116
Pawlak Z (1982) Rough sets. Int J Comput Inf Sci 11(5):341–356
Plackett RL (1983) Karl pearson and the chi-squared test. International Statistical Review/Revue Internationale de Statistique 59–72
Portokalidis G, Homburg P, Anagnostakis K, Bos H (2010) Paranoid android: versatile protection for smartphones. In: Proceedings of the 26th annual computer security applications conference, pp 347–356
Quan D, Zhai L, Yang F, Wang P (2014) Detection of android malicious apps based on the sensitive behaviors. In: 2014 IEEE 13th international conference on trust, security and privacy in computing and communications, IEEE, pp 877–883
Rahman M (2013) Droidmln: a markov logic network approach to detect android malware. In: 2013 12th international conference on machine learning and applications, IEEE, vol 2, pp 166–169
Rahman SSMM, Saha SK (2018) Stackdroid: evaluation of a multi-level approach for detecting the malware on android using stacked generalization. In: International conference on recent trends in image processing and pattern recognition, Springer, pp 611–623
Shabtai A, Kanonov U, Elovici Y, Glezer C, Weiss Y (2012) “Andromaly”: a behavioral malware detection framework for android devices. J Intell Inf Syst 38(1):161–190
Sheen S, Anitha R, Natarajan V (2015) Android based malware detection using a multifeature collaborative decision fusion approach. Neurocomputing 151:905–912
Shen T, Zhongyang Y, Xin Z, Mao B, Huang H (2014) Detect android malware variants using component based topology graph. In: 2014 IEEE 13th international conference on trust, security and privacy in computing and communications, IEEE, pp 406–413
Suarez-Tangil G, Tapiador JE, Peris-Lopez P, Pastrana S (2015) Power-aware anomaly detection in smartphones: an analysis of on-platform versus externalized operation. Pervas Mobile Comput 18:137–151
Tam K, Khan SJ, Fattori A, Cavallaro L (2015) Copperdroid: automatic reconstruction of android malware behaviors. In: Ndss
Tong F, Yan Z (2017) A hybrid approach of mobile malware detection in android. J Parallel Distrib Comput 103:22–31
Tramontana E, Verga G (2019) Mitigating privacy-related risks for android users. In: 2019 IEEE 28th international conference on enabling technologies: infrastructure for collaborative enterprises (WETICE), IEEE, pp 243–248
Vinayakumar R, Alazab M, Soman K, Poornachandran P, Venkatraman S (2019) Robust intelligent malware detection using deep learning. IEEE Access 7:46717–46738
Wang W, Wang X, Feng D, Liu J, Han Z, Zhang X (2014) Exploring permission-induced risk in android applications for malicious application detection. IEEE Trans Inf Forensics Secur 9(11):1869–1882
Wang W, Zhao M, Wang J (2019) Effective android malware detection with a hybrid model based on deep autoencoder and convolutional neural network. J Ambient Intell Humaniz Comput 10(8):3035–3043
Wei F, Li Y, Roy S, Ou X, Zhou W (2017) Deep ground truth analysis of current android malware. In: International conference on detection of intrusions and malware, and vulnerability assessment, Springer, pp 252–276
Wei TE, Mao CH, Jeng AB, Lee HM, Wang HT, Wu DJ (2012) Android malware detection via a latent network behavior analysis. In: 2012 IEEE 11th international conference on trust, security and privacy in computing and communications, IEEE, pp 1251–1258
Wu DJ, Mao CH, Wei TE, Lee HM, Wu KP (2012) Droidmat: Android malware detection through manifest and API calls tracing. In: 2012 seventh Asia joint conference on information security, IEEE, pp 62–69
Xiao X, Zhang S, Mercaldo F, Hu G, Sangaiah AK (2019) Android malware detection based on system call sequences and LSTM. Multimed Tools Appl 78(4):3979–3999
Xu R, Saïdi H, Anderson R (2012) Aurasium: practical policy enforcement for android applications. In: Presented as part of the 21st \(\{\)USENIX\(\}\) security symposium (\(\{\)USENIX\(\}\) Security 12), pp 539–552
Yang L, Ganapathy V, Iftode L (2011) Enhancing mobile malware detection with social collaboration. In: 2011 IEEE third international conference on privacy, security, risk and trust and 2011 IEEE third international conference on social computing, IEEE, pp 572–576
Yewale A, Singh M (2016) Malware detection based on opcode frequency. In: 2016 international conference on advanced communication control and computing technologies (ICACCCT), IEEE, pp 646–649
Yuxin D, Siyi Z (2019) Malware detection based on deep learning algorithm. Neural Comput Appl 31(2):461–472
Zhou Y, Jiang X (2012) Dissecting android malware: characterization and evolution. In: 2012 IEEE symposium on security and privacy, IEEE, pp 95–109
Zhu HJ, Jiang TH, Ma B, You ZH, Shi WL, Cheng L (2018) Hemd: a highly efficient random forest-based malware detection framework for android. Neural Comput Appl 30(11):3353–3361
Zhu HJ, You ZH, Zhu ZX, Shi WL, Chen X, Cheng L (2018b) Droiddet: effective and robust detection of android malware using static analysis along with rotation forest model. Neurocomputing 272:638–646
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mahindru, A., Sangal, A.L. SemiDroid: a behavioral malware detector based on unsupervised machine learning techniques using feature selection approaches. Int. J. Mach. Learn. & Cyber. 12, 1369–1411 (2021). https://doi.org/10.1007/s13042-020-01238-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-020-01238-9