SemiDroid: a behavioral malware detector based on unsupervised machine learning techniques using feature selection approaches

Mahindru, Arvind; Sangal, A. L.

doi:10.1007/s13042-020-01238-9

SemiDroid: a behavioral malware detector based on unsupervised machine learning techniques using feature selection approaches

Original Article
Published: 24 November 2020

Volume 12, pages 1369–1411, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

SemiDroid: a behavioral malware detector based on unsupervised machine learning techniques using feature selection approaches

Download PDF

1049 Accesses
28 Citations
3 Altmetric
Explore all metrics

Abstract

With the exponential growth in Android apps, Android based devices are becoming victims of target attackers in the “silent battle” of cybernetics. To protect Android based devices from malware has become more complex and crucial for academicians and researchers. The main vulnerability lies in the underlying permission model of Android apps. Android apps demand permission or permission sets at the time of their installation. In this study, we consider permission and API calls as features that help in developing a model for malware detection. To select appropriate features or feature sets from thirty different categories of Android apps, we implemented ten distinct feature selection approaches. With the help of selected feature sets we developed distinct models by using five different unsupervised machine learning algorithms. We conduct an experiment on 5,00,000 distinct Android apps which belongs to thirty distinct categories. Empirical results reveals that the model build by considering rough set analysis as a feature selection approach, and farthest first as a machine learning algorithm achieved the highest detection rate of 98.8% to detect malware from real-world apps.

PerbDroid: Effective Malware Detection Model Developed Using Machine Learning Classification Techniques

PermDroid a framework developed using proposed feature selection approach and machine learning techniques for Android malware detection

Article Open access 10 May 2024

ANNDroid: A Framework for Android Malware Detection Using Feature Selection Techniques and Machine Learning Algorithms

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Detection of malware from smartphones has become a major concern for the research community. At the end of 2019, the number of Android users will be 3.3 billions throughout the world.^{Footnote 1} Android is based on the Linux kernel and provide useful services such as security configuration, process management and others. The primary reason for the growth of Android operating system is due to its open-nature and freely available apps. At the end of July 2019,^{Footnote 2} Android had 2.7 billion free and paid apps in its play store. There is an increase of 13%,^{Footnote 3} in downloading of apps from Google play store with respect to previous years. Android operating system is based on the principle of privilege-separated where each app has its own distinct system identity, i.e., group-ID and Linux user-ID. Each app run in a procedure sandbox and access permission to use the resources which are not present in its sandbox. Depending on the permission sensitivity, the system automatically grants permission or may prompt users to approve or reject requests for permission. Permissions granted by users include, access to the calendar, camera, body sensors, microphone, contacts, location, SMS, storage of the device. To defend Google official market^{Footnote 4} from malware-infected apps, Google introduced Google Bouncer in the year 2012, which scans new apps at the time of their launch. But, Google Bouncer has limitation, it can easily be fingerprint.^{Footnote 5} It is not very difficult for malware apps to bypass Google’s security check and enter to Google official market^{Footnote 6} and ultimately to users’ devices. By taking advantage of these permissions, cyber-criminals build malware apps on a daily basis and invite users to install these applications. More than two billion active Android devices are present in the market.^{Footnote 7} To overcome the drawback of the bouncer and to protect Android devices, Google introduced Google play protect in the market. Google play protect have the capability to protect data in real-time. However, according to a study in,^{Footnote 8} G-Data Security expert counted 4.18 millions malware applications until the end of the year 2019 and discovered over 7,50,000 new malware applications during the first quarter of 2020.

Android apps work on the permission-based model [15]. Android operating system provides protection at four levels, that categorize permissions as^{Footnote 9} “signature”, “signature or system”, “normal” and “dangerous”. In our study, we do not consider “signature” and “signature or system” because they are system granted. We only consider “normal” and “dangerous” permissions which are granted by the user. Normal permissions does not pay any risk to the user’s privacy. If the permission is listed in the manifest file, then system grants permission automatically. On the other hand, dangerous permission give access to the user’s confidential data. However, it is purely dependent upon the user to give access or revoke the use of permission or set of permissions.

Selection of right feature or feature sets pay great effect on the performance of malware detection [45, 46, 49, 62]. Feature selection approach is based on the procedure to select appropriate features from total available features. Feature selection approaches are classified into two distinct groups i.e., one group contains feature ranking methods and second group contains feature subset selection methods [45, 46, 49, 62]. Feature ranking approach is based on ordering the feature on the basis of its scoring function [49, 62]. On the other hand, feature subset selection is to discover the optimal feature subset [46]. In our study, we implemented ten distinct feature selection approaches to select best features and hold only those feature sets which have excellent discriminatory power.

In the literature [45, 46, 49, 62], researchers and academicians had applied distinct machine learning algorithms that were based on classification, regression and clustering to develop Android malware detection model. The main flaw in their work is that they used labelled data set to develop malware detection model. So to overcome this issue, in this study, we consider five distinct unsupervised machine learning algorithms [i.e., K-mean, farthest first clustering , filtered clustering, density-based clustering and self-organizing map (SOM)] to develop a model for Android malware detection.

List of phases followed by us in developing malware detection model is demonstrated in Fig. 1. In the first stage, we collect Android application packages (.apk) files from different repositories and identify their classes. In the second stage of our experiment, we extract permissions and API calls from collected .apk files and consider them as features. Further in third stage, we select best features by using ten different feature selection approaches. Next, with the help of selected features we trained five different unsupervised machine learning algorithms and build models. We compare our developed models with the help of distinct performance parameters i.e., intra-cluster distance, inter-cluster distance, accuracy and F-measure. At the last stage, we validate our proposed model with the help of existing techniques available in the literature.

The novel and unique contribution of this research paper are:

To the best of our knowledge, this is the first work in which 5,00,000 unique apps are collected which belongs to 30 different categories of Android apps. Extracted features are publicly available for researchers and academicians.^{Footnote 10} To build effective and efficient malware detection model we extract permissions, rating of an app, number of the user download the app, and API calls, as a feature and achieved a detection rate of 98.8% when compared to distinct anti-virus scanners.
We proposed a new approach which works on the principle of unsupervised machine learning algorithm by selecting relevant features using feature selection. Our empirical result reveals that our suggested model is able to detect 98.4% unknown malware from real-world apps.
Proposed framework is able to detect malware from Android apps by using 100% unlabelled data set.
In this study, we applied t test analysis to investigate that features selected by feature selection approaches are having significant difference or not.
Proposed malware detection approach is able to detect malware in less time when compared to distinct anti-virus scanners available in the market.

The rest of the paper is summarized as follows. In Sect. 2, we discuss about the work related to Android malware detection. In Sect. 3, we discuss about the Android permission model. In Sect. 4, we present the formulation of data set. Section 5 presents the features selection approaches implemented in this study. In Sect. 6, we discuss about the different machine learning algorithms. In Sect. 7, we present the different techniques which are used in the literature to detect malware from real-world apps. Section 8 represent the performance parameters and experimental setup is presented in Sect. 9. Section 10, contains the experimental results i.e., which model is best in detecting malware from real-world apps. At last in Sect. 11, we discuss about the threats to validity and conclusion of this empirical study is presented in Sect. 12.

2 Related work

Enck et al. [28] proposed Kirin framework which helps in detecting malware apps based on permissions requested by them during their installation time. Kirin is based on set of rules which helps to mitigate the effect of malware from Android apps. Suarez-Tangil et al. [65] examined out of cloud based detection or on-device detection method, which method is more power saving. They suggested a power model to compare both the methods with the help of machine learning algorithms. Empirical results reveals that cloud based detection method is more effective and better choice to detect malware. Cui et al. [26] proposed a malware detection model based on cloud computing by using network packets. They used the principles of data mining to reduce the branches of packets by gathering knowledge of packets whether it is useful for malware detection or not. They proposed SMMDS in their study which work on the principles of machine learning algorithms to detect malware. Chen et al. [23] proposed a solution which monitor the behavior of smartphones when they are sending user’s private information to an external source. But the solution provided in their study is not effective, because it cannot support real-time detection. Narudin et al. [53] proposed STREAM which automatically installs and runs Android apps and extract features from them. Further, the extracted features are used to train with the help of machine learning classifiers to detect malware from Android apps. STREAM has a disadvantage, it takes a lot of system resources and time to load the data. Wei et al. [73] build a malware detection model based on anomaly behavior of Android apps. They developed a model by considering network information as a feature by using Naïve Bayes and Logistic machine learning algorithms and achieved higher accuracy rate. Ali et al. [11] suggested a malware detection model based on Gaussian mixture. They collected features based on hardware utilization such as CPU, memory, battery and so on and trained it with the help of Gaussian mixture. But the model proposed by them has a limitation, it needs a remote server for computation. Dixon et al. [27] developed a model by using the behaviors of battery life of smartphone when infected by malware. But, the model proposed by them is not able to detect some sophisticated malware.

Tong and Yan [67] proposed hybrid approach to detect malware from Android by using individual system call and sequential system calls related to accessing the files and networks. Their approach is able to detect the behavior of unknown app and achieved the detection rate of 91.76%. But, the presented approach has a limitation, it cannot support real-time detection. Quan et al. [59] used three different feature sets i.e., native code, system calls and API calls to detect malware from Android. The detection rate depends upon the predefined threshold value. Ng et al. [54] developed model by using Dendritic Cell Algorithm and considered system call as a feature. They selected best features by implementing statistical methods and achieved the higher detection rate. Sheen et al. [63] proposed Android malware detection system by considering API calls and permissions as features. They chose features by using Relief algorithm to train three different classifiers: J48, SVM and Naïve Bayes. Detection rate is good, but it also consumes number of resources and its computing burden is too high. Fung et al. [31] proposed a decision model RevMatch which work on the principle of malware detection history to make decision that Android app is infected with malware or not. This approach do not provide real-time detection. Babaagba and Adesanya [12] compared the performance of supervised and unsupervised machine learning algorithms, with and without using feature selection approaches. Empirical study were performed on 149 Android apps and result reveals that model developed with feature selection approach and supervised machine learning algorithm achieved higher detection rate when compared to the model developed using unsupervised machine learning algorithm. Yewale and Singh [78] proposed malware detection model based on opcode frequency. Experiments were performed on 100 distinct files and achieved 96.67% detection rate by using SVM as a machine learning algorithm.

Enck et al. [29] proposed TaintDroid which work on the principles of tracking information-flow in the network. TaintDroid track malicious behavior of apps communication through files, program variables and inter-sectional messages. The process is too much time consuming to label that app is benign or malware. Abawajy and Kelarev [2] proposed ICFS which detect malware from Android by incorporating feature selection approaches and machine learning classifiers. Guo et al. [32] proposed smartphone network behavior using Naïve Bayes as machine learning algorithm. They build a pattern from benign and malware apps to discover malware from unlabeled apps. In recent study [68], Authors presented the mechanism that how an app breaching user privacy to gain user private data. They proposed a general and novel defence solution, to protect resources and data in Android based devices. Rahman and Saha [61] proposed StackDroid a multi-level architecture which is used to minimize the error rate. They detected malware at two different levels, in the first level they consider multi-;ayer perceptron, stochastic gradient descent, random forest and extremely randomized tree and in the second level they consider extreme gradient boosting as machine learning classifier to detect malware from Android. Barrera et al. [13] proposed a methodology which work on the principles of permission model by implementing self-organizing map (SOM) on collected data set of 1,100 Android apps. They analyzed the Android permission model which is used to investigate the malware apps from Android. SOM implemented by them, give a 2-dimensional visualization of high dimension data.

Alazab et al. [4] proposed an effective classification model that combines permission requests and API calls. Further, API calls were divided into three different groups i.e., ambiguous group, risky group, and disruptive group. Experiments were performed on 27,891 malware-infected Android apps and proposed model achieved an F-measure of 94.3%. Xiao et al. [75], proposed a novel detection approach based on the principle of deep learning. In their studies, authors consider semantic information in system call sequences as the natural language and construct a classifier based on the long short-term memory (LSTM) language model. Empirical result reveals that proposed approach achieved the detection rate of 96.6%. Yuxin and Siyi [79] proposed a malware detection model based on the principle of Deep Belief Network (DBN). In their study, they compare the performance of proposed model with three baseline malware detection models, which use support vector machines, decision trees, and the k-nearest neighbor algorithm as classifiers. Experimental results indicate that the autoencoder can effectively model the underlying structure of input data and significantly reduce the dimensions of feature vectors.

Vinayakumar et al. [69] proposed an effective zero-day malware detection model based on the principle of image processing technique with optimal parameters for machine learning algorithms (MLAs) and deep learning architectures. Experiments were performed on two distinct data sets and they achieved a detection rate of 96.2% and 98.8% with proposed detection model. Arora et al. [9] proposed a framework named as PermPair, that constructs and compares the graphs for malware and normal samples by extracting the permission pairs from the manifest file of an app. Empirical result reveals that proposed scheme is successful in detecting malicious samples with an accuracy of 95.44% when compared to mobile anti-malware apps. Lee et al. [42] proposed a malware detection model that learns the generalized correlation between obfuscated string patterns from an application’s package name and the certificate owner name. Experimental results reveal that proposed model is robust in obfuscation and sufficient lightweight for Android devices.

Alzaylaee et al. [6] proposed DL-Droid, based on the principle of deep learning model. Experiments were performed on 30,000 distinct Android apps. Empirical result reveals that DL-Droid can achieve up to 97.8% detection rate (with dynamic features only) and 99.6% detection rate (with dynamic + static features) respectively. Ma et al. [44] proposed a malware detection approach based on the control flow graph of the app to obtain API information. On the basis of API information, they constructed three different data sets that is based on Boolean, frequency, and time-series. By using these three data sets three different malware detection models are constructed. Experiments were conducted on 10010 benign applications and 10683 malicious applications. The result shows that detection model achieves 98.98% detection precision. Jerbi et al. [36] proposed artificial malware-based detection (AMD) that was based on extracted malware patterns that were generated artificially. Experiments were performed on balanced and imbalanced data set and achieved an accuracy of 99.69% for balanced data set and 99.64% for imbalanced data set. Table 1 described the name of the framework or author name whose proposed the approach developed in the literature with its detection type, feature used, implemented algorithm, place of analysis, and major observations in their study.

Previous research work mentioned above has the following limitations: use of limited data set, higher detection rate with limited data set, high computation burden, implementation of limited feature selection approaches, implementation of limited classification algorithms using 100% labelled data set and unable to detect sophisticated malware. To overcome the first limitations, in this study, we collect 5,00,000 Android apps which belong to thirty different categories from different repositories mentioned in Sect. 4. Further, to overcome other limitations, we implement ten distinct feature selection approaches on extracted feature data set (i.e., permissions and API calls consider as feature in this study). Next, the selected features are considered as an input to develop a model by using unsupervised machine learning algorithms (means no labelled data is required to develop the models) so that a suitable model is build to identify malware from real-world apps.

Table 1 Dynamic analysis based smartphone detection presented in literature

SemiDroid: a behavioral malware detector based on unsupervised machine learning techniques using feature selection approaches

Abstract

Similar content being viewed by others

PerbDroid: Effective Malware Detection Model Developed Using Machine Learning Classification Techniques

PermDroid a framework developed using proposed feature selection approach and machine learning techniques for Android malware detection

ANNDroid: A Framework for Android Malware Detection Using Feature Selection Techniques and Machine Learning Algorithms

Explore related subjects

1 Introduction

2 Related work

2.1 Research questions

3 Android permission model

4 Formulation of data set

4.1 Collection of .apk files

4.2 Extraction of features

4.3 Formulation of feature sets

5 Feature selection approaches

5.1 Feature ranking approaches

5.1.1 Gain-ratio feature selection

5.1.2 Chi-squared test

5.1.3 Information-gain feature selection

5.1.4 OneR feature selection

5.1.5 Principal component analysis (PCA)

5.1.6 Logistic regression analysis

5.2 Feature subset selection approaches

5.2.1 Correlation based feature selection

5.2.2 Rough set analysis

5.2.3 Consistency subset evaluation approach

5.3 Filtered subset evaluation

6 Machine learning techniques

6.1 Self-organizing maps (SOM)

6.2 K-mean

6.3 Farthest first

6.4 Filtered cluster

6.5 Density-based cluster

7 Comparison of proposed model with different existing techniques

8 Evaluation of performance parameters

9 Experimental setup

10 Results of performed experiment

10.1 Feature ranking approaches

10.2 Feature subset selection approaches

10.3 Machine learning techniques

10.4 Comparison of results

10.5 Evaluation of proposed framework i.e., SemiDroid

10.5.1 Comparison of results with previously used classifiers

10.5.2 Comparison of results with different anti-virus scanners

10.5.3 Detection of known and unknown malware families

10.5.4 Experimental findings

11 Threat to validity

12 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation