Experimental comparison of features, analyses, and classifiers for Android malware detection

Shar, Lwin Khin; Demissie, Biniam Fisseha; Ceccato, Mariano; Tun, Yan Naing; Lo, David; Jiang, Lingxiao; Bienert, Christoph

doi:10.1007/s10664-023-10375-y

Experimental comparison of features, analyses, and classifiers for Android malware detection

Published: 26 September 2023

Volume 28, article number 130, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Empirical Software Engineering Aims and scope Submit manuscript

Experimental comparison of features, analyses, and classifiers for Android malware detection

Download PDF

Lwin Khin Shar ORCID: orcid.org/0000-0001-5130-0407¹,
Biniam Fisseha Demissie²,
Mariano Ceccato³,
Yan Naing Tun¹,
David Lo¹,
Lingxiao Jiang¹ &
…
Christoph Bienert⁴

392 Accesses
1 Citation
Explore all metrics

Abstract

Android malware detection has been an active area of research. In the past decade, several machine learning-based approaches based on different types of features that may characterize Android malware behaviors have been proposed. The usually-analyzed features include API usages and sequences at various abstraction levels (e.g., class and package), extracted using static or dynamic analysis. Additionally, features that characterize permission uses, native API calls and reflection have also been analyzed. Initial works used conventional classifiers such as Random Forest to learn on those features. In recent years, deep learning-based classifiers such as Recurrent Neural Network have been explored. Considering various types of features, analyses, and classifiers proposed in literature, there is a need of comprehensive evaluation on performances of current state-of-the-art Android malware classification based on a common benchmark. In this study, we evaluate the performance of different types of features and the performance between a conventional classifier, Random Forest (RF) and a deep learning classifier, Recurrent Neural Network (RNN). To avoid temporal and spatial biases, we evaluate the performances in a time- and space-aware setting in which classifiers are trained with older apps and tested on newer apps, and the distribution of test samples is representative of in-the-wild malware-to-benign ratio. Features are extracted from a common benchmark of 7,860 benign samples and 5,912 malware, whose release years span from 2010 to 2020. Among other findings, our study shows that permission use features perform the best among the features we investigated; package-level features generally perform better than class-level features; static features generally perform better than dynamic features; and RNN classifier performs better than RF classifier when trained on sequence-type features

Detection and robustness evaluation of android malware classifiers

Article 26 June 2021

Predicting Android malware combining permissions and API call sequences

Article 18 November 2022

Revisiting the Approaches, Datasets and Evaluation Parameters to Detect Android Malware: A Comparative Study from State-of-Art

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Android platform has dominated the smart phone market for years now. With currently more than three billion devices running Android, it is the most popular end-user operating system in the world. Unsurprisingly, its enormous user base, coupled with the popularity of mobile apps led to the launch of several malicious applications by hackers. Symantec Symantec (2019) reported that in 2018, it detected an average of 10,573 mobile malware per day; found that one in 36 mobile devices has high risk apps installed; and one in 14.5 apps accesses high risk user data.

To detect Android malware, several approaches have been proposed by the research community. These approaches have built detection models utilizing either sequence of API call features Tobiyama et al. (2016); Karbab et al. (2018); Onwuzurike et al. (2019), use of API call features Sharma and Dash (2014); Chan and Song (2014); Yerima et al. (2015); Arp et al. (2014) or frequency of API call features Aafer et al. (2013); Garcia et al. (2018). API call features represent invocations of Android APIs. Some approaches Enck et al. (2009); Sanz et al. (2013); Huang et al. (2013); Liu and Liu (2014); Sharma and Dash (2014); Chan and Song (2014); Arp et al. (2014); Lindorfer et al. (2015) categorized Android APIs according to privilege levels (known as Android permissions). In Android, APIs are classified into four privilege levels — normal, signature, dangerous, and special. These approaches rely on the concept that malware typically require privileged operations (i.e., dangerous permissions) such as read/send SMS, read contact, read location, etc. Given that modern malware often use reflections and system (native API) calls, to hide their true behaviours and implement their malicious functionalities, some approaches such as Garcia et al. (2018); Suarez-Tangil et al. (2017); Afonso et al. (2015) utilized features that represent native API calls and reflections, in an attempt to further distinguish malware from benign apps. In addition to permission uses, Kim et al. Kim et al. (2018) also investigated the use of app components as features. Hence, a study of the significance of those features for Android malware detection on a common benchmark would be beneficial.

The API calls can be extracted at various abstraction levels such as method, class, package, and family. Since there are millions of unique methods in Android, some approaches Garcia et al. (2018); Onwuzurike et al. (2019); Ikram et al. (2019) have proposed to abstract API calls at class and package levels. This reduced the number of features significantly and yet produced comparable or even better results Garcia et al. (2018); Onwuzurike et al. (2019); Ikram et al. (2019) than using API calls at method level.

To extract these features, in general, two types of techniques are used — static analysis Arp et al. (2014); Chan and Song (2014); Yang et al. (2018); Garcia et al. (2018); Onwuzurike et al. (2019); Ikram et al. (2019) and dynamic analysis Dini et al. (2012); Tobiyama et al. (2016); Afonso et al. (2015). Typically, static analysis-based features cover more information since static analysis can reason with the whole program code whereas dynamic analysis-based features are limited to the code that is executed. On the other hand, static analysis may have issues dealing with complex code such as code obfuscation, and modern malware is usually crafted with obfuscated code Garcia et al. (2018). In general, static analysis and dynamic analysis complement each other. Hence, some approaches such as Lindorfer et al. (2015) perform both analyses and use both types of features.

Once these features have been extracted using program analyses, machine learning classifiers, such as Support Vector Machines (SVM), K-Nearest Neighbours, and Random Forest, are used to train on the features to build malware detectors. For instance, DadiDroid Ikram et al. (2019) and MamaDroid Onwuzurike et al. (2019) used all the three classifiers mentioned above; RevealDroid Garcia et al. (2018) used SVM; Huang et al. Huang et al. (2013) used AdaBoost, Naive Bayes, Decision Tree, and SVM. In parallel, other studies Tobiyama et al. (2016); McLaughlin et al. (2017); Karbab et al. (2018); Xu et al. (2018) have focused on the use of deep learning classifiers, such as Convolutional Neural Network and Recurrent Neural Network, to build malware detectors. Deep learning classifiers use several neural network layers to study various levels of representations and extract higher-level features from the given lower-level ones. Hence, in general, they have a built-in feature selection process and are better at learning complex patterns. On the other hand, it generally comes with a much larger cost in terms of computational resources. Deep learning classifiers also typically have more parameters to tune and typically require intensive fine-tuning to match the characteristics of datasets.

In terms of evaluating the malware detection performance, cross validation or random split schemes are commonly used in literature Lindorfer et al. (2015); Arp et al. (2014); Afonso et al. (2015); Karbab et al. (2018). But, as reported by Allix et al. (2016) and Pendlebury et al. (2019), these evaluation schemes are biased because data from the ‘future’ is used in training the classifier. Fu and Cai (2019) showed that F-measure drops from 90% to 30% when training and test data are split based on one year gap. Additionally, Pendlebury et al. (2019) reported an issue with spatial bias where the evaluation does not consider the realistic distribution between malware and benign samples.

In view of the proposals of different types of features, different types of underlying analyses used for feature extraction, and different types of classifiers, there is a need for a comprehensive evaluation on the performance of current state-of-the-art in Android malware classification on a common benchmark. There is also a need to evaluate the performances in a time- and space-aware setting. Hence, in this study, we evaluate the malware detection accuracy of features, analyses, and classifiers based on a common benchmark. Our evaluation includes the comparison between 14 types of features, the comparison between conventional machine learning classifier and deep learning classifier, the study of the impact of additional features such as native API calls and reflection, and combined static and dynamic features, and the robustness of features over Android evolution.

The experiments are conducted on a benchmark of 13,772 apps (7,860 benign apps and 5,912 malware) that are released from 2010 to 2020. Benign samples were collected from Androzoo repository Allix et al. (2016) while malware samples were collected from both Androzoo and Drebin Arp et al. (2014) repositories. We extract static features from call graph of Android package (apk) codes and dynamic features by executing the app in an Android emulator using our in-house intent-fuzzer combined with Android’s Monkey testing framework Android (2019).

Our preliminary study, documented in our conference paper Shar et al. (2020), evaluated the performance between sequence of API calls features and use of API calls features and evaluated the performance between un-optimized classifiers. This paper extends the previous work and makes the following new contributions:

We conduct a more systematic evaluation of the performances of features and classifiers. More specifically, we evaluate the performances in a time- and space-aware setting in which classifiers are trained with older apps and tested on newer apps and the distribution of benign and malware samples is representative of in-the-wild malware-to-benign ratio. These biases were not considered in our previous work.
We significantly increase the size of our dataset. Our earlier work used the dataset of 6,971 apps. In this extension, we use the dataset of 13,772 apps collected over a period of 11 years.
We analyze sequence/use/frequency of API calls features at two different abstraction levels — class and package. We consider additional features that characterize reflection, native API calls, and permission uses and app component uses in our evaluation.
We perform a series of optimizations on the deep learning classifier and the conventional machine learning classifier and compare their performance.

More specifically, the new research questions investigated in this study are:

RQ1: Features. Which types of features perform the best? Are class-level features or package-level features better? Are static analysis-based features or dynamic analysis-based features better? Finding. Permission use features perform the best; Package-level features generally perform better than class-level features. Static features generally perform better than dynamic features.
RQ2: Classifiers. When optimized, which type of classifiers — conventional machine learning (ML) classifier or deep learning (DL) classifier — performs better? Finding. In our previous work Shar et al. (2020), the un-optimized DL classifiers did not perform as well as the best conventional ML classifier (Random Forest). In this evaluation, we observed that when optimized, the DL classifier (Recurrent Neural Network) performs better than the conventional ML classifier (Random Forest) on sequence-type features.
RQ3: Additional features. Does the inclusion of features that characterize reflection, native API calls, and API calls that are classified as dangerous (dangerous permissions) improve the malware detection accuracy? Does combining static analysis-based and dynamic analysis-based features help? Finding. Overall, inclusion of reflection feature, native API calls features, dangerous permission features does not improve the performances significantly; combining static and dynamic-based features in a naive manner results in a worse performance.
RQ4: Robustness. How robust are the malware detectors against evolution in Android framework and malware development? Finding. Generally, the performance of malware detectors is sensitive to changes in Android framework and malware development.

Data Availability

The scripts used in our experiments and sample datasets are available at our github page.^{Footnote 1} We provide more detailed results and the complete dataset upon request. The rest of the paper is organized as follows.

Section 2 discusses related work and motivates our work. Section 3 discuses the methodology — it explains the data collection and features extraction processes, and the machine learning and deep learning classifiers we optimized and used. Section 4 presents the empirical comparisons and discusses the results. Section 5 draws conclusions from this study and provides insights for Android malware researchers. Section 6 provides the concluding remarks and proposals for future studies.

2 Related Work on Android Malware Detection

Surveys

citenaway2018review reviewed the use of deep learning in combination with program analysis for Android malware detection. Recently, Liu et al. (2022) also reviewed the use of deep learning for Android malware defenses. In contrast to Naway and Li (2018), Liu et al. additionally reviewed critical aspects of using deep learning to prevent/defend against malicious behaviors (e.g., malware evolution, adversarial malware detection, deployment, malware families). However, the contributions of both studies is a literature survey, focusing on the use of deep learning for Android malware detection, rather than an empirical study like ours.

Empirical studies

There are a few empirical studies Allix et al. (2015, 2016); Ma et al. (2019); Cai (2020) in literature, which contrast different types of features and classifiers to detect Android malware. Among them, Zhuo et al.’s study Ma et al. (2019) is closely related to ours as it also investigates static sequence/use/frequency features extracted from control flow graph. The main differences between Zhuo et al.’s study and ours are a) we consider both static and dynamic analysis, b) we evaluate the use of native calls, reflection, permissions, and API calls at class level and package level, c) we evaluate a DL algorithm whereas we evaluate both conventional ML and DL algorithms, d) most importantly, Zhuo et al’s study applied cross validation for performance evaluation, which introduces temporal and spatial biases whereas our evaluation takes measures to address these biases. In general, the other studies focus on a single dimension such as features, analyses, classifiers, or temporal and spatial aspects. By contrast, our study look at all those aspects and evaluate them on a common benchmark.

Allix et al. (2016) conducts a large-scale empirical study on the dataset sizes used in Android malware detection approaches. Allix et al. (2015) also investigates the relevance of timeline in the construction of training datasets. Both studies Allix et al. (2015, 2016) observed that performance of malware detector significantly dropped when they are tested against the malware in the wild, i.e., malware that were not seen in the training. Allix et al. (2015) presents a critical literature review of Android malware classification based on supervised machine learning. They define a dataset to be historically coherent when the apps in the training set are all historically anterior to all the apps in the testing set. According to their experiment, when the dataset is not historically coherent, classification performances (e.g., F-measure) are artificially inflated. According to their literature review, a relevant portion of the papers uses historically incoherent datasets, causing results to be biased. Another study Pendlebury et al. (2019) additionally discussed the importance of space-aware setting that consider the realistic distribution of malware and benign samples during both training and testing. We took measure to mitigate these two biases in our evaluations. The need of retraining an ML-based malware detector is defined by Cai (2020) as the sustainability problem. Cai (2020) compares five malware detectors, revealing limitations with respect to sustainability of the learned model. Our results confirm these findings. These existing studies were conducted on limited types of analyses (static analysis) and features (e.g., sequence of API calls), and limited span of app released years (\(\le \) 3 years). Our work addresses the gap by investigating the relevance of timeline in the construction of datasets representing different types of features extracted from apps released in a wide time span of 11 years. We provide complementary, additional findings to these existing studies.

Static analysis-based features

Several approaches rely on static analysis to extract features from the app such as permissions Enck et al. (2009); Wu et al. (2012); Sanz et al. (2013); Huang et al. (2013); Liu and Liu (2014); Sharma and Dash (2014); Chan and Song (2014); Arp et al. (2014); Suarez-Tangil et al. (2017), the sequence of API calls McLaughlin et al. (2017); Chen et al. (2016); Shen et al. (2018); Karbab et al. (2018); Onwuzurike et al. (2019); Shi et al. (2020); Zou et al. (2021), the use of API calls Sharma and Dash (2014); Zhang et al. (2014); Chan and Song (2014); Yerima et al. (2015); Arp et al. (2014); Suarez-Tangil et al. (2017); Ikram et al. (2019); Xu et al. (2019); Bai et al. (2020); Wu et al. (2021), or the frequency of API calls Aafer et al. (2013); Chen et al. (2016); Fan et al. (2016); Garcia et al. (2018). A few approaches Garcia et al. (2018); Suarez-Tangil et al. (2017) also relied on features that characterize native API calls and reflections. Since these approaches evaluate various types of features independently and majority of these approaches were not evaluated in a time- and/or space-aware manner, our work addresses this by evaluating all these types of features on a common benchmark in a time- and space-aware manner. In addition, our study evaluates features extracted not only with static analysis but also with dynamic analysis and with both static and dynamic analysis combined. And we evaluate these features on both ML and DL classifiers. Considering that analysis at method level leads to millions of features, resulting in long training time and memory consumption, some approaches Onwuzurike et al. (2019); Ikram et al. (2019); Yang et al. (2018) abstracted features at class, package, family, or entity levels, to save memory and time. Our study evaluates features at class level and package level.

Dynamic analysis-based features

Dynamic analysis-based approaches such as Dini et al. (2012); Tobiyama et al. (2016); Afonso et al. (2015); Spreitzenbarth (2013) have mainly focused on features at native API calls (system calls). Narudin et al. Narudin et al. (2016) evaluate the performance of five ML classifiers on network features (API calls that involve network communication) extracted with dynamic analysis. Most dynamic analysis approaches have largely used Monkey (UI) test generator Naway and Li (2018). But Monkey test generator only focuses on exercising UI components and could miss out component interactions. In contrast to these approaches, our approach employs a combination of Monkey test generator and intent fuzzing.

Hybrid analysis-based features

As reported in Liu et al. Liu et al. (2022), possibly due to high computational cost, very few approaches Yuan et al. (2014); Lindorfer et al. (2014); Alshahrani et al. (2019); Spreitzenbarth (2013); Bläsing et al. (2010) combine static analysis and dynamic analysis. And, these approaches focus on extracting specific features that are generally considered to be dangerous, such as sending SMS and connecting to Internet. For example, Droid-sec Yuan et al. (2014) uses features that characterize permission requested and permission use, which are coarse-grained and prone to false positives Enck et al. (2009). DDefender Alshahrani et al. (2019) uses features that are based on permissions, network activities and native API calls. Monkey tool was also used in the dynamic analysis; thus it may not be able to generate all the events that a malware can make. Mobile-Sandbox Spreitzenbarth (2013) applies static analysis of manifest file and bytecode to guide the dynamic analysis process. It then analyzes native API calls during the application’s execution. AASandbox Bläsing et al. (2010) uses static analysis to extract suspicious code patterns, such as the use of Runtime.exec() and functions related to reflection. During the dynamic step, AASandbox runs the app in a controlled environment and monitors system calls. In contrast to the above-mentioned approaches, we evaluate more types of features, and evaluate both conventional machine learning and deep learning classifiers. We also employ a combination of Monkey test generator and intent fuzzing to cover both UI events and component interactions. Marvin Lindorfer et al. (2015) also uses both static analysis and dynamic analysis to extract features that are similar to the features extracted by our work. The features extracted include permissions, reflection, native calls, Java classes, etc. But its classifier is evaluated by randomly splitting training and test data, without considering the timeline in the construction of training data, which could produce biased results.

Robust classifiers

While Zhang et al. (2020) proposes a way to mitigate the problem of model aging, Fu and Cai (2019), MaMaDroid Onwuzurike et al. (2019), Afonso et al. (2015), and RevealDroid Garcia et al. (2018) propose the use of features that could be robust against the evolution of apps (timeline). Our empirical study complements their work by evaluating which combination of features, program analyses, and classifiers produces robust malware detectors, on a common benchmark.

3 Methodology

This section explains the workflow of our empirical study. As illustrated in Fig. 1, it consists of three phases. In the first phase, static analysis is used to extract manifest files and call graphs; dynamic analysis is used to generate execution traces, from benign and malware apps. In the second phase, various features — sequence/use/frequency of API calls features at class level and package level, permission uses, and app component uses — are extracted from call graphs and execution traces. Each type of features forms a distinct dataset. Each record in the dataset, representing an app, is tagged with its known label. In the last phase, classifiers — Random Forest (RF) and Recurrent Neural Network (RNN) — are trained and tested on the labeled datasets in a time- and space-aware setting and produce the evaluation results.

The following subsections discuss each phase in detail. As a running example, we will use a malicious app called com.test.mygame released in year 2017, which has been flagged as malware by 27 anti-viruses. It is a variant of the SmsPay malware where a legitimate app is repackaged with covert functions to send and receive SMS messages, potentially causing unexpectedly high phone charges.

3.1 Program Analysis

In this phase, static analysis and dynamic analysis are performed on the given Android Application Packages (APKs).

Static analysis

Given an APK, we use apktool^{Footnote 2} to extract Android manifest file and use FlowDroid Arzt et al. (2014) to extract call graph. Call graph contains paths from public entry points of the app to the program termination. Those paths contain sequences of API calls. FlowDroid is based on Soot (2018). Firstly, Soot converts a given APK (i.e., the DEX code) into an intermediate representation called Jimple and FlowDroid performs flow analysis on the Jimple code. The analysis is flow- and context-sensitive. FlowDroid also handles common native API calls. Using some heuristics, it tracks data flow across some commonly used native calls.

Dynamic analysis

Static call graphs characterize all possible program behaviors, in terms of API calls. But static analysis has inherent limitations, such as dealing with code obfuscation and reflection. FlowDroid can only resolve reflective API calls when the arguments used in the call are all string constants. Dynamic analysis can overcome this limitation. Hence, the goal of dynamic analysis here is to execute test inputs to observe concrete program behaviors. Since mobile apps are event driven in general, a good test generator needs to be able to generate various kinds of events. In Android, events are typically triggered by means of inter-component communication (intent messages sent by app components) or GUI inputs. Hence, we use two different test generators — an Intent fuzzer and a GUI fuzzer. Our Intent fuzzer was developed in our previous work Demissie et al. (2020). Firstly, it analyzes call graph of the app to extract paths from public entry-points (i.e., inter/intra-component communication interfaces) to the leaf nodes. Similar to the static analysis phase, we generate the call graph of the app using Soot with FlowDroid plugin for Android. The call graph is then traversed forward in depth-first search manner starting from the root node until a leaf node is reached. The output of this step is paths from component entry points to the different leaf nodes (method calls without outgoing edges). Once the list of paths is available, the intent fuzzer generates inputs in an attempt to execute each path (target). The given app is installed and executed on a fresh Android emulator. The generated inputs are Intent messages that are sent to the app under test via Android Debug Bridge (ADB) commands. With ADB’s privilege, we can also invoke private components as well as send events that can only be generated by the system (e.g., BOOT_COMPLETED). Execution traces are then collected using ADB logcat command. A genetic algorithm is used to guide the test generation, where fitness function is defined based on the coverage of nodes in the target path. To this end, we first instrument the app to collect execution traces and install the app on an Android emulator. We then run our intent fuzzer with statically collected values (such as static strings) from the app as seed (initial values). The generated inputs are Intent messages that are sent to the app under test via the Android Debug Bridge (ADB). Our goal is to maximize coverage and collect as many traces as possible. The traces are also used to guide the test generation.

While the Intent fuzzer exercises code parts that involve inter-/intra-component communications, it does not address user interactions through GUI. Therefore, to complement our intent fuzzer, we use Google’s Android Monkey GUI fuzzer Android (2019). Monkey comes with the Android SDK and is used to randomly generate GUI input events such as tap, input text or toggle WiFi in an attempt to trigger abnormal app behaviors. We used Monkey because the random exploration of Monkey has been found to yield higher statement coverage than tools utilizing advanced exploration techniques Choudhary et al. (2015). And by complementing Monkey’s approach with other strategies (in this case inter-/intra communication), we expect that the coverage could be further improved.

We measure the coverage achieved by this approach. Since code coverage is difficult to measure due to the usage of libraries, we measured component coverage, by measuring the ratio of the components that are executed when performing dynamic analysis and the components that are listed in the Android manifest file. Component coverage is shown in the histogram in Fig. 2. While on average component coverage is approximately 43%, a remarkable number of apps reach 100% coverage. This degree of coverage is in line with literature results Choudhary et al. (2015).

3.2 Features Extraction

From the call graphs and the execution traces generated in the previous phase, we extract sequence features, use features, and frequency features at class level and package level. Each type of features forms a distinct dataset. From the extracted API calls, we identify API calls that require dangerous permissions. We also identify native API calls (e.g., API calls that require system services and access hardware devices). Finally, we identify reflections (i.e., classes that start with java.lang.reflect) and mark them as additional features. From the Android manifest files, we extract features that represent permission uses (permission requests) and Android component uses as well, which are also considered as distinct datasets.

Note that the API calls that we extract here are abstracted at class level and package level. The rationale for choosing class and package level features instead of method level features is to reduce the amount of features, following the recent state-of-the-art approaches Garcia et al. (2018); Yang et al. (2018); Onwuzurike et al. (2019); Ikram et al. (2019). Method level features would result in millions of features that cost significantly long training time. Those recent approaches have reported that, despite the cost, the classifiers may not achieve a better accuracy since the feature vectors of the samples would be sparse and abstracted API calls features characterize Android malware even better. The abstraction also provides robustness against API changes in Android framework because methods are often subject to changes and deprecation. Figure 3 shows an example of an API at different levels.

Regarding the extraction of dangerous features, we implemented an in-house tool that crawls the Android permission documentation website^{Footnote 3} and maps API calls to dangerous permissions. This tool is similar to PScout Au et al. (2012) but PScout only supports up to Andriod 5.11. Our tool supports Android 11 (API 30).^{Footnote 4}

Sequence Features Extraction. We extract sequence of API calls from call graphs and execution traces. Given a call graph, we traverse the graph in a depth first search manner and extract class/package signatures^{Footnote 5} as we traverse (hence, sequence). If there is a loop, the signature is traversed only once. Note that we only extract Android framework classes/packages, Java classes/packages, and standard org classes/packages (org.apache, org.xml, etc.). This is because it is common for malware to be obfuscated to circumvent malware detectors. The obfuscation often involves renaming of custom (user-defined) library and classes/packages. Hence, a malware detector will not be robust against obfuscation if it is trained on custom library and classes/packages. A study Rastogi et al. (2013) has shown that a simple renaming obfuscation can prevent popular anti-malware products from detecting the transformed malware samples. Hence, we filtered classes/packages that are not from the above-mentioned standard packages. Similarly, we extract classes/packages from the execution traces. However, since execution traces are already sequences, depth first search is not necessary. An excerpt of sequences of API calls extracted from a repackaged malware app com.test.mygame is shown in Fig. 4.

Next, we discretized the sequence of API calls we extract above so that it can be processed by the classifiers. More precisely, we replace each unique class/package signature with an identifier, resulting in a sequence of numbers. We build a dictionary that maps each class to its identifier. During the testing or deployment phase, we may encounter unknown API calls. To address this, (1) we consider a large dictionary that covers over 160k class signatures and 4605 package signatures from standard libraries and (2) we replace all unknown signatures with a fixed identifier.

The length of the sequences varies from one app to another. The sequence length determines the number of features and to have a fixed number of features, it is necessary to unify the length of the sequences. Since we have two types of API calls sequences — from call graphs and from execution traces — we chose two different uniform sequence lengths. Initially, we extracted the whole sequences. We then took the median length of sequences from call graphs as the uniform sequence size, denoted as \(L_{cg}\), for call graph-based sequence features and took the median length of sequences from execution traces as the uniform sequence, denoted as \(L_{tr}\), for execution traces-based sequence features.^{Footnote 6} If the length of a given sequence is less than L, we pad the sequence with zeros; if the length is longer than L, we trim it to L, from the right. Hence, for each app, we end up with a sequence of numbers which is a feature vector. Each number in the sequence corresponds to the categorical value of a feature. The number of features is the uniform sequence length L. As a result, we obtain static-sequence features from call graphs at class level and package level, denoted as ssfc and ssfp, respectively. Likewise, we obtain dynamic-sequence features from execution traces at class level and package level, denoted as dsfc and dsfp respectively. As an example, Table 1 shows a sample dataset containing sequence features.

Use Features Extraction

We extract use of API calls at class level and package level from call graphs and execution traces. The extraction process is the same for both call graphs and execution traces. We initially build a database that stores unique classes and packages. Again for obfuscation resiliency, we only consider the Android framework, Java, and standard org classes similar to extracting sequencefeatures. Given call graphs or execution traces, we scan the files and extract the class signatures and the package signatures (sequence does not matter in this case). Each unique class or package in our database corresponds to a feature (Table 5). The value of a feature is 1 if the corresponding class/package is found in a given call graph or execution trace; otherwise, it is 0. As a result, we obtain static-use features from call graphs at class level and package level, denoted as sufc and sufp, respectively. Likewise, we obtain dynamic-use features from execution traces at class level and package level, denoted as dufc and dufp respectively. Table 2 shows a sample dataset containing use features at class level.

Table 1 An excerpt of sequence features extracted from static call graphs. Sequence length L is fixed at 21,000 for dynamic features and 85,000 for static features, which are the median lengths observed in our datasets

Experimental comparison of features, analyses, and classifiers for Android malware detection

Abstract

Similar content being viewed by others

Detection and robustness evaluation of android malware classifiers

Predicting Android malware combining permissions and API call sequences

Revisiting the Approaches, Datasets and Evaluation Parameters to Detect Android Malware: A Comparative Study from State-of-Art

Explore related subjects

1 Introduction

Data Availability

2 Related Work on Android Malware Detection

Surveys

Empirical studies

Static analysis-based features

Dynamic analysis-based features

Hybrid analysis-based features

Robust classifiers

3 Methodology

3.1 Program Analysis

Static analysis

Dynamic analysis

3.2 Features Extraction

Use Features Extraction

Frequency Features Extraction

Permission and App Component Features Extraction

3.3 Classifiers

3.3.1 Deep Learning (DL) Classifier

3.3.2 Conventional Machine Learning (ML) Classifier

3.3.3 Optimizing the Classifiers

Tuning the hyper-parameters of RNN

3.4 Data Preprocessing

4 Evaluation

4.1 Experiment Design

Dataset

Performance measure

Evaluation Procedure

Hardware used

4.2 RQ1: Comparison among Features

4.3 RQ2: Optimized DL Classifier vs Optimized Conventional ML Classifier

4.4 RQ3: Additional Features

4.5 RQ4: Robustness Against Android Evolution

4.6 Threats to Validity

5 Insights

For Antivirus vendors

For research community

6 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation