Keywords

1 Introduction

Android Operating System has become very popular over the years, and it is a Linux-based operating system. It has been designed primarily for touchscreen mobile devices and tablets. They are increasingly used to access services, such as messaging, video/music sharing and e-commerce transactions that have been previously available on PCs only. Subsequently, it has attracted several Malware developers who target these mobile users [3, 10].

We must detect this malicious software that tampers with the device performance and steals personal data, such as accessing contacts, media and personal messages, without the user’s knowledge. Machine Learning techniques can be used to classify software into two categories: malware and safeware. This classification can be done by using the XML file called “Android Manifest” to present in each Android application. It provides essential information to the operating system, like the first class to use when starting the app or the type of permissions used in the application [3].

Only permissions provided in the file will be used in the application, and this is done only after asking the user to grant these permissions. If the application tries to use some other permissions which were not allowed in the Android Manifest file, the execution fails. Unfortunately, many users tend to grant permissions to unknown applications, which is why malicious software infects the device [3].

Thus, users must be made aware of the type of software they are installing so that they do not fall prey to malicious software and lose essential data from their mobile devices. For this particular project, we are making use of the DREBIN dataset. It contains 5,560 applications from 179 different malware families. The samples have been collected from August 2010 to October 2012 [5].

2 Literature Survey

There are two approaches to detecting malware in Android operating systems. The first one is a signature-based approach which generates a signature for every kind of malware and compares it with the application [1]. Typical antivirus software (e.g., Norton and McAfee) use signature-based methods to identify malware. However, this can be easily evaded by attackers. Example methods involve changing signatures using code obfuscation or repackaging [11]. The second is behavioural detection. The behaviour of an application is compared at runtime to identify malicious intent [1, 3].

In recent years, there has been an increasing trend using machine learning to overcome the challenges mentioned above to develop automatic and intelligent malware detection methods. These techniques are capable of discovering certain patterns to detect previously unseen malware samples and identifying the malware families of malicious samples. These systems can be classified into two categories: Dynamic analysis and Static analysis [3, 11].

Dynamic analysis [12,13,14] involves accumulating information regarding API calls, environmental variables and data transmission during the execution of an application. Dynamic analysis gives precise predictions and has lower false positive rates [3, 11].

Static analysis involves two parts—feature extraction and classification. The features are first extracted from the source file and a model is created to identify the malware families. A number of known datasets are used for feature extraction. DroidMat is used for static analysis using the manifest file and source code to extract features and k-means clustering and k-NN classification [7]. DREBIN uses the manifest file to extract features from 5,560 applications and SVM as a classifier [3, 5, 11].

Some effort has been to integrate static and dynamic analyses for better performance. Dynamic analysis could be used to reduce false positives obtained after static analysis but doing so could in turn increase the false positive if a particular path is not executed during the dynamic analysis [3, 8, 9, 11].

Permissions accessed by Android Applications have been significantly studied to understand malicious intent. The Android operating system provides a coarse-grained mandatory access control (MAC). A permission-based classifier can identify more than 81% of malicious software. It can be used for preliminary malware check before a complete second analysis [3, 3].

These applications are classified as malicious or benign by the combination of permissions required by them. The DREBIN dataset contains 5,560 applications and the respective permissions from 179 different malware families were making it a sufficient dataset [3, 5].

DREBIN database, in comparison with older datasets, gives a better False Positive Ratio parameter overall. Most Machine Learning algorithms provide high accuracy rates, which are more significant than 85% using the dataset. DREBIN performs better than older datasets and 9 out of 10 popular anti-virus scanners [2]. The analysis of the DREBIN dataset is speedy, usually taking lesser than a second on computer systems and lesser than a few seconds on a smartphone [3, 5].

On this dataset, Random Forest Classifier shows better results than Naive Bayes and Logistic Regression [2]. The Random Forest Classifier is most suited for high-dimensional data modelling. It is easy to use as it can handle all data types and manage dataset inconsistencies easily [4]. Support Vector Machine is a good choice for a classifier as it gives high precision and recall values. DREBIN data has embedded feature sets that make it suitable to run the SVM algorithm. Compared to the other two methods, the SVM algorithm takes longer to run but gives accurate results [6].

3 Dataset Description

The DREBIN approach made in this paper for malware detection and classification is to gather as many features as possible from the application’s manifest and code and embed these features into a joint vector space where each feature is grouped into sets. Some machine learning techniques are used to identify patterns in these features, which were gathered earlier. These features, each collected from each application, have the following properties, which are further grouped into sets: feature as Set S1 (Hardware Components) permission as Set S2 (Requested Permission) activity, service receiver, provider, service as set S3 (App Components) intent as set S4 (Filtered Intents) api call as set S5 (Restricted API calls) real permission as set S6 (Used Permission) call as set S7 (Suspicious API Calls) url as set S8 (Network Addresses) [3].

Due to the large size of features, the actual contents have not been used. Some have different values running into thousands, and not many are the same across other application files. Therefore, building one hot encoder and exponential growth in the feature vectors has been used in the algorithms. The number of feature properties of each feature set has been counted and stored in the dataset. A feature vector of size eight was used where each feature has count values and the output being True (malware) or False (not malware). The input vector looks in the following way (Fig. 1):

Fig. 1
figure 1

A snapshot of the feature_vectors_dataċsv file

4 Proposed Work

Using the dataset, the machine learning techniques are used to classify the applications as malware or non-malware. As there is much advancement in the usage of Android apps, it becomes a need for us to detect the malicious behaviour of Android apps for users’ security, privacy and safe usage. Our approach is detecting malware systems using machine learning techniques that classify the apps as malicious and benign and suggest a better detection method.

4.1 Sampling Data

A graph between the count of malware and safeware+malware is plotted (Fig. 2), and it can be seen that the number of malware are 5560 and safeware+malware are 129013. It can be observed that there is a huge difference between the values, and this implies that the data is imbalanced. To balance this data, we use the methods of upsampling and downsampling it. Since the malware is very less in number, it is made as a minority class, and safeware+malware are made as a majority class.

  • Upsampling: It is the process of inserting zero-valued samples between original examples in order to increase the sampling rate. In this dataset, the minority class is upsampled using resample method of the scikit-learn library with the number of samples set to 123453 and a random state of 123. The number of malware now is 123453, and the number of malware+safeware is also the same (Fig. 3).

  • Downsampling: It is the method of removing samples of a disproportionately low subset of the majority class examples, decreasing the sampling rate. In this dataset, the majority class is downsampled using the resample method with the number of samples set to the length of the minority class (i.e., 5560) and random state to 123. The number of malware now is 5560, and the number malware+safeware is also the same (Fig. 4).

For both upsampled and downsampled data, the data is preprocessed, standardised and split into 70% as training set and 30% as testing set.

Fig. 2
figure 2

A graph between the count of malware and safeware+malware in the Drebin dataset

Fig. 3
figure 3

A graph between the count of malware and safeware+malware after upsampling the DREBIN dataset

Fig. 4
figure 4

A graph between the count of malware and safeware+malware after downsampling the DREBIN dataset

4.2 Logistic Regression

Logistic regression is a classification algorithm that is used to assign observations to a discrete set of classes. In this project, binary classification has been employed as we are classifying the software into two categories, viz. malware and safeware. The model, in which the data was split into training and testing sets, is trained with the logistic regression method from the linear model of scikit-learn. Then the labels of test data are predicted, and metric methods are used on these labels to calculate accuracy, precision, model recall and F1 score of the model. The same procedure was followed for unsampled, upsampled and downsampled data to calculate the results.

4.3 Random Forest Classifier

A random forest classifier is a classifying method that combines many decision trees by recursively selecting subsets of datasets to build different decision trees. It does so by building multiple decision trees and then merging them to get a more accurate and stable prediction. The model is trained with the Random Forest Classifier method from the linear model of the scikit-learn. The test data labels are then predicted, and metric methods are used to calculate accuracy, precision, model recall and F1 score of the model. The same procedure is followed for unsampled, upsampled and downsampled data to calculate the results.

4.4 Support Vector Machines

Support Vector Machine is a popular Supervised Learning algorithm used for classification. This algorithm aims to create the best decision boundary or line (hyperplane), which can segregate n-dimensional space into classes to make it easier for us to put the new data point in the correct category in the future. The algorithm involves choosing extreme points, called support vectors, in creating hyperplanes. The model is trained with the Support Vector Classifier (SVC) method from the SVM of scikit-learn. Prediction on the model and metric calculations on unsampled, upsampled and downsampled are similar to other classification methods.

5 Results and Analysis

The output plots, along with the results, are given below. We notice that all the three classifying methods have successfully classified the software in the given DREBIN dataset to Malware and Safeware, respectively.

(Formulae Used For calculations:)

$$\begin{aligned} True Positives \% = \frac{TP}{P} \end{aligned}$$
$$\begin{aligned} True Negatives \% = \frac{TN}{N} \end{aligned}$$
$$\begin{aligned} Accuracy = \frac{TP + TN}{P+N} \end{aligned}$$
$$\begin{aligned} Precision = \frac{TP}{TP + FP} = TruePositives\% \end{aligned}$$
$$\begin{aligned} Recall = \frac{TP}{P} \end{aligned}$$
$$\begin{aligned} F1 = \frac{2*Precision*Recall}{Precision+Recall} \end{aligned}$$

(TP = True Positives, TN = True Negatives, FN = False Negatives, FP = False Positives, P = Positives = TP+FN, N = Negatives = FP+TN)

Here, Positives are Malware and Negatives are Safeware.

Confusion matrix representation (Fig. 5):

Fig. 5
figure 5

Representation of a confusion matrix

Below are the confusion matrices of all three classications performed on unsampled, upsampled and downsampled data:

Table 1 Results of classifiers on unsampled, upsampled and downsampled data

From the results given in Table 1, we can observe that the classification performed on the initial form of the dataset, which has not undergone resampling, produces inconsistent results. It is also clear that the three classifying methods perform better when the dataset has been upsampled or downsampled [3] (Fig. 6, 7, 8, 9, 10).

Fig. 6
figure 6

Confusion matrix obtained for Random Forest Classifier method performed on the non-sampled data

Fig. 7
figure 7

Confusion matrix obtained for Random Forest Classifier method performed on the upsampled data

We can see that the Random Forest Classifier gives us the best results with an accuracy of 0.993 with the upsampled dataset, whereas with the downsampled dataset, it gives slightly lesser accuracy. It produced 99.7% true positives and 78.8% true negatives. The Logistic Regression method, on the other hand, gives a good accuracy of around 0.82. And on the other hand, the Support Vector Machine gives an accuracy of 0.866 when upsampled and 0.863 when downsampled, but these values are much lesser than the Random Forest Classifier method. Hence, we can conclude that the Random Forest Classifier method is a better method for malware classification among the three (Fig. 11, 12, 13, 14).

Fig. 8
figure 8

Confusion matrix obtained for Random Forest Classifier method performed on the downsampled data

Fig. 9
figure 9

Confusion matrix obtained for Logistic Regression method performed on the non-sampled data

Fig. 10
figure 10

Confusion matrix obtained for Logistic Regression method performed on the upsampled data

Fig. 11
figure 11

Confusion matrix obtained for Logistic Regression method performed on the downsampled data

Fig. 12
figure 12

Confusion matrix obtained for SVM method performed on the non-sampled data

Fig. 13
figure 13

Confusion matrix obtained for SVM method performed on the upsampled data

Fig. 14
figure 14

Confusion matrix obtained for SVM method performed on the downsampled data

6 Conclusion and Future Work

In this paper, safeware static analysis has been carried out using the AndroidManifest.xml file to extract the features such as “permissions” and “API calls” after which the results of the classification carried out by the three methods have been analysed and compared. We have concluded from the results that the Random Forest Classifier method is more effective in malware classification among the three. For future work, the software can be classified using dynamic analysis by extracting system calls.