1 Introduction

Traditionally, malware detection and classification has relied on pattern matching against signatures extracted from specific malware samples. While simple and efficient, signature scanning is easily defeated by a number of well-known evasive strategies. This fact has given rise to statistical and machine learning-based techniques, which are more robust to code modification. In response, malware writers have developed advanced forms of malware that alter statistical and structural properties of their code, which can cause statistical models to fail.

In this chapter, we compare deep learning (DL) models for malware classification. For most of our deep learning models, we use image-based features, but we also experiment with opcode features. The DL models consider include a wide variety of neural networking techniques, including multilayer perceptrons (MLP), several variants of convolutional neural networks (CNN), and vanilla recurrent neural networks (RNN), as well as the advanced RNN architectures known as long short-term memory (LSTM) and gated recurrent units (GRU). We also experiment with a complex stacked model that combines both LSTM and GRU. In addition, we consider transfer learning, in the form of the ResNet152 and VGG-19 architectures.

The remainder of this chapter is organized as follows. In Sect. 2 we provide relevant background information, including a discussion of related work, an overview of the various learning techniques considered, and we introduce the dataset used in this research. Section 3 is the heart of the chapter, with detailed results from a wide variety of malware classification experiments. Section 4 concludes the chapter and provides possible directions for future work.

2 Background

In this section, we discuss related work and we introduce the various learning techniques that are considered in this research. We also discuss the dataset that we use in our malware classification experiments. In addition, we provide the specifications of the hardware and software that we use to conduct the extensive set of experiments that are summarized in Sect. 3.

2.1 Related Work

To the best of our knowledge, image-based analysis was first applied to the malware problem in [16], where high-level “gist” descriptors are used as features. More recently, [44] confirmed the results in [16] and presented an alternative deep learning approach that produces equally good—if not slightly better—results, without the extra work required to extract gist descriptors.

Transfer learning, where the output layer of an existing pre-trained DL model is retrained for a specific task, is often used in image analysis. Such an approach allows for efficient training, as a new model can take advantage of a vast amount of learning that is embedded in the pre-trained model. Leveraging the power of transfer learning has been shown to yield strong image-based malware detection and classification results [44].

There is a vast malware analysis literature involving classic machine learning techniques. Representative examples include [2, 5, 8, 25, 28, 42]. Intuitively, we might expect models based on image analysis to be somewhat stronger and more robust, as compared to models that rely on opcodes, byte n-grams, or similar statistical features that are commonly used in malware research.

The work presented in this chapter can be considered an extension of the work in [6], where image-based transfer learning is applied to the malware classification problem. We have extended this previous work in multiple dimensions, including a larger, more challenging, and more realistic dataset. In addition, we perform much more experimentation with a much wider variety of techniques, and we consider a large range of hyperparameters in each case.

2.2 Learning Techniques

In this section, we provide a brief introduction to each of the learning techniques considered in this paper. Additional details on most of the learning techniques discussed here can be found in [27], which includes examples of relevant applications of the techniques. We provide additional references for the techniques discussed below that are not considered in [27].

2.2.1 Multilayer Perceptron

A perceptron computes a weighted sum of its components in the form of a hyperplane, and based on a threshold, a perceptron can be used to define a classifier. It follows that a perceptron cannot provide ideal separation in cases where the data itself is not linearly separable. This is a severe limitation, as something as elementary as the XOR function is not linearly separable.

A multilayer perceptron (MLP) is an artificial neural network that includes multiple (hidden) layers in the form of perceptrons. Unlike a single layer perceptron, MLPs are not restricted to linear decision boundaries, and hence an MLP can accurately model more complex functions. The relationship between perceptrons and MLPs is very much analogous to the relationship between linear support vector machines (SVM) and SVMs based on nonlinear kernel functions.

Training an MLP would appear to be challenging since we have hidden layers between the input and output, and it is not clear how changes to the weights in these hidden layers will affect each other or the output. Today, MLPs are generally trained using backpropagation. The discovery that backpropagation can be used for training neural networks was a major breakthrough that made deep learning practical.

2.2.2 Convolutional Neural Network

Generically, artificial neural networks use fully connected layers. The advantage of a fully connected layer is that it can deal effectively with correlations between any points within training vectors. However, for large training vectors, fully connected layers are infeasible, due to the vast number of weights that must be learned.

In contrast, a convolutional neural network (CNN) is designed to deal with local structure. A convolutional layer cannot be expected to perform well when significant information is not local. The benefit of CNNs is that convolutional layers can be trained much more efficiently than fully connected layers, due to the reduced number of weights.

For images, most of the important structure (edges and gradients, for example) is local. Hence, CNNs are an ideal tool for image analysis and, in fact, CNNs were developed precisely for image classification. However, CNNs have performed well in a variety of other problem domains. In general, any problem for which local structure predominates is a candidate for CNNs.

2.2.3 Recurrent Neural Network

MLPs and CNNs are feedforward neural networks, that is, the data feeds directly through the network, with no “memory” of previous feature vectors. In a feedforward network, each input vector is treated independently of other input vectors. While feedforward networks are appropriate for many problems, they are not well suited for dealing with sequential data.

In some cases, it is necessary for a classifier to have memory. Suppose that we want to tag parts of speech in English text (i.e., noun, verb, etc.), this is not feasible if we only look at words in isolation. For example, the word “all” can be an adjective, adverb, noun, or pronoun, and this can only be determined by considering its context. A recurrent neural network (RNN) provides a way to add memory (or context) to a feedforward neural network.

RNNs are trained using a variant of backpropagation known as backpropagation through time (BPTT). A problem that is particularly acute in BPTT is that the gradient calculation tends to be become unstable, resulting in “vanishing” or “exploding” gradients. To overcome these problems, we can limit the number of time steps, but this also serves to limit the utility of RNNs. Alternatively, we can use specialized RNN architectures that enable the gradient to flow over long time periods. Both long short-term memory and gated recurrent units are examples of such specialized RNN architectures. We discuss these two RNN architectures next.

2.2.4 Long Short-Term Memory

Long short-term memory (LSTM) networks are a class of RNN architectures that are designed to deal with long-range dependencies. That is, LSTM can deal with extended “gaps” between the appearance of a feature and the point at which it is needed by the model. In plain vanilla RNNs this is generally not possible, due to vanishing gradients.

The key difference between an LSTM and a generic vanilla RNN is that an LSTM includes an additional path for information flow. That is, in addition to the hidden state, there is a so-called cell state that can be used to, in effect, store information from previous steps. The cell state is designed to serve as a gradient “highway” during backpropagation. In this way, the gradient can “flow” much further back with less chance that it will vanish (or explode) along the way.

As an aside, we note that the LSTM architecture has been one of the most commercially successful learning techniques ever developed. Among many other applications, LSTMs play a critical role in Google Allo [11], Google Translate [43], Apple’s Siri [13], and Amazon Alexa [9].

2.2.5 Gated Recurrent Unit

Due to its wide success, many variants on the LSTM architecture have been considered. Most such variants are slight, with only minor changes from a standard LSTM. However, a gated recurrent unit (GRU) is a fairly radical departure from an LSTM. Although the internal state of a GRU is somewhat complex and less intuitive than that of an LSTM, there are fewer parameters in a GRU. As a result, it is easier to train a GRU than an LSTM, and consequently less training data is required.

2.2.6 ResNet152

Whereas LSTM uses a complex gating structure to ease gradient flow, a residual network (ResNet) defines additional connections that correspond to identity layers. These identity layers allow a ResNet model to, in effect, skip over layers during training, which serves to effectively reduce the depth when training and thereby mitigate gradient pathologies. Intuitively, ResNet is able to train deeper networks by training over a considerably shallower network in the initial stages, with later stages of training serving to flesh out the intermediate connections. This approach was inspired by pyramidal cells in the brain, which have a similar characteristic, in the sense that they bridge “layers” of neurons [26].

ResNet152 is a specific deep ResNet architecture that has been pre-trained on a vast image dataset. As one of our two examples of transfer learning, we use this architecture, which includes an astounding 152 layers. That is, we use the ResNet152 model, where we only retrain the output layer specifically for our malware classification problem.

2.2.7 VGG-19

VGG-19 is a 19-layer convolutional neural network that has been pre-trained on a dataset containing more than \(10^6\) images [24]. This architecture has performed well in many contests, and it has been generalized to a variety of image-based problems. Here, we use the VGG-19 architecture and pre-trained model as one of our two examples of transfer learning for image-based malware classification.

2.3 Dataset

Our dataset consists of 20 malware families. Three of these malware families, namely, Winwebsec, Zeroaccess, and Zbot, are from the Malicia dataset [15], while the remaining 17 families are taken from the massive malware dataset discussed in [12]. This latter dataset is almost half a terabyte and contains more than 500,000 malware samples in the form of labeled executable files.

Table 1 lists the 20 families used in this research, along with the type of malware present in each family. Next, we briefly discuss each of these 20 malware families.

Table 1 Type of each malware family

 

Adload:

downloads an executable file, stores it remotely, executes the file, and disables proxy settings [29].

Agent:

downloads trojans or other software from a remote server [30].

Alureon:

exfiltrates usernames, passwords, credit card information, and other confidential data from an infected system [35].

BHO:

can perform a variety of actions, guided by an attacker [32].

CeeInject:

uses advanced obfuscation to avoid being detected by antivirus software [34].

Cycbot.G:

connects to a remote server, exploits vulnerabilities, and spreads through a backdoor [3].

DelfInject:

sends usernames, passwords, and other personal and private information to an attacker [20].

FakeRean:

pretends to scan the system, notifies the user of supposed issues, and asks the user to pay to clean the system [36].

Hotbar:

is adware that shows ads on webpages and installs additional adware [1].

Lolyda:

sends information from an infected system and monitors the system. It can share user credentials and network activity with an attacker [21].

Obfuscator:

tries to obfuscate or hide itself to defeat malware detectors [37].

Onlinegames:

steals login information and tracks user keystroke activity [22].

Rbot:

gives control to attackers via a backdoor that can be used to access information or launch attacks, and it serves as a gateway to infect additional sites [38].

Renos:

downloads software that claims the system has spyware and asks for a payment to remove the nonexistent spyware [31].

Startpage:

changes the default browser homepage and can perform other malicious activities [33].

Vobfus:

is a worm that downloads malware and spreads through USB drives or other removable drives [39].

Vundo:

displays pop-up ads and it can download files. It uses advanced techniques to defeat detection [40].

Winwebsec:

displays alerts that ask the user for money to fix nonexistent security issues [41].

Zbot:

is installed through email and shares a user’s personal information with attackers. In addition, Zbot can disable a firewall [23].

Zegost:

creates a backdoor on an infected machine [4].

The number of samples per malware family for the various features is given in Table 2. The “Binaries” lists the number of binary executable files available, the “Images” column lists the number of binaries that were successfully converted to images, and the “Opcodes” column lists the number of samples from which a sufficient number of opcodes were extracted. From the table we see that 26,413 samples are used in our image-based experiments, and 25,901 samples are used in our opcode-based experiments.

Table 2 Samples per malware family

2.4 Hardware

Table 3 lists the hardware configuration of the machine used for the experiments reported in this chapter. This machine was assembled for the purpose of training deep learning models and it is highly optimized for this task.

Table 3 Hardware characteristics

2.5 Software

For our deep learning neural network experiments, we have used PyTorch [18]. In addition, for general data processing and related operations, we employ both Numpy [17] and Pandas [14]. In addition, all code that was developed as part of this project is available at [19].

3 Deep Learning Experiments and Results

In this section, we present results of a wide variety of neural network-based experiments. First, we consider MLP experiments, followed by CNN experiments, and then RNN experiments. We consider a large number of CNN and RNN cases. We conclude this section with a pair of models based on transfer learning. The MLP, CNN, and transfer learning models are based on image features, while the RNN experiments use opcode sequences.

We consider various different sizes for images, in each case using square images. To generate a square image from an executable, we first specify a width N, with the height determined by the size of the sample. We then resize the image so that it is \(N\times N\), which has the effect of stretching or shrinking the height, as required.

3.1 Multilayer Perceptron Experiments

We experimented with various perceptron-based neural networks. The model we present here uses square input image and has four hidden layers, each using the popular rectified linear unit (relu) activation function. The output from the final hidden layer is passed to a fully connected output layer. The output layer is used to classify the sample—since we have 20 classes of malware in our dataset, the output vector is 20-dimensional. The hyperparameters used for these MLP experiments are given in Table 4.

Table 4 MLP model parameters

Figure 2 gives the confusion matrix for the best results obtained in our MLP experiments. The hyperparameters used for this best case are those shown in boldface in Table 4. In this case, the DelfInject and Obfuscator families have the lowest detection rates, with both only slightly above 50% accuracy. The overall accuracy is 0.8644.

3.2 Convolutional Neural Network Experiments

We have conducted a large number of convolutional neural network (CNN) experiments. In this section we first discuss CNN experiments based on two-dimensional images. Then we consider one-dimensional CNN experiments, where the malware images are vectorized. We also present results for CNN experiments using opcodes extracted from PE files, as opposed to forming images based on the raw byte values in the executable files. The opcodes were extracted using objdump, and we use the resulting mnemonic opcode sequence (eliminating operands, labels, etc.) as features. The hyperparameters tested for all of these CNN experiments are given in Table 5.

Table 5 CNN model parameters

3.2.1 Two-Dimensional Image CNNs

Based on two-dimensional image features, we test the CNN model hyperparameters listed under “CNN 2-d” in Table 5. All of these 2-d CNN experiments use two convolutional layers and three fully connected layers. The first convolutional layer takes as input a square gray-scale image with one channel and outputs data with 12 channels using a kernel size of three, padding of two, and a stride of one. A relu activation and max pooling is applied to the result before passing it to the second convolutional layer. This second layer outputs data with 16 channels, with the other parameters being the same as the first convolutional layer. Again, relu activation and max pooling is applied before passing data to the first fully connected layer. This first fully connected layer outputs a vector of dimension 120. After applying relu activation, the data is passed to the second fully connected layer, which reduces the output to a 90-dimensional vector. Finally, relu activation is again applied and the data passes to the last fully connected layer, which is used to classify the sample, and hence is 20-dimensional. For all image sizes less than 1024, we execute our CNN 2-d models for 50 epochs; for the case of \(1024\times 1024\) images, we use 8 epochs due to the costliness of training on these large images.

The best overall accuracy obtained for our CNN 2-d experiments is 0.8955. Figure 3 gives the confusion matrix for the best case. We note that the Obfuscator family is again the most difficult to distinguish.

3.2.2 Vectorized Image CNNs

Recent work has shown promising results for malware classification using one-dimensional CNNs on “image” data [10]. Consequently, we experiment with flattened images, that is, we use images that are one pixel in height. A possible advantage of this approach is that two-dimensional results can depend on the width chosen for the images. We perform two sets of such experiments, which we denote as CNN 1-d and CNN 1-d refined, the latter of which considers additional fine-tuning parameters. The hyperparameters tested for these two cases are given in Table 5.

Our CNN 1-d model uses two one-dimensional convolutional layers, followed by three fully connected layers. The first convolution layer takes in an image with one channel and outputs data with 28 channels based on a kernel size of three. The second convolutional layer outputs data with 16 channels and again uses a kernel of size three. The first fully connected layer outputs a vector of 120 dimensions, which is reduced to 90 dimensions by the second fully connected which, in turn, is reduced to 20 dimensions by the third (and last) fully connected layer. We have applied relu activations in all layers.

The confusion matrix for our best CNN 1-d case is given in Fig. 4. The overall accuracy in this case is 0.8664. A handful of families (Agnet, Alureon, DelfInject, Obfuscator, and Rbot) have accuracies below 80%, which represents the majority of the loss of accuracy.

The CNN 1-d refined tests use the same basic setup as our CNN 1-d experiments, but includes different selections of hyperparameters. As expected, these additional parameters improved on the CNN 1-d case, as the best overall accuracy attained for our CNN 1-d refined experiments is 0.8932. Qualitatively, the CNN 1-d refined results are similar (per family) to the CNN 1-d experiments, so we have omitted the confusion matrix for this case.

3.2.3 Opcode-Based CNNs

We also apply 2-d CNNs to opcode features. For each malware sample, we use the first N opcodes from each binary file, where \(N\in \{500,5000\}\). We also experiment with various other parameters, as indicated in Table 5.

The results for the best choice of parameters for our opcode-based CNN experiments are summarized in the confusion matrix in Fig. 5. Perhaps not surprisingly, the results in this case are relatively weak, with an overall accuracy of 0.8282. However, it is interesting to note from the confusion matrix that some of the families that are consistently misclassified at high rates by image-based CNN models are classified with high accuracy by this opcode-based approach. For example, DelfInject is classified at no better than about 71% in our previous CNN experimetns, but it is classified with greater than 90% accuracy using the opcode-based features.

3.3 Recurrent Neural Networks

Next, we consider a variety of experiments based on various recurrent neural network (RNN) architectures. Specifically, we employ plain vanilla RNN, LSTM, and GRU models. We also consider a complex LSTM-GRU stacked model. The hyperparameters tested in these experiments are summarized in Table 6.

Table 6 RNN model parameters

3.3.1 Vanilla RNN, LSTM, and GRU

We have trained our plain vanilla RNNs, LSTMs, and GRU-based models using 20 epochs in each case, with a learning rate of 0.001, a batch size of 128, and based on the first 500 opcodes from each malware sample. We performed multiple experiments with various other parameters, as given in Table 6. In addition, we have applied a dropout layer with 0.3 probability for all models with more than one layer.

The vanilla RNN experiments performed poorly, with an overall accuracy of just 0.7294, and hence we omit the confusion matrix for this case. On the other hand, both the LSTM and GRU models perform well, with accuracies of 0.8916 and 0.9003, respectively. The confusion matrix for the GRU case is given in Fig. 6. Since the LSTM results are so similar, we omit the LSTM confusion matrix. From Fig. 6, we see that, qualitatively, the results of our GRU experiments more closely match those of the CNN opcode-based experiments than the CNN image-based experiments. However, quantitatively, our GRU opcode-based experiments yield significantly better results than our CNN opcode-based experiments.

3.3.2 Stacked LSTM-GRU Model

As in [7], we have also experimented with stacked LSTM and GRU layers. The experiments in this chapter test more parameters and we use a larger dataset, as compared to [7]. A configuration option, which we refer to as LG, is used to decide whether the LSTM is stacked on top of the GRU (\(\text{ LG } = \text{ false }\) in this case) or GRU is stacked on top of the LSTM (\(\text{ LG } = \text{ true }\)). For example, when LG is “true,” opcode inputs are first passed to LSTM layers, with the output of the LSTM (i.e., the hidden cells) becoming input to the GRU layers. The output of the GRU is then passed to fully connected layers that are used to classify the input data. We have applied a dropout layer with 0.3 probability for models with more than one layer.

The best overall accuracy we obtain for our stacked LSTM-GRU experiments is 0.8990; the confusion matrix for this case is given in Fig. 7. This is somewhat disappointing, as it is in between the results obtained for our LSTM and GRU models.

3.4 Transfer Learning

Finally, we have considered two popular image-based transfer learning models, namely RestNet152 and VGG-19. These are models that have been pre-trained on large image datasets, and we simply retrain the last few layers for the malware dataset under consideration, while the earlier layers are frozen during training. The parameters used in these experiments are summarized in Table 7.

Table 7 Transfer learning model parameters

For ResNet152, the model parameters for layer four were unfrozen for training. We also added two more layers of fully connected neurons for training. Resnet152 is pre-trained based on 1000 classes and hence its last fully connected layer has output dimensions of 1000. We reduce this output dimension to 500 via another fully connected layer, and an additional fully connected layer further reduces the output dimension to 20, which is the number of classes in our dataset.

For VGG-19, we froze all layers except 34, 35, and 36. As with ResNet152, we added two more layers of fully connected neurons to reduce the output dimension from 1000 to 20.

For all of our transfer learning experiments, we use a batch size of 256 and trained each model for 20 epochs with learning rates of 0.001 and 0.0001. Both ResNet152 and VGG-19 expect image dimensions of \(224\times 224\) and hence we resize our \(256\times 256\) images to \(224\times 224\).

The performance of these transfer learning models was the best of our deep learning experiments, with ResNet152 achieving an overall accuracy of 0.9150 and VGG-19 doing slightly better at 0.9216. The confusion matrix for VGG-19 is given in Fig. 8; we omit the confusion matrix for ResNet152 since it is similar, but marginally worse. As compared to the other image-based deep learning models we have considered, we see marked improvement in the classification accuracy of the most challenging families, such as Obfuscator.

3.5 Discussion

The results of the malware classification experiments discussed in this section are summarized in Fig. 1. We see that among the deep learning techniques, the image-based pre-trained models, namely, ResNet152 and VGG-19, perform best, with VGG-19 classifying more than 92% of the samples correctly. The best of our other (i.e., not pre-trained) image-based models achieved slightly less than 90% accuracy.

Fig. 1
figure 1

Comparison of results

Although the opcode-based results performed relatively poorly overall, it is interesting to note that they were able to classify some families with higher accuracy than any of the image-based models. This suggests that a model that combines both image features and opcode features might be more effective than either approach individually.

4 Conclusions and Future Work

Malware classification is a fundamental and challenging problem in information security. Previous work has indicated that treating malware executables as images and applying image-based techniques can yield strong classification results.

In this chapter, we provided results from a vast number of learning experiments, comparing deep learning techniques using image-based features to some cases involving opcode features. For our deep learning techniques, we focused on multilayer perceptrons (MLP), convolutional neural networks (CNN), and recurrent neural networks (RNN), including long short-term memory (LSTM) and gated recurrent units (GRU). We also experimented with the image-based transfer learning techniques ResNet152 and VGG-19. Among these techniques, the image-based transfer learning models performed the best, with the best classification accuracy exceeding 92%.

For future work, additional transfer learning experiments would be worthwhile, as there are many more parameters that could be tested. Larger and more diverse datasets could be considered. In addition, it would be interesting to consider both image-based and opcode features as part of a combined classification technique. As noted above, the opcode-based techniques perform worse overall, but they do provide better results for some families that are particularly challenging to distinguish based only on image features.