1 Introduction

In this chapter, we discuss a variety of topics related to deep learning, with the primary focus on popular neural networking-based architectures. We survey various malware-related applications of each architecture considered. Each topic is discussed in some detail, with additional references for further reading provided in all cases.

This chapter can be viewed as a companion to the survey [78], which covers classic machine learning techniques and their applications in cybersecurity research. Our focus here is on neural networks and deep learning, and with respect to applications, we focus most of our attention on malware-related topics, but we do mention other applications within the broader information security domain.

For the sake of completeness, we begin with an introduction to artificial neural networks (ANNs), which includes a brief history of neural networks. We then introduce a wide variety of architectures and techniques, including convolutional neural networks (CNN), recurrent neural networks (RNN), long short-term memory (LSTM), residual networks (ResNet), and generative adversarial networks (GAN). We also discuss related techniques, such as word embeddings—including Word2Vec. We also briefly mention ensemble techniques and transfer learning in passing.

2 A Brief History of ANNs

The concept of an artificial neuron [26, 82] is not new, as the idea was proposed by McCulloch and Pitts in the 1940s [52]. However, modern computational neural networking really begins with the perceptron, which was first proposed by Rosenblatt in the late 1950s [68].

An artificial neuron with three inputs is illustrated in Fig. 1. In the original McCulloch-Pitts formulation, \(X_i\in \{0,1\}\), \(w_i\in \{+1,-1\}\), and the output \(Y\in \{0,1\}\). The threshold T determines whether the output Y is 0 (inactive) or 1 (active), based on \(\sum w_i X_i\). The thinking was that a neuron either fires or it does not (thus, \(Y\in \{0,1\}\)), and the inputs would come from other neurons (thus, \(X_i\in \{0,1\}\)), while the weights \(w_i\) specify whether an input is excitatory (increasing the chance of the neuron firing) or inhibitory (decreasing the chance of the neuron firing). Whenever \(\sum w_i X_i > T\), the excitatory response wins, and the neuron fires; otherwise, the inhibitory response wins and the neuron does not fire.

Fig. 1
figure 1

Artificial neuron

A perceptron is considerably less restrictive than a McCulloch–Pitts artificial neuron, as the \(X_i\) and \(w_i\) can be real-valued. Since we want to use a perceptron as a binary classifier, the output is generally taken to be binary. McCulloch and Pitts chose such a restrictive formulation because they were trying to model logic functions. At the time, it was felt that encoding elementary logic into artificial neurons would be the key step to constructing systems with artificial intelligence. However, that point of view has certainly not panned out, while the additional generality offered by the perceptron formulation has proven extremely useful.

Given a real-valued input vector \(X=(X_{0},X_{1},\ldots ,X_{n-1})\), a perceptron can be viewed as a function of the form

$$ f(X) = \sum _{i=0}^{n-1} w_{i} X_{i} + b , $$

that is, a perceptron computes a weighted sum of the components. Based on a threshold, a perceptron can be used to define a binary classifier. For example, we could classify a sample X as “type 1” provided that \(f(X) > T\), for some specified threshold T, and otherwise classify X as “type 0.”

In the case of two-dimensional input, the decision boundary of a perceptron defines a line

$$\begin{aligned} f(x,y) = w_{0} x + w_{1} y + b . \end{aligned}$$
(1)

It follows that a perceptron cannot provide ideal separation in cases where the data itself is not linearly separable.

There was considerable research into ANNs in the 1950s and 1960s, and that era is often described as the first “golden age” of AI and neural networks. But the gold turned to lead in 1969 when an influential work by Minsky and Papert [55] emphasized the limitations of perceptrons. Specifically, they observed that the XOR function is not linearly separable, which implies that a single perceptron cannot model something as elementary as XOR. The OR, AND, and XOR functions are illustrated in Fig. 2, where we see that OR and AND are linearly separable, while XOR is not.

Fig. 2
figure 2

OR and AND are linearly separable but XOR is not

As the name suggests, a multilayer perceptron (MLP) is an ANN that includes multiple (hidden) layers in the form of perceptrons. An example of an MLP with two hidden layers is given in Fig. 3, where each edge represents a weight that is to be determined. Unlike a single-layer perceptron, MLPs are not restricted to linear decision boundaries, and hence an MLP can accurately model the XOR function. However, the perceptron training method proposed by Rosenblatt [68] cannot be used to effectively train an MLP [44]. To train a single perceptron, simple heuristics will suffice, assuming that the data is linearly separable. From a high-level perspective, training a single perceptron is somewhat analogous to training a linear SVM, except that for a perceptron, we do not require that the margin (i.e., minimum separation) be maximized. However, training an MLP would appear to be challenging since we have hidden layers between the input and output, and it is not clear how changes to the weights in these hidden layers will affect each other, let alone the output.

Fig. 3
figure 3

MLP with two hidden layers

As an aside, it is interesting to note that for SVMs, we deal with data that is not linearly separable by employing a soft margin (i.e., we allow for training errors) and by the use of the so-called “kernel trick,” where we map the input data to a higher dimensional feature space using a (nonlinear) kernel function. In contrast, perceptrons (in the form of MLPs) overcome the limitation of linear separability by the use of multiple layers. For an MLP, it is almost as if the nonlinear kernel function has been embedded directly into the model itself through the use of hidden layers, as opposed to a user-specified explicit kernel function, as is the case for an SVM.

One possible advantage of the MLP approach over an SVM is that for an MLP, the equivalent of the kernel function is, in effect, derived from the data and refined through the training process. In contrast, for an SVM, the kernel function is selected by a human, and once selected it does not change. In machine learning, removing those pesky humans from the learning process is a good thing. However, a possible tradeoff is that significantly more training data will likely be needed for an MLP, as compared to an SVM, due to the greater data requirement involved in learning the equivalent of a kernel function.

As another aside, we note that from a high-level perspective, it is possible to view MLPs as combining some aspects of SVMs (i.e., specifically, nonlinear decision boundaries) and HMMs (i.e., hidden layers). Also, we will see that the backpropagation algorithm that is used to train MLPs includes a forward pass and backward pass, which is eerily reminiscent of the training process that is used for HMMs.

As yet another aside, we note that an MLP is a feedforward neural network, which means that there are no loops—the input data and intermediate results feed directly through the network. In contrast, a recurrent neural network (RNN) can have loops, which gives an RNN a concept of memory but can also add significant complexity.

In the book Perceptrons: An Introduction to Computational Geometry, published in 1969, Minsky and Papert [55] made much of the perceived shortcoming of perceptrons—in particular, the aforementioned inability to model XOR. This was widely viewed as a devastating criticism at the time, as it was believed that successful AI would need to capture basic principles of logic. Although it was known that perceptrons with multiple layers (i.e., MLPs) can model XOR, at the time, nobody knew how to efficiently train MLPs. Minsky and Papert’s work was highly influential and is frequently blamed for the relative lack of interest in the field—a so-called “AI winter”—that persisted throughout the 1970s and into the early 1980s.

By 1986, there was renewed interest in ANNs, thanks in large part to the work of Rumelhart, Hinton, and Williams [70], who developed a practical means of training MLPs—the method of backpropagation. For details on backpropagation, see [80], for example.

It is worth noting that there was another “AI winter” that lasted from the late 1980s through the early 1990s (at least). The proximate cause of this most recent AI winter was that the hype far outran the limited successes that had been achieved. Although deep learning has now brought ANNs back into vogue, your author (a doubting Thomas, and proud of it) is not convinced that the current artificial intelligence mania will prove any less artificial than previous AI “summers” which, on the whole, yielded mostly disappointment. Some of the ridiculous statements being made today [28] lead your eminently sensible author to believe that the hype is already hopelessly out of control.Footnote 1

Next, we discuss deep learning, which builds on the foundation of ANNs. We can view the relationship between ANNs and deep learning as being somewhat akin to that of Markov chains and HMMs, for example. That is, ANNs serve as a basic technology that can be used to build a powerful machine learning technique, analogous to the way that an HMM is built on the foundation of an elementary Markov chain. But, before we get into the details of deep learning, we consider the topic from a high-level perspective.

3 Why Deep Learning?

It is sometimes claimed that the major advantage of deep learning arises when the amount of training data is large. For example, the tutorial [35] gives a graph similar to that in Fig. 4, which purports to show that deep learning will continue to achieve improved results as the size of the dataset grows, whereas other machine learning techniques will plateau at some relatively early point. That is, models generated by non-deep learning techniques will “saturate” relatively quickly, and once this saturation point is reached, more data will not yield improved models.Footnote 2 In contrast, deep learning is supposed to continue learning, essentially without limit as the volume of training data increases, or at least it will plateau at a much higher level. Of course, even if this is entirely true, there are practical computational constraints since more data requires more computing power for training.

Fig. 4
figure 4

Model performance as a function of the amount of training data

4 Decisions, Decisions

The essence of machine learning is that when training a model, we minimize the need for input from those fallible humans. That is, we want our machine learning models to be data-driven, in the sense that the models learn as much as possible directly from the data itself, with minimal human intervention. However, any machine learning technique will require some human decisions—for HMMs, we specify the number of hidden states; for SVMs, we specify the kernel function; and so on.

For ANNs in general, and deep learning in particular, the following design decisions are relevant [22].

  • The depth of an ANN refers to the number of hidden layers. The “deep” in deep learning indicates that we employ ANNs with lots of hidden layers, where “lots” seems to generally mean as many as possible, based on available computing power.

  • The width of an ANN is the number of neurons per layer, which need not be the same in each layer.

  • In an MLP, for example, nonlinearity is necessary, and this is achieved through the activation functions (also known as transfer functions). Most activation functions used in deep learning are designed to mimic a step function—examples include the sigmoid (or logistic) function

    $$ f(x) = \frac{1}{1+e^{-x}} , $$

    the hyperbolic tangent

    $$ f(x) = \tanh (x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} , $$

    the inverse tangent (also known as arctangent)

    $$ f(x) = \tan ^{-1}(x) , $$

    and the rectified linear unit (ReLU)

    $$ f(x) = \max \{0,x\} = \left\{ \begin{array}{ll} x &{} \text { if } x > 0 \\ 0 &{} \text { otherwise} . \end{array}\right. $$

    Note that the softmax function is a generalization of the sigmoid function to multiclass problems.

    The graph of each of the activation functions given above is illustrated in Fig. 5. As of this writing, ReLU is the most popular activation function. Numerous variants of the ReLU function are also used, including the leaky ReLU and exponential linear unit (ELU).

  • In addition to activation functions, we also specify an objective function. The objective function is the function that we are trying to optimize and typically represents the training error.

  • A bias node may be included (or not) in any hidden layer. Each bias node generates a constant value and hence is not connected to any previous layer. When present, a bias node allows the activation function to be shifted. In the perceptron example given in (1), the bias corresponds to the y-intercept b.

For the sake of comparison with our favorite non-deep learning technique, the depth of an HMM can be viewed as the order of the underlying Markov model. Typically, for HMMs, we only consider models of order one (in which case, the current state depends only on the previous state), but it is possible to consider higher order models. The width of an HMM might be viewed as being determined by N, the number of hidden states. But, regardless of the order of the model or the choice of N, there is really only one hidden layer in any HMM. The fact that an HMM is based on linear operations implies that adding multiple hidden layers would have no effect, as the multiple layers would be equivalent to a single layer. Furthermore, the A and B matrices of an HMM can be viewed as its activation functions (with the B matrix corresponding to the output layer), and \(P({\mathscr {O}}\,|\,\lambda )\) corresponds to the objective function in an ANN. Note that these functions are all linear in an HMM, while at least some of the activation functions must be nonlinear in any true multilayer ANN, such as an MLP.

Neural networks are trained using the backpropagation algorithm, which is a special case of a more general technique known as reverse mode automatic differentiation. For additional details on the topic of backpropagation, see, for example, [80].

The remainder of this paper is focused on various neural network based architectures and related topics. For each topic covered, we discuss research in the field of malware analysis.

Fig. 5
figure 5

Activation functions

5 Multilayer Perceptrons

We have already discussed multilayer perceptrons (MLP) in some detail. MLPs are in some sense one of the most generic neural networking architectures—when someone speaks of a neural network in general, there is a good chance that they have an MLP in mind.

5.1 Overview of MLPs

Recall that Fig. 3 is an example of an MLP with two hidden layers. Each edge in the figure represents a weight that is to be determined via training, and backpropagation is an efficient and effective way to train such a network. The advantage of an MLP is that it is not restricted to linear decision boundary.

5.2 MLPs in Malware Analysis

MLPs are extremely popular, and in most fields, they are one of the first learning techniques considered. Information security is no exception, as MLPs have been applied to nearly every security problem where deep learning techniques are applicable. Not surprisingly, large numbers of malware research papers employ MLPs. For example, in [5] MLPs are trained on progressively more generic malware families, yielding quantifiable results on the inherent tradeoff between the generality of the training data and accuracy. The research in [74] shows that a straightforward ensemble of various learning algorithms—including MLPs—can generate significantly stronger results than any of the component techniques. The paper [85] uses MLPs as part of an Android malware detection technique.

Another field in information security where MLPs have played a very prominent role is in intrusion detection systems (IDS). For example, the paper [57] uses MLPs in a novel multiclass IDS approach.

6 Convolutional Neural Networks

In this section, we provide an introduction to one of the most important and widely used learning techniques—CNN. After a brief overview, we introduce discrete convolutions with the focus on their specific application to CNNs. We then consider a simplified example that serves to illustrate various aspects of CNNs.

6.1 Overview of CNNs

Generically, ANNs use fully connected layers. A fully connected layer can deal effectively with correlations between any points within the training vectors, regardless of whether those points are close together, far apart, or somewhere in between. In contrast, a CNN, is designed to deal with local structure—a convolutional layer cannot be expected to perform well when crucial information is not local. A key benefit of CNNs is that convolutional layers can be trained much more efficiently than fully connected layers.

For images, most of the important structure (edges and gradients, for example) is local. Hence, CNNs would seem to be an ideal tool for image analysis and, in fact, CNNs were developed for precisely this problem. However, CNNs have performed well in a variety of other problem domains. In general, any problem for which there exists a data representation where local structure predominates is a candidate for a CNN. In addition to images, local structure is crucial in fields such as text analysis and speech analysis, for example.

6.2 Convolutions and CNNs

A discrete convolution is a sequence that is itself a composition of two sequences and is computed as a sum of pointwise products. Let \(c=x*y\) denote the convolution of sequences \(x=(x_0,x_1.x_2,\ldots )\) and \(y=(y_0,y_1.y_2,\ldots )\). Then the \(k^{{{\,\mathrm{th}\,}}}\) element of the convolution is given by

$$ c_{k} = \sum _{k=i+j} x_{i} y_{j} = \sum _{i} x_{i} y_{k-i}. $$

We can view this process as x being a “filter” (or kernel) that is applied to the sequence y over a sliding window.

For example, if \(x=(x_0,x_1)\) and \(y=(y_0,y_1,y_2,y_3,y_4)\), we find

$$ c = x*y = \bigl (x_0y_1+x_1y_0,x_0y_2+x_1y_1,x_0y_3+x_1y_2,x_0y_4+x_1y_3\bigr ). $$

If we reverse the order of the elements of x, then we have

$$ c = \bigl (x_0y_0+x_1y_1,x_0y_1+x_1y_2,x_0y_2+x_1y_3,x_0y_3+x_1y_4\bigr ) $$

which is, perhaps, a slightly more natural and intuitive way to view the convolution operation.

Again, we can view x as a filter that is applied to the sequence y. Henceforth, we define this filtering operation as convolution with the order of the elements of the filter reversed. For example, suppose that we apply the filter  to the sequence . In this case, the convolution gives us

We can define an analogous filtering (or discrete convolution) operation in two or three dimensions. For the two-dimensional case, suppose that \(A=\{a_{ij}\}\) is an \(N\times M\) matrix representing an image and \(F=\{f_{ij}\}\) is an \(n\times m\) filter. Let \(C=\{c_{ij}\}\) be the convolution of F with A. As in the one-dimensional case, we denote this convolution as \(C = F*A\). In this two-dimensional case, we have

$$ c_{ij} = \sum _{k=0}^{n-1}\sum _{\ell =0}^{m-1} f_{k,\ell } a_{i+k,j+\ell }, $$

where \(i=0,1,\ldots ,N-n\) and \(j=0,1,\ldots ,M-m\). That is, we simply apply the filter F at each offset of A to create the new—and slightly smaller—matrix that we denote as C. The three-dimensional case is completely analogous to the two-dimensional case.

We could simply define filters as we see fit, with each filter designed to correspond to a specific feature.Footnote 3 But since we are machine learning aficionados, for CNNs, we let the data itself determine the filters. Therefore, training a CNN can be viewed as determining filters, based on the training data. As with any respectable neural network, we can train CNNs via backpropagation.

Suppose that A represents an image and we train a CNN on the image A. Then the first convolutional layer is trained directly on the image. The filters determined at this first layer will correspond to fairly intuitive features, such as edges, basic shapes, and so on. We can then apply a second convolutional layer, that is, we apply a similar convolutional process, but the output of the first convolutional layer is the input to this second layer. At the second layer, filters are trained based on features of features. Perhaps not surprisingly, these second layer filters correspond to more abstract features of the original image A, such as the “texture.” We can repeat this convolution of convolutions step again and again, at each layer obtaining filters that correspond to features representing a higher degree of abstraction, as compared to the previous layer. The final layer of a CNN is not a convolution layer but is instead a typical fully-connected layer that can be used to classify based on complex image characteristics (e.g., “cat” versus “dog”). In addition, so-called pooling layers can be used between some of the convolutional layers. Pooling layers are simple—no training is involved—and serve primarily to reduce the dimensionality of the problem. Below, we give a simple example that includes a pooling layer.

In addition to having multiple convolutional layers, at each layer, we can (and generally will) stack several convolutions on top of each other. These filters are all initialized randomly, so they can all learn different features. In fact, for a typical color image, the image itself can be viewed as consisting of three layers, corresponding to the R, G, and B components in the RGB color scheme. Hence, for color images, the filters for the first convolutional layer will be three-dimensional, while subsequent convolutional layers can—and, typically, will—be three-dimensional as well, due to the stacking of multiple convolutions/filters at each layer. For simplicity, in our example, we only consider a black-and-white two-dimensional image, and we only apply one convolution at each layer.

Before considering a simple example, we note that there are advantages of CNNs that are particularly relevant in the case of image analysis. For a generic neural network, each pixel would typically be treated as a separate neuron, and for any reasonable size of image, this would result in a huge number of parameters, making training impractical. In contrast, at the first layer of a CNN, each filter is applied over the entire image, and at subsequent layers, we apply filters over the entire output of the previous layer. One effect of this approach is that it greatly reduces the number of parameters that need to be learned. Furthermore, by sliding the filter across the image as a convolution, we obtain a degree of translation invariance, i.e., we can detect image features that appears at different offsets. This can be viewed as reducing the overfitting that would otherwise likely occur.

The bottom line is that CNNs represent an efficient and effective technique that was developed specifically for image analysis. However, CNNs are not restricted to image data, and can be useful in any problem domain where local structure is dominant.

6.3 Example CNN

Now we turn our attention to a simple example that serves to illustrate some of the points discussed above. Suppose that we are dealing with black-and-white images, where each pixel is either 0 or 1, with 0 representing white and 1 representing black.Footnote 4 Further, suppose that the black-and-white images under consideration are \(16\times 16\) pixels in size. An example of such an image appears in Fig. 6.

Fig. 6
figure 6

\(16\times 16\) black-and-white image

In Fig. 7, we give some \(3\times 3\) filters. For example, the output of the filter in Fig. 7a is maximized when it aligns with a diagonal segment. Figure 8 shows the result of applying the convolution represented by the filter in Fig. 7a to the smiley face image in Fig. 6.

Fig. 7
figure 7

Examples of filters

Fig. 8
figure 8

First convolutional layer (\(3\times 3\) filter from Fig. 7a)

We note that, for the convolution in Fig. 8, the maximum value of 6 does indeed occur only at the three offsets where the (main) diagonal segments are all black and the off-diagonal elements are all white. These maximum values correspond to convolutions over the red boxes in Fig. 9.

Fig. 9
figure 9

Maximum convolution values in Fig. 8

In a CNN, so-called pooling layers are often intermixed with convolutional layers. As with a convolutional layer, in a pooling layer, we slide a window of a fixed size over the image. But whereas the filter in a convolutional layer is learned, in a pooling layer an extremely simple filter is specified and remains unchanged throughout the training. As the name implies, in max pooling, we simply take the the maximum value within the filter window. An illustration of max pooling is given in Fig. 10.

Fig. 10
figure 10

Max pooling layer (\(2\times 2\), non-overlapping)

Instead of a max pooling scheme, sometimes average pooling is used. In any case, pooling can be viewed as a downsampling operation, which has the effect of reducing the dimensionality, and thus easing the computational burden of training subsequent convolutional layers.Footnote 5 To increase the downsampling effect, pooling usually uses non-overlapping windows. Note that the dimensionality reduction of pooling could also be achieved by a convolutional layer that uses a larger stride through the data, and in [75], for example, it is claimed that such an approach results in no loss in accuracy for the resulting CNN.

An illustration of the first convolutional layer for a color image is given in Fig. 11. In this case, a three-dimensional filter is applied over the R, G, and B components in the RGB color scheme. The example in Fig. 11 is meant to indicate that five different filters are being trained. Since each filter is initialized randomly, they can all learn different features. At the second convolutional layer, we can again train three-dimensional filters, based on the output of the first convolutional layer. This process is repeated for any additional convolutional layers.

There are several possible ways to visualize the filters in convolutional layers. For example, in [89], a de-convolution technique is used to obtain the results in Fig. 12. Here, each row is a randomly selected filter and the columns, from left to right, correspond to training epochs 1, 2, 5, 10, 20, 30, 40, and 64. From layer 4, we see that the training images must be faces. In general, it is apparent that the filters are learning progressively more abstract features as the layer increases.

Fig. 11
figure 11

First convolutional layer with stack of five filters (RGB image)

Fig. 12
figure 12

Visualizing convolutions [89]

A fairly detailed discussion of CNNs can be found at [38], while the paper [15] provides some interesting insights. For a more intuitive discussion, see [37], and if you want to see lots of nice pictures, take a look at [16]. More details on convolutions can be found in [61].

6.4 CNNs in Malware Analysis

CNNs have proven their worth in a wide variety of security-related applications. Some of these applications, such as image spam detection [1, 10, 72], are obvious and relatively straightforward applications of CNNs. However, other security domains that do not have any apparent image-based component have also had success with CNNs.

By treating executable files as images, researchers have been able to leverage the strengths of CNNs for malware detection, classification, and analysis. For example, the papers [6] and [88] treat executable files as images, and obtain the state-of-the-art result for the malware detection problem. In particular, the research in [88] makes extensive use of transfer learning, whereby the output layer of previously trained CNNs are retrained for the malware detection problem. This results in fast training times and very high malware classification accuracies.

The research in [34] compares CNNs to so-called extreme learning machines (ELM), a topic that we discuss below, in Sect. 10. The best CNN results in [34] are obtained using a one-dimensional CNN trained on the raw bytes of executable files. In [86], CNNs are successfully applied to a combination of static and dynamic features.

Fig. 13
figure 13

Feedforward neural network with two hidden layers

7 Recurrent Neural Networks

An example of a feedforward neural network with two hidden layers is given in Fig. 13. This type of neural network has no “memory” in the sense that each input vector is treated independently of other input vectors. Hence, such a feedforward network is not well suited to deal with sequential data.

In some cases, it is necessary for a classifier to have memory. For example, if we want to tag parts of speech in English text (i.e., noun–verb, and so on), this is not feasible if we only look at words in isolation. For example, the word “all” can be an adjective, adverb, noun, or even a pronoun, and the only way to determine which is the case is to consider the context. A recurrent neural network (RNN) provides a way to add memory (or context) to a feedforward neural network.

To convert a feedforward neural network into an RNN, we treat the output of the hidden states as another input. For the neural network in Fig. 13, the corresponding generic RNN is illustrated in Fig. 14. The structure in Fig. 14 implies that there is a time step involved, that is, we train (and score) based on a sequence of input vectors. Of course, we cannot consider infinite sequences, and even if we could, the influence of feature vectors that occurred far back in time is likely to be minimal.

Fig. 14
figure 14

Network in Fig. 13 as an RNN

The RNN in Fig. 14 can be “unrolled,” as illustrated in Fig. 15. Note that in this case, we use f to represent the hidden layer or layers, while the notation \(X_t\) is used to represent \((x_0,x_1)\) at time step t from un-unrolled RNN in Fig. 14 and, similarly, \(Y_t\) corresponds to \((y_0,y_1,y_2)\) at time t. From the unrolled form, it is clear that any RNN can be treated as a special case of a feedforward neural network, where the intermediate hidden layers (f in our notation) all have identical structures and weights. We can take advantage of this special structure to efficiently train an RNN using a (slight) variant of backpropagation, known as backpropagation through time (BPTT).

Fig. 15
figure 15

Unrolled RNN (sequence-to-sequence model)

Before briefly turning our attention to BPTT, we illustrate some variants of a generic RNN. An RNN such as that illustrated in Fig. 15 is known as a sequence-to-sequence model, since each input sequence \((X_0,X_1,\ldots ,X_{n-1})\) corresponds to an output sequence \((Y_0,Y_1,\ldots ,Y_{n-1})\). In Fig. 16a, we have illustrated a many-to-one example of an RNN, that is, the case where an input sequence of the form \((X_0,X_1,\ldots ,X_{n-1})\) corresponds to the single output \(Y_{n-1}\). At the other extreme, Fig. 16b illustrates a one-to-many RNN, where the single input \(X_{0}\) corresponds to the output sequence \((Y_0,Y_1,\ldots ,Y_{n-1})\).

Fig. 16
figure 16

Variants of the generic RNN in Fig. 15

A many-to-one model might be appropriate for part-of-speech tagging, for example, while a one-to-many RNN could be used for music generation. An example of an application where a sequence-to-sequence (or many-to-many) RNN would be appropriate is a machine translation. There are numerous possible variants of the sequence-to-sequence RNN. Also, note that a feedforward neural network, such as that in Fig. 13, can be viewed as a one-to-one RNN.

Multilayer RNNs can also be considered. This can be viewed as training multiple RNNs simultaneously, with the first RNN trained on the input data, the second RNN trained on the hidden states of the first RNN, and so on. A two-layer (sequence-to-sequence) RNN is illustrated in Fig. 17. Of course, more layers are possible, but the training complexity will increase, and hence only “shallow” RNN architectures (in terms of the number of layers) are generally considered.

Fig. 17
figure 17

Two-layer RNN

7.1 Backpropagation Through Time

RNNs can be viewed as neural networks that are designed specifically for time series or other sequential data. With an RNN, the number of parameters is reduced so as to ease the training burden. This situation is somewhat analogous to CNNs, which are designed to efficiently deal with local structure (e.g., in images). That is, both CNNs and RNNs serve to make training more efficient—as compared to generic feedforward neural networks—for specific classes of problems. Backpropagation through time (BPTT) is simply an ever-so-slight variation on backpropagation that is optimized for training RNNs.

In Fig. 18, we give a detailed view of a many-to-one (actually, two-to-one) RNN. In this case, we see that the 10 weights, \((w_0,w_1,\ldots ,w_9)\) must be determined via training.

Fig. 18
figure 18

Simple RNN example

In Fig. 19, we give a neural network that is essentially the fully connected version of the RNN in Fig. 18. Note that in this fully connected version, there are 20 parameters to be determined. In an RNN, we assume that the data represents sequential input and hence the reduction in the number of weights is justified, since we are simply eliminating from consideration cases where the past is influenced by the future.Footnote 6

Fig. 19
figure 19

Fully connected analog of Fig. 18

It is well known that gradient issues are a concern when training neural networks in general, and are a particularly acute issue with generic RNNs. In an RNN, the further that we attempt to backpropagate through time, the more likely that the gradient will “explode” or “vanish” or oscillate between extremes. The details of the exploding gradient and vanishing gradient are beyond the scope of this survey; for more information on these topics, see [79], for example.

Next, we turn our attention to specialized RNN architectures that are designed to mitigate the gradient issues that plague generic RNNs. Specifically, we consider LSTM networks in some detail and we then briefly discuss a variant of LSTM known as gated recurrent units (GRU). In fact, a vast number of variants of the LSTM architecture have been developed. However, according to the extensive empirical study in [23], “none of the variants can improve upon the standard LSTM architecture significantly.”

7.2 Long Short-Term Memory

In addition to being a tongue twister, LSTM networks are a class of RNN architectures that are designed to deal with long-range dependencies. That is, LSTM can deal with “gaps” between the appearance of a feature and the point at which it is needed by the model [23]. The claim to fame of LSTM is that it can reduce the effect of a vanishing gradient, which is what enables such models to account for longer range dependencies [30].

Before outlining the main ideas behind LSTM, we note that the LSTM architecture has been one of the most commercially successful learning techniques ever developed. Among many other applications, LSTMs have been used in Google Allo [39], Google Translate [84], Apple’s Siri [46], and Amazon Alexa [25]. However, recently, the dominance of LSTM may have begun to wane. ResNet has been shown to have theoretical advantages over LSTM, and it outperforms LSTM in a wide range of applications [63].

Figure 20 illustrates an LSTM. The obvious difference from a generic vanilla RNN is that an LSTM has two lines entering and exiting each state. As in a standard RNN, one of these lines represents the hidden state, while the second line is designed to serve as a gradient “highway” during backpropagation. In this way, the gradient can “flow” much further back with less chance that it will vanish along the way.

Fig. 20
figure 20

LSTM

In Fig. 21, we expand one of the LSTM cells \(L_t\) that appear in Fig. 20. Here, \(\sigma \) is the sigmoid function, \(\tau \) is the hyperbolic tangent (i.e., \(\tanh \)) function, the operators “\(\times \)” and “\(\text {}+\text {}\)” are pointwise multiplication and addition, respectively, while “\(\text {}\Vert \text {}\)” indicates concatenation of vectors. The vector \(i_t\) is the “input” gate, \(f_t\) is the “forget” gate, and \(o_t\) is the “output” gate. The vector \(g_t\) is an intermediate gate and does not have a cool name, but is sometimes referred to as the “gate” gate [47], which, come to think of it, is especially cool. We have much more to say about these gates below.

Fig. 21
figure 21

One timestep of an LSTM

The gate vectors that appear in Fig. 21 are computed as

$$\begin{aligned}\nonumber \begin{aligned} f_t&= \sigma \Biggl (W_{ f} \biggl (\begin{array}{c} h_{t-1} \\ X_{t} \end{array}\biggr ) + b_f\Biggr ) \\ i_t&= \sigma \Biggl (W_{i} \biggl (\begin{array}{c} h_{t-1} \\ X_{t} \end{array}\biggr ) + b_i\Biggr ) \end{aligned} \qquad \begin{aligned} g_t&= \tau \Biggl (W_{ g} \biggl (\begin{array}{c} h_{t-1} \\ X_{t} \end{array}\biggr ) + b_g\Biggr ) \\ o_t&= \sigma \Biggl (W_{ o} \biggl (\begin{array}{c} h_{t-1} \\ X_{t} \end{array}\biggr ) + b_o\Biggr ), \end{aligned} \end{aligned}$$

while the outputs are

figure a

where “ ” is pointwise multiplication and “\(\oplus \)” is the usual pointwise addition. Note that each of the weight matrices is \(n\times 2n\).

In matrix form, ignoring the bias terms b, we have

$$ \left( \begin{array}{c} i_t \\ f_t \\ o_t \\ g_t \end{array}\right) = \left( \begin{array}{c} \sigma \\ \sigma \\ \sigma \\ \tau \end{array}\right) W \biggl (\begin{array}{c} h_{t-1} \\ X_{t}, \end{array}\biggr ) $$

where \(X_{t}\) and \(h_{t-1}\) are column vectors of length n, and W is the \(4n\times 2n\) weight matrix

$$ W = \left( \begin{array}{c} W_{ i} \\ W_{ f} \\ W_{ o} \\ W_{ g} \end{array}\right) $$

Further, each of the gates \(i_t\), \(f_t\), \(o_t\), and \(g_t\) is a column vectors of length n. Recall that the sigmoid \(\sigma \) squashes its input to be within the range of 0 to 1, whereas the \(\tanh \) function \(\tau \) gives output within the range of \(-1\) to \(+1\).

To highlight the intuition behind LSTM, we follow a similar approach as that given in the excellent presentation [47]. Specifically, we focus on the extreme cases, that is, we assume that the output of each sigmoid \(\sigma \) is either 0 or 1, and each hyperbolic tangent \(\tau \) is either \(-1\) or \(+1\). Then the forget gate \(f_t\) is a vector of 0s and 1s, where the 0s tell us the elements of \(c_{t-1}\) that we forget and the 1s indicate the elements to remember. In the middle section of the diagram, the input gate \(i_t\) and gate gate \(g_t\) together determine which elements of \(c_{t-1}\) to increment or decrement. Specifically, when element j of \(i_t\) is 1 and element j of \(g_t\) is \(+1\), we increment element j of \(c_{t-1}\). And if element j of \(i_t\) is 1 and element j of \(g_t\) is \(-1\), then we decrement element j of \(c_{t-1}\). This serves to emphasize or de-emphasize particular elements in the new-and-improved cell state \(c_t\). Finally, the output gate \(o_t\) determines which elements of the cell state will become part of the hidden state \(h_t\). Note that the hidden states \(h_t\) is fed into the output layer of the LSTM. Also note that before the cell states are operated on by the output gate, the values are first squeezed down to be within the range of \(-1\)– \(+1\) by the \(\tau \) function.

Of course, in general, the LSTM gates are not simply countered that increment or decrement by 1. But, the intuition is the same, that is, the gates keep track of incremental changes thus allowing relevant information to flow over long distances via the cell state. In this way, LSTM negates some of the limitations caused by vanishing gradients.

7.3 Gated Recurrent Units

As mentioned above, there are a large number of variants of the basic LSTM architecture. Most such variants are slight variants, with only minor changes from a standard LSTM. A gated recurrent unit (GRU), on the other hand, is a fairly radical departure from an LSTM. Although the internal state of a GRU is somewhat complex and, perhaps, less intuitive than that of an LSTM, there are fewer parameters in a GRU, and hence it is easier to train a GRU, and less training data is required. The wiring diagram for a GRU is given in Fig. 22.

Fig. 22
figure 22

One timestep of a GRU

The gate vectors that appear in Fig. 21 are computed as

figure c

while the output is

figure d

where “ ” is pointwise multiplication and “\(\oplus \)” is the usual pointwise addition. Note that each of the weight matrices is \(n\times 2n\).

In matrix form, ignoring the bias terms b, we have

figure f

where \(X_{t}\) and \(h_{t-1}\) are column vectors of length n, and W is the \(3n\times 2n\) weight matrix

$$ W = \left( \begin{array}{c} W_{ z} \\ W_{ r} \\ W_{ g} \end{array}\right) $$

Each of the gates \(z_t\), \(r_t\), and \(g_t\) is a column vectors of length n.

The intuition behind a GRU is that it replaces the input, forget, and output gates of an LSTM with just two gates—an “update” gate \(z_t\) and a “reset” gate \(r_t\). The GRU update gate serves a similar purpose as the combined output and forget gates of an LSTM. Specifically, the update serves to determine what to output (or write) and what to forget. The function \(1-z_t\) in the GRU implies that anything that is not output must be forgotten. Thus, the GRU is less flexible as compared to an LSTM since an LSTM allows us to independently select elements for output and elements that are forgotten. The GRU reset gate and the LSTM input gate each serve to combine new input with previous memory.

The gating in a GRU is more complex and somewhat less intuitive as compared to that found in an LSTM. In any case, the most radical departure of the GRU from the LSTM architecture is that there is no cell state in a GRU. This implies that any memory must be stored in the hidden state \(h_t\). This simplification (as compared to an LSTM) relies on the fact that in a GRU, the write and forget operations have been combined.

7.4 Recursive Neural Network

We mention in passing that recursive neural networks can be viewed as generalizing recurrent neural networks.Footnote 7 In a recursive neural network, we can recurse over any hierarchical structure, with trees being the archetypal example. Then training can be accomplished via backpropagation through structure (BPTS), often using stochastic gradient descent for simplicity. In contrast, a recurrent neural network is restricted to one particular structure—that of a linear chain.

7.5 Last Word on RNNs

RNNs are useful in cases where the input data is sequential. Generic RNN architectures are subject to vanishing and exploding gradients, which limit the length of the history (or gaps) that can effectively be incorporated into such models. Relatively complex RNN-based architectures—such as LSTM and its variants—have been developed that can better handle such gradient issues. These architectures have proven to be commercially successful across a wide range of products.

A good general discussion of RNNs can be found in [59], and an overview of various RNN-specific topics—with links to many relevant articles—is available at [58]. A more detailed (mathematical) description can be found in Chap. 10 of [20]. The slides at [47] provide a good general introduction to RNNs, with nice examples and a brief, but excellent, discussion of LSTM.

7.6 RNNs in Malware Analysis

In a commercial sense, LSTMs are surely the most successful deep learning technique yet developed, so it is not surprising that LSTMs have been successfully applied to the malware detection problem [50]. Both LSTMs and GRUs—along with CNNs—are considered in [2], with the authors claiming a major improvement over relevant previous work. The paper [31] considers an adversarial attack, where the attacker can defeat a system that uses RNNs based on API calls.

There are many applications of RNNs in areas of information security outside of the malware domain. In [87], CNN and LSTM architectures are used to detect cybersecurity events, based on social networking messages. Other infosec applications of LSTMs include generating security ontologies [19], network security [49], breaking CAPTCHAs [11], host-based intrusion detection [41], and network anomaly detection [13], among others.

8 Residual Networks

At the time of this writing, residual network (ResNet) is considered the state of the art in deep learning for many image analysis problems. A residual network is one in which instead of approximating a function F(x), we approximate the “residual,” which is defined as \(H(x) = F(x)-x\). Then the desired solution is given by \(F(x)=H(x)+x\).

The original motivation for considering residuals was based on the observation that deeper networks sometimes produce worse results, even when vanishing gradients are not the cause [27]. This is somewhat counterintuitive, as the network should simply learn identity mappings when a model is deeper than necessary. To overcome this “degradation” problem, the authors of [27] experiment with residual mappings and provide extensive empirical evidence that the resulting ResNet architecture yields improved results as compared to standard feedforward networks for a variety of problems. The authors of [27] conjecture that the success of ResNet follows from the fact that the identity map corresponds to a residual of zero, and “if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.”

Whereas LSTM uses a complex gating structure to ease gradient flow, ResNet defines additional connections that correspond to identity layers. This enables ResNet to deal with vanishing gradients, as well as the aforementioned degradation problem. These identity layers allow a ResNet model to skip over layers during training, which serves to effectively reduce the minimum depth when training. Intuitively, ResNet is able to train deeper networks by, in effect, training over a considerably shallower network in the initial stages, with later stages of training serving to flesh out the intermediate connections. This approach was inspired by pyramidal cells in the brain, which have a similar characteristic in the sense that they bridge “layers” of neurons [76].

A very high-level illustrative example of a ResNet architecture is given in Fig. 23, where each curved edge represents an identity transformation. Note that in this case, the identity transformations enable the model to skip over two layers. In principle, ResNet would seem to be applicable to any flavor of deep neural network, but in practice, it seems to applied to CNNs.

Fig. 23
figure 23

Example of a ResNet architecture

If a ResNet has N identity paths, then the network contains \(2^N\) distinct feedforward networks. For example, the ResNet in Fig. 23 can be expanded into the graph in Fig. 24. Note that most of the paths in a ResNet are relatively short.

Fig. 24
figure 24

Another view of the ResNet architecture in Fig. 23

Surprisingly, the paper [81] provides evidence that in spite of being trained simultaneously, the multiple paths in a ResNet “show ensemble-like behavior in the sense that they do not strongly depend on each other.” And perhaps an even more surprising result in [81] shows that “only the short paths are needed during training, as longer paths do not contribute any gradient.” In other words, a deep ResNet architecture is more properly viewed as a collection of multiple, relatively shallow networks.

8.1 ResNet in Malware Analysis

At the time of this writing, ResNet is a relative newcomer and the level of research in the security domain is somewhat limited. Nevertheless, ResNet architectures have shown promise for dealing with the usual suspects, namely malware analysis [40, 66] and intrusion detection [43, 83].

9 Generative Adversarial Network

Let \(\{X_i\}\) be a collection of samples and \(\{Y_i\}\) a corresponding set of class labels. In statistics, a discriminative model is one that models the conditional probability distribution \(P(Y\,|\,X)\). Such a discriminative model can be used to classify samples—given an input X of the same type as the training samples \(\{X_i\}\), the model enables us to easily determine the most likely class of X by simply computing \(P(Y\,|\,X)\) for each class label Y.

In contrast, a model is said to be generative if it models the joint probability distribution of X and Y, which we denote as P(XY). Such a model is called “generative” because, by sampling from this distribution, we can generate new pairs \((X_i,Y_i)\) that fit the probability distribution. Note that we can produce a discriminative model from a generative model, since

$$ P(Y\,|\,X) = \frac{P(X,Y)}{P(X)}. $$

Therefore, in some sense, a generative model is inherently more general than a discriminative model.

Consider, for example, hidden Markov models (HMM) [77], which are a popular class of classic machine learning techniques. An HMM is defined by the three matrices in \(\lambda =(A,B,\pi )\), where \(\pi \) is the initial state distribution, A contains the transition probability distributions for the hidden states, and B consists of the observation probability distributions corresponding to the hidden states. If we train an HMM on a given dataset, then we can easily generate samples that match the probability distributions of the HMM. To generate such samples, we first randomly select an initial state based on the probabilities in \(\pi \). Then we repeat the following steps until the desired observation sequence length is reached: Randomly select an observation based on the current state, using the probabilities in B, and randomly select the next state, based on the probabilities in A. The resulting observation sequence will be indistinguishable (in the HMM sense) from the data that was used to train the HMM.

From the discussion in the previous paragraph, it is clear that a trained HMM is a generative model. However, it is more typical to use an HMM as a discriminative model. In discriminative mode, we determine a threshold, then we classify a given observation sequence as matching the model if its HMM score is above the specified threshold. This example shows that in practice, it is easy to use a generative model as a discriminative model.

On the other hand, while a trained SVM serves to classify samples, we could not use such a model to generate samples that match the training set. Thus, an SVM is an example of a discriminative model.

In the realm of deep learning, a discriminative network is designed to classify samples, while a generative network is designed to generate samples that “fit” the training data. From the discussion above, it is clear that we can always obtain a discriminative model from a generative model. Intuitively, it would seem that training a (more general) generative model in order to obtain a (more specific) discriminative model would be undesirable since we do not need the full generality of the model. However, reality appears to be somewhat more subtle. In [60], it is shown that for one generative–discriminative pair (naïve Bayes and logistic regression) the discriminative models do indeed have a lower asymptotic error; however, the generative models consistently converge faster. This suggests that with limited training data, a generative model might produce a superior discriminative model, as compared to directly training the corresponding discriminative model. In any case, in the realm of deep learning, discriminative models dominate, with an example of a typical application being image classification. In contrast, generative models have only recently come into vogue, with an example application being the creation of fake images.

Now, suppose that when training a discriminative neural network, in addition to the real training data, we generate “fake” training samples that follow a similar probability distribution as the real samples. Further, suppose that these fake training samples are designed to trick the discriminative network into making classification mistakes. Such samples would tend to improve the training of the network, thus making it stronger and more effective than if we had restricted the training to only the real data.

Although intuitively appealing, several problems arise when trying to implement a training technique based on fake samples. For one thing, we generally do not know the distribution of the training set, which often lives in an extremely high dimensional space of great complexity. Another issue is that during training, the discriminative network is constantly evolving, so determining samples that are likely to trick the network is a moving target. Another concern is that if the fake training samples are too difficult—or too easy—to distinguish at any point in the training process, we are unlikely to see any improvement over simply using the real training data

Several techniques have been proposed to try to take advantage of fake data so as to improve the training process. In the case of a generative adversarial network (GAN), we use a neural network to generate the fake data—a generative network is trained to defeat a discriminative network. Furthermore, the discriminative and generative networks are trained simultaneously in a minimax game. This approach sidesteps the complications involved in trying to model the probability distribution of the training samples. In fact, the generative network in a GAN simply uses random noise as its underlying probability distribution.

To summarize, a GAN consists of two competing neural networks—a generative network and a discriminative network—with the generative network creating fake data that is designed to defeat the discriminative network. The two networks are trained simultaneously following a game-theoretic approach. In this way, both networks improve, with the ultimate objective being a discriminative model (and/or a generative model) that is stronger than it would have been if it was trained only on the real training data.

We define two neural networks, namely a discriminator \(D(x;\theta _d)\), and a generator \(G(z;\theta _g)\), where \(\theta _d\) consists of the parameters of the discriminator network, and \(\theta _g\) consists of the parameters of the generator network. Here, we describe the training process in terms of images, but other types of data could be used. Also, to simplify the notation, we suppress the dependence on \(\theta _d\) and \(\theta _g\) in the remainder of this discussion, except where it is essential for understanding and may not be clear from context.

The generator G(z) produces a fake image (based on the random seed value z) with the goal of tricking the discriminator into believing it is a real training image. In contrast, the discriminator D(x) returns a value in the range of 0– 1 that can be viewed as its estimate of the probability that the image x is real. For example, \(D(x)=1\) means that the discriminator is completely certain that the image is real, while \(D(x)=0\) tells us that the discriminator is sure that the image is fake, and \(D(x)=1/2\) implies that the discriminator is clueless. Note that the discriminator must deal with both real and fake images, while the generator is only concerned with generating fake images that trick the discriminator.

The generator G wins if D thinks its fake images are real. Thus, we can train G by making \(1-D(G(z))\) as close to zero as possible or, equivalently, by minimizing \(\text {log}(1-D(G(z)))\). On the other hand, D wins if it can distinguish the fake images from real images so, ideally, when training D, we want \(D(x)=1\), when x is a real image, and \(D(G(z))=0\) for fake images G(z). Therefore, we can train D by maximizing \(D(x)(1-D(G(z))\) or, equivalently, by maximizing \(\text {log}(D(x)) + \text {log}(1-D(G(z)))\). We want the D and G models to be in competition, so they can strengthen each other. This can be accomplished by formulating the training in terms of the minimax game

$$\begin{aligned} \min _{G}\max _{D} \Bigl (E\bigl (\text {log}(D(x))\bigr ) + E\bigl (\text {log}(1-D(G(z)))\bigr ) \Bigr ), \end{aligned}$$
(2)

where E is the expected value, relative to the implied probability distribution. Specifically, for the \(\max \) over D, the expectation is with respect to the real sample distribution which has parameters \(\theta _d\), while for the \(\min \) over G, the expectation is with respect to the fake sample distribution, which is specified by the parameters \(\theta _g\).

In the case of stochastic gradient descent (or ascent), at each iteration, we consider one real sample x and one fake sample G(z). Then, due to the \(\max \) in equation (2), we first perform gradient ascent to update the discriminator network D. This is followed by gradient descent to update generator network G. Of course, both of these steps rely on backpropagation.

Note that for the discriminator network D, the backpropagation error term involves

$$ \text {log}\bigl (D(x)\bigr ) + \text {log}\bigl (1-D(G(z))\bigr ), $$

while for the generator network G, the error term involves only

$$\begin{aligned} \text {log}\bigl (1 - D(G(z))\bigr ). \end{aligned}$$
(3)

Of course, in practice, we would typically use a minibatch of, say, m real samples and m fake samples at each update of D and G, rather than a strict stochastic gradient descent/ascent.

There is one technical issue that arises when attempting to train the generator network G as outlined above. As illustrated in Fig. 25, the gradient of the expression in (3) is nearly flat for values of D(G(z)) near zero. This implies that, early in training, when the generator network is sure to be extremely weak—and hence the discriminator can easily identify most G(z) images as fake—it will be difficult for the G network to learn. From, Fig. 25, we also see that

$$\begin{aligned} \text {log}\bigl (D(G(z))\bigr ) \end{aligned}$$
(4)

is relatively steep near zero. Hence, instead training G based on a gradient ascent involving equation (3), we perform gradient descent based on (4). Note that we have simply replaced the problem of maximizing \(1-D(G(z))\) with the equivalent problem of minimizing the probability D(G(z)).

Fig. 25
figure 25

Gradient of generator network G

The algorithm for training a GAN is summarized in Fig. 26. In some applications, letting \(\texttt {iters}=1\) works best, while in others, \(\texttt {iters}>1\) yields better results. In the latter case, we update the discriminator network D multiple times for each update of the generator network G. This implies that in such cases, the generator might otherwise overwhelm the discriminator, that is, the generator is in some sense easier to train. Finally, while a GAN certainly is an advanced architecture, it is important to realize that training reduces to a fairly straightforward application of gradient ascent.

Fig. 26
figure 26

GAN training algorithm

As with LSTM, there are a vast number of variations on the basic GAN approach outlined here; see [48] for a list of nearly 50 such variants. Additional sources of information on GANs include the original paper on the subject [21] and the excellent slides at [48].

9.1 GANs in Malware Analysis

GANs seem to show promise for dealing with some of the most challenging problems in information security. For example, GANs have been applied with some success to zero-day malware detection [32, 42]. In addition, the generative aspect of a GAN can be used to create challenging security problems in the “lab,” thus enabling researchers to consider defenses against potential threats before those threats arise in a real-world setting [67].

10 Extreme Learning Machines

As with most aspects of ELMs, the origin of the technique is somewhat controversial. The unfortunate terminology of “Extreme Learning Machine” was apparently first used in [24]. Regardless of the origin of the technique, ELMs are essentially randomized feedforward neural networks that effectively minimize the cost of training.

An ELM consists of a single layer of hidden nodes, where the weights between inputs and hidden nodes are randomly initialized and remain unchanged throughout training. The weights that connect the hidden nodes to the output are trained, but due to the simple structure of an ELM, these weights can be determined by solving linear equations—more precisely, by solving a linear regression problem. Since no backpropagation is required, ELMs are far more efficient to train, as compared to other neural network architectures. However, since the weights in the hidden layer are not optimized, we will typically require more weights in an ELM, which implies that the testing phase may be somewhat more costly, as compared to a network trained by backpropagation. Nevertheless, in applications where models must be trained frequently, ELMs can be competitive.

Fig. 27
figure 27

Architecture of an ELM model

Consider the ELM architecture shown in Fig. 27, where X denotes the input layer, H is the hidden layer, and Y is the output layer. In this example, there are N samples of the form \((x_i, y_i)\) for \(i=1,2, \ldots , N\), where is the feature vector for sample i and  are the output labels, where T indicates the transposition operation. Then the input and output for the ELM are as and , respectively. In this example, the hidden layer H has \(\ell \) neurons. We denote the activation function of the hidden layer as g(x).

To train an ELM, we randomly select the weight matrix that connects the input layer X to the hidden layer H. We denote this randomly assigned weight matrix as , where each \(w_i\) is a column vector. We also randomly select the bias matrix  for this same layer. During the training phase, both W and B remain unchanged.

After W and B have been initialized, the output of the hidden layer H is given by

$$ H=g(WX + B) . $$

The output of the ELM is denoted as Y and is calculated as

$$ Y=H\beta , $$

where \(\beta \) is the weight matrix for the output layer.

The values of the weights \(\beta \) at the hidden layer are learned via linear least squares, and can be computed using \(H^\dagger \), the Moore–Penrose generalized inverse of H, as discussed below. It is worth emphasizing that the only parameters that are learned in the ELM are the elements of \(\beta \).

Given that Y is the desired output, a unique solution of the system based on least squared error can be found as follows. We denote the Moore–Penrose generalization inverse of H as \(H^\dagger \), which is defined as

$$ H^\dagger = {\left\{ \begin{array}{ll} (H^TH)^{-1}H^T \text { if }H^TH~\text {is nonsingular}\\ H^T(HH^T)^{-1} \text { if }HH^T~\text {is nonsingular} \end{array}\right. } $$

Then the desired solution \(\beta \) is given by

$$ \beta =H^\dagger Y. $$

After calculating \(\beta \), the training phase ends. For each test sample x, the output Y can be calculated as

$$ Y=g\bigl (C(x)\bigr ) \beta , $$

where C(x) is defined below. The entire training process is extremely efficient, particularly in comparison to the backpropagation technique that is typically used to train neural networks [80].

For the research reported in this paper, we use the Python implementation of ELMs given in [17]. This implementation uses input activations that are a weighted combination of two functions referred to as an “MLP” kernel and an “RBF” kernel—we employ the same terminology here. The MLP kernel is simply the linear operation

$$ M(x) = Wx + B, $$

where the weights W and biases B are randomly selected from a normal distribution. This is the kernel function that is typically associated with a standard ELM.

The RBF kernel is considerably more complex and is based on generalized radial basis functions as defined in [18]. The details of this RBF kernel go beyond the scope of this paper; see [18] for additional information and, in particular, examples where this kernel is applied to train ELMs. We use the notation \(R(x)\) to represent the RBF kernel. Also, it is worth noting that the RBF kernel is much more costly to compute, and hence its use does somewhat negate one of the major advantages of an ELM.

The input activations are given by

$$\begin{aligned} C(x) = \alpha M(x) + (1 - \alpha ) R(x), \end{aligned}$$
(5)

where \(0\le \alpha \le 1\) is a user-specified mixing parameter. Note that for \(\alpha =0\) we use only the MLP kernel \(M(x)\) and for \(\alpha =1\), only the RBF kernel \(R(x)\) is used.

10.1 ELMs in Malware Analysis

In [34], ELMs are compared to CNNs for malware classification, and it is shown that ELMs can outperform CNNs in some cases. This is impressive since ELMs have training times that are only a small fraction of those required for comparable CNNs. ELMs have also been applied to malware detection on the Android platform in [90], where the training is based on static features, with reasonably strong results. In [71], the authors consider the effectiveness of a technique that they refer to as high-performance extreme learning machines (HP-ELM). By varying the features and activation functions of their HP-ELM architecture, they achieve high accuracy on a challenging dataset. A two-layer ELM is applied to the malware detection problem in [33]. A partially connected network is used between the input and the first hidden layer, and this layer is aggregated with a fully connected network in the second layer. The authors utilize an ensemble to improve the accuracy and robustness of the resulting ELM-based system.

11 Word Embedding Techniques

Word2Vec is a technique for embedding terms in a high-dimensional space, where the term embeddings are obtained by training a shallow neural network. After the training process, words that are more similar in context will tend to be closer together in the Word2Vec space.

Perhaps surprisingly, meaningful algebraic properties also hold for Word2Vec embeddings. For example, according to [53], if we let

$$ w_0=\text {``king''}, w_1=\text {``man''}, w_2=\text {``woman''}, w_3=\text {``queen''} $$

and \(V(w_i)\) is the Word2Vec embedding of word \(w_i\), then \(V(w_3)\) is the vector that is closest—in terms of cosine similarity—to

$$ V(w_0) - V(w_1) + V(w_2). $$

Results such as this indicate that Word2Vec embeddings capture significant aspects of the semantics of the language.

The focus of this section is Word2Vec, but before discussing this popular and effective word embedding technique, we consider a couple of alternatives. First, we discuss simple embedding strategies based on hidden Markov models. Then we briefly consider a word embedding technique the uses PCA. Finally, we discuss the main ideas behind Word2Vec.

11.1 HMM2Vec

To begin, we consider individual letter embeddings, as opposed to word embeddings. We call the letter embedding technique considered here Letter2Vec.

Recall that an HMM is defined by the three matrices A, B, and \(\pi \), and is denoted as \(\lambda =(A,B,\pi )\). The \(\pi \) matrix contains the initial state probabilities, A contains the hidden state transition probabilities, and B consists of the observation probability distributions corresponding to the hidden states. Each of these matrices is row stochastic, that is, each row satisfies the requirements of a discrete probability distribution. Notation-wise, we let N be the number of hidden states, M is the number of distinct observation symbols, and T is the length of the observation (i.e., training) sequence. Note that M and T are determined by the training data, while N is a user-defined parameter. For more details in HMMs, see [77] or Rabiner’s fine paper [65].

Suppose that we train an HMM on a sequence of letters extracted from English text, where we convert all uppercase letters to lowercase and discard any character that is not an alphabetic letter or word-space. Then \(M=27\), and we select \(N=2\) hidden states, and we use \(T=50{,}000\) observations for training. Note that each observation is one of the \(M=27\) symbols (letters plus word-space). For the example discussed below, the sequence of \(T=50{,}000\) observations was obtained from the Brown corpus of English [7]. Of course, any source of English text could be used.

For one specific case, an HMM trained with the parameters listed in the previous paragraph yields the B matrix in Table 1. Observe that this B matrix gives us two probability distributions over the observation symbols—one for each of the hidden states. We observe that one hidden state essentially corresponds to vowels, while the other corresponds to consonants. This simple example nicely illustrates the concept of machine learning, as no a priori assumption was made concerning consonants and vowels, and the only parameter we selected was the number of hidden states N. Through the training process, the model learned a crucial aspect of English directly from the data. This illustrative example is discussed in more detail in [77] and originally appeared in Cave and Neuwirth’s classic paper [8].

Table 1 Final \(B^T\) for HMM

Suppose that for a given letter \(\ell \), we define its Letter2Vec representation \(V(\ell )\) to be the corresponding row of the matrix \(B^T\) in Table 1. Then, for example,

(6)

Next, we consider the distance between these Letter2Vec representations. Instead of using Euclidean distance, we measure the cosine similarity.Footnote 8

The cosine similarity of vectors X and Y is the cosine of the angle between the two vectors. Let S(XY) denote the cosine similarity between vectors X and Y. Then for \(X=(X_0,X_1,\ldots ,X_{n-1})\) and \(Y=(Y_0,Y_1,\ldots ,Y_{n-1})\),

$$ S(X,Y) = \frac{\displaystyle \sum _{i=0}^{n-1} X_i Y_i}{ \sqrt{\displaystyle \sum _{i=0}^{n-1} X_i^2}\sqrt{\displaystyle \sum _{i=0}^{n-1} Y_i^2}} $$

In general, we have \(-1\le S(X,Y)\le 1\), but since our Letter2Vec encoding vectors consist of probabilities—and hence are non-negative values—we always have \(0\le S(X,Y)\le 1\).

When considering cosine similarity, the length of the vectors is irrelevant, as we are only considering the angle between vectors. Consequently, we might want to consider vectors of length one, \(\widetilde{X}=X/\Vert X\Vert \) and \(\widetilde{Y}=Y/\Vert Y\Vert \), in which case the cosine similarity simplifies to the dot product

$$ S(\widetilde{X},\widetilde{Y}) = \displaystyle \sum _{i=0}^{n-1} \widetilde{X}_i \widetilde{Y}_i $$

Henceforth, we use the notation \(\widetilde{X}\) to indicate a vector X that has been normalized to be of length one.

For the vector encodings in (6), we find that for the vowels “a” and “e”, the cosine similarity is \(S(V({{\,\mathrm{a}\,}}),V({{\,\mathrm{e}\,}}))=0.9999\). In contrast, the cosine similarity of the vowel “a” and the consonant “t” is \(S(V({{\,\mathrm{a}\,}}),V({{\,\mathrm{t}\,}}))=0.0372\). The normalized vectors \(V({{\,\mathrm{a}\,}})\) and \(V({{\,\mathrm{t}\,}})\) are illustrated in Fig. 28. Using the notation in this figure, cosine similarity is \(S(V({{\,\mathrm{a}\,}}),V({{\,\mathrm{t}\,}}))=\cos (\theta )\).

Fig. 28
figure 28

Normalized vectors \(\widetilde{V}({{\,\mathrm{a}\,}})\) and \(\widetilde{V}({{\,\mathrm{t}\,}})\)

These results indicate that these Letter2Vec encodings—which are derived from a trained HMM—provide useful information on the similarity (or not) of pairs of letters. Note that we could obtain a vector encoding of any dimension by simply training an HMM with the number of hidden states N equal to the desired dimension.

Our HMM-based approach to Letter2Vec encoding is interesting, but we want to encode words, not letters. Analogous to the Letter2Vec embeddings discussed above, we could train an HMM on words and then use the columns of the resulting B matrix (equivalently, the rows of \(B^T\)) to define word vectors. The state of the art for Word2Vec uses a dataset corresponding to \(M=10{,}000\), \(N=300\) and \(T=10^9\). Training an HMM with similar parameters would be decidedly non-trivial, as the work factor is on the order of \(N^2 T\).

While the word embedding technique discussed in the previous paragraph—we call it HMM2Vec—is plausible, it has some potential limitations. Perhaps the biggest issue with HMM2Vec is that we typically train an HMM based on a Markov model of order one. This means that the current state only depends on the immediately preceding state. By basing our word embeddings on such a model, the resulting vectors would likely provide only a very limited sense of context. While we can train HMMs using models of a higher order, the work factor would be prohibitive.

11.2 PCA2Vec

Another option for generating embedding vectors is to apply PCA to a matrix of pointwise mutual information (PMI). To construct a PMI matrix, based on a specified window size W, we compute the probabilities \(P(w_i,w_j)\) for all pairs of words \((w_i,w_j)\) that occur within a window W of each other within dataset, and we also compute \(P(w_i)\) for each individual word \(w_i\). Then we define the PMI matrix as \(X = \{x_{ij}\}\) as

$$ x_{ij} = \text {log}\biggl (\frac{P(w_j,w_i)}{P(w_i)P(w_j)}\biggr ) = \text {log}\,P(w_j,w_i) - \text {log}\,P(w_i) - \text {log}\,P(w_j). $$

Let \(X_i\) be column i of X. We use \(X_i\) as the feature vector for word \(w_i\) and perform PCA (using SVD) based on these \(X_i\) feature vectors. As usual, we project the feature vectors \(X_i\) onto the resulting eigenspace. Finally, by choosing the N dominant eigenvalues for this projection, we obtain word embedding vectors of length N.

It is shown in [56] that these embedding vectors have many similar properties as Word2Vec embeddings, with the author providing examples analogous to those we give in the next section. Interestingly, it may be beneficial in some applications to omit a few of the dominant eigenvectors when determining the PCA2Vec embedding vectors [45].

For more details on using PCA to generate word embeddings, see [45]. The aforecited blog [56] gives an intuitive introduction to the topic.

11.3 Word2Vec

Word2Vec uses a similar approach as the HMM2Vec concept outlined above. But, instead of using an HMM, Word2Vec is based on a shallow (one hidden layer) neural network. Analogous to HMM2Vec, in Word2Vec, we are not interested in the resulting model itself, but instead we make use the learning that is represented by the trained model to define word embeddings. Next, we consider the basic ideas behind Word2Vec. Our presentation is fairly similar to that found in the excellent tutorial [51].

Suppose that we have a vocabulary of size M. We encode each word as a “one-hot” vector of length M. For example, suppose that our vocabulary consists of the set of \(M=8\) words

$$\begin{aligned} W&= (w_0,w_1,w_2,w_3,w_4,w_5,w_6,w_7) \\&= (\text {``for''}, \text {``giant''}, \text {``leap''}, \text {``man''}, \text {``mankind''}, \text {``one''}, \text {``small''}, \text {``step''}) \end{aligned}$$

Then we encode “for” and “man” as

$$ E(w_0)=E(\text {``for''}) = 10000000\ \ \text {and} \ \ E(w_3)=E(\text {``man''}) = 00010000, $$

respectively.

Now, suppose that our training data consists of the phrase

$$\begin{aligned} \text {``one small step for man one giant leap for mankind''}. \end{aligned}$$
(7)

To obtain training samples, we specify the window size, and for each offset, we use all pairs of words within the specified window. For example, if we select a window size of two, then from (7), we obtain the training pairs in Table 2.

Consider the pair “(for,man)” from the fourth row in Table 2. As one-hot vectors, this training pair corresponds to input 10000000 and output 00010000.

A neural network similar to that in Fig. 29 is used to generate Word2Vec embeddings. The input is a one-hot vector of length M representing the first element of a training pair, such as those in Table 2, and the network is trained to output the second element of the ordered pair. The hidden layer consists of N linear neurons and the output layer uses a softmax function to generate M probabilities, where \(p_i\) is the probability of the output vector corresponding to \(w_i\) for the given input.

Table 2 Training data

Observe that the Word2Vec network in Fig. 29 has NM weights that are to be determined, as represented by the blue lines from the hidden layer to the output layer. For each output node \(\omega _i\), there are N edges (i.e., weights) from the hidden layer. The N weights that connect to output node \(\omega _i\) form the Word2Vec embedding \(V(w_i)\) of the word \(w_i\).

As mentioned above, the state of the art in Word2Vec for English text is based on a vocabulary of \(M=10{,}000\) words, and embedding vectors of length \(N=300\). These embeddings are obtained by training on a set of about \(10^9\) samples. Clearly, training a model of this magnitude is an extremely challenging computational task, as there are \(3\times 10^6\) weights to be determined, not to mention a huge number of training samples to deal with. Most of the complexity of Word2Vec comes from tricks that are used to make it feasible to train such a large network with a massive amount of data.

One trick that is used to speed training in Word2Vec is the subsampling of frequent words. Common words such as “a” and “the” contribute little to the model, so these words can appear in training pairs at a much lower rate than they are present in the training text.

The most significant work-saving trick that is used in Word2Vec is so-called “negative sampling.” When training a neural network, each training sample potentially affects all of the weights of the model. Instead of adjusting all of the weights, in Word2Vec, only a small number of “negative” samples have their weights modified per training sample. For example, suppose that the output vector of a training pair corresponds to word \(w_0\). Then the “positive” weights are those of output node \(\omega _0\), and all of the corresponding weights are modified. In addition, a small subset of the \(M-1\) “negative” words (i.e., every word in the dataset except \(w_0\)) are selected and only the weights of the corresponding output nodes are modfied. The distribution used to select the negative subset is biased toward more frequent words.

A high-level discussion of Word2Vec can be found in [3], while a very nice and intuitive—yet reasonably detailed—introduction is given in [51]. The original paper describing Word2Vec is [53] and an immediate follow-up paper discusses a variety of improvements that mostly serve to make training practical for large datasets [54].

Fig. 29
figure 29

Neural network for Word2Vec embeddings

11.4 Word Embeddings in Malware Analysis

Word2Vec is fairly popular in the malware detection literature. For example, in [64] Word2Vec models based on machine code form the basis for a malware detection technique, while in [12], an Android malware detection scheme dubbed DroidVecDeep uses Word2Vec results as features in deep belief networks [29]. The recent malware research in [9] considers multiple word embedding techniques (Word2Vec, HMM2Vec, and PCA2Vec) based on opcode sequences. Better results are obtained in most cases, as compared to using raw opcode sequences, which indicates that word embeddings are a useful form of feature engineering. The paper [36] considers Word2Vec and HMM2Vec embeddings for malware classification, with strong results obtained in many cases. In [62], word embeddings are used as part of a scheme that can successfully distinguish points in time where significant evolution has occurred within a malware family.

Word2Vec has proven surprisingly useful in a variety of security applications beyond the malware domain. Such applications range from network-based anomaly detection [4] to analyzing the evolution of cyberattacks [73].

12 Conclusion

In this chapter, we have provided details on a wide array of deep learning techniques that have proven useful in the field of malware analysis. We began with an introduction to the historical development of neural network-based techniques and related topics. This was followed by a discussion of several popular modern architectures. Specifically, we covered the following architectures: Multilayer perceptrons (MLP), convolutional neural networks (CNN), recurrent neural networks (RNN), long short-term memory (LSTM), gated recurrent units (GRU), residual networks (ResNet), generative adversarial networks (GAN), extreme learning machines (ELM), and Word2Vec. For each of these architectures, we cited representative examples of relevant malware-related research, and in most cases, we also mentioned other applications related to information security.