1 Introduction

Most of the network security configurations allow the DNS data to pass through. Attackers often embed malware commands in DNS data and manage installed malware through the C&C server. Most of these malwares like botnets use Domain Generation Algorithms (DGAs) to dynamically generate a large number of pseudorandom domain names. These domain names are called algorithmically-generated domains (AGDs), some of which are selected as the masks of malware commands and used to connect with C&C server. In order to completely shut down such a botnet, defenders need to intercept all AGDs generated by the malware.

Existing solutions are largely based on the linguistic features to build the models for DGA botnet detection. Unfortunately, using linguistic properties has a potential drawback because they may be bypassed by the malware authors, while deriving a new set of features is rather challenging. Some techniques incorporate contextual information (such as manual features [1, 2]) to further improve performance. However, this is a time consuming process and costly measure task, which cannot meet the needs of many real-world security applications that require real-time detection and prevention [3]. To address this challenge, we proposed a lightweight semantics and visual feature extraction, and conducted a set of experiments for confirming the validity of this feature extraction. This feature extraction is performed based on an in-depth analysis of the semantics and visual features, which is much simpler than existing manual features-based DGAs detection methods. This strategy allows a very large number of DGA domain names to be scanned quickly.

The visual features are spatially rich features that are able to encode the visual concepts of domain names without dissection the malware, when compared to other well-studied features such as traffic features and linguistic features. Furthermore, the intuition of using visual features is the inconsistent nomenclature between the normal and DGA domain names, as much of the previous work in DGA detection shown that many DGA domain names composed of random characters or words, which are different from legitimate domain names. Thus, the visualized images of normal and DGA domain names should be significantly different visually. The basic motivation is to identify the related malicious domains structure, which may be useful in identifying well-known DGAs. The introducing of visual feature helps to find the hidden spatial sequence patterns. This is almost similar to human visual systems, hence yields better accuracy than existing methods. We believe that the proposed methodology along with the results will contribute a benchmark on forthcoming DGA botnet detection proposals and research endeavors.

The challenge to catch malicious domain names that are generated dynamically has led to the recent interest in detecting DGA botnets using deep learning algorithms. In contrast to the traditional approaches, the deep learning algorithms can learn features automatically, instead of relying on manual or expert-defined features. Traditional deep learning algorithms [4], such as 1-dimensional convolution neural network (CNN [5]) and long short term memory (LSTM [6]), which are commonly used in detecting DGA botnets, can handle contextual relationships or local order information. However, by using a single model (such as CNN and LSTM) or a traditional sequential joint model (such as CNN + LSTM [4]), it is easy to lose deep features during feature mining. For example, when a traditional LSTM processes long sequence data, the front part of the features in the sequence are lost, resulting in an inability to completely capture the deep features and worsening the classification effects. Thus, this is important for researchers to find a better algorithm to learn and store the useful information of DGA domain names. To address this challenge, we proposed a novel Two-Stream network-based deep learning framework, which enforces the learning of correlation and correspondence between textual semantics and visual concepts. To highlight the significance of this study, we compared the performance of our framework with those of other existing deep learning-based methods. Overall, compared with the six baselines (see Sect. 4 for details), the performance gain of the F-score is about 0.08% to 0.82% on the four datasets.

Our goal in this paper is to develop a deep learning-based framework for detecting DGA botnets without any reverse engineering for malware. Furthermore, we applied the Two-Stream network with two independent encoders to these two tasks and measured its performance on the multimodal features from the perspective of a network administrator. In the proposed Two-Stream network, one stream is utilized to encode the semantic features and the other is applied for the visual features. Projecting the feature vectors of semantic and visual into a shared vector space through the two-stream network can provide strong associations between textual semantics and visual concepts, which has the potential to improve the ability of feature extraction. This strategy allows us to obtain good performances while keeping the complexity as low as possible.

In summary, we mainly make the following contributions:

  • We introduced how to develop multimodal information in detecting DGA botnets. To the best of our knowledge, this is the first study that aims to extract multimodal information consisting of textual semantics and visual concepts.

  • We proposed a novel deep learning-based framework to encode different types of multimodal information. The proposed TS-ASRCaps can automatically learn multimodal representations from the data, bypassing the human effort of feature engineering.

  • From a practical perspective, the proposed method is attractive, due to its ease of implementation, acceptable computational complexity, and high execution efficiency. The experimental results show that the proposed model outperforms the state-of-the-art methods significantly.

The rest of this paper is organized as follows. The next section summarizes previous research in detecting DGA botnets. Section 3 outlines our approach, including feature extraction and classification. Section 4 presents our experimental results. Section 5 concludes the paper.

2 Related Work

One of the most common approaches for DGA botnet detection is taking the tasks as binary and multiclass classification, where each domain is assigned with a label. Many methods have focused on the analysis of DNS traffic to recognize botnets [3, 7]. These DNS traffic-based methods require the DNS traffic data from a top-level domain name server or a recursive resolution server. Schiavoni et al. [7] proposed a mechanism called Phoenix, which characterizes the DGAs using a combination of string and IP-based features. Phoenix used a set of fingerprints to label new DGA domains. Mowbray et al. [8] proposed a procedure for DGA detection, which is used to reveal and identify client IP addresses with an unusual distribution of second-level string lengths in DGA domain names (from the query data of DNS). Based on observation time and known seeds, Sivaguru et al. [9] selected data for test and training sets. In this study, they evaluated the robustness of tree ensemble models based on manual features and deep neural networks that learn features automatically from domain names. However, these methods have limited generalization capabilities in the modeling process.

In addition to the above studies, a more effective approach involves the use of deep learning techniques for identifying the DGA domains. Deep learning approaches like recurrent neural network (RNN) and convolutional neural network (CNN) have recently been proposed. These deep learning-based methods significantly outperform traditional machine learning-based methods [1, 10,11,12] on classification accuracy, at the price of increasing the complexity of training the model and requiring larger datasets. A typical approach is to feed the original characters of a domain name as the semantic features into deep learning model without the need for expert features [13,14,15,16] so that the model can automatically learn and classify, which also facilitates real-time detection of DGAs. In this case, all extracted character features are first converted to numbers, and then truncated or complemented to a fixed length.

Furthermore, CNN has been widely used in fields like image classification and video recognition. Recently, Catania et al. [5] provided a performance analysis and comparison of CNN designed specifically for DGA detection. Additionally, RNN has been successful in DGA botnet detection, since it can capture local information. To overcome the multiclass imbalance problem, Tran et al. [6] proposed a novel cost-sensitive LSTM, called LSTM. MI. They introduced a cost item into backpropagation learning procedure to measure the identification importance among classes. However, RNN-based methods still suffer from the defect of unable to completely record the context information in long text. Yu et al. [4] discussed two types of deep neural networks (including CNN and LSTM) based on character features for DGA domains detection. This method can be automatically identify and learn semantic knowledge.

Overall, many methods for DGA botnet detection had achieved good performance using deep learning model as a classifier. These methods are used to decide whether the given domain is a malicious domain. However, they are only conducted on the semantic features at character level without considering visual features. For DGA botnet detection, predictive accuracy should generally strongly increase when adding additional visual information.

3 Our Approach

3.1 Problem Definition

We formulated our tasks as binary and multiclass classification. Given a set of \(S{ = }\left\{ {\left( {x^{i} ,y^{i} } \right)} \right\}_{i = 1}^{N}\), where \(x^{i}\) as the model input represents an instance of domain, and \(y^{i}\) denotes the label of the instance.

Our goal is to find a proper function \(f\left( \cdot \right)\), which is the score predicting the class assignment for an instance. Finally, the output of the function with the largest probability is taken as the final class predicted by the classifier.

3.2 The Overview of TS-ASRCaps Framework

In this paper, a deep learning framework (TS-ASRCaps) of feature extraction and detection of DGAs based on multimodal information is proposed. This framework conducts two major processes; feature analysis (Sect. 3.3) and classification (Sect. 3.4). More concretely, we designed a novel ATTSRNN- and CapsNet-based Two-Stream network to model the semantic and visual knowledge. By combining natural language processing technology with image processing technology, the semantic and visual features are first extracted from domain names, and then fed them into the two streams of the Two-Stream network, respectively. Lastly, the Two-Stream network is evaluated on the fused classification score, and a classification loss is adopted for optimizing the whole network.

3.3 Feature Analysis

The accuracy of classification is directly affected by the quality of the extracted features. The multimodal vectors are divided into two types: semantic- and visual-based vectors.

3.3.1 Semantic Features

As previously noted, some deep learning methods based on character features have been proposed, which can be considered as classification of short character strings. Previous research has demonstrated the effectiveness of character features. We extracted DGA domain name characters as semantic features, which was inspired by the reports of previous work [4, 6]. Thus, each character contained in a domain name is represented by a number.

3.3.2 Visual Features

Additionally, we extracted the visual features from the domain names. A typical approach [16] is represented executable malware binary files as one-channel gray-scale images via scanning and converting every bit value in binary files into an image pixel. For example, Su et al. [17] proposed a light-weight approach to classify malware in IoT environments. After generating images, they firstly analyzed these converted binary images and then applied a lightweight convolutional neural network for detecting DDoS malware.

Inspired by this, we extracted the gray-scale images from domain names and classified them according to the similarity of image texture. This strategy allows researchers to intuitively understand the spatial patterns and structures of domain names. So this visual feature will provide an effective input for subsequent model learning. Our main assumption is that image samples from different DGA families have distinct texture characteristics. Because the different DGA domains are generated by different algorithms. Furthermore, we treated the domain names as binary data [17,18,19]. Each character is represented by the corresponding one-channel gray-scale image pixel, and each transformed gray-scale image contains some layouts and textures. Figure 1 illustrates the visualization process of extracting visual features. “Domain to Image Converter” is first transformed a domain into a gray-scale image, and visual features are then represented and stored in a database. Specifically, converting domains to the corresponding images only requires creating the input vectors to the subsequent model learning, which is a very fast operation.

Fig. 1
figure 1

Example of the visualization process

Algorithm 1 illustrates the visualization process for converting the DGA domain names to gray-scale images. Firstly, the characters of domain name are stored in a 1-D array. We set the length of array L = 50. Furthermore, domain names longer than 50 characters would get truncated from the 50-th character, and any domain names shorter than 50 characters would get padded with the special character ‘0′ till their lengths reached 50. This array can be treated as a 2-D matrix of a specified width and height. For simplicity, the width and the height of the image are fixed (Algorithm 1, Line 2–3). Finally, we converted the characters of domain name to the pixels of image (Algorithm 1, Line 4–10).

figure a

3.4 Two-Stream Architecture

In this part, we introduced the Two-Stream architecture that includes three components, namely, Attention (ATT), Sliced Recurrent Neural Network (SRNN) and Capsule Network (CapsNet). The overall flow of the proposed TS-ASRCaps is shown in Fig. 2. Specifically, the choice of the optimized Two-Stream architecture is based on the experiments in Sect. 4.

Fig. 2
figure 2

Flow diagram of the TS-ASRCaps

3.5 The Semantic Stream

3.5.1 Embedding Layer

In order to improve the statistical quality of the model, one-hot is a typical method for handling categorical data, which encodes words and stores them into a sparse matrix that is used as the input of a deep learning model. However, for the one-hot encoding method, a serious problem is the high dimensionality of the data and the curse of dimensionality. Instead of the one-hot encoding method, we encoded semantic features into dense vectors of real values by applying a trainable embedding layer, which is helpful for dealing with high dimensionality and data sparsity. In such a case, the generated embedding maps the semantic-knowledge into a fixed \(n \times m\) matrix, where n represents the number of the semantic features. Each semantic feature is stored in a \(1\times m\) vector represented by a row in the matrix.

3.5.2 Independently Recurrent Unit

In order to prevent the gradient exploding and vanishing in recurrent neural network (RNN), the variant of RNN, referred to as independently recurrent neural network (IndRNN), have been proposed [20], which allows the network to learn long-term dependencies. Furthermore, the LSTM was developed to address the gradient explosion and disappearance problems effectively when the network converges, but the use of hyperbolic tangent and the sigmoid action functions results in gradient decay over layers [20]. Compared with the LSTM, the neurons in the same layer of IndRNN are independent of each other and connected across layers, and the gradient can be effectively propagated at different time steps. Through the gradient reverse propagation of time, the IndRNN makes network memory more effective in learning content and processing long sequences. Thus, the IndRNN can make full use of the long-range information while preventing the gradient explosion and disappearance problems. Furthermore, the gradient can be effectively propagated at different time steps, resulting in faster processing compared to LSTM.

This standard IndRNN structure with a non-saturated function \(\sigma\) such as relu as activation function can be described as:

$${\mathbf{h}}_{t} = \sigma \left( {{\mathbf{W}}x_{t} + {\mathbf{U}} \odot {\mathbf{h}}_{t - 1} + {\mathbf{b}}} \right),$$
(1)

where \(\odot\) represents the Hadamard product, \({\mathbf{U}}\) and \({\mathbf{W}}\) are the recurrent weight and the input weight, respectively. Since spatial patterns are aggregated independently (i.e. through \({\mathbf{W}}\)) over time (i.e. through \({\mathbf{U}}\)), neurons at different times in the same layer are independent of each other.

However, the IndRNN only uses forward context information during the processing sequence. To improve the modeling ability to integrate forecast knowledge, the improved bidirectional independently recurrent neural network (Bi-IndRNN) is used in this paper, which can capture both forward and future context information. The standard Bi-IndRNN structure is shown in Fig. 3. It can be described as:

$${\vec{\mathbf{h}}}_{t} = \sigma \left( {{\mathbf{W}}x_{t} + {\mathbf{U}} \odot {\vec{\mathbf{h}}}_{t - 1} + {\mathbf{b}}} \right),$$
(2)
$$\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{\text{h}}_{t} = \sigma \left( {{\mathbf{W}}x_{t} + {\mathbf{U}} \odot {\mathbf{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h} }}_{t - 1} + {\mathbf{b}}} \right),$$
(3)

where each input for the forward-to-future and future-to-forward direction are associated with a hidden state \({\vec{\mathbf{h}}}_{t}\) and a hidden state \({\mathbf{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h} }}_{t}\) corresponding to \({\vec{\mathbf{h}}}_{t}\). Next, we concatenate the two vectors to form the output of the BiIndRNN. In doing so, each hidden state \({\mathbf{h}}_{t}\) contains information about the whole input sequence.

Fig. 3
figure 3

The BiIndRNN structure

3.5.3 Sliced Recurrent Neural Networks

In order to improve the traditional RNN connection structure, a new RNN structure called sliced recurrent neural networks (SRNNs) has been proposed [21, 22], which have the ability to obtain high-level semantic information of the input sequences, not just the character-level information.

The input sequence is sliced into several minimum subsequences with equal length. At each layer, the SRNN can work on each subsequence simultaneously through the improved RNN connection structure. Furthermore, the Bi-IndRNN is integrated as the recurrent unit of SRNN and is used for each subsequence, so that Bi-IndRNN can be computed in parallel, which brought the superiority of both Bi-IndRNN and SRNN into full play. The hidden state of each minimum subsequence on the 0-th layer can be described as:

$${\mathbf{h}}_{t}^{1} = BiIndRNN^{0} \left( {mss_{{t - l_{0} \sim t}}^{0} } \right),$$
(4)

where \(mss\) denotes minimum subsequences on the 0-th layer, t denotes the length of each subsequence, \(l_{0}\) denotes the minimum subsequence length, and the Bi-IndRNN is used on each layer. On the p-th layer, the last hidden state of the subsequences can be described as:

$${\mathbf{h}}_{t}^{p + 1} = BiIndRNN^{p} \left( {h_{{t - l_{p} }}^{p} \sim h_{t}^{p} } \right),$$
(5)

where \(l_{p}\) denotes the subsequence length of the p-th layer.

In this way, information can be obtained in many short subsequences and important information is then transmitted in parallel through the multiple-layers structure from the 0-th layer to the top layer.

3.5.4 Attention Layer

For knowledge distillation, we stacked an attention module [23] on the top layer of the SRNN, which pays more attention to the quality of the semantic features of the SRNN hidden-layer. The attention mechanism is used to extract such characters that are important to the meaning of the semantic expression since the contribution of each character to the semantic expression of a domain name is different. Thus, some critical semantic features are obtained by considering the probability weight distribution. Furthermore, the representations of those informative characters are then aggregated into vectors. The context vector \({\mathbf{s}}_{i}\) for the attention allocation coefficient \(a_{ij}\) is generated as follows:

$${\mathbf{s}}_{i} = \sum\limits_{j = 1}^{n} {\alpha_{ij} ann_{j} } .$$
(6)

The weight coefficient of attention mechanism is calculated as follows:

$$\alpha_{ij} = {\raise0.7ex\hbox{${\exp \left( {e_{ij} } \right)}$} \!\mathord{\left/ {\vphantom {{\exp \left( {e_{ij} } \right)} {\sum\nolimits_{k = 1}^{n} {\exp \left( {e_{ik} } \right)} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${\sum\nolimits_{k = 1}^{n} {\exp \left( {e_{ik} } \right)} }$}},$$
(7)

where

$$e_{ij} { = }a\left( {{\mathbf{h}}_{i - 1} ,ann_{j} } \right)$$

is an alignment model which scores how well the inputs around position j and the output at position i match [23]. The score is based on the hidden state \(h_{i - 1}\) of the previous layer and the j-th annotation \(ann_{j}\).

3.5.5 The Visual Stream

The visual stream study also begins with an embedding layer. All of these visual features are treated as the specially learnt words, and then embed them into a specific visual vector space. This strategy enables us to better identify the combinations and patterns of features.

The capsule network [24,25,26] is effective for image-related inference tasks (e.g., image classification, etc.). We introduced a version of CapsNet to extract visual features from gray-scale images that help improve the classification accuracy. Our main assumption is that the CapsNet will able to successfully detect DGA botnets using raw pixel values extracted from domain names. Each capsule performs complex internal calculations on sample inputs and learns how to represent and reconstruct a given sample like an autoencoder. Compared with CNN [5], the CapsNet converts scalars into vectors, which can better store features. Furthermore, the dynamic routing algorithm is used to ensure a more accurate output of low-level capsule vectors to higher-level parent capsules. By doing this, the dynamic routing-based capsule network can encode the intrinsic spatial relationships between a part and a whole knowledge. The above advantages show that CapsNet is a promising architecture against to standard CNN.

3.5.6 Convolutional Layer

In this network, a standard convolutional layer is used to generate effective features from different positions of the embedded vector matrix in the previous embedding layer. The convolutional operations are computed by:

$$z_{l} = f\left( {{\mathbf{W}}^{a} \circ X_{l:l + k - 1} + b} \right),$$
(8)
$${\mathbf{Z}}^{(i)} = \left[ {z_{ 1} ,z_{ 2} , \ldots ,z_{L - k + 1} } \right],$$
(9)

where \(\circ\) denotes element-wise multiplication. The convolutional filter is denoted by \({\mathbf{W}}^{a} \in R^{{K_{1} \times V_{1} }}\), where \(K_{1}\) and \(V_{1}\) are used to describe the window size of the convolutional filter. All extracted features are collected into one feature map by sliding the filter over the embedded vector matrix.

3.5.7 Capsule Layer

Next, a convolutional capsule layer with a group-convolution operation is used to transform the feature maps into primary capsules. Capsules apply a subset of the filters used in the previous layer to represent each element in the current layer. More formally, features at same position from all feature maps are encapsulated into a corresponding capsule by the \(1\times 1\) filters \({\mathbf{W}}^{b} { = }\left\{ {w_{1} , \ldots ,w_{v} } \right\} \in R^{{V_{ 2} }}\) shard across different windows. Thus, a capsule vector \(\varvec{p}_{i}\) is computed by:

$$p_{ij} = z_{i} \cdot w_{j} \in R,$$
(10)
$$\varvec{p}_{i} = g(\left[ {p_{i1} ,p_{i2} , \ldots ,p_{{iV_{2} }} } \right]) \in R^{{V_{2} }} ,$$
(11)

where “[]” is the concatenation operator, \(V_{2}\) is the dimension of a capsule vector, and \(g\) is a non-linear squashing function. Specially, the length \(\left\| {p_{i} } \right\|\) of each capsule is constrained to the unit interval [0, 1] by the squashing function:

$$g(x) = squash(x) = \frac{{\left\| x \right\|^{{\mathbf{2}}} }}{{{\mathbf{1 + }}\left\| x \right\|^{{\mathbf{2}}} }}\frac{x}{\left\| x \right\|}.$$
(12)

By a linear transformation matrix \({\mathbf{W}}^{c}\), the prediction vector \(\hat{\varvec{u}}_{j|i}\) from its i-th child capsule in the first capsule layer to the j-th parent capsule in the subsequent capsule layer is generated by:

$$\hat{\varvec{u}}_{j|i} { = }{\mathbf{W}}^{c} \varvec{u}_{i} .$$
(13)

Then, the high-level capsule \(\varvec{v}_{j}\) is calculated as a weighted sum over all prediction vectors \(\hat{\varvec{u}}_{j|i}\). The capsule

$$\varvec{v}_{j} { = }g\left( {\sum\limits_{i} {c_{j|i} \hat{\varvec{u}}_{j|i} } } \right),$$
(14)

where \(c_{i}\) is the coupling coefficients that are determined by the dynamic routing process.

3.5.8 Fusion Vector

The outputs of the two streams are directly combined together by a concatenation operation “[]”, so that the semantic and visual features are closely interacted. Furthermore, the multimodal vectors \(\varvec{m}\) can be obtained by:

$$\varvec{m}{ = }\left[ {\varvec{s},\varvec{v}} \right].$$
(15)

After fusion, it is passed on to further layers. This multimodal vectors are used to calculate the class probabilities for classification task.

Furthermore, the multimodal vectors \(m\) can be invoked as features for the binary or multiclass classification. We let the Sigmoid/Softmax function operate on the multimodal vectors.

For the binary classification, the Sigmoid function is used as the output layer to model the binary probabilities.

$$y_{binary} = Sigmoid \, \left(\varvec{m} \right).$$
(16)

For the multiclass classification, the Softmax function is used as the output layer to model the multiclass probabilities:

$$y_{multiclass} = Softmax \, \left(\varvec{m} \right).$$
(17)

Our network is trained by minimizing the cross entropy loss, and the objective function is optimized using the gradient-based optimization algorithm Adam. Having the same architecture in both the binary and multiclass classification subtasks makes the development and the evaluation of a given design simpler.

4 Experiments and Evaluation

For a comprehensive assessment, we evaluated the TS-ASRCaps in a binary experiment (Whether DGA?) and multiclass experiment (Which DGA?) by describing the details of our experimental setup and evaluation metrics. We conducted two evaluation studies to answer the following research questions:

  • For Study I How accurate is TS-ASRCaps in detecting DGA and non-DGA, and how does it compare to other state-of-the-art peer approaches that address the same problem?

  • For Study II Does TS-ASRCaps have the ability to distinguish one DGA algorithm from another with high precision?

4.1 Dataset and Metrics

Dataset Our model is evaluated on three widely used benchmark datasets containing both DGA and normal domains. The normal domains were obtained from the Alexa top 1 million domains [27]. The DGA domains were obtained from the repositories of DGA domains of OSINT [28], Lab360 [29], and Andrey Abakumov [30]. Table 1 lists the descriptions of the three datasets, where #Num is the number of DGA and normal domain names. The OSINT DGA feed (used in our experiment) from Bambenek Consulting consists of thirty families of DGAs with a varying number of examples from each class, which was downloaded on the following dates: April 11, May 7, and July 10, 2019.

Table 1 Main datasets used in our evaluation studies

Along with these three public datasets, we also collected a lot of domains generated by real DNS traces in Jan 2020 from our university. It is referred as XJU dataset. For all the users connected to the university network, their DNS queries are sent to the DNS server. We collected all the DNS traffic by mirroring the ports of the DNS server. Specially, after deduplication, only 88,913 of the queried domains (including 86,075 normal domains and 2838 DGA domains) are unique. Some statistics for this dataset are listed in Table 1.

In our experiment, 80% of the samples being randomly selected as the training set, and the remaining as the test set. Note that, in order to avoid causing biases or overfitting, this evaluation study did not involve any re-sampling. These two sets do not have any common samples.

Evaluation metrics For evaluating the classification results, we used some common performance evaluation metrics to quantify numerically the performance of the classifier, such as accuracy (ACC), precision (P), recall (R), F-score (F). The goal of any DGA classification research is to achieve a high value for F-score (F). Additionally, following the work of [6], we use the micro and macro average to averaging results over classes. For micro average, smaller classes are considered less than larger classes in the average, which is a better performance predictor for this paper. For macro average, all classes are averaged regardless of the number of elements in each individual class.

Experimental Set-up Additionally, we also provided the parameter settings of TS-ASRCaps as shown in Table 2.

Table 2 Hyper parameter setting

Among them, the Semantic and Visual denote the dimension of the semantic and visual features, respectively. The Embedding denotes the dimension of the generated embedding vectors. The SRNN-layers, Bi-IndRNN-layers, Attention-layers and CapsNet-layers indicate the number of SRNN, Bi-IndRNN, Attention and CapsNet layers, respectively. The Bi-IndRNN is used as the recurrent unit of SRNN. The Bi-IndRNN-units indicates the number of Bi-IndRNN neurons. The Capsule-numbers indicates the number of capsules used in the capsule layer. The Capsule-dimensions denotes the dimension of a capsule. The Dropout indicates that the dropout technique is used to overcome over-fitting during training. The Batch indicates the batch amount. The Epochs indicates the number of iterations for the model training. In addition, the loss function used in TS-ASRCaps is the cross-entropy. The optimizer used in TS-ASRCaps is the Adam and the learning rate of the Adam is 0.001.

4.2 Study I: DGA Botnet Detection

The Performance Comparison for DGA Botnet Detection For comparison, we start from the original proposals as can be found in the literature [31], which provides the five state-of-the-art deep learning models for DGA botnet detection, including Endgame, Invincea, CMU, MIT, and NYU. We reimplemented these models as a set of baselines. In addition, following the work of [6], we also reimplemented the LSTM-MI architecture. It is also worth mentioning that Endgame, CMU, MIT, and LSTM-MI are also used in the recent research [9]. Specially, all baselines were built using the descriptions and specified parameters from existing papers [6, 31]. Overall, we compared our proposed model against the following six baselines:

  • Endgame Model [4] [31]: A long short-term memory network (LSTM) was designed specifically for DGA detection.

  • CMU Model [31]: A standard bidirectional language model for DGA detection consists of a forward and backward long short-term memory network.

  • NYU Model [4, 5, 31]: A standard convolutional neural network (CNN) was originally proposed by [32], and adapted for DGA detection by [4, 5], and [31].

  • Invincea Model [31]: An extended model of CNN with parallel architecture was adapted by [31].

  • MIT Model [31]: A hybrid neural network consists of an embedding layer, a CNN layer, and an LSTM layer, which was originally proposed for character-level text processing by [33].

  • LSTM-MI [31]: The cost-sensitive LSTM model for DGA detection was proposed by [6], which introduces the cost items into the backpropagation learning procedure to take into account the identification importance among classes.

These models are used independently to model multimodal information without using the Two-Stream mode. The results are listed in Table 3. All models performed well on the four benchmark datasets. We can see that the LSTM-MI, Endgame, and NYU are closest to our TS-ASRCaps in overall performance, which fully demonstrates the powerful generalization ability of these models.

Table 3 TS-ASRCaps versus baselines for DGA detection on the four benchmark datasets (the size of dataset used in this experiment is described in Table 1)

Although their performances better than the other methods, but they are not yet comparable to the results given by the TS-ASRCaps. The TS-ASRCaps substantially outperforms all baselines, since it takes into account the mutual relationship between semantic and visual. Figure 4 shows the ROC curves for the four benchmark datasets. This shows that the TS-ASRCaps has superior performance in detecting DGA botnets.

Fig. 4
figure 4

ROC curves for the four benchmark datasets

Is the Two-stream network redundant? In this part, we conduct experiments to evaluate: (1) the effectiveness of multimodal. (2) the contributions of each component of TS-SRACaps. As TS-SRACaps comprises a set of contiguous components, such as Attention Mechanism (ATT), Sliced Recurrent Neural Network (SRNN) and Capsule Network (CapsNet), we designed four models to investigate the necessity and benefits of these components.

  • OS-CapsNet (One-Stream CapsNet): OS-Caps is designed to test whether visual features and CapsNet for DGA detection are necessary, which adopt one branch of the CapsNet to learn the visual information.

  • OS-CNN (One-Stream CNN): To verify the necessity of CapsNet, we also designed a baseline OS-CNN to learn the visual information.

  • OS-SRNN (One-Stream SRNN without Attention Mechanism): To verify the necessity of semantic features and SRNN, we designed a baseline OS-SRNN to learn the semantic information, which uses one branch of the SRNN.

  • OS-ATTSRNN (One-Stream SRNN with Attention Mechanism): Unlike the OS-SRNN model, one branch of SRNN with an attention mechanism (OS-ATTSRNN) is used to learn semantic information and to model the hidden contextual by calculating the attention weights, which verifies the necessity of the attention mechanism. OS-ATTSRNN is a straightforward combination of the SRNN and attention models.

As shown in Table 4, the OS-CapsNet performs the best in this One-Stream mode. This is the only configuration that uses a CapsNet as it is not tested in combination with the SRNN. Meanwhile, the OS-CapsNet in the OSINT dataset has an accuracy of 99.26%, which is a significant improvement over the OS-CNN. This is because the OS-CapsNet can encode the intrinsic spatial relationships between a part and a whole knowledge. Specially, an ensemble of SRNN with attention (i.e. OS-ATTSRNN) substantially improves the overall performance compared to the OS-SRNN, with an accuracy of 98.69% and an F-score of 99.32% on the XJU dataset. This is because the OS-ATTSRNN can capture the importance of each context character while the OS-SRNN cannot. The result listed in Table 4 affirmed the effectiveness of the proposed semantic and visual features.

Table 4 Ablation studies of TS-ASRCaps on the four benchmark datasets

As we expect, the TS-ASRCaps outperformed all baselines since the ensemble method benefits from the incorporation of the Two-Stream network outputs. The knowledge of multimodal for the detection task can be learned from the DGA data since the semantic and visual embeddings are shared cross the network. Again, the ensembles lead to a dramatic increase in performance, showing that the ATTSRNN and CapsNet are complementary.

The Performance Comparison for the Proportion of Training and Testing Sets To demonstrate the reliability of the TS-ASRCaps, we investigated the impact of the number of training samples on classification accuracy. The experiment settings are along the same lines as previously mentioned in Sect. 4.1. We evaluated our model by training it on a small number of samples while testing on unseen samples.

Note that there are no common sample between the two sets. In this experiment, we found that the number of training samples plays a crucial role in the performance of the model. As shown in Fig. 5, we changed the proportion of training and testing sets. We can see that, when the TS-ASRCaps used 20% of the data for training, the classification accuracy rate is 99.14%. The results indicated that the TS-ASRCaps can achieve good results in the case of a small number of training samples, demonstrating that the feature collection ability of the TS-ASRCaps is very strong. The downside is that the classification accuracy rate has not reached the best level.

Fig. 5
figure 5

The comparative classification performance in various proportions of training and testing sets

Furthermore, when the number of training samples increases, the model fits the data distribution better. The very strong ability in express information will make the value from the function close to the desired target, and thus improve the performance of the model. The experimental results indicate that the TS-ASRCaps is more sensitive to the number of training samples. A modest increase in the training sample size would improve the performance of the model.

4.3 Study II: Multiclass

The Performance Comparison for Familial Classification For assessing the performances of the proposed model, we reported the familial classification results of the TS-ASRCaps against the rival methods (the hyper-parameters of all models are fixed without tuning). The precision, recall, and F-score are displayed in Tables 5 and 6 for the Endgame, Invincea, CMU, MIT, NYU, and LSTM-MI. Based on the additional information provided by detecting the specific DGA malware family, anti-malware providers can validate the results, thereby making optimal detection decision with higher confidence. Our goal is to retain the high F-score on the non-DGA (Alexa) class while increasing the micro and macro averaging F-score on the DGA classes. As we can see, the TS-ASRCaps has a more balanced F-score in all classes, which explains why an ensemble of SRNN with CapsNet works, as they are complementary to one another. Figure 6 shows the multiclass classification performance of the TS-ASRCaps for each DGA family.

Table 5 Multiclass classification results in terms of precision, recall and F-score—part I
Table 6 Multiclass classification results in terms of precision, recall and F-score—part II
Fig. 6
figure 6

Normalized confusion matrix of the TS-ASRCaps classifier

We discovered that some families were misclassified most of the time, such as cryptolocker, tempedreve, hesperbot, fobber, and dircrypt. These observations are along the same lines as [6]. All baselines are well recognized for some DGA families and do not perform well in other families. This is perhaps because the serious imbalance class distribution towards these families makes the classifier less likely to learn useful information. According to [6], another possible reason may be due to the uniform distribution of letters generated by those malwares. Overall, although the different families are classified, the TS-ASRCaps is still comparable to these state-of-the-art techniques. The TS-ASRCaps also has the ability to retain high F-score on the non-DGA (Alexa) class.

In addition, to analyze the weighting process of attention, we drew the attention distributions at attention layer as shown in Fig. 7. Color proceeding a domain name denotes to the weight of the attention matrix, and does not necessarily denote non-DGAs or DGAs. The attention layer acted somewhat as an optimized feature extractor on the sequences of semantic feature vectors produced from previous SRNN layer, and the cell of attention provided an indication of what the semantic feature was weighting. We can observed that after attention learning, some characters of a domain can obtain a high weight at attention layer, while others cannot. In other word, the key semantic features would be collected and irrelevant parts would be ignored. This shows our application of attention layer for the DGAs classification is effective at guiding the attention layer to select the vital semantic features.

Fig. 7
figure 7

Examples of the visualization results of the attention matrix

4.4 Efficiency

Analyzing a domain name is mainly divided into two phases: one is the feature extraction phase, and the other is the classification phase. For the feature extraction phase, previous studies, such as [24], have demonstrated the effectiveness of character features for deep learning model in DGA botnet detection. These methods are more efficient than ours because only character features are extracted. Compared to them, we additionally extracted the visual features as the input of classifier. In fact, the extraction of the multimodal features is a very fast operation, and the time taken to process one sample for feature extraction is almost negligible. Another topic of discussion is the efficiency of our method in classifying DGAs. For potential users, the classification time is critical. For assessing the overall runtime, we focused on evaluating the processing time for feature extraction and then giving an average runtime for processing one sample. Furthermore, we time predation for 100 k domains, which include the feature extraction and classification time. Figure 8 displays the distribution of overall runtimes for all models. The five-pointed star denotes the average runtime and the segment inside the box shows the median.

Fig. 8
figure 8

Frequency distribution of runtimes

For our method, the average lies at just 0.015 s for processing one domain name on a single core (GeForce GTX 1060 with 8 GB RAM), of which 0.0005 s spent in the feature extraction and 0.015 s in the classification. Although our method is slower than that of some existing methods, it has almost no computation cost. Additionally, as shown in Table 3, the baseline methods achieve competitive results on some datasets but fail to adapt to the others. For example, Endgame and LSTM-MI perform quite well on OSINT, Lab360 and AR but poorly on XJU, or Invincea has a favorable performance on OSINT but lower values on other datasets. Our proposed method performs consistently well on all datasets that demonstrates the good generalization ability. Apparently, given the substantially superior performance of the TS-ASRCaps over other state-of-the-art techniques, the additional cost incurred by the TS-ASRCaps can be seen to be justified. Overall, our method allows for computationally inexpensive feature extraction and classification. Based on the above facts, we can claim that the proposed method can facilitate real-time detection of DGAs.

5 Discussions

In this paper, we explored a novel Two-Stream Network to simultaneously capture the semantic distribution and spatial context information contained in DGA domain names, without relying on any other complex or expert features. We evaluated our framework from two aspects: detecting DGA and non-DGA, and distinguishing one DGA algorithm from another. To the best of our knowledge, this is the first application of the multimodal deep learning to the DGA botnet detection.

Though the TS-ASRCaps performed extremely well in our experiments, there might be room for improvement. We will work on extending our system to incorporate new multimodal features into the system, which may promote the representation of higher-level concepts. Another important consideration is the modularity of the system. Future work is needed to replace some of the components contained in this architecture with newer and better-performing versions.