Keywords

1 Introduction

Contemporary malware families installed on infected computers typically use domain generation algorithms (DGAs) [17]. As stated by DGArchive,Footnote 1 there are 72 different DGA families. The DGA families might increase in the near future because it can oppose the botnet takedown mechanism [17]. A malware family which uses DGAs is called as domain fluxing [19]. DGAs support to generate many pseudo-random domain names and a subset of the domain names is used to establish a connection to command and control (C2C) server. To make a successful communication, an author has to register only a small subset of domain names. Based on the successful establishment of communication, the malicious author can start executing malicious activities while the entire information is passed to the botmaster. Then the botmaster issues instructions to bots and sometime even to update the malware family itself. Analysis of DNS traffic provides a way to detect malicious activities hosted by botnet. In recent days, botnet has been used as a primary approach to countermeasure many malicious activities [26].

Recent days, botnet has remained as a serious threat to the Internet service community. Thus detecting DGAs has been a significant problem in the domain of cyber security [11]. DNS blacklisting is the most commonly used method for detecting DGA domain name in earlier days. The significance of DNS blacklisting method for DGA analysis is studied by Kührer et al. [11]. The study used both public and private blacklists. Private blacklisting is prepared by vendors and the experimental results show that the private blacklisting survived better than the public blacklisting. The results of public blacklisting performance varied for DGA malware families. They suggest that the DNS blacklisting is very useful and can be used along with the other approach to provide a more appropriate level of protection. Another approach is to reverse engineer the malware along with its DGA to identify the seed. Once the seed is known, then subsequent domain names can be registered and those registered domain names act as an impersonator C2C server to seize botnets. This type of process is typically called as sinkholing [23]. Once the botnet is seized, an adversary has to redeploy new botnet with revised seeds to further continue the process to do malicious activities. Both blacklisting and sinkholing methods consume more time and resource intensive approaches. More importantly blacklisting completely fails to detect new types of domain name or variants of existing domain name. Sinkholing has low success rate in detecting new types of DGA domain and variants of existing DGA domain name. Later, DGA classifiers are built using machine learning (ML) algorithms. This type of DGA classifier stays in the network and captures the DNS requests and looks for the DGA domain name. Once the DGA domain name is detected, it gives an alert to the network admin to further examine the foundation of a DGA. The existing works on ML based detection are classified into retrospective and real-time. Retrospective detection methods follow clustering and estimate the statistical properties for each cluster for classification [3, 43, 44]. To enhance the system detection rate, retrospective methods use other contextual information WHOIS information, NXDomain, HTTPheaders. Most of the existing methods belong to retrospective detection and contain several issues in deploying in real-time systems [10, 43, 44]. On the other side, real-time detection method acts on domain name only to detect the DGA domain name. Most of the ML based real-time detection methods are based on feature engineering. These methods are easy to evade and require extensive domain knowledge to extract significant features to distinguish the domain name into either legitimate or DGA domain name [20]. In recent days, to avoid feature engineering phase, the application of deep learning is leveraged in the field of cyber security [25, 27,28,29,30,31,32, 34, 35, 37, 39]. In [42] the authors proposed LSTM based DGA detection and categorization and the method can be deployed in any environment. Generally, the deep learning architectures are prone to multiclass imbalance problems. There are a few DGA families that contain few samples of domain name. Thus the deep learning architectures bias towards the classes which have more number of samples and as a result DGA families which contain very few samples remain undetected. Additionally, deep learning based DGA detection stays safe in an adversarial environment when compared to CML based DGA detection. To handle multiclass imbalance problem, [24] proposed Cost-Sensitive LSTM which performed better than the Cost-Insensitive LSTM architecture. Consequently, in this work we use Cost-Sensitive LSTM and additionally other Cost-Sensitive deep learning based architecture are considered to evaluate the performances on AmritaDGAFootnote 2 data set. The main contributions of the proposed work are given below:

  • This work proposes DeepDGA-MINet, which uses Cost-Sensitive deep learning based architectures which can handle the multiclass imbalance in DGA family categorization. The performances of various Cost-Sensitive deep learning based architectures are shown on AmritaDGA data set.

  • A detailed experimental analysis of various Cost-Sensitive deep learning based architectures is shown on two different types of testing data sets. These data sets are completely disjoint and include time information. Thus models evaluated on these data sets facilitate to meet zero day malware detection.

The rest of the part of this chapter is organized as follows: Sect. 2 discusses the background details of DNS, botnet, DGA, Keras embedding, deep learning architectures, and Cost-Sensitive deep learning architectures. Section 3 discusses the related works on application of deep learning on DGA analysis. Section 4 discusses the description of data set. Section 5 discusses the statistical measures. Section 6 includes the proposed framework. Section 7 includes experimental results and observations. At last, conclusion, future work, and discussions are presented in Sect. 8.

2 Background

2.1 Domain Name System (DNS)

Domain name system (DNS) is a critical component in an Internet service system. It maintains a distributed database that facilitates to translate domain name to Internet protocol (IP) address and vice versa. Thus DNS is a main component for nearly all network services and has been a main target for attackers. Domain name is a name of a particular application in an Internet service system which follows naming convention system defined by DNS. The maximum length of the domain name is 63 and parts of the domain name are separated by dots. Generally, the right most element in a domain name is root and left most element is the host label. DNS maintains a hierarchy to manage the domain name named as domain name space. The domain name space is divided into different authorities called as DNS zone. The hierarchy is shown in Fig. 37.1 and it represents the organizational structure. The domain name with the host label and root is called as fully qualified domain name (FQDN). Primarily, there are two types of DNS server, they are recursive and non-recursive. Recursive server contacts the nearby DNS server if the requested information doesn’t exist. Thus there may be a possibility for various attacks such as denial of service (DoS), distributed denial of service (DDoS), DNS cache poisoning, etc.

Fig. 37.1
figure 1

An overview of domain name system

2.2 Botnet

Botnet is a network of compromised computers that is remotely controlled by botmaster or bot herder. The compromised computers use same malicious code and each compromised computer in a network is called as bot. Botmaster frequently updates the code of bot to evade the current detection methods. A bot uses DGAs to establish a communication channel to a command and control (C2C) server. Recently, botnet behavior is discussed in detail by Alomari et al. [2]. Botnet has been most commonly used by cyber criminals nearly to inject various types of malicious activities and has become a serious threat in the Internet service. Recent botnets use fluxing approach to establish a communication point between bot and C2C server. Mostly, two types of fluxing are used. They are IP flux and domain flux. This work is towards domain flux and domain flux uses the DGA. The DGA algorithm is shared between the botmaster and bots. To establish a connection to C2C server, there may be possibility that DGA generates many failed DNS queries.

Based on the architectures, botnets are grouped into three categories [6]. They are centralized, decentralized, and hybrid. In centralized architecture a botmaster controls all the connected bot in a single point called command and control server (C2C server). Centralized botnet architecture uses star and hierarchical topology and Internet relay chat (IRC) and Hyper Text Transfer Protocol (HTTP) protocols. Decentralized architecture contains more than one C2C server and peer-to-peer protocol. Hybrid architecture is a combination of centralized and decentralized architecture.

2.3 Domain Generation Algorithms (DGAs)

Mostly, recent malware families use DGA instead of hardcoded addresses [17]. This is due to the fact that the DGA is an algorithm which generates large number of pseudo-random domain names based on a seed and appends a top level domain (TLD) such as .com, .edu, etc. to the pseudo-random domain names. A seed can be anything mostly used are data and time information and a seed is shared between the botmaster and bots.

2.4 Domain Name Representation Using Keras Embedding

In this work, Keras embedding is used for domain name representation. In the beginning, a dictionary is formed for the DGA data set which contains only unique characters. Generally, it includes an extra position to handle an unknown character in the testing phase. Each character in a domain name is replaced by a particular index of the dictionary. This transforms the index value in a domain name vector into N dimensional continuous vector representation. The N acts as hyperparameter. This type of representation captures the similarity among the characters in a domain name. The Keras embedding takes the following parameters as input:

  • Dictionary-size: The number of unique characters

  • Embedding-length: The size of the embedding vector dimension

  • Input-length: The size of the input vector

We used Gaussian distribution to initialize the weights during beginning phase in training. The weights are fine-tuned during backpropogation and it coordinatively works with other deep learning layers.

2.5 Deep Learning Architectures

Deep learning is an advanced model of classical machine learning (CML) [13]. They have the capability to obtain optimal feature representation by taking raw input samples. Generally, there are two types of deep learning architectures, one is convolutional neural network (CNN) and another one is recurrent structures (RSs) such as recurrent neural network (RNN), long short-term memory (LSTM), and gated recurrent unit (GRU). Primarily CNNs are used on data which includes spatial properties and RSs are used on data which includes time or sequence information. Basic information along with mathematical details for RNN and CNN is discussed below.

Recurrent neural network (RNN), enhanced model of RNN named as long short-term memory (LSTM) [13], minimized version LSTM named as gated recurrent unit (GRU) [13] belong to the family of RSs. They are most commonly used on sequential data tasks. The structures of RSs look similar to classical feed-forward networks (FFN) and additionally the neurons in RSs contain a self-recurrent connection. All RSs are trained using backpropagation through time (BPTT). RNN in RSs generates vanishing and exploding gradient issue when the network is trained for longer time-steps [13]. To handle vanishing and exploding gradient issue, LSTM was introduced. It replaces the simple RNN unit with a memory block. This has the capability to carry out the important information across time-steps. A memory unit contains a memory cell and gating functions such as input gate, output gate, and forget gate. All 3 different gating functions control a memory cell. However, LSTM contains more parameters. Later a minimized version of LSTM, named as GRU is introduced. GRU achieves the same performance as LSTM and additionally it is computationally inexpensive. A basic unit in RNN, LSTM, and GRU is shown in Figs. 37.2 and 37.3, respectively. The computational functions for RNN, LSTM, and GRU are defined mathematically as follows:

Fig. 37.2
figure 2

Architecture of recurrent neural network (RNN) unit (left) and Long short-term memory (LSTM) memory block (right)

Fig. 37.3
figure 3

Architecture of unit in gated recurrent unit (GRU)

Generally RSs take input x = (x 1, x 2,…., x T) (where x t ∈ R d) and maps to hidden input sequence h = (h 1, h 2, …, h T) and output sequences o = (o 1, o 2, …, o T) from t = 1 to T by iterating the following equations:

Recurrent Neural Network (RNN)

$$\displaystyle \begin{aligned} {h_t} = \sigma ({w_{xh}}{x_t} + {w_{hh}}{h_{t - 1}} + {b_h})\end{aligned} $$
(37.1)
$$\displaystyle \begin{aligned} {o_t} = \,sf({w_{ho}}{h_t} + {b_o}) \end{aligned} $$
(37.2)

Long Short-Term Memory (LSTM)

$$\displaystyle \begin{aligned} {i_t} = \sigma ({w_{xi}}{x_t} + {w_{hi}}{h_{t - 1}} + {w_{ci}}{c_{t - 1}} + {b_i})\end{aligned} $$
(37.3)
$$\displaystyle \begin{aligned} {f_t} = \sigma ({w_{xf}}{x_t} + {w_{hf}}{h_{t - 1}} + {w_{cf}}{c_{t - 1}} + {b_f})\end{aligned} $$
(37.4)
$$\displaystyle \begin{aligned} {c_t} = {f_t} \odot {c_{t - 1}} + {i_t} \odot \tanh ({w_{xc}}{x_t} + {w_{hc}}{h_{t - 1}} + {b_c})\end{aligned} $$
(37.5)
$$\displaystyle \begin{aligned} {o_t} = \sigma ({w_{xo}}{x_t} + {w_{ho}}{h_{t - 1}} + {w_{co}}{c_t} + {b_o})\end{aligned} $$
(37.6)
$$\displaystyle \begin{aligned} {h_t} = {o_t} \odot \tanh ({c_t}) \end{aligned} $$
(37.7)

Gated Recurrent Unit (GRU)

$$\displaystyle \begin{aligned} {u_t} = \sigma ({w_{xu}}{x_t} + {w_{hu}}{h_{t - 1}} + {b_u})\end{aligned} $$
(37.8)
$$\displaystyle \begin{aligned} {f_t} = \sigma ({w_{xf}}{x_t} + {w_{hf}}{h_{t - 1}} + {b_f})\end{aligned} $$
(37.9)
$$\displaystyle \begin{aligned} {c_t} = \tanh ({w_{xc}}{x_t} + {w_{hc}}(f \odot {h_{t - 1}}) + {b_c})\end{aligned} $$
(37.10)
$$\displaystyle \begin{aligned} {h_t} = f \odot {h_{t - 1}} + (1 - f) \odot c \end{aligned} $$
(37.11)

where w terms for weight matrices, b terms for bias, σ is the sigmoid activation function, sf at output layer denotes the softmax activation function, tanh denotes the tanh activation function, i, h, f, o, c denotes the input, hidden, forget, output, and cell activation vectors, in GRU input gate and forget gate are combined and named as update gate u.

Convolutional neural network (CNN) is a type of deep learning architecture which is most commonly used in spatial data analysis [13]. Primarily, CNN is composed of three different sections, they are convolution, pooling, and fully connected layer. Convolution operation is composed of convolution and filters that slide over the domain name vector and extracts the features. The collection of features of convolutional layer is called as feature map. The feature map is huge and to reduce the dimension pooling is used, pooling can be max, min, or average pooling. Finally, the reduced feature representation is passed into fully connected layer for classification. Moreover, the pooling layer can also be passed into RSs to extract the sequence information among the character in the domain name. This type of hybrid architecture is shown in Fig. 37.4.

Fig. 37.4
figure 4

An overview of combination of convolutional neural network (CNN) and long short-term memory (LSTM) architectures

2.6 Employing Cost-Sensitive Model for Deep Learning Architectures to Handle Multiclass Imbalance Problem

All deep learning architectures focus on minimizing the cost function of the network by considering the true output y l and a target t l, where l defines the number of neurons and let’s define the cost function for softmax.

$$\displaystyle \begin{aligned} E(t) = - \sum_{S \in samples} {\sum_l {{t^l}} } (t)\log {y^l}(t) \end{aligned} $$
(37.12)

Generally gradient descent with truncated version of real-time recurrent learning (RTRL) is used to minimize the cost function [13]. As Eq. (37.12) indicates that the deep learning architectures consider all the samples of each class equally. Thus, deep learning architectures are more prone to class imbalance problem. This type of architectures biased towards the classes which has more number of samples and shows less performance for detecting DGA families which contains less representation in training data set [42]. Cost-Sensitive learning is an important approach in many of the real-time data mining applications and capable to handle multiclass imbalance problem [18].

There are various methods exist to convert the Cost-Insensitive LSTM to Cost-Sensitive method [48]. One of the most commonly used methods is to accommodate the balanced training samples via following oversampling or under sampling [49]. In [48] the authors reported the resampling approach is not an efficient method in dealing with class imbalance on multiclass applications. Later, several methods were introduced based on threshold. In [12] the authors proposed Cost-Sensitive based neural networks to handle multiclass imbalance problem by using the error minimization function with the aim to achieve the expected costs. They haven’t mainly targeted the class imbalance problem in their experiment. Following [24], introduced Cost-Sensitive LSTM which incorporates the misclassification costs into the backward pass of LSTM. Each sample S is coupled with a cost item c[class(S), k], where k and class(S) define the predicted and actual class, respectively. A cost weight is assigned based on the frequency of samples of a class. Generally, the cost items indicate the classification importance.

$$\displaystyle \begin{aligned} E(t) = - \sum_{S \in samples} {\sum_l {{t^l}} } (t)\log {y^l}(t)c[class(S),k \end{aligned} $$
(37.13)

Based on Eq. (37.13), the basic equations for all deep learning architectures are changed by including the cost item. A cost item typically controls the magnitude of weight updates [9].

Initially for an input data samples the cost matrix is not known. Application of genetic algorithm can be used to identify the optimal cost matrix. However, it requires more time and considered as a difficult task [41]. Let’s assume the data samples in one type of class are equal cost. C[i, i] indicates the misclassification cost of the class i, which is produced using the class distribution as

$$\displaystyle \begin{aligned} c[i,i] = {\left( {\frac{1}{{{n_i}}}} \right)^\gamma } \end{aligned} $$
(37.14)

where γ ∈ [0, 1] is a hyperparameter, if c[i, i] is inversely proportional to the class size n i, then γ = 1 amd γ = 0 indicate the deep learning architectures are Cost-Insensitive.

3 Related Works on Domain Generation Algorithms (DGAs) Analysis

A detailed review of detecting malicious domain names is reported by Zhauniarovich et al. [47]. In earlier days, blacklisting is the most commonly used method. These methods completely fail to detect new kinds or variants of DGA based domain name. Later, many approaches have been introduced based on machine learning (ML). These ML based solutions are mostly retrospective which means the methods build clusters based on the statistical properties [3, 43, 44]. These methods are not efficient in real-time DGA domain name detection. Additionally, the retrospective methods take advantage of additional information obtained from HTTP headers, NXDomains, and passive DNS information. Later, real-time detection based on ML is introduced. These methods act on a per domain information which means extract different features from domain name and pass into ML algorithms for classification [20]. However, these ML based solutions rely on feature engineering. This is considered as one of the daunting tasks and these solutions are vulnerable in an adversarial environment.

Recently the application of deep learning is leveraged for DGA detection which completely avoids feature engineering [33, 42]. In [42] the authors proposed a method for DGA detection and categorization. The method uses LSTM which looks for DGA domain name on per domain bases. The method performed well when compared to the benchmark classical methods based on HMM and also results are compared with the feature engineering methods. In [33] the authors proposed a method to collect DNS logs inside an Ethernet LAN and to analyze the DNS logs the application of deep learning architectures such as RNN and LSTM was used. The results are compared with the classical method, feature engineering with Random Forest classifiers. A detailed experimental analysis is shown for various data sets collected in real-time and public sources. The application of various deep learning architectures such as RNN, GRU, LSTM, CNN, and CNN-LSTM is evaluated for DGA detection and categorization [40]. For comparative study bigram with logistic regression and feature engineering with Random Forest classifier is mapped. In all the experiments, the deep learning architectures performed well when compared to the classical methods. In [26] the authors developed a cyber-threat situational awareness framework by using DNS data. They showed a method to collect the DNS logs at an Internet service provider level and application of deep learning architecture is used for DNS data analysis with the aim to detect the DGA domain names. In [45] the authors proposed a method to automatically label the data into DGA or non-DGA and used deep learning architecture for DNS data analysis. For comparative study, 11 different feature sets are extracted based on the domain knowledge and passed into Random Forest classifier. A detailed study of all the different models was evaluated on very large volume of data set which was collected from both the public source and real-time DNS streams. The deep learning model particularly CNN performed well when compared to feature based approach and the system performance has been shown on live stream deployment. In [14] the authors evaluated the performance of recurrent networks on very large volume of data set which consists of 61 different DGA malware families. In recent days, many deep learning architectures based on character level embedding are introduced for many text applications in the field of NLP. To leverage the application of these models [46] evaluated the performance of various benchmark deep learning architectures with character based models for DGA detection and compared with classical methods, feature engineering with Random Forest and multilayer perceptron (MLP) classifiers. The methods based on deep learning with character level embedding performed better than the classical methods. The application of various Image Net models such as AlexNet, VGG, SqueezeNet, InceptionNet, ResNet are transformed for DGA detection by Feng et al. [7]. They followed preprocessing approach to convert the domain name into image format and followed transfer learning approach. In [15] the authors evaluated the performance of various supervised learning models such as LSTM, recurrent SVM, CNN with LSTM, and bidirectional LSTM and compared it with the classical methods HMM, C4.5, ELM, and SVM on the 38 DGA families data set which was collected in real-time. In [5] the authors proposed a method which uses recurrent networks for DGA detection. The method takes the benefit of side information from WHOIS database. This is due to the fact that the DGA families with a high average Smashword score are very difficult to detect based on the domain information alone in the case of a per domain basis method. Smashword score defines the average of n-gram (n ranges [3–5]) intersection with words from an English word dictionary. Generally, it is the measure that gives the measure of closeness between DGA and English words. In [24] the authors proposed Cost-Sensitive LSTM to handle multiclass imbalance in DGA families detection. The proposed method showed 7% improvement in both precision and recall when compared to the Cost-Insensitive LSTM. Additionally, the Cost-Sensitive LSTM showed better performance in detecting 5 additional DGA based bot families. In [22] the authors evaluated the performance of various benchmark character based models for DGA detection and categorization. These models are based on ensemble of human engineered and machine learned features. The importance to time and seed is given while selecting the data set for train and test. Thus this type of methodology allows effectively to evaluate the robustness of the trained classifiers for identifying domain names initiated by the same families at various times or even seed changes. They also state that their method performed well for detecting DGA in the case of time dependent seed when compared to time invariant DGAs. They also evaluated the best performed model on real-time DNS traffic and showed that many of the legitimate domain names are flagged as legitimate. This is mainly due to the reason that Alexa is not completely a non-malicious domain name in real-time DNS traffic. In [16] the authors proposed a unique deep learning architecture typically called as spoofnet which correlates both DNS and URL data to detect malicious activities. Following, the spoofnet architecture is evaluated on various types of data sets of DGA and URL and additionally employed for spam email detection [38]. To meet zero day malware detection, [37] incorporated the time information in generating the data sets for train and test. To leverage the application of various character based benchmark models, [35] transformed these approaches to DGA analysis.

4 Description of Data Set

To measure the performance of Cost-Sensitive based deep learning architectures, we have used the AmritaDGAFootnote 3 data set [36]. This data set has been used as part of DMD-2018 shared task. Along with the data set, baseline systemFootnote 4 is publically available for further research. This data set contains domain names which are collected from publically available sources and real-time DNS traffic inside an Ethernet LAN. Additionally, the data set has been designed by giving importance to the time information. Thus, the trained models on this type of data set have the ability to meet zero day malware detection. The data set is composed of two types of testing data sets. Testing 1 data set is formed using publically available sources and Testing 2 data set is formed using DNS traffic inside an Ethernet LAN. The statistics of AmritaDGA is shown in Table 37.1. The data set was used for two tasks, one is binary and other is multiclass classification. Binary class classification aims at classifying the domain name as either legitimate or DGA and multiclass categorizes the domain name to their families.

Table 37.1 AmritaDGA data set used in DMD-2018 shared task

5 Statistical Measures

To measure the performance of trained models of various deep learning architectures, we adopted the various statistical measures. These various measures are approximated based on the positive (PD) : legitimate domain name, negative (NG): DGA domain name, true positive (T PD) : legitimate domain name that is predicted as legitimate, true negative (T ND) : DGA domain name that is predicted as DGA domain name, false positive (F PD) : DGA domain name that is predicted as legitimate, and false negative (F ND) : legitimate domain name that is predicted as DGA domain name. Using confusion matrix T PD, T ND, F PD, and F ND are obtained. Confusion matrix is represented in the form of matrix where each row denotes the domain name samples of a predicted class and each column denotes domain name samples of actual class. The various statistical measures considered in this study are defined as follows:

$$\displaystyle \begin{aligned} Accuracy = \frac{{{T_{PD}} + {T_{ND}}}}{{{T_{PD}} + {T_{ND}} + {F_{PD}} + {F_{ND}}}} \end{aligned} $$
(37.15)
$$\displaystyle \begin{aligned} Recall = \frac{{{T_{PD}}}}{{{T_{PD}} + {F_{ND}}}} \end{aligned} $$
(37.16)
$$\displaystyle \begin{aligned} Precision = \frac{{{T_{PD}}}}{{{T_{PD}} + {F_{PD}}}} \end{aligned} $$
(37.17)
$$\displaystyle \begin{aligned} F1\text{-}score = \frac{{2 * {\mathop\mathrm{Re}\nolimits} call * \Pr ecision}}{{{\mathop\mathrm{Re}\nolimits} call + \Pr ecision}} \end{aligned} $$
(37.18)
$$\displaystyle \begin{aligned} TPR = \frac{{{T_{PD}}}}{{{T_{PD}} + {F_{PD}}}} \end{aligned} $$
(37.19)
$$\displaystyle \begin{aligned} FPR = \frac{{{F_{PD}}}}{{{F_{PD}} + {T_{ND}}}} \end{aligned} $$
(37.20)

Accuracy estimates the fraction of correctly classified domain name, Precision estimates the fraction of DGA domain name which is actually DGA domain name, Recall or Sensitivity or TPR estimates the fraction of DGA domain names that are classified as DGA domain name, and F1-score estimates the harmonic mean of precision and recall.

6 Proposed Architecture: DeepDGA-MINet

The proposed architecture named as DeepDGA-MINet is shown in Fig. 37.5. A detailed overview of DeepDGA-MINet is shown in Fig. 37.6. This contains mainly 3 different sections: (1) data collection, (2) Cost-Sensitive deep learning layers, and (3) classification.

Fig. 37.5
figure 5

Proposed architecture: DeepDGA-MINet

Fig. 37.6
figure 6

A detailed overview of DeepDGA-MINet

In data collection, the system collects the DNS logs inside an Ethernet LAN in a passive way. The data has been passed into NoSQL database. Further, the domain name information is extracted from the DNS logs and passed into the Cost-Sensitive deep learning layers. This implicitly composed of character level embedding layer which helps to map the domain name characters into domain name numeric representation. The character level embedding layer works with Cost-Sensitive deep learning layers to extract the similarity among characters during backpropogation. Further, Cost-Sensitive deep learning layer extracts significant features from the character level embedding vectors. Finally, the feature set is passed into the fully connected layer for classification. This composed of softmax activation function which uses categorical cross-entropy loss function. The softmax and categorical cross-entropy loss function are defined mathematically as follows:

$$\displaystyle \begin{aligned} Soft\max {(x)_i} = \frac{{{e^{{x_i}}}}}{{\sum\nolimits_{j = 1}^n {{e^{{x_j}}}} }} \end{aligned} $$
(37.21)
$$\displaystyle \begin{aligned} loss(p,e) = - \sum\nolimits_x {p(x)} \log (e(x)) \end{aligned} $$
(37.22)

where x denotes an input , e and p denote true probability distribution and predicted probability distribution, respectively. To minimize the loss function adam optimizer is used. Finally, the classification results are displayed in Front End Broker.

7 Experiments, Results, and Observations

The detailed configuration details of deep learning architectures are reported in Table 37.2. In this research all the deep learning architectures are implemented in TensorFlow [1] with Keras [4] higher level library and various experiments of deep learning architectures are run on GPU enabled TensorFlow inside Nvidia GK110BGLTeslak40. All deep learning architectures are trained using AmritaDGA data set. To control the train accuracy across the more number of epochs, we have used validation data set that was from 20% of train data set taken randomly. The domain name samples are transformed into numeric vectors using Keras embedding. Keras embedding implicitly builds a dictionary which contains 39 unique characters. The character list is given below:

  • abcdefghijklmnopqrstuvwxyz0123456789._ -

Table 37.2 Detailed configuration of deep learning architectures

Using dictionary the characters of a domain name are transformed into indexes. The maximum length of the domain name is 91. Thus the domain name which contains less than 91 is padded with 0s. The index vector is passed into Keras embedding. It takes 3 different parameters such as Dictionary-size is 39, Embedding-length is 128, and Input-length is 91. Keras embedding follows deep learning layers and the detailed configuration details of deep learning layers are reported in Table 37.2. The deep learning layers follow fully connected layer for classification. All the trained models of various deep learning architectures are tested on the two types of AmritaDGA data set and the detailed results are reported in Tables 37.3, 37.4, 37.5, and 37.6. All DMD-2018 shared tasks submitted systems have used Cost-Insensitive deep learning architectures. The proposed deep learning architectures based on Cost-Sensitive performed better than the baseline system of DMD-2018 and all the submitted entries of DMD-2018 shared task, as shown in Tables 37.7 and 37.8. The detailed results for Testing 1 AmritaDGA data set are reported in Tables 37.9 and 37.10 for Testing 2 AmritaDGA data set. All baseline system of DMD-2018 and all the submitted entries of DMD-2018 shared task methods are based on Cost-Insensitive models. The Cost-Sensitive models can even perform well in detecting real-time DGA. This is due to the reason that most of the data set in real-time are highly imbalanced. This work has given importance only to achieve the best performance when compared to the baseline system and other submitted system entries of DMD-2018 shared task. However, the proposed method can perform well in any other data set and real-time detection of DGA domain name. Mostly, the results obtained by all the models are closer in nature. Moreover, the LSTM model has outperformed other deep learning architectures. However, the reported results can be further enhanced by following parameter tuning method. This is due to the reason that the optimal parameters implicitly have direct impact on getting the best performance in deep learning [13].

Table 37.3 Detailed Testing 1 data set results of DMD-2018 shared task participated systems [36]
Table 37.4 Detailed Testing 2 data set results of DMD-2018 shared task participated systems [36]
Table 37.5 Detailed Testing 1 data set results of AmritaDGA baseline system for multiclass classification [36]
Table 37.6 Detailed Testing 2 data set results of AmritaDGA baseline system for multiclass classification [36]
Table 37.7 Detailed Testing 1 data set results of the proposed method—deep learning architectures based on Cost-Sensitive data mining concept
Table 37.8 Detailed Testing 2 data set results of the proposed method—deep learning architectures based on Cost-Sensitive data mining concept
Table 37.9 Class-wise test results of the proposed method for Testing 1 data set of AmritaDGA
Table 37.10 Class-wise test results of the proposed method for Testing 2 data set of AmritaDGA

8 Conclusion, Future Works, and Discussions

This work proposes DeepDGA-MINet tool which provides an option to collect a live stream of DNS queries and checks for DGA domain name on a per domain basis. It uses the application of Cost-Sensitive deep learning based methods to handle multiclass imbalance problem. Each class or DGA family is associated with cost items and these are directly initiated into backpropogation learning algorithm. The proportion of cost is a hyperparameter and selected based on hyperparameter tuning method. The performance obtained by Cost-Sensitive based deep learning architectures is good when compared to the Cost-Insensitive deep learning architectures. Moreover, the performance shown by various Cost-Sensitive deep learning based architectures is almost similar. Hence, a voting methodology can be employed to enhance the DGA domain detection rate. This remains as one of the significant direction towards future work. This work has considered only 20 DGA families. The performance is shown for classifying a domain name into corresponding DGA family. Therefore, the further research on investigating the performance of Cost-Sensitive deep learning architectures on more number of DGA families remain as a significant direction towards future work. As well as in this work the hyperparameter tuning is not followed for deep learning architectures. Hyperparameters have direct impact on the performance of deep learning architectures. Thus investigation of proper hyperparameter tuning remains as another significant direction towards future work.