Keywords

1 Introduction

The web is the abbreviation for the World Wide Web, which plays a central role in the development of the Information Age and has become the primary tool for billions of people to interact on the Internet. Currently, the majority of services on the Internet are provided by web applications with a myriad of information, entertainment, education, commercial and governmental utilities. However, the web security situation is not optimistic. For cyber-criminals, the web has become a main venue for spreading malware and launching cyber-attacks, thus engaging in a wide range of cybercrimes, including information theft, fraud, espionage and blackmail. As early as 2008, Symantec [1] observed that attackers tended to adopt stealthier and more focused techniques targeting computers through the web instead of trying to penetrate networks with high-volume broadcast attacks, and the web-based vulnerabilities had outnumbered traditional computer security concerns with the majority of effective malicious activities targeting the web. According to Trustwave [2], hackers are increasingly focusing on and succeeding with application layer attacks.

Among the numerous web security protection solutions, the web application firewall (WAF) is a type of application firewall that applies specifically to web applications. By inspecting HTTP traffic, it can prevent attacks stemming from web application security flaws, such as SQL injection [3], cross-site scripting (XSS) [4], and path traversal [5]. However, the current WAFs typically work in a rule-based mode and rely highly on signatures to detect and prevent attacks. They must have enough characterization and generalization ability to cover normal or malicious behaviors, whereas in practice it is a time-consuming and labor-intensive task to update rules against new emerging attacks. Notably, the renaissance of machine learning, especially the rise of deep learning provides us with new ideas for solving problems. We can build a mathematical model based on sample data to make predictions or decisions without using explicit instructions. Inspired by this, we explore and study how to use deep learning techniques to design a novel and effective WAF—DeepWAF. In this paper, we systematically discuss the approach for using two currently popular deep learning models, namely, convolutional neural network (CNN) and long short-term memory (LSTM), to build web attack detection models.

The rest of the paper is organized as follows. The related work is introduced in Sect. 2. The details of DeepWAF are described in Sect. 3. Experimental results and discussions are presented in Sect. 4. Finally, Sect. 5 concludes the paper.

2 Related Work

Considerable web attacks detection or prevention research [6, 7] has been proposed. Such research ranges from narrow solutions used to prevent only some specific attacks, to generic methods aiming to provide comprehensive protection for web applications.

SQL injections are one of the most common web attacks; thus, a large number of protection methods are proposed specifically to circumvent SQL injection attacks [8,9,10,11,12]. Kar et al. [13] presented an approach for detecting SQL injection attacks by modeling SQL queries as a graph of tokens and using the centrality measure of nodes to train a support vector machine (SVM).

XSS attacks are a type of injection attack in which malicious scripts are injected into the targeted website. Gupta et al. discussed a detailed comprehensive analysis of the exploitation, detection and prevention mechanisms of XSS attacks in [14]. XSS attacks are generally categorized into two categories: stored and reflected. The stored attacks usually rely on client protections to monitor the outgoing HTTP responses [15]. The reflected attacks are generally circumvented by user input sanitizing [16, 17].

HTTP parameter pollution is a special type of attack that supplies multiple HTTP parameters with the same name and may cause a web application to interpret values in unanticipated ways, thus allowing it to be exploited to bypass input validation, trigger application errors or modify internal variable values. Balduzzi et al. [18] presented an automated approach for the discovery of HTTP parameter pollution vulnerabilities in web applications to prevent attackers from compromising application logic to perform attacks.

Protection techniques against other types of web attacks have also been explored. For example, Su and Wassermann [19] proposed a method for preventing command injection based on context-free grammar and compiler parsing techniques. Tajbakhsh and Bagherzadeh [20] presented a framework for preventing local file inclusion attacks. Han [21] introduced a system to detect directory traversal attacks by analyzing web server logs. Saxe and Berlin [22] used a character-level CNN to detect malicious URLs, file paths and registry keys.

Unlike the above work, some research has concentrated on uniform solutions to detect or prevent many types of attacks. Kruegel et al. [23, 24] presented a multi-model approach to detect web-based attacks. They built many statistical detection models on different features, including attribute length, attribute character distribution, attribute order, and access frequency. Corona et al. proposed a formulation of query analysis through hidden Markov models (HMM) to detect attacks on web applications in [25], and presented SuStorID in [26, 27], which is a multiple classifier system that can model legitimate inputs towards web services. Zolotukhin et al. [28] considered analyzing HTTP logs to detect web attacks and employed support vector data description (SVDD), K-means and density-based spatial clustering of Applications with Noise (DBSCAN) to model normal user behaviors.

Choras and Kozik [29] proposed a model consisting of patterns obtained using graph-based segmentation techniques and dynamic programming based on information from HTTP requests to detect cyberattacks on web applications. Bronte et al. [30] proposed an anomaly detection approach that utilizes three measures: cross-entropy for parameters, value and data type, which are intended to compare the deviation between learned request profiles and a new web request. Zhang et al. [31] designed a CNN model to detect web attacks, and the experimental results showed that the model achieves satisfactory results with a high detection rate and a low false alarm rate.

The above work has made great achievements, but only a few have tried to develop protection solutions by using machine learning techniques. Defending web applications is very difficult because there are so many and different attacks. It is necessary to use machine learning, especially deep learning techniques to develop effective protection solutions that are easily implementable and capable of learning. In this paper, we systematically present how to apply two currently popular deep learning models, i.e., CNN and LSTM, and their combinational models to the detection of web attacks.

3 DeepWAF

In this section, we describe the details of DeepWAF. First, the architecture of DeepWAF is introduced. Second, the HTTP request preprocessing algorithm is described. Finally, the four types of detection models, i.e., CNN, LSTM, CNN-LSTM and LSTM-CNN, are presented.

3.1 Architecture of DeepWAF

Figure 1 shows the architecture of DeepWAF with the main focus on the detection phase. Because DeepWAF is a machine learning-based detection system, it must be trained with real web requests before deployment in a real environment to provide protection for web applications. In practical use, DeepWAF can be deployed inline as a reverse proxy.

Fig. 1.
figure 1

Architecture of DeepWAF.

DeepWAF is composed of four modules: parser, preprocessor, detector and responder. The typical process for DeepWAF to detect a malicious web request is as follows. First, the request to the web server is parsed and analyzed by the parser into HTTP headers and body. Next, the preprocessor preprocesses the HTTP request and generates a URL sequence that can be fed to the detector. Then the detector detects whether the request is normal or malicious based on the built-in deep learning models. Finally, the responder performs suitable actions according to the detection results. For example, it can forward the request to the web server if the detection result is normal but may drop it if malicious. Since the parser and the responder are similar to those in ordinary WAFs, the following will focus on the implementation of the preprocessor and the detector.

3.2 Preprocessing the HTTP Request

Web attacks exclusively leverage the HTTP protocol to perform malicious activities. If a web server is attacked, that means it receives one or more malicious HTTP requests. Based on this, DeepWAF is designed by inspecting HTTP requests to detect the server-side web attacks. Like other WAFs, DeepWAF can also support the HTTPS protocol by copying the private key used by the server.

The following snippet shows a GET HTTP request from the dataset HTTP DATASET CSIC 2010 [32]. An HTTP request consists of a request line, several request headers and an optional message body (for the POST request). The request line is composed of three components: the HTTP-method, the HTTP-URL and the HTTP-version. Because the vast majority of web attacks are implemented by manipulating the HTTP-URL, and the dataset used in our experiments only contain attacks in HTTP-URL, we focus the detection object on the HTTP-URL. However, without loss of generality, our detection method can be applied to other fields of the HTTP request. A special case is that the POST request contains a message body that can be exploited by injection attacks. So for the POST request, the detection object is defined as the combination of the HTTP-URL and the HTTP-body. For convenience, the detection object is simply called URL in later sections.

figure a

The procedure of the HTTP request preprocessing, which is used to process the HTTP request into a URL sequence that can be fed to the detector, is shown in Algorithm 1. The main steps are Decode, Lowercase and Split. Since the HTTP URL allows users to encode special characters, attackers often leverage the encodings to hide attack payloads. To effectively detect web attacks, the URL should be decoded first. Because the URL is not case-insensitive, we lowercase all the characters in URL, which can reduce the size of the training vocabulary. The URL is finally split into a sequence by special characters “/”, “?”, “&”, “=”, “+”, etc. In practice, the preprocessing may be continuously optimized according to the detection results.

For the above HTTP request, one result of the preprocessing is as follows.

figure b

3.3 CNN- and LSTM-Based Detection Models

CNN Model.

CNN was initially designed for image recognition but has become a versatile model used for a wide array of tasks. CNN can recognize local or high-order structural features of the input. For example, in our detection model, CNN might be able to distinguish that a request containing the words “table”, “select”, “from”, etc. is malicious. The architecture of the CNN-based detection model is shown in Fig. 2. The one-hot encodings X of the URL sequence are input to the embedding layer. The embedding vectors E are convolved on the Convolutional layer with different types of filters, i.e., if the size of E is l × k, the filter sizes are set to s × k (s = 3, 4, 5…), with k equaling the embedding dimension and s taking different values. The max-pooling (over time) takes the largest element from each feature map output by the convolutional layer, and then concatenates them to pass to the Softmax layer. The Softmax layer outputs a label “0” or “1”, which indicates whether the request is normal (by label “0”) or malicious (by label “1”).

Fig. 2.
figure 2

CNN-based detection model.

LSTM Model.

LSTM is a variant of the recurrent neural network (RNN), which has been proven to perform extremely well on sequential data. In our detection model, LSTM might be able to remember that the word “from” appearing in a malicious URL sequence usually follows the word “select”. The architecture of the LSTM-based detection model is shown in Fig. 3. The length of the time steps is the same as the length of the URL sequence. The embedding vectors of the one-hot encodings are sequentially distributed to different LSTM units. Then, the outputs of all the LSTM units are gathered together to be input to the Softmax layer.

Fig. 3.
figure 3

LSTM-based detection model.

CNN-LSTM Model.

The CNN-LSTM model is a combination of CNN and LSTM. As Fig. 4 shows, the convolutional layer receives the embedding vectors as input. Its output is pooled and then fed to the LSTM layer. The output of the LSTM layer is input to the Softmax layer. The intuition behind the CNN-LSTM model is that the CNN will extract structure features, from which the LSTM will learn the sequential features to classify the input.

Fig. 4.
figure 4

CNN-LSTM-based detection model.

LSTM-CNN Model.

The LSTM-CNN model is a combination of LSTM and CNN. As Fig. 5 shows, the LSTM layer receives the embedding vectors as input. Its output is directly input to the convolutional layer. The output of the convolutional layer is pooled and then input to the Softmax layer. The intuition behind the LSTM-CNN model is that the LSTM generates new sequential encodings of the input, from which the CNN extracts structural features to classify the input.

Fig. 5.
figure 5

LSTM-CNN-based detection model.

4 Experiments

To evaluate the performance of models on detecting web attacks, we experimented on the dataset of HTTP DATASET CSIC [32].

4.1 Data Preparation

The HTTP DATASET CSIC 2010 dataset contains thousands of web requests automatically generated by the Information Security Institute of CSIC (Spanish Research National Council), and has been widely used for testing web attack detection systems. The dataset contains 36,000 normal requests and 24,668 malicious requests. The malicious requests include web attacks such as SQL injection, XSS, buffer overflow, information gathering, and file disclosure.

As shown in Table 1, we randomly select approximately 70% of the dataset as training data, approximately 5% as the validation data, and the remaining approximately 25% as the testing data. We train the detection models using the “training data”, tune the parameters using the “validation data” and then test the performance of the detection models on the unseen “testing data”.

Table 1. Experimental data distribution.

4.2 Parameter Settings and Evaluating Criteria

Based on empirical experiences, we set the necessary hyperparameters as Table 2 shows. The embedding dimension is set to 128. The CNN utilizes 4 types of filters with sizes of 3 × 128, 4 × 128, 5 × 128 and 6 × 128. The number of each type of filter is 128. For the LSTM model, the dimensionality of the output space, i.e., the number of hidden units, is set to 64. We train the models by the batch training approach. The learning rate is set to1e-3, and the batch size is 128.

Table 2. Hyperparameter settings.

To evaluate the detection models, we adopted criteria usually used in intrusion detection systems, i.e., detection rate and false alarm rate, as well as criteria used in machine learning, i.e., precision, recall, F1-measure and accuracy. We use TP (true positive) to represent the number of malicious requests that are correctly detected as malicious. FP (false positive) represents the number of normal requests that are incorrectly detected as malicious. TN (true negative) represents the number of normal requests that are correctly detected as normal. FN (false negative) represents the number of malicious requests that are incorrectly detected as normal. The evaluation criteria are defined as follows. Note that the recall has the same definition as the detection rate.

$$ {{Detection\;rate} \mathord{\left/ {\vphantom {{Detection\;rate} {Recall}}} \right. \kern-0pt} {Recall}} = \frac{TP}{TP + FN} $$
(1)
$$ False\;alarm\;rate\; = \;\frac{FP}{FP + TN} $$
(2)
$$ Precision\; = \;\frac{TP}{TP + FP} $$
(3)
$$ F_{1} \text{ - }measure\; = \;\frac{2 * Precision * Recall}{Precision + Recall} $$
(4)
$$ Accuracy\; = \;\frac{TP + TN}{{TP + FP + TN{ + }FN}} $$
(5)

4.3 Experimental Results

The detection model must first be adequately trained on the training data to perform well on the testing data, i.e., effectively detect web attacks. In practice, the testing data (i.e., the requests to be detected) are unknown to us, so we can only improve the performance of the detection models with training and validation data.

In the experiment, we first observe the model performance on training and validation data, and then adjust the training strategies based on validation accuracy. Finally, we evaluate the detection models on the testing data.

Training Results

There are two commonly used methods to enhance the generalization of the detection model during the training phase, i.e., selecting adequate training epochs and applying dropout. We first simply trained each model for 10 epochs and added dropout after the max-pooling layer with the keeping probability being 0.5, and then performed adjustment depending on the results. The training accuracy and loss were recorded every one step and the validation accuracy and loss were recorded every 100 steps. The results are shown in Fig. 6, where blue curves denote the training metrics and orange curves denote the validation metrics. The CNN, LSTM and LSTM-CNN models exhibit good performance, with accuracy rapidly achieving above 95% and loss decreasing towards 0 on both the training and validation data. The CNN-LSTM model may not seem ideal. It fits the training data well but has a large generation error on the validation data. It also demonstrates that 10 epochs of training are sufficient for these models to achieve stable performance.

Fig. 6.
figure 6

Training results of the four types of detection models.

Effects of Dropout

In this part, we test the effects of dropout. Dropout has a tunable hyperparameter p (the probability of retaining a neuron in the network, or called the keeping probability). A small p indicates that very few neurons work during training, and “p = 1” means no adoption of dropout. We added dropout after the max-pooling layers and trained the models with different keeping probabilities. The results are shown in Table 3. Since the LSTM model does not contain a max-pooling layer and no dropout is applied, its validation accuracy is always 96.11%. For the CNN and LSTM-CNN models, the dropout provides a very limited contribution to improving the model performance. The validation accuracy varies little with p. However, for the CNN-LSTM model, the dropout has a significant negative impact on the validation accuracy. It increases the generalization error. As long as the dropout exists, whatever value the keeping probability takes (i.e., p = 0.2, 0.5 or 0.8), the validation accuracy is significantly smaller than that without dropout (i.e., p = 1), Which also explains why the CNN-LSTM model does not behave as expected as other models in Fig. 6, where all the models were trained with dropout of the keeping probability being 0.5.

Table 3. Effects of dropout.

We retrained the CNN-LSTM model without dropout, and the training results are shown in Fig. 7. Obviously, the CNN-LSTM model regains its outstanding performance on both training and validation data.

Fig. 7.
figure 7

Training results of CNN-LSTM without dropout.

We think that the aforementioned dropout is improper for the CNN-LSTM model. The dropout is added after the max-pooling layer and before the LSTM layer. It randomly drops some neurons at training time, which is disastrous for the LSTM. The LSTM is learned by sequential information, some of which is unfortunately removed by the dropout. We can conclude that if the CNN and LSTM are sequentially combined to form a CNN-LSTM model, it is not appropriate to apply the dropout before the LSTM, which will undermine LSTM’s learning process.

Given the above results, the four types of detection models (i.e., CNN, LSTM, CNN-LSTM and LSTM-CNN) are all trained for 10 epochs without dropout.

Detection Results

After completing the training, we ran the trained models on testing data to evaluate their performance on detecting web attacks. The detection results are as shown in Table 4. In terms of intrusion detection evaluation criteria, each detection model achieves both a high detection rate (average approximately 95%) and a low false alarm rate (average approximately 2%). In terms of machine learning evaluation criteria, every model achieves satisfactory performance with high precision (average 96.92%), recall (average 94.27%), F1-measure (average 95.57%) and accuracy (average 96.44%). Because the numerical difference in each criterion is very small (approximately 1–2%), it is hard to determine which model is the best.

Table 4. Detection results.

All the models achieved satisfactory detection results, which were obtained just by using the basic CNN and LSTM models with little hyperparameter tuning. Theoretically, the detection results will be better if we adopt more optimal hyperparameter values. Obviously, the results demonstrate that machine learning has great potential to be applied in the field of web attack detection.

Discussions and Case Studies

In this subsection, we provide an intuitive grasp of the number of false negatives (FN) and the number of false positives (FP), and carry out case studies to explain why some requests are incorrectly detected.

As stated above, the testing dataset contains 9,000 normal requests and 6,167 malicious requests. The FN and FP of different detection models are shown in Table 5, where “COM” represents the number of requests that are incorrectly detected by all four types of models. Specifically, the same 233 malicious requests are incorrectly reported as normal, and 21 normal requests are incorrectly reported as malicious, which demonstrates that these detection models are more likely to produce the same false negatives but different false positives. Theoretically, if we construct an ensemble model with these four types of models, the detection rate can be increased to 96.22% (i.e., 233/6,167), and the false alarm rate can be decreased to 0.23% (i.e., 21/9,000), but that will be time consuming.

Table 5. FN and FP of different models.

We choose a false negative and a false positive for case studies. The following snippets show two requests. The upper is a malicious testing request that is incorrectly detected as normal and the below is a normal request in the training data. We can see that the following two requests are very similar except that the upper request contains a “%2F”, which is the encoding of “/”. In our preprocessing algorithm, “/” is regarded as a special character used to split the URL and will not appear in the URL sequence. In other words, the following two requests have the same type of URL sequence after the preprocessing, which explains the reason why the upper request is incorrectly detected as normal.

figure c

The following snippet shows a normal testing request that is detected as malicious. Through analysis, we find that the following request contains some strings such as “pasar”, “por” and “caja”, which never appear in the training vocabulary. Such types of requests are very likely to be detected as malicious by the detection models.

figure d

The above case studies can be used for further improvements, which we leave as future work.

5 Conclusion

We present a novel web application firewall called DeepWAF by using deep learning techniques to detect web attacks. We first described the architecture of DeepWAF. Then we provided detailed explanations of the HTTP request preprocessing and the principles of the proposed four types of detection models based on CNN, LSTM, CNN-LSTM and LSTM-CNN. Finally, we evaluated the detection models on the dataset of HTTP DATASET CSIC 2010 and verified their good performance in detecting web attacks.

We simply tried the basic CNN and LSTM models with little hyperparameter tuning. Future work can be concentrated on adopting more sophisticated deep learning models, tuning model hyperparameters and inspecting all the fields of the HTTP request, thus resulting in much more powerful web attack detection models.