1 Introduction

The number of computer security incidents has been steadily growing over the past few years: the early detection of network intrusions has become a hot topic. As a matter of fact, a wide corpus of research in this field is available. Many research proposals strongly rely on public intrusion datasets, such as UNSW-NB15 (Moustafa & Slay, 2015), NDSec-1 2016 (Beer et al., 2017), and CICIDS2017 (Sharafaldin et al., 2018), which are leveraged for designing, evaluating, and comparing novel intrusion detection systems (IDS). On the other hand, the ever-increasing research efforts on intrusion detection techniques stimulated the production of a large number of public intrusion datasets Ring et al. (2019). Public datasets typically provide ready-to-use network packets and labeled records — also known as network flow records — collected under normative operations and attack conditions. Network flow records consist of categorical and numeric features that provide context data and summary statistics computed from the packets exchanged between a source computer and a destination across a network. Commonly used features include, but are not limited to, source-destination IP address and port, duration, number and length of packets, flag counts, min, max, mean, and standard deviation of the packets inter-arrival time. The availability of public datasets and flow records makes it straightforward to develop machine and deep learning models for intrusion detection. Not surprisingly, the intersection of intrusion detection and machine learning is an extremely hyped research topic Liu and Lang (2019). A plethora of attack detectors have recently spread in the literature; noteworthy, some of these detectors achieve astonishing results. For example, solutions proposed in Kshirsagar and Kumar (2021) and Ali and Cotae (2018) achieve an accuracy of 0.999 and 0.996, respectively. At the time being, intrusion detection may seem a perfectly solved problem with no room for further improvements.

Successful attacks reported every day in the news suggest that the prompt detection of intrusions is still an open issue. Unfortunately, most of the existing — and impressive — intrusion detection results hold just in the context of the individual datasets that were used to obtain the results themselves. We believe that the results obtained on the top of synthetic and “lab-made” attacks (such as those provided by many public datasets around) may not apply to real-life production networks. As a matter of fact, synthetic intrusion datasets simply are not able to summarize the complexity and uncertainty of production networks, which are intertwined with the ever-evolving sophistication of the attacks, the presence of heterogeneous and non-stationary workloads, bad or incomplete configuration of real-life servers, and lack of proper defense mechanisms. In consequence, any attempt to learn intrusion detectors on top of a public dataset may lead to partial — if not incorrect — patterns, which cannot be used to drive general and rigorous security claims on the effectiveness of a given IDS technique. In our opinion, the implications of using public datasets for advancing the state-of-the-practice in intrusion detection and cybersecurity remain quite opaque.

This paper explores the proposition above with a focus on the detection of denial of service (DoS) attacks. As typically done by the current literature in the area, detection is pursued on the top of network flow records. Our analysis is based on two independent datasets: CICIDS2017 and USB-IDS-1. The former is a public dataset and a well-established intrusion detection benchmark that is gaining ever-increasing attention by the security community; the latter is a recent dataset collected within a private network infrastructure at our premises. Both the datasets consist of network flow records collected during normative operations and DoS attacks. CICIDS2017 and USB-IDS-1 are closely related, in that they both (i) use CICFlowMeter for extracting flow records from the raw packets, (ii) are based on the same mixture of DoS attacks, and (iii) provide attacks done against an installation of the Apache web server. More importantly, USB-IDS-1 is collected in a much more simple network and provides obviously clear and effective attacks. Another interesting feature of USB-IDS-1 is that attacks are executed both in the case of “no defense” from DoS and two state-of-the-practice defense modules; it provides the same attacks of CICIDS2017 plus additional variants obtained under defense.

Our analysis is based on a twofold experiment. First, we learn an intrusion detector on the top of CICIDS2017, which encompasses flow records from benign traffic and various types of DoS attacks. We use three machine learning models to learn the detectors, i.e., decision tree, random forest, and deep neural network, in order to make sure that our claims are not biased by a given model. Second, we test the detectors against “held out,” i.e., not used for learning, network flow records of both (i) CICIDS2017 and (ii) records obtained from different attack-defense combinations available in USB-IDS-1. The analysis done in this paper falls within the larger scope of transfer learning Pan and Yang (2010), in that we learn a predictive function for a “target” domain (USB-IDS-1) based on a “source” domain (CICIDS2017) given the same learning task (intrusion detection) in both domains. Moreover, since the target and source domains are different — although related, as said above — but the learning task is the same, our specific setting is called transductive transfer learning Pan and Yang (2010).

The effectiveness of the detectors is assessed with the consolidated metrics of precision, recall, and F1 score. Not surprisingly, all the IDS models assessed easily achieve precision between 0.96 and 1, and recall between 0.97 and 0.99 when trained and tested on the top of an individual dataset, such as CICIDS2017. These figures are extremely high and consistent with existing literature on IDSs applied to public datasets. We observe that an attempt to transfer such highly performing models to an unseen dataset reveals different findings. As for USB-IDS-1 — attacks emulated in the case of no server-side defense, the collection setting of CICIDS2017 — the detectors successfully transferred to just one attack, i.e., hulk, which is detected with 0.98 recall by the random forest. Surprisingly, the detectors performed quite poorly in the case of slow attacks: at best, they are detected with recall equal to 0.8 (slowloris) and 0.7 (slowhttptest) by the random forest and deep neural network, respectively, although the detectors were trained to detect these attacks. Detection gets even much worse for mitigated variants of the attacks, elicited by hardening the configuration of the victim server through the defense modules: in this case, USB-IDS-1 hulk goes totally undetected (0.06 recall by the decision tree at best), while the best value of recall for slowloris and slowhttptest in the case of defense is 0.28 and 0.32, respectively (deep neural network).

The results provide two surprising findings: IDS models trained to detect a given attack might be ineffective to reveal “weaker” variants of the same attack; moreover, a minor change with respect to the data collection environment of the training dataset — as the enablement of the defense modules — invalidates patterns learned by an IDS model. In a previously appeared paper Catillo et al. (2021b), we documented a preliminary critique experiment with one type of attack and one machine learning technique. Here we present a much wider study, which features a substantial data collection environment, four attack types, two server-side defense modules, and three machine learning techniques. Overall, they contribute to more comprehensive experiments and findings along different directions on the subject. Experiments indicate that results obtained within the “ideal” world of a synthetic dataset may not transfer in practice.

The rest of this paper is organized as follows. Section 2 presents related work in the area. Section 3 describes the experimental testbed underlying USB-IDS-1, attacks, defenses, and how experiments have been conducted. Section 4 provides the analysis framework and the machine learning techniques assessed in this study. Section 5 presents results, lessons learned from our experiment, and a comparison of the results across the techniques assessed. Section 6 describes the threats to validity of our study and how they have been mitigated, while Sect. 7 concludes the paper and provides future perspectives of our work.

2 Related work

2.1 Public intrusion detection datasets

The evaluation of intrusion detection techniques relies on the availability of intrusion datasets. In general, most network intrusion detection research is still based on simulated datasets, due to the scarce availability of real network data. As a matter of fact, datasets composed of network packets or flow records from real-life environments are not easily accessible because of privacy issues. Since benchmark datasets might be — at least in theory — a good basis to evaluate and compare the quality of different network intrusion detection systems, over the years many public intrusion detection datasets have spread in the security community. These typically contain time-ordered events (e.g., network packet traces, flow summaries, and log entries) generated in synthetic environments under normative conditions and multiple intrusion scenarios. Each data point is labeled and assigned the corresponding “normal” or “attack” class. The number of detected attacks and false alarms on the given dataset may be used for evaluating the detector under test.

Network traffic provided by an intrusion detection dataset is often captured either in packet-based or flow-based format. Packet files are an ordered collection of network packets originating from one (several) benign or malicious source(s). Flow-based data, instead, are more concise and usually contain metadata of network connections. Most often, network flow records contained in common datasets are organized in comma-separated values files, specially crafted to apply modern machine learning techniques. The quality and usability of datasets reflect their suitability to assist researchers in developing advanced detection techniques. At the state of the art, numerous solutions available in the literature capitalize on well-known intrusion detection datasets by achieving high levels of accuracy and recall, often close to 1.

The earliest effort to create a public intrusion detection dataset was made by DARPA (Defence Advanced Research Project Agency) in 1999 by providing a comprehensive and realistic intrusion detection benchmarking dataset, named KDD-CUP’99Footnote 1. This is a highly cited dataset and a source of study in intrusion detection. It contains summarized labeled flow data with a broad selection of intrusions and attacks-free instances simulated within a military network. The dataset encompasses 7 weeks of network traffic and consists of about 5 million lines. Even if KDD-CUP’99 is now two decades old, it is still extensively used in the field of intrusion detection and machine learning, as in Tavallaee et al. (2010). However, it has been heavily criticized by researchers as being unrepresentative of real-life network conditions McHugh (2000), Kayacık and Zincir-Heywood (2005). This is also true for the more recent NSL-KDDFootnote 2 Tavallaee et al. (2009), a cleaned up variant of the KDD-CUP’99 dataset, with duplicates removed and reduced in size.

Another recent public intrusion detection dataset is UNSW-NB15Footnote 3 Moustafa and Slay (2015). This dataset, released by the Australian Center for Cyber Security (ACCS), was created in 2015 by means of the IXIA Perfect Storm tool, used as a normal and abnormal traffic generator. It includes nine categories of the modern attack types and normal traffic. In particular, it leverages the CVEFootnote 4 vulnerability database to retrieve information about the latest types of attacks used and discovered in the security community. The dataset is labeled and accessible in comma-separated values file and in packet capture (pcap) file format.

It is worth pointing out that it remains challenging to create and maintain a representative dataset that meets real-world criteria. An attempt to overcome this limitation is suggested by the authors of the UGR’16 datasetFootnote 5 Maciá-Fernández et al. (2017), proposed by the University of Granada. It consists of 4 months of anonymized network traffic flows (Netflow) gathered from a Spanish Internet Service Provider (ISP) facility. In particular, it is a labeled dataset and contains synthetically generated attack flows which are added to the normal ISP use network traffic.

Other known public intrusion datasets are NDSec-1Footnote 6 Beer et al. (2017), MILCOM2016Footnote 7 Bowen et al. (2016), and TRAbIDFootnote 8 Viegas et al. (2017). They are all accessible both as labeled network flows and as pcap files and contain different types of attacks.

A public intrusion detection dataset that has experienced strong popularity among the security researchers is certainly CICIDS2017Footnote 9 Sharafaldin et al. (2018). It was released by the Canadian Institute for Cybersecurity (CIC) in 2017 and simulates real-world network data. Details about the CICIDS2017 dataset are widely discussed in Sect. 4.1.

Finally, another recent dataset is USB-IDS-1 Catillo et al. (2021a). It provides ready-to-use normal and abnormal labeled network flow records and considers both network traffic and application-level facets, such as defense modules of the victim server under attack. The USB-IDS-1 testbed, collection procedures, dataset, and reference webpage are presented in Sect. 3.

The interested reader is referred to Ring et al. (2019) for a survey of existing literature on intrusion detection datasets. Table 1 summarizes the aforementioned public intrusion detection datasets and their key features in chronological order.

Table 1 Comparative summary of public intrusion detection datasets

2.2 Denial of service detection

Ready-to-use public datasets are fostering research contributions by a very large base of academics and practitioners. The ever-growing sophistication of the attacks has attracted a significant interest by the research community, which focused on specialized detection mechanisms. Frequently, intrusion detectors are implemented by means of specially crafted artificial neural networks or of well-known classifiers, which are able to detect almost all the attacks contained in the dataset used for the training phase.

(Deep) neural networks techniques

Modern machine learning and deep learning techniques have been shown to be effective in intrusion detection, and, as such, a wide literature on the detection of DoS attacks has been produced recently. The solution in Wankhede and Kshirsagar (2018) is specifically focused on DoS detection; a neural-network based approach relying on the implementation of a simple multi-layer perceptron is compared to the random forest technique. In Ali and Cotae (2018), instead, the authors propose a method that leverages the Bayesian regularization (BR) backpropagation and scaled conjugate gradient (SCG) descent backpropagation algorithm. The results are promising for the detection of DoS attacks. In particular, the model achieves an accuracy of 99.6% using Bayesian regularization and of 97.7% in scaled conjugate gradient descent. In order to solve the challenges in DoS detection, reference Nguyen et al. (2018) proposes an intrusion detection system that leverages a convolutional neural network model. The authors evaluate the performance of the proposed method using the datasets UNSW-NB15 and NSL-KDD. The results are valuable as compared to the state-of-the-art DoS detection methods. A machine-learning-based DoS detection system is presented in Filho et al. (2019). The authors use an inference-based approach and the detection rate achieved is 96%. The paper Lee et al. (2019) is also focused on DoS detection: well-known machine learning approaches (e.g., naïve Bayes and logistic regression) are used to distinguish normative conditions from malicious ones. Paper Qu et al. (2019) proposes the statistic-enhanced directed batch growth self-organizing mapping (SE-DBGSOM), a recent model based on self-organizing maps (SOM) for DoS attack detection. The proposal is evaluated on the CICIDS2017 dataset. More recently, the detection of different classes of anomalies — including DoS activity — has been addressed by means of system log analysis and a semi-supervised deep autoencoder: the proposed approach, called AutoLog, achieves up to 99% recall and 98% precision across different system logs Catillo et al. (2022).

Other techniques

Traditional intrusion detection approaches are based on well-known classification techniques. For example, a comparative analysis between different classifiers is reported in Ahmim et al. (2019). All algorithms are evaluated by means of the CICIDS2017 dataset. The output from two of the classifiers is fed into the third with the original input data. The authors achieve an accuracy of 96.67% by using the reduced error pruning (REP) tree, JRip algorithm, and forest penalizing attributes (PA) classifiers. In Kshirsagar and Kumar (2021), it is reported a feature reduction approach based on the combination of filter-based algorithms, namely information gain ratio (IGR), correlation (CR), and relief (ReF). The proposed approach aims to reduce the number of features and leverages a rule-based classifier called projective adaptive resonance theory (PART) in order to detect DoS attacks. Therefore, mixed methods are used to detect specific attacks such as DoS attacks by feature reduction. The authors obtain 99.95% accuracy with the CICIDS2017 dataset. The paper Sacramento et al. (2018) instead proposes FlowHacker, which aims to detect malicious traffic on the top of network flows by capitalizing on unsupervised machine learning and threat intelligence: the approach is validated with both the ISCX public dataset and real data by an Internet Service Provider.

2.3 Our contribution

The collection of real-life datasets has historically required expensive networked assets, specialized traffic generators, and considerable design preparation. As mentioned above, many intrusion detection datasets have been proposed over the years and many detectors have been tuned and tested by means of these data. However, work that looks more critically at these datasets has spread recently. In particular, some of them, such as Silva et al. (2020), consider the quality of data by analyzing statistical flaws that might introduce bias in the model training phase. The paper Catillo et al. (2021c) analyzes the representativeness of the attacks provided by public intrusion detection datasets and demonstrates a partial ineffectiveness of the attacks in the presence of defense mechanisms and suitable server configurations. In reference Kenyon et al. (2020), it is reported a detailed analysis that considers the majority of public intrusion detection datasets issues. In particular, the authors state that public datasets do not fit real-life conditions, and therefore the value of analysis performed against them may be of questionable value.

It is worth pointing out that recent studies also take into consideration the performance of the detector in terms of generalization on data — different from those belonging to the original training dataset — with the aim of estimating its adaptability. The importance of thorough understanding of the underlying data and their features as well as the produced results is stressed in Sommer and Paxson (2010). In particular, the authors examine the surprising imbalance between the considerable amount of research on machine learning-based anomaly detection pursued by the academic intrusion detection field, versus the lack of operational deployments of such systems. This problem is also addressed in Ahmad et al. (2021), where the authors highlight the low performance of machine learning intrusion detection systems in real-life environments.

The paper closest to our work is Verkerken et al. (2021). Their authors evaluate four unsupervised algorithms on two recent datasets, CICIDS2017 and CSECICIDS2018. Then, these results are used as a baseline for estimating the generalization strength of the models by testing them on different data from those belonging to the original dataset on which the model was trained. The obtained results show that all models can present high classification scores on an individual dataset but fail to directly transfer the high scores to a second “never-seen-before” — but related — dataset.

Our main original contribution here is an analysis that paves the way for considerations on the gap between the extensive academic research in intrusion detection systems and actual deployments of such systems. While there is a substantial and valuable body of research on attack detection — as for many of the papers referenced in Sect. 2.2 — we propose a “data oriented” perspective for assessing the transferability of machine learning models. Our findings can potentially shift the current research focus, which strives for the maximization of detection rates, in the direction of optimal and transferable machine learning models for practical applications.

3 Data collection

The data used for testing the transferability of machine models come from USB-IDS-1, which is a flow-based dataset we developed at the University of Sannio. In the following, we describe the basics of the experimental testbed underlying USB-IDS-1, the DoS tools and the defenses used, and the data collection procedure and provide some insights into the effectiveness of the attacks therein.

3.1 Experimental testbed

The dataset was collected at a private network infrastructure at the University of Sannio. The experimental testbed consists of three Ubuntu 18.04 LTS nodes, equipped with Intel Xeon E5-2650V2 8 cores (with multithreading) 2.60 GHz CPU and 64 GB RAM, within a local area network (LAN). The experimental testbed is sketched in Fig. 1 and its components are described in the following.

The “victim” node hosts an installation of the Apache web server 2.4.29. The web server supports a variety of modules — including security-related ones — that either come by default with the server installation or can be downloaded and plugged in later on by the site owner. It is worth noting that the modules can be enabled/disabled by adjusting the configuration of the server. As for the specific deployment of the “victim” node in our testbed, we focus on well-established server-side modules — namely evasive and reqtimeout (presented in Sect. 3.2) — intended to defend from various DoS attacks. USB-IDS-1 accounts for data collected within various settings, which pertain to attacks done against the server with “no defense” in place and the enablement of the defense modules in hand.

The web server is operated with default thread limits and maximum workers and other similar parameters. The default configuration typically proposes a good balance between performance and overhead, tailored to most common use cases; as such, we opted for such a configuration to avoid any potential bias or inadvertent mitigation of the attacks caused by changes to the configuration beside the defense modules assessed in this study.

The “attacker” node in Fig. 1 generates DoS traffic workload against the victim server. We emulate a mixture of DoS attacks — included in USB-IDS-1 — that leverage both traditional flooding activities and the intrinsic design of the HTTP protocol, such as for slow attacks. Overall, the attacks are carried out by means of public DoS scripts and command line utility programs, which are the state-of-the-practice by the security community. As the defense modules, the DoS tools hosted by the “attacker” node are addressed in Sect. 3.2.

Fig. 1
figure 1

Experimental testbed

The “client” node runs httperfFootnote 10, which is a well-known workload generator. The generator makes it possible to set a desired level of workload (measured in reqs/s) consisting of HTTP requests by means of several parameters. In our testbed, httperf is used to probe the victim server at regular intervals by collecting several convenient metrics that summarize its operational status, as throughput, i.e., HTTP requests accomplished by the web server within the time unit (measured in reqs/s) and response time, i.e., the time taken to serve a request. In the context of this work, we use the throughput to characterize the status of the server in response to the load generated by httperf.

3.2 DoS tools and defense modules

The attacks and pertinent information on the DoS tools that we used to generate them are briefly listed in the following:

  • hulk: implements an HTTP flood attack, which spawns a large volume of obfuscated and unique requests; in consequence, requests are hard to be detected by means of signatures. The main goal of the tool is to overwhelm a victim web server with randomly generated header and URL parameter values. We use a popular hulk implementation for our experimentsFootnote 11.

  • TCP flood: it is another well-known DoS attack, which issues TCP connection requests in order to lock server-side ports and to cause incapability to accept connections from legitimate clients. It can be considered a flooding attack. For our experiments, we used the GitHub TCP flood scriptFootnote 12.

  • slowloris: implements a DoS attack by means of a low-bandwidth approach, which exploits a weakness in the management of TCP fragmentation of the HTTP protocol. We launched this attack by means of a well-known python scriptFootnote 13. It implements a slow header attack by sending incomplete HTTP requests (i.e., without ever ending the header). If the server closes a malicious connection, this is re-established by the attack tool.

  • slowhttptest: this is a toolFootnote 14 that allows to launch slow DoS application-layer attacks. For our experiments, we used slowhttptest in the “slowloris” mode, which allows to send incomplete HTTP requests to the victim server.

As for the defense modules that are recommended for hardening web server installations against DoS attacks, we consider:

  • evasive: it is a consolidated defense module intended to protect a server from DoS, DDoS, and brute-force attacks. In order to enable the evasive module, it is necessary to download and configure it in the baseline server installation, as we did according to the instructions from a well-detailed tech blogFootnote 15. The module stores incoming and previous IP addresses, and Universal Resource Identifiers (URIs) in a table, which is used to lookup if a specific HTTP request should be allowed or not.

  • reqtimeout: the module aims to mitigate slow DoS attack types and — differently from evasive — is typically enabled by default in the baseline server after installation from the standard Ubuntu repository, which means that its disablement may require explicit changes to the configuration. The module allows to set — according to the environment and domain where the web server is deployed — minimum data rates and timeouts for allowing HTTP request headers and body from clientsFootnote 16. These conditions need to be met in order to keep a connection open.

As mentioned, both evasive and reqtimeout can be seamlessly enabled/disabled by adjusting the configuration and re-starting the web server, as done with the victim server in our testbed.

3.3 Collection procedure

For each DoS tool listed in Sect. 3.2, we run three independent attack experiments against the web server, which is started with (i) “no defense” modules at all in place, (ii) evasive module on, and (iii) reqtimeout module on, respectively (attacks are run one by one, i.e., either with “no defense” or one module per experiment at a time).

The duration of each experiment is 600 s, which is long enough to perceive the effect of the attacks in our testbed; the web server is exercised with a workload L = 1000 reqs/s by httperf during the entire progression of the attack. It should be noted that in attack-free conditions, in response to the workload of 1000 reqs/s, the throughput of the server is also equal to 1000 reqs/s. Given that in our testbed the only source of legitimate activity is the “client” node, at any time a throughput lower than 1000 reqs/s points out the presence of DoS activity. Each experiment is performed according to the following schedule:

  1. 1.

    Setup: adjustment of the configuration of the web server — in order to reflect one out of “no defense,” evasive or reqtimeout scenario — and boot of web server;

  2. 2.

    Metrics collection: boot of httperf, which continuously exercises the web server with 1000 reqs/s and collects the throughput during the whole progression of the experiment;

  3. 3.

    Attack: execution of a DoS attack by means of one of the tools in hand: the web server is under both benign load and DoS traffic as shown in Fig. 1;

  4. 4.

    Experiment completion: shutdown of the attack tool, httperf, web server, and storage of the throughput observations for subsequent analysis.

In order to ensure independent experimental conditions between pairs of subsequent experiments, we clear the logs of the web server, stop httperf, attack scripts, and the web server and reboot the nodes. The experimental campaign led to 12 experiments where we collected data for all the combinations in the Cartesian product (hulk, TCP flood, slowloris, slowhttptest)\(\times\)(“no defense,” evasive, reqtimeout). During each experiment, we captured the network traffic into a pcap packet capture data file by means of tcpdump. Moreover, pcap files obtained after the campaign (one file per experiment) were processed with CICFlowMeter in order to generate network flow records in the form of comma-separated values (csv), which we use for testing the machine learning algorithms.

USB-IDS-1 is freely available to the community and all the csv files can be downloaded for research use through our institutional webpageFootnote 17.

3.4 Effectiveness of the attacks

Figure 2 shows the throughput of the victim server (measured by means of httperf) during the progression of the attacks. In all cases, the attack starts at t=15 s since the beginning of the experiment. For each attack, throughput is shown in the case of “no defense,” evasive, and reqtimeout. Interestingly, attacks cause a variety of outcomes. We note either a progressive degradation of the throughput for hulk (Fig. 2a) or periodic drops in TCP flood (Fig. 2b). On the other hand, slow DoS attacks are characterized by their typical “on-off” behavior, which means that the throughput drops sharply from 1000 to 0 reqs/s in few seconds (Fig. 2c and d).

All the attacks are proven to be effective against the server, in that they significantly impact the throughput. As for the defense modules, it should be noted that they provide scarce, if not zero, protection from the attacks. For example, Fig. 2a indicates that evasive (\(\triangle\)-marked series) does delay the throughput depletion of the server when compared to the baseline “no defense” run (\(\bullet\)-marked series); however, it is not a long-term defense. Similarly, Fig. 2c indicates that slowloris under reqtimeout (\(\times\)-marked series) is almost effective as the corresponding “no defense” run (\(\bullet\)-marked series) if not for sporadic and marginal spikes. TCP flood is mitigated by none of the modules; evasive is not effective for slow DoS attacks.

In principle, the reader may think that the attacks done against a defended server may be harmless and not worthy to be detected; however, we found that this is not the case. With respect to our dataset, attacks under defense are still capable of significantly disrupting operations. As such, modern IDS techniques should be properly tuned even when the server is properly hardened.

Fig. 2
figure 2

Throughput of the victim server during the progression of the attacks in the case of “no defense” (\(\bullet\)), evasive (\(\bigtriangleup\)), and reqtimeout (\(\times\)) for each attack (a-d)

4 Analysis method

4.1 CICIDS2017 and relationship with USB-IDS-1

CICIDS2017

consists of benign traffic synthesized by the abstract behavior of 25 users mixed with malicious traffic from many common attacks. In order to create the dataset, the proposing group used a laboratory environment with attacker and victim networks Sharafaldin et al. (2018). The attacker was a Kali Linux node and the victim an Ubuntu 16.04 system running an Apache web server.

The dataset is provided both as a set of pcap files and bidirectional labeled flow records (csv files). In the latter format, each record is a labeled flow, obtained from the network traffic by means of the tool CICFlowMeterFootnote 18 and identified by 84 features (attack label included). These are mainly network traffic features along with the label, stating if the records belong to normal traffic or attacks.

The data capture period started at 9 a.m., Monday, July 3, 2017, and ended at 5 p.m., Friday, July 7, 2017, for a total of 5 days. Monday is the “normal day” and contains only benign traffic; in the morning and afternoon of Tuesday, Wednesday, Thursday, and Friday, in addition to normal traffic, attacks were performed. These attacks belong to the categories Brute Force FTP, Brute Force SSH, DoS, Heartbleed, Web Attack, Infiltration, Botnet, and DDoS. DoS attacks, such as hulk, slowloris, and slowhttptest, were performed and captured on “Wednesday,” i.e., the “DoS day.”

Relationship between CICIDS2017 and USB-IDS-1

The datasets share several important similarities in the technologies, DoS attacks, and victim servers, which make perfectly reasonable our attempt to transfer intrusion detection models learned from CICIDS2017 to USB-IDS-1. The datasets are both based on CICFlowMeter for generating network flow records on the top of the raw packets: in consequence, flow records of the datasets consist of the same number and ordering of features. It should be noted that CICFlowMeter has undergone several updates since its original version; more recently, authors in Engelen et al. (2021) provided an improved version of CICFlowMeter aiming to solve some issues pertaining to the termination of TCP flows and a new release of the CICIDS2017 csv files. For the sake of uniform evaluation and to avoid any potential technological bias, our study is based on the version of CICFlowMeter used to generate the original CICIDS2017. USB-IDS-1 includes flow records collected in face of hulk, slowloris, and slowhttptest — the same mixture of DoS attacks and reference tools of CICIDS2017 — plus TCP flood, which is not present in CICIDS2017 and so it can be regarded as some sort of “zero-day” attack. The Apache web server is the victim in both the datasets; however, it is worth noting that the original CICIDS2017 paper Sharafaldin et al. (2018) does not concretely touch and disclose the configuration of the victim server or the enablement of any defense module at the time data were collected. Since not stated otherwise, it is reasonable to assume that CICIDS2017 was collected under the default configuration of the server and no specific defense in place, which corresponds to our USB-IDS-1 “no defense” scenario.

There are two facets of USB-IDS-1 that should have made successful — at least in principle — to transfer intrusion detection models from CICIDS2017. First, USB-IDS-1 is collected through a much more simple network consisting of three nodes with the only httperf being used as a normative workload generator: in consequence, data collection at our private premises is not affected by strong sources of uncertainty or uncontrollable factors. Second, all the attacks are effective and clearly “stand out” as shown in Fig. 2. Moreover, attacks are executed in the case of no server-side protection from DoS and two state-of-the-practice defense modules: overall, we aim to resemble the same data collection conditions of CICIDS2017 (i.e., the “no defense” scenario) plus additional variants of the attacks — slightly mitigated although as effective as in “no defense” — to make it sure that our claims were not biased by a specific defense configuration.

4.2 Data preprocessing and analysis framework

Figure 3 provides a visual representation of the analysis framework. As in any machine learning experiment, we preprocess the input dataset — CICIDS2017 “Wednesday” file — and put it in a format suitable for the analysis to be performed. First, we remove non-relevant or biasing features, i.e., timestamp and id of the flow records, source address and port, destination address and port. The remainder consists of 78 features (label included) per flow record. It is worth pointing out explicitly that the presented experiment is performed in a binary classification scenario. As such, all flows produced within different types of attacks are considered as belonging to a unique generic class named ATTACK — encoded with the numeric label 0 — whereas BENIGN flow records are assigned 1 as label.

Fig. 3
figure 3

IDS learning and evaluation framework

The flow records contained in the aforementioned CICIDS2017 “Wednesday” file are split into three disjoint subsets used for the training, validation, and test of the IDS models (presented in Sect. 4.3). While splitting the file, we adopt a stratified sampling strategy with no replacement. This means that (i) the ratio between benign and attack classes of the original file is preserved in the output splits and (ii) each flow record of the original file is assigned to a unique split. The original CICIDS2017 file contains 692,703 flow records, 1,297 of which were discarded due to the presence of malformed or unsuitable values (e.g., “Infinity” or “NaN”). The remaining 691,406 total flow records are divided as follows:

  • CICIDS-TRAINING: 70% of the total (i.e., 483,982), divided into 307,778 BENIGN and 176,204 ATTACK flow records;

  • CICIDS-VALIDATION: 15% of the total (i.e., 103,707), divided into 65,952 BENIGN and 37,755 ATTACK flow records;

  • CICIDS-TEST: 15% of the total (i.e., 103,707), divided into 65,952 BENIGN and 37,755 ATTACK flow records.

It should be noted that the three splits above sum up to 691,396 flow records, i.e., 10 less than the abovementioned total (691,406). This is due to the rounding to the highest preceding integer, performed whenever the chosen percentages do not return an integer number of records to be assigned to a given split.

According to Fig. 3, CICIDS-TRAINING and CICIDS-VALIDATION are used to learn the IDS models; CICIDS-TEST jointly with the records of the USB-IDS-1 collection (arranged by individual csv files corresponding to specific attack-defense combinations) represent the test sets used for evaluating the IDS models. CICIDS-TEST and all USB-IDS-1 files provide held-out benign and attack flow records, i.e., records not seen at all by the IDS models during the learning phase.

Evaluation is based on the typical metrics of precision (P), recall (R), and F-score (F). They are computed from the number of true negative (TN), true positive (TP), false negative (FN), and false positive (FP) obtained by running the IDS models against the test sets. For instance, a TN is a BENIGN record of the test set that is classified BENIGN by the model; a FN is an ATTACK record of the test set that is deemed BENIGN by the model. Metrics are computed as follows:

$$\begin{aligned} \normalsize P=\frac{TP}{TP+FP} \ \ \ R=\frac{TP}{TP+FN} \ \ \ F =2\cdot \frac{P\cdot R}{P+R} \end{aligned}$$
(1)

4.3 Machine learning models

Our experiments are based on three popular machine learning techniques, i.e., decision tree, random forest, and deep feed forward network, which we use to learn an intrusion detector.

Decision tree

It is widely used in the literature to learn IDS models because of the capability to infer explicable rules for classifying network flow records Li and Ye (2003). Decision trees are supervised learning algorithms, in that the model is trained on a set of labeled data. In general, decision trees are used to solve classification problems. Therefore, they might be used when the label is categorical or qualitative, as for our case study (BENIGN or ATTACK). For a classification problem, a decision tree is a tree where each node represents a predicate tested on a feature, each link represents a decision, and each leaf represents an outcome. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data. Given a dataset, a decision tree groups and labels data points that are similar between them, and looks for the best rules that split the data points that are dissimilar between them until they reach a given degree of similarity. In general, the main goal of a decision tree is to make the best splits between nodes, which will optimally divide the data points into the correct categories. In the basic structure of a decision tree, a root node is the beginning of the tree, an internal node is a sub-node that might be further split into additional sub-nodes, and a leaf node is a sub-node that cannot be split into further additional sub-node and represents a possible outcome. The depth of a tree is defined by the number of levels, not including the root node. The tree has a root node, where the input data points are passed through. Therefore, to classify a data point, one starts at the root of the decision tree and follows the branch indicated by the outcome of each test until a leaf node is reached. The name of the class at the leaf node is the resulting classification.

Random forest

As suggested by its name, it is based on the use of a large number of individual decision trees, which operate as an ensemble. It is a widely used technique in intrusion detection to implement attack classifiers Resende and Drummond (2018). Again, it is a supervised algorithm, and uses a modified tree learning algorithm that inspects a random subset of the features in the learning process. The algorithm is based on the construction of multiple decision trees (forest), each working as a classifier. A random forest creates n different trees by using a number of feature subsets. Each tree produces a classification result, and the result of the classification model is obtained by majority voting. The input data point is assigned to the class that obtains the highest voting scores. There is a direct relationship between the number of trees in the forest and the results it can get: the larger the number of trees, the more accurate the result.

Deep neural network

Over the last few years, artificial neural networks have been successfully applied for intrusion detection Shenfield et al. (2018). Deep neural networks are the quintessential deep learning models. In general, a neural network is made of many interconnected neurons. Each neuron takes input values and multiplies them by weights. The weights are fixed during testing, but during training these change in order to “tune” the network. The weighted inputs then are summed to a bias value. The sum obtained is then transformed into an output value according to the activation function of the neuron. The output value of the neuron is often the input for another neuron. Neurons are connected together according to a specific network architecture. There is typically an input layer (containing a number of neurons equal to the number of input features in the data), an output layer (containing a number of neurons equal to the number of classes), and hidden layers, containing any number of neurons. There can be multiple hidden layers to allow the neural network to learn more complex decision boundaries. In the case of “feed forward” neural network — the deep neural network adopted in this study — the information flow goes forward in the neural network, through the input nodes, the hidden layers, and, finally, the output nodes. The learning process involves adjusting the weights of the network to improve the accuracy of the results. This is done by minimizing the observed errors. Learning stops when the examination of additional data points does not usefully reduce the error rate.

4.4 Implementation and learning of the IDS models

Machine learning models are implemented by means of well-known and state-of-the-art libraries. In particular, we capitalize on the python implementation of the decision tree and random forest provided by the scikit-learn packageFootnote 19. The deep neural network, instead, is implemented in python by means of the Keras libraryFootnote 20. It is worth pointing out that all the models are learned from the same training set (CICIDS-TRAINING) and validation set (CICIDS-VALIDATION), as shown in Fig. 3: the former is used to train the model; the latter is aimed to validate the model, to assess its performance while varying the hyperparameters and to drive their final selection. More importantly, CICIDS-TEST is not used at all during the learning stage.

Decision tree

The specific “tree-like” structure that is learned from the training set depends on user-supplied hyperparameters, which drive the creation of the predicates (i.e., nodes) meant to classify a given input data point — network flow record in this study — connections and depth of the tree. Among the most typical hyperparameters, max_depth regulates the length of the path from the root to the furthest leaf of the tree; min_samples_leaf indicates the minimum number of data points to be collected at a leaf during the training phase. Hyperparameters should be set after proper tuning. For example, a high max_depth in conjunction with a low min_samples_leaf value tends to produce an excessive number of leaves and classification paths, which cause the tree to overfit the training data. In order to set the hyperparameters and to assess the corresponding impact on the goodness of the classification, we conduct a sensitivity analysis by learning trees with different max_depth and min_samples_leaf through CICIDS-TRAINING and measuring the F1 score they achieve after classifying the data points in CICIDS-VALIDATION.

Fig. 4
figure 4

Sensitivity of the F1 score of the decision tree model with respect to max_depth and min_samples_leaf

The results are summarized in Fig. 4, which shows the sensitivity of the F1 score of the decision tree model with respect to the hyperparameters measured with CICIDS-VALIDATION. We observe there is no significant benefit in increasing max_depth above 10. In fact, the only significant improvement for any value of min_samples_leaf is obtained by raising max_depth from 5 to 10: F1 score flattens from 10 on. Regarding min_samples_leaf, the F1 score decreases as min_samples_leaf increases — as expected — until it drops sharply at min_samples_leaf=10000. For example, Fig. 4 reports the F1 score obtained with max_depth=10 while varying min_samples_leaf: F1 score ranges between 0.994 and 0.954. A reasonable tradeoff is represented by the following selection: max_depth=10 and min_samples_leaf=500. It is worth noting that much of the contributions in the area of intrusion detection overlook the sensitivity of the decision tree.

Random forest

The random forest classifier is trained similarly to the decision tree, using labeled BENIGN and ATTACK network flow records. According to the aforementioned sensitivity analysis, the best values for the max_depth and min_samples_leaf parameters are 10 and 500, respectively. An additional crucial parameter for the random forest is the n_estimators, which regulates the number of trees in the forest. We selected n_estimators=50 after tuning with CICIDS-VALIDATION. Each tree gives a vote that indicates the decision on the class of the network flow record. Therefore, the targets are discrete class labels and the algorithm considers the highest voted one as the final prediction.

Deep neural network

As for the two aforementioned techniques, we train our deep neural network with both BENIGN and ATTACK flow records. Our selected network is made up of seven layers. These layers include 77-100-100-100-100-100-2 neurons, where 77 is the number of features selected for the analysis. The Rectified Linear Unit (ReLu) has been selected for all the hidden layers, while for the output layer has been used the Softmax activation function. We train the deep neural network with CICIDS-TRAINING data points for 200 epochs, using the Adadelta optimizer with the learning rate value lr=0.1 and a batch size of 1024. The design of a deep neural network is based on setting many hyperparameters that are subject to fine-tuning. As for any machine learning study, the final selection of the hyperparameters has been guided by experimental tests carried out by analyzing the outcome of the model with respect to CICIDS-VALIDATION.

5 Results

We evaluate each IDS model with “held out” test sets of network flow records, according to the analysis framework in Fig. 3. The test sets encompass both (i) CICIDS-TEST and (ii) individual USB-IDS-1 files collected in the case of various attack-defense combinations. The former is intended to prove that models are not flawed since their inception (i.e., they perform well on the individual dataset they were learned from); the latter aims to provide a deep insight into the transferability of the models to an unseen — although closely related, as explained above — dataset, such as USB-IDS-1. Results are presented in the following by machine learning technique; Sect. 5.4 provides a comparative discussion of the results and lessons learned on the transferability with respect to the specific machine learning model, attacks, and defenses.

5.1 Decision tree

Fig. 5
figure 5

Evaluation of the decision tree model across CICIDS2017 and USB-IDS-1 test sets for each attack (a-d)

We test the decision tree with CICIDS-TEST and obtain precision, recall, and F1 score equal to 0.99, 0.98, and 0.99, respectively. Not surprisingly, all the values are close to 1, i.e., almost “perfect” detection. In fact, this is the finding by most of the papers on IDSs when it comes to applying machine learning techniques to public intrusion datasets.

Figure 5 shows precision, recall, and F1 score achieved by the decision tree — learned from CICIDS2017 — when tested with USB-IDS-1 data. The results are arranged by attack (i.e., one subfigure per attack); for each attack, we report precision, recall, and F1 score obtained by attempting the detection of the attack in the case of “no defense,” evasive, and reqtimeout. For each subfigure, the metrics obtained by the decision tree applied to CICIDS-TEST are reproduced by the three leftmost bars in order to ease comparison and visualization of the results.

Figure 5 returns a quite different picture when compared to the “perfect” detection seen above. We observe that the decision tree achieves satisfactory results in the case of “no defense” and reqtimeout for USB-IDS-1 hulk (Fig. 5a): precision, recall, and F1 score equal to 1, 0.97, and 0.99, thus inline with the results obtained with CICIDS-TEST. However, this is the only successful exception. For example, the decision tree fails to detect USB-IDS-1 hulk in the case of evasive: differently from the previous case, recall is 0.06, thus strongly unsatisfactory. TCP flood (Fig. 5b) goes completely undetected by the decision tree regardless of the enablement of the defense. As for USB-IDS-1 slowloris (Figure 5c), the performance of the decision tree is fairly acceptable (although not good enough for a fruitful detector to be deployed in a real-life production environment) in “no defense” and evasive settings, where the recall is 0.79. Similarly to the hulk attack, the enablement of a defense module, such as reqtimeout, is detrimental to the decision tree: in this case, the recall drops to 0.24, which is unsatisfactory. Finally, slowhttptest (Fig. 5d) is undetected by the decision tree regardless of the defenses.

5.2 Random forest

Fig. 6
figure 6

Evaluation of the random forest model across CICIDS2017 and USB-IDS-1 test sets for each attack (a-d)

The random forest model achieves 1, 0.99, and 0.99 precision, recall, and F1 score with CICIDS-TEST. The figures are higher than the decision tree above and — again — in line with the literature on IDSs trained and tested on the top of an individual dataset.

Figure 6 shows precision, recall, and F1 score achieved by the random forest — learned from CICIDS2017 — when tested with the USB-IDS-1 dataset. The arrangement of the results is the same as Fig. 5. For each attack, we report precision, recall, and F1 score obtained in “no defense,” evasive, and reqtimeout; moreover, for each subfigure, we reproduce the metrics obtained by the random forest applied to CICIDS-TEST.

The results in Fig. 6 confirm most of the findings inferred with the decision tree with some interesting additions. Again, a “perfect” detection model obtained within an individual dataset does not transfer to a closely related dataset. As for the decision tree, USB-IDS-1 hulk (Fig. 6a) is very well detected in the case of “no defense” and reqtimeout (precision and recall equal 1 and 0.99, respectively); however, recall drops to 0 when the evasive module is enabled. Again, TCP flood (Fig. 6b) goes completely undetected by the random forest regardless of the enablement of the defense. As for the decision tree, slowloris (Fig. 6c) is weakly detected by the random forest with recall equal to 0.80 in “no defense” and evasive settings; recall drops to 0.13 after the enablement of reqtimeout. Differently from the decision tree, the random forest can also detect slowhttptest (Fig. 6d) to some extent: recall is 0.7 in “no defense” and evasive and 0.3 in the case of reqtimeout. It is worth noting that slowhttptest went almost undetected with the decision tree: this finding indicates that it is of the utmost importance to look at different models — such as we did in this study — to make rigorous claims on the transferability.

Fig. 7
figure 7

Evaluation of the deep neural network model across CICIDS2017 and USB-IDS-1 test sets for each attack (a-d)

5.3 Deep neural network

Finally, we test the deep neural network with CICIDS-TEST and obtain precision, recall, and F1 score equal 0.96, 0.97, and 0.96, respectively. These figures are lower than the decision tree and random forest; however, they are still remarkable. Again, they have been obtained by testing the deep neural network with records taken from the same dataset being used for training. Figure 7 shows precision, recall, and F1 score of the deep neural network on CICIDS2017 and USB-IDS-1 test sets according to the arrangement used above by attack and defense; for each subfigure, we reproduce the metrics obtained by the deep neural network applied to CICIDS-TEST.

Again, Fig. 7 confirms the results seen with the decision tree and random forest. For example, USB-IDS-1 hulk (Fig. 7a) is detected in the case of “no defense” and reqtimeout although with much lower recall, i.e., 0.67 and 0.66, respectively; recall drops to 0.05 when the evasive module is enabled. TCP flood (Fig. 7b) goes completely undetected, which confirms the finding obtained both for the decision tree and random forest. Different from the decision tree and random forest, the deep neural network cannot detect USB-IDS-1 slowloris (Fig. 6c); finally, the deep neural network performs as well as the random forest in the case of slowhttptest (Fig. 7d).

5.4 Lessons learned and comparative analysis

There are several interesting lessons that can be learned from our analysis. First, we observe it is quite straightforward to get a highly performing machine learning model on the top of an individual dataset. All the models assessed easily achieve precision between 0.96 and 1, and recall between 0.97 and 0.99 when trained-tested with CICIDS2017 itself. Surprisingly, these “apparently” effective IDS models do not transfer to a much more simple network and obviously clear attacks, such as those in USB-IDS-1. In consequence, it is hard to see if and how models learned from a public dataset, such as CICIDS2017, would generalize to a real-life network affected by all the sources of complexity and uncertainty that do not exist in our small-scale, controlled, testbed.

Table 2 provides the recall of decision tree (DT), random forest (RF), and deep neural network (DNN) — previously presented in Figs. 5, 6, and 7 — by USB-IDS-1 attack-defense combination. Table 2 is meant to foster comparison and to “visualize” the transferability of the models: the lower the recall, the darker the cells. As for the precision, it is close (if not equal) to 1 for almost all attack-defense combinations, as can be noted in Figs. 5, 6, and 7: we hypothesize that the models tend to produce few false positives because the normative traffic underlying USB-IDS-1 is more simple and predictable when compared to CICIDS2017. In the following, TCP flood is firstly discussed; considerations on the remaining attacks are then presented with respect to the defense.

TCP flood is detected by none of DT, RF, and DNN and regardless the presence or not of the defense: as shown in Table 2 (row corresponding to TCP flood), the recall of the models is always 0 or 0.01. TCP flood can be seen as a “zero-day” attack for CICIDS2017. With respect to the results in hand, it is reasonable to suppose that none of the models inferred from CICIDS2017 could possibly be transferred with success to a “zero-day” attack. We also hypothesize that other learning approaches, such as unsupervised or semi-supervised learning, might be helpful to detect never-seen-before or“zero-day” attacks; however, testing these techniques is out of the scope of the work at this stage.

Table 2 Recall of decision tree (DT), random forest (RF), and deep neural network (DNN) by USB-IDS-1 attack-defense combination

Attacks in “no defense” mode

According to Table 2 — “no defense” column — we found out that two out of three models learned from CICIDS2017, i.e., DT and RF, successfully transferred to USB-IDS-1 hulk: in fact, recall is 0.97 and 0.98, respectively, which is highly satisfactory. However, hulk is the only successful exception. In fact, recall drops significantly in the case of slow DoS attacks. At best, slowloris is detected with 0.80 recall by the RF; recall is worse for slowhttptest, i.e., 0.70, in both RF and DNN. These detection figures are not satisfactory for a production environment. The finding is quite surprising and unexpected because (i) as for hulk, CICIDS2017 does contain records of slow DoS attacks and (ii) slow DoS attacks of USB-IDS-1 are so straightforward and proven to be 100% disruptive as shown in Fig. 2c and d (\(\bullet\)-marked “no defense” series) in Sect. 3.4.

Attacks in evasive and reqtimeout mode

The enablement of the defense modules is detrimental to the detection. As for the evasive column in Table 2, it can be noted that the recall obtained for USB-IDS-1 hulk drops to 0.06 (DT), 0 (RF), and 0.05 (DNN). Different from the “no defense” mode, hulk knowledge inferred from CICIDS2017 does not transfer to the same attack under defense. It is worth noting that hulk remains harmful also in the case of evasive, as shown in Fig. 2a, which makes our finding even more interesting. Similar considerations can be done for slow attacks in the case of reqtimeout. At best, recall of slowloris and slowhttptest is 0.28 and 0.32, respectively. Again, it is worth noting that both slowloris and slowhttptest severely impact the server in the case of reqtimeout, as shown in in Fig. 2c and d. This finding indicates that models trained to detect a given attack might be ineffective to reveal “weaker” variants of the same attack.

As previously mentioned, the research group that published CICIDS2017 does not concretely touch and disclose configuration-related aspects, as the presence of defense modules. We hypothesize that CICIDS2017 attacks were conducted with no defense modules in place. In this respect, our findings indicate that a minor difference with respect to the data gathering environment of the public dataset, such as the enablement of a defense module, can totally invalidate an IDS model inferred on the top of it. It must be noted that defense modules are just a marginal example out of the large number of uncontrollable factors (e.g., sophistication of the attacks, workloads, and configuration) that characterize a production network. Public intrusion datasets provide only a limited and incomplete view: our analysis demonstrates that one single variation of the factors changes it all. Overall, the implications of using public datasets for advancing the state-of-the-practice of real-life networks and to drive general and rigorous security claims on machine learning and IDS techniques remain quite opaque.

6 Threats to validity

As for any data-driven study, there may be concerns regarding the validity and generalizability of the results. We discuss them based on the four aspects of validity listed in (Wohlin et al., 2000).

Construct validity

The study builds around the intuition that learning intrusion detectors on top of a public dataset may lead to partial — if not incorrect — patterns, which cannot be used in production networks. This construct has been investigated in the context of two datasets: the well-established CICIDS20217 and the recent USB-IDS-1. We also use three well-founded machine learning techniques, namely decision tree, random forest, and a deep neural network. The study is supported by extensive experimentation leveraging widely consolidated methods, deep learning framework, and evaluation metrics.

Internal validity

The results and key findings of this paper are based on direct measurement experiments, where we analyze network flow records from the reference datasets under normative operations and attacks. Most notably, attacks are carried out by means of public DoS scripts and command line utility programs, which are the state-of-the-practice by the security community. We also consider the role of the defense, which is overlooked by current literature on public datasets. The use of such diverse attacks and defenses aims to mitigate internal validity threats.

Conclusion validity

Conclusions have been inferred by assessing two independent datasets and the sensitivity of key results with respect to different factors, such as the machine learning techniques, attacks, and defenses. Moreover, we make sure to correctly generate the models in hand. For example, the decision tree is assessed by varying critical parameters, such as max_depth and min_samples_leaf. We present an extensive discussion of the results. The key findings of the study are consistent across the machine learning algorithms, attacks, and defenses, which provides a reasonable level of confidence on the analysis.

External validity

The steps of our analysis can be applied to other datasets consisting of network flow records. Nowadays, there exist many products for capturing network packets and generating flow records, which make our approach definitively feasible in practice. In fact, in this paper, we successfully used two independent datasets, four attack tools, and two defense modules to mitigate external validity threats. We are confident that the experimental details provided in the paper would support the replication of our study by future researchers and practitioners.

7 Conclusion

Intrusion detection is one of the main challenges encountered by the network security community. The recent spread of machine learning techniques has boosted the performance of intrusion detection systems. In general, machine learning techniques play a pivotal role in the early classification of the attacks. Many proposals in the literature leverage public intrusion detection datasets and achieve detection rates that often are very impressive. Despite all of this research, machine learning applications for intrusion detection have not been widely adopted in real-life production networks. This is due to a number of factors including the ever-evolving sophistication of the attacks, the presence of heterogeneous workloads, complexity of configuration, and lack of defense mechanisms of real-life servers. Basically, machine learning techniques could be ineffective under realistic traffic conditions.

This paper proposed an investigation of machine learning in the context of two public datasets, with a focus on DoS attacks. We trained three “promising” detectors — decision tree, random forest, and deep neural network — based on the flow records of CICIDS2017, considering both benign traffic and DoS attacks. First, we evaluate each IDS model on held-out data from CICIDS2017 and obtain a recall between 0.97 and 0.99 for the models in hand, which demonstrates that it is straightforward to get a highly performing machine learning model on the top of an individual dataset. Successively, we tested the same models with the USB-IDS-1 dataset containing various attack-defense combinations. The detectors successfully transferred to just one attack, i.e., hulk (“no defense”), while they performed quite poorly in the case of slow attacks. Detection gets even much worse for mitigated variants of the attacks.

It is worth pointing out that both datasets consist of network flow records collected during normative operations and DoS attacks and they are closely related. This consideration prompted us to train a model with the first dataset and to test it with the second dataset. While the learned normal/attack model should still be “theoretically” applicable and transferable, we observe a significant drop in the classification performance. This finding contributes to establish new knowledge in the area and poses novel challenges. In general, the ideal conditions that identify most intrusion detection datasets are not generalizable to real-life environments. This paves the way for further research with the goal of improving the transferability of machine learning models.

Our analysis marks the gap between academic intrusion detection prototypes and potential real-world applications. Results are extremely relevant both for the release of new datasets and the implementation of transferable machine learning models. In the future, we will extend the analysis to further types of attacks, victim servers, and datasets. Another important research avenue pertains to learning models through USB-IDS-1 and testing them with other existing datasets, in order to support the development of more useful datasets for practical purposes. Last but not least, we will investigate semi-supervised and unsupervised learning techniques because, at least in theory, they might be less sensitive to the issues pointed out by this paper.