1 Introduction

Current networks are exposed to attacks from malicious users everyday. Among others, we may stress flooding (that causes denial of service), port scanning (that searches for vulnerabilities of the system), password guessing (trying to make un unauthorized login), or buffer overflow attacks (a exploit that can allow to gain root privileges). Therefore, intrusion detection is a very important task for providing security and integrity in information systems. Analyzing the information gathered by security audit mechanisms, intrusion detection systems (IDS) apply several rules that discriminate between legitimate events or an undesirable use of the system (Vasilomanolakis et al. 2015).

When referring to IDS, we must stress two main categories (Debar et al. 1999): (1) Misuse detection, that are trained with an already established set of known attacks (Lee and Stolfo 2000); and (2) Anomaly detection that determines a profile for the “normal behavior”, and sets up as “attack” any access with different characteristics from the standard (Patcha and Park 2007). In this paper, we will focus on a static context, which implies the use of the first type of approaches.

Among different machine learning solutions, Computational Intelligent and Soft Computing techniques have shown a very robust behavior for detecting both known and unseen intrusion attacks and to recognize normal network traffic (Wu and Banzhaf 2010; Guo et al. 2014). In this area of research, fuzzy systems are a valuable tool as they are capable for modeling complex and dynamic systems. This is due to their good properties for representing uncertain knowledge, as well as for presenting a smooth adaptation to the context. We consider the use of fuzzy rule-based classification systems (FRBCSs) (Ishibuchi et al. 2004), an extension to classical rule-based systems, where its antecedents and consequent are composed of fuzzy logic statements. Furthermore, we focus on fuzzy sets with linguistic labels, allowing the output system to present a higher interpretability degree for the expert (Gacto et al. 2011).

In our last study on the topic (Elhag et al. 2015), we focused on the development of a methodology based on this paradigm for misuse detection. Specifically, we made use of the Fuzzy Association Rule-based Classification for High-Dimensional problems (FARC-HD) (Alcalá-Fdez et al. 2011), a linguistic Evolutionary Fuzzy System (EFS) (Fernandez et al. 2015) in synergy with the One-vs-One (OVO) class decomposition technique (Galar et al. 2011), in which binary subproblems are obtained by confronting all possible pair of classes. The high potential of this fuzzy rule learning approach was determined by the goodness in the correct identification for all types of attacks, including rare attack categories.

However, in the context of IDS, there are several metrics of performance to be optimized. Among others, we must stress the attack detection rate (ADR), which stands for the accuracy obtained for the attack classes managed as a whole, and the false alarm rate (FAR), i.e., the number of false positives. Additionally, we have emphasized the significance of identifying correctly every single type of attack, as it allows one being able to develop a different task in each case. For the aforementioned reasons, a Multi-Objective Evolutionary Algorithm (MOEA) (Coello-Coello et al. 2007) could be the best-suited approach for the optimization of these different measures in order to achieve several solutions which can fit different conditions (Fernandez et al. 2015; Mohammadi Shanghooshabad and Saniee Abadeh 2016).

In this work, we propose the integration of an MOEA approach within the learning procedure of the FARC-HD algorithm aiming for the achievement of a wide set of accurate solutions. The synergy between the multi-objective process and the fuzzy representation has already provided several advantages under different scenarios such as pixel classification from remote sensing imagery (Alok et al. 2016) or market clearing of joint energy and reserves auctions (Goroohi Sardou and Ameli 2016). Specifically, we must stress the following ones:

  • Any desired metrics can be selected for the optimization.

  • It allows a decision among several configurations (codified in each solution) depending on the final requirements of the system.

  • The smoothness related to the linguistic fuzzy terms allows the model to provide a better separability among different types of alarms.

  • Finally, a better understanding of the working procedure of the IDS is implicit to the use of linguistic fuzzy rules with a low number of antecedent values.

The goodness of our novel methodology will be tested using several datasets from the area of IDS. Specifically, we have selected the standard KDDCUP’99 dataset (Lee and Stolfo 2000), the NSL-KDD (Tavallaee et al. 2009), and the Gure-KDDCup (Perona et al. 2008). This way, the experimental results will be directly comparable with most of the Computational Intelligence approaches for intrusion detection. Specifically, for the evaluation of the experimental results, we will compare them versus the standard FARC-HD algorithm (Alcalá-Fdez et al. 2011), as well as our previous approach on the topic, i.e., FARC-HD-OVO (Elhag et al. 2015), which was shown to outperform the state-of-the-art EFS algorithms for misuse detection. Finally, we will complement our comparison with the classical C4.5 decision tree (Quinlan 1993).

The remainder of the manuscript is organized as follows. Section 2 introduces some preliminary concepts, including the context of IDS, some basic notions on FRBCSs and the details for the working procedure of FARC-HD. Next, Sect. 3 presents our proposal for the development of the methodology integrating the MOEA and the fuzzy rule learning in misuse detection. Next, in Sect. 4 we set up the experimental framework, including the features of the KDDCUP’99 dataset, metrics of performance, algorithms for comparison and their parameters. Then, the analysis of the performance of this novel approach with respect to the state-of-the-art is shown in Sect. 5. Finally, Sect. 6 summarizes and concludes the work.

2 Preliminaries: intrusion detection systems and fuzzy rule-based classification systems

In this section, we will introduce several concepts that will present a clearer definition of the work context, as well as the features of the type of classifier that will be used to build our final solution. In this way, a summary of the main concepts for IDS will be presented first in Sect. 2.1. Next, in Sect. 2.2, we will recall some brief details on FRBCSs. Finally, in Sect. 2.3, we will describe the features of the FARC-HD algorithm, which has been selected as baseline technique for this research.

2.1 Intrusion detection systems

In this data age, we are witnessing how computer systems are creating, processing, and sharing an overwhelming quantity of information. According to this fact, computer security must be regarded as a critical issue, so that the unauthorized access to this data from a computer and/or computer network, could imply a significant problem, as it compromises the integrity, confidentiality and availability of the resources (Chebrolu et al. 2005). This issue is of extreme importance in several application areas such as have medical (Mitchell and Chen 2015) or power systems (Pan et al. 2015). Therefore, a wide amount of computer security tools such as antiviruses, firewalls, data encryption, and so on. In addition to this, there are some complementary tools that monitor the activity of the network in order to detect and block intrusions.

Anomalous activities are thus identified by IDSs, which comprise the process of monitoring and analyzing events occurring in a computer system or network in order to detect anomalous activity (Vasilomanolakis et al. 2015). In particular, IDS can be split into two categories according to the detection methods they employ, including (1) misuse detection and (2) anomaly detection.

The main difference between both types of systems is related to whether they use a signature detection or anomaly detection paradigm. Misuse detection systems take the majority of IDSs, and use an established set of known attack patterns, and then monitor the net trying to match incoming packets and/or command sequences to the signatures of known attacks (Lee and Stolfo 2000). Hence, decisions are made based on the prior knowledge acquired from the model. The main advantage of this type of IDS is that they provide high detection accuracy with few false positives, but with the disadvantage that they are not able to detect new attacks other than those previously stored in the database.

On the other hand, anomaly detection IDS have the ability to detect new attacks, but at the cost of increasing the number of false positives. In an initial phase, the anomaly-based IDS is trained in order to obtain a normal profile of activity in the system (Patcha and Park 2007). The learned profiles of normal activity are customized for every system, making it quite difficult for an attacker to know with certainty what activities it can carry out without getting detected. Then, incoming traffic is processed in order to detect variations in comparison with the normal activity, in which case it will be considered as a suspicious activity. In addition to the higher number of false alarms raised, another disadvantage of the development a system of these characteristics is the higher the complexity compared to the case of misuse detection.

2.2 Introduction to FRBCSs

Any classification problem consists of m training patterns \(x_p = (x_{p1},\ldots ,x_{pn},C_p)\), \(p=1,2,\ldots ,m\) from M classes where \(x_{pi}\) is the ith attribute value (\(i = 1,2,\ldots ,n\)) of the pth training pattern.

In this work, we use fuzzy rules of the following form for our FRBCSs:

$$\begin{aligned} \begin{array}{l} \text{ Rule } R_j: \text{ If } x_1 \text{ is } A_{j1} \text{ and } \ldots \text{ and } x_n \text{ is } A_{jn}\\ \text{ then } \text{ Class } = C_j \text{ with } \text{ RW }_j \end{array} \end{aligned}$$
(1)

where \(R_j\) is the label of the jth rule, \(x = (x_1, \ldots ,x_n)\) is an n-dimensional pattern vector, \(A_{ji}\) is an antecedent fuzzy set, \(C_j\) is a class label, and \(\text{ RW }_j\) is the rule weight (Ishibuchi and Yamamoto 2005). We use triangular MFs as antecedent fuzzy sets.

When a new pattern \(x_p\) is selected for classification, then the steps of the fuzzy reasoning method (Cordón et al. 1999) are as follows:

  1. 1.

    Matching degree, that is, the strength of activation of the if-part for all rules in the Rule Base with the pattern \(x_p\). A conjunction operator (t-norm) T, is applied in order to carry out this computation:

    $$\begin{aligned} \mu _{A_j}(x_p)= & {} T(\mu _{A_{j1}}(x_{p1}),\ldots , \mu _{A_{jn}}(x_{pn})), \qquad \nonumber \\ j= & {} 1,\ldots , L \end{aligned}$$
    (2)
  2. 2.

    Association degree To compute the association degree of the pattern \(x_p\) with the M classes according to each rule in the Rule Base. To this aim, a combination operator h, is applied to combine the matching degree with the rule weight (RW). In our case, this association degree only refers to the consequent class of the rule (i.e., \(k = \text{ Class }(R_j))\).

    $$\begin{aligned} b^k_j= & {} h\left( \mu _{A_j} (x_p),\text{ RW }^k_j \right) , \quad \nonumber \\ k= & {} 1,\ldots ,M; \qquad j = 1,\ldots , L \end{aligned}$$
    (3)
  3. 3.

    Pattern classification soundness degree for all classes We use an aggregation function f, which combines the positive degrees of association calculated in the previous step.

    $$\begin{aligned} Y_k = f\left( b^k_j , j = 1,\ldots , L \text{ and } b^k_j > 0\right) , \qquad k = 1,\ldots ,M \end{aligned}$$
    (4)
  4. 4.

    Classification We apply a decision function F over the soundness degree of the system for the pattern classification for all classes. This function will determine the class label l corresponding to the maximum value.

    $$\begin{aligned} F(Y_1,\ldots , Y_M) = \arg \max (Y_k), \qquad [k=1,\ldots ,M] \end{aligned}$$
    (5)

where L denotes the number of rules in the Rule Base and M the number of classes of the problem.

2.3 Baseline fuzzy classifier: FARC-HD algorithm

The great challenge for IDS is to develop a system that can guarantee both a high attack detection rate and a low false alarm rate. In order to overcome this, Data Mining techniques, especially Soft Computing and Computational Intelligence approaches, have become essential pieces for addressing this problem (Wu and Banzhaf 2010).

Among these, FRBCSs (Ishibuchi et al. 2004) have emerged as a valuable solution in accordance with their inner properties. First, the use of fuzzy labels allows a smoother management for the user profile, i.e., decreasing false alarms. Additionally, security itself includes fuzziness in the definition between normal and abnormal behavior. Another positive feature for FRBCSs is their extension to EFSs (Fernandez et al. 2015), i.e., the hybridization between Fuzzy Systems and Evolutionary Algorithms (Eiben and Smith 2003), which leads to a leap of quality as the fuzzy system is adapted to the context of the problem.

Among different methods based in EFSs for their application in IDS, in our last work on the topic, we have shown the goodness of the FARC-HD approach (Alcalá-Fdez et al. 2011; Elhag et al. 2015) over the most representative algorithms of this paradigm. FARC-HD algorithm is based on association discovery (Zhang and Zhang 2002), whose integration with classification allows the achievement of precise and interpretable models. In addition, it has shown to be quite robust for high dimensional problems, a feature that is very significant in the case of IDS (Bostani and Sheikhan 2017).

Fig. 1
figure 1

Learning stages for the FARC-HD algorithm

In summary, the FARC-HD method is based on the following three stages (as depicted in Fig. 1):

  • Stage 1Fuzzy association rule extraction for classification A search tree is employed to list all possible frequent fuzzy item sets and to generate fuzzy association rules for classification, limiting the depth of the branches in order to find a small number of short (i.e., simple) fuzzy rules.

  • Stage 2Candidate rule pre-screening Afterward, the rule generation, the size of the rule set can be too large to be interpretable by the end user. Therefore, a pre-selection of the most interesting rules is carried out by means of a “subgroup discovery” mechanism based on an improved weighted relative accuracy measure (wWRAcc’) (Kavsek and Lavrac 2006).

  • Stage 3Genetic rule selection and lateral tuning Finally, in order to obtain a compact and accurate set of rules within the context of each problem, an evolutionary process will be carried out in a combination for the selection of the rules with a tuning of membership function, as its positive synergy has been shown in previous work on the topic (Alcala et al. 2007; Casillas et al. 2005).

3 Proposed methodology: a multi-objective evolutionary fuzzy system approach for IDS

The premise behind our research is to develop a misuse detection tool that may provide several advantages in this area. Specifically, we have focused on three different issues that are of high interest for the community in IDS. First, the output model must be interpretable by the final user, i.e., it should easily allow to explain the phenomena that has been detected. Second, it must have the ability to adapt to different requirements. Finally, it should be able not only to recognize an suspicious behavior, but also to identify the particular type of attack with a both good recall and precision.

To accomplish the first goal, we will use as baseline classifier the FARC-HD approach. As described in the previous section, this FRBCS is based on linguistic fuzzy labels, which enable rules’ semantic to be closer to human natural language (Gacto et al. 2011). In addition to the former, during the learning stage of FARC-HD, the maximum total length of the final rules can be limited to few antecedents. In case of problems with a large number of variables, this is a clear advantage for the compactness of the output model.

The second and third issues are closely related. We must take into account that in the area of IDS there are many possible ways to determine the quality of the software system, which will be described with detail in Sect. 4.3). Therefore, for the user, it might be important to decide the desired main objectives of the IDS among all the possible metrics of performance (Kudlacik et al. 2016). However, most of these may be in conflict one with another. A clear main example is the achievement of a high attack detection rate and a low false alarm rate. Therefore, our proposed approach must be able not only to derive a different behavior according to the user requirements, but also to be robust enough even in case of conflicting objectives.

In accordance with the former, we must tune FARC-HD toward the achievement of the highest results for a given set of IDS metrics of performance. A straightforward way for addressing the former task is by carrying out an optimization of the initial FARC-HD RB so that we may accomplish the user objectives with a high-quality fuzzy system.

Considering that we must manage a multi-objective problem, the most straightforward way to carry out this process is to build a single function that aggregates every objectives. However, this implies two main drawbacks: (1) to determine the optimal weight for each one of the objectives; and (2) it is only possible to obtain a single compromise solution.

For this reason, in these cases, it is more reliable to proceed via an MOEA, i.e., a Pareto-optimization (Coello-Coello et al. 2007). It allows the optimization of several objectives by evolving a set of non-dominated solutions. This “non-dominance” term refers to the case when a given solution \(s_1\) is equal or better than another \(s_2\) for all objectives but one. For example, in a two-objective minimization problem, \(s_1\) and \(s_2\) are considered to be a couple of non-dominated solutions when \(s_1 = (0.0, 0.5)\) and \(s_2 = (0.1, 0.3)\); on the contrary, \(s_1\) dominates \(s_3\) when \(s_3 = (0.1, 0.6)\).

There are two main advantages related to this procedure. The first one is to make the search space broader, thus resulting in a better exploration ability. The second one is providing a set of solutions from which the user or expert can select the one that is better suited to the system’s requirements. We must stress that both issues cover our initial premises for obtaining a robust IDS.

As stated previously, our proposal is acting on the optimization procedure of FARC-HD, namely the “Genetic rule selection and lateral tuning” that is carried out in the last stage of the FARC-HD algorithm (Alcalá-Fdez et al. 2011). Instead of carrying out a single evolutionary process, we substitute it with an MOEA approach that is able to generate different Knowledge Bases in accordance with the objectives selected.

For developing our model, we will make use of the NSGA-II algorithm (Deb et al. 2002), as it is widely known for being a high-performance MOEA. Its two main features are first the fitness evaluation of each solution based on both the Pareto ranking and a crowding measure, and the other is an elitist generation update procedure.

In order to codify the solutions, and following the same than in the original optimization process for FARC-HD, we will make use of a chromosome with two well differentiate parts: one (RS) for the rule selection and another one (DB) for the tuning of the Data Base:

  • The first part will have a binary codification. The length of this part of the solution will be the number of rules selected in Stage 2 (pre-screening) of the FARC-HD process. Therefore, a 0 value means that the rule will not take part for generating the classification model, whereas a 1 value stands for the opposite case.

  • The second part will have a real codification in the range [0, 1]. There will be as many genes as different labels contained in the Data Base. Following the 2-tuples linguistic representation (Alcala et al. 2007; Fernández et al. 2010; Herrera and Martínez 2000), values in [0.0, 0.5) imply a lateral displacement to the left, whereas values in the (0.5, 1] are displacements to the right in the original universe of discourse. Accordingly, the initial state of the linguistic label corresponds to the value 0.5.

Chromosomes will be evaluated jointly with aims at obtaining the best synergy between both characteristics, instead of optimizing them separately. This issue is based on the fact that it is not clearly defined which the best order for carrying our both processes is. An initial chromosome will be built with all binary genes equal to ‘1’ and real genes to ‘0.5’ in order to implement the standard case study, i.e., the original Knowledge Base, whereas the remaining individuals will be generated at random.

In order to proceed with the evaluation of the chromosomes, the decoding implies the generation of a new Knowledge Base from the configuration parameters given in the solution. Then, this model will be applied to training set in order to obtain the values for the performance metrics selected as objectives. The list of most commonly used performance metrics in the context of IDS will be introduced in Sect. 4.3.

Finally, for the sake of allowing a better exploitation in the search, we have limited the number of objectives to two. In the experimental study (Sect. 5), we will present several case studies from which we can conclude the best synergy between the selected performance metrics.

4 Experimental framework

This section is devoted to establish the configuration for the experiments in order to carry out a thorough analysis. With this aim, Sect. 4.1 introduce the features of the selected benchmark problems. The learning algorithms and their configuration parameters will be presented in Sect. 4.2. Finally, the metrics of performance applied to compare the results obtained with the different classifiers are described in Sect. 4.3.

4.1 Benchmark data: IDS problems

Among different benchmark problems for IDS, the KDDCUP’99 dataset is possibly the widest used one, being a standard until today (Benferhat et al. 2013; Khor et al. 2012; Chung and Wahid 2012). It was obtained by the Information System Technology (IST) group of Lincoln laboratories at MIT university under contract of DARPA and in collaboration with ARFL (Lee and Stolfo 2000). It consisted of an environment of a local area network (LAN) that simulates a typical U.S. Air Force LAN, including several weeks of raw TCP dump data with normal activities and various types of attacks.

It comprises 41 attributes in total, which are divided three main groups: intrinsic features (extracted from the headers’ area of the network packets), content features (extracted from the contents area of the network packets), traffic features (extracted with information about previous connections).

Class labels are divided into normal and attack activities. This last class can be further divided into particular types of attack, which are basically grouped into four major categories, namely:

  • Denial of Service (DOS): make some machine resources unavailable or too busy to answer to legitimate users requests (SYN flooding).

  • Probing (PRB): Surveillance for information gathering or known vulnerabilities about a network or a system (port scanning).

  • Remote To Local (R2L): use vulnerability in order to obtain unauthorized access from a remote machine (password guessing).

  • User To Root (U2R): exploit vulnerabilities on a system to gain local super-user (root) privileges (buffer overflow attack).

From this classical and well-known dataset, several alternative problems have been generated trying to both overcome some of the limitations from the original data (such as the number of redundant records), and to update its inner characteristics for novel IDS requirements. Specifically, we have made use of the NSL-KDD (Tavallaee et al. 2009) and the Gure-KDDCUP (Perona et al. 2008) datasets.

The NSL-KDD dataset contains an informed subsample from the original KDDCUP’99 dataset. First, all redundant records in the train and test sets were removed, so that classifiers will not be biased toward more frequent records. Then, instances were selected in accordance to their difficulty level, whose value is computed regarding the number of correct hits obtained by 7 different learners, each of which executed three times. The number of attributes and classes remain the same with respect to the KDDCUP’99 dataset.

Gure-KDDCUP problem is a novel IDS dataset which followed an extraction process similar to that of KDDCUP’99. In this sense, authors processed “TCP dump” files with bro-idsFootnote 1 and stored each connection with its attributes. Finally, connections were labeled based on the connections-class files (“tcpdump.list”) provided by MIT.

In the NSL-KDD dataset, the number of records in the train and test sets are reasonable (1,074,992 records), making affordable to run the experiments on the complete set. However, in both KDDCUP’99 and Gure-KDDCUP problems, the total amount of data places them in the context of Big Data (Fernandez et al. 2014), i.e., affecting the scalability of current approaches. For this reason, usually a small portion of the whole data is randomly selected for its use with standard classifiers.

In accordance with the former, for the KDDCUP’99 dataset we will select just a 10% of the instances for our experiments. This implies a total of 494,021 connections. Then, we have also removed all duplicated instances, reducing the data to a total of 145,585 examples.

Analogously to the previous case, authors of the Gure-KDDCUP dataset made available a subset with a 6% of the records. This sample contains all no-flood attacks matched with tcpdump.list and a random subsample of normal connections matched with tcpdump.list, thus including 178,858 connections. It is important to point out that, on the contrary to KDDCUP’99 and NSL-KDDCUP problem, this particular case study does not contain any record for the PRB attack type.

Finally, in order to carry out a validation procedure of the results, we have selected a hold-out methodology. Specifically, we will employ a 10% of the datasets for training and the remaining 90% for test. However, in order to take into account the original distribution of classes, we will include a 50% of instances for U2R in both training and test. Table 1 shows the final distribution of examples for each partition/class.

Table 1 Number of examples per class in each dataset partition for KDDCUP’99, NSL-KDD and Gure-KDDCUP problems
Table 2 Case studies for different metrics of performance selected as objectives within the MOEA approach

4.2 Algorithms and parameters

In this paper, we have considered several algorithms for a fair analysis of the behavior of our proposal. The choice of FARC-HD (Alcalá-Fdez et al. 2011) as the baseline classifier makes mandatory to include its original version for this study. Additionally, we must make use of its multi-classifier extension, i.e., FARC-HD-OVO (Elhag et al. 2015), as it was shown to be one of the most accurate EFSs for the task of intrusion detection. Finally, we will include C4.5 (Quinlan 1993) in the experimental study as a state-of-the-art rule induction algorithm. In what follows, we detail the configuration of the parameters for each approach:

  1. 1.

    FARC-HD (Alcalá-Fdez et al. 2011) First, we have selected 5 labels per variables for the fuzzy sets, product t-norm as conjunction operator and additive combination for the inference procedure. As specific parameters of the learning stage, we have set up the minimum support to 0.05 and the minimum confidence to 0.8. Finally, we have fixed the maximum depth of the tree to a value of 3, and the k parameter for the pre-screening to 2. For more details about these parameters, please refer to Alcalá-Fdez et al. (2011).

    We must stress that this configuration will be shared for all three models based on FARC-HD, i.e., the standard approach, FARC-HD-OVO, and our proposed model FARC-HD-MOEA.

  2. 2.

    FARC-HD-OVO (Elhag et al. 2015) The learning procedure will be performed using all possible pairs of classes. In order to aggregate the outputs of each binary classifier into a single solution, we will make use of the preference relations solved by Non-Dominance Criterion (ND) (Fernández et al. 2010).

  3. 3.

    FARC-HD-MOEA The parameters of the NSGA-II MOEA have been set up as follows: 50 individuals as population size, with 20000 generations. The crossover and the mutation (per gen) probabilities are 0.9 and 0.025, respectively.

  4. 4.

    C4.5 (Quinlan 1993) For C4.5 we have set a confidence level of 0.25, the minimum number of item-sets per leaf was set to 2 and the application of pruning was used to obtain the final tree. We must point out that, for the sake of allowing the output model to be compact and interpretable, we have carried out an extensive pruning. Specifically, we have limited the maximum depth of the tree to 3. Therefore, rules obtained from C4.5 will be of the same length than those learned by the FARC-HD algorithms, establishing a fair comparison between both techniques.

4.3 Performance metrics for IDS

In the specialized literature for IDS in general, and for misuse detection in particular, authors have made use of several metrics of performance for the evaluation of their results in comparison with the state-of-the-art. In this paper, we have selected different measures which will allow us to analyze the behavior of our approach under several perspectives:

  1. 1.

    Accuracy It stands for the global percentage of hits. In our case (IDS), its contribution is low as it does not take into account the individual accuracies of each class, but it has been selected as a classical measure.

    $$\begin{aligned} \text{ Acc } = \frac{\sum _{i=1}^{C}{TP_i}}{N} \end{aligned}$$
    (6)

    where C stands for the number of classes, N stands for the number of examples and \(\text{ TP }_i\) is the number of True Positives of the ith class.

  2. 2.

    Mean F-measure In the binary case, the standard f-measure computes a trade-off between precision and recall of both classes. In this case, we compute the average for the F-measure achieved for each class (taken as positive) and the remaining ones (taken as a whole as negative):

    $$\begin{aligned} \text{ MFM }= & {} \frac{\sum _{i=1}^{C}{\text{ FM }_i}}{C} \end{aligned}$$
    (7)
    $$\begin{aligned} \text{ FM }_i= & {} \frac{2 \cdot \text{ Recall }_i \cdot \text{ Precision }_i}{\text{ Recall }_i + \text{ Precision }_i} \end{aligned}$$
    (8)
    $$\begin{aligned} \text{ Precision }_i= & {} \frac{\text{ TP }_i}{\text{ TP }_i + \text{ FP }_i} \end{aligned}$$
    (9)
    $$\begin{aligned} \text{ Recall }_i= & {} \frac{\text{ TP }_i}{\text{ TP }_i + \text{ FN }_i} \end{aligned}$$
    (10)

    where \(\text{ TP }_i\), \(\text{ FP }_i\) and \(\text{ FN }_i\) are the number of true positives, false positives and false negatives of the ith class, respectively, percentage).

  3. 3.

    Average accuracy It is computed as the average of the individual hits for each class. For this reason, it is also known as the average recall:

    $$\begin{aligned} \text{ AvgAcc } = \frac{1}{C}\sum _{i=1}^{C}{\text{ Recall }_i} \end{aligned}$$
    (11)
  4. 4.

    Attack accuracy In this case we omit the “Normal” instances and we focus in checking whether we guess correctly the different “Attack” types individually.

    $$\begin{aligned} \text{ AttAcc } = \frac{1}{C-1}\sum _{i=2}^{C}{\text{ Recall }_i} \end{aligned}$$
    (12)

    In this case, the first class (\(i=1\)) is considered to be the “Normal” class.

  5. 5.

    Attack detection rate It stands for the accuracy rate for the attack classes. Therefore, it is computed as:

    $$\begin{aligned} \text{ ADR } = \frac{\sum _{i=2}^{C}{\text{ TP }_i}}{\sum _{i=2}^{C}{\text{ TP }_i+\text{ FN }_i}} \end{aligned}$$
    (13)

    Reader must take into account that also in this case, the first class (\(i=1\)) is considered to be the “Normal” class.

  6. 6.

    False alarm rate In this case, we focus on the “Normal” examples, and we check which is the percentage of “false negatives” found, i.e., those instances identified as “alarms” but which are actually normal behavior.

    $$\begin{aligned} \text{ FAR } = \frac{\text{ FP }_1}{\text{ TP }_1 + \text{ FP }_1} \end{aligned}$$
    (14)

    As in the former metric (ADR), the “Normal” class has the first index (\(i=1\)).

Table 3 Experimental results in CASE STUDY 1 (AttDet vs. FAR)
Table 4 Experimental results in CASE STUDY 2 (AvgAcc vs. FAR)
Table 5 Experimental results in CASE STUDY 3 (AttAcc vs. FAR)
Table 6 Experimental results in CASE STUDY 4 (MfM vs. FAR)
Table 7 Experimental results in CASE STUDY 5 (MfM vs. AvgAcc)

5 Experimental study

This section includes the experimental results and the analysis of the former to support the goodness of our proposed approach. In particular, we will first perform a selection of the best combination of objectives to carry out the search, i.e., those whose synergy allows a better exploration of the search space and thus show an good overall performance (Sect. 5.1). Then, once the best parameters have been found, we will contrast the behavior of this novel methodology with that of the state-of-the-art for rule learning and fuzzy rule learning, i.e., the C4.5 decision tree (Quinlan 1993), the original FARC-HD algorithm (Alcalá-Fdez et al. 2011), and FARC-HD-OVO (Elhag et al. 2015) (Sect.  5.2). Finally, we provide additional information for the experimental study by showing the whole Pareto fronts for all case studies in training and test, as well as the confusion matrices in the test partition (Sect. 5.3).

5.1 Analysis of the best objectives’ combination

The field of IDS has several metrics of performance depending on the user’s requirements. Section 4.3 introduced the main objectives that are commonly used to measure the quality of the designed models to address this task. In this work, we do not seek a particular goal, but an “multi-purpose” method that will be able to adapt well under several scenarios, trying to maximize the precision and recall among all classes, regardless of their type.

According to the previous fact, our first step is to find the better combination of objectives that leads to a robust average performance. Among all available metrics, we consider that the “False Alarm Rate” (FAR) is the mandatory one to control the bias toward the majority class, i.e., Normal activity. Therefore, we will carry out the combination for all the remaining metrics together with the former one. Additionally, we will also combine the “mean F-measure” with the “average accuracy” as both try to balance the correct recognition among all classes. Specifically, all combinations are summarized in Table 2. We must point out that the standard accuracy was not considered as it does not take into account the individual rates for the different concepts of the problem.

Tables 3, 4, 5, 6 and 7 show the experimental results for the training and test partitions of the three selected problems, namely KDDCUP’99, NSL-KDDCUP and Gure-KDDCUP. All performance measures are considered, as introduced in Sect.  4.3: Accuracy (Acc), Mean F-measure (MFM), Average Accuracy (AvgAcc), Attack average accuracy (AttAcc), Attack Detection Rate (ADR), and False Alarm Rate (FAR). Since we must extract a single KB from a single solution of the final set, we have chosen three different points from the Pareto. Specifically, we have considered the maximal values for each objective, as well as the “knee point” (Branke et al. 2004), since this solution is likely to be the most relevant to the decision maker, as it represents the compromise between both objectives.

Table 8 Complete experimental results for our proposed approach (FARC-HD-MOEA (MfM+FAR)) versus the state-of-the-art (FARC-HD, FARC-HD-OVO, and C4.5) over the reduced KDDCUP’99 dataset for different metrics of performance: accuracy (Acc), mean F-measure (MFM), average accuracy (AvgAcc), attack average accuracy (AttAcc), attack detection rate (ADR), and false alarm rate (FAR)
Table 9 Complete experimental results for our proposed approach (FARC-HD-MOEA (MfM \(+\) FAR)) versus the state-of-the-art (FARC-HD, FARC-HD-OVO, and C4.5) over the NSL-KDD dataset for different metrics of performance: accuracy (Acc), mean F-measure (MFM), average accuracy (AvgAcc), attack average accuracy (AttAcc), attack detection rate (ADR), and false alarm rate (FAR)
Table 10 Complete experimental results for our proposed approach (FARC-HD-MOEA (MfM \(+\) FAR)) versus the state-of-the-art (FARC-HD, FARC-HD-OVO, and C4.5) over the Gure-KDDCUP dataset for different metrics of performance: accuracy (Acc), mean F-measure (MFM), average accuracy (AvgAcc), attack average accuracy (AttAcc), attack detection rate (ADR), and false alarm rate (FAR)
Table 11 Comparison of number of rules (#Rules) and average number of antecedents (#Avg. Ant.) for the algorithms selected in the experimental study

We acknowledge that if we focus on individual measures, the case study that has selected it for the learning process will achieve unequivocally the highest results. It is interesting to point out that, when the “False alarm rate” metric is selected in combination with different metrics, the obtained results vary. This fact implies the benefits of the use of the MOEA to seek for a wide variety of IDS depending on the desired behavior.

Among all case studies that have been analyzed, the one that probably shows the best results is the combination between “mean F-measure” and “False alarm rate.” Specifically, if we focus on the values obtained using the “best MfM,” we must highlight a superior behavior in most of the metrics, or at least a similar performance, thus stressing the advantage of this configuration for the search procedure.

5.2 Comparison versus the state-of-the-art

Once we have selected the best combination of objectives for the genetic learning, we will analyze the goodness of this approach versus the methods from the state-of-the-art selected in the experimental framework, i.e., the original FARC-HD (Alcalá-Fdez et al. 2011) and FARC-HD-OVO (Elhag et al. 2015), and the C4.5 decision tree (Quinlan 1993). Experimental results are divided with respect to the benchmark IDS problem. Performance values for the KDDCUP’99 dataset are shown in Table 8, for the NSL-KDD dataset in Table 9, and for the Gure-KDDCUP in Table 10.

Table 12 Confusion Matrix in the test partition for the FARC-HD-MOEA approach (MfM vs. FAR, best solution for MfM) in the KDDCUP’99 dataset
Table 13 Confusion Matrix in the test partition for the FARC-HD-MOEA approach (MfM vs. FAR, best solution for MfM) in the NSL-KDD dataset
Fig. 2
figure 2

Pareto front obtained in the test stage with FARC-HD-MOEA approach. Objectives selected during the search were the mean F-measure (MfM) and the false alarm rate (FAR). a Pareto front in KDDCUP’99 dataset with FARC-HD-MOEA approach. b Pareto front in KDDCUP’99 dataset with FARC-HD-MOEA approach

Fig. 3
figure 3

Pareto front obtained in the test stage with FARC-HD-MOEA approach. Objectives selected during the search were the mean F-measure (MfM) and the false alarm rate (FAR). a Pareto front in KDDCUP’99 dataset with FARC-HD-MOEA approach. b Pareto front in NSL-KDD dataset with FARC-HD-MOEA approach

For all selected problems, FARC-HD-MOEA improves the results of FARC-HD is most of the considered metrics of performance. We must recall that the same configuration is shared by both approaches. In this sense, the initial KB obtained afterward stages 1 and 2 (refer to Sect. 2.3) will be the same for both models. This fact suggests the goodness in the design and capabilities of the proposed MOEA optimization procedure versus the standard Genetic Algorithm when dealing with IDS problems. In particular, we must stress the differences with respect to the values of the mean f-measure, average accuracy, and attack accuracy are especially remarkable, improving up to 10–15 points in some cases.

Fig. 4
figure 4

Pareto front obtained in the test stage with FARC-HD-MOEA approach. Objectives selected during the search were the mean F-measure (MfM) and the false alarm rate (FAR). a Pareto front in KDDCUP’99 dataset with FARC-HD-MOEA approach. b Pareto front in Gure-KDD dataset with FARC-HD-MOEA approach

Regarding the comparison versus FARC-HD-OVO, we must stress that the performance values are more similar in this case. However, there is a clear advantage of our novel proposed technique, which relies in the use of a single classifier, instead of a whole ensemble. This fact avoids a combination among different outputs during the inference, which can degrade the system response.

When our proposed FARC-HD-MOEA is contrasted versus the C4.5 decision tree, we observe an interesting behavior. Whereas global metrics of performance such as accuracy and/or attack detection rate are usually higher for C4.5, the goodness of FARC-HD-MOEA lies in the ability of providing a good average recognition, if we focus on the mean f-measure and the average accuracy. Furthermore, the false alarm rate obtained with our approach is always below the 1%, whereas C4.5 raises this value for the NSL-KDD dataset up to the 4% (Tables 111213).

In accordance with the whole analysis that has been carried out, we must stress that our FARC-HD-MOEA proposal is a robust choice for the IDS problem. Its main advantage in contrast to the remaining methods is its ability to achieve a good trade-off between recall (average accuracy) and precision (mean F-measure). This issue implies that our approach reaches a high average performance for all concepts/classes of the problem. Additionally, it enhances the false alarm rate from the baseline FARC-HD approach, also maintaining a very close value compared to that of the remaining algorithms. Finally, we have observed that a low number of simple (compact) linguistic rules are enough to cover the whole problem space accurately.

Table 14 Confusion Matrix in the test partition for the FARC-HD-MOEA approach (MfM vs. FAR, best solution for MfM) in the GURE-KDD dataset

5.3 Complementary results: Pareto front of solutions and confusion matrices

For the sake of complementing this study, we show in Figs.  2, 3 and 4 the complete Pareto front obtained from the genetic optimization for the KDDCUP’99, NSL-KDD and GURE-KDD datasets. In all cases of study, we may observe a wide amount of non-dominated solutions from both the training and test sets, all of which are homogeneously distributed in the solution space. This issue reflects the good properties of the search procedure, as it covers a wide amount of different cases from which the expert can select the most appropriate one for a desired profile of behavior.

Finally, we include in Tables 12, 13, 14 the confusion matrices obtained in the test partition for FARC-HD-MOEA algorithm for all three IDS problems. This is done with aims at showing additional information for the experimental results, so that any interested research could reproduce and extend the current study for additional future work.

6 Concluding remarks

In the context of IDS, profiles may change depending on the users’ requirements. In this sense, we must usually face confronting objectives for the working procedure of this type of system, mainly between achieving a low number of false alarms, and a good average recognition for different types of attacks.

With this aim, in this research, we have proposed the integration of a MOEA within a linguistic fuzzy association rule mining, the FARC-HD algorithm. Specifically, the genetic optimization has been focused on the last stage to carry out the rule selection and data base tuning. The goodness for the use of the MOEA is related to the simultaneous optimization of different metrics of performance in the scenario of IDS. The aim for this procedure is being able to both extending the search space and obtaining a wide amount of accurate solutions. By doing so, the final user may select the most suitable classification system for the current work context.

On the first part of our analysis, we have considered several case studies depending on the combination of metrics for the learning process. Among them, we have found out that the synergy between the mean F-measure and the false alarm rate was the most relevant for achieving good average results under the different IDS benchmark problems. Nevertheless, we must recall any other configuration could be also valuable according to the final requirements of the application.

Finally, the comparison of this approach versus the state-of-the-art, which included both the original FARC-HD classifier, FARC-HD with OVO, and the C4.5 decision tree, supported the high quality of our novel methodology. In particular, we remarked the good trade-off obtained between precision and interpretability for all cases of study, in accordance with the length of the RB and average antecedents per rule.