1 Introduction

With the growth and development of cyber threats and attacks in computer networks, the need for modern methods to detect intrusion in these networks and defense against cyber threats and attacks has become a significant challenge in computer systems [1, 2].

In general, intrusion detection systems are responsible for identifying and detecting any unauthorized use of resources in the system, as well as abuse or harm caused by both internal and external users. Detection and prevention of intrusion is considered as one of the main mechanisms in providing the security of various computer systems and it is generally used along with firewalls as a complement [3, 4].

Generally, an intrusion detection system can be classified based on intrusion detection method, architecture, and type of intrusion response in the network [4,5,6]. Basically, intrusion detection systems include three general functions: monitoring and evaluating, detecting, and responding to the security threats. Moreover, different types of intrusion detection methods include abnormal behavior detection and abuse detection (i.e., signature-based detection). There are also several types of architectures for intrusion detection systems that can be classified into three classes: host-based (HIDS), network-based (NIDS), and distributed (DIDS) intrusion detection systems. While a firewall allows traffic to pass through its destination, intrusion detection technologies perform complex analyses on network threats and vulnerabilities, and they can detect and remove attacks that are within the network's legal traffic passing the firewall [7]. Given that there are intrusion prevention approaches, these methods may not always be able to resist against all available cyber-attacks [8]. The task of intrusion detection systems in computer systems is to detect threats and anomalies and notify such anomalies to system administrators to take the appropriate actions to eliminate and prevent them. The simplest form of intrusion detection systems monitors a variety of heterogeneous resources in the system and collects data to examine them to detect anomalies and threats, including data collection from the system and monitoring different types of heterogeneous resources [9, 10].

Among the existing cyber security attacks, advanced persistent threat (APT) attack is one of the newest and modern cyber security attacks that has sacrificed many individuals and organizations. This attack includes a set of complex and long-term actions taken against specific individuals, organizations, or companies [11]. In a typical cyber security attack, an attacker tries to quickly enter the network, steal information, and exit the network so that the intrusion detection systems are less likely to detect such an attack. However, in an APT attack, the goal is to achieve the continuous access to the system and data. To intrude into computer systems using unknown access, an attacker must constantly rewrite the codes and apply sophisticated escape methods. Some complex APT attacks require full-time monitoring and management. Therefore, this usually attacks defense and financial organizations that have confidential and sensitive data and information. APT attacks usually infiltrate into the network through a spear phishing attack which is a social engineering style attack. The next step in this attack is to find a valid username and password to log in to software systems [12,13,14].

In general, an APT attack involves complex efforts carried out by a group of hackers, focusing on a single goal. The aim of these efforts is to penetrate undetectable software systems for a long time with the lowest level of tracking [11, 14, 15]. It must be known that a single technology or process cannot stop APT attacks, and traditional security methods are not able to defend against them. Therefore, different layers of defense, information about the threat, and advanced skills are needed to counter APT attacks.

The cyber-kill chain is a continuous model and process that demonstrates how activities related to malicious targets are carried out. This chain penetrates computer networks as a specific sequence, and if any of the steps in this chain is blocked, the attack fails [16]. In this paper, an APT detection approach based on Kill Chain model is proposed called “APT-Dt-KC” for detecting APT attacks by adapting the cyber-kill chain model with the fuzzy features of APT attacks. In fact, this paper proposes a more effective, accurate, and as early as possible approach to detect an APT attack by classifying security alerts based on the cyber-kill chain model. In this regard, Pearson correlation algorithm is used to correlate and pre-process a wide range of data. Then, Bayesian algorithm is applied to train and evaluate the threshold values, and finally, the fuzzy prioritization is exploited as a fuzzy analytic hierarchy process to detect and classify all types of attacks. Therefore, the main contributions of this paper are summarized as follows:

  • We apply Pearson correlation test in APT-Dt-KC to correlate and preprocess data to reduce the amount of processing data in the process of detecting an APT attack. To the best of our knowledge, this method has not yet been used to preprocess a wide range of data in the detection of APT attacks based on the cyber-kill chain model.

  • We exploit a new hybrid approach based on the cyber-kill chain model to detect APT attacks. In APT-Dt-KC, Bayesian classification algorithm is used to train and evaluate threshold values in the classification structure. The fuzzy analytic hierarchy process is also employed to fuzzy prioritization and attacks classification.

  • We evaluate the performance of APT-Dt-KC in terms of intrusion detection rate, training time, accuracy, and computational efficiency with DT-EnSVM approach in [17] by performing several experiments using KDD Cup 99 dataset. DT-EnSVM is a support vector-based approach which combines both the ensemble learning and data transformation techniques in the process of APT attack detection.

The rest of paper is organized as follows: In Sect. 2, the concepts and definitions of APT attacks, cyber-kill chain model as well as applied methods in this paper are presented. In Sect. 3, the existing approaches for network intrusion detection, especially techniques for detecting the APT attacks are explained. In Sect. 4, the proposed approach (i.e., APT-Dt-KC) for detecting the APT attacks using the cyber-kill chain model is explained in more details. In Sect. 5, the proposed approach is simulated, and the experimental results are compared with DT-EnSVM approach in [17]. Finally, in Sect. 6, after a general conclusion, the suggestions are presented as future works.

2 Concepts and definitions

This section describes the concepts and definitions related to the APT attack, the cyber-kill chain model, and the applied techniques in this paper.

2.1 Advanced persistent threat (APT) attack

The APT attack uses a variety of techniques to infiltrate the system and collect sensitive information. In this type of attacks, when an attacker realizes that he has been detected, he changes the method of attack and uses other methods to infiltrate the system [15]. Another feature of this attack is its complexity, which abuses the zero-day vulnerability. This attack utilizes various social engineering techniques to infiltrate the system [18]. A zero-day attack is a computer attack that exploits an unknown vulnerability in the software. This means that the exploiting a zero-day attack (i.e., software that uses a security hole to launch an attack) is used or shared by attackers before the developer of the target software becomes aware of the vulnerability. The zero-day attacks occur during the vulnerability window period. The vulnerability window includes the following steps [15, 18]:

  • Developers produce software that has unknown vulnerability.

  • Attackers discover vulnerability points before developers take some actions for solving them.

  • An attacker writes exploits that are either unknown or known to developers, but these vulnerability points have not yet been completely closed.

  • Developers become aware of the vulnerability and programmers try to eliminate the vulnerability by developing some codes.

  • Development is free to improve the vulnerability. It means that using plug-in elements and customization of the detection process is applicable to reduce the vulnerability for the service provider and there are no restrictions in this respect.

Conceptually, when an event occurs by a zero-day attack, users who make custom improvements to cover the vulnerability points and effectively cover the damage window are different from those users who readily cover the vulnerability points by the ready-made software. Meanwhile, some users may not be aware of vulnerability points so that these points remain unprotected. Therefore, the length of vulnerability windows depends only on the size of vulnerability points obtained by the user. It is difficult to measure the vulnerability window length because attackers do not report a vulnerability point when they discover it. Developers do not want this information to be released for commercial or security reasons. They may not be aware that they will be attacked on zero-day while repairing the vulnerability. In general, the specific characteristics of an APT attack include achieving targets, using complex techniques, having a good chance of exploiting zero-day vulnerabilities, intruding the target continuously, and being established for as long as possible [15, 18, 19].

2.2 Cyber kill-chain model

The cyber-kill chain model is developed to define the various stages of a cyber-attack. This model can be used to analyze, detect, and prevent different types of cyber security attacks [20, 21]. In general, there are seven phases of cyber-kill chain model: (1) Reconnaissance, (2) Weaponization, (3) Deliver, (4) Exploitation, (5) Installation, (6) Command and Control, and (7) Actions on objectives [20,21,22,23,24,25]. As shown in Fig. 1, these seven consecutive steps in the cyber-kill chain provide information on the adversary’s tactics, techniques and procedures (TTPs).

Fig. 1
figure 1

Lockheed martin cyber kill chain model [22]

2.2.1 Reconnaissance

Reconnaissance is the first phase of the cyber-kill chain model during which an attacker collects the network or endpoint information about a target. The endpoint can be an individual, an organization, or part of a target network's hardware/software. In this phase, the attacker performs hidden investigations about the existence of the target and identifies the potential methods and ways of network failure. Investigations in this phase also provide information on what types of malicious objects can be deployed on the target network, without being detected by cyber-security defense. Furthermore, the backdoor is found in the target network. In addition, attackers determine an appropriate set of intrusion targets located in the target network. If the purpose is to steal personal information, the attacker must identify a way to investigate to establish a bilateral link, so that he can first enter the network, find the information of interest, and then steal it from the outside of network. In the case that the purpose is to destroy the network, he can do it in another way. Regardless of this, the attacker seeks to find the network or users' vulnerabilities for illegal use and access. An example of a unilateral link is spam phishing emails, which, in addition to their traditional purpose, have been stealing credentials to provide malware and sending malicious files as attachments to a specific user. This example shows how an attacker exploits an email service to attack a network. An example of a bilateral link is to find open ports using a port scanner, which after finding an open port, allows authorized bilateral communication through Telnet connections.

2.2.2 Weaponization

The second phase of the kill chain model is weaponization, during which the attacker creates a deliverable malware payload. The attacker uses the information collected during the reconnaissance phase to plan what vulnerabilities should be delivered for exploitation and how it should be delivered for use with the discovered backdoor. There are two types of malware payloads in this phase [23]:

  • Malware payloads that do not require to communicate with the attacker, such as viruses and worms.

  • Malware payloads that need to communicate with the attacker to receive command and control signals or send the stolen information to the attacker.

These are known as Remote Access Trojans (RAT). The RAT requires both the client and the command-and-control servers. The RAT client is the destination that receives the actual malware payload and is configured to communicate with the command-and-control server. This server is located on the Internet and controlled by the attacker. For example, in the reconnaissance phase, the attacker notices that the email system does not allow sending and receiving *.exe files but allows the.pdf * file in emails of a particular university. On the other hand, the attacker notices that the professors regularly receive and open PDF files by emails from their students. Therefore, the attacker creates a RAT file with the ability to communicate with the command-and-control server and embeds this RAT file in a PDF file, called myCV.pdf, which is sent as an attachment via a phishing email.

2.2.3 Delivery

After the malware payload has been developed and the backdoor for payload reception has been identified, the delivery phase will be performed. Malware can be delivered by tricking or forcing the user to interact with the malware, or it can be delivered automatically by exploiting the weak points of protocols or software packages. For example, an email can have an attachment file to deliver the malware payload. Delivery is an important part of ensuring a successful attack without being detected by the existing security mechanisms. Therefore, enemies design their attacks in such a way that their attacks are not tracked, and they must hide the source of the attack from the security and criminology experts. In addition, enemies use a variety of delivery methods to increase their success rate. It is very difficult to find an exploitative malware that does not require user interaction because they use an inherent defect in the protocol, program, or software to deliver the malware payload. This inherent defect is called software vulnerability and requires software patch to reduce vulnerability [22].

2.2.4 Exploitation

After the successful delivery of the malware payload to the target computer, the exploitation phase begins by installing the malware inside the target computer. The following conditions must be met to begin the malware installation [24]:

  • Malware must have the necessary permissions to be installed on the target computer.

  • The target computer's operating system or software must be able to install the malware without additional requirements. For example, a malware built for the Linux operating system cannot be installed on a Windows operating system.

  • The anti-malware defense of the target computer should not be able to detect this malware; otherwise, the attack will fail due to broken cyber chains.

The exploitation phase does not actually perform the installation, but rather prepares the environment for the installation phase of the cyber-kill chain model. However, this phase is not far from the installation phase, as all installation phases must be performed by the exploitation phase. In order to deliver the malware payload for installation, there must be a form of software or hardware error that the malware payload can be exploited for installation or execution. Such errors are called Common Vulnerabilities and Exposure (CVE).

2.2.5 Installation

Computer infection starts during the installation phase. If the malware is in the form of an executable file or malicious activity based on a code injection or an internal threat, there is no need for installation phase. However, if the malware needs to be installed on the target computer, then the delivery phase should place the dropper or downloader on the destination computer and complete the exploitation phase by disabling security services and finding traps in the operating system to begin malware installation. At this phase, the malware is installed and the installed files support libraries and operating system files or download those files from the downloader and dropper packages. In addition, the malware installation updates the operating system files using the authorized permissions. Malware changes the appearance of their files by changing the file format or hiding the files from user access. Advanced malware can also modify their footprint memory to prevent detection by sandbox algorithms or behavior-based anti-malware systems. The installation phase is not only performed on the backdoor of the target victim but also ensures that the attackers are able to communicate permanently with the victim computer. It should be noted that this phase does not begin to communicate with the command-and-control mechanism for malware activity. Moreover, this phase is different from the exploitation phase, it means that the exploitation phase specifies that the malicious packages are ready to be installed and all the needs of the exploitation phase are met, while in the installation phase, the real malware payload starts to establish a base inside the victim computer in a local manner.

2.2.6 Command and control

The command-and-control phase of the cyber-kill chain model is necessary because of the following reasons [15]:

  • To steal information (i.e. passwords, financial data, intellectual property, etc.) from the target computer using some tools such as Key Loggers, Zeus, and Trojan.Coinbitclip.

  • To send instructions of the malware to the target computer to connect the malware to other parts of the target computer, execute malware, or enable encryption for ransomware activities.

It should be noted that the endpoint defense mechanisms and network monitoring services play an important role in detecting illegal connection to the network and this phase is the last step to prevent the malicious activities. There are two major types of communication-based command and control servers:

  • Servers that contain meta-information about compromise nodes via proactive messages.

  • Servers that actively communicate with target nodes by issuing commands to victim node and perform more malicious activities.

In addition, command-and-control servers can be classified into direct and indirect communication classes [15]:

  • In direct communication, the malware in the victim node contains a list of command and controls of the IP servers, so that if a particular IP is blocked, the malware communicates with another IP. This feature is called durability.

  • In indirect communications, attackers use legitimate intermediary nodes to communicate. A group of nodes is compromised to establish a communication link, while the source of the link is hidden from the victim's view. Thus, a botnet is created to re-establish the communication path from the victim to the source.

Attackers may have different ways to connect to the destination node. They may use email protocols for malware payload delivery. They may utilize one or more HTTP connections to establish an output link. In addition, they can use different compression mechanisms to data exfiltration. The attacker constantly changes his tactics and techniques to escape detection. However, their common feature is network traffic, and if the endpoint security mechanism fails to detect the presence of a communication malware, the network defense can do this [24].

2.2.7 Actions on objectives

This is the last phase of the cyber-kill chain model, and it is responsible for carrying out the attack against targets. In this phase, if the malware is on the target computer, then the execution of the program function starts either through the instructions of the command-and-control server or independently. This phase is also known as the explosive phase that has successfully completed the cyber-kill chain model. The following are the main classes of actions that can be taken in this phase:

  • Data Exfiltration: Stealing the confidential information from the network.

  • Ransomware: Attackers get hostage the victim information by encryption, as well as block the network resources and demand compensation by changing credits or using encryption.

  • Cyber Terrorism: Attackers do damage the system by deleting data or destroying files completely.

In general, the cyber-kill chain model is a continuous model that demonstrates how the attacker reaches the target node. This model is based on this hypothesis that attackers attempt to infiltrate computer networks in a sequential, incremental, and advanced manner. In this structured model, if each phase of the cyber-kill chain model is blocked, the attack taken by the attacker will not be successful. In fact, cyber-security experts are looking for early detection of cyber threats with a cyber-kill chain model. Seven consecutive phases in the cyber-kill chain model provide information on enemy's tactics, techniques, and procedures (TTPs) [25]

2.3 Bayesian inference and its relationship with the cyber-kill chain model

Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian classifiers are the statistical classifiers with the Bayesian probability understandings. The accuracy of Bayesian classifier is high, and its value can be significantly increased by using special functions. The learning method in this approach is supervised learning, so that this approach considers the prediction variables independently, therefore, it is called simple Bayes or Naïve Bays [26].

For simple Bayesian classification, feature attributes separation, including feature identification and their classifications are necessary. After classification, the data forms a set of feature attributes, at the same time learning samples are generated simultaneously. The classification results obtained from the classifiers are mainly determined by the feature attributes and the quality of learning samples.

The behaviors of cyber-attacks can be considered by exploiting Bayesian classification approach. In this structure, the malicious behavior is used to indicate the events of a cyber-attack [11]. When n types of attack behaviors (behavior 1, behavior 2, …, behavior n) are captured, the probability of an APT attack is greater than or equal to Qn.

Generally, Bayesian inference system architecture consists of the following modules [11]:

  • Alarm Processing Module: The whole system, which is constantly running, is driven by alerts. Alarm processing module includes alarm receiver.

  • Information Integration and Analysis Module: This module performs fuzzy search in parallel according to the received alert. Search results are stored in the internal cache. The module performs the corresponding functions, alarm classification, and priority allocation.

  • Bayesian Learning Module: Learning includes the statistics of the sample and Bayesian engine tuning. The statistics of the sample include previous probabilities called APT and NOAPT, so conditional probabilities of APT attack with single or multiple behaviors are received by the defender. Bayesian engine tuning involves the behavior weight, and the participation of a behavior in the evaluation of APT attack. The learning module not only performs the task of learning, but also tests the system. The configuration information is given to the Bayesian engine through the continuous test and storage. Information is stored for in-depth analysis in case of discrepancies with the result of detected and predicted system.

  • Bayesian Engine: After receiving the results from the Bayesian learning module, the Bayesian engine is ready to receive a variety of event alerts, such as the alerts for APT attacks. Many real-time alerts are first collected in the information integration and analysis module. Then, along with the baseline information, alerts are issued by the Bayesian evaluation program. This process divides the events in the attack into three cases: APT cases, threat-less cases, and gray threat cases. Gray event indicates that it cannot be detected by current information and all events are stored in default state. Gray results are predicted according to the thresholds. The threshold value is adjusted after several simulation tests.

2.4 Analytic hierarchy process (AHP)

The analytic hierarchy process (AHP) is a powerful and flexible method among the multi-criteria decision-making methods by which complex problems can be solved at different levels. This is called a hierarchical model since it is a tree model in a hierarchy manner. The AHP method combines both objective and subjective evaluations into an integrated structure based on scales with pair comparisons and helps analysts to organize the essential aspects of a problem into a hierarchical framework [27, 28].

The advantages of this method are as follows: evaluating the consistency of decision makers' judgments, creating pairwise comparisons in choosing the optimal solution, the ability to consider the criteria and sub-criteria in evaluating options, and the ability to achieve the best solution through pairwise comparisons.

Generally, AHP is a way to help the decision-making process and focuses on the importance of a decision maker's intuitive judgments as well as the stability of comparing alternative options in the decision-making process. One of the advantages of this method is that it regularly organizes tangible and intangible factors and offers a structural but relatively simple solution to decision problems [27]. In this method, the decision-making problem is divided into different levels of objectives, criteria, sub-criteria and options. Therefore, various options are involved in decision making and it is possible to analyze the sensitivity of criteria and sub-criteria. Sensitivity analysis means what changes occur in the ranking of options as the weight of the criteria changes. Another advantage of this method is to determine the amount of compatibility and incompatibility of the decision. AHP method simplifies the complex issues by analyzing them [28]. The selection of criteria is the first part of AHP method. Candidates are evaluated based on the identified criteria. The word "alternatives" and "candidates" are used interchangeably. The reason for calling this method hierarchy is that we must first start from the goals and strategies of organization at the top of the pyramid and by expanding them, identify the criteria to reach the bottom of the pyramid. This method is one of the most widely used methods for ranking and determining the importance of factors, which is used to prioritize each of the criteria using pairwise comparisons. It is difficult to form a matrix of pairwise comparisons if there are many alternatives. The aim of AHP method is to select the best option based on different criteria through pairwise comparison. This method is also used to weigh criteria. Since increasing the number of elements in each cluster makes paired comparisons difficult, decision criteria are usually subdivided into sub-criteria [29].

  • Criterion: There is a parameter that is selected as a quality component.

  • Option: An item that is selected from the available items.

The following models are used as the most widely used models in the AHP method:

  • Goal-Criterion

  • Goal-Criterion-Sub-Criterion

  • Goal-Criterion-Option

  • Goal-Criterion- Sub-Criterion-Option

In the AHP method, you may want to weigh only the criteria. There may be sub-criteria and the goal is to determine the weight of the sub-criteria. In this process, the evaluation of the relative importance of decision criteria and comparison of decision options according to each criterion is performed by pairwise comparisons, which includes the following three steps [29]:

  • Create a comparison matrix at each level of the hierarchy, starting from the second level.

  • Calculate the relative weights for each element of the hierarchy.

  • Estimate the adjustment rate to check the compliance of the arbitration.

2.5 Pearson correlation method

Pearson correlation is a method based on parametric statistics that shows the intensity and direction of the relationship between two variables. This method, like other correlation methods, considers the relationships of variables in pairs. That is, if you measure the relationship between two variables A and B with or without the presence of a variable such as C, the value of this relationship is still the same. In examining the correlation of two variables, if both variables have relative scale, the Pearson moment correlation coefficient is used. If the correlation coefficient of the population ρ and the correlation coefficient of the sample n are the volume n of the population r, r may be obtained randomly. For this purpose, the correlation coefficient significance test is used. This test examines whether two variables are random and independent. In other words, whether the correlation coefficient of the society is zero or not. This coefficient calculates the value of correlation between two distance or relative variables which is between + 1 and −1. If the obtained value is positive, it means that the changes in two variables occur in the same direction, namely as each variable increases, so does the other variable; conversely, if the value of r becomes negative, it means that the two variables also act in the opposite direction; that is, by increasing the value of one variable, the values of the other variable decrease, and vice versa. If the obtained value is zero, it shows that there is no relationship between two variables. In the case that it is + 1, the correlation is positive and complete, while if it is −1, the correlation is negative and complete [30].

If the data distribution of the variables is normal, the Pearson correlation coefficient is used to measure the correlation. The Pearson correlation coefficient, which is also known as the Pearson moment correlation coefficient, correlation coefficient and bilateral correlation coefficient, is used to calculate the value and amount of linear relationship between two variables. The range of correlation coefficients varies from + 1 to + 1. The closer this value is to + 1, the stronger and positive the relationship between the two variables. In other words, as each of variable increases, the other variables increase, and vice versa. In the case that the closer the value of this coefficient is to −1, the stronger and negative the relationship between the two variables. In other words, with the increase in each variable, the other variables decrease, and with the decrease in each variable, the other variables increase [30]. Therefore, there are:

  • Null hypothesis: The correlation coefficient between two variables is zero.

  • Alternative hypothesis: The correlation coefficient between two variables is not zero.

Then, in the proposed method, this coefficient is used to create correlations between the components and while improving the correlation in the data, the level of redundancy is also reduced.

3 Related works

Many intelligence algorithms have been applied to improve the detection capability of an intrusion detection system. Among these methods, ensemble learning has received an increasing interest and shown to obtain better performance than single learning methods. Besides, the intrusion detection performance is also highly dependent on the quality of training data. In [17], an effective intrusion detection framework based on SVM ensemble with feature augmentation is proposed. Specifically, the quality-improved technique is used to provide concise, high-quality training data and SVM ensemble is applied to build intrusion detection model.

Singh et al. [31] have proposed the application of Security Onion (SecOn) to develop the network security monitoring (NSM) and intrusion detection system (IDS) in the context of SCADA cyber physical security. They have applied a cyber kill-chain model to demonstrate the different stages of attacks and associated mechanisms.

In the current development of online social networks and information technologies, the capture of group privacy may lead to individual privacy violations. In this regard, Kim et al. [32] have studied the privacy kill chain, which uses group privacy as a tool to capture the individual privacy. They have shown how the kill chain makes the need to protect group privacy possible from a social, legal, ethical, commercial, and technical perspectives.

Shameli-Sendi et al. [33] have proposed a new approach for automated response systems to assess the value of the loss that could be suffered by a compromised resource. This approach uses a feedback-based mechanism that can measure the quality of response and provides a great assist to represent the risk level of different applications. Note that the proposed approach uses a new online mechanism to activate and deactivate the responses based on the online risk effect. Furthermore, it can determine the main factors in risk assessment and efficiency calculation with very high complexity.

Duncan et al. [34] have proposed a hybrid method of attack tree and kill chain to identify multiple indirect detection in cloud computing. An attack tree, as defined by Schneier [35], represents attacks as a tree structure in which the root node is the attack target and the leaf nodes are actions or events. In particular, the use of attack trees makes it possible to identify all possibilities of detecting self-attacks in cloud environment. Basically, the attack tree coverage at the top of the kill chain, increases the chances of indirect detection from the tree itself, as well as allowing the cloud provider to determine how much an attack has progressed after a suspicious activity.

Hoffman [36] has proposed to start using Markov processes to model some cyber-attacks. In the proposed method, two example theory models of cyber-attack recursion cycles are presented, which is called the cyber-kill chain model with iterations. The proposed models are based on homogeneous continuous time Markov chains. There are no special steps such as start and end in all available solutions to describe the cyber-attack kill chains. Therefore, a generalized cyber-attack life cycle has recently been proposed including two additional phases [37]. The first step is to identify the attacker's requirements. The last step in the cyber-attack process is to finish the attack along with removing the traces of the attackers' activities. The aim is to provide analytical stochastic models and theory of cyber kill chains with iterations. The steps of modeled cyber-kill chains are understood according to steps given in [38] and [39]. It is also assumed that some phases of cyber-attack processes may be skipped by the attackers; cyber-attacks may be assigned by attackers, or they may be stopped by cyber defense systems at any time, and then the new cyber-attacks may begin. The proposed models are based on homogeneous Markov chains.

In [40], the problem of customizing a dynamic quarantine and recovery (QAR) scheme for an organization is addressed in such a way that the APT impact can be minimized. In this approach, the expected effect of APT under a QAR scheme is estimated based on an epidemic model.

In [41], the decision tree learning methods (i.e., Bayesian network and deep learning) are used to detect and classify different types of APT attacks. The proposed method can improve the detection accuracy by analyzing the data through a deep learning model. Moreover, it can improve deep learning training by testing the existing data through the Maxout method and cross-validation to avoid over-fitting and increase generalizability.

A theoretical framework is built in [42] for describing an APT information-based attack over the internal network. In particular, the mathematical framework of this approach includes an initial input model for selecting entry points and a targeted attack model for information collection, strategy decision-making, weaponization, and lateral movement.

A new APT defense mechanism known as the DBAR-based APT defense mechanism is proposed in [43], which can overcome the main drawbacks of the DAR-based APT defense mechanism. It is expected that it is efficiently applicable on the Software-Defined Network (SDN) patterns. In [43], the problem is reduced to a differential game problem based on a new dynamic model that describes the expected security situation of the organization network. Therefore, it finds a cost-effective DBAR strategy in terms of Nash equilibrium.

In [44], a multiple machine learning classifier is used to identify APT attacks by matching the alerts at different stages of the cyber-kill chain model. The proposed approach uses feature selection to reduce the impact of this on the overall prediction accuracy to detect APT attacks.

A detection method is proposed in [45] to detect APT attacks based on abnormal behaviors of network traffic using machine learning. Abnormal behavior of APT attacks in network traffic, which includes domain and IP, is evaluated, and classified based on a random forest classification algorithm to draw conclusions about the behavior of APT attacks. This method combines the behavior and characteristics of IP and domain.

Table 1 provides a general comparison of the existing approaches, along with performance measurement criteria, scalability level, detection accuracy, and additional useful features.

Table 1 Comparison of the existing approaches

It can be expressed that we increase the accuracy, reduce the positive and negative false rates and finally improve the detection rate in our proposed approach (i.e., APT-Dt-KC). Since APT-Dt-KC uses an intrusion detection method based on fuzzy hierarchical structure, it is expected that the detection of attacks will be performed with better accuracy, false positive and false negative rates.

4 The proposed approach

In this section, we describe APT-Dt-KC approach for detecting the APT attacks based on the cyber-kill chain model. The flowchart of our proposed approach is shown in Fig. 2. In general, it is necessary to create a database of real-time alerts. This database is used in the process of detecting the APT attacks in APT-Dt-KC. As shown in Fig. 2, the data preprocessing phase is performed using Pearson correlation test. This method is used to optimize the input information for training by Bayesian algorithm. Next, Bayesian algorithm classifies data based on three components of detection threshold, prediction threshold and gray results. Then, the rest of the preprocessed data is used as experimental components and their outputs are employed as input parameters in the analytical hierarchy process to prioritize attacks. The outputs of this phase include a classification with a known priority for attacks. In the following, we describe these phases in more detail.

Fig. 2
figure 2

Flowchart of APT-Dt-KC approach

4.1 Pre-processing and correlation of parameters

In general, different attacks have different evaluation parameters. Therefore, several parameters must be evaluated to detect the APT attacks. The existing studies have shown that various parameters have been introduced for the evaluation and detection of cyber-attacks, which depending on the type and severity of the attacks. Some of these parameters are used for the evaluation of cyber-attacks but the most parameters are left unused. It should be noted that examining the volume of many parameters may increase the overhead and ultimately reduce the efficiency of the proposed solution for intrusion detection [46, 47].

In this paper, the correlation coefficient method is used in the preprocessing and correlation phase to reduce the number of parameters used to detect the APT attacks. The correlation coefficient is a statistical tool to determine the type and degree of relationship between two quantitative variables. Also, it is one of the factors used to determine the correlation of two variables. It indicates the intensity of relationship as well as the type of relationship. This coefficient is between −1 to 1. In the case that there is no relationship between two variables, it is equal to zero [46].

The correlation between two variables can be measured using a variety of different computational methods. Pearson correlation coefficient, Spearman correlation coefficient, and Tau Kendall correlation coefficient are the most common methods for the calculation of correlation between variables. In general, there are three cases as follows: (1) If both variables are with rank scale, Tau-Kendall correlation coefficient is used. (2) If both variables are proportional and contiguous scale, Pearson correlation coefficient is used. (3) If both variables are proportional and discrete scale, Spearman correlation coefficient is used. Therefore, due to the correlation between the parameters, the best method is Pearson correlation coefficient. Perhaps the most widely used application of bivariate correlation statistical index is Pearson moment correlation coefficient, commonly called the Pearson correlation and denoted by r [46, 47]. Pearson coefficient shows how much linear relationship exists among quantitative variables. The main application of Pearson correlation coefficient is when the variables are parametric; that is, they have a normal distribution, and they are at a distance/relative level. Meaning of distances and levels are the Euclidean distance between the parameters and the available rates for these parameters. If the ratings for two parameters are on the same level, the distance can be traversed and measured. In general, Pearson correlation coefficient between two random variables equal to their covariance divided by the product of their standard deviations. For a statistical population, the correlation coefficient can be defined as Eq. (1) [48]:

$$ \rho_{X,Y} = \frac{{{\text{cov}} (X,Y)}}{{\sigma_{X} \sigma_{Y} }} = \frac{{E\left[ {\left( {X - \mu_{X} } \right)\left( {Y - \mu_{Y} } \right)} \right]}}{{\sigma_{X} \sigma_{Y} }} $$
(1)

where cov is covariance, ،\({ }\sigma_{X}\) is the standard deviation of the variable X, \(\sigma_{Y}\) is the standard deviation of Y, \(\mu_{X} \) is the mean of X, \(\mu_{Y}\) is the mean of the Y, and finally E represents the mathematical expectation.

Usually, Pearson correlation coefficient for a statistical sample with n data pairs is defined as \(\left( {X_{i} ,Y_{j} } \right)\) using Eq. (2) [7]:

$$ r = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {X_{i} - \overline{X}} \right)\left( {Y_{i} - \overline{Y}} \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {X_{i} - \overline{X}} \right)^{2} } \sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {Y_{i} - \overline{Y}} \right)^{2} } }} $$
(2)

Equation (2) can be summarized as Eq. (3).

$$ r = \frac{1}{n - 1}\mathop \sum \limits_{i = 1}^{n} \left( {\frac{{X_{i} - \overline{X}}}{{s_{X} }}} \right)\left( {\frac{{Y_{i} - \overline{Y}}}{{s_{Y} }}} \right) $$
(3)

where parameters \(\overline{X}\), \(\overline{Y}\), \(s_{X}\) and \(s_{Y}\) represent the mean of X, mean of Y, the standard deviation of X, and the standard deviation of Y, respectively, defined by Eqs. (4), (5), (6) and (7):

$$ \overline{X} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} X_{i} $$
(4)
$$ \overline{Y} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} Y_{i} $$
(5)
$$ s_{X} = \sqrt {\frac{1}{n - 1}\mathop \sum \limits_{i = 1}^{n} \left( {X_{i} - \overline{X}} \right)^{2} } $$
(6)
$$ s_{Y} = \sqrt {\frac{1}{n - 1}\mathop \sum \limits_{i = 1}^{n} \left( {Y_{i} - \overline{Y}} \right)^{2} } $$
(7)

where \(X_{i}\) and \(Y_{i}\) are equal to the existing parameters.

After determining the level of correlation and direction, it is necessary to evaluate the intensity of relationship. To interpret this intensity, various classifications are proposed depending on the given applications. These classifications are used to correlate data and remove useless data from the mass of data. By correlating data, a very large portion of useless evaluations for unnecessary data can be removed. Therefore, those parameters used to detect attacks are correlated using the Pearson correlation method, and the unused parameters are discarded. The interpretation of relationship intensity in Pearson correlation is shown in Table 2.

Table 2 Interpretation of relationship intensity in pearson correlation

4.2 Threshold training and evaluation using Bayesian algorithm

Generally, Bayesian decision theory is a probable method for reasoning. It is assumed that the given variables follow a certain probabilistic distribution. These probabilities and observed data can be used to make decisions.

The simple Bayesian classification model (NBC) is based on Bayesian decision theory, which is a simple Bayesian probability model. This classification is simple in execution, fast in classification, and high in accuracy and it is one of the most widely used classification models in machine learning [49].

The advantages of this algorithm are (1) Classifying experimental data using this algorithm is easy and fast. (2) When the condition of independence is met; a naïve Bayesian classifier performs better than other models such as logistic regression and requires a low amount of training data.

Given that the data set has K features, it is assumed that values of K features are discrete. The aim of classification is to predict the type of each item within the test set that is part of the data set. Other part is the learning set, whose aim is to make simple Bayesian learning. For a particular sample whose features are in the range of \(a_{1}\) to \(a_{k}\), the calculated probability for the given sample of class \(c_{i}\) is, \(P\left( {C = c_{i} {|}A_{1} = a_{1} , \ldots ,{ }A_{k} = a_{k} } \right)\). It is obvious that according to Bayesian decision theory, we have Eq. (8) [49]:

$$ P\left( {C = c_{i} {|}A_{1} = a_{1} , \ldots ,{ }A_{k} = a_{k} } \right) = \frac{{P\left( {A_{1} = a_{1} , \ldots ,{ }A_{k} = a_{k} {|}C = c_{i} } \right)P\left( {C = c_{i} } \right)}}{{P\left( {A_{1} = a_{1} , \ldots ,{ }A_{k} = a_{k} } \right)}} $$
(8)

where \(P\left( {{ }C{ } = { }c_{i} } \right)\) is the prior probability and can easily be considered as a part of learning set. In the given dataset, \(P\) (\(A_{1} = a_{1} , \ldots ,{ }A_{k} = a_{k}\)) is the same for each class of \(c_{i}\), and we assume that the values of the features are independent. Therefore, Eqs. (9) and (10) are as follows:

$$ P\left( {A_{1} = a_{1} , \ldots ,{ }A_{k} = a_{k} } \right) = 1 $$
(9)
$$ P\left( {A_{1} = a_{1} , \ldots ,{ }A_{k} = a_{k} {|}C = c_{i} } \right) = P\left( {A_{1} = a_{1} {|}C = c_{i} } \right) \ldots P\left( {A_{k} = a_{k} {|}C = c_{i} } \right) $$
(10)

By replacing Eqs. (9) and (10) in Eq. (8), the method used in simple Bayesian classification is applied in this paper. In other words, we have Eq. (11):

$$ V_{{{\text{NBC}}}} \left( x \right) = \arg \max P\left( {C = c_{i} } \right)\prod P\left( {A_{j} = a_{j} {|}C = c_{i} } \right) $$
(11)

where \(V_{{{\text{NBC}}}} \left( x \right)\) represents the output of the simple Bayesian classification and \(\arg \max P\left( {C = c_{i} } \right)\) is the maximum of prior probability. Also, \(P\left( {A_{j} = a_{j} {|}C = c_{i} } \right)\) is equal to the calculated probability for the given sample of class \(c_{i}\).

Theoretically, simple Bayesian classification has the lowest incorrect classification rate compared to other classification algorithms [26, 49]. However, it is difficult to assume that the actual network behaviors are independent. In general, each computer network has its own unique features, which can directly affect the results of intrusion detection methods. Therefore, a weighted feature is assigned to the simple Bayesian classification to give different weights to each attribute that affect these relationships in the simple Bayesian classification. Different weights have different results in the simple Bayesian classification, and these weights have many effects on intrusion detection methods. The main point of simple Bayesian classification in an intrusion detection system is the way of determining the weights of different features. The probability calculated in the above topic is used to determine the threshold value. Therefore, the results are expressed in three modes (i.e., gray threshold, prediction threshold, and detection threshold) by evaluating the threshold values. The prediction mode represents the process by which current intrusion information is detected. The detection mode determines the process by which intrusion information is fully identified and detected. The gray mode indicates that intrusion is not detectable with current information.

4.3 Fuzzy analytical hierarchy process

APT-Dt-KC uses a fuzzy AHP [7] to classify different alerts and determine the correlation among parameters to detect an APT attack. AHP uses multiple attribute decision making (MADM) method for decision making and select an option solution among several options. This process reflects human's natural behavior and thought. This method analyzes complex problems, modifies them simply, and then begins to solve them [50].

In this method, input data must be fuzzified at the first. Various membership functions are used for the numerical values fuzzification. One of the common methods is to determine membership functions using the mathematical relations, because in this case neutrality is maintained. Membership functions are defined in different forms, the most common of which are triangular membership function and bell membership function [27]. The advantage of the triangular membership function is that if it is used, more theoretical arguments can be used to prove theories while the most important feature of the bell function is that it is closer to the human thinking. Membership degree A(x) indicates the membership degree of the element x in the fuzzy set A. If the membership degree of an element is zero, that member is completely out of the set while if the membership degree is equal to one, that member is completely in the set. In the case that the membership degree is between zero and one, this number indicates a gradual membership degree. In this paper, the triangular membership function is used for the fuzzification of values. The triangular membership function can be computed using Eq. (12) [28]:

$$ \mu \ A\left( x \right) = \left\{ {\begin{array}{*{20}l} 0 \hfill & {x < a_{1} } \hfill \\ {\frac{{x - a_{1 } }}{{a_{2 } - a_{1 } }} } \hfill & {a_{1} \le x \le a_{2} } \hfill \\ {\frac{{a_{3 } - x}}{{a_{3 } - a_{2 } }}} \hfill & {a_{2 } \le x \le a_{3} } \hfill \\ 0 \hfill & {x > a_{3} } \hfill \\ \end{array} } \right. $$
(12)

where \( a_{1}\), \(a_{2}\), and \(a_{3}\) represent the x coordinates in the fuzzy set A(x). The value 0 indicates the lowest membership degree, while the value 1 indicates the highest membership degree for a fuzzy set. It should be noted that the values between \(a_{1}\) and \(a_{2}\) show an increase in the membership degree, while the values between \({a}_{2}\) and \({a}_{3}\) represent a decrease in membership degree in a fuzzy set [27].

In general, the hierarchical analysis process includes the following steps:

  • AHP Modeling Process: In this step, all the decision-making goals are considered as a hierarchy of decision elements that are related to each other. Decision elements are decision-making criteria and options. Figure 3 shows the structure of AHP in which the desired goals can be achieved using various available options and criteria.

  • Creating a pairwise comparison decision matrix: According to each criterion (parameter), the pair comparison matrix of different options is built. Moreover, the pair comparison matrix of criteria is created to obtain the weight of decision-making criteria. In this method, the elements at each level are compared with the corresponding elements in the higher level and their weights are calculated. These weights are called the mean weights and they are combined to achieve the final weight of each option. Now, consider a matrix n × n where n represents the number of criteria. The decision matrix A and the relationship between its elements can be defined using Eqs. (13) and (14), respectively.

    $$ A = \left[ {\begin{array}{*{20}c} {a_{11} } & \cdots & {a_{1n} } \\ \vdots & \ddots & \vdots \\ {{ }a_{{{\text{n}}1}} } & \cdots & {a_{{{\text{nn}}}} } \\ \end{array} } \right] $$
    (13)
    $$ a_{{{\text{ij}}}} = \frac{1}{{a_{{{\text{ji}}}} }}{\text{ and }}a_{{{\text{ii}}}} = 1 $$
    (14)

    The element of row i and column j in matrix A which is \(a_{{{\text{ij}}}}\) indicates the importance of objective i relative to objective j. This importance is measured with an integer in the range of 1 to 9. Since the importance of each factor relative to itself is one, then the diagonal elements of the matrix of pairwise comparisons are equal to one, and the other elements of the matrix are different and inverse of each other. Therefore, this matrix is a square and invertible matrix in which the importance of the factors relative to each other is determined according to Table 3. The standard preference table can have different modes. It should be noted that increasing the number of modes in this table leads to an increase in computational overhead. Therefore, the selection of these values must be done accurately based on the balance between accuracy and efficiency.

  • Calculate the criteria weight and score of each option: Now, a pairwise comparison matrix of criteria is generated (i.e., Matrix A). There are different methods for obtaining the weight of the criteria in this matrix. One of the methods to approximate the weight of the criteria in this matrix is normalization that takes place during the following steps:

    1. o

      Step 1: For each column of matrix A, the following operations are performed: The sum of the elements of each column is calculated. Then, each element of column i is divided by the sum of the elements in the column. Finally, the new matrix is the result of a matrix whose the total elements in each column are equal to one.

    2. o

      Step 2: To approximate the weight of each criterion, the average of each row is calculated and the vector \(n\times 1\) is obtained. This vector determines the weight of each criteria, denoted by w that indicates the weight of each parameter for testing and comparison. It is a value between 0 and 1, and the sum of that weight must be 1. Therefore, the importance of each option for each criterion is obtained. For example, if a problem contains n criteria and m options, the matrix n × m is obtained by performing this calculation which indicates the importance of all options for the given criteria.

    3. o

      Step 3: This step includes the compatibility test; for this, two parameters of consistency index (CI) and consistency ratio (CR) are involved. These two parameters are calculated using Eqs. (15) and (16), respectively.

      $$ {\text{CI}} = \frac{X - n}{{n - 1}} \quad X = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \frac{{\left( {A \times W} \right)_{i} }}{{w_{i} }}$$
      (15)
      $$ {\text{CR }} = \frac{{{\text{CI}}}}{{{\text{RI}}}} $$
      (16)
Fig. 3
figure 3

The structure of AHP

Table 3 Standard preference table

where CI, RI, and CR represent the consistency index, random index, and consistency ratio, respectively, while X is the largest eigenvalue of the n-order matrix and W is the weight of each parameter. It has a value between 0 and 1, and the sum of that weight must be 1. The random indexes of different number of criteria have different values as shown in Table 4. If the consistency ratio is less than 0.1, the result is acceptable and matrix A is fully compatible; otherwise, it returns to step 1 and the performed operations are repeated.

Table 4 Relationship between random index values and number of criteria

5 Performance evaluation

In this section, we first explain our simulation environment and its related settings and then we show the experimental results along with their analysis.

5.1 Experimental settings

Our proposed approach (i.e., APT-Dt-KC) was simulated along with the proposed approach in [17] (i.e., DT-EnSVM) using MATLAB 2015. All the experiments were performed on a computer with an Intel Core i7-6700 @ 3.40 GHz processor with 16 GB RAM and Windows 7.

In all the experiments, KDD Cup 99 dataset [17] generated by MIT laboratory has been used to evaluate our proposed approach. This dataset, which is the most common and standard dataset for evaluating the intrusion detection systems since 1999, is based on data recorded from DARPA 98 project. This dataset contains approximately 4 GB of raw binary data collected by TCPDump software from the traffic of network. It also contains approximately 5 million records with connection vectors, each with a size of 100 bytes and 41 attributes, and a tag that includes normal or attack modes. Due to the large volume of this dataset, most studies have used 10% reference subset provided in the standard dataset. Therefore, we also used 10% of dataset in our experiments. This dataset includes 494,021 records with 23 attacks and 2 normal classes. In this dataset, data records are classified into four main classes of attacks, including the Denial of Service (DoS) attacks, obtaining initial access (R2L), improving the level of access and performance (U2R), and probing. In our experiments, these attacks were used as an example for the APT attacks according to the fuzzy features of cyber-kill chain and the adaptation of these kinds of attacks with some phases of the cyber kill chain model. In this classification, we have considered the attacks of DoS, R2L, U2R, and probing for information collection (“Reconnaissance”), intrusion (“weaponization” and “Delivery”), deployment (“Exploitation” and “Installation”) and information stealing (“Command and Control” and “Action on Objectives") phases. To distinguish normal communication from attacks, each network communication data component in KDD99 was represented by 41 attributes. In addition, each data component in KDD99 was marked as the attack alert or the normalized alert. It should be noted that there were 23 types of cyber-attacks that could be used in an APT attack, each of which belongs to one of the four given attack classes.

The general structure of APT-Dt-KC approach consists of five functional components and three memory components. The functional components in this structure include data processing, classification, detection, training, and decision making based on the fuzzy based classification structure. Since the data forms used in the process of detecting the APT attacks have different modes and the training data set changes dynamically in both approaches (i.e., Bayesian classification method and fuzzy AHP method), three types of memory were considered in APT-Dt-KC approach. A memory was used to sort all the candidate data for training; a cache memory was employed to link the training data of each procedure in Bayesian algorithm; finally, a cache memory was applied to link the training data during the process of detecting the APT attacks. The types of sorted data in these three memories were displayed as small point, large point, and object, respectively. Other structures used in the simulation were as follows:

  • Row Processing Structure: In APT-Dt-KC approach, row data were converted to point-type data after processing and stored on the normal memory. Then, they were used as the input training data for Bayesian algorithm.

  • Structure of Training Algorithm: In this structure, data were converted to large point data and stored on the cache of device memory. This structure implements Bayesian algorithm training process until the selected data points form the final training structure.

  • Hierarchical Classification: Data in this structure were converted to data types that could be evaluated by the fuzzy structure and stored on the memory of decision-making and classification model. This structure aligns the decision-making process with the marked objects.

  • Creation of a Detection-Based Classification Structure: It implemented APT-Dt-KC approach based on classifier using the Bayesian and AHP method.

  • Testing the Detection-Based Classification Structure: This structure tested the classifier stored on the memory using a test dataset and the statistical results obtained from the analysis of this data were represented in the form of graphs.

It is obvious that Bayesian algorithm achieves the best detection rate when the amount of training data from two classes is balanced. In our experiments, the amount of normal data and different ratios of intrusion types are set to deal with the problem of imbalance. Let D represents a training dataset. The distribution for each class in D is shown in Table 5.

Table 5 Class distribution in network communication records

To compare the performance of APT-Dt-KC approach, the classifiers generated by Bayesian algorithm must be tested separately. Since general comparisons and their results are not related to the amount of training data and the distribution of each data class, only 10% of the training data was used. Assume that T1 represents the amount of experimental data used to compare the available approaches. The distribution of T1-related classes is also given in Table 6. To consider this trend, another dataset, called T2, was considered. The distribution of all classes in T2 dataset was almost similar to the distribution of the entire 10% KDD99 dataset. To eliminate the differences among the features and prevent the overcoming of features with a large value compared to the features that have a lower numerical range, when data were normalized, the preprocessing was performed on data using Pearson correlation. In our experiments, 90% of the given data was selected to test the intrusion detection model and another 10% was used to train the model. Therefore, each subset had an equal chance in the training and testing by running the model 10 times. In the phase of evaluation, the performance of APT-Dt-KC approach and DT-EnSVM approach was evaluated to enhance the features using KDD99 as the dataset and T1 as the experimental data. In this phase, the feature improvement of training scheme could be used directly for multi-fold problems to evaluate DT-EnSVM approach. In the training process, the intrusion detection system trained a single classifier that provides a model for each class of attacks. A suitable value was required for the similarity coefficient β, which could be estimated between 0.2 and 0.4. In our experiments, the value of β was set to 0.25. The coefficient β indicates the similarity of congestion in the dataset. By exploiting this structured based classifier, all data in T1 could be classified.

Table 6 Confusion matrix for T1 dataset

Finally, the performance of APT-Dt-KC approach was evaluated and compared with DT-EnSVM approach in terms of training time, detection rate, precision rate, accuracy, false positive rate, false negative rate, false alarm rate (i.e., FAR) and detection rate (i.e., DR). A false positive is a warning event that sounds when an attack has not occurred, while a false negative indicates that we have encountered a problem in detection of the real attack. All the experiments were performed 5 times, and finally, the experimental average values were used as the result of experiments.

5.2 Experimental results

In DT-EnSVM approach, the support vector-based classification process was applied to the training scheme and 4 classifiers were trained to differentiate each class from the existing training attacks. Data records in T1 datasets have been tested using a support vector machine classifier. It was formed based on a voting strategy in such a way that each decision on each binary classification was considered as a vote. Finally, the final decision was generated for the classes using the maximum number of votes. In the case that the maximum votes did not form a single class, the data points were labeled as unknown. In contrast, in APT-Dt-KC approach, before performing the process of training and testing, the data were pre-processed and optimized for future processing by Pearson correlation test. Then, the data were included in the training cycle as a set of processing features. Since input data included both normalized and optimal data, redundancy in the preprocessing process was significantly reduced. In our experiments, 10% of data was trained and the remaining 90% was evaluated. Based on the evaluation results, the confusion matrix could be extracted for APT-Dt-KC approach and DT-EnSVM approach by improving the features. Generally, a confusion matrix is a matrix in which the performance of relevant algorithms is represented. Such a representation is commonly used for supervised learning algorithms, although it also applies to unsupervised learning. This matrix is used to determine the value of evaluation indicators such as precision and accuracy, which will be discussed later. The confusion matrix extracted from the support vector machine classifier test with the feature improvement and our proposed approach for T1 dataset are shown in Tables 6a and b, respectively. Based on the confusion matrices, APT-Dt-KC approach is more efficient compared to DT-EnSVM approach.

As explained in Sect. 4, a combination of Bayesian algorithm and fuzzy AHP is applied in APT-Dt-KC approach. In the training strategy of APT-Dt-KC approach, the parameter M is responsible for controlling the convergence of these two algorithms. The detection rate and training time for T1 dataset are shown in Fig. 4 when this parameter was set in ranges of 1 to 6 with intervals of 1. Considering the efficiency and accuracy of this parameter, the value of M was set to 4. The value of M represents the data points that will be used for training in the next step of Bayesian algorithm in APT-Dt-KC approach and DT-EnSVM approach.

Fig. 4
figure 4

Detection rate and training time with different values of M

The confusion matrix extracted from DT-EnSVM approach and APT-Dt-KC approach for T2 dataset are shown in Tables 7a and b, respectively. The purpose of evaluating this dataset is to analyze the scalability of APT-Dt-KC approach in a larger amount of data. Based on Table 7, the confusion matrix for APT-Dt-KC approach has a better efficiency in the classification phase than DT-EnSVM approach.

Table 7 Confusion Matrix for T2 Dataset

The comparative results for both datasets T1 and T2 are shown in Tables 8 and 9, respectively. Based on the results in Tables 8 and 9, APT-Dt-KC approach has higher efficiency than DT-EnSVM approach in terms of detection rate, false positive rate, and false negative rate. Also, the training time in APT-Dt-KC approach is better than DT-EnSVM approach due to the elimination of redundancy of the evaluated alerts.

Table 8 Comparison of evaluation results in T1 dataset
Table 9 Comparison of Evaluation Results in T2 Dataset

The rates of detection accuracy for APT-Dt-KC approach and DT-EnSVM approach in T1 and T2 datasets are shown in Fig. 5. Based on the experimental results, the accuracy rate of APT-Dt-KC approach is higher than DT-EnSVM approach in both T1 and T2 datasets.

Fig. 5
figure 5

Accuracy rate of attack detection in T1 and T2 datasets

After the attack classification process, the level and priority of attacks should be identified. Since that APT-Dt-KC approach applies AHP algorithm to consider the level of attacks, it can rank different attacks. Therefore, all the attacks can be classified, prioritized, and evaluated as most likely to occur. The experimental results in the phase of attack detection in T1 dataset are shown in Fig. 6. As shown in Fig. 6, the rate of false positive and the rate of false negative for APT-Dt-KC approach are less than DT-EnSVM approach.

Fig. 6
figure 6

Rates of false positive and false negative in the phase of attack detection in T1 dataset

The values of mean detection rate for APT-Dt-KC approach and DT-EnSVM approach in T1 dataset are shown in Fig. 7. The experimental results show that the final detection rate for APT-Dt-KC approach is better than DT-EnSVM approach. Considering that the detection rate in the classification process also had a better level for APT-Dt-KC approach, it can be concluded that APT-Dt-KC approach provides better detection in whole process.

Fig. 7
figure 7

Rate of mean detection in T1 Dataset

The use of data preprocessing process and Bayesian algorithm, which has a better convergence rate and accuracy than the support vector machine, increases the efficiency of APT-Dt-KC approach and improves the detection process over DT-EnSVM approach.

To evaluate the scalability of APT-Dt-KC approach, T2 dataset was used in our experiments. The values of false positive rate and false negative rate in T2 dataset for APT-Dt-KC approach and DT-EnSVM approach are shown in Fig. 8. As it can be seen from Fig. 8, these values for APT-Dt-KC approach are smaller than DT-EnSVM approach. It means that APT-Dt-KC approach has higher accuracy than DT-EnSVM approach in large datasets.

Fig. 8
figure 8

Rates of false positive and false negative in T2 datase

The values of mean detection rate in APT-Dt-KC approach and DT-EnSVM approach in T2 dataset are shown in Fig. 9. As shown in Fig. 9, the experimental results show the efficiency of APT-Dt-KC approach compared to DT-EnSVM approach on the existing scale.

Fig. 9
figure 9

Mean detection rate in T2 dataset

Table 10 shows the values of accuracy, false alarm rate (FAR) and detection rate (DR) in APT-Dt-KC approach and DT-EnSVM approach. Based on the experimental results, it is determined that APT-Dt-KC approach has better precision rate, FAR and DR compared to DT-EnSVM approach. The reason of these improvements is due to improvement of false positive rate and false negative rate in APT-Dt-KC approach.

Table 10 Comparison of Experimental Results for Other Metrics

6 Conclusion and future works

In this paper, a new approach called APT-Dt-KC was proposed for detecting the APT attacks. APT-Dt-KC approach used the prioritization and selection based on Bayesian algorithm classification to select training data and significantly increase the efficiency in terms of runtime. It also used an AHP to solve multi-criteria optimization problem using a prioritization-based classification method. Therefore, it could improve the mean detection rate for all data classes. APT-Dt-KC approach had lower training overhead compared to DT-EnSVM approach since it only used data points around the boundaries of two classes to construct the classifier. The training process for intrusion detection using DT-EnSVM approach was suitable for multi-criteria cases with various problems, but it had more difficulty in balancing the training and test data. However, APT-Dt-KC approach overcomes the existing problems. Finally, in APT-Dt-KC approach, the training and testing processes could be done in parallel since the training and testing phases are designed as separate modules.

Given that our proposed approach uses AHP technique to solve multi-criteria optimization problem, it has high computational overhead. Thus, it can be claimed that the problem with our approach on large datasets is computational overhead. In the future, this problem can be considered by using other optimization techniques. Moreover, the evolutionary algorithms such as bee colony can be combined by APT-Dt-KC approach to improve the efficiency and flexibility of an intrusion detection system. In addition, other classifiers such as neural network-based methods can be applied in APT-Dt-KC approach as a new path to detect the APT attacks.