Keywords

1 Introduction

In the present era, every organization has moved their services online for accessible 24/7 to grow business and revenue. When the Internet was designed, the main objectives were fast data transfer, fast processing, and identification of packet tampering [1]. Everyday Internet users and Internet of Things (IoT) devices are exponentially multiplying because of easy access and decentralized nature of the Internet. Denial of Service (DoS) is an attack to submerge the victim service and denied access to genuine users. DoS attack is launched easily using a single device that continuously forwards random traffic to a victim service. However, it can be undoubtedly identified, infiltrated and trace-back immediately to take legal action because of a single source [1,2,3]. A Distributed Denial of Service (DDoS) is an attack which completely deprives the performance of a victim service or seldom it may be unavailable [3]. The DDoS attack can be arisen by compromising multiple devices, launch in a coordinated manner, and sending unnecessary traffic through bots towards a victim service. Therefore, it is a challenging job to detect DDoS attack with greater accuracy in real time.

As per Kaspersky Lab report [4], it has remarked that in a first part (Q1) of 2018, significant growth in number of attack occurrences as well as the span of attack when linked with last part (Q4) of 2017. It presents the incidence of DDoS attacks increasing because of various causes such as exponential increase of the non-secure IoT devices, user-friendly attack tools, and security defects in the network. The DDoS attacks volume size is constantly increasing each year despite effective and powerful detection, mitigation and trace back mechanism have introduced by fellow researchers. Figure 1 shows how the volume size of DDoS attack increases every year.

Fig. 1.
figure 1

Year-wise DDoS attack volume size from the year 2007 to 2018.

The first DDoS attack was witnessed in June-July 1999, filed in August 1999 which is a target on a single computer system of the University of Minnesota with the help of 227 compromised systems [5]. It was sustained for almost two days and an attack was launched using DDoS Trinoo [6]. And this onwards in 2018, Github has experienced the highest DDoS attack in the records which is around 1.35 Terabits per seconds (Tbps). However, Github was recovered from this attack within 8 min [7].

Peng et al. [8] categorized the DDoS defense system into four comprehensive categories such as Prevention, Detection, Traceback and Mitigation. Further, DDoS attack can be deployed at victim-end (destination-end), source-end, intermediate (core-network) and distributed [2, 9]. Bhuyan et al. [10] analyzed each deployment location of the defense system and presented victim-end DDoS defense system which is better. The reasons are: (i) It was deployed near to victim, hence closely watched network traffic, (ii) Victim-end deployment is quite simple cost effective (iii) It gets aggregated network traffic for analysis which improve detection accuracy and lessen false positive rate. However, victim-end defense system needs to process a large amount of network traffic flows and sometimes the system can itself become a victim of DDoS attack. Therefore, there is a demand to implement systems which can exploit the benefits of victim-end defense system and efficiently analyze massive amount of network traffic to discriminate DDoS attacks from legitimate traffic on a cluster of nodes. Apache Hadoop [11] is an open source, reliable, scalable and distributed framework. It is one of the most powerful frameworks to store and process a huge amount of data i.e., Big Data on a cluster of nodes. In this paper, we implemented a victim-end Hadoop based DDoS defense framework which detects DDoS attack traffic using MapReduce programming model [17] and validated using real datasets (MIT Lincoln LLDDoS1.0, CAIDA) and live traffic generated using proposed testbed.

The rest of paper is organized as follows, Sect. 2 discuss existing literature in the field of DDoS using Hadoop framework, Sect. 3 proposed Hadoop based detection system, and Sect. 4 present the methodology. Section 5 presents details of our experimental setup, Sect. 6 we present the performance results followed by remarks in Sect. 7.

2 Related Work

In this section, outlined existing literature presented by fellow researchers to combat against a DDoS attack based on a Hadoop framework. The fellow researchers proposed numerous powerful solutions to fight against DDoS attack and to address volume based detection in a Hadoop framework. However, after these attack incidents are increasing linearly. The modern attacker generates low rate DDoS attacks by compromising millions of devices which can undoubtedly circumvent the volume based detection system.

Lee and Lee [12] proposed a Hadoop based DDoS attack system. They implemented a counter based detection algorithm to perform detection using the MapReduce programming model and performed implementation in testbed. However, they validated the defense system using offline batch processing only. According to performance evaluation parameters, the proposed system requires approximate 25 min for 500 GB and 47 min for 1 TB of network traffic. It implies that approximate 5 to 10 min is enough to crash victim service and to refuse access to legitimate users. Khattak et al. [13] proposed a Hadoop based DDoS forensics framework using the MapReduce programming model. They applied “horizontal threshold” and “vertical threshold” inside the distinct time window. They verified a defense system using MIT Lincoln LLS-DDoS-1.0 [23] real datasets and efficiently detected high rate DDoS (HR-DDoS) attacks. However, a defense system is validated only using offline batch processing and low rate DDoS (LR-DDoS) attack easily circumvent the system. Zhao et al. [14] proposed Hadoop and HBase based DDoS detection framework using a neural network. They implemented a testbed setup on cloud platform comprises of a victim web server, attacker nodes, and defense system. However, a defense system demands more time for training and testing phase.

Dayama et al. [15] proposed a Hadoop-based DDoS protection framework. They used a MapReduce programming model to implement a detection algorithm based on threshold value (count number of requests) to discriminate DDoS attack and genuine traffic flows. However, if a sophisticated attacker performs LR-DDoS attack, then it surely circumvents defense system where as in case of a flash event [16], genuine users can be treated as attack traffic.

Hameed et al. [18] proposed Hadoop based framework to combat against DDoS attack. They designed an algorithm for DDoS detection to detect attack four influential attacks such as ICMP, UDP, TCP-SYN, HTTP-GET using MapReduce programming model and extended their own work [19] by proposing a HADEC framework to detect HR-DDoS attack within fair time. They generated attack traffic using Mausezahn tool [20] and added legitimate traffic. A HADEC framework is comprised of traffic capturing server, detection server (Namenode) and data nodes (ranges from 2 to 10). A threshold value (500 & 1000) is used to discriminate between attack traffic & genuine network traffic. However, almost 77% time is demanded by traffic capturing server of total detection time and because of a threshold value (500 & 1000) LR-DDoS attack can easily circumvent the defense system and can be treated as legitimate traffic. Chhabra et al. [21] presented a Hadoop based forensics analytic system for DDoS and implemented using a supervised machine learning algorithm. They have validated framework using CAIDA dataset and claimed 99.34% detection accuracy. However, the proposed system requires more time for training and testing phase. Also, they validated proposed system only using real datasets.

Utmost of the existing literature has widely used volume based detection method to discriminate DDoS attack from legitimate network traffic. Nowadays sophisticated attacker is compromising millions of unsecure devices, originate LR-DDoS attack from each device, and consequently the tremendous amount of useless network traffic target towards the victim server. In this paper we proposed a victim-end Hadoop based DDoS defense system using Information theory metric i.e. Shannon entropy [22].

3 Proposed Hadoop Based DDoS Detection Framework

In this section, we proposed a victim-end Hadoop based DDoS defense system by employing Shannon entropy. The detection framework consists of two phases (i) Network traffic sniffing phase and (ii) Detection process phase. In sniffing phase, live traffic is captured by using Wireshark network traffic sniffer tool and stored in the Hadoop Distributed File System (HDFS). In the detection process phase, the resources are allocated with the help of Yet Another Resource Allocator (YARN) to perform the detection job using the MapReduce programming model. The architecture of the proposed system is depicted in Fig. 2.

Fig. 2.
figure 2

Logical architecture of proposed framework.

Figure 2 consists of three phases, (i) Captures live traffic from legitimate and attacker nodes, (ii) store captured traffic into HDFS and YARN allocates resources for analyzed network traffic flows and, (iii) Using MapReduce programming model traffic to analyze and store result on HDFS and decide whether it is a DDoS attack or normal traffic.

4 Methodology

Information theory plays a significant role in the domain of mathematics, physics, statistics, mechanical engineering, civil engineering, computer science & engineering, and many more areas. Information theory based detection metric is often used in the anomaly detection research from the past several years because it is offering notable divergence between an anomaly and legitimate packet. However, in the case of a DDoS detection based on Hadoop framework, the information theory metric is seldom used to detect attack traffic.

4.1 Shannon Entropy

Shannon entropy can be defined mathematically as,

$$ SE = \sum\nolimits_{i = 1}^{m} {\frac{Pi}{S}{ \log }\frac{Pi }{S}} $$
(1)

where Pi is total number of request with the ith source IP in time window. And S can be defined as

$$ S = \sum\nolimits_{i = 1}^{m} {Pi} $$
(2)

For our detection framework used information theory detection metric such as Shannon entropy to discriminate DDoS attack from legitimate traffic flows, for that we defined T is the sampling period in which incoming packets are X1, X2, X3 ….. Xn and time window is set to 1 to analyze network traffic flows. Where n – total number of packets, t – total number of time window, m – total number of packets in each time window and its value may be different for each time window or may be zero (if no incoming packet). Hence value of n = m1 + m2 + m3 + …. + mt. It is require for Eq. (1).

4.2 Dataset Used

The proposed framework is validated using different live network traffic scenarios as depicted in Table 1. Also real dataset CAIDA is used to validate proposed framework.

Table 1. Live traffic scenario.

The real datasets such as MIT Lincoln and live traffic (i.e. no attack scenario) are used to form baseline behavior of our proposed framework to get average value (µ), and standard deviation (σ).

4.3 Detection Algorithm

In the Hadoop framework, data processing job consists of a couple of parts, such as Mapper and Reducer job. Each network traffic block (default size is 128 MB) is processed by one mapper that implies if our network traffic data file splits into 10 blocks then concurrently 10 mappers are executed on a cluster of nodes (datanodes) to execute this job. A reducer job is performed by one datanode which is decided by YARN manager.

figure a
figure b

5 Experimental Setup

In this section, we explain the details of our experimental proposed testbed. In Fig. 3, Hadoop based testbed consists of one sniffing node (victim), one namenode (master), three datanodes (slaves), and multiple traffic generators (legitimate and attacker) nodes. Figure 3 shows the experimental testbed of proposed Hadoop based detection framework.

Fig. 3.
figure 3

Testbed: Hadoop based detection framework.

In Fig. 3, multiple attackers and legitimate systems generates live network traffic flows and send towards capturing server (victim) node (live traffic captured scenario depicted in Table 1, Sect. 4.2). The job of namenode is to only monitoring and metadata management of data blocks stored in different datanodes. The role of datanodes (DN1, DN2, and DN3) is to process the mapper and reducer job to discriminate legitimate and DDoS attack traffic. SecondaryNamenode is used to provide backup in case of failure of Namenode.

6 Results and Discussion

To validate a proposed framework, live traffic is generated using a testbed, the details of generated traffic are discussed in Table 1, Sect. 4.2. We have calibrated the threshold value using the following Eq. 3, and get average value (µ), and standard deviation (σ) values from baseline behavior (i.e. no attack scenario).

$$ th \, = \, \mu \, \pm \, k*\sigma $$
(3)

where µ - entropy mean of each time window, k-tolerance factor and σ-standard deviation of entropies value. To measure the performance of proposed framework, we have used detection accuracy, false positive rate and false negative rate which defined in Eqs. (4), (5) and (6). Table 2 shows results of each performance parameter. Four important parameters of confusion matrix are True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (TN), which are required to calculate performance metrics such as Detection Accuracy, FPR and FNR. Detection accuracy can be calculated using a fraction of attack events detected correctly. False Positive Rate (FPR) is the percentage of normal traffic reported as attack traffic. False Negative Rate (FNR) is the percentage of attack traffic stated as legitimate traffic. The value of tolerance factor k is chosen in such a way where False Positive Rate (FPR) and False Negative Rate (FNR) is crossing each other (in our use case value of k is 1.0). This provides tradeoff between the detection accuracy and false positive rate.

Table 2. Result indicating tradeoff between detection accuracy and FPR.
$$ DetectionAccuracy = \frac{TP}{TP + FN} $$
(4)
$$ FPR = \frac{FP}{TN + FP} $$
(5)
$$ FNR = \frac{TN}{TN + FP} $$
(6)

Threshold calibrated (tolerance factor) is done as shown in Fig. 4. Tolerance factor value is calculated in such a way where False Positive Rate (FPR) value and False Negative Rate (FNR) value is crossing each other (in our case is k = 1.0).

Fig. 4.
figure 4

Threshold calibration.

Figure 5 shows Receiver operating characteristic (ROC) curve between the detection accuracy and false positive rate. Entropy values are calculated for attack traffic, and legitimate traffic flows as shown in Fig. 6. It shows that low rate DDoS attack entropies values are higher compared to legitimate traffic due to large number hosts are sending traffic to the victim server which helps us to discriminate attack traffic from legitimate traffic flows.

Fig. 5.
figure 5

ROC Curve.

Fig. 6.
figure 6

Discrimination DDoS attack and legitimate traffic.

7 Conclusions

The DDoS attack is a big threat to Internet-based services. In this paper, we have proposed a victim-end Hadoop based DDoS detection framework. The proposed defense framework uses the concept of computing entropy of source IP address to discriminate between legitimate and attack network traffic by employing a cluster of nodes. It is observed that the proposed defense framework recognizes the application layer DDoS attack (LR-DDoS) with a high detection rate (97%). The proposed system efficiently handles a large amount of network traffic with quick response.