1 Introduction

Network traffic, most commonly referred to as the amount of data being transferred across a network at a specific time, is increasing at a drastic rate as the Internet continues to grow in scope and complexity [1]. Network traffic can also be measured in terms of bandwidth or transmission capacity and is an important factor when determining the quality and speed of a network. The emergence of more and more applications running on Internet Protocol (IP) networks in different fields—including not only traditional Internet services such as WWW, FTP, and e-mail, but also multimedia services such as multimedia streaming, P2P file sharing and gaming—has yielded to network bandwidth growing from hundreds of Mbps to busier and faster wireless networks of more than 10 Gbps [2]. It is therefore crucial for networks to be monitored so as to understand their behavior in terms of applications and bandwidth usage, utilization of network resources, and to detect network anomalies and security issues, hence preventing overall network performance degradation or failure. The two main operations encompassing network analytics are traffic monitoring and traffic classification. Network traffic monitoring tools are employed by administrators in order to check for availability and maintain system stability by fixing network problems on time and ensuring the network security strength. On the other hand, traffic classification helps to identify different applications and protocols that utilize the network’s resources. While network analytics is not essential for private networks, it is an indispensable tool for large business operators to have a better understanding of their networks and which eventually enables them to make smarter and data-driven decisions to attain desired operations’ outcomes and to meet customers’ needs. In other words, the process involves the study of network data and statistics to identify trends and patterns for easy detection and elimination of anomalies [3]. An overview of recent publications that have proposed interesting classification approaches of IP traffic is given next.

In [4], Parsaei et al. applied ML algorithms on captured traffic from a Software-Defined Network (SDN). Four ML algorithms, namely feedforward, Multi-layer Perceptron (MLP), the Levenberg–Marquardt and Naïve Bayes were used. To specify specific flows, features like source port, destination port, IP source, IP destination and transport layer protocol were used. Testing of the classifier model yielded to an accuracy of 95.6% for feedforward, 97% for MLP and Levenberg–Marquardt and finally 97.6% for Naïve Bayes algorithm. The study successfully attained its objective of minimizing overhead of controllers’ processing and network traffic. In [5], a comparative analysis of ML algorithms for classification of traffic from internet applications was performed. For data collection, real time network traffic for a duration of one minute using Wireshark software was collected and the Weka toolkit was used for classification. Traffic from WWW, DNS, FTP, P2P and Telnet applications were targeted. The classification model was constructed by the application of four machine learning algorithm, namely Naive Bayes, Bayes Net, C4.5 and Support Vector Machine (SVM). It was found that C4.5 algorithm gave the highest classification accuracy at 79%. The results also revealed that the recall and precision values for DNS and WWW applications are lower than those of the remaining applications.

In [6], Singh and Agrawal conducted a classification of IP traffic using ML approach. The performance of the five ML algorithms was evaluated based on parameters such as classification accuracy, training time and precision and recall values. It was found that for the case of full feature dataset, the Bayes net classifier gave the best classification accuracy, which is 85.3%. A 100% recall and precision value was recorded with Bayes Net for FTP, P2P, VoIP and IM. In [7], Sohi et al. made use of three ML algorithms: Bayes Net, RBF and C4.5 for classifying Internet traffic into educational and non-education applications. Some educational websites used were the IEEE, Science Direct and SparkNotes while non-educational sites included BitTorrent and Yahoo Messenger. It was found that Bayes Net gave a classification accuracy of 76.6%, making it the most accurate among the 3 classifiers. The latter also outperformed the RBF and C4.5 classifiers in terms of recall and precision for both educational and non-educational Internet applications.

In [8], the authors presented several criteria to assess existing network data capture mechanisms. An extensive review of state of the art network data collection techniques such as packet, flow and log based methods was performed with an in depth analysis of their benefits and drawbacks using the proposed criteria as a means for systematic evaluation. The evaluation criteria used system performance indicators such as instantaneity, effectiveness, scalability and expense among others as a basis. A number of open problems were also identified and several possibilities for future research were identified. In [9], a study based on the selection of features from network traffic in the detection of anomalies was made. The work focused on data preprocessing and outlined the importance of feature selection. This step helps to remove redundant features and hence allows for faster processing and storing of data by reducing resource consumption. To evaluate the performance of the selected feature set, ML algorithms such as KNN, Naïve Bayes, Decision Trees, Artificial Neural networks (ANN) and SVM were deployed. They assessed the performance of the classifiers with datasets consisting of 41, 30 and 16 features. It was observed that the classifiers performed better with feature sets of smaller size. The Bayes classifier showed a high False Positive rate by considered almost every new sample as attack with 41 features. However, its performance greatly improved with 16 features, but at the cost of less anomaly detection power.

Building upon the works previously described, this paper aims at analysing the network traffic of an 802.11 wireless LAN by first capturing a maximum amount of traffic information from on-going sessions of internet applications using three network monitoring tools, namely PRTG, Wireshark and Colasoft Capsa. The applications employed are YouTube, Skype, BitTorrent, Google Drive, Browsing and FTP sharing. Traffic generated during downloading, uploading, streaming and idle states are also captured. The collected data are then used in the evaluation of 8 ML classification algorithms, serving as analytic tools. Moreover, the effect of anomalies in the form of DDoS attack and rogue servers on the network performance is also examined.

The remainder of this paper proceeds as follows: Sect. 2 describes how each traffic capture tool is used for feature extraction. Section 3 describes the classification algorithms used for the analytics and how to perform the analytics with the Weka Toolkit. Section 4 describes the system model used for capturing and analyzing network traffic for different applications, states and anomalies. Section 5 presents the results of all extracted features and classification results of each scheme as well as evaluation and analysis on the performance of each classifier with different feature sets. Section 6 concludes the paper with some recommendations for future works.

2 Feature extraction tools

Based on previous researches, three network monitoring tools were chosen for conducting this study. They are PRTG, Wireshark and Capsa. Their main features are outlined in the following subsections.

2.1 PRTG

PRTG [10, 11] is a product of Paessler which serves as a network monitoring tool. While PRTG is not capable of functioning as an intrusion detection system, it acts as a preventive system and warns against anomalous activities in a network.

Key features of PRTG:

  • Monitoring of network performance in terms of bandwidth and application usage.

  • Monitoring of system usage (CPU loads, free memory, free disk space) of hardware devices.

  • Makes use of a statistical approach by setting up threshold values for traffic parameters and hence detects and alerts about anomalies like unexpected load peaks and abnormally heavy traffic, downtimes and slow servers. Spikes in activity can signal a threat.

  • User-friendly graphics engine that makes network activity accessible in the form of tables and graphs and hence facilitates analysis of network usage.

  • Efficient database system that provide storage of raw monitoring data and a report generator to create both live and scheduled reports in CSV, HTML or XML data files.

  • Network analysis modules for automatic discovery of network devices and sensors.

Several sensors are used by PRTG to track and display network traffic. Four sensors have been deployed for traffic capture and feature extraction. They are the Windows Network Card, Ping [12], DNS and health sensors.

2.2 Wireshark

Wireshark [13, 14] is an open-source network protocol analyser or sniffer that captures and displays data traversing a network in the form of packets. The main features of Wireshark include:

  • Ability to perform live capture of packets and deep offline analysis of protocols and packet contents.

  • Reading of live data from several interfaces such as IEEE 802.11, Ethernet, Bluetooth, ATM, USB, among others.

  • Provide powerful filters for selecting specific protocols for analysis.

  • Use of coloring rules to highlight packets for quick and easy identification of different protocols.

The captured traffic obtained from Internet applications is saved as CSV files for further processing.

2.3 Colasoft Capsa free

Colasoft Capsa [15] is an open-source network traffic and protocol analyser with a rich set of features [16]. It provides graphical statistics for global network as well as specific nodes in a dashboard tab. A graphical display of both broadcast and multicast packets [17] traversing the network is obtained with Capsa [17]. It also gives the packet count for TCP and UDP traffic along with the amount of TCP FIN and TCP RST sent. It allows for saving the displayed data in CSV format. Protocol statistics include features like sent and received packets and bytes as well average packets per second.

3 Classification of network traffic using machine learning in Weka

This section describes the main classification algorithms used and how the classification was performed using Weka.

3.1 Classification algorithms

Machine Learning techniques help to identify different applications and protocols in a network by grouping them based on packet flow parameters. These include minimum, maximum and mean number of packets, packet length, flow duration, traffic rate, volume, etc. ML classification techniques can be of two types: supervised and unsupervised [5].

In supervised learning technique, a complete labeled data set is required to classify unknown classes. This dataset is used to train the model which will predict output responses in a new set of data. Unsupervised machine learning approach does not constitute complete labeled data. This technique cannot be applied directly for classification because the output is unknown.

A set of 8 ML algorithms is used for this work. They are Naive Bayes, Bayesian Network, Multi-Layer Perceptron, Support Vector Machine, Radial Basis Function Neural network, KNN, bagging and C4.5 Decision Tree. A detailed description of these techniques can be found in references [18,19,20,21,22].

3.2 Classification using Weka Toolkit

The classification process was performed using Weka toolkit [23]. The latter is used as a data mining tool to implement IP traffic classification with ML algorithms. The overall process involves feeding the feature sets containing information about each sample with their labels into the machine learning algorithm to generate a classifier model. The efficiency and accuracy of the obtained model to capture a pattern is then determined by comparing the labels generated by the model for the inputs in a test set with the correct labels for those inputs. This classification process is illustrated in Fig. 1.

Fig. 1
figure 1

ML Classification in Weka

The performance of the classifiers was based on the following criteria:

  1. (i)

    Classification Accuracy.

    Accuracy is the simplest metric deployed to evaluate a classifier. It gives the percentage of inputs in the test set that the classifier correctly labeled.

    $$Accuracy = \frac{\sum TP + \sum TN}{\sum Total no. of samples}$$
    (1)

    where True Positives (TP): relevant items correctly identified as relevant. True Negatives (TN): irrelevant items correctly identified as irrelevant.

To define the remaining parameters, False Positives (FP) and False Negatives (FN) are also used. FP denotes irrelevant items incorrectly identified as relevant, while FN represents relevant items incorrectly identified as irrelevant.

  1. (ii)

    Precision (P).

    Precision indicates the number of items identified as relevant and is given by:

    $$Precision = \frac{TP}{TP + FP}.$$
    (2)
  2. (iii)

    Recall (R).

    Recall value indicates the number of relevant items that are identified.

    $$Recall = \frac{TP}{TP + FN}.$$
    (3)
  3. (iv)

    The F-Measure (or F-Score).

    This combines the precision and recall to give a single score, also called the harmonic mean of the precision and recall.

    $$F{-}Measure = \frac{2 \times Precision \times Recall}{Precision + Recall}.$$
    (4)
  4. (v)

    The confusion Matrix.

The confusion matrix summarises the performance of a multi-class classifier. If P denotes the first class and N is the second, the confusion matrix can be represented as shown in Fig. 2.

Fig. 2
figure 2

Confusion matrix [24]

4 Experimental set-up and testing procedures

The overall set-up for the experiments is shown in Fig. 3. The tests were performed on a PC connected to a Wi-Fi network. For this project, a 2.70 GHz Intel core i5 CPU with 4 GB RAM and 64-bit Windows 10 Operating System workstation was used. The network interface discovery feature in PRTG, Capsa and Wireshark was enabled to monitor IP traffic for the Wi-Fi network on the PC.

Fig. 3
figure 3

Overall implemented system

Data was captured for a duration of 30 min in intervals of 15 s for the on-going session of each application and state. For the classification of applications, three datasets of 700 samples each were built from raw data captured from the three monitoring tools and were saved as CSV files.

Streaming, uploading, downloading and idle state were considered for further classification. The size of datasets for state classification was of 470 samples.

As for anomaly classification, datasets of 700 samples with three classes labeled as normal, DDoS and Rogue Servers were used. The ‘normal’ class was obtained by running Internet applications under normal conditions.

The network traffic monitoring tools as well as the Weka classifier application were run on the PC. Classification algorithms were used to classify six different internet applications namely video streaming on YouTube, File download and upload via Google Drive, Browsing, Video Conferencing, FTP transfer and P2P File sharing. The experiment was performed for four different states in which the PC can be set namely, streaming, uploading, downloading and idle. Moreover, classification of two different anomalies namely DDoS attack and Rogue Servers were also investigated. Details of these testing conditions are given in the following sub-sections.

4.1 Applications and protocols

Most internet applications operate according to the Client/Server model in the Application layer of the TCP/IP model. A client is a device that requests information and server is the device that responds to the request. Format of requests and responses between clients and servers are generally defined by Application layer protocols [25].

The applications monitored in this study are hereafter described.

  1. (i)

    Online (Real-Time) Streaming.

Real time streaming implies sending audio or video data and played by the receiver on the other end with a negligible and consistent delay. This process can involve only a sender and a receiver, hence point-to-point, or one sender and several receivers, called broadcast. Real-time streaming prioritises accurate and quick delivery of data. For this purpose, User Datagram Protocol (UDP) is used to deliver continuous information and avoid re-sending dropped packets as does TCP [26].

Application used: YouTube.

  1. (ii)

    Upload and Download via e-mail.

Upload is referred to as the transfer of data from a client to a server while data transfer from server to client is called download. During e-mail operations, the Mail User Agent (MUA) or e-mail client applications are usually used. The e-mail client uses Post office Protocol (POP) to receive e-mail messages from an e-mail server and the Simple Mail Transfer Protocol (SNMP) allows e-mail to be sent from either a client or a server.

Application used: Google Drive.

  1. (iii)

    Video Conferencing.

Video conferencing via the Internet makes use of the Voice Over Internet protocol (VoIP). VoIP technology enables voice to be transmitted over the Internet as a digital signal [27].

Application used: Skype.

  1. (iv)

    Web Browsing.

World Wide Web (WWW) services are accessible through a web server. To establish a connection to a web service on a server, the web browser uses the Hypertext Transfer Protocol (HTTP). The process involves running background services by the server to allow for requested files by the client to be available. The browser converts the information received by the server into a plain text or HTML format and displays it for the user.

Application used: WWW services.

  1. (v)

    FTP Transfer.

File Transfer Protocol (FTP) enables file transfer between a client and a server. FTP needs to establish two connections between the client and the server for successful transfer. The first connection, consisting of commands and replies, is made to the server by the client and is established on TCP port 21. The second connection is then made over TCP port 20 for actual file transfer.

Application used: FileZilla Server.

  1. (vi)

    P2P Applications.

A Peer-to-Peer (P2P) application is one where a device can behave as both the client and the server during the same transfer process. P2P implies requesting information off of other computers and not from a server. Therefore, the client is a server and vice versa. Both client and server can set up a connection and have equal priority.

Application used: BitTorrent.

4.2 Network anomalies

Many works have been done in the area of network anomaly detection. This problem is usually approached using Artificial Intelligence and Machine Learning techniques.

In this project, 2 types of anomalies are investigated: (1) Distributed Denial of Service (DDoS) and (2) Rogue Servers.

The DDoS attack refers to the disruption of normal traffic of a server by bombarding the targeted server with excessive Internet traffic, eventually jamming the network infrastructure and prevent desired traffic from reaching its destination [28]. For this research work, a DDoS attack is generated through a code written in JavaScript which serves to open an infinite number of tabs on Google Chrome continuously, and hence preventing the user to access the network and servers as well as jamming the network infrastructure and slows down or completely shut down the operation of Internet applications. The code was run on the NetBeans IDE.

Rogue servers are set up on a network which serve to disrupt access to a target server. It makes use of the Dynamic Host Configuration Protocol (DHCP), a network protocol that allows an IP address from a given range of numbers to be automatically assigned to a computer by a server. Rogue server attacks are launched by attackers in the form of Sniffing and Reconnaissance attacks, among others [29, 30]. To create rogue servers in the system under study, a code was written in JavaScript which consists of three rogue servers and each made to listen to allocated ports 50,300, 50,302 and 50,305 respectively. These port numbers form part of the dynamic/private port range of 49,152–65,535. The code was run on Node.js.

5 Results and analysis

5.1 Features extracted from monitoring tools

Table 1 shows the list of features obtained from the monitoring three monitoring tools.

Table 1 List of extracted features from PRTG, Wireshark and Capsa

The performance and efficiency of the 8 ML classifiers were tested for the classification of applications, states and anomalies. The applications are YouTube, Google Drive, Skype, Browsing, FTP and BitTorrent. The states are four and include Downloading, Uploading, Streaming and Idle state. As for the classification of anomalies, a feature set with 3 classes labeled as ‘normal’, ‘DDoS’ and ‘Rogue servers’ is used.

5.2 Classification of applications

The classification accuracy (A), training time (T) and root mean square error (RMSE) obtained from the ML algorithms for classification of applications characterized by traffic flow features extracted from PRTG, Capsa and Wireshark are tabulated below (Table 2).

Table 2 Evaluation metrics for classification of applications

For application classification based on PRTG features, the KNN algorithm gives the best accuracy which is 98.7%. However, it has the highest training time at 16.3 s. RBF neural network is considered as the best classifier in this case with a classification accuracy of 98.3%, a training time of 0.99 s and root mean squared error of 7.1%.

For the Capsa feature set, Naïve Bayes best classifies the applications with an accuracy of 100%, shortest training time of 0.03 s and zero error.

KNN classifier has the highest classification accuracy but its high training time of 7.9 s makes it inappropriate also for classification of applications using Wireshark features. Bayes Net gives an accuracy of 99.6% and it has the shortest training time, making it the most efficient application classifier for the case of Wireshark.

The most appropriate feature set for classifying applications is illustrated in Fig. 4 for the comparison of classification accuracy.

Fig. 4
figure 4

Comparison of classification accuracy between PRTG, Capsa and Wireshark

KNN has the best accuracy for all three tools but at the cost of high training times. Besides, most ML classifiers best classify applications characterized by features from Capsa, except for SVM classifier which gives a higher accuracy with Wireshark feature.

Figure 5 below displays the precision and recall values obtained from the ML classifiers in the classification of applications with PRTG features.

Fig. 5
figure 5

Precision and Recall of ML classifiers from classification of applications using PRTG dataset

It can be seen that Google Drive is the best classified application in terms of precision. All 8 ML classifiers give precision value of 1, representing 100% precision. Since Google Drive application was considered for uploading a 150 MB video file onto the server, the traffic generated included larger amount of sent Bytes and packets and smaller volume of incoming data as compared to the other applications, and therefore it could be easily distinguished by the ML classifiers.

It can also be seen that Bayes Net, Naïve Bayes and RBF classifiers give 100% precision for BitTorrent, Browsing, FTP and Google Drive application while RBF network gives 100% precision for all applications except for YouTube.

SVM is the worst classifier with very low recall value for most applications compared to other ML schemes. Bayes Net gives 100% recall for BitTorrent, FTP, Skype and YouTube. MLP gives same for BitTorrent, FTP, Google Drive and Skype. However, RBF network gives a 100% recall for 5 applications and hence chosen as the best ML classifier in this case.

The behavior of RBFNN and SVM can be further explained by their respective confusion matrices as in Table 3.

Table 3 Confusion matrices for RBFNN and SVM

It can be clearly observed that all applications are correctly identified as themselves with the case of RBFNN while SVM fails to distinguish between the different applications. 25 YouTube samples, 17 Google Drive samples and 28 Skype samples are classified as Browsing. It also classifies 38 FTP and 14 BitTorrent instances as Browsing. This validates the high FP rate of 62.2% obtained with Browsing application as tabulated above.

Figure 6 shows the precision and recall values obtained from the ML classifiers with Capsa features. Google Drive proved to be the best classified application, denoted by 100% precision and recall by all ML classifiers, except SVM. Naïve Bayes, RBFNN, MLP and KNN give 100% recall and precision for all applications considered. SVM on the other hand has a very poor performance. Combining the percentage accuracy from Table 2, Naïve Bayes is the best classifier for the classification of Capsa features-based applications.

Fig. 6
figure 6

Precision and Recall of ML classifiers from classification of applications using Capsa dataset

The behavior of Naïve Bayes and SVM can be further explained by their respective confusion matrices as in the Table 4.

Table 4 Confusion matrices for Naive Bayes and SVM

All 238 instances in the train Capsa set are correctly identified by Naïve Bayes. SVM on the other hand classifies all Google Drive, Skype, FTP and BitTorrent samples as YouTube. This validates the high FP rate of 0.881 (88.1%) obtained with YouTube application.

The precision and recall values obtained from the ML classifiers with Wireshark features are as shown in Fig. 7. BitTorrent application is best classified in terms of Precision while FTP is best classified in terms of Recall. An F-score of 100% is obtained by 5 classifiers for both applications. BitTorrent is easily distinguished by the classifiers since it consists of Peer-to-Peer sharing and involves more UDP packets than other applications.

Fig. 7
figure 7

Precision and Recall of ML classifiers from classification of applications using Wireshark dataset

It can also be observed that C4.5 Decision Tree demonstrates high efficiency by giving 100% precision for five of the six applications. It also gives the best recall value for all applications except for Browsing. On the other hand, Naïve Bayes classifier gives the worst performance in terms of both precision and recall. Although Bayes net gave the highest classification accuracy, it is less reliable than C4.5 in terms of recall and precision.

The classification accuracy and training time of C4.5 was found to be 99.1% and 0.38 s respectively, making it an acceptable ML scheme for the classification of applications.

The behavior of C4.5 DT and Naïve Bayes can be further explained by their respective confusion matrices as shown in Table 5.

Table 5 Confusion matrices for C4.5 DT and Naive Bayes

Most applications are correctly identified as themselves with the case of C4.5 DT. Only 2 Browsing instances are mistaken to be YouTube application. On the contrary, Naïve Bayes largely fails to distinguish between the different applications.

Table 6 gives the average values of the performance evaluation metrics; i.e., TP and FP rate, precision (P), recall (R) and F-measure (F) for the overall classification of the six applications.

Table 6 Average precision and recall values for overall classification of applications

It further confirms that Capsa feature set is the best for classification of applications. KNN, MLP, Naïve Bayes and RBF Network give 100% precision and recall for Capsa while none of them gives ideal values for PRTG. As for Wireshark, recall and precision value of 1 are only obtained with C4.5 and KNN. Thus, it can be deduced that for application classification, better classification performance is portraited by ML algorithms when a dataset with more features is used. Also, solely the Capsa dataset contains detailed information about IP, TCP and UDP traffic, which largely contribute to proper classification of Internet applications.

5.3 Classification of states

The classification accuracy, training time and root mean square error obtained from the ML algorithms in the classification of the four states, namely Downloading, Uploading, Streaming and Idle are tabulated in Table 7.

Table 7 Algorithms’ performance evaluation metrics for state classification

Figure 8 compares the accuracy of the ML algorithms.

Fig. 8
figure 8

Comparison of classification accuracy between PRTG, Capsa and Wireshark for state classification

For state classification based on PRTG features, the Bayes net, Naïve Bayes, MLP, KNN and Bagging algorithms give 100% classification accuracy. However, Bayes Net and Naïve Bayes are considered as the best classifiers in this case with a training time of 0.02 s and zero RMSE.

For the Capsa feature set, Naïve Bayes best classifies the states with an accuracy of 100%, shortest training time of 0.02 s and zero error.

The best classifier based on classification accuracy and training time using Wireshark feature set is Bayes Net.

Bayes Net displays the best overall classification performance for all three tools. Besides, most ML classifiers best classify states characterized by features from PRTG, except for SVM classifier which gives a higher accuracy with Wireshark features.

The precision and recall values obtained from the ML classifiers in the classification of states with PRTG features are summarised in Fig. 9. In terms of precision, it can be seen that Uploading and Downloading are the best classified states. All 8 ML classifiers give precision value of 1, representing 100% precision. It can be difficult to differentiate between uploading and downloading sessions of the same video file as both mainly involve TCP traffic. However, downloading implies larger volume of incoming traffic and less outgoing traffic and vice versa for uploading state. It can be therefore deduced that PRTG provides concise features that fully contribute to distinguish and classify these two states.

Fig. 9
figure 9

Precision and Recall of ML classifiers from classification of states using PRTG dataset

As for the other two states, i.e., Streaming and Idle, they are perfectly classified by all algorithms, except C4.5 and SVM for Idle and Streaming respectively.

From the recall chart, it is clearly seen that SVM is the worst classifier with very low recall value for 3 out of 4 states. Bayes Net, Naïve Bayes, MLP, RBFNN, KNN and Bagging are equally good classifiers and they give 100% recall and precision for all four states. Streaming is the best classified state in terms of recall.

The behavior of Bayes Net and SVM can be further explained by their respective confusion matrices as shown in Table 8.

Table 8 Confusion matrices for Bayes Net and SVM

All 160 instances in the train PRTG set are correctly identified by Bayes Net. SVM on the other hand classifies almost half of Downloading samples, 20 Idle state samples and 24 out of 44 Uploading samples as Streaming. However, no streaming samples are classified as other states. That is why a bad FP rate of 51.6% but a good TP rate of 100% are obtained for the SVM classifier with PRTG feature set.

The precision and recall values obtained from the ML classifiers with Capsa features are summarized in Fig. 10. SVM gives precision and recall only for the Idle state. In addition to that, a 100% recall is achieved with all ML algorithms for the Idle state Thus, features extracted from Capsa are best suited for the classification of Idle state. The least amount of traffic is generated when the PC is idle and is not being used for Internet applications. It is therefore easier to differentiate Idle state from the other 3 states.

Fig. 10
figure 10

Precision and Recall of ML classifiers from classification of states using Capsa dataset

Naïve Bayes, RBF network, MLP, KNN and Bagging give 100% precision and recall for all four states.

The behavior of the best classifiers and SVM can be further explained by their respective confusion matrices as shown in Table 9.

Table 9 Confusion matrices

SVM is unable to classify all states. 158 samples out of the 160 samples present in the test set are identified as Idle state. This is denoted by the high False Positive rate of 98.3% given by SVM for the Idle state in the table above.

The precision and recall values respectively obtained from the ML classifiers in the classification of states with Wireshark features are shown in Fig. 11. Bayes Net gives 100% precision for downloading, uploading and idle states and 100% recall for downloading, streaming and uploading states, making it the most efficient algorithm for state classification using Wireshark features. On the other hand, lowest precision and recall are obtained with Naïve Bayes and RBF networks, hence explaining their low classification accuracy values.

Fig. 11
figure 11

Precision and Recall of ML classifiers from classification of states using Wireshark dataset

During Idle state, protocols traversing the network are mainly ARP and ICMP requests as compared to the other states which involve TCP and UDP packets. Since Wireshark characterizes samples by protocols, the Idle state is the easiest identified and classified one in terms of Precision and Recall by the ML algorithms.

Table 10 gives the Confusion matrices for the Bayes Net and Naive Bayes classifiers.

Table 10 Confusion matrices for Bayes Net and Naive Bayes

It can be clearly observed that all states are correctly identified as themselves with the case of Bayes Net while SVM fails to distinguish between the different states. Downloading samples are classified as Uploading, Idle and Streaming. 13 Uploading samples are classified as Streaming and Idle. 20 Streaming samples are classified as Uploading and Idle. This validates the high FP rate of 31.4% obtained with Idle state, 20.7% obtained with Uploading and 7.4% with Streaming.

Table 11 gives the average values of the performance evaluation metrics for the overall classification of the 4 states.

Table 11 Average precision and recall values for overall classification of states

It can be concluded that state classification using Wireshark gives the poorest performance among the three monitoring tools. Moreover, 100% precision and recall are obtained with 6 out of 8 ML classifiers using PRTG compared to 5 classifiers when using Capsa. It can be confirmed that the PRTG feature set is the best for state classification. The PRTG set contains 17 features while that of Capsa consists of 30 features. However, three important features are provided by PRTG that enhance the performance of ML classifiers in the classification of states. They are system health, CPU load and available memory.

5.4 Classification of anomalies

The classification accuracy, training time and root mean square error obtained from the ML algorithms in the classification of two types of anomalies, namely DDoS and Rogue Servers, along with a class of normal traffic, characterized by traffic flow features extracted from PRTG, Capsa and Wireshark are tabulated in Table 12.

Table 12 Classification accuracy of ML algorithms for classifying anomalies

For anomaly classification based on PRTG features, the MLP and KNN algorithms give the best accuracy which is 99.1%. However, KNN has the highest training time at 46 s. MLP is therefore considered as the best classifier due to its lower training time of 2.88 s and RMSE of 7.7%.

For the Capsa feature set, MLP best classifies the anomalies with an accuracy of 99.1%, but at the cost of a relatively high training time of 4.1 s. The second best classifier is C4.5 Decision Tree. It gives a classification accuracy of 96.6%, considerably shorter training time of 0.33 s and RMSE of 14.4%.

The Bagging classifier has the highest classification accuracy for anomaly classification using Wireshark features. It gives an accuracy of 75.2% and training time of 0.27 s. Moreover, Naïve Bayes is the only classifier that takes 0 s to train the model using Wireshark dataset but its low classification accuracy of 52.5% makes it inefficient for classification.

The most appropriate feature set for classifying anomalies is illustrated in Fig. 12 comparison of classification accuracy.

Fig. 12
figure 12

Comparison of classification accuracy between PRTG, Capsa and Wireshark

The precision and recall values obtained from the ML classifiers in the classification of anomalies with PRTG features are summarised in Fig. 13.

Fig. 13
figure 13

Precision and Recall of ML classifiers from classification of anomalies using PRTG dataset

An overall view of the two above bar charts shows that MLP is the classifier with 100% precision for DDoS and Rogue servers, and 100% recall for DDoS and Normal class samples. On the other hand, lowest precision and recall are obtained with Naïve Bayes. DDoS samples are relatively better classified in terms of recall and precision with PRTG features. During DDoS attack, a significantly larger volume of traffic was recorded, and hence, makes it easily distinguishable from the other classes.

The behavior of MLP and Naïve Bayes can be further explained by their respective confusion matrices as in Table 13.

Table 13 Confusion matrices for MLP and Naive Bayes

It can be clearly observed that MLP classifies all instances without fail except for 2 rogue server samples that it wrongly identifies as Normal samples. On the other side, many false positives are obtained with SVM. It mistakes 35 DDoS samples and 51 Rogue samples for Normal. This is why an FP rate of 56.2% is obtained with Naïve Bayes for Normal class.

The precision and recall values obtained from the ML classifiers in the classification of anomalies with Capsa features are summarised in Fig. 14.

Fig. 14
figure 14

Precision and Recall of ML classifiers from classification of anomalies using Capsa dataset

From Fig. 14, it can be seen that the Normal class is the best classified in terms of precision. 100% precision is obtained with SVM, MLP, KNN and C4.5. The second diagram reveals that Rogue server anomaly is better classified in terms of recall.

Overall, KNN algorithm exhibits best classification result and SVM displays the poorest classification performance with very low recall value for 2 out of 3 classes (Table 14).

Table 14 Confusion matrices for KNN and SVM

KNN exhibits ideal classification results contrarily to SVM which wrongly classifies Normal and DDoS instances as Rogue Servers, which explains its high FP rate of 83.2%.

The precision and recall values obtained from the ML classifiers in the classification of anomalies with Wireshark features are summarised Fig. 15.

Fig. 15
figure 15

Precision and Recall of ML classifiers from classification of anomalies using Wireshark dataset

Using Wireshark feature set, DDoS is found to be better classified in terms of precision and Rogue Servers in terms of recall. An important factor to consider here is that Wireshark provides the Source and Destination ports features. Since Rogue Server attack implies connection onto other designated ports, Rogue Server samples have greater chances of being recognized from other samples, and therefore best classified by the ML algorithms. The classification performance of the ML algorithms varies from one another and no best algorithm can be deduced from the above two figures. However, it can be clearly observed without further analysis that the Wireshark feature set is not the best option for classification of anomalies.

Table 15 gives the average values of the performance evaluation metrics for the overall classification of the 3 anomaly classes.

Table 15 Average precision and recall values for overall classification of anomalies

It can be concluded that anomaly classification using Wireshark gives the poorest performance among the three monitoring tools.

Moreover, both PRTG and Capsa features result in approximately same performance of ML classifiers. Only MLP and KNN gives 100% precision and recall. Therefore, both feature extraction tools generate features that are suited for the classification of anomalies.

6 Conclusion

The aim of this paper was to capture Internet traffic from web applications using three traffic monitoring tools (PRTG, Colasoft Capsa and Wireshark) and to deploy eight Machine Learning algorithms for classification of six applications and four states derived from them. The states included Downloading, Uploading, Streaming and Idle states. Two anomalies, namely DDoS and Rogue Server attacks were also generated during traffic capture and were classified using Weka Toolkit. it was noted that Capsa allowed for extraction of the largest number of features, followed by PRTG. The classification results obtained showed that the performance of the ML classifiers varies in each case. It was further observed that Capsa feature set was best suited for classification of applications due to its large number of features. PRTG feature set outperformed that of Capsa in the classification of States. An important implication on this observation is that the contribution of the individual features in classification is more relevant than the overall number of features actually present in a dataset. Finally, ML algorithms gave the poorest performance in the classification of anomalies. A possible explanation would be the presence of only 3 classes and the high level of similarities between them. On an overall perspective, classification based on Wireshark feature set displayed the worst results. Additionally, the SVM classifier gave the poorest performance in the overall classification of Internet traffic. This validates the fact that SVM is largely affected by irrelevant and noisy samples. On the other hand, KNN showcased the highest classification accuracy in most cases but it takes significantly high time to train the classifier model. The Naïve Bayes algorithm can be chosen as an alternative for its robustness to irrelevant samples. This study makes conspicuous that feature selection is an imperative step in the classification of IP traffic. The main limitation encountered in this work is that due to resource constraints, network traffic capture was carried out for short intervals of time, resulting in less samples for classification. The above observations finally pave the way to conclude that ML classification is a reliable technique for analysis of Internet traffic, given the appropriate set of features. Interesting future works will be to optimize the performance of ML algorithms by using larger number of samples and to perform a deeper analysis on the Capsa feature set by deducing the generating cost of individual features and eliminating those which barely contribute to classification performance, and hence reducing network resource consumption.