1 Introduction

Traffic classification is generally known as a prominent concept in network security and management. Enhancing the number of users and online applications has also increased Internet traffic while about 80 % of Internet traffic belongs to peer-to-peer applications [1, 2]. The growth of bandwidth requirements and the restricted capacity of communication lines have made a remarkable need for enhancing the quality of network resource utilization. To this end, network traffic classification can help enhance network quality, licensing, accounting, and security [3].

The asymmetric architecture of today’s network access links can be considered as one of the most important network classification examples that is based on this hypothesis that clients generally download more than what they upload. Furthermore, the pervasiveness of symmetric-demand applications (like peer-to-peer (P2P) applications, voice over IP (VoIP), and video call) led to alternations in clients’ demands to deviate from the mentioned assumption [4]. Therefore, it is necessary to provide application-level knowledge for allocating enough resources to such applications to be able to supply a satisfactory experience for the clients. Traffic identification and classification of the various applications and services on the network are considered as the primary step for managing the networks. Furthermore, according to the rapid growth of malware, while they try to conceal their traffic to escape intrusion detection systems and firewalls, traffic classification is changed to a necessary initial step in network security systems and intrusion detection against cyber threats [5,6,7].

Network traffic classification methods are generally categorized into four groups of port-based, payload inspection, statistical, pattern matching, and machine learning methods [2, 8, 9]. Port-based methods employ port communications in the TCP/UDP header. In spite of the simplicity and speed of these methods, they cannot classify all protocols owing to the existence of dynamic and private ports. Furthermore, payload inspection is a very public term for acquiring the payload of a packet and it is referred to deep packet inspection. Although many techniques have been presented to increase the efficiency of payload inspection, privacy, encryption, complexity, and high processing time are still identified as their drawbacks [2].

According to the fact that those traffic classification methods that do not need access to the load capacity of packets cannot be used for specifying all protocols, some legal restrictions may be performed to prevent access to the load capacity of packets and protect user privacy. Additionally, by encrypting the load capacity, access to the load capacity will be also denied. Consequently, statistical classification methods prevent these issues by utilizing independent load capacity parameters including arrival time packet length and flow length. It must be mentioned that although statistical methods cannot provide remarkable accuracy, they are able to carry out the classification at a high speed [10, 11].

Pattern matching methods are known as another group of network traffic classification methods that have been utilized in this field for a long time. Based on the fact that they need to read the contents of packets and reading the encrypted data is hard, they are confronting with some barriers and need to overcome some problems like scalability for processing multi-GB connections and supporting large volumes of signatures [2, 12]. On the other hand, with the growth of machine learning methods, they have changed to significant methods in the field of network traffic classification. Although machine learning based methods are able to overcome some of the existing limitations and insufficiencies of the previous methods, besides requiring feature engineering, they are still confronted with two crucial challenges [6, 13]. Firstly, they cannot support real-time classification owing to the high processing load of the large amounts of data on the network. Secondly, they are suffering from overfitting while they are confronted with unbalanced data due to their unbalanced distribution among various classes [5, 6].

Specifically, unbalanced data not only decrease the classification efficiency but also enhance the incorrect classification rate that leads to an impracticable cost to the network and causes crucial challenges to the security and management of network resources. To overcome these issues, a re-sampling method at the pre-processing step of machine learning methods has been used to balance the data. Nonetheless, loss of information and false information production are also known as the potential disadvantages of these methods [1, 14].

In recent years, deep learning methods have also achieved remarkable attention for network traffic classification owing to the fact that they do not need handcraft features and are able to extract efficient features automatically without human intervention [15, 16]. Noteworthy, class imbalance has also a negative influence on the classification efficiency of these methods. To overcome this insufficiency, a deep learning based encrypted traffic classification framework is proposed in this paper. The proposed method utilizes the Convolutional Neural Network (CNN) integrated with cost-sensitive learning to provide a classification model that aims to manage the problem of unbalanced data, which is entitled as Cost-Sensitive CNN (CSCNN). The proposed method employs a cost matrix that assigns a cost to each misclassification according to the distribution of each class. This cost is then applied to the network during the training phase. The cost matrix is created in the preprocessing step using data distribution of various classes where higher cost is assigned to the minority classes in comparison to the majority classes.

To evaluate the efficiency of the proposed method for tasks of traffic classification, application identification, and traffic description, the ISCX VPN-nonVPN [17] dataset was used in our experiments. According to the obtained results, CSCNN not only achieved higher performance in comparison to other existing methods but also was able to identify more minority class samples correctly. It can be owing to the fact that CSCNN assigns higher cost to the minority classes and lower cost to the majority classes. CSCNN then used these costs to update weight along with the training phase that made the model more sensitive to the minority classes. The advantages of the proposed traffic classification method that makes it superior compared to other classification schemes are mentioned in the following:

  • The proposed method is based on deep neural networks and therefore there is no need for experts to extract features related to the network traffic.

  • The proposed method can decrease the influence of unbalanced data, especially class imbalance, on the efficiency of traffic classification using a cost-sensitive matrix which yields to assigning higher cost to the minority classes and lower cost to the majority classes.

  • To the best of our knowledge, deep learning has been rarely employed for the task of traffic classification. However, we did our best to compare our proposed method with two of the most famous deep learning based methods for all tasks of traffic classification, application identification, and traffic description on the ISCX VPN-nonVPN dataset.

We also compared our proposed method with machine learning based methods that have conducted their experiments on this dataset. Based on the empirical results, it can be claimed that CSCNN has superior classification performance than both deep learning and machine learning based methods.

The rest of this paper is categorized as follows. Various kinds of network traffic classification methods besides the solutions for the problem of unbalanced data in traffic classification are completely explained in Sect. 2. The proposed method which tries to enhance the traffic classification performance on unbalanced data via training a cost-sensitive CNN is described in Sect. 3. Section 4 includes the results of the experiments and analysis while the conclusion and possible future research are mentioned in Sect. 5.

2 Related Work

An overview of the most significant traffic classification methods as well as techniques for overcoming the class imbalance problems are provided in this section. Particularly, network traffic classification methods are classified into five categories as follows [6]: (1) port-based method, (2) payload inspection, (3) pattern matching, (4) statistical, and (5) machine learning.

Port-based methods are known as the oldest and the most famous methods in this field where the classification is carried out utilizing the information in the TCP/UDP header of the packets [18]. In other words, port-based methods use the information existing in the packet headers to extract port numbers that are related to a specific application. Considering the fact that extraction is not a difficult process and port numbers are not also affected by encryption programs, these methods are not only famous for being easy and fast but also are generally used in firewalls and Access Control List (ACL) [19]‌. However, it is necessary to mention that besides their potential advantages, problems such as port misuse, port transmission, network address translation, and port randomness have decreased their efficiency [20], and only 30–70 % of current internet traffic can be classified using port-based classification methods [21, 22].

Payload inspection methods, commonly known as Deep Packet Inspection (DPI), employ the information existing in the application layer to perform classification where the predefined patterns like signature and regular expression of each protocol are used to distinguish protocols from each other [2]. Noteworthy, updating patterns after releasing each new protocol as well as user privacy can be considered as the main weaknesses of these methods. In this regard, Sherry et al. [23] presented a system that could solve the privacy issue by inspecting encrypted payload without decryption. However, their proposed method was only able to process HTTP secure traffic.

Following a similar line of research, Pattern matching methods are another group of network traffic classification methods that have been used in this field for a long time [24] that try to compare the packet content with a set of predefined rules in the string format. However, these methods are also confronted with particular limitations like expression limitation while they are not able to cope with complex services. To overcome these drawbacks, regular expression and dual-finite automata are commonly employed in these methods to derive suitable patterns for classification [25, 26].

Statistical methods try to overcome these problems by utilizing independent factors, such as arrival time, packet length, and flow length to perform classification [27, 28]. In other words, they are based on this assumption that the traffic of each application has some unique statistical features that can be efficiently used in classifying their underlying traffic. However, having access to a little part of statistical flow information in real-time traffic may also jeopardize their performance. In this regard, protocol fingerprints based on the Probability Density Function (PDF) of packets inter-arrival time and normalized thresholds were presented by Crotti et al. [29] and their obtained accuracy was about 91 % for a group of protocols like HTTP, Post Office Protocol 3 (POP3), and Simple Mail Transfer Protocol (SMTP). Similarly, PDF of the packet size was also considered by Wang et al. [30]. Their proposed technique accuracy was about 87 % while it was able to determine a broader range of protocols like File Transfer Protocol (FTP), Internet Message Access Protocol (IMAP), SSH, and TELNET.

By the advancement of machine learning methods, they have attracted many researchers for the task of traffic classification due to the fact that they are able to automatically create a model from a dataset [2]. Generally, machine learning methods are divided into supervised and unsupervised techniques. Supervised techniques require labeled data to perform classification while unsupervised techniques can be performed without any prior information about the samples. In this regard, Auld et al. [31] used a Bayesian neural network to classify P2P protocols and obtained 99 % accuracy. Moore et al. [22] also used the Naïve Bayes classifier on the same application and achieved 96 % accuracy. Additionally, artificial neural networks have been used for traffic classification and acquired superior performance compared to Naïve Bayes [32]. It is worth mentioning that two of the most prominent methods that have conducted their experiments on the ISCX VPN-nonVPN traffic dataset were also based on machine learning methods. Particularly, Gil et al. [17] utilized time-related features to specify the network traffic using the C4.5 decision tree and k-nearest neighbor technique and obtained 92 % recall for classifying the six major classes of the traffic. Yamansavascilar et al. [33] also used the k-nearest neighbor technique and obtained an accuracy of 94 % for classifying 14 classes of applications on the same dataset.

Although machine learning based methods have obtained remarkable results, they are still facing obstacles including feature extraction and feature selection that are mostly performed with the help of an expert. Hence, it can be stated that feature engineering is very costly, time-consuming, and prone to human mistakes. To overcome these issues, using deep learning methods has obtained considerable attention in the field of traffic classification while they do not require any handcraft features and their highly flexible architectures can learn directly from raw data [34,35,36]. In this regard, Chen et al. [34] proposed a method, named Seq2Img, that utilized CNN to classify IP traffic. Based on their proposed method, stream sequences were converted into an image and CNN was employed to perform traffic classification. Wang et al. [35] proposed a method that learned the low-level spatial features of network traffic using deep CNN and Long Short-Term Memory (LSTM) network. Notably, a combination of CNN and Recurrent Neural Network (RNN) has been also recently utilized for the task of traffic classification [37, 38]. Following a similar line of research, the Datanet method [37] was proposed to efficiently manage distributed smart home networks where the classification was performed using three deep learning models including multilayer perceptron, stacked auto-encoder, and CNN. Particularly, Wang et al. [39] and Lotfollahi et al. [1] used deep neural networks for traffic classification and performed their experiments on the ISCX VPN-nonVPN traffic dataset. In this regard, Wang et al. [39] presented a method that integrated feature extraction, feature selection, and classifier into a unified end-to-end method. Lotfollahi et al. [1] also presented the Deep Packet method that leveraged a combination of stacked auto-encoder and CNN to classify the network traffic.

Noteworthy, in spite of the fact that deep learning based methods have made significant improvements in traffic classification [1, 37], the class imbalance is still known as a common problem in this field which can generally result in discriminatory and biased classification towards majority classes. There are two prominent approaches for dealing with unbalanced data: (1) Data-level approaches that re-balance the distribution of classes in the pre-processing phase. Since data-level approaches are independent of the classification algorithms, they are flexible. Re-sampling techniques including Random Under-Sampling (RUS) and Random Over-Sampling (ROS) are two famous re-sampling techniques that are used for this aim. In RUS, some samples are randomly selected and eliminated from the majority classes. In contrast, in ROS some minority samples are randomly duplicated [40]. Information loss and fake knowledge production are known as the primary problems that RUS and ROS techniques are respectively confronted with [41]. (2) Algorithm-level approaches that alter the learning process by inserting an extra specialized mechanism into the original algorithm to determine the classifier’s sensitivity toward the minority classes. Cost-sensitive is a technique that is commonly utilized to deal with class imbalance problems where variable costs are assigned to each class and then the algorithm adapts these costs during the training process to update weights. The goal of this technique is to assign a higher cost to minority class samples as well as decreasing the overall learning cost.

In this area of research, some methods have been recently presented that tried to integrate class-specific costs into deep neural networks, such as CNN [15, 42], DNN [43, 44], and auto-encoders [45]. Khan et al. [44] incorporated a cost-sensitive setting in CNN to learn feature representations and classifier parameters for both majority and minority classes. This approach can be used for both binary and multiclass problems. To the best of our knowledge, no studies have been previously conducted on using cost-sensitive techniques along with deep neural networks for the task of traffic classification aiming to overcome the class imbalance problem. It can be claimed that it is the first study that tries to deal with class imbalance problem in encrypted traffic classification by generating a cost matrix based on the class distribution and using it during the model training process and updating weight.

3 Background in Deep Neural Network

Neural networks are commonly recognized as computing systems that consist of many basic and highly interconnected elements for processing information. These networks are made of an extensive number of building blocks (neurons) that are connected to each other via links (connection) with a particular weight. Along with the training process, a large number of data samples are fed to the neural network and a learning algorithm (backpropagation) is performed to adjust the weights to obtain the desired output. Specifically, deep neural networks are also known as a particular kind of neural network that contains many hidden layers that have become more feasible in recent years due to the rapid enhancement of computational powers and accessibility of Graphical Processing Units (GPU) [46]. They have been also successfully utilized in various domains including computer vision, image processing, natural language processing, information retrieval, and especially traffic classification [8, 47]. While the focus of the proposed method of this paper is on using CNN for traffic classification, more details about this network and how it is used for traffic classification are provided in the following section.

CNN is one of the typical deep neural networks that can perform automatic feature extraction using layers comprised of convolutional operations and is generally suitable for sequential data like language [1, 48, 49]. The main concept of CNN is to achieve local features from the input at higher layers and then combine them into more complex features at lower layers. In this regard, CNN leverages the convolutional layer as a basic building block which takes \(N\times N\) square neurons and a filter of \(w\)with the size of \(m\times m\) as input. By applying the convolutional operation on these two matrices, the output features of each layer \({c}^{l}\) with the size of \((N-m+1)\times (N-m+1)\) are obtained. Thereafter, \(f\) is applied as the activation function to learn more complex features (Eq. 1). Pooling operation is also applied over the obtained feature maps to aggregate multiple low-level features and reduce computational costs [8].

$${c}_{ij}^{l}=f\left(\sum _{a=0}^{m-1}\sum _{b=0}^{m-1}{w}_{ab}{c}_{(i+a)(j+b)}^{l-1}\right)$$
(1)

Due to the fact that network traffic is essentially sequential data and the structure of byte, packet, session, and the whole traffic are very similar to the structure of character, word, sentence, and the whole document in the natural language processing, CNN can be also used for encrypted traffic classification. In this regard, consider \({X}_{i}\in \mathbb{R}\) is a k-dimensional vector that corresponds to the ith traffic byte in the session or flow. Therefore, \({X}_{1:n}={x}_{1}\oplus {x}_{2}\oplus \dots \oplus {x}_{n}\) represents a flow of length \(n\) where \(\oplus\)is the concatenation operation. Generally, \({x}_{i:i+j}\) presents the concatenation of traffic bytes \({x}_{1}, {x}_{i+1}, \dots , {x}_{i+j}\). Then, the convolutional operation with the filter \(w\in \mathbb{R}\) is applied to a window of \(h\) traffic bytes to generate a new feature \({ c}_{i }\in \mathbb{R}\mathbb{ }\)(Eq. 2).

$${c}_{i }=f(w\circ {x}_{i:i+h-1}+b)$$
(2)

Here \(b\) is the bias term, \(f\) is the activation function (ReLU), and\(\circ\) refers to the dot product between the convolutional filter and traffic submatrix. The filter is applied to each possible window of traffic bytes \(\{{x}_{1:h}, {x}_{2:h+1},\dots , {x}_{n-h+1:n}\}\) to generate feature maps\({ c}_{i }\) (Eq. 3).

$${c}_{i }= \left\{{c}_{1}, {c}_{2}, \dots , {c}_{n-h+1}\right\}$$
(3)

Then, a max-pooling operation is applied over the obtained feature maps to obtain the maximum value as the next feature (Eq. 4)

$$\widehat{c}=\text{max}\left({ c}_{i }\right)$$
(4)

The achieved features are passed to a fully connected SoftMax to determine the probability distribution of the input session or flow (Eq. 5).

$$y=Softmax({W}^{\left(s\right)}\widehat{c}+b)$$
(5)

Overall, that it can be stated that CNN can be an ideal choice for network traffic classification because it is able to capture spatial dependencies between adjacent bytes in network packets which results in finding discriminative patterns for every class of protocols/applications that can lead to the accurate traffic classification.

4 Methodology

According to the fact that unbalanced data have a considerable influence on the efficiency of CNN and can cause over-fitting during the training process [42], different methods have been proposed to fill these lacunae. Existing methods [1, 35] generally utilized re-sampling techniques that needed expert knowledge to specify the majority and minority classes. Besides being time-consuming and costly, this technique not only yields to the removal of many data patterns but also results in false data patterns generation. In this regard, a Cost-Sensitive CNN- (CSCNN) method for the task of encrypted traffic classification is proposed in this paper. The schematic representation of the proposed method is illustrated in Fig. 1.

Fig. 1
figure 1

Schematic representation of the proposed method

As it is obvious, the proposed method includes two separate phases. Accordingly, the network traffic data are cleaned to become suitable as the input of the neural network in the first phases. while the second phase consists of the proposed CSCNN method. More details about the used dataset and the proposed method are provided in the following.

4.1 Dataset

Generally, self-collected traffic or security companies’ private traffic have been used for the evaluation of encrypted traffic classification methods that resulted in incompatibility between their obtained results. In other words, public datasets have been rarely employed for evaluation in this field and while classical machine learning requires handcraft features as input, most of the existing public datasets are feature datasets rather than traffic datasets. However, Gil et al. [17] generated a public dataset for the task of encrypted traffic classification (ISCX VPN-nonVPN dataset) that includes captured traffic of different applications in pcap format files. Each file was labeled according to the application produced the packets (e.g., Skype, and Hangouts) along with the activity the application was engaged during the capture session (e.g., voice call, chat, file transfer, or video call). The dataset also contains packets captured over Virtual Private Network (VPN) sessions. Like non-VPN traffic, VPN traffic is captured for different applications, such as Skype, while performing different activities, like voice calls, video calls, and chat.

Having the above-mentioned issues in our mind, the “ISCX VPN-nonVPN” dataset was used in our experiments in order to be able to provide a fair comparison between our proposed method and other existing methods for the task of encrypted traffic classification.

4.2 Pre-Processing

While the “ISCX VPN-nonVPN” dataset file format is not suitable to be used as the input of the proposed CSCNN method, the raw traffic of this dataset must be pre-processed to generate the required format. The pre-processing phase contains seven fundamental steps that are explained in the following:

  1. (1)

    Data integration and re-labeling: All pcap files are merged to form a single dataset in the integration phase. While pcap files are labeled according to their applications, they must be re-labeled for application identification and traffic description. Re-labeling respectively resulted in 17 and 14 classes for application identification and traffic description. More details about re-labeled classes are reposted in Table 1.

  2. (2)

    Converting bit to byte: The values ​​in the dataset are stored as bits between 0 and ff, including eight bits. To reduce the input size, the data is first converted to byte format and then is converted to a value between 0 and 255.

  3. (3)

    Discarding irrelevant information: While the dataset is collected in a real-world simulation, it consists of several useless packets that are not suitable for modeling and must be removed. As matter of fact, the ISCX VPN-nonVPN dataset consists of some TCP segments with SYN, ACK, or FIN flag sets that are essential for a three-way handshaking procedure or establishing or finishing a connection. In contrast, these segments are not carrying any suitable information about the application that had generated them and must be discarded. Moreover, Domain Name Service (DNS) segments that are not relevant to application identification and traffic classification must be also removed from the dataset.

  4. (4)

    Packet truncation: Due to the fact that neural networks need fixed-length inputs and the packet length varies a lot through the dataset, the length of packets must be united. To this end, the size of all packets is fixed to 1480 by cutting or zero-padding.

  5. (5)

    Normalization: To achieve higher efficiency, all the packet bytes are divided by 255 while the input values are in the range [0–1].

  6. (6)

    Cost matrix formulation: The cost matrix γ is formulated in the last level of the pre-processing phase and is then ​​applied to the output of the last layer of CNN to alter the network’s weights based on various costs. While in CNN higher score is commonly assigned to the output class, the goal of the cost matrix is to assign the maximum cost to the minority classes while lower costs are assigned to other classes (majority ones). In a cost matrix, the diagonal of the matrix is known as the utility vector. This vector presents the correct classification and is set to zero. For other classifications, all costs are non-negative, i.e. \({\gamma }_{i,j}>0\). An example of a cost matrix for a dataset with four classes is presented in Table 2 where a 4 × 4 matrix is ​​generated when all the cells of the matrix are larger than zero, except those in the diagonal row that are always set to zero.

  7. (6)

    Removing data-link header: According to the fact that the ISCX VPN-nonVPN dataset is captured at the data-link layer and the data-link header includes physical link information like a Media Access Address (MAC) that is also necessary for transmitting frames over the network but is not essential and informative for application identification or traffic description, the Ethernet header must be eliminated in the pre-processing phase

Table 1 Number of samples per class: (A) Traffic description (B) Application Identification
Table 2 Traffic description performance

Consequently, it can be stated that if the algorithm accurately classifies the sample, there is no cost. In other respect, the proposed algorithm assigns a cost to misclassification based on the distribution of the corresponding classes. The cost of each class is also computed using the following equation (Eq. 6).

$$ {\text{Cost}}_{{ci \to cj}} = \left\{ \begin{gathered} 1 - \frac{{N_{{ci}} }}{{N_{{ci}} + N_{{cj}} }},~{\text{if}}\,\frac{{N_{{ci}} }}{{N_{{cj}} }} > 1~ \hfill \\ \frac{{N_{{ci}} }}{{N_{{ci}} + N_{{cj}} }},~{\text{Otherwise}} \hfill \\ \end{gathered} \right.$$
(6)

4.3 Cost-Sensitive Convolutional Neural Network (CSCNN)

Based on previous studies, it can be concluded that using cost-sensitive learning along with the training of a neural network can result in higher efficiency compared to data-level methods. As a matter of fact, cost-sensitive learning is a subfield of machine learning which considers the cost of prediction errors along with the training of a model. It is also closely related to the field of imbalanced learning which involves explicitly defining and using cost during the training process. In this regard, a Cost-Sensitive CNN (CSCNN) is proposed in this paper that tries to learn appropriate features for the minority and majority classes automatically to improve the efficiency of traffic classification.

The main notion behind cost-sensitive learning is to make a priority on minority class instances while facing misclassification. Accordingly, a cost is assigned to each misclassification type while misclassifications of the minority classes obtain higher values compared to misclassifications of the majority classes. Once a misclassification happens, the cost function is then activated and attempts to enhance the cost value by applying the corresponding cost that was pre-defined by the user or automatically assigned by a technique. Therefore, cost-sensitive learning results in a higher cost value when a minority class instance is misclassified in comparison to the time when a majority class instance is misclassified. Considering this cost along with updating the parameters of the neural networks, the training process becomes more sensitive to minority classes. Despite previous methods that utilized a user-defined matrix, the proposed CSCNN automatically adjusts the cost of each classification using the data distribution. The details of the proposed CSCNN are presented in Fig. 2 which includes three primary steps: (1) forming a cost matrix, (2) learning features using CNN, and (3) a cost-sensitive function.

Fig. 2
figure 2

Cost-Sensitive CNN algorithm

Noteworthy, the goal of the proposed method is to decrease the influence of unbalanced data on the efficiency of encrypted network traffic classification as well as focusing on learning data that have uneven penalties or costs when making a prediction. To this end, a cost matrix is ​​formed by computing the distribution of samples of all classes in the first step. Thereafter, the data is fed to CNN to carry out the classification in the second step which can be considered as an end-to-end strategy that is able to directly learn the nonlinear relationship between traffic input and expected output label rather than diving the problem into subproblems. It is worth mentioning that a two-layers CNN where a convolution layer is followed by a ReLU activation function and a max-pooling layer is also utilized. SoftMax classifier, which is a multi-class version of logistic regression, is finally used to specify the output classes.

Noteworthy, due to the fact that the compromise is to train on the training dataset but to stop training at the point when performance on a validation dataset starts to degrade, an early stop technique is also utilized to intercept over-fitting. Accordingly, If the value of the loss function of the validation set does not change for several iterations, the training process stops. To standardize the input and stabilize the learning process besides reducing the number of training epochs and accelerating the learning process, batch normalization technique is also used during the training process.

Thereafter, the actual and predicted classes are first specified using a cost-sensitive layer (third step) after each misclassification. The misclassification cost is then determined utilizing the cost matrix to be used for output modification. Ultimately, the specified cost is then applied to the outputs of actual and predicted classes. In summary, the classification process is determined as follows:

  1. (1)

    The first convolutional layer makes use of a set of learnable filters where the input data is processed with 8 filters (filter size= [1, 3]). Each filter moves 1 step after one convolutional operation. Convolving the same filters at every position in the input matrix allows the features to be extracted automatically.

  2. (2)

    Features obtained from the convolution layer are fed to ReLU as an activation function to learn complex patterns in the data (Eq. 7).

$$ReLU\left(x\right)=max[0: x]$$
(7)
  1. (3)

    After applying ReLU function, the results are then processed through the pooling layer to reduce the dimension of the features. The pooling layer operates upon each feature map separately to create a new set of the same number of pooled feature maps. Max-pooling is used in the proposed method which processes a [1, 2] input as follows (Eq. 8). The max-pooling has a step size of 1.

$$max pooling \left[{x}_{1} , {x}_{2},{x}_{3,} {x}_{4} \right] = max ({x}_{1}, {x}_{2},{x}_{3},{x}_{4})$$
(8)
  1. (4)

    Three aforementioned layers (i.e., convolutional layer, ReLU, and max-pooling) are then added to the network with the same settings described in layers 1, 2, and 3.

  2. (5)

    These features are then passed to a fully connected SoftMax layer whose output is the probability (Eq. 9).

$${f}_{\theta }\left(x\right)=\frac{1}{{\sum }_{j=1}^{C}{e}^{{y}_{j}}}\left[{e}^{{y}_{1}} {e}^{{y}_{2}} \dots {e}^{{y}_{C}} \right]=\left[p({y}_{i}=1|{x}_{i}\left) p\right({y}_{i}=2\left|{x}_{i}\right) \dots p({y}_{i}=C|{x}_{i}) \right]$$
(9)

A cost-sensitive strategy is then utilized to address the class imbalance problem along with the feature learning process via improving the efficiency of the cost function where weight and bias parameters are learned by the SoftMax classifier to minimize cost. In fact, the purpose of a cost-sensitive strategy is to punish all kinds of classification errors based on certain costs. The basic idea behind this strategy is that the larger output values in the output layers yield to higher probability in comparison to the smaller output values in the SoftMax layer. Consequently, CSCNN tries to decrease the predicted class output value and enhance the actual class output using Eqs. (10) and (11).

$${y}_{k}={y}_{P}-{\gamma }_{i,k}\times {y}_{P}$$
(10)
$${y}_{i}={y}_{A}+{\gamma }_{i,k}\times {y}_{A}$$
(11)

Here the terms yP and yA respectively refer to the predicted class and the actual class outputs. The yi and yk values also respectively present the new outputs for actual and predicted classes. The cross-entropy cost function is then altered and a new cost function is introduced. The new function obtains y and p values as inputs and returns a loss value for each class. After modifying the output of the two actual and predicted classes, the probability values are again calculated utilizing the SoftMax function (that are known as pk and pi respectively for the predicted (yk) and actual (yi) classes). The new loss value for the predicted and the actual classes are computed using Eqs. (12) and (13) respectively.

$$-\left(yloglog \left({p}_{k}\right) +\left(1-y\right)loglog \left(1-{p}_{k}\right) \right)$$
(12)
$$-\left(yloglog \left({p}_{i}\right) +\left(1-y\right)loglog \left(1-{p}_{i}\right) \right)$$
(13)

Since the value of y is equal to1 for the predicted class and is zero for the actual class, the predicted class activates the first part of the equation, i.e. − log(p). On the contrary, the actual class activates the second part of the equation, i.e. − log(1 − p). Since the probability value of p for the predicted class is decreased (by enhancing the value in the output layer, yk), the − log(p) value becomes larger. Furthermore, as the probability value (p) of the actual class is enhanced (by reducing the output yi), the p value in the − log(1 − p−1) is reduced. Consequently, the outputs of the loss function are enhanced for both classes that leads to an enhancement in the overall loss cost of various classes.

In general, the proposed CSCNN aims to decrease the neural network cost by imposing punishments on various misclassifications. These costs are specified utilizing the classes’ distribution in such a way that classes with fewer samples obtain higher costs and the majority classes obtain lower costs aiming to train a network that is more sensitive to minority classes.

5 Results and Discussion

In order to prove the efficiency of our proposed method, we decided to carry out various experiments in terms of traffic classification, traffic description, and application identification. It is worth mentioning that all implementations were conducted using Python as the programming language, Anaconda as a library for implementing deep neural networks, and Tensorflow as its backend. Furthermore, the learning rate and the number of epochs were respectively set to 0.1 and 100. The size of convolutional filters and pooling filter were respectively set to 1 × 12 and 1 × 3. To prevent overfitting, the dropout layer with a masking probability of 0.4 was applied for regularization. A fully connected network with two hidden layers was also implemented to perform the final classification. Stochastic Gradient Descent (SGD) and cross-entropy were respectively utilized as optimizer and loss function. To perform training, 90 % of data was randomly selected as a training set and the remained 10 % of data was used as a test set. Implementations were conducted on a system with an Intel Xeon 2 E5-2620 2.0 GHz processor and 8 GB of RAM running windows server 2008.

To evaluate the proposed method, five commonly used metrics, known as Accuracy, Recall, Precision, F1 Score, and False Alarm Rate (FAR) were employed in our experiments and their equations are provided in the following. Where TP, TN, FP, and FN respectively refer to the true positive, true negative, false positive, and false negative.

$${\text{Accuracy}} = \frac{TP+TN}{FP+FN+TP+TN}$$
(14)
$${\text{Recall}} = \frac{TP}{TP+FN}$$
(15)
$${\text{Precision}} = \frac{TP}{TP+FP}$$
(16)
$${\text{F1 Score }} = \frac{2*{\rm{Recall}}*{\rm{Percision}}}{{\rm{Recall}}+{\rm{Percision}}}$$
(17)
$${\text{FAR}} = \frac{FP}{FP+TN}$$
(18)

As previously mentioned, deep neural networks have been rarely employed for the task of traffic classification and most of the existing methods also have used a particular self-collected traffic dataset in their experiments. Moreover, the efficiency of deep learning based methods is highly related to the used hardware. In this regard, comparing the proposed method with other state of the art is very complicated and confusing. However, in order to prove the superior performance of the proposed method, we compared it with two of the most prominent methods in this field, namely Deep Packet [1] and Datanet [37] in terms of traffic classification, traffic description, and application identification on ISCX dataset. Noteworthy, the proposed method is also compared with a machine learning based method while their results were taken from their original papers. More details about the obtained results are provided in the following.

5.1 Classification Results

The performance of the proposed method in terms of classification between VPN and Non-VPN classes compared to the Deep Packet [1] and Datanet [37] methods is investigated in this section. The Confusion matrices for all three methods are illustrated in Fig. 3. As can be seen, the proposed CSCNN method is able to detect VPN and Non-VPN with 98 % accuracy while only 0.02 % of traffic was misclassified. On the other hand, Deep Packet [1] and Datanet [37] respectively obtained the accuracy of 97 and 94 % for Non-VPN classification and 96 and 92 % for VPN classification. Apparently, the proposed CSCNN method has superior performance for the task of classification.

Fig. 3
figure 3

Confusion matrices of classification for VPN and Non-VPN classes

In order to provide more comparison, the results obtained from the confusion matrices of Fig. 3 are illustrated in Fig. 4 based on five metrics of accuracy, recall, f-measure, precision, and false alarm rate. As can be seen, although CSCNN has slightly higher performance than Deep Packet [1] and Datanet [37] based on four measures of accuracy, recall, f-measure, and precision, their performances can be also considered relatively identical which can be attributed to the balanced distribution of each class data while each of the VPN and Non-VPN classes includes about 50 % of the total data. Moreover, the false alarm rate of CSCNN is lower on Non-VPN traffic compared to two other mentioned methods.

Fig. 4
figure 4

Comparison of the two-class classification performance of CSCNN, Deep Packet, and Datanet methods

5.2 Traffic Description Results

Traffic description is another factor that can be considered for evaluating the performance of the proposed method. In this regard, the confusion matrices of CSCNN, Deep Packet [1], and Datanet [37] are illustrated in Fig. 5 where the rows of the confusion matrices correspond to the actual class and the columns are related to the predicted labels and therefore the matrices are row normalized. Noteworthy, the darker color of the diagonal elements states that the method can classify each application with minor confusion because they present the accurately classified results. By carefully observing the confusion matrices of these three methods, it is clear that the number of misclassification of the CSCNN is lower compared to the other two methods that can clearly prove the efficiency of our proposed method.

Fig. 5
figure 5

Confusion matrices of traffic description (14 classes)

In order to provide more comparison, the results obtained from the confusion matrices of Fig. 5 are illustrated in Fig. 6 based on five metrics of accuracy, recall, f-measure, precision, and false alarm rate. As it can be seen, CSCNN presented better results compared to the other two methods. The better performance of CSCNN is clearer in false alarm rate comparison that can be due to the higher error rate for the minority class samples leads to higher values compared to Deep Packet [1] and Datanet [37].

Fig. 6
figure 6

Comparison of the traffic description performance of CSCNN, Deep Packet, and Datanet methods

Moreover, this value is higher for Chat، File transfer، Streaming, and VPN: Streaming compared to other classes. The accuracy of the CSCNN is about 96.9 % while this value is equal to 94.3 and 96.71 % for Deep Packet [1] and Datanet [37] respectively. Therefore, it can be concluded that the accuracies of all three methods are relatively low which can be due to the fact that this measure of classification cannot consider all the classes. In this regard, it can be claimed that accuracy is not an appropriate measure for evaluating unbalanced data classification. In contrast, recall and precision have better efficiency in evaluating the classification of unbalanced data. Notably, the recall and precision values are about 97.4 % for CSCNN while these values are lower for Deep Packet [1] and Datanet [37]. The comparison of the CSCNN, Deep Packet [1], and Datanet [37] methods for traffic description according to five metrics of accuracy, recall, f-measure, precision, and false alarm rate is also presented in Table 3. The higher performance of the CSCNN in all evaluation metrics proves that it has entirely extracted and learned the discriminative features from the training set and can successfully classify traffic description.

Table 3 Traffic description performance of CSCNN, Deep Packet, and Datanet method

5.3 Application Identification Results

Application identification is another factor that was used to evaluate the performance of the proposed method. In this regard, the confusion matrices of CSNN, Deep Packet [1], and Datanet [37] are illustrated in Fig. 7 where the rows of the confusion matrices correspond to the actual class and the columns are related to the predicted labels and therefore the matrices are row normalized. As it is clear, CSCNN can correctly classify Youtube, Tor‌, Torrent, and AIM Chat classes. Moreover, by carefully observing the confusion matrices of these three methods, it is obvious that CSCNN can classify samples more efficiently while its misclassification rate is lower compared to the other two methods for application identification.

Fig. 7
figure 7

Confusion matrices of application identification (17 classes)

In order to provide more comparison, the results obtained from the confusion matrices of Fig. 7 are illustrated in Fig. 8; Table 4 based on five metrics of accuracy, recall, f-measure, precision, and false alarm rate. As it can be seen, CSCNN presented better results compared to the other two methods. The accuracy of the CSCNN is about 97.9 % while this value is equal to 96.2 and 96.1 % for Deep Packet [1] and Datanet [37] respectively and the recall average is about 98.6, 96.4  and 93.9 % for CSCNN, Deep Packet [1], and Datanet [37] respectively. The higher performance of the CSCNN in all evaluation metrics proves that it has entirely extracted and learned the discriminative features from the training set and can successfully distinguish each application.

Fig. 8
figure 8

Comparison of the application identification performance of CSCNN, Deep Packet, and Datanet methods

Table 4 Application identification performance of CSCNN, Deep Packet, and Datanet method

Application identification is another factor that was used to evaluate the performance of the proposed method. In this regard, the confusion matrices of CSNN, Deep Packet [1], and Datanet [37] are illustrated in Fig. 7 where the rows of the confusion matrices correspond to the actual class and the columns are related to the predicted labels and therefore the matrices are row normalized. As it is clear, CSCNN can correctly classify Youtube, Tor‌, Torrent, and AIM Chat classes. Moreover, by carefully observing the confusion matrices of these three methods, it is obvious that CSCNN can classify samples more efficiently while its misclassification rate is lower compared to the other two methods for application identification.

In order to provide more comparison, the results obtained from the confusion matrices of Fig. 7 are illustrated in Fig. 8; Table 4 based on five metrics of accuracy, recall, f-measure, precision, and false alarm rate. As it can be seen, CSCNN presented better results compared to the other two methods. The accuracy of the CSCNN is about 97.9 % while this value is equal to 96.2 and 96.1 % for Deep Packet [1] and Datanet [37] respectively and the recall average is about 98.6 , 96.4 , and 93.9 % for CSCNN, Deep Packet [1], and Datanet [37] respectively. The higher performance of the CSCNN in all evaluation metrics proves that it has entirely extracted and learned the discriminative features from the training set and can successfully distinguish each application.

5.4 Training Time Analysis

Considering the fact that the training time of deep neural networks is highly related to the hardware that they are implemented on, namely modern GPUs can significantly reduce the training time, it cannot be considered as a fair measure for comparing the efficiency of deep learning based methods and it has been rarely explored as a metric for evaluation. However, in order to provide an analysis of the time complexity of our proposed method, we decided to plot the training accuracy of our proposed method (CSCNN) compared to Deep Packet [1] and Datanet [37] that is illustrated in Fig. 9. As it is clear, the classification accuracy of these three methods based on the number of epochs is very close to each other in two-class classification (Fig. 9.A) which can be due to the fact that the data are unbalanced and there is a balanced rate for both classes. In the case of traffic description and application identification, it is obvious that not only CSCNN has higher accuracy but also it obtained the maximum accuracy only after four epochs and therefore it was converged faster compared to the other two methods.

Fig. 9
figure 9

Training accuracy comparison based on the number of epochs

Training time per epoch is another factor that can be considered for time analysis. In this regard, the runtime of these three methods, namely CSCNN, Deep Packet [1], and Datanet [37], based on the number of epochs for the tasks of traffic description and application identification is depicted in Fig. 10. As it is clear, Deep Packet [1] and Datanet [37] required training time is very close to each other while CSCNN requires more time for training in each epoch. After all, although CSCNN requires more training time per each epoch, it is converged in a lower number of epochs (about after four epochs) compared to the other two methods. Therefore, it can be concluded that the difference between their training time is not very critical. However, it is necessary to mention that choosing an optimal model for traffic classification task is not possible while the definition of “optimal” is not well-defined and there is always a tradeoff between the model complexity (training and test speed) and its performance.

Fig. 10
figure 10

Training time comparison based on the number of epochs

5.5 Discussion

As previously mentioned, in order to prove the efficiency of our proposed method, we carried out various experiments and explored its efficiency for the tasks of traffic classification, traffic description, and application identification compared to two of the most famous methods, namely Deep Packet [1] and Datanet [37]. Apparently, CSCNN presented higher performance while its F1-score was about 97.4 and 96.3 % respectively for traffic description and application identification which is not only higher than the two other methods (Tables 3 and 4) but also implies that it is capable of accurately classifying the packets.

In order to provide more comparison, we aimed at comparing our proposed method with other existing methods that evaluated their methods on the “ISCX VPN-nonVPN” dataset. As mentioned in Sect. 2, Gil et al. [17] and Yamansavascilar et al. [33] used this dataset in their experiments. However, it must be emphasized that they utilized handcraft features based on network traffic flow while CSCNN does not require any hand-craft features and consider the network traffic at the packet level and therefore can be more applicable in real-world applications. Table 5 includes the results for comparing the performance of CSCNN with other existing methods. Having the previously mentioned analysis in our mind, it can be concluded that CSCNN has higher performance than both machine learning and deep learning based methods.

Table 5 Comparison between CSCNN and other existing methods on ISCX VPN-nonVPN” dataset

It is worth mentioning that Wang et al. [39] also proposed a similar method for traffic description on ISCX VPN-nonVPN” dataset and obtained 100 % precision. However, their obtained result is seriously questionable because their best result was obtained by utilizing packets having all headers from every five layers of the Internet protocol stacks. Considering the fact that the source and destination IP addresses are unique for each application, they presumably only utilized this feature for classification and in that case a much simpler classifier could properly handle the classification task. Particularly, we masked IP address fields in our pre-processing steps to avoid this phenomenon.

6 Conclusion

By the rapid development of the Internet and particularly online applications, accurately classifying Internet traffic has changed to one of the prominent issuers in the field of the computer network. On the other hand, by the enormous growth of deep learning models in various application and their remarkable results considering the fact that they do not need any handcraft features and are able to learn a high representation of input data besides extracting precious features automatically, they have also obtained considerable attention for the task of traffic classification.

Due to the fact that the standard learning methods are particularly designed to minimize the overall error without considering the class distribution, they are generally biased toward the majority classes and result in less sensitivity to minority class samples. In this regard, convergence and generalization of a classification method can be easily influenced by the problem of unbalanced data. To fill this lacuna, a Cost-Sensitive Convolutional Neural Network (CSCNN) is proposed in this paper that tries to deal with the class imbalance issue in encrypted traffic classification. Accordingly, a cost matrix is generated based on the class distribution which is then utilized during the training process to modify the weights. In order to prove the efficiency of our proposed method, it was compared with machine learning and deep learning based methods on ISCX VPN-nonVPN” dataset for the task of traffic classification, traffic description, and application identification. Based on the results of experiments, it can be concluded that CSCNN has higher efficiency compared to both machine learning and deep learning based methods.

Following a similar line of research, CSCNN can be utilized in various complex operations including multi-channel classification, distinguishing between different types of Skype traffic like chat, voice, and video calls, and Tor traffic. Performing a cost-sensitive strategy to other models like SAE or RNN can be also worth exploring.