1 Introduction

According to the 42nd National Internet development statistics report issued by China Internet Information Center, up to December 2018 the number of Internet users in China had reached 802 million [1]. With the increase of popularity of Internet, the scale of the Internet is becoming larger and larger; meanwhile, various new network applications and services are emerging. Network technology represented by wireless communications and networks has become one of the necessities in people’s daily life. Network traffic classification and application identification are the basis of solving many network management. They are of great significance to network security, intrusion detection and service quality guarantee. In the last two decades, several network traffic classification methods have been developed. The earliest traffic classifiers are mostly port-based classification methods that use the port number to identify applications [2]. The advantage of port-based classification methods is simple and fast. However, with more and more applications using dynamic port number, encryption, and re-encapsulation technology, this type of methods are no longer effective. For example, P2P applications evade identification by using port masquerading. Thus, the accuracy of port-based classification methods is less than 70% in some applications [3].

To overcome these limitations, deep packet inspection (DPI) [4] has emerged based on payload-based classification technique. DPI searches and matches the application layer load of the traffic stream, so that it can identify different types of network traffic. This type of methods can accurately identify the unencrypted traffic, but their matching process costs a lot of computation and memory. In addition, some privacy policies do not allow users to use sensitive information or check payload. The famous open source application in DPI is NDPI tool [5], which can handle encrypted traffic with a mount number of protocols. But NDPI cannot identify certain types of network traffic and labels them as unknown. The behavioral classification methods based on host behavior features can discover new network traffic behaviors without analyzing the load of the message. The research in [6] shows that the graphical connections generated from server-client models are different from the graphs from P2P applications. For instance, this type of methods carries out classification tasks by calculating the number of communication hosts and considering the number of ports and transport layer protocols. The behavioral classification methods usually have high classification accuracy with lower computational cost; however, they only work well on the end-hosts and endpoint. In addition, different network environments will invalidate the network traffic classification method based on host behavior.

With the continuous improvement of hardware and software technology, intelligent traffic classification methods based on machine learning have been widely concerned in recent years [7,8,9,10,11,12]. Machine learning methods generally do not depend on the information of application layer load, but only need the header information of packets. They establish machine learning model to achieve network traffic classification based on the statistical characteristics of network traffic. The famous example of this type of methods is intrusion detection systems, which apply machine learning methods such as neural work, decision tree, support vector machine (SVM), Bayesian network, etc. SVM becomes one of the most promising methods because SVM can solve high-dimensional nonlinear problem and maximize the margin in the kernel space [13, 14]. For example, a SVM-based method classifies the Internet traffic based on the statistical characterization of payload [15]. An incremental SVM with attenuation factor proposed in [8]. In its updating process, the support vectors and their corresponding weight will be adjusted to adopt the change of the classification hyperplane. A proximal SVM is proposed to build an online network traffic classifier, and its advantage is fast and simple system of linear equations [16].

The traditional machine learning traffic classifier trains the network traffic data as a whole for batch learning. Batch learning is suitable for small-scale classification. However, its computation complexity is high for large-scale network data. In addition, when the data distribution of network traffic changes with the time going, especially the new training data occurring, the classification performance of batch learning will degrade significantly. Thus, batch learning methods cannot adapt to practical applications of massive and streaming network traffic data. Online learning is one of the effective ways to solve this problem. Compared with traditional batch learning methods, online learning methods can achieve better generalization performance. Online learning continuously updates a classifier with continues newly arrived data points, so that its classification performance can be improved step by step. Up to now, several successful SVM-based online learning methods are proposed to handle with network traffic classification. These methods can basically be grouped into two types: (1) reducing the computation complexity by stochastic gradient descent strategy [17, 18]; (2) selecting the representative data points to keep the size of training data in a small scale [19, 20].

However, online SVMs are almost always sensitive to noise; meanwhile, they usually assume that the training data are class balanced. In practice, noise and class imbalance are common in network traffic. For example, HTTP always generates more traffic than other applications, such as P2P and VoIP [21]. In order to establish a more effective online SVM for network traffic classification, in this study, we propose a scalable kernel convex hull online SVM called SKCHO-SVM, which focuses on controlling the size of training data points based on the scalable kernel convex hull in the noise and imbalanced network traffic scenario. The SKCHO-SVM method consists of two stages: (1) discard the noise points and select the training data points based on the scalable kernel convex hull in the offline stage, then train a robust classifier on the basis of pinball-SVM; (2) re-compute the scalable kernel convex hull and update the classifier in the online stage step by step. Note that the scalable convex hull is computed in both offline and online stages. We perform this procedure periodically to control the size of training points within the reasonable scope.

The advantages of SKCHO-SVM are as follows: (1) it can deal with noise network traffic data efficiently since the outliers are discarded before the classifier training and the noise insensitivity characteristic of pinball-SVM is inherited. (2) By dynamically adjusting the scalable kernel convex hull as training data, SKCHO-SVM can be applied in the large scale network traffic scenarios. (3) The scalable kernel convex hull can relieve the class imbalance, since it exacts the convex hull vertices by the profile of the data, rather than the number of data points. (4) It is proved theoretically that the scalable kernel convex hull as training data can achieve almost the same classification performance as the classifiers trained with all the data.

The rest of this paper is organized as follows. Section 2 introduces the concept of convex hull and pinball-SVM. Section 3 describes the proposed SKCHO-SVM method in detail. Section 4 presents the classification performance analysis of SKCHO-SVM. The experimental performance of SKCHO-SVM is evaluated in Sect. 5. The conclusions are given in the final section.

2 Related works

The classification principle of traditional SVM aims to build an optimal classification hyperplane, which maximizes the minimum margin between the data points belonging to different classes. In the binary classification task, given the dataset X = {x1, x2, …, xN} and its corresponding class label set Y = {y1, y2, …, yN}, the decision function of SVM classifier is.

$$ f\left(\mathbf{x}\right)={\mathbf{w}}^T\mathbf{x}+b $$
(1)

where w and b are the weight vector and bias parameter, respectively. Based on the maximum margin strategy, the corresponding classification hyperplane of SVM classifier can be obtained as.

$$ \underset{\left\Vert {f}^{\prime}\right\Vert =1}{\mathit{\max}}\left\{\mathit{\min}{f}^{\prime}\left(\mathbf{I}\right)+\mathit{\min}{f}^{\prime}\left(\mathbf{I}\mathbf{I}\right)\right\} $$
(2)

where two sets of indices I and II represent as I = {i| yi = 1} and II = {i| yi = − 1}, respectively. Function f(I) and f(II) mean f(I) = {yif(xi), iI} and f(II) = {yif(xi), iII}, respectively. Equation (1) shows that traditional SVM is sensitive to noise point, especially to noises around the boundary of classification hyperplane.

To solve this problem, inspired by the application of quantile in statistics, the pinball loss function is introduced into the traditional framework of SVM [22,23,24]. Huang et al. [22] proposed the asymmetric least squares SVM called aLS-SVM to use the pinball loss function for the least squares SVM model. Huang et al. [23] proposed an asymmetric v-tube support vector regression model to deal with noise problem in regression tasks. Huang et al. [24] proposed the pin-SVM by replacing the pinball loss function with the hinge loss function in the L1-SVM. Pin-SVM aims to maximize the quantile distance instead of maximizing the margin between two classes.

Suppose the discrete scalar set U = {u1, u2, …, um} with u1u2 ≥ … ≥ um, q-lower quantile of the set U called as minq{U} can be written as

$$ {\mathit{\min}}^q\left\{U\right\}=\left\{t:t\kern0.8mm \in \kern0.8mm \mathbf{R},t\kern0.8mm \mathrm{is}\kern0.8mm q\kern0.6mm \mathrm{percent}\ \mathrm{larger}\ \mathrm{than}\kern0.6mm {u}_i\right\}\left(0\le q\le 1\right) $$
(3)

The classification hyperplane of pin-SVM can be obtained as

$$ \underset{\left\Vert {f}^{\prime}\right\Vert =1}{\mathit{\max}}\left\{{\mathit{\min}}^q{f}^{\prime}\left(\mathbf{I}\right)+{\mathit{\min}}^q{f}^{\prime}\left(\mathbf{I}\mathbf{I}\right)\right\}. $$
(4)

Substituting Eq. (1) into Eqs. (3), (4), the decision function of SVM can be represented as

$$ \underset{\left\Vert \mathbf{w}\right\Vert =1,b}{\mathit{\max}}\left\{{\mathit{\min}}_{i\in \mathbf{I}}^q\kern0.5mm {y}_i\left({\mathbf{w}}^T{\mathbf{x}}_i+b\right)+{\mathit{\min}}_{i\in \mathbf{I}\mathbf{I}}^q\kern0.5mm {y}_i\left({\mathbf{w}}^T{\mathbf{x}}_i+b\right)\right\} $$
(5)

Using the τ/(1 + τ)-lower quantile, the pinball loss Lτ(u) in pin-SVM is defined as

$$ {L}_{\tau }(u)=\left\{\begin{array}{c}u,\kern3mm \mathrm{if}u\ge 0.\\ {}-\tau u,\kern2mm \mathrm{otherwise}.\end{array}\right. $$
(6)

where pinball parameter τ is equal to q/(1 − q). Substituting Eq. (6) into Eq. (5), we can obtain

$$ \underset{\left\Vert \mathbf{w}\right\Vert =1,b}{\mathit{\max}}\left\{ argmin\sum \limits_{i\in \mathbf{I}}{L}_{\tau}\left(1-{y}_i\left({\mathbf{w}}^T{\mathbf{x}}_i+b\right)\right)+ argmin\sum \limits_{i\in \mathbf{I}\mathbf{I}}{L}_{\tau}\left(1-{y}_i\left({\mathbf{w}}^T{\mathbf{x}}_i+b\right)\right)\right\} $$
(7)

From Eq. (7), we can see that maximizing the quantile distance between two classes is insensitive to noise, especially to noise around the classification hyperplane, which improves the robustness of the classifier. Based on this strategy, let ϕ(⋅) be the mapping from Rd to the kernel space, the optimization problem of pin-SVM is represented as

$$ {\displaystyle \begin{array}{c}\underset{\mathbf{w},b,\xi }{\mathit{\min}}\frac{1}{2}{\left\Vert \mathbf{w}\right\Vert}^2+\frac{C}{N}\sum \limits_{i=1}^N{\xi}_i,\\ {}s.t.{y}_i\left({\mathbf{w}}^T\phi \left({\mathbf{x}}_i\right)+b\right)\ge 1-{\xi}_i,\\ {}{y}_i\left({\mathbf{w}}^T\phi \left({\mathbf{x}}_i\right)+b\right)\le 1+\frac{1}{\tau }{\xi}_i.\end{array}} $$
(8)

where N is the number of training set. C is the regularization parameter, and ξi is the slack variable. When τ = 0, pin-SVM is equivalent to the standard L1-SVM. Thus, L1-SVM can be considered as a special case of pin-SVM. Except for insensitivity to noise, it is theoretically proved that minimizing pinball loss can be viewed as a trade-off between within-class minimization and small misclassification minimization. The solution of Eq. (8) can be converted to a quadratic programming (QP) problem. The computation complexity of pin-SVM is O(N3). Therefore, pin-SVM can only deal with the classification tasks for small-scale datasets. For large-scale datasets, even for medium-sized datasets, its computation is quite large.

3 Scalable kernel convex hull online SVM

We describe the proposed SKCHO-SVM method in detail in this section. Firstly, we introduce the scalable convex hull in the kernel space in Sect. 3.1. It is the core component of SKCHO-SVM. Based on the definition of scalable kernel convex hull, a small amount of scalable convex hull vertices are computed as training points to train a classifier. In Sect. 3.2, we present a detailed description of SKCHO-SVM. In Sect. 3.3, we provide a theoretical analysis of SKCHO-SVM.

3.1 Scalable kernel convex Hull

The solution of traditional SVM is equivalent to finding an optimal classification hyperplane, which separates the contour points of two classes at the maximum intervals. Geometrically, convex hull can accurately describe the geometric structure of data points.

Definition 1 (convex hull) [25]. The convex hull of the given dataset X is the smallest convex set containing all data points in X

$$ co\left(\mathbf{X}\right)=\left\{\sum \limits_{i=1}^n{\alpha}_i{\mathbf{x}}_i|{\mathbf{x}}_i\in \mathbf{X}|{\alpha}_i\ge 0|\sum \limits_{i=1}^n{\alpha}_i=1\right\} $$
(9)

Any point in X can be written as a linear combination of convex hull vertices in co(X)

$$ {\mathbf{x}}_i=\sum \limits_{{\mathbf{x}}_t\in co\left(\mathbf{X}\right)}{\lambda}_{i,t}{\mathbf{x}}_t $$
(10)

where \( \sum \limits_{{\mathbf{x}}_t\in co\left(\mathbf{X}\right)}{\lambda}_{i,t}=1 \) and λi, t ≥ 0.

By introducing the nonlinear kernel mapping (ϕ), the input data in kernel space are represented as X = {x1, …, xn} ↦ ϕ(X) = {ϕ(x1), …, ϕ(xn)}. The kernel convex hull of dataset ϕ(X) is defined as co(ϕ(X)). Any point ϕ(xi)in ϕ(X) can be written as a linear combination of convex hull vertices in co(ϕ(X)):

$$ \phi \left({\mathbf{x}}_i\right)=\sum \limits_{\phi \left({\mathbf{x}}_t\right)\in co\left(\phi \left(\mathbf{X}\right)\right)}{\mu}_{i,t}\phi \left({\mathbf{x}}_t\right) $$
(11)

where \( \sum \limits_{\phi \left({\mathbf{x}}_t\right)\in co\left(\phi \left(\mathbf{X}\right)\right)}{\mu}_{i,t}=1 \) and μi, t ≥ 0.

In order to ensure the solvability of the kernel convex hull, it is necessary to “relax” the requirements for the kernel convex hull. For this goal, a scalable threshold ε is introduced in Eq. (11), and then we obtain the definition of scalable kernel convex hull.

Definition 2 (scalable kernel convex hull). The linear relationship between kernel convex hull vertices and ϕ(xi)in the kernel space satisfies:

$$ \underset{{\mathbf{x}}_i\in \mathbf{X}}{\mathit{\max}}\mathit{\min}{\left\Vert \phi \left({\mathbf{x}}_i\right)-\sum \limits_{\phi \left({\mathbf{x}}_t\right)\in co\left(\phi \left(\mathbf{X}\right)\right)}{\mu}_{i,t}\phi \left({\mathbf{x}}_t\right)\right\Vert}^2\le \varepsilon $$
(12)

where 0 ≤ μi, t ≤ 1 and \( \sum \limits_{\phi \left({\mathbf{x}}_t\right)\in co\left(\phi \left(\mathbf{X}\right)\right)}{\mu}_{i,t}=1 \).

The role of εhas two aspects. First, the scalable threshold ε indicates the degree to which the approximation scalability can be tolerated. Second, different ε can adjust the size of kernel convex hull. We will discuss it in detail in the Sect. 4.3. Then, each point ϕ(xi)in the kernel space can be written as the following linear combination of kernel convex hull vertices:

$$ \phi \left({\mathbf{x}}_i\right)=\sum \limits_{\phi \left({\mathbf{x}}_t\right)\in co\left(\phi \left(\mathbf{X}\right)\right)}{\gamma}_{i,t}\phi \left({\mathbf{x}}_t\right)+{\boldsymbol{\updelta}}_i $$
(13)

where ‖δi2ε and \( {\gamma}_{i,t}=\left\{\begin{array}{c}{\mu}_{i,t},\kern3mm \mathrm{if}\phi \left({\mathbf{x}}_t\right)\in co\left(\phi \left(\mathbf{X}\right)\right)\mathrm{and}\phi \left({\mathbf{x}}_i\right)\notin co\left(\phi \left(\mathbf{X}\right)\right),\\ {}0,\kern5mm \mathrm{otherwise}.\end{array}\right. \).

Since ε is a very small positive constant, each component in δi is very small.

Based on the definition of Eq. (10), the geometric structure of data set ϕ(X) in the kernel space can be represented by the scalable kernel convex hull. The vertexes of scalable kernel convex hull are boundary points of ϕ(X). The classification mechanism of SVM depends on finding the exact boundary in the optimal kernel space. It is reasonable to use a small amount of vertexes of scalable kernel convex hull to train a SVM-based classifier, and SKCHO-SVM can guarantee that it is safe to delete the data points inside the scalable kernel convex hull.

3.2 Construction of SKCHO-SVM method

The proposed SKCHO-SVM in this paper consists of two stages: (1) discard the noise points and compute the scalable kernel convex hull as the initial training set to train the pinball-SVM classifier; (2) re-compute the scalable kernel convex hull and update pinball-SVM classifier step by step in the online stage.

3.2.1 Compute the scalable kernel convex Hull

The noise points also called isolated points, according to their geometric distribution, which are often far away from most of the points belonging to the normal points [26]. Most of noise points are located around the boundary of the data points. That is to say, the noise points are more likely to be recognized as the vertexes of scalable kernel convex hull. Thus, we must delete the noise points before computing the scalable kernel convex hull. Inspired of [27], we definite the noise points as follows. Given a random point xi, we compute the distances dker(xi, xj) between xi and the other point xj(xjX, xjxi) in the kernel space: \( {d}_{\mathrm{ker}}^2\left({\mathbf{x}}_i,{\mathbf{x}}_j\right)={\left\Vert \phi \left({\mathbf{x}}_i\right)-\phi \left({\mathbf{x}}_j\right)\right\Vert}^2 \). Then, we record the occurrence frequency k of the condition in which the distance \( {d}_{\mathrm{ker}}^2\left(\mathbf{x},{\mathbf{x}}_i\right) \) is less than θ,

$$ k=\mathrm{count}\left({d}_{\mathrm{ker}}^2\left(\mathbf{x},{\mathbf{x}}_i\right)\le \theta \right). $$
(14)

If the obtained occurrence frequency k is less than the given number \( \overset{\sim }{\alpha } \), xi is identified as the noise point. It is noted that we can easily perform the deleting process in parallel mode, such that its running time is greatly reduced. After deleting a large part of noise points around the boundary the data X, we use kernel support vector data description (KSVDD) [28] to compute the minimum enclosing ball of X, called MEB(ϕ(X)), which is the smallest hard ball contains all data points in ϕ(X). Let c is the center of the hypersphere for MEB(ϕ(X)). According to the distance dker(xi, c) between each point xi and c in the kernel space, all data points are sorted in descending order of dker(xi, c). We set the data points on the hypersphere boundary as the initial set of scalable convex hull co(ϕ(X)). Based on the descending order, each point xi(xiXandxico(ϕ(X))) in a sequence is checked whether it is a scalable convex hull vertex as follows

$$ \underset{\boldsymbol{\upmu}}{\mathit{\min}}{\left\Vert \phi \left({\mathbf{x}}_i\right)-\sum \limits_{j=1}^{\left| co\left(\phi \left(\mathbf{X}\right)\right)\right|}{\mu}_{i,j}\phi \left({\mathbf{x}}_j\right)\right\Vert}^2,s.t.\phi \left({\mathbf{x}}_i\right)\in co\left(\phi \left(\mathbf{X}\right)\right),0\le {\mu}_{i,j}\le 1,\sum \limits_{j=1}^{\left| co\left(\phi \left(\mathbf{X}\right)\right)\right|}{\mu}_{i,j}=1. $$
(15)

In this study, we consider the commonly used Gaussian kernel, ϕ(xi)Tϕ(xi) becomes a constant. Its value has no influence on the solution of Eq. (15). We discard this item, and then Eq. (15) can be expressed as a matrix form

$$ {\displaystyle \begin{array}{c}\underset{\boldsymbol{\upmu}}{\mathit{\min}}{\boldsymbol{\upmu}}^T co{\left(\phi \left(\mathbf{X}\right)\right)}^T co\left(\phi \left(\mathbf{X}\right)\right)\boldsymbol{\upmu} -2\phi {\left({\mathbf{x}}_i\right)}^T co\left(\phi \left(\mathbf{X}\right)\right)\boldsymbol{\upmu}, \\ {}s.t.\sum \limits_{j=1}^{\left| co\left(\phi \left(\mathbf{X}\right)\right)\right|}{\mu}_{i,j}=1,0\le {\mu}_{i,j}\le 1.\end{array}} $$
(16)

Obviously, Eq. (16) can be formulated as a convex quadratic programming (QP) problem. We substitute μ into Eq. (16), for given a threshold ε, if point xi satisfies \( {\left\Vert \phi \left({\mathbf{x}}_i\right)-\sum \limits_{j=1}^{\left| co\left(\phi \left(\mathbf{X}\right)\right)\right|}{\mu}_{i,j}\phi \left({\mathbf{x}}_j\right)\right\Vert}^2>\varepsilon \), xi will be regarded as a scalable convex hull vertex in the kernel space. Then, we add it to set co(ϕ(X)). Otherwise, the point xi will be regarded as a non-scalable kernel convex hull vertex, since it can be represented linearly by the current convex hull vertexes.

3.2.2 Train pinball-SVM classifier

For a small part of noise points distributed in the region of the data, SKCHO-SVM uses pin-SVM as the classification model to further reduce the impact of potential noise points. Inherited its advantage of insensitive to noise located around the classification hyperplane, SKCHO-SVM obtains the noise insensitive classifier model by maximizing the quantile distance between two classes in the kernel space. The unconstrained original problem of SKCHO-SVM is described as

$$ \underset{\mathbf{w},b}{\mathit{\min}}\frac{1}{2}{\left\Vert \mathbf{w}\right\Vert}^2+\frac{C}{N}\sum \limits_{t=1}^Ml\left(\mathbf{w},b,\phi \left({\mathbf{x}}_t\right)\right) $$
(17)

where M is the size of scalable kernel convex hull. l(w, b, ϕ(xt)) is pinball loss function, l(w, b, ϕ(xt)) = max {−τ[1 − yt(wTϕ(xt) + b)], 1 − yt(wTϕ(xt) + b)}.

3.2.3 Update classifier online

In this stage, we re-compute the scalable kernel convex hull with the newly coming data points, and then update the classifier with the newly scalable convex hull. For traditional online learning methods, two strategies are often used to update the training set: one is to directly add new points to the current training set; the other is to combine the new points with the current support vectors to form a new training set. However, with the increase of training data, the training efficiency of the first strategy will have a sharp decline. It may not be suitable for real-time classification in large scale network traffic scenarios. The second strategy adopts the support vectors of the online classifier as training data, but the support vectors are only related to the classification hyperplane constructed by the classifier, and cannot comprehensively represent the contour information of the data in the kernel space. When the distribution of newly arrived points is different from that of historical training points, the classification performance of this strategy is not efficient. At this stage, we first run the noise deleting method using Eq. (14) to judge whether the newly arrived points are noise or not. Then, we use Eq. (3) to classify each newly arrived (or one batch) point \( \overset{\sim }{\mathbf{x}} \) to obtain its class label \( \overset{\sim }{y} \). If the value of its decision function is less than a limit value, we consider \( \overset{\sim }{\mathbf{x}} \) is a scalable convex hull vertex, then we add \( \overset{\sim }{\mathbf{x}} \) into the current training set

$$ \left|{\mathbf{w}}^T\phi \left(\overset{\sim }{\mathbf{x}}\right)+b\right|\le 1+\beta $$
(18)

where β is a constant. Finally, the current pin-SVM will be re-trained by the new training set. Otherwise, both the training set and pin-SVM will not be updated. It is noted that with the increase of newly arrived points, when the number of training set is larger than a given threshold, we will re-compute the scalable kernel convex hull again using Eq. (16).

3.2.4 Algorithm description and computation complexity analysis

Based on the analysis above, the SKCHO-SVM procedure can be summarized by algorithm 1.

figure a

Here, we discuss the computation complexity of SKCHO-SVM. In the first stage of SKCHO-SVM, we use the sequential minimum optimization (SMO) technology to solve Eqs. (15), (16) in a progressive way, and the scalable convex hull of X is obtained progressively. The time complexity of first stage is \( O\left({M}_0^2\right) \), where M0 is the size of boundary points of KSVDD. In the second stage of SKCHO-SVM, we also use SMO technology to train pinball-SVM, and its time complexity is O(M), where M is the size of scalable convex hull obtained in the first stage. The third stage is online update sage of SKCHO-SVM. We use Eq. (18) to judge the newly arrived point whether it is a scalable convex hull vertex. The computation complexity of this stage is linear. Thus, the total computation complexity of SKCHO-SVM is \( O\left({M}_0^2+{M}^2\right) \). The value of M is much smaller than the training point size, so that SKCHO-SVM can be applied to large-scale online classification tasks. In addition, the QP solution of SKCHO-SVM can guarantee its global optimal solution.

4 SKCHO-SVM performance analysis

To theoretically analyze the classification performance of scalable kernel convex hull, we substitute all training points into the unconstrained objective function of Pinball-SVM and name it as F1(w, b):

$$ \underset{\mathbf{w},b}{\mathit{\min}}{F}_1\left(\mathbf{w},b\right)=\underset{\mathbf{w},b}{\mathit{\min}}\frac{1}{2}{\left\Vert \mathbf{w}\right\Vert}^2+\frac{C}{N}\sum \limits_{i=1}^Nl\left(\mathbf{w},b,\phi \left({\mathbf{x}}_i\right)\right) $$
(19)

where l(w, b, ϕ(xi)) = max {−τ[1 − yi(wTϕ(xi) + b)], 1 − yi(wTϕ(xi) + b)}.

We name the unconstrained objective function of SKCHO-SVM, i.e., Eq. (17), as F2(w, b). In order to better compare the relationship between F1(w, b) and F2(w, b), using ri, t(1 ≤ iN, 1 ≤ tM) obtained in Eq. (13) to construct a new unconstrained objective function, name it as F3(w, b):

$$ \underset{\mathbf{w},b}{\mathit{\min}}{F}_3\left(\mathbf{w},b\right)=\frac{1}{2}{\left\Vert \mathbf{w}\right\Vert}^2+\frac{C}{N}\sum \limits_{i=1}^Nl\left(\mathbf{w},b,{\mathbf{u}}_i\right) $$
(20)

where l(w, b, ui) = max {−τ[1 − yi(wTui + b)], 1 − yi(wTui + b)}, \( {\mathbf{u}}_i={\sum}_{t=1}^M{r}_{i,t}\phi \left({\mathbf{x}}_t\right),\phi \left({\mathbf{x}}_t\right)\in con\left(\phi \left(\mathbf{X}\right)\right) \). Different from Eq. (19), N training points in (20) is linearly represented by scalable kernel convex hull.

Theorem 1.

F3(w, b) andF2(w, b) are defined in Eqs. (20) and (17), respectively.F3(w, b) ≤ F2(w, b). Proof. Since \( \sum \limits_{t=1}^M{r}_{i,t}=1, \)we have the following:

$$ {\displaystyle \begin{array}{l}{L}_3\left(w,b,{Z}^{\ast}\right)=\frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-\tau \left[1-{y}_i\left({w}^T\sum \limits_{t=1}^M{r}_{i,t}\phi \left({x}_t\right)+b\right)\right],\kern0.33em 1-{y}_i\left({w}^T\sum \limits_{t=1}^M{r}_{i,t}\phi \left({x}_t\right)+b\right)\right\}\\ {}\kern5.75em =\frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-\tau \sum \limits_{t=1}^M{r}_{i,t}\left[1-{y}_i\left({w}^T\phi \left({x}_t\right)+b\right)\right],\kern0.33em \sum \limits_{t=1}^M{r}_{i,t}\left[1-{y}_i\left({w}^T\phi \left({x}_t\right)+b\right)\right]\right\}\\ {}\begin{array}{l}\kern5.75em \le \frac{C}{N}\sum \limits_{i=1}^N\sum \limits_{t=1}^M\mathit{\max}\left\{-\tau {r}_{i,t}\left[1-{y}_i\left({w}^T\phi \left({x}_t\right)+b\right)\right],\kern0.33em {r}_{i,t}\left[1-{y}_i\left({w}^T\phi \left({x}_t\right)+b\right)\right]\right\}\\ {}\kern5.75em =\frac{C}{N}\sum \limits_{t=1}^M\mathit{\max}\left\{-\tau \left[1-{y}_i\left({w}^T\phi \left({x}_t\right)+b\right)\right],\kern0.33em \left[1-{y}_i\left({w}^T\phi \left({x}_t\right)+b\right)\right]\right\}\sum \limits_{i=1}^N{r}_{i,t}\\ {}\kern5.75em ={L}_2\left(w,b,\phi (X)\right)\end{array}\end{array}} $$
(21)

Adding the term (1/2)‖w2 on the both side of above equation, we have F3(w, b) ≤ F2(w, b). ■.

Theorem 2.

\( -\frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{{y}_i{\mathbf{w}}^T{\boldsymbol{\updelta}}_i,-\tau {y}_i{\mathbf{w}}^T{\boldsymbol{\updelta}}_i\right\}\le {F}_1\left(\mathbf{w},b\right)\hbox{-} {F}_3\left(\mathbf{w},b\right)\le \frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-{y}_i{\mathbf{w}}^T{\boldsymbol{\updelta}}_i,\tau {y}_i{\mathbf{w}}^T{\boldsymbol{\updelta}}_i\right\} \)

Proof.

$$ {\displaystyle \begin{array}{l}{L}_1\left(w,b,X\right)=\frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-\tau \left[1-{y}_i\left({w}^T\phi \left({x}_i\right)+b\right)\right],\kern0.33em 1-{y}_i\left({w}^T\phi \left({x}_i\right)+b\right)\right\}\\ {}\kern5.5em =\frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-\tau \left[1-{y}_i\left({w}^T\left(\sum \limits_{t=1}^M{r}_{i,t}\phi \left({x}_t\right)+{\delta}_i\right)+b\right)\right],\kern0.33em 1-{y}_i\left({w}^T\left(\sum \limits_{t=1}^M{r}_{i,t}\phi \left({x}_t\right)+{\delta}_i\right)+b\right)\right\}\\ {}\begin{array}{l}\kern5.5em =\frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-\tau \sum \limits_{t=1}^M{r}_{i,t}\left[1-{y}_i\left({w}^T\phi \left({x}_t\right)+b\right)\right]+\tau {y}_i{w}^T{\delta}_i,\kern0.33em \sum \limits_{t=1}^M{r}_{i,t}\left[1-{y}_i\left({w}^T\phi \left({x}_t\right)+b\right)\right]-{y}_i{w}^T{\delta}_i\right\}\\ {}\kern5.5em \le \frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-\tau \sum \limits_{t=1}^M{r}_{i,t}\left[1-{y}_i\left({w}^T\phi \left({x}_t\right)+b\right)\right],\kern0.33em \sum \limits_{t=1}^M{r}_{i,t}\left[1-{y}_i\left({w}^T\phi \left({x}_t\right)+b\right)\right]\right\}+\frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-{y}_i{w}^T{\delta}_i,\tau {y}_i{w}^T{\delta}_i\right\}\\ {}\kern5.75em ={L}_3\left(w,b,X\right)+\frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-{y}_i{w}^T{\delta}_i,\tau {y}_i{w}^T{\delta}_i\right\}\end{array}\end{array}} $$
(22)

Adding the term (1/2)‖w2 on the both side of above inequality, we have \( {F}_1\left(\mathbf{w},b\right)\hbox{-} {F}_3\left(\mathbf{w},b\right)\le \frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-{y}_i{\mathbf{w}}^T{\boldsymbol{\updelta}}_i,\tau {y}_i{\mathbf{w}}^T{\boldsymbol{\updelta}}_i\right\} \). Similarly,

$$ {\displaystyle \begin{array}{l}{L}_1\left(w,b,X\right)=\frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-\tau \left[1-{y}_i\left({w}^T\phi \left({x}_i\right)+b\right)\right],\kern0.33em 1-{y}_i\left({w}^T\phi \left({x}_i\right)+b\right)\right\}\\ {}\kern5.5em \ge \frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-\tau \sum \limits_{t=1}^M{r}_{i,t}\left[1-{y}_i\left({w}^T\phi \left({x}_t\right)+b\right)\right],\kern0.33em \sum \limits_{t=1}^M{r}_{i,t}\left[1-{y}_i\left({w}^T\phi \left({x}_t\right)+b\right)\right]\right\}-\frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{{y}_i{w}^T{\delta}_i,-\tau {y}_i{w}^T{\delta}_i\right\}\\ {}\kern5.25em ={L}_3\left(w,b,X\right)-\frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{{y}_i{w}^T{\delta}_i,-\tau {y}_i{w}^T{\delta}_i\right\}\end{array}} $$
(23)

Thus we have\( -\frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{{y}_i{\mathbf{w}}^T{\boldsymbol{\updelta}}_i,-\tau {y}_i{\mathbf{w}}^T{\boldsymbol{\updelta}}_i\right\}\le {F}_1\left(\mathbf{w},b\right)\hbox{-} {F}_3\left(\mathbf{w},b\right)\le \frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-{y}_i{\mathbf{w}}^T{\boldsymbol{\updelta}}_i,\tau {y}_i{\mathbf{w}}^T{\boldsymbol{\updelta}}_i\right\} \). ■

Theorem 3.

Let\( \left({\mathbf{w}}_1^{\ast },{b}_1^{\ast}\right) \)is the optimal solution of F1(w, b), \( \left({\mathbf{w}}_2^{\ast },{b}_2^{\ast}\right) \) is the optimal solution of F2(w, b). \( {F}_1\left({\mathbf{w}}_1^{\ast },{b}_1^{\ast}\right)-{F}_2\left({\mathbf{w}}_2^{\ast },{b}_2^{\ast}\right)\le C\sqrt{C\varepsilon} \). Proof. According to the duality theorem of pinball-SVM, we can get\( \left\Vert {\mathbf{w}}_2^{\ast}\right\Vert \le \sqrt{C} \).

From Theorem 2,

$$ {\displaystyle \begin{array}{l}{F}_1\left({w}_1^{\ast },\kern0.33em {b}_1^{\ast}\right)-{F}_3\left({w}_2^{\ast },\kern0.33em {b}_2^{\ast}\right)\le \frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-{y}_i{w_2^{\ast}}^T{\delta}_i,\tau {y}_i{w_2^{\ast}}^T{\delta}_i\right\}\\ {}\kern11.75em \le \frac{C}{N}\sum \limits_{i=1}^N\left\Vert {w}_2^{\ast}\right\Vert \left\Vert {\delta}_i\right\Vert \le \frac{C}{N}\sum \limits_{i=1}^N\sqrt{C\varepsilon}\\ {}\kern11.75em =C\sqrt{C\varepsilon}\end{array}} $$
(24)

\( \left({\mathbf{w}}_1^{\ast },{b}_1^{\ast}\right) \)is the optimal solution of F1(w, b), thus\( {F}_1\left({\mathbf{w}}_1^{\ast },{b}_1^{\ast}\right)\le {F}_1\left({\mathbf{w}}_2^{\ast },{b}_2^{\ast}\right) \). Using Theorem 1, we can have:

$$ {\displaystyle \begin{array}{c}{F}_1\left({w}_1^{\ast },\kern0.33em {b}_1^{\ast}\right)-{F}_2\left({w}_2^{\ast },\kern0.33em {b}_2^{\ast}\right)\le {F}_1\left({w}_1^{\ast },\kern0.33em {b}_1^{\ast}\right)-{F}_3\left({w}_2^{\ast },\kern0.33em {b}_2^{\ast}\right)\\ {}\kern9.899999em \le {F}_1\left({w}_2^{\ast },\kern0.33em {b}_2^{\ast}\right)-{F}_3\left({w}_2^{\ast },\kern0.33em {b}_2^{\ast}\right)\\ {}\kern2.31em =C\sqrt{C\varepsilon}.\end{array}} $$

Theorem 4.

\( {F}_1\left({\mathbf{w}}_2^{\ast },{b}_2^{\ast}\right)-{F}_1\left({\mathbf{w}}_1^{\ast },{b}_1^{\ast}\right)\le 2C\sqrt{C\varepsilon} \). Proof. Let\( \left({\mathbf{w}}_3^{\ast },{b}_3^{\ast}\right) \) is the optimal solution of F3(w, b), \( \left({\boldsymbol{\upalpha}}_3,{\hat{\boldsymbol{\upalpha}}}_3\right) \)is the optimal solution of \( {L}_3\left({\boldsymbol{\upalpha}}_3,{\hat{\boldsymbol{\upalpha}}}_3\right) \), which is the dual form of F3(w, b), thus\( {F}_3\left({\mathbf{w}}_3^{\ast },{b}_3^{\ast}\right)={L}_3\left({\boldsymbol{\upalpha}}_3,{\hat{\boldsymbol{\upalpha}}}_3\right) \). Similarly, \( \left({\boldsymbol{\upalpha}}_2,{\hat{\boldsymbol{\upalpha}}}_2\right) \) is the optimal solution of \( {L}_2\left({\boldsymbol{\upalpha}}_2,{\hat{\boldsymbol{\upalpha}}}_2\right) \), which is the dual form of F2(w, b), thus \( {F}_2\left({\mathbf{w}}_2^{\ast },{b}_2^{\ast}\right)={L}_2\left({\boldsymbol{\upalpha}}_2,{\hat{\boldsymbol{\upalpha}}}_2\right) \). Suppose \( h\left(\overset{\sim }{\boldsymbol{\upalpha}},\overline{\boldsymbol{\upalpha}}\right)=\left\{\left({a}_t,{\hat{a}}_t\right):{a}_t=\sum \limits_{i=1}^N{r}_{i,t}\tilde{a}_{i}\boldsymbol{and}{\hat{a}}_t=\sum \limits_{i=1}^N{r}_{i,t}{\overline{a}}_i\right\} \), then\( h\left({\overset{\sim }{\boldsymbol{\upalpha}}}_2,{\overline{\boldsymbol{\upalpha}}}_2\right)=\left({\boldsymbol{\upalpha}}_2,{\hat{\boldsymbol{\upalpha}}}_2\right) \), thus \( \left({\overset{\sim }{\boldsymbol{\upalpha}}}_2,{\overline{\boldsymbol{\upalpha}}}_2\right) \)is also the optimal solution of \( {L}_3\left({\boldsymbol{\upalpha}}_3,{\hat{\boldsymbol{\upalpha}}}_3\right) \):

$$ {\displaystyle \begin{array}{l}{L}_2\left(h\left({\overset{\sim }{\alpha}}_2,\kern0.33em {\overline{\alpha}}_2\right)\right)=\sum \limits_{t=1}^M{\alpha}_t-\frac{1}{2}\sum \limits_{t=1}^M\sum \limits_{s=1}^M{\alpha}_t{\alpha}_s{y}_t{y}_s\phi \Big({}^{x_t}\phi \left({x}_s\right)\\ {}\kern7em =\sum \limits_{t=1}^M\sum \limits_{i=1}^N{r}_{i,t}\tilde{a}_{i}-\frac{1}{2}\sum \limits_{t=1}^M\sum \limits_{s=1}^M\sum \limits_{i=1}^N\sum \limits_{j=1}^N{r}_{i,t}{r}_{j,s}\tilde{a}_{i}\tilde{a}_{j}{y}_t{y}_s\phi \Big({}^{x_t}\phi \left({x}_s\right)\ \end{array}} $$
(25)

Since \( \sum \limits_{t=1}^M{r}_{i,t}=1, \) the above formula can be simplified as follows:

$$ {L}_2\left(h\left({\overset{\sim }{\boldsymbol{\upalpha}}}_2,{\overline{\boldsymbol{\upalpha}}}_2\right)\right)=\sum \limits_{i=1}^N\tilde{a}_{i}-\frac{1}{2}\sum \limits_{i=1}^N\sum \limits_{j=1}^N\tilde{a}_{i}\tilde{a}_{j}{y}_t{y}_s\phi {\left({\mathbf{x}}_t\right)}^T\phi \left({\mathbf{x}}_s\right) $$
$$ ={L}_3\left({\boldsymbol{\upalpha}}_2,{\hat{\boldsymbol{\upalpha}}}_2\right) $$
(26)

Thus, \( {L}_2\left(h\left({\overset{\sim }{\boldsymbol{\upalpha}}}_2,{\overline{\boldsymbol{\upalpha}}}_2\right)\right)={L}_2\left({\boldsymbol{\upalpha}}_2,{\hat{\boldsymbol{\upalpha}}}_2\right)={F}_2\left({\mathbf{w}}_2^{\ast },{b}_2^{\ast}\right) \)=\( {L}_3\left({\boldsymbol{\upalpha}}_2,{\hat{\boldsymbol{\upalpha}}}_2\right) \). At the same time, \( {F}_3\left({\mathbf{w}}_3^{\ast },{b}_3^{\ast}\right)={L}_3\left({\boldsymbol{\upalpha}}_3,{\hat{\boldsymbol{\upalpha}}}_3\right) \). We can have \( {L}_3\left({\boldsymbol{\upalpha}}_3,{\hat{\boldsymbol{\upalpha}}}_3\right)\ge {L}_3\left({\boldsymbol{\upalpha}}_2,{\hat{\boldsymbol{\upalpha}}}_2\right) \), i.e.,

$$ {F}_3\left({\mathbf{w}}_3^{\ast },{b}_3^{\ast}\right)\ge {F}_2\left({\mathbf{w}}_2^{\ast },{b}_2^{\ast}\right) $$
(27)

From Theorem 1, \( {F}_3\left({\mathbf{w}}_3^{\ast },{b}_3^{\ast}\right)\le {F}_3\left({\mathbf{w}}_2^{\ast },{b}_2^{\ast}\right)\le {F}_2\left({\mathbf{w}}_2^{\ast },{b}_2^{\ast}\right) \). Thus,

$$ {F}_3\left({\mathbf{w}}_3^{\ast },{b}_3^{\ast}\right)={F}_3\left({\mathbf{w}}_2^{\ast },{b}_2^{\ast}\right). $$
(28)

We can further obtain that

$$ -\frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{{y}_i{{\mathbf{w}}_1^{\ast}}^T{\boldsymbol{\updelta}}_i,-\tau {y}_i{{\mathbf{w}}_1^{\ast}}^T{\boldsymbol{\updelta}}_i\right\}\le {F}_1\left({\mathbf{w}}_1^{\ast },{b}_1^{\ast}\right)\hbox{-} {F}_3\left({\mathbf{w}}_1^{\ast },{b}_1^{\ast}\right) $$
(29)
$$ {F}_1\left({\mathbf{w}}_2^{\ast },{b}_2^{\ast}\right)\hbox{-} {F}_3\left({\mathbf{w}}_2^{\ast },{b}_2^{\ast}\right)\le \frac{C}{N}\sum \limits_{i=1}^N\mathit{\max}\left\{-{y}_i{{\mathbf{w}}_2^{\ast}}^T{\boldsymbol{\updelta}}_i,\tau {y}_i{{\mathbf{w}}_2^{\ast}}^T{\boldsymbol{\updelta}}_i\right\} $$
(30)

Thus,

$$ \le \frac{C}{N}\sum \limits_{i=1}^N\left[\mathit{\max}\left\{-{y}_i{{\mathbf{w}}_2^{\ast}}^T{\boldsymbol{\updelta}}_i,\tau {y}_i{{\mathbf{w}}_2^{\ast}}^T{\boldsymbol{\updelta}}_i\right\}+\mathit{\max}\left\{-{y}_i{{\mathbf{w}}_1^{\ast}}^T{\boldsymbol{\updelta}}_i,\tau {y}_i{{\mathbf{w}}_1^{\ast}}^T{\boldsymbol{\updelta}}_i\right\}\right] $$
$$ \le \frac{C}{N}\sum \limits_{i=1}^N\left\Vert {\mathbf{w}}_2^{\ast}\right\Vert \left\Vert {\boldsymbol{\updelta}}_i\right\Vert +\left\Vert {\mathbf{w}}_1^{\ast}\right\Vert \left\Vert {\boldsymbol{\updelta}}_i\right\Vert $$
$$ \le \frac{C}{N}\sum \limits_{i=1}^N2\sqrt{C\varepsilon} $$
$$ \le 2C\sqrt{C\varepsilon} $$

The scalable threshold ε is a very small normal number. It can be seen from Theorems 1–4 that compared with the pinball-SVM SVM trained with all training points, SKCHO-SVM proposed in this paper only uses the scalable kernel convex hull to train the classifier, but the optimal classification results of two methods are very close.

5 Experiments

In the following, we evaluated the performance of the proposed SKCHO-SVM method on four large-scale network traffic datasets and compare with four classification methods. Our experiments are organized as follows. Datasets and experiment settings are introduced in Sect. 5.1. The experimental analysis on SKCHO-SVM is presented in Sect. 5.2. In Sect. 5.3, the comparison experiments are displayed in terms of classification performance and running time.

In order to evaluate the robustness of SKCHO-SVM, two different noise strategies are designed in the experiments. First, following [22, 23], we add the noise points of 5% data at the boundary and inside of the data. The added noises are 5% Gaussian white noise with the mean value of 0 and the variance of 5% of the data. The data boundary in the kernel space is obtained by SVDD method. Second, 2% data points are randomly selected to add Gaussian white noise with the mean value 0 and the variance 10% of the data. The intensity of noise mode 2 is greater than that of noise mode 1, and the number of noise points in noise mode 1 is greater than that in noise mode 2.

The datasets for online learning methods consist of three parts: the initial training dataset, the online learning dataset, and testing dataset. Following [27], we divide the data into ten equal and independent subsets, where five subsets for initial training, two subsets for online learning, and the rest three subsets for testing. The sets for online learning are equally divided into five parts, and all methods are updated with one part at a time.

5.1 Datasets and experiment settings

In our experiment, we use the classic network traffic data Moore datasets [29], which consists of several separate datasets each from a different period of the 24-h day. Each dataset is represented by tens of thousands of traffic flows, which are derived from header information. Four datasets named as Day1, Day2, Day3, and SiteB are selected for experimental comparisons. The traffic flows from Day1, Day2, and Day3 are selected on three weekdays in 2003, 2004, and 2006 from site A, and the ones form SiteB are selected on a weekday in 2007. The original data points in these four datasets are generated 248 features. Twelve features are selected by the correlation-based filtering mechanism following [30]. Table 1 shows the number of traffic flows of each class in each dataset.

Table 1 The number of traffic flows of each class in four network traffic datasets

From Table 1, we can easily see that the class imbalance exists in real word network traffic. The number of traffic flows from “WWW” accounts for the largest proportion of all flows. The number of traffic flows from “CHAT”, “GAMES” and “GRID” are minimal. In this work, we delete the traffic flows of “GAMES” and “GRID” because they are very few. Network traffic classifications are generally evaluated from real-time, accuracy, and robustness. Real-time reflects the ability to classify online network traffic quickly. Accuracy and robustness reflect the ability to correctly classify network traffic and noise tolerance. In this study, we adopt training time, precision and recall as the performance indexes.

Since SVM is a classic binary classifier, we use the one-against-all classification strategy to perform multiclass classification problems. We train one SVM for each traffic category to separate the data points of this category from points of the other categories. The Gaussian kernel is used for all SVM classifiers. All parameters are selected by a five-fold cross-validation strategy on the initial training dataset. The regularization parameter and the kernel width σ in Gaussian kernel are taken in the set {10−3, …, 103}and{10−2, …, 102}, respectively. Scalable parameterεin pinball-SVM is selected in the set {10−4, 10−3, 10−2}. The parameter τin pinball loss function is set 0.05. Based on extensive experiments, parameters β, θ, and \( \overset{\sim }{\alpha } \) in SKCHO-SVM are set 0.1, 0.5, and 20, respectively. All parameters in comparison methods are obtained in their default settings. Our experiments are implemented in MATLAB using a computer with 2.6 GHz dual-core CPU, 8 GB RAM, and Windows operation system.

5.2 SKCHO-SVM on four network traffic datasets

An effective online learning strategy is to preserve the representative points and discard the un-representative ones in the newly arrived data. Which are representative and how much is enough to be two critical issues for online learning. In this subsection, the performance on SKCHO-SVM is presented as follows. We first discuss the percentage of scalable kernel convex hull in its corresponding class, and then discuss its running time and classification performance.

5.2.1 Average percentage of scalable kernel convex hull in its corresponding class and its running time

As discussed in Sect. 3, the number of scalable kernel convex hull vertices are related to the Gaussian kernel parameter σ and scalable parameter ε. We discuss the number of scalable kernel convex hull and running time with different σ and ε on six classification cases: WWW in Day1, WWW in Day2, WWW in Day3, WWW in SiteB, MAIL in Day1, and P2P in Day3. The experimental results in the cases of noise mode 1 are shown in Tables 2 and 3, respectively. We can see that:

Table 2 Average percentage of scalable kernel convex hull vertices in corresponding class (%) in noise mode 1
Table 3 Average running time of computing scalable kernel convex hull vertices in corresponding class (seconds) in noise mode 1
  1. 1.

    The number of scalable kernel convex hull vertices increases with the increase of Gauss kernel parameter. This is because that the value of σ is related to the feature space. When its value is small, the distance between the points in the kernel space is small and the distribution is concentrated, so fewer convex hull vertices are obtained; conversely, when its value is large, the distance between the points is large and the distribution is dispersed, so more convex hull vertices are obtained.

  2. 2.

    When the scalable threshold is small, fewer points are satisfied Eq. (14), so more convex hull vertices are obtained, and the running time is longer. On the contrary, when the scalable threshold is large, fewer convex hull vertices are obtained and the running time is short.

5.2.2 Classification performance with different ε

In terms of Tables 2 and 3, the average percentage of scalable kernel convex hull in its corresponding class is sensitive to scalable parameter ε. As we known, the number of and distribution of training data are directly related to the performance of SVM. In order to discuss the relationship of classification and scalable threshold ε, we experimentally study the average testing performance of SKCHO-SVM with different ε on six classification tasks: WWW in Day1, WWW in Day2, WWW in Day3, WWW in SiteB, MAIL in Day1, and P2P in Day3. Figures 1 and 2 show their precision and recall in noise mode 1, respectively. From these two figures, we can see that classification performance is closely related to scalable parameter ε. The smaller ε tends to obtain more scalable kernel convex hull vertices. The more vertex of convex hull, the longer running time of convex hull selection, but it will get better precision and recall rate. This is because the more convex hull vectors, the better the representation of contour distribution of data in the feature space. Therefore, in practical network traffic classification applications, we need to balance the running time and the number of convex hull vertices. In the following experiments, we fix ε = 10−3.

Fig. 1
figure 1

Average precision of SKCHO-SVM with different values of parameter ε in noise mode 1

Fig. 2
figure 2

Average recall of SKCHO-SVM with different values of parameter ε in noise mode 1

5.3 Comparison experiments

In this section, we compare SKCHO-SVM with pinball-SVM (as baseline classifiers), and three online SVMs (including online-FastKDE [31], LASVM [32], and OCVM [27]) in terms of training time of offline learning stage, updating time of updating learning stage, and classification performance for testing stage. Pinball-SVM learns the classifier model in batch mode and updates the classifiers with old data points and newly arrived data points. Online-FastKDE is the online learning version of FastKDE method that updates the classifier based on the simple selection strategy. Online-FastKDE uses the simple point selection strategy to select the number of 1% training sets.

Average running time of offline learning stage of all comparison methods on noise mode 1 is presented in Table 4. We can see that the running time of offline learning stage of SKCHO-SVM is obviously less than other methods, except for Online-FastKDE. Since Online-FastKDE trains the classifier using the simple selection strategy, and only selects a small amount of data points for training. Pinball-SVM has the longest running time of offline learning stage for all datasets. Since Pinball-SVM trains the classifier using the whole training set; thus, its number of training set is the largest among all comparison methods. The initial training of LASVM is based on SMO strategy called REPROCESS. Its training time of offline learning stage is larger than SKCHO-SVM. OCVM selects the convex hull as training data to train the classifier. The computation of OCVM is relatively complicated compared with SKCHO-SVM. Thus, the training time of OCVM for offline learning stage is also larger than that of SKCHO-SVM.

Table 4 Average running time of offline learning stage of SKCHO-SVM and comparison methods on noise mode 1

Average running time of updating learning stage of all comparison methods on noise mode 1 is presented in Fig. 3. We can see from Fig. 3 that the running time of SKCHO-SVM for offline learning stage is much shorter than those of pinball-SVM, LASVM, and OCVM. Meanwhile, the training time of SKCHO-SVM on four network traffic datasets is comparable with that of online-FastKDE. The reason is that SKCHO-SVM removes most of the redundant data points in the offline learning stage; thus, the classifier built on the remaining data points is much faster than pinball-SVM. Using the definition of scalable kernel convex hull in Eq. (12), the running time of extracting convex hull vertices is faster than LASVM and OCVM. In addition, we can see that with the increase of network traffic flows, the entire training time of SKCHO-SVM does not increase dramatically. Pinball-SVM uses batch mode to update the classifier, and its running time of online learning stage is much larger than other methods, with the increase of the newly arrived points.

Fig. 3
figure 3

Average running time of updating learning stage of SKCHO-SVM and four comparison methods on noise mode 1. a Day1. b Day2. c Day3. d SiteB

The average classification performance (in terms of precision and recall) of all comparison methods on four network traffic datasets in noise mode 1 are shown in Tables 5, 6, 7, and 8. We can see that SKCHO-SVM obtains the best performance in terms of precision and recall. The reason is that (1) SKCHO-SVM deletes the noise points around the boundary of the data at the first period of offline learning stage, then SKCHO-SVM can obtain relatively “clean” data. Meanwhile, SKCHO-SVM is insensitive to noise with the help of pinball loss function. (2) SKCHO-SVM computes the scalable convex hull vertices in the kernel space based on Eq. (12). The obtained convex hull vertices are used as training data points both in the offline learning stage and online learning stage. The training data points of SKCHO-SVM are dynamically adjusted, so that the structure information of data can be retained well. (3) SKCHO-SVM efficiently relieves the imbalance problem in network traffic classification tasks. SKCHO-SVM respectively selects convex hull and obtains a small amount of points in different classes. Thus, the sizes of training data are roughly equal in different classes. In our experiments, we use the whole data points as training data one class when its size is less than 500.

Table 5 Average performance for testing data points of all methods on Day1 in noise mode 1
Table 6 Average performance (%) for testing data points of all methods on Day2 in noise mode 1
Table 7 Average performance (%) for testing data points of all methods on Day3 in noise mode 1
Table 8 Average performance (%) for testing data points of all methods on SiteB in noise mode 1

The simple selection strategy in online-FastKDE cannot distinguish the noise from the normal points, so the noise will become the training points in offline learning stage and online learning stage. LASVM use REPROCESS strategy to remove nonsupport vectors during the update process. The main idea of REPROCESS depends on Karush-Kuhn-Tucker (KKT) condition of SVMs, so that LASVM is sensitive to noises and outliers. In addition, LASVM may discard some support vectors of classifier in online learning stage; thus, some important discriminative information may be discarded. For WWW, MAIL, and BULK classes, pinball-SVM achieves satisfactory classification performance, since the data points in these classes are sufficient. However, for ATTACK, P2P, DATABASE, MULTIMEDIA, and SERVICES classes, pinball-SVM achieves high precision and low recall since class imbalance leads to class hyperplane skew.

In view of noise distribution, noise points can be divided into two categories: one is distributed around the boundary of data points, and the other is distributed within the data area. In the following, we evaluate the second category of noise distribution. The average performance (in terms of precision and recall) of all comparison methods on four network traffic datasets in noise mode 2 are shown in Tables 9, 10, 11, and 12. The intensity of noise mode 2 is greater than that of noise mode 1; we can see that the classification performance of online-FastKDE and LASVM decrease with increasing noise intensity, both of which are sensitive to noise. The proposed SKCHO-SVM and pinball-SVM achieve the best precision and recall. The performance gap between them is lower than 0.1%. Under high intensity of noise, OCVM shows lower precision and recall than SKCHO-SVM and pinball-SVM. It is verified that pinball loss function is a robust loss function and insensitive to feature noise. Since OCVM can only delete noise points around the boundary of the data, and its precision and recall are lower than SKCHO-SVM when the noise points are distributed within the data area. The experimental results also show that using the scalable kernel convex hull, SKCHO-SVM is faster and more efficient than four comparison methods. Therefore, SKCHO-SVM is an effective tool for online network traffic classification tasks.

Table 9 Average performance (%) for testing data points of all methods on Day1 in noise mode 2
Table 10 Average performance (%) for testing data points of all methods on Day2 in noise mode 2
Table 11 Average performance (%) for testing data points of all methods on Day3 in noise mode 2
Table 12 Average performance (%) for testing data points of all methods on SiteB in noise mode 2

6 Conclusion

In this study, to solve large scale noise network traffic classification, we propose a novel online learning method called SKCHO-SVM, which takes advantages of both scalable kernel convex hull and pinball loss function. In the offline learning stage and online updating stage, SKCHO-SVM distinguishes noise points and uses scalable kernel convex hull to dynamically select a small amount of data points as training data. The selected scalable convex hull vertices can represent the profile of network traffic data in the kernel space. Thus, SKCHO-SVM efficiently builds an online network traffic classifier, which achieves the high classification performance and low computation complexity. However, we only study the feature noise problem in this paper. How to apply SKCHO-SVM to network traffic classification tasks with label noise is an interesting work in near future. In addition, the extraction attribution of network traffic flow is only twelve dimensions. We will further experimentally study the efficiency of SKCHO-SVM for high dimensional network traffic classification tasks.