Keywords

1 Introduction

It has been proven that online learning is successful for building accurate and reliable models from a sequence of data elements efficiently. Different from regular batch machine learning algorithms that suffer from massive training time and memory consumption, online learning models often enjoy the properties of fast construction, highly scalable and memory saving. Due to these advantages, online learning algorithms have been successfully used in many real-world applications, such as online advertising [14], weather condition prediction [11], and computational finance [10].

Various algorithms have been developed to tackle online binary classification tasks, which can be simply divided into two types: linear and kernel methods. The linear methods are able to construct linear predictive models at an amazing speed. Some well-known examples include online gradient descent (OGD) [18], forward backward splitting (FOBOS) [7], regularized dual averaging (RDA) [19] and follow-the-regularized-leader (FTRL) [13, 14]. However, linear models are not always the right choice. Linear online algorithms may fail to produce effective outcomes when faced with linearly non-separable inputs, which is more common in real-world applications. To overcome this issue, researchers invited kernel functions into online learning methods and came up with field of online kernel learning. Kernel-based estimators avoid the non-separable property in the input space by mapping the instances to a high dimensional feature space implicitly. One key limitation of classical online kernel methods is that the functional representation of the produced estimator will become more complex as the observations grows. To be more specific, the learner is asked to maintain a support vector (SV) set during the online learning process. Whenever a newly arrived instance is misclassified, it will be added to the SV set immediately. Thus the complexity of the estimator and memory resource it demands will increase linearly over time, causing memory overflow for a potentially infinite input data sequence.

Several approaches have been proposed to handle the extension issues of online kernel learning. One interesting aspect, which is usually referred to as “budget online kernel learning” [5], tries to bound the number of SVs within a fixed size during the training process. Two major wildly acknowledged budget maintenance strategies are removal and projection. The former simply evicts a selected SV when the number of SVs overflows. It is adopted by many algorithms, such as Forgetron [6], randomized budget perceptron (RBP) [3], and naive online \(R_{reg}\) minimisation algorithm (NORMA) [9]. The latter further projects the selected SV onto the remaining ones, which is explored in algorithms like Projectron [15] and online manifold regularization (OMR) [2], Budget strategies do release the pressure to some extent, but the existing budget online kernel methods are either too simple to achieve promising results or just too slow to perform. The other promising aspect is to use the functional approximation scheme [20]. Unlike the budget maintenance strategy, this kind of scheme tackles the problem in a mathematically elegant way. A certain explicit mapping can be derived by approximating a kernel function, making it possible to project data from the input space to a computable highly dimensional feature space. Combining with linear online learning algorithms like OGD, nonlinear kernel-based algorithms are then trained in an efficient linear manner. As far as we known, Fourier online gradient descent (FOGD) has achieved a success in reducing time cost following this idea [12]. To reduce the required memory, the final model should be stored sparsely, or the number of non-zero coefficients in the final model parameter should be small. However, even employing the \(L_1\) penalty, FOGD can hardly produce sparse models. Similarly, it may cause the memory usage problem when the dimension of the feature space becomes too high.

In order to take the advantages of linear online models and produce sparsity simultaneously, we propose a Fourier follow-the-regularized-leader (FFTRL) algorithm in this paper. FFTRL adopts the random Fourier feature technique to approximate shift-invariant kernels and introduces sparsity using the FTRL algorithm. Theoretical analysis and experiments on FFTRL are also provided in this paper.

The rest of the paper is organized as follows. Section 2 details the proposed method. Experimental results and analysis are presented in Sect. 3 and the conclusion is given in Sect. 4.

2 Proposed Method

2.1 Algorithm Description

The proposed FFTRL is a online kernel learning method for binary classification tasks. The goal of FFTRL is to learn a final mapping or hypothesis \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) from a sequence of data elements \(\{(\textbf{x}_1,y_1),(\textbf{x}_2,y_2),\dots ,(\textbf{x}_T,y_T)\}\), where \(\textbf{x}_t\in \mathbb {R}^n\) is the tth training instance, and \(y_t\in \{+1,-1\}\) is the corresponding class label, n and T are the number of features and samples, respectively. Generally, a convex loss function \(l(f(\textbf{x}),y):\mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\) is used to penalize the deviation of the estimation \(f(\textbf{x})\) from the exact class label y. Further, we assume \(\mathcal {H}_k\) is a reproducing kernel Hilbert space (RKHS). Thus, the function \(k(\cdot ,\cdot ):\mathbb {R}^n\times \mathbb {R}^n\rightarrow \mathbb {R}\) is defined as the reproducing kernel of \(\mathcal {H}_k\) if and only if it implements the inner product \(\langle \cdot ,\cdot \rangle \) such that

  1. 1.

    \(k(\textbf{x},\cdot )\in \mathcal {H}_k\) for \(\forall \textbf{x}\in \mathbb {R}^n\);

  2. 2.

    \(\langle f,k(\textbf{x},\cdot )\rangle =f(\textbf{x})\) for \(\forall \textbf{x}\in \mathbb {R}^n\) and \(\forall f\in \mathcal {H}_k\).

In classical online kernel learning, the computation of kernel functions improves the complexity of algorithms. Inspired by FOGD, FFTRL represents a kernel mapping in a linear manner. Namely,

$$\begin{aligned} k(\textbf{x}_j,\textbf{x}_m)\thickapprox \textbf{z}(\textbf{x}_j)^\textsf{T}\textbf{z}(\textbf{x}_m), \end{aligned}$$
(1)

where the superscript \(\textsf{T}\) means the operation of a vector or matrix transpose, \(\textbf{x}_j\) and \(\textbf{x}_m\) are arbitrary instances in the sequence, and \(\textbf{z}(\textbf{x}_j)\) is an approximate image of \(\textbf{x}_j\) in the feature space.

Let \(f(\textbf{x})=\textbf{w}^\textsf{T}\textbf{z}(\textbf{x})\), where \(\textbf{w}\) is the weight vector. Then the loss function can be represented as \(l(\textbf{w},\textbf{z}(\textbf{x}),y)\). To find \(\textbf{z}(\textbf{x})\) related to \(k(\cdot ,\cdot )\), we introduce random Fourier features [16], which is a kernel functional approximation technique that works for shift-invariant kernels like Gaussian and Laplacian kernels. Such kernels have the form of \(k(\textbf{x}_j,\textbf{x}_m)=k(\varDelta \textbf{x})\), where \(\varDelta \textbf{x}=\textbf{x}_j-\textbf{x}_m\) is the divergence between the two instances. Bochner’s theorem implies that a positive definite kernel function \(k(\varDelta \textbf{x})\) is the Fourier transform of a proper probability density function \(p(\textbf{u})\) with a random variable \(\textbf{u}\in \mathbb {R}^n\) [17]. Namely,

$$\begin{aligned} k(\varDelta \textbf{x})=\int p(\textbf{u})e^{i\textbf{u}^\textsf{T}\varDelta \textbf{x}} \,d\textbf{u}, \end{aligned}$$
(2)

where i is the imaginary unit. By contrary, assume we have the right kernel here. By calculating the inverse Fourier transform of the kernel \(k(\varDelta \textbf{x})\), we can obtain

$$\begin{aligned} p(\textbf{u})=\left( \frac{1}{2\pi }\right) ^n \int e^{-i\textbf{u}^\textsf{T}(\varDelta \textbf{x})}k(\varDelta \textbf{x}) \,d(\varDelta \textbf{x}). \end{aligned}$$
(3)

For example, given a Gaussian kernel \(k(\textbf{x}_j,\textbf{x}_m)=\exp (-\Vert \textbf{x}_j-\textbf{x}_m\Vert ^2_2 / 2\sigma ^2)\) with the kernel parameter \(\sigma >0\), we have the corresponding distribution \(p(\textbf{u})=\mathcal {N} (\textbf{0},\sigma ^{-2}\textbf{I})\) with the identify matrix \(\textbf{I}\). According to (2), we can see that the kernel function can be expressed as the expectation of \(\textbf{u}\) drawn from the distribution \(p(\textbf{u})\). In other words, we have

$$\begin{aligned} \int p(\textbf{u})e^{i\textbf{u}^\textsf{T}\varDelta \textbf{x}} \,d\textbf{u} =E_{\textbf{u}}[e^{i\textbf{u}^\textsf{T}\textbf{x}_j} e^{-i\textbf{u}^\textsf{T}\textbf{x}_m}], \end{aligned}$$
(4)

where the function \(E_{\textbf{u}}[\cdot ]\) is to find the expectation of \(\textbf{u}\). Using Euler’s formula, we can rewrite (4) as

$$\begin{aligned}&E_{\textbf{u}}[\cos (\textbf{u}^\textsf{T}\textbf{x}_j)\cos (\textbf{u}^\textsf{T}\textbf{x}_m)+\sin (\textbf{u}^\textsf{T}\textbf{x}_j)\sin (\textbf{u}^\textsf{T}\textbf{x}_m)] \nonumber \\ =&E_{\textbf{u}}\left[ [\sin (\textbf{u}^\textsf{T}\textbf{x}_j),\cos (\textbf{u}^\textsf{T}\textbf{x}_j)][\sin (\textbf{u}^\textsf{T}\textbf{x}_m),\cos (\textbf{u}^\textsf{T}\textbf{x}_m)]^\textsf{T}\right] . \end{aligned}$$
(5)

According to (5), we can make \(\textbf{z}(\textbf{x})=[\sin (\textbf{u}^\textsf{T}\textbf{x}),\cos (\textbf{u}^\textsf{T}\textbf{x})]^\textsf{T}\) that is a new representation (image) of instance \(\textbf{x}\). Since the kernel function \(k(\varDelta \textbf{x})\) equals the expectation of inner productor of \(\textbf{z}(\textbf{x}_j)\) and \(\textbf{z}(\textbf{x}_m)\), we can draw D samples \(\textbf{u}_1,\dots ,\textbf{u}_D\) independently from the distribution p and construct the image of \(\textbf{x}\) as

$$\begin{aligned} \textbf{z}(\textbf{x})=\left[ \sin (\textbf{u}_1^\textsf{T}\textbf{x}),\cos (\textbf{u}_1^\textsf{T}\textbf{x}),\dots ,\sin (\textbf{u}_D^\textsf{T}\textbf{x}),\cos (\textbf{u}_D^\textsf{T}\textbf{x})\right] ^\textsf{T} \in \mathbb {R}^{2D}. \end{aligned}$$
(6)

Now, we can ignore the computation of kernel function because we get the explicit images in the high-dimensional feature space that is induced by the corresponding kernel function. If the number of samples D is large enough, the error brought by approximation can be omitted reasonably. Thus, the online kernel learning in the original space is transformed into the linear online learning in a high dimensional feature space.

To produce sparsity in the online process, we introduce FTRL that comprehensively considers the differences between FOBOS and RDA on regularization terms and model parameter \(\textbf{w}\). In the tth round, FFTRL performs the update of the weight vector \(\textbf{w}_{t+1}\) as follows:

$$\begin{aligned} \textbf{w}_{t+1}=\mathop {\arg \min }\limits _{\textbf{w}} \left\{ \textbf{w}^\textsf{T}\left( \sum _{s=1}^t \textbf{g}_s\right) +\frac{1}{2} \sum _{s=1}^{t}\Vert \sqrt{\boldsymbol{\sigma }_s}\odot (\textbf{w}-\textbf{w}_s)\Vert _2^2+\lambda \Vert \textbf{w}\Vert _1\right\} , \end{aligned}$$
(7)

where \(\textbf{g}_s=\nabla _{\textbf{w}_s}l(\textbf{w}_s,\textbf{z}(\textbf{x}_s),y_s)\) is the gradient in the sth iteration, \(\odot \) is the element-wise multiplication operator, \(\boldsymbol{\sigma }_s=\left[ \sigma _{s,1},\dots ,\sigma _{s,2D}\right] ^{\textsf{T}}\in \mathbb {R}^{2D}\) is the parameter related to the current learning rate, and \(\lambda \) is a positive regularization parameter. We discuss \(\boldsymbol{\sigma }_s\) later.

The basic idea behind FTRL is to minimize the loss cumulated in the online training process, which will get a low-regret solution in the current round. Therefore, FFTRL uses a cumulative gradient to approximately estimate the cumulative loss, or the first term of (7). The second term in (7) works as a stabilization penalty to avoid \(\textbf{w}\) from vibrating extensively in iterations, while the third term is an \(L_1\) penalty. With \(\lambda >0\), FFTRL does an excellent job in producing sparsity.

Moreover, we thought that if a feature variable varies more rapidly than the other, then it is reasonable that the learning rate on this feature variable should decline faster. Thus, FFTRL uses the per-coordinate learning rate instead of a global learning rate like setting \(\eta _t=\frac{1}{\sqrt{t}}\) (\(t>0\)) for all features. In other words, the learning rate is calculated independently for each feature. Let \(\boldsymbol{\eta }_t=[\eta _{t,1},\dots ,\eta _{t,2D}] \in \mathbb {R}^{2D}\) be the learning rate used in FFTRL. We reflect the rate of change using the gradient component in a certain dimension. Without loss of generality, let \(g_{t,h}\) be the hth entry in \(\textbf{g}_t\). Then, the corresponding learning rate in the hth dimension can be expressed as

$$\begin{aligned} \eta _{t,h}=\frac{\alpha }{\beta +\sqrt{\sum _{s=1}^{t}g_{s,h}^2}} \end{aligned}$$
(8)

for \(t>0\), where both \(\alpha >0\) and \(\beta >0\) are two parameters needed to be tuned for good performance. When \(t=0\), \(g_{s,t}=0\). Then, \(\eta _{0,h}=\alpha /\beta \) for all h. For \(\boldsymbol{\sigma _s}\), its hth component can be defined as

$$\begin{aligned} \sigma _{s,h}=\frac{1}{\eta _{t,h}}-\frac{1}{\eta _{t-1,h}}. \end{aligned}$$
(9)
figure a

The detail algorithm description of FFTRL is summarized in Algorithm 1. For training data arriving sequentially, we first construct the new representation of an instance using the explicit mapping \(\textbf{z}(\textbf{x})\) in (6) and then perform a sparse linear online learning using FTRL. The overall time complexity of FTRL in one update round is O(D).

2.2 Theoretical Analysis

We further analyze the theoretical property of our proposed method. For the purpose of simplicity, \(l_t(f)\) represents \(l(f(\textbf{x}_t),y_t)\), and \(l_t(\textbf{w})\) is \(l_t(\textbf{w}_t,\textbf{z}(\textbf{x}_t),y_t)\). In the following, we show that the regret of our algorithm is sub-linear, which indicates the effectiveness of FFTRL .

Theorem 1

Assume that the original data is contained by a ball in \(\mathbb {R}^n\) of diameter \(\tilde{R}\). Let \(k(\textbf{x},\textbf{x}')=k(\varDelta \textbf{x})\) be a positive definite and shift-invariant kernel, and \(l(f(\textbf{x}),y):\mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}\) be a convex loss function that is Lipschitz continuous with Lipschitz constant L. Assume that \(\textbf{w}_1,\dots ,\textbf{w}_T\) is the sequence of model parameters generated by FFTRL (Algorithm 1) under the mild condition that the learning rate \(\eta _{t,h}=\eta _t\) for every dimension in the same iteration, where \(\Vert \textbf{w}_t\Vert _2\le R\). With probability at least \(1-2^8(\frac{\varsigma _p\tilde{R}}{\epsilon })^2\exp (\frac{-D\epsilon ^2}{4(n+2)})\), the following inequality

$$\begin{aligned} \sum _{t=1}^{T}l_t(\textbf{w}_t)-\sum _{t=1}^{T}l_t(f^*)\le \frac{(1+\epsilon )\Vert f^*\Vert ^2_1}{2\eta _T}+L^2\sum _{t=1}^{T}\eta _t+\frac{3R^2}{2\eta _T}+\sqrt{2D}\lambda R+\epsilon LT\Vert f^*\Vert _1 \end{aligned}$$

holds true for any \(f^*(\textbf{x})=\sum _{t=1}^{T}\alpha _{t}^*k(\textbf{x}_t,\textbf{x})\), where \(\Vert f^*\Vert _1=\sum _{t=1}^{T}|\alpha _t^*|\), \(\varsigma _p^2=E_p[\textbf{u}^\textsf{T}\textbf{u}]\) is the second moment of the Fourier transform of the kernel function \(k(\cdot ,\cdot )\) given that \(p(\textbf{u})\) is the probability density function calculated by (3), and \(\epsilon \) is a small positive constant.

Proof

Given \(f^*(\textbf{x})=\sum _{t=1}^{T}\alpha _t^*k(\textbf{x}_t,\textbf{x})\) as the optimal solution of FFTRL, we have the corresponding linear model \(\textbf{w}^*=\sum _{t=1}^{T}\alpha _{t}^*\textbf{z}(\textbf{x}_t)\). First of all, we have to bound the regret of the sequence \(\textbf{w}_1,\dots ,\textbf{w}_T\) learned by FFTRL with respect to the optimal linear model \(\textbf{w}^*\) in the new feature space. According to the regret analysis of the FTRL algorithm with strongly convex regularizers (Lemma 2.3.) [18], we have:

$$\begin{aligned} \sum _{t=1}^{T}(l_t(\textbf{w}_t)-l_t(\textbf{w}^*))\le L^2\sum _{t=1}^{T}\eta _t+r_{1:T}(\textbf{w}^*)+\psi (\textbf{w}^*), \end{aligned}$$
(10)

where \(r_{1:T}(\textbf{w}^*)=\sum _{t=1}^{T} r_t(\textbf{w}^*)\). Let \(r_t(\textbf{w})=\frac{\sigma _t}{2}\Vert \textbf{w}-\textbf{w}_t\Vert _2^2\) and \(\psi (\textbf{w})=\lambda \Vert \textbf{w}\Vert _1\). Then, the cumulative sum of regularizers becomes

$$\begin{aligned} r_{1:T}(\textbf{w}^*)+\psi (\textbf{w}^*)=\frac{1}{2}\sum _{t=1}^{T}\sigma _t\Vert \textbf{w}^*-\textbf{w}_t\Vert _2^2+\lambda \Vert \textbf{w}^*\Vert _1, \end{aligned}$$
(11)

which is exactly the same as the regularization term in (7).

For \(r_{1:T}(\textbf{w}^*)\), we can infer that

$$\begin{aligned} r_{1:T}(\textbf{w}^*)&=\frac{1}{2}\sum _{t=1}^{T}\sigma _t\Vert \textbf{w}^*-\textbf{w}_t\Vert _2^2 \nonumber \\&\le \frac{1}{2}\sum _{t=1}^{T}\sigma _t (\Vert \textbf{w}^*\Vert _2^2-2\langle \textbf{w}^*,\textbf{w}_t \rangle +\Vert \textbf{w}_t\Vert ^2_2) \nonumber \\&\le \frac{1}{2}\sum _{t=1}^{T}\sigma _t (\Vert \textbf{w}^*\Vert _2^2+3R^2)=\frac{1}{2\eta _T}(\Vert \textbf{w}^*\Vert _2^2+3R^2). \end{aligned}$$
(12)

For \(\psi (\textbf{w}^*)\), it is upper-bounded by \(\sqrt{2D}\lambda R\) according to the arithmetic-geometric mean inequality (AGMI). The regret bound (10) now becomes

$$\begin{aligned} \sum _{t=1}^{T}(l_t(\textbf{w}_t)-l_t(\textbf{w}^*))\le L^2\sum _{t=1}^{T}\eta _t+\frac{\Vert \textbf{w}^*\Vert _2^2+3R^2}{2\eta _T}+\sqrt{2D}\lambda R \end{aligned}$$
(13)

Next, we examine the difference between \(\sum _{t=1}^{T}l_t(\textbf{w}^*)\) and \(\sum _{t=1}^{T}l_t(f^*)\). According to the uniform convergence of random Fourier features (Claim 1 in [16]), with probability at least \(1-2^8(\frac{\varsigma _p\tilde{R}}{\epsilon })^2\exp (\frac{-D\epsilon ^2}{4(n+2)})\), we have

$$\begin{aligned} \forall j,m,~~ |\textbf{z}(\textbf{x}_j)^\textsf{T}\textbf{z}(\textbf{x}_m)-k(\textbf{x}_j,\textbf{x}_m)|<\epsilon . \end{aligned}$$
(14)

In other words, the more we sample, the smaller the probability that the difference between approximated kernel value and real kernel value is greater than the constant \(\epsilon \) we will get. We further assume \(k(\textbf{x}_j,\textbf{x}_m)\le 1\), then we have \(\textbf{z}(\textbf{x}_j)^\textsf{T}\textbf{z}(\textbf{x}_m)\le 1+\epsilon \) that leads to

$$\begin{aligned} \Vert \textbf{w}^*\Vert _2^2 \le (1+\epsilon )\Vert f^*\Vert ^2_1. \end{aligned}$$
(15)

With (14), we have:

$$\begin{aligned} \left| \sum _{t=1}^{T}l_t(\textbf{w}^*)-\sum _{t=1}^{T}l_t(f^*)\right|&\le \sum _{t=1}^{T}\left| l_t(\textbf{w}^*)-l_t(f^*)\right| \nonumber \\&\le L\sum _{t=1}^{T}\sum _{j=1}^{T}|\alpha _j^*||\textbf{z}(\textbf{x}_j)^\textsf{T}\textbf{z}(\textbf{x}_t)-k(\textbf{x}_j,\textbf{x}_t)| \nonumber \\&\le \epsilon L \sum _{t=1}^{T}\sum _{j=1}^{T}\left| \alpha _j^*\right| =\epsilon LT\Vert f^*\Vert _1. \end{aligned}$$
(16)

Combing (13), (15) and (16) leads to the completion of the proof.

3 Experiments

3.1 Description of Data and Algorithms Involved

To validate the performance of our proposed algorithm, we conducted extensive experiments on the tasks of online binary classification. We first introduced the datasets used in our experiments and then described the algorithms for comparison.

Table 1 shows the details of eight publicly available datasets where the first five datasets can be downloaded from KEEL dataset repository [1] and the rest three are available at LIBSVM website [4]. We followed the common setting of online binary classification tasks that each dataset should be divided into training and test sets. We adopted the original splits of training and test sets for datasets downloaded from the LIBSVM website. For KEEL datasets, a random split of 4 : 1 training–test was performed.

Table 1. Information of eight publicly available datasets used in experiments.

In experiments, our proposed method was first compared with NORMA and ACCOSVM for regular online kernel classification, which are solved in primal and dual spaces, respectively.

  • “NORMA” [9]: Online gradient descent for kernel SVM without budget.

  • “ACCOSVM” [8]: An accelerator for online SVM combing quadratic programming and window techniques.

Further, we invited three state-of-the-art budget online kernel learning algorithms to compare with FFTRL. Namely,

  • “BNORMA” [9]: The budgeted version of NORMA using removal strategy.

  • “Forgetron” [6]: Budget perceptron using the removal strategy.

  • “Projectron” [15]: Budget perceptron using the projection strategy.

Finally, we introduced an algorithm sharing the similar idea with our proposed method:

  • “FOGD” [12]: Online gradient descent using random Fourier features for kernel approximation.

3.2 Experimental Setting

All the experiments were carried out in Python 3.6 on a PC running Windows 10 with a 2.9GHz Intel Core i7 processor and 16 GB RAM. To make a fair comparison, all algorithms adopted the following same setups. The Gaussian kernel was used as the kernel function \(k(\cdot ,\cdot )\), and the hinge loss was taken as the convex loss function. Since the hinge loss is a non-smooth function, subgradient was adopted instead of gradient, which counts only when \(yf(\textbf{x})<1\).

The budget size in budget online learning algorithms and the number of samples in FOGD and our proposed method were set to 100 and 200, respectively, following the same setups in [12]. The learning rate related parameter \(\beta \) in our algorithm was set to 1 according to the instruction from [14]. Other hyper-parameters were selected by a standard 5-fold cross validation on the training set, including the kernel bandwidth \(\sigma \), the learning rate related parameter \(\alpha \) for FFTRL, the regularization parameter \(\lambda \) for FFTRL, NORMA and BNORMA, the initial learning rate \(\eta _0\) for FGD, NORMA, and BNORMA, and C for ACCOSVM. Then, the training set was refitted using the best model five times, where at each run the instances were shuffled differently. The mean and standard deviation of mistake rate on the training set, training time, accuracy on the test set, and test time were reported as the final results.

3.3 Results and Analysis

Table 2 summarizes the evaluation results on the eight datasets, where the best results are in bold. Note that the test process of NORMA on the Ijcnn1 dataset was early stopped after 10,000 s, and the instances being tested at the time of early stopping was reported in italic. From Table 2, we can draw the following conclusions.

Table 2. Comparison of online kernel algorithms on 8 benchmark binary classification datasets.

First, we found that budget online kernel classification algorithms run much faster than the regular ones (say, NORMA and ACCOSVM) in both training and test process. That means scalable online kernel methods are more practical in terms of time efficiency. However, budget online kernel classification algorithms generally make more mistakes on the training set and then get lower accuracy on the test set. Potentional loss of information is occurred when adopting budget strategies, validating the importance of exploring effective techniques for budget online kernel learning algorithms. The same phenomenon happens inside the family of budget online algorithms too. We notice that Projectron takes more time in training and test but obtains more promising results than both BNORMA and Forgetron in five out of eight datasets since the projection strategy is more complex than just simply remove an SV. The trade-offs between accuracy and time efficiency should be analyzed in specific situations.

Second, we compared the two kernel approximation methods (FOGD and FFTRL) with the budget online kernel classification algorithms. As is listed in Table 2, FOGD takes the least time in training, and our proposed method FFTRL shows competitive results too. Both algorithms achieve amazing speed in training, far exceeding any budget online kernel algorithms. We inferred that the extraordinary time efficiency of kernel approximation methods should be attributed to the linear online learning framework. Moreover, both FOGD and FFTRL also show better mistake rate and accuracy in most cases, which demonstrates that kernel approximation scheme is suitable for large scale online learning.

Finally, we analyzed the performance of FFTRL. It seems surprising to find that FFTRL gets the lowest mistake rate or highest accuracy, even outperforms NORMA in some datasets (such as, Spambase, Coil2000, A7a and Ijcnn1). The reasons may lie in two aspects. The first reason is the appropriate choice of sample number D. According to the conclusions from [12], choosing a too large value of D will result in under-fitting for small datasets, and choosing a too small value of D will result in over-fitting. The second reason is the well-designed per-coordinate learning rate. Except from FFTRL, all the gradient-based algorithms adopt the global learning rate schedule. However, we need to use the learning rate to reflect our confidence of each dimension in online setting, which indicates the global learning rate schedule is not the optimal choice. Besides, FFTRL also produces a sparser model than FOGD as expected. Unfortunately, the benefits of sparsity brought to FFTRL are largely obscured by the efficiency of linear learning framework since the test time of FOGD and FFTRL are generally the same. To validate the advantage of our proposed method over FOGD, we listed the number of zero components in the weight vector \(\textbf{w}\) in Table 3, where the number of zero coefficients in FOGD is taken as the baseline. From Table 3, we can obviously see that the model generated by FFTRL is much sparser than that of FOGD.

Table 3. Sparsity promotion of FFTRL against FOGD.

4 Conclusion

In this paper, we present a novel sparse algorithm FFTRL for solving large-scale online kernel binary classification tasks. The basic idea of FFTRL is to approximate a kernel function via functional approximation technique, which enables us to transform the original online kernel learning task into an approximate linear online learning task. Random Fourier features are used as the kernel approximation scheme, and then a new high dimensional feature space is induced in this process. We further adopt FTRL to find a sparse solution in the new feature space. In theory, we analyze the regret bound of our proposed algorithm.

We performed extensive experiments to evaluate the performance of FFTRL and other state-of-the-art online kernel learning methods. Our promising results show that FFTRL enjoys both time efficiency and accuracy. Moreover, the sparsity produced by FFTRL fits the need of high dimensional and large-scale data scenarios, making FFTRL suitable for real-world applications. In future work, we plan to extend our work by exploring the field of multi-label online classification tasks.