FFTRL: A Sparse Online Kernel Classification Algorithm for Large Scale Data

Su, Changzhi; Zhang, Li; Zhao, Lei

doi:10.1007/978-3-031-44207-0_17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14254))

Included in the following conference series:

International Conference on Artificial Neural Networks

1211 Accesses

Abstract

Online kernel learning is an efficient way when dealing with nonlinearly large-scale data. The training speed of online kernel learning is improved by Fourier online gradient descent (FOGD). However, FOGD has a high space complexity when the number of features is relatively high because FOGD lacks of sparsity. In this paper, we propose a new sparse online kernel classification algorithm for large-scale data, called Fourier follow-the-regularized-leader (FFTRL). Existing budget (sparse) online kernel learning methods attempt to bound the number of support vectors through some budget maintenance strategies; however, budget maintenance strategies are unsuitable for FOGD. By introducing the proximal algorithm, follow-the-regularized-leader, FFTRL achieves sparsity in a different way. By applying random Fourier features as the kernel approximation schemes, FFTRL finds the optimal sparse solution in a linear manner. The regret bound analysis shows the feasibility of FFTRL in theory. Comprehensive experiments were carried out on public datasets to compare the performance of FFTRL with related online kernel algorithms. Promising results show that our proposed method enjoys both high accuracy and time efficiency and still produces sparse models, opening a window for obtaining sparsity in online kernel learning.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Robust large-scale online kernel learning

Article 19 May 2022

Isolation kernel: the X factor in efficient and effective large scale online kernel learning

Article 19 August 2021

Worst-case regret analysis of computationally budgeted online kernel selection

Article 22 January 2022

Keywords

1 Introduction

It has been proven that online learning is successful for building accurate and reliable models from a sequence of data elements efficiently. Different from regular batch machine learning algorithms that suffer from massive training time and memory consumption, online learning models often enjoy the properties of fast construction, highly scalable and memory saving. Due to these advantages, online learning algorithms have been successfully used in many real-world applications, such as online advertising [14], weather condition prediction [11], and computational finance [10].

Various algorithms have been developed to tackle online binary classification tasks, which can be simply divided into two types: linear and kernel methods. The linear methods are able to construct linear predictive models at an amazing speed. Some well-known examples include online gradient descent (OGD) [18], forward backward splitting (FOBOS) [7], regularized dual averaging (RDA) [19] and follow-the-regularized-leader (FTRL) [13, 14]. However, linear models are not always the right choice. Linear online algorithms may fail to produce effective outcomes when faced with linearly non-separable inputs, which is more common in real-world applications. To overcome this issue, researchers invited kernel functions into online learning methods and came up with field of online kernel learning. Kernel-based estimators avoid the non-separable property in the input space by mapping the instances to a high dimensional feature space implicitly. One key limitation of classical online kernel methods is that the functional representation of the produced estimator will become more complex as the observations grows. To be more specific, the learner is asked to maintain a support vector (SV) set during the online learning process. Whenever a newly arrived instance is misclassified, it will be added to the SV set immediately. Thus the complexity of the estimator and memory resource it demands will increase linearly over time, causing memory overflow for a potentially infinite input data sequence.

Several approaches have been proposed to handle the extension issues of online kernel learning. One interesting aspect, which is usually referred to as “budget online kernel learning” [5], tries to bound the number of SVs within a fixed size during the training process. Two major wildly acknowledged budget maintenance strategies are removal and projection. The former simply evicts a selected SV when the number of SVs overflows. It is adopted by many algorithms, such as Forgetron [6], randomized budget perceptron (RBP) [3], and naive online $R_{reg}$ minimisation algorithm (NORMA) [9]. The latter further projects the selected SV onto the remaining ones, which is explored in algorithms like Projectron [15] and online manifold regularization (OMR) [2], Budget strategies do release the pressure to some extent, but the existing budget online kernel methods are either too simple to achieve promising results or just too slow to perform. The other promising aspect is to use the functional approximation scheme [20]. Unlike the budget maintenance strategy, this kind of scheme tackles the problem in a mathematically elegant way. A certain explicit mapping can be derived by approximating a kernel function, making it possible to project data from the input space to a computable highly dimensional feature space. Combining with linear online learning algorithms like OGD, nonlinear kernel-based algorithms are then trained in an efficient linear manner. As far as we known, Fourier online gradient descent (FOGD) has achieved a success in reducing time cost following this idea [12]. To reduce the required memory, the final model should be stored sparsely, or the number of non-zero coefficients in the final model parameter should be small. However, even employing the $L_1$ penalty, FOGD can hardly produce sparse models. Similarly, it may cause the memory usage problem when the dimension of the feature space becomes too high.

In order to take the advantages of linear online models and produce sparsity simultaneously, we propose a Fourier follow-the-regularized-leader (FFTRL) algorithm in this paper. FFTRL adopts the random Fourier feature technique to approximate shift-invariant kernels and introduces sparsity using the FTRL algorithm. Theoretical analysis and experiments on FFTRL are also provided in this paper.

The rest of the paper is organized as follows. Section 2 details the proposed method. Experimental results and analysis are presented in Sect. 3 and the conclusion is given in Sect. 4.

2 Proposed Method

2.1 Algorithm Description

The proposed FFTRL is a online kernel learning method for binary classification tasks. The goal of FFTRL is to learn a final mapping or hypothesis $f:\mathbb {R}^n\rightarrow \mathbb {R}$ from a sequence of data elements $\{(\textbf{x}_1,y_1),(\textbf{x}_2,y_2),\dots ,(\textbf{x}_T,y_T)\}$, where $\textbf{x}_t\in \mathbb {R}^n$ is the tth training instance, and $y_t\in \{+1,-1\}$ is the corresponding class label, n and T are the number of features and samples, respectively. Generally, a convex loss function $l(f(\textbf{x}),y):\mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}$ is used to penalize the deviation of the estimation $f(\textbf{x})$ from the exact class label y. Further, we assume $\mathcal {H}_k$ is a reproducing kernel Hilbert space (RKHS). Thus, the function $k(\cdot ,\cdot ):\mathbb {R}^n\times \mathbb {R}^n\rightarrow \mathbb {R}$ is defined as the reproducing kernel of $\mathcal {H}_k$ if and only if it implements the inner product $\langle \cdot ,\cdot \rangle $ such that

1.
$k(\textbf{x},\cdot )\in \mathcal {H}_k$ for $\forall \textbf{x}\in \mathbb {R}^n$;
2.
$\langle f,k(\textbf{x},\cdot )\rangle =f(\textbf{x})$ for $\forall \textbf{x}\in \mathbb {R}^n$ and $\forall f\in \mathcal {H}_k$.

In classical online kernel learning, the computation of kernel functions improves the complexity of algorithms. Inspired by FOGD, FFTRL represents a kernel mapping in a linear manner. Namely,

$$\begin{aligned} k(\textbf{x}_j,\textbf{x}_m)\thickapprox \textbf{z}(\textbf{x}_j)^\textsf{T}\textbf{z}(\textbf{x}_m), \end{aligned}$$

(1)

where the superscript $\textsf{T}$ means the operation of a vector or matrix transpose, $\textbf{x}_j$ and $\textbf{x}_m$ are arbitrary instances in the sequence, and $\textbf{z}(\textbf{x}_j)$ is an approximate image of $\textbf{x}_j$ in the feature space.

Let $f(\textbf{x})=\textbf{w}^\textsf{T}\textbf{z}(\textbf{x})$, where $\textbf{w}$ is the weight vector. Then the loss function can be represented as $l(\textbf{w},\textbf{z}(\textbf{x}),y)$. To find $\textbf{z}(\textbf{x})$ related to $k(\cdot ,\cdot )$, we introduce random Fourier features [16], which is a kernel functional approximation technique that works for shift-invariant kernels like Gaussian and Laplacian kernels. Such kernels have the form of $k(\textbf{x}_j,\textbf{x}_m)=k(\varDelta \textbf{x})$, where $\varDelta \textbf{x}=\textbf{x}_j-\textbf{x}_m$ is the divergence between the two instances. Bochner’s theorem implies that a positive definite kernel function $k(\varDelta \textbf{x})$ is the Fourier transform of a proper probability density function $p(\textbf{u})$ with a random variable $\textbf{u}\in \mathbb {R}^n$ [17]. Namely,

$$\begin{aligned} k(\varDelta \textbf{x})=\int p(\textbf{u})e^{i\textbf{u}^\textsf{T}\varDelta \textbf{x}} \,d\textbf{u}, \end{aligned}$$

(2)

where i is the imaginary unit. By contrary, assume we have the right kernel here. By calculating the inverse Fourier transform of the kernel $k(\varDelta \textbf{x})$, we can obtain

$$\begin{aligned} p(\textbf{u})=\left( \frac{1}{2\pi }\right) ^n \int e^{-i\textbf{u}^\textsf{T}(\varDelta \textbf{x})}k(\varDelta \textbf{x}) \,d(\varDelta \textbf{x}). \end{aligned}$$

(3)

For example, given a Gaussian kernel $k(\textbf{x}_j,\textbf{x}_m)=\exp (-\Vert \textbf{x}_j-\textbf{x}_m\Vert ^2_2 / 2\sigma ^2)$ with the kernel parameter $\sigma >0$, we have the corresponding distribution $p(\textbf{u})=\mathcal {N} (\textbf{0},\sigma ^{-2}\textbf{I})$ with the identify matrix $\textbf{I}$. According to (2), we can see that the kernel function can be expressed as the expectation of $\textbf{u}$ drawn from the distribution $p(\textbf{u})$. In other words, we have

$$\begin{aligned} \int p(\textbf{u})e^{i\textbf{u}^\textsf{T}\varDelta \textbf{x}} \,d\textbf{u} =E_{\textbf{u}}[e^{i\textbf{u}^\textsf{T}\textbf{x}_j} e^{-i\textbf{u}^\textsf{T}\textbf{x}_m}], \end{aligned}$$

(4)

where the function $E_{\textbf{u}}[\cdot ]$ is to find the expectation of $\textbf{u}$. Using Euler’s formula, we can rewrite (4) as

$$\begin{aligned}&E_{\textbf{u}}[\cos (\textbf{u}^\textsf{T}\textbf{x}_j)\cos (\textbf{u}^\textsf{T}\textbf{x}_m)+\sin (\textbf{u}^\textsf{T}\textbf{x}_j)\sin (\textbf{u}^\textsf{T}\textbf{x}_m)] \nonumber \\ =&E_{\textbf{u}}\left[ [\sin (\textbf{u}^\textsf{T}\textbf{x}_j),\cos (\textbf{u}^\textsf{T}\textbf{x}_j)][\sin (\textbf{u}^\textsf{T}\textbf{x}_m),\cos (\textbf{u}^\textsf{T}\textbf{x}_m)]^\textsf{T}\right] . \end{aligned}$$

(5)

According to (5), we can make $\textbf{z}(\textbf{x})=[\sin (\textbf{u}^\textsf{T}\textbf{x}),\cos (\textbf{u}^\textsf{T}\textbf{x})]^\textsf{T}$ that is a new representation (image) of instance $\textbf{x}$. Since the kernel function $k(\varDelta \textbf{x})$ equals the expectation of inner productor of $\textbf{z}(\textbf{x}_j)$ and $\textbf{z}(\textbf{x}_m)$, we can draw D samples $\textbf{u}_1,\dots ,\textbf{u}_D$ independently from the distribution p and construct the image of $\textbf{x}$ as

$$\begin{aligned} \textbf{z}(\textbf{x})=\left[ \sin (\textbf{u}_1^\textsf{T}\textbf{x}),\cos (\textbf{u}_1^\textsf{T}\textbf{x}),\dots ,\sin (\textbf{u}_D^\textsf{T}\textbf{x}),\cos (\textbf{u}_D^\textsf{T}\textbf{x})\right] ^\textsf{T} \in \mathbb {R}^{2D}. \end{aligned}$$

(6)

Now, we can ignore the computation of kernel function because we get the explicit images in the high-dimensional feature space that is induced by the corresponding kernel function. If the number of samples D is large enough, the error brought by approximation can be omitted reasonably. Thus, the online kernel learning in the original space is transformed into the linear online learning in a high dimensional feature space.

To produce sparsity in the online process, we introduce FTRL that comprehensively considers the differences between FOBOS and RDA on regularization terms and model parameter $\textbf{w}$. In the tth round, FFTRL performs the update of the weight vector $\textbf{w}_{t+1}$ as follows:

$$\begin{aligned} \textbf{w}_{t+1}=\mathop {\arg \min }\limits _{\textbf{w}} \left\{ \textbf{w}^\textsf{T}\left( \sum _{s=1}^t \textbf{g}_s\right) +\frac{1}{2} \sum _{s=1}^{t}\Vert \sqrt{\boldsymbol{\sigma }_s}\odot (\textbf{w}-\textbf{w}_s)\Vert _2^2+\lambda \Vert \textbf{w}\Vert _1\right\} , \end{aligned}$$

(7)

where $\textbf{g}_s=\nabla _{\textbf{w}_s}l(\textbf{w}_s,\textbf{z}(\textbf{x}_s),y_s)$ is the gradient in the sth iteration, $\odot $ is the element-wise multiplication operator, $\boldsymbol{\sigma }_s=\left[ \sigma _{s,1},\dots ,\sigma _{s,2D}\right] ^{\textsf{T}}\in \mathbb {R}^{2D}$ is the parameter related to the current learning rate, and $\lambda $ is a positive regularization parameter. We discuss $\boldsymbol{\sigma }_s$ later.

The basic idea behind FTRL is to minimize the loss cumulated in the online training process, which will get a low-regret solution in the current round. Therefore, FFTRL uses a cumulative gradient to approximately estimate the cumulative loss, or the first term of (7). The second term in (7) works as a stabilization penalty to avoid $\textbf{w}$ from vibrating extensively in iterations, while the third term is an $L_1$ penalty. With $\lambda >0$, FFTRL does an excellent job in producing sparsity.

Moreover, we thought that if a feature variable varies more rapidly than the other, then it is reasonable that the learning rate on this feature variable should decline faster. Thus, FFTRL uses the per-coordinate learning rate instead of a global learning rate like setting $\eta _t=\frac{1}{\sqrt{t}}$ ($t>0$) for all features. In other words, the learning rate is calculated independently for each feature. Let $\boldsymbol{\eta }_t=[\eta _{t,1},\dots ,\eta _{t,2D}] \in \mathbb {R}^{2D}$ be the learning rate used in FFTRL. We reflect the rate of change using the gradient component in a certain dimension. Without loss of generality, let $g_{t,h}$ be the hth entry in $\textbf{g}_t$. Then, the corresponding learning rate in the hth dimension can be expressed as

$$\begin{aligned} \eta _{t,h}=\frac{\alpha }{\beta +\sqrt{\sum _{s=1}^{t}g_{s,h}^2}} \end{aligned}$$

(8)

for $t>0$, where both $\alpha >0$ and $\beta >0$ are two parameters needed to be tuned for good performance. When $t=0$, $g_{s,t}=0$. Then, $\eta _{0,h}=\alpha /\beta $ for all h. For $\boldsymbol{\sigma _s}$, its hth component can be defined as

$$\begin{aligned} \sigma _{s,h}=\frac{1}{\eta _{t,h}}-\frac{1}{\eta _{t-1,h}}. \end{aligned}$$

(9)

The detail algorithm description of FFTRL is summarized in Algorithm 1. For training data arriving sequentially, we first construct the new representation of an instance using the explicit mapping $\textbf{z}(\textbf{x})$ in (6) and then perform a sparse linear online learning using FTRL. The overall time complexity of FTRL in one update round is O(D).

2.2 Theoretical Analysis

We further analyze the theoretical property of our proposed method. For the purpose of simplicity, $l_t(f)$ represents $l(f(\textbf{x}_t),y_t)$, and $l_t(\textbf{w})$ is $l_t(\textbf{w}_t,\textbf{z}(\textbf{x}_t),y_t)$. In the following, we show that the regret of our algorithm is sub-linear, which indicates the effectiveness of FFTRL .

Theorem 1

Assume that the original data is contained by a ball in $\mathbb {R}^n$ of diameter $\tilde{R}$. Let $k(\textbf{x},\textbf{x}')=k(\varDelta \textbf{x})$ be a positive definite and shift-invariant kernel, and $l(f(\textbf{x}),y):\mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}$ be a convex loss function that is Lipschitz continuous with Lipschitz constant L. Assume that $\textbf{w}_1,\dots ,\textbf{w}_T$ is the sequence of model parameters generated by FFTRL (Algorithm 1) under the mild condition that the learning rate $\eta _{t,h}=\eta _t$ for every dimension in the same iteration, where $\Vert \textbf{w}_t\Vert _2\le R$. With probability at least $1-2^8(\frac{\varsigma _p\tilde{R}}{\epsilon })^2\exp (\frac{-D\epsilon ^2}{4(n+2)})$, the following inequality

$$\begin{aligned} \sum _{t=1}^{T}l_t(\textbf{w}_t)-\sum _{t=1}^{T}l_t(f^*)\le \frac{(1+\epsilon )\Vert f^*\Vert ^2_1}{2\eta _T}+L^2\sum _{t=1}^{T}\eta _t+\frac{3R^2}{2\eta _T}+\sqrt{2D}\lambda R+\epsilon LT\Vert f^*\Vert _1 \end{aligned}$$

holds true for any $f^*(\textbf{x})=\sum _{t=1}^{T}\alpha _{t}^*k(\textbf{x}_t,\textbf{x})$, where $\Vert f^*\Vert _1=\sum _{t=1}^{T}|\alpha _t^*|$, $\varsigma _p^2=E_p[\textbf{u}^\textsf{T}\textbf{u}]$ is the second moment of the Fourier transform of the kernel function $k(\cdot ,\cdot )$ given that $p(\textbf{u})$ is the probability density function calculated by (3), and $\epsilon $ is a small positive constant.

Proof

Given $f^*(\textbf{x})=\sum _{t=1}^{T}\alpha _t^*k(\textbf{x}_t,\textbf{x})$ as the optimal solution of FFTRL, we have the corresponding linear model $\textbf{w}^*=\sum _{t=1}^{T}\alpha _{t}^*\textbf{z}(\textbf{x}_t)$. First of all, we have to bound the regret of the sequence $\textbf{w}_1,\dots ,\textbf{w}_T$ learned by FFTRL with respect to the optimal linear model $\textbf{w}^*$ in the new feature space. According to the regret analysis of the FTRL algorithm with strongly convex regularizers (Lemma 2.3.) [18], we have:

$$\begin{aligned} \sum _{t=1}^{T}(l_t(\textbf{w}_t)-l_t(\textbf{w}^*))\le L^2\sum _{t=1}^{T}\eta _t+r_{1:T}(\textbf{w}^*)+\psi (\textbf{w}^*), \end{aligned}$$

(10)

where $r_{1:T}(\textbf{w}^*)=\sum _{t=1}^{T} r_t(\textbf{w}^*)$. Let $r_t(\textbf{w})=\frac{\sigma _t}{2}\Vert \textbf{w}-\textbf{w}_t\Vert _2^2$ and $\psi (\textbf{w})=\lambda \Vert \textbf{w}\Vert _1$. Then, the cumulative sum of regularizers becomes

$$\begin{aligned} r_{1:T}(\textbf{w}^*)+\psi (\textbf{w}^*)=\frac{1}{2}\sum _{t=1}^{T}\sigma _t\Vert \textbf{w}^*-\textbf{w}_t\Vert _2^2+\lambda \Vert \textbf{w}^*\Vert _1, \end{aligned}$$

(11)

which is exactly the same as the regularization term in (7).

For $r_{1:T}(\textbf{w}^*)$, we can infer that

$$\begin{aligned} r_{1:T}(\textbf{w}^*)&=\frac{1}{2}\sum _{t=1}^{T}\sigma _t\Vert \textbf{w}^*-\textbf{w}_t\Vert _2^2 \nonumber \\&\le \frac{1}{2}\sum _{t=1}^{T}\sigma _t (\Vert \textbf{w}^*\Vert _2^2-2\langle \textbf{w}^*,\textbf{w}_t \rangle +\Vert \textbf{w}_t\Vert ^2_2) \nonumber \\&\le \frac{1}{2}\sum _{t=1}^{T}\sigma _t (\Vert \textbf{w}^*\Vert _2^2+3R^2)=\frac{1}{2\eta _T}(\Vert \textbf{w}^*\Vert _2^2+3R^2). \end{aligned}$$

(12)

For $\psi (\textbf{w}^*)$, it is upper-bounded by $\sqrt{2D}\lambda R$ according to the arithmetic-geometric mean inequality (AGMI). The regret bound (10) now becomes

$$\begin{aligned} \sum _{t=1}^{T}(l_t(\textbf{w}_t)-l_t(\textbf{w}^*))\le L^2\sum _{t=1}^{T}\eta _t+\frac{\Vert \textbf{w}^*\Vert _2^2+3R^2}{2\eta _T}+\sqrt{2D}\lambda R \end{aligned}$$

(13)

Next, we examine the difference between $\sum _{t=1}^{T}l_t(\textbf{w}^*)$ and $\sum _{t=1}^{T}l_t(f^*)$. According to the uniform convergence of random Fourier features (Claim 1 in [16]), with probability at least $1-2^8(\frac{\varsigma _p\tilde{R}}{\epsilon })^2\exp (\frac{-D\epsilon ^2}{4(n+2)})$, we have

$$\begin{aligned} \forall j,m,~~ |\textbf{z}(\textbf{x}_j)^\textsf{T}\textbf{z}(\textbf{x}_m)-k(\textbf{x}_j,\textbf{x}_m)|<\epsilon . \end{aligned}$$

(14)

In other words, the more we sample, the smaller the probability that the difference between approximated kernel value and real kernel value is greater than the constant $\epsilon $ we will get. We further assume $k(\textbf{x}_j,\textbf{x}_m)\le 1$, then we have $\textbf{z}(\textbf{x}_j)^\textsf{T}\textbf{z}(\textbf{x}_m)\le 1+\epsilon $ that leads to

$$\begin{aligned} \Vert \textbf{w}^*\Vert _2^2 \le (1+\epsilon )\Vert f^*\Vert ^2_1. \end{aligned}$$

(15)

With (14), we have:

$$\begin{aligned} \left| \sum _{t=1}^{T}l_t(\textbf{w}^*)-\sum _{t=1}^{T}l_t(f^*)\right|&\le \sum _{t=1}^{T}\left| l_t(\textbf{w}^*)-l_t(f^*)\right| \nonumber \\&\le L\sum _{t=1}^{T}\sum _{j=1}^{T}|\alpha _j^*||\textbf{z}(\textbf{x}_j)^\textsf{T}\textbf{z}(\textbf{x}_t)-k(\textbf{x}_j,\textbf{x}_t)| \nonumber \\&\le \epsilon L \sum _{t=1}^{T}\sum _{j=1}^{T}\left| \alpha _j^*\right| =\epsilon LT\Vert f^*\Vert _1. \end{aligned}$$

(16)

Combing (13), (15) and (16) leads to the completion of the proof.

3 Experiments

3.1 Description of Data and Algorithms Involved

To validate the performance of our proposed algorithm, we conducted extensive experiments on the tasks of online binary classification. We first introduced the datasets used in our experiments and then described the algorithms for comparison.

Table 1 shows the details of eight publicly available datasets where the first five datasets can be downloaded from KEEL dataset repository [1] and the rest three are available at LIBSVM website [4]. We followed the common setting of online binary classification tasks that each dataset should be divided into training and test sets. We adopted the original splits of training and test sets for datasets downloaded from the LIBSVM website. For KEEL datasets, a random split of 4 : 1 training–test was performed.

Table 1. Information of eight publicly available datasets used in experiments.

Full size table

In experiments, our proposed method was first compared with NORMA and ACCOSVM for regular online kernel classification, which are solved in primal and dual spaces, respectively.

“NORMA” [9]: Online gradient descent for kernel SVM without budget.
“ACCOSVM” [8]: An accelerator for online SVM combing quadratic programming and window techniques.

Further, we invited three state-of-the-art budget online kernel learning algorithms to compare with FFTRL. Namely,

“BNORMA” [9]: The budgeted version of NORMA using removal strategy.
“Forgetron” [6]: Budget perceptron using the removal strategy.
“Projectron” [15]: Budget perceptron using the projection strategy.

Finally, we introduced an algorithm sharing the similar idea with our proposed method:

“FOGD” [12]: Online gradient descent using random Fourier features for kernel approximation.

3.2 Experimental Setting

All the experiments were carried out in Python 3.6 on a PC running Windows 10 with a 2.9GHz Intel Core i7 processor and 16 GB RAM. To make a fair comparison, all algorithms adopted the following same setups. The Gaussian kernel was used as the kernel function $k(\cdot ,\cdot )$, and the hinge loss was taken as the convex loss function. Since the hinge loss is a non-smooth function, subgradient was adopted instead of gradient, which counts only when $yf(\textbf{x})<1$.

The budget size in budget online learning algorithms and the number of samples in FOGD and our proposed method were set to 100 and 200, respectively, following the same setups in [12]. The learning rate related parameter $\beta $ in our algorithm was set to 1 according to the instruction from [14]. Other hyper-parameters were selected by a standard 5-fold cross validation on the training set, including the kernel bandwidth $\sigma $, the learning rate related parameter $\alpha $ for FFTRL, the regularization parameter $\lambda $ for FFTRL, NORMA and BNORMA, the initial learning rate $\eta _0$ for FGD, NORMA, and BNORMA, and C for ACCOSVM. Then, the training set was refitted using the best model five times, where at each run the instances were shuffled differently. The mean and standard deviation of mistake rate on the training set, training time, accuracy on the test set, and test time were reported as the final results.

3.3 Results and Analysis

Table 2 summarizes the evaluation results on the eight datasets, where the best results are in bold. Note that the test process of NORMA on the Ijcnn1 dataset was early stopped after 10,000 s, and the instances being tested at the time of early stopping was reported in italic. From Table 2, we can draw the following conclusions.

Table 2. Comparison of online kernel algorithms on 8 benchmark binary classification datasets.

Full size table

First, we found that budget online kernel classification algorithms run much faster than the regular ones (say, NORMA and ACCOSVM) in both training and test process. That means scalable online kernel methods are more practical in terms of time efficiency. However, budget online kernel classification algorithms generally make more mistakes on the training set and then get lower accuracy on the test set. Potentional loss of information is occurred when adopting budget strategies, validating the importance of exploring effective techniques for budget online kernel learning algorithms. The same phenomenon happens inside the family of budget online algorithms too. We notice that Projectron takes more time in training and test but obtains more promising results than both BNORMA and Forgetron in five out of eight datasets since the projection strategy is more complex than just simply remove an SV. The trade-offs between accuracy and time efficiency should be analyzed in specific situations.

Second, we compared the two kernel approximation methods (FOGD and FFTRL) with the budget online kernel classification algorithms. As is listed in Table 2, FOGD takes the least time in training, and our proposed method FFTRL shows competitive results too. Both algorithms achieve amazing speed in training, far exceeding any budget online kernel algorithms. We inferred that the extraordinary time efficiency of kernel approximation methods should be attributed to the linear online learning framework. Moreover, both FOGD and FFTRL also show better mistake rate and accuracy in most cases, which demonstrates that kernel approximation scheme is suitable for large scale online learning.

Finally, we analyzed the performance of FFTRL. It seems surprising to find that FFTRL gets the lowest mistake rate or highest accuracy, even outperforms NORMA in some datasets (such as, Spambase, Coil2000, A7a and Ijcnn1). The reasons may lie in two aspects. The first reason is the appropriate choice of sample number D. According to the conclusions from [12], choosing a too large value of D will result in under-fitting for small datasets, and choosing a too small value of D will result in over-fitting. The second reason is the well-designed per-coordinate learning rate. Except from FFTRL, all the gradient-based algorithms adopt the global learning rate schedule. However, we need to use the learning rate to reflect our confidence of each dimension in online setting, which indicates the global learning rate schedule is not the optimal choice. Besides, FFTRL also produces a sparser model than FOGD as expected. Unfortunately, the benefits of sparsity brought to FFTRL are largely obscured by the efficiency of linear learning framework since the test time of FOGD and FFTRL are generally the same. To validate the advantage of our proposed method over FOGD, we listed the number of zero components in the weight vector $\textbf{w}$ in Table 3, where the number of zero coefficients in FOGD is taken as the baseline. From Table 3, we can obviously see that the model generated by FFTRL is much sparser than that of FOGD.

Table 3. Sparsity promotion of FFTRL against FOGD.

Full size table

4 Conclusion

In this paper, we present a novel sparse algorithm FFTRL for solving large-scale online kernel binary classification tasks. The basic idea of FFTRL is to approximate a kernel function via functional approximation technique, which enables us to transform the original online kernel learning task into an approximate linear online learning task. Random Fourier features are used as the kernel approximation scheme, and then a new high dimensional feature space is induced in this process. We further adopt FTRL to find a sparse solution in the new feature space. In theory, we analyze the regret bound of our proposed algorithm.

We performed extensive experiments to evaluate the performance of FFTRL and other state-of-the-art online kernel learning methods. Our promising results show that FFTRL enjoys both time efficiency and accuracy. Moreover, the sparsity produced by FFTRL fits the need of high dimensional and large-scale data scenarios, making FFTRL suitable for real-world applications. In future work, we plan to extend our work by exploring the field of multi-label online classification tasks.

References

Alcalá-Fdez, J., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple Valued Logic Soft Comput. 17, 255–287 (2011)
Google Scholar
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7(11) (2006)
Google Scholar
Cavallanti, G., Cesa-Bianchi, N., Gentile, C.: Tracking the best hyperplane with a simple budget perceptron. Mach. Learn. 69(2), 143–167 (2007)
Article MATH Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 1–27 (2011)
Article Google Scholar
Crammer, K., Kandola, J., Singer, Y.: Online classification on a budget. In: Advances in Neural Information Processing Systems, vol. 16 (2003)
Google Scholar
Dekel, O., Shalev-Shwartz, S., Singer, Y.: The forgetron: A kernel-based perceptron on a budget. SIAM J. Comput. 37(5), 1342–1372 (2008)
Article MathSciNet MATH Google Scholar
Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009)
MathSciNet MATH Google Scholar
Guo, H., Zhang, A., Wang, W.: An accelerator for online SVM based on the fixed-size KKT window. Eng. Appl. Artif. Intell. 92, 103637 (2020)
Article Google Scholar
Kivinen, J., Smola, A.J., Williamson, R.C.: Online learning with kernels. IEEE Trans. Signal Process. 52(8), 2165–2176 (2004)
Article MathSciNet MATH Google Scholar
Li, B., Zhao, P., Hoi, S.C., Gopalkrishnan, V.: PAMR: passive aggressive mean reversion strategy for portfolio selection. Mach. Learn. 87(2), 221–258 (2012)
Article MathSciNet MATH Google Scholar
Li, X., Plale, B., Vijayakumar, N., Ramachandran, R., Graves, S., Conover, H.: Real-time storm detection and weather forecast activation through data mining and events processing. Earth Sci. Inf. 1(2), 49–57 (2008)
Article Google Scholar
Lu, J., Hoi, S.C., Wang, J., Zhao, P., Liu, Z.Y.: Large scale online kernel learning. J. Mach. Learn. Res. 17(47), 1 (2016)
MathSciNet MATH Google Scholar
McMahan, B.: Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. pp. 525–533. JMLR Workshop and Conference Proceedings (2011).
Google Scholar
McMahan, H.B., et al.: Ad click prediction: a view from the trenches. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1222–1230 (2013)
Google Scholar
Orabona, F., Keshet, J., Caputo, B.: The projectron: a bounded kernel-based perceptron. In: Proceedings of the 25th International Conference on Machine Learning, pp. 720–727 (2008)
Google Scholar
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems 20 (2007)
Google Scholar
Rudin, W.: Fourier Analysis on Groups. Courier Dover Publications (2017)
Google Scholar
Shalev-Shwartz, S., et al.: Online learning and online convex optimization. Found. Trends® Mach. Learn. 4(2), 107–194 (2012)
Google Scholar
Xiao, L.: Dual averaging method for regularized stochastic learning and online optimization. In: Advances in Neural Information Processing Systems 22 (2009)
Google Scholar
Zhang, K., Lan, L., Wang, Z., Moerchen, F.: Scaling up kernel SVM on limited resources: a low-rank linearization approach. In: Artificial Intelligence and Statistics, pp. 1425–1434. PMLR (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technolgy, Soochow University, Suzhou, 215006, China
Changzhi Su, Li Zhang & Lei Zhao

Authors

Changzhi Su
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Zhang .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Su, C., Zhang, L., Zhao, L. (2023). FFTRL: A Sparse Online Kernel Classification Algorithm for Large Scale Data. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14254. Springer, Cham. https://doi.org/10.1007/978-3-031-44207-0_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-44207-0_17
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44206-3
Online ISBN: 978-3-031-44207-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

FFTRL: A Sparse Online Kernel Classification Algorithm for Large Scale Data

Abstract

Similar content being viewed by others

Robust large-scale online kernel learning

Isolation kernel: the X factor in efficient and effective large scale online kernel learning

Worst-case regret analysis of computationally budgeted online kernel selection

Keywords

1 Introduction

2 Proposed Method

2.1 Algorithm Description

2.2 Theoretical Analysis

Theorem 1

Proof

3 Experiments

3.1 Description of Data and Algorithms Involved

3.2 Experimental Setting

3.3 Results and Analysis

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

FFTRL: A Sparse Online Kernel Classification Algorithm for Large Scale Data

Abstract

Similar content being viewed by others

Robust large-scale online kernel learning

Isolation kernel: the X factor in efficient and effective large scale online kernel learning

Worst-case regret analysis of computationally budgeted online kernel selection

Keywords

1 Introduction

2 Proposed Method

2.1 Algorithm Description

2.2 Theoretical Analysis

Theorem 1

Proof

3 Experiments

3.1 Description of Data and Algorithms Involved

3.2 Experimental Setting

3.3 Results and Analysis

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation