Introduction

Online learning techniques get great progress in recent years [1,2,3,4,5,6,7,8,9,10]. In general, online learning has several characteristics: (1) the samples arrive sequentially in a stream and only one new sample is available in each online learning round; (2) the label of the new arrived sample is predicted by the current classifier and the true label of the sample is revealed; (3) when a new sample is misclassified, the classifier should be updated in time to improve its generalization ability; and (4) the classifier can be updated without re-training all the visible samples.

In literature, much attention has been put on online supervised learning, e.g., [11,12,13,14], that is, the true labels are available in the online training process. However, in practice, we frequently face online semi-supervised learning problems [15,16,17], such as the human categorization problem. In [16], Zhu et al. designed a series of experiments to demonstrate that the human learning behavior is closely related to the semi-supervised learning pattern. Furthermore, Gibson et. al. [18] applied the learned semi-supervised model to human learning tasks. In human learning, learners can incrementally learn the classes of various objects from the surrounding environment, where only a few objects are labeled by a knowledgeable source. This scenario can be actually regarded as online semi-supervised learning, that is, the label of a new arrived sample is unavailable or presented very sporadically in the online process.

In this paper, we focus on the online semi-supervised learning problems (OS2L). Several OS2L algorithms have been proposed in the past several years. By using a heuristic method to greedily label the unlabeled examples, Babenko et al. [19] and Grabner et al. [20] tried to solve the OS2L problems in an online supervised learning framework. Dyer et al. [21] presented a semi-supervised learning (SSL) framework called COMPOSE (COMPacted Object Sample Extraction), where a few labeled samples are given initially, and then a SSL problem is solved based on the currently labeled samples and new unlabeled samples, which follow a drift distribution. To reduce the computational complexity of manifold construction in the online training process, Kveton et al. [22] and Farajtabar et al. [23] proposed the harmonic solution for manifold regularization on an approximate graph.

By using online convex programming, Goldberg et al. [24] proposed an online manifold learning framework for SSL in a kernel space with stochastic gradient descent. In addition, they extended their method to online active learning by adding an optional component to select the instances to be labeled [25]. Sun et al. [26, 27] exploited the property of Fenchel conjugate of hinge loss and gradient ascend method to solve the dual problem of their online manifold learning model. Those algorithms in [24, 26, 27] are derived by using online gradient methods, implying that these methods can be regarded as solving the off-line semi-supervised learning models by stochastic gradient methods. However, none of these stochastic gradient methods can obtain an exact solution because they do not directly solve the constrained optimization problem involved.

In practice, we prefer an exact solution, which can usually achieve a more accurate result and meanwhile more efficiently. In this paper, we propose an algorithm with analytical solution to solve the online semi-supervised problem. Specifically, we propose a novel online manifold regularization learning model in a reproducing kernel Hilbert space (RKHS), by exploiting the internal geometry information of the unlabeled data and take advantage of the kernel methods. In each iteration of online training, by considering the new arrived sample and the previous samples, an online model based on a constrained optimization problem is presented, and the exact solution of the proposed model is obtained with the help of the Lagrange dual problem. Meanwhile, a fast learning algorithm (named FMOMR) is presented by introducing an approximate technique to compute the derivative of the manifold term. In addition, the regularization parameter of the proposed model can be regarded as a forgetting factor, which provides a reasonable and consistent way to control the number of support vectors. By such merits, the proposed online predictors experimentally exhibits a high accuracy comparable to batch algorithm LapSVM.

This paper substantially extend our previous work [28] by providing (a) a fast algorithm of the proposed model (“Fast Algorithm of the Proposed Model” section), (b) several buffering strategies (“Buffering Strategies” section), (c) a brief theory analysis of the proposed algorithms (“Theory Analysis” section), (d) more experiments (“Action Video Categorization” section), and (e) some background knowledge (“Background Knowledge” section).

The rest of this paper is organized as follows. The background knowledge is briefly reviewed in “Background Knowledge” section, and the proposed model and algorithms are detailed in “Online Manifold Learning with Kernels” section. After giving some experimental results in “Experiments” section, the paper is concluded by “Conclusion” section.

Background Knowledge

The background knowledge consists of two parts, Lagrange dual problem and LapSVM, a batch manifold learning model.

Lagrange Dual Problem

The Lagrange dual technique is frequently used for solving the primal optimization problem in RKHS. Thus, we give a brief review of the Lagrange dual problem in this section. Consider the primal constrained optimization problem:

$$\begin{array}{@{}rcl@{}} \begin{array}{lllll} \min_x& \quad f_0(x) \\ \text{s.t.} &\quad f_i(x)\leq0,\quad i=1,...,p\\ &\quad h_i(x) = 0,\quad i=1,...,q, \end{array}\end{array} $$
(1)

where xR n and f 0(x) is an objective cost function that is minimized under the p inequality constraints f i (x) ≤ 0 and q equality constraints h i (x) ≤ 0. The Lagrange function of Eq. 1 is

$$\begin{array}{@{}rcl@{}} \begin{array}{lllll} L(x,\alpha,\beta) = f_0(x) +\sum\limits_{i=1}^p\alpha_if_i(x)+\sum\limits_{i=1}^q\beta_ih_i(x) \end{array} \end{array} $$
(2)

where α = (α 1,..., α p )T and β = (β 1,..., β q )T are Lagrange multipliers. By minimizing (2) over x, the Lagrange dual function g is defined as:

$$\begin{array}{@{}rcl@{}} \begin{array}{llllll} g(\alpha,\beta) = \min_x \quad L(x,\alpha,\beta) \end{array} \end{array} $$
(3)

Then the Lagrange dual problem of Eq. 1 is to maximize (3)

$$\begin{array}{@{}rcl@{}} \begin{array}{lllll} \max_{\alpha,\beta}&\quad g(\alpha,\beta) \\ \text{s.t.}& \quad \alpha_i\geq 0,\quad i=1,...,p. \end{array} \end{array} $$
(4)

The strong duality, that is, the optimal value of Eqs. 1 and 4 being equal to each other, holds under the Slater’s condition [29], that is, the primal problem is convex and there exists x 0 such that f i (x 0) < 0, i = 1,..., p. Therefore, the solution of the primal problem can be obtained by solving its Lagrange dual problem. Actually, the standard Laplacian SVM (SVM based on manifold regularization) is commonly solved by Lagrange dual technique, as reviewed below.

Manifold Regularization for Semi-supervised Learning

Laplacian support vector machine [30] (LapSVM) is derived by adding the manifold regularization term into support vector machine (SVM). Given the labeled training data (x 1, y 1),...,(x l , y l ) and unlabeled training data x l+1,..., x l + u , where \(x_{i}\in \mathcal {X}\) and y i ∈{−1,1}, LapSVM is given by the following optimization problem:

$$\begin{array}{@{}rcl@{}} \begin{array}{llllll} \underset{f\in\mathcal{H}_K}{\min} &\frac{1}{l}\sum\limits_{i=1}^{l}(1-y_if(x_i))_{+}+\gamma_A||f||_K^2\\ &+\frac{\gamma_I}{(u+l)^2}\mathbf{f}^TL\mathbf{f}. \end{array} \end{array} $$
(5)

where γ A , γ I are trade-off parameters, f = [f(x 1),..., f(x l + u )], \(K: \mathcal {X} \times \mathcal {X} \rightarrow \mathbb {R}\) is a Mercer kernel and \(\mathcal {H}_{K}\) is an associated RKHS of functions \(f: \mathcal {X} \rightarrow \mathbb {R}\) with the corresponding norm ∥⋅∥ K . Especially, the graph Laplacian L is defined by L = DW, where W is the edge weights matrix and D is a diagonal matrix (defined by \(D_{ii} = {\sum }_{j=1}^{l+u}W_{ij}\), i = 1,..., l + u). By the Representer Theorem (see Theorem 2 in [30]), the optimal solution of Eq. 5 can be presented as

$$\begin{array}{@{}rcl@{}} \begin{array}{lllllll} f^{*}(x) = \sum\limits_{i=1}^{l+u}\alpha_i^{*}K(x,x_i). \end{array} \end{array} $$
(6)

By adding a bias term b to the above formula, the primal problem (5) can be rewritten as:

$$\begin{array}{@{}rcl@{}} \begin{array}{llllll} \underset{\alpha\in \mathbb{R}^{l+u},\xi\in\mathbb{R}^l}{\min} \frac{1}{l}\sum\limits_{i=1}^{l}\xi_i+\gamma_A\alpha^TK\alpha+\frac{\gamma_I}{(u+l)^2}\alpha^TKLK\alpha \\ \text{s.t.}\quad y_i\left( \sum\limits_{j=1}^{l+u}\alpha_jK(x_i,x_j)+b\right)\geq 1-\xi_i,i=1,...,l \\ \xi_i \geq 0, i=1,...,l. \end{array} \end{array} $$

where α = [α 1,..., α l + u ]T. Then, we can obtain the Lagrangian:

$$\begin{array}{@{}rcl@{}} \begin{array}{lllllllll} L(\alpha,\xi,b,\beta,\zeta) \\ \qquad= \frac{1}{l}\sum\limits_{i=1}^{l}\xi_i+\frac{1}{2}\alpha^T\left( 2\gamma_AK+2\frac{\gamma_A}{(l+u)^2}KLK\right)\alpha \\ \qquad\quad-\sum\limits_{i=1}^{l}\beta_i\left( y_i\left( {\sum}_{j=1}^{l+u}\alpha_jK(x_i,x_j)+b\right)-1+\xi_i\right)\\\qquad\quad-\sum\limits_{i=1}^l\zeta_i\xi_i. \end{array} \end{array} $$

where β i , ζ i are Lagrange multipliers.

To obtain the minimum with respect to b and ξ, consider the conditions L/ b = 0 and L/ ξ i = 0. Thus, a reduced Lagrangian can be formulated as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{lllll} L^{R}(\alpha,\beta) &=& \frac{1}{2}\alpha^T\left( 2\gamma_AK+2\frac{\gamma_I}{(u+l)^2}KLK\right)\alpha \\ &&-\alpha^TKJ^TY\beta+\sum\limits_{i=1}^l\beta_i. \end{array} \end{array} $$

where J = [I 0] is an l × (l + u) matrix with I as the l × l identity matrix (assuming the first l points are labeled in the training set) and Y is a diagonal matrix with Y i i = y i for i = 1,..., l. By taking derivative of the reduced Lagrangian with respect to α, we have

$$\begin{array}{@{}rcl@{}} \begin{array}{lllll} \frac{\partial{L^R}}{\partial{\alpha}} = \left( 2\gamma_AK+2\frac{\gamma_I}{(u+l)^2}KLK\right)\alpha-KJ^TY\beta. \end{array} \end{array} $$
(7)

So, we can get:

$$\begin{array}{@{}rcl@{}} \begin{array}{lllll} \alpha = \left( 2\gamma_AI+2\frac{\gamma_I}{(u+l)^2}LK\right)^{-1}J^TY\beta^{*}. \end{array} \end{array} $$
(8)

Substituting back in the reduced Lagrangian, an optimization problem with respect to β is derived as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{llllll} \beta^{*} =& \underset{\beta\in\mathbb{R}^l}{\max}\sum\limits_{i=1}^l\beta_i-\frac{1}{2}\beta^TQ\beta \\ \text{s. t. } & \sum\limits_{i=1}^{l}\beta_iy_i = 0 \\ &0\leq \beta_i \leq \frac{1}{l},i=1,...,l \end{array} \end{array} $$
(9)

where

$$\begin{array}{@{}rcl@{}} Q = YJK\left( 2\gamma_AI+2\frac{\gamma_I}{(u+l)^2}LK\right)^{-1}J^TY \end{array} $$

By solving (9) and using the Eqs. 6 and 8, we can obtain the optimal solution f (x). However, the process of training LapSVM classifier with all the training data can be very slow when the data size is large. To improve the computational efficiency for online learning, we proposed a novel online manifold learning model based on a constrained optimization problem, which is presented in the next section.

Online Manifold Learning with Kernels

In this section, the proposed model is presented in detail. In “Online Model Based on Manifold Regularization” section, a model based on manifold regularization is proposed for online semi-supervise learning in a RKHS. In “Online Algorithm of the Proposed Model” section, the proposed model is solved by exploiting the property of Lagrange dual problem. Several fast learning strategies are presented in “Fast Learning Strategies” section. A brief theoretical analysis of the proposed algorithms is presented in “Theory Analysis” section.

Online Model Based on Manifold Regularization

Assume that the current learning data for semi-supervised learning are (x 1, y 1, δ 1),(x 2, y 2, δ 2),..., (x t , y t , δ t ) where \(x_{i}\in \mathcal {X}\) is a point, \(y_{i}\in \mathcal {Y}=\{-1,1\}\) is its label and δ i is a flag to determine whether the label y i is available (y i is available if and only if δ i = 1). At round t, the current predictor is h t (x) = sign(f t (x)) and f 0 is set as f 0 = 0 in our algorithm. In online semi-supervised learning, when a new sample (x t+1, y t+1, δ t+1) is available, the function f t+1 is updated based on the current decision function f t and the implicit feedback, that is, the manifold structure of the samples. The detailed process of online manifold learning is presented in Fig. 1. In Fig. 1, a new input x t is provided to the current predictor and the decision value f(x t ) is computed by the predictor. Thereafter, the learner will update the decision function in different ways based on different feedbacks: if the label of x t is available, the learner will update the classifier with both the explicit feedback y t and the implicit feedback under the manifold structure of the samples; otherwise, the classifier will be updated only based on the implicit feedback. The process will continue until no more new samples arrive and the final predictor h T (x) = sign (f T (x)) (T is the final time) is derived for classification tasks.

Fig. 1
figure 1

Online semi-supervised learning based on manifold regularization framework

Suppose that K(⋅,⋅) is a chosen Kernel function over the training samples and \(\mathcal {H}\) is the corresponding RKHS. Therefore, according to the Representer Theory [31], we can write f t and f t+1 as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{lllllll} f_t(\cdot) &= \sum\limits_{i=1}^t\alpha_i^{t}K(x_i,\cdot), \\ f_{t+1}(\cdot) &= \sum\limits_{i=1}^t\alpha_i^{t+1}K(x_i,\cdot)+\alpha_{t+1}^{t+1}K(x_{t+1},\cdot). \end{array} \end{array} $$
(10)

In the online learning process, our aim is to update {α i t+1}i = 1t+1 from {α i t}i = 1t based on a proper algorithm. Considering the trade-off between the amount of progress made on each round and the amount of information retained from previous rounds, and compromise the classification error, the manifold constraint and the complexity of f as LapSVM, our online semi-supervised learning model with manifold regularization is presented as follows:

$$\begin{array}{@{}rcl@{}} \begin{array}{llllll} \underset{f,\xi_{t+1}}{\min}& \frac{1}{2}\|f-f_{t}\|_{\mathcal{H}}^{2}+\frac{\lambda_{1}}{2}\|f\|^{2}_{\mathcal{H}}+C\delta_{t+1}\xi_{t+1}\\&+ \frac{1}{2}\lambda_{2}\sum\limits_{i=1}^{t}(f(x_{i})-f(x_{t+1}))^{2}w_{it+1} \\ \text{s.t.} \quad &y_{t+1}f(x_{t+1})\geq 1-\xi_{t+1}, \xi_{t+1} \geq 0 \end{array} \end{array} $$
(11)

where \(\frac {1}{2}\|f-f_{t}\|_{\mathcal {H}}^{2}\) measures the difference between f and the previous f t , \(\|f\|^{2}_{\mathcal {H}}\) controls the complexity of the decision function f, \({\sum }_{i=1}^{t}(f(x_{i})-f(x_{t+1}))^{2}\) w i t+1 is the manifold regularizer which depends on the edge weight w i t+1, f and x i , and ξ t+1 is the slack variable denoting a possible error for the newly arrived data (x t+1, y t+1, δ t+1) after f is determined, λ 1, λ 2 and C are parameters reflecting the weights compromising complexity, the manifold regularizer and the classification error.

In the objective function of Eq. 11, the manifold structure of the samples is reflected in the term \({\sum }_{i=1}^{t}(f(x_{i})-f(x_{t+1}))^{2}\) w i t+1, which can be regarded as an implicit feedback. This regularization term makes the new sample gain a similar decision value to its close sample in the manifold. The solution of the proposed model is presented in the next section.

Online Algorithm of the Proposed Model

In this section, we give a detailed solution of the proposed model by exploiting the property of Lagrange dual problem. Assuming that δ t+1 = 1 (if δ t+1 = 0, the solution of Eq. 11 can be obtained by the similar process as bellow), the Lagrange dual problem of Eq. 11 is

$$\begin{array}{@{}rcl@{}} \begin{array}{lllllll} \underset{\gamma_{t+1}}{\max}\underset{f,\xi_{t+1}}{\min}\quad &{L(f,\xi_{t+1},\gamma_{t+1},\beta_{t+1})} \\ \text{s.t.} \quad &\gamma_{t+1} \geq 0, \quad \beta_{t+1} \geq 0 \end{array} \end{array} $$
(12)

where γ t+1 and β t+1 are the Lagrange multipliers corresponding to the constraints y t+1 f(x t+1) ≥ 1 − ξ t+1 and ξ t+1 ≥ 0, respectively, and

$$\begin{array}{@{}rcl@{}} \begin{array}{lllll} {L(f,\xi_{t+1},\gamma_{t+1},\beta_{t+1})}& =& \frac{1}{2}\|f-f_t\|_{\mathcal{H}}^2+\frac{\lambda_1}{2}\|f\|^2_{\mathcal{H}} \\ &&+\frac{1}{2}\lambda_2\sum\limits_{i=1}^{t}(f(x_i)\,-\,f(x_{t+1}))^2w_{it+1} \\ &&-\gamma_{t+1}(y_{t+1}f(x_{t+1})-1+\xi_{t+1}) \\ &&+C\xi_{t+1}-\beta_{t+1}\xi_{t+1} \end{array} \end{array} $$

By solving the Lagrange dual problem of Eq. 11 (the details can be found in the Appendix), we can obtain the new classifier at time t + 1:

$$\begin{array}{@{}rcl@{}} \begin{array}{lllll} &f_{t+1}(x) = \sum\limits_{i=1}^{t+1}\alpha_i^{t+1}K(x_{t+1},x), \\ &h_{t+1} = \text{sign}({f_{t+1}(x)}), \end{array} \end{array} $$
(13)

where

$$\begin{array}{@{}rcl@{}} \begin{array}{llllll} \alpha^{t+1} =A^{-1}(K\tilde{\alpha}^{t}+\delta_{t+1} y_{t+1}\gamma_{t+1}^{*}J). \end{array} \end{array} $$

The above process is summarized in Algorithm 1. In Algorithm 1, when the first sample arrives, the value of α 1 is set to be 1.

figure a

However, there are two difficulties in performing Algorithm 1: (1) To compute the value of α t , we need to compute the inverse of matrix A = K + λ 1 K + λ 2 K L K, which is difficult to calculate when t is very large; (2) In the online learning process, the online manifold learning algorithms with kernel functions have to store the sequence up to the current round. In a result, the set of support vectors will grow unboundedly, which limits the applicability of the online algorithms. Therefore, we present a fast algorithm to solve the proposed model and introduce several buffering strategies to reduce the number of support vectors.

Fast Learning Strategies

Fast Algorithm of the Proposed Model

In this section, we propose a fast algorithm to solve the proposed model. Note that if we let λ 2 = 0, the process of calculating the inverse matrix can be avoided. There, for the sake of taking advantage of the properties of manifold regularization and improving the computational efficiency, we use an approximate term to replace (31).

Consider the formula (30), by replacing the term λ 2 K L K α with λ 2 K L K α t , we have

$$\begin{array}{@{}rcl@{}} \begin{array}{lllllll} \frac{\partial {L}^R}{\partial \alpha} \approx &(K+\lambda_1K)\alpha+\lambda_2KLK\alpha_t\\ &-K\alpha^t-Jy_{t+1}\gamma_{t+1}\\ \end{array} \end{array} $$
(14)

This approximation is reasonable for that the term \(\frac {1}{2}\|f-f_{t}\|_{\mathcal {H}}^{2}\) is used to control the distance of a predicted f from the previous f t in our model, which can guarantee that the difference between α t+1 and α t is not very large. In addition, the convex function M(α) = α T λ 2 K L K α is continuous and differentiable, so we have

$$\begin{array}{@{}rcl@{}} \begin{array}{llllllll} \frac{\partial M}{\partial \alpha^{t+1}} \approx \frac{\partial M}{\partial \alpha^{t}} \\ \end{array} \end{array} $$

that is,

$$\begin{array}{@{}rcl@{}} \begin{array}{llllll} \lambda_2KLK\alpha^{t+1} \approx \lambda_2KLK\alpha^{t} . \end{array} \end{array} $$

Now, from Eq. 14, we get

$$\begin{array}{@{}rcl@{}} \begin{array}{llllllll} {\alpha} = &\frac{1}{1+\lambda_1}[(I-\lambda_2LK)\alpha_t+ey_{t+1}\gamma_{t+1}]\\ \end{array} \end{array} $$
(15)

Taking the derivative of Eq. 29 with respect to γ t+1 we get:

$$\begin{array}{@{}rcl@{}} \begin{array}{lllllll} \frac{\partial {L}^R}{\partial \gamma_{t+1}} = 1-y_{t+1}\alpha^TJ=0 \end{array} \end{array} $$
(16)

Substituting (15) into (16), we have

$$\begin{array}{@{}rcl@{}} \begin{array}{llllll} \overline{\gamma}_{t+1} = 1+\lambda_1-y_{t+1}J^T(I-\lambda_2LK)\alpha^t \end{array} \end{array} $$
(17)

Let the approximate solution of Eq. 30 be \(\hat {\alpha }_{t+1}\) and \(\hat {\gamma }^{*}_{t+1}\). Hence,

$$\begin{array}{@{}rcl@{}} \hat{\gamma}_{t+1}^{*} = \left\{ \begin{array}{lcl} 0,\quad \text{if} \quad \overline{\gamma}_{t+1}\leq 0 \\ C,\quad \text{if} \quad \overline{\gamma}_{t+1}\geq 0 \\ \overline{\gamma}_{t+1},\quad \text{otherwise} \end{array} \right. \end{array} $$
(18)

Similar to Eq. 13, the classifier obtained at time t + 1 is:

$$\begin{array}{@{}rcl@{}} \begin{array}{llllll} &f_{t+1}(x) = \sum\limits_{i=1}^{t+1}\hat{\alpha}_i^{t+1}K(x_{t+1},x), \\ &h_{t+1} = \text{sign}({f_{t+1}(x)}), \end{array} \end{array} $$
(19)

where

$$\begin{array}{@{}rcl@{}} \begin{array}{lllll} \hat{\alpha}_{t+1} = &\frac{1}{1+\lambda_1}\left[(I-\lambda_2LK)\alpha_t+e\delta_{t+1}y_{t+1}\hat{\gamma}^{*}_{t+1}\right] \end{array} \end{array} $$

The above process is summarized in Algorithm 2.

figure b

The main computation in Eqs. 15 and 17 is to calculate the matrix multiplication L × K. It can be seen that L = DW is a sparse matrix by its definition (D is a diagonal matrix and W is a matrix that only 2t elements are non-zero), which means the computational complexity of L × K is only O(t 2). Therefore, the computational complexity is O(t 2) of Algorithm 2. Note that the computational complexity becomes very high with the increasing of t. This limits the scalability of the proposed algorithms. Therefore, we present several buffering strategies to improve the scalability of the online algorithms in the next section.

Buffering Strategies

In practice, kernel-based discriminative algorithms have been shown to perform very well on semi-supervised learning problems [30, 32]. However, in the online learning process, the set of support vectors will grow unboundedly, which limits the applicability of the online manifold regularization algorithms. To address this problem, we present several approaches to bound the size of the support set.

Buffering strategies [5, 24, 33] keep a fixed number of support vectors for online learning. Let the buffer size be τ. There are several different strategies:

  1. (1)

    Buffer-N. The oldest sample in the buffer is replaced with the new incoming sample after each online learning round.

  2. (2)

    Buffer-U. The oldest unlabeled sample in the buffer is replaced with the new incoming sample after each online learning round. When the buffer is filled with labeled samples, the oldest labeled points is evicted from the buffer.

To modify this for a more general case, we remove the sample with the smallest |α i t| in round t, where |⋅| is the absolute value symbol. As suggested in [24], we choose Buffer-U as the buffering strategy for all our experiments.

Theory Analysis

In this section, we give out a brief theory analysis of the proposed algorithms.

Theorem 1

Suppose that K (⋅,⋅) is a chosen Kernel function over the training samples and \(\mathcal {H}\) is the corresponding RKHS, then (13) is exactly the solution of the primal problem (11) .

Proof

Let

$$\begin{array}{@{}rcl@{}} \begin{array}{llllll} c_t^1(f) =& 1-\xi_{t+1}-y_{t+1}f(x_{t+1}), \\ c_t^2(f)=&-\xi_{t+1}. \end{array} \end{array} $$

Apparently, c t1(f) and c t2(f) are continuous. In addition, since the object function of Eq. 11 is convex, by the Convex Duality Theorem of [34] (see the Theorem 14.37 on page 532), the optimal value of the primal problem (11) is equal to that of its Lagrange dual problem. Therefore, according to the above derivational process, the result (13) is exactly the solution of the primal problem (11). □

Theorem 1 implies that MOMR is an exact algorithm with respect to the proposed model (11). Next, we give out an analysis of the relationship between the proposed MOMR and FMOMR.

Theorem 2

Suppose λ 2 = 0, then the solution of Eqs. 13 and 19 is equivalent.

Proof

Suppose λ 2 = 0, we have

$$\begin{array}{@{}rcl@{}} \begin{array}{llllllll} J^TA^{-1}J &= e^TK(K+\lambda_1K)^{-1}Ke \\ &= \frac{1}{1+\lambda_1}e^TKe = \frac{1}{1+\lambda_1}. \end{array} \end{array} $$
(20)

Therefore, from Eq. 33, we get

$$\begin{array}{@{}rcl@{}} \begin{array}{llllll} \overline{\gamma}_{t+1} &= \frac{1-y_{t+1}J^TA^{-1}K\tilde{\alpha}^t}{J^TA^{-1}J}\\ &= 1+\lambda_1-y_{t+1}J^T\tilde{\alpha}^t. \end{array} \end{array} $$
(21)

which is equivalent to Eq. 17 if λ 2 = 0.

Similarly, by substituting λ 2 = 0 into (31) and (15) respectively, we have

$$\begin{array}{@{}rcl@{}} \begin{array}{llllll} \hat{\alpha}_{t+1} = {\alpha}_{t+1} = \frac{1}{1+\lambda_1}(\tilde{\alpha}^{t}+ey_{t+1}\gamma_{t+1}). \end{array} \end{array} $$
(22)

Theorem 2 is reasonable for that the Algorithm 2 is obtained only by approximating the derivative of the manifold regularization term. In addition, for that (31) and (15) are continuous with respect to λ 2, (15) is an appropriate approximation of Eq. 31 when λ 2 is very small.

Experiments

In this section, to verify the effectiveness, we compare the proposed algorithms, MOMR and FMOMR, with two online manifold regularization algorithms and a batch algorithm on three data sets (see “Handwritten Digit Recognition4” section), respectively.

In all the experiments, the RBF kernel \(k(x_{i},x_{j})=\exp ({-{\|x_{i}-x_{j}\|^{2}}/(2{{\sigma _{K}^{2}}}}))\) is used for classification. The edge weight is \(k(x_{i},x_{j})=\exp ({-{\|x_{i}-x_{j}\|^{2}}/(2{{\sigma _{W}^{2}}}}))\), which define a fully connected graph. The labeled rate of training samples is 2%.

In our experiments, we focus on online manifold regularization algorithms derived from the dual problem. Therefore, we compare the performance of our algorithms with an online manifold regularization algorithm based on Example-Associate Update (denoted by OMR-EA), an online manifold regularization algorithm based on Overall Update (denoted by OMR-Overall) [26], and a batch manifold regularization algorithm LapSVM [30]. As suggested in [26], the step sizes of the OMR-EA and OMR-Overall are set to be a small value 0.01.

All the evaluations share the same buffering strategy, Buffer-U, but employ different buffer sizes (B ∈{50,100,150,200}). The parameter values σ K , σ W , λ 1 and λ 2 are selected by using five-fold cross validation on the first 500 samples of the training data, where σ K , σ W ∈{2−3, 2−2, 2−1, 2−0, 21, 22, 23} and λ 1, λ 2 ∈{10−5,10−4, 10−3,10−2,10−1,10−0,101,102}. In addition, the value of parameter C is set to be 1 for the proposed algorithm MOMR and FMOMR. The computational efficiencies of all the algorithms are evaluated in terms of their CPU running time (in seconds). All the experiments are implemented in Matlab on a PC with Inter(R) Core(TM) 3.2 GHz CPU, 4G RAM and Windows 7 operating system.

All the four online algorithms are performed in the same way which can be divided into two steps: (1) Online processing: training a classifier with a new arrived sample using an online algorithm. (2) Test: testing the final model on a test set. However, the batch algorithm LapSVM is trained with all the visible samples in each learning round. We repeat all the experiments ten times (each with an independent random permutation of the training samples) and the results presented bellow are all average over ten trials.

Handwritten Digit Recognition

In this section, we perform an evaluation experiment on the MNIST data set [35]. We focus on the binary classification task of separating “6” from “8” (MNIST6VS8) in our experiment. The sizes of the training set and test set are 11769 and 1932 respectively. Some images of the MNIST6VS8 data set are presented in Fig. 2.

Fig. 2
figure 2

Some images of the MNIST3VS6. The top two rows are images of “6” while the bottom two rows are images of “8”

The test accuracies are summarized in Table 1. From the results, the test accuracies of MOMR and FMOMR are comparable to those of LapSVM and higher than those of OMR-EA and OMR-Overall. This is reasonable for that MOMR is exactly the solution of the proposed model, while OMR-EA and OMR-Overall [26] are obtained by using the stochastic gradient methods. And, the performance of the fast algorithm FMOMR is very similar to that of the algorithm MOMR. It implies that FMOMR is a proper approximate solution to the proposed model.

Table 1 On the MNIST6VS8, test accuracies (%) of MOMR, FMOMR, OMR-EA, OMR-Overall, and LapSVM with using different buffer sizes

The online updating time is presented in Fig. 3. We can see that: with respect to the updating time (a) MOMR is comparable to the other three online algorithms when the buffer size is small; (b) FMOMR is comparable to the online algorithms OMR-EA and OMR-Overall, and much faster than the off-line algorithm LapSVM. These are reasonable for that each sample is trained only once by the online algorithms and a buffering strategy is used to reduce the repeated training process.

Fig. 3
figure 3

Cumulative running time of online updating the classifiers with different buffer sizes on the MNIST6VS8 data set

Face Recognition

This experiment is performed on the data set FACEMIT [36] which contains 361-dimensional images of faces and non-faces. A balanced subset (size 5000) from FACEMIT is randomly sampled and divided into two sets obeying a rule that the number of training samples is equal to that of test samples. Some images of the FACEMIT data set are presented in Fig. 4.

Fig. 4
figure 4

Some images of the FACEMIT. The top four rows are images of faces while the bottom four rows are images of non-faces

The test accuracies are summarized in Table 2. We can make the following comments: (a) The test accuracies of MOMR and FMOMR are higher than those of the other algorithms; (b) FMOMR surpass other algorithms with respect to the test accuracy, which further demonstrates that the proposed fast approximate algorithm FMOMR is reasonable and efficient.

Table 2 On the FACEMIT, test accuracies (%) of MOMR, FMOMR, OMR-EA, OMR-Overall, and LapSVM with different buffer sizes

The online updating time of the five algorithms are presented in Fig. 5. It can be seen that FMOMR is the fastest algorithm among all the five algorithms. Additionally, the difference between MOMR and FMOMR increases with the increasing of buffer size. This can be explained by that the computational complexity of MOMR and FMOMR are O(B 3) and O(B 2) respectively. Note that in Fig. 5, the curves are plotted by using singe logarithmic coordinate axis. Therefore, MOMR consumes more time than FMOMR as B increases. However, when B is small, both MOMR and FMOMR can be implemented very fast, so that the difference of cumulative running time between MOMR and FMOMR is insignificant.

Fig. 5
figure 5

Cumulative running time of online updating the classifiers with different buffer sizes on the FACEMIT data set

Action Video Categorization

Further, we evaluate our methods on a kind of multi-manifold data, action video. As we know, a video is always made up of lots of static images which keep coherence in content and space, especially action videos. We adopt the UCF YouTube dataset [37] which consists of 1168 video sequences captured under uncontrolled conditions. It is a challenging dataset owing to tremendous variations in camera motion, object pose, cluttered background, viewpoint, illumination, etc. We select two action categories, biking and diving (some images are presented in Fig. 6.), which both have a better continuity. We utilize the dense trajectories [38] to describe actions in the videos. The method can extract essential features representing actions which is robust to fast irregular motions and short boundaries. Then, 10,000 frames are sampled from these 2 action categories respectively (so the total number of frames is 20,000). They are divided into two sets: the training set and the test set with a proportion 1:1 for our experiment. The task is to classify these frames into these two action categories.

Fig. 6
figure 6

Some frames from the videos of biking and diving. The top two rows are frames of biking while the bottom two rows are frames of diving

The test accuracies are summarized in Table 3. We can make the following comments: (a) The test accuracies of MOMR and FMOMR are comparable with the off-line algorithm LapSVM and higher than those of the two online algorithms OMR-EA and OMR-Overall; (b) the proposed algorithms MOMR and FMOMR make 2% improvements over the other two online algorithms when the buffer size is small; and (c) all the online algorithms make a significant improvement on performance with the increasing of the buffer size.

Table 3 Test accuracies (%) of MOMR, FMOMR, OMR-EA, OMR-Overall, and LapSVM on a subset of UCF YouTube [29] with different buffer sizes

The online updating time of the five algorithms are presented in Fig. 7. With respect to the running time, it can be seen that (a) FMOMR is faster than the other compared online algorithms, and (b) all the compared online algorithms are much faster than the off-line algorithm LapSVM.

Fig. 7
figure 7

Cumulative running time of online updating the classifiers with different buffer sizes on the UCF YouTube dataset

Considering the above results, it can be inferred that the proposed algorithms can reach the first grade among the five algorithms both on the test accuracy aspect and on the running time aspect. The proposed fast algorithm FMOMR is the best among the compared online algorithms for its performances on the three data sets because (a) FMOMR is the fastest algorithm among the compared algorithm; (b) in the aspect of generalization performance, FMOMR is better than OMR-EA and OMR-Overall and FMOMR has a comparable performance to the batch algorithm LapSVM.

Additionally, the test accuracy is higher with a larger buffer, but the time cost increases with the increase of the buffer size. In practice, the buffer size can be used to trade-off the accuracy and the time cost of online classifiers. An appropriate buffer size can be derived by using cross validation on the first N arrived samples, where N is a predefined number.

Conclusion

According to the manifold regularized online model, we give out an analytical solution of the constrained optimization problem by exploiting the techniques of the Lagrange dual problem. The proposed idea offers two new algorithms to solve the online semi-supervised leaning problem. Experiment results verify the effectiveness and validity of the proposed algorithms.

In fact, the proposed algorithms can solve not only semi-supervised learning problems but also online supervise learning problems (this can be done in the algorithm MOMR by deleting the manifold regularization term from the object function of (11) and setting the value of λ 2 to be 0). In the future work, we will extend the proposed algorithms to solve some specific online learning problems.