1 Introduction

Support vector machines (SVMs) are a state-of-the-art tool for pattern classification and regression problems [13], which originate from the idea of structural risk minization in statistical learning theory. SVMs can learn a nonlinear decision function which is linear in a potentially high-dimensional feature space [4] with the aid of the kernel trick. In practice, SVMs have been applied to a variety of domains such as object detection, text categorization, bioinformatics and image classification, etc.

In order to reduce the computational cost of SVMs, proximal support vector machines (PSVMs) [5] have been proposed. Compared with SVMs, PSVMs solve a linear equation with time complexity O(d 3) (d is the dimension of the examples) while SVMs solve the convex optimization problem. In essence, PSVMs classify the examples by a hyperplane on the premise of guaranteeing the maximum margin. Mangasarian and Wild [6] proposed generalized eigenvalue proximal SVMs (GEPSVMs) which are an extension of PSVMs for binary classification. Instead of finding a single hyperplane as in PSVMs, GEPSVMs find two nonparallel hyperplanes such that each hyperplane is as close as possible to examples from one class and as far as possible to examples from the other class. The two hyperplanes are obtained by eigenvectors corresponding to the smallest eigenvalues of two related generalized eigenvalue problems. Jayadeva et al. [7] proposed another nonparallel hyperplane classifier called twin SVMs (TSVMs), which aim to generate two nonparallel hyperplanes such that one of the hyperplanes is closer to one class and has a certain distance to the other class. The formulation of TSVMs is different from that of GEPSVMs and is similar to SVMs. TSVMs solve a pair of quadratic programming problems (QPPs), whereas SVMs solve a single QPP. This strategy of solving two smaller sized QPPs rather than one large QPP makes TSVMs work faster than standard SVMs [7]. Experimental results [8] show that nonparallel hyperplane classifiers given by TSVMs can indeed improve the performance of conventional SVMs [914].

In many machine learning tasks [1518], labeled examples are often difficult and expensive to obtain, while unlabeled examples may be relatively easy to collect. Semi-supervised learning has attracted a great deal of attention in the last decade to deal with this situation. It can be superior to the performance of the counterpart supervised learning approaches if the unlabeled data are properly used. Some extensions of SVMs and TSVMs from supervised learning to semi-supervised learning have been proposed, e.g., transductive SVMs, semi-supervised support vector machines, Laplacian support vector machines (LapSVMs), Laplacian twin support vector machines (LapTSVMs) [1924]. LapTSVMs [24] are a successful combination of semi-supervised learning and TSVMs, which are a generalized framework of twin support vector machines for learning from labeled and unlabeled data. By choosing appropriate parameters, LapTSVMs can degenerate to TSVMs [25, 26]. Experimental results showed that LapTSVMs are superior to LapSVMs and TSVMs in classification accuracy and the training time is more economical than LapSVMs and TSVMs.

In many real-world applications, multi-modal data are very common because of the use of different measuring methods (e.g., infrared and visual cameras), or of different media (e.g., text, video and audio) [27]. For example, web pages can be represented by a vector for the words in the web page text and another vector for the words in the anchor text of a hyper-link. In content-based web-image retrieval, an image can be simultaneously described by visual features and the text surrounding the image. Multi-view learning (MVL) is an emerging direction which aims to improve classifiers by leveraging the complementarity and consistency among distinct views [2830]. The theories on MVL can be classified to four categories which are canonical correlation analysis, effectiveness of co-training, generalization error analysis for co-training and generalization error analysis for other MVL approaches [27].

SVM-2K is a successful combination of MVL and SVMs which combines the maximum margin and multi-view regularization principles to leverage two views to improve classification performance [31]. Farquhar et al. [31] have provided a theoretical analysis to illuminate the effectiveness of SVM-2K, showing a significant reduction in the Rademacher complexity of the corresponding function class. Sun and Shawe-Taylor characterized the generalization error of multi-view sparse SVMs [32] and multi-view LapSVMs (MvLapSVMs) [33] in terms of the margin bound and derived the empirical Rademacher complexity of the considered function classes [34]. MvLapSVMs integrate three regularization terms respectively on function norm, manifold and multi-view regularization in the objective function. However, there is no existing multi-view extension for LapTSVMs although LapTSVMs are superior to LapSVMs. In this paper, we extend LapTSVMs to our new frameworks named by multi-view Laplacian twin support vector machines (MvLapTSVMs) which combine two views by introducing the constraint of similarity between two one-dimensional projections identifying two distinct TSVMs from two feature spaces. Compared to MvLapSVMs, there are two main differences. First, LapSVMs and LapTSVMs are different in the principle though they commonly use the manifold regularization term for semi-supervised learning. MvLapSVMs are based on LapSVMs while MvLapTSVMs are based on LapTSVMs. Second, MvLapTSVMs combine two views in the constraints rather than in the objective function. Experimental results validate that our proposed methods are effective.

The remainder of this paper proceeds as follows. Section 2 briefly reviews related work including SVMs, TSVMs, LapSVMs, LapTSVMs and SVM-2K. Section 3 introduces our proposed linear MvLapTSVMs and kernel MvLapTSVMs. After reporting experimental results in Section 4, we give conclusions in Section 5.

2 Related work

In this section, we briefly review SVMs, TSVMs, LapSVMs, LapTSVMs and SVM-2K. They constitute the foundation of our subsequent proposed methods.

2.1 SVMs and TSVMs

Suppose there are l examples represented by matrix A with the ith row A i (i = 1, 2, ⋯ ,l) being the ith example. Let y i ∈ {1, −1} denote the class to which the ith example belongs. For simplicity, here we only review the linearly separable case [1]. Then, we need to determine wR d and bR such that

$$ y_{i}(A_{i}w+b)\geq 1. $$
(1)

The hyperplane described by w x + b = 0 lies midway between the bounding hyperplanes given by w x + b = 1 and w x + b = −1. The margin of separation between the two classes is given by \(\frac {2}{\| w \|},\) where ∥w∥ denotes the 2 norm of w. Support vectors are those training examples lying on the above two hyperplanes. The standard SVMs [1] are obtained by solving the following problem

$$\begin{array}{@{}rcl@{}} &&\min\limits_{w,b} \,\,\,\frac{1}{2}w^{\top}w \\ &&\text{s.t.}\,\,\,\,\,\forall i:y_{i}(A_{i}w+b)\geq 1. \end{array} $$
(2)

The decision function is

$$ f(x)=\text{sign}(w^{\top}x+b). $$
(3)

Then we introduce TSVMs [7]. Suppose examples belonging to classes 1 and −1 are represented by matrices A + and B , and the size of A + and B are (l 1×d) and (l 2 × d), respectively. We define two matrices A, B and four vectors v 1, v 2, e 1, e 2, where e 1 and e 2 are vectors of ones of appropriate dimensions and

$$\begin{array}{@{}rcl@{}} A&=&(A_{+} , e_{1}), \,\,B=(B_{-},e_{2}),\\ &&v_{1}= \left( \begin{array}{c} w_{1}\\ b_{1} \end{array}\right), \,\,v_{2}= \left( \begin{array}{c} w_{2}\\ b_{2} \end{array}\right). \end{array} $$
(4)

TSVMs obtain two nonparallel hyperplanes

$$ w_{1}^{\top}x+b_{1}=0\,\,\,\,\,\text{and}\,\,\,\,w_{2}^{\top}x+b_{2}=0 $$
(5)

around which the examples of the corresponding class get clustered. The classifier is given by solving the following QPPs separately (TSVM1)

$$\begin{array}{@{}rcl@{}} &&\min\limits_{v_{1},q_{1}}\,\,\,\frac{1}{2}(Av_{1})^{\top}(Av_{1})+c_{1}e_{2}^{\top}q_{1}\\ &&\text{s.t.} \,\,\,\,-Bv_{1}+q_{1}\succeq e_{2},\,\,q_{1}\succeq 0, \end{array} $$
(6)

(TSVM2)

$$\begin{array}{@{}rcl@{}} &&\min\limits_{v_{2},q_{2}}\,\,\,\frac{1}{2}(Bv_{2})^{\top}(Bv_{2})+c_{2}e_{1}^{\top}q_{2}\\ &&\text{s.t.} \,\,\,\, Av_{2}+q_{2}\succeq e_{1},\,\,q_{2}\succeq 0, \end{array} $$
(7)

where c 1, c 2 are nonnegative parameters and q 1, q 2 are slack vectors of appropriate dimensions. The label of a new example x is determined by the minimum of |x w r +b r | (r = 1, 2) which are the perpendicular distances of x to the two hyperplanes given in (5).

2.2 LapSVMs

LapSVMs combine manifold regularization and SVMs [22]. Suppose x 1,⋯ ,x l + u R d represent a set of examples including l labeled examples and u unlabeled examples. W (l + u)×(l + u) represents the similarity of every pair of examples

$$ W_{ij}=exp\left(-\parallel x_{i}-x_{j}\parallel^{2}/2\sigma^{2}\right), $$
(8)

where σ is a scale parameter. The manifold regularization can be written as

$$\begin{array}{@{}rcl@{}} Reg(f)&=&\frac{1}{2}\sum\limits_{i,j=1}^{l+u}W_{ij}(f(x_{i})-f(x_{j}))^{2}\\ &=&\sum\limits_{i=1}^{l+u}\left(\sum\limits_{j=1}^{l+u}W_{ij}\right)f^{2}(x_{i})-\sum\limits_{i,j=1}^{l+u}W_{ij}f(x_{i})f(x_{j})\\ &=&\mathbf{f}^{\top}(V-W)\mathbf{f}=\mathbf{f}^{\top}L\mathbf{f}, \end{array} $$
(9)

where function \(f: R^{d}\rightarrow R\) and f = [f(x 1),⋯ ,f(x l + u )]. The matrix V is diagonal with the ith diagonal entry \(V_{ii}={\sum }_{j=1}^{l+u}W_{ij}\). The matrix L = VW and L is the graph Laplacian of W. LapSVMs have the following optimization problem

$$\begin{array}{@{}rcl@{}} \min\limits_{f\in \mathcal {H}}&\frac{1}{l}&\!\!\sum\limits_{i=1}^{l}(1-y_{i}f(x_{i}))_{+}+\gamma_{A}\|f\|^{2}\\ &+&\frac{\gamma_{I}}{(u+l)^{2}}\sum\limits_{i,j=1}^{l+u}W_{ij}(f(x_{i})-f(x_{j}))^{2}, \end{array} $$
(10)

where ℋ is the RKHS induced by a kernel. γ A and γ I are respectively ambient and intrinsic regularization coefficients.

2.3 LapTSVMs

The square loss function and hinge loss function are used for TSVMs from supervised learning to semi-supervised learning. LapTSVMs [24] are similar to LapSVMs in the sense of manifold regularization. The optimization problems of LapTSVMs can be written as

$$\begin{array}{@{}rcl@{}} &&\min\limits_{w_{1},b_{1},\xi}\frac{1}{2}\|A_{+}w_{1}+e_{1}b_{1}\|^{2}+c_{1}e_{2}^{\top}\xi+\frac{1}{2}c_{2}\left(\|w_{1}\|^{2}+{b_{1}^{2}}\right)\\ &&{\kern2.8pc}+\frac{1}{2}\,c_{3}\left(w_{1}^{\top}M^{\top}+e^{\top}b_{1}\right)L(Mw_{1}+eb_{1})\\ &&\text{s.t.}-(B_{-}w_{1}+e_{2}b_{1})+\xi\succeq e_{2},\,\xi\succeq 0, \end{array} $$
(11)
$$\begin{array}{@{}rcl@{}} &&\min\limits_{w_{2},b_{2},\eta}\frac{1}{2}\|B_{-}w_{2}+e_{2}b_{2}\|^{2}+c_{1}e_{1}^{\top}\eta+\frac{1}{2}c_{2}\left(\|w_{2}\|^{2}+{b_{2}^{2}}\right)\\ &&{\kern2.8pc}+\frac{1}{2}\,c_{3}\left(w_{2}^{\top}M^{\top}+e^{\top}b_{2}\right)L(Mw_{2}+eb_{2})\\ &&\text{s.t.}-(A_{+}w_{2}+e_{1}b_{2})+\eta\succeq e_{1},\,\eta\succeq 0, \end{array} $$
(12)

where M includes all of labeled data and unlabeled data. L is the graph Laplacian. e 1, e 2 and e are vectors of appropriate dimensions. w 1, b 1, w 2, b 2 are classifier parameters. c 1, c 2 and c 3 are nonnegative parameters. ξ and η are slack vectors of appropriate dimensions. The dual problem of (11) and (12) respectively can be written as

$$\begin{array}{@{}rcl@{}} &&\max\limits_{\alpha}e_{2}^{\top}\alpha-\frac{1}{2}\alpha^{\top}G\left(H^{\top}H+c_{2}I+c_{3}J^{\top}LJ\right)^{-1}G^{\top}\alpha\\ &&\text{s.t.}\,\,\,0\preceq \alpha \preceq c_{1}e_{2}, \end{array} $$
(13)
$$\begin{array}{@{}rcl@{}} &&\max_{\beta}e_{1}^{\top}\beta-\frac{1}{2}\beta^{\top}H\left(G^{\top}G+c_{2}I+c_{3}J^{\top}LJ\right)^{-1}H^{\top}\beta\\ &&\text{s.t.}\,\,\,0\preceq \beta \preceq c_{1}e_{1}, \end{array} $$
(14)

where

$$\begin{array}{@{}rcl@{}} v_{1}&=&\left( \begin{array}{c} w_{1}\\ b_{1} \end{array}\right), \,\,v_{2}=\left( \begin{array}{c} w_{2}\\ b_{2} \end{array}\right),\\ H&=&(A_{+},\,e_{1}),\,J=(M,\,e),\,G=(B_{-},\,e_{2}). \end{array} $$
(15)

α and β are the vectors of nonnegative Lagrange multipliers. I is an identity matrix of appropriate dimensions. v 1,v 2 can be obtained simultaneously

$$\begin{array}{@{}rcl@{}} &v_{1}=-\left(H^{\top}H+c_{2}I+c_{3}J^{\top}LJ\right)^{-1}G^{\top}\alpha, \end{array} $$
(16)
$$\begin{array}{@{}rcl@{}} &v_{2}=-\left(G^{\top}G+c_{2}I+c_{3}J^{\top}LJ\right)^{-1}H^{\top}\beta. \end{array} $$
(17)

According to matrix theory, it can be easily proved that H H + c 2 I + c 3 J L J is a positive definite matrix. LapTSVMs obtain two nonparallel hyperplanes

$$ w_{1}^{\top}x+b_{1}=0\,\,\,\,\,\text{and}\,\,\,\,w_{2}^{\top}x+b_{2}=0. $$
(18)

The label of a new example x is determined by the minimum of |x w r +b r | (r = 1,2) which are the perpendicular distances of x to the two hyperplanes given in (18).

2.4 SVM-2K

Suppose that we are given two views of the same data, view 1 is represented by a feature projection ϕ A with the corresponding kernel function k A and view 2 is represented by a feature projection ϕ B with the corresponding kernel function k B . Then the two-view data are given by a set S = {(ϕ A (x 1),ϕ B (x 1)),⋯ ,(ϕ A (x n ),ϕ B (x n ))}. SVM-2K [31] combines the two views by introducing the constraint of similarity between two one-dimensional projections identifying two distinct SVMs from the two feature spaces:

$$ |\langle w_{A},\phi_{A}(x_{i})\rangle+b_{A}-\langle w_{B},\phi_{B}(x_{i})\rangle-b_{B}|\leq \eta_{i}+\epsilon $$
(19)

where w A , b A , w B , b B are the weight and threshold of the first (second) SVMs. The SVM-2K method has the following optimization for classifier parameters w A , b A , w B , b B

$$\begin{array}{@{}rcl@{}} &&\min\limits_{w_{A},w_{B},q_{1i},q_{2i},\eta_{i}} \frac{1}{2}\|w_{A}\|^{2}+\frac{1}{2}\|w_{B}\|^{2}+c_{1}\sum\limits_{i=1}^{n}q_{1i}\\ &&{\kern5.1pc}+c_{2}\sum\limits_{i=1}^{n}q_{2i}+D\sum\limits_{i=1}^{n}\eta_{i}\\ &&\text{s.t.}{\kern.5pc}|\langle w_{A},\phi_{A}(x_{i})\rangle+b_{A}-\langle w_{B},\phi_{B}(x_{i})\rangle-b_{B}|\leq \eta_{i}+\epsilon, {} \\ &&{\kern1.5pc}y_{i}(\langle w_{A},\phi_{A}(x_{i})\rangle+b_{A})\geq 1-q_{1i},\\ &&{\kern1.5pc}y_{i}(\langle w_{B},\phi_{B}(x_{i})\rangle+b_{B})\geq 1-q_{2i},\\ &&{\kern1.5pc}q_{1i}\geq 0,q_{2i}\geq 0,\eta_{i}\geq 0,\text{all for}\,\, 1\leq i\leq n, \end{array} $$
(20)

where D, c 1, c 2, 𝜖 are nonnegative parameters and q 1i , q 2i , η i are slack vectors of appropriate dimensions. Let \(\hat {w}_{A}\), \(\hat {w}_{B}\), \(\hat {b}_{A}\), \(\hat {b}_{B}\) be the solution to this optimization problem. The final SVM-2K decision function is \(f(x)=\frac {1}{2}(\langle \hat {w}_{A},\phi _{A}(x)\rangle +\hat {b}_{A}+\langle \hat {w}_{B},\phi _{B}(x)\rangle +\hat {b}_{B})\). The dual formulation of the above optimization problem can be written as

$$\begin{array}{@{}rcl@{}} &&\min\limits_{{\xi_{i}^{A}},{\xi_{j}^{A}},{\xi_{i}^{B}},{\xi_{j}^{B}},{\alpha_{i}^{A}},{\alpha_{i}^{B}}}\frac{1}{2}\sum\limits_{i,j=1}^{n}\left({\xi_{i}^{A}}{\xi_{j}^{A}}k_{A}(x_{i},x_{j})\right.\\ &&{\kern6pc}\left.+{\xi_{i}^{B}}{\xi_{j}^{B}}k_{B}(x_{i},x_{j})\right)-\sum\limits_{i=1}^{n}\left({\alpha_{i}^{A}}+{\alpha_{i}^{B}}\right)\\ &&\text{s.t.}\kern1pc{\xi_{i}^{A}}={\alpha_{i}^{A}}y_{i}-\beta_{i}^{+}+\beta_{i}^{-},\\ &&\kern2pc{\xi_{i}^{B}}={\alpha_{i}^{B}}y_{i}+\beta_{i}^{+}-\beta_{i}^{-},\\ &&{\kern2pc}\sum\limits_{i=1}^{n}{\xi_{i}^{A}}=\sum\limits_{i=1}^{n}{\xi_{i}^{B}}=0,\\ &&{\kern2pc}0\leq \beta_{i}^{+},\beta_{i}^{-},\beta_{i}^{+}+\beta_{i}^{-} \leq D,\\ &&{\kern2pc}0\leq \alpha_{i}^{A/B}\leq c_{1/2}, \end{array} $$
(21)

where \({\alpha _{i}^{A}}\), \({\alpha _{i}^{B}}\), \(\beta _{i}^{+}, \beta _{i}^{-}\) are the vectors of nonnegative Lagrange multipliers and we have taken 𝜖 = 0. The prediction function for each view is given by

$$ f_{A/B}(x)=\sum\limits_{i=1}^{n}\xi_{i}^{A/B}k_{A/B}(x_{i},x)+b_{A/B}. $$
(22)

3 Our proposed methods

3.1 Linear MvLapTSVMs

In this part, we extend LapTSVMs to multi-view learning. Here on view 1, positive examples are represented by \(A_{1}^{\prime }\) and negative examples are represented by \(B_{1}^{\prime }\). On view 2, positive examples are represented by \(A_{2}^{\prime }\) and negative examples are represented by \(B_{2}^{\prime }\). The optimization problems of linear MvLapTSVMs can be written as

$$\begin{array}{@{}rcl@{}} &&\min\limits_{w_{1},w_{2},b_{1},b_{2},q_{1},q_{2},\eta}\frac{1}{2}\|A_{1}^{\prime}w_{1}+e_{1}b_{1}\|^{2}+\frac{1}{2}\|A_{2}^{\prime}w_{2}+e_{1}b_{2}\|^{2}\\ &&{\kern5.5pc}+c_{1}e_{2}^{\top}q_{1}+c_{2}e_{2}^{\top}q_{2}\\ &&{\kern5.5pc}+\frac{1}{2}c_{3}\left(\|w_{1}\|^{2}+{b_{1}^{2}}+\|w_{2}\|^{2}+{b_{2}^{2}}\right)\\ &&{\kern5.5pc}+\frac{1}{2}c_{4}\left[(w_{1}^{\top}M_{1}^{'\top}+e^{\top}b_{1})\right.\\ &&{\kern6pc}L_{1}(M_{1}^{\prime}w_{1}+eb_{1})\\ &&{\kern5.5pc}\left.+(w_{2}^{\top}M_{2}^{\prime\top}+e^{\top}b_{2})L_{2}(M_{2}^{\prime}w_{2}+eb_{2})\right]\\ &&{\kern6pc}+De_{1}^{\top}\eta\\ &&{\kern4.5pc}\text{s.t.}{\kern.5pc}|A_{1}^{\prime}w_{1}+e_{1}b_{1}-A_{2}^{\prime}w_{2}-e_{1}b_{2}|\preceq \eta,\\ &&{\kern5.7pc}-B_{1}^{\prime}w_{1}-e_{2}b_{1}+q_{1}\succeq e_{2},\\ &&{\kern5.7pc}-B_{2}^{\prime}w_{2}-e_{2}b_{2}+q_{2}\succeq e_{2},\\ &&{\kern6pc}q_{1}\succeq 0,\,q_{2}\succeq 0,\\ &&{\kern6pc}\eta\succeq 0, \end{array} $$
(23)
$$\begin{array}{@{}rcl@{}} &&\min\limits_{w_{3},w_{4},b_{3},b_{4},q_{3},q_{4},\zeta}\frac{1}{2}\|B_{1}^{\prime}w_{3}+e_{2}b_{3}\|^{2}+\frac{1}{2}\|B_{2}^{\prime}w_{4}+e_{2}b_{4}\|^{2}\\ &&{\kern5.5pc}+c_{1}e_{1}^{\top}q_{3}+c_{2}e_{1}^{\top}q_{4}\\ &&{\kern5.5pc}+\frac{1}{2}c_{3}\left(\|w_{3}\|^{2}+{b_{3}^{2}}+\|w_{4}\|^{2}+{b_{4}^{2}}\right)\\ &&{\kern5.5pc}+\frac{1}{2}c_{4}\left[(w_{3}^{\top}M_{1}^{\prime\top}+e^{\top}b_{3})\right.\\ &&{\kern6pc}L_{1}(M_{1}^{\prime}w_{3}+eb_{3})\\ &&{\kern5.5pc}\left.+(w_{4}^{\top}M_{2}^{\prime\top}+e^{\top}b_{4})L_{2}(M_{2}^{\prime}w_{4}+eb_{4})\right]\\ &&{\kern5.5pc}+He_{2}^{\top}\zeta\\ &&\text{s.t.}{\kern.5pc}|B_{1}^{\prime}w_{3}+e_{2}b_{3}-B_{2}^{\prime}w_{4}-e_{2}b_{4}|\preceq \zeta,\\ &&{\kern1.5pc}-A_{1}^{\prime}w_{3}-e_{1}b_{3}+q_{3}\succeq e_{1},\\ &&{\kern1.5pc}-A_{2}^{\prime}w_{4}-e_{1}b_{4}+q_{4}\succeq e_{1},\\ &&{\kern1.9pc}q_{3}\succeq 0,\,q_{4}\succeq 0,\\ &&{\kern1.9pc}\zeta\succeq 0, \end{array} $$
(24)

where \(M_{1}^{\prime }\) includes all of labeled data and unlabeled data from view 1. \(M_{2}^{\prime }\) includes all of labeled data and unlabeled data from view 2. L 1 is the graph Laplacian of view 1 and L 2 is the graph Laplacian of view 2. e 1, e 2 and e are vectors of ones of appropriate dimensions. w 1, b 1, w 2, b 2, w 3, b 3, w 4, b 4 are classifier parameters. c 1, c 2, c 3 and c 4 are nonnegative parameters. q 1, q 2, q 3, q 4, η and ζ are slack vectors of appropriate dimensions.

The Lagrangian of the optimization problem (23) is given by

$$\begin{array}{@{}rcl@{}} L&=&\frac{1}{2}\|A_{1}^{\prime}w_{1}+e_{1}b_{1}\|^{2}+\frac{1}{2}\|A_{2}^{\prime}w_{2}+e_{1}b_{2}\|^{2}+c_{1}e_{2}^{\top}q_{1}\\ &+&c_{2}e_{2}^{\top}q_{2}+\frac{1}{2}c_{3}\left(\|w_{1}\|^{2}+{b_{1}^{2}}+\|w_{2}\|^{2}+{b_{2}^{2}}\right)\\ &+&\frac{1}{2}c_{4}\left[(w_{1}^{\top}M_{1}^{'\top}+e^{\top}b_{1})L_{1}(M_{1}^{\prime}w_{1}+eb_{1})\right.\\ &+&\left.(w_{2}^{\top}M_{2}^{'\top}+e^{\top}b_{2})L_{2}(M_{2}^{\prime}w_{2}+eb_{2})\right]\\ &+&De_{1}^{\top}\eta-\beta_{1}^{\top}(\eta-A_{1}^{\prime}w_{1}-e_{1}b_{1}+A_{2}^{\prime}w_{2}+e_{1}b_{2})\\ &-&\beta_{2}^{\top}(A_{1}^{\prime}w_{1}+e_{1}b_{1}-A_{2}^{\prime}w_{2}-e_{1}b_{2}+\eta)\\ &-&\alpha_{1}^{\top}(-B_{1}^{\prime}w_{1}-e_{2}b_{1}+q_{1}-e_{2})\\ &-&\alpha_{2}^{\top}(-B_{2}^{\prime}w_{2}-e_{2}b_{2}+q_{2}-e_{2})\\ &-&\lambda_{1}^{\top}q_{1}-\lambda_{2}^{\top}q_{2}-\sigma^{\top}\eta, \end{array} $$
(25)

where α 1, α 2, β 1, β 2, λ 1, λ 2 and σ are the vectors of nonnegative Lagrange multipliers. We take partial derivatives of the above equation and let them be zero

$$\begin{array}{@{}rcl@{}} \frac{\partial L}{\partial w_{1}}&=&A_{1}^{'\top}(A_{1}^{\prime}w_{1}+e_{1}b_{1})+c_{3}w_{1}\\ &+&c_{4}M_{1}^{'\top}L_{1}(M_{1}^{\prime}w_{1}+eb_{1})\\ &+&A_{1}^{'\top}\beta_{1}-A_{1}^{'\top}\beta_{2}+B_{1}^{'\top}\alpha_{1}=0, \end{array} $$
$$\begin{array}{@{}rcl@{}} \frac{\partial L}{\partial b_{1}}&=&e_{1}^{\top}(A_{1}^{\prime}w_{1}+e_{1}b_{1})+c_{3}b_{1}+c_{4}e^{\top}L_{1}(M_{1}^{\prime}w_{1}+eb_{1})\\ &+&e_{1}^{\top}\beta_{1}-e_{1}^{\top}\beta_{2}+e_{1}^{\top}\alpha_{1}=0,\\ \frac{\partial L}{\partial w_{2}}&=&A_{2}^{'\top}(A_{2}^{\prime}w_{2}+e_{1}b_{2})+c_{3}w_{2}\\ &+&c_{4}M_{2}^{'\top}L_{2}(M_{2}^{\prime}w_{2}+eb_{2})\\ &-&A_{2}^{'\top}\beta_{1}+A_{2}^{'\top}\beta_{2}+B_{2}^{'\top}\alpha_{2}=0,\\ \frac{\partial L}{\partial b_{2}}&=&e_{1}^{\top}(A_{2}^{\prime}w_{2}+e_{1}b_{2})+c_{3}b_{2}+c_{4}e^{\top}L_{2}(M_{2}^{\prime}w_{2}+eb_{2})\\ &-&e_{1}^{\top}\beta_{1}+e_{1}^{\top}\beta_{2}+e_{2}^{\top}\alpha_{2}=0, \end{array} $$
(26)
$$\begin{array}{@{}rcl@{}} \frac{\partial L}{\partial q_{1}}=c_{1}e_{2}-\alpha_{1}-\lambda_{1}=0,\\ &\frac{\partial L}{\partial q_{2}}=c_{2}e_{2}-\alpha_{2}-\lambda_{2}=0,\\ &\frac{\partial L}{\partial \eta}=De_{1}-\beta_{1}-\beta_{2}-\sigma=0. \end{array} $$

We define

$$\begin{array}{@{}rcl@{}} &&A_{1}=\left(A_{1}^{\prime},\,e_{1}\right), \,\,A_{2}=\left(A_{2}^{\prime},\,e_{1}\right),\\ &&B_{1}=\left(B_{1}^{\prime},\, e_{2}\right),\,\,B_{2}=\left(B_{2}^{\prime},\, e_{2}\right),\\ &&J_{1}=\left(M_{1}^{\prime},\, e\right),J_{2}=\left(M_{2}^{\prime},\,e\right),v_{1}=\begin{pmatrix}w_{1}\\b_{1}\end{pmatrix},v_{2}=\begin{pmatrix}w_{2}\\b_{2}\end{pmatrix}. \end{array} $$
(27)

From the above equations, we obtain

$$\begin{array}{@{}rcl@{}} A_{1}^{\top}A_{1}v_{1}&+&c_{3}v_{1}+c_{4}J_{1}^{\top}L_{1}J_{1}v_{1}+A_{1}^{\top}\beta_{1}\\ &-&A_{1}^{\top}\beta_{2}+B_{1}^{\top}\alpha_{1}=0, \end{array} $$
(28)
$$\begin{array}{@{}rcl@{}} A_{2}^{\top}A_{2}v_{2}&+&c_{3}v_{2}+c_{4}J_{2}^{\top}L_{2}J_{2}v_{2}-A_{2}^{\top}\beta_{1}\\ &+&A_{2}^{\top}\beta_{2}+B_{2}^{\top}\alpha_{2}=0. \end{array} $$
(29)

It follows that

$$\begin{array}{@{}rcl@{}} v_{1}&=&\left(A_{1}^{\top}A_{1}+c_{3}I+c_{4}J_{1}^{\top}L_{1}J_{1}\right)^{-1}\\ &&[A_{1}^{\top}(\beta_{2}-\beta_{1})-B_{1}^{\top}\alpha_{1}], \end{array} $$
(30)
$$\begin{array}{@{}rcl@{}} v_{2}&=&\left(A_{2}^{\top}A_{2}+c_{3}I+c_{4}J_{2}^{\top}L_{2}J_{2}\right)^{-1}\\ &&[A_{2}^{\top}(\beta_{1}-\beta_{2})-B_{2}^{\top}\alpha_{2}]. \end{array} $$
(31)

We substitute (30), (31) into (25) and get

$$\begin{array}{@{}rcl@{}} L&=&(\alpha_{1}+\alpha_{2})^{\top}e_{2}-\frac{1}{2}\left[(\beta_{2}-\beta_{1})^{\top}A_{1}-\alpha_{1}^{\top}B_{1}\right]\left(A_{1}^{\top}A_{1}\right.\\ &+&\left.c_{3}I+c_{4}J_{1}^{\top}L_{1}J_{1}\right)^{-1} \left[A_{1}^{\top}(\beta_{2}-\beta_{1})-B_{1}^{\top}\alpha_{1}\right]\\ &-&\frac{1}{2}\left[(\beta_{1}-\beta_{2})^{\top}A_{2}-\alpha_{2}^{\top}B_{2}\right]\left(A_{2}^{\top}A_{2}+c_{3}I\right.\\ &+&\left.c_{4}J_{2}^{\top}L_{2}J_{2}\right)^{-1}\left[A_{2}^{\top}(\beta_{1}-\beta_{2})-B_{2}^{\top}\alpha_{2}\right]. \end{array} $$
(32)

Therefore, the dual optimization formulation is

$$\begin{array}{@{}rcl@{}} &&\min\limits_{\xi_{1},\xi_{2},\alpha_{1},\alpha_{2}}\frac{1}{2}\xi_{1}^{\top}\left(A_{1}^{\top}A_{1}+c_{3}I+c_{4}J_{1}^{\top}L_{1}J_{1}\right)^{-1}\xi_{1}\\ &&{\kern3pc}+\frac{1}{2}\xi_{2}^{\top}\left(A_{2}^{\top}A_{2}+c_{3}I+c_{4}J_{2}^{\top}L_{2}J_{2}\right)^{-1}\xi_{2}\\ &&{\kern3pc}-(\alpha_{1}+\alpha_{2})^{\top}e_{2}\\ &&{\kern.8pc}\text{s.t.}{\kern.5pc}\xi_{1}=A_{1}^{\top}(\beta_{2}-\beta_{1})-B_{1}^{\top}\alpha_{1},\\ &&{\kern2.3pc}\xi_{2}=A_{2}^{\top}(\beta_{1}-\beta_{2})-B_{2}^{\top}\alpha_{2},\\ &&{\kern2.6pc}0\preceq \beta_{1},\beta_{2},\beta_{1}+\beta_{2} \preceq De_{1},\\ &&{\kern2.6pc}0\preceq \alpha_{1/2}\preceq c_{1/2}e_{2}. \end{array} $$
(33)

Applying the same techniques to (24), we obtain its corresponding dual optimization formulation as

$$\begin{array}{@{}rcl@{}} &&\min\limits_{\rho_{1},\rho_{2},\omega_{1},\omega_{2}}\frac{1}{2}\rho_{1}^{\top}\left(B_{1}^{\top}B_{1}+c_{3}I+c_{4}J_{1}^{\top}L_{1}J_{1}\right)^{-1}\rho_{1}\\ &&{\kern3.5pc}+\frac{1}{2}\rho_{2}^{\top}\left(B_{2}^{\top}B_{2}+c_{3}I+c_{4}J_{2}^{\top}L_{2}J_{2}\right)^{-1}\rho_{2}\\ &&{\kern3.5pc}-(\omega_{1}+\omega_{2})^{\top}e_{1}\\ &&{\kern.7pc}\text{s.t.}{\kern1pc}\rho_{1}=B_{1}^{\top}(\gamma_{2}-\gamma_{1})-A_{1}^{\top}\omega_{1},\\ &&{\kern2.7pc}\rho_{2}=B_{2}^{\top}(\gamma_{1}-\gamma_{2})-A_{2}^{\top}\omega_{2},\\ &&{\kern3.1pc}0\preceq \gamma_{1},\gamma_{2},\gamma_{1}+\gamma_{2} \preceq He_{2},\\ &&{\kern3.1pc}0\preceq \omega_{1/2}\preceq c_{1/2}e_{1}, \end{array} $$
(34)

where the augmented vectors \(u_{1}=\begin {pmatrix}w_{3}\\b_{3}\end {pmatrix},\,u_{2}=\begin {pmatrix}w_{4}\\b_{4}\end {pmatrix}\) are given by

$$\begin{array}{@{}rcl@{}} u_{1}&=&\left(B_{1}^{\top}B_{1}+c_{3}I+c_{4}J_{1}^{\top}L_{1}J_{1}\right)^{-1}\\ &&{\kern.5pc}\left[B_{1}^{\top}(\gamma_{2}-\gamma_{1})-A_{1}^{\top}\omega_{1}\right], \end{array} $$
(35)
$$\begin{array}{@{}rcl@{}} u_{2}&=&\left(B_{2}^{\top}B_{2}+c_{3}I+c_{4}J_{2}^{\top}L_{2}J_{2}\right)^{-1}\\ &&{\kern.5pc}\left[B_{2}^{\top}(\gamma_{1}-\gamma_{2})-A_{2}^{\top}\omega_{2}\right]. \end{array} $$
(36)

For an example x with \(x_{1}^{\prime }\) and \(x_{2}^{\prime }\), if \(\frac {1}{2}(|x_{1}^{\top }v_{1}|+|x_{2}^{\top }v_{2}|)\leq \frac {1}{2}(|x_{1}^{\top }u_{1}|+|x_{2}^{\top }u_{2}|)\), where \(x_{1}=(x_{1}^{\prime },1)\) and \(x_{2}=(x_{2}^{\prime },1)\), it is classified to class +1, otherwise class −1.

Now we compare SVM-2K and MvLapTSVMs. SVM-2K is a multi-view supervised learning method for SVMs while MvLapTSVMs are multi-view semi-supervised learning methods for TSVMs. Suppose the number of samples from either class is equal to l/2. SVM-2K solves a single QPP and has the computational complexity of O((2l)3), while MvLapTSVMs solve a pair of QPPs and have the computational complexity of O(2l 3). About hyper-parameter selection, SVM-2K needs three hyper-parameters to select, and MvLapTSVMs need five hyper-parameters to select. Therefore, MvLapTSVMs are more efficient for multi-view learning in computational complexity.

3.2 Kernel MvLapTSVMs

Now we extend the linear MvLapTSVMs to the nonlinear case. The kernel-induced hyperplanes are:

$$\begin{array}{@{}rcl@{}} &K\{x_{1}^{\top},C_{1}^{\top}\}\lambda_{1}+b_{1}=0, &K\{x_{2}^{\top},C_{2}^{\top}\}\lambda_{2}+b_{2}=0,\\ &K\{x_{1}^{\top},C_{1}^{\top}\}\lambda_{3}+b_{3}=0, &K\{x_{2}^{\top},C_{2}^{\top}\}\lambda_{4}+b_{4}=0, \end{array} $$
(37)

where K is a chosen kernel function which is defined by K{x i ,x j }=(Φ(x i ),Φ(x j )). Φ(⋅) is a nonlinear mapping from a low-dimensional feature space to a high-dimensional feature space. C 1 and C 2 denote training examples from view 1 and view 2 respectively, that is, \(C_{1}=(A^{'\top }_{1},B^{'\top }_{1})^{\top }\), \(C_{2}=(A^{'\top }_{2},B^{'\top }_{2})^{\top }\).

The optimization problems can be written as

$$\begin{array}{@{}rcl@{}} &&\min\limits_{\lambda_{1},\lambda_{2},b_{1},b_{2},q_{1},q_{2},\eta}\frac{1}{2}\|K\{A_{1}^{\prime},C_{1}^{\top}\}\lambda_{1}+e_{1}b_{1}\|^{2}\\ &&{\kern5.5pc}+\frac{1}{2}\|K\left\{A_{2}^{\prime},C_{2}^{\top}\right\}\lambda_{2}+e_{1}b_{2}\|^{2}\\ &&{\kern5.5pc}+c_{1}e_{2}^{\top}q_{1}+c_{2}e_{2}^{\top}q_{2}+\frac{1}{2}c_{3}\left(\lambda_{1}^{\top}K_{1}\lambda_{1}\right.\\ &&{\kern5.5pc}+\left.{b_{1}^{2}}+\lambda_{2}^{\top}K_{2}\lambda_{2}+{b_{2}^{2}}\right)+\frac{1}{2}c_{4}\left[(\lambda_{1}^{\top}K_{1}\right.\\ &&{\kern5.5pc}+\left.e^{\top}b_{1}\right)L_{1}(K_{1}\lambda_{1}+eb_{1})+\left(\lambda_{2}^{\top}K_{2}\right.\\ &&{\kern5.5pc}+\left.e^{\top}b_{2}\right)L_{2}(K_{2}\lambda_{2}+eb_{2})]+De_{1}^{\top}\eta\\ &&\text{s.t.}{\kern.5pc}|K\{A_{1}^{\prime},C_{1}^{\top}\}\lambda_{1}+e_{1}b_{1}\\ &&{\kern1.4pc}-K\{A_{2}^{\prime},C_{2}^{\top}\}\lambda_{2}-e_{1}b_{2}|\preceq \eta,\\ &&{\kern1.4pc}-K\{B_{1}^{\prime},C_{1}^{\top}\}\lambda_{1}-e_{2}b_{1}+q_{1}\succeq e_{2},\\ &&{\kern1.4pc}-K\{B_{2}^{\prime},C_{2}^{\top}\}\lambda_{2}-e_{2}b_{2}+q_{2}\succeq e_{2},\\ &&{\kern1.6pc}q_{1}\succeq 0,\,q_{2}\succeq 0,\\ &&{\kern1.6pc}\eta\succeq 0, \end{array} $$
(38)
$$\begin{array}{@{}rcl@{}} &&\min\limits_{\lambda_{3},\lambda_{4},b_{3},b_{4},q_{3},q_{4},\zeta}\frac{1}{2}\|K\left\{B_{1}^{\prime},C_{1}^{\top}\right\}\lambda_{3}+e_{2}b_{3}\|^{2}\\ &&{\kern5.3pc}+\frac{1}{2}\|K\left\{B_{2}^{\prime},C_{2}^{\top}\right\}\lambda_{4}+e_{2}b_{4}\|^{2}+c_{1}e_{1}^{\top}q_{3}\\ &&{\kern5.3pc}+c_{2}e_{1}^{\top}q_{4}+\frac{1}{2}c_{3}\left(\lambda_{3}^{\top}K_{1}\lambda_{3}+{b_{3}^{2}}\right.\\ &&{\kern5.3pc}+\left.\lambda_{4}^{\top}K_{2}\lambda_{4}+{b_{4}^{2}}\right)+\frac{1}{2}c_{4}\left[(\lambda_{3}^{\top}K_{1}\right.\\ &&{\kern5.3pc}+e^{\top}b_{3})L_{1}(K_{1}\lambda_{3}+eb_{3})+(\lambda_{4}^{\top}K_{2}\\ &&{\kern5.3pc}+\left.e^{\top}b_{4})L_{2}(K_{2}\lambda_{4}+eb_{4}){\vphantom{A^{T}}}\right]+He_{2}^{\top}\zeta\\ &&\text{s.t.}{\kern.5pc}|K\left\{B_{1}^{\prime},C_{1}^{\top}\right\}\lambda_{3}+e_{2}b_{3}-K\left(B_{2}^{\prime},C_{2}^{\top}\right)\lambda_{4}\\ &&{\kern1.5pc}-e_{2}b_{4}|\preceq \zeta,-K\{A_{1}^{\prime},C_{1}^{\top}\}\lambda_{3}-e_{1}b_{3}+q_{3}\succeq e_{1},\\ &&{\kern1.5pc}-K\{A_{2}^{\prime},C_{2}^{\top}\}\lambda_{4}-e_{1}b_{4}+q_{4}\succeq e_{1},\\ &&{\kern1.8pc}q_{3}\succeq 0,\,q_{4}\succeq 0,\\ &&{\kern1.8pc}\zeta\succeq 0, \end{array} $$
(39)

where K 1 represents kernel matrix of view 1 and K 2 represents kernel matrix of view 2. L 1 is the graph Laplacian of view 1 and L 2 is the graph Laplacian of view 2. e 1, e 2 and e are vectors of ones of appropriate dimensions. λ 1, b 1, λ 2, b 2, λ 3, b 3, λ 4, b 4 are classifier parameters. c 1, c 2, c 3 and c 4 are nonnegative parameters. q 1, q 2, q 3, q 4, η and ζ are slack vectors of appropriate dimensions.

The Lagrangian of the optimization problem (38) is given by

$$\begin{array}{@{}rcl@{}} L&=&\frac{1}{2}\|K\left\{A_{1}^{\prime},C_{1}^{\top}\right\}\lambda_{1}+e_{1}b_{1}\|^{2}+\frac{1}{2}\|K\left\{A_{2}^{\prime},C_{2}^{\top}\right\}\lambda_{2}\\ &+&e_{1}b_{2}\|^{2}+c_{1}e_{2}^{\top}q_{1}+c_{2}e_{2}^{\top}q_{2}+\frac{1}{2}c_{3}\left(\lambda_{1}^{\top}K_{1}\lambda_{1}+{b_{1}^{2}}\right.\\ &+&\left.\lambda_{2}^{\top}K_{2}\lambda_{2}+{b_{2}^{2}}\right)+\frac{1}{2}c_{4}\left[(\lambda_{1}^{\top}K_{1}+e^{\top}b_{1})L_{1}(K_{1}\lambda_{1}\right.\\ &+&\left.eb_{1})+(\lambda_{2}^{\top}K_{2}+e^{\top}b_{2})L_{2}(K_{2}\lambda_{2}+eb_{2})\right]+De_{1}^{\top}\eta\\ &-&\beta_{1}^{\top}\left(\eta-K\{A_{1}^{\prime},C_{1}^{\top}\}\lambda_{1}-e_{1}b_{1}+K\{A_{2}^{\prime},C_{2}^{\top}\}\lambda_{2}\right.\\ &+&\left.e_{1}b_{2}{\vphantom{A^T}}\right)-\beta_{2}^{\top}\left(K\{A_{1}^{\prime},C_{1}^{\top}\}\lambda_{1}+e_{1}b_{1}-K\{A_{2}^{\prime},C_{2}^{\top}\}\lambda_{2}\right.\\ &-&\left.e_{1}b_{2}+\eta{\vphantom{A^T}}\right)-\alpha_{1}^{\top}\left(-K\{B_{1}^{\prime},C_{1}^{\top}\}\lambda_{1}-e_{2}b_{1}+q_{1}-e_{2}\right)\\ &-&\alpha_{2}^{\top}\left(-K\{B_{2}^{\prime},C_{2}^{\top}\}\lambda_{2}-e_{2}b_{2}+q_{2}-e_{2}\right)\\ &-&\xi_{1}^{\top}q_{1}-\xi_{2}^{\top}q_{2}-\sigma^{\top}\eta, \end{array} $$
(40)

where α 1, α 2, β 1, β 2, ξ 1, ξ 2 and σ are the vectors of nonnegative Lagrange multipliers.

We take partial derivatives of the above equation and let them be zero

$$\begin{array}{@{}rcl@{}} \frac{\partial L}{\partial \lambda_{1}}&=&K\left\{A_{1}^{\prime},C_{1}^{\top}\right\}^{\top}\left(K\{A_{1}^{\prime},C_{1}^{\top}\}\lambda_{1}+e_{1}b_{1}\right)+c_{3}K_{1}\lambda_{1}\\ &+&c_{4}K_{1}L_{1}(K_{1}\lambda_{1}+eb_{1})+K\left\{A_{1}^{\prime},C_{1}^{\top}\right\}^{\top}\beta_{1}\\ &-&K\left\{A_{1}^{\prime},C_{1}^{\top}\right\}^{\top}\beta_{2}+K\left\{B_{1}^{\prime},C_{1}^{\top}\right\}^{\top}\alpha_{1}=0,\\ \frac{\partial L}{\partial b_{1}}&=&e_{1}^{\top}\left(K\{A_{1}^{\prime},C_{1}^{\top}\}\lambda_{1}+e_{1}b_{1}\right)+c_{3}b_{1}\\ &+&c_{4}e^{\top}L_{1}(K_{1}\lambda_{1}+eb_{1})+e_{1}^{\top}\beta_{1}-{e_{1}^{T}}\beta_{2}+e_{2}^{\top}\alpha_{1}\!=0,{}\\ \frac{\partial L}{\partial \lambda_{2}}&=&K\left\{A_{2}^{\prime},C_{2}^{\top}\right\}^{\top}\left(K\{A_{2}^{\prime},C_{2}^{\top}\}\lambda_{2}+e_{1}b_{2}\right)+c_{3}K_{2}\lambda_{2}\\ &+&c_{4}K_{2}L_{2}(K_{2}\lambda_{2}+eb_{2})-K\left\{A_{2}^{\prime},C_{2}^{\top}\right\}^{\top}\beta_{1}\\ &+&K\left\{A_{2}^{\prime},C_{2}^{\top}\right\}^{\top}\beta_{2}+K\left\{B_{2}^{\prime},C_{2}^{\top}\right\}^{\top}\alpha_{2}=0,\\ \frac{\partial L}{\partial b_{2}}&=&e_{1}^{\top}\left(K\{A_{2}^{\prime},C_{2}^{\top}\}\lambda_{2}+e_{1}b_{2}\right)+c_{3}b_{2}\\ &+&c_{4}e^{\top}L_{2}(K_{2}\lambda_{2}+eb_{2})-e_{1}^{\top}\beta_{1}+e_{1}^{\top}\beta_{2}+e_{2}^{\top}\alpha_{2}\!=0, \end{array} $$
(41)
$$\begin{array}{@{}rcl@{}} &&\frac{\partial L}{\partial q_{1}}=c_{1}e_{2}-\alpha_{1}-\xi_{1}=0,\\ &&\frac{\partial L}{\partial q_{2}}=c_{2}e_{2}-\alpha_{2}-\xi_{2}=0,\\ &&\frac{\partial L}{\partial \eta}=De_{1}-\beta_{1}-\beta_{2}-\delta=0.\\ \end{array} $$

Let

$$\begin{array}{@{}rcl@{}} &&H_{\phi}=\left(K\{A_{1}^{\prime},C_{1}^{\top}\},e_{1}\right),\,\,G_{\phi}=\left(K\{B_{1}^{\prime},C_{1}^{\top}\},e_{2}\right),\\ &&O_{\phi}=\left(\begin{smallmatrix} K_{1}& 0 \\ 0 & 1\end{smallmatrix}\right),J_{\phi}=(K_{1},\,e),\,\,Q_{\phi}=\left(K\{A_{2}^{\prime},C_{2}^{\top}\},e_{1}\right),\\ &&P_{\phi}=\left(K\{B_{2}^{\prime},C_{2}^{\top}\},e_{2}\right),U_{\phi}=\bigl(\begin{smallmatrix} K_{2} & 0 \\ 0 & 1\end{smallmatrix}\bigr),\,\,F_{\phi}=(K_{2},\,e),\\ &&\theta_{1}=\begin{pmatrix}\lambda_{1}\\b_{1}\end{pmatrix},\,\,\theta_{2}=\begin{pmatrix}\lambda_{2}\\b_{2}\end{pmatrix}. \end{array} $$
(42)

From the above equations, we obtain

$$\begin{array}{@{}rcl@{}} H_{\phi}^{\top}H_{\phi}\theta_{1}&+&c_{3}O_{\phi}\theta_{1}+c_{4}J_{\phi}^{\top}L_{1}J_{\phi}\theta_{1}+H_{\phi}^{\top}\beta_{1}-H_{\phi}^{\top}\beta_{2}\\ &+&G_{\phi}^{\top}\alpha_{1}=0, \end{array} $$
(43)
$$\begin{array}{@{}rcl@{}} Q_{\phi}^{\top}Q_{\phi}\theta_{2}&+&c_{3}U_{\phi}\theta_{2}+c_{4}F_{\phi}^{\top}L_{2}F_{\phi}\theta_{2}-Q_{\phi}^{\top}\beta_{1}+Q_{\phi}^{\top}\beta_{2}\\ &+&P_{\phi}^{\top}\alpha_{1}=0. \end{array} $$
(44)

It follows that

$$\begin{array}{@{}rcl@{}} \theta_{1}&=&\left(H_{\phi}^{\top}H_{\phi}+c_{3}O_{\phi}+c_{4}J_{\phi}^{\top}L_{1}J_{\phi}\right)^{-1}\\ &&\left[H_{\phi}^{\top}(\beta_{2}-\beta_{1})-G_{\phi}^{\top}\alpha_{1}\right], \end{array} $$
(45)
$$\begin{array}{@{}rcl@{}} \theta_{2}&=&\left(Q_{\phi}^{\top}Q_{\phi}+c_{3}U_{\phi}+c_{4}F_{\phi}^{\top}L_{2}F_{\phi}\right)^{-1}\\ &&\left[Q_{\phi}^{\top}(\beta_{1}-\beta_{2})-P_{\phi}^{\top}\alpha_{2}\right]. \end{array} $$
(46)

We substitute (45), (46) into (40) and get

$$\begin{array}{@{}rcl@{}} L&=&(\alpha_{1}+\alpha_{2})^{\top}e_{2}-\frac{1}{2}\left[(\beta_{2}-\beta_{1})^{\top}H_{\phi}-\alpha_{1}^{\top}G_{\phi}\right]\\ &&\left(H_{\phi}^{\top}H_{\phi}+c_{3}O_{\phi}+c_{4}J_{\phi}^{\top}L_{1}J_{\phi}\right)^{-1}\\ &&\left[H_{\phi}^{\top}(\beta_{2}-\beta_{1})-G_{\phi}^{\top}\alpha_{1}\right]-\frac{1}{2}\left[(\beta_{1}-\beta_{2})^{\top}Q_{\phi}\right.\\ &&-\left.\alpha_{2}^{\top}F_{\phi}\right]\left(Q_{\phi}^{\top}Q_{\phi}+c_{3}U_{\phi}+c_{4}F_{\phi}^{\top}L_{2}F_{\phi}\right)^{-1}\\ &&\left[Q_{\phi}^{\top}(\beta_{1}-\beta_{2})-P_{\phi}^{\top}\alpha_{2}\right]. \end{array} $$
(47)

Therefore, the dual optimization formulation is

$$\begin{array}{@{}rcl@{}} &&\min\limits_{\xi_{1},\xi_{2},\alpha_{1},\alpha_{2}}\frac{1}{2}\xi_{1}^{\top}\left(H_{\phi}^{\top}H_{\phi}+c_{3}O_{\phi}+c_{4}J_{\phi}^{\top}L_{1}J_{\phi}\right)^{-1}\xi_{1}\\ &&{\kern3pc}+\frac{1}{2}\xi_{2}^{\top}\left(Q_{\phi}^{\top}Q_{\phi}+c_{3}U_{\phi}+c_{4}F_{\phi}^{\top}L_{2}F_{\phi}\right)^{-1}\\ &&{\kern3pc}\xi_{2}-(\alpha_{1}+\alpha_{2})^{\top}e_{2} \end{array} $$
(48)
$$\begin{array}{@{}rcl@{}} &&\text{s.t.}{\kern1pc}\xi_{1}=H_{\phi}^{\top}(\beta_{2}-\beta_{1})-G_{\phi}^{\top}\alpha_{1},\\ &&{\kern2pc}\xi_{2}=Q_{\phi}^{\top}(\beta_{1}-\beta_{2})-P_{\phi}^{\top}\alpha_{2},\\ &&{\kern2pc}0\preceq\beta_{1},\beta_{2},\beta_{1}+\beta_{2} \preceq De_{1},\\ &&{\kern2pc}0\preceq \alpha_{1/2}\preceq c_{1/2}e_{2}. \end{array} $$

Correspondingly, the dual optimization formulation for (39) is

$$\begin{array}{@{}rcl@{}} &&\min\limits_{\rho_{1},\rho_{2},\omega_{1},\omega_{2}} \frac{1}{2}\rho_{1}^{\top}\left(G_{\phi}^{\top}G_{\phi}+c_{3}O_{\phi}+c_{4}J_{\phi}^{\top}L_{1}J_{\phi}\right)^{-1}\\ &&{\kern3.5pc}\rho_{1}+\frac{1}{2}\rho_{2}^{\top}\left(P_{\phi}^{\top}P_{\phi}+c_{3}U_{\phi}+c_{4}F_{\phi}^{\top}L_{2}F_{\phi}\right)^{-1}\\ &&{\kern3.5pc}\rho_{2}-(\omega_{1}+\omega_{2})^{\top}e_{1}\\ &&{\kern1.5pc}\text{s.t.}{\kern1pc}\rho_{1}=G_{\phi}^{\top}(\gamma_{2}-\gamma_{1})-H_{\phi}^{\top}\omega_{1},\\ &&{\kern3.5pc}\rho_{2}=P_{\phi}^{\top}(\gamma_{1}-\gamma_{2})-Q_{\phi}^{\top}\omega_{2},\\ &&{\kern3.5pc}0\preceq \gamma_{1},\gamma_{2},\gamma_{1}+\gamma_{2} \preceq He_{2},\\ &&{\kern3.5pc}0\preceq \omega_{1/2}\preceq c_{1/2}e_{1}, \end{array} $$
(49)

where the augmented vectors \(\pi _{1}=\begin {pmatrix}\lambda _{3}\\b_{3}\end {pmatrix},\,\pi _{2}=\begin {pmatrix}\lambda _{4}\\b_{4}\end {pmatrix}\) are given by

$$\begin{array}{@{}rcl@{}} \pi_{1}&=&\left(G_{\phi}^{\top}G_{\phi}+c_{3}O_{\phi}+c_{4}J_{\phi}^{\top}L_{1}J_{\phi}\right)^{-1}\\ &&\left[G_{\phi}^{\top}(\gamma_{2}-\gamma_{1})-H_{\phi}^{\top}\omega_{1}\right], \end{array} $$
(50)
$$\begin{array}{@{}rcl@{}} \pi_{2}&=&\left(P_{\phi}^{\top}P_{\phi}+c_{3}U_{\phi}+c_{4}F_{\phi}^{\top}L_{2}F_{\phi}\right)^{-1}\\ &&\left[P_{\phi}^{\top}(\gamma_{1}-\gamma_{2})-Q_{\phi}^{\top}\omega_{2}\right]. \end{array} $$
(51)

Suppose an example x has two views x 1 and x 2. If \(\frac {1}{2}(|K\{x_{1}^{\top },C_{1}^{\top }\}\lambda _{1}+b_{1}|+|K\{x_{2}^{\top },C_{2}^{\top }\}\lambda _{2}+b_{2}|)\leq \frac {1}{2}(|K\{x_{1}^{\top },C_{1}^{\top }\}\lambda _{3}+b_{3}|+|K\{x_{2}^{\top },C_{2}^{\top }\}\lambda _{4}+b_{4}|)\), it is classified to class +1, otherwise class −1.

4 Experimental results

In this section, we evaluate our proposed MvLapTSVMs on three real-world datasets. Three datasets are from UCI Machine Learning Repository: ionosphere classification, handwritten digits classification and advertisement classification. Details about the three datasets are listed in Table 1.

Table 1 Datasets.

4.1 Ionosphere

The ionosphere dataset Footnote 1 was collected by a system in Goose Bay, Labrador. This system consists of a phased array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kilowatts. The targets were free electrons in the ionosphere. “Good” radar returns are those showing evidence of some type of structure in the ionosphere. “Bad” returns are those that do not and their signals pass through the ionosphere. It includes 351 instances in total which are divided into 225 “Good” (positive) instances and 126 “Bad” (negative) instances.

In our experiments, we regard original data as the first view. Then we capture 99 % of the data variance while reducing the dimensionality from 34 to 21 with PCA and regard the resultant data as the second view. We compare MvLapSVMs with single-view LapTSVMs (LapTSVM1 means using the LapTSVMs method to deal with one view data and LapTSVM2 means using the LapTSVMs method to deal with the other view data), SVM-2K and multi-view TSVMs (MvTSVMs)Footnote 2. The result of experiment varies by use of different size of unlabeled data. We select regularization parameters from the range [2−7,27] with exponential growth 0.5. The linear kernel is chosen for the dataset. We select 70 labeled and 70 unlabeled examples as the training set (i.e., l = 70,u = 70). The unlabeled examples are randomly selected from both classes. The size of the test data is 71. The result is in the second column in Table 2. Then we select 70 labeled and 140 unlabeled examples as the training set (i.e., l = 70,u = 140). The unlabeled examples are randomly selected from both classes. The size of the test data is 71. The result is in the third column. Each experiment is repeated five times. Experiment result is in Table 2.

Table 2 Classification accuracies and standard deviations (%) on Ionosphere.

4.2 Handwritten digits

The handwritten digits datasetFootnote 3 consists of features of handwritten digits (0 ∼ 9) extracted from a collection of Dutch utility maps. It consists of 2000 examples (200 examples per class) with view 1 being the 76 Fourier coefficients, and view 2 being the 64 Karhunen-Lo\(\grave {e}\)ve coefficients of each example image.

In this experiment, we compare MvLapSVMs with single-view LapTSVMs, SVM-2K and MvTSVMs. Because TSVMs are designed for binary classification while handwritten digits contains 10 classes, we use three pairs (1,7), (2,4) and (3,9) for binary classification. We select regularization parameters from the range [2−7,27] with exponential growth 0.5. We select 160 labeled and 160 unlabeled examples as the training set (i.e., l = 160,u = 160). Half of the unlabeled data come from one class and the other half come from the other class. The size of the test data is 80. The Gaussian kernel is chosen for the dataset. Each experiment is repeated five times. Experiment result is in Table 3.

Table 3 Classification accuracies and standard deviations (%) on Handwritten digits.

4.3 Advertisement

The advertisement datasetFootnote 4 [35] consists of 3279 examples including 459 ads images (positive examples) and 2820 non-ads images (negative examples). One view describes the image itself (words in the images URL, alt text and caption), while the other view contains all other features (words from the URLs of the pages that contain the image and the image points to).

In this experiment, we randomly select 700 examples therein to form the used dataset. We select regularization parameters from the range [2−7,27] with exponential growth 0.5. The Gaussian kernel is chosen for the dataset. We select u = 100 unlabeled data. The unlabeled examples are randomly selected from both classes. Each experiment is repeated five times. Experiment result is in Fig. 1.

Fig. 1
figure 1

Classification accuracies (%) of four methods on Advertisement

4.4 Analysis of the results

MvLapTSVMs can obtain good performance by combining two views in the constraints and are better than the corresponding single-view LapTSVMs. The second row, third row and sixth row in Table 2 show that MvLapTSVMs are superior to single-view LapTSVMs with the same labeled examples and different unlabeled examples. Similarly, the second row, third row and sixth row in Table 3 show that MvLapTSVMs are superior to single-view LapTSVMs in different digit pairs classification problems. From Figure 1 with varying training sizes, we can conclude that our method MvLapTSVMs are superior to single-view LapTSVMs. MvLapTSVMs can also exploit the usefulness of unlabeled examples to improve the classification accuracy comparable to supervised learning such as MvTSVMs and SVM-2K. The fourth row, fifth row and sixth row in Table 2 show that MvLapTSVMs are superior to MvTSVMs and SVM-2K with the same labeled examples and different unlabeled examples. Similarly, the fourth row, fifth row and sixth row in Table 3 show that MvLapTSVMs are superior to MvTSVMs and SVM-2K in different digit pairs classification problems. From Figure 1 with varying training sizes, MvLapTSVMs are superior to MvTSVMs and SVM-2K.

5 Conclusion

In this paper, we extended LapTSVMs to multi-view learning and proposed a new framework called MvLapTSVMs which combine two views by introducing the constraint of similarity between two one-dimensional projections identifying two distinct TSVMs from two feature spaces. MvLapTSVMs construct a decision function by solving two quadratic programming problems. We provide their dual formulation making use of Lagrange dual optimization techniques. MvLapTSVMs were further extended to their kernel version. Experimental results on real datasets indicate that the multi-view LapTSVMs are better than the corresponding single-view and supervised learning methods.