1 Introduction

1.1 Background

Multiview learning has been an important in machine learning community [5, 21, 22, 24, 32, 34, 35, 38, 40, 41]. In traditional machine learning problems, we usually assume that a data point has a feature vector to represent its input information. For example, in image recognition problem, we can extract a visual feature vector from an image, using a texture descriptor [9, 14, 2528, 36, 39]. In this scene, the texture is a view of the image. However, there could be more than one view of an image. Besides the texture view, we can also extract feature vectors from other views, including shape and color. An other example is the problem of classification of scientific articles, and we may extract a feature vector from the main text of the article [4, 6, 11, 11, 16, 18, 19, 23, 29, 29]. However, the main text is just one view of the article, and we can also have features from other views, such as abstract, reference list. Multiview learning argues that we should learn from more than one views to present the data and construct a predictor. The motive for multiview learning is that single view-based data representation is usually incomplete, and different views can present complementary information for the learning problem. In the problem of multiview learning, the input of a data point is not just one single feature vector of one single view, but multiple feature vectors presenting different views. The target of multiview learning is to learn a predictor to take multiple view feature vectors to predict one single output of a data point. The problem of multiview learning can be classified to two types, supervised multiview learning and unsupervised learning.

  • Supervised multiview learning refers to the problem of learning from a data set, where both the multiview input and output are available for each data point [10, 15, 20]. In this problem, the output is usually a class label, or a continuous response. In this case, the learning problem is to build a predictive model from the training data set to predict the output of a input data point, with help the input–output pairs of the training set.

  • Unsupervised multiview learning refers to the problem of cluster a set of data points, and the multiview inputs of each data point are given [7, 33, 44]. In this problem, the outputs of the data points are not available.

In this paper, we investigate the problem of supervised multiview learning, and propose a novel algorithm to solve it. The proposed method is based on an assumption that different views of a data point are generated from one single intact feature vector, and the view generation is performed by a linear transformation. We try to recover the intact feature vector for each data point from its multiview feature vectors, with guiding of its corresponding output, i.e., its binary class label.

1.2 Relevant works

There are some existing multiview learning methods. Their state-of-the-arts is as follows.

  • Zhang et al. [43] proposed to use local learning (LL) method for the problem of multiview learning problem, and designed a local predictive model for each data point based on the multiview inputs. The local predictive model is learned on the nearest neighbors of a data point.

  • Sindhwani et al. [31] proposed to use co-training algorithm for multiview learning problems to improve the classification performance of each view (CT). This method is based on multiview regularization, and the agreement and smoothness over both labeled and unlabeled data points.

  • Quadrianto [30] proposed a multiview learning algorithm to solve the problem of view disagreement (VD), i.e., different views of one single data point do not belong to the same class. This method uses a conditional entropy criterion to find the disagreement among different views, and removes the data points with view disagreement from the training set.

  • Zhai [42] proposed multiview metric learning method with global consistency and local smoothness (GL) for the multiview learning problem with partially labeled data set. This method simultaneously considers both the global consistency and local smoothness, by assuming that the different views have a shared latent feature space, and imposing global consistency and local structure to the learning procedure.

  • Chen et al. [2] proposed a statistical subspace multiview representation method (SS), by leveraging both multiview dependencies and supervision information. This method is based on a subspace Markov network of multiview latent and assumes that the multiviews and the class labels are conditionally independent. The algorithm is based on the maximization of data likelihood and the minimization of classification error.

1.3 Contributions

In this paper, we propose a novel supervised multiview learning method. This method is based on the assumption of single discriminative intact of different multiview inputs. Under this assumption, although there are different views of one single data point, one single intact feature vector exists for the data point. This intact feature vector is assumed to be discriminative, i.e., it can represent the class information of each data point. Moreover, the feature vector of each view of a data point can be obtained from the intact vector, by performing a linear view-conditional transformation to the intact feature vector. In this way, if we learn the discriminative intact feature vector for each training data point, we can learn a classifier in the intact with the help of the class labels of the training data points. To this end, we proposed a novel method to learn the hidden intact feature vectors, the view-conditional transformation matrices, and the classifier in the intact space simultaneously. We define a intact feature vector for each data point, and a transformation matrix for each view. The feature vector of one view of each data point can be reconstructed as the product of its corresponding transformation matrix and intact feature vector. The reconstruction error for each view of each data point is measured by the Cauchy error estimator [8, 12]. To learn the optimal intact feature vectors and view-conditional transformation matrices, we propose to minimize the Cauchy errors. Moreover, due to the assumption that the intact feature vectors are discriminative, we also argue that we can design a classifier in the intact space, and the classifier can minimize the classification error. Thus, we also propose to learn a linear classifier in the intact space, and use the hinge loss to measure the classification error the training set in the intact space [1, 3]. To learn the optimal classifier parameter and the intact feature vectors, we also propose to minimize the hinge loss with regard to both the classifier parameter and the intact feature vectors.

To model the problem, we propose a joint optimization problem for learning of intact vectors, view-conditional transformation matrices, and the classifier parameter vector. The objective function of this problem is composed of two error terms, and three regularization terms. The firs error term is the view reconstruction term measured by Cauchy estimator over all the data points and views. The second error term is the classification error over all the intact feature vectors of all training data points, measured by hinge losses. The three regularization terms are all squared \(\ell _2\) norm terms over each intact feature vectors, view-conditional matrices, and the classifier parameter vectors. The purpose of impose the squared \(\ell _2\) norm to these variables is to reduce the complexity of the learned outputs. To minimize the proposed objective function, we adapt an alternate optimization strategy, i.e., when the objective function is minimized with regard to one variable, other variables are fixed. The optimization with regard to each variable is conducted by using gradient descent algorithm.

The contributions of this paper are of three parts:

  1. 1.

    We propose a novel supervised multiview learning framework by simultaneous learning of intact feature vectors, view-conditional transformation matrices, and classifier parameter vector.

  2. 2.

    We build a novel optimization problem for this learning problem, by considering both the view reconstruction problem, and the classifier learning problem.

  3. 3.

    We develop an iterative algorithm to solve this optimization problem based on alternate optimization strategy and gradient descent algorithm.

1.4 Paper organization

This paper is organized as follows: In Sect. 2, the proposed method for supervised multiview learning is introduced. In this section, we first model this problem as a minimization problem of a objective function, and then solve it with an iterative algorithm. In Sect. 3, the proposed iterative algorithm is evaluated. We first give an analysis of its sensitivity to parameters, and then compare it to some state-of-the-art algorithms, and finally test the running time performance of the proposed algorithm. In Sect. 4, we give the conclusion of this paper.

2 Methods

In this section, we introduce the proposed supervised multiview learning method.

2.1 Problem modeling

We assume we are dealing with supervised binary classification problem with multiview data. A training data set of n data points is given, \(X = \{\theta _1, \ldots , \theta _n\}\). \(\theta _i=({\mathbf{x}}_i^1, \ldots , {\mathbf{x }}_i^m, y_i)\) is the ith data point. The information of each data point is composed of feature vectors of m views, and a binary class label \(y_i\). \({\mathbf{x}}_i^j\in {\mathbb {R}}^{d_j}\) is the \(d_j\)-dimensional feature vector of the jth view of the ith data point, and \(y_i\in \{+1,-1\}\) is a the binary class label of the ith data point. The problem of supervised multiview learning is to learn a predictive model from the training set, which can predict a binary class label from the multiview input of a test data point. We assume there is an intact vector \({\mathbf{z }}_i\in {\mathbb {R}}^d\) for the ith data point, and its jth view \({\mathbf{x}}_i^j\) can be reconstructed by a linear transformation,

$${\mathbf{x}}_i^j \leftarrow W_j {\mathbf{z }}_i,$$
(1)

where \(W_j\in R^{{d_j} \times d}\) is the view-conditional linear transformation matrix for the jth view. Please note that the view-conditional transformation matrix for the same view of all the data points is the same. By learning both the \(W_j\) and \({\mathbf{z}}_i\), we can recover the hidden intact vector for the ith data point, \({\mathbf{z}}_i\), and use it for classification problem. To this end, we propose to minimize the reconstruction error. The reconstruction error is measured by Cauchy error estimator, \(E({\mathbf{x}}_i^j, W_j {\mathbf{z}}_i)\),

$$E({\mathbf{x}}_i^j, W_j {\mathbf{z}}_i) = \log \left( 1 + \frac{\left\| {\mathbf{x}}_i^j - W_j {\mathbf{z}}_i \right\| _2^2}{c^2} \right).$$
(2)

This error estimator has been shown to be robust, and it also provides an offset. We propose to minimize this error estimator over all data points and all views with regard to both \({\mathbf{z}}_i, i = 1, \ldots , n\), and \(W_j, j = 1, \ldots, m\),

$$\min _{{\mathbf{z}}_i|_{i=1}^n,W_j|_{j=1}^m} \left\{ \sum\limits_{i=1}^n \sum\limits_{j=1}^m E({\mathbf{x}}_i^j, W_j {\mathbf{z}}_i) = \sum\limits_{i=1}^n \sum\limits_{j=1}^m \log \left( 1 + \frac{\left\| {\mathbf{x}}_i^j - W_j {\mathbf{z}}_i \right\| _2^2}{c^2} \right) \right\}$$
(3)

Moreover, we also assume that the intact feature vectors of the data points are discriminative, and presents the class information, thus the intact feature vectors can minimize a classification loss function of the data set. We propose to learn the intact feature vector of the ith data point by jointly learning a liner classifier to predict its class label, \(y_i\). The classifier is designed as linear function,

$$y_i \leftarrow {\varvec{\omega}}^\top {\mathbf{z}}_i$$
(4)

The classification error can be measured by the hinge loss function,

$$L(y_i, {\varvec{\omega }}^\top {\mathbf{z}}_i) = \max (0, 1 - y_i {\varvec{\omega }}^\top {\mathbf{z}}_i).$$
(5)

The optimization of this loss function can obtain a large margin classifier. To learn the optimal classifier and the discriminative intact feature vectors, we propose to minimize the classifier loss measured by the hinge loss function of the classification result over all the training data points,

$$\min_{{\mathbf{z}}_i|_{i=1}^n, {\varvec{\omega }}} \left\{ \sum _{i=1}^n L(y_i, {\varvec{\omega }}^\top {\mathbf{z}}_i) = \sum _{i=1}^n \max (0, 1 - y_i {\varvec{\omega }}^\top {\mathbf{z }}_i) \right\}$$
(6)

Moreover, to prevent the problem of over-fitting of variables, we propose to minimize the squared \(\ell _2\) norm of the variables to regularize the learning \({\mathbf{z}}_i\), \(W_j\), and \({\varvec{\omega }}\),

$$\min _{{\mathbf{z}}_i|_{i=1}^n, W_j|_{j=1}^m,{\varvec{\omega }}} \left\{ R({\mathbf{z}}_i|_{i=1}^n, W_j|_{j=1}^m,{\varvec{\omega }}) = \sum _{i=1}^n \Vert {\mathbf{z}}_i\Vert _2^2 + \sum _{j=1}^m \Vert W_j\Vert _2^2 + \Vert {\varvec{\omega }}\Vert _2^2 \right\}.$$
(7)

Our overall learning problem is obtained by considering both the problems of view-conditional reconstruction in (3), and classifier learning in the intact space in (6),

$$\begin{aligned} &\min _{{\mathbf{z}}_i|_{i=1}^n,W_j|_{j=1}^m,{\varvec{\omega }}} \left\{ \sum _{i=1}^n \sum _{j=1}^m E({\mathbf{x}}_i^j, W_j {\mathbf{z}}_i) + \alpha L(y_i, {\varvec{\omega }}^\top {\mathbf{z }}_i) + \gamma R({\mathbf{z}}_i|_{i=1}^n, W_j|_{j=1}^m,{\varvec{\omega }}) \right. \\&\quad \left. = \sum _{i=1}^n \sum _{j=1}^m \log \left( 1 + \frac{\left\| {\mathbf{x}}_i^j - W_j {\mathbf{z}}_i \right\| _2^2}{c^2} \right) \right. \\&\quad + \alpha \sum _{i=1}^n \max (0, 1 - y_i {\varvec{\omega }}^\top {\mathbf{z}}_i)\nonumber \\&\quad \left. + \gamma \left( \sum _{i=1}^n \Vert {\mathbf{z }}_i\Vert _2^2 + \sum _{j=1}^m \Vert W_j\Vert _2^2 + \Vert {\varvec{\omega }}\Vert _2^2 \right) \right\} , \end{aligned}$$
(8)

where \(\alpha\) is a tradeoff parameter to balance the view-conditional reconstruction terms and the classification error terms, and \(\gamma\) is a tradeoff parameter to balance the view-conditional reconstruction terms and the regularization terms. By optimizing this problem, we can learn intact feature vectors which can present the multiview inputs of the data points, and also is discriminative.

2.2 Optimization

To solve the optimization problem in (21), we propose to use the alternate optimization strategy. The optimization is conducted in an iterative algorithm. When one variable is considered, the others are fixed. After one variable is updated, it will be fixed in the next iteration when other variable is updated. In the following subsections, we will discuss how to update each variable.

2.2.1 Updating z i

When we want to update \(\mathbf{z }_i\), we only consider this single variable, while fix all other variables. Thus we have the following optimization problem,

$$\begin{aligned} \min_{{\mathbf{z}}_i} \left\{ \sum _{j=1}^m \log \left( 1 + \frac{\left\| {\mathbf{x}}_i^j - W_j {\mathbf{z}}_i \right\| _2^2}{c^2} \right) + \alpha \max (0, 1 - y_i {\varvec{\omega }}^\top {\mathbf{z}}_i) + \gamma \Vert {\mathbf{z}}_i\Vert _2^2 \right\} . \end{aligned}$$
(9)

The second term \(\max (0, 1 - y_i {\varvec{\omega }}^\top {\mathbf{z}}_i)\) is not a convex function, and it is hard to optimize it directly. Thus we rewrite it as follows,

$$\max (0, 1 - y_i {\varvec{\omega }}^\top {\mathbf{z }}_i) = \left\{ \begin{array}{ll} 1 - y_i {\varvec{\omega }}^\top {\mathbf{z }}_i, & {\text {if}}\, 1 - y_i {\varvec{\omega }}^\top {\mathbf{z}}_i > 0\\ 0, & {\text{otherwise}}. \end{array}\right.$$
(10)

We define a indicator variable, \(\beta _i\), to indicate which of the above cases is true,

$$\beta _i = \left\{ \begin{array}{ll} 1, &\quad {\text {if}}\, 1 - y_i {\varvec{\omega }}^\top {\mathbf{z}}_i > 0\\ 0, &\quad {\text{otherwise}}, \end{array}\right.$$
(11)

and rewrite (10) as follows,

$$\max (0, 1 - y_i {\varvec{\omega }}^\top {\mathbf{z}}_i) = \beta _i \left( 1 - y_i {\varvec{\omega }}^\top {\mathbf{z}}_i \right)$$
(12)

Please note that \(\beta _i\) is also a function of \({\mathbf{z}}_i\); however, we first update it by using \({\mathbf{z}}_i\) solved in previous iteration, and then fix it to update \({\mathbf{z}}_i\) in current iteration. In this way, (9) is rewritten as

$$\begin{aligned} \min_{{\mathbf{z}}_i} \left\{ \sum _{j=1}^m \log \left( 1 + \frac{\left\| {\mathbf{x}}_i^j - W_j {\mathbf{z}}_i \right\| _2^2}{c^2} \right) + \alpha \beta _i \left( 1 - y_i {\varvec{\omega }}^\top {\mathbf{z}}_i \right) + \gamma \Vert {\mathbf{z}}_i\Vert _2^2 = g({\mathbf{z}}_i) \right\} , \end{aligned}$$
(13)

where \(g({\mathbf{z}}_i)\) is the objective of this optimization problem. To seek the minimization of \(g({\mathbf{z}}_i)\), we use gradient descent algorithm. This algorithm update \({\mathbf{z}}_i\) by descending it to the direction of gradient of \(g({\mathbf{z}}_i)\),

$${\mathbf{z}}_i \leftarrow {\mathbf{z}}_i - \mu \nabla g({\mathbf{z}}_i),$$
(14)

where \(\mu\) is the descent step, and \(\nabla g({\mathbf{z}}_i)\) is the gradient function of \(g({\mathbf{z}}_i)\). We set \(\nabla g({\mathbf{z}}_i)\) as the partial derivative of \(g({\mathbf{z}}_i)\) with regard to \({\mathbf{z}}_i\),

$$\begin{aligned} \nabla g({\mathbf{z }}_i) &= \frac{\partial g({\mathbf{z}}_i)}{\partial {\mathbf{z}}_i} = \sum _{j=1}^m \frac{\frac{2 W_j^\top ( {\mathbf{x}}_i^j - W_j {\mathbf{z}}_i )}{c^2}}{\left( 1 + \frac{\left\| {\mathbf{x}}_i^j - W_j {\mathbf{z}}_i \right\| _2^2}{c^2} \right) } - \alpha \beta _i y_i {\varvec{\omega }}+ \gamma {\mathbf{z}}_i \\ &= \sum _{j=1}^m \frac{2 W_j^\top ( {\mathbf{x }}_i^j - W_j {\mathbf{z}}_i )}{\left( c^2 + \left\| {\mathbf{x}}_i^j - W_j {\mathbf{z}}_i \right\| _2^2\right) } - \alpha \beta _i y_i {\varvec{\omega }}+ \gamma {\mathbf{z}}_i. \end{aligned}$$
(15)

By substituting (15) to (14), we have the final updating rule of \({\mathbf{z}}_i\),

$${\mathbf{z}}_i \leftarrow {\mathbf{z}}_i - \mu \left( \sum _{j=1}^m \frac{2 W_j^\top ( {\mathbf{x }}_i^j - W_j {\mathbf{z}}_i )}{\left( c^2 + \left\| {\mathbf{x }}_i^j - W_j {\mathbf{z}}_i \right\| _2^2\right) } - \alpha \beta _i y_i {\varvec{\omega }}+ \gamma {\mathbf{z}}_i \right).$$
(16)

2.2.2 Updating W j

When we want to optimize \(W_j\), we fix all other variables and only consider \(W_j\) itself. The optimization problem is changed to,

$$\min _{W_j} \left\{ \sum _{i=1}^{n} \log \left( 1 + \frac{\left\| {\mathbf{x }}_i^j - W_j {\mathbf{z}}_i \right\| _2^2}{c^2} \right) + \gamma \Vert W_j\Vert _2^2= f(W_j) \right\}.$$
(17)

where \(f(W_j)\) is the objective function of this problem. To solve this problem, we also update \(W_j\) by using the gradient descent algorithm,

$$W_j \leftarrow W_j - \mu \nabla f(W_j),$$
(18)

where \(\nabla f(W_j)\) is the gradient function of \(f(W_j)\),

$$\begin{aligned} \nabla f(W_j) &= \frac{\partial f(W_j)}{\partial W_j} = \sum _{i=1}^{n} \frac{\frac{2({\mathbf{x }}_i^j - W_j{\mathbf{z}}_i){\mathbf{z}}_i^\top }{c^2}}{ \left( 1 + \frac{\left\| {\mathbf{x }}_i^j - W_j {\mathbf{z}}_i \right\| _2^2}{c^2} \right) }+ \gamma W_j\nonumber \\ &= \sum _{i=1}^{n} \frac{2({\mathbf{x }}_i^j - W_j{\mathbf{z}}_i){\mathbf{z}}_i^\top }{ \left( c^2 + \left\| {\mathbf{x }}_i^j - W_j {\mathbf{z}}_i \right\| _2^2 \right) } + \gamma W_j. \end{aligned}$$
(19)

Substituting (19) to (18), we have the final updating rule of \(W_j\),

$$W_j \leftarrow W_j - \mu \left( \sum _{i=1}^{n} \frac{2({\mathbf{x }}_i^j - W_j{\mathbf{z}}_i){\mathbf{z}}_i^\top }{ \left( c^2 + \left\| {\mathbf{x }}_i^j - W_j {\mathbf{z}}_i \right\| _2^2 \right) }+ \gamma W_j \right).$$
(20)

2.2.3 Updating \({\varvec{\omega }}\)

When we want to update \({\varvec{\omega }}\) to minimize the objective function of (21), we fix the other variables, and only consider \({\varvec{\omega }}\). Thus the problem in (21) is transferred to

$$\min _{{\varvec{\omega }}} \left\{ \alpha \sum _{i=1}^n \beta _i \left( 1 - y_i {\varvec{\omega }}^\top {\mathbf{z}}_i \right) + \gamma \Vert {\varvec{\omega }}\Vert _2^2 = h({\varvec{\omega }}) \right\}.$$
(21)

Please note that \(\beta _i\) is actually a function of \({\varvec{\omega }}\). However, similar the strategy to solve \({\mathbf{z}}_i\), we also update it according to \({\varvec{\omega }}\) solved in previous iteration, and fix it to update \({\varvec{\omega }}\) in current iteration. When \(\beta _i, i=1, \ldots , n\) are fixed, we update \({\varvec{\omega }}\) to minimize \(h({\varvec{\omega }})\) by using the gradient descent algorithm,

$${\varvec{\omega }}\leftarrow {\varvec{\omega }}- \mu \nabla h({\varvec{\omega }}),$$
(22)

where \(\nabla h({\varvec{\omega }})\) is the gradient function of \(h({\varvec{\omega }})\), and it is defined as follows,

$$\nabla h({\varvec{\omega }}) = \frac{\partial h({\varvec{\omega }})}{\partial {\varvec{\omega }}} =- \alpha \sum _{i=1}^n \beta _i y_i {\mathbf{z}}_i + \gamma {\varvec{\omega }}.$$
(23)

By substituting it to (24), we have the final updating rule for \({\varvec{\omega }}\),

$${\varvec{\omega }}\leftarrow {\varvec{\omega }}- \mu \left( - \alpha \sum _{i=1}^n \beta _i y_i {\mathbf{z}}_i + \gamma {\varvec{\omega }}\right) .$$
(24)

2.3 Iterative algorithm

After we have the updating rules of all the variables, we can design an iterative algorithm for the learning problem. This iterative algorithm has one outer FOR loop, and two inner FOR loops. The outer FOR loop is corresponding to the main iterations. The two inner FOR loops are corresponding to the updating of n intact feature vectors of n data points, and the updating of m view-conditional transformation matrices. The algorithm is given in Algorithm 1. The iteration number T is determined by cross-validation in our experiments.

  • Algorithm 1. Iterative algorithm for multiview intact and single-view classifier learning (MISC).

  • Input: Training data set, \(({\mathbf{x }}_1^1, \ldots , {\mathbf{x }}_1^m, y_1), \ldots , ({\mathbf{x }}_n^1, \ldots , {\mathbf{x }}_n^m, y_n)\).

  • Input: Tradeoff parameters, \(\alpha\) and \(\gamma\).

  • Input: Maximum iteration number, T.

  • Initialization: \({\mathbf{z}}_i^0, i=1,\ldots ,n\), \(W_j^0,j=1,\ldots , m\) and \({\varvec{\omega }}^0\).

  • For \(t=1,\ldots , T\)

    • Update descent step, \(\mu ^t \leftarrow \frac{1}{t}\)

    • For \(i=1,\ldots ,n\)

      Update \(\beta _i^t\) as follows,

      $$\begin{aligned} \beta _i^t = \left\{ \begin{matrix} 1, &\quad {\text{if}}\, 1 - y_i {{\varvec{\omega }}^{t-1}}^\top {\mathbf{z}}_i^{t-1} > 0\\ 0, &\quad {\text{otherwise}}. \end{matrix}\right. \end{aligned}$$
      (25)

      Update \({\mathbf{z}}_i^t\) by fixing \(W_j^{t-1}, j=1,\ldots , m\), \(\beta _i^{t-1}\) and \({\varvec{\omega }}^{t-1}\),

      $${\mathbf{z}}_i^t \leftarrow {\mathbf{z}}_i^{t-1} - \mu ^t \left( \sum _{j=1}^m \frac{2 {W_j^{t-1}}^\top ( {\mathbf{x }}_i^j - {W_j^{t-1}} {\mathbf{z}}_i^{t-1} )}{\left( c^2 + \left\| {\mathbf{x }}_i^j - {W_j^{t-1}} {\mathbf{z}}_i^{t-1} \right\| _2^2\right) } - \alpha \beta _i^t y_i {\varvec{\omega }}^{t-1} + \gamma {\mathbf{z}}_i^{t-1} \right) .$$
      (26)
    • End of For

    • For \(j=1,\ldots ,m\)

      Update \(W_j^t\) by fixing \({\mathbf{z}}_i^{t}, i=1,\ldots , m\),

      $$W_j^t \leftarrow W_j^{t-1} - \mu ^t \left( \sum _{i=1}^{n} \frac{2({\mathbf{x }}_i^j - {W_j^{t-1}}{\mathbf{z}}_i^{t}){{\mathbf{z}}_i^t}^\top }{ \left( c^2 + \left\| {\mathbf{x }}_i^j - W_j^{t-1} {\mathbf{z}}_i^t \right\| _2^2 \right) }+ \gamma W_j^{t-1} \right).$$
      (27)
    • End of For

    • Update \({\varvec{\omega }}^t\) by fixing \(\beta _i^t, i=1, \ldots , n\) and \({\mathbf{z}}_i^t, i=1, \ldots , n\),

      $${\varvec{\omega }}^t \leftarrow {\varvec{\omega }}^{t-1} - \mu ^t \left( - \alpha \sum _{i=1}^n \beta _i^t y_i {\mathbf{z}}_i + \gamma {\varvec{\omega }}^{t-1} \right).$$
      (28)
  • End of For

  • Output: \(W_j^T, j=1, \ldots , m\), \({\mathbf{z}}_i^T, i=1, \ldots , n\), and \({\varvec{\omega }}^T\).

As we can see from the algorithm, in the main FOR loop, descent step variable, \(\mu\), is firstly updated, and then the hinge loss indicator variables, \(\beta _i, i=1,\ldots ,n\) and the intact feature vectors are updated. The view-conditional transformation matrices, \(W_j, j=1, \ldots , m\) are updated, and finally, the classifier parameter \({\varvec{\omega }}\) are updated.

3 Experiments

In this section, we will evaluate the proposed algorithm on a few real-world supervised multiview learning problems experimentally.

3.1 Benchmark data sets

3.1.1 PASCAL VOC 07 data set

The first data set used in the experiment is the PASCAL VOC 07 data set [13]. In this data set, there are 9963 images of 20 different object classes. Each image is presented by two different view, which are visual view and tag view. To extract the feature vector from the visual view of an image, we extract local visual features, SIFT, from the image, and represent the local features as a histogram. To extract the feature vector from the tag view from the image, we use the histogram vector of user tags of the image as the feature vector.

3.1.2 CiteSeer data set

The second data set is the CiteSeer data set [37]. In this data set, there are 3312 documents of 6 classes. Each document has three views, which are the text view, inbound reference view, and outbound reference view.

3.1.3 HMDB data set

The third data set is the HMDB data set, which is a video database of human motion recognition problem [17]. In this data set, there are 6849 video clips of 51 action classes. To present each video clip, we extract 3D Harris corners and present them by two different types of local features, which are the histogram of oriented gradient (HOG) and histogram of oriented flow (HOF). We further represent each clip by two feature vectors of two views, which are the histograms of HOG and HOF.

3.2 Experiment protocols

To conduct the experiments, we split each data set into 10 non-overlapping folds, and use the 10-fold cross-validation to perform the training-testing procedure. Each fold is used as a test set in turn, and the remaining 9 folds are used as the training sets. The proposed algorithm is performed on the training set to obtain the view-conditional transformation matrices, and the classifier parameter. Then the learned view-conditional transformation matrices and the classifier parameter are used to represent and classify the data points in the test set. To handle the multiple class problem, we use the one-vs-all strategy.

3.3 Performance measures

To measure the classification performance over the test set, we use the classification accuracy. The classification accuracy is defined as follows,

$${\text{ Classification } \text{ accuracy }}= \frac{{\text {Number\,of\,correctly\,classified\,test\,data\,points}}}{{\text {Number\,of\,total\,test\,data\,points}}}.$$
(29)

It is obvious that a better algorithm should be able to obtain a higher classification accuracy.

3.4 Experiment results

In this experiment, we first study the sensitivity of the algorithm to the parameters, which are \(\alpha\) and \(\gamma\).

3.4.1 Sensitivity to parameters

To study the performance of the proposed algorithm with different tradeoff parameters, \(\alpha\) and \(\beta\). We perform the algorithm by using the parameters of values 0.1, 1, 10, 100 and 1000, and measured the performance of different parameters. Figure 1 illustrates the performance on the PASCAL VOC 07 data set with respect to different tradeoff parameter \(\alpha\). The proposed algorithm achieves a stable performance in all the settings of parameter \(\alpha\). In Fig. 2, the performance against different tradeoff parameter \(\beta\) is also shown. From this figure, we can also see that the algorithm is stable tot he changes of value of \(\beta\). This suggests that MISC is not sensitive to the changes of tradeoff parameters.

Fig. 1
figure 1

Sensitivity curve of \(\alpha\) over PASCAL VOC 07 data set

Fig. 2
figure 2

Sensitivity curve of \(\beta\) over PASCAL VOC 07 data set

3.4.2 Comparison to state-of-the-art algorithms

We compare the proposed algorithm to the following methods, multiview learning algorithm using local learning (LL) proposed by Zhang et al. [43], multiview learning algorithm using co-training (CT) proposed by Sindhwani et al. [31], multiview learning algorithm based on view disagreement (VD) proposed by Quadrianto [30], multiview learning algorithm with global consistency and local smoothness (GL) proposed by Zhai [42], and multiview representation method using statistical subspace learning (SS) proposed by Chen et al. [2]. The error bars of the classification accuracy of the compared methods over three different data sets are given in Figs. 3, 4 and 5. From the figures, we find that the proposed method, MISC, stably outperforms other algorithms at all the data sets. Even on the most difficult data set, HMDB, the proposed method, MISC, also achieves an accuracy as high as about 0.4. The multiple view data are optimally combined by MISC to find the latent intact space and the optimal classifier in the corresponding intact space. The main reason for this is the robust property of the proposed algorithm. This algorithm has the ability to appropriately handle the complementary between multiple views, and learn a discriminative hidden intact space with help of classifier learning.

Fig. 3
figure 3

Results of comparison of different algorithms over PASCAL VOC’07 data set

Fig. 4
figure 4

Results of comparison of different algorithms over CiteSeer data set

Fig. 5
figure 5

Results of comparison of different algorithms over HMDB data set

4 Conclusions

We propose a novel multiview learning algorithm by learning intact vectors of the training data points and a classifier in the intact space. The intact vectors are assumed to be a hidden but critical vector for each data point, and we can obtain its multiple view feature vectors by view-conditional transformations. Moreover, we also assume that the intact vectors are discriminative, i.e., can be separated by a linear function according to their classes. We propose a novel optimization problem to model both the learning of intact vectors and classifier. An iterative algorithm is developed to solve this problem. This algorithm outperforms other multiview learning algorithms on benchmark data sets, and it also shows its stability over tradeoff parameters.