1 Introduction

1.1 Background: what is a multi-view dataset

Multi-view dataset is a widely used one in real-world applications. Each multi-view instance (data point) can be represented by multiple forms (views), and each view consists of multiple kinds of information (features). For example, a web page dataset consists of multiple web pages (instances) and each web page can be introduced by four forms (views), namely text, image, video, and audio. Then taking the text view as instance, its description includes four features, namely size, color, content, and font [1].

In terms of a multi-view dataset, Fig. 1 shows the mathematic description and Table 1 summarizes the definition. In this figure, red, dark blue, green represent different views and blue indicates different instances. Light blue represents other views. Suppose that X is a multi-view dataset which consists of n instances. The ith instance \(x_i\) can be represented by v views, and for its jth view \(x_i^j\), it has \(d_j\) features. Then let \(x_{it}^{j}\) be the tth feature of \(x_i^j\), we have \(x_i^j=\left( \begin{array}{c} x_{i1}^{j} \\ x_{i2}^{j} \\ . \\ . \\ . \\ x_{id_j}^{j}\\ \end{array} \right)\) and the dimension of \(x_i^j\) is \(d_j\times 1\). On the base of \(x_i^j\), we let \(x_i=\left( \begin{array}{c} x_i^1 \\ x_i^2 \\ . \\ . \\ . \\ x_i^v \\ \end{array} \right)\) and its dimension is \(d\times 1\) where \(d=\sum \nolimits _{j=1}^{v}d_j\). Then we let \(X_{j}=(x_1^j,x_2^j,\ldots ,x_n^j)\) be the jth instance matrix of X which consists of instances with the same view and the dimension of \(X_j\) is \(d_j \times n\). According to \(X_j\), \(X=\left( \begin{array}{c} X_1 \\ X_2 \\ . \\ . \\ . \\ X_v \\ \end{array} \right)\) and its dimension is \(d\times n\).

Fig. 1
figure 1

An expression and example of multi-view datasets

Table 1 Definition of a multi-view dataset

1.2 Background: what is a multi-view dataset with incomplete data

Multi-view learning machines have been developed and widely used in many fields including multi-view clustering [2], handwritten digit recognition [3], human gait recognition [4], image recognition [5, 6], and so on [7] while all these machines are developed to process multi-view datasets without any information about views or features lost. This case is likely to be violated in practice for a number of reasons including temporary failure of sensor or the man-made faults. Indeed, in our real-world applications, due to these reasons, there are four kinds of multi-view datasets exist. See Fig. 2 which uses a dataset with 3 views for elaboration. In this figure, see sub-figure ‘(a)-full data’ firstly. If for all instances, all views and features are obtained, we regard this dataset as a multi-view dataset with full data (abbr. MVFD). Then see sub-figure ‘(b)-incomplete views,’ if instances lose information of a whole view, we regard the dataset as a multi-view dataset with incomplete views (abbr. MVIV). In sub-figure ‘(c)-incomplete features,’ we define a multi-view dataset with incomplete features (abbr. MVIF) if instances only lose information of some features. Finally, in sub-figure ‘(d)-incomplete data,’ a multi-view dataset with incomplete data (abbr. MVID) is shown if instances lose information of both a whole view and some features.

Fig. 2
figure 2

An example of four kinds of multi-view datasets. In this figure, question marks represent the lost information

For further description, we use Fig. 3 which is also given in [8] to show a real case. In this figure, it takes the camera network as an example and multiple cameras capture the same scene from different angles at the same time. As [8] said, in common cases, multi-view learning machines utilize all information provided by these cameras for learning. However, some cameras could be temporarily out of action for natural or man-made reasons, and thus, some multi-view instances that are missing some views will be obtained (see the question marks). Moreover, the cameras might be functional but could suffer from occlusions such that the views will have missing features (see the crying expressions). In the worst case, missing views and missing features could simultaneously occur and we also regard such a case as missing data.

Fig. 3
figure 3

Incomplete views and incomplete features in multi-view learning

1.3 Classical learning machines

In order to process MVIV, MVIF, and MVID, some algorithms are developed. A straightforward strategy for handling missing views is to remove any instances with incomplete views. However, this significantly reduces the number of instances available for learning and maybe the reduced instances are useful for designing a better learning machine. Another classical algorithm is conventional matrix completion algorithm which aims to process instance matrices. But when the distribution of the missing features is relatively concentrated or there are column-wise (row-wise) features missing from the matrix, some conventional matrix completion algorithms [9,10,11] are worthless.

Thus in order to solve such a disadvantage, Xu et. al have developed a multi-view learning with incomplete views (MVL-IV) [8]Footnote 1. In the procedure of this learning machine, for each view, MVL-IV gets the low-rank assumption matrix of the corresponding instance matrix firstly. Then, MVL-IL decomposes the low-rank assumption matrix into a feature matrix and a coefficient matrix. The feature matrix can be regarded as another representation of instance matrix, and coefficient matrix can be used for transforming the feature matrix into the instance matrix. In MVL-IL, different views have distinct different feature matrices while sharing a same coefficient matrix. By minimizing the total differences between corresponding low-rank assumption matrix and the product of its decomposition matrices for all views, MVL-IL can try to recover features or views as much as possible. Furthermore, doing so, MVL-IL exploits the connections among multiple views and suggests that different views are generated from a shared subspace which makes the MVL-IV can estimate the incomplete views by integrating the information from the other observed views through this subspace.

1.4 Problems

In MVL-IV, there exist two main problems.

First, in MVL-IV, different views share a same coefficient matrix and this indicates that different views are generated from a shared subspace. This is not feasible since different views possess diverse information and differences.

Second, in terms of the model, MVL-IL only pays attention to the quantity of recovery rather than the quality. According to Sect. 1.3, MVL-IL tries to recover features or views as much as possible. But whether the recovered data bring a good classification performance or not is not taken into account.

1.5 Proposal

In order to solve the above two problems, we use the following solutions and develop a new multi-view learning machine with incomplete data (NMVL-IV)Footnote 2.

First, in MVL-IL, different views share a same coefficient matrix and they are generated from a shared subspace. In our work, we suggest different views are generated from different subspaces and each view has its corresponding coefficient matrix.

Second, in our work, we pay attention to not only the quantity of recovery but also its quality. Simply speaking, when we recover the missing data, we should minimize two matters simultaneously. One is the total differences between corresponding low-rank assumption matrix and the product of its decomposition matrices for all views, and the other is classification error on the recovered data.

What’s more, in order to get the classification error for a multi-view dataset, for each view, we train a classifier and name it as view-based classifier. Doing so, we want the recovered data which can be classified correctly in all views as far as possible.

1.6 Contribution and intuition

Compared with MVL-IV, the proposed NMVL-IV has at least two contributions. (1) NMVL-IV considers the different information of views and won’t regard them come from a shared subspace. (2) NMVL-IV recovers the missing data with the promise of the quantity of recovery and its quality.

In terms of the contributions of our NMVL-IV, we can also get some intuitions behind them. (1) For the recovery, different views should possess variable functions. (2) Recovering the missing data depends on both low-rank assumption matrices and the influence of view-based classifiers.

1.7 Structure

The rest of this paper is organized as follows. Section 2 reviews the related work about MVL-IV. Section 3 shows the framework of the proposed NMVL-IV. Experiments about NMVL-IV are shown in Sect. 4. Section 5 gives the conclusion and future work. Section 6 shows Appendix.

2 Framework of MVL-IV

MVL-IV is a method for recovering the missing data (including views and features) by exploiting the connections (e.g., the consistency and complementarity) among multiple views. Its framework is given as follows.

Suppose there is a multi-view dataset with n instances and v views and it has v instance matrices \(\{X_1,\ldots ,X_v\}\) where \(X_j\in \mathbb {R}^{d_j\times n}\). For each \(X_j\), it has a low-rank assumption matrix \(Z_j\in \mathbb {R}^{d_j\times n}\), and in MVL-IV, \(Z_j\) can be decomposed into the form \(Z_j=U_jW_j\) where \(U_j\in \mathbb {R}^{d_j\times r_j}\) is a feature matrix and \(W_j\in \mathbb {R}^{r_j\times n}\) is a coefficient matrix. Here, \(r_j\) is the rank of \(Z_j\). In MVL-IV, it supposes that the ranks of \(Z_j\)s are up to r and the objective function of MVL-IV is given in Eq. (1) where \(U_j\in \mathbb {R}^{d_j\times r}\), \(W_j\in \mathbb {R}^{r\times n}\), \(Z_j\in \mathbb {R}^{d_j\times n}\), and r is an integer that needs to be adjusted. Moreover, \(\left| \left| \star \right| \right| _F\) is Frobenius form of a matrix. Furthermore, \(P_{O_j}(A)\) is the projection onto the subspaces of sparse matrices with nonzeros restricted to the index subset \(O_j\) which is a matrix on jth view to record the existence of the features in \(X_j\).

$$\begin{aligned}&\min \limits _{U_j,W_j,Z_j}\quad \frac{1}{2}\sum \limits _{j=1}^{v}\left| \left| U_jW_j-Z_j\right| \right| _F^2 \\&\quad {s.t.} \quad P_{O_j}(Z_j)=P_{O_j}(X_j), \quad \forall j\in [1,v]. \end{aligned}$$
(1)

In MVL-IV, \(U_j\) is a feature matrix and it is another representation of \(X_j\). Columns of \(U_j\) can be regarded as a set of features. In terms of \(W_j\), it is a coefficient matrix which can be used to transform \(U_j\) into its corresponding \(X_j\) and each column of \(W_j\) can be treated as the coefficients for linearly combining these features to reconstruct its corresponding instance on jth view.

According to the procedure of MVL-IV, it suggests that different views have distinct ‘feature’ matrices (i.e., \(\{U_j\}_{j=1}^{v}\)), but correspond to the same coefficient matrix (i.e., W) which is also used in [12,13,14]. Thus, Eq. (1) can be transformed as Eq. (2) and this equation is the final objective function of MVL-IV.

$$\begin{aligned}&\min \limits _{U_j,W,Z_j}\quad \frac{1}{2}\sum \limits _{j=1}^{v}\left| \left| U_jW-Z_j\right| \right| _F^2 \\&\quad s.t. \quad P_{O_j}(Z_j)=P_{O_j}(X_j), \quad \forall j\in [1,v]. \end{aligned}$$
(2)

Then in order to solve this objective function, MVL-IV adopts successive over-relaxation (SOR) method to update the objective features and compute the residual ratio until algorithm converges. Then MVL-IV will get the optimal \(U_j\), W, and \(Z_j\), and thus, the missing data will be recovered better.

3 New multi-view learning machine with incomplete data (NMVL-IV)

In order to process the problems of MVL-IV, we develop NMVL-IV and its framework is given as follows.

3.1 Data preparation

Suppose X is a multi-view dataset with n instances and v views. The definitions of views, features, instances, instance matrices, and data are same as the ones given in Sect. 1.1. Namely, \(x_{it}^{j}\) is the tth feature of \(x_i^j\) where \(x_i^j\) is the jth view of ith instance \(x_i\). On the base of these different \(x_i^j\)s (\(i=1,2,\ldots ,n\)), \(X_{j}=(x_1^j,x_2^j,\ldots ,x_n^j)\) is the jth instance matrix. Then X can be represented by n instances \(\{x_1,\ldots ,x_i,\ldots ,x_n\}\) or v instance matrices \(\{X_1,\ldots ,X_j,\ldots ,X_v\}\). What’s more, for \(x_i^j\), its real class label is \(y_i^j\). If \(y_i^j=1\) (or \(y_i^j=-1\)), \(x_i^j\) belongs to positive (or negative) class in actually.

3.2 Goal and model

According to what we said before, the idea behind NMVL-IV to recover data involves two aspects. One is the diversification of views, and the other is quantity and quality of recovery.

First, in real-world applications, different views of a multi-view dataset are generated from different subspaces, possess diverse constructs and information, and play different roles for the recovery. Moreover, if the dataset has incomplete views or features, different views always show their respective influence. In other words, since different views are supposed to be generated from different subspaces, they should have respective corresponding coefficient matrix rather than share a same one. By this operation, when we recover the multi-view dataset with incomplete views, the recovery for each view has particularity. For example, when we recover a web page dataset, the recovery for text view is more accord with the characteristics of the text and the recovery for video view is more accord with the characteristics of the video. If we utilize MVL-IV for recovery, its procedure won’t show the uniqueness while having uniformity. This is not accord with the diversification for a multi-view dataset.

Second, in terms of recovery, two objects should be considered, a large quantity and a high quality. In terms of a large quantity, it indicates that when we recover the data with incomplete views, more missing information should be recovered. If this object does not achieve, even though we select a strong classifier, the classification results won’t be better since more information including the useful one is lost. This won’t guide the training of a classifier better. In terms of a high quality, it means that when we recover the data with incomplete views, the similarity between recovered information and missing information should be high. If this object does not achieve, even though we recover more missing information, but due to the similarity is low, the recovered information cannot represent the original data and maybe disturb the training of the classifier. Thus, guaranteeing the quantity and quality of recovery is instrumental in the process of a multi-view dataset with incomplete views. If we use MVL-IV for recovery, we can only promise the quantity of recovery and cannot promise its quality.

Thus in order to realize the above these two aspects, we choose the following methods.

For the first aspect, we suppose each view has respective coefficient matrix rather than sharing a same one. Concretely speaking, as Sect. 2 said, for each view, its instance matrix \(X_{j}\) has a low-rank assumption matrix \(Z_j\) which can be decomposed into the form \(Z_j=U_jW_j\). Then different views should have different \(W_j\)s, and when we design the objective function of NMVL-IV, we should consider \(W_j\)s rather than W which is a shared coefficient matrix used in MVL-IV.

For the second aspect, in order to measure the quantity of recovery, we refer to the idea of MVL-IV, namely minimizing the difference between \(Z_j\) and \(U_jW_j\) for all views, i.e., \(\sum \nolimits _{j=1}^{v} \left| \left| U_jW_j-Z_j\right| \right| _F^2\). In order to measure the quality of recovery, we utilize view-based classifiers. Namely, for each view, its real label matrix is \(Y_j=diag(y_1^j,\ldots ,y_i^j,\ldots ,y_n^j)\in \mathbb {R}^{n\times n}\) and the corresponding classifier is \(f(U_j,W_j)\). Here, diag represents the operation of diagonalization and the output of this classifier is the predicted label matrix \(\bar{Y_j}\in \mathbb {R}^{n\times n}\). Then \(\sum \nolimits _{j=1}^{v} \left| \left| f(U_j,W_j)-Y_j\right| \right| _2^2\) can be used to measure the quality of recovery where \(\left| \left| \star \right| \right| _2\) is 2-form of a matrix. If its value is smaller, it means the recovered data have a better classification performance in all views.

On the base of the above two aspects, the goal of our NMVL-IV is to minimize these two aspects simultaneously. So we design the model of NMVL-IV by Eq. (3). Since we treat these two aspects having the same importance, the balance parameters of these aspects are set to be \(\frac{1}{2}\). Moreover, in MVL-IV, scholars use \(O_j\) to record the existence of the features in \(X_j\), but this processing process of this operation is a little tedious [8]. Thus in our work, we make some adjustments. Since element 0 always has no influence on the final optimal results, for the incomplete data \(X_j\), we use 0 to replace the missing entries such that \(X_j\) and the related \(U_j\), \(W_j\), \(Z_j\) can be appeared in the objective function, i.e., Eq. (3).

$$\begin{aligned}&\min \limits _{U_j,W_j,Z_j} \\&\quad J=\frac{1}{2}\sum \limits _{j=1}^{v} \left( \left| \left| U_jW_j-Z_j\right| \right| _F^2+\left| \left| f(U_j,W_j)-Y_j\right| \right| _2^2\right) \end{aligned}$$
(3)

3.3 Realization

In Sect. 3.2, we introduce the model of NMVL-IV. But we find that in order to realize this model, we should specify the specific form of the view-based classifiers. As we know, there are many kinds of classifiers, for example linear classifiers [15, 16] and nonlinear classifiers [17, 18]. Moreover, the best method to process specific classification tasks and datasets is to select the most feasible classifiers. But here, in order to elaborate that how to realize the model with a simple and convenient way, we select a classical linear classifier form to represent \(f(U_j,W_j)\) and replace \(f(U_j,W_j)-Y_j\). See Eq. (4) where \(p^j=\{p_1^j,\ldots ,p_t^j,\ldots ,p_{d_j}^j\}\in \mathbb {R}^{1\times d_j}\) is the weight vector for the jth view, \(v^j\in \mathbb {R}^{1\times 1}\) is the bias of jth view, and \(b_i^j\in \mathbb {R}^{1\times 1}\) is loose parameter for the corresponding \(x_i^j\). Although this form maybe not feasible for some specific problems, but compared with MVL-IV, if we can get a better recovery performance with the usage of this linear classifier form, we can imagine that once we process specific problems, the model of NMVL-IV will bring a better performance if we select feasible forms.

$$\begin{aligned} f(U_j,W_j)-Y_j=\sum \limits _{i=1}^{n} (y_i^j({p^j}x_i^j+v^j)-1-b_i^j)^2 \end{aligned}$$
(4)

What’s more, we suppose \(P_j=p^j\), \(V_j=(v^j,v^j,\ldots ,v^j)\in \mathbb {R}^{1\times n}\), \(I_j=(1,1,\ldots ,1)\in \mathbb {R}^{1\times n}\), and \(B_j=(b_1^j,\ldots ,b_i^j,\ldots ,b_n^j)\in \mathbb {R}^{1\times n}\), then the objective function of NMVL-IV; namely, Eq. (4) can be transformed as follows.

$$\begin{aligned}&\min _{U_j,W_j,Z_j,P_j,V_j,B_j}\quad J \\&\quad =\frac{1}{2}\sum \limits _{j=1}^{v}\left( \left| \left| U_jW_j-Z_j\right| \right| _F^2+\left| \left| (P_jU_jW_j+V_j)Y_j\right. \right. \right. \\&\qquad \left. \left. \left. -I_j-B_j\right| \right| _2^2\right) \end{aligned}$$
(5)

Since \(\left| \left| U_jW_j-Z_j\right| \right| _F^2=tr[(U_jW_j-Z_j)^T(U_jW_j-Z_j)]\) and \(\left| \left| (P_jU_jW_j+V_j)Y_j-I_j-B_j\right| \right| _2^2=[(P_jU_jW_j+V_j)Y_j-I_j-B_j][(P_jU_jW_j+V_j)Y_j-I_j-B_j]^T\) where tr is the trace of a matrix, Eq. (5) is transformed to Eq. (6).

$$\begin{aligned}&\min \limits _{U_j,W_j,Z_j,P_j,V_j,B_j} \\&\quad J=\frac{1}{2}\sum \limits _{j=1}^{v}\left\{ tr[(U_jW_j-Z_j)^T(U_jW_j-Z_j)]\right. \\&\qquad +\left[ (P_jU_jW_j+V_j)Y_j-I_j-B_j\right] \left[ \left( P_jU_jW_j\right. \right. \\&\qquad \left. \left. \left. +V_j\right) Y_j-I_j-B_j\right] ^T\right\} \end{aligned}$$
(6)

Since there are six kinds of items should be optimized in Eq. (6), i.e., \(U_j,W_j,Z_j,P_j,V_j,B_j\), so in order to solve Eq. (6), we select the alternating minimization strategy. Simply speaking, we fix the \(P_j\), \(V_j\), and \(B_j\) and then compute the partial derivative of J w.r.t. \(U_j\), \(W_j\), and \(Z_j\) firstly. Then we let \(\frac{\partial J}{\partial U_j}\), \(\frac{\partial J}{\partial W_j}\), and \(\frac{\partial J}{\partial Z_j}\) be zeros and get the solutions of \(U_j\), \(W_j\), and \(Z_j\). Second, we fix \(U_j\), \(W_j\), and \(Z_j\) and compute the partial derivative of J w.r.t. \(P_j\), \(V_j\), and \(B_j\) so as to get their solutions. Details of the procedure of solutions can be found in Appendix, namely Sect. 6.

After we update the \(U_j\), \(W_j\), \(Z_j\), \(P_j\), \(V_j\), and \(B_j\), we can compute the J. We let the J under kth iteration be J(k) and J under \(k+1\)th iteration be \(J(k+1)\). If the stop criterion, i.e., Eq. (7), is satisfied where \(\varepsilon\) is a small threshold, we say that the optimal \(U_j\), \(W_j\), \(Z_j\), \(P_j\), \(V_j\), and \(B_j\) are gotten.

$$\begin{aligned} \frac{\left| \left| J(k+1)-J(k)\right| \right| _2}{\left| \left| J(k)\right| \right| _2}<\varepsilon \end{aligned}$$
(7)

When we get the optimal solutions of \(U_j\) and \(W_j\), i.e., \(U_j^o\) and \(W_j^o\), we can use the \(Z_j^o=U_j^oW_j^o\) to recover the missing data. Each column of \(Z_j^o\) represents a recovered instance, and it corresponds to an instance in \(X_j\). Similarly, once we get optimal \(P_j\), \(V_j\), and \(B_j\), the optimal view-based classifiers can also be gotten.

3.4 Further discussion

Here, we discuss six issues about our NMVL-IV further.

First one is that according to what we have said before, once stop criterion, i.e., Eq. (7), is satisfied, we can get the optimal \(U_j\), \(W_j\), \(Z_j\), \(P_j\), \(V_j\), and \(B_j\) for jth view. But how do we know that these are optimal since much of information is missing. Indeed, for this issue, three reasons can be used to support our explanations. (1) In order to solve Eq. (6), we select alternating minimization strategy here. As many references said, alternating minimization strategy is a feasible method to solve the minimization problem with the constraints [8, 19, 20]. (2) According to Eq. (6), we minimize two aspects: One is the different information of views, and the other is quantity and quality of recovery. So we can know, when we solve this equation, the missing data are being recovered simultaneously. Even though much of information is missing before the solution, this won’t influence the procedure of optimization. (3) MVL-IV which is the basic of our NMVL-IV takes advantage of the similar way to update the low-rank assumption matrix and its decomposition matrices. In MVL-IV, once the optimal results are gotten, the missing data are also recovered simultaneously. Thus, according to the above three reasons, even though much of information is missing before the optimization, we can still get the optimal results with alternating minimization strategy used.

Second one is that how the number of missing views is assessed at the beginning of the optimization process. Indeed, according to the first issue and the procedure of optimization, we know that recovery and optimization are carried out simultaneously. Once we get the optimal results, the missing data are also recovered simultaneously. Moreover, according to Sects. 3.2 and 3.3, we needn’t to know that how many views or features are lost beforehand. In the latter experiments, we will find that no matter how many data are lost, our NMVL-IV still brings a best recovery performance.

Third one is that why we do not take the normalization of features in different views into consideration. As we know, for a multi-view dataset, features in different views may have different scales. This phenomenon will bring troubles to the design of classifiers. Thus two kinds of solutions are developed. First solution is to find an optimized projection matrix such that features can be normalized in a same dimensionality. For example, \(X_1\in \mathbb {R}^{d_1\times n}\) and \(X_2\in \mathbb {R}^{d_2\times n}\) are two instance matrices with different dimensions. Then we can find two optimized projection matrices \(W_1\in \mathbb {R}^{d_1\times q}\) and \(W_2\in \mathbb {R}^{d_2\times q}\). After that, we can use \(W_1^TX_1\in \mathbb {R}^{q\times n}\) and \(W_2^TX_2\in \mathbb {R}^{q\times n}\) to represent the original features and regard them as the new forms. We find that after the projection, the original features with different dimensions are transformed as the new forms with a same dimension. Some widely used methods related to this solution including canonical correlation analysis (CCA) [21], locality preserving CCA (LPCCA) [22], local discriminant CCA (LDCCA) [23], canonical sparse cross-view correlation analysis (CSCCA) [24], globalized and localized canonical correlation analysis (GLCCA) [25], etc., are developed. Second solution is to design the objective function for each view individually and summary all view data together. For example, in terms of \(X_1\in \mathbb {R}^{d_1\times n}\) and \(X_2\in \mathbb {R}^{d_2\times n}\) which are two instance matrices with different dimensions, we construct an objective function for each view, for example \(\min\; f(X_1)\) and \(\min\; f(X_2)\); then, we summary them so as to construct the final one, namely \(\min\; f(X_1)+f(X_2)\). In this solution, since \(f(X_j)\) is only feasible for \(X_j\) and the dimensions of parameters in this individual objective function are accord with the one of \(X_j\) and nothing to do with other instance matrices, we need not to normalize features in different views beforehand. Some widely used methods related to this solution including MVL-IV [8], semi-supervised multi-view multi-label classification learning method based on nonnegative matrix factorization (NMF-SSMM) [26], semi-supervised dimension reduction for multi-label and multi-view learning (SSDR-MML) [27] are proposed. In generally, these two kinds of solutions both have abilities to process multi-view classification tasks. But compared with the second solution, methods of first one always hidden a fatal flaw that if the new forms are not generated by feasible projection matrices, the performance will be worse than the one on the base of original features. Moreover, since MVL-IV also utilizes the second solution and does not normalize the features, NMVL-IV uses the same solution, namely design the objective function for each view individually and summary all view data together.

Fourth one is that why we put all view data together to train the model rather than just train the model on each view data individually. As what we said in the third issue, we design the objective function for each view individually and summary all view data together so as to train the model, the reason why we utilize such an operation depends on the following two aspects: (1) Our basic method MVL-IV puts all view data together to train the model rather than just train the model on each view data individually; (2) putting all view data together to train the model can consider information of all views simultaneously while training the model on each view data individually cannot. According to the above reasons, we put all view data together to train the model.

Fifth one is that how to process specific-form instances. Indeed, for our NMVL-IV, it can process incomplete data which also cover text, image, video, and audio instances. These kinds of instances are collected with different forms. Thus how to process them is an important issue. Since with the aid of computer technology, we can transform text, image, video, and audio into numeric matrices or vectors, we can also use the mathematical methods to process incomplete data which are represented by text, image, video, and audio.

Sixth one is that what is the difference between our NMVL-IV and the synthetic minority over-sampling technique (SMOTE) [28] and other related re-sampling methods including keep it simple and straightforward metric (KISSME) learning method [29], selection-based re-sampling ensemble (SRE) [30]. As we know, in order to design a high-performance classifier, useful and sufficient information is necessary. But due to temporary failure of sensor or the man-made faults, some instances will lose some useful views or features. In order to overcome this issue, two kinds of strategies can be chosen: One is recovery, and the other is re-sampling. Recovery indicates that one recovers the missing information on the base of collected data while re-sampling indicates one collects the data from a source again and again. Although re-sampling can promise the size of instances, it cannot promise the recollected instances possess useful and sufficient information if the sampling equipment is still out of work. Different from those re-sampling methods, when we collect a dataset and recover the incomplete data with the consideration of the quantity of recovery and its quality, our NMVL-IV can promise the usefulness and sufficiency of information. Thus, this is the main difference between ours and other re-sampling methods.

3.5 Final flow chart

For convenience, Fig. 4 shows the final flow chart of our NMVL-IV and we summary the flow of NMVL-IV as follows with simple words.

Fig. 4
figure 4

Flow chart of NMVL-IV

For a collected multi-view dataset, we decompose the low-rank assumption matrix of an instance matrix into a feature matrix and a coefficient matrix firstly. Second, we construct the specific objective function and initialize the parameters. Third, we update the parameters with Eqs. (10)–(23). After one iteration for update, if stop criterion is satisfied, we make full use of the optimal parameters to recover missing data and design optimal view-based classifiers; otherwise, we start the next iteration and update the parameters again until achieving convergence.

4 Experiments

4.1 Experimental setting

In order to show the effectiveness of the proposed NMVL-IV, we conduct our experiments on some real-world applications including clustering, regression, classification, bipartite ranking, and image retrieval. Among these applications, clustering, regression, and classification are also mentioned in MVL-IV [8]. Thus for the corresponding missing-data (view/feature) setting, we can refer to [8]. Moreover, missing-data settings in bipartite ranking and image retrieval can also refer to [8]. Then, as we know, in order to process different applications effectively, we need to select feasible models, and in general cases, their models are different. But as we said in Sect. 3.3, in order to elaborate the method to realize the model with a simple and convenient way, we select a classical linear classifier form. Although this linear model is only feasible for simple tasks, for example, the pending data are linearly separable, but compared with MVL-IV, if NMVL-IV with the linear model used can bring a better recovery performance and process these tasks ultimately with a better result, we can imagine that NMVL-IV with feasible models can better process these specific problems and tasks.

Moreover, for comparison, some recovery algorithms including shared Gaussian process latent variable model (SGPLVM) [31], OptSpace [11], MVL-IV [8], self-representation-based matrix completion methods (SR-LS, SR-LR, and SR-Sp)Footnote 3 [32], nonlinear matrix completion (NLMC) [33] are utilized. Indeed, except these recovery algorithms, another algorithm named multi-label transduction matrix completion model (MTMC) [34] is also widely used. But MTMC aims to process multi-label tasks rather than multi-view tasks; thus in this work, we won’t select MTMC for comparison. Moreover, after recovery, some methods including weight-based canonical sparse cross-view correlation analysis (WCSCCA) [35] can also be used to preprocess the recovered data further. Indeed, some other methods including CCA [21], convex multi-view subspace learning algorithm (MCSL) [36] and factorized latent spaces with structured sparsity algorithm (FLSSS) [37] can also be used to process the recovered data further, but since when we choose CCA, MCSL, FLSSS for further processing, the conclusions are similar with the ones when we choose WCSCCA for experiments, and reference [35] has validated that WCSCCA outperforms CCA. Thus we only choose WCSCCA for further processing here. In our experiments, for description clearly, we use terms similar with ‘A-WCSCCA’ to indicate that we recover the data with recovery algorithm ‘A’ firstly and use WCSCCA for further processing while terms similar with ‘A’ only indicate that we recover the data with recovery algorithm ‘A’ used and without WCSCCA utilized.

Furthermore, in real-world applications, some learning machines and datasets are used. Details of them are introduced in, respectively, experimental contents.

4.2 Comparison about clustering problem

Here, we select three benchmark datasets for experiments; they are Mfeat, Reuters, and Corel [1]. In terms of these three datasets, detailed information can be referred to [1] and we review them in simple as follows. (1) MfeatFootnote 4 consists of handwritten digits (0-9) [38], and each instance consists of six views, i.e., Fourier coefficients of the character shapes (fou), profile correlations (fac), Karhunen–Loève coefficients (kar), pixel averages in \(2\times 3\) windows (pix), Zernike moments(zer), and morphological features (mor). (2) ReutersFootnote 5 consists of machine-translated documents which are written in five different languages, i.e., English (EN), French (FR), German (GR), Italian (IT), and Spanish (SP). Each language can be treated as a view [39, 40], and each document can be translated from one language to another language. Moreover, the documents are also categorized into six different topics. (3) CorelFootnote 6 is extracted from a Corel image collection [38], and it consists of 68040 photographs from various categories. In our experiments, we randomly select 1000 photographs from 10 categories and each category has 100 photographs. The 10 categories are C0-Africa, C1-Beach, C2-Building, C3-Buses, C4-Dinosaurs, C5-Elephants, C6-Flowers, C7-Horses, C8-Mountains, and C9-Food. For this dataset, four views are given. They are color histogram (abbr. Col-h), color histogram layout (abbr. Col-hl), color moments (abbr. Col-m), and co-occurrence texture (abbr. Coo-t). For these datasets, digits, topics, and categories are also treated as classes.

Among each dataset with n instances, we randomly select \(\left| \varOmega \right|\) instances and randomly removed one view from each selected instance. Then we use NMVL-IV, SGPLVM [31], OptSpace [11], MVL-IV [8], SR-LS, SR-LR, SR-Sp [32], and NLMC [33] to recover the missing views firstly. Among these algorithms, the proposed NMVL-IV and MVL-IV can recover the missing views and conduct multi-view learning simultaneously while other algorithms have to recover missing views firstly before conducting multi-view learning. After we recover the missing views for these datasets, we can use WCSCCA [35] to process the recovered data further. After that, we use kernel-based weighted multi-view clustering (KWMVC) [2] and weighted multi-view clustering (WMVC) [1] to accomplish the subsequent clustering task. In order to evaluate the clustering performance, accuracy, true-positive rate (\(acc^{+}\)), true-negative rate (\(acc^{-}\)), positive predictive value (PPV), F-measure, G-mean, and normalized mutual information (NMI) are used. The first six criteria can be referred to [41], and the latter one can be referred to [42]. Moreover, the clustering time (CT) (in seconds) is also used here. The parameter settings of KWMVC and WMVC can be referred to the respective references.

Then Figs. 5, 6, 7, 8, 9, and 10 show the performances of the KWMVC and WMVC with different recovery algorithms on different datasets, respectively. In terms of Figs. 11, 12, 13, 14, 15 and 16, they show the performances of the KWMVC and WMVC with different recovery algorithms and WCSCCA used on different datasets, respectively.

Fig. 5
figure 5

Performances on Mfeat and recovery algorithms used with KWMVC

Fig. 6
figure 6

Performances on Reuters and recovery algorithms used with KWMVC

Fig. 7
figure 7

Performances on Corel and recovery algorithms used with KWMVC

Fig. 8
figure 8

Performances on Mfeat and recovery algorithms used with WMVC

Fig. 9
figure 9

Performances on Reuters and recovery algorithms used with WMVC

Fig. 10
figure 10

Performances on Corel and recovery algorithms used with WMVC

According to the experimental results, it is found that no matter which dataset and which clustering algorithm used, the clustering criteria expect CT have downtrends in average with the increment of \(\left| \varOmega \right| /n\). Moreover, in terms of the first seven criteria, NMVL-IV brings a better clustering performance. In terms of CT, it is found that the clustering time with NMVL-IV used is not the longest. With WCSCCA used, the performances of compared clustering methods are also enhanced in generally. Although in some cases, the performance with WCSCCA used is not the best one, this won’t influence the average results and final conclusions. All the clustering results indicate that the proposed NMVL-IV can recover the missing views well and leads to a best clustering performance in average.

Fig. 11
figure 11

Performances on Mfeat and recovery algorithms used with WCSCCA and KWMVC

Fig. 12
figure 12

Performances on Reuters and recovery algorithms used with WCSCCA and KWMVC

Fig. 13
figure 13

Performances on Corel and recovery algorithms used with WCSCCA and KWMVC

Fig. 14
figure 14

Performances on Mfeat and recovery algorithms used with WCSCCA and WMVC

Fig. 15
figure 15

Performances on Reuters and recovery algorithms used with WCSCCA and WMVC

Fig. 16
figure 16

Performances on Corel and recovery algorithms used with WCSCCA and WMVC

4.3 Comparison about regression problem

Here, datasets FG-Net Aging [8, 43] and Hotel Review are used for experiments. As [8] said, FG-Net Aging dataset is always used for age estimation while Hotel Review is always used for rating prediction problem. The dimensions and views information can be referred to [8]. Here, we still regard the age estimation and rating prediction as two regression problems and then compare the recovery algorithms for these two problems. For each algorithm, support vector regression model (SVR) [44] and interpretable regression trees (IRT) [45] are utilized as the downstream regression models. Each algorithm is evaluated using the leave-one-person-out strategy on the FG-Net Aging dataset [46] and fivefold cross-validation on the Hotel Review dataset.

In the regression task, we consider the missing-feature setting on both training and test sets. Generally speaking, for each instance matrix \(X_j\), we remove \(\left| \varOmega \right|\) features for each column (i.e., \(x_i^j\)) in it. Then we use recovery algorithms to recover the missing features (and WCSCCA for furher processing). For the proposed NMVL-IV and MVL-IV, they recover the missing features and conduct multi-view learning simultaneously while other algorithms have necessary to recover missing features firstly before conducting multi-view learning. After we reconstruct the missing features on each view, we can conduct the regression problems.

First, for age estimation problem, we suggest the estimated age is \(\hat{y_i}\) and the ground truth age is \(y_i\), and then let MAE [see Eq. (8)] be the mean absolute error (MAE) so as to measure the evaluation. Lower MAE means the estimated ages are more closed to truth ages. Then Tables 2 and 3 show MAEs of the compared recovery algorithms on the FG-Net dataset with SVR and IRT, respectively. From these tables, it is found that the proposed NMVL-IV has a smallest MAE and this means NMVL-IV can effectively reconstruct missing features and conduct multi-view learning in an integral framework. What’s more, with the increment of \(\left| \varOmega \right| /n\), MAE also increases and this indicates that more missing features bring worse recovery results and the regression performances are also reduced. Then, with WCSCCA used, for each case, the MAE is smaller and the regression performance is better.

$$\begin{aligned} MAE=\frac{1}{n}\sum \limits _{i=1}^{n}\left| \hat{y_i}-y_i\right| \end{aligned}$$
(8)
Table 2 MAEs of the compared recovery algorithms on the FG-Net dataset with SVR used
Table 3 MAEs of the compared recovery algorithms on the FG-Net dataset with IRT used

Second, for rating prediction problem, we use Eq. (9) to measure the performance.

$$\begin{aligned} R2=1-\frac{\sum \nolimits _{i=1}^n(\hat{y_i}-y_i)^2}{\sum \nolimits _{i=1}^n(y_i-\bar{y_i})^2} \end{aligned}$$
(9)

where \(\bar{y_i}\) is the average of \(y_i\)s. Different from the MAE, a larger R2 indicates a better regression performance. Tables 4 and 5 show the related experimental results. From these tables, it is found that NMVL-IV has a best R2 and this also validates the effectiveness of our NMVL-IV. Moreover, we find with the increment of \(\left| \varOmega \right| /n\), R2 decreases, and this also indicates more missing features leading to worse recovery results and regression performances. Similar, with WCSCCA used, the R2 is larger and the regression performance is also better for each case.

Table 4 Prediction R2 of the compared recovery algorithms on the Hotel Review dataset with SVR used
Table 5 Prediction R2 of the compared recovery algorithms on the Hotel Review dataset with IRT used

4.4 Comparison about classification problem

Here, we use Mfeat, Reuters, Corel, and TRECVID2003 [8] for experiments and take advantage of multiple-view multiple learner (MVML) [3] and support vector machine (SVM) [47] for classifying the recovered datasets. Concretely speaking, the setting of Mfeat, Reuters, and Corel is same as before and TRECVID2003 is composed of 1078 manually labeled video shots belonging to five categories. The setting of TRECVID2003 is given in [8]. For each dataset, 70\(\%\) instances are used for training and the rest is used for test. Then here, for the datasets, we consider the missing-data setting with the following two steps. First, for each dataset, we still randomly select \(\left| \varOmega \right|\) instances from these n instances and randomly removed one view from each selected instance. Second, for each instance matrix \(X_j\), we remove \(\left| \varOmega \right|\) features for each remaining column (i.e., remaining \(x_i^j\)) in it. Remaining \(x_i^j\) indicates that \(x_i^j\) has not been removed in the first step.

In experiments, we use the recovery algorithms to recover the missing data firstly and select the MVML and SVM to classify the recovered datasets. Figures 17, 18, 19, and 20 show the related experimental results and it is found that no matter MVML or SVM used, the proposed NMVL-IV brings a best performance in average. Moreover, with the increment of \(\left| \varOmega \right| /n\), the classification accuracy has a downtrend.

Fig. 17
figure 17

Classification accuracy with different recovery algorithms and different datasets when MVML used

Fig. 18
figure 18

Classification accuracy with different recovery algorithms and different datasets when SVM used

Fig. 19
figure 19

Classification accuracy with different recovery algorithms and different datasets when WCSCCA and MVML used

Fig. 20
figure 20

Classification accuracy with different recovery algorithms and different datasets when WCSCCA and SVM used

4.5 Comparison about bipartite ranking problem

Here we use dataset Reuters and learning machines including semi-supervised multi-view ranking (SmVR) [48], multi-view random forest dissimilarity (RFDIS) [49], view transformation model which utilizes multilayer perceptron (VTM-MLP) [50], and passive-aggressive based on max-out function (PAMO) [51] for experiments. In this part, we want to show that our proposed NMVL-IV can improve the performance of a learning machine on bipartite ranking. In terms of the Reuters, we still consider the missing-data setting and this setting is same as the one given in classification problem. Like what SmVR has done [48], Table 6 shows the average precision (AvP) and area under the ROC curve (AUC) for SmVR, RFDIS, VTM-MLP, PAMO. Because we know that WCSCCA can bring a better performance after the processing of the recovery data, thus, for convenience, we only show the difference between NMVL-IV and MVL-IV. In this table, ‘SmVR’ means we use SmVR to process Reuters without recovery, ‘SmVR-MVL-IV’ means we utilize SmVR to process Reuters after the recovery with MVL-IV used, and ‘SmVR-NMVL-IV’ means we take advantage of SmVR to process Reuters after the recovery with NMVL-IV used. For other terms, the meanings are same.

From this table, we can know that with NMVL-IV used, more missing data can be recovered and they improve the performances of SmVR, RFDIS, VTM-MLP, PAMO on both AUC and AvP.

Table 6 AvP and AUC for SmVR, RFDIS, VTM-MLP, PAMO in terms of bipartite ranking performance on the dataset Reuters

4.6 Comparison about image retrieval problem

Here dataset Corel and learning machine representative instance and feature mapping for instances (RIFM-I) [52] are put to use for experiments. RIFM-I is a classical learning machine for image retrieval problem, and here we still consider the missing-data setting. Then for the recovery algorithms, we only select MVL-IV for comparison and won’t use WCSCCA for further processing. Then similar with the experimental contents in [52], Tables 7, 8, and 9 show the classification confusion matrix of RIFM-I on Corel-NMVL-IV, Corel-MVL-IV, and the original Corel while Fig. 21 shows the illustration of retrieval precisions of RIFM-I on Corel-NMVL-IV, Corel-MVL-IV, and Corel. According to the experimental results, it is found that with NMVL-IV used, RIFM-I can perform better on the dataset Corel. Especially, for C0-Africa, C3-Buses, C5-Elephants, C6-Flowers, and C8-Mountains, compared with the performance on the original Corel without any recovery algorithm used, NMVL-IV brings a higher improvement on the performance. In generally, 5\(\%\) improvement is gotten at least.

The reason for these successes depends on the difference between our proposed NMVL-IV and the compared MVL-IV in terms of methodology or problem formulation. According to Sects. 2, 3, Eqs. (2), and (5), we know that compared with MVL-IV, our proposed NMVL-IV considers the different information of views and the quantity and quality of recovery. Indeed, different views possess respective discriminant information and they should be generated from different subspaces. While MVL-IV suggests that these views are generated from a shared subspace. This suggestion neglects the difference between these views. What’s more, MVL-IV only pays attention to the quantity of recovery. But NMVL-IV considers both the quantity and the quality of recovery with the usage of view-based classifiers. This makes the recovered data have a good classification performance after the recovery. Since for C0-Africa, C3-Buses, C5-Elephants, C6-Flowers, and C8-Mountains, the differences between views are more larger, our NMVL-IV brings a higher improvement on the performance. What’s more, this reason can also be used to explain the better performances of NMVL-IV for clustering, regression, classification, bipartite ranking tasks.

Table 7 Confusion matrix of RIFM-I on Corel-NMVL-IV
Table 8 Confusion matrix of RIFM-I on Corel-MVL-IV
Table 9 Confusion matrix of RIFM-I on Corel
Fig. 21
figure 21

Illustration of retrieval precisions of RIFM-I on Corel-NMVL-IV, Corel-MVL-IV, and Corel

4.7 Comparison about computational complexity

Besides for the experiments on some real-world applications, we also discuss the computational complexities about NMVL-IV and compared ones including SGPLVM [31], OptSpace [11], MVL-IV [8], SR-LS, SR-LR, SR-Sp [32], and NLMC [33].

SGPLVM which is developed in 2005 is used to overcome the disadvantage of CCA. Namely, SGPLVM can be treated as a nonlinear extension of CCA. But according to the framework of SGPLVM, it is found that the quantity and quality of recovery are not well and the computational complexity of SGPLVM is also \(O(n^3)\). Here, n is the number of instances of a multi-view dataset X.

OptSpace is developed in 2010, and it is a conventional matrix completion algorithm. During the procedure of OptSpace, for X, it selects \(\left| E\right|\) features of X and composes a subset M. With the reconstruction of M, the incomplete data can be recovered. Since the rank of M is r, so the computational complexity of OptSpace is \(O(\left| E\right| rlog_2n)\).

MVL-IV which has been reviewed in Sect. 2 is developed in 2015. Its computational complexity is \(O(vr^3+4r^2d+6drn)\). In general, since in real-world applications, r, v, and d are more smaller than n, the computational complexity can be reduced to O(6drn).

SR-LS, SR-LR, SR-Sp are developed in 2017. In terms of SR-LR, it requires computing singular value decomposition in each iteration, which has considerably high computational complexity on big matrices. On the contrary, SR-LS and SR-Sp do not need singular value decomposition in each iteration and hence are applicable to large-scale problems. Thus, the computational complexity of SR-LR is \(O(n^3)\) while the ones of SR-LS and SR-Sp are both somewhat smaller than \(O(n^3)\).

NLMC which is proposed to recover missing entries of data matrices with nonlinear structures is developed in 2018. It minimizes the rank (approximated by Schatten p-norm) of a matrix in the feature space given by a nonlinear mapping of the data (input) space, where kernel trick is used to avoid carrying out the unknown nonlinear mapping explicitly. Its computational complexity is comparable with that of rank minimization-based linear matrix completion (LMC) methods, namely \(O(n^3)\).

In terms of our developed NMVL-IV, the computational complexities of getting \(U_j\), \(W_j\), \(Z_j\), \(P_j\), \(V_j\), \(B_j\) are \(O(6r_jn^2+2n^2)\), \(O(4n^2)\), \(O(d_jr_jn)\), \(O(2d_jnr_j+2n+2d_jn)\), \(O(2n+2nr_j)\), \(O(2n+2nr_j)\), respectively. Since n is more smaller than \(n^2\), the total computational complexity is \(O(6vn^2(\sum \limits _{j=1}^{v}r_j+1))\).

According to the above discussion, we can see that the computational complexity of our NMVL-IV is not the largest while is also not the smallest theoretically.

Moreover, in practice, we give the average time about recovery for all the used datasets with these methods used in Table 10. For convenience, we let the average recovery time of OptSpace on each dataset be 1. In this table, we let \(\left| \varOmega \right| /n\) be 0.2, 0.5, and 0.7, respectively, and the last column of this table shows the rank of NMVL-IV in terms of the time. 1 represents that the recovery time is longest while 8 represents that the recovery time is shortest. According to this table, we can also validate that the computational complexity of our NMVL-IV is not the largest while is also not the smallest. But according to all above experiments, it is found that NMVL-IV has a best performance to recover the incomplete data. Thus this advantage offsets the shortcoming of computational complexity.

Table 10 Comparison about average time (in seconds) for recovery

5 Conclusion and future work

Multi-view problems attract many scholars to research, and some related solutions are developed while traditional solutions cannot process multi-view datasets with incomplete data. In order to process such an issue, some algorithms including multi-view learning with incomplete views (MVL-IV) are developed. MVL-IV exploits the connections among multiple views and suggests that different views are generated from a shared subspace such that the missing data can be recovered. While it has some defects, one is different views that are generated from a shared subspace, and another is the ignorance of quantity and quality of recovery. Here, we propose a new multi-view learning machine with incomplete data (NMVL-IV) to overcome these two defects. In terms of NMVL-IV, it treats different views that are derived from different subspaces and the model of NMVL-IV considers both the quantity of recovery and its quality. Related experiments validate that the proposed NMVL-IV can recover the missing data much better.

Although NMVL-IV can recover missing data better, it still has the following limitations. First, in our NMVL-IV, we suppose each instance has its label. This indicates that NMVL-IV is only suitable for supervised datasets and we will extend the model of NMVL-IV to semi-supervised datasets. Second, the classifier used in the model of NMVL-IV is linear; thus in our future work, we will utilize nonlinear classifiers to realize the model of NMVL-IV and try to find that whether nonlinear classifiers used in the model of NMVL-IV can bring better recovery results. Third, local information of dataset is not considered in NMVL-IV, and thus, this is necessary for us to be considered in the future. Fourth, in our NMVL-IV, for convenience, we do not consider the weights of features and views and regard the roles of them are same. But indeed, we know that in some cases, different views and features play different roles and have different influence on the training of classifiers. Since we know some work including WMVC [1] which is a multi-view clustering method to consider the weights of views and features, in our future work, we will combine WMVC with NMVL-IV so as to consider and balance the weights of features and views in the optimization problem. With such a method, the influence of different features and views can be reflected.

6 Appendix

Here, we show the detailed procedure of the solutions of Eq. (6).

  1. (1)

    First, we fix the \(P_j\), \(V_j\), and \(B_j\) and compute the partial derivative of J w.r.t. \(U_j\), \(W_j\), \(Z_j\) as follows.

    $$\begin{aligned}&\frac{\partial J}{\partial U_j}=U_jW_jW_j^T-Z_jW_j^T \\&\quad +P_j^T[(P_jU_jW_j+V_j)Y_j-I_j-B_j](W_jY_j)^T \end{aligned}$$
    (10)
    $$\begin{aligned}&\frac{\partial J}{\partial W_j}=U_j^TU_jW_j-U_j^TZ_j \\&\quad +(P_jU_j)^T[(P_jU_jW_j+V_j)Y_j-I_j-B_j]Y_j^T \end{aligned}$$
    (11)
    $$\begin{aligned}&\frac{\partial J}{\partial Z_j}=Z_j-U_jW_j \end{aligned}$$
    (12)

    Then we let \(\frac{\partial J}{\partial U_j}=0\) and get

    $$\begin{aligned}&U_jW_jW_j^T+P_j^TP_jU_jW_jY_j(W_jY_j)^T \\&\quad =Z_jW_j^T+P_j^T[I_j+B_j-V_jY_j](W_jY_j)^T, \end{aligned}$$
    (13)

    let \(\frac{\partial J}{\partial W_j}=0\) and get

    $$\begin{aligned}&U_j^TU_jW_j+(P_jU_j)^TP_jU_jW_jY_jY_j^T \\&\quad =(P_jU_j)^T(I_j+B_j-V_jY_j)Y_j^T+U_j^TZ_j \end{aligned}$$
    (14)

    , let \(\frac{\partial J}{\partial Z_j}=0\) and get

    $$\begin{aligned} Z_j=U_jW_j \end{aligned}$$
    (15)

    For Eq. (13), since \(W_jY_j(W_jY_j)^T\) is reversible, we can further get

    $$\begin{aligned}&U_jW_jW_j^T[W_jY_j(W_jY_j)^T]^{-1}+P_j^TP_jU_j \\&\quad =\left\{ Z_jW_j^T+P_j^T\left[ I_j+B_j\right. \right. \\&\qquad \left. \left. -V_jY_j\right] (W_jY_j)^T\right\} [W_jY_j(W_jY_j)^T]^{-1} \end{aligned}$$
    (16)

    while for Eq. (14), since \((P_jU_j)^TP_jU_j\) is also reversible, we can further get

    $$\begin{aligned}&[(P_jU_j)^TP_jU_j]^{-1}U_j^TU_jW_j+W_jY_jY_j^T \\&\quad =[(P_jU_j)^TP_jU_j]^{-1}\left\{ (P_jU_j)^T\left( I_j+B_j\right. \right. \\&\qquad \left. \left. -V_jY_j\right) Y_j^T+U_j^TZ_j\right\} \end{aligned}$$
    (17)

    Now, according to [53], it gives the explicit solution to equation \(AX + XB = C\) in matrices, and thus for Eq. (16), we let \(P_j^TP_j\) be A, \(W_jW_j^T[W_jY_j(W_jY_j)^T]^{-1}\) be B, \(\{Z_jW_j^T+P_j^T[I_j+B_j-V_jY_j](W_jY_j)^T\}[W_jY_j(W_jY_j)^T]^{-1}\) be C, while for Eq. (17), we let \([(P_jU_j)^TP_jU_j]^{-1}U_j^TU_j\) be A, \(Y_jY_j^T\) be B, \([(P_jU_j)^TP_jU_j]^{-1}\{(P_jU_j)^T(I_j+B_j-V_jY_j)Y_j^T+U_j^TZ_j\}\) be C. After that, we use the method given in [53] to get the solutions of \(U_j\) and \(W_j\).

  2. (2)

    Second, we fix the \(U_j\) and \(W_j\) and compute the partial derivative of J w.r.t. \(P_j\), \(V_j\), and \(B_j\). Then we have

    $$\begin{aligned} \frac{\partial {J}}{\partial {P_j}}= & {} [(P_jU_jW_j+V_j)Y_j-I_j-B_j](U_jW_jY_j)^T \end{aligned}$$
    (18)
    $$\begin{aligned} \frac{\partial {J}}{\partial {V_j}}= & {} [(P_jU_jW_j+V_j)Y_j-I_j-B_j]Y_j^T \end{aligned}$$
    (19)
    $$\begin{aligned} \frac{\partial {J}}{\partial {B_j}}= & {} [(P_jU_jW_j+V_j)Y_j-I_j-B_j] \end{aligned}$$
    (20)

    We also let \(\frac{\partial {J}}{\partial {P_j}}\), \(\frac{\partial {J}}{\partial {V_j}}\), and \(\frac{\partial {J}}{\partial {B_j}}\) be zeros and get

    $$\begin{aligned} P_j= & {} [(I_j+B_j)Y_j^{-1}-V_j](U_jW_j)^{\dagger } \end{aligned}$$
    (21)
    $$\begin{aligned} V_j= & {} (I_j+B_j)Y_j^{-1}-P_jU_jW_j \end{aligned}$$
    (22)
    $$\begin{aligned} B_j= & {} (P_jU_jW_j+V_j)Y_j-I_j \end{aligned}$$
    (23)

    where \(\dagger\) represents the pseudo-inverse of a matrix.