Keywords

1 Introduction

Multi-view data, stemming from diverse sources, encompasses multiple representations ubiquitous in real-world scenarios, like videos with audible and visual facets or images with raw RGB space data paired with descriptive text. These diverse views offer both consistent and supplementary information. The aim of Multi-view Clustering (MVC) is to assimilate this multi-faceted information into a unified structure, facilitating the unsupervised clustering of data samples with similar structures. However, a predominant assumption in most MVC methodologies [7, 13, 27, 31, 32] is the complete observation of sample information across all views. In reality, data collection and transmission can breach this assumption, resulting in Incomplete Multi-view Clustering (IMVC) challenges [15, 22, 28].

In the realm of IMVC, inconsistencies in views or incomplete data contribute to data gaps. These voids, illustrated in Fig. 1, stem either from Partially View-missing Problem (PVP) [15, 22, 28] due to inconsistent views (i.e., one sample has empty vectors in some views, as shown in Fig. 1 (a)), or from the more pervasive, yet less explored, Partially Data-missing Problem (PDP) caused by data incompleteness (i.e., one sample has representations in all views but with some missing values, as shown in Fig. 1 (b)). Traditional strategies aimed at remedying PDP often involve padding missing values via imputation techniques [5, 6, 9] before applying MVC on the imputed data. However, this imputation-based approach can falter, especially when applied without domain knowledge on data structures, risking damage to intrinsic structures. Moreover, inaccurately imputed views can distort the fusion process in existing MVC techniques, potentially undermining clustering outcomes.

Fig. 1.
figure 1

Interpretation of PVP and PDP in IMVC problems.

To address the above issues, our contributions are threefold:

  • We propose an imputation-free framework with the matrix correction method to deal with partially data-missing problems in the IMVC community, which can naturally avoid the uncertainty error caused by inaccurate imputation and directly obtain high-quality affinity matrices through matrix correction.

  • We introduce a matrix correction algorithm to effectively and efficiently estimate distance matrices on incomplete data with a theoretical guarantee. Specifically, it starts with initial distance matrices estimated from incomplete data and then corrects these estimates to satisfy specific properties via a convex optimization approach.

  • We design a two-stage strategy, i.e., correction-clustering (as shown in Fig. 2), to combine with all affinity-oriented MVC methods, which makes existing MVC methods great again on IMVC problems. Extensive experiments demonstrate that our strategy achieves superior and robust clustering performance under a wide range of missing ratios, compared to the imputation-based approaches.

Fig. 2.
figure 2

The framework of our imputation-free approach for PDP problems. It adopts a novel two-stage strategy that first corrects affinity (distance) matrices in all views, and then combines with MVC methods to achieve clustering.

2 Related Work

Multi-view Clustering. MVC assumes that all samples are fully observed in all views. One popular roadmap of MVC is to separate the data based on affinity matrices constructed in all views. Those affinity-based clustering methods include spectral clustering [13, 14, 32, 42], kernel-based clustering [7, 21, 31, 41], and graph clustering [27, 29, 36, 37].

Incomplete-view Multi-view Clustering. Incomplete-view MVC methods focus on scenarios where partial samples have missing views or there is no view containing all samples, and they use the observed-view information to recover the missing views. Traditional IMVC models generate the consensus representation or affinity of view-missing data via matrix factorization [15, 22, 28], kernel-based [10, 23, 24], or graph-based [30, 38, 40] methods.

Incomplete-value Multi-view Clustering. In contrast to view-level missing, incomplete-value MVC aims at value-level (data-level) missing, where each sample in any view may contain some missing values. A feasible solution is to first impute missing values and then perform multi-view clustering methods. In practice, statistical imputation techniques [12, 20], such as zero, mean imputation, and k-nearest neighbor (kNN) imputation [39], have been widely used. Besides, matrix completion [5, 6, 9] is a representative machine learning-based technique that solves a matrix factorization problem. Unfortunately, it is difficult to accurately estimate missing values based on the observed data, especially for a large missing ratio, and there is no guarantee on the quality of imputation. This motivates us to design an imputation-free approach.

3 Methodology

To seek a reliable solution to incomplete-value MVC, we propose an imputation-free framework with a two-stage strategy, i.e., correction-clustering, as illustrated in Fig. 2. Our work mainly resides in the matrix correction technique [18, 34] to improve the distance/affinity matrix estimation for incomplete data.

3.1 Distance Estimation

Consider a multi-view dataset with v views and n samples. Denote \(X^{(i)} = \{x_1^{(i)}, \cdots , x_n^{(i)}\} \in \mathbb {R}^{d_i\times n}\) as the data matrix in view i, where \(d_i\) is the view-specific dimension. For simplicity, we consider an incomplete data matrix in a single-view, i.e., \(X^o = \{x_1^o, \cdots , x_n^o\} \in \mathbb {R}^{d \times n}\), where \(x_i^o\) represents the observed i-th sample that may contain missing values.

If samples are not fully observed, their pairwise distance has to be estimated. For two incompletely observed samples \(x_i^o,x_j^o \in \mathbb {R}^{d}\), denote \(I_{i,j} \subseteq \{1,\cdots ,d\}\) as the index set recording the positions of features that are observed in both samples. Assuming \(I_{i,j}\) is not empty, denote \(x_i^o(I_{i,j}) \in \mathbb {R}^{|I_{i,j}|}\) as a vector of selected values in \(x_i^o\) on \(I_{i,j}\). Then, the pairwise Euclidean distance \(d_{i,j}^o\) can be estimated from their commonly observed features by [18]

$$\begin{aligned} d_{i,j}^o = \Vert x_i^o(I_{i,j}) - x_j^o(I_{i,j})\Vert _2 \cdot \sqrt{\frac{d}{|I_{i,j}|}} \in [0, +\infty ). \end{aligned}$$
(1)

The estimated Euclidean distance matrix are obtained by \(D^o = \{d_{i,j}^o\} \in \mathbb {R}^{n\times n}\). Moreover, all distance-based kernel matrices can be calculated from \(D^o\) accordingly, such as the widely used Gaussian kernel \(K^o = \{\exp (-(d^0_{i,j})^2/\sigma ^2)\} \in \mathbb {R}^{n \times n}\).

3.2 Distance and Affinity Correction

To correct an initial distance matrix \(D^o\) calculated via Eq. (1), we introduce the distance correction method [18, 34] and resort to a Laplacian kernel matrix, i.e., \(K^o = \exp (-\gamma D^o)\). Considering the PSD property of the Laplacian kernel [25], we correct the initial distance matrix \(D^o\) by solving the following problem:

$$\begin{aligned} \hat{D} = \mathop {\arg \min }_{D \in \mathbb {R}^{n \times n}}~ \Vert D-D^o\Vert ^2_F \end{aligned}$$
(2)

subject to

$$\begin{aligned} \left\{ \begin{aligned} & d_{i,i} = 0,~\forall ~1\le i \le n \\ & d_{i,j} = d_{j,i} \ge 0,~\forall ~1 \le i\ne j \le n \\ & \exp (-\gamma D) \succeq 0 \end{aligned} \right. \end{aligned}$$

where \(\succeq 0\) denotes the positive semi-definiteness (PSD). However, the problem defined above is hard to solve due to the PSD constraint in the exponential form. Thus, we change the decision variable from D to \(K = \exp (-\gamma D)\) and reformulate the problem under an efficient approximation:

$$\begin{aligned} \hat{K} = \mathop {\arg \min }_{K \in \mathbb {R}^{n \times n}}~ \Vert K - K^o\Vert _F^2 \end{aligned}$$
(3)

subject to

$$\begin{aligned} \left\{ \begin{aligned} & k_{i,i} = 1,~\forall ~1\le i \le n \\ & k_{i,j} = k_{j,i} \in [0,1],~\forall ~1 \le i\ne j \le n \\ & K \succeq 0 \end{aligned} \right. \end{aligned}$$

which can be solved by the Dykstra’s projection method [2, 8, 16,17,18,19, 34, 35].

Algorithm 1
figure a

. Correction-clustering Strategy Based on Distance Correction

3.3 Theoretical Analysis

Theoretical Guarantee. Despite the convergence guarantee [4], the proposed method also have a nice theoretical guarantee [18] on the correction performance that we provide an improved estimate of the unknown ground-truth.

Theorem 1

\(||D^*-\hat{D}||_F^2 \le ||D^*-D^o||_F^2\). The equality holds if and only if \(K^o \succeq 0\), i.e., \(\hat{K} = K^o, \hat{D} = D^o\).

Complexity Analysis. The time complexity of distance correction is \(O(n^3)\) per iteration, which mainly comes from the eigenvalue decomposition (EVD) in the projection operation onto the PSD space. Nevertheless, we can apply highly efficient algorithms for EVD operation to accelerate the algorithms, such as parallel algorithm [3] and the randomized singular value decomposition algorithm [11]. The storage complexity is \(O(n^2)\) to store the whole matrix in memory.

4 Experiments

4.1 Experimental Setup

Datasets. The experiments are carried out on two benchmark datasets as shown in Table 1: 1) MSRC-v1Footnote 1 [33]: a scene recognition dataset containing 210 images with 6 views; 2) ORLFootnote 2: a face-image dataset containing 400 images with 3 views. All the experiments are carried out for 10 random seeds on a ThinkStation with 2.1 GHz Intel i7-12700 Core and 32GB RAM.

Table 1. Statistic of two benchmark multi-view datasets.

Implementation. All data is normalized to \([-1,1]\). In each view, the values of each sample vector are missing completely at random (MCAR) with a given missing ratio r, e.g., \(70\%\). The incomplete clustering task is to first obtain multiple distance/affinity matrices through imputation or correction methods and then conduct multi-view clustering algorithms to get clustering results.

Baselines. The proposed distance correction method is compared with several representative imputation methods from two categories: 1) statistical methods: ZERO, MEAN, \(\boldsymbol{k}\)NN [39] impute the missing value by zero, mean value or an average value of k-nearest neighbors (\(k=10\)), respectively; 2) machine learning methods: SVT [5] makes low-rank matrix completion with singular value thresholding, GROUSE [1] conducts low-rank matrix completion via Grassmanian rank-one update subspace estimation, FNNM [6] performs low-rank matrix completion by factor nuclear norm minimization, and KFMC [9] utilizes a kernelized-factorization matrix completion.

Multi-view Clustering Algorithms. To verify the quality of affinity matrices obtained by imputation or correction methods, we choose popular affinity-based multi-view clustering algorithms to perform clustering, including: 1) spectral clustering: CRSC [14] and AASC [13] employ the co-regularization or affinity aggregation to optimize spectral clustering, respectively; 2) graph-based clustering: AWP [27] and CGD [29] generate a weighted graph by fusing multiple graph matrices; 3) kernel-based clustering: RMKKM [7] and MVCLFA [31] improve the robustness via multiple kernel k-means clustering or late fusion alignment maximization, respectively.

Evaluation Metrics. The clustering performance is evaluated by three commonly used metrics, i.e., clustering accuracy (ACC), normalized mutual information (NMI), and purity (PUR), which range between 0 and 1. The higher the better. The average results of 10 random seeds are reported for all experiments.

4.2 Incomplete Clustering on Single-View Data

We select the first view in the multi-view dataset as a newly single-view dataset, where the values of samples are missing completely at random with a given missing ratio r varying from 20% to 80%. To deal with it, we first obtain an estimated Euclidean distance matrix \(\hat{D}\) from incomplete data \(X^o\). Then we calculate a corresponding Gaussian kernel \(\hat{K} = \exp (-\hat{D}^2/\sigma ^2)\) with \(\sigma = \text {median}\{\hat{D}\}\) as the input of the standard spectral clustering algorithm [26], which applies the normalized cut and is a popular packageFootnote 3 in the MATLAB library. As shown in Fig. 3, the proposed method shows significant improvement in clustering metrics (i.e., ACC, NMI, PUR) in almost all experiments with the least performance degeneration. Even for a large missing ratio (e.g., 80%), our method still maintains reliable performance and shows its robustness.

Fig. 3.
figure 3

Incomplete single-view clustering results on MSRC-v1 and ORL datasets using the standard spectral clustering algorithm.

4.3 Incomplete Clustering on Multi-view Data

Now, we further investigate the multi-view clustering performance. All samples in each view randomly replace their values with the NA values under a missing ratio r. Same as the setting in Sect. 4.2, the results on distance-based Gaussian kernels are shown in Fig. 4 with a fixed \(70\%\) missing ratio. The experimental results show that our approach consistently achieves better performance, in terms of ACC, NMI and PUR, against all compared imputation methods. Thus, the proposed framework shows effectiveness and robustness and therefore more reliable on incomplete multi-view clustering tasks.

Fig. 4.
figure 4

Incomplete multi-view clustering results on MSRC-v1 and ORL datasets using different multi-view clustering algorithms under a fixed 70% missing ratio.

Fig. 5.
figure 5

Consensus affinity matrices obtained by the AASC multi-view clustering algorithm on MSRC-v1 and ORL datasets with a \(70\%\) missing ratio.

4.4 Quality Visualization of Affinity Matrices

To assess the quality of affinity matrices, we visualize consensus affinity matrices. We obtain Euclidean distance matrices through imputation methods or our correction method for all views. The AASC multi-view clustering algorithm [13] is used to fuse multiple distance matrices to a consensus affinity matrix. A high-quality affinity matrix should have a clear block diagonal structure. Our consensus affinity matrices, illustrated in Fig. 5, demonstrate a remarkable ability to capture a strong clustering structure that is nearly identical to the ground-truth. This, in turn, leads to improved clustering performance as compared to the ZERO and MEAN methods whose affinity matrices lack clear block structures and are plagued with noise.

4.5 Motivation and Results Summary

When dealing with incomplete data, common imputation methods rely on domain knowledge of data structures and lack theoretical guarantees for the imputed data. To tackle this issue, we introduce a matrix correction technique that utilizes convex optimization to correct an estimated distance matrix and ensure certain properties are satisfied. Our approach leverages the positive semi-definite (PSD) property of the Laplacian kernel to improve the estimated distance with a theoretical guarantee of accuracy. As a result, our correction-clustering strategy outperforms traditional imputation-based strategies on incomplete clustering tasks in both single-view and multi-view datasets.

5 Conclusion and Future Work

Partially missing data is a significant issue in incomplete multi-view clustering, yet it has received relatively little attention in the research community. Traditional imputation methods can lead to inaccurate results and degrade performance. To address this challenge, we propose a novel imputation-free and unified framework for incomplete-value multi-view clustering. Our framework includes a distance correction method, combined with a two-stage correction-clustering strategy that integrates with existing multi-view clustering algorithms.

Our proposed framework outperforms existing imputation-based strategies, as demonstrated by extensive experiments. Our matrix correction algorithm provides high-quality Euclidean distance matrices that are closely aligned with the unknown ground-truth, resulting in improved performance in single-view spectral clustering. Additionally, our algorithm achieves better multi-view clustering performance by improving consensus affinity matrices. Overall, our framework provides a valuable tool for various data mining applications, particularly those involving incomplete clustering.

In future work, we plan to study missing data imputation in incomplete multi-view clustering and extend our framework to handle other types of missing data, such as missing views or modalities. We also aim to apply our framework to other real-world datasets and practical applications to further validate its effectiveness.