Abstract
The assumption of data completeness plays a significant role in the effectiveness of current Multi-view Clustering (MVC) methods. However, data collection and transmission would unavoidably breach this assumption, resulting in the Partially Data-missing Problem (PDP). A common remedy is to first impute missing values and then conduct MVC methods, which may cause performance degeneration due to inaccurate imputation. To address these issues in PDP, we introduce an imputation-free framework that utilizes a matrix correction technique, employing a novel two-stage strategy termed ’correction-clustering’. In the first stage, we correct distance matrices derived from incomplete data and compute affinity matrices. Following this, we integrate them with affinity-based MVC methods. This approach circumvents the uncertainties associated with inaccurate imputations, enhancing clustering performance. Comprehensive experiments show that our method outperforms traditional imputation-based techniques, yielding superior clustering results across various levels of missing data.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Multi-view data, stemming from diverse sources, encompasses multiple representations ubiquitous in real-world scenarios, like videos with audible and visual facets or images with raw RGB space data paired with descriptive text. These diverse views offer both consistent and supplementary information. The aim of Multi-view Clustering (MVC) is to assimilate this multi-faceted information into a unified structure, facilitating the unsupervised clustering of data samples with similar structures. However, a predominant assumption in most MVC methodologies [7, 13, 27, 31, 32] is the complete observation of sample information across all views. In reality, data collection and transmission can breach this assumption, resulting in Incomplete Multi-view Clustering (IMVC) challenges [15, 22, 28].
In the realm of IMVC, inconsistencies in views or incomplete data contribute to data gaps. These voids, illustrated in Fig. 1, stem either from Partially View-missing Problem (PVP) [15, 22, 28] due to inconsistent views (i.e., one sample has empty vectors in some views, as shown in Fig. 1 (a)), or from the more pervasive, yet less explored, Partially Data-missing Problem (PDP) caused by data incompleteness (i.e., one sample has representations in all views but with some missing values, as shown in Fig. 1 (b)). Traditional strategies aimed at remedying PDP often involve padding missing values via imputation techniques [5, 6, 9] before applying MVC on the imputed data. However, this imputation-based approach can falter, especially when applied without domain knowledge on data structures, risking damage to intrinsic structures. Moreover, inaccurately imputed views can distort the fusion process in existing MVC techniques, potentially undermining clustering outcomes.
To address the above issues, our contributions are threefold:
-
We propose an imputation-free framework with the matrix correction method to deal with partially data-missing problems in the IMVC community, which can naturally avoid the uncertainty error caused by inaccurate imputation and directly obtain high-quality affinity matrices through matrix correction.
-
We introduce a matrix correction algorithm to effectively and efficiently estimate distance matrices on incomplete data with a theoretical guarantee. Specifically, it starts with initial distance matrices estimated from incomplete data and then corrects these estimates to satisfy specific properties via a convex optimization approach.
-
We design a two-stage strategy, i.e., correction-clustering (as shown in Fig. 2), to combine with all affinity-oriented MVC methods, which makes existing MVC methods great again on IMVC problems. Extensive experiments demonstrate that our strategy achieves superior and robust clustering performance under a wide range of missing ratios, compared to the imputation-based approaches.
2 Related Work
Multi-view Clustering. MVC assumes that all samples are fully observed in all views. One popular roadmap of MVC is to separate the data based on affinity matrices constructed in all views. Those affinity-based clustering methods include spectral clustering [13, 14, 32, 42], kernel-based clustering [7, 21, 31, 41], and graph clustering [27, 29, 36, 37].
Incomplete-view Multi-view Clustering. Incomplete-view MVC methods focus on scenarios where partial samples have missing views or there is no view containing all samples, and they use the observed-view information to recover the missing views. Traditional IMVC models generate the consensus representation or affinity of view-missing data via matrix factorization [15, 22, 28], kernel-based [10, 23, 24], or graph-based [30, 38, 40] methods.
Incomplete-value Multi-view Clustering. In contrast to view-level missing, incomplete-value MVC aims at value-level (data-level) missing, where each sample in any view may contain some missing values. A feasible solution is to first impute missing values and then perform multi-view clustering methods. In practice, statistical imputation techniques [12, 20], such as zero, mean imputation, and k-nearest neighbor (kNN) imputation [39], have been widely used. Besides, matrix completion [5, 6, 9] is a representative machine learning-based technique that solves a matrix factorization problem. Unfortunately, it is difficult to accurately estimate missing values based on the observed data, especially for a large missing ratio, and there is no guarantee on the quality of imputation. This motivates us to design an imputation-free approach.
3 Methodology
To seek a reliable solution to incomplete-value MVC, we propose an imputation-free framework with a two-stage strategy, i.e., correction-clustering, as illustrated in Fig. 2. Our work mainly resides in the matrix correction technique [18, 34] to improve the distance/affinity matrix estimation for incomplete data.
3.1 Distance Estimation
Consider a multi-view dataset with v views and n samples. Denote \(X^{(i)} = \{x_1^{(i)}, \cdots , x_n^{(i)}\} \in \mathbb {R}^{d_i\times n}\) as the data matrix in view i, where \(d_i\) is the view-specific dimension. For simplicity, we consider an incomplete data matrix in a single-view, i.e., \(X^o = \{x_1^o, \cdots , x_n^o\} \in \mathbb {R}^{d \times n}\), where \(x_i^o\) represents the observed i-th sample that may contain missing values.
If samples are not fully observed, their pairwise distance has to be estimated. For two incompletely observed samples \(x_i^o,x_j^o \in \mathbb {R}^{d}\), denote \(I_{i,j} \subseteq \{1,\cdots ,d\}\) as the index set recording the positions of features that are observed in both samples. Assuming \(I_{i,j}\) is not empty, denote \(x_i^o(I_{i,j}) \in \mathbb {R}^{|I_{i,j}|}\) as a vector of selected values in \(x_i^o\) on \(I_{i,j}\). Then, the pairwise Euclidean distance \(d_{i,j}^o\) can be estimated from their commonly observed features by [18]
The estimated Euclidean distance matrix are obtained by \(D^o = \{d_{i,j}^o\} \in \mathbb {R}^{n\times n}\). Moreover, all distance-based kernel matrices can be calculated from \(D^o\) accordingly, such as the widely used Gaussian kernel \(K^o = \{\exp (-(d^0_{i,j})^2/\sigma ^2)\} \in \mathbb {R}^{n \times n}\).
3.2 Distance and Affinity Correction
To correct an initial distance matrix \(D^o\) calculated via Eq. (1), we introduce the distance correction method [18, 34] and resort to a Laplacian kernel matrix, i.e., \(K^o = \exp (-\gamma D^o)\). Considering the PSD property of the Laplacian kernel [25], we correct the initial distance matrix \(D^o\) by solving the following problem:
subject to
where \(\succeq 0\) denotes the positive semi-definiteness (PSD). However, the problem defined above is hard to solve due to the PSD constraint in the exponential form. Thus, we change the decision variable from D to \(K = \exp (-\gamma D)\) and reformulate the problem under an efficient approximation:
subject to
which can be solved by the Dykstra’s projection method [2, 8, 16,17,18,19, 34, 35].
3.3 Theoretical Analysis
Theoretical Guarantee. Despite the convergence guarantee [4], the proposed method also have a nice theoretical guarantee [18] on the correction performance that we provide an improved estimate of the unknown ground-truth.
Theorem 1
\(||D^*-\hat{D}||_F^2 \le ||D^*-D^o||_F^2\). The equality holds if and only if \(K^o \succeq 0\), i.e., \(\hat{K} = K^o, \hat{D} = D^o\).
Complexity Analysis. The time complexity of distance correction is \(O(n^3)\) per iteration, which mainly comes from the eigenvalue decomposition (EVD) in the projection operation onto the PSD space. Nevertheless, we can apply highly efficient algorithms for EVD operation to accelerate the algorithms, such as parallel algorithm [3] and the randomized singular value decomposition algorithm [11]. The storage complexity is \(O(n^2)\) to store the whole matrix in memory.
4 Experiments
4.1 Experimental Setup
Datasets. The experiments are carried out on two benchmark datasets as shown in Table 1: 1) MSRC-v1Footnote 1 [33]: a scene recognition dataset containing 210 images with 6 views; 2) ORLFootnote 2: a face-image dataset containing 400 images with 3 views. All the experiments are carried out for 10 random seeds on a ThinkStation with 2.1 GHz Intel i7-12700 Core and 32GB RAM.
Implementation. All data is normalized to \([-1,1]\). In each view, the values of each sample vector are missing completely at random (MCAR) with a given missing ratio r, e.g., \(70\%\). The incomplete clustering task is to first obtain multiple distance/affinity matrices through imputation or correction methods and then conduct multi-view clustering algorithms to get clustering results.
Baselines. The proposed distance correction method is compared with several representative imputation methods from two categories: 1) statistical methods: ZERO, MEAN, \(\boldsymbol{k}\)NN [39] impute the missing value by zero, mean value or an average value of k-nearest neighbors (\(k=10\)), respectively; 2) machine learning methods: SVT [5] makes low-rank matrix completion with singular value thresholding, GROUSE [1] conducts low-rank matrix completion via Grassmanian rank-one update subspace estimation, FNNM [6] performs low-rank matrix completion by factor nuclear norm minimization, and KFMC [9] utilizes a kernelized-factorization matrix completion.
Multi-view Clustering Algorithms. To verify the quality of affinity matrices obtained by imputation or correction methods, we choose popular affinity-based multi-view clustering algorithms to perform clustering, including: 1) spectral clustering: CRSC [14] and AASC [13] employ the co-regularization or affinity aggregation to optimize spectral clustering, respectively; 2) graph-based clustering: AWP [27] and CGD [29] generate a weighted graph by fusing multiple graph matrices; 3) kernel-based clustering: RMKKM [7] and MVCLFA [31] improve the robustness via multiple kernel k-means clustering or late fusion alignment maximization, respectively.
Evaluation Metrics. The clustering performance is evaluated by three commonly used metrics, i.e., clustering accuracy (ACC), normalized mutual information (NMI), and purity (PUR), which range between 0 and 1. The higher the better. The average results of 10 random seeds are reported for all experiments.
4.2 Incomplete Clustering on Single-View Data
We select the first view in the multi-view dataset as a newly single-view dataset, where the values of samples are missing completely at random with a given missing ratio r varying from 20% to 80%. To deal with it, we first obtain an estimated Euclidean distance matrix \(\hat{D}\) from incomplete data \(X^o\). Then we calculate a corresponding Gaussian kernel \(\hat{K} = \exp (-\hat{D}^2/\sigma ^2)\) with \(\sigma = \text {median}\{\hat{D}\}\) as the input of the standard spectral clustering algorithm [26], which applies the normalized cut and is a popular packageFootnote 3 in the MATLAB library. As shown in Fig. 3, the proposed method shows significant improvement in clustering metrics (i.e., ACC, NMI, PUR) in almost all experiments with the least performance degeneration. Even for a large missing ratio (e.g., 80%), our method still maintains reliable performance and shows its robustness.
4.3 Incomplete Clustering on Multi-view Data
Now, we further investigate the multi-view clustering performance. All samples in each view randomly replace their values with the NA values under a missing ratio r. Same as the setting in Sect. 4.2, the results on distance-based Gaussian kernels are shown in Fig. 4 with a fixed \(70\%\) missing ratio. The experimental results show that our approach consistently achieves better performance, in terms of ACC, NMI and PUR, against all compared imputation methods. Thus, the proposed framework shows effectiveness and robustness and therefore more reliable on incomplete multi-view clustering tasks.
4.4 Quality Visualization of Affinity Matrices
To assess the quality of affinity matrices, we visualize consensus affinity matrices. We obtain Euclidean distance matrices through imputation methods or our correction method for all views. The AASC multi-view clustering algorithm [13] is used to fuse multiple distance matrices to a consensus affinity matrix. A high-quality affinity matrix should have a clear block diagonal structure. Our consensus affinity matrices, illustrated in Fig. 5, demonstrate a remarkable ability to capture a strong clustering structure that is nearly identical to the ground-truth. This, in turn, leads to improved clustering performance as compared to the ZERO and MEAN methods whose affinity matrices lack clear block structures and are plagued with noise.
4.5 Motivation and Results Summary
When dealing with incomplete data, common imputation methods rely on domain knowledge of data structures and lack theoretical guarantees for the imputed data. To tackle this issue, we introduce a matrix correction technique that utilizes convex optimization to correct an estimated distance matrix and ensure certain properties are satisfied. Our approach leverages the positive semi-definite (PSD) property of the Laplacian kernel to improve the estimated distance with a theoretical guarantee of accuracy. As a result, our correction-clustering strategy outperforms traditional imputation-based strategies on incomplete clustering tasks in both single-view and multi-view datasets.
5 Conclusion and Future Work
Partially missing data is a significant issue in incomplete multi-view clustering, yet it has received relatively little attention in the research community. Traditional imputation methods can lead to inaccurate results and degrade performance. To address this challenge, we propose a novel imputation-free and unified framework for incomplete-value multi-view clustering. Our framework includes a distance correction method, combined with a two-stage correction-clustering strategy that integrates with existing multi-view clustering algorithms.
Our proposed framework outperforms existing imputation-based strategies, as demonstrated by extensive experiments. Our matrix correction algorithm provides high-quality Euclidean distance matrices that are closely aligned with the unknown ground-truth, resulting in improved performance in single-view spectral clustering. Additionally, our algorithm achieves better multi-view clustering performance by improving consensus affinity matrices. Overall, our framework provides a valuable tool for various data mining applications, particularly those involving incomplete clustering.
In future work, we plan to study missing data imputation in incomplete multi-view clustering and extend our framework to handle other types of missing data, such as missing views or modalities. We also aim to apply our framework to other real-world datasets and practical applications to further validate its effectiveness.
References
Balzano, L., Nowak, R., Recht, B.: Online identification and tracking of subspaces from highly incomplete information. In: 2010 48th Annual Allerton conference on Communication, Control, and Computing (Allerton), pp. 704–711. IEEE (2010)
Bauschke, H.H., Borwein, J.M.: Dykstra’s alternating projection algorithm for two sets. J. Approx. Theory 79(3), 418–443 (1994)
Berry, M.W., Mezher, D., Philippe, B., Sameh, A.: Parallel algorithms for the singular value decomposition. In: Handbook of Parallel Computing and Statistics, pp. 133–180. Chapman and Hall/CRC (2005)
Boyle, J.P., Dykstra, R.L.: A method for finding projections onto the intersection of convex sets in Hilbert spaces. In: Advances in Order Restricted Statistical Inference, pp. 28–47. Springer, New York (1986). https://doi.org/10.1007/978-1-4613-9940-7_3
Cai, J.F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Candes, E., Recht, B.: Exact matrix completion via convex optimization. Commun. ACM 55(6), 111–119 (2012)
Du, L., et al.: Robust multiple kernel K-means using L21-Norm. In: 24th International Joint Conference on Artificial Intelligence (2015)
Dykstra, R.L.: An algorithm for restricted least squares regression. J. Am. Stat. Assoc. 78(384), 837–842 (1983)
Fan, J., Udell, M.: Online high rank matrix completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8690–8698 (2019)
Guo, J., Ye, J.: Anchors bring ease: an embarrassingly simple approach to partial multi-view clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 118–125 (2019)
Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)
Hasan, M.K., Alam, M.A., Roy, S., Dutta, A., Jawad, M.T., Das, S.: Missing value imputation affects the performance of machine learning: a review and analysis of the literature (2010–2021). Inf. Med. Unlocked 27, 100799 (2021)
Huang, H.C., Chuang, Y.Y., Chen, C.S.: Affinity aggregation for spectral clustering. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 773–780. IEEE (2012)
Kumar, A., Rai, P., Daume, H.: Co-regularized multi-view spectral clustering. In: Advances in Neural Information Processing Systems 24 (2011)
Li, S.Y., Jiang, Y., Zhou, Z.H.: Partial multi-view clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 28 (2014)
Li, W.: Estimating jaccard index with missing observations: a matrix calibration approach. In: Advances in Neural Information Processing Systems, vol. 28, pp. 2620–2628. Canada (2015)
Li, W.: Scalable calibration of affinity matrices from incomplete observations. In: Asian Conference on Machine Learning, pp. 753–768. PMLR, Bangkok, Thailand (2020)
Li, W., Yu, F.: Calibrating distance metrics under uncertainty. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 219–234. Springer (2022). https://doi.org/10.1007/978-3-031-26409-2_14
Li, W., Yu, F., Ma, Z.: Metric nearness made practical. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 8648–8656 (2023)
Lin, W.-C., Tsai, C.-F.: Missing value imputation: a review and analysis of the literature (2006–2017). Artif. Intell. Rev. 53(2), 1487–1509 (2020). https://doi.org/10.1007/s10462-019-09709-4
Liu, J., et al.: Optimal neighborhood multiple kernel clustering with adaptive local kernels. IEEE Trans. Knowl. Data Eng. (2020)
Liu, J., et al.: Self-representation subspace clustering for incomplete multi-view data. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2726–2734 (2021)
Liu, X., et al.: Efficient and effective regularized incomplete multi-view clustering. IEEE Trans. Pattern Anal. Mach. Intell. 43(8), 2634–2646 (2020)
Liu, X.: Multiple kernel \(k\) k-means with incomplete kernels. IEEE Trans. Pattern Anal. Mach. Intell. 42(5), 1191–1204 (2019)
Nader, R., Bretto, A., Mourad, B., Abbas, H.: On the positive semi-definite property of similarity matrices. Theoret. Comput. Sci. 755, 13–28 (2019)
Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems 14 (2001)
Nie, F., Tian, L., Li, X.: Multiview clustering via adaptively weighted procrustes. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2022–2030 (2018)
Shao, W., He, L., Yu, P.S.: Multiple incomplete views clustering via weighted nonnegative matrix factorization with \(L_{2,1}\) regularization. In: Appice, A., Rodrigues, P.P., Santos Costa, V., Soares, C., Gama, J., Jorge, A. (eds.) ECML PKDD 2015. LNCS (LNAI), vol. 9284, pp. 318–334. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23528-8_20
Tang, C., et al.: CGD: multi-view clustering via cross-view graph diffusion. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5924–5931 (2020)
Wang, S., et al.: Highly-efficient incomplete large-scale multi-view clustering with consensus bipartite graph. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9776–9785 (2022)
Wang, S., et al.: Multi-view clustering via late fusion alignment maximization. In: 28th International Joint Conference on Artificial Intelligence, pp. 3778–3784 (2019)
Xia, R., Pan, Y., Du, L., Yin, J.: Robust multi-view spectral clustering via low-rank and sparse decomposition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 28 (2014)
Xu, N., Guo, Y., Wang, J., Luo, X., Kong, X.: Multi-view clustering via simultaneously learning shared subspace and affinity matrix. Int. J. Adv. Rob. Syst. 14(6), 1729881417745677 (2017)
Yu, F., Bao, R., Mao, J., Li, W.: Highly-efficient Robinson-Foulds distance estimation with matrix correction. In: (to appear) 26th European Conference on Artificial Intelligence (2023)
Yu, F., Zeng, Y., Mao, J., Li, W.: Online estimation of similarity matrices with incomplete data. In: Uncertainty in Artificial Intelligence, pp. 2454–2464. PMLR (2023)
Zhan, K., Nie, F., Wang, J., Yang, Y.: Multiview consensus graph clustering. IEEE Trans. Image Process. 28(3), 1261–1270 (2018)
Zhan, K., Zhang, C., Guan, J., Wang, J.: Graph learning for multiview clustering. IEEE Trans. Cybern. 48(10), 2887–2895 (2017)
Zhang, P., et al.: Adaptive weighted graph fusion incomplete multi-view subspace clustering. Sensors 20(20), 5755 (2020)
Zhang, S.: Nearest neighbor selection for iteratively KNN imputation. J. Syst. Softw. 85(11), 2541–2552 (2012)
Zhao, H., Liu, H., Fu, Y.: Incomplete multi-modal visual data grouping. In: 25th International Joint Conference on Artificial Intelligence, pp. 2392–2398 (2016)
Zhou, S., et al.: Multiple kernel clustering with neighbor-kernel subspace segmentation. IEEE Trans. Neural Netw. Learn. Syst. 31(4), 1351–1362 (2019)
Zong, L., Zhang, X., Liu, X., Yu, H.: Weighted multi-view spectral clustering based on spectral perturbation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Acknowledgements
We appreciate the anonymous reviewers for their helpful feedback that greatly improved this paper. The work of Fangchen Yu and Wenye Li was supported in part by Guangdong Basic and Applied Basic Research Foundation (2021A1515011825), Guangdong Introducing Innovative and Entrepreneurial Teams Fund (2017ZT07X152), Shenzhen Science and Technology Program (CUHKSZWDZC0004), and Shenzhen Research Institute of Big Data Scholarship Program. The work of Yuqi Ma was supported in part by CUHKSZ-SRIBD Joint PhD Program. The work of Jianfeng Mao was supported in part by National Natural Science Foundation of China under grant U1733102, in part by the Guangdong Provincial Key Laboratory of Big Data Computing, The Chinese University of Hong Kong, Shenzhen under grant B10120210117, and in part by CUHK-Shenzhen under grant PF.01.000404.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yu, F., Shi, Z., Ma, Y., Mao, J., Li, W. (2024). From Incompleteness to Unity: A Framework for Multi-view Clustering with Missing Values. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1965. Springer, Singapore. https://doi.org/10.1007/978-981-99-8145-8_9
Download citation
DOI: https://doi.org/10.1007/978-981-99-8145-8_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8144-1
Online ISBN: 978-981-99-8145-8
eBook Packages: Computer ScienceComputer Science (R0)