Abstract
In this paper, we consider the weakly supervised multi-target regression problem where the observed data is partially or imprecisely labelled. The model of the multivariate normal distribution over the target vectors represents the uncertainty arising from the labelling process. The proposed solution is based on the combination of a manifold regularisation method, the use of the Wasserstein distance between multivariate distributions, and a cluster ensemble technique. The method uses a low-rank representation of the similarity matrix. An algorithm for constructing a co-association matrix with calculation of the optimal number of clusters in a partition is presented. To increase the stability and quality of the ensemble clustering, we use k-means with different distance metrics. The experimental part presents the results of numerical experiments with the proposed method on artificially generated data and real data sets. The results show the advantages of the proposed method over existing solutions.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Weakly supervised learning
- Multi-target regression
- Manifold regularization
- Low-rank matrix representation
- Cluster ensemble
- Co-association matrix
1 Introduction
Weakly supervised learning is a type of machine learning technique in which a model is trained using incomplete, imprecise, or ambiguous supervision signals, rather than using fully correctly labeled data. Weak supervision often arises in real problems for various reasons. This may be due to an expensive data labeling process, poor accuracy of sensors, insufficient expert qualifications or human error. For example, there is weak supervision in cases where the labeling is obtained using crowdsourcing techniques: for each object there is a set of different (possibly inaccurate) labels, the quality of which depends on the skills of the performers. In addition to that, some objects may remain unlabeled if there is not enough budget for them.
Another example is the task of detecting objects in an image [1]. Bounding boxes are a common way to represent the location and extent of objects detected in an image or video frame in object detection tasks. A bounding box is a rectangular box that surrounds the object and is defined by its four corners or coordinates. In some difficult cases, such as detecting objects in medical CT scans, the bounding boxes can be very inaccurate and may highlight unwanted pixels. Moreover, the process of labeling CT images is very time-consuming, so it is not possible to label many objects.
Generally, there are three types of weak supervision: incomplete supervision, inaccurate supervision and inexact supervision [2]. In this work, we focus on the first two types of weak supervision. In particular, we assume that only a small part of the objects have labels, while the labels can be uncertain, and for most of the dataset there are no labels at all.
We propose an algorithm for solving the multi-target weakly supervised regression problem using Wasserstein metric, manifold regularization and a co-association matrix as the similarity matrix. We follow the transductive setting, which means that the objects from test data can be used during training and the task is to find the labels only for these objects. The algorithm for calculating the weighted average co-association matrix is also improved. Finally, we compare the proposed algorithm with existing algorithms of supervised learning and weakly supervised learning on synthetic and real data.
2 Problem Description
Let \( X = \{x_{1},\dots ,x_{n}\}\), \(x_i = (x_{i}^1,\dots ,x_{i}^{p})^\top \in \mathbb {R}^p\) are sampled from distribution \(\mathcal {P}_X\), where n is the number of objects in the sample and p is the dimensionality of the feature space. In turn, \(Y= \{y_{1},\dots ,y_{n}\}\), \(y_i = (y_{i}^1,\dots ,y_{i}^{m})^\top \in \mathbb {R}^m\) are target labels, where m is the dimensionality of the target feature space.
In the semi-supervised transductive learning problem, a dataset \(X \times Y = \{(x_1, y_1)\), \(\dots , (x_n, y_n)\}\) is considered, but the target features \(\{y_1, ..., y_{n_1}\} = Y_1 \subseteq Y\) are only known for a small part of the available data \(\{x_1, ..., x_{n_1}\} = X_1 \subseteq X\). The rest of the objects \(\{x_{n_1 +1 }, ..., x_n\} = X_0 \subseteq X\) are unlabeled. The task is to predict the labels \(Y_0 = \{y_{n_1+1}, ..., y_n\}\) as accurately as possible according to some criterion.
To model the uncertainty of the observed labels, we use a multivariate normal distribution. We suppose that for each i-th data point, \(i=1,\dots ,n_1\), the value \(y_i\) of the target feature is a realization of a random variable \(Y_i\) with a cumulative distribution function (cdf) \(F_i(y)\) defined on \(D_Y \subset \mathbb {R}^m\):
where \(\mu _i \in \mathbb {R}^m\) is a mean vector, \(\varSigma _i \in \mathbb {R}^{m\times m}\) is a covariance matrix, \(i = 1, \dots , n\). The overall degree of uncertainty can be interpreted as \(\mathbb {T}_i = |\varSigma _i|\): the larger it is, the greater the uncertainty of the label. Accordingly, for strictly labeled objects, it is expected that \(\mathbb {T}_i \approx 0\).
The task is to determine \(F_i(y)\) for \(i=n_1+1,\dots ,n\) following an objective criterion.
3 Related Work
The work [3] provides algorithms WSR-RBF and WSR-LRCM for solving the weakly supervised regression problem in the transductive formulation in the case of a one-dimensional target variable. It uses a univariate normal distribution to model inaccuracy:
where \(\sigma _i\) is an indicator of inaccuracy. Then it is proposed to solve the optimization problem by minimizing the distance between the predicted and real distributions using manifold regularization. To approximate the similarity matrix in WSR-LRCM, the co-association matrix is used and to obtain the co-association matrix, the cluster ensemble and the k-means algorithm are used. The WSR-RBF variant uses a weight matrix based on the RBF kernel instead of a low-rank representation:
where \(h=\Vert x_i-x_j\Vert \), and \(\ell \) is a parameter.
However, the presented algorithm does not generalize to the multidimensional case. To solve a multi-target regression, it is necessary to train a separate model for each target variable. With this approach, it is possible to effectively solve those problems in which the target variables are independent of each other. If the target variables are not independent, for example, in the problem of object detection [1], these dependencies will be lost during training. These dependencies can be taken into account by using the distance between multivariate distributions, such as the Wasserstein distance [4].
The article [5] presents a detailed analysis of the co-association matrix and the algorithm for its construction. However, it relies on the basic version of the k-means algorithm, which has significant drawbacks, including the use of a single metric option and the uncertainty in choosing the appropriate number of clusters. In [7] the authors analyze the influence of metrics other than Euclidean on the quality of clustering by the k-means algorithm.
4 Proposed Method
Let
-
\(F^* = \{F_1^*, ..., F_{n_1}^*, ..., F_n^*\}\) be the set of arbitrary multivariate normal cdf’s, each \(F_i^*\) is represented by a pair \((a_i, \mathbb {S}_i)\);
-
\(F = \{ F_1, ..., F_{n_1} \}\) be the set of known cdf’s, each \(F_i\) is represented by a pair \((\mu _i, \varSigma _i)\).
In the following, we assume, both \(\varSigma _i\) and \(\mathbb {S}_i\) to be positive-definite matrices. Therefore, they are admitting Cholesky decomposition: \(\varSigma _i = \varSigma _i^{1/2} {\varSigma _i^{1/2}}^\top \), \(\mathbb {S}_i = \mathbb {S}_i^{1/2} {\mathbb {S}_i^{1/2}}^\top \). We denote elements of \(\mathbb {S}_i^{1/2}\) as \(s^i_{jk}\), and elements of \(\varSigma _i^{1/2}\) as \(\sigma ^i_{jk}\).
4.1 Objective Functional
Consider the following optimization problem:
where \(\mathcal {W}\) is a 2-Wasserstein metric [4] (also known as Kantorovich-Rubenstein distance), \(\gamma >0\) is a parameter, and matrix \(W=(W_{ij})\) represents the similarity measures between elements of dataset. For two multivariate Gaussian distributions \(N(\mu _0, \varSigma _0)\) and \( N(\mu _1, \varSigma _1)\), 2-Wasserstein distance is
Following [3], we also add the regularisation term with parameter \(\beta > 0\). We can rewrite the objective as
4.2 Optimal Solution
To find the optimal solution, we differentiate (3) with respect to elements of \(a_i\) and \(\mathbb {S}_i^{1/2}\), \(i = 1, ..., n\):
Given that the matrices \(\varSigma ^{1/2}_i\) are lower triangular, we introduce an auxiliary operation \(\text {vec}_2: \mathbb {R}^{m\times m} \rightarrow \mathbb {R}^{\frac{m(m+1)}{2}}\) that transforms all elements above the main diagonal (including the main diagonal elements) into a row-by-row vector. Also, a lower triangular matrix can be obtained from a vector using an operation \(\text {vec}_2^{-1}: \mathbb {R}^{\frac{m(m+1)}{2}} \rightarrow \mathbb {R}^{m\times m}\). Similarly, operation \(\text {vec}_3: \mathbb {R}^{n\times m\times m} \rightarrow \mathbb {R}^{n\times \frac{m(m+1)}{2}}\) (as well as \(\text {vec}_3^{-1}: \mathbb {R}^{n\times \frac{m(m+1)}{2}} \rightarrow \mathbb {R}^{n\times m\times m}\)) can be defined for three-dimensional tensors whose elements are lower triangular matrices. Let us denote
Then the solution of the optimization problem can be given in the matrix form
where L is the Laplacian matrix, i.e., \(L = D - W\), D is a diagonal matrix with elements \(D_{ii} = \sum \limits _{j} W_{ij}\). If we assume that there is exist \(V \in \mathbb {R}^{n\times q}, q \ll n\), such that \(W = VV^\top \) then
where \(G = B + 2\gamma D\). By using the Woodbury identity [6], the inverse operator \(B + 2\gamma L\) in the solution, that takes \(O(n^3)\) operations, can be represented as
where G is diagonal matrix (and therefore can be inverted in linear time), \(I-2\gamma V^\top G^{-1} V \in \mathbb {R}^{q \times q}\). Therefore it takes \(O(nq + q^3)\) to perform the inverse, which reduces the computations significantly, since by the assumption \(q \ll n\). Finally, we get:
In the article [3] it is shown that the weighted average co-association matrix can be used as a similarity matrix. By definition, the weighted average co-association matrix is
where \(H_1,\dots ,H_r\) are the co-association matrices for partitions \(P_1, ...,P_r\) with elements indicating whether a pair \(x_i\), \(x_j\) belong to the same cluster of this partition or not, \(\omega _1,\dots ,\omega _r\) are weights of ensemble elements, \(\omega _l \ge 0\), \(\sum \omega _l=1\). This matrix has a low-rank representation:
where \(V=[V_1 V_2 \dots V_r]\) is a block matrix, \(V_l=\sqrt{\omega _l}\, Z_l\), \(Z_l \in \mathbb {R}^{n \times K_l}\) is the cluster assignment matrix for lth partition: \(Z_l(i,k)=\mathbb {I}[c(x_i)=k]\), \(i=1,\dots ,n\), \(k=1,\dots ,K_l\) and \(K_l\) is the number of clusters in partition \(P_l\), \(K_l \ll n\). It is also shown that the Laplacian matrix L for the matrix H can be written in the following form:
Now the optimal solution (5) can be found by using the low-rank representation of the similarity matrix (7) and the diagonal matrix (8).
5 Co-association Matrix: Multimetricity and Optimality
To obtain a low-rank similarity matrix representation, we will use a weighted average co-association matrix as the similarity matrix. However, the standard algorithm for calculating the weighted average co-association matrix [5] has a number of disadvantages:
-
The k-means algorithm using the Euclidean metric can only find spherical clusters, so some complex relationships in the data may not be found as a result of clustering;
-
The result is strongly influenced by both the choice of the desired number of clusters for the k-means algorithm and the number of different partitions in the ensemble.
To solve these problems, we decided to improve the algorithm for calculating the weighted average co-association matrix. Firstly, we propose to average the co-association matrix over the distance metrics used in the k-means algorithm. Secondly, we propose to use only optimal partitions in terms of cluster validity index in the ensemble in order to reduce the influence of unnecessary partitions and reduce the size of the ensemble.
5.1 Multimetric Weighted Average Co-association Matrix
Let \(\{M_t\}_{t=1}^d\) be the set of metrics that can be used in the k-means algorithm as the distance between points, for example, the Minkowski distance of order p. Then for each metric from this set, an arbitrary set of partitions variants \(\{P_{l}^{M_t}\}_{l=1}^{r^{M_t}}\) can be obtained using cluster ensemble. Similarly, for each partition, the co-association matrix \(H_l^{M_t}\) can be found [3]. Then we define the multimetric weighted average co-association matrix as follows:
where \(\omega _1^{M_t},\dots ,\omega _r^{M_t}\) are weights of ensemble elements, \(\omega _l^{M_t} \ge 0\), \(\sum \limits _{l=1}^{r^{M_t}} \omega _l^{M_t}=1\) for each \(M_t\), \(t = 1, ..., d\).
It should be noted that the clustering quality index, on which partition weights \(\omega _l^{M_t}\) depend, should use the selected metric as the distance between points. That is why we assume that \(\sum \limits _{l=1}^{r^{M_t}} \omega _l^{M_t}=1\) for each \(M_t\), \(t = 1, ..., d\) rather than \(\sum \limits _{t=1}^{d}\sum \limits _{l=1}^{r^{M_t}} \omega _l^{M_t}=1\). As a further improvement, co-association matrices can also be weighted.
Thus, by using different metrics, we can obtain different partitions and reduce the impact of some negative effects arising from the use of the Euclidean distance. For example, in [7] it is shown that using the city blocks metric can reduce the impact of the curse of dimensionality.
5.2 Optimal Weighted Average Co-association Matrix
In general, the number of clusters in each partition is a hyperparameter. For example, in [3] two different set of parameters are used:
-
The ensemble size \(r=10\), the number of clusters \(K_i\) in i-th partition: \(K_i=2+i\), \(i=1,...,r\);
-
The ensemble size \(r=10\), the number of clusters \(K_i\) in i-th partition: \(K_i=100+i\), \(i=1,...,r\).
However, this choice may not be optimal. So, in the first case, for partitions with a small number of clusters, the weights can be extremely small, which means that their influence on the weighted average co-association matrix will be insignificant. In the second case, in addition to the high computational complexity of finding partitions with a large number of clusters, all resulting partitions can be similar to each other and have almost the same weights. Also, in both cases, it is not guaranteed that at least one optimal partition will be found in terms of any criterion: for example, a partition that achieves a local optimum of the cluster validity index.
We propose another algorithm that calculates weighted average co-association matrix with optimal partitions. The matrix \(H^*\) thus obtained is called optimal weighted average co-association matrix. This matrix is optimal in the sense that only optimal partitions according to the cluster validity index are used in its calculation. Below is an algorithm for calculating the optimal weighted average co-association matrix by steps:
6 C-WSR Algorithm
We formulate three main variants of the Correlated Weakly Supervised Regression (C-WSR) algorithm:
-
RBF: Radial Basis Function to calculate the similarity matrix is used;
-
LRCM: a low-rank representation of the weighted average co-association matrix to calculate the similarity matrix is used;
-
LROMCM: a low-rank representation of the optimal multimetric weighted average co-association matrix (10) to calculate the similarity matrix is used.
7 Experimental Results
In this section, we will compare three variants of the proposed Correlated Weakly Supervised Regression (C-WSR) algorithm. We use the MWD metric when comparing with weakly supervised learning algorithms WSR-RBF and WSR-LRCM from [3] and MAE when comparing with supervised learning algorithms such as Multivariate Linear Regression and gradient boosting from framework XGBoost on real data:
Since the WSR-RBF and WSR-LRCM algorithms can only be used in a single target scenario, we train a separate model for each target variable. To calculate multimetric weighted average co-association matrix, we use Minkowski metric \(\rho _p\) with different \(p\in \{1, 2, \infty \}\) and Silhouette as index cluster validity to determine the weights and the optimal number of clusters.
For experiments, we used an AMD Ryzen 9 3850X processor with a clock frequency of 3.5 GHz and 64 GB of RAM.
7.1 Monte-Carlo Simulation
For the Monte Carlo simulation, we generated a dataset of 1000 objects from a mixture of multivariate normal distributions \(\mathcal {N}(\mu _k^*, \varSigma _k^*)\), \(\mu _k^*=(8k+1,8k+2,...,8k+d_x) \in \mathbb {R}^m\), \(\varSigma _k^* = \text {diag}(1, ..., 1) \in \mathbb {R}^{d_x\times d_x}\), \(d_x=8\) and \(k \in \{1,2,3\}\).
For objects generated from the k-th component, we assume that the target function is equal to \(Y_k = k + \varepsilon _k\), where \(\varepsilon _k\) is a random variable with \(d_y\)-dimensional normal distribution function \(\mathcal {N}(0, D_k D_k^\top )\), \(D_k\) is random lower-triangular matrix with elements sampled from normal distribution and \(d_y = 4\).
To insure the weak supervision, we assumed 10\(\%\) of the dataset to be strictly labeled, 20\(\%\) of the dataset consists of inaccurately labeled objects and the remaining 70\(\%\) of objects are unlabeled. To model the inaccurate labeling, we use the parameters defined in (1): \(\varSigma _i=\varSigma _Y\), where \(\varSigma _Y\) is a covariance matrix of the target function over labeled data. For strictly labeled objects, we assume that the matrix \(\varSigma _i\) is a zero matrix.
For the WSR-LRCM and C-WSR-LRCM algorithms, we used a cluster ensemble of size \(r=30\) and the number of clusters \(K_i\) in i-th partition: \(K_i=2+i\), \(i=1,...,30\). The C-WSR-LROMCM algorithm uses parameters \(r=10\), \(k_{\text {min}}=2\) and \(k_{\text {max}}=30\). Regularization coefficients \(\beta =0.001\) and \(\gamma =0.001\) are set for all algorithms. The obtained quality metrics were averaged over 100 runs. The results are presented in Table 1.
7.2 CO/NOx Dataset
For CO/NOx dataset [8] we use carbon monoxide (CO) and nitrogen oxides (NOx) emissions for year 2015 as regression targets. This dataset contains 11 features that describe the characteristics of a gas turbine and include 36733 observations.
1\(\%\) of data is assumed to be strictly labeled, \(9\%\) is assumed to be labeled inaccurately, and \(90\%\) of data is considered unlabeled. Since the dataset is large, to model the inaccurate labeling, we estimate the mean vectors \(\mu _i\) and covariance matrices \(\varSigma _i\) by 50 nearest neighbours. For strictly labeled objects, the exact label is used as the mean vector \(\mu _i\), and \(\varSigma _i\) is equal to the zero matrix.
As with synthetic data, regularization coefficients \(\beta =0.001\) and \(\gamma =0.001\) are set. For the WSR-LRCM and C-WSR-LRCM algorithms, a cluster ensemble of size \(r=30\) is used with the number of clusters \(K_i\) in i-th partition: \(K_i=10\,+\,i\), \(i=1,...,30\). The C-WSR-LROMCM algorithm trained with parameters \(r=10\), \(k_{\text {min}}=2\) and \(k_{\text {max}}=50\). Supervised learning algorithms (Multivariate Linear Regression (MLR) and XGBoost (XGB)) are trained only on strictly labeled objects. Note that due to the large amount of data in the dataset, finding the inverse matrix in the RBF variant requires a significant amount of computing resources, especially RAM. The results are presented in Table 2.
Thus the results of the experiments show the considerable improvements in the accuracy for the proposed method.
8 Conclusion
In this paper, we considered the problem of multi-target weakly supervised regression with noisy labelling in a transductive setting. Using the multivariate normal distribution, we described an imprecision model in the multi-output case. We also proposed an algorithm for solving the optimisation problem using the Wasserstein metric and manifold regularisation. To speed up the solution of the optimisation problem, we used the cluster ensemble to obtain the co-association matrix and the low-rank representation technique to compress the resulting matrices.
The presented algorithm has shown its advantage over existing machine learning algorithms that cannot use uncertain multidimensional labels during training. We have also made several important improvements to the calculation of the weighted average co-association matrix by introducing an optimal multimetric weighted average co-association matrix. The new approach can significantly improve the quality and stability of the algorithm, and also simplifies the search for optimal hyperparameters to solve each specific problem.
As a further improvement, one can try different distances between distributions in the optimisation problem, and another promising idea would be to use deep learning approaches to find the co-association matrix. It is also worth considering other imprecision models: for example, using different types of multivariate distributions than normal.
References
Yang, Z., Mahajan, D., Ghadiyaram, D., Nevatia, R., Ramanathan, V.: Activity driven weakly supervised object detection, pp. 2912–2921 (2019). https://doi.org/10.1109/CVPR.2019.00303
Zhou, Z.H.: A brief introduction to weakly supervised learning. Natl. Sci. Rev. 5, 44–53 (2017)
Berikov, V., Litvinenko, A.: Weakly supervised regression using manifold regularization and low-rank matrix representation. In: Pardalos, P., Khachay, M., Kazakov, A. (eds.) MOTOR 2021. LNCS, vol. 12755, pp. 447–461. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77876-7_30
Bogachev, V.I., Kolesnikov, A.: The Monge-Kantorovich problem: achievements, connections, and perspectives Russ. Math. Surv. 67, 785–890 (2012)
Berikov, V.B.: Cluster ensemble with averaged co-association matrix maximizing the expected margin. In: International Conference on Discrete Optimization and Operations Research (DOOR 2016), vol. 1623, pp. 489–500. CEUR-WS.org (2016)
Higham N.: Accuracy and Stability of Numerical Algorithms. SIAM (2002)
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44503-X_27
Kaya, H., Tüfekci, P., Uzun, E.: Predicting CO and NOx emissions from gas turbines: novel data and a benchmark PEMS. Turk. J. Electr. Eng. Comput. Sci. 27, 4783–4796 (2019)
Acknowledgements
The work was carried out with the financial support of the Russian Science Foundation, project 22-21-00261. Special thanks to Vladimir Kondratiev for participating in the discussion and experiments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kalmutskiy, K., Cherikbayeva, L., Litvinenko, A., Berikov, V. (2023). Multi-target Weakly Supervised Regression Using Manifold Regularization and Wasserstein Metric. In: Khachay, M., Kochetov, Y., Eremeev, A., Khamisov, O., Mazalov, V., Pardalos, P. (eds) Mathematical Optimization Theory and Operations Research: Recent Trends. MOTOR 2023. Communications in Computer and Information Science, vol 1881. Springer, Cham. https://doi.org/10.1007/978-3-031-43257-6_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-43257-6_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43256-9
Online ISBN: 978-3-031-43257-6
eBook Packages: Computer ScienceComputer Science (R0)