Keywords

1 Introduction

In the past few years, advances in single-cell RNA sequencing (scRNA-seq) technology have provided a new window of opportunity to learn about biological mechanisms at the single-cell level, and guide scientists in exploring gene expression profiles at the single-cell level [1, 2]. By mining and analyzing scRNA-seq data, we can research cell heterogeneity and identify subgroups. The identification of cell types from scRNA-seq data facilitates the extraction of meaningful biological information, as a matter of unsupervised clustering. With the clustering model, cells that are highly similar will be grouped into the same cluster. Because of biological factors and technical limitations, however, scRNA-seq data tend to be high-dimensional, sparse and noisy. Consequently, classical clustering methods like K-means [3] and Spectral Clustering (SC) [4] are no longer suitable for scRNA-seq data, and reliable clustering cannot always be used for downstream analysis.

At present, in order to iron out the difficulties existing in scRNA-seq data clustering research, scholars have put forward numerous clustering methods. For instance, through the in-depth research of shared nearest neighbors, Xu and Su came up with a quasi-cluster-based clustering method (SNN-Cliq), which shows greater superiority in clustering high-dimensional single-cell data [5]. Based on the profound study of multikernal learning, Wang et al. proposed the SIMLR method, working out dimensionality reduction as well as clustering of data [6]. Park et al. proposed the MPSSC method, in which the SC framework is modified by adding sparse structure constraint, and the similarity matrix is constructed by using multiple double random affinity matrices [7]. Jiang et al. took into account paired cell differentiability correlation and variance, then proposed the Corr model [8].

At the same time, researchers have also proposed a number of subspace clustering methods and proved that the similarity obtained by the subspace clustering method based on low-rank representation (LRR) is more robust than the pairwise similarity involved in the methods mentioned above [9, 10]. For example, Liu et al. proposed the LatLRR method, integrating feature extraction and subspace learning into a unified framework to better cope with severely corrupted observation data [11]. Zheng et al. presented the SinNLRR method, a low-rank based clustering method, that fully exploits the global information of the data by imposing low-rank and non-negative constraints on the similarity matrix [10]. In order to explore the local information of the data, Zhang et al. proposed the SCCLRR method based on SinNLRR with the addition of local feature descriptions to capture both global and local information of the data [9]. Zheng et al. proposed the AdaptiveSSC method based on subspace learning to figure out the matters of noise and high dimensionality in single-cell data, achieving improved performance on multiple experimental data sets [12].

In this paper, we propose a single-cell clustering method called Hypergraph regularization sparse low-rank representation with similarity constraint based on tired random walk (THSLRR), which aims to capture the global structure and local information of scRNA-seq data simultaneously in subspace learning. Concretely, on the basis of the sparse LRR model, the hypergraph regularization based on manifold learning is introduced to mine the complex high-order relationship in scRNA-seq data. At the same time, the similarity constraint based on tired random walk (TRW) further improves the learning ability of model. The final sparse low-rank symmetric matrix \({Z}^{*}\) obtained by THSLRR is further operated to learn the affinity matrix \(H\), then \(H\) is used for single-cell spectral clustering, t-distributed stochastic neighbor embedding (t-SNE) [13] visual analysis of cells and genes prioritization. Figure 1 illustrates the specific process and applications of THSLRR.

Fig. 1.
figure 1

The framework of THSLRR for scRNA-seq data analysis.

2 Method

2.1 Sparse Low-Rank Representation

The LRR model is a progressive subspace clustering method, which is widely used in data mining, machine learning and other fields. Finding the lowest rank representation of data on the basis of the given data dictionary is the central objective of LRR [14]. Given the scRNA-seq data matrix \(X=[{X}_{1},{X}_{2},\dots ,{X}_{n}]\in {R}^{m\times n}\), where \(m\) represents the number of genes and \(n\) is the number of cells, its LRR formula is expressed as follows:

$$\underset{Z,E}{\mathit{min}}{\Vert Z\Vert }_{*}+\gamma {\Vert E\Vert }_{2,1}\; {\textit{s.t.}} \; X=XZ+E.$$
(1)

There, \({\Vert *\Vert }_{*}\) represents the kernel norm of the matrix, \({\Vert *\Vert }_{\mathrm{2,1}}\) is the \({l}_{\mathrm{2,1}}\) norm. \(E\) is the error item and \(Z\) is the coefficient matrix that demands to be optimized to achieve the lowest rank. \(\gamma >0\) is the parameter to coordinate the influence of errors.

The sparse representation model obtains the sparse coefficient matrix that unravels the close relationship between the data points, what is equivalent to solving the following optimization problem:

$$\underset{Z}{\mathit{min}}{\Vert Z\Vert }_{1}\; {\textit{s.t.}} \; X=XZ,$$
(2)

where \({\Vert *\Vert }_{1}\) is the \({l}_{1}\) norm. We further combine sparse and low-rank constraints for the extraction of salient features and noise removal to obtain the sparse LRR of the matrix, as follows:

$$\underset{Z,E}{\mathit{min}}{\Vert Z\Vert }_{*}+\lambda {\Vert Z\Vert }_{1}+\gamma {\Vert E\Vert }_{\mathrm{2,1}}\; {\textit{s.t.}} \; X=XZ+E.$$
(3)

Here, \(\lambda\) and \(\gamma\) are regularization parameters.

2.2 Hypergraph Regularization

Extracting local information from high-dimensional sparse noisy data is also a problem worth considering. Therefore, we exploit the hypergraph to encode higher-order geometric relationships among multiple sample points, which can more fully extract the underlying local information of scRNA-seq data.

For a given hypergraph \(G=(V,E,W)\), \(V=\{{v}_{1},{v}_{2},\dots , {v}_{n}\}\) is the collection of vertexes, \(E=\{{e}_{1},{e}_{2},\dots , {e}_{r}\}\) is the collection of hyperedges, \(W\) is the hyperedge weight matrix. The incidence matrix \(R\) of the hypergraph \(G\) is calculated as follows:

$$R\left( {v,e} \right) = \left\{ {\begin{array}{*{20}l} 1 & {if\;v \in e} \\ 0 & {others} \end{array} } \right.$$
(4)

The weight \(w\left({e}_{i}\right)\) of hyperedge \({e}_{i}\) is obtained by the following formula:

$$w\left({e}_{i}\right)=\mathop{\sum}\nolimits_{{\{{v}_{i},v}_{j}\}\in {e}_{i}}{exp}^{-\frac{{\Vert {v}_{i}-{v}_{j}\Vert }_{2}^{2}}{{\delta }^{2}}},$$
(5)

where \(\delta =\mathop{\sum}\nolimits_{{\{{v}_{i},v}_{j}\}\in {e}_{i}}{\Vert {v}_{i}-{v}_{j}\Vert }_{2}^{2}/k\), and \(k\) represents the number of nearest neighbors of each vertex. The degree \(d\left(v\right)\) of vertex \(v\) is as follows:

$$d\left(v\right)=\mathop{\sum}\limits_{e\in E}w\left(e\right)R\left(v,e\right).$$
(6)

The degree \(g(e)\) of hyperedge \(e\) is as follows:

$$g(e)=\mathop{\sum}\limits_{v\in V}R\left(v,e\right).$$
(7)

Then, we obtain the non-normalized hypergraph Laplacian matrix \({L}_{hyper}\), as shown below:

$${L}_{hyper}={D}_{v}-R{W}_{H}{\left({D}_{H}\right)}^{-1}{R}^{T}.$$
(8)

where vertex degree matrix \({D}_{v}\), hyperedge degree matrix \({D}_{H}\) and hyperedge weight matrix \({W}_{H}\) are diagonal matrices, and the elements on the diagonal are \(d\left(v\right)\), \(g(e)\) and \(w\left(e\right)\) respectively.

Under certain conditions of the mapping, \({z}_{i}\) and \({z}_{j}\) are the mapping representations of the original data points \({x}_{i}\) and \({x}_{j}\) under the new basis, then the target formula of the hypergraph regularization constraint is as follows:

$$\begin{aligned} \underset{Z}{\mathit{min} }\; \frac{1}{2}\mathop{\sum}\limits_{e\in E}\mathop{\sum}\limits_{\left(i,j\right)\in e}\frac{w\left(e\right)}{g\left(e\right)}{\Vert {z}_{i}-{z}_{j}\Vert }^{2} & =\, \underset{Z}{min}\; tr\left(Z\left({D}_{v}-R{W}_{H}{\left({D}_{H}\right)}^{-1}{R}^{T}\right){Z}^{T}\right) \\ &=\,\underset{Z}{min }\; tr\left(Z{L}_{hyper}{Z}^{T}\right) \end{aligned}$$
(9)

2.3 Tired Random Walk

The TRW model was proposed in [15] and proved to be a practical measurement of nonlinear manifold [16]. Therefore, the similarity constraint can not only improve the learning ability of the model for the overall geometric information of the data, but also ensure the symmetry of the similarity matrix, so that the model has better interpretability.

For an undirected weight graph with \(n\) vertexes, the transition probability matrix of the random walk is \(P={D}^{-1}W\), \(W\) represents the affinity matrix of the graph, \(D\) represents the diagonal matrix with \({D}_{ii}=\mathop{\sum }\nolimits_{j=1}^{n}{W}_{ij}\). According to [17], the cumulative transition probability matrix is \({P}_{TRW}=\mathop{\sum}\nolimits_{s=0}^{\infty }{(\tau P)}^{s}\) for all vertices, where \(\tau \in (\mathrm{0,1})\) and the eigenvalue of \(P\) is at \([\mathrm{0,1}]\), so the TRW matrix is as follows:

$${P}_{TRW}=\mathop{\sum}\limits_{s=0}^{\infty }{(\tau P)}^{s}={\left(1-\tau P\right)}^{-1}.$$
(10)

In order to weaken the effect of errors existing in the primary samples and ensure that the paired sample points have consistent correlation weights, we further symmetrize \({P}_{TRW}\) to obtain final TRW similarity matrix \(S\in {R}^{n\times n}\) as follows:

$$S\left({x}_{i},{x}_{j}\right)=\frac{{\left({P}_{TRW}\right)}_{ij}+{\left({P}_{TRW}\right)}_{ji}}{2}.$$
(11)

2.4 Objective Function of THSLRR

THSLRR learns the expression matrix \(Z\in {R}^{n\times n}\) from the scRNA-seq data matrix \(X=[{X}_{1},{X}_{2},\dots ,{X}_{n}]\in {R}^{m\times n}\) with \(m\) genes and \(n\) cells by the following objective function (12):

$$\begin{array}{c}\underset{Z,E}{\mathit{min}}{\Vert Z\Vert }_{*}+{\lambda }_{1}{\Vert Z\Vert }_{1}+{\lambda }_{2}tr\left(Z{L}_{hyper}{Z}^{T}\right)+\beta {\Vert Z-S\Vert }_{F}^{2}+\gamma {\Vert E\Vert }_{\mathrm{2,1}} \\ {\textit{s.t.}} \; X=XZ+E, Z\ge 0, \end{array}$$
(12)

where \(Z\) is the coefficient matrix to be optimized, \({L}_{hyper}\in {R}^{n\times n}\) is the hypergraph Laplacian matrix, \(S\in {R}^{n\times n}\) is the symmetric cell similarity matrix generated by TRW, \(E\in {R}^{m\times n}\) represents the errors term, \({\Vert *\Vert }_{F}\) is the Frobenius norm of the matrix, \({\lambda }_{1}\), \({\lambda }_{2}\), \(\beta\) and \(\gamma\) are the penalty parameters.

2.5 Optimization Process and Spectral Clustering of THSLRR Method

The objective function that has multiple constraints of THSLRR is a convex optimization problem. In order to effectively work out the problem (12), we adopt the Linearized Adaptive Direction Method with Adaptive Penalty (LADMAP) [18].

Initially, to separate the objective function (12) by using an auxiliary variable \(J\), and then obtain formula (13):

$$\begin{array}{c}\underset{Z,E,J}{\mathit{min}}{\Vert Z\Vert }_{*}+{\lambda }_{1}{\Vert J\Vert }_{1}+{\lambda }_{2}tr\left(Z{L}_{hyper}{Z}^{T}\right)+\beta {\Vert Z-S\Vert }_{F}^{2}+\gamma {\Vert E\Vert }_{\mathrm{2,1}} \\ {\textit{s.t.}} \; X=XZ+E, Z=J, Z\ge 0.\end{array}$$
(13)

Then, the augmented lagrangian multiplier method is introduced to eliminate the linear constraints existing in (13). Therefore, we get the following formula:

$$\begin{aligned}L\left(Z,E,J,{Y}_{1},{Y}_{2}\right)={\Vert Z\Vert }_{*} &+\,{\lambda }_{1}{\Vert J\Vert }_{1}+{\lambda }_{2}tr\left(Z{L}_{hyper}{Z}^{T}\right)+\beta {\Vert Z-S\Vert }_{F}^{2}+\gamma {\Vert E\Vert }_{\mathrm{2,1}}\\ &+\,\langle {Y}_{1}, X-XZ-E\rangle +\langle {Y}_{2}, Z-J\rangle \\ & + \,\frac{\mu }{2}\left({\Vert X-XZ-E\Vert }_{F}^{2}+{\Vert Z-J\Vert }_{F}^{2}\right).\end{aligned}$$
(14)

Here, \(\mu\) is a penalty parameter, \({Y}_{1}\) and \({Y}_{2}\) are lagrangian multipliers.

Finally, the optimization problem is ironed out by updating one of the variables by turn while fixing the other variables. Therefore, the update rules of \(Z\), \(E\), and \(J\) are as follows:

$${Z}_{k+1}={\theta }_{\frac{1}{\eta \mu }}\left({Z}_{k}-\frac{{\nabla }_{Z}q\left({Z}_{k}\right)}{\eta }\right).$$
(15)
$${E}_{k+1}\left(i,:\right)=\left\{\begin{array}{c}\frac{\Vert {p}_{i}\Vert -\frac{\gamma }{{\mu }_{k}}}{\Vert {p}_{i}\Vert }{p}_{i}\\ 0, \quad otherwise\end{array}\right., \frac{\gamma }{{\mu }_{k}}<\Vert {p}_{i}\Vert .$$
(16)
$${J}_{k+1}=max\left\{{\theta }_{\frac{\lambda }{{\mu }_{k}}}\left({Z}_{k+1}+{Y}_{2}^{k}/{\mu }_{k}\right), 0\right\}.$$
(17)

The sparse low-rank symmetric matrix \({Z}^{*}\) is obtained with our THSLRR method, and the elements on both sides of the main diagonal of the matrix \({Z}^{*}\) correspond to the similarity weights of the data sample points. Inspired by [19], we use the main direction angle information of matrix \({Z}^{*}\) to learn the affinity matrix \(H\). Finally, we use learned matrix \(H\) as the input of SC method to obtain the clustering results.

3 Results and Discussion

3.1 Evaluation Measurements

In the experiment, two commonly used indicators are used to assess the effectiveness of THSLRR, namely adjusted rand index (ARI) [20] and normalized mutual information (NMI) [21]. The value of ARI belongs to \(\left[-\mathrm{1,1}\right]\) while the value of NMI is \(\left[\mathrm{0,1}\right]\).

Given the real cluster label \(T=\{{T}_{1},{T}_{2},\dots ,{T}_{K}\}\) and the predicted cluster label \(Y=\{{Y}_{1},{Y}_{2},\dots ,{Y}_{K}\}\) of \(n\) sample points. The formula of ARI is as follows:

$$ARI\left(T,Y\right)=\frac{\left(\genfrac{}{}{0pt}{}{n}{2}\right)\left({a}_{ty}+a\right)-\left[\left({a}_{ty}+{a}_{t}\right)\left({a}_{ty}+{a}_{y}\right)+\left({a}_{t}+a\right)\left({a}_{y}+a\right)\right]}{\left(\genfrac{}{}{0pt}{}{n}{2}\right)-\left[\left({a}_{ty}+{a}_{t}\right)\left({a}_{ty}+{a}_{y}\right)+\left({a}_{t}+a\right)\left({a}_{y}+a\right)\right]}.$$
(18)

Here, \({a}_{ty}\) denotes the number of data points put in the same class, whereas \({a}_{t}\) denotes the number of data points in the same class \(T\) but separate \(Y\) classes. \({a}_{y}\) represents the number of data point pairs that are in the same cluster in \(Y\) but not in the same cluster in \(T\), whereas \(a\) is the number of data point pairs that are neither in the same cluster of \(Y\) nor in the same cluster of \(T\).

NMI is defined as follows:

$$NMI\left(T,Y\right)=\frac{\mathop{\sum}\limits_{t\in T}\mathop{\sum}\limits_{y\in Y}p\left(t,y\right)\mathit{ln}\left(\frac{p\left(t,y\right)}{p\left(t\right)p\left(y\right)}\right)}{\sqrt{H\left(T\right)\cdot H\left(Y\right)}},$$
(19)

Here, \(H\left(T\right)\) and \(H\left(Y\right)\) represent the information entropy of the tags \(T\) and \(Y\), respectively. \(p\left(t\right)\) and \(p\left(y\right)\) are the marginal distribution of \(t\) and \(y\), \(p\left(t,y\right)\) represents the joint distribution function of \(t\) and \(y\).

Table 1. The scRNA-seq data sets used in experiments.

3.2 scRNA-seq Datasets

In this paper, nine different scRNA-seq datasets were used to do the relevant experimental analysis. The datasets involved in the experiment include Treutlein [22], Ting [23], Pollen [24], Deng [25], Goolam [26], Kolod [27], mECS, Engel4 [28] and Darmanis [29]. The detailed information of the nine scRNA-seq data sets are shown in Table 1.

Table 2. The optimal values of four parameters for scRNA-seq data sets.
Fig. 2.
figure 2

Sensitivity of different parameters to clustering performance of nine scRNA-seq data sets. (a) \({\lambda }_{1}\) varying. (b) \({\lambda }_{2}\) varying. (c) \(\beta\) varying. (d) \(\gamma\) varying.

3.3 Parameters Setting

In this part, we specifically discuss the influence of different parameters with regard to the effectiveness of THSLRR method. We make use of the grid search method to determine the optimal combination of parameters. The four parameters change in separate intervals \([{10}^{-5}, {10}^{5}]\), and when one of the parameters changes, the other parameters are fixed, and then we get Fig. 2. In Fig. 2, the clustering results are insensitive to different \({\lambda }_{1}\), while \({\lambda }_{2}\), \(\beta\) and \(\gamma\) have a greater impact on the model performance. Fortunately, within a certain range, we can choose the appropriate combination of parameters to achieve the optimal clustering result. Therefore, we obtain the optimal parameters of different datasets, as shown in Table 2.

3.4 Comparative Analysis of Clustering

We conduct experiments on nine scRNA-seq data sets recounted in Table 1 to discuss the clustering performance of THSLRR. t-SNE, K-means, SIMLR, SC, Corr, MPSSC and SinNLRR are selected as comparison methods. In order to ensure the fairness and objectivity of the comparison, we furnish the real number of classes to THSLRR as well as the other seven methods, and their parameters are all set to the optimal parameters. The comparison results are shown in Fig. 3 and Table 3.

By observing Fig. 3 and Table 3, we can draw the following conclusions:

  1. 1)

    In Fig. 3(a), the median ARI for comparison methods in all datasets is below 0.7, while the median value of THSLRR is greater than 0.9. Furthermore, it is the flattest compared to the box plots of the other seven methods, indicating that the performance of THSLRR is more stable. Similar results can be found in Fig. 3(b).

  2. 2)

    In Table 3, SinNLRR outperforms SIMLR, MPSSC, and Corr on most datasets, and the average ARI for SinNLRR is approximately 11%, 6% and 20% higher, respectively. THSLRR exceeds SIMLR, MPSSC and Corr on all datasets except mECS, and outperforms SIMLR, MPSSC and Corr in terms of average ARI by about 27%, 22% and 36% respectively. As can be seen, the low-rank based clustering methods SinNLRR and THSLRR achieve satisfactory clustering results on most of the data sets, indicating the critical contribution of global information to improve the clustering performance once again. In contrast, SIMLR, MPSSC and Corr only take into consideration the local information between samples, their clustering performance is not as impressive as SinNLRR and THSLRR on most of the datasets.

  3. 3)

    It can also be seen from Table 3 that THSLRR exceeds the SinNLRR method by about 16% in ARI score. There are two main factors. First, the THSLRR method utilizes the hypergraph regularization to thoroughly mine the complex high-order relationships of scRNA-seq data, while sinNLRR simply considers the overall information of the data. Secondly, the similarity based on TRW captures the global manifold structure information of the data and improves the learning ability of the model.

In conclusion, THSLRR achieves the best results on most data sets. Moreover, the average ARI and NMI of THSLRR increase by approximately 12% and 22% compared with comparison methods. Therefore, the THSLRR method is rational and it has certain advantages in cell type identification.

Fig. 3.
figure 3

Clustering results of eight clustering methods on nine scRNA-seq data sets. (a) ARI. (b) NMI

Table 3. The clustering performance on the scRNA-seq data

3.5 Visualize Cells Using t-SNE

According to [6], we make use of the improved t-SNE to map the learned matrix \(H\) to the two-dimensional space to observe the structure representation performance of THSLRR method. We only analyze the visualization results for the Ting and Darmanis datasets because of space limitations.

As shown in Fig. 4(a), THSLRR does not distinguish class 1 from class 4 on the Treutlein data, but the boundaries among other types of cells are more obvious. SinNLRR does not distinguish the three cell types 1, 3 and 4, the boundary between classes 2 and 5 is also very blurred. The distribution of t-SNE, SIMLR and MPSSC cells are also scattered. In Fig. 4(b), the result of t-SNE is the worst, SIMLR divides cells belonging to the same class into two clusters, SinNLRR and MPSSC fail to separate the two types of cells and THSLRR can correctly separate five cell types. All methods do not show promising results on the Pollen and Darmanis datasets in Fig. 4(c) and Fig. 4(d), while THSLRR performed best overall because almost all cells belonging to the same cluster are segregated into the same group and the boundaries between clusters were relatively clear.

Fig. 4.
figure 4

Visualization results of the cells on (a) Treutlein, (b) Ting, (c) Pollen, and (d) Darmanis datasets.

3.6 Gene Markers Prioritization

In this section, the affinity matrix \(H\) learned from THSLRR is used to prioritize genes. First, the bootstrap Laplacian score that is proposed in [6] is used for identifying gene markers on the matrix \(H\). Then, the genes are placed in descending order in the light of their importance in distinguishing cell subpopulations. Finally, the top ten genes are selected for visual analysis. We use Engel4 and Darmanis data sets for gene markers analysis.

On Darmanis and Engel4 data sets, we select the top 10 gene markers as shown in Fig. 5(a) and Fig. 5(b) respectively. The color of the ring indicates the mean expression level of the gene, and the darker the color, the higher the average expression level of the gene. The size of the ring means the percentage of gene expression in the cell.

Figure 5(a) shows the top ten genes of Darmanis data set. The genes SLC1A3, SLC1A2, SPARCL1 and AQP4 have a high level of expression in astrocytes, and they play an essential part in early development of astrocytes. In fetal quiescent, SOX4, SOX11, TUBA1A and MAP1B have a high level of expression and have been proven to be marker genes with specific roles [30,31,32,33]. MAP1B in neurons is also highly expressed PLP12 and CLDND1 with high expression in oligodendrocytes can be regarded as gene markers of oligodendrocytes [34]. In the Engel4 data, as shown in Fig. 5(b), Engel et al. have been confirmed for Serpinb1a, Tmsb10, Hmgb2 and Malta1 [28]. The remaining genes have also been selected as marker genes in related literature [35, 36].

Fig. 5.
figure 5

The top ten gene markers. (a) Darmanis data set. (b) Engel4 data set.

4 Conclusion

In this paper, we propose a clustering method based on subspace learning, named THSLRR. There are mainly two differences where our method differs from other subspace clustering methods. The first aspect is the introduction of hypergraph regularization, which is used to encode higher-order geometric relationships among data and to mine the internal information of data. Compared with other subspace clustering methods, the complex relationships of data can be extracted by our method. Another aspect is the similarity constraint based on TRW, it can mine the global nonlinear manifold structure information of the data and improve the clustering performance and the interpretability of the model. Comparative experiments prove the effectiveness of the THSLRR method. Moreover, the THSLRR method can also provide guidance for data mining as well as be employed in other related domains.

Now, we would like to discuss the limitations of our model. Primarily, although the optimal combination of parameters can be searched by the grid search method, it would be helpful if the optimal parameters could be determined automatically based on some strategy. Second, we use the single similarity criterion in our model, which may not be comprehensive for capturing similarity information from the data. So we can try to use measurement fusion to capture more accurate prior information in the next work.