Keywords

1 Introduction

The problem of clustering is well known; there are many reviews (such as [1]) on this topic. The general topic of clustering studies the formation of naturally occurring groups within the data. The simplest (and still the most popular) approach for the same is perhaps K-means [2]. K-means segments the data by relative distances; samples near each other (pre-defined by some distance metric) are assumed to belong to the same cluster. Owing to the linear nature of the distance, the K-means was not able to capture non-linearly occurring groups. This issue was partially addressed by the introduction of kernel K-means [3]. Instead of defining the distance between the samples, a kernel distance was defined (Gaussian, Laplacian, polynomial, etc.) for clustering. Closely related to the kernel K-means is spectral clustering [3]. The later generalizes kernel distances to any affinity measure and applies graph cuts to segment the clusters.

K-means, kernel K-means, and spectral clustering are inter-related. A completely different approach is subspace clustering [4]. In the later, it is assumed that samples belonging to the same group/cluster will lie in the same subspace. There are several variants of subspace clustering, but the most popular one among them is the sparse subspace clustering (SSC) [5]. In SSC it is assumed that the clusters only occupy a few subspaces (from all possibilities) and hence the epithet “sparse”.

So far, we have discussed generic clustering techniques. In the single-cell analysis, cell type identification is important for the downstream analysis. Therefore, clustering forms a crucial step in single-cell RNA expression analysis. Single-cell RNA sequencing (scRNA-seq) measures the transcription level of genes. But, the amount of RNA present in a single cell is very low due to which some genes did not get detect even though they are present and this results in zero-inated data. This data further gets compounded by trivial biological noise such as variability in the cell cycle specic genes. Also, a large number of genes are assayed during an experiment but only a handful of them are used for cell-type identification. This leads to high feature-dimensionality and high feature-redundancy in single-cell data. Applying clustering techniques directly on the high-dimensional data will cause suboptimal partitioning of cells.

This triggers the need for customized techniques. The existing state-of-the-art clustering techniques for single-cell data do not propose new algorithms for clustering per se but apply existing algorithms on extracted/reduced feature sets. One popular technique Seurat [6], instead of applying a distance-based clustering technique on all the genes, selects highly variable genes from which a shared nearest neighbor graph is constructed for segmentation. GiniClust [7] is similar to the former and only differs in the use of the Gini coefficient for measuring differentiating genes. Single-cell consensus clustering (SC3) [8] algorithm uses principal component analysis (PCA) to reduce the dimensions and then applies a cluster-based similarity partitioning algorithm for segmentation.

The success of deep learning is well known in every field today. What is interesting to note is that success has been largely driven by supervised tasks; there are only a handful of fundamental papers on deep dictionary learning-based clustering [9]. Deep dictionary learning is a new framework for deep learning. In the past, it has been used for unsupervised feature extraction [10], supervised classification [11], and even for domain adaptation [12]. However, it has never been used for clustering. This would be the first work on that topic. The advantage of deep dictionary learning is that it is mathematically flexible and can easily accommodate different cost functions. In this work, we propose to incorporate K-means clustering and sparse subspace clustering as losses to the unsupervised framework of deep dictionary learning.

2 Proposed Formulation

There are three pillars of deep learning - convolutional neural network (CNN), stacked autoencoder (SAE), and deep belief network (DBN). The discussion on CNN is not relevant here since it can only handle naturally occurring signals with local correlations. Moreover, they cannot operate in an unsupervised fashion, and hence is not a candidate for our topic of interest. Stacked autoencoders have been used for our purpose (deep learning-based clustering); the main issue with SAE is that it tends to overfit since one needs to learn twice the number of parameters (encoder and decoder) compared to other standard neural networks. However, SAE’s are operationally easy to handle with good mathematical flexibility. DBN on the other hand learns the optimal number of parameters and hence does not overfit. However, the cost function DBN is not amenable to mathematical manipulations.

Deep dictionary learning keeps the best of both worlds. It learns the optimal number of parameters like a DBN and has a mathematically flexible cost function making it amenable to handle different types of penalties. This is the primary reason for building our clustering on top of the deep dictionary learning (DDL) framework. In our proposed formulation, we will regularize the DDL cost function with clustering penalties, where X is the given data (X – in our case single cells are along the columns and genes are along the rows), D is the dictionary learned to synthesize the data from the learned coefficients Z.

$$\begin{aligned} \mathop {\min }\limits _{{D_1},...{D_N},Z} \left\| {X - {D_1}\varphi \left( {{D_2}\varphi (...\varphi ({D_N}Z))} \right) } \right\| _F^2 \end{aligned}$$
(1)

The first clustering penalty will be with K-means.

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _{{D_1},{D_2},{D_3},Z,H} \underbrace{\left\| {X - {D_1}{D_2}{D_3}Z} \right\| _F^2\mathrm{{ s}}\mathrm{{.t}}\mathrm{{. }}{D_2}{D_3}Z \ge 0,{D_3}Z \ge 0,Z \ge 0}_{\mathrm{{Dictionary Learning}}}\\ + \underbrace{\left\| {Z - Z{H^T}{{\left( {H{H^T}} \right) }^{ - 1}}H} \right\| _F^2\mathrm{{ s}}\mathrm{{.t}}\mathrm{{. }}{h_{ij}} \in \{ 0,1\} \,\mathrm{{ and }}\,\sum \limits _j {{h_{ij}} = 1} }_{\mathrm{{K - means}}} \end{array} \end{aligned}$$
(2)

Note that we have changed the cost function for dictionary learning. Instead of having activation functions like sigmoid or tanh, we are using the ReLU type cost function by incorporating positivity constraints. The reason for using ReLU over others is better function approximation capability [13]. The notations in the K-means clustering penalty has been changed appropriately.

In this work, we will follow the greedy approach for solving (2). In the dictionary learning part, we substitute \(Z_{1} = D_{2}D_{3}Z\). This leads to the greedy solution of the first layer of deep dictionary learning.

$$\begin{aligned} \min _{D_{1},Z_{1}}\left| X-D_{1}Z_{1} \right| |_{F}^{2}s.t. Z_{1}\ge 0 \end{aligned}$$
(3)

The input for the second layer of dictionary learning uses the output from the first layer (Z1). The substitution is \(Z_{2} = D_{3}Z\) . This leads to the following problem

$$\begin{aligned} \mathop {\min }\limits _{{D_2},{Z_2}} \left\| {{Z_1} - {D_2}{Z_2}} \right\| _F^2\mathrm{{s}}\mathrm{{.t}}\mathrm{{. }}{Z_2} \ge 0 \end{aligned}$$
(4)

For the third (and final) layer no substitution is necessary; only the output from the second layer is fed into it.

$$\begin{aligned} \mathop {\min }\limits _{{D_3},Z} \left\| {{Z_2} - {D_3}Z} \right\| _F^2\mathrm{{s}}\mathrm{{.t}}\mathrm{{. }}Z \ge 0 \end{aligned}$$
(5)

All the problems (3)–(5) can be solved by non-negative matrix factorization techniques; in particular, we have used the multiplicative updates [14]. Although shown here for three layers, it can be extended to any number.

The input to K-means clustering is the coefficients from the final layer (Z). This is shown as

$$\begin{aligned} \mathop {\min }\limits _H \left\| {Z - Z{H^T}{{\left( {H{H^T}} \right) }^{ - 1}}H} \right\| _F^2\mathrm{{ s}}\mathrm{{.t}}\mathrm{{. }}{h_{ij}} \in \{ 0,1\} \,\mathrm{{ and }}\,\sum \limits _j {{h_{ij}} = 1} \end{aligned}$$
(6)

The standard K-means clustering algorithm is used to solve it.

This concludes our algorithm to solve for the K-means embedded deep dictionary learning algorithm. Owing to the greedy nature of the solution, we cannot claim this to be optimal (owing to lack of feedback from deeper to shallower layers); however, each of the problems we need to solve (3)–(6) have well-known solutions.

Next, we show how the sparse subspace clustering algorithm can be embedded in the deep dictionary learning framework.

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _{{D_1},{D_2},{D_3},Z,C} \underbrace{\left\| {X - {D_1}{D_2}{D_3}Z} \right\| _F^2\mathrm{{ s}}\mathrm{{.t}}\mathrm{{. }}{D_2}{D_3}Z \ge 0,{D_3}Z \ge 0,Z \ge 0}_{\mathrm{{Dictionary Learning}}}\\ + \underbrace{\sum \limits _i {\left\| {{z_i} - {Z_{{i^c}}}{c_i}} \right\| _2^2 + {{\left\| {{c_i}} \right\| }_1},\forall i\,\mathrm{{ in }}\,\{ 1,...,n\} } }_{{\mathrm{Sparse Subspace Clustering}}} \end{array} \end{aligned}$$
(7)

The solution to the deep dictionary learning remains the same as before; it can be solved greedily using (3)–(5). Once the coefficients from the deepest layer are obtained (Z), it is fed into the sparse subspace clustering. This is given by

$$\begin{aligned} \mathop {\min }\limits _{{c_i}'s} \sum \limits _i {\left\| {{z_i} - {Z_{{i^c}}}{c_i}} \right\| _2^2 + {{\left\| {{c_i}} \right\| }_1},\forall i\,\mathrm{{ in }}\,\{ 1,...,n\} } \end{aligned}$$
(8)

Once (8) is solved, the affinity matrix is created and is further used for segmenting the data using Normalized Cuts.

3 Experimental Evaluation

3.1 Datasets

To evaluate the performance of the proposed method we used seven single-cell datasets from different studies.

Blakeley: The dataset consists of three cell lineages of the human blastocyst which are obtained using single-cell RNA sequencing (scRNA-seq). This scRNA-seq data of the human embryo gives an insight into early human development and was validated using protein levels. The study consists of 30 transcriptomes from three cell lines, namely, human pluripotent epiblast (EPI) cells, extraembryonic trophectoderm cells, and primitive endoderm cells [15].

Cell Line: Microfluidic technology-based protocol, Fluidigm, was used to perform scRNA-seq of 630 single-cells acquired from 7 cell lines. Each cell line was sequenced separately. Therefore, the original annotations were directly used. The sequencing results in 9 different cell lines, namely, A549, GM12878 B1, GM12878 B2, H1 B1, H1 B2, H1437, HCT116, IMR90, and K562. The cell lines GM12878 and H1 had two different batches [16].

Jurkat-293T: This dataset consists of 3,300 transcriptomes from two different cell lines - Jurkat and 293 T cells. The transcriptomes are combined in vitro at equal proportions (50:50). All transcriptomes are labeled according to the mutations and expressions of cell-type-specific markers, CD3D, and XIST [17].

Kolodziejczyk: This study reports the scRNA-seq of \(\sim \)704 mouse embryonic stem cells (mESCs) which are cultured in three different conditions, namely, serum, 2i, and alternative ground state a2i. The different culture condition of the cells results in different cellular mRNA expression [18].

PBMC: This dataset constitutes \(\sim \)68,000 peripheral blood mononuclear cell (PBMC) transcriptomes from healthy donors. They are annotated into 11 common PBMC subtypes depending on correlation with uorescence activated cell sorting (FACS)-based puried bulk RNA-Seq data of common PBMC subtypes. For this study, we randomly sampled 100 cells from each annotated subtype and retained the complete cluster in case the number of cells in it was less than 100 [17].

Usoskin: The data consists of 799 transcriptomes from mouse lumbar dorsal root ganglion (DRG). The authors used an unsupervised approach to cluster the cells. Out of 799 cells, 622 cells were classified as neurons, 68 cells had an ambiguous assignment and 109 cells were non-neuronal. The 622 mouse neuron cells were further classified into four major groups, namely, neurofilament containing (NF), non-peptidergic nociceptors (NP), peptidergic nociceptors (PEP), and tyrosine hydroxylase containing (TH), based on well-known markers [19].

Zygote: The RNA-sequencing data consists of 265 single cells of mouse preimplantation embryos. It contains expression proles of cells from zygote, early 2-cell stage, middle 2-cell stage, late 2-cell stage, 4-cell stage, 8-cell stage, 16-cell stage, early blastocyst, middle blastocyst, and late blastocyst stages [20].

3.2 Numerical Results

In the first set of experiments, we have compared the proposed algorithm with the two state-of-the-art deep learning techniques. The first technique is a stacked autoencoder (SAE) which comprises two hidden layers. The number of neurons in the first hidden layer of SAE is 20 and the nodes in the second layer are the same as the number of cell types in the single-cell data. The second method used as a benchmark is a deep belief network (DBN). Like SAE, DBN also has two hidden layers with 100 nodes in the first layer and the number of nodes in the second layer is the same as the number of clusters in the given dataset. For our proposed deep dictionary learning (DDL) the number of nodes in the first layer was 20 and those in the second one are the same as the number of cell types (similar to the configuration of SAE). These configurations yielded the best results. Both state-of-the-art techniques along with the proposed method use the K-means algorithm on the deepest layer of features to determine the clusters in the data.

To determine how SAE, DBN, and the proposed method can segregate different cell types using the respective deepest layer of features we employed two clustering metrics: adjusted rand index (ARI) and normalized mutual information (NMI), since the ground truth annotation (class) of each sample or cell is known apriori (Table 1).

Table 1. Clustering accuracy of the proposed method and existing deep learning techniques on single-cell datasets.

We see that the proposed method improves over existing deep learning tools by a large margin. Only in the case of PBMC are the results from SAE a close second.

In the next set of experiments, we used two well-known single-cell clustering methods, namely, GiniClust [7] and Seurat [6] as benchmark techniques. For both of our proposed methods (K-means and SSC) the configuration remains the same as before.

Table 2. Clustering accuracy of the proposed method and single-cell clustering algorithms on single-cell datasets

GiniClust could not yield any clustering results for the Cell Line dataset. It performs clustering by utilizing genes with a high Gini coefficient value. But, for this particular dataset, the technique could not identify any highly variable gene and hence could not cluster. Overall GiniClust almost always yields the worst results.

Among the proposed techniques (K-means and SSC), we find that K-means is more stable and consistently yields good results. Results from SSC fluctuate, yielding perfect clustering for Blakely to poor results in Kolodziejczyk, PBMC, and Usoskin. Only for the Kolodziejczyk and PBMC datasets does Seurat yield results comparable to Proposed + K-means; for the rest, Seurat is considerably worse than either of our techniques (Table 2).

4 Conclusion

This work proposes a deep dictionary learning-based clustering framework. Given the input (where samples/cells are in rows and features/genes are in columns) it generates a low-dimensional embedding of the data which feeds into a clustering algorithm. The low dimensional embedding represents each transcriptome; it is learned in such a manner that the final output is naturally clustered.

To evaluate the proposed method, we have compared against state-of-the-art deep learning techniques (SAE and DBN) and tailored single-cell RNA clustering techniques (GiniClust and Seurat). Our method yields the best overall results.

The current approach is greedy and hence sub-optimal; there is no feedback between the deeper and shallower layers. In the future, we would like to jointly solve the complete formulations (2) and (7) using state-of-the-art optimization tools.