The advances in technology in the past few years have enabled us to profile various types of genome-wide molecular features at single-cell resolution, including DNA, gene expression, protein-binding, histone modifications, and chromatin accessibility. Previously, such genomic approaches could only be applied to bulk tissue samples comprised of an ensemble of many cells, providing average genomic measures, but masking the cellular difference [1]. Different types of genomic features capture complementary information and together they provide a more complete biological picture.

The high technical variation and high level of noise present in single-cell datasets, especially in single-cell epigenomic datasets, pose challenges for the extraction of biological variation. The large-scale of single-cell datasets also necessitates efficient algorithms to analyze the datasets. In this review, we focus on computational methods developed for single-cell multi-omics data. Depending on the data structure, the computational methods fall into two broad categories: methods that integrate data from multiple molecular modalities profiled in different cells but similar biological tissue, and methods that integrate data from multiple molecular modalities profiled simultaneously in the same cells. In this review, we use “scRNA-Seq data” to represent single-cell gene expression data, and “scATAC-Seq data” to represent single-cell chromatin accessibility data, acknowledging the fact that there are multiple platforms with different names that can profile gene expression and chromatin accessibility at single-cell resolution.

Methods that integrates multiple scRNA-Seq datasets have also been developed [2,3,4,5,6], and they will not be discussed in detail in this review. Methods that integrate multi-omics data obtained from bulk tissues have also been developed. These methods have been summarized and reviewed in [7].

1 Multi-Omics Data Profiled on Different Cells

Cells are sacrificed in single-cell experiments, and it is experimentally more challenging to obtain multiple types of genomic data from the same cell, compared with the relative ease of obtaining such genomic data from the same sample in bulk genomic experiments. Computational methods are developed for the setting where multiple types of genomic data are obtained from different subsets of cells from similar cell population (i.e., tissue).

Based on the goal, the methods can be classified in the following categories: (a) methods that learn low-dimensional embeddings where different types of genomic features are aligned to the same latent space. After cells from different modalities are aligned, cell type identification can be achieved by a separate clustering step using the low-dimension representation of the cells. The methods that fall into this category include coupled NMF [8], DC3 [9], Seurat V3 [10], LIGER [11], online iNMF [12], UINMF [13], and MAESTRO [14]; (b) Methods that directly perform joint clustering on the original data space, where the shared and unshared cell types across data modalities are identified through joint clustering. The methods that fall into this category include scACE [15] and scAMACE [16]; (c) Transfer learning-based methods where one dataset (typically scRNA-Seq data) facilitates the analysis of another noisier dataset (typically single-cell epigenomic data). The methods that fall into this category include coupleCoC [17], coupleCoC+  [18], and scJoint [19].

Because different types of genomic features are profiled in different cells, these methods require that at least a subset of features are connected across the multi-omics data: To connect scATAC-Seq data with scRNA-Seq data, gene activity score[20] that summarizes the peak accessibility near the gene body was used in online iNMF [12], UINMF[21], MAESTRO [14], scAMACE[16], coupleCoC [17], coupleCoC+  [18], and scJoint [19], promoter accessibility was used in scACE [15], prediction model trained from reference data was used in coupled NMF [8], and external chromatin conformation data that links regulatory regions to genes was used in DC3 [9]; To connect single-cell methylation data with scRNA-Seq data, gene body mCH methylation was used in LIGER [11], online iNMF [12], scAMACE [16], coupleCoC [17], and coupleCoC+  [18], promoter methylation was also used in scAMACE[16].

coupled NMF [8] was designed for integrative analysis of scRNA-Seq and scATAC-Seq data obtained from different set of cells. Let O be a p 1 by n 1 data matrix for scATAC-Seq data, where p 1 is the number of regions and n 1 is the number of cells. Let E be a p 2 by n 2 data matrix for scRNA-Seq data, where p 2 is the number of genes and n 2 is the number of cells. The following optimization problem was proposed in coupled NMF:

$$\displaystyle \begin{aligned} \mathop{\text{argmin}}_{{\mathbf{W}}_1, {\mathbf{H}}_1, {\mathbf{W}}_2, {\mathbf{H}}_2 \geq 0} \frac{1}{2} \| \mathbf{O} - {\mathbf{W}}_1 {\mathbf{H}}_1 \|{}_F^2 \kern-1pt+\kern-1pt \frac{\lambda_1}{2} \| \mathbf{E} \kern-1pt-\kern-1pt {\mathbf{W}}_2 {\mathbf{H}}_2 \|{}_F^2- \lambda_2 tr({\mathbf{W}}_2^T \mathbf{A} {\mathbf{W}}_1) + \mu (\|{\mathbf{W}}_1\|{}^2_F + \|{\mathbf{W}}_2\|{}_F^2), {} \end{aligned} $$
(1)

where W 1 is the p 1 × K region-factor matrix for scATAC-Seq data, W 2 is the p 2 × K gene-factor matrix for scRNA-Seq data, H 1 and H 2 are matrices of dimensions K × n 1 and K × n 2, representing the low-dimensional embeddings for the cells in scATAC-Seq and scRNA-Seq data, respectively. coupled NMF is based on non-negative matrix factorization [22], and the entries in W 1, W 2, H 1, and H 2 are non-negative. In coupled NMF, the regions in scATAC-Seq data and the genes in scRNA-Seq data are connected through the term \(tr({\mathbf {W}}_2^T \mathbf {A} {\mathbf {W}}_1)\), where A is a known p 1 × p 2 matrix obtained from training non-negative least squares regression models on bulk gene expression and chromatin accessibility datasets. The matrix A is set to the regression coefficients, where bulk gene expression data was used as the outcome, and bulk chromatin accessibility data was used as the predictor. The term \(\|{\mathbf {W}}_1\|{ }^2_F + \|{\mathbf {W}}_2\|{ }_F^2\) penalizes the scales of W 1 and W 2. λ 1, λ 2, and μ are tuning parameters.

DC3 [9] is a follow-up work based on the general framework of coupled NMF. Instead of a pre-trained regression model, DC3 connects the regions in scATAC-Seq data and genes in scRNA-Seq data through bulk HiChIP data obtained from similar tissues as that in scRNA-Seq and scATAC-Seq data. Simultaneous to clustering, DC3 also performs deconvolution of bulk HiChIP data to different cell subpopulations. The following is the objective function proposed in DC3:

$$\displaystyle \begin{aligned} \min_{{\mathbf{W}}_1, {\mathbf{H}}_1, {\mathbf{W}}_2, {\mathbf{H}}_2, \alpha, \boldsymbol{\Lambda} \geq 0} \frac{\mu_1}{2} \| \mathbf{O} - {\mathbf{W}}_1 {\mathbf{H}}_1 \|{}_F^2 + \frac{\mu_2}{2} \| \mathbf{E} - {\mathbf{W}}_2 {\mathbf{H}}_2 \|{}_F^2 + \frac{1}{2} \| \mathbf{C} - \alpha \mathbf{D} \odot ({\mathbf{W}}_2 \Lambda {\mathbf{W}}_1^T) \|{}_F^2 \\ \text{subject to } \ \sum_{k=1}^K h_{1,kj}=1\text{ for } j=1,2,\cdots,n_1; \sum_{k=1}^K h_{2,kj}=1, \text{ for } j=1,2,\cdots,n_2; \sum_{k=1}^K \lambda_k=1, \end{aligned} $$
(2)

where O, E, W 1, H 1, W 2, H 2 are the same as the corresponding matrices in coupled NMF; h 1,kj and h 2,kj are the kjth entry in H 1 and H 2, respectively; C is a p 1 × p 2 bulk HiChIP data matrix, representing the enhancer-promoter interaction strength. The bulk HiChIP data is obtained from similar tissues as that in scRNA-Seq and scATAC-Seq data, and it is considered as a mixture of different cell subpopulations. The term \(\| \mathbf {C} - \alpha \mathbf {D} \odot ({\mathbf {W}}_2 \Lambda {\mathbf {W}}_1^T) \|{ }_F^2\) is the key for deconvolution of bulk HiChIP to cell subpopulation-specific enhancer-promoter interaction, and for linking genes (promoters) in scRNA-Seq data and regions (enhancers) in scATAC-Seq data. α is a scaling factor and D is a masking matrix that extracts the entries in C that are larger than 1: d ij = 1 if c ij ≥ 1 and d ij = 0 if c ij < 1. Λ = diag(λ 1, ⋯ , λ K) is a diagonal matrix. \({\mathbf {W}}_2 \Lambda {\mathbf {W}}_1^T = \sum _{k=1}^K \lambda _k {\mathbf {w}}_{2, \cdot k} {\mathbf {w}}^{T}_{1, \cdot k}\), where w 1,⋅k and w 2,⋅k denote the kth columns in W 1 and W 2, respectively. \({\mathbf {w}}_{2, \cdot k} {\mathbf {w}}^{T}_{1, \cdot k}\) can be interpreted as the enhancer-promoter interaction strength in the kth cell subpopulation (represented by the kth factor in NMF), which provides deconvolution for C, and λ k can be interpreted as the proportion of cells that belong to the kth cell subpopulation.

Seurat V3 [10] integrates multiple single-cell datasets. Examples were demonstrated which integrate multiple scRNA-Seq datasets, scRNA-Seq with scATAC-Seq data, and in situ gene expression and scRNA-Seq datasets. The features are assumed to be the same across datasets: gene activity score was used for scATAC-Seq data. Seurat V3 implements the following four steps to integrate two datasets, Y and X. The correction procedure in Seurat V3 can also be extended to multiple datasets.

Step 1::

Data preprocessing and feature selection with highly variable genes.

Step 2::

Dimension reduction and identify “anchor” correspondences between datasets. X is a p × n X single-cell dataset, and Y is a p × n Y single-cell dataset, where p is the number of features (i.e., genes), n X and n Y are the number of cells. Seurat V3 performs canonical correlation analysis (CCA) for dimension reduction of X and Y. The first pair of canonical vectors \(\mathbf {u} \in \mathbb {R}^{n_X}\) and \(\mathbf {v} \in \mathbb {R}^{n_Y}\) are obtained by solving the following problem:

$$\displaystyle \begin{aligned} & \max_{\mathbf{u}, \mathbf{v}} {\mathbf{u}}^T{\mathbf{X}}^T\mathbf{Y}\mathbf{v}, \\ & \text{subject to } \| \mathbf{u} \|{}^2_2 \leq 1, \| \mathbf{v} \|{}^2_2 \leq 1. \end{aligned} $$
(3)

Note that the implementation of CCA in 3 is different from its usual implementation in statistics, where the projection vectors are implemented in the feature spaces. Seurat V3 obtains the first k pairs of canonical vectors, and then normalizes the canonical vectors so the 2-norm of the vector for each cell equals to 1. The normalized canonical vectors are used as the low-dimensional representation of the cells. Mutual nearest neighbors (MNN; pairs of cells, with one from each dataset, that are contained within each other’s neighborhoods) are obtained from the low-dimensional representations. These pairwise correspondences are referred as “anchors.”

Step 3::

Filtering, scoring, and weighting of anchor correspondences. The initial anchor pairs obtained in step 2 are filtered, so they are also supported by the original high-dimensional space. The anchors are then scored based on their strength using an approach that is similar to the shared nearest neighbor graphs. Suppose the matrix X is used to correct the matrix Y. W is a weight matrix for the cells in Y, and it has dimension n Y ×number of anchor cells. w ij represents the weighted similarity between cell i in Y and anchor cell j in Y, which not only considers the distance between cells i and j but also considers the anchor score of cell j: if cell j has higher anchor score, w ij will tend to be larger.

Step 4::

Data matrix correction. Let a X and a Y denote the sets of anchor cell pairs in X and Y. Seurat V3 first computes the differences between the pairs of anchor cells in the two data matrices:

$$\displaystyle \begin{aligned} \mathbf{B} = \mathbf{Y}[,{\mathbf{a}}_Y] - \mathbf{X}[, {\mathbf{a}}_X]. \end{aligned} $$
(4)

The corrected data matrix \(\hat {\mathbf {Y}}\) is obtained as the following:

$$\displaystyle \begin{aligned} \hat{\mathbf{Y}} = \mathbf{Y} - \mathbf{B} {\mathbf{W}}^T. \end{aligned} $$
(5)

LIGER [11] employs integrative non-negative matrix factorization (iNMF) [23]. LIGER uses gene as the feature to connect different datasets. It uses one minus non-CpG (mCH) gene body methylation for single-cell methylation data, because non-CpG gene body methylation is generally negatively correlated with gene expression in neurons. Though the integrative analysis of scRNA-Seq and scATAC-Seq was not presented in [11], scATAC-Seq data can be incorporated in principle using gene activity score. The objective function in LIGER is as the following:

$$\displaystyle \begin{aligned} \mathop{\text{argmin}}_{\mathbf{W}, {\mathbf{H}}_i, {\mathbf{V}}_i \geq 0} \sum_i \|{\mathbf{E}}_i - (\mathbf{W}+{\mathbf{V}}_i){\mathbf{H}}_i \|{}_F^2 + \lambda \sum_i \|{\mathbf{V}}_i {\mathbf{H}}_i \|{}_F^2, \end{aligned} $$
(6)

where E i denotes dataset i, which is of dimension n i × m, where n i denotes the number of cells in dataset i, and m denotes the number of genes. W is of dimension K × m, and it is the shared factor loadings across datasets. V i is of dimension K × m, and it is the factor loading that is unique to dataset i. H i is of dimension n i × K, and it denotes the low-dimensional embedding for the cells in dataset i. In the objective function 6, E i is approximated by WH i + V i H i, where WH i represents the shared variation across datasets, and V i H i denotes the dataset-specific effect. The regularization term \(\lambda \sum _i \|{\mathbf {V}}_i {\mathbf {H}}_i \|{ }_F^2\) controls the strength of the dataset-specific variation, and λ is a tuning parameter. After obtaining the low-dimensional embedding H for the cells across datasets, LIGER further builds a shared factor neighborhood graph in which cells are connected based on their similarity in the low-dimensional embeddings, and joint clusters are identified by performing community detection on this graph.

Other than integrative analysis of single-cell multi-omics data, examples that integrate multiple scRNA-Seq datasets from different individuals, time points, species, and spatial gene expression data were also presented in LIGER. Methods have been developed based on extensions of LIGER, including online iNMF [12] and UINMF [21].

Online iNMF [12] has the same objective function as formula 6 in LIGER. The major advantage of online iNMF is its computational efficiency and fixed memory usage for large datasets. It enables integration of large, multi-modal datasets by cycling through the data multiple times in small mini-batches and integration of continually arriving datasets, where the entire dataset is not available at any point during training. Online iNMF builds upon the online non-negative matrix factorization approach in [24].

UINMF [21]. One limitation of LIGER is that the features that are not linked across datasets are not utilized. For example, peaks in the intergenic regions in scATAC-Seq data are not directly linked to the genes in scRNA-Seq data, so the peaks were not included in the objective function of LIGER. To address this limitation, UINMF was developed to include these unlinked features. The objective function of UINMF is as the following:

$$\displaystyle \begin{aligned} \mathop{\text{argmin}}_{\mathbf{W}, {\mathbf{H}}_i, {\mathbf{U}}_i, {\mathbf{V}}_i \geq 0} \sum_i \Big{\{} \|{\mathbf{E}}_i - (\mathbf{W} + {\mathbf{V}}_i) {\mathbf{H}}_i \|{}_F^2 + \|{\mathbf{P}}_i - {\mathbf{U}}_i {\mathbf{H}}_i)\|{}_F^2 \Big{\}} + \lambda_i \sum_i \Big{\{} \|{\mathbf{V}}_i {\mathbf{H}}_i \|{}_F^2 + \|{\mathbf{U}}_i {\mathbf{H}}_i \|{}_F^2 \Big{\} }. \end{aligned} $$
(7)

In formula 7, the terms E i, H i, V i, and W are the same as those in formula 6. P i is a matrix of dimension n i × z i, where n i is the number of cells and z i is the number of unlinked features in the ith dataset. The matrices for the linked features E i and the unlinked features P i share the same H i, which is the low-dimensional embedding for the cells. Note that the tuning parameter λ i is different across datasets, and the variation of the unlinked features U i H i is included in the penalization term.

MAESTRO [14] provides a comprehensive open-source computational workflow for the integrative analyses of scRNA-Seq and scATAC-Seq data from multiple platforms. MAESTRO provides functions for preprocessing, alignment, quality control, expression and chromatin accessibility quantification, clustering, differential analysis, and annotation. Most other methods in this review start from the processed datasets, while MAESTRO supports input from fastq files for a wide variety of single-cell sequencing-based platforms. To integrate the cells from scRNA-Seq and scATAC-Seq, MAESTRO first calculates the regulatory potential for each gene in each cell, which measures the scATAC-Seq reads near the gene weighted by an exponential decay of the read distance to the transcriptional start site of the gene. Note that regulatory potential is computed similarly as the gene activity score. MAESTRO then performs a canonical correlation analysis between gene expression from scRNA-Seq and regulatory potential from scATAC-Seq. A pair of cells, one from scRNA-Seq and the other from scATAC-Seq, can be anchored using mutual nearest neighbors after dimension reduction. Then, MAESTRO transfers the cell type labels from scRNA-Seq (cell type labels in scRNA-Seq data are obtained from clustering by Seurat) to scATAC-Seq using the anchored cell pairs.

After integrating scRNA-Seq and scATAC-Seq cells, MAESTRO combines the transcriptional regulators predicted from scRNA-Seq data using LISA [25] and scATAC-Seq data using GIGGLE [26], and uses the rank product to combine the two. The final candidate regulators are further filtered based on the regulator expression from scRNA-Seq.

scACE [15] and scAMACE [16] are clustering methods built upon Bayesian hierarchical models. scACE integrates scRNA-Seq and scATAC-Seq data profiled on different set of single cells. scAMACE builds upon scACE and extends it to model scRNA-Seq, scATAC-Seq, and sc-methylation data. The goal in scACE and scAMACE is to cluster similar cell types within and across different molecular modalities. The followings are details for the model in scAMACE. The model for scRNA-Seq data:

$$\displaystyle \begin{aligned} & \boldsymbol{\omega}_{\cdot g}^{rna} \overset{\boldsymbol{z}_{l \cdot }}{\longrightarrow} u_{lg} \longrightarrow v_{lg} \longrightarrow y_{lg} \\ & \boldsymbol{z}_{l} \sim \text{Categorical} ( \boldsymbol{\psi}^{rna} ),\\ & u_{lg} \mid z_{lk}=1 \sim \text{Bernoulli}(\omega_{kg}^{rna}),\\ & v_{lg} \mid u_{lg}=1 \sim \text{Bernoulli}(\pi_{l1}); v_{lg} \mid u_{lg}=0 \sim \text{Bernoulli}(\pi_{l0}), \\ & p(y_{lg} \mid v_{lg} ) = v_{lg} g_1(y_{lg}) + (1 - v_{lg}) g_0(y_{lg}). \end{aligned}$$

Assume that there are K cell clusters in total, the random variable z lk denotes whether cell l belongs to cluster k ∈{1, …, K}, and z l follows categorical distribution with probability \(\psi _k^{rna}\) for cluster k. \(\omega ^{rna}_{kg}\) denotes the probability that gene g is active in cluster k. u lg is a binary latent variable representing whether gene g is active in cell l and u lg = 1 represents that it is active. v lg denotes whether gene g is expressed in cell l and v lg = 1 represents that it is expressed. When gene g is active in cell l (u lg = 1), the probability that gene g is expressed in cell l (v lg = 1) is π l1, while the probability that gene g is expressed is π l0 if the gene is not active (u lg = 0). Since genes are more likely to be expressed when they are active, it was assumed that π l1 ≥ π l0. y lg denotes the observed gene expression for gene g in cell l (after normalization to account for sequencing depth and gene length), and it was assumed that y lgv lg follows a mixture distribution, where g 1(.) and g 0(.) are density functions of the expression level conditional on v lg.

The model for scATAC-Seq data:

$$\displaystyle \begin{aligned} & \boldsymbol{\omega}_{\cdot g}^{acc} \overset{\boldsymbol{z}_{i \cdot }}{\longrightarrow} u_{ig} \longrightarrow o_{ig} \longrightarrow x_{ig} \\ & \boldsymbol{z}_{i \cdot} \sim \text{Categorical} ( \boldsymbol{\psi}^{acc} ),\\ & u_{ig} \mid z_{ik}=1 \sim \text{Bernoulli}(\omega_{kg}^{acc}),\\ & o_{ig} \mid u_{ig}=1 \sim \text{Bernoulli}(\pi_{i1}); o_{ig} \mid u_{ig}=0 \sim \text{Bernoulli}(\pi_{i0}), \\ & p(x_{ig} \mid o_{ig} ) =o_{ig} f_1(x_{ig}) + (1 - o_{ig}) f_0(x_{ig}). \end{aligned}$$

The random variables \(\omega _{kg}^{acc}\), z ik, \(\psi _k^{acc}\), and u ig have similar interpretations to their corresponding variables in the model for scRNA-Seq data. The cells in the scATAC-Seq data are different from the cells in the scRNA-Seq data, as indicated by the different notation i which represent the cells. x ig denotes the observed gene activity score for gene g in cell i. It was modeled by a mixture distribution with density functions f 1(.), f 0(.), and binary latent variable o ig. o ig = 1, and 0 represent the mixture components with high (f 1) and low (f 0) gene scores, respectively. Accessibility tends to be positively associated with activity of the gene. This positive relationship was modeled by the distribution o igu ig. When gene g is active in cell i (u ig = 1), the probability that it has high gene score (o ig = 1) is π i1; When gene g is inactive in cell i (u ig = 0), the probability that it has high gene score (o ig = 1) is π i0. π i1 was assumed to be larger than π i0 to represent the positive relationship.

The model for sc-methylation data:

$$\displaystyle \begin{aligned} & \boldsymbol{\omega}_{\cdot g}^{met} \overset{\boldsymbol{z}_{d \cdot }}{\longrightarrow} u_{dg} \longrightarrow m_{dg} \longrightarrow t_{dg} \\ & \boldsymbol{z}_{d \cdot} \sim \text{Categorical} ( \boldsymbol{\psi}^{met} ),\\ & u_{dg} \mid z_{dk}=1 \sim \text{Bernoulli}(\omega_{kg}^{met}),\\ & m_{dg} \mid u_{dg}=1 \sim \text{Bernoulli}(\pi_{d1}); m_{dg} \mid u_{dg}=0 \sim \text{Bernoulli}(\pi_{d0}), \\ & p(t_{dg} \mid m_{dg} ) =m_{dg} h_1(t_{dg}) + (1 - m_{dg}) h_0(t_{dg}). \end{aligned}$$

The random variables \(\omega _{kg}^{met}\), z dk, \(\psi _k^{met}\) and u dg have similar interpretations to their corresponding variables in the model for scRNA-Seq data. The cells in the sc-methylation data are different from the cells in the scRNA-Seq data, as indicated by the different notation d which represents the cells. The binary random variable m dg denotes whether gene g is methylated in cell d, and m dg = 1 represents that it is methylated. Methylation of a gene (promoter methylation/gene body methylation) tends to be negatively associated with activity of the gene, and this negative relationship is modeled with m dgu dg: when the gene g is active in cell d (u dg = 1), it is less likely to be methylated (m dg = 1), as π d1 ≤ π d0. t dg denotes the observed methylation level for gene g in cell d, and t dgm dg was assumed to follow a mixture distribution, where h 1(.) and h 0(.) are density functions conditional on m dg.

The model that connects the three molecular modalities: For scATAC-Seq data, \(\omega _{kg}^{acc}\) was assumed to follow beta distribution with mean \(\mu _{kg}^{acc}\) and precision ϕ acc. The variable \(\mu _{kg}^{acc}\) is connected with \(\omega _{kg}^{rna}\) in scRNA-Seq data through the logit function: \(logit(\mu ^{acc}_{kg})=\eta +\gamma \omega _{kg}^{rna}+\tau (\omega _{kg}^{rna})^2\). For sc-methylation data, the mean of \(\omega _{kg}^{met}\), \(\mu _{kg}^{met}\), is connected with \(\omega _{kg}^{rna}\) through the logit function: \(logit(\mu ^{met}_{kg})=\delta +\theta \omega _{kg}^{rna}\). Methylation and chromatin accessibility regulate gene expression biologically. The model was specified in the reverse order, so gene expression plays a central role. This is because scRNA-Seq data is usually less noisy compared with single-cell epigenomic data, the model specified this way will improve the clustering performance of single-cell epigenomic data, without sacrificing much the clustering performance of scRNA-Seq data.

scJoint [19] is a transfer learning method that integrates atlas-scale, heterogeneous collections of scRNA-Seq and scATAC-Seq data. It is a semi-supervised approach where the cell type labels for scRNA-Seq data are assumed to be known. The goal of scJoint is to transfer knowledge from massive scRNA-Seq data to scATAC-Seq through joint embedding in a low-dimensional space, and it also transfers the cell type labels from scRNA-Seq to scATAC-Seq data. scJoint uses gene activity score for scATAC-Seq data.

The neural network in scJoint consists of one input layer and two fully connected layers. Linear activation functions were used. Let \(\{ {\mathbf {x}}_i^{(s)} \}_{i=1}^{N_s}\) be the expression profiles for the cells in batch s in scRNA-Seq data, and \({\mathbf {y}}_i^{(s)} \in \{ 1, \cdots , K \}\) is the cell type label for cell i. Let \(\{ {\mathbf {x}}_i^{(t)} \}_{i=1}^{N_t}\) denote the gene activity scores for the cells in batch t in scATAC-Seq data. \(f_{\theta , i}^{(s)}=f({\mathbf {x}}_i^{(s)};\theta ) \text{ and } f_{\theta , i}^{(t)}=f({\mathbf {x}}_i^{(t)};\theta ) \in \mathbb {R}^D\), D = 64, are the outputs of the joint embedding layer for scRNA-Seq and scATAC-Seq data, where θ denotes the parameters in the neural network and it is shared in the two datasets. Note that although the same notation i is used to represent cells in scRNA-Seq and scATAC-Seq data, the two types of data are obtained on different sets of cells. \(h(f({\mathbf {x}}_i^{(s)};\theta ))\) and \(h(f({\mathbf {x}}_i^{(t)};\theta ))\) are the outputs from the prediction layer for scRNA-Seq and scATAC-Seq data, respectively. \(g_{\theta , i}^{(s)}=\text{softmax}(h(f({\mathbf {x}}_i^{(s)};\theta )))\) and \(g_{\theta , i}^{(t)}=\text{softmax}(h(f({\mathbf {x}}_i^{(t)};\theta )))\) are vectors of length K, representing the probabilities of the assignment of cells to the K cell types.

There are three steps in scJoint. The first step is to train the neural network with the following loss function:

$$\displaystyle \begin{aligned} \mathcal{L}_1(\mathcal{B}_0, \theta)= \sum_{s=1}^S \big( \mathcal{L}_{\text{NNDR}} (\mathcal{B}^{(s)}, \theta) + \mathcal{L}_{\text{entropy}} (\mathcal{B}^{(s)}, \theta) \big) + \sum_{t=1}^T \big( \mathcal{L}_{\text{NNDR}}(\mathcal{B}^{(t)}, \theta) + \mathcal{L}_{\text{COS}} (\mathcal{B}^{(t)}, \mathcal{B}_R, \theta) \big), \end{aligned} $$
(8)

where \(\mathcal {B}^{(s)}\) denotes the data for batch s in scRNA-Seq data, \(\mathcal {B}^{(t)}\) denotes the data for batch t in scATAC-Seq data, and \(\mathcal {B}_0=\{ \mathcal {B}^{(s)} \}_{s=1}^{S} \cup \{ \mathcal {B}^{(t)} \}_{t=1}^{T} \). In a spirit similar to PCA, the NNDR loss \(\mathcal {L}_{\text{NNDR}}(\cdot )\) aims to capture low-dimensional orthogonal features in the joint embedding layer represented by the function f(⋅). The cosine similarity loss \(\mathcal {L}_{\text{COS}}(\cdot )\) aims to align a subset of scRNA-Seq and scATAC-Seq cells in the joint embedding space. The cross entropy loss \(\mathcal {L}_{\text{entropy}}(\cdot )\) represents the supervised component, where it penalizes the disagreement between the predicted cell type probabilities given by the function g(⋅) and the known cell types labels in scRNA-Seq data. The second step in scJoint transfers cell type labels from scRNA-Seq data to scATAC-Seq through k-nearest neighbor in the joint embedding space. The third step in scJoint refines the joint embedding space and improves mixing of cells from the same cell type in scRNA-Seq and scATAC-Seq data. The neural network is trained with the following loss function:

$$\displaystyle \begin{aligned} \mathcal{L}_{\text{scJoint}}(\mathcal{B}_0, \theta) = \mathcal{L}_1(\mathcal{B}_0, \theta) + \mathcal{L}_{\text{entropy}} (\mathcal{B}^{(t)}, \theta) + \mathcal{L}_{\text{center}}(\mathcal{B}_0, \theta), \end{aligned} $$
(9)

where \(\mathcal {L}_1(\mathcal {B}_0, \theta )\) is the same as the loss in step 1; \(\mathcal {L}_{\text{entropy}} (\mathcal {B}^{(t)}, \theta )\) is the cross entropy loss using the transferred cell type labels for scATAC-Seq data, which are obtained in step 2; The term \(\mathcal {L}_{\text{center}}(\mathcal {B}_0, \theta )\) encourages cells with the same cell type label to form clusters in the joint embedding space (determined by the function f(⋅)), and it is similar to the loss function in k-means clustering, which encourages cells with the same cell type label to be close to the center of the cell type.

coupleCoC [17] and coupleCoC+ [18] are based on the information-theoretic co-clustering [42] transfer learning framework, where the features and observations are clustered simultaneously and the co-clustering result achieves minimal loss in mutual information.

The goal of coupleCoC+  is to utilize one dataset, the source data (S), to facilitate the analysis of another dataset, the target data. Depending on whether the features are linked with the source data, the target data can be partitioned into two parts, data T that contains the linked features, and data U that contains the unlinked features. As an example, consider the setting where scRNA-Seq and scATAC-Seq are profiled on similar cell subpopulations but different cells. It is desirable to utilize the information in scRNA-Seq data to help cluster scATAC-Seq data, which is typically sparser and noisier. So scRNA-Seq data can be used as the source data S, and scATAC-Seq data can be used as the target data. In scATAC-Seq data, the data matrix of gene activity score are directly linked with gene expression in scRNA-Seq data, so it can be regarded as data T; the data matrix of peak accessibility can be regarded as data U, because the peaks that are distal to the genes are not directly linked with gene expression.

In coupleCoC+ , both the genomic features and the cells are clustered. C Y, C X, C Z, C U denote the clustering functions for the cells in target data, the cells in source data, the linked features in the two datasets, and the unlined features that are unique in the target data. The following objective function was proposed in coupleCoC+ :

(10)

The first two terms T(C Y, C Z) and S(C X, C Z) are the losses in mutual information for co-clustering the cells and the shared features in the target data and source data, respectively. The shared features Z have the same cluster C Z in both the target data and the source data. C Z can be viewed as a bridge to transfer knowledge between the source data and the target data. The dimension of the feature space shared by the source data S and the data T is reduced by clustering and aggregating similar features. Aggregating similar features guided by the source data S enables knowledge transfer between the source data S and the data T, which reduces the noise in the single-cell data and can generally improve the clustering performance of the cells in target data. The term U(C Y, C U) corresponds to the loss in mutual information for co-clustering the cells and the features that are unique in the target data. The clustering of the cells in target data, C Y, is the same in terms U(C Y, C U) and T(C Y, C Z). The term aims to match a subset of the cell clusters in the two datasets. λ, β, and γ are tuning parameters.

The objective function in coupleCoC is similar to coupleCoC+ . Its differences from coupleCoC+  include that coupleCoC does not consider the unlinked features across datasets, and the matching of cell types across datasets is implemented in a separate step, instead of being integrated in the objective function. Apart from the integrative analysis of scRNA-Seq and scATAC-Seq data, coupleCoC and coupleCoC+  were utilized for the integrative analysis of sc-methylation and scRNA-Seq data, and scRNA-Seq data from mouse and human.

2 Multi-Omics Data Profiled on the Same Single Cells

Technologies that can profile multiple types of genomic features simultaneously in the same cells are beginning to emerge, and have the potential to reveal causal regulatory relations [27,28,29,30,31,32,33]. Methods have been developed for integrative analysis of these datasets, where one major goal is to integrate multiple molecular modalities profiled on the same cells to obtain better dimension reduction and clustering results compared with using single modalities alone [34,35,36,37,38,39]. These methods do not require the features to be linked across different molecular modalities.

MOFA+ [34] was designed to capture a common latent space, which integrates multi-omics data obtained from the same set of cells. It also considers sample structure (batches, donors, etc.) in the factor analysis model. Let M denote the number of data modalities, MOFA+  assumes the following factor analysis model for the mth data modality:

$$\displaystyle \begin{aligned} {\mathbf{Y}}_{gm} = {\mathbf{Z}}_g {\mathbf{W}}^T_m + \boldsymbol{\epsilon}_{gm}. \end{aligned} $$
(11)

Y gm is a N g × D m matrix, where N g is the number of cells in group/batch g, and D m is the number of features in modality m; Z g is a N g × K matrix, which represents the matrix of K factors in group g; W m is a D m × K matrix, which is the weight matrix for the mth modality. 𝜖 gm is the residual noise matrix. In MOFA+ , the factor matrix Z g is shared across different modalities within group g, and the weight matrix W m is shared for the same modality across different groups. Element-wise spike-and-slab prior was assumed for the entries in Z g and W m for regularization. MOFA+  also extends model 11 and supports non-Gaussian likelihoods, including a Poisson model for count data and a Bernoulli model for binary data. Inference of the models was achieved using stochastic variational inference, which can scale up to large datasets.

WNN (Seurat V4) [35] was primarily designed for the analysis of CITE-Seq data (RNA +  surface protein abundance), and it was also applied to paired measurement of RNA and chromatin accessibility for the same cells. The key is to construct a weighted nearest neighbor (WNN) graph, defined as a K-nearest neighbor (KNN) graph constructed using a weighted similarity metric, which combines the information in the two modalities. The WNN graph can then be used for downstream analysis, including data visualization, clustering, and trajectory analysis. The weighted similarity between cell i and cell j is defined as

$$\displaystyle \begin{aligned} \theta_{weighted}(i, j) = w_{rna}(i)\theta_{rna}({\mathbf{r}}_i, {\mathbf{r}}_j) + w_{protein}(i)\theta_{protein}({\mathbf{p}}_i, {\mathbf{p}}_j), \end{aligned} $$
(12)

where r represents the observed RNA profile for a cell, p represents the observed surface protein level for a cell. w rna(i) and w protein(i) are the weights for RNA and protein profiles, respectively, and the weights depend on the cell label i. θ rna(r i, r j) and θ protein(p i, p j) denotes the affinities between cell i and cell j computed from RNA levels and protein levels, respectively, and they are defined as the following:

$$\displaystyle \begin{aligned} \theta_{rna}({\mathbf{r}}_i, {\mathbf{r}}_j) = \exp(\frac{-\text{max}(d({\mathbf{r}}_i, {\mathbf{r}}_j) - d({\mathbf{r}}_i, {\mathbf{r}}_{knn_{r,i,1}}), 0)}{\sigma_{r, i} - d({\mathbf{r}}_i, {\mathbf{r}}_{knn_{r,i,1}})}), \\ \theta_{protein}({\mathbf{p}}_i, {\mathbf{p}}_j) = \exp(\frac{-\text{max}(d({\mathbf{p}}_i, {\mathbf{p}}_j) - d({\mathbf{p}}_i, {\mathbf{p}}_{knn_{p,i,1}}), 0)}{\sigma_{p, i} - d({\mathbf{p}}_i, {\mathbf{p}}_{knn_{p,i,1}})}), \end{aligned} $$
(13)

where \({\mathbf {r}}_{knn_{r,i,1}}\) denotes the RNA profile for the cell that is closest to cell i, using RNA data to calculate the distance; \({\mathbf {p}}_{knn_{p,i,1}}\) denotes the protein profile for the cell that is closest to cell i, using protein data to calculate the distance. So the affinities θ rna(r i, r j) and θ protein(p i, p j) represent the similarities between cells i and j using RNA and protein profiles, while considering the distance between cell i and its nearest neighbor.

The weights w rna(i) and w protein(i) are chosen as the following:

$$\displaystyle \begin{aligned} s_{rna}(i)=\frac{\theta_{rna}({\mathbf{r}}_i, \hat{\mathbf{r}}_{i,knn_r})}{\theta_{rna}({\mathbf{r}}_i, \hat{\mathbf{r}}_{i,knn_p})+\epsilon}, s_{protein}(i)=\frac{\theta_{protein}({\mathbf{p}}_i, \hat{\mathbf{p}}_{i,knn_p})}{\theta_{protein}({\mathbf{p}}_i, \hat{\mathbf{p}}_{i, knn_r})+\epsilon}, \\ w_{rna}(i)=\frac{e^{s_{rna}(i)}}{e^{s_{rna}(i)} + e^{s_{protein}(i)}}, w_{protein}(i)=\frac{e^{s_{protein}(i)}}{e^{s_{rna}(i)} + e^{s_{protein}(i)}}, \end{aligned} $$
(14)

where \(\hat {\mathbf {r}}_{i,knn_r}\) and \(\hat {\mathbf {r}}_{i,knn_p}\) are the average RNA profiles among the neighbors of cell i: in \(\hat {\mathbf {r}}_{i,knn_r}\), the neighborhood is obtained by the closest distances in RNA profiles; While in \(\hat {\mathbf {r}}_{i,knn_p}\), the neighborhood is obtained by the closest distances in protein profiles. \(\hat {\mathbf {p}}_{i,knn_p}\) and \(\hat {\mathbf {p}}_{i, knn_r}\) are the average protein profiles among the neighbors of cell i, where the neighborhoods are obtained by the closest distances in protein profiles and RNA profiles, respectively. The intuition for choosing the weight is that when the neighborhood obtained from RNA profiles better predicts the RNA profiles and protein profiles for cell i, compared with the neighborhood obtained from the protein profiles, the weight w rna(i) will tend to be larger.

TotalVI [36] was developed for CITE-Seq data and it is based on variational autoencoder [40]. Suppose that there are B batches. s n is a vector of length B, which represents the known one-hot batch index for cell n. The batch index s n is the same for RNA data and protein data. TotalVI learns a shared latent representation for RNA and protein data. z n is the latent representation for cell n, the prior on z n is specified as

$$\displaystyle \begin{aligned} z_n \sim \text{LogisticNormal}(\boldsymbol{\mu}=\boldsymbol{0}, \boldsymbol{\Sigma}=\boldsymbol{I}). \end{aligned} $$
(15)

The following hierarchical model was assumed for the RNA levels in CITE-Seq data.

Similar to the specification in scVI [41], the size factor for RNA data for cell n, represented as \(\ell _n \in \mathbb {R}_+\), was assumed to be latent and it depends on the batch index s n:

$$\displaystyle \begin{aligned} \ell_n \mid \boldsymbol{s}_n \sim \text{LogNormal}(\mu=\boldsymbol{\ell}_{\mu}^T \boldsymbol{s}_n, \sigma^2=\boldsymbol{\ell}_{\sigma^2}^T \boldsymbol{s}_n), \end{aligned} $$
(16)

where μ and \(\boldsymbol {\ell }_{\sigma ^2}\) are vectors of length B, and their entries are set to the empirical mean and variance of the log(RNA library size) calculated from the cells within individual batches. Let x ng denote the observed RNA count for gene g in cell n, and it was modeled as the following:

$$\displaystyle \begin{aligned} & x_{ng} \mid l_n, \boldsymbol{\rho}_n, \theta_g \sim \text{NB}(\text{mean}=l_n \rho_{ng}, \text{dispersion}=1/\theta_g), \\ & \boldsymbol{\rho}_n =f_{\boldsymbol{\rho}}(\boldsymbol{z}_n, \boldsymbol{s}_n), \end{aligned} $$
(17)

where NB stands for negative binomial distribution. The function f ρ(z n, s n) is a neural network: its inputs are z n and s n, and the output is a vector ρ n, which represents the abundance of the genes in cell n. The model specification for RNA data is very similar to that in scVI, and the major difference is that zero inflation was not considered in TotalVI.

The following hierarchical model was assumed for the protein levels in CITE-Seq data.

Let y nt denote the observed count for protein t in cell n. It was assumed to follow a negative binomial mixture distribution:

$$\displaystyle \begin{aligned} &y_{nt} \mid v_{nt}, \beta_{nt}, \alpha_{nt} \sim v_{nt} \text{NB}(\text{mean}=\beta_{nt}, \text{dispersion}=1/\phi_t)\notag \,+\\ &\quad (1 - v_{nt}) \text{NB}(\text{mean}=\beta_{nt} \alpha_{nt}, \text{dispersion}=1/\phi_t), \end{aligned} $$
(18)

where v nt is a binary latent variable representing the mixture component. β nt represents the background intensity, and α nt > 1 represents the fold change in mean for the mixture component with the larger mean. So v nt = 0 represents the mixture component with a larger mean. The distribution for v nt was specified as the following:

$$\displaystyle \begin{aligned} & v_{nt} \mid \boldsymbol{\pi}_n \sim \text{Bernoulli}(\pi_{nt}), \\ & \boldsymbol{\pi}_n = h_{\boldsymbol{\pi}}(\boldsymbol{z}_n, \boldsymbol{s}_n), \end{aligned} $$
(19)

where h π(z n, s n) is a neural network: its inputs are z n, s n, and its output is a vector of probabilities π n for cell n.

The distribution of β nt is specified as

$$\displaystyle \begin{aligned} \beta_{nt} \mid \boldsymbol{s}_n \sim \text{LogNormal} (\mu=\boldsymbol{c}_t^T \boldsymbol{s}_n, \sigma^2=\boldsymbol{d}_{t}^T \boldsymbol{s}_n), \end{aligned} $$
(20)

where c t and d t are parameters to be estimated from the data. The variable α nt is specified as α n = g α(z n, s n), where g α(z n, s n) is a neural network. Inference of TotalVI was performed under the variational autoencoder framework.

scAI [37] was developed for the integrative analysis of single-cell transcriptome and epigenome profiled in the same single cells, and it is based on non-negative matrix factorization. The following optimization problem was proposed in scAI:

$$\displaystyle \begin{aligned} \mathop{\text{argmin}}_{{\mathbf{W}}_1, {\mathbf{W}}_2, \mathbf{H}, \mathbf{Z} \geq 0} \alpha \|{\mathbf{X}}_1 - {\mathbf{W}}_1 \mathbf{H} \|{}_F^2 + \|{\mathbf{X}}_2 (\mathbf{Z} \circ \mathbf{R}) - {\mathbf{W}}_2 \mathbf{H} \|{}_F^2 + \lambda \| \mathbf{Z} - {\mathbf{H}}^T \mathbf{H} \|{}_F^2 + \gamma \sum_j \|{\mathbf{H}}_{\cdot j} \|{}_1^2. \end{aligned} $$
(21)

X 1 is the normalized p × n (p genes in n cells) data matrix for single-cell transcriptomic data, and X 2 is the normalized q × n (q regions in n cells) data matrix for single-cell epigenomic data. W 1 and W 2 are the gene loading and region loading matrices with dimensions p × K and q × K, respectively. H is the K × n cell loading matrix shared by the transcriptomic and epigenomic data. Z is the n × n cell-cell similarity matrix. R is a binary matrix generated by a binomial distribution with probability s. The symbol ∘ represents element-wise multiplication. The term X 2(Z ∘R) has a smoothing effect on the single-cell epigenomic data matrix, where the epigenomic profiles from similar cells are being aggregated based on the cell-cell similarity matrix Z, and this term is helpful to deal with the sparsity and high level of noise in single-cell epigenomic data.

JSNMF [38] was also developed for the integrative analysis of single-cell transcriptome and epigenome profiled in the same single cells. The following optimization problem was proposed in JSNMF:

$$\displaystyle \begin{aligned} & \min_{{\mathbf{W}}_i, {\mathbf{H}}_i, \mathbf{Z}, \lambda_i} \sum_{i=1}^2 \|{\mathbf{X}}_i - {\mathbf{W}}_i {\mathbf{H}}_i \|{}_F^2 + \frac{\alpha}{2} \sum_{i=1}^2 \| \mathbf{Z}- {\mathbf{H}}_i^T {{\mathbf{H}}_i} \|{}_F^2 + \sum_{i=1}^2 \frac{\varphi_i}{2} \|{\mathbf{H}}_i {\mathbf{H}}_i^T\\ &\ \ \ \ \ \ - \mathbf{I} \|{}_F^2 + \eta \|{\mathbf{1}}^T \mathbf{Z} - {\mathbf{1}}^T \|{}_F^2 + \gamma \sum_{i=1}^2 \lambda_i^2 \ tr({\mathbf{H}}_i {\mathbf{L}}_i {\mathbf{H}}_i^T) \\ &\ \ \ \ \ \ \text{s.t. } {\mathbf{W}}_i, {\mathbf{H}}_i, \mathbf{Z}, \lambda_i \geq 0, \text{for } i \in \{ 1, 2 \}; \sum_{i=1}^2 \lambda_i^2=1. \end{aligned} $$
(22)

Similar to scAI, JSNMF is also based on non-negative matrix factorization. X 1, X 2, W 1, and W 2 have similar interpretation as those in scAI. One key difference between JSNMF and scAI is that JSNMF assumes different cell loading matrices H 1 and H 2 for the two data modalities and integrate the information in H 1 and H 2 through consensus graph fusion, \(\sum _{i=1}^2 \| \mathbf {Z}- {\mathbf {H}}_i^T {{\mathbf {H}}_i} \|{ }_F^2\). This integration strategy was shown to be beneficial when the data from different types of genomic features have different levels of noise. The term \(\sum _{i=1}^2 \frac {\varphi _i}{2} \|{\mathbf {H}}_i {\mathbf {H}}_i^T - \mathbf {I} \|{ }_F^2\) improves interpretability of the factors. \(\|{\mathbf {1}}^T \mathbf {Z} - {\mathbf {1}}^T \|{ }_F^2\) is a normalization term that encourages the columns in Z to have summations close to 1. L i ∈ R n×n is the Laplacian graph for the ith data modality, and it captures the high-dimensional geometrical structure in the original data space. The term \(\sum _{i=1}^2 \lambda _i^2 \ tr({\mathbf {H}}_i {\mathbf {L}}_i {\mathbf {H}}_i^T)\) encourages the low-dimensional embeddings H i to preserve the high-dimensional geometrical structure. In JSNMF, formula 22 was also extended to the integration of more than two molecular modalities profiled on the same cells and the integration of multiple single-cell multi-omics experiments. JSNMF also includes a module that infers cell type-specific region-gene associations.

3 Challenges and Future Perspectives

Different molecular modalities capture different aspects of the cell. Most methods in this review focus on exploratory analysis, including dimension reduction and clustering. The natural next step is methodology development for downstream analysis, including estimating the transcriptional regulatory network, data integration with the summary statistics in genome-wide association analysis to unravel the mechanism of human diseases, and relating single-cell multi-omics with the clinical outcome of the patients.

Multi-omics data obtained from the same single cells tend to be noisier compared with single-omic data. It will be interesting to integrate these data with other existing reference data, especially atlas-scale data, to help deal with the high noise level. Computational burden will be another challenge following technology developments that increase the throughput of cells. Single-cell epigenomic data tend to have much more features compared with scRNA-Seq data, and the analysis of these datasets will be more demanding computationally.