Keywords

4.1 Brief Introduction to Single Cell Sequencing

Single cell sequencing (SCS) can be harnessed to acquire the genomes, transcriptomes and epigenomes from individual cells. Next generation sequencing (NGS) technology is the driving force for single cell sequencing. Though various bias could be potentially introduced during the molecule amplification, It has been well recognized that SCS could help detect single nucleotide variations (SNVs) [1], copy number variations (CNVs) [2], structure variations (SVs) [3], gene expression and fusions [4,5,6,7,8], novel transcripts and alternative splicing [9], methylations [10] and chromatin patterns [11, 12] on single cell level with the help of mathematic algorithms and models. SCS also has great potential to reveal novel biological concepts, which have never been investigated. For example, researchers used single cell RNA-seq (scRNA-seq) to uncover new cell types in nervous system [13], immune system and hematopoietic system [14], as well as new insights into the clonal evolution of cancer [15]. Most recently, the accuracy and throughput of SCS have been increased dramatically to be capable of profiling more than thousands of single cells in parallel [5, 16].

4.2 Large-Scale scRNA-seq Library Preparation

scRNA-seq requires a lengthy pipeline comprising of single cell sorting, RNA extraction, reverse transcription, amplification, library construction, sequencing and subsequent bioinformatic analysis. As the key factor to increase the throughput of scRNA-seq study, high-throughput scRNA-seq library preparation technology developed very quickly these few years. FAC sorting-based scRNA-seq library preparation combined with automatic liquid instrument pipeline enable handling 96-well/384-well plate single cells per run [17]. Fluidigm C1 system based on microvalve microfluidic chip that was developed by Quake’s lab enabled to prepare full-length transcripts of 96 single cells in parallel in 2012 [7], and a similar chip with higher throughput up to 800 single cells 3’end transcript preparation was released in 2015. Another type of microfluidic chip, micro-well chip was also used in single-cell RNA amplification. Wu et al. developed an approach called MIRALCS [4], allowing massively parallel single-cell full-length transcripts amplification for 500–1000 single cells based on 5184-well chip. With the same chip, Wafergen Inc. released a single cell preparation system named ICell8, allowing the preparation of 3′ single cell transcripts with throughput up to 1800 cells per run [18]. Taking the advantage of barcode-bead technology, two different groups described microwell chip based method, with the capacity to obtain gene expression from thousands of cells at the single cell level, respectively [19, 20]. In addition, droplet microfluidic technique improves the throughput of single cell 3′ end RNA-seq to million level, and reduce the reaction volume to picoliter [21, 22]. A commercialized instrument with the same strategy was developed by 10X genomics, enabling the preparation of at most 48,000 single cells from eight different samples in parallel. With the development of high-throughput scRNA-seq library preparation technology, the cost has been reduced to less than one dollar per cell, which greatly promotes the studies at the single cell level.

4.3 Computational Analysis of scRNA-seq Data

Computational algorithms are essential to fulfill many tasks of interest using scRNA-seq data (Fig. 4.1). There is a general consensus that analyses of scRNA-seq data sets and conventional RNA-seq data have a lot in common. More specifically, the vast majority of the basic pipelines and tools established for sequencing data derived from bulk cell populations are applicable to that from single cells, following steps including read alignment, quality control and gene expression estimation. Whereas more dedicated softwares for use in fields such as identifying and characterizing cellular subpopulations, exploring evolution of cell groups and inferring the transcriptional kinetics are urgently needed owing to the zero-inflated nature and additional functions of scRNA-seq data sources.

  • Quality Control: Single-cell datasets are expected to be extra messy, thus should undertake the quality control step before any downstream analysis. To begin with, FASTQC [23], Qualimap2 [24] and RSeQC [25] are commonly used for assessment of overall sequencing quality. After removal of adapters or noisy data with low quality, raw reads obtained from a well-designed experiment are firstly aligned to the reference genome using tools such as Tophat [26], HISAT [27] and STAR [28]. Subsequently, several indicators are calculated so as to discriminate cells with degraded RNA or substandard sequencing library, for instance, the number of expressed genes, the proportion of reads mapped to endogenous genes and the fraction of external spike-ins in mapped reads [9, 29, 30]. In addition, Treutlein considered normal expression of housekeeping genes a judgment factor of healthy cells [31].

Fig. 4.1
figure 1

Representative tasks enabled by scRNA-seq. (a) Subpopulation analysis can be performed with various unsupervised clustering algorithms; (b) Pseudotemporal ordering is essential to understand developmental trajectory or disease progression; (c) Differential gene expression analysis is important for the discovery of cell type specific biomarkers; (d) Network inference can be performed to learn regulatory intracellular and intercellular networks; (e) Analysis of alternative splicing offers a new perspective on biology and medicine; (f) Allele specific expression patterns can be addressed using scRNA-seq data

  • Expression estimation and normalization: Gene expression levels of qualified cells can be estimated as count from data without UMIs using HTSeq [32], WemIQ [33] or RSEM [34]. While relative expression including transcripts per million mapped reads (TPM) and reads/fragments per kilobase per million mapped reads (RPKM/FPKM) is widely adopted in downstream analysis. Besides, Islam et al. [29] and Hashimshony et al. [35] provide solutions to UMI-tagged reads. Normalization is essential due to the technical variability in comparison to expression levels between samples. Median normalization or a similar method are popular in many scRNA-seq studies without spike-ins or UMIs [30, 36,37,38,39]. In single cell experiments where spike-ins were applied, technical artifacts can be estimated by difference between their expected and observed expression. Nevertheless, instability arising from inconsistent detection of spike-ins brings about a more current notion of comparing absolute molecular counts of different cells with UMIs, which have greatly reduced the amplification noise by attaching random sequences to cDNA fragments ahead of PCR [21, 29, 40].

  • Identification of subpopulations: Cellular subpopulation identification in heterogenous cells is one of the most exciting areas for exploration in the scRNA-seq experiments. Therefore, various algorithms for clustering have been developed to date. Pollen et al. [41] distinguished different types of cells along lung development using principal component analysis. The study by Li et al. [42] showed transcriptional heterogeneity in colorectal tumors with a novel strategy named reference component analysis (RCA). Following similar lines, self-organizing maps (SOMs) [43], circular a posteriori projection (CAP), ZIFA [44], t-SNE [45] and BackSPIN [46] clustering are approaches developed for differentiating between cells within a biological condition by dimensionality reduction of scRNA-seq data. In addition, RaceID [6] is a computationally efficient tool that relies on k-means clustering, whereas SNN-Cliq [47] clusters individual cells by a graph-based algorithm based on shared nearest neighbor (SNN) similarity measurement. Guo et al. [48] further presented a pipeline for known cell type enrichment that is analogous to gene set enrichment analysis.

  • Differential expression and transcript isoforms across conditions: Once subpopulations are distinguished, differential expression can be applied for cell type characterization. Researchers used to investigate differential expressed genes among cells of different types or stages with bulk RNA-Seq based strategy. However, an abundant zero values on expression matrix from single cells lead to potential fault sets of genes that may have expressed differently resulting from noise. As a consequence, plenty of mixture-model-based methods like MAST [49] and SCDE [50] have been created for accommodation of bimodality in expression levels. Similarly, D3E [51] identify DE genes by comparing two probability distributions on transcriptional bursting model. Korthauer et al. [52] have established a more accurate Bayesian modeling framework, scDD, for differential expression patterns detection under a wide range of circumstances recently. Unlike the traditional methods with a simple mean shift, the scDD model provides posterior probabilities differential distributions (DD) for each gene and classified gene as unimodal distributions (traditional DE), differential modes (DM), differential proportion (DP), or both DM and DE (abbreviated DB).

  • Pseudotemporal ordering: Knowledge of the global expression profile in individual cells provides opportunities for the investigation of dynamic cellular processes such as normal tissue development, stem cell differentiation and tumor progression. A number of computational methods were built on the basis of the theory that differentiation paths can by constructed by reordering unsynchronized cells with gradual changes in gene expression levels at various stages. Similarities to cellular subpopulation identification approaches, most investigators perform pseudotemporal ordering by reducing the dimensionality of gene expression data. Take Monocle [53] as an example, which was the most effective tool to construct the differentiation paths in the infancy of single cell technology. Minimal spanning tree (MST) was built on data processed by independent component analysis (ICA) in Monocle, and the longest path through the MST was considered as a default setting for differentiation. Subsequently, Haghverdi L et al. [54] worked out a diffusion map based method that allows trajectory reconstruction in a single step. Rizvi et al. [55] presented a topology-based algorithm named single-cell topological data analysis (scTDA) for unbiased transcriptional regulation study through a nonlinear and unsupervised statistical framework. Furthermore, when it comes to oscillatory processes, Oscope [56] can be used for oscillatory trajectory reconstruction with co-regulation information among oscillators.

  • Interrogation of spatial information: In spite of looking into the development of cell populations extending in time, scRNA-seq can be applied for spatial reconstruction via the integration of in situ RNA patterns with genome wide gene expression profiles. Satija R et al. [57] has established an accurate spatially resolved tools, Seurat, for mapping cellular localization, with which they inferred cellular localization of cells from dissociated zebrafish (Danio rerio) embryos and generated a transcriptome-wide map of spatial patterning. Meanwhile, another high-throughput approach by Kaia Achim [58] was published online by virtue of a reference gene expression database, which successfully allocates brain cells to precise locations from marine annelid Platynereis dumerilii by comparing specificity-weighted mRNA Profiles. Halpern K B et al. [59] reconstructed a genomic blueprint of mammalian liver by combining landmark genes expression and single-molecule fluorescence in situ hybridization.

  • Network inference: Identification of co-regulated genes with data derived from single cell experiments is superior because it can provide insight into regulatory networks that are hard to be noticed in bulk level. Understanding the transcriptional regulatory networks is of primary interest in a myriad of studies. For convenience, some statistical methods in bulk studies were reused when exploring scRNA-seq data. Weighted correlation network analysis (WGCNA) [60] can be used for gene clustering and summarizing with a comprehensive collection of functions for network construction, module detection, gene selection, calculations of topological properties, data simulation and visualization. Cell-centric statistics (CCs) [61] was invented to model transcriptome dynamics by analyzing aggregated cell-cell statistical distances within biomolecular pathways, for instance, differentially expressed pathways for a single cell of interest. While SCODE [62] inferred the co-regulatory network with ordinary differential equations(ODEs) by integrating the transformation of linear ODEs and linear regression.

  • Differential Splicing: Experimental protocols with full-length transcript coverage to certain sequencing depth provides insight into alternative splicing isoform determination and quantification in scRNA-seq data analysis, which reflects heterogeneity among cells of a biological component from another perspective. A study of heterogeneity in immune cells in 2013 [9] was the first to reveal the dramatic diversity of splicing patterns in mouse bone-marrow-derived dendritic cells(BMDCs). Gokce O et al. [63] used fisher’s exact test for differentially splicing junction definition among cell types and pointed out splice variant expressed in mouse striatum. SingleSplice [64] is the latest tailored method used to detect isoform usage differences in scRNA-seq data, which was applied to mouse embryonic stem cells and eventually shedded insight into the connection between alternative splicing and the cell cycle through a series of analysis.

  • Allelic Expression Patterns: Another subtle point is that allele-specific expression can be accessed for in scRNA-seq to investigate the contribution of parental allele expression. Deng et al. [65] demonstrated an abundant random allele-specific gene expression using train-specific SNPs at single-cell resolution in mouse preimplantation embryos. Reinius B et al. [66] argued in an allele-sensitive scRNA–seq experiment that most patterns of random monoallelic expression of autosomal genes (aRME) are in a decentralized fashion rather than confined to clonally related cells according to previous hypothesis.

4.4 Application of High Throughput scRNA-seq

  • Cancer Biology: scRNA-seq has already enabled researchers to revisit long-standing questions in cancer biology, including cancer metastasis, heterogeneity and evolution. Circulating Tumor Cells (CTC) are not only an important mechanism for cancer metastasis [67], but also provide a possibility to diagnose and monitor cancer in a convenient way independent of surgical resection of the cancer. One landmark study analysed CTC isolated from prostate cancer patients and revealed that the mechanism of resistance to androgen receptor inhibition in recurrent disease is partly due to noncanonical Wnt signaling [68].

A comprehensive picture of cancer heterogeneity is redefined by scRNA-seq. Several studies revealed the heterogeneity of cancer cells [69, 70]. A comprehensive profiling of melanoma using scRNA-seq is a classical example [70]. It was found that two distinct transcriptional signatures were present but they were not mutually exclusive. The melanoma characterized by activation of the transcription factor MITF also harbored a small subpopulation of cells with low MITF activity. The heterogeneity of cancer is not limited to the cell-to-cell variability among cancer cells. More importantly, cancer is itself a heterogeneous tissue comprised of malignant, immune, stromal and endothelial cells [71]. Recently, profiling of the immune cells within the tumor microenvironment is attracting lots of attention [72,73,74,75]. Those studies covered various different cancers and single cell omic technologies. A recent study employed scRNA-seq to analyse T cells isolated from tumor tissues and adjacent normal tissues from hepatocellular carcinoma (HCC) patients, revealing the distinctive functional composition of T cells in HCC and the clonal enrichment of infiltrating Tregs and exhausted CD8 T cells [72].

The clonal evolution of cancer was proposed more than 40 years ago [76]. Longitudinal single cell analysis is now adding new evidence to this widely held concept [77]. Applying single nucleus sequencing to biopsy from primary breast cancer and its liver metastasis, it was suggested that tumor evolution might follow a punctuated expansion mode instead of a gradual progression path [78]. Single cell genome and exome sequencing enabled by MDA further increased the coverage of single cell genome sequencing and rendered the mutation and SNP calling at the single cell possible [79, 80]. The mutation and SNP information for individual cancer cells was valuable for population genetic analysis to understand the clonal evolution of tumor.

  • Developmental Biology: Our understanding of developmental biology has also been dramatically enhanced by scRNA-seq. The identification of rare cell type was realized by the combination of organoid culture, scRNA-seq and development of novel algorithm [6]. This crystalized in the identification of Reg4 as a novel marker for enteroendocrine cells. New markers will then facilitate the investigation of rare cell types. Another recent study focused on the cells in the blood. New types of dendritic cells and monocytes were identified using scRNA-seq [14]. Our understanding of the cell types or subtypes constituting the brain was renewed by single-nucleus RNA sequencing [81] and scRNA-seq [82], while traditionally cell types were defined based on morphology, location and function.

  • The Human Cell Atlas: With the development of high throughput single cell molecular profiling techniques, an international community or network is taking shape rapidly aiming to undertake the ambitious project to identify all cell types in the human body [83]. Single cell omic technologies are situated at the heart of the human cell altlas. Key efforts will be devoted to key organs, such as the liver, the heart, the kidney or the pancreas [84], as well key systems, such as the immune system and the central nervous system [85].

Our understanding of disease will also be greatly refined with the realization of the human cell atlas. In the future biopsy from patients will be routinely assayed with single cell techniques [70, 86] and compared to the normal reference in the human cell atlas. Specific abnormalities will be identified and used to inform both diagnosis and treatment.

The drug industry will benefit dramatically from the human cell atlas. Traditionally, drug discovery and development efforts have been hindered by the challenges that all healthy and diseased tissues are inherently heterogeneous [87]. The emergence and rapid application of single cell analysis tools will pave the way to eventually understand both health and disease at an unprecedented level so that medicine can finally ushers in a new era of personalized healthcare [88, 89].