Keywords

1 Introduction

A normal diploid human genome is composed of ~6 billion bp of nuclear DNA distributed in 23 pairs of linear chromosomes, forming DNA-protein complexes, or chromatins, during the interphase of cell cycle. Besides, there are thousands of copies of mitochondrial genomes (mitogenomes) per cell, each composed of 16,569 bp of deoxynucleotides forming a circular DNA structure. Together, the nuclear genome and the mitochondrial genome form the basis of the information reservoir. Through differential expression of ~26,000 protein-coding genes [1] and hundreds, or more, noncoding RNAs, they drive all the biological activities of human life. As the doctrine of the Central Dogma teaches [2], gene expression is mediated by transcription of genomic sequences to produce RNA molecules (mRNA, tRNA, rRNA, and other noncoding RNA) and translation of the protein-coding mRNA molecules.

All matter in the universe is subject to dynamic environmental impacts. There is no exception for DNA and RNA molecules. Environmental impacts, including toxic chemical compounds, radiations, magnetism, reactive oxygen species (ROS), and many others, may cause DNA/RNA damages. Aberrations in DNA/RNA sequences are then subject to selection by evolution which constantly occurs in the context of the environment. Some aberrations, either as small as single base alterations or as large as deletions or insertions of chromosomal fragments and chromosomal breakage and rearrangements, are likely to result in cell death. Some aberrations, on the other hand, may lead to tumorigenesis [3].

One of the most remarkable features of cancer is genome instability, which acts as a driving force for genetic mutation and plays a key role in wide spectrum reshuffling of the genomic material [4, 5], eventually causing a tumor mass to diversify and form a mixture of heterogeneous, aneuploid genomes. During cancer progression, genomic alterations are recorded as “mutational fingerprints,” which, in theory, can be traced to reveal the cancer evolutionary history. Through gene expression, the genomic alterations are carried onto transcriptome and then proteomes, showing abnormal gene expression profile and altered proteomic state, respectively. Besides cancerous cells, tumors are also composed of infiltrating normal cells (such as fibroblasts) and immune cells (such as macrophages and lymphocytes) [6, 7]. Multiple lines of evidence indicate that the complexity of cancer cannot be fully appreciated by conventional approaches which study cancer cell population as a whole. With the advances of sequencing technologies, aberrations in nucleic acids can now be revealed and analyzed by next-generation sequencing (NGS) at a large scale [8, 9], and single cell sequencing (SCS) technologies are becoming the right choice. Then, may cancer genomes in a tumor be subdivided into subpopulations or subclones? Is SCS able to reveal the developmental history of a tumor? Moreover, since solid tumors also contain a number of noncancerous cells and the gene expression of these “normal” cells is known to be influenced by cancer cells, can SCS provide further insight into how these cells interact to foster cancer progression? From the technical point of view, can SCS compete against other ensemble methods for early detection of cancer? How far can SCS go to straighten out these issues and help answer the questions?

To maintain cell integrity and help define the boundary of cellular activities, each single eukaryotic cell has a well-defined cell membrane, which can be directly observed under an electron microscope. Imaging-based analysis of gene expression at the single cell level by in situ hybridization followed by imaging had long preceded PCR and NGS approaches [10, 11]. Automation of Sanger sequencing together with the invention of PCR by Kary Mullis in 1980s stimulated the invention of NGS platforms during 2005–2007, which have created a sustainable state of sequencing-based biological research [12], and are now, by studying single cells [13], revolutionizing cancer research [14].

Prior to the invention of NGS technologies, tumor evolutionary histories could be studied by dissecting tumors into subregions based on phenotypical characteristics or topographical locations, followed by sequencing library preparation using tissues from subregions, sequencing, and sequence data analysis [15]. With single cell sequencing technologies, the process can be significantly simplified. Recent studies have shown that the mutational fingerprints in single cells of the same tumor, when revealed by SCS, can be interconnected to form lineages of clonal expansion and the history of cancer evolution [14, 16]. Furthermore, the temporal order of the mutational fingerprints may also indicate the driver mutations and their role in cancer progression. Recent researches have also shown that the expression of cancer-associated genes (e.g., oncogenes, tumor suppressor genes, oncogenic miRNA, hormone receptor genes, etc.) can now be better understood with single cell resolution [13, 17]. How they coordinate and interplay with one another will provide us in-depth understanding of intercellular molecular activities in a cell population.

2 Single Cell Sequencing and Its Challenges

One cannot fully appreciate the beauty of a technology without knowing the challenges it has overcome. Indeed, it is much more difficult to conduct single cell sequencing than tissue sequencing [18], mainly because of the limitation in the quantity of genetic materials a single cell can provide. Before single molecule sequencing becomes sophisticated and available for minute DNA sequencing, SCS technologies have to rely on NGS technologies for sequencing the genetic materials in single cells. Conventional NGS technologies require bulk genetic materials to sequence. However, each single cell, even for cancer cells, contains only a minute quantity (normally at picogram level), with at least three orders of magnitude less than that required for conventional NGS protocols. This gap can only be filled by DNA and/or RNA amplification using PCR-based and/or RNA polymerase-based strategies, especially the former. PCR amplification is crucial for SCS. PCR, however, may result in bias due to imbalanced amplification efficiency between amplicons and thus needs to be carefully designed. In fact, this was one of the major barriers that hindered the progress of SCS during the past. Furthermore, because most molecules in a single cell are of low abundance and low copy number, SCS is extremely vulnerable to fluctuations which may result from genetic material loss, poor personal skill, or contamination. Thus, the experimental procedure of a SCS protocol, especially from single cell or nucleus isolation up to DNA/RNA amplification, has to be well formulated and every step needs to be carefully characterized and optimized when dealing with trace amount of DNA or RNA. Moreover, sequencing error is also an issue that needs to be taken care of. While heterozygous alleles have equal chance of being sequenced, sequencing errors are shown at relatively much lower frequency and can be corrected by sufficient fold-of-coverage, or, as demonstrated by Kim and Simon, by incorporating the probability of sequencing error into Bayesian probability test (see below) [19].

2.1 Cells Suitable for Single Cell Sequencing

SCS is suitable for the study of a number of normal and diseased cell types of both prokaryotes and eukaryotes [20, 21]. These include early-stage embryonic cells, stem cells, immune cells, rare cells, microbes that cannot be easily cultured, differentiated cells, bacteria or virus infected cells, and cancer cells at various stages. The application of SCS in the study of cancer and normal cells in the tumor-surrounding microenvironment is of particular interest because of their complexity and tight association with human disease. Cultured cancer cells are probably the most accessible cancer cells for researchers, while circulating tumor cells (CTCs) and intratumor cancer cells require reliable experimental procedure to isolate. SCS analyses will benefit the diagnosis and help in guiding chemotherapy and monitoring/following up the progress of treatment.

2.2 Methods for Cancer Cell Isolation

Depending on the conditions of the single cells, a number of methods are available for single cell isolation. The most commonly used methods are micromanipulation, flow cytometry, microdissection, single cell labeling, and cell trapping [21]. Recently, an automated system (C1 System by FluidigmTM) for single cell genome and transcriptome analyses has been made commercially available. It is relatively easy to isolate single cells from a cell culture or from the blood, because most of these cells are either already separated or can be separated easily. In this situation, single cells can be washed, diluted, and isolated by manipulation such as mouth pipetting under a microscope. Compared to flow-sorting, micromanipulation is more tedious and may not be suitable for collecting a large number of single cells. However, micromanipulation is probably the mildest and gentlest approach that can minimize the impact of harsh conditions such as high pressure produced by flow cytometer and thus can preserve the cell integrity. As such, this method is useful for transcriptome analysis. On the other hand, flow cytometry is probably the most efficient approach [21]. As mentioned above, since high flow pressure can easily damage the cell membrane and cause leakage of the cytoplasm and cross-contamination, precaution has to be taken before using flow cytometer for whole cell isolation. A short period of cell wash and/or cell culture may be needed. Single cell trapping which uses a matrix coated with specific molecule(s) is also for isolating specific cells. It is important to pick single cells that are representative, healthy, with well-maintained cellular integrity, and free of contamination. The automated C1 System facilitates single cell genome and transcriptome analyses by increasing throughput, reducing technical variations, and easing control and comparison [22, 23]. However, its potential problem with primer dimers may result in false positive signals [22]. Moreover, its sensitivity may not be high enough to detect low abundance transcripts. How to improve these drawbacks seems to be critical issues for future improvement for the automated single cell instruments.

2.3 Methods for Single Cell Whole Genome Amplification

Whole genome amplification (WGA) is the key for single cell sequencing, no matter whether it is for exome sequencing or whole genome sequencing of the single cells. To preserve the original state of the genome by minimizing uneven amplification is essential for whole genome amplification. This is frequently done by reducing the number of PCR cycles, preventing PCR by-products, or using barcodes. Barcodes can not only minimize bias resulted from the differences in personal skills and experimental procedure, but also allow multiple small samples to be combined into a larger sample. A number of methods for DNA amplification, together with their limitations, have been reviewed previously [24]. Here, we quickly skim through the methods employed by Navin and Hou.

In 2011, Navin and colleagues adopted degenerated oligos to prime single cell whole genome amplification (DOP-WGA) [14]. Although efficient enough for the authors to generate reliable datasets for copy number variation (CNV) analysis, this approach provided only a low coverage (~6×), presumably due to the limitation in the size range of DOG-WGA products. In 2012, Hou and colleagues published the Multiple Displacement Amplification (MDA) method for single cell whole genome amplification. In the method, they used the Φ29 (Phi29) enzyme to amplify DNA in linear fashion [16]. The products were subsequently subjected to a fluorometry-based quantitation procedure which selected quality sequences for further analysis (see below). This seems to be an efficient approach for single cell genomic amplification for NGS. However, as drawbacks of the approach, allelic dropout and imbalanced amplification have been reported [24]. Imbalanced amplification is a common phenomenon for multiplex PCR. Previous studies have indicated that %GC is responsible for these drawbacks. GC pair is stronger than AT pair because, while AT base pairing is mediated by two hydrogen bonds, GC pairing is mediated by three hydrogen bonds. Besides, high GC content favors Z-form conformation [25]. These properties influence the efficiency of DNA amplification, causing imbalanced PCR amplifications (especially when a significant number of primer pairs are deployed across the entire genome [26]), with allelic dropouts being the most severe cases. Kim and Simon proposed a computational approach to correct potential sequence errors which may be introduced by multiple displacement amplification to cause false discovery or allelic dropout, or by sequencing—see Fig. 3.

2.4 NGS DNA Sequencing

Besides genome amplification, next-generation sequencing (NGS), which is currently the only method able to provide sufficient coverage (sequencing depth) for reliable analysis, is also essential for single cell sequencing. After years of competition since 2005, the Illumina platform has prevailed over the others (e.g., 454 and SOLiD systems) [9]. In terms of single cell sequencing, long reads would be better for at least most cases. However, R2 quality drops faster than R1, making 250 bp an evident barrier for the current Illumina sequencing technology (Fig. 1). The faster quality drop for R2 may result from cluster regeneration. Right before R2 sequencing, R1 clusters (i.e., clusters used as the templates for R1 sequencing) are replaced by their complementary clusters (i.e., R2 clusters made from the complementary strands of the R1 clusters). During the process, some templates in each cluster may get damaged or degraded, making the cluster less sufficient and thus more vulnerable to sequencing reactions. Limited read length is a natural phenomenon. By nature, reading of some templates in a cluster may go wrong in any step of the sequencing reaction. The mistakes accumulate over time, causing the quality value to drop gradually.

Fig. 1
figure 1

Comparison between R1 and R2 QV profiles. The M. cyclopis mitochondrial DNA library was sequenced by MiSeq with 2 × 300 bp paired-end sequencing. The R1 and R2 QV profiles were then displayed in parallel to indicate the faster decrease in R2 quality

3 Existing Applications

3.1 Genome Sequencing of Individual Cancer Cells

Depending on the original cell type and the developmental stage, cancer cells may exist as various forms. During cancer progression, cancer cells further diversify in genomic makeup and function. The heterogeneity of cancer genome presents a challenge for cancer research. However, at the same time it also provides an opportunity for the study of intratumor substructure, cancer progression, and cancer evolution. This is made possible by single cell sequencing, a breakthrough in next-generation sequencing, of the genomes in the same tumor mass.

The report by Navin and colleagues in 2011 using single nucleus sequencing (SNS) to study the evolutionary history of human breast cancer marked a breakthrough in cancer research [14]. The experimental procedure can be outlined to include three major steps: isolation of single cancer nuclei by flow-sorting, whole genome amplification by random priming with degenerated oligonucleotides, and next-generation sequencing. The sequence reads were then analyzed to resolve genomic differences in copy number among individual cancer cells (Fig. 2).

Fig. 2
figure 2

Single cell genome sequencing reveals the differences in genomic sequences between single cells. Individual cells, or nuclei, are first collected in separate tubes. Genomic sequences (chromosomes) are amplified and sequenced. Sequence reads are then mapped against human genome assembly to identify their locations and sequence variations (e.g., copy number variations, insertions, deletions, single base alterations, etc.). Cross-comparison on sequence variation of single cell libraries, if isolated from the same cancer, allows us to reconstitute the evolutionary history of that cancer

The experimental approach was first validated by using single nuclei isolated from SK-BR-3 cell line, together with a million-cell population control from the same cell line. After genomic amplification using degenerated oligos in random priming, they obtained only a low overage (~6x) of the single cell genomes. However, such level of coverage is sufficient for CNV analysis. In terms of bioinformatics, the authors designed unique analytical approaches. For example, instead of using fixed intervals to calculate integer copy number, they used variable length bins but with uniform expected unique counts, which would correct for biases that have been reported in WGA. Pileups (over-replicated loci) were found to be randomly distributed and sparse so that would not affect the results. In both single cells and million-cell population control, they found major amplifications in genes encoding MET, TPD52, ERBB2, and BCAS1 proteins. Deletion in DCC (deletion in colorectal cancer) gene was also detected in both single and 1-million-cell population of SK-BR-3 cells. These results generated from the SK-BR-3 cell line allowed them to move forward to study single nuclei isolated from different sections of breast tumors.

They divided a high-grade (grade III) triple negative (ER, PR, and Her2) carcinoma (labeled as T10) into six sections and analyzed 100 single cell nuclei isolated from these sections. From the study, they identified distinct clonal subpopulations in the genetically heterogeneous ductal carcinoma. The integer copy number profiles were built and analyzed to contain 63 % of normal cells and 37 % tumor cells and infiltrated with leukocytes. By calculating pair-wide distances between the 100 profiles followed by building a phylogenetic tree using neighbor joining, four subpopulations were identified, one with flat diploid profile and three with complex (advanced) genomic structures, suggesting three clonal expansions. Moreover, their method was able to detect diverse chromosome gains and losses and discern ‘pseudodiploid’ nuclei in diploid nuclei. Further clonal analysis allowed the authors to trace the evolutionary history of the cancer from the primary stage to the metastatic stage. The data further suggested that, differing from gradual models of tumor progression, tumors grow by “punctuated” clonal expansion, without discernible intermediate branching.

The application of single nucleus sequencing in reconstitution of cancer evolutionary history, which suggested cancer progression mediated by punctuated clonal expansion, was reviewed [21, 27]. The drawbacks of such copy number-based approach include low coverage and being unable to reach down to the nucleotide resolution.

Following the report by Navin et al., Hou and colleagues reported an MDA-based exome sequencing of single cancer cells [16]. Deviating from copy number-based SCS analysis, this report presented a pilot study at the single cell nucleotide level. The single cell nucleotide sequencing is made possible by using multiple displacement amplification of the whole genome. MDA products are mostly of high molecular weight (>10 Kb), being able to boost genome coverage.

The MDA method was first tested with two single cells under multicell control and hg18 was used as the human genome reference. The coverage was found to be >15× (mean fold coverage = 18×), and sequences of both single cells covered more than 90 % of the reference genome, while more than 95 % of the bases in hg18 were recovered with >15× sequencing depth. WGA failure was found to be associated with GC content, with failed regions containing higher GC% than the average 41 % GC content in human genome. The ratio of allele dropout (ADO), which indicated whether non-amplification occurred in one of the alleles present in a heterozygous sample and would lead to false negative, was maintained at ~11 %. ADO showed no bias relative to genomic location, and errors of MDA also showed no preferences on genes or functions.

After testing, the authors applied MDA procedure to study the genes involved in essential thrombocythemia (ET) evolution. For certain reasons, they conducted exome sequencing instead of whole genome sequencing. A total of 90 single cells from an ET patient were sequenced to a mean depth of 30×. After filtering out the single cells with <70 % coverage, 58 single cells were chosen for further analysis. These single cells have an average of ~70 % of target bases at >5 depth. They considered this coverage as sufficient for population variant calling when multiple single cells have the same variant. Exome sequencing generated SMAFS (somatic mutant allele frequency spectrum) for evolutionary study. Results indicate that ET patient carries a distinct set of mutations and a monoclonal origin of ET cancer cells.

In a parallel study, Xu et al. used single cell exome sequencing to investigate clear cell renal cell carcinoma (ccRCC) by which they revealed kidney tumor-specific single nucleotide mutations [28]. Unlike reports by Navin and colleagues, no significant clonal subpopulations were identified in their ccRCC cases, presumably due to the difference in cancer type and origin. This single cell exome sequencing also revealed single nucleotide mutation characteristic of the kidney tumor.

Besides the above-mentioned reports, a computational approach for inferring the evolutionary mutation history of a cancer using single cell sequencing data has also been reported by Kim and Simon (Fig. 3) [19]. Although the quality of a library is influenced by a number of factors such as the make or design of a sequencer, personal skill, and the quality or the preparation of the material, the forefront quality control would remove the questionable reads, keeping the sequencing errors in the qualified reads at low rate, while on the other hand, heterozygous alleles remain having equal chance to be detected in the sequence reads, making these two types of sequence variations readily distinguishable from each other. The sequencing errors can be more easily detected by sufficient coverage (normally set at 30-fold or above) and then removed by programs, or by incorporating the probability of sequencing error into computation, as demonstrated by the authors.

Fig. 3
figure 3

Workflow for the construction of evolutionary mutation tree presented by Kim and Simon

3.2 Transcriptome Sequencing

Sometimes transcriptional information of individual cells, instead of a cell population, is desired. Obtaining such information relies on single cell transcriptome (SCT) sequencing and analysis (http://genomebiology.com/2010/11/S1/P8), which remains a great challenge for current technologies [17].

Indeed, several lines of evidence show that stochastic gene expression is likely to be a natural phenomenon and thus the gene expression of a tumor or a tissue should be considered as a combinatorial phenomenon summarized from its constituent single cell transcriptomes. SCT sequencing is becoming the most advanced approach for studying gene expression and regulation (Fig. 4), and this line of application heavily relies on cDNA synthesis. In fact, there were methods for single cell cDNA synthesis published before NGS became a popular technology for sequencing. These include the first single cell transcriptome analysis reported by Eberwine and colleagues in 1992 [12]. Later in 2006, Kurimoto et al. published another method for microarray-mediated SCT analysis [29]. The first NGS-based single cell transcriptome sequencing was published in 2009 [13]. In 2012, Ramskold and colleagues published an elegant method for cDNA synthesis and amplification [30]. This approach has been commercialized by Clontech to make a kit called “the SMARTer Ultra Low RNA Kit.” Instead of using oligo-dT as employed by Tang et al., Ramskold and colleagues used CDS primer, which carries a VN tail (V stands for ‘non-T’ and N stands for ‘any base’) in the 3’ end of the oligo-dT sequence, to prime the first strand cDNA synthesis. The VN tail significantly enhances the specificity of priming because the VN tail allows the prime to “hook” to the last two bases right in front of the polyA tail in the mRNA molecule. Without the VN tail, the primer would ‘slip’ within the polyA region, resulting in a significant amount of imprecise priming. The above-mentioned automated single cell analysis system is an efficient approach, but solely for the study of a limited number of genes or genomic regions. Similarly, primer dimer is an issue of concern [22].

Fig. 4
figure 4

Single cell transcriptome sequencing. The profiles of gene expression in single cells can now be studied by single cell transcriptome sequencing. Basically, the procedure is similar to that used for transcriptome analysis of a cell population, except that each cell has to be collected separately and then goes through cell lysis, cDNA synthesis, cDNA amplification, sequencing, sequence data analysis, and cross-library comparison. To prevent material loss, all reactions are conducted at very low volume (e.g., a few microliters or less) and wash is also minimized until the cDNA molecules have been amplified

Single cells are delicate entities, and thus concerns about the accuracy of SCT analysis are inevitable. Various potential factors that may cause transcriptional variations have been, and will continue to be, examined. Indeed, it can be difficult to identify the factors causing variations in single cell transcriptomes. Some variations between individual cells may be real, but some may result from differences in personal technical skills and thus need to be minimized. To minimize the influence of variability in personal technical skill, it is recommended to increase the number of SCTs and use internal controls such as housekeeping genes and previously studied expression patterns of certain genes. For example, in our study of single cell transcriptomes of MCF-7 breast cancer [31], we used the expression the LDHB (lactate dehydrogenase) gene, which is known to be completely shutdown in MCF-7 cells, as an internal control. As expected, its expression was not detected in all libraries (data not shown).

4 SCT Protocols Can Be Modified for Various Reasons

Protocols may need to be modified for certain reasons [9]. For example, a protocol might not have been optimized when it is published. This frequently occurs when it is published in a hurry by companies trying to catch up market demand. Besides, taking advantage of its low input requirement, one can adopt single cell sequencing protocol to generate sufficient amount of input material for a regular sequencer. As shown in the following section, we can use total RNA or mRNA, instead of single cells, as the input material for a single cell protocol. By so doing, we bypass the conventional protocol and use the single cell protocol to rescue the situation when material is not sufficient for a regular sequencer.

5 Using Different Types of Materials as the Input

Since polyA+-RNA molecules in single cells are the only molecules required for double-stranded cDNA synthesis, it is reasonable to use either total RNA or mRNA to replace single cells.

The majority of RNA molecules are ribosomal RNA (rRNA) and transfer RNA (tRNA), while mRNA species constitute only ~2 % of the total RNA. The presence of rRNA and tRNA may reduce the efficiency of cDNA synthesis because of a number of reactions, including oligo-dT priming, reverse transcription, and PCR amplification. As such, it is strongly recommended to use mRNA as the starting material, if the amount of total RNA is sufficient for mRNA isolation. In fact, it has been empirically demonstrated that mRNA works better than total RNA.

Then, how do we correlate the results produced from mRNA or total RNA with cell number? Using MCF-7 as an example, each MCF-7 cell expresses about 10 pg of total RNA. Accordingly, 50 ng of total RNA is equivalent to about 5000 cells, and 50 ng of mRNA is equivalent to 250,000 (=0.25 million) single cells. One can calculate and use a certain amount of total RNA or mRNA based on the number of cells he/she wants to use in the study. Since the total amount of RNA expressed from a single cell varies across different cell types, it is strongly recommended to empirically fine-tune this value based on the cell type being used.

Completion of the second PCR amplification marks the junction where the SCT protocol and other protocols meet. Now the retrieval of mRNA molecular information from a single cell is completed and preparation of a sequencing library can be initiated. Here, one can determine what sequencing libraries to make: shotgun fragment sequencing, paired-end (PE) sequencing, or pair-end ditag (PED) sequencing.

6 Future Potential Applications of Single Cell Sequencing

We can expect many more SCS applications to be developed for cancer research in the near future. Using transcription factor binding site (TFBS) analysis as an example, there is a strong potential for us to conduct TFBS analysis at the single cell level. Currently, most cancer researches have been focused on the study of mutations in cancer-associated genes such as KRAS, TP53, cMYC, etc. Less attention was paid to in vivo study of how alterations in DNA motifs interact with transcription factors (TFs) and/or other intracellular proteins, and how a mutation in TF affects its DNA binding. Will it form complexes with other unexpected proteins? Or, will it bind to different locations in the genome? It would be interesting to further understand how an altered motif influence TF binding at the single cell level and how the effects at the single cell level exert a combinatorial effect at the population level. In theory, one would expect an alteration in DNA motif to result in a corresponding switch in the interacting protein(s), which may in turn play a role in tumorigenesis, angiogenesis, and/or metastasis. Empirical SCS data will help to either prove or disapprove the speculation.

7 Further Improvements

To enhance SCS data analysis and protocol design, it can be helpful to produce a virtual population profile through the integration of SCS profiles by statistical approach. By comparing the virtual population profile with the empirically produced population profile, we can evaluate and improve the SCS procedure. Conceivably, protocols for single cell experiments have to be reproducible, straightforward, and adaptable. However, many protocols do not yet meet these criteria and need to be optimized.

The progress of biological research heavily relies on the advance of biotechnologies. There is no doubt that more innovative SCS approaches will be created in the near future.