Keywords

1 High-Throughput Sequencing Techniques

Since Sanger’s technology in the 1970s, DNA sequencing has been continuously improved regarding both throughput and low cost. Next-generation sequencing (NGS) , also called high-throughput or deep sequencing, constitutes a new breakthrough in increasing research power, a revolutionary advancement in molecular biology knowledge. An increasing number of biological questions may be addressed by NGS technologies, which provide a much larger comprehensive survey compared to the Sanger method, and under a system biology perspective. Transcriptomics has been particularly benefited by the use of these new technologies, also called RNA-seq, allowing a complete characterization of the whole transcriptome at both gene (Kvam et al. 2012) and exon (Anders et al. 2012) levels, and with an additional ability to identify rare transcripts, new genes, novel splicing junctions, and gene fusions (Wang et al. 2009; Katz et al. 2010; Van Verk et al. 2013). More recently, single-cell sequencing had become a feasible task allowing a deeper and systemic view of individual cell’s transcriptomes.

This chapter first addresses a brief overview of sequencing techniques and the most common next-generation platforms and computational methods for RNA-seq data analysis. Then, we present two case studies to assess the capabilities of RNA-seq in addressing important biological issues.

1.1 Sanger’s Sequencing Technology

In 1977, Frederick Sanger and colleagues (1977) developed the DNA sequencing method, which in 2001 allowed the first human genome draft (Lander et al. 2001). This method, called dideoxy chain-termination or simply the Sanger method, is based on special nucleotide molecules (called ddTNP), lacking a 3′-OH at the deoxyribose, which blocks the DNA elongation. These special nucleotides are mixed in lower concentrations to the regular nucleotides and used as reagents for DNA polymerase reaction . Therefore, with the polymer synthesis stopped by the ddNTP’s inclusion, the last nucleotide can be determined. Each of the four ddNTPs was added separately in four different reactions. In the beginning, one of the regular nucleotides, most commonly dATP or dCTP, was radioactively labeled (e.g., 32P or 35S) to achieve the radioactive signal. Usually, polyacrylamide gel electrophoresis was used to separate the DNA molecules, which diverged in length by a single nucleotide. Then the gel was dried and exposed to X-ray film.

An important modification of the method was substituting the radioactive label with a fluorescent dye (Smith et al. 1986). Each distinct wavelength produced by the fluorescent dyes linked to dideoxynucleotides corresponds to a different nucleotide, with the four sequencing reactions performed in the same tube. With the Sanger sequencing method’s automation, the performance reached up to 96 different reactions running in parallel capillary gel electrophoresis (Marsh et al. 1997), which is considered the first-generation technology. At the top of the technology, 384 samples could be sequenced at once in a single multi-well plate. The Sanger method’s main sequencing devices are ABI (Applied Biosystems) and MegaBACE (GE Healthcare Life Sciences).

1.2 Next Generation Sequencing

Regulatory mechanisms and gene expression profiles have been widely investigated toward the elucidation of several essential cellular processes. Hybridization-based technology, e.g., microarray, has been beneficial for determining global gene expression. However, the high background levels due to cross-hybridization, a limited range of quantification, and a restricted detection of known genes are bottlenecks for large-scale use of this technique (Shendure 2008). RNA-seq allows a genome-scale transcriptome analysis, including novel genes and splice variants, with a wide range of quantification and reduced sequencing costs (Wang et al. 2009; Soon et al. 2013). These advantages make RNA-seq a better and attractive solution for whole-genome transcriptome analysis of several organisms, even for those with no sequenced reference genomes.

Nowadays, the most commonly used NGS platforms for RNA-seq research are Illumina, PacBio, and Nanopore. These and other novel platforms are rapidly becoming more popular as they profile short and longer reads at a reasonable price per base. The substitution of older NGS technology is fast and pioneer methods, such as pyrosequencing, are nowadays wholly abandoned. A comparison of current NGS technologies is shown in Table 3.1.

Table 3.1 Comparison of next-generation sequencing technologies

The enormous amounts of data generated by NGS create new challenges to the downstream bioinformatics analysis, which has to handle large sequence files while searching for comprehensive and useful biological information, discussed later in this chapter.

1.3 Illumina Sequencing

Illumina sequencing uses a reversible dye-terminator technique that adds a single nucleotide to the DNA template in each cycle (Bentley et al. 2008). This system was initially developed in 2007 by Solexa and was subsequently acquired by Illumina, Inc. Illumina is widely used in several transcriptome studies since it reaches the deepest depth among NGS technologies, despite its small sequence size (150–300 bp).

Illumina sequencing is based on sequencing-by-synthesis. Sequencing is performed in a solid slide covered by adaptors complementary to those added to the fragmented DNA sequences (Metzker 2010). This procedure, called bridge PCR, consists of amplifying bent DNA sequences attached by both ends to the solid surface (Fig. 3.1a). By the end of the clonal amplification, clusters of identical DNA sequences (Polonies) will be formed to amplify the fluorescence signals. In each round, one single nucleotide is added to the single-strand template sequences followed by fluorescence detection by a high-sensitivity CCD camera (Fig. 3.1b). As in Sanger’s technology, different fluorophore molecules are attached to each nucleotide ; however, only one nucleotide is incorporated in each cycle. The fluorescence emission releases the 3′OH of the recently added nucleotide, allowing it to receive new monomers in the subsequent sequencing cycle.

Fig. 3.1
figure 1

The Illumina sequencing technology . (a) Two basic steps encompass an initial priming and extending of the single-stranded, single-molecule template, and bridge amplification of the immobilized template in a solid device with immediately adjacent primers to form clusters; (b) In the images, the sequencing data is highlighted from two sequence clusters; (c) Paired-end sequencing by which reads are generated from both template strand. “A” block indicates the device-ligation adaptors and “SP,” sequencing primers. (Source: Metzker (2010) and http://www.illumina.com/)

Single-end sequencing, i.e., reads generated from a single-end adaptor, is being replaced by the paired-end sequencing since the accuracy of downstream analysis is greater with a fair price. Paired-end reads are produced from the adaptor priming sites in both template sequence ends, the second adaptor primer being used in a subsequent sequencing run (Fig. 3.1c).

1.4 Pacific BioSciences Sequencing

Single-molecule real-time (SMRT) sequencing was devised by Pacific BioSciences (PacBio) in 2009, and it is also called PacBio sequencing (Eid et al. 2009). This platform uses a single DNA polymerase attached to the bottom of a picolitre well – zero-mode waveguides (ZMW) – which replicates a single-molecule template per well to produce a signal for light detection in the smallest volume. In this method, the template is capped by hairpin adapters at both ends of the double-stranded DNA molecule, forming a single-stranded circular DNA (called a SMRTbell). Consequently, the polymerase repeatedly passes over the circular template and sequencing it multiple times, resulting in long read lengths and, thus, providing higher accuracy (Rhoads and Au 2015). The PacBio platform enables simultaneous analysis of millions of wells per chip in a single run, providing long read lengths to up to 60 kb (with average read lengths of 20 kb) (Nakano et al. 2017).

Overall, this technology is considered highly accurate and robust, even as its first sequencers have some drawbacks that narrow down its application. For instance, the limited high-throughput, higher cost, and error rate compared with those of second-next generation sequencing (SGS) technologies (Kanzi et al. 2020; Wang et al. 2020). However, in 2019, PacBio launched the Sequel II System , which asserts improvements in the sequencing to deal with these limitations, generating highly accurate (99.9%) individual long reads up to 25 kb (HiFi reads) and reduces the costs and time of the project, in comparison with its prior versions (Wenger et al. 2019; Logsdon et al. 2020). These HiFi reads are generated by using the circular consensus sequencing (CCS) due to continuous circular sequencing (Wenger et al. 2019; Pereira et al. 2020).

For transcriptomic analysis, the SMRT isoform sequencing (Iso-Seq) from PacBio increased the read length compared to other SGS technologies. This platform achieves full-length transcripts sequencing, improving the analysis in different applications, including gene annotation, isoform identification, fusion transcripts identification, and long non-coding RNA discovery (Weirather et al. 2015; Nattestad et al. 2018; Wang et al. 2019; Zhang et al. 2020a; Hu et al. 2020).

1.5 Nanopore MinION Sequencing

The long-read-length sequencer MinION , the first nanopore sequencer device, was announced by Oxford Nanopore Technologies (ONT) in 2012 as a portable, compact, real-time sequencing controlled by a laptop computer device (Deamer et al. 2016). Since then, new nanopore platforms have quickly emerged, such as PromethION, which offers a greater scale of sequencing, and SmidgION, the smallest sequencing platform designed for use with smartphones or other mobile devices.

After library preparation, each strand is attached to adapters. The adaptors bind to a protein motor that guides the sequence to the protein pore, which processes it. Beginning at the 5′-end, the DNA or RNA polymer passes through the pore controlled by the motor protein, which unzips dsDNA and translocates a single strand sequence (Fig. 3.2). The translocated strand modulates the ion current flow through the pore membrane (Ip et al. 2015). The variation of the electrochemical current promoted by each different nucleotide is measured by a sensor and enables identification by different signal patterns. The resultant signals are stored in a FAST5 format file and can be finally used for base-calling, a process in which the nucleotides are predicted from the Raw signals and transferred to a FASTQ file. Base-calling can be performed using only information from one strand (1D) or two strands (2D) for consensus, with information from both strands resulting in better base prediction (Lu et al. 2016). Currently use of neural networks in base-calling reached an accuracy interval between 85% and 95% with the detection of signal patterns (Zhang et al. 2020b).

Fig. 3.2
figure 2

Schematic view of the nanopore sequencer. MinION device process double DNA helix. First, the protein motor unzips DNA passing a single strand through the pore. The movement of the single strand promotes an ionic current flow that is measured and converted to nucleotides data by the base calling analysis

Although sequencing full-length reads allows improvement of isoforms identification and discovery in transcriptome sequencing, it deals with high error rates (Kovaka et al. 2019). To reduce error rates before analysis, nanopore correcting errors can be made by a hybrid error correction strategy. This strategy uses high accuracy short reads to correct long-reads, self-correction methods that rely only on long-reads, or reference-based methods that use a reference genome for error correction (Zhao et al. 2019).

2 Bioinformatics Pipelines for Transcriptome Projects

Illumina sequencing is the most used technique in transcriptome studies, since the number of sequenced reads (named raw data) allows to find out virtually the complete set of expressed genes (transcripts). However, longer reads allow a more precise definition of the transcripts. In both cases, the metaphor for reconstructing the transcripts is like mounting a puzzle, where the pieces (the reads) have to be assembled (relative to a reference genome or not) to obtain the picture (transcripts in a transcriptome). After this, different analyses can be performed on these reconstructed transcripts, e.g., quantitative and differential expression. In a transcriptomic project, the tasks of reconstructing transcripts and performing biological analyses are performed by bioinformatics pipelines, discussed next.

2.1 Pipelines

A bioinformatics pipeline or workflow is a computational system composed of a sequence of programs sequentially executed. The output data from one software is the input data for the following software (Wercelens et al. 2019). In general, transcriptome bioinformatics pipelines have the following steps, which can be combined according to the raw input data and the objectives of each project:

  • Quality control of raw data: This initial step allows visualization, analysis, and filtering (cleaning) the data. Usually, this process takes two sub-steps as follows: clipping and trimming. In the clipping step, adapters (primers) attached to the ends of the sequenced reads (or even the whole read) are removed. In the trimming step, low-quality sequences in the reads are filtered. The filtering guarantees a reliable dataset of quality reads to be used in the following phases of the pipeline.

  • Assembly: in the absence of a reference genome or transcriptome, it is necessary to assembly one. For that, overlapping reads (the end of a read is similar to the beginning of another read) are joined in groups of reads (called contig), allowing to construct of one larger sequence (called consensus), which is a predicted (fragment of) transcript. The complete set of transcripts is the predicted transcriptome (Fig. 3.3).

  • Mapping: The filtered reads can be aligned to the transcriptome’s reference genome to find the actively expressed exons or transcripts. The amount of reads mapping to a single exon/transcript is proportional to its expression.

  • Analysis: The whole set of (fragments of) transcripts obtained from the mapping or the assembling step allows to obtain relevant biological information, e.g.

    1. (a)

      quantitative analysis: among others, coverage analysis shows the abundance of genes expressed in one RNA-seq sample, more precisely, the number of reads mapped in a certain region of the chromosome.

    2. (b)

      differential expression: allows to analyze the differences and variability of gene expression between samples along distinct genomic regions.

    3. (c)

      annotation: assigns a biological function to each transcript.

Fig. 3.3
figure 3

Examples of pipelines for transcriptome analysis: (a) Pipelines for short reads, with a well-characterized reference genome, and two types of analyses – coverage statistics and differential expression. (b) Pipeline for longer reads, with no reference genome, and annotation (biological function, gene categories, and ontologies)

Designing a particular pipeline mainly depends on the transcriptome project’s objectives and other information, such as the sequencing platform employed (since the sequencing techniques may cause specific errors in the raw data). It also depends on the availability of a reference genome or transcriptome in the mapping step and the analysis step’s accuracy and biases. Two generic bioinformatics pipelines for transcriptomes are discussed next.

Pipeline 1

The organisms of interest have already been sequenced, preferably with high coverage, well-annotated genes, and other relevant biological characteristics. The reads are usually short (about 150–300 bp), typically produced by Illumina sequencing platforms. This pipeline would be composed of a minimum of three steps (Fig. 3.3a): quality control, mapping, and quantitative analysis.

Pipeline 2

The organism of interest has not been sequenced before. The reads are usually long (up to 40 kb), heavily produced by the PacBio sequencing platform. This pipeline would be composed of a minimum of three steps (Fig. 3.3b): quality control, assembly, and annotation. The assembly phase constructs one consensus sequence for each group of reads presenting similar extremities. This approach heavily depends on sequencing quality, and the multiplatform approach improves the final assembled transcriptome. Finally, the annotation phase assigns biological functions to the consensus sequences.

A bioinformatics pipeline is usually implemented using command lines (e.g., GNU/Linux terminal) mainly because it is a fast, relatively simple, and reliable way to control and manipulate large amounts of datasets. Programming languages such as Shell Script, Python, R, and Perl might also help implement a pipeline and resolve minor tasks by scripting. The pipeline’s files/data can be organized in directories or database management systems, relational databases (e.g., MySQL, Oracle), or NoSQL databases (e.g., MongoDB, Neo4J) to store, retrieve, and manage the data. Most software used in pipelines are free, open-source, publicly available, and some of the most common ones are described next.

Frameworks to manage workflows are also available, such as Snakemake (Köster and Rahmann 2012) and Common Workflow Language (CWL) (https://www.commonwl.org/v1.0). They provide a reliable way to standardize the syntax and semantics for program evoking and create robust and reproducible workflows.

2.2 Bioinformatics Software

2.2.1 Software for Quality Control

The overall quality of the output sequencing data must be assessed to eliminate bad quality, poorly sequenced, or ambiguous raw data that could negatively impact further analysis. Thus, filtering (or cleaning) strategies capable of clipping and trimming are essential to guarantee the reliability of transcriptomics data and ensure obtaining relevant and trustworthy biological information. The sequenced reads are stored using FASTQ format, gathering the nucleotides sequences of each read and their corresponding quality scores.

Some tools are used to assess and visualize the overall quality of data, such as FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc), a popular java-based quality control check program. Other tools to perform filtering steps like FASTX-Toolkit (http://hannonlabcshledu/fastx_toolkit) provide options for performing both clipping and trimming. Other commonly used tools are Cutadapt (Martin 2011) for clipping, PRINSEQ (Schmieder and Edwards 2011), and Trimmomatic (Bolger et al. 2014) for trimming. Fastp (Chen et al. 2018) is an ultra-fast all-in-one quality control, and data-filtering tool that can be an alternative to multiple and insufficiently fast software for quality control. They all present several options , such as minimum size for a read, minimum quality score, and polyadenylation removal.

2.2.2 Software for Mapping

The mapping phase’s main objective is to find where each filtered short read corresponds in a reference genome/transcriptome (Fig. 3.4).

Fig. 3.4
figure 4

Short reads mapped to a reference genome. Reads are aligned to a reference genome and the accumulation of data brings in evidence expressed exons and splice junctions

There are many programs capable of performing the mapping process. In general, these software are computationally intensive (to process and store data), and mapping techniques use indices to accelerate the search procedure and reduce the memory cost associated with finding the location of reads to the reference genome.

Bowtie (Langmead et al. 2009) is a fast short aligner that tolerates a small number of mismatches. Bowtie first concatenates all the reference genome in one single string and performs the Burrows-Wheeler transformation (BWT) to generate one index to this reference genome. Next, character by character of each read is mapped until the entire sequence is aligned. If a read cannot find a perfect alignment location, the program backtracks one character, substitutes this character, and the process is repeated until the alignment is completed. The maximum number of character substitutions is a parameter in Bowtie. The rapid improvement of throughput and increase of read length of sequencing technologies required the development of Bowtie2 (Langmead and Salzberg 2012), a gapped supported alignment tool that performs a faster and more sensitive mapping for reads longer than 50 bp.

TopHat (Trapnell et al. 2009) can identify exons splicing sites by mapping RNA-seq reads against a reference genome. First, the Bowtie mapping program is employed to map the short unspliced reads to the reference genome. The reads that are not initially mapped are not filtered out but are just set apart. After the main alignment, each unmapped reads are split into shorter fragments and then aligned individually and independently to identify splice junctions between exons. TopHat2 (Kim et al. 2013) is an updated version of TopHat with an overall accuracy improvement and better alignment procedure.

The Spliced Transcripts Alignment to a Reference (STAR) (Dobin et al. 2013) represents a significant mapping alignment algorithm for RNA-seq data. STAR aligns non-contiguous (exons) sequences straight to a reference genome by two main steps. First, in the seed searching phase, a maximal mappable prefix (MMP) is employed to correctly map the reads against the reference genome even if the read contains a splice junction. Later, the algorithm attaches the seeds previously aligned and constructs alignments of all read sequences. Finally, using a defined local alignment score system, a seed combination is called the best alignment for a read if it has the highest score.

Segemehl (Hoffmann et al. 2009, 2014) maps short reads to reference genomes, detecting mismatches, insertions, and deletions. Moreover, Segemehl can deal with different read lengths and can map primer or polyadenylation contaminated reads correctly. Segemehl matching method is based on enhanced suffix arrays, supporting the SAM format and queries with gzipped reads to save disk and memory space and allowing both bisulfite sequencing and split read mappings.

Minimap2 (Li 2018) is a fast RNA-seq aligner that maps long-reads against a reference database. Minimap deals with long noisy reads at high error rates generated from both ONT and PacBio sequencing. In aligning spliced sequences, it recovers insertions and deletions and predicts correct splice junctions for correct alignment.

There are many other computational methods to map short reads to a reference genome, as shown in Table 3.2.

Table 3.2 Mapping software and their websites

2.2.3 Software for Assembling

Mapping approaches for transcriptome reconstruction can be particularly tricky since correctly assigning reads to a reference genome are usually computational demanding, prone to errors by splice junctions, sequencing inaccuracy, absence, or unfinished reference sequences. Contrarily, assembly (or de novo assembly) approaches do not require any reference genome, the desired feature, especially when genomic sequences are not available or do not attend minimum quality demands.

The assembly tools algorithms usually aim to group reads with similar extremities, i.e., the overlapping of one read’s end to another indicates that both probably belong to the same transcript (Fig. 3.5). These similar extremities enable the reconstruction of larger regions of the transcripts. As said before, each of these groups is called a contig. The sequence resulting from the overlapping reads in one contig, called consensus, is a predicted (fragment of) transcript.

Fig. 3.5
figure 5

Reads that contain overlapping extremities indicate that they are parts of the same transcript. Multiple reads overlapping each other creates a longer fragment called contig that represents a specific locus of consensus sequence

Short reads sequencing usually have greater accuracy than long reads; however, short reads often align in multiple regions, causing problems to find correct isoforms. Thus, long reads sequencing can improve the discovery and identification of isoforms, but it is less accurate due to base-calling errors. When possible, the mixture of long reads and Illumina short reads are the best strategy for assembling complete and accurate transcriptomes (Kovaka et al. 2019).

Trinity (Grabherr et al. 2011) software package represents a major de novo assembly method composed of three modular components: Inchworm, Chrysalis, and Butterfly. Initially, the inchworm algorithm decomposes and selects from all reads the most common k-mer (k = 25) as the seed promotes contig assembly based on greedy extension (k−1)-mer overlaps. Chrysalis clusters and connects Inchworm contigs in components that could be originated from alternative splicing or related genes. If contigs overlap k−1 bases between themselves and reads span the splicing junction among different contigs, then highly structured de Bruijn graphs are built for each component. Finally, the Butterfly component integrates de Bruijn graphs produced in the Chrysalis stage to their corresponding RNA-seq read, allowing the reconstruction of the transcriptome sequences similar to the original transcripts.

Trans-ABySS (Transcriptome Assembly By Short Sequences) (Robertson et al. 2010) is a de novo assembly tool designed to reconstruct paired-end short reads from transcriptome data. Trans-AbySS derived from ABySS (Simpson et al. 2009), a short-read genomic data assembler. Trans-AbySS employees de Bruijn graph approach promoting data assembly with standard k-mers (k = 32) promoting a good balance between assembling frequent and rare transcripts. Trans-ABySS single-processor version is useful for assembling genomes of up to 100 Mbases. In contrast, the parallel version (implemented using MPI) can be assembled larger genomes, benefiting from multi-threaded processing.

MaSuRca (Zimin et al. 2013) process hybrid assembly, using “super-reads” from short-reads to de novo assemble reads and construct synthetic long reads with a low error rate and combining with long reads from Nanopore/Pacbio . Its assembly permits work with long reads and short reads at the same time, overcoming high error rates from long-reads sequencing (Table 3.3).

Table 3.3 Assembly software and their websites

2.2.4 Software for Analysis

In transcriptome projects, quantitative analysis, differential expression, and transcript annotation are extensively used. Many suitable tools for these analyses are available in R language , which provides a wide variety of statistical and graphical resources. R is highly extensible, allowing us to output well-designed publication-quality plots, including effective data handling and storage facility and a collection of intermediate tools for data analysis. Bioconductor (https://www.bioconductor.org) is a (mostly) R packages repository that provides open-source tools to analyze biological high-throughput data. Similarly, there are many Python-based resources as Biopython (https://biopython.org), a set of freely available tools for biological computation written in Python.

Quantitative Analysis

The transcript coverage is the number of reads “covering” (or the number of mapped reads in) a transcript. The greater the number, the more abundant is the expressed gene in an RNA-seq sample (Fig. 3.6). The RNASeqMap library (Leśniewska and Okoniewski 2011), for instance, provides classes and functions to analyze the RNA-sequencing data using the coverage profiles in multiple samples at a time.

Fig. 3.6
figure 6

Read coverage of transcripts relative to a reference genome. Each red bar plotted indicates a locus alignment coverage. The arcs represent splicing junctions between exons. Finally, the arc numbers are the observed numbers of reads across the junction. (Source: https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html)

Differential Expression

The differential expression refers to the study of the variability of genetic expression between samples. One important objective of RNA-seq projects is to identify the differentially expressed genes in two or more conditions (Rapaport et al. 2013). These genes are selected based on parameters, usually based on p-values generated by statistical modeling. The expression level is measured by the number of reads mapping to the transcript, such as transcripts per million (TPM), which is expected to correlate directly with its abundance level. This measure is different from gene probe-based methods, e.g., microarrays. In RNA-seq, the expression of a transcript is limited by the sequencing depth. It depends on the expression levels of other transcripts, in contrast to array-based methods, in which probe intensities are independent of each other. That one and other technical differences have motivated many statistical algorithms, with different approaches for normalization and differential expression detection. For example, Poisson or negative binomial distributions to model the gene count data and various normalization procedures are common approaches.

Cufflinks (Trapnell et al. 2010) may be used to measure global de novo transcript isoform expression. It assembles transcripts, estimates their abundances, and determines differential expression (Trapnell et al. 2013) in RNA-seq samples . Moreover, Cufflinks accepts reads aligned by other mappers and assembles the alignments to a parsimonious set of transcripts. It then estimates the relative abundances of these transcripts based on how many reads support each one, considering biases in library preparation protocols.

Some articles discuss and compare statistical methods to compute differential expression. In a review, Kvam et al. (2012) compared four statistical methods – edgeR, DESeq, baySeq, and a method with a two-stage Poisson model (TSPM). Rapaport et al. (2013) describe an extensive evaluation of common methods – Cuffdiff (Trapnell et al. 2013), edgeR (Robinson et al. 2010), DESeq (Anders and Huber 2010), PoissonSeq (Li et al. 2012), baySeq (Hardcastle and Kelly 2010), and limma (Smyth 2004) adapted for RNA-seq use, using the Sequencing Quality Control (SEQC) benchmark dataset and ENCODE data.

Splice Junctions

Splice junctions are nucleotide sequences at the exon–intron boundary in the pre-messenger RNA of eukaryotes removed during the RNA splicing. This process can generate many processed transcripts from a single gene. Computationally, the problem is to recognize, given a sequence of DNA, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). This problem consists of two subtasks: recognizing exon/intron boundaries (called EI sites) and recognizing intron/exon boundaries (IE sites). IE borders are called “acceptor sites,” while EI borders are called “donor sites.” The recognition and quantification of splice variants are among the advances of RNA-seq over microarray to measure differential gene expression. The splice junctions help to delineate and quantify the transcript model, as observed in Fig. 3.6.

Tophat (Trapnell et al. 2009) identifies splice junctions, producing the junctions.bed file, where the field score is used to indicate coverage depth. The identified splice junctions can be displayed in browsers (e.g., UCSC genome browser (Kuhn et al. 2013)) using.bed files encoding splice junctions. Junction files should be in the standard.bed format. Pasta (Patterned Alignments for Splicing and Transcriptome Analysis) (Tang and Riva 2013) is a splice junction detection algorithm designed for RNA-seq data, based on a highly accurate alignment strategy on a combination of heuristic and statistical methods to identify exon–intron junctions with high accuracy.

Annotation

The annotation step aims to assign a biological function for each transcript, identifying genes and finding more information, e.g., biological categories and ontologies. The annotation process is characteristic of novel transcriptomes since reference genomes and transcriptomes are typically associated with curated gene annotation.

The annotation methods can be organized into two classes:

  • Pairwise comparison of every transcript against a file with known transcripts and their corresponding annotation. This can be done by comparing the nucleotides or the translated nucleotides.

  • Ab initio gene prediction, where the presence of structural features and motifs of known genes are used to infer function.

The pairwise sequence comparison (or pairwise alignment), where a query sequence (transcript of the organism of interest) is compared with annotated sequences datasets, relies on an algorithm that computes an alignment among two transcripts. The hypothesis is based on Darwin’s evolution theory, which claims that living organisms evolved from ancestor organisms. Therefore, if two transcripts have similar sequences, they may be homologs and probably share the same biological functions. This means that biological function may be inferred from similar sequences. Important pairwise algorithms, which produce alignments between pairs of sequences, are Smith-Waterman (Smith and Waterman 1981) and BLAST (Altschul et al. 1990).

Similar to the assembly step, the main difficulty in the annotation is due to the transcript length. The resulting genes may be fragmented, causing loss of information. Since alignment programs are error-tolerant, it is reasonable to expect that the annotation for transcripts (predicted from reads generated by high-throughput sequencers) is correct if functions of genes of other organisms have been found correctly.

In contrast, finding genes ab initio is not so error robust since sequencing errors can lead to incorrect gene prediction . In particular, sequencing errors introducing a stop codon can result in an incorrectly predicted gene.

3 Single-Cell Transcriptome Sequencing (scRNA-seq)

Although cells in an organism share almost identical genotypes, gene expression is heterogeneous and reflects the activity of a subset of genes. ScRNA-seq technologies are capable of generating data sets that describe the transcriptome of single cells. Single-cell transcriptome sequencing (scRNA-seq) expands the biological panorama granted by RNA-seq. It allows to estimate the expression levels of the whole transcriptome or targeted gene expression from a single cell and addresses new biological questions such as the heterogeneity of cell responses and their gene regulatory networks. It emerged with an mRNA-seq assay where a single mouse blastomere was sequenced, detecting the expression of 75% more genes than microarray techniques (Tang et al. 2009). This pioneer scRNA-seq method profiled RNA transcriptomes from single cells using oligo-dT primers followed by ligation adapter PCR (Tang et al. 2009). This method’s limitation is the reverse transcriptase’s inefficiency on the first-strand cDNA synthesis, causing a 3′ bias.

Eventually, new protocols and lower sequencing costs made scRNA-seq more accessible as technologies advance, resulting in continuously growing datasets, ranging from ~102 to ~106 cells. Some of the most distinguished methods for scRNA-seq are Smart-seq (Ramsköld et al. 2012), Smart-seq2 (Picelli et al. 2014), Drop-seq (Macosko et al. 2015), inDrop (Klein et al. 2015), CEL-seq2 (Hashimshony et al. 2016), 10× Chromium (Zheng et al. 2017), and Smart-seq3 (Hagemann-Jensen et al. 2020).

In general, scRNA-seq methods tag transcripts to make it possible to identify their cell of origin and generate libraries for sequencing. scRNA-seq sequencing data can both come from next-generation sequencing (NGS) and single-molecule sequencing (SMS) (Gao 2018). Smart-seq, Smart-seq2, Smart-seq3, and CEL-seq2 can be considered low-throughput plate-based methods, where the cells are sorted into wells of a multi-well plate. Alternatively, bead-based high-throughput methods distribute the cell suspension into tiny droplets containing reagents and barcoded beads (Drop-seq, 10× Chromium, and inDrops) or into well microplates (Seq-Well and sci-RNA-seq) to produce single droplets or well microplates with one cell and one bead marking the cDNA generated from that cell (Ding et al. 2019).

The Smart-Seq (Ramsköld et al. 2012) addressed this problem using a Moloney Murine Leukemia Virus Reverse Transcriptase (M-MLV RT) to synthesize cDNA with long messenger RNA templates. Unique molecular identifiers (UMI) were incorporated into each RNA molecule as unique barcodes before the whole transcriptome amplification (WTA) amplification (Islam et al. 2014). Smart-seq2 (Picelli et al. 2014) is an approach that combines sensitivity (it captures a considerable fraction of RNAs present in cells) with full-length coverage of transcripts and can detect genes per cell and across cells enabling quantifying isoform-level expression from single cells, but without the incorporation of unique molecular identifiers (UMIs). Smart-seq3 (Hagemann-Jensen et al. 2020) improves the sensitivity of Smart-seq2 , adding optimized reverse transcriptase and buffer conditions together with a partial Tn5 motif and a tag sequence in the template-switching oligonucleotide to directly assign individual RNA molecules to isoforms and establish their allelic origin in single cells.

Drop-Seq dissociates a tissue into individual cells and encapsulates them into droplets with microparticles that deliver barcoded primers. After associating barcodes to each cell’s RNAs, they are reverse-transcribed into cDNAs to generate beads called “Single-cell Transcriptomes Attached to Microparticles” (STAMPs). Then, the STAMPs are amplified in pools for high-throughput mRNA-seq (Macosko et al. 2015) (Fig. 3.7). The 10× Chromium system works, generating a large number of “Gel Bead-in-emulsions partitions” (GEMs) to index each cell’s transcriptome separately. The barcoded gel beads (read, 10xbarcode, UMI, oligo-dT) are mixed with cells, enzymes, and partitioning oil to create single-cell GEMs. Then, the single-cell GEMs undergo reverse transcriptase (RT) to generate a 10× barcoded cDNA library where cDNA from individual cells share a common 10× barcode that can be used for single-cell whole transcriptome sequencing or target sequencing workflows (10× Genomics Inc. 2020). In the inDrops method , the cells are also encapsulated into droplets with lysis buffer, hydrogel microspheres carrying barcoded primers, and an RT mix. After the release of primers, cDNA in each droplet is barcoded during reverse transcription. After the droplets are broken, all cellular material can be amplified for sequencing (Klein et al. 2015).

Fig. 3.7
figure 7

Individual cell’s transcriptome can be analyzed using scRNA-seq. Tissue disrupted single cells are mixed with barcode bead primers and reagents in oil droplets in a microfluidic device. The formed droplet contains a single cell and a barcode. After lysis and primer hybridization, RNA is reverse transcribed and sequenced as in a conventional RNA-seq experiment. The UMI and barcode sequence will be incorporated in the final sequenced reads and will guide the scRNA-seq processing

The Smart-seq methods can detect many genes in a cell, including low abundance transcripts and alternatively spliced transcripts. CEL-seq2 (Hashimshony et al. 2016), Drop-seq, 10× Chromium, and inDrops can quantify mRNA levels with less amplification noise using UMIs, enabling less and profiling isoform-level RNA counting. As a limitation, inDrops droplets may contain two cells or two different types of barcodes. Table 3.4 shows a comparison of some important aspects of these scRNA-seq methods.

Table 3.4 Comparison of some aspects of low- and high-throughput scRNA-seq methods

3.1 scRNA-seq Computational Analysis

Despite the different methods available, the scRNA-seq data is essentially the result of high-throughput sequencing cDNA reverse transcribed from mRNA isolated from a pool of cells. The primordial difference is that the sequenced data is somehow tagged to assign its origin to individual cells. Some standard steps remain the same as RNA-seq, such as the reads quality filtering and reads mapping to a reference genome. Reads quality filtering can be applied to filter the read quality using a quality metric for sequencing like the percentage of base calls (Q score). The reads are then mapped to a reference genome and quantified to generate an expression profile matrix. Some scRNA-seq specialized tools can both align and quantify the reads. Additionally, a second filtering step can be performed after quantifying reads to discard cells expressing a low number of genes or a high number of mitochondrial genes (Park and Lee 2020). The next step of the pipeline is data normalization using a metric for expression normalization as TPM (Transcripts Per Kilobase Million) or RPKM (Reads Per Kilobase per Million) (Gao 2018). At this point, the scRNA-seq computational analysis reaches its two fundamental problems: cluster analysis and sample/feature reduction.

Normalization allows consistent comparison of gene expression measurements in individual cells, including technical variation due to the numbers of sequenced readings or transcripts identified per cell. A normalized gene expression matrix is a matrix with n samples (cells) by m features (genes, transcripts, or exons), depending on the read’s size. For example, for transcripts as features, PacBio full-length transcriptome could be the right choice, or for Illumina short-length reads, the features could be genes. As the number of annotated genes of the target organism, the matrix could be large and sparse, which justifies the sample and feature reduction. The feature selection can be understood as removing genes unhelpful to distinguish biological variation across samples.

Clustering cells allow us to identify cells with correlated phenotype by grouping them based on their gene expression profiles’ similarity. This is achieved using dimension reduction algorithms to embed the expression matrix into a low-dimensional space that summarizes the data structure in as few dimensions as possible (Gao 2018; Luecken and Theis 2019). These low-dimensional spaces can come from dimension reduction methods as Principal Component Analysis (PCA), Independent Component Analysis (ICA), Linear Discriminant Analysis (LDA), Multidimensional Scaling (MDS), and t-distributed Stochastic Neighbor Embedding (t-SNE).

3.2 scRNA-seq Analysis Tools

Seurat (Hao et al. 2020) is an R package that integrates quality control, analysis, and exploration of single-cell RNA-seq data. It is based on a Seurat object, which serves as a container for both data (like the count matrix) and analysis (like PCA, or clustering results). Also, Seurat can make simultaneous measurements of multiple data types from the same cell, known as multimodal analysis, and analyze spatially resolved RNA-seq data.

Cell Ranger is a set of tools to process Chromium single-cell RNA-seq data. The package contains cellranger mkfastq which demultiplexes raw base call (BCL) Illumina files into fastq files. These files are then taken as input by cellranger count to perform alignment, filtering, barcode, and UMI counting. In the next step, cellranger aggr aggregates and normalizes the outputs from multiple runs of cellranger count recomputing the feature-barcode matrices and analyzing the combined data. The cellranger reanalyze reruns the dimensionality reduction, clustering, and gene expression algorithms from the feature-barcode matrices produced by cellranger count or cellranger aggr. Cell Ranger also uses the aligner STAR (Dobin et al. 2013) and the output is delivered in formats like bam, mex, csv, hdf5, and html.

Meta Cell (Baran et al. 2019) is a tool for deriving metacells and analyzing scRNA-seq data. Metacells are a theoretical group of scRNA-seq cell profiles statistically equivalent to samples derived from the same RNA pool , which is obtained by computing partitions of scRNA-seq datasets into disjoint and homogenous groups of cells.

SEQC (Azizi et al. 2018) is a Python package for scRNA-seq analysis in a cloud and subsequent analyzes on a local machine. It has Spliced Transcripts Alignment to a Reference – STAR (Dobin et al. 2013), Samtools (Li et al. 2009), and HDF5 data model as dependencies and has been tested for 10× Genomics v2 and inDrop v2 data.

zUMIs (Parekh et al. 2018) is a pipeline to process RNA-seq data with or without UMIs. zUMIs take cDNA fastq files and other reads containing UMI and Cell Barcode information as input. It was written using R, Perl shell, and Python programming languages and has as dependencies STAR (Dobin et al. 2013).

robustSingleCell is an R package that provides clustering and comparison of population compositions across tissues and experimental models through a similarity analysis characterizing transcriptomic similarities in meta-clusters by identifying their defining overexpressed genes (Magen et al. 2019) (Table 3.5).

Table 3.5 Computational tools for scRNA-seq analysis

4 Case Study 1

  • RNA-seq as an Efficient Tool to Analyze and Identify Gene Expression Patterns Related to Murine Bone Marrow-Derived Macrophage’s Susceptibility and Resistance to Candida albicans Infection

The improvements in organ transplantation techniques and the rise of immune-compromised diseases , like AIDS, are directly linked to the exponential growth of opportunist infections in these patients. Therefore, the study of the etiological agents of these diseases, particularly fungal pathogens, together with the immune response they elicit, became an important issue (Marr et al. 2002; Richardson and Lass-Flörl 2008; Miceli et al. 2011). Candida albicans appears to be the leading cause of invasive infections among fungi, showing high morbidity and mortality rates (Chi et al. 2011; Shigemura et al. 2014).

Many studies have been done to understand the aspects of immune responses to C. albicans (Tierney et al. 2012; Miramón et al. 2013; Hünniger et al. 2014; Martínez-Álvarez et al. 2014). In this case study, the transcriptomic response of murine bone marrow-derived macrophages (BMDMs) from BALB/c (resistant) and DBA/2J (susceptible) mice strains to C. albicans infection was analyzed by RNA-seq to compare both transcriptomic patterns. Therefore, this case study’s main objective was to identify BMDMs gene expression patterns between resistant and susceptible mice after C. albicans infection by the analysis of the resulting transcriptome profiles.

Bone marrow was extracted from the mice, and the hematopoietic stem cells were then differentiated into macrophages. An amount of 2 × 106 BMDMs were co-cultured with 4 × 106 C. albicans yeasts for 90 min, and the RNA was extracted using RNeasy (Qiagen). RNA quality and concentration were verified employing a Bioanalyzer (Agilent) and NanoDrop (Thermo Scientific), respectively. Three μg of total RNA was used for the library preparation, including a step of rRNA depletion using Ribozero (Epicentre) before library construction and sequencing in an Illumina Hiseq platform.

The sequencing results were provided in fastq format. FastQC was used to assess quality. Adaptors clipping and quality trimming were performed using Cutadapt (Martin 2011). Two mapping software, NextGenMap (NGM) (Sedlazeck et al. 2013) and Tophat2 (Kim et al. 2013), were employed. Since both generate a similar number of mapped reads, we chose NextGenMap due to its faster analysis. Low-quality mappings were removed using Samtools (Li et al. 2009), which was also used to sort, index, and convert the mapping results from sam to.bam files. Bedtools (Quinlan and Hall 2010) were then used to count reads for both genes and exons, and generate a table of these counts, to be analyzed for differential expression. As said before, differential expression can be analyzed using different methodologies (Wagner et al. 2012; Soneson and Delorenzi 2013), and EdgeR (Robinson et al. 2010) and DESeq (Anders and Huber 2010) were chosen. Both outputted very similar results. Alternative splicing can be checked by differential exons usage (Anders et al. 2012). Therefore, the resulting list of genes or transcripts differentially expressed (adjusted p-value <0.05 and fold change ≥ ±2.5) was checked for gene ontology (GO terms) using ClusterProfiler (Yu et al. 2012) Bioconductor package.

Several problems may occur in RNA-seq projects, and here we point out some of these:

  • Infection conditions: the optimization of the protocols of co-culture conditions, as well as RNA extraction, may be hard to adjust. Setting a multiplicity of infection (MOI – proportion of host/pathogen cells in the co-culture) that suffices to induce a transcriptomic response in the host cells is the first step. However, a very high MOI may result in host cells’ death and apoptosis, which may result in altered gene expression or low amounts of RNA extracted from these cells.

  • Infection time: the definition of correct time intervals of interaction between pathogen and host cells is essential since different genes have different kinetics of transcription during co-culture. This may vary drastically for different host-pathogens and also depends on the major question of interest.

  • Biological replicates: in transcriptomic studies, robust statistical analysis is fundamental. In this sense, the experimental design has to incorporate proper biological replicates to allow valid statistical inferences (Robles et al. 2012).

  • Library preparation and sequencing parameters: the choice of the preparation methodologies, e.g., poly-A enrichment protocols versus rRNA depletion protocols, or paired-end versus single-end sequencing, may strongly impact the results. Improper handling of samples in this step may also result in sample degradation or inefficient rRNA depletion, which may compromise the whole experiment if not properly adjusted. A well-defined experimental design for the sequencing step must also be taken into consideration. A final low coverage of the transcriptome can result in an inadequate analysis of differential gene expression.

A significant disparity was observed in the differentially expressed genes upon C. albicans infection between BMDMs from both mice strains. BMDMs from the susceptible DBA/2J strain modulated a higher number of genes (4021) upon infection with C. albicans than BMDMs from the resistant BALB/c strain (99), and both sets have few genes in common (60) (Fig. 3.8).

Fig. 3.8
figure 8

Venn diagram of positively (red) and negatively (blue) regulated genes in BMDMs from BALB/c and DBA/2J mice strains infected with C. albicans. Differentially expressed genes were considered when adjusted p-value <0.05 and fold change ≥±2.5

Analysis focusing on GO categories of biological processes revealed enrichment (p <0.01) of upregulated genes in terms related to inflammatory response, cellular response to biotic stimulus, and cytokine production in both resistant and susceptible strains (Fig. 3.9). However, they markedly differed in the modulation of some terms. For example, macrophages from the resistant strain upregulated genes related to apoptosis and neutrophil chemotaxis. In contrast, macrophages from the susceptible strain upregulated genes involved in innate immune response and leukocyte migration.

Fig. 3.9
figure 9

Gene ontology enrichment of upregulated genes in BMDMs from DBA/2J and BALB/c mice strains upon C. albicans infection. Enriched GO terms (adjusted p-value <0.01) from biological processes category associated with upregulated genes in BMDMs derived from the susceptible DBA/2J (left) and the resistant BALB/c (right) mice strains. Dot size is representative of enrichment (gene modulated ratio/gene background ratio) for each GO term. Only major terms related to immune response were plotted

5 Case Study 2

  • Single-Cell Sequencing of SARS-CoV-2 Infected Individuals with Distinct Levels of Severity

COVID-19 outbreak has caused critical consequences for all countries, including many deaths and hospitalization, beyond the economic issues. Beyond the vaccination, it is important to research specific drugs to treat the affected individuals. Monoclonal antibodies have demonstrated their effectiveness in medicine (Maranhão et al. 2020). Therefore, developing new potential antibodies as an alternative against viral proteins remains highly valuable.

This example of scRNA-seq analysis is based on the work “Single cell RNA and immune repertoire profiling of COVID-19 patients reveals novel neutralizing antibody” from Fang Li et al. (2020). They have conducted a study using single-cell transcriptome sequencing (scRNA-seq), single-cell BCR sequencing (scBCR-seq), and deep BCR repertoire to reveal neutralizing antibody sequences in patients who have recently cleared the virus. They collected blood samples (peripheral blood mononuclear cells – PBMCs) from 16 COVID-19 patients and eight healthy controls to reveal immune cells’ changes caused by SARS-CoV-2 infection. Fang Li et al. (2020) scRNA-seq was performed using 10× Genomics. The original data is available in the Zenodo under the accession URL: https://zenodo.org/record/3744141.

This case study uses a Fang Li et al. (2020) sample subset with data from two patients to demonstrate how to identify distinct types of cells based on clustering their transcripts and how to obtain the differentially expressed genes. The input files are barcodes.tsv, datasets.rds, genes.tsv, and matrix.mtx. For this case study, we filtered the complete data to work only with patient 3 (P3) and patient 10 (P10) samples, both from 59 years old females with distinct levels of COVID-19 severity. P3 had severe symptoms, and P10 had moderate symptoms.

This example uses the R package Seurat 4.0 (Hao et al. 2020) to perform the analysis directly from the matrix. The following R codes are commented, and their results presented. The first step is to install and load the required R packages. Seurat 4.0 requires R version 4.x.

figure a

The next step is to download, extract, and read the COVID-19 data. This will result in a matrix with 33,538 lines and 96,404 columns. The columns represent each tagged transcript, and the lines represent the genes where those transcripts were mapped.

figure b

Once loaded the full data, now it is possible to filter them to work only with P3 and P10 samples by using regular expression to identify only data from patients P3 and P10. The new dimensions of P3 and P10 data will be 33,538 lines (genes) by 16,056 columns (tagged transcripts).

figure c

The function CreateSeuratObject() initializes the Seurat object with the non-normalized data constrained by the following parameters: minimal of two cells with at least 20 expressed genes and at least 2,000 features. The dimension of the object in this case will be 17,169 genes and 2,123 tagged transcripts that met the criteria.

figure d

Before starting the data processing, we will create two new columns to add meta-information for the patients (P3 or P10) and for the mitochondrial percent in transcripts. The [[]] operator can add columns to an object. In this case, we create a column to identify patients P3 and P10. We also stash quality control (QC) stats for their mitochondrial samples, which are identified starting by “MT-”.

figure e

Next, it is possible to build a violin plot to visualize the QC metrics for number of features, read count and mitochondrial percentage, grouped by patient (Fig. 3.10).

Fig. 3.10
figure 10

Quality control (QC) metrics for the number of features, read count, and mitochondrial percentage, grouped by patient. Left: Number of featured genes for patients 3 (red) and 10 (blue) after filtering >2000 features. Middle: reads count for P3 and P10. Right: amount of reads from mitochondrial origin shown as percentage

figure f

The next step is to remove unwanted cells from the dataset. In this case we can apply a new filter to keep only samples with the number of features at least equal to 2000 and less than 5% of mitochondrial samples.

figure g

To normalize the data, we can use function LogNormalize(), which normalizes the feature expression measurements for each cell by the total expression. It multiplies this by a scale factor (10,000 by default), and log-transforms the result.

figure h

Once normalized, the next step is to identify highly variable features (feature selection) using the method vst which, according to the manual of Seurat, fits a line to the relationship of log (variance) and log (mean) using local polynomial regression (loess). Then, it standardizes the feature values using the observed mean and expected variance (given by the fitted line). Then, it is computed the feature variance on the standardized values after clipping to a maximum (default is “auto” which sets this value to the square root of the number of cells).

figure i

At this point, it is possible to find, for instance, the 20 most highly variable genes identified (Fig. 3.11) that would be: ‘IGHA1’, ‘JCHAIN’, ‘IGHG1’, ‘IGKC’, ‘IGLC2’, ‘IGHG2’, ‘DERL3’, ‘IGLL5’, ‘IGHV3-23’, ‘ITM2C’, ‘IGKV3-20’, ‘MZB1’, ‘LILRA4’, ‘IGHV3-7’, ‘FKBP11’, ‘GNLY’, ‘IGKV4-1’, ‘TNFRSF17’, ‘STMN1’, and ‘HIST1H4C’. Interestingly, most of these genes are involved with the immune system, more precise to B lymphocytes, a known player in the inflammatory aspect of COVID-19. IGHA, the heavy constant chain of the immunoglobulin alpha, codes for an antibody isotype well characterized to participate in the mucosal immunity, the natural site of SARS-CoV-2 infection.

Fig. 3.11
figure 11

Twenty most highly variable genes identified versus their average expression. In red are shown the 2000 most variable genes among cells, and 20 of them are labeled for exploration purposes

figure j

Before performing the dimensional reduction, it is necessary to perform a linear transformation scaling the data. It is a standard pre-processing step prior to applying techniques like PCA.

figure k

Now, it is possible to determine the dimensionality of the dataset. The function JackStraw() determines the statistical significance of PCA scores by randomly permuting a subset of data, and calculates projected PCA scores for these “random” genes. The ScoreJackStraw() function computes the scores significance by PCs showing a p-value distribution that is strongly skewed to the left compared to the null distribution.

figure l

We can now cluster the cells. The function FindNeighbors() computes the k.param nearest neighbors for a given dataset using the k-nearest neighbors algorithm. Then, the function FindClusters() identifies clusters of cells from the SNN graph (result of the k-nearest neighbors algorithm). As higher is the resolution parameter, as larger will be the communities.

figure m

Uniform Manifold Approximation and Projection (UMAP) is a dimensional reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction. It is founded on three assumptions about the data: (i) the data is uniformly distributed on a Riemannian manifold; (ii) the Riemannian metric is locally constant (or can be approximated as such); and (iii) the manifold is locally connected.

figure n

Finally, it is possible to plot the clusters of distinct types of cell in the samples. Using these parameters, we can find 10 clusters as can be seen in Fig. 3.12.

Fig. 3.12
figure 12

Ten cell clusters belonging to the patients P3 and P10. Dimensionality reduction yields clusters of cells correlated by gene expression profile. Each cluster is labeled with a different color and is identified by a number that can be later annotated as a particular cell type based on the gene markers expressed in the cluster

figure o

As it is possible to see in Fig. 3.12, the cluster number 4 has expressed genes both from patients 3 and 10. In this case, we first split data of patient 3 and 10 and then execute the function FindAllMarkers() can finds all differentially expressed genes for each of the patients in this dataset. Some constraints can be used to filter these genes, as min.pct that test for genes that are very infrequently expressed, which has as default value 0.1. The results are joined and the gene markers are filtered only for cluster number 4.

figure p

The next step is to group the expressed genes as “Not Significant,” “Significant,” “FoldChange,” and “Significant&FoldChange” depending on the values of p-value and fold change. A plot (Fig. 3.13) with the most significant differentially expressed genes for the patients P3 and P10 can be built to highlight them.

Fig. 3.13
figure 13

Differentially expressed genes for patients 3 and 10. Each cluster of cell is tested against all remaining clusters. The most significant down- and upregulated genes are highlighted. Patient 3 is shown in the left and patient 10 in the right

figure q

The differentially expressed genes depicted in Fig. 3.13 reveal that six genes meet both statistical and fold change criteria. The IL7 receptor (IL7R) appears upregulated in both patients, while GNLY, MYOM2, CST7, and NKG7 are upregulated only in patient 3. The LincRNA 00861, a non-coding RNA, is upregulated only in patient 10, who had a milder infection. All of these genes are usually expressed in the cytotoxic CD8 lymphocytes, but patient 10, who evolved a strong inflammatory response, reveals a different gene response that is not associated with the LincRNA but strongly associated with genes involved in cytotoxicity (NKG7 and GNLY).

Single-cell computational analysis can consume vast computational resources. This case study uses only part of the original data to make it reproducible in a regular desktop or notebook computer. All these codes are available for download with the environment set-up instructions at https://github.com/waldeyr/single_cell_analysis.