Keywords

1 Introduction

RNA-Seq is one of the most advanced techniques which use the platform of high-throughput sequencing (HTS) also called the next-generation sequencing (NGS) technologies to decipher the transcriptome. Transcriptome comprises the complete set of transcripts in a tissue, organism, or a specific cell for a given physiological condition. Transcripts include protein-coding messenger RNA (mRNA) and noncoding RNA like ribosomal RNA (rRNA), transfer RNA (tRNA), and other ncRNAs (Lindberg and Lundeberg 2010; Okazaki et al. 2002). RNA-Seq basically helps us in looking at the regions of genome being transcribed in a sample and quantifying the expression of such transcripts. Transcriptome has the tendency to vary with different physiological conditions that make transcriptomics a significant field of study, thus turning RNA-Seq a powerful tool for dissecting and understanding many biological phenomena like underlying mechanism and pathways controlling disease initiation, development, and progression.

Over the years, several technologies have come to the existence to study transcriptome, but lately developed RNA-Seq has the ability to characterize the transcriptome in a more global and relatively better way than microarrays and other traditional strategies. RNA-Seq uses cDNA sequencing, from RNA sample of interest (Wilhelm et al. 2008). Basically, RNA-Seq starts by library construction, followed by sequencing on a specific NGS platform and subsequent bioinformatic analysis. In a nutshell, library construction requires isolation of RNA which is randomly fragmented into smaller pieces, followed by reverse transcription. Reverse transcription converts RNA fragments into cDNA with ligation of adapter sequences to either one or both ends for amplification. Fragmentation of RNA can be done prior to reverse transcription, or reverse transcription can be done first followed by cDNA fragmentation (Roberts et al. 2011; Wang et al. 2009). This choice plays an important role because it mostly causes a bias in final results. Especially, cDNA fragmentation generates an under-representation of the 5′ of the transcripts, while RNA fragmentation allows a better representation of the transcript body although somehow may end up in delivering depleted transcript end (Mortazavi et al. 2008). Basic steps and strategy executed by RNA-sequencing experiment are almost the same for every platform which is shown in Fig. 10.1.

Fig. 10.1
figure 1

A basic layout of RNA-sequencing experiment

Fragment size selection and priming the sequence reaction along with the above steps can vary with the implementation of the protocol and introduce some technical biases in the resulting data. The final sequencing step relies on the NGS platform like 454 pyrosequencing system (a subsidiary of Roche), the AB SOLiD system (Life Technologies), and the Illumina Genome Analyzer (Illumina) (Liu et al. 2012; Marguerat and Bahler 2010; Ansorge 2009), each having its own library construction method. Both the 454 and the SOLiD systems employ an innovative emulsion polymerase chain reaction (emulsion PCR) method for clonal amplification. In emulsion PCR, the cDNA fragments from a library are attached to beads followed by compartmentalization in the aqueous droplets called water-in-oil emulsion. This way, each droplet contains a single DNA molecule as well as the segregated template fragments. These fragments are then amplified in very small emulsified aqueous droplets (Dressman et al. 2003).

The Illumina Genome Analyzer (GA) utilizes the strategy of “bridge PCR” amplification where the adapter-linked single-stranded fragments of cDNA are immobilized on a glass slide by oligonucleotide hybridization in a bridging way, followed by clonal PCR amplification (Fedurco et al. 2006). A population of identical templates is resulted from clonal amplification, but it may introduce a bias in the RNA-Seq result due to PCR artifacts. That is why performances on different biological replicates are needed to determine whether the same short reads are present in different replicates (Wang et al. 2009). Different NGS platforms use different sequencing strategies (Metzker 2009), and several reviews can be found describing details including mechanisms and comparisons of these NGS technologies (Liu et al. 2012a; Metzker 2009; Shendure and Ji 2008; Ansorge 2009). Sequencing can produce single-end or paired-end reads. In paired-end sequencing, a fragment is sequenced from both ends, while in single-end sequencing, only one end is used. Having the advantage of sequencing from both ends, paired-end sequencing generates data of comparatively high quality.

Since the advent of RNA-Seq in 2008, it has emerged as a superior technique to study transcriptome over traditional methods which were either hybridization (microarray) or sequence based (SAGE, CAGE). Being superior in resolution at the single-base level, this technique can effectively measure the expression level of thousands of genes simultaneously in addition to information on alternative splicing, unannotated exons, allele-specific expression (Heap et al. 2010), microRNAs, variants like SNPs (Quinn et al. 2013), and novel transcripts (gene or noncoding RNAs). Additionally, many significant phenomena such as detection of differential alternative splicing and isoform abundance can be studied in detail with RNA-Seq technique (Park et al. 2013).

Although RNA-Seq is clearly more informative and advantageous, the data produced by this technique are still complex and huge. NGS platforms generate high-throughput data in the form of millions of short sequences termed as “reads.” These reads are associated with their base-call quality scores that indicate the reliability of each base call. The length of these short reads depends on the type of NGS platform used for sequencing, but generally they fall within a length of 25–450 bp. The resulting reads are categorized into three types: exonic reads, exon–intron junction reads, and poly(A) reads (Wang et al. 2009). The analysis of this kind of data is not a straightforward task and is usually a bottleneck to deal with. Fortunately, continuous progress in the area of bioinformatics has eased the way to deal with RNA-Seq data. There are now various bioinformatic tools/software, web servers, as well as whole pipelines to tackle and analyze RNA-Seq data. Also, various strategies applicable to RNA-Seq data analysis can be implemented in Bioconductor (Huber et al. 2015; Gentleman et al. 2004) through statistical language “R” (https://www.r-project.org). Bioconductor is free, is open-source, and can deal with analysis of not only RNA-Seq data but other high-throughput genomic data as well. Bioconductor basically works on the basis of different “packages” dedicated to different types of tasks. There are many Bioconductor packages dedicated to the whole RNA-Seq data analysis executable with even a little proficiency in R. Many tools can be combined for analysis of RNA-Seq data, and researchers may form their own custom data analysis pipelines according to their objectives.

Bioinformatic analysis of RNA-Seq data can be divided into several stages. The very first step is experiment/technology dependent, and choice of the methods for downstream analysis is made on the basis of the type of experiment. During sequencing only, the first step of bioinformatic analysis gets started with the transformation of fluorescent measurements into associated nucleotide bases with their quality scores. Base quality score is usually a value representing the confidence of the called bases. The final output of this base-calling step is the short reads (raw data) in FASTQ (FAST-All with quality score) format. The next task is to map these short reads to reference genome (or transcriptome in case of transcriptomic data) in case it’s already available or otherwise firstly assemble them de novo. After mapping , further downstream analysis may proceed according to research goals, though a usual work flow of bioinformatics-based analysis associated with RNA-Seq data is shown in the flowchart (Fig. 10.2). During the analysis, different tools/software or strategies may be applied at different steps.

Fig. 10.2
figure 2

A usual flow chart of bioinformatics-based analysis of RNA-Seq data

It would not be inappropriate to say that RNA sequencing has a variety of different applications and data analysis strategies depending on the organism under study and research objectives. RNA-Seq has the power of identifying transcripts and quantifying gene expression which is the key to decipher more knowledge on the relationship between genome and proteome. Elucidating RNA isoform expression, alternative splicing, and ncRNA levels are other applications of RNA-Seq having great importance in molecular biology.

2 Data Format, Quality Check, and Preprocessing

Raw reads (FASTQ format) obtained after the base-calling step contain nucleotides associated with quality scores. Although different NGS platforms have their own methods of base calling (base-calling software) to evaluate base quality, various third party groups have also put efforts in developing base-calling methods. The most profitable and notable example is the enhanced ABI base caller, Phred, which played an important role in the Human Genome Project (Ewing and Green 1998; Ewing et al. 1998). Nowadays, most NGS platforms provide the user with a Phred-like score value (Ewing et al. 1998) for base quality evaluation which is based on a logarithmic scale encoding the probability of error in the corresponding base call. This base-calling step is particularly important because its accuracy affects the downstream analysis. The resulting format of base-calling algorithm, i.e., FASTQ, is a FASTA (FAST-All) standard format of biological sequences like format but comes with associated quality score for each nucleotide, usually Phred score.

Reads may be represented in other formats like FASTA and standard flowgram format (SFF) that may be converted to one another, but generally FASTQ format is the most frequent one that can be used as input in many applications. FASTQ files may be so huge in size and also consist of contaminations that need to be eliminated before downstream analysis because contaminated input directly affects the outcome. Preprocessing of data is thus a very important and necessary step before jumping onto the downstream analysis. Preprocessing includes steps like checking the Phred scores, length of reads per base, and read quality and trimming the reads to remove adapters, low-quality sequences, duplicate sequences, and Ns (means no base assigned during the base call). Various available preprocessing tools may be in the form of stand-alone software or accessed with different whole data analysis pipelines , web servers like Galaxy (https://galaxyproject.org/), language platforms like R/Bioconductor , or simply based on command lines.

Some popular tools for quality check and preprocessing of RNA-Seq data are FastQC (Andrews 2010) (http://www.bioinformatics.babraham.ac.uk/projects/fastqc), FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit), Cutadapt (https://cutadapt.readthedocs.org/en/stable) (Martin 2011), and Trimmomatic (http://www.usadellab.org/cms/index.php?page=trimmomatic) (Bolger et al. 2014). These tasks are also achievable through some R/Bioconductor packages like “ShortRead” (Morgan et al. 2009). We present a list of some recently developed tools for data quality check and preprocessing (Table 10.1).

Table 10.1 List of recently developed tools/software for data QC and preprocessing

3 Mapping

Mapping is the most important step in way of analyzing any NGS data. “Mapping” makes each read correspond to a particular position in genome/transcriptome. Since RNA-Seq data may produce reads either from single exon without accessing the exon-exon boundary (unspliced) or from a pair of exon where a read would span the intronic region (spliced), the mapping strategy demands a deeper lookout. If we empirically align the RNA-Seq reads using methods like Burrows–Wheeler transform, we have to consider both the aligned and unaligned reads. Fully aligned reads may be unspliced, but the reads which fail to align may be truly spliced reads spanning an intron. Today, we have many aligners for NGS data using different approaches like seed based (e.g., SHRiMP2;David et al. 2011) , BFAST (Homer et al. 2009), SeqMap (Jiang and Wong 2008), CUSHAW3 (Liu et al. 2014), SOAP (Li et al. 2008a), MAQ (Li et al. 2008b), STAMPY (Lunter and Goodson 2011) or hash based (e.g., MOSAIK; Lee et al. 2014), and HIVE hexagon (Santana et al. 2014). Additionally, a popularly used algorithm in data compression technique, the Burrows–Wheeler transform (BWT), also contributes in providing some excellent mapping tools like BWA (Li and Durbin 2009d), SOAP2 (Li et al. 2009a), and Bowtie (Langmead 2010). Several tools such as TopHat (Trapnell et al. 2009), STAR (Dobin et al. 2013), SpliceMap (Au et al. 2010), and MapSplice (Wang et al. 2010) are available today that perform mapping while considering both the exonic and splicing events.

Mapping refers to locating the short reads onto reference genome/transcriptome which is comparatively feasible with the availability of a reference genome/transcriptome; otherwise a de novo assembly is required to proceed further. Without a reference genome or transcriptome, mapping is not feasible as in such case a de novo assembly of RNA-Seq reads would be required to generate full transcript sequences (Robertson et al. 2010). De novo assembly is usually complex in nature that involves construction of de Bruijn graphs using k-mers. There are many tools for de novo assembly for RNA-Seq data like Trinity (Haas et al. 2013), Velvet (Zerbino and Birney 2008), Bridger (Chang et al. 2015), SOAPdenovo (Li et al. 2010), and Trans-ABySS (Simpson et al. 2009). Here we discuss some useful assemblers for de novo assembly and mappers that are very efficient in RNA-Seq reads mapping.

3.1 Trinity

Trinity (Haas et al. 2013) is the first method designed specifically for transcriptome assembly and works on the basis of de Bruijn graphs. It comprises three independent software modules, Inchworm, Chrysalis, and Butterfly, which are used sequentially to produce transcripts. Inchworm assembles the RNA-Seq data into transcript sequences, Chrysalis clusters the Inchworm contigs and constructs complete de Bruijn graphs for each cluster, and then Butterfly processes the individual graphs in parallel to trace the paths of reads within the graph, ultimately reporting full-length transcripts.

3.2 Bridger

Bridger is a newer framework for de novo transcript assembly (Chang et al. 2015). It is so named as if to build a bridge between the basic keys of two popular assemblers: Cufflinks (the reference-based assembler (Trapnell et al. 2012)) and Trinity (the de novo assembler (Haas et al. 2013)). It has some advantages over other de novo aligners like it allows the use of different k-mer lengths for different data, while trinity has a fixed k-mer length of 25. It also has a lower false-positive rate and uses less memory and run time compared with Trinity.

On the other hand, the presence of reference genome/transcriptome makes mapping process relatively faster and easier to implement with some web-based/command-line-based tools. In mapping, the problem of multimapping is also usually seen and needs to be taken care of. Generally, mapping utilizes a heuristic first step to find likely candidates followed by local alignment, but alignment is not sufficient for mapping moderate- to large-sized genomes. Thus, the strategy used by most of the aligners/mappers is to somehow enable a fast heuristic method so that the smaller number of local alignments has to be performed. As aforementioned, RNA-Seq mappers should be able to consider the spliced alignment problem, i.e., they should be able to place spliced read across introns and correctly determine exon–intron boundaries. In the present scenario of RNA-Seq research, many aligners work well in this kind of mapping, among which Bowtie2 (Langmead et al. 2009) is a popular one. We discuss a few other tools that have proven their worth.

3.3 TopHat

TopHat is a program that aligns RNA-Seq reads to a genome/transcriptome while considering splice junction mapping (Trapnell et al. 2009). It uses the ultrahigh-throughput short read aligner Bowtie and then analyzes the mapping results to identify splice junctions between exons. Using this initial mapping information from Bowtie, TopHat builds a database of possible splice junctions and then again maps the reads against these junctions to confirm them. It runs on Linux and MacOS X and was originally designed to work with reads produced by the Illumina Genome Analyzer, although it is successfully applied with reads from other technologies as well. It also can be implemented in R using some Bioconductor packages as well as on Galaxy server. Moreover, mapping can be visualized through Integrated Genome Viewer (https://www.broadinstitute.org/igv/) (Robinson et al. 2011).

Before performing further downstream analysis, it is also recommended to check the quality of mapping as it greatly influences the downstream analysis. A list of data QC and preprocessing tools capable of checking and processing the data at many stages (including mapping) of data analysis is provided in Table 10.1. Tools like SAMStat (Lassmann et al. 2010) and dupRadar (Sayols and Klein 2015) (R package for QC) are easily accessible and very useful in checking and dealing with mapping quality issues.

3.4 STAR

STAR (Spliced Transcripts Alignment to Reference) (Dobin et al. 2013) is one of the important alignment tools that are capable of identifying the alternative splicing junctions in RNA-Seq reads. It is a free, open-source software (under GPLv3 license) that can be downloaded from http://code.google.com/p/rna-star/. It works by indexing the reference genome first, followed by producing a suffix array index to accelerate the alignment step in further processing. STAR has high accuracy like TopHat with comparatively less time consumption. While it can fairly handle single- or paired-end reads, it also increases its accuracy if provided with an annotation (.gtf) file. Advantageously, STAR was not developed as an extension of a short read mapper but a stand-alone C++ code. Being capable of running parallel threads on multi-core systems, STAR is faster in comparison with other tools.

Visualization of mapped reads in a graphical or preferably and advantageously in interactive mode is necessary to closely look at the mapped regions and other factors. There are various tools/software packages such as “SAMtools tview” (Li et al. 2009b), “MapView” (Bao et al. 2009), “Tablet” (Milne et al. 2013), “IGV” (Thorvaldsdóttir et al. 2013), and “Bambino” (Edmonson et al. 2011) that enable the visualization of mapped reads.

In NGS data analysis, the factor of quality control is significant at every single step. Since mapping is the basis for further analysis of data, it is mandatory to check the quality of mapped files to assure the error-free results. Among already available NGS data manipulators like Picard (http://picard.sourceforge.net.) and SAMtools (Li et al. 2009b), some lately developed powerful tools like RseQC and QoRTs assist in quality control, data processing, and management to an excellent level. These tools are included in a package of various utilities that handle the data at different levels.

QoRTs (Hartley and Mullikin 2015) is a fast and portable multifunction toolkit that easily handles cross-comparison of replicates (biological/experimental) and detection of errors, artifacts, and biases. Additionally it can produce count data that can be used in Bioconductor package such as DESeq, DESeq2, and edgeR.

On the other hand, RSeQC (Wang et al. 2012), a comprehensive package of python programs, provides a number of modules to evaluate RNA-Seq data from different aspects. Quality check of raw reads for properties like sequence quality, PCR bias, nucleotide composition bias, and GC bias can be checked with its “basic modules,” while “RNA-Seq specific modules” evaluate the quality/status of sequencing saturation of both splice junction detection and expression estimation. RSeQC is written in Python and C and is freely available at http://code.google.com/p/rseqc/.

Mapping is also fundamental in many versatile applications of RNA-Seq like transcript identification and characterization, gene expression quantification, detection of alternatively spliced isoforms, detection of allele-specific expression (ASE), and differential gene expression. Programs like HTSeq-count (Anders et al. 2015) and featureCounts (Liao et al. 2014) use the raw counts of mapped reads for gene quantification. Gene quantification also utilizes a gene transfer format (GTF) file containing the genome coordinates of exons and genes. The number of reads mapped to transcript reference is also the most important information in estimating gene and transcript expression. For expression analysis, only read counts are not sufficient because of other factors like sequence biases, number of reads, and transcript length. These factors are handled by various normalization methods like RPKM (reads per kilobase per million mapped reads) (Mortazavi et al. 2008), FPKM (fragments per kilobase of transcript per million mapped reads) (Trapnell et al. 2010), and TPM (transcripts per million) which are elaborated later in other sections. “Cufflinks” (Trapnell et al. 2012) is a widely used program for estimating transcript level expression from mapping using an EM (expectation–maximization) approach while taking into account biases like nonuniform distribution of reads along the gene length.

The power of identification and quantification of an overall expression of RNAs in a sample is facilitated by RNA-Seq by enabling the genome-wide studies of alternative pre-mRNA splicing which is an important factor to understand the differential expression . Since alternative splicing produces multiple isoforms by skipping or differential joining of exons or introns within a pre-mRNA transcript during transcription (Fig. 10.3), it delivers functional diversity of a gene during posttranscriptional processing and affects gene regulation.

Fig. 10.3
figure 3

A graphical illustration of alternative splicing event that eventually results in isoforms

Analyzing expression of transcripts at the isoform level is very important in order to understand differential expression . Since many genes may have multiple isoforms, deciphering isoform-specific expression is definitely not straightforward because it is not simple to assign some reads to a particular isoform. The basic approach for dealing with this difficult task was to quantify the transcript isoforms using only those sequences which were unique to particular isoforms (Filichkin et al. 2010). This approach worked on the basis of some already known or predicted transcript isoforms for a given gene that were used to form a set of sequences which in turn could differentiate one isoform from others. Then the mapping of reads to such a set of sequences elaborated the corresponding isoform expression precisely.

Similarly ALEXA-seq (Griffith et al. 2010) method used only those reads that mapped uniquely to one isoform to estimate isoform-specific expression, but these kinds of approaches usually are limited. This is because many isoforms are mostly nonunique or may have minor sequence differences, and also these approaches demand a prior knowledge of precise annotation of splice variants.

The tools related to isoform identification, quantification, abundance estimation, pre-mRNA alternative splicing discovery, and mapping/alignment are already widespread, and the development of new methods is progressing at a very accelerating speed. We present a list (Table 10.2) consisting some recently developed methods/tools dedicated to these tasks along with a brief description of each tool.

Table 10.2 Recently developed methods/tools for isoform discovery, quantification, abundance estimation, alternative splicing discovery, assembling transcriptome, and alignment of RNA-Seq reads

Lately, some algorithms like Sailfish, Kallisto, and Salmon have come into existence that use an alignment-free approach to deal with gene/isoform quantification task. These algorithms are considered to be lightweight algorithms that are faster than traditional mapping steps. A succinct overview of all three algorithms is briefed below.

3.5 Sailfish

Sailfish (Patro et al. 2014) is a free and open-source software, available at http://www.cs.cmu.edu/~ckingsf/software/sailfish. It is a much faster in silico method facilitating the quantification of RNA-isoform abundance by totally avoiding the time-consuming mapping step. Instead of mapping, it inspects k-mers in reads to observe transcript coverage that results in a fast processing of reads. It also maintains the accuracy up to the mark by incorporating an EM procedure that brings a statistical coupling between k-mers. It discards k-mers that overlap inaccurate bases to handle sequencing errors. Overall, it has only a single explicit parameter the k-mer length to rely on. Longer k-mers tend to resolve their origin easier than short k-mers but may be more affected by errors for which Sailfish has implemented an error handling EM procedure. Process wise, Sailfish first builds an index from a FASTA reference transcript file and a chosen k-mer length. Data structures like minimal perfect hash function 9 in the index file play an important role in mapping each k-mer in reference transcript to an identifier in such a way that no two k-mers share an identifier. There is no need to change or rebuild the index unless the reference or the choice of k changes. Next to building index files, the step of quantification is proceeded that takes index and a set of RNA-Seq reads as input to estimate the isoform abundance, measured in RPKM, KPKM (k-mers per kilobase per million mapped k-mers), and TPM. Sailfish can also be used for non-model organisms in de novo mode. Since Sailfish has an overall parameter of the k-mer counts, it is also computationally efficient that can effectively exploit many CPU cores.

3.6 Kallisto

Kallisto (Bray et al. 2016) was developed by Pachter lab with the same lightweight algorithm approach as Sailfish to quantify transcript abundance but improves it with a “pseudoalignment” process. It is a fast software program written mainly in C++. It is considered to be near optimal in speed along with accuracy and tested successfully by its developers in analyzing 30 million unaligned paired-end RNA-Seq reads in less than 5 min on a standard desktop. This software is widely popular because of its accuracy as compared to those of the already existing tools. It does not work on the basis of position in transcript where a read aligns but the compatibility of read with a particular transcript that takes a lot less time than the traditional alignment process.

3.7 Salmon

Salmon (Patro et al. 2015) is an open-source software under the GPL v3 license and available at http://combine-lab.github.io/salmon/. Its developers call it a wicked-fast transcript quantification software that requires a set of target transcripts for quantification task and may be run in two modes: the quasi-mapping-based mode and the alignment-based mode. The quasi-mapping-based mode like Sailfish incorporates two phases, indexing and quantification, while the alignment-based mode uses the alignment file (SAM/BAM) provided by the user along with reference transcript FASTA file and does not require indexing.

4 Differential Expression

An important application of RNA-Seq technique is to identify genes that change in abundance between conditions, i.e., they differ in counts in different conditions. Differential expression (DE) is simply to compare expression levels of genes between two conditions, e.g., stimulated versus unstimulated or wild type versus mutant or normal versus treated. If there is a statistically significant difference or change in read counts between two conditions, a gene can be affirmed as a differentially expressed gene. The aforementioned steps of data preprocessing and mapping are mandatory for analysis of differential expression. Also, for differential expression, it is necessary to analyze read-count distributions, typically represented as a matrix N of n × m where Nij is the number of reads assigned to gene in sequencing experiment/condition j. Bioconductor has many packages to support DE analysis of RNA-Seq data. Many packages like DESeq2 (Love et al. 2014), edgeR (Robinson et al. 2010), limma (Ritchie et al. 2015), and baySeq (Hardcastle 2012) have whole RNA-Seq data analysis pipelines which can be of great use. Most of the packages for DE analysis expect input data in the form of matrix of integer values. To prepare the count matrix, SAM/BAM alignment file along with a file specifying the genomic features, e.g., a GFF3 or GTF, can be used. For this, we may use other packages of Bioconductor like Rsubread (Liao et al. 2013) and GenomicAlignments (Lawrence et al. 2013).

Two most popular packages for DE analysis are DESeq2 (Love et al. 2014) and edgeR (Robinson et al. 2010). They are modular in nature that means there are many entry points in the package from where the package can be used. They often give freedom to use an alternative aligner or a different strategy or tool to obtain read counts and then use the package for rest of the analysis. Since there is not any universal standard for DE analysis, it may somewhat be objective oriented and heavily dependent on external data like reference assemblies and annotation. Thus, we can’t expect that two different analysis strategies of the same data will end up with the same results, similarity is still expected though.

It is also worth mentioning about the importance of normalization which is a very significant step in the analysis of DE. Normalization is necessary to correct for biases which can arise from technical biases like between-sample differences that denote library size and within-sample gene-specific effects that may be related to gene length and GC-content (Oshlack and Wakefi 2009). There are various normalization methods for DE analysis including Total Count (TC), Upper Quartile (UQ), Median (Med), the DESeq normalization implemented in the DESeq Bioconductor package, Trimmed Mean of M values (TMM) implemented in the edgeR Bioconductor package, Quantile (Q), and RPKM normalization. FPKM normalization is also a popular method and is used by tools like cufflinks (Trapnell et al. 2010). FPKM is analogous to RPKM but does not use read counts.

This overview of DE analysis is superficial and descriptive of basics only used in DE analysis. There are actually a huge number of parameters in each step that can change results. Every step including preprocessing and mapping affects the analysis of subsequent steps. Like other tasks of RNA-Seq data analysis where newer algorithms and tools are making a mark, task of differential expression has also opened up the way for the development of newer and different algorithms/tools. BitSeq (Hensman et al. 2015; Glaus et al. 2012), deGPS (Chu et al. 2015), NOISeq (Tarazona et al. 2015), and XBSeq (Chen et al. 2015) are some of the recently developed tools which are really different in their algorithm and performance and give a broader spectrum to differential expression analysis in RNA-Seq data.

Although in this chapter we elaborate on different approaches and tools for analysis of RNA-Seq data, continuous research in this field has provided us some great whole analysis pipelines to also deal with RNA-Seq data. Since RNA-Seq technique has unprecedented ability to study transcriptome to a much greater extent than previous technologies, the analyses of ncRNAs have also become more accessible and feasible. Here we succinctly present a list of recently developed pipelines dedicated to RNA-Seq data and also some tools/pipelines dedicated to analyses of ncRNAs obtained through RNA-Seq technique (Table 10.3).

Table 10.3 List of some popular and recently developed pipelines dedicated to whole RNA-Seq and ncRNA data (obtained through RNA-Seq technique) analysis

5 Summary

Today, RNA-Seq is the mainstream tool for analysis of transcriptomes that are so rich in information and progressing day by day. This technique has its wide applications in various areas like clinical diagnostics, pharmacogenomics, and drug development . It can find novel transcripts and identify drug-related genes and microRNAs . Although RNA-Seq technology is still in progressive and developmental stage, yet it has made substantial contributions to our understanding of many transcriptomes from those of simple unicellular organisms to complex mammalian cells, as well as in tissues in normal and disease states. Still, the data from RNA-Seq is complex to analyze and very sensitive to technical biases. This chapter focused mainly with some tools/software for RNA-Seq data analysis and some interesting platforms like R/Bioconductor and Galaxy web server where many of these tools can be accessed and data can be analyzed. It is worth noting that many tools mentioned in this chapter are not restricted only to RNA-Seq data and may be used for other kinds of NGS data as well. Also, there are several other tools, software, whole analysis pipeline , and statistical strategies for analyzing RNA-Seq data, but they are not discussed here. Still, bioinformatics-based tools are progressing rapidly, and there is a wide opportunity of building new tools and strategies for analyzing RNA-Seq data as well as data derived from other NGS technologies. As NGS technologies are continuously evolving, we can hope for RNA-Seq having more technical and analytical developments with lower cost in the near future.