FormalPara What This Chapter Will Teach You

A gene is active when it is being transcribed to mRNA. mRNA is acting as a template for protein synthesis. Measuring of how much mRNA is transcribed can be used to estimate how active a gene is under a given circumstance. It is possible to get insight to the regulation status of all genes of a prokaryotic strain at the same time by sequencing the total amount of mRNA (the transcriptome) and annotating the sequences against the fully sequenced genome of the prokaryotic strain. The method is called RNA sequencing (RNA-seq), and this chapter will teach you methods of how to prepare RNA material for sequencing, considerations on experimental design, and how data are analyzed.

10.1 Introduction to Transcriptomics

The ability to measure how genes are regulated under certain developmental stages or physiological conditions has expanded the knowledge of the biology of both human and prokaryotic cells tremendously. A technique called Northern blotting, developed in 1977, was the first method to study gene expression (Alwine et al. 1977). More methods to investigate the regulation of genes, including quantitative PCR (qPCR) and microarrays, have become available. The methods, however, have several limitations, for instance, in qPCR analysis, only few genes can be studied at the same time. For all the methods, the major drawback is hybridization probes (needed, e.g., for Northern blotting and microarrays) and specific primers (for qPCR) need to be manually designed, and consequently, “you only find what you looking for.” Despite the bias that the researcher has to decide which contents to put on the hybridization plate, microarrays have been the gold standard to study differential gene expression during the 2000s. As the price of novel high-throughput sequencing has decreased promptly in the same period, RNA sequencing (RNA-seq) is now rapidly replacing hybridization techniques in genome-wide expression studies (Wang et al. 2009).

RNA-seq relies on high-throughput sequencing, and it will allow a genome-wide detection of “active” genes by measuring the level of the genes transcribed. The technique relates to many bioinformatics techniques already described including sequencing, annotation, databases, alignments, and BLAST (Fig. 10.1).

Fig. 10.1
A screenshot of a chart of bioinformatics techniques like sequencing, assembly, sequence quality control, databases, pair-wise alignment, molecular typing, transcriptions, and others.

Relationship of the chapter to other chapters of the book. RNA-seq relates to many preceding chapters dealing with sequence assembly, annotation, databases, alignments, and BLAST

For RNA-seq experiments, often different conditions of the same prokaryotic strain are compared, for instance, the highly oxacillin-resistant strain of Staphylococcus aureus, USA300, cultured in normal growth media can be compared to growth media with therapeutic concentrations of oxacillin (the concentration of oxacillin normal used to treat infections with oxacillin-sensitive S. aureus). RNA-seq will enable the identification of differently expressed (DE) genes between the two conditions. Such an experiment will allow the identification of DE genes in S. aureus under exposure to oxacillin compared to control condition and may elucidate the mechanism to why S. aureus USA300 is able to withstand the exposure to oxacillin. A part from this example of how RNA sequencing can be used for functional studies, RNA-seq may also been applied for detection of SNPs, finding of novel genes, or a total transcriptome assembly, just to give some examples of the use of transcriptomic data.

In overview, the workflow for RNA-seq is relatively simple: Extracted RNA is converted to cDNA; cDNA is sequenced on a next-generation sequencing platform (NGS) such as either Illumina, Helicos, or SOLiD; and finally, the sequence data are matched to genes by sequence alignment (Chaps. 2 and 4).

The application of NGS has several advantages. In contrast to hybridization-dependent methods, the transcription is studied in an unbiased manner, since no probe sequences are needed to be specified. Secondly, the experimental design does not need to be altered in accordance with differences in genome sequences. Finally, the probeless sequencing allows the discovery of new genetic features (Buermans and den Dunnen 2014). Although the analysis is slightly more user-friendly for microarray data than for RNA sequencing data, RNA-seq is giving a more comprehensive overview of the transcriptome and a better dynamic range and gives the possibility to detect SNPs (Bester-Van Der Merwe et al. 2013). These features taken together are probably the key reasons of why RNA-seq has exceeded microarrays as the method of choice to analyze gene expression (Malone and Oliver 2011).

10.2 Experimental Design

There are several types of RNA that can be sequenced such as total RNA, small RNA, or mRNA. mRNA only constitutes about 2% of the total RNA in the cell; however, the assumption for transcription profiling is that changes in the transcriptional mRNA level correlate with the phenotype (protein expression). This chapter is only focusing on the sequencing of mRNA in a single culture and sequencing by the Illumina short-read technology, although mixed samples (metagenomics) and other sequencing platforms may also be suitable for RNA sequencing (Chu and Corey 2012).

The first thing to do before beginning an experiment is to decide on the experimental design (Fig. 10.2). The prokaryotic growth phase needs to be considered to assess gene expression since gene expression may vary significantly under exponential—compared to stationary growth phase (Rolfe et al. 2012). The number of replicates of each sample needs to be considered. It needs to be considered if rRNA needs to be removed. The sequencing depth required to answer a particular research/biological question needs to be defined. The optimal read length needs to be considered, and it needs to be considered if samples can be pooled. The type of library needs to be selected and the sequencing platform decided including if single-end or paired-end sequencing is required.

Fig. 10.2
A block diagram depicts an experimental design for R N A sequencing. The labels read from left to right, experimental design, isolate R N A, prepare library, sequence, and analysis with the process given below each.

Overview of RNA sequencing workflow

It is important to make a distinction between biological and technical replicates. The biological replicates assess variations between samples, whereas the technical replicates can determine variation within sample preparations. RNA extracted from one sample and divided into three samples to be sequenced would be three technical replicates, whereas three samples of RNA harvested from three independently cultured colonies treated under similar conditions would represent three biological replicates. More biological replicates should always be favored over technical replicates but be aware to stratify samples over time. Hence, do not extract RNA for all control samples Monday and all treated samples Wednesday, as factors such a humidity in the weather or even slightly changes of the temperature in the culture water bath may influence the RNA profile. Differences between control and treated samples in this setup could then be due to external (environmental) factors rather than due to the treatment. More biological replicates (at least two, three is better) will increase the statistical power in the subsequent analysis.

The adequate read depth needed has to be assessed from experiment to experiment, depending on prokaryotic species and the research question under investigation. If you do not have sufficient read depth, the vast majority of reads will be associated with the highly expressed genes, which may not be the biologically most important genes needed to answer your research question, for example, how does antibiotic x affect genome-wide gene expression (Depardieu et al. 2007). That being said, there is a trade-off between more read depth and replicates, meaning you can add more replicates rather than increase the read depth (Sims et al. 2014).

For highly expressed genes, little effect of an increased sequencing depth is gained on the number of differentially expressed (DE) genes detected. In this case, increasing the number of biological replicates will be more beneficial. However, for low expressed genes, both sequencing depth and biological replicates increases the power to detect DE. According to ENCODE guidelines, 10–20 million reads are sufficient for differential gene expression, but additional unique transcripts are still being found at one billon read (Liu et al. 2014).

The Illumina platform can perform single-end or paired-end sequencing. Paired end is more expensive than single end but improves mapping to reap sequences and improves the accuracy for detection of differential expression for low-expressed genes.

10.3 Preparing a RNA-seq Library

The Illumina Tru-seq RNA protocol is the most commonly used protocol. Using the Illumina Tru-seq RNA protocol, there are six steps in preparing an RNA-seq library:

  • Step 1: Isolation of RNA

  • Step 2: Depletion or removal of rRNA

  • Step 3: Conversion of rRNA into complementary, double-stranded DNA (cDNA)

  • Step 4: Addition of sequencing adaptors

  • Step 5: PCR amplification (enrichment)

  • Step 6: Quality control of the library

10.3.1 Step 1: Isolation of RNA

There are several companies offering sequencing of RNA. It is, however, the responsibility of the researcher to provide a high RNA sample quality, which is essential for successful RNA-seq experiments. Many methods are available for purification of RNA from prokaryotic cells, including different manual protocols (e.g., an acidic phenol-chloroform RNA extraction protocol) and commercial kits (e.g., Qiagen RNA extraction kit). It is important to use the same RNA extraction protocol for all samples to be compared, since differences between protocols may slightly influence the RNA material (Sultan et al. 2014; Kumar et al. 2017). It is also important to acknowledge that irrespectively of protocol, RNA extraction is much more fragile than DNA extraction due to the ubiquitous and hardly RNases that degrade RNA. Furthermore, RNA extraction is a challenging task due to the short half-life of mRNA (Tan and Yiap 2009). Gloves should always be used when handling/isolating RNA (human skin carries RNases) and all samples be kept on ice. It is acceptable to spin down prokaryotic cells at room temperature, but as soon as the cells have lysed, the temperature should be kept low (by keeping samples on ice), which will inhibit the activity on any luring RNases. Preferably use pipettes that are only handled for work with RNA. If that is not possible, at least make sure always to use RNase-free tips with filter. Be aware that RNases are very stable and will not be eliminated by autoclaving; hence, do only use certified RNase-free water.

The RNA quantity and purity can be evaluated to measuring the UV absorption of the sample using a spectrophotometer, for example, on a NanoDrop spectrophotometer (Thermo Fisher Scientific). RNA has a maximum absorption at 260 nm, and the RNA concentration is determined by the OD reading at 260 nm. In addition to the OD260, measurements should also be taken at 280 nm and 230 nm. The A260/A280 ratio provides an indication of the level of protein contamination in the sample. Pure RNA has an A260/A280 ratio of 2.1; however, values between 1.8 and 2.0 are considered acceptable for many protocols. Be aware that OD absorbance measurements can change depending upon the pH of the RNA solution. The best results are obtained when RNA is solubilized in TE buffer. In general, RNA concentrations must be above 20 μg/mL to give reliable readings.

It is needed to control the integrity of the RNA preparation since RNA may be degraded and not perform well in downstream applications. The easiest and cheapest way to assess RNA integrity is by the means of a 1% standard agarose gel and examining the ribosomal RNA (rRNA) bands. The upper ribosomal band (23S in prokaryotic cells) should be about twice the intensity compared to the lower band (16S in prokaryotic cells) and should be crisp and tight. If the rRNA bands are of equal intensity, then it suggests that some degradation has occurred. mRNA runs between the two ribosomal bands and might be seen as a smear. This is acceptable; however, smearing below the rRNA bands suggests that you have poor-quality RNA. Higher-molecular-weight bands that might indicate that the RNA is contaminated with DNA must not be observed.

A second method to control the integrity of your RNA that has become more popular, especially with in microarray analyses, is to use a bioanalyzer, such as the Agilent Bioanalyzer. Bioanalyzers use small amounts of RNA (1–2 μL) and microfluidics to determine the quantity and quality of RNA samples. The analyzer measures the sizes of the rRNA bands and determines an RNA Integrity Number (RIN) to standardize between RNA samples. Bioanalyzers are expensive, but they can often be found in core facilities.

The input needed for the Truseq Stranded mRNA kit (Illumina) is 0.1–4 μg of total RNA. Using the Agilent Bioanalyzer, the RNA Integrity Number (RIN) value should be greater than or equal to 8.

10.3.2 Step 2: Depletion or Removal of rRNA

The majority of the total RNA is rRNA, and rRNA must either be depleted from the sample, or alternatively, mRNA must be captured. For the latter, the selection of mRNA can be done by poly-A-selection, which is done by filtering RNA with 3′ polyadenylated poly(A) tails. The RNA with 3′ poly(A) tails are mature, processed, coding sequences. Poly(A) selection is performed by mixing RNA with poly(T) oligomers covalently attached to a substrate, typically magnetic beads (Cui et al. 2010). If whole RNA is to be sequenced instead, rRNA must be depleted. There are a number of commercially available kits for rRNA depletion, such a RiboZero (TaKaRa).

10.3.3 Step 3: Convertion of RNA into Complementary DNA (cDNA)

The first stage is fragmentation of the RNA to be sequenced. Fragmentation is achieved by using divalent cations, under elevated temperature, which ensures good coverage of the transcriptome. The cleaved RNA fragments are copied into first strand cDNA using reverse-transcriptase and random primers.

One of the advantages of the Illumina Truseq protocol is that it is “stranded” meaning that it will provide information from which of the two DNA strands a given RNA is derived. This provides a large, complete picture of the transcriptome (Wang et al. 2009).

The strand specificity of the samples is achieved by replacing dTTP with dUTP in the second strand synthesis process, which quenches the subsequent amplification of this strand because the polymerase used in the assay will not incorporate past this nucleotide. The final product of this step is blunt-ended, double-stranded cDNA molecules. The addition of an A-base to the ends of the cDNA prevents the blunt-ended fragments from ligating to one another during the adapter ligation reaction.

10.3.4 Step 4: Addition of Sequencing Adaptors

A ligation reaction takes place, which ligates multiple indexing adapters to the ends of the DNA fragments, preparing them for hybridization onto a flow cell.

10.3.5 Step 5: PCR Amplification (Enrichment)

The adaptor-added products are purified and enriched with PCR to create the cDNA library. This step is necessary to ensure that the sequencing signal will be strong enough to be detected unambiguously for each base of each fragment.

10.3.6 Step 6: Quality Control of the Library

To assess library quality, 1 μl of the post-enriched library is loaded on one of the following instruments: Advanced Analytical Technologies Standard Sensitivity NGS Fragment Analysis Kit (Advanced analytica, Heidelberg, Germany) or Agilent High Sensitivity DNA Chip (Agilent, Santa Clares, USA). The size of the library is controlled for distribution of the DNA fragments with a size range from approximately 200 bp–1 kb. The manufacturer’s instructions should be followed for the respective instruments depending on the kit you are using.

10.4 Sequencing

When preparation of cDNA is done with TruSeq RNA, cDNA sequencing can be done on Illumina MiSeq or HiSeq platform as described elsewhere in the book (Chap. 2). Overall, the sequencing of RNA does not differ from sequencing of genomic DNA. The suggested read length is 50–250 bp.

10.5 Data Management (Sequence Reads)

Data from sequencing will be provided in FASTQ format. Data management includes assessing data for the quality, alignment of the reads to a reference genome, and normalization of the data, before the differential gene expression analysis can be conducted. Some of the most important bioinformatical programs are listed in Table 10.1, and examples on their use will be given in the text.

Table 10.1 Examples of software programs to manage and analysis RNA sequencing data

10.5.1 Raw Data

The Illumina platform provides raw sequence reads in FASTQ format, which may be stored directly at the Sequence Read Archive (SRA) (Chap. 3). When assessing the raw data, start by checking whether the FASTQ file is consistent. The format can be validated with software such as FastQValidator (https://github.com/statgen/fastQValidator). Then the base calling quality score, which is part of the sequencing data output, needs to be assessed. Quality scores reflect how confidently the right bases have been called. FastQC (part of the FastQ validator) is an excellent tool for assessing the quality of the sequencing run. The base calling quality score is called a Phred score, Q, which is proportional to the probability p that a base call is incorrect, where Q = −10 log10(p). For example, a Phred score of 10 corresponds to one error in every ten base calls (Q = −10 log10(1/10)), or 90% accuracy; a Phred score of 20 corresponds to one error in every 100 base calls, or 99% accuracy. A higher Phred score thus reflects higher confidence in the reported base. There is no defined cutoff on how low an acceptable Phred score can be, but one should aim for Phred scores higher than 33. FastQC can also be used to confirm graphically that the GC-content is represented by a nicely bell shaped form.

If the quality score of the run is low, it is may be possible to remove unwanted parts of the raw data. The unwanted parts are either technical contaminations such as low-quality read parts, or technical sequences such as the adaptors or biological contamination like polyA-tails, rRNA, mitochoridal RNA, or tRNA. When these parts are removed, then FastaQC is run again, and it should be decided if you want to proceed with the data or do a re-run.

If you decide that the quality of the reads is sufficient to proceed, the sequences are ready to be aligned to a reference genome.

10.5.2 Alignments of Sequence Reads

Now the reads need to be aligned to the genome(s), which provided the RNA sample(s), and this is called read alignment or mapping. While the term “alignment” describes the process of finding the position of a sequencing read on the reference genome, “mapping” refers to assigning already aligned reads to transcripts, which is also called quantification. The general challenge of short-read alignment is to map millions of reads accurately and in a reasonable time, despite the presence of sequencing errors, genomic variation and repetitive elements. There are several tools to assist in the alignment, depending on whether the reads are aligned to a genome or the reads are assembled de novo, in which the reads need to be assembled first into longer contigs. These contigs can then be considered as the expressed transcriptome to which reads are re-mapped for quantification.

10.5.3 Normalization of Data

Normalization is a process designed for adjustment for sequencing depth and compositional bias. It is done to identify and remove systematic technical differences between samples that occur in the data and to ensure that technical bias has minimal impact on the results. The most common need for normalization is related to differences in the total number of aligned reads.

Library size influences read counts. If one library is sequenced to 20 M reads, and another to 40 M, in the latter case, most genes will be approximately double in their counts. Also, the library composition has importance as highly expressed genes may be overrepresented at cost of lowly expressed genes. One way to normalize data is by the use of Read Per Kilobases per million (RPKM) (for SE-RNA seq) of Fragments per Kilobase Million (FPKM) (For PE-RNA seq). RPKM is calculated by dividing the reads for gene A by the length of gene A times the total number of reads.

$$ \mathrm{RPKM}=\mathrm{Reads}\ \mathrm{for}\ \mathrm{gene}\;\mathrm{A}/\left(\mathrm{Length}\ \mathrm{of}\ \mathrm{gene}\;\mathrm{A}\;\mathrm{x}\;\mathrm{Total}\ \mathrm{number}\ \mathrm{of}\ \mathrm{reads}\right) $$

This formula normalizes read counts for: (1) The sequencing depth, since sequencing runs with more depth will have more reads mapping to the gene, and (2) the length of the gene, since longer genes will have more reads mapping to them.

An alternative that might be considered as a better solution is TPM (Transcript Per Kilobase Milo). TPM is very similar to RPKM/FRKM. The only difference is the order of operations:

  1. 1.

    Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).

  2. 2.

    Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.

  3. 3.

    Divide the RPK values by the “per million” scaling factor. This gives you TPM.

Normalization for the library size may also be done using different software, such as DESeq2 and EdgeR (Table 10.1).

Gene length influences the count, as longer transcripts generate more reads. However, the transcript length does not differ between samples. Since it is the relative difference that is of interest, gene length counts do not need to be normalized.

10.6 Differential Gene Expression

RNA-seq is a relative abundance measurement technology. The primary goal of the differential gene expression analysis is to quantitatively measure differences in the levels of transcripts between two or more treatments or groups.

Step one in any analysis is always the same: plot the data! You may use a principal component analysis (PCA) or something similar to plot the data. This plot gives a nice overview of how similar the different replicates are in the RNA composition (the closer, the better) and will give a first impression if you can expect to find interesting difference between you samples (Fig. 10.3).

Fig. 10.3
A scatter-plot graph to overview how the different replicates are in the R N A composition by plotting on 2 samples P C 2 with 24.1 % versus P C 1 with 48.3 %.

Examples of a principal component analysis (PCA) to visualize the global gene expression of Escherichia coli under four different treatment conditions (C, T, S, ST). All conditions were performed in triplicates. Samples located together have similar gene expression. Genes (gray dots) located in the same direction as samples have higher expression in those samples

The actual identification of DEG is typically done using the statistical program R with either edgeR or DEseq and will result in a plot similar to Fig. 10.4. In Fig. 10.4, black dots represent genes that are expressed the same, while each red dot is a gene that is expressed differently between the two sample conditions (“C” and “S” in Fig. 10.4). The X-axis tells you how much each gene is transcribed, while the Y-axis tells you how big the relative difference is between “C” and “S.” The red genes are therefore your interesting genes. If you know what you are looking for, you can see if the experiment is validating you hypothesis. If you don’t know what you are looking for, you can see if certain pathways are enriched under the different sample conditions using, for example, KeggMapper (https://www.genome.jp/kegg/mapper/) or Cytoscape (Cline et al. 2007).

Fig. 10.4
An M A graph plots fold change in logarithm 2 versus expression in base mean. The graph is similar to the statistical program R with either edge R or D E sequence, where dots denote genes expressed the same way and genes differently expressed.

This MA plot shows the fold change between the two treatment conditions of Escherichia coli (C and S) compared (log2 scaled) as function of the average expression of each gene. Each point represents a gene; red indicates an adjusted p-value <0.01

10.7 Conclusion

Since the first reported studies using RNA seq were published in 2008, our understanding of gene expression has to been taken to a new level. However, there are still some technical issues awaiting resolution such as the PCR amplification stage of the library construction, which may result in redundant sequence read and bias in the final dataset. Nevertheless, RNA-seq holds the promise to continue to replace other genome-wide expression analysis in the future and will likely add in refine our understanding of gene regulation in prokaryotes.