Keywords

19.1 Introduction

The genome of an organism determines its phenotype by setting the range of variability for numerous traits. Environmental factors shape the phenotype within this predetermined range. Knowledge about the genome and genes of a species facilitates various biological research projects. Research on Arabidopsis thaliana (A. thaliana) Columbia–0 was boosted by the availability of the first plant genome sequence (Somssich 2018). The transcriptome of an organism reveals which parts of the genome are ‘active’ at a certain point in time, under specific conditions, and in a defined cell type. Since the nucleic acid types DNA and RNA have very similar biochemical properties, the investigation of genome and transcriptome can be performed by similar methods. Both omics layers, genomics and transcriptomics, are easily accessible by analytic methods, because general biochemical properties of these nucleic acids are independent from the actual sequence. The intention of this chapter is (1) to describe genomics and transcriptomics workflows which are commonly used in plant research, and (2) to list frequently deployed bioinformatic tools for the analysis steps (Fig. 19.1).

Fig. 19.1
figure 1

Selected genomics and transcriptomics workflows in plant sciences. These workflows are deployed in many studies in plant research and the listed tools can be applied to perform the displayed steps. Several alternative and additional tools are listed within this chapter

19.2 Sequencing Technologies

Existing sequencing technologies can be grouped into different generations based on their key properties. However, there is disagreement in the literature about this classification system and the assignment of technologies to different generations (Metzker 2010; Shendure et al. 2004; Shendure and Ji 2008; Schadt et al. 2010; Glenn 2011; Quail et al. 2012; Goodwin et al. 2016; Peterson and Arick 2018). Here, we distinguish between three generations: (I) Sanger chain termination sequencing and Maxam Gilbert sequencing as first generation sequencing technologies, (II) Roche/454 pyrosequencing, IonTorrent, Solexa/Illumina, and Beijing Genomics Institute (BGI) sequencing as second generation sequencing technologies, and (III) Single molecule real time sequencing (Pacific Biosciences, PacBio) and nanopore sequencing (Oxford Nanopore Technologies, ONT) as third generation sequencing technologies. Technical details of these sequencing technologies were reviewed elsewhere (Metzker 2009, 2010; Shendure and Ji 2008; Quail et al. 2012; Goodwin et al. 2016; Margulies et al. 2005; Mardis 2008a).

Since the invention of chain termination sequencing (Sanger and Coulson 1975; Sanger et al. 1977), substantial technological advances paved the way for cost reductions. Therefore, broad application of high throughput sequencing (Metzker 2010) and more recently long read sequencing technologies (Li et al. 2017) became possible. Sanger sequencing generates a single read per sample, while other technologies produce large amounts of reads per sample and are hence crucial for many genome sequencing projects. Length of reads produced from Roche 454 pyrosequencing and IonTorrent is comparable to Sanger sequencing, but have reduced accuracy. Nevertheless, Illumina has been dominating the market for high throughput sequencing with substantially shorter reads due to high accuracy and low costs of sequencing technology. The BGI became a serious competitor during past years and is now offering the generation of similar sequencing data-sets based on its own technologies. While Illumina sequencing platforms are distributed all around the globe, BGI sequencing technology is exclusively available in China.

Paired-end sequencing provides the opportunity to analyze two ends of the same molecule. Overlapping reads; e.g. 2 × 300 nt, can be merged, thus leading to a total length of up to <600 nt. Sophisticated approaches like TrueSeq synthetic long reads (McCoy et al. 2014) were developed to maximize the read length of second generation technologies up to several thousand nucleotides. Mate pair reads provide information about the distance of both reads in addition to the mere sequences of both reads. In mate pair sequencing technique, long DNA fragments are modified at their ends, circularized, and fragmented. Fragments with marks are enriched and finally sequenced as paired-end libraries. The size of the initial fragments determines the distance of the two generated reads and can thus be considered as valuable linkage information during genome assembly processes.

However, length of reads generated from mate-pair sequencing is inferior to those generated by Oxford Nanopore Technologies (ONT) and Pacific Biosciences. From ONT, the longest sequenced DNA molecule has been reported to be over 2 Mbp till date (Payne et al. 2018) and the longest single read is close to 1 Mbp (Jain et al. 2018). Dropping sequencing costs and the rise of long read technologies enabled sequencing projects for numerous plant species (Bolger et al. 2014a; Jiao and Schneeberger 2017; Chen et al. 2018). Nevertheless, short reads are still valuable in projects; e.g. RNA-Seq or re-sequencing projects, where a high number of tags is more important than the read length.

In addition to generating extremely long reads at low costs, ONT also provides the first portable sequencers, namely MinION and Flongel, that can be deployed in field applications (Tyler et al. 2018; Pomerantz et al. 2018). Sequencing in the field opens up opportunities, to monitor pathogens in the field accurately (Hu et al. 2019) and to assess the biodiversity (Pomerantz et al. 2018). Real time base calling and the start of downstream analysis before completion of a sequencing run are beneficial when decisions are time critical (Stoiber and Brown 2017). Moreover, it also allows researchers to stop the sequencing process once sufficient data is generated and to commit the remaining sequencing capacity to other projects (Nguyen et al. 2017).

19.3 Genomics

19.3.1 Genome Assembly

Quality control and preprocessing

Quality checks via FastQC (Andrews 2010) or MultiQC (Ewels et al. 2016) are usually the first steps to assess the quality of sequencing data. Next, reads need to be preprocessed prior to a de novo assembly, while this is not necessary for other applications like read mapping. Low quality sequences and remaining adapter fragments are removed during the trimming process, e.g. by trimmomatic (Bolger et al. 2014b). Removal of adapter sequences is especially important for de novo genome assemblies, because these sequences can occur in independent reads and cause the miss-join of random sequences into contigs.

Assembly concept

A read can only represent a fraction of a complete genome sequence. Hence, intense manual work or the application of sophisticated bioinformatic tools is necessary to reconstruct complete genome sequences based on sequence reads (Mardis 2008b; Chaisson et al. 2009; Myers 2016). Initial sequencing projects involved the cloning of genomic fragments into vectors like bacterial artificial chromosomes (BACs) prior to sequencing. Genome sequences were generated by sequencing several BACs consecutively and combining the BAC sequences almost manually.

Second generation genome assemblies

Especially, the rise of high throughput sequencing methods caused shift from manually curated BAC-based high continuity genome sequences towards whole genome shotgun draft assemblies. Dedicated assemblers were developed to harness the full potential of the available data types, for example combinations of paired-end and mate-pair data. SOAPdenovo2 (Luo et al. 2012), ALLPATHS-LG (Gnerre et al. 2011), Platanus (Kajitani et al. 2014), and the proprietary CLC assembler (QIAGEN 2016) are examples for tools which were successfully deployed for the assembly of plant genomes, but there are also many alternatives (Table 19.1). Modification of parameters, especially k-mer sizes, should be optimized empirically (Bradnam et al. 2013; Chikhi and Medvedev 2014; Shariat et al. 2014; Salzberg et al. 2012). In addition, the best combination of data from multiple sequencing libraries and sequencing technologies needs to be identified. After the generation of contigs in the assembly process, the information of mate pair and paired-end data-sets can be used to connect contigs to scaffolds without knowing the sequence enclosed between contigs of a scaffold. While some assemblers provide this functionality, dedicated tools like SSPACE (Boetzer et al. 2011) are available. Next, gaps between contigs within a scaffold can be partially closed, e.g. via GapFiller (Boetzer and Pirovano 2012) or Sealer (Paulino et al. 2015). The reduced sequencing costs allowed the assembly of plant genome sequences by single groups (Pucker et al. 2016), but most genome sequences were highly fragmented. More recently, the proprietary NRGene assembler (DeNovoMAGICTM) and the competing open source alternative TRITEX (Monat et al. 2019) are promising substantially improved assemblies.

Table 19.1 Assembler for second generation sequencing data

Third generation genome assemblies

The assembly situation changed again when long reads became available, thus enabling the generation of high continuity genome assemblies for numerous plant species with moderate effort (Michael et al. 2018; Pucker et al. 2019; Copetti et al. 2017; Lightfoot et al. 2017). The technological boost on the sequencing side caused an explosion in the development of novel assemblers and read correction tools which can handle noisy long reads efficiently (Table 19.2). FALCON (Chin et al. 2016), Canu (Koren et al. 2017), Flye (Kolmogorov et al. 2019), Miniasm (Li 2016), and wtdbg2 (Ruan and Li 2019) are examples for frequently applied assemblers. Depending on the sequencing coverage and repeat content, the computational costs of assemblies can be high. Several hundred CPU hours, some hundred GB of RAM, and several TB of disc space are often required to assemble plant genomes. Assembled contigs can be joined into scaffolds based on additional information like genetic linkage (Pucker et al. 2019; Gan et al. 2016), optical mapping information, e.g. from Bionano Genomics and OptGen (Jiao et al. 2017; Lin et al. 2012; Tang et al. 2015), and Hi-C (Jiao et al. 2017; Burton et al. 2013; Phillippy 2017). Genetic linkage can rely on molecular markers measured in the lab (Pucker et al. 2019) or on sequencing of multiple individual plants of a segregating population by a high throughput method (Gan et al. 2016). Optical mapping is a size estimation of large DNA fragments which are generated by enzymatic restriction digest and cut site specific coloring with fluorescent dyes. Hi-C measures the 3D distances of genomic loci and assumes that neighboring sequences are also likely to be co-located in 2D.

Table 19.2 Third generation assembler

Due to the high error rate in long reads, raw assemblies require several polishing steps. Firstly, long reads are aligned for correction, e.g. via BLASR (Chaisson and Tesler 2012) and minimap2 (Li 2018). Arrow (Chin et al. 2016) can be applied to polish assemblies based on PacBio reads, while nanopolish (Loman et al. 2015) is the best choice for ONT reads. Secondly, highly accurate short reads are mapped to the assembly to further correct the sequence in single copy regions. Paired-end or mate pair reads provide higher specificity during the mapping compared to single end reads. BWA-MEM (Li 2013) is a suitable read mapping tool and Pilon (Walker et al. 2014) can be used for the detection and correction of assembly errors. Iterative rounds of correction are possible. There is still an ongoing debate about the optimal number of polishing rounds that should be performed (Koren et al. 2017; Vaser et al. 2017). Since the most frequent error types are insertions/deletions, open reading frames are often affected by apparent frameshifts and premature stop codons. Therefore, the contribution of polishing approaches can be benchmarked based on an increase/decrease of frameshifts and premature stop codons in protein encoding genes. The optimal number of correction rounds can be determined by minimizing the number of these variants.

Assembly validation

After combining reads into contigs, the correctness of these connections needs to be assessed. This assembly validation can be performed by mapping all reads back to the generated sequence, e.g. via BWA-MEM (Li 2013), and analyzing the distances of paired reads in this mapping, e.g. via REAPR (Hunt et al. 2013). Alternative approaches like implemented in KAT (Mapleson et al. 2017) inspect the assembly based on included k-mers. Most genome sequencing projects involve the generation of multiple assemblies with different tools and parameter settings. Selection of the best assembly can be challenging and criteria depend on the proposed research questions. The largest reasonable assembly, the assembly with the highest continuity, or the assembly resolving the highest number of genes might be of interest. Benchmarking Universal Single-Copy Orthologs (BUSCO) (Simão et al. 2015) is a frequently applied method to assess the assembly completeness and correctness. The underlying assumption is that all benchmarking genes should appear exactly once in the assembly. Different benchmarking sets exist for different taxonomic groups (Kriventseva et al. 2019). Due to a large phylogenetic distance to other sequenced species, this might not be perfectly accurate for the species of interest. However, the detection of single copy and complete genes is a good indicator for a high quality assembly. High numbers of duplicated BUSCOs can indicate separated haplophases. Recently, DOGMA (Dohmen et al. 2016) was released as an alternative tool for the analysis of sequence set completeness which also comes with an online version (https://domainworld-services.uni-muenster.de/dogma).

19.3.2 Gene Prediction

After generation and polishing of an assembly, the prediction of genes is often the next step. Besides protein encoding genes, there are also various RNA genes, transposable element genes, and numerous repeats which should be annotated as part of a genome project. In general, predictions are distinguished into (I) intrinsic approaches, which rely only on sequence properties, and (II) extrinsic approaches, which harness sequence similarity to previously annotated sequences to transfer annotation. However, frequently applied tools are designed to harness the power of both approaches (Table 19.3). AUGUSTUS (Stanke et al. 2006; Hoff and Stanke 2019) and GeneMark derivatives (Lomsadze et al. 2005, 2014; Ter-Hovhannisyan et al. 2008; Borodovsky and Lomsadze 2011) can predict genes ab initio without any external information. BUSCO can be applied to generate parameter files for this gene prediction process by assessing the gene structure of BUSCO genes (Waterhouse et al. 2018). In contrast to these ab initio approaches, GeMoMa (Keilwagen et al. 2016, 2018) combines external hints to construct a gene annotation based on sequence alignments. The exon intron structure of plant genes is posing a challenge to the gene prediction process, because tools need to account for interruptions of an open reading frame by on average four to five introns per gene (Pucker and Brockington 2018). Intron borders are often detected based on their conserved sequences: GT at the 5′ end and AG at the 3′ end. However, an average of at least 5% of all plant genes contains non‑canonical splice sites, i.e. deviations from the GT-AG combination (Pucker and Brockington 2018; Pucker et al. 2017). Most gene prediction tools exclude non-canonical splice sites at least in the ab initio mode, because the number of possible gene models increases substantially when permitting many more possible intron positions. Therefore, external hints for intron positions are crucial to achieve an accurate prediction. If the identification of all isoforms of a gene is of interest, the accurate annotation of all exon intron borders is especially important. Expressed sequence tags (ESTs), contigs of a transcriptome assembly, or unassembled RNA-Seq reads can be aligned to the genomic sequence to generate hints. These sequences should originate from a broad range of different samples, e.g. collected under different environmental conditions, from different tissues, and different developmental stages. The accurate alignment of transcript sequences to an assembly requires dedicated tools to account for introns. While BLAT (Kent 2002) can align long sequences, STAR (Dobin et al. 2013; Dobin and Gingeras 2015) is well suited for the split alignment of RNA-Seq reads. Dedicated tools like exonerate (Slater and Birney 2005) allow the alignment of previously annotated peptide sequences from other species. Resulting alignments can be converted into gene prediction hints. Annotation pipelines like MAKER2 (Holt and Yandell 2011), BRAKER1 (Hoff et al. 2016), and Gnomon (Souvorov et al. 2010) can integrate the information from different hint sources with ab initio prediction. While the prediction of protein encoding parts of a gene works relatively well, the annotation of untranslated regions (UTRs) and other non-coding sequences is still associated with a higher insecurity (Pucker et al. 2017; Haas et al. 2002; Fickett and Hatzigeorgiou 1997). Quality of the gene prediction process is in general not keeping pace with the rapid improvement of sequencing capacities and the frequent generation of highly contiguous assemblies (Salzberg 2019).

Table 19.3 Plant gene prediction tools

Technological progress allows the systematic investigation of non-protein encoding genes; e.g. through RNA-Seq experiments committed to the analysis of short RNAs. INFERNAL (Nawrocki and Eddy 2013) and tRNAscan-SE2 (Chan et al. 2019) are tools for the prediction of pure RNA genes.

Masking of repeats, e.g. via RepeatMasker (Smit et al. 2015), is frequently performed prior to the prediction of protein encoding genes, but this can actually have almost no or even detrimental effects on the prediction accuracy of certain gene families (Bayer et al. 2018). Although transposable elements and other repeats account for the major proportion of many plant genomes (Michael 2014; Vicient and Casacuberta 2017), the annotation of repeats is often performed poorly or omitted completely (Flutre et al. 2011; El Baidouri et al. 2015; Hoen et al. 2015). There is a plethora of annotation tools like RepeatScout (Price et al. 2005) and RepeatMasker (Smit et al. 2015). Bioinformatic pipelines were developed to account for weaknesses of single tools and to combine the strengths of many individual tools (Estill and Bennetzen 2009; Saha et al. 2008; Bergman and Quesneville 2007). One major issue with the TE and repeat annotation is the lack of a universal benchmarking study which could hint to the best tool for certain purposes (Hoen et al. 2015; Lerat 2010). While the annotation of protein encoding genes can be checked for completeness based on BUSCO (Simão et al. 2015) and DOGMA (Dohmen et al. 2016; Kemena et al. 2019), there is no such benchmarking data-set available for TEs.

Application examples

Sequencing the genome of a plant species can provide insights into specific adaptations to local environmental conditions. Crucihimalaya himalaica is distributed at high altitudes at the Himalaya and the genome sequence reveals a reduced number of pathogen response genes as well as an increased number of DNA repair genes as response to a reduced amount of pathogens and an increased UV exposure, respectively (Kemena et al. 2019).

19.3.3 Re-sequencing and Variant Calling

Once a suitable reference genome sequence is available, re-sequencing projects can by-pass the laborious and expensive assembly step. Reads can be mapped to a reference sequence to identify differences between individuals of the same species or even between closely related species. Since the re-sequencing dataset does not need to provide sufficient data for a de novo assembly, the costs for re-sequencing are low compared to the initial genome project. Re-sequencing of over 1135 A. thaliana accessions revealed insights into the genomic diversity of this species (Alonso-Blanco et al. 2016). Since accessions are adapted to local environmental conditions, this project can reveal insights into adaptation mechanisms. Sequencing data also advances the understanding of population structures, genomic diversity between accessions, and genome evolution.

BWA-MEM (Li 2013) and bowtie2 (Langmead and Salzberg 2012) are frequently applied tools for the mapping of reads to a reference sequence (Table 19.4). The removal of PCR duplicates is necessary to avoid introducing a bias into following coverage analyses or variant callings. PCR duplicates are reads originating from a DNA fragment, which was amplified by PCR during the sequencing library preparation step. Functions like MarkDuplicates of Picard tools (Broad Institute 2019) allow the identification and removal of reads or read pairs originating from identical PCR products. This removal can be based on identical read sequences or identical positions in the mapping to a reference sequence. The detection of copy number variations depends on the equal representation of all genomic parts in the reads. PCR duplicates could cause the identification of false positive duplications by producing a high numbers of identical reads which could display an apparent variant caused by a PCR error in an early amplification step. The identification of sequence variants is sensitive to PCR duplicates, because a certain number of reads displaying a variant is frequently used as filter criteria to remove false positive variant calls.

Table 19.4 Read mapping tools

There are numerous tools for the detection of genomic differences based on a short read mapping (Table 19.5). Genome Analysis Tool Kit (GATK) (McKenna et al. 2010; der Auwera et al. 2013), samtools/bcftools (Li et al. 2009a), and VarDict (Lai et al. 2016) can detect single nucleotide variations (SNVs) and small insertions/deletions (InDels). The rise of long read sequencing technologies added substantially to the sensitivity of the insertion/deletion detection. Moreover, it allows the identification of large scale structural rearrangements. GraphMap (Sović et al. 2016), marginAlign (Jain et al. 2015), and PoreSeq (Szalay and Golovchenko 2015) can align long reads to a reference sequence to call variants. Other tools like SVIM (Heller and Vingron 2018) rely on alignments generated by dedicated long read aligners like minimap2 (Li 2018) or BLASR (Chaisson and Tesler 2012). Identified variants can be subjected to downstream filtering; e.g. based on the number of supporting and contradicting reads.

Table 19.5 Variant callers

Once the variants are identified, it is possible to assign functional annotations. Established tools for this purpose are SnpEff (Cingolani et al. 2012) and ANNOVAR (Wang et al. 2010). Based on the structural annotation of the reference sequence, SnpEff and ANNOVAR assign functional implications like “premature stop codon” or “frameshift” to single variants. Since these tools are predicting the effect for a single variant at a time, NAVIP (Baasner et al. 2019) was developed for the integrated annotation of all variants within one coding sequence. NAVIP accounts for combined effects of neighboring variants, e.g. two short InDels which are both causing a frameshift on their own, but result in a few substituted amino acids when considered together.

19.3.4 Mapping by Sequencing

Forward genetics

Forward genetics describes the genetic screening of mutants which have been isolated based on an outstanding phenotype (Schneeberger 2014). Crossing a mutant with a wild type plant and selfing of the F1 offspring leads to a segregating F2 population. A large segregating population forms the basis for a forward genetics screen. Such a population contains members with the wild-type and mutant phenotypes, respectively. Except for the causal locus, the genotypes of this population should display a random distribution of alleles. Since this population is used for genetic mapping, it is called a mapping population. Genetic markers located near the causal mutation will co-segregate with this mutation. As a result of this linkage between the causal locus and flanking markers, one allele of the flanking markers should be over-represented in the mutant plants. Due to a gradually decreasing linkage, the frequency of the coupled marker allele should drop when moving away from the causal locus. Therefore, the allele frequency can be used to pinpoint loci of interest. Originally, the identification of the location of the causal mutation in the genome of a mutant has been a long-lasting procedure requiring a high number of genetic markers. Once a target region has been identified, this region was screened for candidate genes. In order to validate the link between the assumed candidate gene and the expected phenotype, complementation experiments were frequently conducted. In following studies, the molecular function of the mutated gene was often elucidated.

Next generation forward genetics

Technological advances in next generation sequencing enable the use of small sequence variants as genetic markers. Since these small sequence variants occur in large numbers, the resolution of the resulting genetic map is extremely high. Allele frequencies at all sequence variants are calculated for identification of genomic regions associated with the phenotype of interest (Garcia et al. 2016). First approaches used bulk segregant analysis (BSA), where DNA from the mapping population is pooled based on the phenotypes of individuals and then sequenced, i.e. one pool comprises the wild type allele of a certain locus and the other pool the mutant allele of the respective locus. Next, reads are mapped against a reference genome sequence to detect sequence variants. In the next step, allele frequencies for all small sequence variants are calculated. High allele frequencies can indicate linkage with the causal locus. This approach is also known as mapping-by-sequencing (MBS) and allows the fast and simple identification of causal mutations through allele frequency deviations (Schneeberger 2014).

Mutagenesis

Natural variation can provide mutants, but it is also possible to generate mutant plants via mutagenesis. DNA damaging agents deployed in these mutagenesis experiments can be classified as physical mutagens (e.g. gamma radiation and fast neutron bombardment) or chemical mutagens (e.g. ethyl methanesulfonate, diepoxybutane, sodium azide) (Sikora et al. 2011). In order to achieve maximal genetic variation with a minimum decrease in viability, mutagenic dosage and specific properties of the mutagen need to be considered (Sikora et al. 2011). High mutagenic dosages likely result in a high number of mutations in the individual genome, thus the high diversity around a causal mutation might impede the identification (Schneeberger 2014). If a mutagen introduces large genomic rearrangements (e.g. deletions or translocation of large regions), the resulting mutation density is typically low compared to a mutagen, which causes predominantly single nucleotide variations. Furthermore, large genomic rearrangements might impede or even prevent the identification of the causal mutation by breaking apart a set of linked genes.

Biological material

Mapping-by-sequencing (MBS) can be based on four different sets of biological material. A classical mapping population scheme was frequently used during the first MBS experiments. This involved outcrossing of mutagenized plants with diverged strains followed by one round of selfing to generate the mapping population (Schneeberger et al. 2009; Cuperus et al. 2010). Sequencing was performed on two genomic F2 pools of mutant and wildtype plants, respectively. Starting with A. thaliana, this method was rapidly applied to other model organisms (Wenger et al. 2010; Leshchiner et al. 2012). An isogenic population is generated by crossing homozygous mutants with the non-mutagenized progenitor, resulting in segregation of subtle phenotypic differences in the F2 population (Abe et al. 2012). Therefore, the only segregating genetic variation is that induced by mutagens. MBS is performed as described above. Homozygosity mapping uses only the genomes of affected individuals, originally in the context of recessive disease alleles in inbred humans (Lander and Botstein 1987). In order to identify the causal homozygous mutation, the genomes are screened for regions with low heterozygosity. This approach enables MBS for species where a generation of a mapping population is not feasible (Lander and Botstein 1987; Singh et al. 2013) and no prior knowledge about the parental alleles (Voz et al. 2012) or crossing history is needed (Bowen et al. 2012). Sequencing of individual mutant genomes (Schneeberger 2014) is an expensive, but even more powerful approach. Phenotyping errors can contaminate pools in MBS, but this approach allows an in silico pooling.

Resolution and accuracy

In general, correct phenotyping of each individual of the mapping population is essential for the accuracy of MBS approaches. Contamination of the mapping population with incorrectly phenotyped individuals results in a larger mapping interval, thus complicating the identification of the causal mutation (Greenberg et al. 2011). Therefore, the resolution of MBS depends on the sampling size of correctly phenotyped and genotyped individuals in the mapping population (Schneeberger 2014). However, the resolution is only slightly affected by the number of backcrossed generations (James et al. 2013). As with conventional methods (e.g. classic genetic markers), re-sequencing data can be used to fine map the trait(s) of interest in a crossing population (Schneeberger and Weigel 2011). The higher the number of recombinants analyzed, the narrower the final mapping interval. All variants can be considered as markers and thus the variant with the closest link to the trait hints towards the genomic position of the underlying locus. Due to the high marker density derived from natural polymorphisms in the recombinant mapping population, a stringent marker selection decreases the number of false-positive markers. However, at the same time the risk of excluding causal mutations increases, leading to a critical trade-off.

Mapping-by-sequencing applications

SHOREmap demonstrated the applicability of MBS in A. thaliana (Schneeberger 2014; Schneeberger et al. 2009). Following projects applied MBS to various crop species including sugar beet (Ries et al. 2016), rice (Abe et al. 2012), maize (Liu et al. 2012), barley (Mascher et al. 2014), and cotton (Chen et al. 2015). Liu et al. applied a modification of MBS to maize for the identification of a drought tolerance locus: BSR-Seq (Liu et al. 2012). BSR-Seq uses RNA-Seq reads for the identification of causal mutations without any prior knowledge about polymorphic markers. As a proof of concept, RNA-Seq was performed for the recessive glossy3 (gl3) mutation in a segregating F2 population. The gl3 gene encodes a putative R2R3 type myb transcription factor, which regulates the biosynthesis of very-long-chain fatty acids, which are precursors of epicuticular waxes. Rice seedlings lacking glossy3 show an extremely thick epicuticular wax on juvenile leaves. By using this alternative MBS approach the gl3 locus was mapped to an interval of approximately 2 Mb. In summary, mapping-by-sequencing is a powerful technique, which will lead to (crop) plants that are well adapted to biotic and abiotic stresses in the future.

19.4 Transcriptomics

19.4.1 RNA-Seq

RNA-Seq, the sequencing of cDNAs, emerged as a valuable method for (1) gene expression analysis, (2) de novo transcriptome assembly, and (3) the generation of hints for the gene annotation. The Illumina sequencing workflow of cDNA is very similar to the sequencing of genomic DNA. Besides RNA-Seq, the direct sequencing of RNA became broadly available with ONT sequencing (Garalde et al. 2018). In addition, PacBio provides Iso-Seq to reveal the sequence of full length transcripts, which can facilitate gene annotation in plants (Minoche et al. 2015).

Gene expression analysis

Short RNA-Seq reads replaced previous methods for systematic gene expression analyses like microarrays almost completely (Wang et al. 2009; Nagalakshmi et al. 2008; Mortazavi et al. 2008). Without any prior knowledge about the sequence, the abundance of transcripts can be quantified (Wang et al. 2009; Cheng et al. 2017a, b), e.g. by generating a de novo transcriptome assembly based on the RNA-Seq reads (see below) (Haak et al. 2018; Müller et al. 2017). RNA-Seq even allows to distinguish between different transcript isoforms of the same gene (Wang et al. 2009; Cheng et al. 2017a, b). Saturation of the signal as observed for microarrays is no longer an issue as the number of reads is proportional to the transcript abundance (Wang et al. 2009; Mortazavi et al. 2008). Low amounts of samples can be analyzed and transcripts with low abundance can be detected, because a single read would be sufficient to reveal the presence of a certain transcript (Wang et al. 2009; Hayashi et al. 2018). Transcript quantification can be performed based on alignments against a reference sequence, e.g. using STAR (Dobin et al. 2013), or alignment-free, e.g. via Kallisto (Bray et al. 2016) (Table 19.6). Information about the transcript abundance can be subjected to downstream analysis like the identification of differentially expressed genes between samples e.g. via DESeq2 (Love et al. 2014). An alternative approach is the identification of co-expressed genes or the construction of co-expression networks as described in (van Dam et al. 2017) and references therein.

Table 19.6 RNA-Seq gene expression tools

De novo transcriptome assembly

RNA-Seq reads contain comprehensive information about the transcript sequences. Therefore, a de novo assembly can be generated to reveal the sequences of transcripts present in the analyzed sample (Schliesky et al. 2012). De novo transcriptome assemblies were frequently applied to discover candidate genes which are responsible for a certain trait of interest (Han et al. 2017, 2018; Wu et al. 2017). One of the most popular transcriptome assemblers is Trinity (Haas et al. 2013) which comprises three sequentially applied modules. Trinity performs an in silico normalization of the provided reads, i.e. identical reads are filtered out to achieve a similar coverage depth for all transcripts. Supplying stranded RNA-Seq reads, i.e. reads originating from a specified strand, enables to distinguish between reads originating from mRNAs and reads originating from regulatory antisense transcripts. Trinity performed well in benchmarking studies (Hölzer and Marz 2019; Behera et al. 2017), but there are more tools that can be evaluated on a given data set (Table 19.7). Several transcriptome assemblers including Cuffllinks (Trapnell et al. 2010), Trinity (Haas et al. 2013), and StringTie (Pertea et al. 2015) allow the integration of a genome sequence for reference-based or genome-guided assembly.

Table 19.7 De novo transcriptome assembly tools

After generation of an initial assembly, very short sequences as well as bacterial and fungal contamination sequences are usually filtered out based on sequence similarity to databases. Since no introns are included in assembled transcript sequences, the identification of protein coding regions can be performed by searching for open reading frames of sufficient length. ORFfinder (Wheeler et al. 2003), OrfPredictor (Min et al. 2005), and Transdecoder (Haas et al. 2013) can perform this task. Collapsing very similar sequences is sometimes required and can be performed by CD-HIT (Li and Godzik 2006; Fu et al. 2012). Once a final set of sequences is identified, the assignment of a functional annotation is usually the next step. Sequence similarity to functionally annotated databases like swissprot (Bairoch and Apweiler 2000; The UniProt Consortium 2017) can be harnessed to transfer the functional annotation to the newly assembled sequences. InterProScan5 (Finn et al. 2017) assigns functional annotations including gene ontology (GO) terms and identifies Pfam domains.

Gene prediction hints

Since RNA-Seq reads reveal transcript sequences, they can be incorporated in the prediction of genes. The alignment of RNA-Seq reads to a genome assembly indicates the positions of introns through gaps in the alignment. In addition, continuously aligned parts of RNA-Seq reads reveal exon positions. STAR (Dobin et al. 2013) and HISAT2 (Kim et al. 2015) are suitable tools for the mapping of RNA-Seq reads. If reads are already assembled into contigs, exonerate (Slater and Birney 2005) could be utilized to align transcript sequences to an assembly. Dedicated alignment tools also allow the incorporation of peptide sequences as hints by aligning the sequences of well annotated species against the new assembly. Examples for such peptide alignment tools are exonerate (Slater and Birney 2005) and BLAT (Kent 2002).

19.5 Future Directions

Recent developments in sequencing technologies enabled the cost-efficient generation of genome and transcriptome sequences for numerous plant species of interest (Bolger et al. 2014a; Jiao and Schneeberger 2017). Most of the traditional plant research already benefits from the availability of sequence information for the respective species of interest. This technological progress enables completely new research projects like comparative genomics of large taxonomic groups. Re-sequencing projects, which rely on a reference sequence for comparison, might be replaced by independent de novo genome assemblies for all samples of interest (Jiao and Schneeberger 2017).

The availability of large sequence data-sets will also lead to more data-based studies which just re-use the existing sequence data-sets. These publicly available data-sets can be harnessed to answer novel questions which could not have been addressed before (Pucker and Brockington 2018).

Availability of plant genome sequences can foster the research on and usage of orphan crops (Chang et al. 2018) and help during de novo domestication of crops (Fernie and Yan 2019). Intensifying research activity in this field is especially important to cope with global warming and climatic changes.