5.1 Introduction

Genetic code is the basis of organismal life. Deciphering the primary DNA sequence, i.e. the genome, and the associated gene, has become a fundamental resource in biology. The speed of genome sequencing is higher in the case of mammalian and microbial genomes when compared to plants. Advances in the next-generation sequencing technologies made the whole genome sequencing technically feasible and cost-effective. Application of genomics and the genome data show significant impact on biological investigations irrespective of areas and disciplines. Whole genome sequencing combined with de novo assemblies generates newer reference genomes for genetic investigation and trait discovery in each organism. State-of-the-art sequencing techniques and the associated computational processes provide better resolution to transcriptome, epigenome, and organellar genomes (Varshney et al. 2009). In addition, targeted sequencing including the exomes and regulomes bring major shift to the high-throughput trait diagnostics (Hashmi et al. 2015). A total of 24,002-genome information is available in the NCBI, which includes eukaryotes, prokaryotes, viruses, plasmids, and organelles (https://www.ncbi.nlm.nih.gov/genome/browse/).

The release of the Arabidopsis genome in 2000 (Arabidopsis Genome Consortium 2000) and the technological advances in sequencing methods accelerated the number of sequenced plant genomes closely to 185. Out of these, more than 50% of them are crop genomes and the rest belongs to model plant genomes, orphan crops, and wild relative species. Availability of this genome information shifted the genetic investigation to the next level of research, precision genomics. Also, collection of these genomes representing various species, genera, and classes in the plant kingdom help reveal the evolutionary relationships between plants and also provide better clarity of synteny and gene family evolutions. Above all, the crop-specific genome sequence and transcript information enhances the power of gene and functional marker discovery that will greatly impact the next-generation breeding programs including the genomics-assisted breeding and genomic section.

5.2 Genome Sequencing Platform

The first human genome (Venter et al. 2001; HGSC 2004) and other genome sequencing projects including model plants as well as major crop species such as rice, soybean, sorghum, maize, and grape mainly depended upon Sanger sequencing methods (Sanger et al. 1977), which are expensive and time consuming for complex genomes. Introduction of the next-generation sequencing technologies improved the output/cost ratio of genome sequencing dramatically. In other words, the pyrosequencing approaches were replaced to certain extent by the “sequencing by synthesis” approaches (Margulies et al. 2005; Shendure et al. 2005). At the earlier phase, this technology provided multiplexed DNA fragment sequencing, generated short reads with low-quality data, and after improving the basic chemistry, now it is possible to acquire high-quality, short-read DNA fragment sequences (Fuller et al. 2009). However, the complex genome assemblies using these comparatively short reads show major drawbacks, including assembly and determination of complex genomic regions, identification of gene isoform and other structural rearrangements.

Single-molecule, real-time sequencing technology developed by Pacific BioSciences offers longer read lengths and higher consensus accuracies than the short-read technologies, making it well suited for better characterization in genome, transcriptome, and epigenetics research (Eid et al. 2009; Nakano et al. 2017). The major requirement for the de novo reference assembly of genomes is less percentages of gaps and the contigs generated using the longer reads significantly improve the assemblies by closing gaps. In addition, it helps to better characterize the structural variations in the genomes (Utturkar et al. 2014; Rhoads and Au 2015). More recently, development of optical map helped to detect mismatches in assemblies, providing genomic maps with high resolution and to allow assemblies with more accuracy and completeness (Schwartz et al. 1993; Tang et al. 2015; Jiao et al. 2017a). In general, hybrid-sequencing strategies including short reads, long reads, and optical maps are more affordable and scalable for genomic investigations. These high-quality sequences improve the quality of contigs and scaffolds yielding several fold better genome assembly.

5.3 Plant Genomes and Assemblies

The development of reference genome assemblies is essential to identify the DNA sequence variations. Plant genome assembly is a challenging task and even more challenging than the animal genomes. Larger genome size, polyploidy, highly repetitive genomic regions, and genome duplication are the major challenges with most of the plant genomes. For example, the repetitive fraction of the human genome varies between 35 and 45%, whereas in maize, it is 64–73% (Imelfort and Edwards 2009). Even though there are more than 180 plant reference genomes, assembly of only less number of them are at the chromosome level (Schnable et al. 2009; Jiao et al. 2017b). Recent advances in the next-generation sequencing technologies, especially the long-read sequencing, long-range scaffolding, and chromosome capturing (Burton et al. 2013) helped to overcome the challenges in the chromosome-level assemblies of genomes.

Genomes sequenced by Sanger sequencing technique were initially assembled using the TIGR Assembler (Sutton et al. 1995) and Celera Assembler (Myers 1995). Advances in the genome sequencing technology required newer assembly tools as well. In the case of genome assembly using the short-read, de Bruijn graph assemblers (Pevzner et al. 2001) such as Velvet (Zerbino and Birney 2008) and ABySS (Simpson et al. 2009) were used. There are de novo sequence assembly tools available to generate good quality and robust assembly of complex genomes using short reads. These assemblers include SSAKE (Warren et al. 2007), VCAKE (Jeck et al. 2007), Edena (Hernandez et al. 2008), EulerSR (Chaisson and Pevzner 2008) and AllPaths (Butler et al. 2008). Recently, a new assembler called Supernova (Weisenfeld et al. 2017) and new version of ABySS assembler explores the linked reads and optical mapping for improved scaffolding.

A combination of single-molecule sequencing with complementary technologies has also become a common strategy. Sequencing technologies such as PacBio, Nanopore, and optical mapping produce longer reads exceeding 10 kb, and this larger read length and increased error rate of these new technologies required updated assembly methods. New assembly tools such as Canu (Koren et al. 2017), HINGE (Kamath et al. 2017), and Racon (Vaser et al. 2017) designed specifically for long-read PacBio and Nanopore data assembly. In the case of plant genomes, Zinin et al. (2017) assembled the highly repetitive grass Aegilops tauschii, by combining PacBio long-read and Illumina short-read sequences. Also, combination of chromosome conformation capture and optical mapping generated a new version of the model plant Arabidopsis thaliana genome (Jiao et al. 2017a). This approach helped to improve assembly contiguity reaching chromosome-arm levels.

5.4 Soybean Genome Assembly and Annotation

5.4.1 Soybean Genome Before Whole Genome Sequencing

Grain legumes play significant role in the global food and nutritional security, and soybean is one of the leading oil seed crops in the world. The United States Department of Agriculture (USDA) estimates that the global soybean production in 2016/2017 will be 348.04 million metric tons, around 2.07 million tons more than the previous month’s projection (USDA-FAS, May 2017). The development of several genomic tools including better genetic map and genome sequence information have immense contribution towards crop improvement. Soybean the cultivated species Glycine max and the close relatives including wild species Glycine soja and wild perennial Glycine tomentella are members of the tribe Phaseoleae, the most economically important of the legume tribes. The genus Glycine is paleopolyploid, with 2n = 40 as its base chromosome number, as compared with other phaseoloid legumes, which are largely 2n = 20 or 22 (Goldblatt 1981). The estimated average size of soybean genome was estimated at about 1.1 × 109 bp/C based on flow cytometry (Arumuganathan and Earle 1991), and similar values were predicted by DNA re-association kinetics (Cot analysis; Goldberg 1978; Gurley et al. 1979). Both these soybean Cot studies suggested that 40–60% of the genome is repetitive; re-association of different size fragments indicates that the majority of the repetitive sequences are physically linked while a smaller fraction is interspersed with single copy DNA. This latter conclusion is supported by 2700 sequence sampling of nearly 1000 loci across all 20 soybean linkage groups (Marek et al. 2001). The predictions from this research are that the gene rich and repetitive regions occasionally are interspersed, but gene rich “islands” are present as well. Cytological studies show that euchromatin represents ~65% of the genome with heterochromatin mainly localized to the pericentromeric regions and the short arms of four chromosomes (Singh and Hymowitz 1988). Collectively these studies suggest that gene space accounts for about one third of the soybean genome (i.e. ~3.7 × 108 bp). Two large-scale genome-wide duplication events at 40–50 and 8–10 Mya occurred in the case of soybean and the duplicated regions were segmented and reshuffled after these events (Shultz et al. 2007).

The soybean community has developed substantial amount of marker information and generated physical and genetic maps of soybean, which contributed heavily to the whole genome sequencing and chromosome assembly. As a complimentary approach to the whole genome sequencing, a large number of expressed sequence tags were generated for soybean. Initially around 200,000 expressed sequence tags (ESTs) were generated from more than 50 cDNA libraries representing a wide range of organs, developmental stages, genotypes, and environmental conditions (Shoemaker et al. 2002). In the early 1990s, genome mapping of soybean based on the DNA markers was initiated and several genetic linkage maps of soybean have been published in the last decade. The early maps were developed based primarily on the restriction fragment length polymorphism (RFLP), amplified fragment length polymorphism (AFLP), and simple sequence repeat (SSR) markers, and the more recent maps include single nucleotide polymorphism (SNP) markers. In total, several thousand genetic markers (mostly SSR and SNP markers) were mapped in soybean before release of the first draft genome sequence (Keim et al. 1990; Cregan et al. 1999; Wu et al. 2004, 2010; Song et al. 2004; Kassem et al. 2006; Choi et al. 2007; Xia et al. 2007; Hisano et al. 2008; Yang et al. 2008). Initial physical map for the cultivar Williams 82 was developed from sequencing of the bacterial artificial chromosome (BAC) libraries (Luo et al. 2003; Warren 2006; Shoemaker et al. 2008). Soybean physical maps for “Forrest” and “Williams 82” representing the southern and northern US soybean germplasm base were constructed with different fingerprinting methods. These physical maps are complementary for coverage of gaps on the 20 soybean linkage groups. More than 5000 genetic markers have been anchored onto the Williams 82 physical map, but only a limited number of markers have been anchored to the Forrest physical map. A framework map with almost 1000 genetic markers was constructed using a core set of recombinant inbred lines (RILs) developed from a mapping population of Forrest × Williams 82 (Wu et al. 2011). High-resolution physical maps for both G. max and G. soja were developed later (Ha et al. 2012), and these maps served as a framework for ordering sequence fragments, comparative genomics, cloning genes, and evolutionary analyses of legume genomes.

5.4.2 Soybean Whole Genome Sequencing

5.4.2.1 Cultivated Soybean Genome Version 1

The first reference genome of the cultivated soybean was released in the year 2010 by the soybean research community in collaboration with the Department of Energy Joint Genome Institute (DOE-JGI) (Schmutz et al. 2010). The northern US cultivar Williams 82 represented the first soybean reference genome. Whole genome shot gun approach was adapted to sequence 1.1 gigabase (Gb) genome. Three sized insert libraries and several BAC libraries were sequenced using Sanger sequencing protocols on ABI 3730XL capillary sequencing machines at the JGI. A total of 15,332,163 sequence reads were assembled into 3363 scaffolds using Arachne assembler (Batzoglou et al. 2002; Jaffe et al. 2003). This assembly covers 969.6 Mb of the soybean genome. To obtain 20 chromosome-level pseudomolecules, integrated the genome assembly with the physical map and high-density genetic map of soybean and the resulting assembly covered 937.3 Mb of the genome size. In addition, 1148 unmapped scaffolds that covered 17.7 Mb of the genome, added to make the total sequence coverage of 8.04×. After the genome assembly and chromosome assignments, this reference genome represented 85% (955 Mb) of the estimated genome size with 1.9% gap (Table 5.1).

Table 5.1 Final assembly statistics of G. max cv. Williams 82 reference genome version 1 (Glyma1.01)

Large fractions of plant genomes contain repeated DNA sequences of various types. A combination of genome structure analysis and homology studies identified that 59% of the soybean genome is repeat rich where majority of them are transposable elements (TEs). Analysis of the repetitive elements in the soybean reference genome revealed 57% of the repeat rich low recombination heterochromatic regions are around the centromeres. Percentage of repeat elements are more as compared to Arabidopsis (de la Chaux et al. 2012) and rice (Takata et al. 2007), and lesser compared to maize genome (Schnable et al. 2009). Long terminal repeat (LTR) retrotransposons are the majority of the repetitive elements in the soybean genome (42%) and consist of 510 families containing 14,106 intact elements, including 9733 gypsy-like and 4373 Copia-like transposable elements. Major role of transposable elements is to generate genomic novelty in organisms. This genomic novelty happens through a combination of chromosome rearrangements and associated gene regulation processes. TEs influence genome size, gene content, gene order, and several aspects of nuclear biology (Bennetzen and Wang 2014). Large-scale epigenetic changes due to polyploidization or stress responses affects TEs and all these influence the genomes to generate novel functions (Galindo-González et al. 2017).

The Williams 82 genome contains 46,430 high-confidence protein-coding loci and another ~20,000 predicted loci with low confidence. While comparing these loci with the angiosperm protein families, 283 legume specific gene families harbouring 448 soybean specific genes were identified. A 12.2% of the high-confidence protein-coding loci (5671) represents putative transcription factors in the genome (Schmutz et al. 2010). All these transcription factors from the 28 families were further annotated and a comprehensive soybean transcription factor database, SoyDB, was generated (Wang et al. 2010). The annotations in SoyDB include predicted tertiary structures, protein domains, multiple sequence alignments, DNA binding sites, and consensus sequences for each transcription factor family. Data providing experimental support to the gene content was available after the release of the soybean genome information (Libault et al. 2010; Severin et al. 2010). Transcriptome analysis of soybean tissues collected from 14 different biological conditions revealed the transcription of 55,616 annotated genes and demonstrated that 13,529 annotated soybean genes are putatively pseudogenes (Libault et al. 2010). Mining of the above gene expression atlas identified several tissue specific transcription factors and helped to understand the molecular basis of transcriptional regulation associated with various biological processes.

5.4.2.2 Cultivated Soybean Genome Version 2

Recently, a new version of the soybean genome assembly has been released (Wm82.a2.v1) after correcting several issues in the first version (Glyma1.01). The Glyma1.01 assembly of the whole genome sequence contains 236 unanchored scaffolds with lengths ranging from 10 to 100 kb and 51 unanchored scaffolds with lengths greater than 100 kb (Song et al. 2016). Even though the first assembly was generated using the integrated linkage maps and a genetic map with additional markers, the marker density was not enough to fully cover all regions of the soybean genome. Two high-density genetic linkage maps of soybean based on 21,478 SNP loci mapped in the G. max Williams 82 × G. soja PI479752 population with 1083 RILs and 11,922 SNP loci mapped in the G. max Essex × G. max Williams 82 population with 922 RILs were constructed. The high-density genetic linkage maps helped to identify false joins or misplaced scaffolds and unanchored scaffolds in the Glyma1.01 assembly, and the corresponding scaffolds were broken or reassembled to a new Wm82.a2.v1 assembly (Song et al. 2016).

The Wm82.a2.v1 was generated using the latest version of the ARCHNE assembler. A combination of high-density genetic linkage maps and Phaseolus synteny was used to identify the false joins in the Glyma1.01 assembly and the scaffold was broken based on these false joins. A total of 63 breaks were identified and broken, and the order and orientation of the broken scaffolds was achieved using the high-density markers and Phaseolus synteny. The total sequence including the 1170 unmapped scaffolds was 978.5 Mb and the new build of the 20 chromosomes captured 949.2 Mb (Table 5.2). In the new version of soybean genome assembly, 56,044 protein-coding loci and 88,647 transcripts were predicted. The Wm82.a2.v1 gene set integrates ~1.6 million ESTs, and 1.5 billion paired-end Illumina RNA sequence reads with homology-based gene predictions (https://phytozome.jgi.doe.gov).

Table 5.2 Final assembly statistics of G. max cv. Williams 82 reference genome version 2 (Wm82.a2.v1)

5.4.2.3 Wild Soybean Whole Genome Sequencing

Domestication history of cultivated soybean traces back to around 5000 years ago in China and it is considered that wild soybean Glycine soja is the closest relative of the cultivated soybean, Glycine max (Hymowitz 1970). Both the species have 20 chromosomes (2n = 40) and belong to the primary gene pool, where the hybrids are vigorous, exhibit normal meiotic chromosome pairing and normal gene segregation. However, the wild and cultivated soybeans differ in several morphological characteristics (Hymowitz 2004). In the case of G. soja publically available DNA sequence resources are limited. It is considered that the wild species is the untapped genetic resource for crop improvement (Valliyodan et al. 2016). The whole genome of wild soybean, G. soja var. IT 812932 was sequenced using two platforms, Illumina-GA, and GS-FLX, after the release of the cultivated soybean genome (Kim et al. 2010). Around 48.8 Gb short DNA reads were aligned to the G. max reference genome and a consensus sequence was determined for G. soja. This consensus sequence spanned 915.4 Mb, with a coverage of 97.65% of the available G. max reference genome sequence and an average mapping depth of 43-fold. Also, 32.4 Mb of large deletions and 8.3 Mb of novel sequence contigs in the G. soja genome were detected. The G. soja unmapped and unpaired reads were assembled de novo by Velvet (Version 0.7.31) assembler (Zerbino and Birney 2008). The genome structural analysis revealed 5794 deletions and 194 inversions and predicted 8554 insertions in the G. soja genome. More than 48% of the deleted regions in the wild soybean genome contain repetitive elements including the major classes of retrotransposons (20.74%). In another study, the 5794 deletions were compared against G. max gene positions for regions of overlap. This comparative genomic analysis identified 425 unique genes from the list of higher confidence 46,430 gene model predictions of the Glyma1.01, that are absent in G. soja and unique in G. max (Joshi et al. 2013). In addition, this study showed that there are significant genomic level differences between G. max and G. soja that are associated with some functionally important genes for seed, oil, and protein traits. Further investigation of these genes can explain some of the phenotypic differences observed between G. soja and G. max especially in terms of the major seed composition and agronomic traits.

Recently, de novo assembly of a salt tolerant wild soybean (G. soja) accession, W05 genome was reported (Qi et al. 2014). The genome size of this wild soybean was estimated at 1.17 Gb using k-mer statistics (Lander and Waterman 1988), and it is close to the estimated size (1.12 Gb) of the cultivated soybean reference genome, Williams 82. SOAPdenovo software (Li et al. 2010) assembled the 868 Mb of the W05 genome (Table 5.3). About 43.41% of the genome contains repeat elements and majority of them are LTRs (30.89%). The W05 genome contains 52,395 protein-coding genes, and out of these, 49,560 are functionally annotated and the average coding sequence length is 1083.9 bp. Development of wild soybean de novo genomes can facilitate the identification of genetic materials from the untapped genomic resource for crop improvement since these genomes can identify small and large genomic variations.

Table 5.3 Final assembly statistics of G. soja, accession W05 reference genome

In order to capture the entire genomic sequence present within the species, including the complete gene set, the pangenome needs to be sequenced (Golicz et al. 2016). Tettelin et al. (2005) introduced the concept of pangenome defining it as a full genomic (genic) makeup of a species. A single genome is insufficient to represent the genomic content in the case of both cultivated and wild soybean where individuals are distinct from one another due to low levels of genetic exchange and recombination. Pangenomes of seven G. soja accessions were generated (Li et al. 2014) using the Illumina HiSeq2000 sequencing reads and SOAPdenovo assembler (Li et al. 2010). The average genome coverage was 119.9× and the estimated genome size ranged from 889.33 Mbp for the accession GsojaG (93.6% of the GmaxW82) to 1118.34 Mbp for the accession GsojaD (117.7% of GmaxW82) (Li et al. 2014). In this study, the estimated average number of genes per G. soja genome is 55,570, slightly more than the G. max Williams 82 genome. These G. soja assemblies enabled accurate detection of variation and comparative analyses within genic regions and found significant structural variations when compared to the G. max assembly.

5.5 Conclusion and Future Perspectives

To meet the increasing global demand of grain legumes, including soybean, we have to maximize the development of high-yielding cultivars with better plant performances under the biotic and abiotic stress conditions. In order to achieve this, goal precise dissection of genomic regions associated with major traits to the haplotype/allelic level is needed leading to the next-generation breeding strategies. Advances in the next-generation sequencing platforms, cost-effectiveness per bp of genome sequences, and the development of various computational tools including better genome assemblers will expedite the generation of individual crop genomes. Generation of long-sequence reads and the associated genome assemblers will substantially reduce the major challenges in genome assembly including the one posed by genomic repeats and help improve the haplotype resolution and analysis of genomic variation. The accurate detection of genomic variation and comparative analyses within genic region is essential for the discovery of novel genes or alleles associated with the traits of interest and development of new varieties.

A single reference genome is not enough to capture the genomic diversity due to a large amount of structural variation including the copy number variations and presence/absence variation, which significantly alter the individual genomic sequences. Pangenomes representing landrace, elite, and wild soybean from the global collection are needed to address the above issue of genome structure variation and mining for rare alleles. Large-scale sequencing of soybean germplasm and additional reference genome build for cultivated and wild soybean is in pipeline and will be available soon to the soybean community. From these efforts, it is found that long-read sequencing of soybean genome using the PacBio platform helped to improve the whole genome assembly far better. Wild soybean is important resource for novel alleles due to its higher genomic diversity. More reference genome or pangenomes for wild soybean will significantly influence the tapping of rare genomic resources towards crop improvement. Soybean genomes exhibit higher linkage disequilibrium (LD) and more germplasm lines with deep sequencing will help enable precise genotype–phenotype association studies towards pinpointing the candidate gene associate with the specific trait and for the genomic predictions. In addition, development of a robust haplotype map for soybean (HapMap2) is inevitable to find specific genetic variants that affect plant developmental processes and response to disease and climate changes. Generation of large-scale sequencing of germplasm and various genetic populations need a better data visualization tool to compare and mine the genomic information at the allelic level. This will enable breeders and soybean community to easily associate genome sequence information with traits of interest and apply to genome enabled crop improvement programs including genome editing and next-generation breeding strategies.