Introduction

Lentil (Lens culinaris Medik.) is a diploid (2n = 14) self-pollinated crop and has a genome size of 4 Gbp [1]. It is a cool season crop that is cultivated globally in more than 52 countries. It is one of the healthiest pulse crops because its seeds are rich in minerals, fibers, and carbohydrates besides proteins (22–35%). During the past years significant increase has been observed in the productivity of lentil due to cultivation of high yielding varieties developed through traditional breeding. However, poor competitive ability to weeds, higher flower drop rate, pod shedding, and several biotic and abiotic stresses are still causes of reduced productivity in lentil crops [2]. Moreover, environmental conditions have high impact on the expression of agronomically important quantitative traits leading to poor genetic gain in lentil [3]. This is challenging to overcome under current climatic conditions. However, genomics has provided a potential way to increase the genetic gain in lentil and hence significant progress has been made in the past years for developing genomic resources in lentil. These genomic resources include availability of genome sequences, transcriptome sequences, molecular markers (i.e. SNPs and SSRs), mapping populations, markers linked to genes/QTLs controlling important traits [4,5,6,7,8,9,10]. Further, advanced next generation sequencing technologies have accelerated the development of genomic resources rapidly and cost effectively and their use enriched genomic resources in different crops including orphan legume crops. Thus genomically poor legume crops have become rich with genomic resources due to the availability of draft genome sequences [11, 12] and NGS based approaches. For example, QTL-Seq made possible to identify genes/QTLs rapidly in different legume crops including chickpea, pigeonpea and groundnut [13,14,15,16,17]. Therefore, several workers have discussed the prospects of next generation sequencing for discovering genes for agronomically important traits and their use in breeding legume and food crops [18,19,20,21]. Three food legume (chickpea, pigeonpea and groundnut) crops where NGS has been used for different purposes have been covered in these reviews for discussing NGS based breeding [19, 20]. In lentil, which is an important food legume crop, considerable attention has been given on the use of next generation sequencing for enriching the genomic resources during the past years and used for identification of SSR and SNP markers, development of unigenes, transcripts, and identification of candidate genes for biotic and abiotic stresses, analysis of genetic diversity and identfication of genes/QTLs for agronomically important traits [10, 22]. Next generation sequencing has revolutionized the genomic research and helped in genomic assisted breeding rapidly and cost effectively in different crops including legume crops in the past years. In this review, we discuss pospects of NGS based breeding in lentil.

DNA sequencing methods

Different methods have been developed for analysing genomic sequences. These can be categorized broadly into following two groups.

Conventional DNA sequencing

This is first generation DNA sequencing technology, which is based on Sanger’s sequencing method. It is widely used in crop plants for development of molecular markers and sequencing the plant genome. It emerged in 1977 and used for three decades [23]. This first generation sequencing technology has been used in the genome sequencing of model plant species including Arabidopsis thaliana, Oryza sativa (rice), Carica papaya (papaya) and Zea mays (maize) [18]. In lentil, first generation DNA sequencing has been used to develop ESTs, genic and genomic SSR and SNP markers [4]. The kompetitive allele specific PCR (KASP) methodology has been used to detect the SNPs from available EST-database using this sequencing technology [2, 24].

Advanced DNA sequencing

Advances in DNA sequencing led to next-generation sequencing (NGS) technologies that allow sequencing millions to billions of DNA nucleotides in parallel in a high throughput manner so that multiple samples can be sequenced at low cost [25]. Additionally, cloning of DNA fragments is not required [26]. NGS technologies are further divided into second and third generation and are discussed below in the light of lentil crop.

Second-generation sequencing

Sequencing technologies of this generation use template libraries, which are prepared by ligating the DNA fragments with linker and/or adapters. Thus DNA fragments are not cloned into the host cells before taking the sequencing [27]. In crop plants, Roche 454 pyrosequencing, Illumina (Solexa) HiSeq and MiSeq sequencing, Sequencing by Oligonucleotide Ligation and Detection (SOLiD), DNA nanoball sequencing by BGI Retrovolocity and Ion torrent are commonly used platforms of second generation sequencing (SGS) and each one has advantages and disadvantages that have been discussed earlier in different reviews [27, 28]. The draft genome sequencing of at least 421 plant species including food and horticulture crops have been published using the second generation sequencing platforms as observed on March 2020 at www.plabipd.de website [18, 29, 30]. In lentil, different second generation technologies have been used to develop the draft genome sequence and sequencing of transcriptomes [2, 10, 22, 31,32,33].

Third-generation sequencing

In the past years, the draft genome sequences of several crop species have been developed by assembling the short DNA fragments into contigs and scaffolds that are generated by using second generation technologies. However, polypolid nature, presence of repetitive DNA sequence and large genome size of many crop species make difficult to assemble long chromosome sequences by using short DNA fragments and hence these fragments are not properly mapped to their chromosomal locations [34,35,36]. However, the advancement in DNA sequencing led to the availability of third generation sequencing technologies. These emerging technologies sequence single molecule and are able to generate long DNA sequences and scaffolds that can cover the complete chromosome [37]. The optical mapping [38], chromosome conformation capture [39] and DNA dilution-based technologies [40, 41] are such third generation sequencing technologies. These are used by several sequencing platforms: (i) single-molecule real-time (SMRT) sequencing by Pacific Biosciences, (ii) helicos sequencing by the Genetic Analysis System, (iii) nanopore sequencing by Oxford Nanopore Technologies (MinION and PromethION) and (iv) NGS by electron microscopy. These platforms are replacing the SGS platforms rapidly and in coming years these could replace SGS due to their cost-effectiveness and rapidness [42]. Although, in lentil, a draft genome sequence has been developed based on short DNA reads generated using second generation sequencing methods [31], it has many large gaps. These gaps can be filled by generating longer DNA sequence rapidly and more cost effectively using third generation sequencing (TGS) technologies. Moreover, these technologies can help to increase the accuracy for discovering SNPs, and reduce the chance of finding the false SNP and sequencing/genotyping of many markers in a single step in lentil [43].

Current application of NGS

Next-generation sequencing technologies have been deployed for enriching genomic resources. In lentil, NGS based SNP markers developed in lentil have been used for different purposes [44,45,46,47,48]. These have been discussed below.

Identification of unigenes, transcripts and functional markers

In lentil, second generation sequencing platforms have been used to accelerate the development of molecular markers, unigenes and transcripts (Table 1). Initially, second generation sequencing technologies have been used to develop simple sequence repeat (SSR)-containing ESTs, which were identified from consensus sequences [22]. In this study, 2393 EST-SSR markers were developed and a subset comprising 192 EST-SSR markers were validated across genotypes of cultivated and wild species [22]. Subsequently, these markers have been used to study the genetic diversity and identification of QTLs for agronomically important traits in several studies [7, 8]. Further second generation sequencing based transcriptome analysis led to identification of SSRs and SNPs from functional genomic regions of lentil [2, 6, 10, 33] and used to develop a SNP genotyping platform carrying SNP array chips for genotyping– the lentil genotypes for genetic diversity analysis and molecular mapping of QTLs for agronomically important traits [2, 49].

Table 1 Genetic resources developed in lentil though NGS

Phylo-genetic relationship studies

In lentil, NGS based SNPs allowed to identify the genetic relationship between different species and their time of divergence. The SNPs from coding regions recognize genetic variation between different species due to the changes in the coding regions of the genome. The rate of substituted synonymous SNPs between L. culinaris and L. ervoides species of lentil estimated their separation from each other 677,000 years ago, while L. culinaris was separated from M. truncatula 38 million years ago (MYA) that was also similar to other earlier studies [2, 53]. However, advanced next generation sequencing platforms may further increase the amount of genome sequence data of multiple genotypes of L. culinaris and other available Lens species in near future that can help to elucidate the nature of the domestication process for this crop [52].

Identification of genetic diversity

NGS technologies have accelerated the development of SNPs widely in lentil that has been used to assess the genetic diversity and determine the genetic relationship among different species. In a study, 384 SNP markers distributed genome-wide, which were developed earlier using NGS, have been used to genotype 505 cultivars and landraces and clustered lentil cultivars according to their geographical origin. However, land races were not grouped clearly and landraces belonging to the Mediterranean region especially from Greece and Turkey, showed high level of genetic diversity [44]. In another study, 1194 SNPs have been used to estimate genetic diversity among 352 accessions belonging to 54 diverse countries. These genotypes grouped broadly according to their geographical origin including Mediterranean Basin (also including the Nile valley from Egypt to Ethiopia), subtropical Asia and northern temperate [46]. In a recent study, genotyping by sequencing (GBS) following NGS has been used to study the association of Mediterranean gene pool with specific geographic or phenotypic patterns. For this, 6693 SNPs characterized 349 accessions of Mediterranean gene pool according to geographic patterns and phenotypic traits. This study suggested that cultivation of lentil in Mediterranean countries has been introduced by post-domestication routes and selection of improved types shaped the structure of lentil population [52]. In these studies, considerable genetic diversity has been observed among the genotypes of the cultivated gene pool and hence NGS based SNP genotyping can be used to mine diverse genotypes for hybridization in the breeding program for genetic improvement [46]. In another study, genetic diversity was assessed among 467 wild and cultivated accessions of lentil belonging 10 diverse geographical regions using 422,101 high-confidence SNP markers. In this study, these germplasm accessions grouped into four clusters, which were poorly correlated with geographical origin. However, accessions belonging to L. nigricans showed the greatest allelic diversity compared to all other species/subspecies [54].

Linkage maps and mapping of genes/QTLs

High density molecular maps developed in lentil using SNP markers, which were generated using NGS. In a study, 376 SNPs selected from 50,960 SNPs developed through NGS based transcriptome analysis used as markers along with other SSR and ISSR markers for making the dense molecular map. This map covered 432.8 cM distance having average distance of 1.11 cM between two markers [51]. Using same NGS approach, 6306 high quality polymorphic SNPs used to construct a high density inter-specific linkage map covering 5782.19 cM distance. This molecular map helped to identify the major QTLs controlling seed coat spotting (Scp), flower color (FC), stem pigmentation (SP), time to flowering, seed size, and ascochyta blight resistance (Table 2). A major QTL for flowering time explained 55.73% of phenotypic variance and 3 major QTLs explained 35.48% of phenotypic variance for seed size. Among these QTLs, one has been associated with stem pigmentation. These QTLs were present in the chromosomal region covering 20.8 Mb, which was annotated with 366 genes [33].

Table 2 Identification of QTLs for economically important traits using NGS in lentil.

Identification of candidate genes and pathways

Identification of candidate genes for tolerance to several biotic and abiotic stresses and other agronomic traits could be accelerated with the use of NGS technologies [6, 10, 32]. These technologies have been used to sequence genes expressed at a specific time and growth condition/stage. The gene sequences have been annotated with sequences of known function leading to the identification of candidate genes or pathways associated with a phenotype. Genes expressed by plants at different stages under a particular stress have been identified easily and rapidly through NGS based transcriptome analysis. In lentil, NGS based transcripts sequencing identifies up- and down-regulated genes and pathways expressed at seedling stage under drought and heat conditions [6, 10]. In these studies, the genes encoding plasmodesmata callose-binding protein 3 (PDCB), phosphatidylinositol/phosphatidylcholine transfer protein SFH13, CDP-diacylglycerol–glycerol-3-phosphate 3-phosphatidyltransferase 1 chloroplastic, probable glycerol-3-phosphate acyltransferase 2 (GPAT2), O-acyltransferase, phosphatidylcholine diacylglycerol choline phosphotransferase have been identified through transcriptome analysis. These genes showed up-regulation under heat stress conditions and one gene encoding pyruvate phosphate dikinase was related to shikimate pathway, which produces secondary metabolites responsible for heat tolerance in plants [10]. However, in lentil, heat stress usually affects the terminal stage of growth and development. Therefore, using the same approach, up-regulated and down-regulated genes and pathways have been identified at reproductive stage under heat stress conditions in the field (J. Kumar, personal communication). These genes/pathways were different from the genes/pathways expressed at seedling stage under heat stress conditions in the aforementioned studies (J. Kumar, personal communication). Similarly, candidate genes associated with different functional groups including molecular process, cellular and biological process expressed under drought conditions have also been identified through transcriptome analysis in lentil [6]. In lentil, a reference set of unigenes comprising of 58,986 contigs and scaffolds were used to compare with databases related to genes/proteins available in public domain and identified the candidate genes associated with mechanisms related to boron toxicity tolerance and time of flowering [50]. In another study, comparative mapping of flanking sequences of SNPs markers associated with boron toxicity led to identification of candidate genes for boron transporter-like protein and MIP family [49]. NGS based transcriptome analysis have also been used to uncover the generic basis of disease resistance and identified candidate genes involved in defense-response in lentil. For example, differential gene expression studies between resistant and susceptible genotypes using RNA-seq approach [32, 55] as well as MACE technique [56] have allowed to identify a number of key genes that played an important role in providing of resistance against ascochyta blight. These genes were involved in different defence response functions (see Table 3). Khorramdelazad et al. [32] observed that the resistant genotype has ability to detect and respond signalling to disease infection much earlier and faster than susceptible genotype and structural defence-related genes are over-expressed in lentil. In lentil, NGS has been used to identify the genes encoding disease resistance proteins in the host and later virulence proteins (i.e. effectors) in the pathogen. A complex interaction between resistance and effector genes has been observed for developing anthracnose disease in lentil [57]. In this study, 26 resistance genes including suppressor of npr1-1, constitutive 1 (NBS-LRR) and dirigent (resistance response protein) have been identified in the host after the infection with an isolate of the virulent race 0 of Colletotrichum lentis [57]. Transcriptome analysis identified candidate resistance genes encoding calcium transporting ATPase and glutamate receptor 3.2 and another candidate gene with unknown function for the susceptibility to stemphyllium blight disease in lentil. These candidate genes have been validated through bulk segregant analysis in a mapping population used previously for identification of QTL for this disease [58]. In another study, transcriptome analysis of Ascochyta lentils infected plants identified 18 candidate genes and also reported critical role of lignin biosynthesis and jasmonic acid (JA) pathways in resistance reaction while gibberellins synthesis has been observed to be related with susceptibility to pathogen [56]. These candidate genes can be used for different purposes including identification of genes for pathway-specific expression analysis, genetic modification approaches, development of resources for genotypic analysis, and assistance in the annotation of a future lentil genome sequence and also can be useful for developing the diagnostic functional markers for breeding.

Table 3 Key candidate genes responsible for defence-response to A. lentis (Ascochyta blight) in lentil [32]

Prospects

Accelerating genetic gain in breeding populations

Advancement of next generation sequencing (NGS) led to generation of genome sequences and discovery of SNPs rapidly and cost-effectively. It accelerated sequencing/re-sequencing of the entire genome or a part of it in a large number of genotypes for identification of polymorphism in a crop species. Therefore, in the past years, high throughput genome-wide SNP genotyping platforms such as genotype-by-sequencing (GBS) helped to select the high breeding value genotypes on the basis of its genotypic constitution in breeding populations [59] and used earlier in several crops including legumes for genomic selection (GS) [60]. It has been found that GBS covers much greater fraction of the genome and capture population/family specific genetic variation than other SNPs genotyping methods currently used in crop plants [61]. Therefore, NGS based GBS has been identified as ideal for GS due to it’s flexibility, low cost and genomic estimated breeding value (GEBV) prediction accuracy. The prediction accuracy (0.1 to 0.2) of GEBV has been observed higher through NGS based genotyping compared to other established marker platforms [59]. In soybean, prediction accuracy for grain yield, assessed using cross validation, was estimated to be 0.64, indicating good potential for using GS for grain yield in soybean [62]. Therefore, NGS based GS helps to enhance genetic gain in the following ways.

  • NGS has led to genotyping of a large number of accessions, which helps to uncover a wide genetic variation available in the genome. This increases the power of GS. More than 1000 SNPs can be screened in single run for a sample through NGS and hence GS can be performed for minor QTLs of the traits having low habitability [63]. Consequently, higher genetic gain is possible through NGS based GS.

  • Selection based on marker/genotype profile helped to identify the individuals in a breeding population with high breeding values by deploying higher selection intensity and accuracy for quantitative traits. This can give higher genetic gains per year compared to phenotypic selection [64, 65].

  • Genomic selection also helps to get better genetic gain for those traits that have a long generation time or are difficult to evaluate (i.e., insect resistance, bread making quality, and others). It also makes GS cheaper. Genotyping and environment interactions highly influence quantitatively inherited traits. Therefore, considering genomic loci interacting with environments during genomic selections can enhance the prediction accuracy and genetic gain. It predicts breeding values of lines on the basis of genome wide marker profile without phenotyping in the field and helps to select improved breeding lines across target environments. Thus discovering large number of SNPs and their use in genotyping across large breeding population through NGS especially through GBS can replace the phenotypic selection with GS in coming years [61]. In chickpea, accuracy of GS was predicted for yield related traits from 0.138 (seed yield) to 0.912 (100 seed weight) and inclusion of genotype × environment (G × E) interaction in GS models improved prediction accuracy [66, 67].

In lentil, different GS models and prediction scenarios have been evaluated for designing GS strategies for breeding. Models included G × E interactions and multiple traits showed higher prediction accuracy for a low heritability trait. Moreover, prediction accuracies within-population and across-environments was observed moderate to high in lentil [68]. This study suggested that GS can accelerate genetic gain within population and across environments, if applied to larger population size in lentil [68]. Therefore, use of NGS can help to accelerate genetic gain rapidly and cost effectively through GS in lentil by genotyping the large breeding population with large number of markers.

Towards the development of pan-genome and super pan-genome

A pan-genome refers to the total set of genes for individuals of a species [69]. These genes can be grouped as core and dispensable genes. The core genes are conserved across all individuals and hence these are usually housekeeping genes that are responsible for essential cellular functions [70]. The pan-genome with core genes is also known as closed pan-genome. The dispensable genes in the pan genome are present either in a specific individual or few individuals, but not in all individuals. These genes are functionally associated with various adaptive traits like tolerance to biotic and abiotic stresses, receptor and antioxidant activity, gene regulation, and signal transduction [71,72,73,74,75,76]. Therefore, these genes contribute more to the diversity of a species and evolve faster than core genes [70]. The non-synonymous and synonymous substitutions across dispensable genes have been observed higher in soybean [72]. In lentil, substitutions of synonymous and non synonymous SNPs in the coding regions cause genetic variation [2]. The concept of pan-genome is based on capturing the genetic variation, particularly structural variation, available in the gene content of individuals belonging to the same species [69]. It became feasible due to advancement in NGS technology that allowed sequencing/resequencing of multiple accessions belonging to one or more species. These structural variations (SVs) include presence/absence variations, copy number variations (CNVs) and other form of variations such as inversion, trans-versions and inter/intra chromosomal translocations [77,78,79,80]. Pan-genome analysis involves genomic sequences of multiple accessions belonging to a cultivated species, while genomic sequences of accessions belonging to each species available within a genus are included in the super pan-genome analysis [81, 82]. As wild relatives have many specific traits, super-pangenome analysis can give a unique opportunity of exploiting the available structural genomic variations of a genus in the genetic improvement by associating them with a trait of interest through genome wide association analysis [82]. The pan genomics have been used in several crops including soybean [72], B. oleracea [74], B. napus [76], maize [83], rice [84], wheat [85], sesame [86], sunflower [87] and tomato [88] for genetic diversity analysis. In lentil, NGS technology has been used to sequence RNAs of many accessions belonging to cultivated and wild species and a partial genome analysis identified SNPs/InDels in the genome. However, SNPs/InDEls do not alone to contribute of all genetic diversity available within species [89, 90]. As reference genome sequence is now available in lentil [31] and structural variations in the chromosomes due to translocations have been reported within and between the species earlier [91], re-sequencing of diverse accessions can help to identify prevalence of structural variations at genomic level in lentil as observed in soybean [92, 93] and pigeonpea, a pulse crop [94]. The following factors determine the identification of dispensable genes in the pan-genome and they will be useful to consider during the development of a pan- genome in lentil.

Size of pan-genome

It is determined by the number of sequenced individuals involved in the pan-genome because use of large number of sequenced individuals in the pan genome increases percentage of dispensable genes and decreases the percentage of core genes. For example, in rice, a pan genome with 48,098 genes developed from 3010 accessions possess 41% dispensable genes, while another pan-genome of 40,362 genes had 8% dispensable in three accessions [73, 84].

Type of accessions

It has been observed that the use of closely related accessions does not complete the size of the pan-genome. Therefore it is better to use diverse accessions and use the accessions of wild species along with accessions of cultivated species can help to develop a larger pan-genome with a higher percentage of dispensable genes rather than using accessions of cultivated species. For example, in rice, a pan-genome with 66 accessions of cultivated (Oryza sativa) and wild species (Oryza rufipogon) contained 42,580 genes, in which 38% were dispensable [95], while another pan-genome with accessions of cultivated species contained 40,362 genes involving 7.83% as dispensable genes [73]. Moreover, use of accessions of wild species in the development of a pan-genome can help to identify the genes lost during the domestication of a crop as suggested by Khan et al. [82].

Accelerating the use of gene-based markers in breeding

Before the advent of NGS, re-sequencing of unigene-derived amplicons or ESTs through conventional sequencing has led to development of gene-based SNP markers, which were validated through PCR [96, 97]. However, next generation sequencing based trancriptome analysis has identified candidate genes expressed under several biotic and abiotic stresses in lentil [6, 10]. Moreover, these functional unigene sequences also carried the SNPs and SSRs, which became useful resources for developing gene-based/functional markers [2, 6, 10, 22, 50,51,52, 98]. The SNPs identified through transcriptome analysis have been used to develop array for genotyping and identified their association with traits of interest [2, 49]. Moreover, through NGS it is possible to re-sequence many candidate genes in large number of genotypes in a single run at lower cost, which can be used to associate the SNPs of candidate genes with target traits [49]. The candidate gene sequences can also be used to develop PCR based markers for plant breeders for easily genotyping the large breeding populations as done in other crops [99]. Also, use of functional markers in marker trait association mapping can help in development of the perfect markers for breeding. In lentil, comparative genomic mapping of flanking sequences of SNPs associated with target trait with the genome sequences of other species (Medicago truncatula, soybean and Arabidopsis thaliana) led to identification of candidate genes that are functionally associated with boron toxicity tolerance. These candidate gene sequences can potentially be used for developing a gene based diagnostic marker in lentil [50].

Capturing exome variation

A draft genome sequence of lentil is available for reference [31]. Capturing the genetic variation across whole genome through whole genome sequencing of many accessions is challenging due to difficulties in assembling the large (4063 Mb) and complex lentil genome that is caused by gene duplication, chromosomal arrangements and repetitive elements [100]. However, capturing genetic variation available in the coding regions of the genome is useful for making genetic improvement because it carries genes controlling many traits of agronomic importance. Therefore, coding regions are more important than non coding regions for breeding of those crops that has large and complex genome [101]. In lentil, 3.2% genome (i.e. 130 Mb) is made of genic region [100], which can be common targets for lentil research. Therefore, in lentil, efforts have been made to capture the genetic variation available in these genic regions and exome capture arrays comprising of 85 Mb have been developed in lentil [100]. Such exome capture arrays can help to sequence only the protein coding regions of the genome rather than whole genome and hence it is cost effective sequencing method [102]. Developed exome arrays have been used to identify genetic variation among 38 diverse accessions of lentil including 16 accessions of wild species [100]. They also suggested use of exome capture arrays in downstream research including identification of genetic relationships within the Lens genus, gene discovery, development of genic markers for tracking, identification and selection of beneficial traits in breeding population, identification of genes for adaptive traits in wild relatives and use in DNA barcoding in future [100].

Concluding remarks

In lentil, NGS technologies have been used to identify SSRs, SNPs and candidate genes expressed under specific environmental conditions. The SSR and SNPs have been used to develop markers to describe genetic variation of germplasm collections and their association with different phenotypic traits, including seed quality, disease resistance, and micronutrient concentration [44,45,46,47,48]. However, use of NGS technology in lentil breeding program is not widely used compared to other crops despite low genotyping cost for a large number of individuals. In other crops, NGS technologies have been used to develop pan genome and super pan-genome by re-sequencing or sequencing large number of accessions. In lentil, these efforts have not been made due to genome complexity that impairs assembling and scaffolding of short reads generated on the basis of second generation sequencing. Third-generation single-molecule sequencing technologies have emerged that reduced the cost of sequencing and can be used for sequencing the long DNA fragments leading to easy assembling and scaffolding of complex genome. Hence use of third generation sequencing can overcome problems associated with large genome size of lentil and in coming years, use of NGS will boost genetic gain in lentil.