10.1 Introduction

Chickpea (Cicer arietinum L.) is a self-pollinated, diploid (2n = 2x = 16) legume crop primarily grown by resource-poor farmers in the semi-arid regions of the world. The nitrogen fixing ability and high protein content of chickpea make it a crop of high economic importance in developing countries. Based on the grain size and seed coat color, two market classes of chickpea, namely desi and kabuli, are cultivated extensively. Advances in genomics technologies facilitated the adoption of genomics tools in crop improvement referred as genomics-assisted breeding (Goodwin et al. 2016; Varshney et al. 2014; Koboldt et al. 2013; Metzker 2010). The availability of draft genomes of major cereals including rice (Oryza sativa; IRGSP 2002; Goff et al. 2002), sorghum (Sorghum bicolor; Paterson et al. 2008), maize (Zea mays; Schnable et al. 2009), and legumes such as pigeonpea (Cajanus cajan; Varshney et al. 2012) facilitates the deployment of genomic information in crop improvement.

Owing to the economic importance of chickpea and given the usefulness of draft genomes, the International Chickpea Genome Sequencing Consortium (ICGSC) led by ICRISAT decoded the draft genome of kabuli genotype CDC frontier. This chapter mainly summarizes the tools and strategies used for generating the draft genomes and various analyses for understanding the genome architecture of chickpea and synteny with other sequenced legumes. In addition, this chapter also provides a comparative view of both desi and kabuli genomes available.

10.2 Strategies and Tools for Sequencing

The chickpea genome sequencing was carried out using the short reads from Illumina HiSeq 2000 and bacterial artificial chromosome (BAC) end sequencing (Varshney et al. 2013). The Illumina short reads were assembled into contigs which were further used to construct the scaffolds. The BAC end sequencing was used to form the backbone for the scaffolding. Further, the high-density genetic maps were used to anchor the scaffolds on to the pseudomolecules. The unanchored scaffolds and contigs were reported separately along with pseudomolecules, as a part of the final assembly. Paired-end sequencing libraries (11 in total) were formulated with insert sizes of ~170 bp, 500 bp, 800 bp, 2 Kb, 5 Kb, 10 Kb, and 20 Kb. For the development of assembly, scaffolds’ construction and gap closure, SOAPdenovo2 (Luo et al. 2012) was used. The genetic marker sequences along with flanking regions were searched in the assembly using BLASTN (Altschul et al. 1997) and also using e-PCR (Schuler 1997) in case of the presence of only primer sequences, to place these sequences on the scaffolds. The microbial contamination was eliminated from the genome assembly using searches against the bacterial and fungal genomes with the help of Megablast. Further, BLAT (Kent 2002) was used to screen for contamination of organellar DNA, chloroplast genome sequence of chickpea, and Lotus (L. japonicus) mitochondrion in the chickpea genome assembly. The completeness of the genome assembly was verified by mapping the transcriptome assembly contigs to the genome assembly using BLAT. The exome coverage prediction was carried out by mapping the core eukaryotic genes, identified by core eukaryotic gene mapping approach CEGMA v.2.3 (Parra et al. 2007), to the genome assembly.

10.3 Assembly

A total of 153.01 Gb of sequence data was generated for the development of the first draft genome assembly in chickpea. This resulted in coverage of 207.32X from 11 genomic libraries sequenced using Illumina platform with insert sizes ranging from 180 bp to 20 Kb. The high-quality sequence data of 87.65 Gb after filtering was used to assemble into 544.73 Mb of genome sequence scaffolds. The N50 for these scaffolds was 645.3 Kb, and the maximum size of these scaffolds was found to be 6.17 Mb. The chickpea genome is estimated to be of 738.09 in size which shows that the assembled scaffolds were able to cover 73.8% of the genome. The non-assembled genome is believed to be enriched with repetitive sequences as observed by increased read depth in repeat-containing regions in comparison with non-repeat regions and also by having four-fold lower k-mer diversity in non-assembled fraction as compared to non-repetitive assembled fraction. An improved assembly spanning 532.29 Mb with a N50 of 39.99 Mb having 7,163 scaffolds was generated with the help of 46,270 repeat masked paired bacterial artificial chromosome (BAC) end sequences. The anchoring of 65.23% of this assembly to eight genetic linkage groups was carried out with the help of 1,292 genetic markers reported in previous studies. This data was used to obtain eight pseudomolecules namely, Ca1-Ca8. The anchoring of 93.4% of these scaffolds was validated using restriction-site-associated DNA (RAD) single nucleotide polymorphism (SNP) markers that were discovered between two segregating recombinant inbred line populations. This approach resulted in the identification of low-proportion chimeric scaffolds, i.e., 1.7% of the total scaffolds which amounted to 4.6 Mb of mis-assembled genomic sequence. These chimeric scaffolds were processed by excluding the erroneous part of the scaffold sequences and removing them from the pseudomolecule models. Another synteny-based approach was used to anchor the scaffolds onto the pseudomolecules. In this approach, regions lacking genetic support but showing conserved synteny with Medicago (Medicago truncatula) were anchored to pseudomolecules. The regions supported by synteny are hypothetical placements in the pseudomolecules which will be eventually updated upon availability of improved genetic maps supporting these regions or if there are modifications in the assembly of Medicago. The RAD genotyping data was used to anchor 75% of the scaffolds, while the synteny-based approach by comparing scaffolds with Medicago was used to anchor rest of the 25% scaffolds to the pseudomolecules.

10.4 Repetitive Sequences

Repeat regions in the genome were identified using Tandem Repeat Finder (Benson 1999) which resulted in a total of 127,377 such regions. It was observed that 84.9% repeat regions occurred in span of <1 kb, while in gap-spanning clones repeat regions were present in the tracts of 10–103 kb. Out of the total repeat regions identified, 29,018 regions could not be assembled due to low-sequence complexity and the occurrence of such repeats was masked by adding Ns within the pseudomolecules. Nearly half of the chickpea genome consists of transposable elements (TEs) and unclassified repeat elements similar to the percentage observed in other legume crops such as Medicago (30.5%), pigeonpea (Cajanus cajan; 51.6%), and soybean (Glycine max; 59%). The most abundant transposable elements are long terminal repeat (LTR) which covers more than 45% of total nuclear genome. The centromere regions are made up of the microsatellites which are dispersed as tandem repeats. The most found tandem repeats within the genome are 163-bp (18%), 100-bp (30%), and 74-bp (13%) unit repeats and constitute a total of 61% of total tandem repeats identified. The 163-bp and 100-bp units correspond to already identified chickpea microsatellites, CaSat1 and CaSat2, respectively, while 74-bp repeat is similar to dispersed highly repetitive element CaRep2. Tandem repeat finder was used to filter for the genomic regions >3 copies and >60 bp consensus length across the genome assembly. The genome assembly was scanned for the presence of transposable elements combining two approaches of de novo and homology-based searches. LTR_Finder v 1.03 (Xu and Wang 2007), PILER-DF v 1.0 (Edgar and Myers 2005), and RepeatScout v 1.05 (Price et al. 2005), all three de novo software, were used to build a chickpea repeat database. Repeat Masker v 3.2.7 (http://repeatmasker.org/, v 3.2.2) was deployed to identify repeats with the help of the constructed chickpea repeat database and Repbase (Jurka 1995). Along with these approaches, Repbase was also used to identify repeat-related proteins in the genome using RepeatProteinMask (http://repeatmasker.org/, v 3.2.2).

10.5 Gene Annotation

Gene prediction was done using combined approaches of ab initio modeling and homology-based searches with gene sets taken from six closely related legume species and CaTA transcript sequences. These approaches resulted in a non-redundant set of 28,269 gene models where average transcript and coding sequence size were 3,055 bp and 1,166 bp, respectively. Majority of these genes show homology with the gene models present in TrEMBL and Interpro (Zdobnov and Apweiler 2001) databases. The functions were assigned to 89.73% genes, while the rest 2,904 genes remained unannotated. The gene density was observed to be on the rise toward the ends of the pseudomolecules. The nonprotein coding genes resulted in the prediction of 684 tRNA, 478 rRNA, 420 miRNA, and 647 snRNA genes in the genome. The 454/Roche transcriptome data generated for CDC frontier line was mapped to the genome assembly for validation of the gene space capture by the draft genome assembly. The gene coverage is calculated to be ~90.8%. More than 98% homologs for core eukaryotic genes were found to be conserved in the draft genome assembly. BLASTP search using the chickpea proteome as query against the proteomes of Medicago, soybean, pigeonpea, and Lotus (Lotus japonicus) was carried out to estimate the conservation of chickpea gene models present in mentioned species. Proteome of chickpea was found to be most similar to Medicago (89.7% chickpea proteins correspond to Medicago proteins) and least similar to Arabidopsis (Arabidopsis thaliana: 79.2% were found similar to Arabidopsis proteins).

Three approaches homology-based, de novo, and transcript sequence-based were used for the gene prediction. The results of these approaches were fed to GLEAN (Elsik et al. 2007), which after multiple filtration resulted in a gene set of 28,256 genes. Further, CEGMA identified 453 core genes which are highly conserved across all eukaryotes. Out of these 453 core genes, 13 genes did not align to any gene with the set defined by GLEAN and rest were found present in the genome and hence were added to a final set resulting in 28,269 genes. BLASTP against SwissProt and TrEMBL databases (Magrane and Consortium 2011) was used to assign functions to the final predicted gene set. The presence of motifs and domains in genes was detected using InterProScan against protein databases which include Pfam (Punta et al. 2011), PROSITE (Sigrist et al. 2010), SMART (Letunic et al. 2012), PRINTS (Attwood et al. 2003), PANTHER (Thomas et al. 2003), and ProDom (Corpet et al. 2000). Genes were assigned gene ontology IDs, and with the information obtained from KEGG database (Kanehisa and Goto 2000) annotated with their associated pathway. tRNAscan-s.e.m. v1.23 (Lowe and Eddy 1997) was used to scan for tRNA genes, and INFERNAL v0.81 (Nawrocki et al., 2009) was used to predict snRNA and miRNA genes by searches against the Rfam database.

10.6 Genome Duplication

The genome duplication events occur in the genome over the course of evolution for a species. The scanning of the genome sequence for the presence of segmental duplications resulted in 110 syntenic blocks that contained 5 to 62 gene pairs. The divergence time was observed to be 58 million years (Myr) ago based on the rates of synonymous substitution per synonymous site (Ks) for the syntenic blocks. The divergence time is in consistence with genome duplication event that occurred at the base of Papilionoideae. The galegoid (Medicago, Lotus and chickpea) and millettioid (soybean, pigeonpea) clades in this family separated around 54 Myr ago. The chickpea species diverged from Lotus around 20–30 Myr ago and from Medicago around 10–20 Myr ago based on the analysis of four-fold degenerate sites using the calculation of genetic distance–transversion rates.

10.7 Synteny with Allied and Model Genomes

Synteny analysis was carried out for chickpea with 6 other closely related crops, namely Medicago, Lotus, pigeonpea, soybean, Arabidopsis, and grape (Vitis vinifera). The synteny analyses revealed extensive conservation between chickpea, and other species shows that high percentage of chickpea assembly has conserved regions matching with one or more species included in the synteny analysis. The maximum number of conserved syntenic blocks (>10 kb) was seen in Medicago, while it was substantially fragmented with Lotus. When compared with legumes, soybean showed the maximum number of syntenic blocks depicting its recent polyploidy ancestry, while fragmented colinearity with pigeonpea suggests the incompleteness of the pigeonpea genome assembly. The 28,269 gene models of chickpea were compared with 230,161 gene models from four legumes and two non-legumes resulting in 15,441 orthologous groups using reciprocal pairwise approach. Of these, 5,940 orthologous groups were observed having a single chickpea gene indicating simple orthology relationship, while 4,468 chickpea genes were observed in species-specific groups, with no ortholog but having paralogs within the genome. These groups may be attributed to the structural rearrangements that lack simple orthology followed by duplication, as is observed in the case of NBS-LRR disease resistance genes. The percentage of the total predicted gene models which were classified into orthologous groups by OrthoMCL gives insights for the genes which have history of duplication after the divergence of legumes from Arabidopsis and grape. The chickpea genome may be attributed to a series of gene loss and gene duplications as it is the same time interval required for whole-genome duplication event at the base of the Papilionoideae. Several genes from each of the 7 species could not be placed into orthologous groups which may be because of the heterogeneity in gene prediction for each of these species while it may also be due to lineage-specific evolution events. MUMmer (Delcher et al. 2003) and SyMAP (Soderlund et al. 2011) were used in combination for the synteny analysis. Classification of orthologous genes and gene clusters was carried out using OrthoMCL (Li et al. 2003).

10.8 Comparison of Desi and Kabuli Genomes

There was another effort made towards the genome sequencing of chickpea, by whole genome sequencing of the ICC 4958 genotype which is desi type (Jain et al. 2013). There exist various differences in the final assemblies reported by two efforts mentioned above (Table 10.1). The genome size of the kabuli genome was 532.29 Mb, and in case of desi genome, it was 519.84 Mb. The number of gene models reported for the two genomes was similar: 28,269 in kabuli and 27,571 in desi. The number significantly differs for the number of scaffolds assembled and N50 for the two genome assemblies. The number of scaffolds is comparatively too less in case of kabuli genome, and also, the N50 value is comparatively high which states that the kabuli genome is much better in terms of these assembly parameters. As compared to desi, kabuli genome has ~2.8 times sequence anchored at pseudomolecules, and the longest scaffold in kabuli is more than twice the size of longest scaffold observed in desi genome. The number of miRNA, tRNA, and rRNA fragments is significantly higher in the kabuli genome. The GC content is bit higher in the kabuli genome which may be attributed to only Illumina technology used to develop the assembly. The desi genome is more fragmented in comparison with the kabuli genome, and kabuli genome will serve a better resource for genome-based studies in chickpea.

Table 10.1 Comparison of the features of first two draft genome assemblies in chickpea

10.9 Subsequent Validation

Both kabuli and desi assemblies were subsequently assessed using a chromosomal genomics approach to determine whether the differences in the genome assemblies represent real differences in genome structure or are artifacts of assembly of one or both genomes. Isolated chromosomes from each of the varieties were sequenced and the data was mapped to the pseudomolecules (Ruperao et al 2014). This analysis demonstrated that the physical genomes of kabuli and desi chickpea types are very similar and the observed differences in the sequence assemblies are due to major errors in the desi genome assembly, including the misplacement of whole chromosomes, portions of chromosomes, and the inclusion of a large portion of sequence assembly which does not appear to be from the genome of chickpea. In contrast, the kabuli assembly is mostly correct. Based on this analysis, updated versions of both kabuli and desi genome assemblies have been produced (http://doi.org/10.7946/P2G596 and http://doi.org/10.7946/P2KW2Q), with GBrowse access at http://www.cicer.info/.

10.10 Conclusion

The chickpea genome sequencing has provided the much needed thrust to genomics based breeding approaches. Further, the re-sequencing of the germplasm will help in better understanding of the diversity present in Cicer species. The resource generated from these sequencing efforts will help in improvement of the genome assembly with enhanced coverage. The improved genome assemblies will help in identification of regions linked to important agronomic traits. These sequencing efforts are expected to enhance the chickpea yield and its resistance to biotic and abiotic stresses.