Keywords

3.1 Introduction

Climate change and increasing population growth at an alarming rate poses the biggest challenges to food and nutritional security across the globe. By 2050, the global population is predicted to increase by 55 to 70%, as a result the proportion of people at risk of hunger may increase to around 8% (van Dijk et al. 2021a). With diminishing resources and limited arable land, sustainable production to cater the food and nutritional demands has been a daunting task. Plant breeders and geneticists are constantly under pressure to develop improved crop varieties that are climate-resilient and high-yielding to meet the food and nutritional demands. Low genetic diversity, prolonged breeding cycles, and limited access to high-quality seeds for cultivation have been serious obstacles to achieve greater genetic advancements (Varshney et al. 2020). Although conventional breeding programs contributed to the development of improved varieties, to achieve “zero hunger,” the Sustainable Developmental Goal 2 adopted by United Nations Organization advocated the integration of modern breeding approaches in agriculture (Varshney et al. 2018).

Ever since the rediscovery of Mendelian laws, there has been a paradigm shift in understanding the phenotype-based trait genetics to the use of molecular markers, genomics, genomes and sequence-based trait dissection (Varshney et al. 2019; Thudi et al. 2023). During the last two decades, genomics and NGS (next-generation sequencing) technologies have not only revolutionized our understanding of molecular basis of economically important traits, but also increased the rate of adoption of modern breeding approaches to develop climate-resilient crop varieties (Thudi et al. 2020; Varshney et al. 2021a). To date, draft genomes of more than 1000 plants representing 788 species are available in public domain (Sun et al. 2021). Not only draft genomes, gold standard reference genomes to platinum standard reference genomes are available in crops like rice (Zhou et al. 2020) and also in cetacean species (Morin et al. 2020). Efforts are also underway to sequence all the known eukaryotic species through “The Earth BioGenome Project” that provides insights into the biology of life (Lewin et al. 2018). Apart from draft genomes, several germplasm lines including wild species accessions have been sequenced in several crops including pearl millet (Varshney et al. 2017a), chickpea (Thudi et al. 2016; Varshney et al. 2021b), pigeon pea (Varshney et al. 2017b), rice (Wang et al. 2018; Stein et al. 2018). Development of pangenomes and super-pangenomes are underway in many crop species (Khan et al. 2020). With the rapid availability of biological data in public domain, rate-limiting factor in genomics research has shifted from sequencing to computer analysis (Kathiresan et al. 2017). The statistical, bioinformatics tools and algorithms developed earlier are becoming obsolete and computational tools and algorithms that handle “BIG data” are gaining importance (Edwards et al. 2009; Batley and Edwards 2016).

In this chapter, we review the NGS data analysis and available databases that are developed to store and retrieve biological information produced from different omics approaches. In addition, we also discuss the computational tools and approaches that enable development of pangenome, identification of haplotypes and editing genomes. Besides highlighting the challenges, we also highlight the scope of improving the bioinformatics approaches for effective use in crop improvement.

3.2 Understanding Genetic Diversity and Trait Mapping

Genetic diversity plays a major role for gaining greater insights and simplifying complex traits. Prior to advent of molecular markers, the phenotypic plasticity in a crop species was assessed using simple experimental analyses and programmes like XLstat or SPSS (Addinsoft 2021; IBM Corp Ibm 2017). In addition, statistical packages like INDOSTAT is being used to analyse variance, D2 statistics, canonical roots, path analysis etc. (Khetan and Ameerpet 2015). The statistical tool for agricultural research (STAR) has modules for randomization and layout of crop research experimental designs, data management, and fundamental statistical analysis, including descriptive statistics, hypothesis testing, and ANOVA of designed experiments (Gulles et al. 2014). The stability of a crop over different locations and years is one of the crucial prospects in plant breeding. Software like GGE biplot, GEA-R, STABILITYSOFT, and AMMISOFT are used to analyse Genotype × Environment (G × E) interaction studies (Yan 2001; Pacheco et al. 2015; Pour-Aboughadareh et al. 2019; Gauch and Moran 2019). Stability and performance are examined simultaneously using these tools, allowing for a comprehensive understanding of the crop's behavior across different environments and conditions. (Table 3.1).

Table 3.1 List of commonly used software packages for plant breeding

With the availability of molecular markers, efforts were made to map the genomic regions or genes responsible for the complex traits using both linkage mapping or QTL mapping and linkage disequilibrium-based mapping or association analysis. The most common software packages used for maapping genomic refions are Mapmaker-QTL, QTL Cartographer, Win-QTL Cartographer, PLABQTL, MapQTL are command-line software (Lincoln et al. 1993; Basten et al. 2002; Wang 2005; Utz and Melchinger 1996; Van Ooijen and Maliepaard 1999; Bradbury et al. 2007; Gupta et al. 2015. Mapmaker-QTL can only perform simple interval mapping (Lincoln et al. 1993). The most versatile QTL mapping software is QTL Cartographer. A range of software tools, including the widely used STRUCTURE, are available for determining population structure (Pritchard et al. 2000). Using this software, you can choose the number of subpopulations by using all marker data or a subset of unlinked markers from the marker collection. Alternatively, using the given marker data, principal component analysis (PCA) can be performed and the first few components used as variables to adjust for population structure. Association analysis can be done with TASSEL. Even without forming a core, one can test a population for its suitability as an association panel. Then it can be directly used for TASSEL analysis. However, some prerequisite analysis is required, like population structure, kinship analysis, and principal component analysis (PCA) (Bradbury et al. 2007). It uses marker data to calculate kinship, which helps to address family relatedness and population structure (Table 3.1) (Gupta et al. 2015).

3.3 Identification and Understanding Key Genes Using Multi-Omics Approaches

Interpretation of molecular complexity and variability at several levels, such as genome, transcriptome, proteome and metabolome, is necessary for comprehensive understanding of organism’s entire metabolism. The data from various levels are together referred to as “multi-omics” data. Multi-omics data obtained from various approaches provide insights into the flow of biological information at various levels, can aid in figuring out the biological state of interests underlying mechanisms.

In the last decade, technological advancement in DNA sequencing (Le Nguyen et al. 2019), transcriptomics analysis via RNA-seq (Mashaki et al. 2018), SWATH-based proteomics (Zhu et al. 2020) and metabolomics via UPLC-MS and GC-MS (Balcke et al. 2012) has made a significant contribution in biological data. The first omics field to emerge is genomics that deals with study of complete genomes. Genomic studies like QTL/association mapping has been used to detect genomic regions associated with agronomically important traits (Varshney et al. 2014, 2021b; Bhatta et al. 2019; Thudi et al. 2021; Yoshida et al. 2022) and provide basic framework for other omics approaches. Additionally, differentially expressed genes under several biotic and abiotic stresses were identified using transcriptomics studies in several crop plants (Nayak et al. 2017; Channale et al. 2021; Chen et al. 2022; Pal et al. 2022). Gene expression atlas provides insights into the subsets of genes expressed during different growth stages for pigeon pea (Pazhamala et al. 2017), chickpea (Kudapa et al. 2018), groundnut (Sinha et al. 2020). The spatial transcriptomics method developed by Giacomello et al. (2017) enables high-throughput and spatially resolved transcriptomics in plant tissues using a combination of histological imaging and RNA sequencing. Functional analysis of translated regions of the genome is understood using proteomics, while metabolomics serves as a diagnostic tool for assessing the plant performance under different stimuli (Villate et al. 2021). A number of repositories were developed to organise data generated from different experiments and sequencing studies. The repositories include DNA, RNA and protein sequence databases, as well as specialized databases for specific information (Lai et al. 2012; Thudi et al. 2020). Based on different types of omics data, databases can be classified into four classes: (1) genomics databases contain nucleotide sequence or genomic sequence, (2) transcriptomics databases include functional RNA sequences, (3) proteomics databases contain information related to amino acid sequence and protein structure, and (4) metabolomics databases contain information about metabolites and metabolic pathways (Table 3.2, Fig. 3.1a).

Table 3.2 Summary of widely used databases in plant genetics and breeding research
Fig. 3.1
A 3-part chart. Part A lists the databases in 5 categories. Some of them include P G D D, M G P R, and P L U T O. Part B lists the benefits and challenges of artificial intelligence. Part C lists the approaches to predictive analysis, including genomic selection, gene editing, and others.

Summary of databases and applications of artificial intelligence in agriculture: (a) Represent databases developed to store and retrieve the biological information produced from various omics approaches, includes Genomics, Transcriptomics, Proteomics and Metabolomics, (b, c) Showed different kinds of predictive analysis based on AI

Databases include PlnTFDB (planttfdb.gao-lab.org/) for plant transcription factor, widely used for expression analysis or functional genomics. This database allows user to get sequence information of known plant transcription factors. Phytozome (phytozome-next.jgi.doe.gov/) database provides access to the selected plant genome sequences and improved platform for comparative analysis of genomes. Breeders have access to useful tools like molecular markers that can speed crop improvement program. In case of chickpea, “CicArVarDB” database provides information of single nucleotide polymorphisms (SNP) and insertion/deletion (Indel) variations which can be utilized for advanced genetics research (Doddamani et al. 2015). Additionally, AgBioData consortium (Harper et al. 2018) works together across different agricultural-related databases to identify approaches for integrating and standardizing database operations. This collaborative effort aims to develop database products that exhibit more interoperability. The major challenge is to manage and translate the sequence information for the crop improvement.

3.4 Evolution of Sequencing Technologies and Tools

About 25 years after discovering the double helical structure of DNA, the first-generation sequencing technologies like Sanger sequencing and Maxam and Gilbert sequencing were available for sequencing both smaller and large genomes. Nevertheless, a plethora of sequencing technologies have evolved during last 15 years and there is an increased data output, read lengths, efficiencies, and applications. Second-generation sequencing technologies had improvement in sequencing throughput, required time and read length with low cost. Short-read sequencing technologies (up to 600 bp) have been widely used in genomics research as it supports wide range of statistical analysis using cost-effective pipelines (Heather and Chain 2016). However, sequencing of short reads created complications in reconstruction of larger fragment or original molecules due to the presence of homopolymers. Long-read sequencing (up to 10 kb) is a highly accurate approach that can be used to sequence traditionally challenging genomes and facilitate de novo assembly, also help in the transcript isoform identification and structural variant identifications. It helps to construct better pangenome than short-read sequencing. In case of rice, third-generation sequencing with long reads were used to construct pangenome using 105 accessions and found 604 Mb novel sequences which was not present in reference genome (Zhang et al. 2022). Specialised analytical tools that consider the properties of long-read data are needed, but the speed at which these tools are being developed can be daunting. Currently, more than 350 long-read analysis tools are available that are generally utilized in Nanopore and SMART sequencing platform (Amarasinghe et al. 2020). For choosing appropriate tool, there is a publicly available database named as “long-read-tools.org,” which has a collection of long-read analysis tools and allows us to choose appropriate tools for analysis (Amarasinghe et al. 2021). In order to analyse and interpret the NGS data, there is a need of highly qualified and competent bioinformaticians. For accurate downstream analysis of sequencing data, appropriate analysis tools are essential and it involves conversion of raw signal data to sequence data.

Sequencing data analysis includes raw read quality control, sequence alignment, variant calling, genome assembly, genome annotation and other advanced analysis. Numerous bioinformatics tools have been developed and used in sequence analysis (Table 3.3). It is essential to evaluate the raw sequence data to ensure the quality for any subsequent analysis. It can give a broad overview of read counts and lengths, coverage reads, contaminating sequences and sequence duplication level. In the first stage, adapter sequences and low-quality sequences are separated from whole genome sequencing data through a quality assessment process. FastQC is the well-known bioinformatics tool for calculating quality control of sequencing reads (Andrews 2010). More recently, fastp tool is also utilized in quality control, base correction and filtering of sequencing reads. The fastp tool is two to five times faster than previous approach (Chen et al. 2018) and ensures the read quality as well as adapter trimming.

Table 3.3 Bioinformatics tools used for NGS data analysis

The second step is to align the sequences with reference genome, that is, read/sequence alignment. In the case of non-availability of reference genome, de novo genome assembly method is used to generate the contigs by aligning the overlapping regions together. This step is the most crucial and important in the entire workflow. The sequence reads are precisely and quickly aligned to the appropriate places of the reference genome using a variety of tools and algorithms. Many tools have been developed for sequence alignment; the popular aligners include BWA (Li and Durbin 2009), Bowtie2 (Langmead and Salzberg 2012), CUSHAW3 (Liu et al. 2014), MOSAIK (Lee et al. 2014), and Novoalign (http://novocraft.com/). MOSAIK is the mapping tool currently available that can align reads produced by all the major sequencing technologies. Minimap2 is a flexible pairwise nucleotide sequence aligner and mapper. It can be used with short reads, assembly contigs, long noisy genomic and RNA-seq reads (Li 2018). The lra tool requires less time and memory for alignment as compared to Minimap2 (Ren and Chaisson, 2021). The recently developed kngMap (k-mer neighbourhood graph-based mapper) tool is specifically designed to align long noisy reads to a reference genome (Wei et al. 2022).

The third step is variant calling. The variations in the output sequences compared to the reference sequence are called as variants. The presence of SNPs, INDEL, presence/absence variations (PAVs), copy number variations and haplotypes blocks are detected using variant calling tools. Tools used for variant calling includes SAM tools (Li et al. 2009), Genome Analysis Tool Kit Haplotype Caller (GATK-HC) (McKenna et al. 2010), Freebayes (Garrison and Marth 2012), SNPSVM (O’Fallon et al. 2013), varScan (Koboldt et al. 2013), DeepVariant (Poplin et al. 2018), Torrent Variant Caller (TVC) (Life Technologies, Rockville, MD), etc. Numerous automated workflows have been developed to streamline the variant calling process. These workflows integrate various aligners and variant calling tools with other upstream and downstream tools to provide an end-to-end solution (Kanzi et al. 2020). Tools available like ToTem and Appreci8 (Tom et al. 2018; Sandmann et al. 2018) are completely automated variant calling pipelines. ToTem is becoming a popular tool because it has automated pipeline optimization and efficient analysis management. Appreci8 gives an accurate variant calling as it uses eight different tools to perform the same task that filters and combines the outputs for appropriate calling. Final step is data visualization; there are various tools available for visualization depending on the experiments and the research objectives. One of the popular choices of visualization tool for reference genomes is integrated genome viewer (Thorvaldsdottir et al. 2012). VISTA is also visualization tool which can be used for comparing difference between two genomic sequences. To aid the biologists with no or little knowledge of using perl/python languages, desktop solutions for a wide range of genomic analysis needs, including transcriptomics, variant calling, epigenomics, metagenomics, comparative genomics, are available like Qiagen CLC Genomics Workbench, geWorkbench, Partek Genomics Suite, JMP Genomics, DNA Baser-NextGen Sequence Workbench, etc.

During NGS analysis, numerous intermediate analysis and result files are generated that require large storage. It is difficult to interpret these complicated NGS data files in terms of converting data into knowledge for important traits, especially for aggregated vast volumes of variants or heterogeneous sequencing data require a high-performance computational resource. The NGS data after analysis could be effectively interpreted using machine learning-based techniques.

3.5 Approaches for Development of Genome and Pangenome Assemblies

The wild relatives have a large genetic diversity and ability to survive under various biotic and abiotic stresses. Crop domestication and evolution have significantly decreased the genetic diversity in cultivated species, which has led to the loss of key loci that govern crucial traits. The traditional crop improvement approaches include selection of superior traits from either cultivated varieties or the wild relatives and utilizing them in the breeding programs (Dempewolf et al. 2017). During the process of selection, the crops became more susceptible to different stresses due to impact of climate change and evolution of pathogens and pests. To address these limitations, it is necessary to utilize crop wild relatives, which are known to have genes for several biotic/abiotic stress tolerance traits that have been lost during domestication or breeding procedures. As a result of advancement in sequencing technologies, reference genome sequences for a number of crops have been accessible, serving as the foundation for efforts to boost crop improvement programme (Varshney et al. 2017a, 2017b). In addition to cultivated crop genome, de novo assembled genomes of a number of wild relatives have also been made available. In addition, the idea of pangenomes is being adopted more widely due to the growing recognition that a single reference genome cannot capture the diversity contained within a species.

Pangenome is the collection of genes or DNA sequence in a species to provide useful sources for functional genomics, evolutionary studies that can be used for crop improvement. Pangenomic studies have been conducted in various model and crop plants including Arabidopsis, stiff brome, wheat, cabbage, tomato, soybean, rice, rapeseed, barley, chickpea and sorghum (Hurgobin et al. 2018; Gao et al. 2019; Jayakodi et al. 2020; Barchi et al. 2021; Ruperao et al. 2021; Varshney et al. 2021b; Jha et al. 2022) (Table 3.4). Genome assembly is the process of arranging nucleotides in the proper order. Sequence read lengths are currently far shorter than most of genomes or even most of the genes; therefore, it is important to assemble reads and construct genome or pangenome. In plants or other eukaryotic organisms, genes are found in the same physical place on the chromosome, but the frequency of copies and repeating sequences can vary, making assembly more difficult. Pangenomes have been constructed via de novo, iterative, and graph-based assembly techniques. The de novo assembly is straightforward and simplest approach for development of pangenome. This approach includes assembly using overlapping regions and does not require reference genome. It requires high depth sequencing of all the targeted accessions, then creates unique de novo assemblies for each accession. The comparison of the resulting individual assemblies identifies conserved and variable genomic regions across the genomes. Advancement in long-read sequencing technologies and complementary strategies like creation of Hi-C and BioNano maps make it possible to obtain high-quality plant genomes at the chromosomal level (Miga 2020). Comparative analysis is used to identify all types of variations and characterized genes found in core and dispensable regions (Mahmoud et al. 2019).

Table 3.4 Summary of important tools in various plant genetics and genomics approaches

Several bioinformatics tools have been developed for assembling the prokaryotic pangenome and having the ability to handle less complex genomic content (Khan et al. 2020). For constructing eukaryotic pangenomes, some tools have been developed (Table 3.4) that include EUPAN (Hu et al. 2017), GET_HOMOLOGUES (Contreras-Moreira and Vinuesa 2013), PanTools (Sheikhizadeh et al. 2016), etc. One of the first attempts to examine eukaryotic pangenomes was EUPAN, which supported genome assembly, identification of core and dispensable gene databases using read coverage, and gene annotation of the pan-genomic dataset. GET_HOMOLOGUES can be used in eukaryotic pangenome development and it is written in Perl and R language platform. Additionally, Panconda tool (Warren et al. 2017) is used to compare whole genome multiple sequence and representing relations between sequence as graph and it is the initial step for the de Bruijn graph which can be used for pangenome construction. PanTools is also used to construct and visualize pangenome, the representation of pangenome depending on the de Bruijn graphs. PAN2HGENE (Silva de Oliveira et al. 2021) recently developed computational tools for pangenome analysis, which can do automated comparison analysis for both full and draft genomes and identifies gene that are missing from the original genome sequence.

3.6 Bioinformatics Tools Used in K-Mer Analysis

The importance of supporting sequencing technologies has been highlighted by our growing understanding of biological information and its implications for the vast volume of DNA data. Counting k-mers is an essential component for many bioinformatics techniques, such as nucleotides assembly, metagenomic sequencing and sequencing error correction (Melsted and Pritchard 2011). A k-mer is unique sub-sequence of nucleotide sequence. The distribution of statistically significant k-mers in a genomes and other regulatory subregions has been described in a number of recent studies (Hashim and Abdullah 2015; Cserhati et al. 2018). It has also been also employed in comparative studies (Cserhati et al. 2019), and major advantages of alignment-free approaches based on k-mer are their speed and ability to remove biases. Most of the association mapping studies has been done using SNPs. However, this approach has some limitations (Rahman et al. 2018). A k-mer-based analysis is alternative method to address some limitations of SNP-based analysis.

At its most basic, k-mer count analysis simply considers two parameters: the length of the k-mer and whether the orientation of the DNA strand is known. k is normally selected to be at least 20 and frequently falls between 20 and 31. Too small k will give redundant count information because the probability that a k-mer is unique to a genome is reduced. However, as k increases the probability that a k-mer contains an error increases. There are a number of bioinformatics tools developed to analyse the k-mer and further utilization of k-mers. BFCounter is a program that is used for counting k-mers in DNA sequence data (Melsted and Pritchard 2011) (Table 3.4). KAT (k-mer Analysis Toolkit) is a multipurpose tool for reference-free quality control and de novo assembly (Mapleson et al. 2017). iMOKA (interactive multi-objective k-mer analysis) is bioinformatical tool/software that enables comprehensive analysis of large collections of sequencing data based on k-mer. It uses efficient and effective steps that combines Naive Bayes classifier augmented by an adaptive entropy as well as graph-based filter to reduce search time (Lorenzi et al. 2020). KmerGO software is utilized to identify group-specific nucleotide sequences between two different groups. Furthermore, it is also used to check association between nucleotide sequence and quantitative traits (Wang et al. 2020). KITSUNE is a tool to identify the empirically optimal k-mer length for phylogenetic analysis and provides alternative alignment tool for comparative studies (Pornputtapong et al. 2020).

3.7 Artificial Intelligence

Artificial intelligence (AI) is the simulation of human intelligence processes by computer systems and it holds marvellous promise for better utilization of the available dataset to appropriate prediction and better understanding of genetic complexity (Fig. 3.1a, b). The three cognitive skills that make up AI encoding are learning (acquiring data and then developing algorithms to transform it into usable information), reasoning (selecting the appropriate algorithm to arrive at a desired result), and self-correction (constantly adjusting designed algorithms to ensure that they deliver the most accurate results) (Gharaei et al. 2019). Breeders have access to an ever-growing suite of high-throughput sensors and imaging techniques for a wide range of traits and situations in the field. In addition, novel genomic assays are constantly being developed that can reveal missing heritability (Harfouche et al. 2019). Nowadays, a major challenge in the advancement of technologies is the management and utilization of big data. The utilization of data with AI technologies can accelerate the breeding program to increase productivity and development of climate-resilient crop by phenotyping, efficient and effective diagnosis of disease and precise selection of individual for breeding (Fig. 3.1c). AI can also help breeders to quickly determine which plants grow the quickest in a specific climate, which genes support plant growth and adaptation, produce the best gene combination for a given location and choosing traits that increase yield and fend off the effects of a changing climate.

One of the important elements in AI is machine learning (ML), which helps to use data more efficiently and that uses statistical and mathematical approaches for appropriate predictions (Ayed and Hanana 2021). The ML has ability of ML to distinguish between various types of genomic regions, for instance, distinguishing active genes and pseudogenes, using feature like DNA methylation (Sartor et al. 2019). Additionally, ML was utilised to foresee the locations of DNA crossover (Demirci et al. 2018). Single-cell RNA sequencing is fascinating the new area in which ML is essential (Speranza et al. 2021; van Dijk et al. 2021b). This method makes it possible to examine cellular development and responses to environmental stimuli in diverse tissues. Digital plant phenotyping has been an active study area to accelerate plant science studies. Different imaging systems can be used to study the various macroscopic levels, for example, real-time stomata phenotyping using microscopic observation (Toda et al. 2021). Numerous sensors have been employed to accurate phenotyping, and it includes spectral sensor, lidar/laser sensor, fluorescence sensor, ultrasonic sensor and thermography (Qiu et al. 2018).

AI systems currently in use neural networks (NNs) and extreme gradient boosting (XGboost), both of which are popular machine learning models employed for a variety of tasks including regression and classification (Chen and Guestrin 2016). Deep learning techniques are based on neural networks, sometimes referred to as artificial/ simulated neural networks, which are a subset of machine learning. Leveraging AI in agriculture shows impressive results in image-based disease identification using deep learning model. It uses publicly available image datasets for disease identification (Mohanty et al. 2016). However, the supervised branch of machine learning includes the tree-based method known as XGboost. In maize, different models were used to predict yield using AI and found better results using XGBoost (Nyeki et al. 2019). These AI systems internal working and decision-making procedures are mysterious. It is possible to see the results, but it is not clear why a particular choice was picked. As a result, the introduction of new explainable AI algorithms that not only have a prediction model but also gives the appropriate reasons for choice is needed. It is the first stage in the development of next-generation AI (Harfouche et al. 2019).

3.8 Identification of Superior Haplotype for Crop Improvement

Second-generation molecular markers have been successfully used in plant breeding for development of improved varieties and also utilised in genome mapping, but gives low resolution of QTLs (Zargar et al. 2015). Advancement in the NGS technologies provide sequence-based markers (SNPs) having wide coverage with high density (Gouda et al. 2021), and have wide applications in plant breeding. These markers help to increase the resolution of genome mapping and the accuracy of genomic selection (Yadav et al. 2019). However, identified SNPs have some limitations which includes bi-allelic nature, difficult to identify rare alleles, less polymorphic, linkage drag problem and giving false positive results (Voss-Fels and Snowdon 2016; Bhat et al. 2021). In this context, the haplotype-based approaches are a successful strategy to get over SNPs limitations and boost the resolution of genomic regions (Qian et al. 2017). Haplotype is combination of nucleotide or markers that inherit together from polymorphic sites in the same or different chromosome having strong linkage disequilibrium between them (Bhat et al. 2021). Number of studies have demonstrated that a haplotype-based association study can find variants that would not be detected by a typical SNP-based investigation (Zakharov et al. 2013). Additionally, a recent study also identified several important genes, that can be utilized as important molecular markers for the purpose of genetic manipulation to design and develop robust and resistant crop cultivars (Pal et al. 2022).

The detection of haplotypes and their use in genetic investigations is significantly impacted by the availability of high-throughput sequencing technologies. Second-generation sequencing technologies generate 150 base pairs short reads. Therefore, the haplotypes identification is difficult and requires powerful statistical tools (Delaneau et al. 2019). On the other hand, third-generation sequencing technologies, such as Oxford Nanopore and Pacific Biosciences, generates long reads from which the haplotypes can be constructed directly (Maestri et al. 2020). The haplotype mining can be used to dissect complex traits by using approaches like haplotype-based breeding, haplotype-GWAS, haplotype-assisted genomic selection (Table 3.4).

Haplotype identification, characterization and visualization are important for utilization of haplotype for crop improvement. Many tools have been developed to estimate and visualize haplotypes. Haplotype identification/estimation also called as “phasing,” is a process of estimation or construction of the haplotype sequences from genotypic data and it is utilized for understanding sequence-specific variation. Haplotype-based GWAS analysis is complicated as compared to SNP-based analysis to identify the associations, because it involves three major steps: phasing/haplotype estimation, block determination and statistical analysis. Estimation of haplotypes required pooled information of all individuals present in sample. Number of unrelated individuals is an important factor that can influence the estimation of haplotypes, and more individuals can give better results. However, related individuals can be phased by considering haplotypes shared by members of families which are descended from one another (Browning and Browning 2011). Numerous phasing techniques that enable the construction of haplotypes from long-read sequencing data have recently been established, such as reference-based phasing, de novo genome assembly and strain-resolved metagenome assembly (Garg 2021; Bhat et al. 2021). Choice of appropriate phasing, block determination algorithms and their interaction are important factors that can influence accuracy of phasing the haplotype blocks (Bkhetan et al. 2019). Various haplotype analysis approach combined with different computational tools such as DESMAN, Falcon phase, HapCut2, HapTree, Hifiasm, MetaMaps, POLYTE, SDip, and WhatsHap are extensively reviewed by Garg (2021). The combination of different analysis approaches and computational tools with long-reads sequencing technologies has allowed us to fully utilise the potential of these sequencing methodologies for haplotype construction. SNPViz v2.0 (Zeng et al. 2020) is a web-based tool that enhances the identification of large-scale haplotype blocks. HaplotypeTools (Farrer 2021) is tool to phase variant, based on detecting the reads overlapping ≥ 2 heterozygous positions and then extent of the reads; it is also a powerful tool for analysing hybrid and polyploid genomic regions. Recently, python coded tool HAPPE (Feng et al. 2022) was developed to construct and visualize the haplotypes easily (Table 3.4). Additionally, Practical Haplotype Graph is a powerful tool for storage, retrieval and imputation of haplotypes that can be used for genomic studies (Bradbury et al. 2022).

3.9 Genome Editing

CRISPR/Cas9 is the potent genetic modification technique that is a great example of genome editing technologies. This technology is proved to be extremely effective tool not only in the field of basic science but also in the plant breeding. The development of genome editing technologies (ZFN, TALEN, CRISPR/Cas9, etc.) drawn a lot of attention, because they eliminate the restrictions of traditional breeding approaches (Matres et al. 2021). These methods enable precise and effective targeted genome modifications, greatly shortening the time needed to obtain plants with desired traits for the development of new crop varieties. Sequence-specific nucleases and small guide RNA are the key components of CRISPR-based gene editing approach to generate precise modification. The CRISPR/Cas system is still evolving, but there are two significant obstacles: off-target effects and on-target efficiency (Xu et al. 2015; Zhang et al. 2015; Liu et al. 2020). To overcome these issues, optimizing small guide RNA by effective computer methods assist in silico gRNA design that plays an important role (Doench et al. 2016; Hassan et al. 2021). One of the key factors affecting gRNA effectiveness is the nucleotide content of a target sequence. The PAM (Protospacer Adjacent Motif) sequence and its nearby nucleotide is significantly important for the better efficiency (Liu et al. 2020). Guanines are favoured at first and second nucleotide position before the PAM sequence while thymines are not preferred within four nucleotides upstream/downstream of PAM sequence. Furthermore, sequences upstream of PAMs have no discernible influence, although sequences downstream can affect gRNA efficiency (Doench et al. 2014). At cleavage site, cytosine is preferred and GC content at downstream of the PAM sequence that increases high efficiency to gRNA. Numerous efficiency prediction models are available built using this important information. Various tools have been developed based on these models to design gRNA either by alignment-based, hypothesis-driven and/or learning-based models (Konstantakos et al. 2022). Hypothesis-driven and learning model-based tools perform better than alignment-based models. Several tools have been developed to predict gRNA with high target efficiency includes E-CRISP, CHOPCHOP, CRISPR-FOCUS, PROTOSPACER, CLD, CRISPOR, and CRISPETa (Table 3.4). WheatCRISPR is a web-based bioinformatics tool which is generally used for constructing target-specific gRNA in wheat (Cram et al. 2019). Additionally, CROPSR is the first open source bioinformatics tool to help design genome-wide guide RNA for CRISPR-based genome editing with high speed that reduces the challenges of complex crop genome (Paul et al. 2022).

3.10 Major Challenges in Bioinformatics

NGS technologies have made genomic revolution by generating enormous amount of data quickly and affordably. The use of bioinformatics in life science research is becoming more and more essential at the moment. Data analysis is frequently the main bottleneck because of the exponential growth in amount and complexity of life science data over the past two decades. Handling, analysing, and storing information has become a new barrier for biologists. Efficient data processing is necessary and there are many algorithms available for these specific tasks. To increase efficiency and accuracy, it needs combination of tools and enough resources for smooth operation. Another challenge for the biologists is to learn the languages like python, Perl or R for efficient handling of the data and lack of training in the field by the expert bioinformatician who knows biological problems and associated complexities. Genome assembly has gained more and more attention as advance sequencing technology are developed. Despite the abundance of genome assembly tools available, de novo genome assembly using next-generation reads still faces four significant obstacles: sequencing errors, sequencing bias, topological complexity of repetitive regions and huge computational resource consumption (Liao et al. 2019). The accuracy of results can have a big impact on downstream analysis of sequencing data. False positives and inaccurate findings may result from the errors during data processing. On the other side, poorly chosen approaches or tools may produce false negatives, which would result in the loss of genuine variants. Therefore, finding a suitable balance between accuracy of results and sensitivity is thus another big problem for data analysis. The application of ML in plant research is also an important issue. Traditionally, statistical techniques have been used to predict genotype-phenotype relationships. These techniques have been very effective and successful throughout the past century. Decision-making for researchers and practitioners typically involves the use of confidence measures and model interpretation. Further, data-driven flexibility of ML offers a range of advantages over stringent statistical approaches that make it a powerful tool for solving complex problems and extracting valuable insights from diverse and dynamic datasets.

3.11 Future Prospective and Conclusions

Bioinformatics has been emerging and cross-cutting different fields of agricultural sciences for enhancing our understanding of the complex mechanism underlying different traits in different crop plants in crop improvement (Fig. 3.2). A paradigm shift in the field of life sciences has been brought by NGS and has transformed genomics research. In addition to being crucial for fundamental genomic and molecular biology research, bioinformatics also has a significant influence on many fields of agricultural and medical sciences. Suitable computational tools and the right resources are essential for identifying biological information that adds value and offers novel insights into biological systems. The rise in omics-based research needs education in the relevant technologies and bioinformatics in order to correctly translate experimental and computational efforts. AI-based solutions are help to increase efficiency and regulate a number of factors, including crop yield, soil profile, crop irrigation, weeding, and crop monitoring (Bhardwaj et al. 2022). The possibility of using AI in agriculture will increase as the field of AI matures and more trained algorithms are added. Recently, the development of genetic algorithm-based Internet of precision agricultural things (IopaT) and becoming famous in rural areas to solve the real-time problems. Genetic algorithmic system is developed to predict water requirement (Roy and De 2020). This kind of system will also help in decision-making in agriculture, like crop patterns and water management at particular place (Xu et al. 2022a). Future applications of AI/ML in plant research include predicting which regions of the genome should be modified to produce a particular phenotype and providing the best possible local growing conditions by monitoring crop performance in vivo in the greenhouse or on the field. We are still very early in the genomics era, and undoubtedly, a long way from accomplishing the ambitious objective. In fact, efforts are still required for in-depth and appropriate analyses of genome, transcriptome, and metagenome data to identify link between organization and functionality. Moreover, chemical genomics approaches aid in the comprehension of overcoming stress conditions and improving crop yield and productivity (Pa et al. 2022; Adhinarayanreddy et al. 2022). Utilizing integrated multi-omics data, big data technology, and artificial intelligence proposed the new term called integrated genomic-enviromic prediction (Xu et al. 2022b), as an extension of genomic prediction will provide accelerating breeding programs. With the use of big data, AI and robust bioinformatical analysis, plant breeding in the future will become increasingly smart. The establishment of integrative plant breeding platforms and open-source breeding initiatives can help translate smart breeding efforts into genetic gains.

Fig. 3.2
An infographic chart for the role of bioinformatics. It lists the multi-omics approaches, which include high-throughput precised phenotyping, databases such as crop-specific and specialized, robust analysis via haplotypes and k-mers, and applications such as stress-tolerant and biofortified.

Role of bioinformatics in genetics and plant breeding research for developing climate-resilient crops and sustainable food production. (a) Generation of biological data from various omics approaches as well as phenotyping data from multiple environments; (b) Storage and processing of different omics data generated. (c) Robust analysis of the raw data and transforming to useful information using bioinformatical tools for appropriate interpretation; (d) Application of bioinformatics in agricultural research