Introduction

Wheat is a crop having great historical significance as it marks the turning point of human civilization 10,000 years ago with its domestication. It is grown worldwide and ranks third after maize and rice in global production (Shiferaw et al. 2013). Numerous efforts have resulted in the improvement of wheat genetic maps since the last 3 decades, beginning from restriction fragment length polymorphism (RFLP) to the exon capture analysis (Botstein et al. 1980; Saintenac et al. 2011). The development of molecular markers, since the 1980s, has been based on advanced statistical models, and high-speed computer software which aids in the detection of genomic regions associated with both simple and complex traits in crops. Linkage mapping involves the mapping of quantitative trait loci (QTLs) at a specific/particular location over the genome using a bi-parental population. It is a regression analysis that unravel an association between a genomic locus and variation in the phenotypic data collected from the population (Lander and Botstein 1989; Xie et al. 1993). The important factors affecting linkage mapping include the molecular markers density on genetic maps, quality of phenotypic data, and size of the mapping population. The biggest issue with linkage mapping involves low genetic resolution.

Recently, association or linkage disequilibrium (LD) mapping utilizing genome-wide markers is being adopted in wheat because of its two main advantages: (i) association mapping does not require the cost and time associated with the population development and (ii) GWAS provides high mapping resolution as it efficiently uses the multiple historical crossover events occurred in the diverse association panel used. GWAS detects the association between the particular genotype and trait of interest using conserved LD present in the selected panel of accessions (Myles et al. 2009). It is being adopted at a rapid pace by the plant geneticists/breeders because of the reduction in the genotyping cost, which was a major bottleneck previously. In wheat, the development of next-generation sequencing (NGS) tools such as genotyping by sequencing (GBS) and different SNP arrays provides a plethora of information for conducting whole genome-wide analysis at a very low and affordable price (Tibbs Cortes et al. 2021; Sandhu et al. 2021e).

There has been rapid advancement in the GWAS statistical algorithms which ranged from single to multi-locus models for detecting the real association with complete avoidance of false positives and false negatives (Yu et al. 2006; Huang et al. 2018). Population structure and familial relatedness/kinship are the main causes for the spurious associations, and these associations are avoided in the modified GWAS models with the inclusion of population structure and kinship matrix components (Price et al. 2006; Vanraden 2008). Since the first association mapping in wheat, there was a rapid adoption of GWAS for dissecting the genetic architecture of various important traits (Breseghello and Sorrells 2006). Since, majority of QTLs identified through GWAS have minor effect and are population specific, and difficulties in estimation of exact QTLs’ effect create several challenges for adoption of this technique. However, still, hundreds of GWA studies have been conducted in wheat since the last 15 years for different traits (Tables 3, 4, 5 and 6).

We have provided detailed information about experimental designs adopted for performing GWA studies, and their comparison is made with conventional linkage mapping and nested association mapping (NAM) population designs. Furthermore, various next-generation sequencing platforms, high-throughput phenotyping (HTP), and statistical models are discussed for explaining the whole GWAS analysis pipeline to a novice with previously conducted studies. We tried to cover most of the GWA studies being conducted in the wheat, and critical analysis was performed to detect whether results from these studies were used by the wheat breeding programs for marker-assisted selection (MAS). The genetic architecture of most of the agronomic traits deciphered from various GWAS is also provided. In the end, we made a transition for the future prospectus of genomic selection, OMICS approaches, and machine and deep learning studies after exploration of most of the economically important traits with association studies.

Experimental populations for association mapping

Association or LD mapping is a powerful tool for dissecting the genetic architecture of a trait with the help of phenotypic and genotypic information derived from a set of diverse panels (Kaur et al. 2021). It has been used for deciphering the genetic architecture of most of the complex quantitative traits in wheat (Edae et al. 2014; Arora et al. 2017) due to the availability of high-throughput genotyping and phenotyping platforms along with larger population size. Association mapping has high mapping resolution; this can be attributed to historical recombination events and greater allele richness. The population panel/sets frequently used in association mapping involve lines or accessions having vast genetic diversity. Linkage disequilibrium (LD) in these accessions is used for associating the marker with the QTLs (Nordborg and Weigel 2008).

The association results provided by LD-based mapping not only depends on the linkage between different genomic regions but also on the population structure and relatedness among the lines in the population (Korte and Ashley 2013). The association panel used for association mapping should have genetic diversity with negligible population structure, which otherwise might result in spurious associations or subgrouping. Family-based association mapping was proposed to overcome this obstacle. However, efforts were also made for combining traditional linkage mapping and association mapping for conducting the joint linkage association mapping, which can combine the advantages of both mating designs with avoidance of their pitfalls, but this technique did not get importance in wheat (Würschum et al. 2012). One biggest issue with these populations includes already fixed and rare alleles in the sub-populations which results in missing the identification of rare alleles even after involving a large population size.

However, multi-parent population designs such as multi-parental advanced generation intercross (MAGIC) and nested association mapping (NAM) populations can address these issues for identifying rare variants while retaining the higher mapping resolution at the same time (Beyer et al. 2008; Sandhu et al. 2021e). To our knowledge, in wheat, eight sets of the MAGIC population are available and have been used for genetic characterization of various traits, viz. grain protein content, disease resistance, and grain yield (Mackay et al. 2014; Delhaize et al. 2015). MAGIC populations are developed by several generations of inter-mating cycles among the multiple founder parents. The multiple founder parents maintain the relatively high allelic diversity depending upon the number of parents used compared to the bi-parental mapping population. Several generations of inter-crossing in MAGIC populations create opportunities for the number of recombination events and ultimately results in high resolution in the mapping of MTAs. Mackay et al. (2014) developed the MAGIC population in wheat using eight diverse founder parents for studying the genetic architecture of awns in the UK and European wheats. Similarly, Delhaize et al. (2015) developed the two MAGIC populations of wheat using four and eight founder parents, respectively for studying the genetics of rhizosheath size.

Nested-association mapping (NAM) population is a novel genetic approach for mapping the complex traits which combine the advantages of both association and linkage mapping (Sandhu et al. 2021e). NAM population involves crossing the diverse founder parents with a common cultivar and the resulting F1 are selfed for creating the recombinant inbred lines (Sandhu et al. 2021b). In this way, NAM populations have high allelic variation because of the diverse founder parents and high mapping resolution due to the creation of recombinant inbred lines (Song et al. 2017). For the first time, a NAM population was developed in maize, where 25 lines were crossed with one single parent B73, followed by selfing for creation of a population of 5000 recombinant inbred lines (200 per cross) (Yu et al. 2008). Till now, a couple of studies have reported the use of the NAM population for mapping the complex traits in wheat (Ren et al. 2018; Kidane et al. 2019). Jordan et al. (2018) utilized the NAM population of 2100 RILs derived from 26 founder parents, for genetic mapping and studying the recombination events in different regions of the chromosome to identify recombination hotspots. Linked top cross populations in wheat are another source and can have even more parents than NAM populations with a higher possibility of allele discovery than any other population.

Recent advances in genotyping technologies

The extent of LD provides the estimation of the marker density needed for GWAS in any crop. Linkage disequilibrium (LD) measure or D refers to the difference between the observed and expected gametic frequencies of haplotypes under linkage equilibrium (Cortes et al. 2021). Besides D, several other measures of LD (D', r2, R, D2, D*, Q*, F', X(2), and δ) have been developed to quantify LD in various bi-allelic and multi-allelic situations (Devlin and Risch 1995; Gupta et al. 2005). The detailed description and formulae of LD quantification along with sensitivity, merits, comparison, suitable statistical tests, and calculation methodology for these different LD measures have been extensively described in the literature (Gupta et al. 2005). A low level of LD in crop accessions implies that a higher number of markers will be required for the sufficient coverage of the genome that is crucial for finding the MTAs and/or QTLs for the concerning traits and vice versa. In wheat, the extent of LD patterns have been analyzed extensively (Maccaferri et al. 2005; Chao et al. 2010). Analysis of LD patterns using simple sequence repeats (SSRs) markers in hexaploid wheat revealed significant LD extension to 5 cM (Breseghello and Sorrells 2006) or 10 cM (Chao et al. 2007), whereas tetraploid wheat populations maintained around 50% of their initial LD value at distances up to 20 cM (Maccaferri et al. 2005). Moreover, different genomes of wheat have shown different extent of significant LD; the comparatively higher extent of LD has been observed in the D genome than A and B genomes (Chao et al. 2010). Theoretically, the extent of LD can be increased by selecting a set of closely related accessions or genotypes, and then only fewer markers can be sufficient for detecting the associations. In contrast, accessions having high genetic diversity are employed for GWAS; therefore, availability of a large number of markers, sufficiently covering the whole genome, becomes essential. Initially, SSR markers were used for association analysis in wheat (Prasad et al. 1999; Chen et al. 2014). The number of SSR markers used in these studies was very less which did not cover the genome sufficiently. Hence, numerous causal genomic regions might have gone unnoticed. Later, various advances in genotyping techniques enabled successful GWA studies capable of identifying maximum genetic variation in diverse accessions. To overcome the limitations associated with SSR markers, a high-throughput genotyping system, i.e. diversity array technology (DArT), was developed which allowed the rapid and cost-effective genome-wide genotyping in wheat (Crossa et al. 2007).

DArT is a microarray-based genotyping technique that is independent of sequence information. Since the first study was published in 2007, several studies have reported the successful use of DArT markers for GWAS in wheat for many complex traits (Joukhadar et al. 2013; Jighly et al. 2016; Kalia et al. 2018). The number of DArT markers used in various studies in wheat ranged from 242 (Crossa et al. 2007) to 1728 markers which covered a total genetic distance of 2,851.7 cM at an average distance of 1.7 cM (Joukhadar et al. 2013). Although these DArT markers were quite abundant than SSRs, several researchers found some chromosomes to be poorly covered (Crossa et al. 2007; Langer et al. 2014). The arrival of next-generation sequencing technologies allowed the development of an advanced genotyping technology/platform, i.e. GBS (Elshire et al. 2011), which provided access to a comparatively large number of single nucleotide polymorphism (SNP) markers in a cost-effective manner. GBS-based genotyping has been widely used in wheat for GWAS (Arruda et al. 2016; Liu et al. 2018a; Pradhan et al. 2019). The number of GBS-SNPs used in GWA studies in wheat ranged from 17,937 (Kumar et al. 2018) to 327,609 SNPs (Pang et al. 2020).

Owing to low read coverage, sometimes GBS shows genotyping errors and poor ability to sufficiently detect the true homozygotes. Moreover, its performance is highly affected by the quality of the reference genome. Wheat has a large genome size (1C = 16Gb) comprising three homoeologous genomes which contain more than 85% of repetitive DNA sequences and significant gaps (only 14.1Gb of the total have been accurately assigned and ordered so far). Therefore, the relative incidence of genotyping errors may get increase further as the paralog sequences might be treated as the same reads (Appels et al. 2018; Rahimi et al. 2019). Moreover, the Chinese Spring (whose genome is generally used as a reference genome for SNP calling) is derived from landrace which is known to have wide genetic variation compared to modern/advanced cultivars, resulting in low coverage of SNP markers shown by GBS, therefore restricting the utilization of GBS in modern wheat cultivars. It is also frequently troubled by a high amount of missing data that can potentially reduce the accuracy of any GWA study. One approach to deal with this missing data is imputation and this has widely been applied in many studies in many crops including wheat (Arruda et al. 2016; Liu et al. 2018a; Pradhan et al. 2019). It can increase the number of variants that are used for GWAS by relying on linkage information obtained from common haplotypes. Furthermore, the GBS-SNPs imputed based on the reference genome of ‘W7984’ have shown the highest imputation accuracy (Alipour et al. 2019).

Later, a comparatively cheaper and easier but efficient genotyping technique, i.e. DArTseq™, was also developed by combining DArT with next-generation sequencing platforms, which provides a relatively large number of markers to build more dense linkage maps cost-effectively. These highly dense linkage maps have been widely used for various GWAS in wheat (Dinglasan et al. 2019; Alahmad et al. 2019; Long et al. 2019). In some studies, to get more marker density, both DArT and DArTseq markers have also been used together (Ledesma-Ramírez et al. 2019).

SNP array/platform is another genotyping platform that has various features such as high marker density, low cost, high-throughput, high commercialization, and flexibility which are highly recommended for GWAS in wheat. Unlike NGS and PCR-based markers, these SNP arrays are flexible with respect to customization of sample and data point number, which donates to its high-density scanning and comparatively higher and robust call rates. To date, a number of high-density SNP genotyping arrays have been developed and employed for various GWA studies in wheat, for instance, the Illumina Wheat 9K iSelect SNP array (Cavanagh et al. 2013; Guo et al. 2018a), the Wheat 15K SNP array (Boeven et al. 2016; Qaseem et al. 2019), the Wheat Breeders’ 35K Axiom array developed from 820K SNP array (Sheoran et al. 2019; Kumar et al. 2020), the Wheat 55K SNP array developed from 660K array (Ye et al. 2019; Jin et al. 2020), the Illumina Wheat 90K iSelect SNP genotyping array (Dhakal et al. 2018; Mellers et al. 2020), TaBW280K (Rimbert et al. 2018), and the Axiom Wheat 660K SNP array (Yang et al. 2019). These arrays have been very promising for detecting extensive variation in secondary and tertiary gene pools in addition to the primary gene pool of wheat. For covering maximum genetic variation present in the large germplasm collections, different combinations of SNP arrays have also been utilized in wheat such as the combination of 9K and 90K SNP arrays (Lewien et al. 2018), 35K and 90K SNP arrays (Muqaddasi et al. 2017), and 90K and 660K SNP arrays (Liu et al. 2017e). However, a more recent study showed that the Wheat 660K SNP array could be used as a substitute for other SNP arrays for a great range of possible applications including GWAS, as it comprises the highest percentage of genome-specific SNPs with precise physical positions (Sun et al. 2020). The timeline of advancements in genotyping of whole-genome variants employed for GWAS in wheat is diagrammatically represented in Fig. 1.

Fig. 1
figure 1

Timeline of advancements in genotyping of whole-genome variants employed for GWAS in wheat

The Wheat 660K SNP array, developed by the Chinese Academy of Agricultural Sciences, is based on tetraploid and hexaploid wheat, Aegilops tauschii, and emmer wheat and has the advantages of being genome-specific, efficient, high-density, cost-effective, and with a wide range of possible applications, as well as adding numerous markers to the D genome (Sun et al. 2020). However, the choice of array largely depends upon the aim of the researcher, available resources, and the kind of population used for genetic dissection of the trait(s). The use of SNP arrays has allowed access to an unprecedented number of markers for genomic studies; however, there are drawbacks in using these technologies (Lachance and Tishkoff 2013; Elbasyoni et al. 2018; Chu et al. 2020). Inherent ascertainment owing to smaller population size is one of the major drawbacks in using SNP arrays for genotyping (Lachance and Tishkoff 2013). Since the SNP probes on arrays are static, sub-population-specific variants or rare variants are generally not assayed. This causes bias in population genetics studies including GWAS and does not permit the identification of rare functional variants controlling traits in question. By allowing access to all potential polymorphisms in the population of interest and not restricting the analysis to discrete markers on an array, a method of detecting markers directly from sequence data may reduce ascertainment bias on an experiment-by-experiment basis. Access to all possible polymorphisms can improve the resolution of genetic mapping and GWA studies. However, the confidence in sequence-based prediction of genotypes is confounded by the uncertain alignment of short reads in the genome of polyploids such as wheat. To overcome these challenges, a method of sequence-based genotyping has been proposed (Clevenger et al. 2018) which instead of applying a filter to individual sites collects observed haplotypes from sequence reads and contrasts those haplotypes between genotypes/accessions to identify available polymorphic markers in polyploids. Haplotype-based genotyping should be broadly applicable in wheat and other allopolyploids.

On the other hand, in most techniques of association analysis, SNP markers are evaluated individually for their association with the trait of interest, which can be problematic for complex traits regulated by several genetic loci (Gupta et al. 2014; Zhai et al. 2018). Furthermore, due to the bi-allelic nature of SNPs, a single model may be unable to describe true allelic diversity available in the population (Lu et al. 2011). Therefore, for better description of the genetic architecture of complex traits, researchers recommend testing numerous SNP markers, either with a multi-locus model that evaluates adjacent markers simultaneously or with haplotype blocks where closely linked markers are converted to a solitary multi-locus haplotype block (Da 2015). The use of haplotype blocks minimizes the cost and time spent on GWAS since it eliminates the need to study every individual SNP. Several GWA studies have demonstrated the importance of this approach in the identification of QTLs associated with different traits in wheat (Liu et al. 2020a). Furthermore, deep analysis for the identified causative loci by GWAS, e.g., haplotype-based analysis, is considered a key for genomics-assisted crop breeding. Using “wheat AND GWAS” as the keywords, we screened papers published on GWAS in wheat in the PubMed database (https://pubmed.ncbi.nlm.nih.gov/). A total of 552 research articles were published since December 31, 2020, which reported the successful application of GWAS for dissecting various complex traits in wheat. This analysis shows that wheat geneticists and/or breeders are rapidly utilizing and realizing the importance of GWAS for genetic dissection of complex traits and the number of papers on GWAS in wheat is expanding rapidly with the maximum number of papers being reported in the year 2020 (Fig. 2).

Fig. 2
figure 2

The number of publications related to GWAS in wheat published from 01/01/2009 to 31/12/2020. Source: PubMed (keywords “wheat AND GWAS” were used to search the number of publications in PubMed

Alleviating the phenomics bottleneck: high-throughput phenotyping

Associating genotype to phenotype for complex traits on a genome level requires an association panel having diverse accessions or mapping populations along with high density of molecular markers. Plant phenotyping refers to the assessment and measurement of observable characteristics of the plants in the field or under controlled conditions. The capability to collect accurate phenotypic data in the field and/or greenhouse conditions is a major bottleneck for precise genetic dissection of complex traits. The recent advancements in genotyping technologies have already provided almost limitless access to high-density molecular markers; therefore, it becomes an urgent need to shift plant science research from genomics to phenomics. Several efforts have already been made in the recent past for replacing low throughput and invasive phenotyping methods with high-throughput, rapid, and non-invasive phenotyping technologies (Mir et al. 2019). Table 1 includes several such imaging techniques developed in the last decade which have revolutionized crop phenomics.

Table 1 Various high throughput phenotyping techniques or platforms presently used for phenotyping different traits in various crops including wheat

Hyperspectral imaging was used in wheat to determine spectral changes during salt stress (Moghimi et al. 2018). Moreover, a “hyperspectral absorption-reflectance-transmittance imaging (HyperART)” system was employed for the non-invasive quantification of different leaf traits (Bergsträsser et al. 2015). Various “unmanned aerial vehicles” (UAVs) having different sensors attached to them were utilized effectively to measure different traits in various crops including wheat (Yang et al. 2020b).

These non-invasive high-throughput phenotyping platforms involve the utilization of various sophisticated technologies such as (a) fluorescent spectroscopy to evaluate photosynthetic rates; (b) infrared imagery and thermography to examine transpiration/temperature profiles; (c) light detection and ranging (LIDAR) to measure development/growth rates; (d) 3-dimensional reconstruction to measure plant structure and growth rate; (e) canopy spectral reflectance for monitoring dynamic complex quantitative traits; (f) magnetic resonance imaging (MRI) and positron emission tomography (PET) to measure leaf/root physiology, growth/development patterns, photosynthetic assimilate translocation properties, and water relations; (g) digital RGB (red, green, and blue colour) imaging for recording data on several characteristics of shoots, roots, leaves, and seeds; and (h) nuclear magnetic resonance (NMR) for monitoring the sucrose allocation and the structure of tissues (Yang et al. 2020b).

Although the potential of these high-throughput phenotyping technologies have already been and being continuously demonstrated for various applications in wheat (Crain et al. 2018; Sandhu et al. 2021d), somehow these technologies have not been fully explored for GWA studies in wheat. Only a few papers have been published so far which utilized phenotypic data recorded via high-throughput phenotyping platforms for dissecting the different complex traits such as normalized difference vegetation index (NDVI) (Condorelli et al. 2018), lodging (Singh et al. 2019), and transpiration efficiency (Gehan and Kellogg 2017) in wheat. For the first time in wheat, a study reported the increased ability of aerial platforms, viz. UAVs over ground-based phenotyping platforms to identify the QTLs by GWAS for NDVI under terminal drought stress conditions (Condorelli et al. 2018). Recently in 2019, one more study provided a proof-of-concept application of UAS-based phenotyping of a complex phenological trait, i.e. lodging for describing the genetic architecture of lodging tolerance in wheat through GWAS. Phenotypic data recorded on transpiration efficiency (TE) via a high-throughput lysimeter platform was also successfully used for identifying the QTLs associated with TE in wheat (Fletcher et al. 2019).

Several state-of-the-art phenomics centers have been established to increase the visibility and impact of plant phenotyping in crops including wheat. Moreover, an association, known as International Plant Phenotyping Network (IPPN) (https://www.plant-phenotyping.org/) has also been established to disseminate information about high-throughput phenotyping. This network has mainly six national partners/centers: (i) the Austrian Plant Phenotyping Network (https://www.appn.at/), (ii) Australian Plant Phenomics Facility (https://www.plantphenomics.org.au/), (iii) China Plant Phenotyping Network, (iv) German Plant Phenotyping Network (https://dppn.plant-phenotyping-network.de/), (v) Phen-Italy (http://www.phen-italy.it/index.php), and (vi) the PHENOME-The French plant phenomic network (https://www6.dijon.inrae.fr/umragroecologie_eng/Research-Programs/Investissement-Avenir/PHENOME). Other major centers are the Julich Plant Phenotyping Centre (https://www.fz-juelich.de/ibg/ibg-2/EN/_organisation/JPPC/JPPC_node.html) in Germany and High-Resolution Plant Phenomics Centre located in Canberra at CSIRO Plant Industry, whereas Nanaji Deshmukh Plant Phenomics Centre (developed by Saveer Biotech Limited) at ICAR-IARI, high-throughput plant phenomics facility at the ICAR-Indian Institute of Horticultural Research (IIHR), high-throughput automated phenotyping platform at ICRISAT (https://www.icrisat.org/researchfacilities/), and phenomics facility (http://www.niam.res.in/Phenomics-facility) at ICAR-National Institute of Abiotic Stress Management (developed by LemnaTech, Germany) are the major high-throughput phenotyping centres in India.

These centres use platforms designed mainly for phenotyping under artificial/controlled conditions; however, efforts are being made to create relevant technologies and tools for use under field conditions at both industrial and experimental scales. Moreover, the establishment of high-throughput phenotyping systems is time-consuming and costly and needs in-depth knowledge of computational and engineering sciences to maintain functionality and flexibility. The implementation of such systems may only be justified at big research centres and companies as the unit cost depends on throughput. Several private companies like ‘LemnaTec’, ‘PhenoSpex’, ‘Phenokey’, ‘Photon System Instruments’, ‘We Provide Solutions’, ‘WIWAM’, and ‘Saveer Biotech Limited’ offer large-scale, custom, high-throughput phenotyping platforms for both controlled and field environments (Gehan and Kellogg 2017).

One of the biggest problems associated with high-throughput phenotyping platforms is the handling of large volume, velocity, and variety of data. This might be one of the possible reasons also why high-throughput phenotyping technologies have not been fully explored in wheat for genetic studies. To overcome these challenges associated with the analysis and interpretation of enormous datasets, machine learning (ML) and deep learning (DL) algorithms can be employed (Ma et al. 2018b; Sandhu et al. 2021a). These ML and DL algorithms are multidisciplinary approaches that provide more efficient, accurate, and faster data analytics by utilizing the concepts from statistics, probability theories, decision theories, and optimization (González-Camacho et al. 2018). Application of these machine learning and deep learning algorithms/methods in the prediction of phenotypes holds big promise, and therefore, these methods are likely to be integral tools for future breeding programs (Shah et al. 2019).

Mixed models and significance thresholds for GWAS in wheat

Several GWAS models are available, which range from simple to increasingly complex for associating phenotypic variation with the particular genotype configuration in wheat (Huang et al. 2018). Traditionally, linear models such as ANOVA, t-tests, and linear regression were used for studying MTAs, but these models usually resulted in several spurious associations because of the ignorance of population structure and familial relatedness (Price et al. 2006; Yu et al. 2006). The wide geographical distribution of wheat parents in the association panel results in a strong population structure and it is important to use GWAS models which reduce the false associations due to the population structure. However, analysis using structure and principal component analysis (PCA) packages accounts for the population structure generated with the help of molecular information. The inclusion of these structure parameters as a covariate in the GWAS model controls the false positives (Pritchard et al. 2000). General linear models (GLM) perform the association using a single marker at a time with the inclusion of population structure as a fixed effect in the model and can be represented as:

$$\mathrm{Y}=\mathrm{SNP}+\mathrm{Q}\left[\mathrm{PCs}\right]+e$$

where Y is the trait of interest, SNP represents the matrix of genotypic information, Q is population structure obtained using structure or PCA, and e is residual error. The results from structure and PCA are usually similar, but PCA is more often utilized because of less computational cost and resources required for its generation (Wang et al. 2009; Wu et al. 2011); initially, GLM was most frequently used for GWAS analysis in wheat, but later it was realized that GLM results in various false-positive associations because of ignorance of relatedness among the populations and hence, it was then replaced by recent mixed models (Segura et al. 2012). GLM only accounts for the population structure, completely ignoring the relatedness among the individuals in the population. Yu et al. (2006) developed the unified mixed model approach with inclusion the family relatedness as a random effect in the GLM, resulting in the creation of a mixed linear model (MLM) for GWAS, which can be represented as:

$$\mathrm{Y}=\mathrm{SNP}+\mathrm{Q}\left[\mathrm{PCs}\right]+\mathrm{Kinship}+e$$

All the terms of this equation are described above. At the same time, kinship denotes the random components of the model, demonstrating the relationship between individuals in the population obtained using pedigree or genotypic information. This model complements the previously developed models that only account for either population structure or familial relatedness, thus resulting in the creation of the powerful GWAS model (Abecasis et al. 2000). Presently, this is the most often utilized GWAS model. More than 50% of the GWAS conducted in wheat were performed with this model and it is also evident from the information on models provided in Table 2.

Table 2 A brief comparison of different GWAS models

MLM was shown to be superior regarding control of false positives in the simulation models, but this model suffers from substantial computational cost. Computational time varies in MLM as mpn3 where m is the number of markers, p is the number of iterations required to solve the model, and n is the number of individuals in the random effect model. The computational time increases with the cube of individuals in the random component of the MLM. Zhang et al. (2010) developed the compressed mixed linear model (CMLM) that reduces the computational time of the MLM by the grouping of a number of individuals in the random effect model. This model reduces the computational time compared to MLM by retaining the same or higher statistical power than MLM. CMLM can be represented as

$$\mathrm{Y}=\mathrm{SNP}+\mathrm{Q}\left[\mathrm{PCs}\right]+\mathrm{Kinship}+e$$

Here, kinship is obtained among the groups using the maximum likelihood method. If all the individuals are classified into one group, it is equivalent to GLM, while if all the individuals are in separate groups, this will result in MLM. In this regard, CMLM is intermediate to MLM and GLM. CMLM gained its popularity in wheat due to the grouping of lines from the same breeding programs or regions into one pool, to account for the relatedness for controlling false positives. Several studies utilized CMLM for association analysis and reported its computational superiority over the MLM in wheat (Arruda et al. 2016).

The first mixed linear model was published in 2006 for GWAS analysis, and since then, many MLMs have been proposed to account for population structure and family relatedness for controlling the false positives (Breseghello and Sorrells 2006). However, all these MLM were single-locus models, studying a single association at a time, but the majority of traits in wheat are controlled by a large number of QTLs which show that these models fail to mimic the true genetic architecture of the traits (Segura et al., 2012; Liu et al. 2016). This required the use of multi-locus GWAS models in wheat to reduce the false negatives produced by single-locus GWAS models because of overfitting in the models. This overfitting happens because single-locus models explain the variation individually for each marker, which completely ignores the other significant marker and interactions between markers, resulting in missing some real associations. Segura et al. (2012) developed the multi-locus mixed model (MLMM), which studies multiple associations using stepwise regression and heritability as criteria for forward inclusion and backward elimination of markers in the model. This model can be represented as:

$$\mathrm{Y}=\mathrm{SNP}+\mathrm{Q}\mathrm{TN}1+\mathrm{Q}\mathrm{TN}2+\mathrm{Q}\mathrm{TN}\mathrm{n}+\mathrm{Q}\left[\mathrm{PCs}\right]+\mathrm{Kinship}+e$$

Quantitative trait nucleotides (QTNs) are added in the model using the heritability estimate for stopping further inclusion. Once forward inclusion is done, backward elimination is performed, where each added QTN is removed individually, to identify the exact number of QTNs which are controlling the variation in the trait, using heritability estimate.

MLMM uses all the SNP marker information for extracting the kinship matrix among the individuals. The settlement of MLM under the progressive exclusive relationship (SUPER) model was developed, which used significant QTNs for extracting the kinship matrix. This model produces higher statistical power and is also computationally efficient than MLMM (Wang et al. 2014a). Both MLMM and SUPER models incorporate the significant QTNs to remove the confounding problem between the testing markers and kinship. To altogether remove the confounding issue in the analysis, a fixed and random model circulating probability unification (FarmCPU) was developed, which divides the model into a fixed and random effect model (Liu et al. 2016). The fixed-effect model tests a single marker at a time, while the random effect model utilizes the multiple associated markers for obtaining kinship as a covariate in the model, and this also controls the false positives in the model. This model is superior compared to previous multi-locus models, having high statistical power and less computational time (Liu et al. 2016). This model is represented as:

$$\mathrm{Y}=\mathrm{SNP}+\mathrm{Q}\mathrm{TN}1+\mathrm{Q}\mathrm{TN}2+\mathrm{Q}\mathrm{TN}\mathrm{n}+\mathrm{Q}\left[\mathrm{PCs}\right]+e$$

This is a fixed component of the model where each QTN is tested individually at a time. The random effect component of the model is represented as:

$$\mathrm{Y}=\mathrm{Q}\left[\mathrm{PCs}\right]+\mathrm{Kinship}+e$$

Kinship in this model is obtained using multiple associated markers from the fixed-effect model, thus controlling the false positives. Several studies using the FarmCPU model for association analysis have been reported in wheat (Bhatta et al. 2018b). FarmCPU has been reported to be superior for GWAS analysis because of its computational advantage owing to the separation of fixed and random effects. This is particularly important for wheat, owing to its hexaploid and complex genome nature.

Malik et al. (2019) compared the performances for three GWAS models, namely, SUPER, CMLM, and FarmCPU for association analysis of yield and straw quality traits in wheat. The superiority of FarmCPU was reported for MTAs for plant height, yield, lodging, Septoria tritici blotch, and harvest index. Q-Q plots and P-value inflations were compared to conclude that FarmCPU performed superior for all the traits analyzed in this study and validated that FarmCPU should be used for future analysis in wheat. Similar results were obtained by Ward et al. (2019) during the comparison of MLM and FarmCPU for GWAS for yield and yield-related components in wheat. They showed that MLM results in a large number of false negatives, as MLM only identified nine significant MTAs while FarmCPU identified 74 significant MTAs. These results were also validated using Q-Q plots from association analysis.

FarmCPU model assumes that QTNs are randomly distributed across the genome, thus eliminating the LD along the genome. Furthermore, FarmCPU has a random effect model that has associated computational cost with bigger data sets. Bayesian information and LD iteratively nested keyway (BLINK) is the most recent GWAS model which removes the problem available in the FarmCPU (Huang et al. 2018). BLINK uses Bayesian information criteria for replacing the random effect component of the FarmCPU with the fixed effect model. Furthermore, LD information is used for the inclusion of a single marker at a time in the model and eliminating the confounding problem (Huang et al. 2018). This model is reported superior for analysis in wheat, but till now, there are only a few studies available that have reported the use of this model due to its recent release (Liu et al. 2020b). A comparison of different GWAS models is presented in Table 2.

False positives and false negatives occur not only by GWAS models, but they can also arise because of over-conservative or less stringent threshold, suggesting that identification of significant threshold is crucial in wheat (Dudbridge and Gusnanto 2008; Pe’er et al. 2008). The commonly utilized significant thresholds in wheat are Bonferroni correction, false discovery rate (FDR), and positive false discovery rate (PFDR) (Benjamini and Hochberg 1995). Bonferroni correction of 0.05 is a strict significant threshold (obtained using P-value/number of markers), and this causes a number of false negatives because of the over-conservative nature of this threshold, as it does not consider that markers on the same chromosome could be independent (Hayes 2013). MTAs identified with Bonferroni correction are highly significant and provide high confidence for incorporating particular MTAs in a breeding program. A very few studies usually report the significant MTAs with Bonferroni correction as evident from Table 3. FDR and PFDR are somewhat less stringent threshold criteria and are often used for reporting the significant associations (Tables 3, 4, 5, and 6). FDR is calculated from the expected portion of MTAs that are in fact the false positives (Hayes 2013). There is a high need for deciding the strict threshold for controlling the false positive associations in wheat. Permutation testing was proposed for solving the multiple testing problem in humans to select a significant threshold by analyzing the large number of simulated data sets generated from the real data set by randomly shuffling the population (Churchill and Doerge 1994). This led to the selection of a P-value < 5 x 10−8 as a strict cutoff for reporting significant MTAs in humans, and this kind of cutoff is needed for association studies in crops including wheat, in spite of freedom to the researchers to report their own subjective threshold P-value.

Table 3 Recent GWAS in wheat and characterization of candidate genes conducted over the last decade (2010–2020) for biotic stress traits
Table 4 Recent GWAS in wheat and characterization of candidate genes conducted over the last decade (2010-2020) for abiotic stress traits
Table 5 Recent GWAS in wheat and characterization of candidate genes conducted over the last decade (2010–2020) for agronomic traits
Table 6 Recent GWAS in wheat and characterization of candidate genes conducted over the last decade (2010–2020) for end-use quality traits

GWAS and characterization of candidate genes

Over the years, GWA studies have been successfully conducted for better defining the relative role of genes in various crops and further assisted in exploring the genetic basis of natural selection and population differences among the individuals of a population, developing into a briefly verified and mature method today. GWAS has been extensively used to investigate various biological and physiological traits in the wheat crop during the last decade (Tables 3, 4, 5, and 6). In the literature that we surveyed, 86,122 wheat lines have been studied under various GWA studies reporting 46,940 loci. However, further utilization of these is largely limited. Nevertheless, this huge information source can be further utilized for identifying meta-QTLs through meta-GWAS. Meta-analysis of QTLs identified through interval mapping has regularly been conducted in wheat (Saini et al. 2021a, 2021b, 2021c). The traits considered under evaluation include cold tolerance (Zhao et al. 2020b), seed dormancy (Zuo et al. 2019), coleoptile length (Ma et al. 2020), spike fertility (Pradhan et al. 2019), agro-morphological traits (Sheoran et al. 2019), kernel weight and length (Daba et al. 2018), end-use quality traits (Jernigan et al. 2018), anther extrusion (Muqaddasi et al. 2017), root traits (Beyer et al. 2019), disease resistance (Bhatta et al. 2018b), micro-nutrients in grain (Cu et al. 2020), and multiple pest resistances (Ando et al. 2018) (see Tables 3, 4, 5, 6). Bar graphs have been provided to represent the number of GWA studies conducted over the last decade, and further, the recorded studies have been divided into four major categories to compile the data for this study. Maximum number of GWA studies have been conducted for agronomic and yield traits, followed by biotic stress resistance, quality traits, and abiotic stress tolerance in wheat over the last decade (Figs. 3, 4, and 5).

Fig. 3
figure 3

Histogram showing the number of GWAS over the last decade under different categories

Fig. 4
figure 4

Histogram showing the number of GWAS over the last decade for various a biotic stresses and b abiotic stresses. Trait name is given as reported in published reports

Fig. 5
figure 5

Histogram showing the number of GWAS over the last decade for various a agronomic and b quality traits. Trait name is given as reported in published reports

While the number of studies for four major categories has been represented here in the form of histograms, a comprehensive table has also been provided to summarize all these GWAS conducted in wheat during the last decade, with information related to the population size for a particular study, the number of QTLs identified, putative genes, and GWAS models used for analysis in the study. Only high confidence putative genes or associated markers having a significant PVE (Phenotypic Variation Explained) or R2 value for a particular trait have been documented in this review (Tables 3, 4, 5, 6).

Multiple disease resistance for leaf rust (Puccinia triticina), stripe rust (Puccinia striiformis var. tritici), common bunt (Tilletia tritici), and tan spot (Pyrenophora tritici-repentis) was phenotyped in 81 accessions where 94 MTAs were identified on seven chromosomes for the studied traits. Identified major effect genomic regions were found to be coinciding with previously identified genes like Tsn1 gene (Perez-Lara et al. 2017). Resistance to powdery mildew was phenotyped in 97 accessions, and 262 significant loci were identified in these accessions. Based on GWAS and the linkage map-based QTL analysis, two large effect QTLs with dynamic gene action were identified on chromosome 1BL and 2BL for adult plant resistance to powdery mildew which may be used in breeding programs; some candidate genes were also identified and annotated like TraesCS1B02G264000 (Mohler and Stadlmeier 2019).

In a recent study, a total of 319 varieties were phenotyped for the resistance to stripe rust, and 47 significant loci were found significantly associated with the trait. Moreover, using the gene enrichment with mapping-by-sequencing and the homozygosity haplotyping algorithm, 589 high confidence genes were detected, and out of these 589 genes, 10 genes (e.g. TraesCS2B01G486100, TraesCS2B01G486200) were annotated for diseases resistance which had homology to a previously characterized Yr7 candidate gene (Gardiner et al. 2020). Similarly, fusarium crown rot resistance was phenotyped in 358 accessions, and 104 loci were found to be significantly associated with it. A novel significant region was detected on chromosome 5DL; qRt-PCR was used to validate the involvement of candidate genes in providing resistance to the fusarium crown rot disease. Validated candidate genes, namely, TraesCS5D01G138700.1 and TraesCS5D01G142400.1, encode the proteins belonging to the widely known disease resistance protein (TIR-NBS-LRR class) family (Jin et al. 2020).

The combination of advanced techniques with GWAS provides precision targeting of the candidates for the trait of interest. Conversely, advancements in breeding methodologies based on GWAS also provide a key role in developing resistant lines; efficient methods such as genomic selections can be used to assist in it. A similar study was associated with Septoria tritici blotch (STB) resistance; a total of 371 accessions were phenotyped, and 44 loci were found to be significantly associated with STB resistance. Putative candidate genes TraesCS1A01G323600 and TraesCS6D01G365100 were also identified and annotated. Along with candidate gene identification, the potential of using the results of GWAS in genomic prediction was also assessed in order to highlight the potential of combined use of GWAS and genomic selection in STB resistance (Muqaddasi et al. 2019a).

Multiple GWAS models have also been used for conducting association analysis in wheat. For instance, a study was conducted for Barley yellow dwarf (BYD) virus resistance where 335 accessions were phenotyped and both MLM and GLM models of GWAS were employed which resulted in the identification of 36 loci significantly associated with the target trait. The candidate genes, namely, TraesCS2B01G037300 and TraesCS2B01G038300 were defined which may be useful to breeders in breeding programs to achieve the stable resistance to BYD virus (Choudhury et al. 2019). Along with disease resistance, traits like quality of the wheat grain and its processed products were also well studied by using GWAS. A comprehensive study of baking and milling traits which include flour yield, softness equivalent, flour protein, and four solvent was conducted in 270 accessions and 84 loci were found to be significantly associated; except one, all the identified associations were novel. Moreover, two putative genes, viz. TraesCS1B01G12950 and TraesCS7A01G01360 were also identified (Gaire et al. 2019). Likewise, STB resistance, more than one GWAS analysis model, namely GLM and MLM, were used to study grain protein content. Grain protein content was phenotyped in 161 accessions, and 145 loci were found to be significantly associated with it. Furthermore, two large effect QTLs on chromosome arms 2B and 7B and underlying putative genes namely TraesCS1D01G029200.2 and TraesCS2A01G328100.1 were identified.

Abiotic stresses pose a great threat to the crops and can cause a huge loss (Kaur et al. 2021). In order to have a better insight into the genetics of traits associated with abiotic stresses, several GWA studies have been conducted in wheat. Drought, nutrient toxicity, extreme temperatures, and salinity are some of the major abiotic stresses that can significantly impede the normal development of plants. Among these abiotic stresses, the effect of drought on various traits like yield and other related traits has been extensively studied by GWAS. Li et al. (2019a) conducted GWAS using 277 accessions and identified 295 significantly associated loci using three different models, namely, GLM, MLM, and FarmCPU. Candidate genes including TraesCS6A02G124100 and TraesCS6D02G114400 were also identified (Li et al. 2019a, 2019b, 2019c, 2019d). Similarly, the effect of drought stress on grain yield was studied by Bhatta et al. (2018a) using the FarmCPU model of GWAS, the trait was phenotyped among 123 accessions, and 90 loci were found to be significantly associated. Candidate genes TraesCS7A01G158200.1 and TraesCS3D01G002700 were also identified. As evident from the histogram (Fig. 4), nine GWA studies have been conducted for drought tolerance alone, followed by heat stress (3), salt tolerance (3), and drought and heat together (2), while many other abiotic stresses have been studied at least once (Fig. 4).

Quality traits including the micronutrients like vitamins and minerals have been the topics of keen interest with respect to market pricing and consumer preference; in this regard, GWAS was performed for evaluating the variation for vitamins B1 and B2 content in wheat, which were phenotyped among 166 cultivars. A total of 24 loci were declared significant (17 loci for Vitamin B1 and 7 loci for Vitamin B2) in this study. IWB43809, IWB69903, and IWB23595 were identified as putative markers which can be of interest to the breeders. However, the candidate genes remain unidentified as little is known about biosynthetic pathways of Vitamins B1 and B2 in plants (Li et al. 2018). Similarly, copper content in wheat grains was phenotyped using 243 accessions, and 489 loci were found to be significantly associated with the trait. Furthermore, haplotype analysis revealed three important genetic loci, GCC_Hap_2A1, GCC_Hap_3B1, and GCC_Hap_5A1 associated with grain copper content. Linkage mapping identified four QTLs on chromosomes 1D, 6A, 6B, and 7D, associated with copper content in wheat grains. Two of the significant SNPs, detected on chromosome 1D via GWAS, were mapped within the interval of one QTL (QGCC.hau-1D), implying that this locus has an important role in regulating copper content in wheat grains (Zhao et al. 2020a). Fig. 5 represents the number of GWA studies considered or covered under a particular quality trait over the last decade.

Agronomic traits such as plant architecture, root structure, and most importantly yield affecting traits have also been studied by GWAS. For example, a GWA study for twelve agronomic traits phenotyped in 768 accessions under multiple environments resulted in the identification of a total of 807 loci significantly associated with the traits under study. A total of 9 environmentally stable QTLs were identified which can be of great use in breeding programs. Candidate genes TraesCS4B02G049100 and TraesCS1B02G415500 were reported for QTLs associated with spike seed setting and grain size, respectively (Pang et al. 2020). For root traits, six related traits were phenotyped in 196 accessions, and 1,105 loci were found to be significantly associated with the traits under study. Three candidate genes TraesCS5A02G022300, TraesCS4A02G484800, and TraesCS4A02G493900 were also reported; the proteins of these genes were found to be associated with carbon metabolism, nitrogen metabolism, signal induction, stress responses, and DNA synthesis (Xu et al. 2020). Similarly, yield and its contributing traits were phenotyped in 320 accessions of a highly diverse wheat association mapping panel, and 46 loci were found to be significantly associated with five traits. Candidate genes TraesCS3B02G123600 and TraesCS4A02G389900 were also reported (Gahlaut et al. 2019). Generally, a single locus GWAS model is used to identify significant MTAs, but for complex traits like nutrient use efficiency which are being controlled by multiple loci, a more stringent model is required. Hence, multi-locus models were developed, as they can detect potential MTAs using lower significance criteria. Such a study was conducted for potassium use efficiency which was phenotyped in 150 accessions. In this study, both single and multi-locus GWAS models were used from which a total of 534 loci were found to be significantly associated with the traits in question. Candidate genes TraesCS1A02G288500 and TraesCS2B02G201400 were reported (Bin Safdar et al. 2020). Similarly, spikelet sterility was phenotyped in 710 accessions, and 118 loci were found to be significant using GWAS. Candidate genes TraesCS1B01G144500 and TraesCS1B01G145500 were reported (Alqudah et al. 2020). Fig. 4 represents the number of GWA studies for each agronomic trait over the last decade. Grain yield remains the most extensively studied trait under GWAS of various agronomic traits of wheat.

Applications in breeding

Recent advancements in molecular genetics have made it possible to use molecular technologies in breeding programs and to develop diverse molecular breeding strategies for efficient and effective crop improvement. One of such tools is GWAS, which has been extensively used to search for genomic regions associated with various traits. These identified genomic regions, then, can be used to develop breeder-friendly markers for use in the breeding programs. Many traits, including phenology, height, and resistance to rusts in wheat, are affected by some key genes (such as Ppd, Vrn, Yr, Lr, Sr, and Rht) with major effects. These genes can reduce the sensitivity for other minor QTLs (or hinder the detection of minor effect QTLs) since the different alleles/QTLs can only be analyzed accurately in the respective group of lines. Actually, the estimation of the total number of QTLs depends on the distribution of QTL effects. If the overall distribution of the effects is delineated by an exponential distribution, the distribution of identified QTL effects becomes a truncated exponential distribution after incorporating the Beavis effect. This must not be confused with the original Beavis experiment where all simulated QTLs are supposed to have an equal genetic effect. According to the Beavis experiment, when only 100 progeny are evaluated, the average estimates of phenotypic variances associated with correctly identified QTL are greatly overestimated, slightly overestimated when 500 progeny are evaluated, and fairly close to the actual magnitude when 1000 progeny are evaluated (Beavis et al. 1994; Beavis 2019). The statistical power of detecting a minor QTL is as low as 3% when the sample size is modest, say 100, and the predicted effects are frequently inflated 10-fold. This phenomenon has since been termed the Beavis effect and has formed the basis of a number of subsequent analyses (Beavis et al. 1994; Beavis 2019). Furthermore, minor alleles/genes can also be detected for any trait that has been measured in response to the major genes (at least background genes such as Vrn, Ppd, and Rht) or using the wheat genotypes having null alleles for these major genes and using a genotyping technology which facilitate the selection of most-informative SNPs (by adding or removing targeted loci) in a custom-designed fashion.

Various modifications of GWAS like (a) eGWAS (uses data from gene expression profiling) which can be very useful for identification and annotation of candidate genes involved in the metabolic pathways (Luo 2015), (b) PWAS (proteome wide association study) which can be used to link proteome abundance variation and phenotypic variation (Brandes et al. 2020), (c) mGWAS (metabolic GWAS) which is used to define the relationship between genetic factors and the metabolome of a tissue or the complete plant (Luo 2015), and (d) TWAS (transcriptome wide association study) which conducts expression mapping by creating functionally relevant maps that correspond to genes and their expression have broadened the application of GWAS in genetic studies from gene to the molecule level (metabolites) (Wainberg et al. 2019). Another concept, PheWAS (phenome wide association studies) applies a contrasting phenotype to genotype approach for assessing the sequence polymorphisms across diverse phenotypes, thereby complementing the data from GWAS (Denny et al. 2010). The markers derived from GWAS can be involved in genomic selection/genomic prediction models as fixed effects for enhancing the prediction accuracy (e.g. for grain yield and yield-related traits in wheat) (Odilbekov et al. 2019). Despite all these applications and advantages of GWAS over conventional breeding, the true potential of GWAS still awaits full exploitation in wheat breeding because there are many gaps between genomic studies and breeding (Samantara et al. 2021). One of the gaps is that the breeders who hold molecular biology training still fall short in the handling of genomic data. More user-friendly software systems are required to fill this gap. Similarly, various modifications of GWAS like eGWAS, PWAS, and TWAS are still relatively new concepts to the breeders and thus require a wider adoption. Moreover, the effectiveness of molecular breeding for highly complex traits like yield and related traits based on GWAS data needs to be further improved. The complex traits are controlled by multiple alleles, thus the conventional GWAS or the single-locus GWAS cannot be used to search for the associated MTAs, as the marker selection criteria like FDR implemented in single-locus GWAS make the criteria stringent for the detection of multiple MTAs. Hence, multi-locus GWAS models with higher sensitivity are required in order to detect MTAs of such complex traits using a less stringent criterion. Another gap arises, as the QTL×QTL interactions and QTL×E interactions of complex traits are not completely described in genetic studies. The knowledge of these interactions is very important for the better understanding of complex quantitative traits and effectively using GWAS for them (Samantara et al. 2021).

Being an allopolyploid, genetic redundancy is no new concept to wheat. A plethora of traits like seed dormancy (Abe et al. 2019) and broad-spectrum resistance to diseases like powdery mildew (Wang et al. 2014b) can be identified in wheat where genetic analysis has been very difficult because of the presence of multiple homeoalleles, as completely recessive mutant does not exist in natural population to understand the functioning and effects of underlying alleles (homeoalleles). To solve this issue and for the improvement or better understanding of polyploid crops, simultaneous editing of multiple homeoalleles of a trait is required. The CRISPR-Cas9 system and its predecessors ZFN (zinc finger nucleases) and TALEN (transcription activator–like effector nuclease) are powerful tools for genome editing which can be used to precisely edit multiple QTLs simultaneously and to generate novel alleles, providing rapid genetic enhancements (Abe et al. 2019). The results of significant MTAs from GWAS can be put into candidate gene identification approaches to find putative genes. The CRISPR-Cas9 or TALEN can then be used to generate genome-edited organisms in order to validate the function of associated putative genes or they can be directly used in the editing of candidate genes if the data is already available. So far, CRISPR-Cas9 and TALEN have been successfully used in wheat for editing traits like male sterility (Okada et al. 2019), powdery mildew resistance (Wang et al. 2014b), and quality traits like gluten content of grains (Jouanin et al. 2020). Hence, genome editing tools enlighten the path to the era of ‘GWAS-plus’ in wheat, a concept given in rice (Wang et al. 2020).

In wheat, several recent studies have also demonstrated the power of association mapping in identifying and characterizing the candidate genes that control the target traits (Li et al. 2019a; Wang et al. 2019; Sandhu et al. 2021e). For instance, using wheat 90 K SNP assay, an association mapping was performed for grain length and thousand-grain weight leading to the identification of numerous significant SNPs located on chromosome 7B. Furthermore, haplotype analysis of these significant SNPs on 7B generated the block containing the predicted TaGW8-B1 gene, which was then cloned by sequencing in bread wheat. Analysis of agronomic traits revealed that genotypes with TaGW8-B1a allele possessed significantly more grain number per spike, wider grain length, higher thousand-grain weight, longer grain length, and more spikelets per spike than the genotypes with TaGW8-B1b (Yan et al. 2019). Another GWA study conducted in wheat using a 90K genotyping assay for the six quality-related traits in Chinese wheat cultivars in eight environments over 4 years led to the identification of a total of 846 significant SNPs, involving 103 multi-environment significant SNPs detected in more than four environments (Chen et al. 2019b). Furthermore, it was discovered that some important genes, including some known functional genes and annotated unknown functional genes, were linked to the six quality traits. TaRPP13L1 was found to be associated with flour colour among the annotated unknown functional genes. Wheat cultivars or lines with the TaRPP13L1-B1a allele showed considerably higher flour redness and lower yellowness than those with TaRPP13L1-B1b in the Chinese wheat natural population and the bi-parental population. This study provided valuable information for further dissection of the genetic basis of flour colour and also provided potential genes or genetic loci for marker-assisted selection to improve the process of breeding quality wheat (Chen et al. 2019b). The aforementioned successful examples of genes discovered using GWAS give strong evidence that GWAS can be utilized as a part of a rapid gene-cloning strategy.

Researchers may have been misled by early optimism regarding QTL deployment in populations using marker-assisted selection. Because favourable alleles frequently have population-specific effects, QTL found in one population may not have the same amount of effect in other populations. This could be the result of epistatic interactions between the QTL and the total genetic background, resulting in low penetrance and varying degrees of expression (Gaire et al. 2020). After interrogating simultaneously, a large number of QTLs/alleles in natural populations through GWAS, bi-parental populations can be used to validate a subset of the detected QTLs. This validation step is required to choose a parental line confirmed to have the favourable allele that can be used as a donor in the marker-assisted breeding program. For instance, a GWA study identified several QTLs associated with grain yield, yield components, and plant water status in wheat. Fourteen of these QTLs detected as significant in at least three environments in the GWAS were further validated using a panel of eight bi-parental mapping populations (Zhang et al. 2018). QTL identified and validated in this study provided beneficial information for the improvement of wheat under full and limited irrigation.

Moreover, GWAS results can also be validated using meta-QTL analysis. For instance, a GWAS analysis identified a total of 13 and 11 significant MTAs for fertile tiller number and total tiller number, respectively, in Iranian wheat under varying water regimes (Bilgrami et al. 2020). Then, a meta-analysis was conducted using 30 previously published independent studies, which led to the identification of 30 meta-QTL regions on 11 wheat chromosomes, that validated at least 5 significant MTAs (identified through GWAS) associated with the trait in question (Bilgrami et al. 2020). GWAS offers the opportunity to identify genes that contribute to naturally occurring variation in complex quantitative traits. However, GWAS relies largely on the statistical association, so functional validation is necessary to make strong claims about gene function. The genes identified through GWAS can be validated using different strategies including transgenesis, gene silencing, gene, and genome editing (Curtin et al. 2017).

GWAS can be considered an exploratory analysis for the right selection of true segregating genotypes/accessions that may be used as parents in the bi-parental mapping population, as well as for further genetic and molecular validation of the associations (Alqudah et al. 2020). GWAS can also be utilized to get insights into breeding-program variation (the genetic variation in the natural population used to develop improved breeding material) or MAS (where candidates are screened for target markers, their phenotypes are predicted based on allelic states, and then selections are made based on these predictions) because the association mapping population can be considered as a source of favourable alleles that are not or rarely present in the bi-parental populations. QTLs/MTAs identified through GWAS can be followed by MAS if a significant proportion of trait genetic variation is explained by the associated markers. Initial limitation of retrieving large number of loci based on hundreds of identified SNPs through simple MAS was practically not possible and required re-genotyping making it difficult cost-wise; however, platforms now have come up for multiplexed SNP identification which may now lead to practical utilization of information generated through GWAS or GS. For instance, the AgriSeq targeted GBS can target and uniformly amplify the hundreds to thousand of markers in a single PCR reaction utilizing a highly efficient multiplexed PCR chemistry (Gujjula et al. 2019). A targeted sequence-based, scalable, and flexible multiplexed genotyping technology known as KeyGene SNPSelect technology was also proposed which facilitates the selection of most informative SNPs (by adding or removing loci), permitting cost efficient yet highly informative genotyping in a custom-made fashion (Hogers et al. 2018). Most recently in 2020, a method known as SNP-seq was developed which combines the advantages of multiplex PCR amplification and high-throughput sequencing. This is flexible both in number of SNPs and samples targeted, yields high accuracy, particularly when genotyping genome wide perfect SNPs with high polymorphism and conserved flanking sequences, and is also cost-effective (Zhang et al. 2020a). Under significant epistasis, interacting loci distributed across the genome alter the outcome of a major single-locus QTL. The epistatic background influence limits the usefulness of QTLs in other populations (Korte et al. 2012; Bocianowski 2013). The QTL and the interacting loci act as a package within the specific genetic background of the discovery population (Bocianowski 2013); in these scenarios, special statistical techniques may be required to identify and minimize background epistasis effects. Xavier et al. (2015) (Bocianowski 2013) advocated simultaneously assessing marker effects in different populations to eliminate variations in QTL phasing, genetic background, and effect sizes from one population to another (Bocianowski 2013). Most recently in 2021, Malosetti et al. combined a QTL discovery method employing pre-breeding populations that used intensive phenotypic selection for the target trait across several plant generations with accelerated generation turnover (i.e. ‘speed breeding’) to allow the cycling of multiple plant generations each year. They demonstrated that QTL detection using breeding populations under selection for the target trait can detect QTLs associated with the trait in question and that the frequency of the favourable alleles gets increased as a response to selection, thereby validating the QTLs identified. This is a useful opportunistic approach that may provide QTL information that is more readily transferred to breeding applications (Malosetti et al. 2020). They also envisaged great potential for integrating speed breeding with GWAS, accelerating the rate of crop improvement.

GWAS-assisted genomic selection

Conventionally plant breeders used to rely on the use of phenotypic information for selections of desirable plants in the field. With the development of high throughput genotyping tools, the selection process got complemented with the use of MAS. The MAS allows the rapid selection of superior genotypes by identifying QTLs having a major effect on the trait. Still, it fails in most of the complex quantitative traits in crop plants, which are usually controlled by a large number of small-effect QTLs (Xu and Crouch 2008). Furthermore, these small effect QTLs are highly affected by environmental conditions, different genetic backgrounds, and QTL by environment interactions (Bernardo 2016). Even linkage and association mapping have not been able to properly account for such small effect QTLs. Moreover, MAS is used to introduce a single gene at a time and thus increasing the time required for variety release especially in wheat, which has a large number of contributing genes due to its hexaploid nature. Hence, genome-wide prediction (GP) or genomic selection (GS) came in handy, which uses the whole genome-wide marker information for predicting the breeding value of the plant, known as genomic estimated breeding values (GEBVs), and these GEBVs further assist in making selection (Meuwissen et al. 2001).

Originally proposed by Meuwissen et al. (2001) in animal breeding, GS could be considered an upgraded version of MAS, where all the markers are used to calculate the GEBVs of the plant. It is believed that each QTLs is in LD with at least one of the molecular markers, and this marker accounts for all the genetic variances for that QTLs in the GS models for predicting GEBVs (Lorenz et al. 2011). Genomic selection requires a training population that is genotyped and phenotyped, and the generated information is then used for defining the GS model. This model calculates the effect of all molecular markers using phenotypic data from the training population in the GS model (Rutkoski et al. 2011). Once the GS model is trained, it is used to predict the GEBVs of the breeding/testing population, which is only genotyped. Plant breeders can choose the parents for inter-mating, generate segregating population, genotype the population, make the selection based on these GEBVs, and develop cultivars without further testing, thus accelerating the breeding cycle and ultimately increasing the genetic gain per unit time (Bernardo 2016). Genomic selection is being applied in breeding programs for the selection of parents for crossing, selection of top-performing lines in the breeding trials, and prediction of multi-environmental trials breeding values, and assists in the maintenance of high performing lines in the program (Sandhu et al. 2021b, a).

Several factors affect the GS prediction accuracies, namely, the heritability of the trait, relatedness between training and testing population, sample size, cross-validation scenario, marker density, and GS model used (Lorenz et al. 2011; Sandhu et al. 2021c). Some of the traits in wheat are controlled by large effect QTLs, and hence, the inclusion of those QTLs’ effects in the GS model may provide an excellent opportunity (Fig. 6). Several studies have shown that incorporation of GWAS results as a fixed effect in GS models resulted in an increase in prediction accuracy for quantitative traits (Boichard et al. 2012; Bernardo 2014). Bernardo (2014) showed in a simulation study that the inclusion of a QTL as a fixed effect in GS model which explains more than 10% of the genetic variance resulted in a significant increase in model performance. GWAS-assisted GS has several benefits as it does not require the additional data and results in the increase of prediction accuracy, and furthermore, this is easily accessible to plant breeders without the need for considering the underlying genetic architecture of the trait (Spindel et al. 2016), and the structure of a population can be accounted by using PCA as a fixed effect in the GS models. In a recent study from CIMMYT, authors showed that the inclusion of GWAS loci as a fixed effect in the GS model results in a 9 to10% increase in prediction accuracy for grain yield in spring wheat (Sehgal et al. 2020). Similarly, Odilbekov et al. (2019) demonstrated the ability of GWAS-assisted GS for predicting resistance to Septoria tritici blotch in winter wheat. They showed that prediction accuracy increased from 47 to 62% with the inclusion of all significant QTLs in the GS model. The GWAS-assisted GS has demonstrated significant results for traits controlled by a smaller number of QTLs; however, for grain yield, significant improvement has not been observed.

Fig. 6
figure 6

Steps are depicted for GWAS assisted genomic selection. The results from GWA studies (association population) are used as fixed effects in the genomic selection pipeline. Genomic selection models are trained on previous data sets and significant QTLs are included as fixed effects during the prediction of breeding values. Concept for development of this figure is taken from (Crossa et al. 2017; Sehgal et al. 2020b; Sandhu et al. 2021c)

Mixed models used in GS take only the additive genetic effects into account completely ignoring the dominance, epistatic, and environmental variances (Crossa et al. 2019). With the rapid adoption of machine learning (ML) and deep learning (DL) approaches in other disciplines, there is also a need for these highly efficient approaches for conducting GS in wheat breeding. ML and DL models are flexible in regard to modelling the large and small effect QTLs in the GS model, and hence, these models have completely overcome the need for separately including the GWAS-assisted fixed effects in the GS models. ML and DL models have shown their superiority for predicting grain length (Ma et al. 2018b), grain yield (Sandhu et al. 2021c), and rust resistance (González-Camacho et al. 2018) in wheat. These models remove the assumptions of traditional GS models during training of the models due to the use of nonlinear activation functions (Bellot et al. 2018; Sandhu et al. 2021c). Commonly used models are random forest, reproducing kernel Hilbert space, support vector machine, multilayer perceptron, convolutional neural network, and recurrent neural network. This review opens up the avenue where we can shift from GWAS to the GS using ML and DL models for making the best selection, and thus increasing the genetic gain in crop plants for complex quantitative traits.

Transcriptome-wide association studies (TWAS) and probabilistic TWAS (PTWAS)

GWAS is performed in humans, animals, and plants to associate the various traits to genomic loci (MacArthur et al. 2017). Most GWAS loci lie in the intronic region of the genome; therefore, information about casual genes for the gene-trait association is largely lacking. GWAS has failed to determine the exact causal genes that have a major effect on the trait variant and causal genomic loci that drive the association (Gallagher and Chen-Plotkin 2018). This limitation has led to the development of new methods to prioritize causal genes at GWAS loci. Transcriptome-wide association study (TWAS) is one such method, which uses gene expression data to determine gene-complex trait association and prioritizes likely causal genes at GWAS loci (Gamazon et al. 2015). Transcriptome-wide association study follows a three-step procedure: firstly, it uses expression panels to train the simulation models for expression prediction from genotype; secondly, these models are used to predict an individual’s expression in the GWAS cohort; and the final step involves the estimation of a statistical association between predicted gene expression and phenotypic traits. Transcriptome-wide association study could be performed with individual data and summary of GWAS data using PrediXcan (Gamazon et al. 2015) and Fusion (Gusev et al. 2016) or S-PrediXcan (Barbeira et al. 2018), respectively.

However, it is also found that TWAS makes false prioritization with expression panels from non-related tissues. TWAS’s Fusion platform performs better in prioritizing genes at loci than two simple baselines, i.e. random per locus ranking and expression ranking (Wainberg et al. 2019). But TWAS is challenged by two factors, i.e. tissue biasness and co-regulation. Tissue biases can be reduced by using mechanistically most related tissue. If tissue is too small to get a sufficient sample size, then other related tissues can be taken to increase the sample size. Co-regulation can be addressed by using TWAS fine mapping. But TWAS fine mapping is more challenging to perform than GWAS fine mapping. Therefore, there is a need for more computational methods along with TWAS to make it perform better in gene prioritization at GWAS loci. The other two limitations of TWAS are that (i) it does not validate the causal implications of association and (ii) it lacks estimation of the causal effect of gene-trait associations (Zhang et al. 2020b). Probabilistic transcriptome genome-wide association studies (PTWAS) address these limitations of TWAS analysis by testing relationships between causal genes and complex traits and allow validation of causal implication and estimation of the causal effect.

Probabilistic transcriptome genome-wide association studies use instrumental variance analysis and probabilistic eQTLs annotations to estimate the causal relationship between causal gene expression and phenotypic traits (Zhang et al. 2020a, 2020b). It is more powerful than other existing methods as it provides causal implications and estimates tissue-specific genes to trait effect using multi-tissue eQTL data for analysis. The phenome-wide association study (PheWAS) is a high-throughput tool that determines the association between the genotypic variation and phenotype of the organism to get a better understanding of the effect of genotype. GWAS determines genotype-phenotype association by linking a number of genotypic variants like SNPs to a phenotypic trait or disease, whereas PheWAS studies the link of genotypic variation to a number of phenotypic traits. GWAS focuses on the study of a single target phenotype over a number of genotypes (maybe up to 500,000 SNPs) and PheWAS studies of single target genotype to a number of phenotypes (up to 1,000). Phenome-wide association study was recently used in the field of medicines to identify the association of genetic loci with many diseases. However, the application of PheWAS in the field of plant science is not explored yet.

Conclusion and future perspectives

With the arrival of high-throughput next-generation sequencing technologies and the development of various efficient statistical models, GWAS has become a method of choice for the genetic dissection of complex quantitative traits in many crops including wheat. The information generated in various GWA studies reporting 46,940 loci is apparently for great use in breeding and may form the base of meta-GWAS analysis, while actual utilization of these is not apparent and we are yet to see them being transferred from publications to actual varieties. Using GWAS, the genetic architecture of several different agronomic, physiological, and quality traits has been widely investigated and thousands of MTAs or causal SNPs have been revealed for these studied traits in wheat (Tables 3, 4, 5, and 6). These identified causal SNPs or MTAs have largely allowed the identification of candidate genes for different complex traits (Tables 3, 4, 5, and 6). The use of these identified significant MTAs as fixed effects in the genomic prediction models has also resulted in the increased prediction accuracy of GS for various traits in wheat (Sehgal et al. 2020) which indicates that these highly significant and robust genomic regions identified via GWAS can largely improve the utility of GS in future wheat breeding programs.

Almost 14 years have passed since the first paper of association study in wheat was published (Breseghello and Sorrells 2006), but still, GWAS faces some challenges which need to be addressed carefully to exploit this important approach. These challenges or limitations include false discovery rate (FDR), ‘large p small n problem’, markers with rare genetic variants and rare alleles, family-wise error rate (FWER), and reproducibility of identified loci. These issues have been discussed elsewhere and solutions have also been sought to manage these issues/concerns (Gupta et al. 2019a). Moreover, epistatic interactions and G × E interactions have largely been ignored in wheat GWAS, although these genetic interactions have been demonstrated to be important for complex quantitative traits (Sehgal et al. 2017). Improved statistical models/methods and the experimental designs for dissecting these genetic interactions need to be explored in the future. Furthermore, for a detailed understanding of the underlying molecular mechanisms of genotype-phenotype relationships, causative genes along with other causative sequence variants need to be identified. Also, we do not have enough knowledge of the potential effects of sequence variants on untranslated regions (UTRs) and promoter regions. Integrated use of multi-omics data can also help in getting insights into these molecular mechanisms. Various modifications of GWAS like eGWAS, PWAS, and TWAS have emerged but these are still relatively new concepts to the wheat breeders/geneticists. A new method, meta-GWAS, has recently emerged that can enable more robust and significant genomic regions associated with the target traits. Nevertheless, a few meta-GWA studies have been used within a wheat breeding program to reveal associated genomic regions and directly implement genomics-assisted breeding. Moreover, with the increasing interest in the ML and DL techniques, the analysis of multi-dimensional data will become much easier soon (Sandhu et al. 2021a; c). With these advancements, it will be possible to develop the networks that might be involved in the expression of target phenotypes of the complex traits. We believe that these efforts will greatly facilitate molecular breeding in wheat.