Introduction

Association mapping (AM) as a complement to linkage mapping (Breseghello and Sorrells, 2006a) overcomes the limitations of bi-parental population. Loci controlling the traits of interest in bi-parental population escape detection due to their inability to identify loci from similar parents. This technique requires the utilization of different individuals that raises allele numbers examined as well as multiple historical recombinant events. Rare alleles are difficult to be detected in AM and to get higher alleles frequency that includes the genetic diversity of the crop species. AM panels must be suitably chosen. This will effectively decrease duration and costs while detecting markers connected to quantitative characters. AM considers the use of panels with diverse cultivars for the purpose of recording more recombination events that contribute to a higher resolution to find regions associated with trait than linkage mapping (Zhu et al. 2008; Stich and Melchinger 2010). AM is divided into two components: Genome-wide association studies (GWAS) and candidate gene. Quantitative trait loci (QTL) obtained through interval mapping are validated using GWAS. Markers sharing an association with traits are identified in AM. However, diversity between the sample and the number of chromosomes affects the genotype study. Markers such as Simple Sequence Repeats (SSR), Expressed Sequence Tag (EST), Restriction Fragment Length Polymorphism (RFLP), Random Amplified Polymorphic DNA (RAPD), Amplified Fragment Length Polymorphism (AFLP), Diversity Arrays Technology (DArT), and Single Nucleotide Polymorphisms (SNP) have contributed to AM. To determine if any relationship exists between phenotypes and markers, GWAS scans the whole genome. Nevertheless, more markers are required to cover the genome, considering the anticipated rate of linkage disequilibrium (LD) decay. Most AM activities are done using GWAS with SNPs. Nevertheless, this method has several restrictions. First, knowledge about the genome is needed for designing SNP arrays and the location of SNPs in the genome. Second, the phenotype might be caused by rare variants that are not on the SNP chip. Another restriction is the existence of structural variations. However, high throughput sequencing data are utilized to overcome some of these limitations. Mapping all the reads to the reference genome, followed by variant calling using mapping by sequencing (MBS) (including RNA-Seq, Whole-genome sequencing WGS and Bulked Segregant Analysis (BSA)) are used to overcome such limitations (Hartwig et al., 2012). These variants are then tested for association. But this needs a reference genome, and the regions not in the reference genome will not be captured in the study and it may induce biases in variant calling. Moreover, genotype calling will be complicated when sequencing depth is low (Nielsen et al. 2012) due to the sequencing errors and repetitive regions. An alternate approach is genotyping using tools such as Cortex and simultaneous de-novo assembly (Iqbal et al. 2012). However, it should be noted that both approaches are computational and costly. Sequence identification which is significantly diverged from the reference genes, like R genes, is complicated in GWAS. But, the use of traits association on sub-sequences (k-mers) overcomes such limitations. Needle in the k-stack (NIKS) for mutation identification was introduced by comparison of sequencing data from two strains using k-mers (Nordström et al., 2013). Based on counting and identifying k-mers associated with the phenotype, the overlapped k-mers are then assembled to obtain sequences corresponding to associated regions. K-mer-based association genetics was used by Arora et al. (2019) to clone R genes from plant diversity panel. The data were then combined with R gene enrichment sequencing (AgRenSeq) to identify Sr genes in the phenotyped group. However, the authors observed that complete Nucleotide-binding Leucine-rich Repeat (NLR) contigs would not be generated if local assembly approaches that use only those k-mers that are strongly linked to the trait were used.

Linkage analysis and AM

Recombination number is very few within pedigree and families in linkage mapping (Zhu et al. 2008), which leads to low mapping resolution. While recombination tends to be high and diverse in AM, natural genetic diversity is exploited, leading to higher resolutions. Wu and Zeng (2001) proposed Joint linkage-association mapping (JLAM) to overcome low resolution and power in bi-parental mapping and AM limitations, respectively, to harness their potentials. Equally, Chromosomal crossing over has been used in association and linkage mappings to break up allele associations into new haplotypes that link to phenotypic variations (Myles et al. 2009). The key difference between the two methods lies on the degree (either through mating design or selection of the set of germplasms) at which the researcher has overcome the recombination events. Usually, in linkage mapping, the researcher makes use of biparental populations, thereby making it more feasible to control the possibility for recombination events in the progeny, though having a corresponding loss in mapping resolution related to the AM. Association panels in AM can be regarded as a more natural experiment because there is no control over the number of recombination events that produce the tested genotypes (Álvarez et al., 2015). Diverse panels such as the bi- and multi-parent populations, as well as breeding populations are used in both AM and linkage studies, though they have their own advantages and limitations (Xiao et al. 2017). Thus, AM assesses correlations between phenotypes and genotypes, from which QTL can be detected in traits that show variation. The main advantages of AM over linkage mapping are resolution power and accommodation of multiple alleles to be tested for associations. Moreover, the probability of creating populations with positive versus negative alleles exists in linkage mapping, whereas only phenotypic range values for the alleles are involved in AM present in a population (Álvarez et al., 2015). It should not be assumed that the frequency distribution of alleles at functional loci be the same as that of the distribution of alleles at random loci. Instead, it will be tough to account for most phenotypic differences using AM, because rare alleles usually cause most of it.

To measure the LD decay rate in AM, different germplasms should be used. Therefore, the density of the marker is typically higher than that of linkage mapping (Álvarez et al. 2015). LD decay is slightly higher in Recombinant Inbred Lines (RIL) than in F2 populations. Nevertheless, the resolution power is lower than that of the AM population (Álvarez et al. 2015). The order of 5–10 cM is the resolution required to locate the QTL in linkage mapping, from which many genes within each QTL are present (Buckler IV and Thornsberry 2002). In addition, for those germplasm panels with low LD, the diagnostic power of a single marker will only extend a short way. Thus, high number of markers is needed for whole-genome scan. Additionally, the population in AM is obtained either by a strategy that is advantageous for sampling or breeding objectives, whereas in linkage mapping, the population structures are usually constructed and maintained. The breeding lines are challenging to keep in QTL mapping (Myles et al. 2009), but the germplasm accessions are supported adequately in AM due to the excessive number of alleles contained in them. Generally, AM is an alternate to QTL mapping, that does not need the screening of progeny generation or the development of bi-parental crosses.

Limitations

The detection power of AM relies on the phenotype under study and the association of the marker locus (Álvarez et al. 2015). However, in most germplasm collections, relevant alleles are found and are frequently very significant sources of desirable alleles (Rafalski 2010). Likewise, in some germplasms, the existence of different individuals with diverse growing condition will be a barrier to its usage. As such, in association study, phenotypic evaluation for diverse germplasm must be given due considerations (Myles et al. 2009). Gupta et al. (2019) stated that identification of false positives or negatives and the issue of missing heritability are the main problems in GWAS. Recently, however, some approaches are employed to overcome such limitations. They include epigenetics, the use of expression profiles re-sequencing, identification of candidate genes and functional characterization using reverse genetic approaches such as gene silencing or retrotransposon-mediated gene disruption among others. Furthermore, analysis of rare alleles and rare variants are also among the approaches used to enhance GWAS. The advantage of association over linkage mapping in populations where LD is vast and does not occur rapidly across most of the genome appears to be common in many self-pollinating species. In this case, no dependable relationships can be attained among traits and for specific genes. However, the genome of the entire region is associated due to lack of haplotype chunk disintegration. Alternatively, some replacements that take advantage of AM still exist in such crops where there is low resolution due to restrictions to AM or high LD. Nevertheless, population structure and rare allele limitations can be overcome. This limitation can be achieved by crossing breeding lines to form a multi-parent population, from which functional allele combinations are identified and are used directly to identify marker x trait associations effectively (Kover et al. 2009). AM utilizing Q + K model has been modified to deal with large p, multiple testing and small n limitations in GWAS (Yu and Buckler, 2006). With the development of high throughput technology, haplotypes and SNP-sets (instead of single SNPs) are being used for GWAS, thereby overcoming the limitations of multiple testing and enhancing the identification of candidate genes which in turn facilitate gene-set-based and gene-based association mappings.

Genomic technology

The technology involved in manipulating and analyzing genomic information is referred to as genomic technology. It was initiated following the invention of DNA cloning in the 1970s (Galas and McCormack 2003). The availability of model species and their genome annotations as well as the application of genomic technology provide sequences for various complex traits and candidate genes for further association analysis (Zhu et al. 2008). Genome re-sequencing, reduced representation sequencing and pool-seq are very accessible and inexpensive approaches for population genomic studies (Therkildsen and Palumbi 2017). Targeting Induced Local Lesions in Genomes (TILLING, a method in molecular biology that allows direct identification of mutations in a specific gene) and Ecotype TILLING (EcoTILLING, a modification of TILLING technique that looks for natural mutations in individuals, usually for population genetics analysis) are among the genomic methods used for germplasm collections and screening allelic variant mutants in target genes. Genome re-sequencing is very useful for genome-wide discovery of markers for high-throughput genotyping, such as SNPs and SSRs or for the construction of high-density genetic maps. These, in turn, enhance genetic diversity study, and make identifications of markers linked to genes and QTL achievable via a variety of approaches including fine genetic mapping, bulked segregant analysis (BSA) and association mapping (M Perez-de-Castro et al. 2012). Currently, whole genome sequencing and characterization have been achieved using molecular markers and play a role in marker-assisted breeding (Song et al. 2010). High-density markers are needed to detect alleles that are involved in agronomic traits (Tardivel et al. 2014). Genetic improvement of complex characters (especially drought and salt tolerance) has now been achieved with genomic technology. Today, detection of specific genes is effectively accomplished at faster rate by combining marker-assisted selection (MAS) with genomic technology as compared to classical breeding (Saade et al. 2016).

Natural diversity

Introgression library (IL) and advanced backcross QTL (AB-QTL) techniques are used to explore natural diversity. They are used to remove alleles from germplasm to improve quality, productivity, nutritional value and adaptation of crops (Zamir 2001). Interestingly, large scale functional diversity of a crop species can be evaluated using AM, which makes it different from AB-QTL and IL (Breseghello and Sorrells 2006b). Generally, information derived from association mapping applies to a broader germplasm base while that of bi-parental mapping is specific to the same or genetically similar population (Zhu et al. 2008).

Different phases of association mapping

Five stages are involved in AM as illustrated in (Fig. 1): (i) Individuals for the population are selected, (ii) the selected population are genotyped, (iii) Population structure based on the genotyping is analyzed, (iv) phenotypic traits of interest among the population are characterized (phenotyping), and lastly, (v) phenotype versus genotype relationships are determined (Association analysis).

Fig. 1
figure 1

Simple illustration of association mapping

Selection

In the AM process, an important component to consider is the careful selection of the population, and the resolution power of the study would be better if more recombinations are observed. Generally, a diverse population must be considered rather than classified or structured (Álvarez et al. 2015). Association analysis can be successfully achieved by a careful selection of the population (Breseghello and Sorrells 2006b). Appropriate techniques for association analysis and the power of statistics to detect marker-phenotype association depend on the genetic germplasm diversity, the LD level in genome-wide association, and population structure level and population relatedness under study (Stich and Melchinger 2010).

Plant population can be classified in two dimensions: (a) the extent of population structure and (b) relatedness of the family (Yu and Buckler 2006). Based on these dimensions, the populations are further classified into the following categories: (i) ideal sample with familial relatedness and subtle population structure, (ii) multifamily without considering population structure, (iii) population structured sample that does not consider relatedness of the family, (iv) samples that consider relatedness of the family with that of population structure, and (v) severe population structured sample and relatedness of the family. Therefore, the existence of a population in one of the categories mentioned will determine the kind of statistical methods to be applied for the association analysis. Furthermore, AM populations can also be categorized according to the source of materials (Breseghello and Sorrells 2006b). These sources could be from germplasm collection bank, elite breeding lines, natural population, or synthetic population. These sources of population materials are expected to vary according to the extent of LD, genotypic and phenotypic diversity and the importance of structured population and relatedness of the family.

Genotyping

GWAS and candidate genes analysis are the two approaches used in AM (Fig. 1). However, the selection of each depends upon the amount of marker for the association. GWAS usually tests for an association that represents most of the segments of the genome, and considers genotype of population of individuals that are densely distributed across genetic marker loci covering all the chromosomes (Rafalski, 2010). However, in candidate gene association analysis, markers are chosen based on their location in the genome and based on previous QTL studies/functions of the genes involved that led to the final variation.

Unlinked neutral background markers are selectively mounted for successful coverage and are considered in association studies (Zhu et al. 2008). They have been actively engaged in characterizing the genetic composition of individuals. Additionally, these markers are highly beneficial in conveying individuals to populations (Pritchard and Rosenberg 1999). As such, population structure and relatedness limitations are overcome (Yu and Buckler 2006), and inbreeding and kingship are determined (Lynch and Ritland 1999). Molecular markers easily trace genetic loci that can be computed in a population and may be related to a specific trait or gene of interest (Hayward et al. 2015). Generally, heritable differences within a population are affected by mutations in the form of translocation, inversion or insertion, which can be noticed and screened using molecular markers (Hayward et al. 2015). Markers can be used to identify the true uniqueness of individual plants. 1 cM distance is an ideal location for an active marker for MAS of the anticipated characteristic and is capable of high throughput and reproducibility genotyping (Mohan et al. 1997). AFLP and RAPD Markers have poor genomic distribution, reproducibility, and low polymorphism and these limit their application in MAS (Vos et al. 1995; Williams et al. 1990). They need unique statistical methods if intended to be used for estimating genetic population parameters. On the other hand, SNPs and SSR are profoundly revealed markers and are used in determining the relative kinship matrix and population structure which make them appear more powerful (Zhu et al. 2008). When calculating genetic parameters using SSR and in the presence of size homoplasy, high mutation rate and size of the alleles may be serious challenges especially if the population is large (Estoup et al. 2002). However, for a valid selection of genotyping technology (Syvänen 2005) SNP markers and scored individuals are required. The rate of mutation per generation in SNPs is shallow compared to that of SSR (Li et al. 2002). Consequently, the biallelic nature of SNPs makes them less informative than multiallelic SSRs. It should be noted that in SNPs, expected heterozygosity is lower (and therefore are required) than SSR background markers for the successful attainment of a practical assessment of population structure and family relatedness of most crops. Additionally, SNPs are distributed widely throughout the genome and inexpensive to score than SSRs. Wessinger et al. (2018) established that the effectiveness of detecting SNPs to explain phenotypic variation depends on some genetic factors of the population such as allele frequency of the population, size effect, sampling effects along with epistasis and genotype uncertainty. SNPs are the most heritable and fine mapping markers (Singh et al. 2001). Polymorphic markers of large sets can be screened through SNPs even in complex polyploid species and large-scale sequencing (Collard and Mackill 2007), and, as such, support genome-wide association studies.

Population structure

Changes in allele frequencies occur due to non-random mating within a species in population structure (Ersoz et al. 2007), which is considered a limiting factor in association mapping. It produces false positives (spurious associations), and it is complicated to follow up on perceived signals through expensive biochemical and independent studies as well as molecular analyses to replicate significant results (Zhu et al. 2008). Recently, approaches such as principal component analysis (PCA), mixed model approach and structured association and genomic control (GC), among others, are used to justify family relatedness and structure of the population (Price et al. 2006; Yu and Buckler 2006). False positives from population structure can be overcome through explicit (e.g., mixed model and SA) or ad hoc adjustment approaches (Zhao et al. 2007). To take care of problems arising from population structure in most association studies, structured association has recently emerged as a method of choice. For this, individuals in population substructures are calculated and assigned through random unlinked markers (Pritchard et al. 2000). Population structure is often calculated using STRUCTURE software (Pritchard et al. 2000), through which the proportion of an individual’s genome that initiated from different inferred populations is calculated using Bayesian algorithm. Different groups of individuals are then clustered based on their genome classification. STRUCTURE 20 assumes that all individual population is in Hardy-Weinberg equilibrium and unrelated. The degree of population admixture of each individual is calculated through this program. Additionally, PCA is also used to estimate population structure as reported by Price et al. (2006), which is quicker and more effective than STRUCTURE (Zhao et al. 2007). Generalized linear model (GLM, one of the various structured association models) usually correlates genotypes with phenotypes using subpopulations (Q) as covariates in a regression model (Thornsberry et al. 2001). However, this may not control false positives even when used along with GC model (Yu and Buckler 2006). Subpopulations (Q) are usually assigned as covariates in a unified mixed-model (or Q + K model; K = kinship matrix); nevertheless, they use covariate in the regression as K (Yu and Buckler 2006). It has been concluded that Q + K is more advantageous than Q model (Zhao et al. 2007) in Arabidopsis studies and is therefore recommended in most GWAS. Currently, GWAS analysis can be achieved using Trait Analysis by Association, Evolution and Linkage (TASSEL) Software (Bradbury et al. 2007).

Linkage Disequilibrium (LD) decay

Non-random associations of alleles at diverse loci are referred to as LD (Oraguzie et al. 2007), through which the resolution of AM studies is determined. It should be noted that the resolution is expected to be very high when the LD decays are displayed in a short distance, even though they need many markers. Additionally, mapping resolution will be low when the LD spreads in long-distances, but it requires only a few markers here. Generally, low resolutions reveal high LD, and vice versa. Many factors affect LD. They include population subdivision and population size, genetic isolation among lineage, recombination rates, mutation and amount of inbreeding, among others (Mackay and Powell 2007; Gupta et al. 2005). Under linkage disequilibrium, variation between the observed and expected gamete haplotype frequencies is measured as LD (Soto-Cerda and Cloutier 2012). Graphics’ view of LD is used to present r2 (the Pearson’s squared value (product-moment) correlation coefficient) over genetic distances among polymorphic sites (Bradbury et al. 2007) within the loci/gene along a chromosome (Bradbury et al. 2007). r2 is usually preferred to decay plot D when measuring LD through pair-wise measurements between markers due to fewer biases (Soto-Cerda and Cloutier 2012). The LD decay rate over distances must be understood for the most straightforward determination of the number of markers that would be required for GWAS. The number of marker required to saturate the genome for GWAS should be known prior to measuring the LD in AM. LD is used to identify genomic areas of the candidate related to a specific character or diseases and can offer a more exceptional resolution more than that of linkage-based mapping (Mackay and Powell 2007). Generally, to quantify genetic diversity, LD is used and can easily be explored to make extrapolation about the populations’ evolutionary history (Zhu et al. 2015; Slatkin 2008). Genetic drift, population growth, admixture (introduction of genes from a previously distinct population to another) or migration, population structure, natural selection, gene conversion, variable recombination, and rate of mutation are among the factors that influence linkage disequilibrium (Ardlie et al. 2002). Additionally, the level of LD in a varied population will be determined from the species mode of reproduction (Flint-Garcia et al. 2003). Crops, such as self-pollinated ones, have much longer LD distances than cross-pollinated ones (like wheat and maize, respectively). For LD generated by population structure, there should be careful consideration of the sample to avoid faulty analyses of the results (Ersoz et al. 2007). Bilton et al. (2018) reported that evolutionary and genetic forces affect LD, as such, its pattern is utilized in computing genetic diversity and can make inferences about the evolutionary history of natural populations (Bilton et al., 2018). In addition, the association between the map distance and LD level can be used to estimate adequate population size (Sved et al. 2013; Waples 2006). Sequencing data of low coverage that accounts for under-called heterozygous genotypes was used to calculate pairwise disequilibrium by establishing new likelihood methods and Genotyping Uncertainty with Sequencing data - Linkage Disequilibrium (GUS-LD) (Bilton et al. 2018). The authors concluded that using GUS-LD, reliable estimates were obtained whereas the results will be underestimated for linkage disequilibrium if no adjustment is made for the errors. Many authors studied gene controls by one or few loci with significant effects, especially in areas concerning the biochemical basis of essential phenotypes like abiotic and biotic stress tolerance. These phenotypes have greater impacts on enhancing crop production, especially in MAS breeding (Foolad and Panthee 2012). But, complex trait differences have proven to be very difficult to understand, as the genetic architecture of these essential traits (especially salt and drought tolerance) involves many loci with small effects associated with one another and the environment (Buckler et al. 2009; Collard and Mackill 2007). Grouping of statistical tools is now being used to distinguish such small effects. Among them, LD is used to survey genetic variances with a limited resolution to a mapping population rather than the density of the marker. The relationship between polymorphisms in a population is usually stated by LD. Myles et al. (2009) stated that the distance between any two markers functionally relies upon the strength of the relationship between them (Myles et al., 2009). The faster the rate of LD decays over distance signifies how far the resolution from which QTL can be mapped. Thus, the first stage in the design of AM studies could be structural analysis of LD.

Identification of candidate genes

Genes that are indirectly or directly affecting the developmental process of characters with known biological functions that can be valid by assessing the effects of the causative gene differences in association analysis are referred to as candidate genes (Zhu and Zhao 2007). It has been used and applied for genetic association studies, research for biomarkers, gene-disease and drug target selection in many organisms from animals to humans (Tabor et al. 2002). Apart from genome scan, candidate gene analysis is also used for position cloning of QTL regulating main genetic differences of characters of interest. It should be noted that the causative genes are the QTL that show significance in a region of chromosome affecting the genetic variations of characters under study. This region of QTL consists of several genes gathered at about ~20 cM confidence interval (Zhu and Zhao 2007). The highest resolution power for mapping QTL and in LD was offered by SNPs with the causative polymorphism; for this reason, they are usually prepared as a candidate-gene variant to genotype in AM (Rafalski 2002). SNPs within specific genes and between line identifications are necessary for candidate-gene AM. Consequently, candidate gene SNPs identification procedure depends on the amplicons resequencing from numerous individuals that are genetically diverse from a larger association population and within specific Genes (Zhu et al. 2008). Generally, to identify rarer SNPs, individual SNP panel is required while in identifying common SNPs, fewer are needed. For identifying a candidate gene, promoter SNPs, exon, intron and untranslated 5′/3′ regions are all reasonably targeted, with coding regions that have less level of nucleotide diversity than the non-coding part (Zhu et al. 2008). The SNPs number per unit length required to detect significant associations is dictated by a candidate gene locus, which depends on the rate of LD decay (Flint-Garcia et al. 2003). Hence, the Locus of the candidate gene is entirely reliant on SNP distribution and LD as well as amplicon numbers and the pair-based length needed to sample it adequately. Seven hundred and thirteen upland kinds of cotton (Gossypium hirsutum L.) accessions of a natural population were evaluated for salt tolerance-related characters (Sun et al. 2018). From the GWAS result, the authors obtained seven genomic regions that were represented from 23 SNPs. Salt-tolerance and survival rate are among the significantly associated traits. These traits were simultaneously related to two SNP markers on the D09 chromosome (i47388Gh and i46598Gh). Two hundred and eighty possible candidate genes were also screened based on all loci under salt stress (Sun et al. 2018). Genes such as MYB, NAC, WD40, NXH, CDPK, CIPK and LEA are involved in plant salt tolerance and are transporters that participate in numerous enzymatic and transcriptional activities. Because of the limitation of including all causative genes and low repetition of results, the candidate gene approach has been disapproved (Tabor et al. 2002). The digital candidate gene approach (DigiCGA) has been developed to overcome some bottleneck limitations for successful detection of candidate genes in some studies (Zhu and Zhao 2007).

Phenotyping

In association mapping, diverse accessions must be relatively needed in large numbers, thereby making it challenging while taking the phenotypic replicated data across environments and years. However, for the inhomogeneous field, careful consideration of QTL x environment interactions, employing incomplete block design and appropriate statistical methods enhance mapping power (Eskridge 2003). Influences of variabilities within and between years, environments and seasons may complicate trait phenotyping for G x interactions (Atlin et al. 2011). For abiotic stress responses in plants, further improvement is achieved under controlled environment (Negin and Moshelion 2017). Nevertheless, observations of the actual field conditions under this environment, particularly in drought, are challenging (Passioura 2012). Phenotyping in association mapping has not been given much consideration compared to genotyping (Zhu et al. 2008). For large-scale association mapping, obtaining vigorous phenotypic data remains very difficult. Since AM often comprises relatively large and diverse accessions, phenotypic data collection with enough replications across multiple locations and years is equally challenging. Therefore, the experimental area must be effectively laid out with latex design (incomplete block design) due to its potential to increase the mapping power (Piepho et al. 2006). In addition, if unbalanced plant breeding trials are used as sources of phenotypic data, appropriate statistical modeling of the experimental design as well as genotype x environment and marker x environment interactions, must be taken into consideration (Malosetti et al. 2008). As such, mapping power will be increased (Stich et al. 2008). As stated by Cobb et al. (2013) for reliable phenotyping approaches based on quantitative measurements, proper quantitative characterization is needed to dissect genetic differences precisely. Heritability is usually calculated individually to understand the ratio of genetic variances explained by the detected QTL. Some phenomics systems have been established and used for some data like biomass content, photosynthesis, pigment content and attributes of the canopy using rapid and guided-GPS (Simko et al. 2016), responses due to abiotic stress factors (Cobb et al. 2013), flowering (Guo et al. 2015) and pathogenesis (Mahlein 2016). Phenotypic variation relationships under field and control environments must be observed critically so that important information is provided to enhance phenotyping techniques in the control environment.

Statistical analysis

The most straightforward statistical approach for association analysis of quantitative traits is the analysis of variance (Yu and Buckler 2006). However, restrictions of AM studies, especially arising from population structure quantitative transmission disequilibrium test (QTDT), were modified to apply to inbred populations of plants (Stich et al. 2006). Genomic control and structured association are now in use for both human and plant association studies for population-based samples. Random effects such as multiple background QTL and population membership estimates of Q-matrix are combined in a mixed model for correction of false association at the same vain, considering covariances due to relatedness (Bradbury et al. 2007). Kinship (K) derived from random markers or pedigree can be used to estimate the average relationship between individuals. However, the most effective one is that which combined both Q and K (Yu and Buckler 2006). In population structure diagnosis, PCA is used for genetic diversity study in an association mapping context (Patterson et al. 2006). In structure association analysis, the implemented Q method has been utilized in GLM function in TASSEL software. STRUCTURE program and PCA have been used to derive covariates in the model using population membership estimates (Pritchard et al. 2000; Zhao et al. 2007). To calculate the structure of the population and use the outcome for further analysis, a set of random markers must be initially utilized in structured association (Falush et al. 2003; Pritchard and Rosenberg 1999). Logistic regression has been used for the modified structured association (Thornsberry et al. 2001). Chhatre (2013) reported the use of StrAuto v0.3.1. It is a Python-based structure with an automated procedure software for Linux-based computers, and is recently been utilized for (i) discovery of genetic structure in sample populations for medical purposes (Pritchard and Donnelly 2001); (ii) population structure studies (Randi and Lucchini 2002); and (iii) detection of cryptic genetic structure of natural populations (Caizergues et al. 2003). PCA and Multiple Correspondence Analysis (MCA) are performed for 3D or 2D space to observe the relative distribution of subpopulation (Rahim et al. 2018). They require less computing time than maximum likelihood estimation. Therefore, Rahim et al. (2018) concluded that PCA and discriminate analysis are the most frequently used analytical techniques in population structure analysis (Rahim et al., 2018). However, STRUCTURE is frequently used in Bayesian clustering method.

Software packages

A number of software/statistical packages have been used in AM studies. Theses include TASSEL, Statistical Analysis System (SAS), R package, STRUCTURE, Spatial Pattern Analysis of Genetic Diversity (SPAGeDi), EINGENSTRAT, Multiple Trait Derivative-Free Restricted Maximum Likelihood (MTDFREML), and Residual Maximum Likelihood (ASREML) (Zhu et al. 2008). Additionally, STRAT, Bimbam and GEN STAT 11 software have been added recently (Álvarez et al. 2015). Summary-data-based Mendelian randomization (SMR) and heterogeneity in dependent instru-ments (HEIDI) tools have been used to test pleiotropic interaction between gene expression level and complex traits using expression quantitative trait loci (eQTL) and GWAS data (Zhu et al. 2016). Moreover, these tools can be employed to assess the size of the effect of SNP on phenotype mediated by the expressed gene.

Application of association mapping in plant breeding

AM sustains breeding practices that capture superior alleles and support their introgression into elite breeding germplasm from diverse individuals. It is noted that most studied characters are abiotic stresses, quality, yield, and morphological parameters (Table 1). Liu et al. (2018) identified 122 and 134 QTL for yield-related traits and fiber quality in cotton, respectively (Liu et al., 2018). The same authors also identified 139 quantitative trait nucleotides (QTNs) for yield components and 209 QTNs for fiber quality among which 74 were observed in two environments using GWAS. Four were possibly “pleiotropic” among the 35 common candidate genes observed. Patishtan et al. (2018) used a panel of 306 diverse rice accession to perform GWAS and identified transcription factors and components of the ubiquitination pathway as an important source of genetic diversity (Patishtan et al. 2018). RD2, HAT22, PIP2 and PP2C genes were proposed to be potentially significant for drought tolerance in cotton using RNA-seq and were verified through a Quantitative reverse transcription-polymerase chain reaction (qRT-PCR) (Hou et al. 2018). Resende et al. (2018) carried out Regional heritability mapping (RHM) and GWAS for lodging, productivity, and plant architecture across two environments using 188 common bean germplasms. The study detected three trait-associated markers using GWAS, whereas 145 markers along chromosomes 5 with eight QTL were identified using RHM. The authors concluded that combining allelic differences of QTL with the large effect could be successfully combined into whole-genome prediction models and can easily be traced using marker-assisted selection. Identification of salt tolerance loci in rice was also carried using GWAS, where Na+/ K+ ratios with the major association were measured at the reproductive stages and were equally detected and found to contain saltol as the major QTL on chromosome I at the seedling stage, regulating salinity tolerance (Kumar et al. 2015). Maulana et al. (2018) Mapped QTL and identified SNP associated with seedling heat tolerance in wheat. Their findings revealed some effective QTL that are tolerant to heat from seedling to reproductive stages. Interestingly, however, new QTL that have never been reported previously at the reproductive stage were found responding to seedling heat stress. Analysis of candidate genes also indicated high sequence resemblances of some loci with candidate genes involved in plant stress responses, such as salt, heat and drought stresses. Su et al. (2018) determined the genetic basis of cotton plant architecture using GWAS, from which 30 significant relations among five-plant architecture and 22 SNP markers were identified. Additionally, more plant architecture component traits concurrently associated with chromosome D03 with four peak SNPs were identified. 37,901 SNP markers in switchgrass were obtained and utilized for GWAS (Taylor et al. 2018). Arabidopsis pseudo-response regulator 5 homolog was related to heading date across environments and years on chromosome 8a. The study found that genetic deviations associated with floral enhancement influence the dates of flowering and productivity. Significant quantitative trait SNP markers comprising about 87, 21 and 16 for fatty acid, oil and proteins, respectively, were identified (Du et al. 2018). Protein contents have been controlled by epistasis influence, accounting for a total variation of about 65.18%. However, 16 chromosomes containing 20 QTNs were found to contribute to six-drought tolerance. Moreover, Messenger RNA (mRNA) expression levels of the genes were verified in the target interval through which the potential loci/genes that regulated branch number in Brassica napus expression were identified (He et al. 2017). Two SNP markers i47388Gh and i46598Gh on chromosome D09 were found to be associated with salt tolerance level and relative survival rate in cotton, respectively (Sun et al. 2018). Additionally, different expression levels of about 280 candidate genes under salt stress were screened, from which CIPK, NXH, MYB, LEA, WD40 and CDPK genes were responsible for plant salt tolerance. Most of these genes are transcription factors, transporters or enzymes. SNP markers and QTL were identified that could effectively be used for bio-fortification and breeding disease resistance in rice (Descalsota et al. 2018). Breeding material is used directly in genetic studies, such as recurrent selection or multiple cross pedigree programs. Nevertheless, for greater promising, Marker Assisted Recurrent Selection (MARS) is used. However, AB-QTL methods have been used for introgression and genetic study for commercial production. Geneticist, breeders and statisticians used breeding lines and populations, from which they came-up with models for whole-genome selection that are enhanced at each consecutive generation, season and phenotyping exercise based on whole-genome haplotypes rather than individual gene evaluation, also known as genomic selection (GS). Since GS uses statistical modeling coupled with high-throughput markers, the system has eventually transformed MAS. The GS strategy was recommended in 2001 from several reports of statistical models (Hayes and Goddard 2001). It has been used for enhancing preselection precisions, especially using genomic information for complex agronomic traits. The GS also uses data from genotypic and phenotypic training population (TP), which can be used to calculate genomic estimated breeding values (GEBVs) for accurate selection of each individual from the breeding population that is genotyped without phenotyping (Jonas and de Koning 2013). All marker effects can be directly estimated, and such loci with minor effects for complex characteristics can be easily captured in the whole genome as the main advantage of GS over others (Nakaya and Isobe 2012). Additionally, the rate of annual genetic gain can be significantly enhanced by reducing time, accelerating breeding cycles and cost because selection depends on an individual’s genotypes deprived of the required records of the phenotype (Xu et al. 2017). To assess the performance of breeding program in genomic selection, prediction accuracy (rMG) is estimated as Pearson’s correlation (r) between the GEBVs of candidate individuals and the true breeding value. The prediction ability of GS is usually affected by many factors that directly influence the accuracy of GEBV. These include population structure, marker density, performances of the model, association between breeding population, target trait heritability, and size of the population of both TP and breeding population (BP). rMG also varies with statistical models of GS (Endelman 2011; Gianola 2013; Juliana et al. 2017; Ornella et al. 2014; VanRaden 2008).

Table 1 Examples of AM studies in various plant species

About 94 peach germplasm collections were used by Font et al. (2019) and 347 significant associations were identified between markers and traits, which appeared mapped within the interval where many candidate genes are involved in different pathways. Zhang and Yuan (2019) conducted AM and GP (genomic prediction) analyses using 300 inbred lines of maize from different collection zones. They found out that 1549 SNPs were significantly correlated to 12 trait-environment combinations; the PVE of these significant SNP was about 4.33%, and 541 of them had a phenotypic variance explained (PVE) value greater than 5%. They observed fewer numbers of significant associations and candidate genes with higher PVE values in haplotype-based association mapping than the single SNP-based association mapping. Arab et al. (2019) explored the genomic differences and population structure of Persian walnut, from which loci underlying the variation in kernel and nut-related traits were identified using the new Axiom J. regia 700 K SNP genotyping arrays. Moreover, they uncovered 55 significant SNPs associated with kernel and nut-related traits. Gao et al. (2019) also identified 17 genes and 4 QTL correlated to 42 significant SNPs associated with thermos tolerance of seed-set by GWAS and linkage mapping, respectively.

Future perspective

Association genetics studies in plants are still in progress; and appropriate phenotyping methods, development of error free statistical software and accessibility to genotyping still remain the major challenges to its effectiveness, despite series of improvements to bridge the AM studies gaps and enhance its efficacy in crop breeding and genetics development. It must be noted that the use of AM in dissecting QTL for evolutionary population studies requires full information about the organism to identify the number markers needed (Álvarez et al. 2015). If the knowledge about re-combinational history in breeding populations is known for several population types, The effectiveness of AM studies will be maximized. Additionally, GS and AM phenotyping remain challenging due to the need to capture the right phenotype and differences that occur in different material or breeding programs (Álvarez et al. 2015). Association cannot be found within a single locus particularly when population structure and morphological characteristics are correlated; however, associated prediction with epigenetic interactions and multiple loci can easily be improved due to GS approaches (Jannink et al. 2010). Where variability due to phenotype is established within subpopulations and candidate genes are known, marker density is adequate and AM approaches will be successfully implemented. The low-marker density limitations in GWAS can be overcome by increasing the marker numbers for all crops, although this depends on the types of marker selected in relation to representation and gene distribution in space and LD level. However, large number of markers is not required in AM studies. In the future, AM approaches should look at improvements in computational and statistical methods (such as SNP imputation, Bayesian and haplotypes methods) and their integration with gene annotation data or functional analysis (Zhang et al. 2014). Additionally, advances in crop genome re-sequencing and the expansions of mammalian and other model organisms will influence GWAS (Visscher et al. 2017). Generally, statistical tools that are user-friendly and genomics resources need to be improved. While applying AM, all factors like the population size, the density of marker as well as population structure, should be taken into consideration. For the detecting marker-phenotype relationship, the choice of germplasm, quality of genotypic and phenotypic data, use of the appropriate statistical analysis and verification of the marker-phenotype associations are key to association analysis. To harness the linkage-based QTL mapping and AM, joint linkage association mapping is proposed. Bayesian regression method can be used to overcome the genome-wide error rate (GWER), and it is expected to be used more frequently in GWAS, especially if artificial intelligence networking is involved. It was observed that markers with rare alleles in GWAS are often excluded from the analysis that attributed to missing heritability; as such, rare allele/variant analysis will be an important area to be considered to enhance AM studies.

Conclusion

AM is a tool used in plant breeding and genetics to comprehend QTL location and ascertain and monitor essential characters. It provides a vast prospect to assess and discover diversity of plant species for modern agricultural production. Many loci controlling the traits of interest escape detection and failure to identify the loci from similar parents are among the limitations of linkage based mapping. However, integrating it with AM produces high-resolution power and multiple alleles can be tested easily in the same experiment. Additionally, PCA and discriminate analysis are suitable for population structurer while STRUCTURE is recommended for Bayesian clustering method.