Introduction

Genome-wide association study (GWAS) and genomic prediction have become accessible to human geneticists and animal and plant breeders. Since the early 2000s, investigations based on real and simulation-based data have established GWAS as an effective approach toward identifying genes responsible for qualitative and quantitative traits in humans and animal and plant species (Ingvarsson and Street 2011; Ku et al. 2010), and genomic prediction as an effective procedure aimed at population and hybrid breeding (Daetwyler et al. 2013; de Los Campos et al. 2013a; Zhao et al. 2015) and predicting complex human traits (de los Campos et al. 2013b). The efficacy of identification of candidate genes or at least putative quantitative trait loci (QTLs) from statistically significant associations between single nucleotide polymorphisms (SNPs) and a complex trait and genomic prediction of additive or genotypic value depend highly on genetic variability, genetic relationship, linkage disequilibrium (LD), sample size, and population structure. Other determinants include the number, density and coverage of SNPs through the genome and the precision of phenotyping for GWAS. High efficacy levels can also be attributable to the development of genetic models and the use of complex statistical approaches as restricted maximum likelihood and Bayesian analyses, inclusive for missing data imputation (Browning and Browning 2007; Endelman 2011; Yu et al. 2006).

Assuming that there is genetic variability, GWAS and genomic prediction relies mainly on the LD between SNPs in dense marker maps and QTLs (Meuwissen et al. 2001) but relationship information is also a key factor in achieving high efficiency. GWAS evolved from single marker association in the context of case-only or case-control into a mixed model approach with genomic relationship information (Yu et al. 2006). Prediction of additive values was pioneered by Henderson (1974) in the context of animal breeding. His approach, the best linear unbiased prediction (BLUP), uses the pedigree-based relationship. An important application of BLUP in plant breeding is the prediction of untested single crosses (Bernardo 1994) which evolved from pedigree-based relationship information into genomic relationship information (Technow et al. 2014).

How LD and relationship information allow identification of QTLs and genomic prediction can be understood from the studies of Fisher (1918), Haseman and Elston (1972), Weir (2008), Goddard (2009), and Gianola et al. (2009). Regardless of LD, Fisher (1918) proved that there is correlation between relatives, whose value is proportional to twice the coefficient of coancestry. Also regardless of LD, Haseman and Elston (1972) related marker and QTL based on identity by descent (IBD). Regardless of the degree of relationship, Weir (2008) demonstrated marker-trait association for case-only and case-control designs. Goddard (2009) and Gianola et al. (2009) related SNP and QTL variances in populations with LD. Thus, LD and genetic relationship contribute to efficient identification of QTLs and genomic prediction in human and breeding populations. It is interesting to highlight that due to LD, even if there is no genetic relationship in a population it is expected that identity by state (IBS) contributes to increase the efficacy of identification of QTLs from GWAS and genomic prediction. This is because similar SNP genotypes will share similar QTL genotypes (the same occurs when there is relatedness due to IBD).

The first genomic additive relationship matrix was proposed by VanRaden (2008) (first method). Endelman and Jannink (2012) introduced shrinkage to maximize additive value prediction accuracy. Their scaling for the genomic additive relationship matrix keeps the average diagonal element at 1 + F, where F is the inbreeding coefficient of the current population. Both approaches are identical under high SNP density. The main advantage of the genomic additive relationship matrix (observed or realized) over the pedigree-based additive relationship matrix (expected) is the discrimination between individuals with the same relationship. For example, full-sibs can have zero, one, or two alleles IBD at a QTL locus. Furthermore, although on average genomic and pedigree relationships are consistent, the former provides relationship values that are affected by genetic drift, recombination, mutation, and selection (Wang and Da 2014).

The importance of relationship information on genomic selection has been extensively investigated but there is limited knowledge on the impact of relationship information on GWAS, especially in human, animal, and cross-pollinating populations. In the short term, the relationship between individuals in the training and selection populations has a higher effect on additive value prediction accuracy when compared to LD (Clark et al. 2012; Liu et al. 2015; Wientjes et al. 2013). Some genome-wide association studies with inbred lines panels have evidenced a positive effect of relationship information on adherence to nominal level of significance (Bernardo 2013; Stich and Melchinger 2009). Furthermore, a deeper understanding of the genetic architecture of complex traits can only be achieved by identification of low heritability QTLs, which require larger sample sizes. Our objective was to analyze the relevance of the relationship information on the identification of low heritability QTLs from a GWAS and on the genomic prediction of complex traits (of low heritability) in human, animal, and cross-pollinating populations.

Material and methods

Data sets

The software REALbreeding (available by request) was employed to simulate SNP genotypes and phenotypes of parents and 50 samples of 1000 individuals of seven related populations, derived from a common population with LD (same QTL and SNP allele frequencies). The common population was a second-generation composite derived by crossing two populations in linkage equilibrium. A composite is a Hardy-Weinberg equilibrium population with LD only for linked SNPs and QTLs. The software does not assume a distribution for LD values (nor for gene effects), but computes the true LD values for QTLs based on quantitative genetics theory (Viana 2004) to determine the parametric additive, dominance, and genotypic values. The parametric LD for QTLs i and j in a composite of two populations in linkage equilibrium is \( {\varDelta}_{ij}=\left(\frac{1- 2{\theta}_{ij}}{4}\right)\;\left(\;{p}_i^1-{p}_i^2\right)\;\left(\;{p}_j^1-{p}_j^2\right) \), where θ is the recombination fraction, p is an allele frequency, and the indices 1 and 2 refer to the parental populations. Because SNP and gene positions and frequencies are random, the LD values in the populations are also random, ranging between −0.25 and 0.25. SNP and QTL frequencies follow a beta distribution. To provide general results, we simulated scenarios close to human populations (200 full-sib progeny of five individuals), wild/domesticated animal populations (full-sib progeny), and non-inbred and inbred (one generation of selfing) maize populations (half-sib and S1 progeny = inbred full-sib progeny), each composed of 100 progeny of 10 individuals or 50 progeny of 20 individuals. As a reference population, we included 1000 non-related individuals. Non-inbred and inbred full-sib progeny compose some breeding populations for cross- and self-pollinated crops (in this case with equal allele frequencies).

Based on our input, REALbreeding randomly distributed 10,000 SNPs, 10 QTLs, and 90 minor genes (QTLs of lower effect) in 10 chromosomes (1000 SNPs and 10 genes by chromosome). Four chromosomes had one QTL, three chromosomes had two QTLs, and three chromosomes had no QTLs. The average SNP density was 0.1 cM and each QTL had an SNP within it (same frequency). We simulated a quantitative trait showing dominance (maize grain yield, with minimum and maximum genotypic values for homozygotes of 30 and 140 g per plant, respectively). The minimum and maximum genotypic values for homozygotes are used to compute the deviations a (the difference between the genotypic value of the homozygote of greater expression and the mean of the homozygotes) and d (dominance deviation). The deviation a for a QTL was defined as three to six times greater than the deviation for a minor gene. The dominance deviation is computed from the degree of dominance (d/a). We defined positive dominance (0 < d/a ≤ 1.2) (Bernardo 1994). The true additive and dominance genetic values were computed from the population gene frequencies, LD values, average effects of gene substitution, and dominance deviations. The phenotypic values were computed from the true population mean, additive and dominance values, and from error effects sampled from a normal distribution. The error variance was computed from the broad sense heritability, assumed as 30%. The narrow sense heritability was 25%, which implies a maximum accuracy of phenotypic selection of 0.50. The QTL heritabilities ranged from 1.1 to 2.9%.

Data availability

The data sets are available at https://doi.org/10.6084/m9.figshare.5417290.v1. The ‘File description.doc’ contains the description of all data files (SNP and QTL positions, SNP genotypes, and phenotypic values).

GWAS

The model employed for analysis was an extension of the mixed model proposed by Yu et al. (2006) that includes the SNP dominance deviation without the population effect:

$$ y=M+{Z}_1\alpha +{Z}_2\delta +{Z}_3u+\varepsilon $$

where y is the vector of phenotypic values, M is the population mean, α is the SNP average effect of an allele substitution, δ is the SNP dominance deviation, u is the vector of additive values, Z 1, Z 2, and Z 3 are incidence matrices, and ε is the error vector. The variance of u is K\( {\sigma}_A^2 \), where K is the additive relationship matrix and \( {\sigma}_A^2 \) is the additive variance. We defined K as either the pedigree-based (expected) additive relationship matrix (A), the genomic (realized or observed) additive relationship matrix (G), or ignored the relationship information defining K as an identity matrix (I). The analyses were performed using the R package GWASpoly (Rosyara et al. 2016). The R package pedigree (Coster 2013) was used to calculate A. To control the type I error we used Benjamini-Hochberg false discovery rate (FDR) of 5 and 1% (Benjamini and Hochberg 1995). A significant association was considered true when the difference between the SNP position and the true QTL (candidate gene) position was less than or equal to 2.5 cM. The significant associations in chromosomes without QTL were used to compute the observed FDR. Because in random cross populations the level of LD between SNP and QTL depends on the allelic frequencies and physical distance, we counted the number of significant associations outside of the QTL interval (5 cM). The mapping precision was expressed as the average difference between QTL and significant SNP positions, within the QTL interval.

Genomic prediction

We fitted the additive model in a genomic BLUP (GBLUP) approach given by:

$$ y=M+ Zu+\varepsilon $$

where the terms y, M, u, and ε are as previously defined, Z is an incidence matrix, and V(u) = G\( {\sigma}_A^2 \). We also employed a pedigree-based BLUP prediction, defining V(u) = A\( {\sigma}_A^2. \) The analyses were performed with the R package rrBLUP (Endelman 2011). We computed additive value prediction accuracies for non-inbred and inbred full-sibs, non-inbred half-sibs, and non-inbred and inbred descendants. The training set size ranged from 20 to 80% and from 5 to 40% of the population size for predictions in the same generation (sibs) and in a future generation (descendants) respectively. The additive value prediction accuracy was computed as the correlation between the true additive values computed by REALbreeding and the values predicted by GBLUP or pedigree-based BLUP.

Results

GWAS

Regardless of the significance level, population, and degree of relationship, it is not reasonable to ignore the relationship information in a GWAS, since this greatly increases the observed FDR and the number of significant associations outside of the QTL intervals, thus making it impossible to identify candidate genes because the high number of significant association along the chromosomes (Table 1). For most populations the observed FDR was much greater than the significance level and the number of significant associations outside of the QTL intervals ranged from approximately six to 2408. By using the relationship information, the number of significant associations outside of the QTL intervals was greatly reduced, ranging from 0.2 to 7.1. The observed FDR ranged from 0.7 to 6.6 and from 2.0 to 6.6 for significance levels of 1 and 5% respectively. Defining the level of significance at 5% maximized the power of detection for the low heritability QTLs and effectively controlled the type I error rate. Except in three populations, the observed FDR was kept below 5%. In this scenario, the power of detection for low heritability QTLs ranged from approximately 14 to 50%. The QTL detection power was consistently higher using the genomic relationship matrix, compared to the use of the pedigree-based relationship matrix. The increase gained from using G instead of A ranged from approximately 6 to 32%. As expected, GWAS proved to be a precise approach for identifying QTLs, with an average bias between significant SNPs and a QTL ranging from less than 0.01 to 0.23 cM. Generally, the QTLs were also identified by SNPs in close proximity (up to 2.5 cM), and therefore, the QTL detection power would not be affected if there were no SNPs within the QTLs. The number of significant SNPs inside the QTL intervals ranged from approximately 2.0 to 6.0. Surprisingly, the best general scenario, considering QTL detection power, control of type I error, and number of significant associations outside of the QTL intervals, was unrelated individuals. The QTL detection power achieved approximately 55%, the observed FDR was kept below the level of significance of 5%, and there was one significant association outside of the QTL interval on average.

Table 1 Average QTL detection power (%), observed FDR (%), mapping precision (bias in the QTL position; cM), number of SNPs inside and outside of the QTL interval, in function of the significance level (%), population, average coancestry coefficient (Av. r), coancestry coefficient within the progeny (r), and the genetic relationship matrix (G, A or I)

Genomic prediction

Regardless of the population and training set size, with one exception the genomic prediction (GBLUP) provided higher prediction accuracy of a complex trait than the pedigree-based prediction (BLUP) (Table 2). The average superiority ranged from approximately 1 to 35%. The accuracy of additive value prediction with unrelated individuals is only due to LD. Also regardless of the population and training set size, the efficacy of genomic prediction when there is relatedness between individuals in the training set and the reference population is much higher than the values for unrelated individuals. The average increase ranged from approximately 169 to 260%, inversely proportional to the training set size. This result indicates that relatedness is more important than LD in genomic prediction. Taking into account that the maximum prediction accuracy of additive value, by phenotyping all individuals in the population, is 0.5, it is impressive that, with one exception, the relative accuracy (accuracy/root square of the heritability) of the genomic prediction of related non-phenotyped individuals ranged from 60%, for the lower training set size, to 143%, for the higher training set size. With a small training set size (5 or 10% of the reference population), the estimates of additive value genomic prediction accuracy of descendants from unrelated parents in the training set ranged from approximately 0.20, in most scenarios with non-inbred progeny, to approximately 0.50, in most scenarios with inbred full-sibs (Table 3). Furthermore, the results from two scenarios with non-inbred full-sibs (200 × 5 and 100 × 10) revealed that the prediction accuracy of descendants is similar to the prediction accuracy of sibs assuming the same training set size.

Table 2 Average additive value prediction accuracy and its standard deviation (in parenthesis) in the same generation as a function of the genetic relationship matrix, population, averaged coancestry coefficient (Av. r), coancestry coefficient within the progeny (r), and training set size (%)
Table 3 Average additive value genomic prediction accuracy of descendants and its standard deviation (in parenthesis), regarding seven populations that differ for the average coancestry coefficient (Av. r) or coancestry coefficient within the progeny (r) and training set size (%)

Discussion

Relatedness and LD are important factors affecting GWAS and genomic prediction efficacy. It is important to emphasize that human geneticists and breeders can identify QTLs (candidate genes) and predict complex traits regardless of LD, based on IBD, and regardless of relatedness, based on LD. As highlighted by Haseman and Elston (1972), a statistically significant association for a marker indicates that there is linkage between the marker locus and one or more QTL loci. The efficacy of QTL detection is proportional to the total genetic variance for the QTLs and inversely proportional to the recombination fractions between the marker locus and the linked QTLs. Regardless of LD, Fisher (1918) proved that human geneticists and breeders can predict complex traits for non-phenotyped individuals based on the phenotypes of their relatives. This is the principle of genetic improvement. For example, assuming a narrow sense heritability of 0.25, the accuracy of predicting the additive value of a non-phenotyped descendant from a single phenotyped parent is approximately 0.12 (half of the heritability). Further, regardless of relatedness, Weir (2008), Goddard (2009), and Gianola et al. (2009) showed that the efficacy of identification of marker-trait association and genomic prediction are proportional to the LD between markers and QTLs.

In GWAS, relationship information does not directly affect QTL detection power but does impact FDR. Liu et al. (2016) stated that false positive associations in a GWAS can be effectively controlled by fitting a population fixed effect and a random polygenic effect but this adjustment can compromise the QTL detection power. Because in populations SNP and QTL can have a high level of LD regardless of the physical distance, the efficacy of identification of QTLs increases when the significant associations are restricted to a few SNPs in close proximity to the QTLs. Regardless of the population, the inclusion of the relationship information keeps the FDR below the level of significance and greatly decreases the number of significant associations outside of the QTL intervals (not all false-positive associations). Interestingly, sampling non-related individuals can maximize the power of detection of low heritability QTLs, if there is a high level of LD. In this scenario, the FDR is kept under the significance level and the significant associations are for SNPs in close proximity to the QTLs. Because human geneticists and breeders should find a balance between QTL power detection, control of type I error, mapping precision, and minimization of the number of significant associations outside of the QTL intervals, we recommend GWAS based on the genomic relationship matrix. Cheng et al. (2013) agree that ignoring relatedness often results in higher type I error rates. In their study with simulated and real data, when sufficiently dense SNP data were used to estimate relatedness, type I error was efficiently controlled and the QTL detection power increased. Concerning the populations, the power of detection of low heritability QTLs is optimized by maximizing the number of progeny and minimizing the number of members, particularly with non-inbred half- and full-sibs. Guo et al. (2013) argue that family-based association mapping takes advantage of LD between segregating families within crosses and among parents to provide greater power than association mapping and greater resolution than linkage mapping. The study of Guo et al. (2010) evidenced that the power of detection of a QTL depends on the progeny and member numbers.

Concerning genomic prediction of complex traits, relatedness between individuals in the training set and those to be predicted is key for maximizing the additive value prediction accuracy, regardless of the population. Our results indicate that relatedness affects prediction accuracy more than LD, especially when measured based on the genomic additive relationship matrix. Liu et al. (2015) and Wientjes et al. (2013) showed that the level of relationship between individuals in the training and selection sets has a much higher effect on prediction accuracy than LD. It is also important to highlight that the additive value prediction accuracy is proportional to the average coancestry coefficient in the population (correlation of 0.89 in our study) and that high prediction accuracy can be achieved with even a relatively low level of relationship. In our study, the maximum value for the average coancestry coefficient was approximately 0.01. Concerning the populations, higher additive value prediction accuracy was achieved by minimizing the number of progeny and maximizing the number of members, especially with inbred- and non-inbred full-sibs. This implies that human populations, composed of, in general, full-sib families of few members, are better suited for GWAS than for genomic prediction of complex traits. Using stochastic simulation, Vela-Avitua et al. (2015) compared the additive value prediction accuracy from pedigree-based BLUP, IBS GBLUP (VanRaden’s G), and IBD GBLUP (Fernando and Grossman’s G). Regardless of the trait heritability, IBS GBLUP provided the highest accuracy under high SNP density. For lower SNP density, IBD GBLUP performed better.

In the context of genomic selection, where the main objective is to shorten the selection cycle, maximizing the genetic gain per cycle and minimizing the cost per unit of genetic gain (Meuwissen et al. 2013), a breeder can define the training set size as 40% of the selection population size. This will provide a relative prediction accuracy of additive values of at least 0.80, compared to selection based on measuring all individuals in the population, except for half-sibs. With inbred full-sibs the accuracy can be 10 to 29% greater than the theoretical maximum accuracy. Concerning genomic prediction of complex human traits, geneticists should have lower training set size relative to the reference population size. Assuming reduced training set size (5 to 10% of the reference population), the relative prediction accuracy of sibs and descendants computed from genotyped and phenotyped sibs and unrelated parents, respectively, ranged from approximately 0.30–0.60 with half-sibs to 0.80–1.00 for inbred full-sibs (prediction accuracy of sibs computed for training set sizes of 5 and 10% of the reference population based on a quadratic regression model). Based on type-2 diabetes and height data sets, de los Campos et al. (2013b) and Makowsky et al. (2011) observed equivalent prediction accuracy based on pedigree-BLUP, GBLUP, and Bayesian Lasso (0.50), assuming training and testing set sizes of approximately 90 and 10%. The genomic prediction accuracy for unrelated individuals was 0.18, regarding type-2 diabetes. The maximum expected genomic prediction accuracy was 0.89 for the three data sets. This implies in relative accuracies of 0.60 and 0.20 for related and non-related individuals. In both investigations the authors conclude that whole-genome methods provide a promising approach for the prediction of complex traits.

In conclusion, the relationship information is a key factor for both GWAS and genomic prediction of complex trait, for effective control of the type I error rate and decreasing the number of significant associations outside of the QTL intervals, and for maximizing the prediction accuracy of additive or genotypic value, respectively. The identification of low heritability QTLs from GWAS requires greater sample sizes. The estimates of additive value prediction accuracy support previous statements that whole-genome regression methods have potential use in preventive and personalized medicine and for genomic selection.