Introduction

In recent years, the identification of thousands of single nucleotide polymorphisms (SNPs) dispersed at the genome level has resulted in predicted breeding values based on marker information (genomic breeding value) (Meuwissen et al. 2001). Genomic selection, in comparison with traditional methods (selection based on phenotypic records), leads to an increase in the genetic progress due to the reduction of generation interval and lack of an animal’s need for a specific age (Schrooten et al. 2005). Genomic selection is a form of marker-assisted selection (MAS), in which all genetic markers that cover the entire genome are used simultaneously (Meuwissen et al. 2001; Goddard 2009). To this end, the number of markers should be such that each quantitative trait loci (QTL) is in linkage disequilibrium (LD) with at least one marker (Toosi et al. 2010). The accuracy of genomic estimated breeding value (GEBV) is influenced by heritability, marker density, minor allele frequency (MAF), and genetic architecture of target trait. (De los Campos et al. 2013). In genomic selection, first the genotype of the training set animals with phenotypic records is determined by a large number of markers and the effects of all markers are estimated simultaneously by statistical models. Then, the estimated marker effects are used to predict the genomic breeding value of individuals without phenotypic records in the validation set (Meuwissen et al. 2001).

The appropriate training set is highly effective in accurate predicting of the breeding values of young individuals without phenotypic records, since it plays an important role in estimating the marker effects. Factors such as the number of individuals, the reliability of the individuals’ phenotypic information, the genetic relationships within the training set, and the relationships between the individuals of the training set and the individuals of the validation set play a key role in the accuracy of prediction in the training set (Samuel et al. 2012).

For many effects, Bayes B considers the genetic variance to be zero in many analytical cycles, and therefore it is not introduced in the equations. Bayes A requires more computational power than Bayes B does (Meuwissen et al. 2001). According to the simulation studies, for traits with a limited number of large-effect QTL, differential shrinkage of estimates of effects and variable selection methods (e.g. Bayes A and Bayes B) have predicted superiority and yield higher accuracy than genomic best linear unbiased prediction (GBLUP). However, the differences between methods shown by simulation studies have not always been reported by empirical studies using real data analysis (De los Campos et al. 2013).

Ghafouri-Kesbi et al. (2017) compared three machine learning algorithms (support vector machines, boosting and random forests), as well as GBLUP to predict genomic breeding values. GBLUP had better predictive accuracy than machine learning methods in particular in the scenarios of normal and uniform distributions of QTL effects and higher number of QTL. In the scenarios of small number of QTL and gamma distribution of QTL effects, boosting surpassed other methods.

Some data do not follow a particular statistical distribution (e.g. normal distribution). For this reason, it is not possible to estimate marker effects using conventional statistical methods such as frequency-oriented methods (GBLUP and ridge regression best linear unbiased prediction) and Bayesian methods. As a result, we have to use nonparametric methods to estimate marker effects.

The purpose of this study was to compare the accuracy of the parametric methods (Bayesian ridge regression and Bayes A), semiparametric method (reproducing kernel Hilbert spaces) and nonparametric methods (random forest and support vector machine), in predicting genomic breeding values for traits with different genetic architecture in terms of marker density, number of QTLs, heritability and number of training set individuals (number of observations) using simulated data.

Materials and methods

Population simulation

Programming to create populations was done in the software environment R under the hypred package (Technow 2013). The base population including 100 individuals (50 males and 50 females) was simulated. This demographic structure was conducted for 50 generations by random mating (historical population) to create recombination and drift, and linkage disequilibrium between the marker and the QTL. In the historical population, assuming that both parents (over 50 generations of random mating) had produced two progeny, the effective population size was fixed along the base generations. Progeny chromosomal components were obtained from random sampling of each parent’s paternal and maternal chromosomes. In the 51st generation, the population size was increased by 1000 and 2000 individuals known as the training set. The members of this population had both genotypic and phenotypic information. Generations 52, 53 and 54 were recognized as the validation set, for which only genomic data was simulated and their genomic breeding values were predicted.

Genome simulation

In this study, a genome consisting of four chromosomes each with a length of 100 cM was simulated. On each chromosome 500, 1000 and 2000 markers were considered at identical marker intervals throughout the genome. Based on the different simulation scenarios, 50 and 200 QTLs were randomly distributed on chromosomes. The markers and QTLs were considered as bi-allelic and with an initial allele frequency of 0.5. In the 51st generation, the substitute effect for each QTL was considered by using standard normal distribution (mean 0 and variance 1) and in three levels of heritability (0.1, 0.3 and 0.5). The whole genetic variance of the trait was covered by QTLs and the true breeding value of each individual was calculated from the following relation according to each individual’s genotype from the total effect of QTLs:

$$ {\text{TBV}}_{\rm{i}} = \mathop \sum \limits_{{{\rm{j}} = 1}}^{\rm{n}} {\text{x}}_{\rm{ij}} {\text{b}}_{\rm{j}} , $$

where TBVi is the true breeding value of the individual i, n is the number of effective QTLs on the trait, xij is the QTL genotype at position j, and bj is the additive effect of the j QTL.

The following equation was used to simulate the phenotype:

$$ {\text{y}}_{\rm{i}} = {\text{TBV}}_{\rm{i}} + {\text{e}}_{\rm{i}} $$

where yi is the phenotype of individual i and ei is the residual effect.

LD estimation

LD value in the training set was measured by r2 statistic (Hill and Robertson 1968):

$$ r^{ 2} = {\text{D}}^{2} /{\text{freq}}\left( {{\text{A}}_{1} } \right) * {\text{freq}}\left( {{\text{A}}_{2} } \right) * {\text{freq}}\left( {{\text{B}}_{1} } \right) * {\text{freq}}\left( {{\text{B}}_{2} } \right) $$

Freq (A1) is the frequency of A1 allele in the population likewise for other alleles in the population. D is the deviation of parental genotypes from the recombinant genotypes estimated as:

$$ {\text{D}} = {\text{freq}}({\text{A}}_{1} \_{\rm{B}}_{1} )*{\text{freq}}({\text{A}}_{2} \_{\rm{B}}_{2} )-{\text{freq}}({\text{A}}_{1} \_{\rm{B}}_{2} )*{\text{freq}}({\text{A}}_{2} \_{\rm{B}}_{1} ). $$

Evaluation methods-parametric methods

Bayesian ridge regression (BRR)

In ridge regression (Hoerl and Kennard 1970), the distribution of marker effects is normal and are assumed as nonzero and partial.

Ridge regression is the same as the ordinary least squares, with the difference that, if the number of effects is more than the number of observations, it has no restrictions and also, when the markers are correlated, it has numerical stability and is calculated as the following:

$$ {\hat{\beta }} = \left\{ {\mathop \sum \limits_{\rm{i}} \left[ {{\text{y}}_{\rm{i}} - \mathop \sum \limits_{\rm{j}} {\text{x}}_{\rm{ij}}\upbeta{\text{j}}} \right]^{2} + \lambda \mathop \sum \limits_{{{\rm{j}} \in {\text{s}}}}\upbeta_{\rm{j}}^{2} } \right\} ,$$

where, λ ≥ 0 is a moderator for the controlling parameter to make balance between fitness (measured by the sum of error squares) and model complexity (which can be measurable by the sum of marker effect squares). The lambda is added to the diameter of the coefficient matrix and drives the estimates to zero. Although it stimulates the bias, it reduces the variance of estimates. If the lambda tends to infinity, β will equal zero. On the other hand, if the lambda is zero, the estimates of this method will be similar to the estimates of the least ordinary square method.

Bayes A

In the method of Bayes A (Meuwissen et al. 2001), the prior assumption is that a large number of positions have minor effects and a small number of them have major effects and the conditional distribution considered for marker effects is t distribution. The prior distribution of the variance is a scale inverted chi-square distribution with a degree of freedom v and a parameter of scale s. The posterior distribution combines the former distribution information and data information together. Therefore, the posterior distribution will also be categorized as a scale inverted chi-square distribution.

Semiparametric method

Reproducing kernel Hilbert space (RKHS) method

The RKHS method (Gianola et al. 2006) is a semiparametric method for genomic estimated breeding values in which the regression function is a linear combination of the basic function created by the RK. Thus, selection of RK is one of the central elements of model specification. The RK is a function that maps from pairs of points in the input space into the real line and must be positive semidefinite.

$$ {\text{k}}\left( {{\text{x}}_{\rm{i}} , {\text{x}}_{\rm{j}} } \right):\left\{ {\left( {{\text{x}}_{\rm{i }} , {\text{x}}_{\rm{j}} } \right) \to {\text{R}}} \right\} $$

The K-kernel matrix inputs are as follows:

$$ {\text{K}}\left( {{\text{x}}_{\rm{i}} ,{\text{x}}_{\rm{j}} } \right) = { \exp }\left\{ { - {\text{h}} \times {\text{d}}\left( {{\text{x}}_{\rm{i}} ,{\text{x}}_{\rm{j}} } \right)} \right\}, $$

d\( \left( {{\text{x}}_{\rm{i}} , {\text{x}}_{\rm{j}} } \right) \) is the squared-Euclidean distance between i and j according to the genotype of their markers:

$$ {\text{d}}\left( {{\text{x}}_{\rm{i}} ,{\text{x}}_{\rm{j}} } \right) = \left( {{\text{x}}_{\rm{i}} - {\text{x}}_{\rm{j}} } \right)^{2} ,$$

h is a bandwidth parameter that controls how the covariance (kernel) function velocity drops as the distance between pairs of vector genotypes increases. If the value of h is too small (for example, 0.001), it will create a very big kernel, and if the value of h is too big (for example, 50), it will create a very small kernel. This parameter plays an important role. In this research, a normal kernel was used. After optimization, the value of the parameter h was considered 0.1.

Nonparametric methods

Random forest method

An RF regression was created using an accumulation of decision trees. Each decision tree uses a bootstrap sample of training data including genotypic and phenotypic information. The model is trained in the training set and is applied on the validation set. One of the n samples enters each split of each tree (mtry), and this sample of marker information is used to categorize animals in a way that the animals are classified for the selected marker according to their genotypic information. This is done in sequential splits, until we finally reach the nodes in which there is maximum uniformity (animals with phenotypic information accumulate with similar genotypes for different SNPs in a node). The RF prediction for the training set, \( {\text{f}}_{\rm{rf}}^{\text{B}} \left( {\text{x}} \right) \) is performed through the averaging of the B \( {\text{trees }}\left\{ {{\text{T}}\left( {{\text{x }}, \Psi_{\rm{b}} } \right)} \right\}_{1}^{\text{B}} \) as follows (Hastie et al. 2009):

$$ {\text{f}}_{\rm{rf}}^{\text{B}} \left( {\text{x}} \right) = \frac{1}{\text{B}}\mathop \sum \limits_{{{\text{b}} = 1}}^{\text{B}} {\text{T}}({\text{x }},\Psi _{\rm{b}} ), $$

where Ψb represents the bth tree in the RF. The most important parameters in the RF are the number of variables selected in each tree split (mtry), the number of trees (ntree), and the minimum size or the minimum number of observations in final nodes (nodesize) whose appropriate values must be defined prior to the analyses. For continuous data, the proposed value for the number of randomly sampled variables is equal to p/3 in each split (mtry) (p is the number of markers). In this study, the values of mtry were half, equal and twice the default value. After optimizing the parameters, the value of mtry, ntree and nodesize were obtained as 4000, 1000 and 5, respectively.

Support vector machine (SVM) method

The SVM method is a computer algorithm that learns through training information to classify observations. The purpose of this method is to identify and distinguish complex patterns in the data and to categorize them. It is the best method to solve the linear separable two-class problems. If the data are linear separable, it will create a hyperplane with a maximum margin to separate the categories. SVM regression in corrective programmes can implement the relationship between marker genotype and phenotype with a linear or nonlinear function that employs samples from predictor spaces to feature spaces (Hastie et al. 2009). The statistical model is as follows:

$$ {\text{f}}\left( {\text{x}} \right) = {\text{b}} + {\text{wx}}. $$

Where b is the constant effect and w is the unknown value vector.

The function f(x) is obtained by minimizing the function \( \uplambda \mathop \sum \nolimits_{{{\text{i}} = 1}}^{\text{n}} {\text{L}}\left( {{\text{y}}_{\rm{i}} - {\text{f}}\left( {{\text{x}}_{\rm{i}} } \right)} \right) + 1/2\|{\text{w}}\|^{2} \). L(.) denotes the loss function that measures the quality of the estimates. Lambda is the regulating parameter between model dispersion and complexity. Due to the high penalty applied, few errors in the classification of information will be acceptable. But if very large values of λ are used, since almost no errors are acceptable in the classification of information, there will be ‘overfitting’ and hence the generalization of the model will be reduced. Also, when the value of this parameter is small, the fine imposed will be small, and further mistakes will be acceptable in the classification of information. And when very small amounts of λ are used, ‘underfitting’ occurs and the model is wrongly trained and, as a result, classification of new data will be done with a high error rate. ||w|| has an inverse relation with model complexity. By selecting w to minimize ||w||, model complexity can be reduced. There are many loss functions that are used for SVM regression, such as squared loss, absolute loss, and ε-insensitive loss, which is as follows: (i) the squared loss function is \( {\text{L}}\left( {{\text{y}} - {\text{f}}\left( {\text{x}} \right)} \right) \) = \( \left( {{\text{y}} - {\text{f}}\left( {\text{x}} \right)} \right)^{2} \), which indicates that outliers have quadratic values that must be confronted with preregression analysis outliers. (ii) The absolute loss function is \( {\text{L}}\left( {{\text{y}} - {\text{f}}\left( {\text{x}} \right)} \right) \) = \( \left| {{\text{y}} - {\text{f}}\left( {\text{x}} \right)} \right| \). This function evaluates linear loss through error size, which solves the problem of using the entire data with the outliers. (iii) The ε-insensitive loss function is as follows:

$$ {\text{L}}\left( {{\text{y}} - {\text{f}}\left( {\text{x}} \right)} \right) = \left\{ {\begin{array}{*{20}l} 0 \hfill & {{\text{if}}\; \left| {\text{y} - \text{f}\left( \text{x} \right)} \right| < \varepsilon } \hfill \\ {\left| {{\text{y}} - {\text{f}}\left( {\text{x}} \right)} \right| - \varepsilon } \hfill & {{\text{if}}\; \left| {\text{y} - \text{f}\left( \text{x} \right)} \right| \ge \varepsilon } \hfill \\ \end{array} }. \right. $$

Where ε determines the number of support vectors (SVs) used in the regression function. The increase of ε indicates fewer support vectors in the fitting. The ε-insensitive loss function ignores the existing errors in the model that are smaller than ε, and when the error is greater than ε, the loss function is \( \left| {{\text{y}} - {\text{f}}\left( {\text{x}} \right)} \right| -\upvarepsilon \). Accordingly, the solving function is as follows:

$$ {\hat{\text{f}}}({\text{x}}) = \mathop \sum \limits_{{{\text{i}} = 1}}^{\text{n}} \left( {{\text{a}}_{\rm{i}} - {\text{a}}_{\rm{i}}^{ *} } \right){\text{K}}\left( {{\text{x}},{\text{x}}_{\rm{i}} } \right) + {\text{b}} .$$

Where ai and ai* are positive weights given to each observation and are estimated from the data. The internal multiplication of the kernel K (x, xi) is a definite positive n*n matrix. In this study, Epsilon regression and Gaussian kernel function were used. Also, the value of cost parameter (the fining parameter, λ) was considered 3 after optimization.

The accuracy of genomic estimated breeding values (GEBV) was obtained from the correlation between true breeding values and predicted breeding values. In this study, each scenario was repeated 10 times due to the use of a randomized model. The R package BGLR (Perez and De los Campos 2014) was used to run ridge regression, Bayes A and RKHS methods. To perform the RF method, the random Forest package (Liaw 2013) was used. To implement the SVM method, the R package e1071 (Meyer et al. 2013) was used. Also we used R to investigate the effect of factors affecting the accuracy of genomic breeding values. To compare different statistical methods, the Tukey’s test was used at a significance level of 0.05.

Results and discussion

The value of r2 as a measure of LD for marker density and different QTL numbers is in table 1. As shown, the r2 values increases with marker density increased and its highest value was 0.21 for the marker density of 2000. In GWAS studies and genomic selection, the minimum value of r2 between markers and QTL should be 0.2 for tracking the average effect (Hayes 2007). The results of variance analysis of prediction accuracy are presented in table 2. The main factors included marker density, number of QTLs, heritability and the size of training set, as well as the interactive effects among these factors. Among the effects, heritability, the size of training set, and statistical methods respectively had the greatest effect on the accuracy of prediction of genomic breeding values.

Table 1 Values of r2 for marker density and different QLT numbers.
Table 2 Variance analysis output for prediction accuracy.

The prediction accuracy of five methods studied for four generations (the first generation is the training set and second to fourth generations are the validation set) and in different combinations of marker densities (500, 1000 and 2000), levels of heritabilities (0.1, 0.3 and 0.5), two levels of the size of training set (1000 and 2000) are shown in figures 15, respectively.

Figure 1
figure 1

The accuracy of genomic breeding values prediction in different methods of ridge regression, Bayes A, RKHS, random forest and SVM over four generations.

Figure 2
figure 2

The accuracy of genomic breeding values prediction in different methods of ridge regression, Bayes A, RKHS, random forest and SVM in three different marker density levels.

Figure 3
figure 3

The accuracy of genomic breeding values prediction in different methods of ridge regression, Bayes A, RKHS, random forest and SVM in two different QLT number levels (50 and 200).

Figure 4
figure 4

The accuracy of genomic breeding values prediction in different methods of ridge regression, Bayes A, RKHS, random forest and SVM in three different levels of heritabilities (0.1, 0.3, and 0.5).

Figure 5
figure 5

The accuracy of genomic breeding values prediction in different methods of ridge regression, Bayes A, RKHS, random forest and SVM for two different number of observations (1000 and 2000).

By increasing the generation interval between the training set and validation set, the accuracy of genomic breeding values decreased significantly (see figure 1) mainly due to the change in the marker or haplotype structure and the decrease of LD between markers and QTLs due to recombination (Hayes et al. 2009). As presented in figure 2, increasing the marker density resulted in increase in the predictive accuracy of genomic breeding value (P < 0.05). Parametric and semiparametric methods showed higher accuracy than nonparametric methods (P < 0.05). Among the three parametric and semiparametric methods, Bayes A showed the highest prediction accuracy which was not statistically significant (P > 0.05). Among nonparametric methods, SVM method showed higher accuracy than random forest method but it was not statistically significant (P > 0.05).

In a simulation study, the doubling in the number of markers resulted in an accuracy increase from 0.63 to 0.73 (Piyasation and Dekkers 2013).

Gianola et al. (2006) reported that in comparing RKHS and multiple linear regression (MLR), when the effective gene was additive, both methods showed the same accuracy (MLR), but when the effective gene was nonadditive (additive effect interaction) the parametric MLR was obviously superior to the RKHS method. In a simulation study, the accuracy of Bayes A and Bayes L were the same and higher than RKHS (Howard et al. 2014).

In all methods, by increasing the number of QTLs from 50 to 200, the accuracy of genomic breeding values was reduced, which is consistent with the results of other studies of this field (Daetwyler et al. 2010). In addition, Abdollahi-Arpanahi et al. (2013) simulated a trait controlled by 50, 100, 267 and 200 QTLs, and observed that by increasing the number of QTLs the accuracy of prediction decreased, this is due to the fact that by increasing the number of QTLs, due to the limited amount of genetic variance versus a large number of QTLs, the proportion of each QTL decreases in total genetic value, thereby, reducing the accuracy of genomic breeding values as well as the power of models in estimating the effects. Also, by increasing the number of QTLs, the number of markers should also increase so that the effects of all QTLs can be captured (Habier et al. 2009). An increase in the number of QTLs can increase the accuracy of genomic breeding values if the number of markers increases as QTLs increase.

The results of comparing predictive performance of different statistical methods for different levels of heritabilities (0.1, 0.3 and 0.5) are presented in figure 4. In all the methods, the accuracy of estimating breeding values increased significantly as the heritability increased. According to the studies reported, it is proposed that by increasing the heritability from 0.1 to 0.9, the predictive accuracy of genomic breeding values increased from 0.3 to 0.7 (Hayes et al. 2010). It has also been reported that by increasing heritability from 0.25 to 1, the accuracy of the prediction in terms of genetic architecture of the trait increased from 0.05 to about 1 (Combs and Bernardo 2012). The high value of heritability of a trait indicates that environmental factors have a less important role than genetic factors in the development of diversity. Reducing the role of environmental factors in the phenotypic value of the trait reduces the variance of model error and, consequently, increases the predictive accuracy of genomic breeding values (Meuwissen 2013). According to equation r = \( \sqrt {{\text{N}}_{\rm{p}} {\text{h}}^{2} \left[ {{\text{N}}_{\rm{p}} {\text{h}}^{2} + {\text{M}}_{\rm{e}} } \right]^{ - 1} } \) (Deatwyler et al. 2013), predictive accuracy of genomic breeding value (r) has a direct relationship with the number of individuals with genotypic and phenotypic information in the training set (Np) and trait heritability (h2), as well as an inverse relationship with the number of independent chromosome segments (Me). As a result, the maximum predictive accuracy of genomic breeding value for high-heritability traits and a high number of individuals in the training set would be expected (Hayes et al. 2009).

The results of comparing the predictive ability of different statistical methods in two levels of the size of training set (1000 and 2000) are presented in figure 5. By increasing the size of the training set from 1000 to 2000, the increase in accuracy was evident in all methods. As a result, there should be a direct relationship between the number of observations and the predictive accuracy. It has been indicated that if the number of individuals in the training set increases from 500 to 1000 and 2200, the estimation accuracy of breeding values in Bayes B will be increased from 0.708 to 0.787 and then to 0.848 (Meuwissen et al. 2001). It was also reported that in the heritability level of 0.2, with an increase in the number of animals in the training set from 1151 to 3576, the predictive accuracy of breeding values linearly increased from 0.35 to 0.53 (VanRaden et al. 2009). By increasing the size of the training set from 200 to 1600 bulls, the predictive accuracy of genomic breeding values increased from 0.3 to 0.6 (Hayes et al. 2010). Genomic studies that use real data are subject to biases such as genotypic and sampling errors. Simulation studies lack these biases, and thus, these differences can lead to a difference in the results of simulation studies compared to real studies.

According to the results, the predictive accuracy of the breeding values in genomic selection depends on heritability, marker density, QTL number, number of training individuals (number of observations), and statistical models used. In models with only gene additive effect, nonparametric methods such as random forest and SVM showed lower accuracy than parametric and semiparametric methods such as RKHS (P < 0.05). Also, parametric methods showed higher accuracy than semiparametric and this superior predictive accuracy was not statistically significant (P  > 0.05). To succeed in genomic evaluation programmes, markers should be at an acceptable level of LD with QTL so that the marker can express QTL effect efficiently in the population. The accuracy of the estimates is directly related to the heritability of the trait, because if the heritability of the trait decreases, the ratio of environmental variance (residual) to genetic variance increases. As a result, the distributed environmental variance among all the recorded and genotype-determined animals increases; thus, the accuracy of predictions decreases. Although increasing the size of the training set increases the cost of genotype determination, it leads to an increase in the accuracy of the estimation of allelic effects and, as a result, increases genetic enhancement. Comparison of these methods for nonadditive models under different simulations as well as real data should is recommended.