Background

Rice (Oryza sativa) constitutes a fundamental staple crop, essential to humans for its nutritional and caloric value. World rice production has reached a plateau [1], and following the conventional breeding techniques, rice yield will soon not meet the high demand caused by the increasing world population. Therefore, we need to explore new approaches to secure global nutritional requirements by increasing at the same time the quality and quantity of rice yield. Genomic Prediction (GP) can help to achieve these goals, accelerating the breeding progress [2]. Various studies in plants have shown the effectiveness of GP in increasing breeding speed [3,4,5,6]. GP framework has been widely used in rice studies for predicting various quantitative traits, reporting moderate to high predictive performance [7]. Complex traits are controlled by numerous loci that are difficult to detect with genetic mapping. GP assumes that quantitative trait loci (QTL) will be in linkage disequilibrium (LD) with molecular markers. Thus, instead of detecting all the QTL associated with a trait, an indirect association between marker and trait is exploited.

Conceptually, since the number of genotyped individuals n, is typically smaller than the number of molecular markers p, GP faces statistical challenges such as large sampling variance and increased mean-square error. To overcome this limitation, variables must be selected or restrictions on the solutions must be applied or sometimes both. The main classes of GP methods are the genomic relationship-based method such as Genomic Best Linear Unbiased Prediction (GBLUP, [8]), the SNP effect-based methods such as the Bayesian family [2, 9, 10] and LASSO [11]. Bayesian models including the latter do not necessarily assume variance heterogeneity for marker effects. They perform variable selection and shrinkage on the effects simultaneously using priors other than Gaussian. BayesC is an example of this category assuming as a prior a normal distribution with constant variance while a fraction of marker has no effect [9]. On the other hand, methods such as GBLUP involve restriction on the square of solutions (L2 norm), with the effect of the markers assuming to be normally distributed with equal variance.

Deep Learning (DL) networks are a collection of machine learning algorithms that have exhibited excellent performance in some prediction tasks [12, 13]. The DL models are trained in such a way to find complex relationships between data. DL networks consist of multiple layers and interconnected nodes. Each layer uses as input the output of the previous layer to optimize the prediction or classification. Numerous DL architectures have been proposed, such as Multi-Layer Perceptron (MLP), Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN, [14]). DL has been around for decades but only recently started to be widely used due to the easy implementation framework provided by various online libraries (e.g. https://keras.io/, https://pytorch.org/). The performance of the DL networks depends on the accuracy of hyperparameter choice, which is not an easy task and requires abundant computation resources [15, 16].

Despite their features, various works have shown a performance of DL in genomic prediction comparable to linear models [17,18,19,20]. Zingaretti et al. [21] did not find a considerable advantage of DL over linear models, except when epistatic variance was important. Ehret et al. [22] did not observe significant differences between a GBLUP and a MLP model. In a wheat study [19], DL performed better than GBLUP when used to predict phenotypes from genotypes. Similarly, Gianola et al. [23] found that MLP performed better than a Bayesian linear model in wheat. In another study in wheat, Pérez-Rodríguez et al. [24] extensively compared the prediction performance of Radial Basis Function Neural Networks and Bayesian Regularized Neural Networks against especially linear kernels and semiparametric models such as Reproducible Kernel Hilbert Space (RKHS). The authors concluded that the non-linear models, such as DL, demonstrated a higher prediction ability than the linear models. Evaluating the potential of DL algorithms, Sandhu et al. [25] showed higher prediction ability against the linear models in spring wheat. For an extensive review in GP using DL models see Montesinos-López et al. [26], Pérez-Enciso and Zingaretti [27].

While genomic prediction of quantitative traits has been broadly studied, binary and ordinal traits are of great importance in plant breeding as well. Particularly, in rice, culm morphology and stay green traits are important targets for genetic improvement [28,29,30]. However, the prediction ability of different types of traits depends on the statistical models used. Thus, many studies in Genomic Selection (GS) focus on improving existing models or developing new ones as it was pointed out by Montesinos-López et al. [31].

Most of the studies in GP rely only on SNPs, disregarding other sources of genomic variation such as structural variants (SVs). Among the different types of SVs, Transposable Insertion Polymorphisms (TIPs) account for an important fraction and play a key role in plant evolution, from domestication to adaptation and breeding [32]. Studies in tomato and in rice found that the use of TIPs can lead to the identification of novel associations not detected by SNPs in genome-wide association studies [33,34,35]. In a recent study, we showed that TIPs explain a sizable fraction of the genetic variance in several rice agronomic traits and significantly improve genomic prediction [36]. However, TIPs are not the only type of structural variation in the genome. Other SVs such as deletions, inversions, and duplications account for an important fraction of genetic variation and have been key in the domestication and diversification of plant crops [37]. SVs have a high potential to generate large effect mutations when they affect genes or gene regulatory regions. Over the last few years, several studies have demonstrated the importance of presence-absence variation and structural variation as a source of gene expression and phenotypic variability in plant crops including rice, tomato, or soybean [38,39,40].

Here we investigate whether merging all the structural and nucleotide genome-wide variation can improve phenotypic prediction compared to using only SNPs in rice. Finally, we further explore the performance of DL in GP by (i) using multiple marker input strategies, (ii) proposing several approaches to accommodate large scale marker information, and (iii) optimizing network architectures. We also provide documented python code based in tensorflow 2 [41] in at https://github.com/ivourlaki/Deep-Learning-in-rice-for-prediction.

Methods

Rice accessions and traits

In this study we used a subset of 738 accessions from the 3000-rice genome project [42]. Accessions were chosen based on the availability of at least 15 × of sequencing depth. The 738 accessions are representative of the main rice population groups: Aus/Boro (AUS, N = 75), Indica (IND, N = 451), Japonica (Jap, N = 166), Aromatic (ARO, N = 17). An additional group of admixed varieties (ADM, N = 29) consisting of accessions that cannot be assigned to a specific rice group was also used. SNP-based group assignment from Alexandrov et al. [43] and Sun et al. [44] was used to identify the different subsets of this study. Studied traits are publicly available at IRRI SNP-Seek database (https://snp-seek.irri.org/). Among continuous traits, grain weight and time to flowering were used, whereas culm diameter (1st internode) and leaf senescence were selected among the binary and ordinal traits available. The ordinal trait leaf senescence (containing the values 1,3,5,7 and 9) was binned in two classes. Values equal to 1 were assigned to class 1 while the rest ones were assigned to class 2. The motivation behind the transformation was to balance the number of observations per class, since there were several classes with low number of individuals and rare classes are difficult to predict. Both culm diameter at first internode (originally recorded as 0 or 1) and the transformed leaf senescence values were treated as binary traits. Finally, time to flowering was log-transformed.

Markers

We used the filtered SNP dataset in Vourlaki et al. [36]. Specifically, a binary ped file format with the 3 K RG Core SNPs dataset for all chromosomes was downloaded from the SNP-Seek database. The original dataset consisted of 404,399 bi-allelic SNPs from 3034 rice accessions, including the 738 accessions selected. We filtered out markers with minor allele frequency ≤ 0.01 and missing rate > 1% using plink2 [45, 46]. Also, missing genotypes were imputed using Beagle 5.2 with default parameters [47]. After filtering the final dataset consisted of 228,871 SNPs. The TIP dataset described in Castanera et al. [34] (containing two categories: MITE-DTX and RLX-RIX) was complemented with non-TE deletions (DEL), duplications (DUP) and inversions (INV) downloaded from SNP-seek database (3 K RG Large Structural Variants release 1.0). We filtered out SVs events containing multiple overlapping deletions as these complex variants are difficult to genotype with short reads and are thus less reliable. SVs genotypes were recorded as 0/1 (absence/presence). All the markers with minor allele frequency ≤ 0.01 were filtered out. SV matrices were subjected to PCA analyses and compared to SNP-based PCA to evaluate the overall congruence (Supplementary Figs. 1–6). Finally, the dataset used in our analysis consists of 52,120 MITE-DTX, 21,517 RLX-RIX, 74,136 DEL, 25,670 DUP and 7,527 INV.

Genetic variance inference

To estimate the additive genetic variance components explained by each marker set, we fitted the following linear model using RKHS [48]:

$$\mathbf{y}={\varvec{\mu}}+{\varvec{Z}}{{\varvec{a}}}_{1}+{\varvec{Z}}{{\varvec{a}}}_{2}+{\varvec{Z}}{{\varvec{a}}}_{3}+{\varvec{Z}}{{\varvec{a}}}_{4}+{\varvec{Z}}{{\varvec{a}}}_{5}+{\varvec{Z}}{{\varvec{a}}}_{6}+{\varvec{e}}$$
(1)

where \(\mu\) is the general mean, \(y\) is the phenotype vector of size n (the number of accessions), \(Z\) is an identity incidence matrix,\({a}_{1}\), \({a}_{2}\), \({a}_{3}\), \({a}_{4}\), \({a}_{5}\) and \({a}_{6}\) are the additive random effects of each of the marker groups, SNPs, MITE-DTX, RLX-RIX, DEL, DUP and INV, whereas e is the residual. Random effects are assumed to be normally distributed\({a}_{1}\sim N\left(0,{A}_{1}{\sigma }_{1}^{2}\right)\),\({a}_{2}\sim N\left(0,{A}_{2}{\sigma }_{2}^{2}\right)\),\({a}_{3}\sim N\left(0,{A}_{3}{\sigma }_{3}^{2}\right)\),\({a}_{4}\sim N\left(0,{A}_{4}{\sigma }_{4}^{2}\right)\),\({a}_{5}\sim N\left(0,{A}_{5}{\sigma }_{5}^{2}\right)\), and\({a}_{6}\sim N\left(0,{A}_{6}{\sigma }_{6}^{2}\right)\), with constant covariance matrix\({A}_{1}{\sigma }_{1}^{2}\),\({A}_{2}{\sigma }_{2}^{2}\),\({A}_{3}{\sigma }_{3}^{2}\),\({A}_{4}{\sigma }_{4}^{2}\),\({A}_{5}{\sigma }_{5}^{2}\), and\({A}_{6}{\sigma }_{6}^{2}\). Where\({A}_{1}\),\({A}_{2}\),\({A}_{3}\),\({A}_{4}\),\({A}_{5}\), and \({A}_{6}\) are the linear genomic relationship matrices (GRMs) based on VanRaden [8] obtained using AGHMatrix [49] for each of the six marker sets. Model was implemented in BGLR package [10] using default priors to estimate\({\sigma }_{1}^{2}, {\sigma }_{2}^{2}\),\({\sigma }_{3}^{2}\),\({\sigma }_{4}^{2}\),\({\sigma }_{5}^{2}\), and\({\sigma }_{6}^{2}\).

Genomic prediction models

Bayesian regression models

Two Bayesian methods were employed in this study: Bayesian RKHS and BayesC. RKHS is a kernel-based method with similar properties to GBLUP, based on kernel regression and L2 regularization. BayesC is a variable selection method that estimates the effect of the markers. Both methods were applied to each trait separately. Particularly, for each method, various models were designed and applied comparing the predictive performance of using all the markers together versus using only SNPs. For RKHS, we performed prediction analysis using all markers at the same time (model 1 above) while other models were studied as well. Particularly, we were interested in investigating whether one matrix represented by different type of markers could result in a higher prediction in comparison to model using only SNPs. As a result, the following models were studied:

$${\varvec{y}}={\varvec{\mu}}+ {\varvec{Z}} {{\varvec{a}}}_{10{\varvec{k}}\_{\varvec{v}}{\varvec{a}}{\varvec{r}}{\varvec{i}}{\varvec{a}}{\varvec{n}}{\varvec{t}}{\varvec{s}}}+ {\varvec{e}}$$
(2)
$${\varvec{y}}={\varvec{\mu}}+ {\varvec{Z}} {{\varvec{a}}}_{10{\varvec{k}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}+ \mathbf{Z}{ {\varvec{d}}}_{10{\varvec{k}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}+ \mathbf{Z}{ {\varvec{e}}}_{10{\varvec{k}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}+ {\varvec{e}}$$
(3)
$${\varvec{y}}={\varvec{\mu}}+{\varvec{Z}}{{\varvec{a}}}_{{\varvec{L}}{\varvec{i}}{\varvec{n}}{\varvec{k}}{\varvec{e}}{\varvec{d}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}+\mathbf{Z}{\boldsymbol{ }{\varvec{d}}}_{{\varvec{L}}{\varvec{i}}{\varvec{n}}{\varvec{k}}{\varvec{e}}{\varvec{d}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}+\mathbf{Z}{\boldsymbol{ }{\varvec{e}}}_{{\varvec{L}}{\varvec{i}}{\varvec{n}}{\varvec{k}}{\varvec{e}}{\varvec{d}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}+{\varvec{e}}$$
(4)

where,\(\alpha \sim N\left(0,A{\sigma }_{\alpha }^{2}\right)d\sim N\left(0,D{\sigma }_{d}^{2}\right)\), \(e\sim N\left(0,E{\sigma }_{e}^{2}\right)\) are the normally distributed additive, dominant and epistatic random effects of each of the marker groups with indicators 10k_variants, 10k_SNPs and Linked_SNPs correspond to GRMs computed by the 10,000 most associated markers among the six sets, the 10,000 most associated SNP effects and by using the SNPs linked to structural variation respectively. Where \(D\) is the dominance GRM based on and Vitezica et al. [50] computed by AGHMatrix package. The epistatic matrix was calculated as the Hadamard product of additive effects, i.e. \(A\odot A\) [51]. Dominant and epistatic effects were considered only for models where SNPs were used exclusively. LD between SNPs and SVs was calculated using ngsLD software [52], and we considered that a SNP was linked to an SV when r2 >  = 0.8.

For BayesC, the complete models were:

$${\varvec{y}}={\varvec{\mu}}+{{\varvec{X}}}_{10{{\varvec{k}}}_{\_{\varvec{v}}{\varvec{a}}{\varvec{r}}{\varvec{i}}{\varvec{a}}{\varvec{n}}{\varvec{t}}{\varvec{s}}}}{\varvec{\beta}}+{\varvec{e}}$$
(5)
$${\varvec{y}}={\varvec{\mu}}+{{\varvec{X}}}_{10{\varvec{k}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}{\varvec{\beta}}+{\varvec{e}}$$
(6)
$${\varvec{y}}={\varvec{\mu}}+{{\varvec{X}}}_{{\varvec{L}}{\varvec{i}}{\varvec{n}}{\varvec{k}}{\varvec{e}}{\varvec{d}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}{\varvec{\beta}}+{\varvec{e}}$$
(7)

where \({X}_{10k\_variants}\), \({X}_{10k\_SNPs}\), \({X}_{Linked\_SNPs}\) are the standardized genotypes for the 10,000 most associated markers among all marker sets, the 10,000 most associated SNPs and the linked SNPs respectively. The 10,000 most associated were selected based on a linear regression-based genome-wide association (GWAS) analysis. A detailed description is given in “Input Strategies” section. Finally, where \(\beta\), is the vector of effects for the corresponding matrix. BayesC assumes a priori that markers have effects with probability \(\pi\) and zero effect with probability \(\left(1-\pi \right)\) sampled from a beta distribution, \(\pi \sim Beta\left({p}_{0},{\pi }_{0}\right)\). As it was described in Pérez and de Los Campos [10], “the beta prior is parameterized in a way that the expected value by \(E\left(\pi \right)={\pi }_{0}\); on the other hand \({p}_{0}\) can be interpreted as the number of prior counts (prior “successes” plus prior “failures”) Here we chose \({p}_{0}\) = 5 and for the prior probability of a feature π0 = 0.01. For the case of binary traits option “response_type = ordinal” was applied in both methods (RKHS, BayesC). Finally, BGLR run for 100,000 iterations using default priors for RKHS and with burn-in and thinning values equal to 500 and 5 respectively.

Using either RKHS or BayesC, phenotypes to be predicted were removed from the dataset and the model fitted using the remaining phenotypes. Prediction ability was assessed by computing two different metrics related to the type of trait: (i) loss of the model and (ii) binary entropy. A loss metric is used to optimize the performance of the model and indicate how good the prediction is. Here, we computed the loss metric between the predicted and observed phenotype as the mean squared error (MSE) for the quantitative traits defined as:

$$MSE=\frac{1}{N}\sum_{i=1}^{N}{\left({y}_{i}-\widehat{{y}_{i}}\right)}^{2}$$
(8)

where \({\widehat{y}}_{1}\), \({y}_{1}\) the predicted and observed phenotypes respectively for \(N\) individuals. For binary traits, binary cross-entropy [53] was employed to quantify the difference between predicted binary outcomes and actual binary labels, as following:

$$-\frac{1}{N}\sum_{i=1}^{N}\left({y}_{1}\bullet \text{log}\left(p\left({y}_{i}\right)\right)+\left(1-{y}_{i}\right)\bullet log\left(1-p\left({y}_{i}\right)\right)\right)$$
(9)

where \(N\) the number of individuals, \({y}_{1}\) is the real class, \(p\left({y}_{i}\right)\) is the probability the predicted trait value to have been classified as one and \(1-p\left({y}_{i}\right)\) is the probability of zero. The log in the formula is used to penalize the inaccurate predictions of the model based on the probabilities of the outcomes to belong to each class.

Multilayer perceptron

One of the most popular DL architectures is the Multilayer Perceptron (MLP). MLP is a fully connected feedforward artificial neural network which transforms any input dimension to the desired dimension. All the neurons are connected to every neuron in the previous layer and then connected to every neuron in the next layer. Each neuron receives the initial inputs multiplied by a corresponding weight coefficient. Then the sum of all inputs multiplied by weight plus a bias, is passed to an activation function which introduces non-linearity to the network transforming the inputs accordingly. We can represent the output of each hidden layer as (note the transposes):

$${{\varvec{Z}}}_{{\varvec{l}}}={\varvec{f}}\left({\varvec{X}}{{\varvec{W}}}^{({\varvec{l}}-1){\varvec{T}}}+{{\varvec{b}}}^{({\varvec{l}}-1){\varvec{T}}}\right)$$
(10)

where \({Z}_{l}\) is the output of the l layer, \({b}^{(l-1)T}\) is the bias vector of the first layer, \(X\) is a single matrix of all training examples so that we could compute all the prediction using a single matrix multiplication, \({W}^{(l-1)T}\) is the weight matrix and \(f\) is a nonlinear activation function. The model is trained successively, that is, the output of neurons from the previous layer will be the input for the next layer. Figure 1 shows the basic workflow of MLP network.

Fig. 1
figure 1

Panel A shows the Multilayer perceptron (MLP) representation with markers as input layers; Panel B displays the basic workflow of a perceptron

Convolutional neural networks

Convolutional neural networks (CNNs) can utilize spatial relationships between nearby variables (e.g., pixels) of the input matrix. This architecture can accommodate situation where input variables are distributed along a space and are associated with each other such as linkage disequilibrium between nearby markers [27]. A CNN has hidden layers which typically consist of convolutional layers, pooling layers, flatten layers and fully connected dense layers. In each convolutional layer, CNN automatically performs a convolution that is a linear operation performed along the input of predefined width and strides by applying kernels or filters. The weights used are the same for all marker windows. The filter moves along windows of same sizes consist of markers performing a multiplication operation (dot product) until the entire matrix is traversed. The output of the convolutional function can be described as an integral transformation [54], as follows:

$${\varvec{s}}\left({\varvec{t}}\right)=\left({\varvec{f}}\mathbf{*}{\varvec{k}}\right)\left({\varvec{t}}\right)={\sum }_{{\varvec{x}}}{\varvec{k}}\left({\varvec{t}}-{\varvec{x}}\right){\varvec{f}}\left({\varvec{x}}\right)$$
(11)

where \(k\) represents the kernel, convolution is the transformation of \(f\) into \(s\left(t\right)\). The operation is performed over an infinite number of copies \(f\) resulting in the weighted sum shifting over the kernel. An activation function is applied after each convolution to produce the output layer. After nonlinearity has been applied to the feature map produced by the first layer, a pooling layer usually follows, aiming to reduce the dimensionality and smoothen the representation (Fig. 2). The benefit of using CNN is their ability to develop an internal representation of a two-dimensional matrix extracting the most important features. CNN leverages the fact that nearby input variables are more strongly related than the distant ones.

Fig. 2
figure 2

Schematic representation of a Convolutional Neural Network (CNN)

Cross-validation and independent prediction

Here we evaluate the prediction accuracy by following two challenging validation scenarios relevant for breeding programs: prediction of individuals from two different population groups and prediction of randomly selected individuals from the rest ones. For the first strategy, we predicted performance of the admixed (ADM, N = 29) and aromatic (ARO, N = 17) groups using the rest accessions (IND, JAP and AUS/Boro). Since accessions to be predicted are not phylogenetically close to the accessions in the training set, it would be expected a low prediction ability from the models for this scenario.

In the second strategy, prediction accuracy was evaluated by implementing a 10-fold cross-validation (CV) where training population consisted of 90% of the data and testing set included 10% of the remaining data. Analysis was performed in each of the ten training sets separately assuming ten different validation scenarios. Since accessions are randomly selected and not based on their population group, samples in the training set might be related to the predicted ones. Note that, in the case of DL application, training population was further split in a validation dataset which included 20% of the training data set (Fig. 3).

Fig. 3
figure 3

A visualization of how the three data sets, training, validation, and test are divided

Validation data set is used during the training process of our network to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. It is important to mention that the model “sees” the data and uses it for an evaluation of the process but never “learn from these”. After the model was trained, we retrieved the best hyperparameters and performed prediction using the test dataset. The test dataset provides the gold standard used to make an unbiased final evaluation of the model. It is used only once a model is completely trained using the train and validation sets.

Marker input strategies

We used different marker input strategies aiming to enhance network flexibility and thus improve prediction ability. Four strategies were designed as follows:

  1. 1.

    Most associated markers: In the first input strategy, we merged the structural and nucleotide genomic-wide variation to test whether prediction accuracy can be improved. However, using the whole six genotype matrices (SNPs, MITE-DTX, RLX-RIX, DEL, DUP, INV) would add a high complexity in our network that might cause an overfitting. Studies have shown that using a subset of markers can result in equivalent prediction ability to using all the data sets [17, 36, 55]. Thus, from the total of 409,892 molecular markers we selected the 10,000 most associated to the traits of interest. Specifically, we performed a genome wide association study (GWAS) fitting a linear model (single-marker regression analyses) to find associations between each of the six-marker set and each of the four traits (4 × 6). For each fitted model, a p-value corresponding to each marker was collected. From the collection of the p-values the 10,000 most associated was selected. GWAS was performed only to the training sets. Since we followed two different cross-validation strategies the process was repeated for each of those, that is for the across population training set, for the ten partitions training sets and for each trait. This strategy (hereafter referred to as “COMBINED_VARIANTS”) was applied to DL and Bayesian linear models. Additionally, we selected the 10,000 most associated SNPs (hereafter referred as “SNPs”) to perform the same analysis and compare directly to the COMBINED_VARIANTS strategy.

  2. 2.

    Linked SNPs: Using causal variants associated to the specific traits can result to prediction accuracy almost 1 [56]. Multiple examples of structural variants have been found to be at the origin of trait variability in plant crops [37, 57]. Nevertheless, SNPs are easier to genotype in populations than SVs (ie, by genotyping chips). We reasoned that SNPs in LD with SVs could be used as SV replacement and would be easier to use in further experimental or breeding programs. Hence, we used ngsLD to detect SNPs in LD with SVs (r2 >  = 0.8) and used them for prediction. The software was applied to each trait and cross-validation scenario using only the training set. Then, for each case the unique pairs of SNPs-SVs with LD >  = 0.8 were selected. The type, position and chromosome of the variants meeting the criteria were collected in a list and is provided in the github repository. The linked SNPs sets were used as marker sets across all the analysis. The selected abbreviation for this input strategy was “LINKED_SNPS”. The average number of linked SNPs across the analyses was ~ 12,000.

  3. 3.

    PCs single matrix: In the third strategy, we exploited the advantages of principal component analysis (PCA) by incorporating it to neural networks. Studies have shown that using principal components in DL framework can be particularly advantageous [58]. In our study, PCs were computed based on eigenvectors for each of the obtained GRM. We run the analysis by feeding the network with a single matrix merging the PCs computed by the six GRMs introducing them as a single layer. Also, a separate analysis was performed for SNPs and linked SNPs testing whether this strategy will enhance the performance of the model. The selected abbreviation for this input strategy was “COMBINED_VARIANTS” with method applied MLP_PCs.

  4. 4.

    Multiple Inputs: Here, we tested whether multiple input strategy could improve the prediction of traits. Particularly, we used the six PCs sets (computed by the GRM of each marker set) as six inputs feeding to the network in different input layers simultaneously. Other works have shown that a multiple input strategy can reduce overfitting and computational cost while at the same time exploits mixed data improving prediction [59, 60]. Thus, the network accepted six different input layers which independently forwards in six different hidden dense layers. Next the six layers are merged by a concatenate layer (Fig. 4). In order to compare directly this DL strategy against Bayesian Linear models, RKHS was performed with model Eq. 1 using all the GRM as inputs at the same time. The selected abbreviation for this input strategy was “MULTIPLE INPUTS”.

Fig. 4
figure 4

Representation of Multiple inputs strategy employed in the present study

Optimization of hyperparameters

CNNs, MLP, BayesC and RKHS were implemented using the COMBINED_VARIANTS (SNPs+SVs), as well as SNPs and LINKED_SNPs as inputs separately. Additionally, MLP network was employed for using the six PCs as a single input matrix, as six different input layers and for PCs produced by GRM of SNPs. All the models were applied separately to each trait and to eleven cross-validation scenarios (10-fold and ARO/ADM across-population prediction). Supplementary Table 1 shows the different models implemented in our analysis.

For each of the different runs, hyperparameter tuning was performed obtaining the best hyperparameters and then retrained the model with the hyperparameters obtained by the search. Here Keras Tuner [61] library was used to pick the optimal set. Hyperparameters are the variables that control the training process and the topology of our model. When the model is built for hyperparameter tuning, the search space is also defined in addition to the model architecture. Then a tuner must be selected to determine which hyperparameter combinations should be tested. In our analysis we used the Hyperband tuner. The Hyperband tuning algorithm uses adaptive resource allocation and early stopping to quickly converge on a high-performing model. The algorithm trains a large number of configurations for a few epochs (defined as the times of complete pass through the entire training dataset) and carries forward only the top-performing half of models to the next round [62] evaluating the performance by computing the MSE (for quantitative traits) and binary cross-entropy (for binary traits) were computed on a held-out validation set. The best model is the one that minimizes errors. After the hyperparameter search was finished, we evaluated the model on the test data and performed prediction computing the pre-mentioned evaluation metrics of interest on the test data set. Figure 5 displays the suggested scheme.

Fig. 5
figure 5

Figure depicts the basic scheme performing from hyperband tuner to determine the best configuration towards to the final evaluation of the model

DL performance is controlled by various parameters and thus the optimization of the hyperparameters is not a trivial step. Here, we designed the tuner search space based on the available literature [21, 25]. The hyperparameters chosen to be optimized were: activation function (Relu, Tanh, Linear), number of neurons for the first layer in MLP and number of filters in CNN (16,38,64,128), number of hidden dense layers for MLP (0,1,2,3) while for CNN (1,2,3), number of neurons of hidden layers (2,4,8,16), numbers of optimizers (Adam, RMSprop, SGD), dropout rate (0,0.05,0.1,0.15,0.2,0.25,0.3), L1 and L2 regularizers with optimized weight decay parameter (0.001, 0.01, 0.1). For the output layer we used one unit with linear as activation function for the quantitative traits whereas sigmoid was used for the binary traits as it is suggested [63] For the hyperparameter optimization, 80% of training set was used and the remaining 20% validation data set was applied for inner testing. Training a DL network that can generalize well new data set is a challenging issue. A model with too little capacity cannot learn from the data, a problem known as underfitting, whereas a model with a large capacity can learn and fit too well to the training dataset results in overfitting. For avoiding and reducing the effects of these two phenomena there are techniques that can be adjusted to a DL network.

Here we used two regularization techniques such as L1 and L2 with a weight decay parameter. These techniques penalize the weight values of the network making values tend to zero and negative equal to 0 to avoid a parsimonious model. L1 adds “the absolute value of magnitude” of coefficient as penalty term to the loss function while L2 adds “squared magnitude” of coefficient as penalty term to the loss function. We added L1 and L2 regularizers in the hidden layers. Additional to the regularization, dropout and early stopping were applied to reduce the effect of overfitting and underfitting on our models. A dropout layer was applied before the output layer. Our analysis was implemented using Tensor Flow 2.8.0 library with Keras 2.8.0 interface and Keras Tuner 1.1.2.

Results

Phenotypic structure and genetic inference

We used PCA to determine the underlying structure of our data and the direction of the maximum variation when projected in a lower dimension space. Figure 6A shows the projections of variables of each trait onto the principal components. The length of the arrow is proportional to trait contribution, whereas the angle between arrows shows whether traits are correlated (pointed out in the same direction) or not. An analysis in two principal components showed that the first component depends on grain weight, which contributes the most to the total phenotypic variation. The main contributors to the second component in descending sequence are time to flowering, culm diameter and leaf senescence. Our analysis showed that no clear phenotypic structure is observed in the varieties analysed (Fig. 6B).

Fig. 6
figure 6

Panel A displays the PCA loadings of each trait for the two first standardized principal components. The different colours indicate the percentage of contribution to the PCAs displayed as “contrib”. Panel B shows the projections of the accessions

Genetic variance estimates were obtained for each trait (Fig. 7) using model (1). Particularly, we estimated the genetic variance explained by each SV marker set in comparison to SNPs to understand the relative importance of each set to determine the observed phenotype. Figure 7 shows that structural variants explain a significant fraction of additive genetic variance in the two binary traits, Culm diameter and Leaf Senescence while in the latter one is even larger than that explained by SNPs. Among the different types of SVs, TE-related variants (MITE-DTX and RLX-RIX) explained more genetic variance than non-TE variants. Among the latter, deletions explained more variance than duplications and inversions.

Fig. 7
figure 7

Means of posterior distributions of genetic variances explained by each marker set. Red bars represent SNPs with blue bars represent different kind of SVs

Comparison of model performances

The prediction ability of DL implementations was compared to those of Bayesian regression using RKHS and BayesC for each trait and under eleven cross validation sets. Particularly, we assessed prediction by following two different validation strategies, prediction using ten randomly selected training sets produced by a 10-fold cross validation strategy, and prediction across populations. All the models were applied separately to each of the eleven validation scenarios (see Materials and Methods). Figure 8 shows the performance of each of the models under the 10-fold cross validation. We used binary-cross entropy as evaluation metric for binary traits and MSE for quantitative traits. Other metrics can also be applied (e.g. Pearson’s correlation for quantitative traits, accuracy for binary traits) however here we measure the prediction ability in terms of loss of the model as previously reported [64]. That is, the closer the model predictions are to the actual values, the lower is the loss of the model. The highest prediction ability (minimum loss) for culm diameter was obtained using CNN and linked SNPs (loss = 0.580). For leaf senescence the best prediction value was reported using MLP with PCs computed by the combined variants (Loss = 0.576). For the case of quantitative traits, Bayesian Regression models showed higher prediction accuracy values than those of DL models in the 10-fold cross validation strategy. Particularly, grain weight and time to flowering was better predicted under RKHS model using linked SNPs (Loss = 0.715) and multiple inputs (Loss = 0.327) respectively.

Fig. 8
figure 8

Performance of each of the model-input combinations under the 10-fold cross validation strategy and four predicted phenotypes. Each model was applied separately to each of the ten partitions. Points represent the evaluation metric for each partition whereas the boxplot shows the distribution of the numerical values displaying the data quartiles. The value that appears in bold is the average value of each model. The y-axis shows the loss metric values which for binary traits is the binary-cross entropy and for quantitative traits the MSE

In the independent prediction, phenotypes of all ADM and ARO accessions were predicted given the rest of the accessions. Figure 9 shows the prediction ability for across population strategy under eleven different models. Particularly, culm diameter and leaf senescence were better predicted under the MLP_PCs model and “COMBINED_VARIANTS” marker strategy as in ten partitions. However, grain weight and time to flowering showed the lowest loss values under CNN with SNPs (Loss = 0.963) and MLP with combined variants (Loss = 0.44) respectively. In general, time to flowering was better predicted compared to the rest traits in both cross-validation strategies. On average, prediction across populations was less accurate for the quantitative traits than in 10-fold scenarios as it was expected because of the more distantly related training and test datasets.

Fig. 9
figure 9

Performance of each of the model-input combinations under the across population strategy. Points represent the evaluation metric. The y-axis shows the loss metric values which for binary traits is the binary-cross entropy and for quantitative traits the MSE

Impact of marker selection on genomic prediction

Our results showed that phenotypic traits such as leaf senescence and time to flowering were better predicted using combined variants or multiple inputs. Also, using SNPs linked to SVs exhibited an efficient prediction ability especially for culm diameter under both validation strategies. We observed that incorporating structural variation in a genomic prediction framework either combining it with SNPs or generating the linked SNPs to these variants resulted in an improved prediction performance in 87% of the studied cases against using only SNPs (Table 1).

Table 1 Minimum prediction loss and corresponding model with input strategy

Discussion

In this study we show that combining structural and nucleotide genome-wide variation for genomic prediction can improve prediction ability for important agronomic traits in rice. Previous studies on plants have explored the association between structural variants and phenotypic traits [65,66,67], and some have been demonstrated to be the causal variants for a diversity of phenotypes across major traits in plants [68, 69]. In example, late or early flowering on wheat depends on the increased copy number of Vrn-A1 and Ppd-B1 genes respectively [70], and plant height is associated with a specific tandem duplication [71]. We show that SVs can explain a significant fraction of genetic variance in binary traits and especially in leaf senescence. This might be due to the presence of causal SVs lacking markers in high or complete linkage disequilibrium in the filtered SNP dataset, as described in [66]. The strong regulatory potential of rice SVs [40, 65, 72] could be an explanation for their relationship with trait variability and their high performance in the prediction of phenotypic traits when used alone or in combination with SNPs. However, the incorporation of SVs in breeding programs demands their genotyping to be automatized. This can be a complex task, as SVs are highly diverse and commercial SV genotyping assays do not exist. Our results suggest that the identification of SNPs highly linked to SVs (ideally in complete LD) from high-density SNP datasets could be an effective strategy to indirectly incorporate SVs in genomic prediction in a straightforward manner. We propose that the accurate characterization of SV variants in large accession panels (ie, from pangenome studies) should be a priority to enhance crop breeding, as they constitute a promising resource to be used for genomic assisted breeding.

A second objective addressed is the evaluation of the performance of DL and Bayesian methods to predict agronomically important traits in rice. Culm diameter, leaf senescence and time to flowering are correlated (Fig. 6), whereas grain weight is uncorrelated to them. Traits such as time to flowering and grain weight are polygenic, controlled by many QTLs of large effects [73,74,75]. Studies in culm diameter have shown that it is controlled by at least twelve QTLs associated with lodging resistance in dry direct-seeded rice [76]. In addition, delayed leaf senescence or stay-green is associated to forty-six QTLs that made up the genetic basis of this important trait in rice [77]. Genomic prediction of traits such as time to flowering was quite accurate using both GP and Bayesian models, with the loss metric reported being the lowest values across all the study (average MSE value equal to 0.327, Table 1). For leaf senescence the GP performance was lower than that in time to flowering yet accurate. It is worth mentioning that, in the ARO/ADM validation strategy the prediction ability of leaf senescence and time to flowering was improved by DL against the best values of Bayesian models by 24% and 21% respectively (Fig. 9). Since the genetic relatedness of the accessions used for training increases prediction accuracy, it is interesting that DL models outperformed the Bayesian ones in both binary and quantitative traits for genetically distant lines (Table 1).

Increasing the prediction accuracies of traits in rice breeding is challenging but at the same time of high importance, taking into consideration the increasing environmental constraints that limit world production. New methods attempt to improve prediction of agronomic traits promising lower computational cost and better results. DL is a state-of-the-art method applied in many different fields, and many recent studies have started to compare DL with standard linear models for genomic prediction [17,18,19,20,21, 25, 31, 64]. Here, we studied the performance of DL models for predicting complex traits in rice comparing them to Bayesian regression methods under different input strategies and scenarios. Overall, our results showed that DL increased prediction accuracy compared to Bayesian methods in 75% (in 6/8 studied cases, Table 1) of the implementations. Across DL architectures, MLP and CNN were the optimal choices in the same number of cases depending on the trait and training population. This observation shows that there is not a clear winner, as evidenced by contrasting findings in the literature, where MLP outperforms CNN according to [25], whereas [17, 21] report the opposite trend. For the case of Bayesian regression models, RKHS clearly outperformed BayesC.

Another critical and challenging issue in DL models is the optimization of hyperparameters, mainly due to the high computational cost. The tuning of the hyperparameters for each trait depends on the genetic basis and architecture of the trait. As we show in Supplementary Tables 2–5, different combinations of hyperparameters were selected for the various traits as the prediction ability is highly associated with the interaction of these factors [17, 20]. We observed that Tanh was the most useful activation function in quantitative traits being selecting in 75% of the cases (6/8) whereas in binary traits, Relu function was the optimal choice in 63% of the cases (5/8). Moreover, Adam optimizer was the most frequently chosen in binary traits during the hypertuning with 63%. Nevertheless, RMSprop was the optimal option with percentage of 50% in quantitative traits. DL models can capture interactions of large orders because of the presence of hidden layers [14, 78]. However, RKHS models are also able to capture complex interaction patterns. This ability of both methods can be reflected in our results demonstrating that both can capture complex interactions.

The third aim of this work is to study the impact of various input strategies on the prediction results. It is commonly believed that GP requires a large marker set to be used for an efficient prediction. However, our current results and some of related works [17, 36] support that GP models can be effective even with a smaller dataset of markers. However, the optimal marker size can be related to the studied trait [25]. We also observed that the best input strategy is affected by the chosen phenotypic trait and the training set in some cases (Table 1). Note that MLP models using PCs as input strategy proved beneficial in 66.7% of the cases with MLP as best model. In any case, the different input strategies that we followed indicated that the accommodation of subsets of the markers in GP framework can be equal or even more informative than using the whole marker sets.

Finally, we would like to mention the challenges and limitations of DL models. Firstly, DL models do not provide clear insights into the genetic architecture of the traits, nor do they give information about the effects of specific markers in the studied traits. Different hyperparameters act on different parts of the data, making it hard to interpret the biological significance and importance of each marker in the model [17, 79]. Also, the high computational cost of training models is a significant drawback, especially when multiple hyperparameters must be optimized for each trait separately [80]. A main advantage of DL over linear models may be related to the main assumptions fulfilled by the latter. To be optimal, linear models must satisfy several assumptions such as linearity (linear relationship between independent and dependent variable), no multicollinearity (no highly correlated independent variables), homogeneity of variance (the residuals have equal variance) and normality (the residuals follow a normal distribution with a mean of zero). On the other hand, DL assumes non-linearity and can, in principle, identify complex patterns in high dimensional data. However, outperformance of DL over linear models is not always the case. The prediction ability depends on the studied traits and can be affected by various factors such as the architecture and the genetic basis of the studied trait, the input strategy, the LD of the markers, the model architecture and the hyperparameters tuning. All these factors could explain the vast diversity found in the bibliography when the performance of DL models is compared to other models. Thus, our results support the fact that there is not a single algorithm that performs better in all species and traits [27]. Nevertheless, even though the advantage of DL networks against linear methods has not been established yet, their incorporation into plant breeding can be important to improve genetic merit for complex traits.

Conclusions

Our study reveals that the use of structural genomic variants can enhance trait prediction in rice, independently of the methodology used, a result that can be likely applicable to other plant crops. Importantly, SNPs linked to SVs can be effectively used to indirectly incorporate SVs in genomic prediction. From a practical point of view, the use of SNPs linked to known SVs can greatly facilitate the incorporation of this previously ignored source of genetic variability to genomic prediction. We show that under our tested scenarios deep learning networks have better performance than Bayesian models in the prediction of binary traits, although the latter perform better for quantitative traits. Nevertheless, we found that DL generally outperform Bayesian models in both binary and quantitative traits when the training and target sets are not closely related. This highlights the potential advantage of DL for crop improvement in specific scenarios.