Evaluation of deep learning for predicting rice traits using structural and single-nucleotide genomic variants

Vourlaki, Ioanna-Theoni; Ramos-Onsins, Sebastián E.; Pérez-Enciso, Miguel; Castanera, Raúl

doi:10.1186/s13007-024-01250-y

Evaluation of deep learning for predicting rice traits using structural and single-nucleotide genomic variants

Research
Open access
Published: 10 August 2024

Volume 20, article number 121, (2024)
Cite this article

Download PDF

You have full access to this open access article

Plant Methods Aims and scope Submit manuscript

Evaluation of deep learning for predicting rice traits using structural and single-nucleotide genomic variants

Download PDF

Ioanna-Theoni Vourlaki^1,2,
Sebastián E. Ramos-Onsins¹,
Miguel Pérez-Enciso^1,3,4 &
…
Raúl Castanera^1,2

332 Accesses
3 Altmetric
Explore all metrics

Abstract

Background

Structural genomic variants (SVs) are prevalent in plant genomes and have played an important role in evolution and domestication, as they constitute a significant source of genomic and phenotypic variability. Nevertheless, most methods in quantitative genetics focusing on crop improvement, such as genomic prediction, consider only Single Nucleotide Polymorphisms (SNPs). Deep Learning (DL) is a promising strategy for genomic prediction, but its performance using SVs and SNPs as genetic markers remains unknown.

Results

We used rice to investigate whether combining SVs and SNPs can result in better trait prediction over SNPs alone and examine the potential advantage of Deep Learning (DL) networks over Bayesian Linear models. Specifically, the performances of BayesC (considering additive effects) and a Bayesian Reproducible Kernel Hilbert space (RKHS) regression (considering both additive and non-additive effects) were compared to those of two different DL architectures, the Multilayer Perceptron, and the Convolution Neural Network, to explore their prediction ability by using various marker input strategies. We found that exploiting structural and nucleotide variation slightly improved prediction ability on complex traits in 87% of the cases. DL models outperformed Bayesian models in 75% of the studied cases, considering the four traits and the two validation strategies used. Finally, DL systematically improved prediction ability of binary traits against the Bayesian models.

Conclusions

Our study reveals that the use of structural genomic variants can improve trait prediction in rice, independently of the methodology used. Also, our results suggest that Deep Learning (DL) networks can perform better than Bayesian models in the prediction of binary traits, and in quantitative traits when the training and target sets are not closely related. This highlights the potential of DL to enhance crop improvement in specific scenarios and the importance to consider SVs in addition to SNPs in genomic selection.

Comprehensive evaluation of mapping complex traits in wheat using genome-wide association studies

Article 22 December 2021

Advances in integrated genomic selection for rapid genetic gain in crop improvement: a review

Article 23 September 2022

Machine learning models outperform deep learning models, provide interpretation and facilitate feature selection for soybean trait prediction

Article Open access 08 April 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Background

Rice (Oryza sativa) constitutes a fundamental staple crop, essential to humans for its nutritional and caloric value. World rice production has reached a plateau [1], and following the conventional breeding techniques, rice yield will soon not meet the high demand caused by the increasing world population. Therefore, we need to explore new approaches to secure global nutritional requirements by increasing at the same time the quality and quantity of rice yield. Genomic Prediction (GP) can help to achieve these goals, accelerating the breeding progress [2]. Various studies in plants have shown the effectiveness of GP in increasing breeding speed [3,4,5,6]. GP framework has been widely used in rice studies for predicting various quantitative traits, reporting moderate to high predictive performance [7]. Complex traits are controlled by numerous loci that are difficult to detect with genetic mapping. GP assumes that quantitative trait loci (QTL) will be in linkage disequilibrium (LD) with molecular markers. Thus, instead of detecting all the QTL associated with a trait, an indirect association between marker and trait is exploited.

Conceptually, since the number of genotyped individuals n, is typically smaller than the number of molecular markers p, GP faces statistical challenges such as large sampling variance and increased mean-square error. To overcome this limitation, variables must be selected or restrictions on the solutions must be applied or sometimes both. The main classes of GP methods are the genomic relationship-based method such as Genomic Best Linear Unbiased Prediction (GBLUP, [8]), the SNP effect-based methods such as the Bayesian family [2, 9, 10] and LASSO [11]. Bayesian models including the latter do not necessarily assume variance heterogeneity for marker effects. They perform variable selection and shrinkage on the effects simultaneously using priors other than Gaussian. BayesC is an example of this category assuming as a prior a normal distribution with constant variance while a fraction of marker has no effect [9]. On the other hand, methods such as GBLUP involve restriction on the square of solutions (L2 norm), with the effect of the markers assuming to be normally distributed with equal variance.

Deep Learning (DL) networks are a collection of machine learning algorithms that have exhibited excellent performance in some prediction tasks [12, 13]. The DL models are trained in such a way to find complex relationships between data. DL networks consist of multiple layers and interconnected nodes. Each layer uses as input the output of the previous layer to optimize the prediction or classification. Numerous DL architectures have been proposed, such as Multi-Layer Perceptron (MLP), Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN, [14]). DL has been around for decades but only recently started to be widely used due to the easy implementation framework provided by various online libraries (e.g. https://keras.io/, https://pytorch.org/). The performance of the DL networks depends on the accuracy of hyperparameter choice, which is not an easy task and requires abundant computation resources [15, 16].

Despite their features, various works have shown a performance of DL in genomic prediction comparable to linear models [17,18,19,20]. Zingaretti et al. [21] did not find a considerable advantage of DL over linear models, except when epistatic variance was important. Ehret et al. [22] did not observe significant differences between a GBLUP and a MLP model. In a wheat study [19], DL performed better than GBLUP when used to predict phenotypes from genotypes. Similarly, Gianola et al. [23] found that MLP performed better than a Bayesian linear model in wheat. In another study in wheat, Pérez-Rodríguez et al. [24] extensively compared the prediction performance of Radial Basis Function Neural Networks and Bayesian Regularized Neural Networks against especially linear kernels and semiparametric models such as Reproducible Kernel Hilbert Space (RKHS). The authors concluded that the non-linear models, such as DL, demonstrated a higher prediction ability than the linear models. Evaluating the potential of DL algorithms, Sandhu et al. [25] showed higher prediction ability against the linear models in spring wheat. For an extensive review in GP using DL models see Montesinos-López et al. [26], Pérez-Enciso and Zingaretti [27].

While genomic prediction of quantitative traits has been broadly studied, binary and ordinal traits are of great importance in plant breeding as well. Particularly, in rice, culm morphology and stay green traits are important targets for genetic improvement [28,29,30]. However, the prediction ability of different types of traits depends on the statistical models used. Thus, many studies in Genomic Selection (GS) focus on improving existing models or developing new ones as it was pointed out by Montesinos-López et al. [31].

Most of the studies in GP rely only on SNPs, disregarding other sources of genomic variation such as structural variants (SVs). Among the different types of SVs, Transposable Insertion Polymorphisms (TIPs) account for an important fraction and play a key role in plant evolution, from domestication to adaptation and breeding [32]. Studies in tomato and in rice found that the use of TIPs can lead to the identification of novel associations not detected by SNPs in genome-wide association studies [33,34,35]. In a recent study, we showed that TIPs explain a sizable fraction of the genetic variance in several rice agronomic traits and significantly improve genomic prediction [36]. However, TIPs are not the only type of structural variation in the genome. Other SVs such as deletions, inversions, and duplications account for an important fraction of genetic variation and have been key in the domestication and diversification of plant crops [37]. SVs have a high potential to generate large effect mutations when they affect genes or gene regulatory regions. Over the last few years, several studies have demonstrated the importance of presence-absence variation and structural variation as a source of gene expression and phenotypic variability in plant crops including rice, tomato, or soybean [38,39,40].

Here we investigate whether merging all the structural and nucleotide genome-wide variation can improve phenotypic prediction compared to using only SNPs in rice. Finally, we further explore the performance of DL in GP by (i) using multiple marker input strategies, (ii) proposing several approaches to accommodate large scale marker information, and (iii) optimizing network architectures. We also provide documented python code based in tensorflow 2 [41] in at https://github.com/ivourlaki/Deep-Learning-in-rice-for-prediction.

Methods

Rice accessions and traits

In this study we used a subset of 738 accessions from the 3000-rice genome project [42]. Accessions were chosen based on the availability of at least 15 × of sequencing depth. The 738 accessions are representative of the main rice population groups: Aus/Boro (AUS, N = 75), Indica (IND, N = 451), Japonica (Jap, N = 166), Aromatic (ARO, N = 17). An additional group of admixed varieties (ADM, N = 29) consisting of accessions that cannot be assigned to a specific rice group was also used. SNP-based group assignment from Alexandrov et al. [43] and Sun et al. [44] was used to identify the different subsets of this study. Studied traits are publicly available at IRRI SNP-Seek database (https://snp-seek.irri.org/). Among continuous traits, grain weight and time to flowering were used, whereas culm diameter (1st internode) and leaf senescence were selected among the binary and ordinal traits available. The ordinal trait leaf senescence (containing the values 1,3,5,7 and 9) was binned in two classes. Values equal to 1 were assigned to class 1 while the rest ones were assigned to class 2. The motivation behind the transformation was to balance the number of observations per class, since there were several classes with low number of individuals and rare classes are difficult to predict. Both culm diameter at first internode (originally recorded as 0 or 1) and the transformed leaf senescence values were treated as binary traits. Finally, time to flowering was log-transformed.

Markers

We used the filtered SNP dataset in Vourlaki et al. [36]. Specifically, a binary ped file format with the 3 K RG Core SNPs dataset for all chromosomes was downloaded from the SNP-Seek database. The original dataset consisted of 404,399 bi-allelic SNPs from 3034 rice accessions, including the 738 accessions selected. We filtered out markers with minor allele frequency ≤ 0.01 and missing rate > 1% using plink2 [45, 46]. Also, missing genotypes were imputed using Beagle 5.2 with default parameters [47]. After filtering the final dataset consisted of 228,871 SNPs. The TIP dataset described in Castanera et al. [34] (containing two categories: MITE-DTX and RLX-RIX) was complemented with non-TE deletions (DEL), duplications (DUP) and inversions (INV) downloaded from SNP-seek database (3 K RG Large Structural Variants release 1.0). We filtered out SVs events containing multiple overlapping deletions as these complex variants are difficult to genotype with short reads and are thus less reliable. SVs genotypes were recorded as 0/1 (absence/presence). All the markers with minor allele frequency ≤ 0.01 were filtered out. SV matrices were subjected to PCA analyses and compared to SNP-based PCA to evaluate the overall congruence (Supplementary Figs. 1–6). Finally, the dataset used in our analysis consists of 52,120 MITE-DTX, 21,517 RLX-RIX, 74,136 DEL, 25,670 DUP and 7,527 INV.

Genetic variance inference

To estimate the additive genetic variance components explained by each marker set, we fitted the following linear model using RKHS [48]:

$$\mathbf{y}={\varvec{\mu}}+{\varvec{Z}}{{\varvec{a}}}_{1}+{\varvec{Z}}{{\varvec{a}}}_{2}+{\varvec{Z}}{{\varvec{a}}}_{3}+{\varvec{Z}}{{\varvec{a}}}_{4}+{\varvec{Z}}{{\varvec{a}}}_{5}+{\varvec{Z}}{{\varvec{a}}}_{6}+{\varvec{e}}$$

(1)

where $\mu$ is the general mean, $y$ is the phenotype vector of size n (the number of accessions), $Z$ is an identity incidence matrix,${a}_{1}$, ${a}_{2}$, ${a}_{3}$, ${a}_{4}$, ${a}_{5}$ and ${a}_{6}$ are the additive random effects of each of the marker groups, SNPs, MITE-DTX, RLX-RIX, DEL, DUP and INV, whereas e is the residual. Random effects are assumed to be normally distributed${a}_{1}\sim N\left(0,{A}_{1}{\sigma }_{1}^{2}\right)$,${a}_{2}\sim N\left(0,{A}_{2}{\sigma }_{2}^{2}\right)$,${a}_{3}\sim N\left(0,{A}_{3}{\sigma }_{3}^{2}\right)$,${a}_{4}\sim N\left(0,{A}_{4}{\sigma }_{4}^{2}\right)$,${a}_{5}\sim N\left(0,{A}_{5}{\sigma }_{5}^{2}\right)$, and${a}_{6}\sim N\left(0,{A}_{6}{\sigma }_{6}^{2}\right)$, with constant covariance matrix${A}_{1}{\sigma }_{1}^{2}$,${A}_{2}{\sigma }_{2}^{2}$,${A}_{3}{\sigma }_{3}^{2}$,${A}_{4}{\sigma }_{4}^{2}$,${A}_{5}{\sigma }_{5}^{2}$, and${A}_{6}{\sigma }_{6}^{2}$. Where${A}_{1}$,${A}_{2}$,${A}_{3}$,${A}_{4}$,${A}_{5}$, and ${A}_{6}$ are the linear genomic relationship matrices (GRMs) based on VanRaden [8] obtained using AGHMatrix [49] for each of the six marker sets. Model was implemented in BGLR package [10] using default priors to estimate${\sigma }_{1}^{2}, {\sigma }_{2}^{2}$,${\sigma }_{3}^{2}$,${\sigma }_{4}^{2}$,${\sigma }_{5}^{2}$, and${\sigma }_{6}^{2}$.

Genomic prediction models

Bayesian regression models

Two Bayesian methods were employed in this study: Bayesian RKHS and BayesC. RKHS is a kernel-based method with similar properties to GBLUP, based on kernel regression and L2 regularization. BayesC is a variable selection method that estimates the effect of the markers. Both methods were applied to each trait separately. Particularly, for each method, various models were designed and applied comparing the predictive performance of using all the markers together versus using only SNPs. For RKHS, we performed prediction analysis using all markers at the same time (model 1 above) while other models were studied as well. Particularly, we were interested in investigating whether one matrix represented by different type of markers could result in a higher prediction in comparison to model using only SNPs. As a result, the following models were studied:

$${\varvec{y}}={\varvec{\mu}}+ {\varvec{Z}} {{\varvec{a}}}_{10{\varvec{k}}\_{\varvec{v}}{\varvec{a}}{\varvec{r}}{\varvec{i}}{\varvec{a}}{\varvec{n}}{\varvec{t}}{\varvec{s}}}+ {\varvec{e}}$$

(2)

$${\varvec{y}}={\varvec{\mu}}+ {\varvec{Z}} {{\varvec{a}}}_{10{\varvec{k}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}+ \mathbf{Z}{ {\varvec{d}}}_{10{\varvec{k}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}+ \mathbf{Z}{ {\varvec{e}}}_{10{\varvec{k}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}+ {\varvec{e}}$$

(3)

$${\varvec{y}}={\varvec{\mu}}+{\varvec{Z}}{{\varvec{a}}}_{{\varvec{L}}{\varvec{i}}{\varvec{n}}{\varvec{k}}{\varvec{e}}{\varvec{d}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}+\mathbf{Z}{\boldsymbol{ }{\varvec{d}}}_{{\varvec{L}}{\varvec{i}}{\varvec{n}}{\varvec{k}}{\varvec{e}}{\varvec{d}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}+\mathbf{Z}{\boldsymbol{ }{\varvec{e}}}_{{\varvec{L}}{\varvec{i}}{\varvec{n}}{\varvec{k}}{\varvec{e}}{\varvec{d}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}+{\varvec{e}}$$

(4)

where,$\alpha \sim N\left(0,A{\sigma }_{\alpha }^{2}\right)d\sim N\left(0,D{\sigma }_{d}^{2}\right)$, $e\sim N\left(0,E{\sigma }_{e}^{2}\right)$ are the normally distributed additive, dominant and epistatic random effects of each of the marker groups with indicators 10k_variants, 10k_SNPs and Linked_SNPs correspond to GRMs computed by the 10,000 most associated markers among the six sets, the 10,000 most associated SNP effects and by using the SNPs linked to structural variation respectively. Where $D$ is the dominance GRM based on and Vitezica et al. [50] computed by AGHMatrix package. The epistatic matrix was calculated as the Hadamard product of additive effects, i.e. $A\odot A$ [51]. Dominant and epistatic effects were considered only for models where SNPs were used exclusively. LD between SNPs and SVs was calculated using ngsLD software [52], and we considered that a SNP was linked to an SV when r² > = 0.8.

For BayesC, the complete models were:

$${\varvec{y}}={\varvec{\mu}}+{{\varvec{X}}}_{10{{\varvec{k}}}_{\_{\varvec{v}}{\varvec{a}}{\varvec{r}}{\varvec{i}}{\varvec{a}}{\varvec{n}}{\varvec{t}}{\varvec{s}}}}{\varvec{\beta}}+{\varvec{e}}$$

(5)

$${\varvec{y}}={\varvec{\mu}}+{{\varvec{X}}}_{10{\varvec{k}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}{\varvec{\beta}}+{\varvec{e}}$$

(6)

$${\varvec{y}}={\varvec{\mu}}+{{\varvec{X}}}_{{\varvec{L}}{\varvec{i}}{\varvec{n}}{\varvec{k}}{\varvec{e}}{\varvec{d}}\_{\varvec{S}}{\varvec{N}}{\varvec{P}}{\varvec{s}}}{\varvec{\beta}}+{\varvec{e}}$$

(7)

where ${X}_{10k\_variants}$, ${X}_{10k\_SNPs}$, ${X}_{Linked\_SNPs}$ are the standardized genotypes for the 10,000 most associated markers among all marker sets, the 10,000 most associated SNPs and the linked SNPs respectively. The 10,000 most associated were selected based on a linear regression-based genome-wide association (GWAS) analysis. A detailed description is given in “Input Strategies” section. Finally, where $\beta$, is the vector of effects for the corresponding matrix. BayesC assumes a priori that markers have effects with probability $\pi$ and zero effect with probability $\left(1-\pi \right)$ sampled from a beta distribution, $\pi \sim Beta\left({p}_{0},{\pi }_{0}\right)$. As it was described in Pérez and de Los Campos [10], “the beta prior is parameterized in a way that the expected value by $E\left(\pi \right)={\pi }_{0}$; on the other hand ${p}_{0}$ can be interpreted as the number of prior counts (prior “successes” plus prior “failures”) Here we chose ${p}_{0}$ = 5 and for the prior probability of a feature π₀ = 0.01. For the case of binary traits option “response_type = ordinal” was applied in both methods (RKHS, BayesC). Finally, BGLR run for 100,000 iterations using default priors for RKHS and with burn-in and thinning values equal to 500 and 5 respectively.

Using either RKHS or BayesC, phenotypes to be predicted were removed from the dataset and the model fitted using the remaining phenotypes. Prediction ability was assessed by computing two different metrics related to the type of trait: (i) loss of the model and (ii) binary entropy. A loss metric is used to optimize the performance of the model and indicate how good the prediction is. Here, we computed the loss metric between the predicted and observed phenotype as the mean squared error (MSE) for the quantitative traits defined as:

$$MSE=\frac{1}{N}\sum_{i=1}^{N}{\left({y}_{i}-\widehat{{y}_{i}}\right)}^{2}$$

(8)

where ${\widehat{y}}_{1}$, ${y}_{1}$ the predicted and observed phenotypes respectively for $N$ individuals. For binary traits, binary cross-entropy [53] was employed to quantify the difference between predicted binary outcomes and actual binary labels, as following:

$$-\frac{1}{N}\sum_{i=1}^{N}\left({y}_{1}\bullet \text{log}\left(p\left({y}_{i}\right)\right)+\left(1-{y}_{i}\right)\bullet log\left(1-p\left({y}_{i}\right)\right)\right)$$

(9)

where $N$ the number of individuals, ${y}_{1}$ is the real class, $p\left({y}_{i}\right)$ is the probability the predicted trait value to have been classified as one and $1-p\left({y}_{i}\right)$ is the probability of zero. The log in the formula is used to penalize the inaccurate predictions of the model based on the probabilities of the outcomes to belong to each class.

Multilayer perceptron

One of the most popular DL architectures is the Multilayer Perceptron (MLP). MLP is a fully connected feedforward artificial neural network which transforms any input dimension to the desired dimension. All the neurons are connected to every neuron in the previous layer and then connected to every neuron in the next layer. Each neuron receives the initial inputs multiplied by a corresponding weight coefficient. Then the sum of all inputs multiplied by weight plus a bias, is passed to an activation function which introduces non-linearity to the network transforming the inputs accordingly. We can represent the output of each hidden layer as (note the transposes):

$${{\varvec{Z}}}_{{\varvec{l}}}={\varvec{f}}\left({\varvec{X}}{{\varvec{W}}}^{({\varvec{l}}-1){\varvec{T}}}+{{\varvec{b}}}^{({\varvec{l}}-1){\varvec{T}}}\right)$$

(10)

where ${Z}_{l}$ is the output of the l layer, ${b}^{(l-1)T}$ is the bias vector of the first layer, $X$ is a single matrix of all training examples so that we could compute all the prediction using a single matrix multiplication, ${W}^{(l-1)T}$ is the weight matrix and $f$ is a nonlinear activation function. The model is trained successively, that is, the output of neurons from the previous layer will be the input for the next layer. Figure 1 shows the basic workflow of MLP network.

Convolutional neural networks

Convolutional neural networks (CNNs) can utilize spatial relationships between nearby variables (e.g., pixels) of the input matrix. This architecture can accommodate situation where input variables are distributed along a space and are associated with each other such as linkage disequilibrium between nearby markers [27]. A CNN has hidden layers which typically consist of convolutional layers, pooling layers, flatten layers and fully connected dense layers. In each convolutional layer, CNN automatically performs a convolution that is a linear operation performed along the input of predefined width and strides by applying kernels or filters. The weights used are the same for all marker windows. The filter moves along windows of same sizes consist of markers performing a multiplication operation (dot product) until the entire matrix is traversed. The output of the convolutional function can be described as an integral transformation [54], as follows:

$${\varvec{s}}\left({\varvec{t}}\right)=\left({\varvec{f}}\mathbf{*}{\varvec{k}}\right)\left({\varvec{t}}\right)={\sum }_{{\varvec{x}}}{\varvec{k}}\left({\varvec{t}}-{\varvec{x}}\right){\varvec{f}}\left({\varvec{x}}\right)$$

(11)

where $k$ represents the kernel, convolution is the transformation of $f$ into $s\left(t\right)$. The operation is performed over an infinite number of copies $f$ resulting in the weighted sum shifting over the kernel. An activation function is applied after each convolution to produce the output layer. After nonlinearity has been applied to the feature map produced by the first layer, a pooling layer usually follows, aiming to reduce the dimensionality and smoothen the representation (Fig. 2). The benefit of using CNN is their ability to develop an internal representation of a two-dimensional matrix extracting the most important features. CNN leverages the fact that nearby input variables are more strongly related than the distant ones.

Cross-validation and independent prediction

Here we evaluate the prediction accuracy by following two challenging validation scenarios relevant for breeding programs: prediction of individuals from two different population groups and prediction of randomly selected individuals from the rest ones. For the first strategy, we predicted performance of the admixed (ADM, N = 29) and aromatic (ARO, N = 17) groups using the rest accessions (IND, JAP and AUS/Boro). Since accessions to be predicted are not phylogenetically close to the accessions in the training set, it would be expected a low prediction ability from the models for this scenario.

In the second strategy, prediction accuracy was evaluated by implementing a 10-fold cross-validation (CV) where training population consisted of 90% of the data and testing set included 10% of the remaining data. Analysis was performed in each of the ten training sets separately assuming ten different validation scenarios. Since accessions are randomly selected and not based on their population group, samples in the training set might be related to the predicted ones. Note that, in the case of DL application, training population was further split in a validation dataset which included 20% of the training data set (Fig. 3).

Validation data set is used during the training process of our network to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. It is important to mention that the model “sees” the data and uses it for an evaluation of the process but never “learn from these”. After the model was trained, we retrieved the best hyperparameters and performed prediction using the test dataset. The test dataset provides the gold standard used to make an unbiased final evaluation of the model. It is used only once a model is completely trained using the train and validation sets.

Marker input strategies

We used different marker input strategies aiming to enhance network flexibility and thus improve prediction ability. Four strategies were designed as follows:

1.
Most associated markers: In the first input strategy, we merged the structural and nucleotide genomic-wide variation to test whether prediction accuracy can be improved. However, using the whole six genotype matrices (SNPs, MITE-DTX, RLX-RIX, DEL, DUP, INV) would add a high complexity in our network that might cause an overfitting. Studies have shown that using a subset of markers can result in equivalent prediction ability to using all the data sets [17, 36, 55]. Thus, from the total of 409,892 molecular markers we selected the 10,000 most associated to the traits of interest. Specifically, we performed a genome wide association study (GWAS) fitting a linear model (single-marker regression analyses) to find associations between each of the six-marker set and each of the four traits (4 × 6). For each fitted model, a p-value corresponding to each marker was collected. From the collection of the p-values the 10,000 most associated was selected. GWAS was performed only to the training sets. Since we followed two different cross-validation strategies the process was repeated for each of those, that is for the across population training set, for the ten partitions training sets and for each trait. This strategy (hereafter referred to as “COMBINED_VARIANTS”) was applied to DL and Bayesian linear models. Additionally, we selected the 10,000 most associated SNPs (hereafter referred as “SNPs”) to perform the same analysis and compare directly to the COMBINED_VARIANTS strategy.
2.
Linked SNPs: Using causal variants associated to the specific traits can result to prediction accuracy almost 1 [56]. Multiple examples of structural variants have been found to be at the origin of trait variability in plant crops [37, 57]. Nevertheless, SNPs are easier to genotype in populations than SVs (ie, by genotyping chips). We reasoned that SNPs in LD with SVs could be used as SV replacement and would be easier to use in further experimental or breeding programs. Hence, we used ngsLD to detect SNPs in LD with SVs (r² > = 0.8) and used them for prediction. The software was applied to each trait and cross-validation scenario using only the training set. Then, for each case the unique pairs of SNPs-SVs with LD > = 0.8 were selected. The type, position and chromosome of the variants meeting the criteria were collected in a list and is provided in the github repository. The linked SNPs sets were used as marker sets across all the analysis. The selected abbreviation for this input strategy was “LINKED_SNPS”. The average number of linked SNPs across the analyses was ~ 12,000.
3.
PCs single matrix: In the third strategy, we exploited the advantages of principal component analysis (PCA) by incorporating it to neural networks. Studies have shown that using principal components in DL framework can be particularly advantageous [58]. In our study, PCs were computed based on eigenvectors for each of the obtained GRM. We run the analysis by feeding the network with a single matrix merging the PCs computed by the six GRMs introducing them as a single layer. Also, a separate analysis was performed for SNPs and linked SNPs testing whether this strategy will enhance the performance of the model. The selected abbreviation for this input strategy was “COMBINED_VARIANTS” with method applied MLP_PCs.
4.
Multiple Inputs: Here, we tested whether multiple input strategy could improve the prediction of traits. Particularly, we used the six PCs sets (computed by the GRM of each marker set) as six inputs feeding to the network in different input layers simultaneously. Other works have shown that a multiple input strategy can reduce overfitting and computational cost while at the same time exploits mixed data improving prediction [59, 60]. Thus, the network accepted six different input layers which independently forwards in six different hidden dense layers. Next the six layers are merged by a concatenate layer (Fig. 4). In order to compare directly this DL strategy against Bayesian Linear models, RKHS was performed with model Eq. 1 using all the GRM as inputs at the same time. The selected abbreviation for this input strategy was “MULTIPLE INPUTS”.

Optimization of hyperparameters

CNNs, MLP, BayesC and RKHS were implemented using the COMBINED_VARIANTS (SNPs+SVs), as well as SNPs and LINKED_SNPs as inputs separately. Additionally, MLP network was employed for using the six PCs as a single input matrix, as six different input layers and for PCs produced by GRM of SNPs. All the models were applied separately to each trait and to eleven cross-validation scenarios (10-fold and ARO/ADM across-population prediction). Supplementary Table 1 shows the different models implemented in our analysis.

For each of the different runs, hyperparameter tuning was performed obtaining the best hyperparameters and then retrained the model with the hyperparameters obtained by the search. Here Keras Tuner [61] library was used to pick the optimal set. Hyperparameters are the variables that control the training process and the topology of our model. When the model is built for hyperparameter tuning, the search space is also defined in addition to the model architecture. Then a tuner must be selected to determine which hyperparameter combinations should be tested. In our analysis we used the Hyperband tuner. The Hyperband tuning algorithm uses adaptive resource allocation and early stopping to quickly converge on a high-performing model. The algorithm trains a large number of configurations for a few epochs (defined as the times of complete pass through the entire training dataset) and carries forward only the top-performing half of models to the next round [62] evaluating the performance by computing the MSE (for quantitative traits) and binary cross-entropy (for binary traits) were computed on a held-out validation set. The best model is the one that minimizes errors. After the hyperparameter search was finished, we evaluated the model on the test data and performed prediction computing the pre-mentioned evaluation metrics of interest on the test data set. Figure 5 displays the suggested scheme.

DL performance is controlled by various parameters and thus the optimization of the hyperparameters is not a trivial step. Here, we designed the tuner search space based on the available literature [21, 25]. The hyperparameters chosen to be optimized were: activation function (Relu, Tanh, Linear), number of neurons for the first layer in MLP and number of filters in CNN (16,38,64,128), number of hidden dense layers for MLP (0,1,2,3) while for CNN (1,2,3), number of neurons of hidden layers (2,4,8,16), numbers of optimizers (Adam, RMSprop, SGD), dropout rate (0,0.05,0.1,0.15,0.2,0.25,0.3), L1 and L2 regularizers with optimized weight decay parameter (0.001, 0.01, 0.1). For the output layer we used one unit with linear as activation function for the quantitative traits whereas sigmoid was used for the binary traits as it is suggested [63] For the hyperparameter optimization, 80% of training set was used and the remaining 20% validation data set was applied for inner testing. Training a DL network that can generalize well new data set is a challenging issue. A model with too little capacity cannot learn from the data, a problem known as underfitting, whereas a model with a large capacity can learn and fit too well to the training dataset results in overfitting. For avoiding and reducing the effects of these two phenomena there are techniques that can be adjusted to a DL network.

Here we used two regularization techniques such as L1 and L2 with a weight decay parameter. These techniques penalize the weight values of the network making values tend to zero and negative equal to 0 to avoid a parsimonious model. L1 adds “the absolute value of magnitude” of coefficient as penalty term to the loss function while L2 adds “squared magnitude” of coefficient as penalty term to the loss function. We added L1 and L2 regularizers in the hidden layers. Additional to the regularization, dropout and early stopping were applied to reduce the effect of overfitting and underfitting on our models. A dropout layer was applied before the output layer. Our analysis was implemented using Tensor Flow 2.8.0 library with Keras 2.8.0 interface and Keras Tuner 1.1.2.

Results

Phenotypic structure and genetic inference

We used PCA to determine the underlying structure of our data and the direction of the maximum variation when projected in a lower dimension space. Figure 6A shows the projections of variables of each trait onto the principal components. The length of the arrow is proportional to trait contribution, whereas the angle between arrows shows whether traits are correlated (pointed out in the same direction) or not. An analysis in two principal components showed that the first component depends on grain weight, which contributes the most to the total phenotypic variation. The main contributors to the second component in descending sequence are time to flowering, culm diameter and leaf senescence. Our analysis showed that no clear phenotypic structure is observed in the varieties analysed (Fig. 6B).

Genetic variance estimates were obtained for each trait (Fig. 7) using model (1). Particularly, we estimated the genetic variance explained by each SV marker set in comparison to SNPs to understand the relative importance of each set to determine the observed phenotype. Figure 7 shows that structural variants explain a significant fraction of additive genetic variance in the two binary traits, Culm diameter and Leaf Senescence while in the latter one is even larger than that explained by SNPs. Among the different types of SVs, TE-related variants (MITE-DTX and RLX-RIX) explained more genetic variance than non-TE variants. Among the latter, deletions explained more variance than duplications and inversions.

Comparison of model performances

The prediction ability of DL implementations was compared to those of Bayesian regression using RKHS and BayesC for each trait and under eleven cross validation sets. Particularly, we assessed prediction by following two different validation strategies, prediction using ten randomly selected training sets produced by a 10-fold cross validation strategy, and prediction across populations. All the models were applied separately to each of the eleven validation scenarios (see Materials and Methods). Figure 8 shows the performance of each of the models under the 10-fold cross validation. We used binary-cross entropy as evaluation metric for binary traits and MSE for quantitative traits. Other metrics can also be applied (e.g. Pearson’s correlation for quantitative traits, accuracy for binary traits) however here we measure the prediction ability in terms of loss of the model as previously reported [64]. That is, the closer the model predictions are to the actual values, the lower is the loss of the model. The highest prediction ability (minimum loss) for culm diameter was obtained using CNN and linked SNPs (loss = 0.580). For leaf senescence the best prediction value was reported using MLP with PCs computed by the combined variants (Loss = 0.576). For the case of quantitative traits, Bayesian Regression models showed higher prediction accuracy values than those of DL models in the 10-fold cross validation strategy. Particularly, grain weight and time to flowering was better predicted under RKHS model using linked SNPs (Loss = 0.715) and multiple inputs (Loss = 0.327) respectively.

In the independent prediction, phenotypes of all ADM and ARO accessions were predicted given the rest of the accessions. Figure 9 shows the prediction ability for across population strategy under eleven different models. Particularly, culm diameter and leaf senescence were better predicted under the MLP_PCs model and “COMBINED_VARIANTS” marker strategy as in ten partitions. However, grain weight and time to flowering showed the lowest loss values under CNN with SNPs (Loss = 0.963) and MLP with combined variants (Loss = 0.44) respectively. In general, time to flowering was better predicted compared to the rest traits in both cross-validation strategies. On average, prediction across populations was less accurate for the quantitative traits than in 10-fold scenarios as it was expected because of the more distantly related training and test datasets.

Impact of marker selection on genomic prediction

Our results showed that phenotypic traits such as leaf senescence and time to flowering were better predicted using combined variants or multiple inputs. Also, using SNPs linked to SVs exhibited an efficient prediction ability especially for culm diameter under both validation strategies. We observed that incorporating structural variation in a genomic prediction framework either combining it with SNPs or generating the linked SNPs to these variants resulted in an improved prediction performance in 87% of the studied cases against using only SNPs (Table 1).

Table 1 Minimum prediction loss and corresponding model with input strategy

Full size table

Discussion

In this study we show that combining structural and nucleotide genome-wide variation for genomic prediction can improve prediction ability for important agronomic traits in rice. Previous studies on plants have explored the association between structural variants and phenotypic traits [65,66,67], and some have been demonstrated to be the causal variants for a diversity of phenotypes across major traits in plants [68, 69]. In example, late or early flowering on wheat depends on the increased copy number of Vrn-A1 and Ppd-B1 genes respectively [70], and plant height is associated with a specific tandem duplication [71]. We show that SVs can explain a significant fraction of genetic variance in binary traits and especially in leaf senescence. This might be due to the presence of causal SVs lacking markers in high or complete linkage disequilibrium in the filtered SNP dataset, as described in [66]. The strong regulatory potential of rice SVs [40, 65, 72] could be an explanation for their relationship with trait variability and their high performance in the prediction of phenotypic traits when used alone or in combination with SNPs. However, the incorporation of SVs in breeding programs demands their genotyping to be automatized. This can be a complex task, as SVs are highly diverse and commercial SV genotyping assays do not exist. Our results suggest that the identification of SNPs highly linked to SVs (ideally in complete LD) from high-density SNP datasets could be an effective strategy to indirectly incorporate SVs in genomic prediction in a straightforward manner. We propose that the accurate characterization of SV variants in large accession panels (ie, from pangenome studies) should be a priority to enhance crop breeding, as they constitute a promising resource to be used for genomic assisted breeding.

A second objective addressed is the evaluation of the performance of DL and Bayesian methods to predict agronomically important traits in rice. Culm diameter, leaf senescence and time to flowering are correlated (Fig. 6), whereas grain weight is uncorrelated to them. Traits such as time to flowering and grain weight are polygenic, controlled by many QTLs of large effects [73,74,75]. Studies in culm diameter have shown that it is controlled by at least twelve QTLs associated with lodging resistance in dry direct-seeded rice [76]. In addition, delayed leaf senescence or stay-green is associated to forty-six QTLs that made up the genetic basis of this important trait in rice [77]. Genomic prediction of traits such as time to flowering was quite accurate using both GP and Bayesian models, with the loss metric reported being the lowest values across all the study (average MSE value equal to 0.327, Table 1). For leaf senescence the GP performance was lower than that in time to flowering yet accurate. It is worth mentioning that, in the ARO/ADM validation strategy the prediction ability of leaf senescence and time to flowering was improved by DL against the best values of Bayesian models by 24% and 21% respectively (Fig. 9). Since the genetic relatedness of the accessions used for training increases prediction accuracy, it is interesting that DL models outperformed the Bayesian ones in both binary and quantitative traits for genetically distant lines (Table 1).

Increasing the prediction accuracies of traits in rice breeding is challenging but at the same time of high importance, taking into consideration the increasing environmental constraints that limit world production. New methods attempt to improve prediction of agronomic traits promising lower computational cost and better results. DL is a state-of-the-art method applied in many different fields, and many recent studies have started to compare DL with standard linear models for genomic prediction [17,18,19,20,21, 25, 31, 64]. Here, we studied the performance of DL models for predicting complex traits in rice comparing them to Bayesian regression methods under different input strategies and scenarios. Overall, our results showed that DL increased prediction accuracy compared to Bayesian methods in 75% (in 6/8 studied cases, Table 1) of the implementations. Across DL architectures, MLP and CNN were the optimal choices in the same number of cases depending on the trait and training population. This observation shows that there is not a clear winner, as evidenced by contrasting findings in the literature, where MLP outperforms CNN according to [25], whereas [17, 21] report the opposite trend. For the case of Bayesian regression models, RKHS clearly outperformed BayesC.

Another critical and challenging issue in DL models is the optimization of hyperparameters, mainly due to the high computational cost. The tuning of the hyperparameters for each trait depends on the genetic basis and architecture of the trait. As we show in Supplementary Tables 2–5, different combinations of hyperparameters were selected for the various traits as the prediction ability is highly associated with the interaction of these factors [17, 20]. We observed that Tanh was the most useful activation function in quantitative traits being selecting in 75% of the cases (6/8) whereas in binary traits, Relu function was the optimal choice in 63% of the cases (5/8). Moreover, Adam optimizer was the most frequently chosen in binary traits during the hypertuning with 63%. Nevertheless, RMSprop was the optimal option with percentage of 50% in quantitative traits. DL models can capture interactions of large orders because of the presence of hidden layers [14, 78]. However, RKHS models are also able to capture complex interaction patterns. This ability of both methods can be reflected in our results demonstrating that both can capture complex interactions.

The third aim of this work is to study the impact of various input strategies on the prediction results. It is commonly believed that GP requires a large marker set to be used for an efficient prediction. However, our current results and some of related works [17, 36] support that GP models can be effective even with a smaller dataset of markers. However, the optimal marker size can be related to the studied trait [25]. We also observed that the best input strategy is affected by the chosen phenotypic trait and the training set in some cases (Table 1). Note that MLP models using PCs as input strategy proved beneficial in 66.7% of the cases with MLP as best model. In any case, the different input strategies that we followed indicated that the accommodation of subsets of the markers in GP framework can be equal or even more informative than using the whole marker sets.

Finally, we would like to mention the challenges and limitations of DL models. Firstly, DL models do not provide clear insights into the genetic architecture of the traits, nor do they give information about the effects of specific markers in the studied traits. Different hyperparameters act on different parts of the data, making it hard to interpret the biological significance and importance of each marker in the model [17, 79]. Also, the high computational cost of training models is a significant drawback, especially when multiple hyperparameters must be optimized for each trait separately [80]. A main advantage of DL over linear models may be related to the main assumptions fulfilled by the latter. To be optimal, linear models must satisfy several assumptions such as linearity (linear relationship between independent and dependent variable), no multicollinearity (no highly correlated independent variables), homogeneity of variance (the residuals have equal variance) and normality (the residuals follow a normal distribution with a mean of zero). On the other hand, DL assumes non-linearity and can, in principle, identify complex patterns in high dimensional data. However, outperformance of DL over linear models is not always the case. The prediction ability depends on the studied traits and can be affected by various factors such as the architecture and the genetic basis of the studied trait, the input strategy, the LD of the markers, the model architecture and the hyperparameters tuning. All these factors could explain the vast diversity found in the bibliography when the performance of DL models is compared to other models. Thus, our results support the fact that there is not a single algorithm that performs better in all species and traits [27]. Nevertheless, even though the advantage of DL networks against linear methods has not been established yet, their incorporation into plant breeding can be important to improve genetic merit for complex traits.

Conclusions

Our study reveals that the use of structural genomic variants can enhance trait prediction in rice, independently of the methodology used, a result that can be likely applicable to other plant crops. Importantly, SNPs linked to SVs can be effectively used to indirectly incorporate SVs in genomic prediction. From a practical point of view, the use of SNPs linked to known SVs can greatly facilitate the incorporation of this previously ignored source of genetic variability to genomic prediction. We show that under our tested scenarios deep learning networks have better performance than Bayesian models in the prediction of binary traits, although the latter perform better for quantitative traits. Nevertheless, we found that DL generally outperform Bayesian models in both binary and quantitative traits when the training and target sets are not closely related. This highlights the potential advantage of DL for crop improvement in specific scenarios.

Availability of data and materials

The datasets generated and analysed during the current study (SNP and SVs matrices, phenotypic data) are available in at https://github.com/ivourlaki/Deep-Learning-in-rice-for-prediction, along with documented python code to reproduce the analyses.

References

FAO. Crop prospects and food situation—quarterly global report No. 2, July 2023. Rome: FAO; 2023. https://doi.org/10.4060/cc6806en.
Book Google Scholar
Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–29. https://doi.org/10.1093/genetics/157.4.1819.
Article PubMed PubMed Central CAS Google Scholar
Jighly A, Lin Z, Pembleton LW, et al. Boosting genetic gain in allogamous crops via speed breeding and genomic selection. Front Plant Sci. 2019. https://doi.org/10.3389/fpls.2019.01364.
Article PubMed PubMed Central Google Scholar
Krishnappa G, Savadi S, Tyagi BS, et al. Integrated genomic selection for rapid improvement of crops. Genomics. 2021;113:1070–86. https://doi.org/10.1016/j.ygeno.2021.02.007.
Article PubMed CAS Google Scholar
Tessema BB, Liu H, Sørensen AC, et al. Strategies using genomic selection to increase genetic gain in breeding programs for wheat. Front Genet. 2020. https://doi.org/10.3389/fgene.2020.578123.
Article PubMed PubMed Central Google Scholar
Xu Y, Liu X, Fu J, Wang H, Wang J, Huang C, Prasanna BM, Olsen MS, Wang G, Zhang A. Enhancing genetic gain through genomic selection: from livestock to plants. Plant Commun. 2020;1(1):100005. https://doi.org/10.1016/j.xplc.2019.100005.
Article PubMed CAS Google Scholar
Xu Y, Ma K, Zhao Y, et al. Genomic selection: a breakthrough technology in rice breeding. Crop Journal. 2021;9:669–77. https://doi.org/10.1016/j.cj.2021.03.008.
Article Google Scholar
VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008;91:4414–23. https://doi.org/10.3168/jds.2007-0980.
Article PubMed CAS Google Scholar
Habier D, Fernando RL, Kizilkaya K, Garrick DJ. Extension of the bayesian alphabet for genomic selection. BMC Bioinform. 2011;12:186. https://doi.org/10.1186/1471-2105-12-186.
Article Google Scholar
Pérez P, de los Campos G. Genome-wide regression and prediction with the BGLR statistical package. Genetics. 2014;198(2):483–95. https://doi.org/10.1534/genetics.114.164442.
Article PubMed PubMed Central Google Scholar
Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Ser B Stat Methodol. 2011;73:273–82. https://doi.org/10.1111/j.1467-9868.2011.00771.x.
Article Google Scholar
Min S, Lee B, Yoon S. Deep learning in bioinformatics. Brief Bioinform. 2017;18:851–69. https://doi.org/10.1093/bib/bbw068.
Article PubMed Google Scholar
Pattanayak S. Unsupervised learning with restricted boltzmann machines and auto-encoders. In: Pro Deep Learning with TensorFlow. Berkeley: Apress; 2017. p. 279–343. https://doi.org/10.1007/978-1-4842-3096-1_5.
Chapter Google Scholar
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44. https://doi.org/10.1038/nature14539.
Article PubMed CAS Google Scholar
Chan M, Scarafoni D, Duarte R, Thornton J, Skelly L. Learning network architectures of deep CNNs under resource constraints. In: IEEE computer society conference on computer vision and pattern recognition workshops. (IEEE), 2018. pp. 1784–1791. https://doi.org/10.1109/CVPRW.2018.00222.
Young SR, Rose DC, Karnowski TP, Lim SH, Patton RM. Optimizing deep learning hyper-parameters through an evolutionary algorithm. In: Proceedings of the workshop on machine learning in high-performance computing environments - MLHPC ‘15, (New York, New York, USA: ACM Press), 2015. pp. 1–5. https://doi.org/10.1145/2834892.2834896.
Bellot P, de Los Campos G, Pérez-Enciso M. Can deep learning improve genomic prediction of complex human traits? Genetics. 2018;210:809–19. https://doi.org/10.1534/genetics.118.301298.
Article PubMed PubMed Central CAS Google Scholar
González-Recio O, Rosa GJM, Gianola D. Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits. Livest Sci. 2014;166:217–31. https://doi.org/10.1016/j.livsci.2014.05.036.
Article Google Scholar
Ma W, Qiu Z, Song J, Cheng Q, Ma C. DeepGS: predicting phenotypes from genotypes using deep learning. bioRxiv. 2017. https://doi.org/10.1101/241414.
Article Google Scholar
Montesinos-López A, Montesinos-López OA, Hernández-Suárez CM, Gianola D, Crossa J. Multi-environment genomic prediction of plant traits using deep learners with dense architecture. G3 Genes Genomes Genet. 2018;8(12):3813–28. https://doi.org/10.1534/g3.118.200740.
Article Google Scholar
Zingaretti LM, Gezan SA, Ferrão LFV, Osorio LF, Monfort A, Muñoz PR, et al. Exploring deep learning for complex trait genomic prediction in polyploid outcrossing species. Front Plant Sci. 2020;11:25. https://doi.org/10.3389/fpls.2020.00025.
Article PubMed PubMed Central Google Scholar
Ehret A, Hochstuhl D, Gianola D, Thaller G. Application of neural networks with back-propagation to genome-enabled prediction of complex traits in Holstein-Friesian and German Fleckvieh cattle. Genet Sel Evol. 2015;47(1):22. https://doi.org/10.1186/s12711-015-0097-5.
Article PubMed PubMed Central Google Scholar
Gianola D, Okut H, Weigel KA, Rosa GJ. Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat. BMC Genet. 2011;12:87. https://doi.org/10.1186/1471-2156-12-87.
Article PubMed PubMed Central CAS Google Scholar
Pérez-Rodríguez P, Gianola D, González-Camacho JM, Crossa J, Manès Y, Dreisigacker S. Comparison between linear and non-parametric regression models for genome-enabled prediction in wheat. G3. 2012;2(12):1595–605. https://doi.org/10.1534/g3.112.003665.
Article PubMed PubMed Central CAS Google Scholar
Sandhu KS, Lozada DN, Zhang Z, Pumphrey MO, Carter AH. Deep learning for predicting complex traits in spring wheat breeding program. Front PlantSci. 2021;11:613325. https://doi.org/10.3389/fpls.2020.613325.
Article Google Scholar
Montesinos-López OA, Montesinos-López A, Pérez-Rodríguez P, Barrón-López JA, Martini JWR, Fajardo-Flores SB, Gaytan-Lugo LS, Santana-Mancilla PC, Crossa J. A review of deep learning applications for genomic selection. BMC Genomics. 2021;22(1):19. https://doi.org/10.1186/s12864-020-07319-x.
Article PubMed PubMed Central Google Scholar
Pérez-Enciso M, Zingaretti L. A guide for using deep learning for complex trait genomic prediction. Genes. 2019;10:553. https://doi.org/10.3390/genes10070553.
Article PubMed PubMed Central CAS Google Scholar
Chigira K, Yamasaki M, Adachi S, Nagano AJ, Ookawa T. Identification of novel quantitative trait loci for culm thickness of rice derived from strong-culm landrace in Japan. Omachi Rice. 2023;16(1):4. https://doi.org/10.1186/s12284-023-00621-8.
Article PubMed CAS Google Scholar
Kamal NM, Alnor Gorafi YS, Abdelrahman M, Abdellatef E, Tsujimoto H. Stay-green trait: a prospective approach for yield potential, and drought and heat stress adaptation in globally important cereals. Int J Mol Sci. 2019;20(23):5837. https://doi.org/10.3390/ijms20235837.
Article PubMed PubMed Central CAS Google Scholar
Lee S, Masclaux-Daubresse C. Current understanding of leaf senescence in rice. Int J Mol Sci. 2021;22(9):4515. https://doi.org/10.3390/ijms22094515.
Article PubMed PubMed Central CAS Google Scholar
Montesinos-López OA, Martín-Vallejo J, Crossa J, Gianola D, Hernández-Suárez CM, Montesinos-López A, Juliana P, Singh RA. Benchmarking between deep learning, support vector machine and bayesian threshold best linear unbiased prediction for predicting ordinal traits in plant breeding. G3. 2019;9(2):601–18. https://doi.org/10.1534/g3.118.200998.
Article PubMed PubMed Central Google Scholar
Dubin MJ, Mittelsten Scheid O, Becker C. Transposons: a blessing curse. Curr Opin Plant Biol. 2018;42:23–9. https://doi.org/10.1016/j.pbi.2018.01.003.
Article PubMed CAS Google Scholar
Akakpo R, Carpentier MC, Ie HY, Panaud O. The impact of transposable elements on the structure, evolution, and function of the rice genome. New Phytol. 2020;226:44–9. https://doi.org/10.1111/nph.16356.
Article PubMed Google Scholar
Castanera R, Vendrell-Mir P, Bardil A, et al. Amplification dynamics of miniature inverted-repeat transposable elements and their impact on rice trait variability. Plant J. 2021;107:118–35. https://doi.org/10.1111/tpj.15277.
Article PubMed CAS Google Scholar
Domínguez M, Dugas E, Benchouaia M, et al. The impact of transposable elements on tomato diversity. Nat Commun. 2020. https://doi.org/10.1038/s41467-020-17874-2.
Article PubMed PubMed Central Google Scholar
Vourlaki IT, Castanera R, Ramos-Onsins SE, Casacuberta JM, Pérez-Enciso M. Transposable element polymorphisms improve prediction of complex agronomic traits in rice. Theor Appl Genet. 2022;135(9):3211–22. https://doi.org/10.1007/s00122-022-04180-2.
Article PubMed PubMed Central CAS Google Scholar
Lye ZN, Purugganan MD. Copy number variation in domestication. Trends Plant Sci. 2019;24(4):352–65. https://doi.org/10.1016/j.tplants.2019.01.003.
Article PubMed CAS Google Scholar
Alonge M, Wang X, Benoit M, Soyk S, Pereira L, Zhang L, Suresh H, Ramakrishnan S, Maumus F, Ciren D, Levy Y, Harel TH, Shalev-Schlosser G, Amsellem Z, Razifard H, Caicedo AL, Tieman DM, Klee H, Kirsche M, Aganezov S, Ranallo-Benavidez TR, Lemmon ZH, Kim J, Robitaille G, Kramer M, Goodwin S, McCombie WR, Hutton S, Van Eck J, Gillis J, Eshed Y, Sedlazeck FJ, van der Knaap E, Schatz MC, Lippman ZB. Major impacts of widespread structural variation on gene expression and crop improvement in tomato. Cell. 2020;182(1):145-161.e23. https://doi.org/10.1016/j.cell.2020.05.021.
Article PubMed PubMed Central CAS Google Scholar
Liu Y, Du H, Li P, Shen Y, Peng H, Liu S, Zhou GA, Zhang H, Liu Z, Shi M, Huang X, Li Y, Zhang M, Wang Z, Zhu B, Han B, Liang C, Tian Z. Pan-genome of wild and cultivated soybeans. Cell. 2020;182(1):162-176.e13. https://doi.org/10.1016/j.cell.2020.05.023.
Article PubMed CAS Google Scholar
Qin P, Lu H, Du H, Wang H, Chen W, Chen Z, He Q, Ou S, Zhang H, Li X, Li X, Li Y, Liao Y, Gao Q, Tu B, Yuan H, Ma B, Wang Y, Qian Y, Fan S, Li W, Wang J, He M, Yin J, Li T, Jiang N, Chen X, Liang C, Li S. Pan-genome analysis of 33 genetically diverse rice accessions reveals hidden genomic variations. Cell. 2021;184(13):3542-3558.e16. https://doi.org/10.1016/j.cell.2021.04.046.
Article PubMed CAS Google Scholar
Abadi M and others. Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 2016. pp. 265–283.
Li Z, Fu BY, Gao YM, et al. The 3,000 rice genomes project. Gigascience. 2014;3:8. https://doi.org/10.1186/2047-217X-3-8.
Article PubMed PubMed Central CAS Google Scholar
Alexandrov N, Tai S, Wang W, Mansueto L, Palis K, Fuentes RR, Ulat VJ, Chebotarov D, Zhang G, Li Z, Mauleon R, Hamilton RS, McNally KL. SNP-Seek database of SNPs derived from 3000 rice genomes. Nucleic Acids Res. 2015;43(Database issue):D1023–7. https://doi.org/10.1093/nar/gku1039.
Article PubMed CAS Google Scholar
Sun C, Hu Z, Zheng T, et al. RPAN: rice pan-genome browser for ∼3000 rice genomes. Nucl Acids Res. 2017. https://doi.org/10.1093/nar/gkw958.
Article PubMed PubMed Central Google Scholar
Chang CC, Chow CC, Tellier LCAM, et al. Second-generation PLINK: Rising to the challenge of larger and richer datasets. Gigascience. 2015. https://doi.org/10.1186/s13742-015-0047-8.
Article PubMed PubMed Central Google Scholar
Purcell S, Neale B, Todd-Brown K, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75. https://doi.org/10.1086/519795.
Article PubMed PubMed Central CAS Google Scholar
Browning BL, Zhou Y, Browning SR. A one-penny imputed genome from next-generation reference panels. Am J Hum Genet. 2018;103:338–48. https://doi.org/10.1016/j.ajhg.2018.07.015.
Article PubMed PubMed Central CAS Google Scholar
Gianola D, Fernando RL, Stella A. Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics. 2006;173(3):1761–76. https://doi.org/10.1534/genetics.105.049510.
Article PubMed PubMed Central CAS Google Scholar
Amadeu RR, Cellon C, Olmstead JW, et al. AGHmatrix: R package to construct relationship matrices for autotetraploid and diploid species: a blueberry example. Plant Genome. 2016. https://doi.org/10.3835/plantgenome2016.01.0009.
Article PubMed Google Scholar
Vitezica ZG, Varona L, Legarra A. On the additive and dominant variance and covariance of individuals within the genomic selection scope. Genetics. 2013;195(4):1223–30. https://doi.org/10.1534/genetics.113.155176.
Article PubMed PubMed Central Google Scholar
Henderson CR. Best linear unbiased prediction of nonadditive genetic merits in noninbred populations. J Anim Sci. 1985;6:111–7. https://doi.org/10.2527/jas1985.601111x.
Article Google Scholar
Fox EA, Wright AE, Fumagalli M, Vieira FG. ngsLD: evaluating linkage disequilibrium using genotype likelihoods. Bioinformatics. 2019;35(19):3855–6. https://doi.org/10.1093/bioinformatics/btz200.
Article PubMed CAS Google Scholar
Good IK. Rational decisions. J Royal Stat Soc Ser B Methodol. 1952;14(1):107–14.
Article Google Scholar
Widder DV. The convolution transforms. Bull Am Math Soc. 1954;60:444–56. https://doi.org/10.1090/S0002-9904-1954-09828-2.
Article Google Scholar
Müller BSF, Neves LG, de Almeida Filho JE, Resende MFR, Muñoz PR, Dos Santos PET, Filho EP, Kirst M, Grattapaglia D. Genomic prediction in contrast to a genome-wide association study in explaining heritable variation of complex growth traits in breeding populations of Eucalyptus. BMC Genom. 2017;18(1):524. https://doi.org/10.1186/s12864-017-3920-2.
Article Google Scholar
Pérez-Enciso M, Rincón JC, Legarra A. Sequence- vs. chip-assisted genomic selection: accurate biological information is advised. Genet Sel Evol. 2015. https://doi.org/10.1186/s12711-015-0117-5.
Article PubMed PubMed Central Google Scholar
Gaut BS, Seymour DK, Liu Q, Zhou Y. Demography and its effects on genomic variation in crop domestication. Nat Plants. 2018;4(8):512–20. https://doi.org/10.1038/s41477-018-0210-1.
Article PubMed Google Scholar
Seuret M, Alberti M, Liwicki M, Ingold R. PCA-Initialized deep neural networks applied to document image analysis. In: IAPR international conference on document analysis and recognition (ICDAR). 2017; 01: 877–882. https://doi.org/10.1109/ICDAR.2017.148.
Livieris IE, Dafnis SD, Papadopoulos GK, Kalivas DP. A multiple-input neural network model for predicting cotton production quantity: a case study. Algorithms. 2020;13(11):273. https://doi.org/10.3390/a13110273.
Article Google Scholar
Xiong C, Zheng J, Xu L, Cen C, Zheng R, Li Y. Multiple-input convolutional neural network model for large-scale seismic damage assessment of reinforced concrete frame buildings. Appl Sci. 2021;11:8258. https://doi.org/10.3390/app11178258.
Article CAS Google Scholar
O’Malley T, et al. Keras Tuner. 2019. github.com/keras-team/kerastuner. Accessed Nov 2023.
Li L, Jamieson K, DeSalvo G, Rostamizadeh A, Talwalkar A. Hyperband: a novel bandit-based approach to hyperparameter optimization. J Mach Learn Res. 2018;18:1–52.
Google Scholar
Montesinos-López OA, Montesinos López A, Crossa J. Artificial neural networks and deep learning for genomic prediction of binary, ordinal, and mixed outcomes. In: Multivariate statistical machine learning methods for genomic prediction. Cham: Springer; 2022. https://doi.org/10.1007/978-3-030-89010-0_12.
Chapter Google Scholar
Montesinos-López OA, Montesinos-López A, Mosqueda-González BA, Bentley AR, Lillemo M, Varshney RK, Crossa J. A new deep learning calibration method enhances genome-based prediction of continuous crop traits. Front Genet. 2023;12:798840. https://doi.org/10.3389/fgene.2021.798840.
Article Google Scholar
Shang L, Li X, He H, Yuan Q, Song Y, Wei Z, Lin H, Hu M, Zhao F, Zhang C, Li Y, Gao H, Wang T, Liu X, Zhang H, Zhang Y, Cao S, Yu X, Zhang B, Zhang Y, Tan Y, Qin M, Ai C, Yang Y, Zhang B, Hu Z, Wang H, Lv Y, Wang Y, Ma J, Wang Q, Lu H, Wu Z, Liu S, Sun Z, Zhang H, Guo L, Li Z, Zhou Y, Li J, Zhu Z, Xiong G, Ruan J, Qian Q. A super pan-genomic landscape of rice. Cell Res. 2022. https://doi.org/10.1038/s41422-022-00685-z.
Article PubMed PubMed Central Google Scholar
Zhou Y, Zhang Z, Bao Z, Li H, Lyu Y, Zan Y, Wu Y, Cheng L, Fang Y, Wu K, Zhang J, Lyu H, Lin T, Gao Q, Saha S, Mueller L, Fei Z, Städler T, Xu S, Zhang Z, Speed D, Huang S. Graph pangenome captures missing heritability and empowers tomato breeding. Nature. 2022;606(7914):527–34. https://doi.org/10.1038/s41586-022-04808-9.
Article PubMed PubMed Central CAS Google Scholar
Żmieńko A, Samelak A, Kozłowski P, Figlerowicz M. Copy number polymorphism in plant genomes. Theor Appl Genet. 2014;127(1):1–18. https://doi.org/10.1007/s00122-013-2177-7.
Article PubMed CAS Google Scholar
Cook DE, Lee TG, Guo X, Melito S, Wang K, Bayless AM, et al. Copy number variation of multiple genes at Rhg1 mediates nematode resistance in soybean. Science. 2012;338:1206–9. https://doi.org/10.1126/science.1228746.
Article PubMed CAS Google Scholar
Sutton T, Baumann U, Hayes J, Collins NC, Shi BJ, Schnurbusch T, et al. Boron-toxicity tolerance in barley arising from efflux transporter amplification. Science. 2007;318:1446–9. https://doi.org/10.1126/science.1146853.
Article PubMed CAS Google Scholar
Würschum T, Boeven PH, Langer SM, Longin CF, Leiser WL. Multiply to conquer copy number variations at Ppd-B1 and Vrn-A1 facilitate global adaptation in wheat. BMC Genet. 2015;16:96. https://doi.org/10.1186/s12863-015-0258-0.
Article PubMed PubMed Central CAS Google Scholar
Li Y, Xiao J, Wu J, Duan J, Liu Y, Ye X, Zhang X, Guo X, Gu Y, Zhang L, Jia J, Kong X. A tandem segmental duplication (TSD) in green revolution gene Rht-D1b region underlies plant height variation. New Phytol. 2012;196(1):282–91. https://doi.org/10.1111/j.1469-8137.2012.04243.x.
Article PubMed CAS Google Scholar
Castanera et al. Transposons are important contributors to gene expression variability under selection in rice populations. eLife 2023;12:RP86324. https://doi.org/10.7554/eLife.86324.3.
Begum H, Spindel JE, Lalusin A, et al. Genome-wide association mapping for yield and other agronomic traits in an elite breeding population of tropical rice (Oryza sativa. PLoS ONE. 2015;10(3): e0119873. https://doi.org/10.1371/journal.pone.0119873.
Article PubMed PubMed Central CAS Google Scholar
Chen K, Łyskowski A, Jaremko Ł, Jaremko M. Genetic and molecular factors determining grain weight in rice. Front Plant Sci. 2021. https://doi.org/10.3389/fpls.2021.605799.
Article PubMed PubMed Central Google Scholar
Xu F, Sun X, Chen Y, et al. Rapid identification of major QTLs associated with rice grain weight and their utilization. PLoS ONE. 2015;10(3): e0122206. https://doi.org/10.1371/journal.pone.0122206.
Article PubMed PubMed Central CAS Google Scholar
Yadav S, Singh UM, Naik SM, Venkateshwarlu C, Ramayya PJ, Raman KA, Sandhu N, Kumar A. Molecular mapping of QTLs associated with lodging resistance in dry direct-seeded rice (Oryza sativa L.). Front Plant Sci. 2017;8:1431. https://doi.org/10.3389/fpls.2017.01431.
Article PubMed PubMed Central Google Scholar
Jiang GH, He YQ, Xu CG, Li XH, Zhang Q. The genetic basis of stay-green in rice analyzed in a population of doubled haploid lines derived from an indica by japonica cross. Theor Appl Genet. 2004;108:688–98.
Article PubMed CAS Google Scholar
Goodfellow IJ, Bengio Y, Courville AC. Deep learning. Adaptive computation and machine learning. Cambridge: MIT Press; 2016.
Google Scholar
Cuevas J, Montesinos-López O, Juliana P, Guzmán C, Pérez-Rodríguez P, González-Bucio J, Burgueño J, Montesinos-López A, Crossa J. Deep kernel for genomic and near infrared predictions in multi-environment breeding trials. G3. 2019;9(9):2913–24. https://doi.org/10.1534/g3.119.400493.
Article PubMed PubMed Central Google Scholar
Gulli A, Pal S. Deep learning with Keras. Birmingham: Packt Publishing Ltd; 2017.
Google Scholar

Download references

Acknowledgements

We acknowledge Josep M. Casacuberta for his helpful advice on the manuscript and Alejandro Navarro Martínez for his contribution for his initial work on the topic.

Funding

Grants PID2019-108829RB-I00 and CEX2019-000902-S funded by MICIU/AEI/https://doi.org/10.13039/501100011033. IJC2020-045949-I and BES-2017-081139 funded by MICIU/AEI/ https://doi.org/10.13039/501100011033 and by “ESF Investing in your future”. RYC2022-037459-I funded by MICIU/AEI/ https://doi.org/10.13039/501100011033 and by “FSE+

Author information

Authors and Affiliations

Centre for Research in Agricultural Genomics CSIC-IRTA-UAB-UB, Campus UAB, Edifici CRAG, Bellaterra, 08193, Barcelona, Spain
Ioanna-Theoni Vourlaki, Sebastián E. Ramos-Onsins, Miguel Pérez-Enciso & Raúl Castanera
IRTA (Institut de Recerca i Tecnologia Agroalimentàries), Caldes de Montbui, 08140, Barcelona, Spain
Ioanna-Theoni Vourlaki & Raúl Castanera
Catalan Institute for Research and Advanced Studies (ICREA), Barcelona, Spain
Miguel Pérez-Enciso
Universitat Autónoma de Barcelona, 08193, Barcelona, Spain
Miguel Pérez-Enciso

Authors

Ioanna-Theoni Vourlaki
View author publications
You can also search for this author in PubMed Google Scholar
Sebastián E. Ramos-Onsins
View author publications
You can also search for this author in PubMed Google Scholar
Miguel Pérez-Enciso
View author publications
You can also search for this author in PubMed Google Scholar
Raúl Castanera
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

IV contributed to the conception of the project, design of the work, performed data analysis and wrote the manuscript. SO helped with data interpretation and substantially revised the manuscript. MPE contributed to the conception of the project, design of the work and substantially revised the manuscript. RC contributed to design of the work, data acquisition, and wrote the manuscript.

Corresponding authors

Correspondence to Ioanna-Theoni Vourlaki or Raúl Castanera.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

MPE is currently employed at Corteva Agriscience.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Vourlaki, IT., Ramos-Onsins, S.E., Pérez-Enciso, M. et al. Evaluation of deep learning for predicting rice traits using structural and single-nucleotide genomic variants. Plant Methods 20, 121 (2024). https://doi.org/10.1186/s13007-024-01250-y

Download citation

Received: 24 January 2024
Accepted: 28 July 2024
Published: 10 August 2024
DOI: https://doi.org/10.1186/s13007-024-01250-y

Evaluation of deep learning for predicting rice traits using structural and single-nucleotide genomic variants

Abstract

Background

Results

Conclusions

Similar content being viewed by others

Comprehensive evaluation of mapping complex traits in wheat using genome-wide association studies

Advances in integrated genomic selection for rapid genetic gain in crop improvement: a review

Machine learning models outperform deep learning models, provide interpretation and facilitate feature selection for soybean trait prediction

Explore related subjects

Background

Methods

Rice accessions and traits

Markers

Genetic variance inference

Genomic prediction models

Bayesian regression models

Multilayer perceptron

Convolutional neural networks

Cross-validation and independent prediction

Marker input strategies

Optimization of hyperparameters

Results

Phenotypic structure and genetic inference

Comparison of model performances

Impact of marker selection on genomic prediction

Discussion

Conclusions

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation