Introduction

Accurate prediction of phenotypes of plants is important for breeding and management because it facilitates the design and selection of new genotypes, the introduction of new genetic resources and decision-making in management practices. It is desirable for a number of traits to be predicted in cultivation, including yield, phenology, stress resistance and morphological architectures. These traits are often complex and influenced by a number of factors such as interactions among genes, environments and management methods. A key statistical technique recently developed for the prediction of complex traits is genomic prediction or genome-wide prediction. Genomic prediction uses genome-wide DNA markers simultaneously as predictor variables and plays an essential role in genomic selection. Although this approach was proposed for predicting genetic merit in animal breeding (Meuwissen et al. 2001), it has been suggested that accurate genomic prediction is of practical use for phenotypic prediction in plants such as crops (Crossa et al. 2010), trees (Resende et al. 2012) and fruits (Iwata et al. 2013). However, a drawback of genomic prediction is that it is often difficult to predict the performance of plants in diverse environments (Resende et al. 2012; Ly et al. 2013), because phenotypes of complex traits are often influenced by genotype–environment interaction. As climate change and variability are serious concerns for crop and livestock production (Thornton et al. 2014), statistical models that enable prediction of the performance of plants cultivated in various environmental conditions will play a crucial role in breeding for environmental adaptability and optimization of management methods.

Another approach for predicting phenotypes of plants is a mathematical modelling approach known as crop modelling or eco-physiological modelling (Yin et al. 2004; Hammer et al. 2006). Crop models generally consist of multiple functions that depict physiological processes (Tardieu 2003). Using environment and management information, including temperature, photoperiod, precipitation and sowing date, the models predict the responses of plants/organs to environment stimuli (Tardieu 2003). Because it is recognized that the parameters of crop models are genetically controlled, parameters are often referred to as genetic coefficients or model input traits (Yin et al. 2003). Thus, genetic dissection of crop model parameters has been conducted to reveal the genetic control of responses to the environment. Models include those for growth and flowering time in barley (Yin et al. 2000, 2005), leaf growth in maize (Reymond et al. 2003), fruit quality in peach (Quilot et al. 2005), flowering time in rice (Nakagawa et al. 2005) and wheat (Bogard et al. 2014), yield under drought stress in rice (Gu et al. 2014) and flowering time in Brassica oleracea (Uptmoor et al. 2008). Based on estimated QTL effects, phenotypes of untested lines in untested environments can be predicted via the prediction of crop model parameter values. For example, Gu et al. (2014) predicted yields in rice and reported that prediction accuracy, measured as the coefficient of determination between the observed and predicted values, was 0.20‒0.21. Reymond et al. (2003) predicted leaf elongation rates in maize and reported that the coefficient of determination was 0.74. Days to heading (DTH) in wheat was predicted by Bogard et al. (2014) with a root mean squared error (RMSE) between the observed and predicted of 6.3 days. These studies used only significant QTLs (markers) to predict the parameter values of crop models. Because detected QTLs usually capture only a part of the genetic variance and QTL effect sizes tend to be overestimated, using whole-genome markers (i.e. genomic prediction) is preferable for prediction.

A straightforward way to use genomic prediction with crop models is to predict the model input traits (parameters of crop models) using genomic prediction methods. This approach (the two-step approach) is similar to the studies based on detected QTLs described above. In this approach, the model input traits are first statistically optimized. Then the genomic prediction methods are trained with the optimized model input traits. A major advantage of this approach will be the feasibility of statistical implementation. However, when the model input traits are statistically optimized, an issue for this approach will be that uncertainty in the optimization of model input traits is not taken into account in building genomic prediction models because these processes (optimization of model input traits and training of genomic prediction) are performed separately. Another issue is that, because crop model parameters are often optimized for each line/cultivar independently (Nakagawa et al. 2005; Bogard et al. 2014), genetic relatedness among lines/cultivars cannot be taken into account in the optimization. A possible way of addressing these issues would be to adopt an approach more comprehensive than the two-step approach, i.e. a single hierarchical model that combines a crop model with a genomic prediction model. With this approach (the integrated approach), the issues of the two-step approach can be avoided through the simultaneous inference of crop model parameters and marker effects and consideration of genetic relatedness among lines/cultivars via whole-genome markers. Thus, the predictions of the integrated approach would be expected to be more accurate and robust than those of the two-step approach.

Technow et al. (2015) proposed a statistical approach to combine genomic prediction with crop modelling for predicting the yield of maize. Using genome-wide markers, the authors defined prior distributions of four measurable traits that are inputs for a maize growth model and predicted the input traits and grain yield using an approximated Bayesian computation technique. During simulation analysis, the authors showed that based on environmental information, the integrated approach predicted grain yield more accurately than genomic prediction. However, to date, there has been no comparison between the integrated and two-step approaches; therefore, it has not been revealed whether the integrated approach, which is statistically more complex than the two-step approach is superior to the two-step approach pertaining to their prediction accuracy. Moreover, the integrated approach has not been applied to real data.

To achieve the integration of genomic selection with crop modelling, in this study, we developed an integrated approach to predict the heading date of rice (Oryza sativa, L.) and compared the predictive ability of the integrated approach with that of the two-step approaches using real data. Modelling of phenological events including the heading date (i.e. flowering time) and maturity time is important when simulating crop growth, and is thus an indispensable part of a complete crop model (Soltani and Sinclair 2012). We also compared the integrated approach with conventional genomic prediction and phenological (crop) modelling. As a phenological model for the trait, we used a developmental rate (DVR) phenological model (Yin et al. 1997; Nakagawa et al. 2005). We compared four approaches for predicting the heading date. The first approach is conventional genomic prediction. In this approach, heading date was directly predicted using whole-genome markers with the extended Bayesian LASSO (EBL) (Mutshinda and Sillanpaa 2010). The second is conventional phenological modelling. In this approach, heading date was predicted based on daily mean air temperature and photoperiod using the DVR model. Parameters of the DVR model were inferred in two ways using Nelder–Mead (i.e. non-Bayesian) and Markov chain Monte Carlo (MCMC) (i.e. Bayesian) algorithms. The third approach is the two-step approach. In this, the DVR model parameters were first inferred by the Nelder–Mead or MCMC algorithms, and then the genomic prediction model for the parameters was built using the EBL. The fourth approach is the integrated method, wherein the DVR model was combined with a genomic prediction model in a Bayesian two-level hierarchical model. The first level was the DVR model and the second level was the EBL, which inferred whole-genome marker effects from the DVR model parameters. The predictive performance of these methods was evaluated via field experiments using 174 backcross inbred lines (BILs) derived from japonica and indica varieties, Koshihikari and Kasalath, respectively. These lines were cultivated across 3 years in six locations that represented a wide range of day lengths.

Materials and methods

Plant materials and field evaluation

We used 174 Koshihikari/Kasalath//Koshihikari BILs together with the parental lines (Koshihikari and Kasalath). These lines were developed by the National Institute of Agrobiological Sciences Rice Genome Resource Center (Ma et al. 2002). Genotypes and a linkage map of 162 restriction fragment length polymorphism markers are available at http://www.rgrc.dna.affrc.go.jp/ineKKBIL182.html. All these markers were bi-allelic. The BILs were cultivated in six experimental fields across 3 years, resulting in field trials in a total of nine environments (Table S1). At each site, seeds were pre-germinated in water and sown to seedling trays filled with soil. Emergence was recorded for each genotype when about 80 % of plants emerged from the soil surface. Seedlings were grown either in a puddled field or in a poly-house, depending on the environment, and air temperature was measured in the nursery environment. Nursery duration varied from 15 days to 25 days depending on the environments (Table S1), but we commonly transplanted seedlings with 3–4 leaves. Each environment included two field replicates, in each of which 10 seedlings of each BIL were transplanted in a single row, with 30-cm spacing between the rows. The distance between the plants in a row was 15 cm. All fields received an equal amount of fertilizer at a rate of 50 kg/ha of N, 50 kg/ha of P2O5 and 50 kg/ha of K2O, applied prior to transplanting. The field was kept flooded at least until flowering for all lines. Heading dates were recorded for five central plants for each line and were averaged. Although two field replications were available at each environment, missing records were occasionally observed. Averages of two replications and records from one replication had different proportions of environmental (non-genetic) variance. Thus, for each environment, we used the phenotypic records (i.e. DTH) of the replication with the fewest missing records. Air temperatures were measured onsite and the daily photoperiod (P d) was calculated astronomically, as per Horie and Nakagawa (1990), from:

$$\cos \left( {{{\varpi P_{d} } \mathord{\left/ {\vphantom {{\varpi P_{d} } 2}} \right. \kern-0pt} 2}} \right) = - \tan \phi \tan \delta ,$$

where ω is the angular rate of daily rotation, ϕ represents the latitude of the observation point and δ is the declination of the sun.

Prediction methods

The prediction methods used were categorized into four groups (Table 1). Genomic prediction based on EBL (the GP method) used DTH and whole-genome marker genotypes as response and predictor variables, respectively. The phenological (DVR) models based on the Nelder–Mead and MCMC algorithms (C-Nel and C-Bay, respectively) predicted DTH based on mean air temperature and photoperiod in the growth period of the rice plants. The two-step methods were the combinations of C-Nel and C-Bay with EBL (T-Nel and T-Bay, respectively). In the EBL regression of the two-step methods, the DVR model parameters derived from the respective algorithms and the marker genotypes were considered to be the response and predictor variables, respectively. In the integrated model method (IM), the DVR model parameters were regressed on genome-wide markers using EBL and the model parameters and marker effects on them were inferred simultaneously. In the Bayesian methods (C-Bay, T-Bay and IM), the DVR model parameters were log-transformed. In T-Bay and IM, regression analysis was also conducted in the log-transformed scale. The prediction methods are explained in the subsequent sections.

Table 1 Prediction methods used in this study

EBL

The EBL was developed from the Bayesian LASSO (Mutshinda and Sillanpaa 2010). Whereas the Bayesian LASSO controls the level of shrinkage of predictor variables with a single hyperparameter (Park and Casella 2008), the EBL controls the level with two hyperparameters: one at the global level, the other at the marker-specific level (Mutshinda and Sillanpaa 2010). EBL has been shown to be robust across factors, e.g. the number of QTLs or heritability, which influence the prediction accuracy of genomic prediction (Onogi et al. 2015). We initially tried the genomic best linear unbiased prediction (GBLUP) method, which is more commonly used for genomic prediction. However, because EBL proved more accurate than GBLUP for the genomic prediction of DTHs using the rice population of the present study (data not shown), we focused on EBL. Because the primary purpose of this study was to compare the integrated approach with the two-step approach and genomic prediction, it was important to use the same regression method in all the approaches compared (IM, T-Nel, T-Bay and GP). We had not intended to compare the different regression methods (EBL vs. GBLUP) in this study.

We used the variational Bayesian algorithm introduced by Li and Sillanpaa (2012) and modified by Onogi et al. (2015) for parameter estimation. The algorithm was executed using an R package, VIGoR (Onogi 2015). We provide the model structure of EBL and the calculation procedure performed by the package in the Supplementary Methods. The EBL was used in GP, T-Nel, T-Bay and IM. In each method, the predictor variables were marker genotypes and regression coefficients were marker effects. In GP, the response variable was DTH. In T-Nel and T-Bay, the response variables were the DVR model parameters derived from C-Nel and C-Bay, respectively. In IM, the response variables were the DVR model parameters derived simultaneously with regression coefficients for predictor variables (i.e. marker effects on the DVR model parameters). In these methods (GP, T-Nel, T-Bay and IM), the response variable was standardized and the hyperparameters were set to the values common across the methods (see Supplementary Methods).

DVR model

We used the DVR model developed by Nakagawa et al. (2005) which was originally proposed by Yin et al. (1997) as a ‘three-stage Beta model’. The model assumes three developmental phases before flowering: (1) the ‘juvenile phase’, when plants are insensitive to photo stimulus; (2) the ‘photosensitive phase’, when photo stimulus is influential on DVR; and (3) the ‘post-photosensitive phase’ after the photosensitive phase for flowering. DVRs accumulate from the emergence day, and growth of developmental stages (DVSs) is given by:

$$\text{DVS} = \sum\limits_{d = 1}^{D} {\text{DVR}_{d} } ,$$

where D is the number of days since the emergence day. Flowering occurs when DVS reaches 1. The DVR at day d is defined as follows:

$$\text{DVR}_{d} = \left\{ {\begin{array}{*{20}l} {{{f\left( {T_{d} } \right)} \mathord{\left/ {\vphantom {{f\left( {T_{d} } \right)} G}} \right.} G}} & {\text{if DVS}_{d} \text{ < DVS}_{1} \text{ or DVS}_{d} \text{ > DVS}_{2} } \\ {{{f\left( {T_{d} } \right)g\left( {P_{d} } \right)} \mathord{\left/ {\vphantom {{f\left( {T_{d} } \right)g\left( {P_{d} } \right)} G}} \right.} G}} & {\text{if DVS}_{1} \le \text{DVS}_{d} \le \text{DVS}_{2} } \\ \end{array} } \right. ,$$

where T d is the mean air temperature (°C) at day d, P d is the photoperiod (h), f and g denote the functions that relate T d and P d with DVR d , respectively, G (\(\ge 0\)) represents earliness of the heading date given optimal temperature and photoperiod, and DVS 1 and DVS 2 denote the ends of the juvenile and photosensitive phases, respectively. The functions f and g are defined as follows:

$$f\left( {T_{d} } \right) = \left\{ {\begin{array}{*{20}l} {\left[ {\left( {\frac{{T_{d} - T_{b} }}{{T_{o} - T_{b} }}} \right)\left( {\frac{{T_{c} - T_{d} }}{{T_{c} - T_{o} }}} \right)^{{{{\left( {T_{c} - T_{o} } \right)} \mathord{\left/ {\vphantom {{\left( {T_{c} - T_{o} } \right)} {\left( {T_{o} - T_{b} } \right)}}} \right. \kern-0pt} {\left( {T_{o} - T_{b} } \right)}}}} } \right]^{\alpha } } & {\text{if }T_{b} \le T_{d} \le T_{c} } \\ 0 & {\text{otherwise}} \\ \end{array} } \right.,$$

and

$$g\left( {P_{d} } \right) = \left\{ {\begin{array}{*{20}c} {\left[ {\left( {\frac{{P_{d} - P_{b} }}{{P_{o} - P_{b} }}} \right)\left( {\frac{{P_{c} - P_{d} }}{{P_{c} - P_{o} }}} \right)^{{{{\left( {P_{c} - P_{o} } \right)} \mathord{\left/ {\vphantom {{\left( {P_{c} - P_{o} } \right)} {\left( {P_{o} - P_{b} } \right)}}} \right. \kern-0pt} {\left( {P_{o} - P_{b} } \right)}}}} } \right]^{\beta } } & {\text{if }P_{o} \le P_{d} } \\ 1 & {\text{if }P_{o} > P_{d} } \\ \end{array} } \right. ,$$

where T b , T o and T c denote the base, optimum and ceiling temperatures, respectively, and P b , P o and P c denote the base, optimum and ceiling photoperiod, respectively. According to Nakagawa et al. (2005), T b , T o , T c , P b , P o and P c were fixed to be 8, 30, 42 °C, 0, 10 and 24 h, respectively. Thus, G represents DTH under a 10-h photoperiod at 30 °C in the present study as well as in Nakagawa et al. (2005). Parameters α (\(\ge\)0) and β (\(\ge\)0) represent the temperature and photoperiod sensitivity, respectively. Sensitivity increases as these parameters increase. The change of sensitivity due to these parameters is illustrated in Fig. 1 in Nakagawa et al. (2005). DVS 1 and DVS 2 are defined as

$$\begin{aligned} \text{DVS}_{1} = 0.145 + 0.005G \hfill \\ \text{DVS}_{2} = 0.345 + 0.005G \hfill \\ \end{aligned}$$

according to Nakagawa et al. (2005). Although some pre-fixed parameters such as T o may show genetic variation, we assumed genetic variations existed only in G, α and β in this study. Thus, we use G i , α i and β i to denote the DVR model parameters of line i. Note that all these parameters are constrained to be \(\ge\)0.

Fig. 1
figure 1

Marker effects on the developmental rate model parameters derived from the two-step approaches (T-Nel and T-Bay) and the integrated model (IM). Different shades indicate different chromosomes (left to right, chromosome 1 to chromosome 12). Dashed lines indicate the chromosomal positions of known major genes illustrated in the figure. Because T-Bay and IM estimated the marker effects in log-transformed scales, scales of y axes differ from those of T-Nel for all the parameters. G, α and β represent earliness of heading date under optimal conditions, temperature sensitivity and photoperiod sensitivity, respectively

Nelder–Mead optimization of the DVR model

The objective function is

$$\text{argmin}_{{G_{i} ,\alpha_{i} ,\beta_{i} }} \sum\limits_{j = 1} {\left| {H_{ij} - h\left( {G_{i} ,\alpha_{i} ,\beta_{i} ,{\mathbf{T}}_{j} ,{\mathbf{P}}_{j} } \right)} \right|} ,$$

where H ij is the DTH of line i in environment j, h is a function that calculates the DTH for given values of G i , α i and β i and mean daily temperature (T j) and photoperiod (P j) in environment j. | | indicates absolute difference. Although we also tried optimization based on squared loss in the preliminary study, prediction accuracy was lower than that based on absolute loss (data not shown). This may indicate the presence of outliers, because the absolute loss function is more robust than the squared loss function in handling outliers. However, we did not observe any obvious outliers in the distribution of DTH (Fig. S1) or in the local outlier factor analysis (Breunig et al. 2000, data not shown). Optimization was conducted for each line. As initial values for the optimization of G, we tried 40, 55 and 80. For α and β, we tried 0.01, 5 and 10. These values were chosen to cover the ranges of estimates in the BIL population derived from Nipponbare and Kasalath (Nakagawa et al. 2005). Consequently, we conducted optimization a total of 27 (3 × 3 × 3) times with different initial value combinations and obtained 27 sets of estimates of G, α and β. From these, we chose the set that minimized the objective function. When multiple sets of estimates minimized the objective function, we calculated the average of the estimates over the sets. Optimization was conducted by using the optim function in R (R Development Core Team 2011).

MCMC inference of the DVR model

In C-Bay, the DVR model was regarded as a regression model where the response variable was fixed to be 1:

$$\text{1} = \sum\limits_{d = 1}^{{H_{ij} }} {\text{DVR}_{ij,d} } + e_{ij} ,$$
(1)

where DVR ij,d is the DVR of line i in environment j at day d and e ij is the residual. The residual was assumed to follow a normal distribution, \(N\left( {0,\,\sigma_{e}^{2} } \right)\) where \(\sigma_{e}^{2} \sim {1 \mathord{\left/ {\vphantom {1 {\sigma_{e}^{2} }}} \right. \kern-0pt} {\sigma_{e}^{2} }}\). Log-transformed values of G i , α i and β i were assumed to follow normal distributions:

$$\begin{aligned} \tilde{G}_{i} = \text{log}\left( {G_{i} } \right)\sim \text{N}\left( {\mu_{G} ,\,\,\sigma_{G}^{2} } \right) \\ \tilde{\alpha }_{i} = \text{log}\left( {\alpha_{i} } \right)\sim \text{N}\left( {\mu_{\alpha } ,\,\,\sigma_{\alpha }^{2} } \right) \\ \tilde{\beta }_{i} = \text{log}\left( {\beta_{i} } \right)\sim \text{N}\left( {\mu_{\beta } ,\,\,\sigma_{\beta }^{2} } \right), \\ \end{aligned} ,$$

where \(\mu_{G}\), \(\mu_{\alpha }\) and \(\mu_{\beta }\) are the means, and \(\sigma_{G}^{2}\), \(\sigma_{\alpha }^{2}\) and \(\sigma_{\beta }^{2}\) are the variances. Non-informative prior distributions were assigned to the means and variances. \(\mu_{G}\), \(\mu_{\alpha }\) and \(\mu_{\beta }\) were assumed to be proportional to a constant value. \(\sigma_{G}^{2}\), \(\sigma_{\alpha }^{2}\) and \(\sigma_{\beta }^{2}\) were assumed to follow scaled inverse Chi-squared distributions with the degree of freedom being −2 and the scaling parameter being 0. Parameter estimation was conducted by using a Metropolis algorithm (Metropolis et al. 1953) (see Supplementary Methods). The number of iterations was 60,000, and the sampling interval was 50. The first 200 samples were discarded as burn-in, and the posterior distributions were inferred from 1000 samples. T-Bay used the posterior means of \(\tilde{G}\), \(\tilde{\alpha }\) and \(\tilde{\beta }\).

Integrated model

Together with C-Bay, the DVR model in IM was regarded as a regression model wherein the response variable was fixed to be 1 (Eq. 1). The residual was assumed to follow a normal distribution \(N\left( {0,\,\,\sigma_{e}^{2} } \right) ,\) where \(\sigma_{e}^{2} \sim {1 \mathord{\left/ {\vphantom {1 {\sigma_{e}^{2} }}} \right. \kern-0pt} {\sigma_{e}^{2} }}\). \(\tilde{G}\), \(\tilde{\alpha }\) and \(\tilde{\beta }\) were regressed on marker genotypes. For line i, we assume

$$\begin{aligned} \tilde{G}_{i} \sim \text{N}\left( {\begin{array}{*{20}c} {\mu_{G} + \sum\limits_{p = 1}^{P} {g_{p} x_{ip} } ,} & {{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {\tau_{0,G}^{2} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${\tau_{0,G}^{2} }$}}} \\ \end{array} } \right) \, \hfill \\ \tilde{\alpha }_{i} \sim \text{N}\left( {\begin{array}{*{20}c} {\mu_{\alpha } + \sum\limits_{p = 1}^{P} {a_{p} x_{ip} } ,} & {{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {\tau_{0,\alpha }^{2} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${\tau_{0,\alpha }^{2} }$}}} \\ \end{array} } \right) \hfill \\ \tilde{\beta }_{i} \sim \text{N}\left( {\begin{array}{*{20}c} {\mu_{\beta } + \sum\limits_{p = 1}^{P} {b_{p} x_{ip} } ,} & {{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {\tau_{0,\beta }^{2} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${\tau_{0,\beta }^{2} }$}}} \\ \end{array} } \right), \hfill \\ \end{aligned}$$

where g p , a p and b p are effects of marker p, and x ip is the genotype at marker p coded as −1 (AA), 0 (AB) and 1 (BB). IM infers the DVR model parameters (\(\tilde{G}\), \(\tilde{\alpha }\) and \(\tilde{\beta }\)) and the parameters included in these regressions simultaneously. The intercepts (\(\mu_{G}\), \(\mu_{\alpha }\) and \(\mu_{\beta }\)) were assumed to be proportional to a constant value. As described above, we used the EBL to model marker effects, g p , a p and b p . The prior distributions of the residual precisions, \(\tau_{0,G}^{2}\),\(\tau_{0,\alpha }^{2}\) and \(\tau_{0,\beta }^{2}\) were \(\tau_{0,G}^{2} \sim {1 \mathord{\left/ {\vphantom {1 {\tau_{0,G}^{2} }}} \right. \kern-0pt} {\tau_{0,G}^{2} }}\), \(\tau_{0,\alpha }^{2} \sim {1 \mathord{\left/ {\vphantom {1 {\tau_{0,\alpha }^{2} }}} \right. \kern-0pt} {\tau_{0,\alpha }^{2} }}\) and \(\tau_{0,\beta }^{2} \sim {1 \mathord{\left/ {\vphantom {1 {\tau_{0,\beta }^{2} }}} \right. \kern-0pt} {\tau_{0,\beta }^{2} }}\), respectively. To infer these parameters, we developed a variational Bayesian method using MCMC simulation to derive the expectations of the DVR model parameters (see Supplementary Methods). In this method, the means and variances of the DVR model parameters (e.g. \(\text{E}\left[ {\tilde{G}_{i} } \right]\) or \(\text{V}\left[ {\tilde{G}_{i} } \right]\)) that are required for updating the EBL parameters (e.g. g p ) are obtained via MCMC sampling. Because the variational Bayesian inference is related to the EM algorithm (Ghahramani and Beal 2001; Bishop 2006), our algorithm for IM can be seen as analogous to the Monte Carlo EM algorithm (Wei and Tanner 1990). MCMC samples of \(\tilde{G}\), \(\tilde{\alpha }\) and \(\tilde{\beta }\) at the last iteration of the combined procedure were used to derive the posterior distributions of DTH. The number of MCMC samples at the last iteration was 300.

Full data analysis

To compare the behaviours of the prediction methods, analysis was conducted using full data: GP was fitted to each environment and C-Nel was fitted to each line separately. T-Nel re-estimated the DVR model parameters by using the EBL from the estimates from C-Nel. C-Bay and IM were fitted to data comprising all lines in all environments. T-Bay re-estimated the DVR model parameters by using the EBL from the estimates by C-Bay.

Leave-one-out cross-validation

The predictive ability of the prediction methods was evaluated by using three types of leave-one-out cross-validation (LOO). The first is leave-one-environment-out cross-validation (LOEO). In this scheme, one of nine environments was removed from the data, and the DTHs of all lines in the removed environment were predicted with the prediction model derived from the DTHs of all lines of the remaining eight environments. This procedure was repeated until all environments were removed once. When T-Nel and T-Bay were tested by LOEO, the values of G, α and β estimated by C-Nel and C-Bay, respectively, were used as the response variables for regression. The fitted values, i.e. the values without residuals, were used for prediction. The second LOO is leave-one-line-out cross-validation (LOLO). One of 176 BILs was removed from the data, and the DTHs of the removed line in all environments were predicted with the prediction model derived from the retained data. This was repeated until all BILs were removed once. The third type is leave-one-‘combination of an environment and a line’-out cross-validation (LOELO). In this scheme, one of nine environments and one of 176 BILs were removed from the data, and the DTH of the removed line at the removed environment was predicted with the prediction model derived from the data consisting of 175 BILs in eight environments. This was repeated until all combinations of environments and lines were removed once. LOEO, LOLO and LOELO correspond, respectively, to tests of predictive ability for tested lines in untested environments, untested lines in tested environments and untested lines in untested environments.

The phenological models C-Nel and C-Bay were evaluated using only LOEO because they were not able to predict new lines (Table 1). Conversely, GP was not evaluated using LOEO because GP could not use the environment information (Table 1). LOLO of GP was conducted at each environment separately. To assess the ability of GP for untested lines in untested environments, LOLO was conducted at the environment most similar to the target (untested) environment. We used the root mean squared difference of DTH between environments as the metric of between-environment similarity. We report the results of this ‘LOLO at the most similar environment’ together with the results of LOELO for the two-step and integrated approaches, because both LOO schemes evaluated the predictive ability for untested lines in untested environments. Because we used DTH of the target environments to determine the most similar environments, the prediction ability of GP may have been biassed towards an overestimate. However, because GP showed the least accuracy, as described later in the Results section, such bias, if it existed, would not have influenced our conclusions. Another approach for constructing the GP training data set for prediction in untested environments is to average DTH across the tested (available) environments. Although we attempted this, the results were extremely poor (data not shown).

In each LOO scheme, predicted DTH values were obtained for individual lines in each environment. Predictive ability was measured as RMSE and the Pearson correlation between the predicted and observed DTH. Unlike other traits, including yield, DTH needs to be controlled at a preferable value rather than to be increased or decreased as far as possible; therefore, RMSE is more important in practice. Thus, we compared the methods by primarily using RMSE as the metric of accuracy, although we reported both metrics (RMSE and correlation). We also calculated the regression coefficients of the predicted values on observed values. Because coefficients lower than one indicate shrinkage of prediction, we reported the coefficients as the level of shrinkage.

Results

Phenotype evaluation

The distribution of DTH in each environment is shown in Fig. S1. As expected, DTH tended to be longer at higher latitudes. Correlation and root mean squared difference of DTH between environments are presented in Table S2. Scatterplots of DTH between environments are shown in Fig. S2. As described in the “Materials and methods”, on the basis of the root mean squared difference, we determined the environments used for training of GP to test predictive ability of DTH for untested lines in untested environments: for prediction at Tsukuba2007, Tsukuba2009 was used as the environment for training; for Tsukuba2008E, Ishikawa; for Tsukuba2008L, Ishigaki; for Tsukuba2009, Ishikawa; for Ishikawa, Tsukuba2009; for Fukuoka, Tsukuba2008L; for Ishigaki, Ha Noi, for Thai Nguyen, Ishigaki and for Ha Noi, Ishigaki were used as the environments for training.

Full data analysis

RMSEs between the observed and estimated DTH were the smallest in C-Nel and the largest in T-Bay (Table 2). The residual variance \(\sigma_{e}^{2}\) inferred by C-Bay and IM were quite small, suggesting good fitting of the models.

Table 2 Root mean squared errors (RMSEs) (day) and Pearson correlation between estimated and observed values, regression coefficients of estimated values on observed values (slope), and posterior means of the residual variance of the Bayesian methods in the full data analysis

The Markov chains of C-Bay converged quickly (data not shown). Acceptance rates of Metropolis algorithms for the estimates of \(\tilde{G}\), \(\tilde{\alpha }\) and \(\tilde{\beta }\) were 0.26, 0.83 and 0.67, respectively, suggesting good mixing properties. To assess the reproducibility of IM, we repeated runs five times with different initial values of G, α and β. The estimates (posterior means) of \(\tilde{G}\), \(\tilde{\alpha }\) and \(\tilde{\beta }\) were highly consistent among the replications: mean Pearson correlation coefficients among the replications were >0.99 for all the parameters. Estimated marker effects on these parameters were also consistent among replications (mean Pearson correlation coefficients >0.98).

The marker effects inferred by T-Nel, T-Bay and IM are shown in Fig. 1. Effects estimated by these methods were largely consistent between the methods, and association signals were often observed around known heading date genes, Hd1 (Yano et al. 1997, 2000), Hd2 (Yano et al. 1997), Hd3a (Yano et al. 1997; Monna et al. 2002), Hd5 (Yano et al. 1997) and Hd6 (Takahashi et al. 2001) (Fig. 1). A major difference between the methods is that a relatively large signal was detected in chromosome 12 for T-Nel and T-Bay. The associated marker was R1684.

Leave-one-out cross-validation

Results for LOEO are presented in Table 3. RMSE and Pearson correlation calculated for each environment are presented in Table S3. The least RMSE was observed in IM. IM also showed the regression coefficient (slope) that was closest to one, suggesting less shrunken prediction. In LOLO, GP showed the least RMSE and IM showed the second least (Tables 3 and S4). The slope of GP was also closest to one and that of IM was secondly closest. Thus, it is suggested that GP is suitable for heading date prediction of untested lines when the training data set at the target environment is available. In LOELO, IM showed the least RMSE with the least shrinkage tendency (Tables 3 and S5), suggesting the usefulness of the integrated approach in prediction of untested lines in untested environments. The RMSE of GP was greater than the RMSEs of the two-step and integrated approaches. Considering that GP showed the least RMSE in LOLO, the results suggest that genomic prediction of heading dates is sensitive to the differences in environmental conditions between the training and target environments.

Table 3 Root mean squared errors (RMSEs) (day) and Pearson correlation between predicted and observed values and regression coefficients of predicted values on observed values (slope) in the three leave-one out cross-validation schemes

In each LOO experiment, IM consistently showed higher accuracy than the two-step approach, regardless of the optimization algorithms for the DVR model, Nelder–Mead and MCMC algorithms (Table 3). This suggests that prediction accuracy can be improved by integrating genomic prediction with crop modelling in a single hierarchical model.

IM can estimate the posterior variance of predicted DTH. To investigate the empirical reliability of the posterior variance, we calculated the frequency with which the observed (true) DTH was included in the credible intervals of the predicted DTH (Table 4). In each LOO scheme, posterior variance is suggested to have been underestimated. This underestimation was most severe in LOEO. However, in LOLO and LOELO, the underestimation was relaxed and the posterior variances would have reasonable sizes.

Table 4 Frequencies for the true DTH being included in the credible intervals estimated by the integrated model in each leave-one-out cross-validation (LOO) scheme

Discussion

We developed a novel method that integrated whole-genome markers with a phenological model in predicting the heading date of Asian rice, and compared this method with conventional genomic prediction, phenological modelling and two-step approach. The proposed method IM showed the best accuracy in LOEO and LOELO and the second best in LOLO. In each LOO scheme, IM was superior to the two-step approach (T-Nel and C-Nel) pertaining to both prediction accuracy and the amount of shrinkage. An approach that integrates genomic prediction with crop models will be statistically more challenging than the two-step approach and will require the development of special algorithms and programs (e.g. the variational Bayesian inference combined with MCMC in the present study). Thus, it is important to demonstrate the superiority of the integrated approach over the two-step approach to support the development of the integrated approach. Our study has several limitations: the backcross inbred line population used here is not usually used for the development of new cultivars; heading date in rice is considered to have a relatively simple genetic architecture (Matsubara et al. 2014) and the phenological model used in this study, the DVR model, is relatively simple, consisting of only a few functions. Nevertheless, for the first time, our results demonstrate superiority of the integrated approach in predicting plant phenotypes using real data. As the first empirical evidence, this study should encourage further development and application of the integrated approach in predicting plant phenotypes.

An issue in our integrated approach (IM) is the computational time required. When Nelder–Mead optimization is conducted for each line in tandem with the full data analysis, the computational times of IM and T-Nel are similar. However, the Nelder–Mead optimization can be conducted in parallel, which significantly reduces the time required. The computational burden of IM mainly stemmed from the MCMC simulation used to infer the posterior distribution of the DVR model parameters. This can be reduced by decreasing the number of iterations of the MCMC (600) and/or the number of entire optimization steps (100) (see Supplementary Methods), though these reductions may prevent the solution from reaching the global optimum. The number of iterations (600 and 100) was set arbitrarily in this study, but when using IM in practical applications, it would be necessary to assess the effect of the number of iterations on the efficiency of optimization.

In all the cross-validation schemes, IM outperformed T-Nel and T-Bay in terms of prediction accuracy. This superiority can be attributed to several factors: consideration of genetic dependency between lines in deriving the DVR model parameters, consideration of uncertainty of the DVR model parameter derivation in marker effect estimation and avoiding loss of information by simultaneously deriving the DVR model parameters and the marker effects on these. The relative contributions of these factors to the improvement in accuracy are unclear. Note that in C-Bay (i.e. the first step of T-Bay), the derivation of the DVR model parameters was not independent between lines because the variances of the model parameters \(\sigma_{G}^{2}\), \(\sigma_{\alpha }^{2}\) and \(\sigma_{\beta }^{2}\) were estimated from the parameter values of all the lines (see Supplementary Methods). Thus, information of the DVR model parameters was shared among lines via the variances. Nevertheless, C-Bay and T-Bay were not always superior to C-Nel and T-Nel in the cross-validation schemes tested. Thus, sharing information between lines may not largely contribute in improving prediction accuracy of IM. Further work will be needed to elucidate the reasons for the superiority of the integrated approach.

Bogard et al. (2014) predicted heading dates in wheat using a two-step approach that was based on detected QTLs. The authors reported that, when untested lines were predicted at untested environments, the RMSE was 6.3 days. Although the reported RMSE was lower than those obtained by T-Nel, T-Bay and IM in the present study, prediction accuracy will be influenced by various factors other than models and inference methods; therefore, direct comparison of accuracy between the studies would not be meaningful. These factors include: the number of lines and environments used for training (210 lines and 10 environments in Bogard et al. (2014)), the predictive ability of phenological models, genetic relationships between lines and variation in environmental conditions.

IM is a two-level hierarchical model, where a regression model is embedded in each level: the first level is the DVR model and the second level is the genomic prediction models, i.e. multi-locus regression models, which regard the DVR model parameters as the response variables. This model structure is similar to the structures used for functional QTL mapping by Malosetti et al. (2006), Sillanpaa et al. (2012) and Li et al. (2014). These studies, respectively, fitted logistic, polynomial and linear curves to temporary trends and regressed the curve parameters on multi-locus genotypes in single hierarchical models. Comparison between a single hierarchical model and a two-step approach that derives the curve parameters and the marker effects on the parameters separately was conducted by Li et al. (2014) using growth-related traits of conifers. The authors reported that both the methods detected similar sets of QTLs when the sample size was large (≥250); whereas, when the sample size was small (<100), more association signals were detected by using the two-step approach, although it was unclear whether these associations were true. In our results, the integrated (IM) and the two-step methods (T-Nel and T-Bay) detected similar sets of association signals in the full data analysis (Fig. 1). This consistent result across the methods could be because of the relatively large sample size (176) and relatively simple genetic architecture of heading dates in the used population. The fact that both the integrated and two-step approaches captured similar association signals may suggest that the superiority of IM in prediction accuracy was because of the accuracy of marker effect size estimation rather than QTL detection power. The hierarchical model by Technow et al. (2015) also has a two-level structure; its first level is the crop growth model and the second is the genomic prediction model. However, one difference is that, although the model input traits (yTLN, yAM, ySRE, and yMTU in Technow et al. 2015) are regressed on genome-wide markers in the second level, no residual term was assumed (see Eq. 2 in Technow et al. 2015). This means that the complete variance of the model input traits was assumed to be explained by the additive effects of the markers, which may be a strong assumption in real data analysis.

In LOELO, the RMSE of GP was greater than the RMSEs of the two-step and the integrated approaches, particularly at Tsukuba2007 and Tsukuba2008E (Table S5). For prediction for these two environments, GP was trained with records at Tsukuba2009 and Ishikawa, respectively, as described in the “Results” section. However, the root mean squared differences of DTH between the target and training environments (i.e. between Tsukuba2007 and Tsukuba2009 and between Tsukuba2008E and Ishikawa), were 10.3 and 17.9, respectively, which were the largest two values among the differences between the target and training environments. Thus, the results of LOELO clearly suggest that accuracy of genomic prediction depends on the similarity between the target and training environments, which is also illustrated by morphological traits in loblolly pine (Resende et al. 2012). On the other hand, the phenological model-based methods, IM, T-Nel and T-Bay did not appear to show the dependency on the similarity. GP can predict plant performance at multiple environments by modelling the covariance structure between genotypic (polygenic) effects between environments in a mixed model framework (Burgueno et al. 2012). However, this multi-environment mixed model is unable to predict phenotypes in environments where no line has been cultivated. To overcome this deficiency, Heslot et al. (2014) proposed mixed models that incorporate stress covariates derived from crop models to model genotype–environment interaction and to predict phenotypes in untested environments. The authors performed multiple and separated steps in parameter inference, which may impair the predictive ability. Their approach is, however, an alternative to integrating genomic prediction with crop models; it will be interesting to compare the predictive ability with that of the crop model-based methods.

Author contribution statement

AO conceived and developed the integrated Bayesian method, conducted all statistical analyses and drafted the manuscript. MW developed the Nelder–Mead optimization procedure, and HN collaborated on this development. TM, T. Hasegawa and HN conducted the field trials and collaborated on the preparation of the manuscript. T. Hayashi collaborated on statistical analysis. T. Hasegawa and HI were involved in the conception and design of the study, administrative support and supervision. All authors have read and approved the final manuscript.