Introduction

Wheat (Triticum aestivum L.) is one of the most important crops, providing one-fifth of the total calories for the world’s population (Reynolds et al. 2011). Successful wheat varieties must display high and stable yield performances in their target environments. Increasing yield stability is, therefore, a central goal in wheat breeding. Yield performance of a genotype in a specific environment is the result of numerous physiological and biochemical processes (Ceccarelli et al. 1991). Thus, yield stability is the complex outcome of the reaction of a genotype to changing environmental factors (Muhleisen et al. 2014a).

Yield stability can be defined using either a static or dynamic concept (Becker and Léon 1988). Genotypes with constant performances across varying growing conditions exhibit high static yield stability. Static stability can be assessed by a linear regression of individual genotype performance on the environmental means (Finlay and Wilkinson 1963). Static yield stability is often associated with low yield level and, therefore, of low economic relevance in particular for high-yielding target environments (Becker and Léon 1988). In the dynamic yield stability concept, a stable genotype shows only slight deviation from the response to varying growing conditions. Numerous biometrical parameters have been proposed to assess the dynamic stability (Becker and Léon 1988; Lin et al. 1986; Piepho 1998) and one popular approach estimates genotype-specific genotype-by-environment interaction variances (Wricke 1962; Shukla 1972).

Precise evaluation of dynamic and static yield stability requires large-scale phenotyping with recommendations ranging up to 200 environments (Piepho 1998). Muhleisen et al. (2014a) proposed to assess the heritability for yield stability based on experimental data. The heritability of yield stability is estimated in three steps (Muhleisen et al. 2014a): (1) randomly divide the total number of environments into two subsets; (2) evaluate yield stability in both subsets; and (3) estimate correlation coefficient between yield stability parameters of both subsets. The heritability of yield stability is then estimated using the squared correlation coefficient. The above procedure to estimate heritability of yield stability; however, can be biased due to the assumption that the residuals for both subsets are independent (Muhleisen et al. 2014a).

Several alternatives exists which potentially facilitate a more economic and accurate estimation of yield stability: (1) dissect the genetic architecture underlying stability and apply genomic-assisted breeding, (2) use indirect traits to assess yield stability, or (3) select fewer contrasting and more diagnostic environments. The genetic architecture of phenotypic stability was studied for wheat (Huang et al. 2016), barley (Kraakman et al. 2004), and rye (Wang et al. 2015). The findings suggest a very complex genetic architecture underlying yield stability requiring large mapping populations to identify reliable QTL. The potential of using indirect traits which are tightly correlated with yield stability has been studied only for barley (Muhleisen et al. 2014a), but not for wheat. The findings for barley indicate low efficiency of indirect selection for yield stability. Moreover, despite its potential, selecting subsets of more diagnostic environments have not been studied so far.

Hybrids are expected to exhibit higher grain yield performance and yield stability compared to inbred lines due to the exploitation of heterosis and individual buffering effects (Hallauer et al. 2010). Yield stability of hybrids and lines can be compared contrasting the means of both groups or by inspecting the differences of individual pairs of genotypes. Assessing the difference between individual genotypes requires intensive phenotyping in many environments, but group-contrast can be implemented with a limited number of environments (Muhleisen et al. 2014b). Examining group-contrasts revealed for both selfing and outcrossing crops that hybrids were more yield stable than lines (Muhleisen et al. 2014b; Léon 1994). Comparing individual pairs of wheat hybrids and lines was not yet done with the required phenotyping intensity.

Our study is based on five series of official registration trials of winter wheat conducted in Germany. The objectives of the present study were to (1) develop and implement an improved heritability estimate to examine the required scale of phenotyping for assessing yield stability, (2) compare yield performance and yield stability of hybrids and inbred lines, (3) investigate the association of agronomic traits with yield stability, and (4) explore the possibility of selecting subsets of environments allowing to portray large proportion of the variation of yield stability.

Materials and methods

Field design and plant materials

The yield data consisted of five series of official winter wheat registration trials in Germany. For each series, the Federal Plant Variety Office (Bundessortenamt, Hannover) tested different genotypes and checks in multi-locations in three consecutive years (Supplementary Fig. 1). After each year some genotypes were rejected and checks could be changed. The total number of environments, i.e., location-by-year combinations, in each series varied from 40 to 50 (Supplementary Fig. 1) since some environments were discarded due to frost damage and high field heterogeneity.

The field design followed a split plot design with crop protection as main-plot factor and genotype as sub-plot factor. The main-plot factor crop protection had two intensity levels. Intensity level one entailed reduced crop protection where no fungicides were applied and growth regulators only in exceptional cases at up to 50% of the local custom application rate. In intensity level two, fungicides and growth regulators were applied according to local practices. Each intensity level comprised two replicates and 30–132 genotypes and all tested genotypes within each replicate of each intensity level were completely randomized. In exceptional environments, three replicates were performed or additional genotypes were included due to the interest of resident agricultural departments (Länderdienststellen). However, these genotypes were not considered in the present study.

The yield data (Mg ha−1; moisture content 14%) under intensity level two was used for our study since farmers usually apply intensive crop protection. In addition, the Federal Plant Variety Office also provided scores under intensity level one for several agronomic traits. These traits were date of heading and date of yellow maturity (days after the beginning of the years), number of ears, grain dry matter (%), plant height (cm), resistance to diseases (brown rust resistance, powdery mildew resistance, Septoria tritici blotch resistance, and Fusarium head blight resistance; scores from 1 to 9 for low to high susceptibility), deficiencies after winter (scores from 1 for no deficiencies to 9 for all plants killed) and lodging before harvest (scores from 1 to 9 for high to low resistance).

Phenotypic analysis

All the series were analyzed separately due to small number of shared checks and environments across series (Supplementary Fig. 1). We adopted a two-stage analysis procedure to estimate the adjusted genotype means. After outlier tests (Anscombe and Tukey 1963), we estimated the adjusted genotype means for each environment, i.e., year-by-location combination, and repeatabilities based on the following model:

$$y_{in} = \mu + g_{i} + r_{n} + \varepsilon_{in} ,$$
(1)

where y in was the yield performance of ith genotype in nth replicate, μ was the intercept, g i was the genetic effect of the ith genotype, r n was the effect of nth replicate, and ɛ in was the corresponding residual. For estimation of adjusted genotype means, only the residual and replication effects were assumed to be random effects with zero means and variances equal to \(\sigma_{\varepsilon }^{2}\) and \(\sigma_{r}^{2}\). For estimating the repeatability, only μ was treated as fixed effect. A few environments had very low repeatability estimates. Therefore, we removed environments with repeatability estimates below 0.5, which corresponds to 5% of the environments.

To provide an overview of the different sources of the phenotypic variation and to estimate the heritability, we fitted the following model:

$$y_{ijkn} = \mu + g_{i} + l_{j} + y_{k} + (ly)_{jk} + (gl)_{ij} + (gy)_{ik} + (gly)_{ijk} + r_{jkn} + \varepsilon_{ijkn} ,$$
(2)

where y ijkn was the yield performance of the ith genotype in the nth replicate of the jth location in the kth year, μ was the intercept, g i was the genetic effect of the ith genotype, l j was the effect of the jth location, y k was the effect of the kth year, (ly) jk was the year-by-location interaction effect of the jth location in the kth year, (gl) ij was the genotype-by-location interaction effect of the ith genotype in the jth location, (gy) ik was the genotype-by-year interaction effect of the ith genotype in the kth year, (gly) ijk was the three way interaction effect of the ith genotype in the jth location of the kth year, r jkn was the effect of the nth replication in the jth location of the kth year, and ɛ ijkn was the residual of y ijkn . The g i , l j , y k , (ly) jk , (gl) ij , (gy) ik , (gly) ijk , and r jkn were assumed to be random effects with zero means and variances equal to \(\sigma_{g}^{2}\), \(\sigma_{i}^{2}\), \(\sigma_{y}^{2}\), \(\sigma_{ly}^{2}\), \(\sigma_{gl}^{2}\), \(\sigma_{gy}^{2}\), \(\sigma_{gly}^{2}\), and \(\sigma_{r}^{2}\). The ɛ ijkn was assumed to be random effects with zero means and heterogeneous variances (\(\sigma_{{\varepsilon_{(jk)} }}^{2}\)) among different year-by-location combinations and the effect μ was assumed to be a fixed effect. Broad-sense heritability of grain yield was estimated as follow:

$$h^{2} = \frac{{\sigma_{g}^{2} }}{{\sigma_{g}^{2} + \frac{{\sigma_{gl}^{2} }}{{{\text{Nr}} . {\text{Loc}}}} + \frac{{\sigma_{gy}^{2} }}{{{\text{Nr}} . {\text{Year}}}} + \frac{{\sigma_{gly}^{2} }}{{{\text{Nr}} . {\text{Loc}}\; \times \;{\text{Nr}} . {\text{Year}}}} + \frac{{\sigma_{\varepsilon }^{2} }}{{{\text{Nr}} . {\text{Loc}} \times {\text{Nr}} . {\text{Year}}\; \times \;{\text{Nr}} . {\text{Rep}}}}}} ,$$
(3)

where \(\sigma_{g}^{2}\) is the genotypic variance, \(\sigma_{gl}^{2}\) was variance of genotype-by-location interaction, \(\sigma_{gy}^{2}\) was variance of genotype-by-year interaction, \(\sigma_{gly}^{2}\) was variance of genotype-by-location–by-year interaction, and \(\sigma_{\varepsilon }^{2}\) was the average variance of the residuals across environments. Nr.Loc, Nr.Year, and Nr.Rep represent number of locations, number of years, and number of replicates, respectively.

Estimation of yield stability parameters

We followed the approach to estimate yield stability parameters proposed by Piepho (1999). The approach suggested by Piepho (1999) was deduced based on the following model assuming homogeneous residual variance across environments:

$$y_{imn} = \mu + g_{i} + e_{m} + (ge)_{im} + r_{mn} + \varepsilon_{imn,}$$
(4)

where y imn was the yield performance of the ith genotype in the nth replicate of the mth environment, μ was the intercept, g i was the genetic effect of the ith genotype, e m was the effect of the mth environment, (ge) im was the genotype-by-environment interaction effect of the ith genotype in the mth environment, r mn was the effect of the nth replication in the mth environment, and ɛ imn was the residual of y imn . The effect μ was assumed to be a fixed effect and the effects g i , e m , (ge) im , r mn , and ɛ imn were assumed to be random effects with zero means and variances equal to \(\sigma_{g}^{2}\), \(\sigma_{e}^{2}\), \(\sigma_{ge}^{2}\), \(\sigma_{r}^{2}\), and \(\sigma_{\varepsilon }^{2}\). Starting with model [4], Piepho (1999) elaborated computational efficient approximations to estimate yield stability parameters which are based on the adjusted genotype means of each environments:

$$y_{im} = \mu + g_{i} + v_{m} + f_{im} ,$$
(5)

where y im was the adjusted mean of the ith genotype in the mth environment, μ was the intercept, g i was the genetic effect of the ith genotype, v m was the effect of the mth environment, and f im was the genotype-by-environment interaction effect of the ith genotype in the mth environment confounded with the residual effects. Only v m and f im were assumed to be random effects with zero means and variances equal to \(\sigma_{v}^{2}\) and \(\sigma_{f}^{2}\).

For Shukla’s yield stability variance (SVAR; Shukla 1972), a stability model derived from model [5] was used to extract genotype-specific variance (\(\sigma_{f(i)}^{2}\)) by fitting SAS statement “REPEATED GEN/SUB = ENV TYPE = UN (1)”. The statement “REPEATED GEN/SUB = ENV” specifies the identity matrix for the effect f im and the statement “TYPE = UN (1)” defines a banded main diagonal covariance structure. The confounded variance (\(\sigma_{f}^{2}\)) can be decomposed into variance of genotype-by-environment interaction (\(\sigma_{ge}^{2}\)) and the residual variance (\(\sigma_{\varepsilon }^{2}\)) in model [4]. Because the data within each environment is well balanced by replicates, the adjusted genotype means for each environment is then equivalent to the genotype means. When comparing model [4] and [5], we observed that \(y_{im} = \sum\nolimits_{n} {{{y_{imn} } \mathord{\left/ {\vphantom {{y_{imn} } {{\text{Nr}} . {\text{Rep}}}}} \right. \kern-0pt} {{\text{Nr}} . {\text{Rep}}}}}\), \(v_{m} = e_{m} + \sum\nolimits_{n} {{{r_{mn} } \mathord{\left/ {\vphantom {{r_{mn} } {{\text{Nr}} . {\text{Rep,}}}}} \right. \kern-0pt} {{\text{Nr}} . {\text{Rep,}}}}}\) and \(f_{im} = (ge)_{ij} + \sum\nolimits_{n} {{{\varepsilon_{imn} } \mathord{\left/ {\vphantom {{\varepsilon_{imn} } {{\text{Nr}} . {\text{Rep}}}}} \right. \kern-0pt} {{\text{Nr}} . {\text{Rep}}}}}\). Then, \(\sigma_{v}^{2} = \sigma_{e}^{2} + \sigma_{r}^{2} /{\text{Nr}} . {\text{Rep}}\) and the variance of f im for the ith genotype is \(\sigma_{f(i)}^{2} = \sigma_{ge(i)}^{2} + \sigma_{\varepsilon }^{2} /{\text{Nr}} . {\text{Rep}}\). The genotype-by-environment interaction variance of the ith genotype \(\sigma_{ge(i)}^{2}\) reflects its stability variance. Due to the use of mean data, we cannot directly estimate it in model [4], so we only estimate \(\sigma_{f(i)}^{2}\) and termed this variance as stability variance (SVAR). Since the \(\sigma_{\varepsilon }^{2}\) and \(\sigma_{r}^{2}\) in model [4] are constants for each series, using model [4] should not affect the ranking of stability variance. Small SVAR parameters indicate high stability according to the dynamic concept.

To estimate static stability parameters, we used the model proposed by Finlay and Wilkinson (1963):

$$y_{imn} = \mu + g_{i} + \lambda_{i} w_{m} + d_{im} + r_{mn} + e_{imn} ,$$
(6)

where λ i is the sensitivity (FW for the Finlay–Wilkinson model; Finlay and Wilkinson 1963) of the ith genotype to a latent environmental variable w m and d im is the deviation of ith genotype from the mth environment effect. Due to use of the adjusted genotype means, we implemented the model as described by Piepho (1999):

$$y_{im} = \mu + g_{i} + \lambda_{i} w_{m} + s_{im} ,$$
(7)

where \(s_{im} = d_{im} + \sum\nolimits_{n} {{{\varepsilon_{imn} } \mathord{\left/ {\vphantom {{\varepsilon_{imn} } {{\text{Nr}} . {\text{Rep}}}}} \right. \kern-0pt} {{\text{Nr}} . {\text{Rep}}}}}\) is a random deviation from the environments confounded with the residuals with variance \(\sigma_{s}^{2} = \sigma_{d}^{2} + \sigma_{\varepsilon }^{2} /Nr.Rep\). The term λ i w m is over-parameterized and, therefore, we set the variance of w m as \(\sigma_{w}^{2}\)  = 1 following Piepho (1999). The genotype-specific sensitivity was estimated by fitting the SAS option “REPEATED GEN/SUB = ENV TYPE = FA1 (1)” defining an equal diagonal first-order factor analytic covariance structure. The interpretation of sensitivity is straightforward. A genotype with small sensitivity is stable according to the static concept. The variance of d im (\(\sigma_{d}^{2}\)) is independent to the environment effects and can be used as dynamic stability measure in analogy to the stability variance. Since the analysis was based on adjusted genotype means, we could only estimate the variance of s im (\(\sigma_{s}^{2}\)) and termed this variance deviation variance. The effect s im is assumed to have common variance in Finlay–Wilkinson model.

The Eberhart–Russell model (ER; Eberhart and Russell 1966) differs from the Finlay–Wilkinson model by considering genotype-specific variance (\(\sigma_{d(i)}^{2}\)) for d im . The genotype-specific variance \(\sigma_{d(i)}^{2}\) was obtained by replacing the SAS option “TYPE = FA1 (1)” from FW model with SAS option “TYPE = FA (1)”, which specifies a first-order factor analytic covariance structure. Due to the use of adjusted genotype means, we can only estimate genotype-specific s im variance \(\sigma_{s(i)}^{2}\) and we have \(\sigma_{s(i)}^{2} = \sigma_{d(i)}^{2} + \sigma_{\varepsilon }^{2} /{\text{Nr}} . {\text{Rep}}\). As stated in Shukla’s stability variance model, the difference between \(\sigma_{s(i)}^{2}\) and \(\sigma_{d(i)}^{2}\) is a constant for each series, and \(\sigma_{s(i)}^{2}\) was termed as deviation variance (DVAR; Becker and Léon 1988; Shukla 1972).

Association of agronomic traits and yield stability

Phenotypic correlation coefficients were calculated between grain yield and yield stability parameters for the five series. Genotypes with standardized yield performance (adjusted genotype means minus series’ mean performance) in different series were combined into a set to increase the robustness of the results. Scatter plots were drawn to show the association between grain yield and yield stability parameters. In addition, comparison on grain yield and yield stability between hybrids and lines was also performed.

Each agronomic trait (brown rust resistance, deficiencies after winter, date of heading, date of yellow maturity, ears, Fusarium head blight resistance, grain dry matter, lodging before harvest, plant height, powdery mildew resistance, and Septoria tritici blotch resistance) in each series was analyzed with models [1] and [5]. The adjusted genotype means across environments were then extracted and standardized with the following formula:

$$z_{i} = (x_{i} - \overline{x} )/s_{i} ,$$
(8)

where z i is the standardized value, x i is the adjusted mean of the ith genotype, \(\overline{x}\) the mean of adjusted means of all genotypes, and s i the corresponding standard deviation.

For each series and the response variables, i.e., YIELD (grain yield), SVAR, DVAR, ER, and FW, a multiple linear regression was performed based on standardized values of all 11 agronomic traits. For model selection, GLMSELECT procedure of the SAS system (Inc 2011) was used to select significantly associated variables and the selected variables entered the regression analysis with the REG procedure of the SAS system (Inc 2011) and regression coefficients and adjusted R 2 were recorded. To assess how transferable the regression coefficients were across series, each regression coefficient was averaged across four series to predict the remaining series.

Estimation of heritability of yield stability

Our study aimed to reveal the association between the magnitude of heritability and the number of test environments. We put forward two different strategies to verify how many environments are required for precise estimation of yield stability parameters and contrasted this with the heritability of grain yield. For strategy I, environments were split into two halves: one half was defined as the reference set, and a sample of three or more environments (up to all the environments in the second half) from the other half was used as test set. If the number of environments was uneven, we used the larger half as reference set. For strategy II, we used all available environments as reference set. As test set we randomly sampled subset of three or more environments (up to N−1 environments, N refers to total number of environments). For each series, only genotypes which were tested in all the environments (balanced set, Table 1) were used to avoid a potential bias and accelerate computation and 1000 runs of sampling were performed for each size of test sets.

Table 1 Heritability (h 2) and variance components (\(\sigma_{g}^{2}\) genotypic, \(\sigma_{l}^{2}\) location, \(\sigma_{y}^{2}\) year, \(\sigma_{ly}^{2}\) location-by-year, \(\sigma_{gl}^{2}\) genotype-by-location interactions, \(\sigma_{gy}^{2}\) genotype-by-year interactions, \(\sigma_{gly}^{2}\) genotype-by-location-by-year interactions, \(\sigma_{\varepsilon }^{2}\) error) of grain yield (Mg ha−1) for the complete set (com) and the corresponding balanced set (bal) of wheat genotypes evaluated in five series of official registration trials in Germany

For strategy II, the Pearson correlation between estimates of reference and test set was recorded. For strategy I, Pearson correlation was corrected for the impact of the covariance between residuals of the reference and test sets (for details see “Appendix”). The average correlation was calculated for each number of test environments. For strategy II, the average correlation is assumed to be an estimate of the square root of heritability; for strategy I, the average correlation is not an estimate of the square root of heritability, since the estimated values of grain yield and stability parameters in the reference set were not the true values. To adjust for that under estimation, the correlations were divided by the square root of the correlations estimated with the maximum number of test environments. For a detailed explanation of the adjustment see the Appendix. The heritability estimated with strategy I was denoted throughout the manuscript as independent heritability \(h_{\text{ind}}^{2}\) and that of strategy II as dependent heritability \(h_{\text{dep}}^{2}\). To make the result comparable, we also implemented strategy I and II to estimate the heritability for grain yield.

Extrapolated heritabilities for an arbitrary number of test environments

Estimates of \(h_{\text{ind}}^{2}\) were available for up to 23 test environments. Therefore, we adopted a formula to describe the relationship between heritability and the number of test environments and infer the expected heritability for data set with more than 23 environments. Heritability (h 2) can be expressed with the following equation:

$$h^{2} = \frac{{\sigma_{g}^{2} }}{{\sigma_{g}^{2} + \frac{{\sigma_{f}^{2} }}{n} }} = \frac{1}{{1 + \frac{{\sigma_{f}^{2} }}{{n*\sigma_{g}^{2} }} }},$$
(9)

where \(\sigma_{g}^{2}\) is the genotypic variance, \(\sigma_{f}^{2}\) the genotype-by-environment interaction variance confounded with the error variance, and n the number of test environments. Assuming that \(\frac{{\sigma_{f}^{2} }}{{\sigma_{g}^{2} }}\) is a constant ratio for a given population of environments and genotypes leads to the equation:

$$h^{2} = \frac{1}{{1 + \frac{c}{n} }} .$$
(10)

For a given value of c, heritability can be estimated for any number of test environments. We assumed that c is specific for each series. To obtain estimates for c, a non-linear regression with formula [10] based on the estimated values of subdivision heritability was performed for each series for parameter grain yield. For stability parameters, n was defined as number of test environments minus two. Subsequently heritability for both yield stability and grain yield were predicted with the estimated values of c for up to 50 test environments.

Selection of subsets of environments facilitating to assess yield stability reliably

The genotype-by-environment interaction effects were extracted with model [4] and were used to estimate Euclidean distances among all pairs of environments. Clustering based on Euclidean distance of environment-specific genotype-by-environment interaction effects was performed. The K-Medoids procedure (Reynolds et al. 2006) was applied to select K (from three to eight) representative environments. To eliminate the genotype- and environment-based bias, 1000 resampling runs on genotypes and environments were performed separately. Sub-population containing 80% of environments or 80% of genotypes from the balanced set was randomly sampled in each resampling run. The representative environments were recorded in each run. The corresponding K (from 3 to 8) environments with highest detection frequencies across two cross-validation approaches were termed as optimal environments. Field data from each number of optimal environments were used to estimate grain yield and yield stability as stated above. The heritabilities were calculated by squaring the correlation coefficients that were calculated between estimates based on optimal environments and all environments. Efficiency of selection strategy was shown by comparing heritabilities with dependent heritability (\(h_{\text{dep}}^{2}\)) under the corresponding number of environments.

Results

Decomposition of the phenotypic variance

Phenotyping was performed in multi-location-year field trials with a substantial fraction of the locations varying across years (Supplementary Fig. 1). After quality check, 11 out of 226 environments, i.e., year-by-location combination, were removed because estimates of repeatability were below 0.5. Intensive phenotyping led to estimates of heritability above 0.9 (Table 1). The phenotypic variance components involving genotypes and their interactions with locations and years were stable across the five series. The variance of genotype-by-location-by-year interactions was much higher than those of genotype-by-location and genotype-by-year interactions.

Fig. 1
figure 1

Observed (dots) and predicted (lines) values of independent heritability (\(h_{\text{ind}}^{2}\)) for grain yield (GY) and yield stability with an increasing number of test environments. SVAR and DVAR refer to the dynamic yield stability estimated as stability variance and deviation variance, respectively. ER and FW denote the static yield stability estimated following the Eberhart–Russell model and the Finlay–Wilkinson model, respectively

Estimates of heritability for yield stability are lower than those for grain yield

The heritability of grain yield measured as \(h_{\text{ind}}^{2}\) and \(h_{\text{dep}}^{2}\) increased rapidly with increasing number of environments and plateaued already at around 10 environments (Figs. 1 and 2). This was not the case for \(h_{\text{ind}}^{2}\) for static and dynamic yield stability which increased throughout the full range of number of environments examined. The estimated \(h_{\text{ind}}^{2}\) for the full set of around 40 environments was on average larger for grain yield (0.97) than for static (0.75) and dynamic yield stability (0.66). For \(h_{\text{dep}}^{2}\), we observed contrasting trends for static and dynamic yield stability (Fig. 2): Dynamic yield stability heritability increased linearly with increasing number of environments for all series except Series B. In contrast, static yield stability heritability showed a logarithmic growth with increasing number of environments for all series except Series E.

Fig. 2
figure 2

Dependent heritability (\(h_{\text{dep}}^{2}\)) for grain yield (GY) and yield stability with an increasing number of test environments. SVAR and DVAR refer to the dynamic yield stability estimated as stability variance and deviation variance, respectively. ER and FW denote the static yield stability estimated following the Eberhart–Russell model and the Finlay–Wilkinson model, respectively

Grain yield performance and yield stability of wheat hybrids versus pure lines

Static yield stability was significantly (P < 1E − 06) correlated with grain yield (Supplementary Fig. 2). Thus, improving static stability entails a penalty in grain yield performance. In contrast, dynamic yield stability was not significantly (P < 0.95) associated with grain yield performance. The average grain yield performance of the group of hybrids was significantly (P < 1E − 06) larger than that of the group of inbred lines (Fig. 3). Neither dynamic nor static stability differed (P > 0.20) between the groups of hybrids and inbred lines.

Fig. 3
figure 3

Violin plots of standardized grain yield performance (GY) estimated based on model [5] and yield stability parameters of the inbred lines. Red horizontal lines denote the performance of the evaluated hybrids. SVAR and DVAR refer to the dynamic yield stability estimated as stability variance and deviation variance, respectively. ER and FW denote the static yield stability estimated following the Eberhart–Russell model and the Finlay–Wilkinson model, respectively

Agronomic traits influencing yield stability

Regression analysis of individual series revealed that dynamic and static yield stability as well as grain yield were significantly (P < 0.05) associated with varying agronomic traits (Table 2). Grain yield was impacted in 40% of the series from the four traits brown rust resistance, deficiencies after winter, Fusarium head blight resistance, and plant height. The four traits were also in 50% of the series significantly (P < 0.05) associated with static yield stability, but only in 12.5% of the series with dynamic stability. The adjusted R 2 values averaged across the five series was 0.22 for grain yield, 0.21 for dynamic stability, and 0.16 for static stability. We validated the regression model of the agronomic traits established in four series and predicted yield stability of the remaining series. The R 2 values decreased by 93.8% for dynamic stability, by 54.1% for static stability, and by 41.7% for grain yield.

Table 2 Standardized regression coefficients, adjusted and validated R 2 of the multiple regression models regressing different response variables

Selecting subsets of environments portraying large proportion of the variation of yield stability

We applied resampling procedure to examine the robustness of selecting representative environments with the K-Medoids approach and observed a high consistency of the chosen subsets (Supplementary Table 1). Interestingly, large proportion of the variation for dynamic yield stability can be portrayed selecting representative subsets of environments and allowing thus to decrease the number of environments considerably (Table 3): For instance, in Series D, dynamic yield stability estimated in four environments explained 76% of the variation observed in the full set of 40 environments. The observed benefit was large compared to a random scenario with a difference for instance for Series D and four environments achieving an advantage of explained proportion of variance for selected subsample of environments (δ) of 69%.

Table 3 Proportion of variance of dynamic yield stability explained in a reduced set of environments selected with the K-Medoids approach in comparison to the dynamic yield stability observed in the total number of environments (\(r_{p}^{2}\))

In addition, we observed across the five series a variation in the number of environments that resulted in the largest proportion of variation of dynamic yield stability portrayed in the subsamples. The optimum was strongly associated with the number of distinct subgroups of environments observed in the clustering based on the matrix of genotype-by-environment interaction effects (Supplementary Fig. 3).

Discussion

Two-stage analysis procedure to estimate yield stability

We followed the two-stage approach to estimate yield stability parameters proposed by Piepho (1999), which was deduced assuming homogeneous residual variance across environments. The main reason for selecting this analysis strategy was that a two-stage approach is computationally less demanding than a one-stage procedure. Different methods of two-stage analyses of plant breeding trials have been compared with a one-step analysis assuming heterogeneous residual variance across environments (Möhring and Piepho 2009). Two-stage analysis assuming homogeneous residual variance across environments and, thus, without weighting means from the first stage, also produced acceptable results for estimating the genotypic effects. We observed that the residual variance components of individual environments ranged from 0.02 to 0.81 (Supplementary Fig. 5), which poses the question whether the simplification of homogeneous residual variance across environments affected our estimates of yield stability parameters or not. We analyzed for each series the balanced data set, i.e., genotypes which were tested in all the environments, using weighted means and estimated Shukla’s yield stability variance (SVAR; Shukla 1972). As weighting factor we used the squared standard error of a mean of a genotype in the respective environment (Method 2 outlined in Möhring and Piepho 2009). Shukla’s yield stability variance (SVAR; Shukla 1972) estimated using weighted means and unweighted analysis were tightly correlated with an average Pearson correlation coefficient of 0.96 (Supplementary Fig. 6). This suggests that the simplification of homogeneous residual variance across environments made in our study is not substantially affecting our findings.

Estimating the heritability of yield stability

Assessing the quality of phenotypic data is of central importance before using it either for selection or genome-wide mapping studies. We determined \(h_{\text{dep}}^{2}\) to measure the quality of the estimates of yield stability. Because of the dependency between test and reference sets, \(h_{\text{dep}}^{2}\) is a biased estimate for the heritability of yield stability. Nevertheless, it is possible to draw at least partial conclusions whether the number of environments is sufficient to reliably assess yield stability or not by inspecting the trends of \(h_{\text{dep}}^{2}\) with increasing number of environments. The trend of \(h_{\text{dep}}^{2}\) for dynamic yield stability is rather characterized by a linear increase, which indicates that the number of environments is likely too low to obtain precise estimates (Fig. 2). In contrast, static yield stability plateaued for 80% of the scenarios, which suggests that these parameters are accurately estimated. Determining \(h_{\text{dep}}^{2}\) is in particular of interest if the number of environments is low and do not facilitate to form independent reference and test sets with appropriated sizes.

As a further strategy to assess the quality of the estimates of yield stability we modified the approach elaborated by Muhleisen et al. (2014a) and relaxed the assumption of absence of covariance between residuals of the reference and test sets (\(h_{\text{ind}}^{2}\)). Comparing our approach with the method suggested by Muhleisen et al. (2014a) revealed that \(h_{\text{ind}}^{2}\) can be underestimated when ignoring the covariance between residuals of the reference and test sets (Supplementary Fig. 4). The official variety tests in Germany are based on at least 40 environments. The corresponding \(h_{\text{ind}}^{2}\) amounted to 0.75 for static and 0.66 for dynamic yield stability (Fig. 1). Although the reported heritability was based on the balanced set, in which the limited number of genotypes could lead to certain bias, both parameters may be considered at least as valuable complementary information in the national plant variety catalogue.

Hybrids outperformed lines with respect to grain yield, but not to yield stability

Hybrid wheat breeding enables to exploit the phenomenon of heterosis in contrast to line breeding (Longin et al. 2012). A recent large-scale study examining the yield potential of wheat hybrids revealed a maximum surplus of yield of 1 Mg ha−1 reflecting roughly 15 years of breeding progress (Zhao et al. 2015). Our results are in line with this finding as the group of hybrids outperformed the group of inbred lines (Fig. 3). Besides the benefits of an increased grain yield performance also higher yield stability estimates were reported previously for groups of hybrids as compared to groups of lines (for a detailed discussion, see Muhleisen et al. 2014b). In contrast, we observed no significant differences of the yield stability of groups of hybrids and lines (Fig. 3). This finding can be explained by the strong pre-selection and the limited number of hybrids. The evaluated genotypes suffered intensive selection not only during official variety registration trials, but also in the process of multi-stage selection, which already restricts the diversity of yield stability performance. The limited number of candidate varieties within each series, especially the few hybrids, could probably lead to a biased variation of yield stability for different populations. Moreover, hybrid wheat breeding is still a niche (Longin et al. 2012) with far lower investments as compared to line breeding. Thus, the selection intensity in wheat breeding in Germany is much higher for developing line than hybrid varieties. The situation is currently changing with a major emphasis to establish hybrid wheat breeding programs. Thus, it is of interest to monitor whether the absence of differences in yield stability of hybrids and lines observed in the official variety tests maintains stable or not in the coming years.

Phenotyping in a minimum of 30–40 environments is required to reliably measure yield stability of individual wheat genotypes (Fig. 1). In contrast to our study, this phenotyping intensity was never met in previous surveys in wheat (Huang et al. 2016; Muhleisen et al. 2014b; Robert 2002; Jalaluddin and Harrison 1993) which used a maximum of 22 environments (Jalaluddin and Harrison 1993). The analysis of the yield stability of individual genotypes in our study revealed undesired correlations between static yield stability and grain yield performance (Supplementary Fig. 2), which clearly suggests that static yield stability is not a proper selection target. Our finding is in accordance with previous reports in maize (Pham and Kang 1988), common bean (Mekbib 2003), soybean (Sneller et al. 1997), and barley (Muhleisen et al. 2014a). The analyses of dynamic yield stability of individual wheat genotypes in our study revealed absence of an association with grain yield performance (Supplementary Fig. 2), which is in line with previous findings in barley (Muhleisen et al. 2014a). Thus, breeding high-yielding and stable wheat varieties for a climate-smart agriculture is not hampered by undesired pleiotropic or linkage effects. Nevertheless, the challenge posed by the complexity of dynamic yield stability is to identify efficient selection strategies.

Dynamic stability cannot be improved by indirect selection

Indirect selection is an attractive alternative to improve agronomic traits that are difficult to phenotype (Falconer and Mackay 1996). Interestingly, grain yield assessed applying intensive pest control was still significantly associated with the resistance level of genotypes against Fusarium head blight (Table 2). Besides biotic stress resistance, improved tolerance against abiotic stresses occurring during winter reduced the plant height and prolonged life-time promoted grain yield. Although the impact of individual traits varied among series, validated R 2 amounted to 0.13 indicating that these traits are valuable for indirect selection of grain yield performance at least in early selection stages. For the trait dynamic yield stability; however, validated R 2 amounted to only 0.01 on average (Table 2). Consequently, indirect selection is not an interesting strategy for improving dynamic yield stability.

Selecting subsets of environments is a promising strategy to reduce phenotyping intensity still maintaining high accuracy to assess dynamic yield stability

We evaluated the option to select a diverse subsample of environments, which allowed portraying large proportion of the variation of the total number of environments (Table 3). The selection is based on the matrix of the genotype-by-environment interaction effects and revealed that large proportion of the variance observed in the full set of environments can be explained in the subsamples including less or equal than eight environments (Table 3). The advantage compared to random samples with the same size was at the optimum number of environments substantial with an absolute difference of at least 29% (Table 3). Thus, our implemented strategy is an interesting option to improve the efficiency of assessing the complex trait dynamic yield stability. The central question is whether the selection of environments can be done a priori and not posteriori as in our study. An a priori selection can only be implemented based on genotype-by-location interactions, because all interactions involving years are not predictable. Nevertheless, the variance components analysis revealed that genotype-by-location interactions contributed only to small proportion of the total phenotypic variance (Table 1). The estimate has to be interpreted with caution because of the unbalanced structure of the location-by-year combination (Supplementary Fig. 1) which may bias the estimation of the phenotypic variance components. Thus, further research is required studying the potential to identify stable clusters of locations in the Central European target environment.

Author contribution statement

Jochen Christoph Reif and Yusheng Zhao conceived the study; Vilson Mirdita prepared the data; Guozheng Liu and Yusheng Zhao analyzed the data; Guozheng Liu and Jochen Christoph Reif wrote the manuscript; all authors read, edited and approved the manuscript.