Introduction

The prediction of hybrid performance using information from parental inbred lines is of great interest to breeders. If successful, it can increase substantially the efficiency of breeding programs. Prediction methods using the genetic distance between the parental lines failed to predict reliably the hybrid performance of inter-group hybrids in plant breeding programs (Melchinger 1999). In contrast, prediction methods using markers linked to quantitative trait loci affecting the trait under consideration were successfully developed (Vulysteke et al. 2000; Schrag et al. 2006, 2007, 2009a, b).

In hybrid breeding, the germplasm is usually divided into genetically distant heterotic pools. Molecular marker-based genetic distances, graphically displayed by multivariate statistical methods, such as principal coordinate and cluster analyses can be helpful to accomplish this task (Reif et al. 2003, 2005).

With the advent of transcriptome analysis by gene expression profiling, a new lab technology has emerged. It can be employed in studying the molecular basis of heterosis (Bircheler et al. 2003). Stupar et al. (2008) found a correlation of genetic diversity and transcriptional variation. Guo et al. (2006) suggested that differential allele regulation may play an important role for heterosis. Springer and Stupar (2007) suggested that modified levels of gene expression in hybrids may contribute to heterotic phenotypes. The applicability of expression profiles, SNP markers, and metabolites for prediction of hybrid performance and heterosis was recently investigated with various approaches (Maenhout et al. 2009; Repsilber et al. 2009; Steinfath et al. 2009).

A primary focus of interest is to use gene expression data in functional analyses for detection of genes underlying agronomic traits (Thiemann et al. 2009). An alternative view on gene expression data is possible by disregarding all functional information on the analyzed genes and considering the transcript abundance levels as quantitative variables characterizing a genotype. These quantitative variables could then be used to construct distance measures between genotypes on the basis of their transcription profiles. In combination with multivariate methods, the distances could be employed for grouping of germplasm and in combination with linear models for prediction of hybrid performance and heterosis. To our knowledge, no previous investigation on transcriptome-based distance measures and their applications is available.

The goal of our study was to investigate the potential application of transcriptome-based distance measures in maize hybrid breeding programs. In particular, our objectives were to (a) investigate distance measures between inbred lines based on gene expression profiles, (b) examine their correlation with molecular marker-based genetic distances, and (c) assess the suitability of transcriptome-based distances for germplasm grouping and prediction of heterosis and hybrid performance.

Materials and methods

Field data

Seven flint and 14 dent elite inbred lines developed by the maize breeding program of the University of Hohenheim were used as parental lines for a 7 × 14 factorial mating design. The inbreds comprised eight dent lines with Iowa Stiff Stalk Synthetic (S028, S036, S044, S046, S049, S050, S058, S067) and six with Iodent background (P033, P040, P046, P048, P063, P066). Four flint lines (F037, F039, F043, F047) had an European Flint and three (L024, L035, L043) a Flint/Lancaster background.

The factorial crosses were evaluated in 2002 at six agroecologically diverse locations in Germany (Bad Krozingen, Eckartsweier, Hohenheim, Landau, Sünching, Vechta). The 21 inbred parents were evaluated for their per se performance in 2003 at four locations (Eckartsweier, Hohenheim, Sünching, Pocking) and in 2004 at three locations (Eckartsweier, Hohenheim, Bad Krozingen). The trials were evaluated in two-row plots using adjacent α designs (generalized lattices) with two to three replications. Grain yield (Mg ha−1) adjusted to 155 g kg−1, grain moisture, and grain dry matter concentration (%) for the inbred parents and factorial crosses were recorded. The data were analyzed with a mixed linear model as described in detail by Schrag et al. (2009a, b). The factorial set of crosses investigated here is one of the nine factorials analyzed by Schrag et al. (2009a, b) and was also included in the studies of Schrag et al. (2006, 2007), where it was referred to as Experiment 1.

Molecular marker data

The inbred lines were assayed for AFLP markers with 20 primer combinations as described in detail by Schrag et al. (2006). The AFLP analyses resulted in 1,835 markers.

The genetic distance D A between inbred lines i and j was calculated from the banding pattern of n m AFLP bands as

$$ D_A(i,j) = \sqrt{\frac{1}{n_m} \sum_{m = 1}^{n_m} \left[ b_{m}(i) - b_{m}(j)\right]^2}, $$
(1)

where b m (i) and b m (j) are indicator variables taking the value one, if band m was observed in inbred line i or j, respectively, and zero otherwise. D A (ij) is related to the single matching coefficient SM(ij)

$$ D_A(i,j) = \sqrt{1-\hbox{SM}(i,j)}. $$
(2)

D A has the property of being Euclidean and, therefore, is well suited for principal coordinate and cluster analyses.

Gene expression data

Five plants of each of the 21 inbred lines were grown for 7 days in a climate chamber under regulated growth conditions. The five biological replicates were pooled (Kendziorski et al. 2005) and RNA was isolated from a mixture of the five seedlings. The 46-k array from the maize oligonucleotide array project (http://www.maizearray.org/, University of Arizona, USA) with 43381 gene-oriented 70-mer maize oligo-spots (in total 46,128 features) printed on a glass-slide was used for hybridization analyses (Thiemann et al. 2009). For the mircroarray analysis we employed an intervoven loop design (Kerr and Churchill 2001) resulting in 62 direct comparisons of dent and flint lines by sampling each dent line five times and each flint line eight times.

Differences in the gene expression were tested with a modified F test using a false discovery rate of 0.01 for all genes showing a fold change of at least 1.3 and expression level (log2) of at least 8.0. All genes, which were differentially expressed in at least one pair of parental lines of the 98 factorial crosses, were assigned to the subset of genes S p .

Transcriptome-based distance measures

A Euclidean distance D E between lines i and j can be determined from the gene expression data of n g genes as

$$ D_E (i,j) = \sqrt{\sum_{g = 1}^{n_g} \left[ l_{g}(i) - l_{g}(j) \right]^2}, $$
(3)

where l g (i) and l g (j) are the base-two logarithms of the transcript abundance of gene g in inbred lines i and j.

A binary distance D B between lines i and j can be determined from the gene expression data as

$$ D_B(i,j) = \sqrt{\frac{1}{n_g} \sum_{g = 1}^{n_g} \left[ x_{g}(i) - x_{g}(j)\right]^2} $$
(4)

where x g (i) and x g (j) are indicator variables taking the values zero and one depending on differential gene expression of gene g in inbred lines i and j. If gene g is differentially expressed in lines i and j, then

$$ \begin{aligned} x_g(i) & = 1 \quad \hbox{and} \quad x_g(j)=0 \quad \hbox{for}\; l_g(i) > l_g(j), \quad{\hbox{and}}\\ x_g(i) & = 0 \quad \hbox{and} \quad x_g(j)=1 \quad \hbox{for}\; l_g(i) \leq l_g(j). \end{aligned} $$

If gene g is not differentially expressed, then x g (i) =  x g (j) = 0. Equation 4 simplifies to

$$ D_B (i,j) = \sqrt{n_s(i,j) / n_g}, $$
(5)

where n s (ij) is the number of genes differentially expressed in lines i and j.

The transcriptome-based distances D E and D B were determined for the subset of genes S p . The correlation of D E and D B with D A and the correlation of D A , D E , and D B with hybrid performance and heterosis was determined. Furthermore, the distances D A , D E , and D B were subjected to a principal coordinate analysis and a hierarchical cluster analysis using the complete linkage clustering algorithm implemented in the ‘hclust’ function of statistical software R (Ihaka and Gentleman 1996).

Association of differential gene expression with hybrid performance and heterosis

To identify genes, which were differentially expressed in the parents of hybrids with high performance, the hybrids are divided in two classes T and L of equal size. The class T consists of hybrids with high and the class L of hybrids with low hybrid performance Y:

$$ \forall\, Y(i,j) \in T, Y(k,l) \in L : Y(i,j) \geq Y(k,l). $$
(6)

Consider gene g and let o gt and o gl denote the numbers of hybrids in class T and L, respectively, for which the parents show differential expression of gene g. Let

$$ P_g = \sum_{k=o_{gt}}^{o_{gt}+o_{gl}} \hbox{Bin}_{n,p}(k) $$
(7)

where Binn,p is the probability function of the binomial distribution with parameters n = o gt  + o gl and p = 1/2. P g denotes the probability that the count o gt , or a larger one, is observed under the condition that differential gene expression occurs with the same probability in the hybrids showing a high and those showing a low hybrid performance. Hence, small values for P g indicate an association of differential expression with high hybrid performance. A subset of genes, the differential expression of which is associated with mid parent heterosis can be determined in an analogous manner.

On the basis of all 98 hybrids of the factorial, we determined the subsets of genes associated with hybrid performance S y and heterosis S h . To accomplish this, we compared the probabilities P g with threshold values determined with the Bonferroni-Holm procedure for a Type-I error of α = 0.01. The distances D B and D E were calculated for the subsets of genes S y and S h and the correlations of these distances with hybrid performance and heterosis, respectively, were determined.

Transcriptome based prediction of hybrid performance and heterosis

For prediction of new hybrids, a reference set of related breeding material (estimation set) is required. The estimation set consists of the expression profiles of parental inbred lines uv and the field data of their factorial crosses Y(uv). From these data, the set of genes S is determined, of which differential expression is associated with hybrid performance. The (Euclidean or binary) distances D S(uv) on basis of S between the parental lines are determined. The distances between parents and the performance data of the hybrids in the estimation set are used to estimate the regression parameters β0 and β1 with the linear regression model

$$ Y(u,v) = \beta_0 + \beta_1 D^S(u,v). $$
(8)

To predict the performance Y(ij) of a new hybrid, the gene expression profiles of the parental lines i and j are assessed. The set of genes S is used to determine the (binary or Euclidean) distance D S (ij) between the parental lines. From the distance D S (ij) and the regression parameters β0 and β1 the performance Y(ij) of the new hybrid is predicted with Eq. 8. The prediction procedure is summarized in Fig. 1. Mid-parent heterosis H(ij) is predicted in an analogous manner.

Fig. 1
figure 1

Prediction of hybrid performance

Assessment of prediction efficiency

The prediction efficiency was evaluated with a cross-validation procedure in which we divided our data in an estimation set, used for estimation of prediction parameters, and a validation set for which prediction was carried out. The estimation set consisted of the transcriptome data of five randomly chosen dent and three randomly chosen flint lines and the field data of their hybrids. The validation set consisted of the remaining inbreds and hybrids of the factorial. The subsets of genes associated with hybrid performance and heterosis were determined in the estimation set by comparing the binomial probability P g with a threshold of α = 0.05 (employing no adjustment for multiple testing). With these distances, the regression parameters β0 and β1 were determined in the estimation set (Eq. 8). Employing S, β0, and β1, the hybrid performance and heterosis of the hybrids in the validation set were predicted. The correlation coefficient r of observed with predicted values of Y(ij) and H(ij) was assessed in 100 cross validation rounds. It was employed as a measure to assess the prediction efficiency.

In this cross validation scheme, less than half of the lines are used to estimate the prediction parameters and more than half of the lines for validation. Therefore, it models the reality of maize breeding programs much better than the older approach of a “one-leave-out” cross validation (Vuylsteke et al. 2000; Schrag et al. 2006).

The efficiency of the transcriptome-based prediction was compared with that of prediction based on the general combining ability (GCA) estimated from field trials and that of prediction with the ‘single-marker total effects of associated markers’ (SM-TEAM) approach of Schrag et al. (2007).

Results

The mean grain yield of the 98 hybrids was 11.72 Mg ha−1 with a broad sense heritability of 80.3%. The GCA and SCA variance components, as well as their interactions with the locations were significantly different from zero (α = 0.05). The ratio of SCA:GCA variance components was 1.12. Therefore, the GCA is expected to explain only partially the variation in hybrid performance in this factorial. The results were presented in detail by Schrag et al. (2006).

The subset S p of genes, which were differentially expressed in at least one pair of parental lines of the factorial crosses consisted of 10,810 genes. The ranges of the distances between lines from the same heterotic pool (intra-pool distances) were 0.22 ≤ D A  ≤ 0.57, 0.16 ≤ D B  ≤ 0.53, and 25.7 ≤ D E  ≤ 76.5. The ranges of the distances between lines from opposite heterotic pools (inter-pool distances) were 0.56 ≤ D A  ≤ 0.61, 0.39 ≤ D B  ≤ 0.69, and 38.6 ≤ D E  ≤ 89.0. The genetic distance D A was strongly correlated with the transcriptome-based distances D B and D E for intra-pool crosses, but only loosely for inter-pool crosses (Fig. 2).

Fig. 2
figure 2

Correlation r of the genetic distance D A with the transcriptome-based distances D B and D E . The distances D B and D E were determined from the subset of genes S p , comprising 10,810 differentially expressed genes. *P ≤ 0.05, ***P ≤ 0.001

The first principal coordinate clearly separated the flint from the dent lines for all three distance measures (Fig. 3). The only exception was flint line L024, which had a first principal coordinate closer to the dent than to the flint pool for distance D E . The second principal coordinate separated the dent lines with Stiff Stalk from those with Iodent background (Fig. 3).

Fig. 3
figure 3

Principle coordinate analyses and hierarchical cluster analyses based on the genetic distance D A and the transcriptome-based distances D B and D E . The distances D B and D E were determined from the subset of genes S p comprising 10,810 differentially expressed genes

Cluster analyses based on the genetic distance D A and the binary distance D B resulted in separate clusters for the flint and dent lines (Fig. 3). The binary distance D B even separated sub-clusters of flint lines having European flint and Flint/Lancaster background. The Euclidean distances D E did not separate the flint and dent lines. A clear separation between the different dent backgrounds (Stiff Stalk vs. Iodent) was not observed for any of the distances.

The genetic distance D A between parental lines was neither significantly correlated with hybrid performance nor with heterosis for grain yield (Fig. 4). In contrast, the distances D B and D E , determined from the subset S p of differentially expressed genes, were correlated with hybrid performance and heterosis.

Fig. 4
figure 4

Correlation r of hybrid performance Y (left hand side) and mid-parent heterosis H (right-hand side) for grain yield with the genetic distance D A (top), binary distance D B (center), and Euclidean distance D E (bottom). The distances D B and D E were determined from the subset of genes S P comprising 10,810 differentially expressed genes. ns: P > 0.05, ***P ≤ 0.001

The subset of genes S y whose differential expression was associated with hybrid performance consisted of 1,424 genes, and the subset S h of genes associated with heterosis of 1,763 genes. The distances D B and D E , determined from the subsets of genes S y and S h were strongly correlated with hybrid performance and heterosis for grain yield (Fig. 5).

Fig. 5
figure 5

Correlation r of hybrid performance Y (left-hand side) and mid-parent heterosis H (right-hand side) for grain yield with the binary distance D B (top) and Euclidean distance D E (bottom). The distances were determined from the subset of genes S y comprising 1,424 genes of which differential expression is associated with hybrid performance (left-hand side), and the subset S h comprising 1,763 genes of which differential expression is associated with heterosis (right-hand side). ***P ≤ 0.001

In cross validation, the correlation of observed with predicted values for hybrid performance and heterosis was greater for prediction with the distances D B and D E than for prediction with the earlier prediction methods GCA and SM-TEAM (Fig. 6). Prediction with the binary distance D B resulted in stronger correlations than prediction with the Euclidean distance D E for both, hybrid performance and heterosis.

Fig. 6
figure 6

Correlations r of the observed with predicted hybrid performance Y and mid-parent heterosis H for grain yield obtained by cross validation with 100 rounds. The bold lines in the boxes denote the median, the borders of the boxes the quartiles, and the ends of the whiskers the extreme values of r. Left-hand side: The genes for prediction were selected in the estimation set with an exact binomial test (α = 0.05), resulting in approx. 800–1,000 genes for each trait in the individual cross validation rounds. D B denotes prediction with the binary and D E with the Euclidean transcriptome-based distances. Results for GCA and SM-TEAM were taken from Schrag et al. (2007). Right hand side: Fixed numbers of 50 to 5,000 genes were selected in each cross validation round on basis of small probabilities P g

For prediction of grain yield with the binary distance D B , the correlations of observed with predicted values had the smallest ranges and inter-quartile distances if 1,000–1,500 genes were selected for prediction (Fig. 6). For fewer than 1,000 genes, the ranges and inter-quartile distances increased and for more than 1,500 genes the median decreased.

Discussion

Transcriptome analysis in the seedling stage

The RNA for expression profiling was extracted from entire seedlings 7 days after sowing. From a biological point of view, this raises the question whether there are justifications for the assumption that the transcript levels in seedlings are related to agronomic performance. There may be specific development stages and specific tissues, in which gene expression is functionally more closely related to grain yield, grain dry matter concentration, or other important traits. From a practical point of view, expression profiling in the seedling stage has the big advantage that the data can be generated quickly and with limited resources, compared with growing plants for a longer time period and analyzing specific tissues. It remains open to further investigations, whether a possible gain in information content of the transcriptome data at later development stages may outweigh this time advantage in breeding programs, in particular because establishing a new variety one season earlier in the market can be the key to its economic success.

Theoretical properties of the transcriptome-based distances

The distance D E has the desirable property of being Euclidean and, hence, from a mathematical point of view, is suitable for a broad range of multivariate methods. From a genetical point of view, it has the shortcoming that genes with a big difference in the transcript abundance influence more the final value of the distance than genes with a smaller difference. Transformations, such as the base-two logarithmic transformation employed in this study, reduce the numerical effect. However, genetical or physiological models justifying such transformations are lacking.

The binary distance D B discards the quantitative information on the transcription levels and assigns equal effects to all differentially expressed genes. This corresponds to a quantitative genetic model with many genes having small effects of similar size. Employing D B for prediction of heterosis with the linear regression model of Eq. 8 corresponds well to the hypothesis, that small dominance effects at a large number of loci are resulting in the heterosis observed in a hybrid.

Correlation of the genetic distances with the transcriptome-based distances

The transcriptome-based distances showed a significant correlation of about 0.7 with the genetic distance for intra-pool crosses, whereas for inter-pool crosses only a loose correlation of about 0.2 was observed (Fig. 2). The high correlation of the transcriptome-based distances with the genetic distance for intra-pool crosses supports the hypothesis that within one heterotic pool both measures contain to a large part similar information. For inter-pool crosses, the low correlation suggests that the information content of the transcriptome-based and genetic distances differs. The range of inter-pool genetic distance was considerably smaller than the range of inter-pool transcriptome-based distances. This supports the hypothesis that the inter-pool transcriptome-based distances carry more information than the inter-pool genetic distances. The transcriptome-based distances are directly quantifying the expression of genes, which may be responsible for the phenotype and do not rely on the linkage between markers and genes. In consequence, transcriptome data should be preferable to marker data when different heterotic pools are considered.

Germplasm grouping

The multivariate analyses employing the binary distance D B grouped the flint and dent pools as clearly as did the genetic distance D A . In addition, cluster analysis with D B separated two subgroups within the flint pool (Fig. 3). In principal coordinate analysis on basis of D E , the flint line L024 was close to the dent lines, and in the cluster analyses no separate clusters for flint and dent lines were formed. In conclusion, multivariate analyses based on D B were more effective in grouping the germplasm than analyses based on the genetic distance D A , whereas grouping on basis of D E was slightly less effective than grouping on basis of D A .

Correlation of the transcriptome-based distances with hybrid performance and heterosis

In many studies in maize, it has been observed that the genetic distance between the parents of inter-pool hybrids was not correlated with the hybrid performance or heterosis (cf. Melchinger 1999). Our data on the AFLP-based genetic distance D A confirms this result (Fig. 4). The significant correlations of the transcriptome-based distances with hybrid performance and heterosis may be explained by the high density of investigated loci, the analysis of the genes rather than markers, and the inclusion of additive–additive interactions in the analysis.

Transcription profiling resulted in 10,810 differentially expressed genes in the factorial. Such a high number of loci investigated means a good coverage of the genes underlying grain yield and, therefore, resulted not only in significant but very strong correlations.

Founder effects, selection, and random genetic drift can result in differences in the linkage disequilibrium between marker alleles and functional alleles in different heterotic pools (Boppenmeier et al. 1992; Charcosset and Essioux 1994). While in one heterotic pool, a certain marker allele can be in coupling phase linkage with a certain allele at a functional locus, in the opposite pool the marker allele may be in repulsion phase linkage with this functional allele. Therefore, inter-pool genetic distances at marker loci may provide only a poor estimate for the differences at functional genes between two lines belonging to different heterotic pools. Expression profiling investigates directly the genes, and does not rely on linkage disequilibrium between marker alleles and functional alleles. Therefore, it is not affected by different linkage phases in different heterotic pools and quantifies directly the differences at functional genes between two lines. This seems to be the main reason why the transcriptome-based distances are strongly correlated with hybrid performance and heterosis, whereas the inter-pool genetic distances are not correlated.

Furthermore, additive–additive interactions responsible for increased RNA transcription are accounted for in the transcription profiling and, hence, contribute to the transcriptome-based distances. These may increase the proportion of phenotypic variance explained by the distances and, thus, can also contribute to the high correlation of transcriptome-based distances with hybrid performance and heterosis.

The correlation of transcriptome-based distances with hybrid performance and heterosis for selected genes (subsets S y and S h , Fig. 5) was considerably larger than for unselected genes (subset S p , Fig. 4). Hence, using a set of selected genes for prediction models has the potential to increase prediction efficiency.

Transcriptome-based prediction of hybrid performance and heterosis

Selection of the set of genes S, employed for prediction, was based on the binomial probability (Eq. 7). It was determined separately for each cross validation run, employing a threshold of α = 0.05 and no adjustment for multiple testing. This was an arbitrary choice, and the number of genes selected with such a procedure depends strongly on the size of the estimation set and chosen thresholds. For the present data set, however, it resulted in approximately 800–1,000 genes for each run, which was near the optimum value of 1,000–1,500 genes (Fig. 6). An alternative strategy would be to determine once a ‘core’ set of genes responsible for a given trait and use these genes subsequently for prediction. It is an interesting area for further research, whether using a fixed set of genes improves the prediction efficiency or not.

Predicted values of heterosis are of use in practical breeding programs only in combination with the mid-parent value of line per se performance, whereas the predicted hybrid performance can be applied directly for making selection decisions. Therefore, the prediction of heterosis is of use, if it can be accomplished with higher precision than that of hybrid performance. In our study, the predictions of hybrid performance and heterosis were equally precise (Fig. 6). Consequently, we recommend prediction of the hybrid performance rather than that of heterosis.

Predictions with the binary distance D B showed greater correlations to the observed values than predictions with the Euclidean distance D E . Further, D B outperformed by far the GCA-based prediction and also the prediction with a linear model using selected AFLP markers (Fig. 6). In consequence, prediction models employing the transcriptome-based distance D B should provide the most precise predictions of hybrid performance available to date.

Applications in breeding programs

Expression profiling of seedlings can be conducted directly after producing new inbred lines. From the transcriptome data, the performance of possible hybrids can be predicted and the promising hybrids can be produced and tested in field trials. This indirect pre-selection step based on expression profilecs can enhance the response to selection. At present transcriptome analysis is expensive, but a decrease in lab costs is expected and with such a cost decrease, the suggested pre-selection can increase the cost efficiency of breeding programs.

Assuming a heritability near one of the transcriptome data (h 2 I  = 1), the response G I to indirect selection based on transcriptome-based distances is G I  = i h I r σ g , whereas that for direct selection based on field trials is G D  = i h D σ g (assuming constant selection intensity i and genetic variance σ 2 g ). Consequently, with correlations of r ≈ 0.8 (Fig. 6) indirect selection has the same efficiency as that of direct selection with a heritability of h 2 D  = (0.8)2 = 0.64. This demonstrates the potential utility of the transcriptome-based prediction of hybrid performance in hybrid breeding programs.