Background

Rapeseed (Brassica napus L.) stands as a pivotal oilseed plant cultivated across diverse regions like Europe, Canada, Australia and Iran, owing to its substantial genetic diversity (FAO 2020). The primary goal of rapeseed research involves expanding germplasm diversity to attain elevated yields. However, most critical agronomic traits of canola are limited. Quantitative traits are governed by numerous minor alleles, complicating the identification of chromosomal allele locations and their relative contributions to quantitative trait manifestation and phenotypic distribution (Sabouri et al. 2012; Liu et al. 2022). Genetic marker-based breeding has enabled the identification of quantitative trait alleles and the creation of genetic maps. Molecular marker-based breeding technology is a suitable and useful method due to the ease of improving the expansion of genetic diversity and the absence of time limits in rapeseed cultivation (Ton et al. 2020; Chugh et al. 2023; Singh et al. 2022). Genetic markers include morphological, cytogenetic, biochemical and molecular markers, with DNA-level polymorphic markers being especially important. Markers such as RAPD, SSR, AFLP, and ISSR are extensively employed for locating genes linked to polygenic and monogenic traits (Suping et al. 2021; Dolatabadian et al. 2022). Several investigations have explored genetic diversity within rapeseed germplasm utilizing identical and non-identical markers (Chai et al. 2019; Singh et al. 2017; Jesske et al. 2013). For instance, Motallebinia et al. (2019), assessed genetic diversity in 12 canola genotypes using 18 ISSR markers, identifying 60 polymorphic bands out of 106 amplified bands. Similarly, Safari and Mehrabi (2017), reported 100% polymorphism across 45 canola genotypes through 12 RAPD markers. Masoudi et al. (2017) appraised 60 wheat genotypes using three markers IPBS, ISSR and IRAP, observed 47 polymorphic bands out of 61 amplified bands, which is the highest and lowest percentage of polymorphism related to ISSR and IPBS primers.

Artificial neural networks (ANN) are modern computational methods for machine learning to predict responses to complex problems, partly inspired by the way the biological nervous system functions to process data and information. An ANN is a set of computational elements called neurons that function similarly to biological neurons. These networks are capable of learning and correcting their errors. Learning in these systems is done adaptively, ‘i.e.,’ using examples ‘the weight of synapses changes so that the system produces a correct response if new inputs are given. The characteristics of ANN include the ability to train the versatility of dispersion, capability information, generalization of parallel processing, robustness, and general modeling of physical processes, which deductive and inductive methods can do. The basis of the deductive method is based on mathematical theories and formulas’ in other words, modeling is done by relationships and constant coefficients of experiment (Kasabov 2019). However, ANN based modeling methods have to be more useful and flexible in dealing with possible non-linear relationships than linear regression (Jamshidi et al. 2016; Niazian and Niedbała 2020). The potential of molecular and phenotypic data in predicting crop yield has been harnessed through various ANN models (Wojciechowski et al. 2016; Sharma and Singh 2017; Torkashvand et al. 2017). Neural networks, offer great flexibility in precise access to pre-harvest yield prediction, gaining traction in genetic research (Ma et al. 2018; Singh et al. 2016; Gholipoor and Nadali 2019; Wang et al. 2019). The synergy of quantitative and qualitative data within network models yields enhanced predictions, that in the context of rapeseed yield, both types of data have been utilized (Zhang et al. 2020; Wawrzyniak et al. 2020).

There are different types of ANNs such as radial basis function (RBF) and multilayer perceptron (MLP) (Araghinejad et al. 2017), that have no dependency on any previous knowledge regarding the construction or inter-relationships between input and output signals. Therefore, the usage of these kinds of models such as ANN would be useful in modeling and optimizing in plant genetics such as, tissue culture (Jamshidi et al. 2016; Eren et al. 2023; Aasim et al. 2022) and molecular markers (Sandhu et al. 2021). An ANN model with MLP architecture predicted rapeseed yield based on meteorological (temperature and precipitation) and fertilization data, demonstrating lower MAPE errors values with the 15:15-18-11-1:1 structure (Niedbała 2019).

Support vector machine (SVM) is a learning system used both for classifying input data and estimating the data fit function so that the least error occurs in the data classification and regression. The data is divided into three categories: training, validation, and test so that training data causes SVM training, validation data is used to calibrate the parameters of the machine, and finally, this machine is used to classify or estimate test data. This method is based on constraint optimization theory that uses the principle of minimization of structural error and leads to a solution with overall optimum returns (Campbell and Ying 2011; Hesami and Jones 2020). Among these models, the SVM emerges as a widely adopted machine learning algorithm, adeptly addressing both classification and regression tasks (Noble 2006).

The integration of machine learning-based techniques into breeding introduces a novel avenue, promising accurate prediction of rapeseed hybrid performance. This paper aims to predict rapeseed yield using phenotypic and molecular data, employing diverse machine learning models. These trained models reduce the need for resource-intensive experiments, marking a significant advancement in rapeseed breeding research.

Materials and methods

During the 2017–2018 crop year, a Diallel genetic design was employed to cross eight parents (refer to Table 1) at the Gorgan City Natural and Agricultural Resources Research Station. Subsequently, in the autumn of 2018, a total of 8 parents and 56 hybrid offspring were cultivated in the research field using a randomized complete block design (RCBD) with three replicates. These plants were sourced from Dr. Payghamzadeh at the Gene Bank of Horticultural Products Research Institute, Golestan Agriculture and Natural Resources Research and Education Center, under the Agricultural Research, Education and Extension Organization (AREEO). The plant specimens, identified as voucher IDs (SPN-202, SPN-204, SPN-206, SPN-207, SPN-217, SPN-225, SPN-227, SPN-182) are accessible for study and verification at the Herbarium of the Research Institute (AREEO).

Table 1 Specifications of rapeseed genotypes used in this study

Molecular analysis

The researchers acquired seeds from eight distinct rapeseed genotypes originating from the Gorgan Agriculture and Natural Resources Research Station. Subsequently, the researchers planted 15–20 seeds from each genotype, including parents and hybrids within small pots in the greenhouse at Gonbad-Kavous University. To extract DNA from every genotype, plant samples were harvested at the three-leaf stage, pulverized with liquid nitrogen and then preserved in a − 20 °C freezer. For DNA extraction, the CTAB method as described by (Saghi Maroof et al. 1994) was employed on leaf samples, and the of the extracted DNA was assessed through 0.8% agarose gel electrophoresis. To explore the genomes of the studied genotypes, the researchers employed a set of 40 primers (refer to Table 2) capable of reproducing the genome of the genotypes studied. The PCR products were subsequently separated through electrophoresis, utilizing a 1.5% agarose gel, and the resulting gel was visualized under UV light.

Table 2 List 40 of markers used in research

Yield prediction utilizing MLP neural network

Neural networks exhibit capabilities encompassing classification, prediction and clustering. The training process involves increasing and decreasing the weight coefficients of input nodes. These networks generally comprise fundamental neural units forming an input layer, one or more hidden layers, and an output layer. The input signal propagates through the network in a direct layer-by-layer path, often referred to as the MLP architecture. The structure of a multilayer neural network is depicted in Fig. 1.

Fig. 1
figure 1

Structure of multilayer perceptron (MLP) neural network

Yield prediction utilizing RBF neural network

The radial basis function (RBF) network is a type of ANN where each unit generates an output vector upon receiving input. Training this network employs the backpropagation training algorithm with a diminishing learning rate (BDLRF). This algorithm’s advantages encompass parameter adjustment the ease, reduced the learning time and enhanced network behavior depiction learning. The schematic depiction of a three-layer RBF network, comprising input, output and hidden layers, is presented in Fig. 2.

Fig. 2
figure 2

Structure of the radial basis function (RBF) neural network

Yield prediction utilizing SVM model

Support vector machine stands as a supervised learning approach utilized for classification and regression tasks. Support vectors, a set of points in the data’s 1D space, establish category boundaries, effectively segmenting and categorizing data as displayed in Fig. 3. This algorithm aims to find a boundary between categories so that maximally distances itself from support vectors of each category. This method’s essence lies in processing data through kernel mathematical functions, mapping it into a new space, for analyzing complex, nonlinearly structured separated data. Various kernel functions are available, including linear, polynomial, cyclic, and radial, each producing distinct results upon function selection.

Fig. 3
figure 3

Support vectors in support vector machine (SVM) model

Input data for models

Trait assessment involved recording traits such as days to flower initiation from emergence date, days to flower termination from emergence date, flowering duration, physiological maturing, plant height (cm), no. lateral branches, branching height (cm), podding height (cm), main stem length (cm), pod length (cm), and stem diameter (mm). Additionally, no. pods per main stem, no. pod per lateral branches, no. pods per plant, no. grain per pod, 1000 grain weight (g), and yield (kg. ha−1) were noted for both direct and reciprocal crossings after the cultivation period. From each plot, five randomly chosen plants from the two central rows were tagged before flowering (BBCH: 32) and harvested at maturity (BBCH: 99) (Meier et al. 2009), to collect data on traits. Moreover, the entire plot was harvested for obtaining grain yield per plot. Detailed data collection procedures are outlined in Table 3. Genetic factors also played a role in this research, with genetic data from 40 markers in Table 2 being employed. Alleles were categorized as zero (absence of band) and one (presence of band).

Table 3 Description of investigated traits in the experiment

Evaluation of model performance

When predicting the yield of rapeseed hybrids, a combination of phenological, morphological, and yield-related traits, along with seed yield components, and parental molecular data, were concurrently employed to train the models. The quantity and distribution of the training data are pivotal factors influencing prediction accuracy (Duan et al. 2015). The dataset was randomly divided into two segments: training and testing. Specifically, 80% of the data was allocated for training, while the remaining 20% was designated for model testing. Based on this method, a data is randomly selected from the data set so that each cultivar had an equal probability of being selected during the data sampling process (Yates et al. 2008). Additionally, validation involved comparing model-predicted values with actual values obtained from phenotypic and molecular data. To assess the models’ performance in predicting hybrid performance, the statistical criteria such as MAE, RMSE, and R2 were employed (Zhang et al. 2020). Correlation coefficient square (R2) is a measure that describes how closely the values of measurement and simulation are correlated (Eq. 1). In other words, when the measured values increase, the predicted values increase, or vice versa. The values of R2 are between zero and one, and the closer this value is to one, the more the values of measurement and prediction correlations are more than each other, and vice versa.

Mean square error (MSE) is a statistical scale of the difference between the objective values of the observational dataset and the predicted output values through the model (Eq. 2). It is the mean of all squares between the prediction and actual values. Error-values are squared to represent the effect of large error values better and, on the other hand, to remove the effect of the positive and negative values caused by subtraction. Root mean square error (RMSE) is the root of the MSE metric (Eq. 3).

$${\varvec{R}}^{{\mathbf{2}}} = {\mathbf{1}} - \frac{{\mathop \sum \nolimits_{{{\varvec{i}} = {\mathbf{1}}}}^{{\varvec{n}}} \left( {{\varvec{y}}_{{\varvec{i}}} - \overline{\user2{y}}_{{\varvec{i}}} } \right)^{{\mathbf{2}}} }}{{\mathop \sum \nolimits_{{{\varvec{i}} = {\mathbf{1}}}}^{{\varvec{n}}} \left( {{\varvec{y}}_{{\mathbf{1}}} - {\varvec{y}}_{{{\varvec{ave}}}} } \right)^{{\mathbf{2}}} }}$$
(1)
$${\varvec{MSE}} = \frac{{\mathbf{1}}}{{\varvec{n}}}\sum\nolimits_{{{\varvec{i}}{\mathbf{ = 1}}}}^{{\varvec{n}}} {\left( {{\varvec{y}}_{{\varvec{i}}} - \overline{\user2{y}}_{{\varvec{i}}} } \right)^{2} }$$
(2)
$${\varvec{RMSE}} = \sqrt {\frac{{\sum\nolimits_{{{\varvec{i}}{\mathbf{ = 1}}}}^{{\varvec{n}}} {\left( {{\varvec{\gamma}}_{{\varvec{i}}} - \hat{\varvec{\gamma }}_{i} } \right)^{{\mathbf{2}}} } }}{{\varvec{n}}}}$$
(3)

In these equations, \({y}_{i} and \overline{{y }_{i}}\) are predicted value and actual value, \({y}_{ave}\). The average of data set values, and n is the number of observations.

The MATLAB 2018b software was employed to establish and train the neural network within the programming environment (Sajid et al. 2022).

Results

The analysis of the studied genotypes’ averages (Table 4) indicates that the estimated average yield for parents was 1975.17 (kg. ha−1), with the highest parent yield reaching 2853.49 (kg. ha−1) and the lowest at 1433.48 (kg. ha−1). Similarly, the hybrid yields revealed that the average yield of reciprocal crosses (2025.90 kg. ha−1) exceeded that of direct crosses (1974.12 kg. ha−1). The maximum seed yield observed among direct crosses and the reciprocal crosses was 3002.65 and 2969.07 (kg. ha−1), respectively. Conversely, the lowest yield recorded within the reciprocal crosses was 1237.67 (kg. ha−1).

Table 4 Yield obtained from rapeseed genotypes

Molecular assessment

As evidenced by the data presented in Table 5, the employed primers yielded distinct and marked banding patterns. Among the 40 primers examined, the distribution consisted of one primer from the CAAT tag, twelve from the IPBS tag, eleven from the ISSR tag, seven from the ISJ tag, and nine from the SCoT tag. Evaluation of the genotypes resulted in the discovery of a total of 196 alleles, averaging 4.90 alleles per marker. Notably, 114 alleles were documented, with an average of 2.85 alleles per marker. The average proportion of polymorphism across all primers was computed at 58.16%. Significant percentages of polymorphism were observed among the IPBS, ISSR, ISJ, and SCoT markers. Specifically, the primers IPBS15 (80%), ISSR58 (100%), ISJ10 (100%), and SCoT1 and SCoT9 (both 80%) displayed notable polymorphism levels. The mean polymorphic information content (PIC) value attributed to the primers was calculated at 0.34. Among all markers, the SCoT9 primer stood out with the highest PIC value of 0.50, while the primers CAAT28, IPBS5, IPBS8, and ISSR47 registered the lowest values at 0. In evaluating the efficiency of primers in determining polymorphism, the Shannon index (I) serves as an important parameter. The primers ISJ10 and SCoT1 exhibited higher values of the Shannon index (I) compared to other markers. The overall mean value of the Shannon index (I) was calculated to be 0.29. Examination of the Nei genetic diversity index (H) indicated diversity values ranging from 0.39 to 0 across markers. Among the primers, ISJ10 (0.39), CAAT28, IPBS5, IPBS8, and ISSR47 (0) recorded the highest and lowest Nei genetic diversity values, respectively. The comprehensive average value of Nei genetic diversity was estimated at 0.19. Furthermore, the analysis of effective alleles (Ne) revealed that the ISJ10 primer had the highest effective allele value at 1.72, while primers CAAT28, IPBS5, IPBS8, and ISSR47 had the lowest number of effective alleles.

Table 5 Results caused evaluation of 8 rapeseed genotypes using markers

Assessment of MLP, RBF and SVM model performance

The construction and training of the MLP, RBF, and SVM models were carried out utilizing the fitrnet, newrb, and fitrsvm functions embedded within the MATLAB software. Outcomes of machine learning models employing distinct data partitioning approaches are outlined in Table 6. We explored various data splitting ratios, including 90–10, 80–20, 70–30, and 60–40, with the optimal performance observed under the 80–20 ratio. Examining the minimum, maximum, and average values across these data partitioning ratios revealed that no significant differences exist among the mentioned data partitioning ratios.

Table 6 Result of machine learning models with different data portioning schemes

Assessment of MLP, RBF, and SVM model effectiveness based on 80–20 ratio for predicting hybrid performance using various criteria

The illustration of the effectiveness of the MLP, RBF, and SVM models in forecasting hybrid performance, as evaluated against the MAE criterion and utilizing an 80–20 ratio, is depicted in Fig. 4. Among models trained exclusively with genetic traits in direct crosses, the MLP model displayed a notably lower error compared to the other models. Conversely, in models trained with genetic traits in reciprocal crosses, the MAE errors were closely aligned. When utilizing phenotypic traits, both the MLP and RBF models trained for direct crosses demonstrated similar and improved MAE errors compared to the SVM model. Remarkably, for the SVM model trained with phenotypic traits, the lowest MAE error was observed in the context of reciprocal crosses as indicated in Fig. 4.

Fig. 4
figure 4

Comparison of multilayer perceptron (MLP), radial basis function (RBF), and support vector machine (SVM) model performance in predicting hybrid yield using MAE criterion. G1: Model trained using genetic traits in direct crosses; G2: Model trained using genetic traits in reciprocal crosses; P1: Model trained using phenotypic traits in direct crosses; P2: Model trained using phenotypic traits in reciprocal crosses; PG1: Model trained using phenotypic and genetic traits in direct crosses; PG2: Model trained using phenotypic and genetic traits in reciprocal crosses

Further examination of the MLP, RBF and SVM models’ effectiveness in predicting hybrid performance, evaluated through the RMSE criterion, is visualized in Fig. 5. Models trained with genetic traits in both direct and reciprocal crosses yielded RMSE errors closely clustered. Notably, the RMSE error was comparatively lower in reciprocal crosses than in direct crosses. The utilization of phenotypic traits led to a decreased RMSE error, particularly for the forward intersections of the RBF model and the reciprocal crosses of the SVM model. The incorporation of phenotypic and genetic traits, yielded models with decreased RMSE errors in both the direct crosses of the RBF model and the reciprocal crosses of the SVM model.

Fig. 5
figure 5

Comparison of multilayer perceptron (MLP), radial basis function (RBF), and support vector machine (SVM) model performance in predicting hybrid yield using root mean square error (RMSE) criterion. G1: Model trained using genetic traits in direct crosses; G2: Model trained using genetic traits in reciprocal crosses; P1: Model trained using phenotypic traits in direct crosses; P2: Model trained using phenotypic traits in reciprocal crosses; PG1: Model trained using phenotypic and genetic traits in direct crosses; PG2: Model trained using phenotypic and genetic traits in reciprocal crosses

The assessment of MLP, RBF, and SVM models’ efficacy in predicting hybrid performance based on the R2 criterion is portrayed in Fig. 6. Across diverse datasets and model training inputs, models trained with genetic traits in direct crosses exhibited the lowest R2 values. For models trained using genetic traits in reciprocal crosses, their R2 values closely aligned. In contrast, models utilizing phenotypic traits in both direct and reciprocal crosses showcased a superior R2 value for the MLP model compared to the other models. Incorporating both phenotypic and genetic traits resulted in models with comparable R2 values for both direct and reciprocal crosses.

Fig. 6
figure 6

Comparison of multilayer perceptron (MLP), radial basis function (RBF), and support vector machine (SVM) model performance in predicting hybrid yield using coefficient of determination (R2) criterion. G1: Model trained using genetic traits in direct crosses; G2: Model trained using genetic traits in reciprocal crosses; P1: Model trained using phenotypic traits in direct crosses; P2: Model trained using phenotypic traits in reciprocal crosses; PG1: Model trained using phenotypic and genetic traits in direct crosses; PG2: Model trained using phenotypic and genetic traits in reciprocal crosses

The ultimate configuration details for each machine learning model–MLP, RBF, and SVM–are elucidated in Table 7.

Table 7 The final configuration of multilayer perceptron (MLP), radial basis function (RBF) and support vector machine (SVM) models

Discussion

Markers efficiency

The comparison of markers in terms of their discriminating power relies on crucial parameters such as polymorphic information content (PIC) and Nei index. Higher values of these parameters indicate heightened polymorphism, the presence of alleles or rare alleles within a marker band, and a marker’s proficiency in differentiation (Badirdast et al. 2018). Our exploration aimed to evaluate the performance and efficiency of markers to assess the extent of diversity among rapeseed parents. The outcomes unveiled that the mean percentage of total primer polymorphism and average PIC value of the primers were 58.16 and 0.34%, respectively, indicating the capacity to discern and characterize genetic diversity among the 8 canola parents. In a study by Motallebinia et al. (2019) involving canola and ISSR markers, polymorphic information content values ranged from 0.36 to 0.08. Furthermore, the highest effective allele (Ne) value was observed for ISJ10 primer with a value of 1.72, while the primers CAAT28, IPBS5, IPBS8, and ISSR47 displayed the lowest number of effective alleles. The discrepancy between the total alleles and effective alleles signifies the presence of rare alleles found in only a few genotypes, which can be exploited for identification purposes. A proper distribution of markers throughout the genome, achieved by selecting markers from different genome regions, enhances the accuracy of molecular diversity measurement due to a more comprehensive representation of the entire genome (Yeken et al. 2022; Tiwari et al. 2022; Pour-Aboughadareh et al. 2022; Heikal et al. 2022). Thus, our findings are consistent with previous research, indicating that the markers studied here exhibit a diversified distribution within the genome similar to SCoT and ISSR markers (Badirdast et al. 2021; Khodadadi et al. 2021; Shah-Ghobadi et al. 2018), underlining the genetic diversity across the parents.

Model performance

The assessment of model performance reveals that, in terms of RMSE, the MLP, RBF, and SVM models results yielded within the ranges of [207,405], [175,367], and [168,374], respectively (Fig. 4). Concerning MAE, the models exhibited values spanning [182, 322], [147,309], and [141,296], respectively (Fig. 5). In the context of R2, the model performance ranged from [0.64, 0.92], [0.63, 0.89], to [0.55, 0.89], respectively (Fig. 6). Evaluating models trained based on genetic traits in direct crosses unveiled that none of the MLP, RBF, or SVM models surpassed an accuracy of 65% (R2) in predicting hybrid performance. However, in reciprocal crosses, all three models exhibited an accuracy of 89% (R2). Turning to models trained using phenotypic traits, the MLP model demonstrated superior predictive capabilities in both direct and reciprocal crosses, with an accuracy of 89% and 92%, respectively. Furthermore, models trained with both phenotypic and genetic traits exhibited comparable accuracy across at the three models, with the highest values reaching 77% in direct crosses and 89% in reciprocal crosses. While the application of artificial intelligence-based methods for phenotypic and genetic prediction remains limited, the significance of neural networks in genetic enhancement has been underscored by previous research. Marini et al. (2004) successfully predicted corn and soybean yields based on environmental climatic conditions using neural networks, achieving explanatory coefficients of 0.77 for corn and 0.81 for soybean. Similarly, Rosado et al. (2020) demonstrated that employing ANN-MLP neural networks for bean genetic prediction, incorporating phenotypic and genetic traits, led to a 90% increase in model accuracy. This approach capitalizes on quantitative features to improve prediction accuracy. The results across various crop plants further endorse the efficiency of neural networks for crop performance prediction (Eren et al. 2023; Shamsabadi et al. 2022; Hara et al. 2023; Huang 2023).

Conclusion

The recent years utilation of artificial intelligence in analysis, modeling, and forecasting has gained prominence. This study harnessed diverse algorithms with distinct structures to predict rapeseed hybrid performance. Utilizing both molecular and phenotypic data inputs in the model revealed that the MLP model exhibited reduced RMSE and MAE values and a heightened R2 in predicting reciprocal crosses, outperforming direct crosses. Training the MLP model based on molecular and phenotypic data yielded a R2 of up to 89%, highlighting its capability to approximate real data more accurately. The proposed neural network model empowers breeders to predict hybrid performance of parent combinations prior to crossing, streamlining efforts towards optimal hybrid outcomes. The remarkable versatility of neural networks has spurred advancements in learning and predictive models using both phenotypic and molecular data enabling comprehensive exploration of various plant traits. Further research is warranted to explore the potential of other machine learning models.