1 Introduction

Solar radiation (SR) is the primary energy source of water cycle, and thus, it is one of the key meteorological data having direct effects on developmental stages of plant growth and evapotranspiration (ET). ET is the summation of evaporation from the soil surface and transpiration from the plant leaves, and it is a significant parameter in agricultural and environmental implementations. ET is calculated from various meteorological data including SR. Because of its being and endless form of natural energy, and along with technologic developments such as photovoltaic solar panels, in the last 20 years, SR has been being used as a source of clean and inexhaustible energy in many countries. And, apparently, solar energy will keep being used at an increasing rate. Hence, SR is another kind of renewable energy used in residential establishments and industrial facilities (Kalogirou et al. 1999). Therefore, SR is a critical variable influencing the hydrological, environmental, agricultural, ecological, and the industrial activities.

Satellite-based instruments have directly measured SR since 1978 and demonstrate that on mean, about 1361 W/m2 achieves the top of the Earth’s atmosphere. Parts of the Earth’s surface, air pollution and clouds in the atmosphere act as mirrors and reflect about 30% of this energy back into space. Higher SR values are recorded when the sun is more active. Changes in lighting follow a sunspot cycle of approximately 11 years, with SR values fluctuating on average around 0.1% in recent cycles (Stocker et al. 2013).

SR strongly affects the evaporation from terrestrial surfaces. Llasat and Snyder (1998) noted that minor changes in SR might have distinctive impacts on reference evapotranspiration (ET0). Bois et al. (2008) demonstrated that ET0 computed for Southern France by FAO-56-PM method was highly sensitive to SR. Cobaner (2011) employed the method of adaptive neuro-fuzzy inference system (ANFIS) for ET0 estimation and used SR and air temperature (Ta) as the input data. Citakoglu et al. (2014) obtained more accurate ET0 estimations when only the SR was taken as the input data than the models taking wind speed (WS), moisture, and Ta as explanatory variables.

There are significant correlations between SR and some meteorological variables. For instance, Guan et al. (2007) demonstrated positive correlations of solar energy as related with humidity, atmospheric pressure, WS, and Ta. Ododo et al. (1995) noted that besides the other meteorological data, SR was also dependent on maximum temperature (Tmax). Hargreaves and Samani (1982) presented a simple equation for SR estimation as related to Ta and position of the sun. ElNesr et al. (2015) used relevant data of 29 climate stations in Saudi Arabia and noticed a reverse relationship of SR with the altitude. Rahimikhoob (2010) worked on artificial neural network (ANN) models to calculate SR and used Ta as the input variable.

Various empirical models have been proposed using sun hours, TmaxTmin, and moisture-like meteorological data in order to calculate SR. Angstrom (1924) used sunny hours as inputs for model development and proposed two models in estimating SR. Tymvios et al. (2006) and Mubiru and Banda (2008) used different meteorological data as input variables in ANN models for estimation of SR. Rehman and Mohandes (2008) used different input combinations with Ta, moisture, and days of the year and developed three ANN models for Saudi Arabia. Authors demonstrated that models with Ta and moisture variables provide adequate performance for SR estimation. Alam et al. (2009) developed an ANN model for SR estimation in India. Dastorani et al. (2010) proposed empirical equations, multiple regression, and artificial intelligence (AI) models for SR estimation and noted that the outcomes of AI models were superior to the other models. Since ET0 varies both spatially and temporally, latitude (La), longitude (Lo), and altitude (Al) were also incorporated into ANN models developed for SR estimation (Rehman and Mohandes 2008; Mohandes et al. 1998; Hontoria et al. 2005; Siqueira et al. 2010). Linares-Rodríguez et al. (2011) selected La, Lo, the number of days of total cloud cover, Ta, total water vapor, and total ozone data as the input variables of an ANN model for SR estimation with the aid of an ANN model with plausible success. Fadare et al. (2010)designed multi-layer, feed forward back propagation ANN models with different architectures by using La, Lo, Al, month, mean sunshine duration (n), Tmean, and moisture as the input data and showed that fuzzy genetic model yielded better outcomes than ANN and ANFIS models. Olatomiwa et al. (2015) used n, Tmax, and Tmin as the input data for SR estimation and employed support vector machines (SVM), firefly algorithm (FFA), ANN, and genetic programming (GP) models and demonstrated that the SVM–FFA model was an efficient machine learning approach for SR estimation in Tabas province of Iran. Mohammadi et al. (2015a) used ANFIS model to estimate SR by using the year as the single input and showed that the ANFIS model was successful for SR estimation. Mohammadi et al. (2015b) employed n, Ta, moisture, Tmean, and extraterrestrial SR as input data for daily and monthly SR estimation in coastal zones of Iran and developed SVM with wavelet transform algorithm (WT), ANN, and GP models. Mashaly and Alazba (2016) modeled instant thermal efficiency of the sun with the aid of meteorological and operational data in ANN, multivariate regression (MVR), and stepwise regression (SWR) methods. Researches indicated that ANN model had better performance values than the MVR and SWR models. Mehdizadeh et al. (2016) used meteorological data and compared the performance of AI models such as gene expression programming (GEP), ANN, and ANFIS in SR estimation and demonstrated that SR and meteorological parameter-dependent scenarios of ANN and ANFIS models achieved better accuracy than the empirical equations. Nasruddin et al. (2017) measured global SRs in Jakarta province of Indonesia for two whole years and calculated monthly SR with the aid of four empirical models in literature and linear regression model, and they concluded that the empirical models could be preferred in monthly SR estimation in Indonesia. These researchers also indicated the Allen equation as the best model for monthly SR estimation in Indonesia. Meenal and Selvakumar (2018) employed SVM, ANN, and empirical SR models for various experimental input combinations to estimate SR of different provinces of India. By using the Waikato Environment for Knowledge Analysis (WEKA) software, Meenal and Selvakumar (2018) observed that month, La, Tmax, and shiny hours were the most effective parameters and the moisture as the least effective parameter for SR estimation. And, these researchers also indicated that the SVM model had a better performance than both ANN and empirical models. Keshtegara et al. (2018) conducted a research for SR estimation with four different heuristic regression methods, which were Kriging, response surface method (RSM), multivariate adaptive regression (MARS), and M5 model tree (M5Tree), using meteorological data as explanatory variables. These researchers demonstrated that M5Tree model estimations yielded erroneous outcomes compared with both maximum errors and the other models for minimum agreement. They also noted that the Kriging models had better performance values than the MARS, RSM, and M5Tree models. Ozoegwu (2018) stated that because Third World countries have too few measured meteorological data, single-data models had greater use than the multi-data models, and therefore, he indicated that temperature necessarily has to be the only distinctive data to be used in models to be developed for SR estimation. Since the Hargreaves–Samani equation depends on Tmax and Tmin in SR estimation, Ozoegwu (2018) developed temperature-dependent models for Nigeria and indicated that single meteorological data model was statistically more reliable. Kaba et al. (2018) applied a new model obtained by a combination of deep learning (DL) and multiple ANN methods to 34 stations of Turkey. They used SR data of the years 2001–2007 for model train and test and compared the simulated and actual data and demonstrated that DL model yielded quite accurate outcomes for daily global SR estimation. They also indicated that DL model yielded more successful outcomes than various other methods in literature and also pointed out that DL model may constitute an alternative method and be used reliably in similar regions. Samadianfard et al. (2019) noted that SR was mostly predicted by artificial intelligence techniques or by empirical equations, and they applied SWR, ANFIS, GEP, and model trees (MT) approaches to detect the relations between SR and several meteorological variables in Tabriz region of Iran. As a result of their analyses, Samadianfard et al. (2019) observed that the relationship between the SR and n was pretty strong, and according to the performance evaluations, the SVR-6 model turned out to be better than the other models in predicting SR. Antonopoulos et al. (2019) used empirical equations, ANNs, and multi-linear regression methods (MLR) in order to estimate SR using daily meteorological measurements in Northern Greece. They have tried combinations of different input variables for ANNs and MLR models. They indicated that according to RMSE criteria, the results of the ANN models were in harmony with the results of the MLR models with the same input data. Fan et al. (2019a) used empirical equations, ANNs, ANFIS, SVM, MT, and machine learning models to estimate SR in different region of China. They demonstrated that the machine learning models performed better than the empirical models. They noted that ANFIS was highly amendable, while multivariate adaptive regression spline (MARS) and extreme gradient boosting (XGBoost) were also promising models to estimate daily SR. Fan et al. (2019b) used 72 existing empirical equations to evaluate for forecasting diffuse SR in different regions of China. They stated that sunshine duration-based equations and daily-based equations gave similar results. Chen et al. (2019) studied empirical equations available in literatures by using the commonly measured meteorological data and geographic factors. Chen et al. (2019) collected the entire 294 different empirical equations and divided the equations into 37 groups according to input data. They provided an overview of advanced empirical equations in the literature with their study, and they pointed out the more suitable equations in the three Gorges Reservoir area in China.

ANN has shown its success in many applications in various fields of civil engineering (e.g., Coulibaly et al. 2001; Coppola et al. 2003; Uncuoglu et al. 2008; Laman and Uncuoglu 2009; Bilgili and Sahin 2010; Altun and Dirikgil 2013; Cobaner et al. 2014; Citakoglu 2015; Bayram et al. 2016; Bayram and Al-Jibouri 2016). ANN is an effective algorithm for solving engineering problems and mainly consists of input layer, output layer, and one or more hidden layers. But how ANN defines the relationships between the input and the output is a black box, and there is no general framework for the selection of ANN structures and parameters. Consequently, it cannot easily obtain an explicit formulation of the relation between the input and output by using ANN.

GP is an approach of machine learning branch of artificial intelligence. The idea of GP was first proposed by Koza (1992) to handle challenging problems by making use of automatically evolving computer programs. Although GP has biological evolution inspiration like genetic algorithm, it has solutions represented by tree structures rather than binary strings. Also, GP does not require prior form of the existing relationships to achieve simplified prediction equations. Recent relevant literature shows that GP models successfully simulate the behavior of various branches of civil engineering problems (e.g, Aytek and Kisi 2008; Azamathulla et al. 2008; Kashid and Maity 2012; Gandomi et al. 2016; Kurugodu et al. 2018; Mehr et al. 2018). Recently, due to Searson et al. (2010), multi-gene genetic programming (MGGP) has emerged as a promising variant of traditional GP. MGGP can deal with non-linearity and maps the relationships among all involved factors by combining the model structure selection capability of the traditional GP and parameter estimation power of classical regression. Because MGGP uses small trees to compose a generalized model, it provides simpler models than that of traditional GP. Because of its success, there are several applications of MGGP in civil engineering problems (Gandomi and Alavi 2012a, 2012b; Bayazidi et al. 2014; Kumar et al. 2014; Cobaner et al. 2016a, b; Hadi and Tombul 2018; Pedrino et al. 2019).

In the present study, extraterrestrial radiation (Ra), sunshine durations (n), mean of monthly maximum sunny hours (N), long-term mean of monthly maximum air temperatures (Tmax), long-term mean of monthly minimum temperatures (Tmin), monthly mean air temperature (Tmean), and monthly mean moisture (relative humidity) (RHmean) data over the period from 1975 through 2015 are used as the input data to estimate SR values of Turkey. The main objective of this study is to obtain a prediction model by MGGP approach using SR values of Turkey. To achieve this objective, the following aspects are considered: (i) to investigate the accuracy of three different original empirical equations in SR estimation, (ii) to modify the coefficients of these three empirical equations for Turkey, (iii) to obtain more practical equations with the aid of MGGP method, (iv) to obtain new equations appropriate for Turkey with the aid of data used in empirical equations and compare these new equations with those calibrated three empirical equations, (v) to obtain a succinct equation suitable for Turkey needing only easy-to-acquire data like Tmax and Tmin, and (vi) to compare single-data models with multi-data models. The ultimate goal of the study summarized in this paper is to obtain an efficient and compact model for SR.

2 Materials and methods

2.1 Material

Long-term monthly means of meteorological data used in this study were measured in 163 stations operated by the General Directorate of State Meteorological Services (MGM). Measurements cover the period between the years 1975 and 2015. Turkey is located between 26′−45′′ E longitudes and 36′−42′′ N latitudes. The stations used cover all of seven climate zones of Turkey. The altitudes of the stations range between 3 and 2400 m. Location of the metrological stations in Turkey is illustrated in Fig. 1. As can be seen in Fig. 1, the stations used in the train and test are distributed uniformly over Turkey and are colored in red and green, respectively.

Fig. 1
figure 1

Locations of the meteorological stations used in train are colored in red rounds and used in test are colored in green rounds

Fifty-six percent of Turkey has an altitude above 1000 m. The lands with a slope greater than 15% constitute 62% of country’s total area. Twenty-five percent of the country is overlain by the first-, second-, and third-quality class soils, and 90% of these lands are used for agriculture. Of a total of 80 million hectares of land area of Turkey, 26.3 million hectares are used for cultivation activities. Turkey has quite large agricultural fields in different climate conditions. Therefore, different plant species are grown in terrestrial and coastal sections of the country. Fruits, vegetables, and citrus species both in open fields and in greenhouses are dominant in the Mediterranean region; olive, tobacco, and citrus are common in the Aegean region; sunflower is dominant in the Marmara region; cereals and sugar beet are the common plants in the Central Anatolia region; and finally, cotton and cereals are the major crops in the Southeastern Anatolia region.

2.2 Methods

2.2.1 Multi-gene genetic programming (MGGP)

Genetic programming (GP) is a symbolic regression technique developed to derive predictive models for challenging computational problems (Koza 1992). Different from other forms of regression such as linear regression, in GP, the form of the model is not defined a priori and does not require any assumptions to develop models. GP automatically evolves computer programs for solving a specific task through an evolutionary process. First, GP starts with a randomly generated individual program. Then, genetic operators like crossover and mutation are applied to select individual programs based on their fitness values to produce new individuals for the next generations. The programs are evolved and improved until better fitness values are obtained. Traditional GP model has a single-tree structure and is constructed by a set of functional or terminal sets of elements. A functional set can be arithmetic operators (+, *, /, or –), mathematical functions (sin, cos, tanh, log, etc.), Boolean operators (AND, OR, NOT, etc.), or even logical expressions (IF or THEN). The terminal set can include variables (a, b, c, etc.) or constant values. Random selections are used for functions and terminals to provide and express tree structures. A tree structure of a traditional GP model (\( \sqrt{\log (b)+a\ast c\ } \)symbolic expression) is depicted in Fig. 2.

Fig. 2
figure 2

A tree structure of a traditional GP model

Multi-gene genetic programming (MGGP) is an automatic programming technique, and like GP, MGGP uses terminals and functions (Searson et al. 2010). MGGP can design effective prediction models because of its outstanding adaptability and versatility. Because it integrates the ability of traditional GP for selecting the model structure by the least squares technique, this peculiarity provides MGGP to generate models by using low-depth GP trees and combining low-order non-linear transformations of the input-output variables. Traditional GP uses a single-tree structure, but MGGP models are derived from several genes (each gene is a tree structure). Each gene is weighted (d1, d2, d3…), and then, a random bias term (d0) is added to predict an output (y). The schematic representation of an MGGP model tree structure is shown in Fig. 3.

Fig. 3
figure 3

An MGGP model tree structure

The maximum gene number and tree depth are two essential control parameters of the MGGP model. These two parameters have significant effect on the complexity of the evolved models. Detailed explanations of MGGP parameters and operators are given in publications by Searson et al. (2010) and Gandomi and Alavi (2012a).

2.2.2 Empirical equations

There are several empirical equations in literature to define quantitative relations between SR and meteorological data. FAO recommends Angstrom and Hargreaves–Samani equations for SR estimation. First, we have used the Angstrom equation for estimation of SR. This equation related SR with extraterrestrial radiation reaching to earth and sunshine durations (n) as expressed below:

$$ {R}_s=\left(a+b\bullet \frac{n}{N}\right)\bullet {R}_a $$
(1)

where Rs is solar radiation (MJ/m2), Ra is extraterrestrial radiation (MJ/m2), n; sunshine durations, N is maximum monthly mean sunshine durations, and a and b are coefficients for which the values recommended in [11] are a = 0.25, b = 0.50.

Hargreaves and Samani (1982) developed an equation for estimation of SR as related to Tmax, Tmin, and Ra. Since it used data which is easiest to gauge (Ta) as explanatory variables, this equation has become the most practical one, and it is rewritten below:

$$ {R}_s=a\bullet {R}_a\bullet \sqrt{T_{\mathrm{max}}-{T}_{\mathrm{min}}} $$
(2)

where Tmax is maximum long-term monthly mean temperature (°C), Tmin is minimum long-term monthly mean temperature (°C), and a is a coefficient, for which the numerical values of 0.19 for coastal zones and 0.17 for terrestrial zones were recommended by Hargreaves and Samani (1982).

In recent years, several researchers developed different empirical equations for SR estimation relating it to various meteorological data. For instance, Abdalla (1994) recommended a new equation for SR estimation from sunshine durations (n), maximum monthly mean sunshine durations (N), mean temperature, and relative humidity, which is:

$$ {R}_s=\left(a+b\bullet \frac{n}{N}+c\bullet {T}_{\mathrm{mean}}+d\bullet {\mathrm{RH}}_{\mathrm{mean}}\right)\bullet {R}_a $$
(3)

where Tmean is monthly mean temperature; RHmean is monthly mean moisture; and a, b, c, d are coefficients, for which the numerical values a = 1.943, b = 0.577, c = − 0.01483, and d = − 0.12129 are recommended by Abdalla (1994).

2.2.3 Calibration with Microsoft Excel Solver

We have used the Excel Solver to find the optimal value of the equation in the target cell. The Excel Solver interoperates with a cell group directly or indirectly related to the equation in the target cell. Constraints can be set for the values to be used in the model by the Solver, and such constraints can also be applied to the other cells influencing the equation of the target cell [Excel–help].

The enlisted steps below were followed to find out numerical relations between SR and the other meteorological data with the aid of Excel Solver.

  1. 1.

    Open the Solver parameters dialog box from Solver command in analysis group of data tab.

  2. 2.

    In the set target cell box, enter data points for objective cell or enter into Excel worksheet. The target cell should contain an equation. In this study, as the target cell, the cell containing the evaluation parameters of mean absolute error (MAE), root mean square error (RMSE), and mean absolute relative error (MARE) are selected.

  3. 3.

    From the target cell menu, the “minimum” option is selected for each one of the MAE, RMSE, and MARE parameters, because of the obvious reason that the closer the parameter to zero the better the tested model.

  4. 4.

    From the changing variables menu, a name is entered for each decision variable cell range. Variable cells should be directly or indirectly related to the target cell. In this study, the cells containing the values of a, b, c, and d coefficients will be selected as variable cells. Initial values of a, b, c, and d coefficients are given as (1,1,1,1).

  5. 5.

    From the Solver parameters dialog box, proper one of three algorithms or solving method is selected. The non-linear generalized reduced gradient (GRG) scheme is used for smooth non-linear problems, while the simple LP method is used for linear problems, and the expansion method is used for non-smooth problems. The GRG method is used in this study.

  6. 6.

    Following the click on the solve button, the equation is solved by the Solver and those numerical values of a, b, c, and d coefficients with the least error values are obtained.

There are several equations available in literature for SR estimation. In this study, Angstrom, Abdalla, and Hargreaves–Samani equations are considered incorporated with the mentioned meteorological data in Turkey. The explanatory meteorological variables for these equations include maximum, minimum, and mean temperatures; extraterrestrial radiation; mean moisture; n; and monthly maximum sunshine durations. The data measured in 163 meteorological stations are divided into two portions, 75% of which (randomly selected 1467 elements) are used to develop the models and the remaining 25% (randomly selected 489 data) are used to test the power of the models. The stations used for model train are indicated in red circles, and the stations used for model tests are indicated in green circles (Fig. 1). In the first stage, train and test data sets are used to calculate SR values for Turkey with the aid of original equations. In the second stage, a, b, c, and d coefficients of original equations were computed for 1975–2015 period, and actual SR values by using the long-term monthly mean values of the meteorological data were recorded at 163 stations with the aid of Solver menu of Microsoft Excel. In the third stage, the meteorological data of Angstrom, Abdalla, and Hargreaves–Samani equations are used to develop new equations with the aid of the MGGP method.

2.2.4 Performance assessment criteria

In this study, extraterrestrial radiation (Ra), sunshine durations (n), mean of monthly maximum sunny hours (N), long-term mean of monthly maximum air temperatures (Tmax), long-term mean of monthly minimum temperatures (Tmin), monthly mean air temperature (Tmean), and monthly mean moisture (RHmean) data measured at 163 stations of Turkey are used as input for (1) multi-gene genetic programming (MGGP), (2) calibrated three empirical equations, and (3) original empirical equations to estimate SR. Performance assessment criteria include mean absolute error (MAE), mean absolute relative error (MARE), mean square error (MSE),and root mean square error (RMSE). In this study, MAE, MARE, MSE, and RMSE are defined as follows:

$$ \mathrm{MAE}=\frac{1}{N_s}\left|{\mathrm{RS}}_e-{\mathrm{RS}}_m\right| $$
(4)
$$ \mathrm{MARE}=100\frac{1}{N_s}\left|\frac{{\mathrm{RS}}_e-{\mathrm{RS}}_m}{{\mathrm{RS}}_m}\right| $$
(5)
$$ \mathrm{MSE}=\frac{1}{N_s}{\left({\mathrm{RS}}_e={\mathrm{RS}}_m\right)}^2 $$
(6)
$$ \mathrm{RMSE}=\sqrt{\frac{1}{N_s}}{\left({\mathrm{RS}}_2-{\mathrm{RS}}_m\right)}^2 $$
(7)

3 Results and discussions

Turkey lies in a climatologically heterogeneous area of about 780 thousand km2 in the northern hemisphere. Accordingly, first, simple regressions between SR against each one of these input variables are done in order to have an initial assessment of dependence of SR individually on these explanatory variables. The correlation coefficients of the relationships between SR and each input data for different climate areas of Turkey are given in Table 1. The correlation coefficients for the minimum temperature and moisture of the Black Sea region are lower than those of the other regions. However, in general, the correlation coefficients for all of seven regions of Turkey are not that much different from each other, and they are also close to those for Turkey handled as one region. So, instead of trying separate models for each region individually, a generalized model is attempted for Turkey as a whole.

Table 1 Correlation coefficients indicating the relationships between each input data and SR for different climate zones of Turkey

Before developing new equations by MGGP and numerical optimization methods, comparison criteria (error values) are determined for the original equations considered in this study (Table 2). Comparison criteria of the original equations are given in Table 2. As can be seen in Table 2, the comparison criteria of the train and test data are similar. While original Abdalla and Hargreaves–Samani equations have quite large error values, original Angstrom equation yields reasonable ones. The comparison of the measured and estimated SR values from train and test data is presented in Figs. 4 and 5. As can be seen in Figs. 4 and 5, Angstrom equation yields better estimations for both test data sets. The values computed by Abdalla and Hargreaves–Samani equations are greater than the measured values. As seen in Figs. 4 and 5, Abdalla and Hargreaves–Samani equations have high determination coefficients. But, it can be observed that no points estimated Abdalla and Hargreaves–Samani equations drop on the y = x (45°) line. It is clear from Figs. 4 and 5 that the values computed by Abdalla and Hargreaves–Samani equations are not in compliance with the measured values, while the values yielded by Angstrom equation are more consistent.

Table 2 MAE, MARE, MSE, RMSE, and R2 statistics of original equations for train and test stages
Fig. 4
figure 4

SR values calculated with the original equations and measured by MGM for train data

Fig. 5
figure 5

SR values calculated with the original equations and measured by MGM for test data

The magnitudes of a, b, c, and d coefficients of Angstrom, Abdalla, and Hargreaves–Samani equations are calibrated for entire Turkey with the aid of Excel Solver. In this stage, three different equations are used, and hence, three different calibrations are performed. Coefficients of the original and the calibrated equations are given in Table 3. According to Table 3, the new coefficients are different from the original ones. As mentioned before, the coefficients of the original Hargreaves–Samani equation are different for coastal and inland areas. However, such a difference is not obtained in this study. The comparison criteria for three different calibration equations are given in Table 4. According to Table 4, the error values of the calibrated equations are appreciably lower than those of the original equations. There is a distinct decrease in error values of the calibrated Abdalla and Hargreaves–Samani equations. As can be seen from Figs. 6 and 7, the decrease in error values is reflected also on scatter diagrams of the calibrated equations. Yet, deviations from y = x (45°) line by the calibrated Hargreaves–Samani equation are still more than those of the other two equations in scatter plots of the train and test stages (Figs. 6 and 7).

Table 3 The coefficients of the original and the calibrated equations
Table 4 MAE, MARE, MSE, RMSE, and R2 statistics of calibrated equations for train and test stages
Fig. 6
figure 6

SR values calculated with the calibrated equations and measured by MGM for train data

Fig. 7
figure 7

SR values calculated with the calibrated equations and measured by MGM for test data

MGGP method is an alternative method employed in this study. SR values are also estimated by the MGGP method. The MGGP parameters listed in Table 5 are used to generate the prediction models for SR.

Table 5 The parameter settings of the MGGP model

The selection of parameter values of the MGGP approach is very important because it influences the generalization capability of the models to be formed. Therefore, the parameters given in Table 5 are determined through the preliminary studies. Especially, the population size and the number of generations are selected based on the complexity of the problem. The larger the values of these two parameters, the longer the study will take. The maximum number of genes (Gmax) and the maximum gen depth (Dmax) influence the size of the search universe and the solutions to be discovered in this search universe. More successful outcomes are achieved by increasing these parameters, but the complexity of the problem and the study duration increases in that case. Different Gmax and Dmax values are tried to make the model produce more accurate outcomes, and the best models are obtained by Dmax of 4–5 and Gmax of 3–4. MSE values are used for goodness-of-fit, and the model with the least MSE is identified as the best model. The maximum number of generations is used as the termination criterion. MAE, MARE, MSE, RMSE, and determination coefficient R2 criteria are used to assess the prediction models.

Equations generated by the MGGP method by using the data of Abdalla equation (n, N, Tmean, RHmean, Ra), of Angstrom equation (n, N, Ra), and of Hargreaves–Samani equation (Tmax, Tmin, Ra) are given below:

  • The equation generated with the data of Abdalla equation is:

$$ {\displaystyle \begin{array}{l}{R}_S=0.8348\times {R}_a\hbox{--} 0.4174\times {T}_{\mathrm{mean}}\hbox{--} 0.8348\times N+0.8348\times n+0.01629\times n\times \left({\mathrm{RH}}_{\mathrm{mean}}+3\times n\right)\\ {}\hbox{--} 0.003428\times {\mathrm{RH}}_{\mathrm{mean}}\times \left( Ra\hbox{--} {T}_{\mathrm{mean}}+5\times n+4.65\right)\hbox{--} 0.01629\times N\times \left(2\times n+\sin (N)\right)+\\ {}0.001393\times \left[{T}_{\mathrm{mean}}\hbox{--} n+{\mathrm{RH}}_{\mathrm{mean}}\right)\Big]\times \left({\mathrm{T}}_{\mathrm{mean}}\hbox{--} n+\sin (n)\right)\times \left(N\hbox{--} n+\mathit{\sin}\left(n\hbox{--} N\right)+\sin (n)\right)+5.099\end{array}} $$
(8)
  • The equation generated with the data of Angstrom equation is:

$$ {\displaystyle \begin{array}{l}{R}_S=0.8738\times {R}_a\hbox{--} 0.4369\times n\hbox{--} 0.4369\times N\hbox{--} 0.1311\times N2\Big)\hbox{--} 0.4369\times \sin \left(N+0.9479\right)\\ {}\hbox{--} 0.1311\times \left(N\hbox{--} 3.58\right)\times \left(N\hbox{--} n\right)\hbox{--} \left(12.27\times \left(N\hbox{--} {R}_a+n\times {R}_a\right)\right)/\exp \left(n+3.735\right)+3.618\end{array}} $$
(9)
  • The equation generated with the data of Hargreaves–Samani equation is:

$$ {\displaystyle \begin{array}{l}{R}_S=0.263\times {R}_a\hbox{--} 0.5171\times {T}_{\mathrm{max}}+0.007334\times \sin \left(0.6695\times {T}_{\mathrm{max}}\right)+0.8905\times \sin \left(0.6695\times {R}_a\right)\\ {}\hbox{--} 1.881\times \sin \left({R}_a+\sin \left({R}_a\right)\hbox{--} 0.07122\right)\hbox{--} 0.0113\times \left(0.2697\times {R}_a\hbox{--} \sin \left({R}_a\right)\right).\ast \left({T}_{\mathrm{min}}\hbox{--} 2\times {R}_a+10.69\right)\\ {}+0.007334\times \left({T}_{\mathrm{max}}+9.993\right)\times \left({T}_{\mathrm{max}}+10.61\right)+5.074\end{array}} $$
(10)

Comparisons of MAE, MARE, MSE, RMSE, and R2 values of the computed results of Eqs. 8, 9, and 10 against the measured values at train and test stages are provided in Table 6. According to Table 6, with the use of MGGP method, MARE values are decreased by 7–10% for train and 8–11% for test, MSE values decreased by 2–3. 3% for train and test, MARE and RMSE values decreased by 1–1.8% for train and 1.5–1.8% for test. It is observed in scatter plots in Figs. 8 and 9 that SR values obtained by the MGGP method are much scattered around the trend line than original equations.

Table 6 MAE, MARE, MSE, RMSE, and R2 statistics of equations obtained by the MGGP method for train and test stages
Fig. 8
figure 8

SR values calculated with MGGP and measured by MGM for train data

Fig. 9
figure 9

SR values calculated with MGGP and measured by MGM for test data

There are three different equations considered in this study: Abdalla, Angstrom, and Hargreaves–Samani. In the last stage of the study, first, comparisons are made for each one of these three equations separately. Then, all equations are compared. While SR estimation by the original Hargreaves–Samani equation turns out to be the least successful one, numerical optimization and MGGP methods yield much closer estimations to the actual values. Considering the RMSE values in Tables 2, 4, and 6, it is observed that the MGGP method has much lower RMSE values than the original Hargreaves–Samani equation and the calibrated Hargreaves–Samani equation. Therefore, the MGGP method is found to be much more successful than the other methods. Although the original Angstrom equation has lower error values than the other two original equations, it is not sufficiently successful in the estimation of SR values obtained with the aid of numerical optimization and the MGGP methods. It is observed in Tables 2, 4, and 6 that the RMSE values of three different methods are close to each other. The original Abdalla equation is identified as the weakest method in SR estimation. The MARE value of the original Abdalla equation for test data is identified as 317.42 (Table 2). However, the MARE values after numerical optimization and the MGGP methods for test data drop down to 10.09 and 8.32 (Tables 4 and 6), respectively. When the numerical optimization and the MGGP methods are compared, it is observed that the MGGP method has lower error values.

Comparing the magnitudes of the parameters used for goodness criteria for all of the methods and equations in Tables 2, 4, and 6, it is obvious that the MGGP method has the least error values and the best method for SR estimation. The equations obtained with MGGP approach are much practical and easy than the original and calibrated empirical equations.

4 Conclusions

In this paper, solar radiation values are estimated for Turkey using extraterrestrial radiation, maximum temperature (Tmax), minimum temperature, mean temperature sunshine durations, maximum monthly mean sunshine durations, and mean moisture data as independent variables. Original Abdalla, Angstrom, and Hargreaves–Samani equations, calibrated Abdalla, Angstrom, and Hargreaves–Samani equations and new equations improved by MGGP method are used for solar radiation prediction. These three different equations and three different methods are compared based on the commonly used performance criteria of MAE, MARE, MSE, RMSE, and R2 statistics. The following conclusions are drawn from the present study:

  • When the original Abdalla, Angstrom, and Hargreaves–Samani equations are compared, the Angstrom equation yields closer estimations to the SR values measured by the General Directorate of Meteorology of Turkey (MGM) than the other two equations.

  • When the estimation performance of the original, calibrated, and MGGP-derived equations are compared at train and test stages, it is observed that the SR values estimated by the MGGP method are closer to the SR values measured by MGM.

  • The MGGP-derived equation with the use of data of Angstrom equation is not found to be sufficiently successful in SR estimation as compared with the other two methods.

  • Of the equations developed by the MGGP method by using the data of the original equations, the new equation obtained through the use of the data of Abdalla equation yields better outcomes in terms of the error criteria than the new equations obtained through the use of the data of the other two methods.

  • Although multi-data models are not preferred for the third world countries, in MGGP model, multi-data models yield better outcomes than the single-data models.