1 Introduction

Solar radiation (SR) is the energy emitted by the sun [1]. The energy balances of several physical, chemical and biological processes are influenced by solar radiation reaching the Earth's surface [2, 3, 4]. Changes in solar radiation have a significant impact on heat fluxes, the hydrological cycle, terrestrial biological ecosystems and climate [5, 6]. In addition, solar energy emits significantly less pollution than traditional sources such as fossil fuels, and it is the most abundant of all renewable and sustainable energy resources at locations all over the world and can be used for commercial purposes through large solar power plants [7, 8, 9]. Thus, precise measurement and comprehension of solar radiation's spatial–temporal variability are critical for meteorological and hydrological processes as well as energy development and usage [10, 11].

Meteorology, hydrology and agricultural activities are used in several research to forecast SR [12] . For example, Ododo et al. [13] suggested temperature as a solar radiation metric. SR has substantial relationships with air temperatures, according to Bandyopadhyay et al. [14]. In order to forecast solar radiation, Ododo [15] used relative humidity and maximum temperature. Average air temperature measurements were utilized as input data by Rehman and Mohandes [16] to forecast solar radiation. Kisi et al. [17] suggested many meteorological parameters for the SR forecast. In addition, current studies in the literature have revealed that station location information is used in the forecast of global solar radiation [18]. for example, Kumar et al. [19] reviewed different models for SR forecast with latitude, longitude and altitude data. Chabane et al. [20] estimated SR as a function of latitude and longitude coordinates.

Over 400 articles were found in Scopus' reported database for machine learning (ML) approach for SR forecasting. The VOSviewer technique was used to generate a list of important keywords for this research domain (Fig. 1a). Furthermore, when the adopted research is examined across time (Fig. 1a), it is clear that many studies were published in 2018 and beyond. These studies appear to be more interested in climate change, deep learning, new machine learning models such as SVM, ELM, climate change and the development of renewable energy generation. Figure 1b shows the main regions where solar radiation estimates have been investigated. It is the region of China with the most research (76), followed by the USA (63), India (51), Spain (25), Iran (22), France (21) and Turkey (18).

Fig. 1
figure 1

Literature review keywords a for the SR forecasting using ML approach in research regions b

Some researchers have investigated SR modeling using different mathematical equations and ML approaches; for example, Kumar et al. [19] compared the regression model with the ANN models for SR prediction. Kisi et al. [17] employed wavelet transform approach with ANN ELM, radial basis function (RBF) and their hybrid variants. Rahimikhoob et al. [21] compared the ANN's and statistical methodologies for deriving SR from satellite images. Polo et al. [22] investigated the sensitivity of satellite-based approaches for calculating SR to various aerosol input and model choices. Ahmad and Tiwari [23] investigated various SR models and discovered that the Collares-Pereira and Rabl model, as modified by Gueymard, had the best accuracy for projecting mean hourly SR, and that the Ertekin and Yaldiz model performed best against measured data from Konya, Turkey. Sonmete et al. [24] compared 147 SR models available in the literature for monthly solar radiation estimation in Ankara (Turkey). Citakoglu [25] also compared the ANFIS, ANN and MLR models, and different empirical equations; the end results showed that when it came to estimating monthly SR in Turkey, the ANN model outperformed the ANFIS, MLR and empirical equations. Wang et al. [11] compared three different ANN methods (GRNN, RBNN and MLP models) for predicting the daily SR using meteorological variables such as air temperature, relative humidity and sunshine duration. To our knowledge, no research has been conducted to evaluate the performance of machine learning approach on SR prediction by examining optimum conditions such as optimization or training algorithms and reducing input parameters.

Solar radiation is studied widely around the world, particularly in solar-rich locations like the Mediterranean and the Middle East [26, 27]. Unfortunately, most sites lack access to and measurement of observed sun radiation values. The costs of obtaining, installing and maintaining devices, as well as issues calibrating radiation detecting equipment, are the main causes for the absence of trustworthy radiation data [17]. As a result, location-based models, temperature-based models, remote sensing-based approaches, temperature-based models, day and month-number-based models, cloudiness-based models, sunshine-based models and hybrid models are all commonly employed to estimate solar radiation [11, 25, 28,29,30,31,32,33,34,35,36,37,38]. However, due to intricate connections between independent and dependent variables, these models cannot always provide trustworthy estimates, particularly in humid places where solar radiation is heavily influenced by clouds [11].

The aim of this study is to investigate forecasting of SR with five different ML approaches, including long short-term memory (LSTM), support vector machine regression (SVMR), Gaussian process regression (GPR), extreme learning machines (ELM) and K-nearest neighbors (KNN). Geographical positions (latitude, longitude and elevation), the time information of the station measurements (months and years) and monthly observed meteorological measurements (temperature, evaporation, wind speed and relative humidity) of 163 meteorological stations of Turkey were used to estimate SR. This research will make a substantial contribution to the existing literature in the following ways:

  1. (i)

    The majority of Turkish meteorological stations are used in the SR forecasting process. In addition, the data has a continuous and long-term recording period.

  2. (ii)

    Five different models were utilized for SR forecasting, and the methods were compared. In the model comparisons, the sub-parameters and the number of inputs were differentiated, and the best result was determined for each model parameter and the number of inputs.

  3. (iii)

    Variance inflation factor (VIF) analysis was performed to enhancing the SR forecasting accuracy, and the input parameters that reduced the accuracy of the model were excluded from the study.

  4. (iv)

    Finally, the Kruskal–Wallis test and ANOVA test were used to detect whether data estimated and measured were from the same distribution.

The next section of the paper presents the “Materials and Method.” In this section, given study area description and data set, and theoretical framework of ML approaches, followed by this is performance metrics and application of variable selection. Then Sect. 5 presents the results and discussion, and Sect. 6 presents concluding remarks.

2 Materials and method

2.1 Study area description and data set

Turkey is bordered by the sea on three sides (north, south and west). Turkey is geographically located between the 36°–42° N and 26°–45° E meridians. The country has a rectangular shape with a width of 1660 km. The actual coverage area of the country, including the lakes and islands, is 814,578 km2, whereas the anticipated coverage area is 783,562 km2. The vast gap between these two places is due to the country's steep and rugged terrain. The country's highest and most mountainous regions are largely found in the east. The interior of the country is primarily flat. Dry climate is seen in the interior of Turkey. Summers in Turkey are hot and dry, and winters are dry and chilly, especially in areas distant from the sea. Continental climate is seen in the mountainous parts of the Eastern Region, Southeast Region and Inner Region of Turkey. In this climate type, the annual temperature difference is huge and the winters are cold [39].

The data used in this study were obtained from the General Directorate of Meteorology (MGM). In total, 163 meteorological stations were used. Monthly solar radiation (SR, MJ/m2), max. temperature (Tmax, °C), avg. temperature (Tavg °C), min. temperature (Tmin, °C), avg. wind speed (WSavg, m/s), elevation (m), year, longitude (°), month, latitude (°), max. relative humidity (RHmax, %) and min. relative humidity (RHmin, %) data were supplied from MGM. The data covers the years 1967–2020, the stations in the region are located between 2 and 1777 m, the highest temperatures are 46.40 °C, the relative humidity reaches a maximum of 110%, and the maximum solar radiation is read as 31.54 MJ/m2 at the regional stations. It is understood that under the title of months, there is a continuous and 12-month periodic component. In the modeling phase, these parameters were introduced as input data, respectively. While deciding the order of the input data in the models, the correlation coefficient from strong to weak between SR and parameters was considered. The parameters used in the SR estimation are Tmax, Tavg, Tmin, WSavg, elevation, year, longitude, month, latitude, RHmax and RHmin. The correlation between these data and SR is given in Fig. 2.

Fig. 2
figure 2

Correlation matrix between SR and each input variables

In Fig. 2, a strong correlation is observed between T and SR, while negative correlations are observed between relative RH and SR. The data was randomly divided into two parts: training and test data, before the modeling of the study. Of the 163 stations' data, 70% was used in the training phase and the remaining 30% was used in the testing phase to compare the performance of the models while training data is used to construct the model. The training and test rates used are frequently used and recommended in the literature [11, 40, 41, 42]. Figure 3 shows the stations that were utilized.

Fig. 3
figure 3

Spatial distribution of training and testing stations

Although the aim of the study is to distribute the stations regionally homogeneously, it is seen that training stations are not found in some regions, especially in the northern regions, in random selection, while in some regions, especially in the inner parts, there are no training stations. This is the result of completely random selection. The distance and independence of the training and test stations, as shown in Fig. 3, show that a solution is being sought for a difficult problem. Statistical information about the training and test stations is given in Table 1.

Table 1 Information about the test and training data

While the stations were separated during the training and testing phases, the station balancing was not performed after the data were separated according to the training or test rate. For this reason, the data of some stations in the entire recording period were used in the training phase; for example, while the previous years were used in the training phase in some stations, the data of some years were transferred to the testing phase. For this reason, Table 1 shows that the data in the year title are at least 1967 and at most 2020. In Table 1, the data are distributed homogeneously. For example, maximum temperatures are around 45–46 °C and SR values are about 31 MJ/m2.

2.2 Long short-term memory (LSTM)

LSTM was first presented by Hochreiter and Schmidhuber [43] based on recurrent neural networks (RNN). It was created to solve vanishing and exploding gradient difficulties. Using its unique structure, gates and cell state, it can also maintain dependencies over lengthy periods of time. To ensure the integrity of this work, a brief review of the LSTM unit is offered below. The fundamentals of RNN and LSTM were extensively defined in [44]. LSTM is a superior evolution of recurrent neural networks (RNNs) that tackle the drawbacks of RNNs. In addition, LSTM technology is unusual in that it stores information for a lengthy period of time. Furthermore, the LSTM is made up of four layers that are linked together through various communication protocols. The fact that its entire network is built up of memory blocks is the next feature. These blocks are also known as cells. Information is stored in one cell and then sent to the next using gate controls. With the help of these gates, it becomes much easier to precisely examine data [45, 46]. Figure 4 shows the construction of the LSTM. LSTM equations are listed below:

$$i_{{\text{t}}} = \sigma (W_{{\text{i}}} x_{{\text{i}}} + U_{{\text{i}}} h_{t - 1} + b_{{\text{i}}} )$$
(1)
$$f_{t} = \sigma (W_{{\text{f}}} x_{t} + U_{{\text{f}}} h_{t - 1} + b_{{\text{f}}} )$$
(2)
$$o_{t} = \sigma (W_{{\text{o}}} x_{t} + U_{{\text{o}}} h_{t - 1} + b_{0} )$$
(3)
$$\mathop C\limits^{ \sim }_{t} = \tanh (W_{{\text{c}}} x_{t} + U_{{\text{c}}} h_{t - 1} + b_{{\text{c}}} )$$
(4)
$$C_{t} = f_{t} \otimes C_{t - 1} + i_{t - 1} \otimes \mathop C\limits^{ \sim }_{t}$$
(5)
$$h_{{\text{t}}} = o_{{\text{t}}} \otimes \tanh (C_{t - 1} )$$
(6)
Fig. 4
figure 4

LSTM structure

In the equations, it, ft and ot are the entrance, forgetting and exit, respectively; Wi, Wf and Wo show the weights connecting the input, forget and output gates to the input, respectively; Ui, Uf and Uo represent the weights from the entry, forget and exit gates to the hidden layer in order; bi, bf and bo indicate the input, forget and output gate bias vectors, respectively; \(\widetilde{C}\)t is the previous state of the cell; Ct is the current state of the cell; ht-1 refers to the cell's output at the previous time point; and ht stands for the output of the cell [47]. In this study, SR was estimated using meteorological, spatial and temporal input parameters. The application of the LSTM model was carried out with the help of codes written in MATLAB. In this study, adaptive moment estimation (Adam), stochastic gradient descent with momentum (SGDM) and root-mean-square propagation (RMSProp) optimization algorithms were used for the training of the model and the forecasting performances were compared. For details of optimization algorithms, Pandey & Srivastava [48] can be examined.

The equations for Adam are as follows:

$$m_{t} = \beta_{1} m_{t - 1} + (1 - \beta_{1} )g_{\begin{subarray}{l} t \\ \end{subarray} }$$
(7)
$$v_{t} = \beta_{2} v_{t - 1} + (1 - \beta_{2} )g_{t}^{2}$$
(8)
$$m^{^{\prime}}_{t} = \frac{{m_{t} }}{{1 - \beta_{1}^{t} }},\;\;\;v^{^{\prime}}_{t} = \frac{{v_{t} }}{{1 - \beta_{2}^{t} }}$$
(9)
$$\theta_{t + 1} = \theta_{t} - \frac{\eta }{{\sqrt {v^{^{\prime}}_{t} + \in } }}m^{^{\prime}}_{t}$$
(10)

The equations for RMSPRop are as follows:

$$E\left[ {g^{2} } \right]_{t} = 0.9E\left[ {g^{2} } \right]_{t - 1} + 0.1g^{2}_{t}$$
(11)
$$\theta_{t + 1} = \theta_{t} - \frac{\eta }{{\sqrt {E\left[ {g^{2} } \right]}_{t} + \in }}g_{t}$$
(12)
$$g_{t} = \nabla_{\theta t} J(\theta_{t} )$$
(13)

The equations for SGDM are as follows:

$$v_{t + 1} = \gamma v_{t} + \eta \nabla_{\theta } J$$
(14)
$$\theta_{t + 1} = \theta_{t} - v_{t + 1}$$
(15)

θ ∈ ℝd: model parameters; η: learning coefficient; ∇θJ(θt; x(i); y(i)): the slope of the target function depending on the parameters; Gt,ii: each diagonal element is the sum of the squares of the slope values calculated up to t. iterations, according to parameter θi; and ϵ: the constant value assigned to prevent the learning coefficient from dividing by 0. [49].

2.3 Support vector machine regression (SVMR) model

Support vector machine (SVM) was first proposed by Vapnik [50] in 1995. The concept of SVM is based on statistical learning theory and the principle of structural risk minimization [50]. Smola [51], devised a form of regression model called support vector machine regression (SVMR). SVMR models were created by merging regression functions with SVM to handle forecasting, prediction and regression problems [52, 53]. The SVMR model's main goal is to discover a function with the least "ε" deviation and that is as linear as possible for all training data points and target vectors [51]. The SVMR model's structural configuration is shown in Fig. 5. The SVMR regression function's summary is as follows [54]:

$$f\left( x \right) = w \times \phi \left( x \right) + b$$
(16)
Fig. 5
figure 5

Nonlinear support vector regression configuration

In the equation, w is the weight vector, b is the deviation, and ϕ is the transfer function.

Optimal conditions are obtained with the Lagrangian multipliers and kernel function in SVMR. Linear, polynomial, radial basis function (RBF) and sigmoid functions are examples of kernel functions [39, 55, 56]. Application of the SVMR model was carried out with the help of codes written in MATLAB. The linear, polynomial, radial basis function were used for the training of the model and the forecasting performances were compared. The following technical report contains more information on SVM and SVMR approaches: Classification and regression with support vector machines [57].

2.4 Gaussian process regression (GPR) model

GPR is a probabilistic nonparametric approach. Both estimations and confidence intervals are calculated with GPR a probabilistic nonparametric model. GPR is a significant extension of the Gaussian probability distribution. The probability of a Gaussian distribution is calculated using the input vectors. Each input data vector's probability is determined. As a result, the GPR model computes a mean and variance–covariance vector [58, 59]. The SVMR regression function is:

$$f\approx GPR\left(m(x),k(x,x{^{\prime}})\right)$$
(17)

where x is the vector of input variables; m(x) is the average function of input variables; and k(x, x') is the variance–covariance matrix. The shape of a multi-variate Gauss distribution is defined by the variance–covariance matrix.

Kernel (with ardmatern32, ardmatern52, squaredexponential, ardsquaredexponential, matern32, matern52 covariance function) and basis (with constant, none, linear, pureQuadratic, squaredexponential covariance function) were used in this study because they performed better in forecasting studies than the others. In the SR estimation, the function that gives the least error according to the error criteria in the next section is used. For details of covariance function, Rasmussen and Williams (2006) [58] and Neal [60] can be examined.

2.5 Extreme learning machines (ELM)

Extreme learning machine (ELM) was first presented by Huang et al. in 2006. ELM is a single hidden layer feedforward neural network training algorithm that converges significantly faster than traditional ANN methods and produces promising results [61, 62]. This is because the input weights are created at random, resulting in a unique least-squares solution for the output weights, which is solved by the Moore–Penrose function [63]. Because the randomly initiated hidden neurons in ELM's underlying theory are fixed, ELM is extraordinarily efficient at achieving a global optimum solution using universal approximation capabilities. Slow convergence, poor generalization, local minima difficulties, overfitting and the necessity for iterative tweaking are key drawbacks of the ANN model, all of which point to ELM's superiority over ANN [63, 64, 65]. The ELM model's general structural configuration is given Fig. 6.

Fig. 6
figure 6

ELM structure

The SVMR regression function's summary is as follows:

$$\sum\limits_{i = i}^{L} {B_{i} g_{i} } (\alpha_{i} x{}_{t} + \beta_{i} ) = z_{t}$$
(18)

In Eq. 18, L is the hidden nodes number, gii xt + βi) is the hidden layer output function, αi and βi is hidden node parameters, Bi is the weight factor connecting the ith hidden nodes and output node and zt is ELM model output.

The application of the ELM model was carried out with the help of codes written in MATLAB. The number of input neurons was tried from 1 to 300, and training ratio was chosen 0.7 in this study. The input parameters in Fig. 6 are defined separately to the ELM model according to the correlation order expressed under the data set title (see Fig. 2).

2.6 K-nearest neighbors (KNN)

The KNN is a nonparametric classification method invented by Evelyn Fix and Joseph Hodges in 1951 [66] and expanded by Altman [67]. Data categorization and regression are both done with KNN. In both circumstances, the input is a data set with the k closest training samples. The KNN approach searches through a database for data that is comparable to the observed data. These data are referred to as the present data's nearest neighbors [68]. In this paper, KNN is used to forecast mostly related testing stations with the training station. The KNN regression function's summary is as follows:

$$f_{{{\text{KNN}}}} (x^{\prime}) = \frac{1}{K}\sum\limits_{{i \in N_{K} (x^{\prime})}} {y_{i} }$$
(19)

For an unknown pattern x, KNN regression computes the mean of the function values of its K-nearest neighbors with set NK(x) containing the indices of the K-nearest neighbors of x. The notion of localization of functions in data and label space underpins the idea of averaging in KNN. In local neighborhoods of xi, patterns x are expected to have similar continuous labels f(xi) like yi [69]. The application of the KNN model was carried out with the help of codes written in MATLAB. The kdtree and exhaustive nearest neighbor search method were used for the training of the model and the forecasting performances were compared. The study flowchart of this study is given in Fig. 7.

Fig. 7
figure 7

Study flowchart

3 Performance metrics

The accuracy of the models proposed in this research was evaluated using widely known performance metrics [70]. MAE, MARE, RMSE, R2 and NSE were used in model evaluations. Low MAE, MARE and RMSE values, as well as R2 values near 1, suggest accurate and dependable estimations. NSE values range from − ∞ and 1 [71].

$$\mathrm{RMSE}=\frac{1}{n}\sum_{i=1}^{n}\sqrt{{\left({\mathrm{SR}}_{\mathrm{predicted }}-{\mathrm{SR}}_{\mathrm{measured}}\right)}^{2}}$$
(20)
$$\mathrm{MARE}=100\frac{1}{n}\sum_{i=1}^{n}\frac{\left|{\mathrm{SR}}_{\mathrm{predicted}}-{\mathrm{SR}}_{\mathrm{measured}}\right|}{{\mathrm{SR}}_{\mathrm{predicted}}}$$
(21)
$$\mathrm{MAE}=\frac{1}{n}\sum_{i=1}^{n}\left|{\mathrm{SR}}_{\mathrm{predicted}}-{\mathrm{SR}}_{\mathrm{measured}}\right|$$
(22)
$$R^2=\frac{{\sum_{i=1}^{n}{{(\mathrm{SR}}_{i\mathrm{ measured}}-\overline{{\mathrm{SR} }_{\mathrm{ i measured}}})}^{2} {.(\mathrm{SR}}_{\mathrm{i predicted }}-\overline{{\mathrm{SR} }_{\mathrm{predicted }}})}^{2}}{\sum_{i=1}^{n}{\left({\mathrm{SR}}_{\mathrm{i measured}}-\overline{{\mathrm{SR} }_{\mathrm{i measured}}}\right)}^{2}.\sum_{\mathrm{i}=1}^{\mathrm{n}}{\left({\mathrm{SR}}_{\mathrm{ i predicted }}-\overline{{\mathrm{SR} }_{\mathrm{predicted }}}\right)}^{2}}$$
(23)
$$\mathrm{NSE}=1-\frac{\sum_{i=1}^{n}{\left({\mathrm{SR}}_{\mathrm{predicted }}-{\mathrm{SR}}_{\mathrm{measured}}\right)}^{2}}{\sum_{i=1}^{n}{\left({\mathrm{SR}}_{\mathrm{measured}}-\overline{{\mathrm{SR} }_{\mathrm{measured}}}\right)}^{2}}$$
(24)

where SRmeasured is SR variables measured by MGM; SRpredicted is SR variables predicted by approaches; \(\overline{{SR }_{\mathrm{measured}}}\) is average of SR variables; and n is amount of data. In this study, Taylor diagram, violin and box error plot were used to compare LSTM, SVMR, GRP, ELM and KNN approaches. These diagrams graphically summarize how close the models are to the observations [72, 73, 74]. Comparisons in the Taylor diagram were made using model correlations and root-mean-square deviation (RMSD). On the other hand, many statistical parameters such as mean median standard deviation etc. are used in the violin diagram. In addition, for the final evaluation of the performance of the models, the spider graph of the methods of the input combinations that gave the best results was also given and more than one evaluation criteria were evaluated on a single figure [75].

4 Application of variable selection

In this study, model development was realized by reducing the input parameters. Models’ variance inflation factors (VIFs) were calculated in three steps, and significant variables were selected from among many potential variables. Table 2 shows the computed VIFs for each phase. In Table 2, VIFs greater than 5.0 are written using bold definitions. In the first stage, the VIFs of the Tavg and Tmin variables are all greater than 5.0, as shown in Table 2. The value of 5.0 of VIF is the critical value, and the parameters exceeding this value represent the parameters that should be excluded from the modeling [25]. Tavg and Tmin of t values are also less than tcri. As a result, the variables Tavg and Tmin are no longer included in the models. In the second step, a smaller number of variables are purposefully chosen, and after witnessing high VIF's for Tavg and Tmin variables, Tavg and Tmin are deleted for good, as Tavg and Tmin are highly connected. In the last stage, the VIFs of all coefficients are less than 5.0, and the t values of all variables are greater than tcri. Tmax (°C), WSavg (m/s), elevation (m), year, longitude (°), month, latitude (°), RHmax (%) and RHmin (%) are selected as valid input variables as a consequence of this study.

Table 2 Summary of the results of the VIF analysis

The performance of the LSTM, SVMR, GRP, ELM, KNN models was checked using the suggested 9-input parameters (for example Tmax as 1st input, WSavg as second input, …, RHmin as lastly input) during both training and testing phases. The results are given next section.

5 Results and discussion

The LSTM, SVMR, GRP, ELM and KNN techniques were utilized in this study to create models for forecasting SR in Turkey's using meteorological parameters, location and spatial and temporal information.

The LSTM model was used to estimate SR data in the first part of the research. Several trials were undertaken throughout the creation phase of LSTM models by adjusting the number of neurons in the hidden layer. In the LSTM models, tanh was used as “state activation function” and sigmoid was used as “gate activation function” [39]. Also, in LSTM models, Adam, SGDM and RMSProp optimization algorithms were used for network training. Trials were conducted with a single hidden layer, between 10 and 30 neurons, and 50 to 300 iterations in the LSTM model architecture. Initial learning rate coefficient was set to 0.05, learning rate reduction factor was set to 0.2, and learning rate reduction time was set at 125 for the other parameters of the LSTM model. The selection of LSTM model parameters was inspired by [39]. The best outcome for each output value is provided in Table 3 as a consequence of the trials undertaken during the LSTM modeling phase. The SVMR model was utilized to estimate SR in the current study's second phase. The "kernel function" was used to create estimating models in the SVMR technique. The common nonlinear radial basis function (RBF), linear and polynomial were utilized in this study because they performed better in estimate studies than the other kernel functions. The lowest values of alpha (αi − α*i) and bias (b) parameters, representing the difference between two Lagrange multipliers, were obtained with sequential minimal optimization (SMO). The GPR model was utilized to estimate SR in the third phase of the current investigation. Estimation models were created using kernel and basis functions in the GPR technique. In this study, many kernel functions have also been tested. In order to obtain the best performance value, mater32, matern52, ardmatern32, ardmatern52, ardsquaredexponential and squaredexponential covariances were tried in this study. Similarly, many basis functions have been tried. Functions tried are constant, none, linear and pureQuadratic, respectively. The function that gave the least error in the training phase was used in the testing phase. “Subset of regressors approximation” and “fully independent conditional approximation” were used to determine beta and sigma parameters used in GPR approach [39]. In the fourth phase of the present study, ELM model was used for estimation of SR. The ELM allows to train a single hidden layer. The ELM uses feedforward network for estimation with the Moore–Penrose pseudoinverse of matrix [62]. In ELM approach, estimation models were developed with the use of different the number of cells in the interlayer, the standardization equation (Eq. 25) used while introducing the data to the model and the training ratio of 0.7 [62]. While estimating with ELM, the number of hidden layers was tried from 1 to 300 and the error criteria were obtained during the test phase by taking note of the number of hidden layers that gave the least RMSE error.

Table 3 Comparison of the model results of the testing phase after reducing the number of inputs
$$y=\frac{xi-\overline{x}}{\sigma }$$
(25)

The KNN model was utilized to estimate SR in the current study's final phase. To determine the k-nearest neighbors, estimate models were created using exhaustive and kdtree functions in the KNN technique. In this study, many distance metrics functions have also been tested. In order to obtain the best performance value, seuclidean, cosine, hamming, correlation, mahalanobis, jaccard, spearman distance metrics were tried in the exhaustive function. Similarly, many kdtree distance functions have been tried. Functions tried are euclidean, cityblock, minkowski and chebychev, respectively. The function that gave the least error in the training phase was used in the testing phase.

A direct comparison of the approaches is made in Table 3 for testing phase. The input parameters were introduced to the models by considering the correlation size between SR. The input parameters used for SR forecasting are Tmax, WSavg, elevation, year, longitude, month, latitude, RHmax and RHmin, respectively.

It can be noted that the LSTM model outperformed the GPR SVMR, KNN and ELM models in terms of each average performance metrics at the testing phase. MARE values (%) varied between 15.17 and 28.31, MAE values between 1.759 and 3.358 and RMSE values between 2.297 and 4.422. In the testing phase, the best input combination was observed in the SGDM optimization algorithm of LSTM, in which 7 input parameters were used. When the models are compared with the best in themselves, the kernel function is superior to the basis function in GPR. Similarly, SGDM is the LSTM optimization algorithm that gives the least error metric, followed by Adam and PMSEProp. In SVMR, on the other hand, while polynomial is given the least faulty function, it is followed by RBF and linear functions. The lowest error in KNN was observed in kdtree function, followed by exhaustive the function. The optimum sets of model inputs for each of the investigated predictive modeling strategies were not the same, demonstrating that each model type reacts differently to distinct input variable sets and data patterns/attributes in the input data [76]. Overall, the best accurate input combinations for the LSTM, SVMR, GPR, ELM and KNN were based on models 7, 7, 9, 6 and 8. Evaluation of different modeling approaches (LSTM, SVMR, GPR, ELM and KNN) with different sets of input variables (i.e., 1–9) shows that the most accurate predictions depend on the model used and the optimization of the model.

NSE values of less than one are ideal, as this indicates a 100 percent success rate. Low estimation success is indicated by NSE values between 0.3 and 0.5, acceptable estimation success is shown by NSE values between 0.5 and 0.7, great estimation success is indicated by NSE values between 0.7 and 0.9, and outstanding estimation success is indicated by NSE values between 0.9 and 1 [71]. In Table 3, mean NSE values ranged between 0.228 and 0.875. These values indicate that the models in which some inputs are used show low estimation success, but the best model shows great estimation success. For example, according to the NSE criterion, the most successful method was obtained in the modeling using 7 input combinations in the SGDM architecture of the LSTM approach. The estimating power of the LSTM with SGDM model was fairly good. The current findings demonstrated that the LSTM model could overcome nonlinear relationships between variables, indicating that it performed well.

The scatter-plots for LSTM, SVMR, GPR, ELM and KNN models are shown in Fig. 8. Figure 8 shows the regression coefficient R2 and the regression equation (y = ax + b). With a best R2 value of 0.8957, the LSTM model was able to obtain the best fit line between observed and anticipated SR values using the 7-input combination. The SVMR, GPR, ELM and KNN approaches had R2 = 0.8871, 0.8701, 0.796 and 0.773, respectively.

Fig. 8
figure 8

Scatterplot comparison of measured and models predicted SR values

Figure 8 shows the observed and estimated SR values for the four models during the testing phase. This is an indication of the variation of underestimated or overestimated SR values. As shown in the figure, low SR values are too high, and values are slightly overestimated. (This can be observed by following the dashed black line, 1:1 or the best line, y = x.) The relatively weak performance for these extreme values of SR indicates that the model is likely to fall short on the training data set used to estimate its parameters. When the relationship between the scatterplots of the model results and the best line is examined, it is observed that the models forecast low SR values (< 10 MJ/m2) higher and higher values (> 20 MJ/m2) low. This can be observed at the intersection of the best line and the model line (red) and is a disadvantage observed in all approaches. Although all approaches better predict intermediate SR values (10–20 MJ/m2), some deviations were observed when estimating high and low values. It is observed that the convergence to the best line is mostly in the LSTM-SGDM approach, and some low values are estimated with quite outlier (with high values) values in the KNN approach.

In previous statements, LSTM was considered as the best model for SR forecast since it had the least RMSE, MAE and MARE and highest R2 and NSE values. All data were distributed around the regression line in scatter plots. In these plots, it was discovered that all models essentially followed the same regression lines. Although the MSE, MAE and RMSE error criteria indicated the correctness of the forecasted variables, they do not offer information about the models' distribution [39]. Therefore, violin plot (Fig. 9), box error plot (Fig. 10) and Taylor diagram (Fig. 11) were used for comparison.

Fig. 9
figure 9

Violin plot for GPR, LSTM, SVMR, ELM and KNN approaches

Fig. 10
figure 10

Box plot diagrams and errors diagrams for GPR, LSTM, SVMR, ELM and KNN approaches

Fig. 11
figure 11

Taylor diagrams for GPR, LSTM, SVMR, ELM and KNN approaches

The conformity of estimation data with observed data was examined using the violin plot. Further statistical comparisons of the models were conducted using the violin plot. Figure 9 shows a violin plot for the best outcome of the LSTM, SVMR, GPR, ELM and KNN techniques. Differences are seen in each ML approach based on the errors presented by the box plots (Fig. 10), with smaller error values circled for the GRP, LSTM, SVMR and ELM models. The error graph was obtained by subtracting the predicted values from the observed values by absolute value [77]. The Taylor graph in Fig. 11 is the graphical representation of Eq. 20 (RMSD) between the model and the observed values and the correlation between these two values [78].

The five best models in Fig. 9 were very similar to each other; however, LSTM was distinguishable from the other four approaches in Fig. 10's box plot diagrams and errors diagrams. The extreme error values of these models are almost at the same level; however, the KNN method differs from other methods based on excessive errors for its estimations. When the error graph is examined, it is seen that in particular the KNN model overestimates, while the ELM model underestimates. When using the Taylor, the LSTM model produced SR estimates that were quite similar to observed values. The Taylor diagram also demonstrated that the LSTM technique outperformed the other models.

It was quite difficult to identify the superior method for SR estimation in the study. For this reason, many statistical and graphical methods have been used. Finally, spider plots, which are used to evaluate all error criteria of the best approaches, were drawn in the study. Figure 12 shows the spider plot.

Fig. 12
figure 12

Spider graph for best approaches

Thanks to the spider graph, it can be easily seen that LSTM is less than other approaches according to the RMSE, MARE and MAE criteria, while the exact opposite NSE and R2 values are better than other methods. In addition, it has been determined that the least successful method is the KNN technique.

Finally, statistical significance comparisons between the results of the five approaches and the observed data were made. Firstly, the Kruskal–Wallis test was employed to see whether the distributions of the estimated and measured data were identical. [39, 79]. In the estimations of the three approaches (GPR, LSTM and KNN) in Table 4, the H0 hypothesis is rejected. In other words, it demonstrates that the means of the anticipated and observed data are not significantly different. Other models, on the other hand, have a considerable difference, and it is likely that the model findings are not from the same field as the actual data. The KW test was performed at 95 percent of the confidence interval.

Table 4 KW test results

With the KW test in Table 4, it has been seen that the models have less errors, which does not indicate that the technique is fully appropriate. This result shows that these models do not always provide reliable SR estimates due to the complex connections between independent and dependent variables. In particular, the large number of data and the inability to predict the extreme values well cause the H0 hypothesis to be accepted in the KW test [80]. In Table 4, the GPR, LSTM and KNN approaches passed the KW test, meaning that the estimates given by these methods come from the same mean as the measured SR. Then, test results of the applied models were also evaluated by one-way analysis of variance (ANOVA) for evaluating the robustness (the significance of differences between the measured and estimated SR values) of the different machine learning approaches [81]. The test was set at a 95% significance level. Table 5 gives the test statistics.

Table 5 ANOVA test results of the LSTM, SVMR, GPR, ELM and KNN techniques in the testing phase

In Table 5, the GPR kernel has the lowest test value (0.51) with the highest significance level (0.4734) compared with the others. According to the ANOVA test, the GPR kernel model is more robust than the GPR basis and LSTM-Adam models (the similarity between the measured SR values and the GPR kernel forecasts is significantly high) in modeling monthly SR. All other methods failed this test. Thus, unlike the ones stated above, it was decided that GPR kernel and LSTM ADAM were the most successful methods in this study.

6 Conclusion

The primary aim of the current research is to make SR prediction with different machine learning approaches. It was also to investigate the applicability and capacity of ML approaches by examining the effect of input parameters on forecast accuracy and removing parameters that decrease forecast accuracy to increase forecast performance. Finally, the most important findings of this study might be stated as follows:

  1. 1.

    VIF analysis was performed to develop the model. Thus, the input parameters that reduce the performance of the model are eliminated.

  2. 2.

    When the models are compared within themselves, kernel function is superior to the basis function in GPR, Polynomial is superior to the RBF and linear function in SVMR, SGDM is superior to the Adam and RMSProp optimization algorithm in LSTM and Kdtree function is superior to the exhaustive function in KNN.

  3. 3.

    The error criteria of MAE, MARE, RMSE, R2 and NSE, the results were analyzed according to the Taylor, violin, box error and spider plots and it was decided that the method that best predicted the observed values was LSTM. It is followed by GRP, SVMR, ELM and KNN.

  4. 4.

    LSTM model average performance metrics at the testing phase. MARE values (%) varied between 15.17 and 28.31, MAE values between 1.759 and 3.358, RMSE values between 2.297 and 4.422 and mean NSE values reached 0.875.

  5. 5.

    In addition, statistical significance test of the analysis results was performed with KW and ANOVA. It was concluded that the method that is more robust than other methods is GPR. This is followed by LSTM and KNN. With these tests, it was concluded that the predictions of the SVMR and ELM models were doubtful, while the predictions of the GPR, LSTM and KNN models could represent the mean.

  6. 6.

    Finally, these results proved that LSTM and GPR algorithms are applicable, valid and an alternative for SR estimation in Turkey, which has arid and semi-arid climatic regions.

The seven main limitations of this study can be mentioned as follows: (i) using data from 163 meteorological stations to represent Turkey, (ii) using data from 1967 to 2020, (iii) using VIF analysis for input selection, (iv) using different optimization techniques and five different machine learning methods, (v) using visual comparison criteria (violin, Taylor, spider and box plot) in addition to performance metrics and (vi) KW and ANOVA tests are used in the accuracy of the results.

This study is an effort to estimate SR in Turkey, which is of great importance in energy balances and production, biological processes, hydrological cycle, terrestrial biological ecosystems and climate. In future studies, the accuracy of the regional study can be increased by providing new machine learning methods. In addition to machine learning methods, it is considered to develop models that give equations using nature-inspired optimization algorithms and input parameters.