Introduction

The amount of total global solar irradiance (TGSI) and its temporal distribution are important factors to determine water availability and to study the impact of changes in radiation levels on climate change, ecosystems, and economical activities. Also, these factors are critical to optimize the design of solar energy systems (e.g., Wang et al. 2015; Widén et al. 2015). This latter use has become relevant to recent research studies since solar energy is one of the most important renewable energies.

The most accurate way to obtain solar radiation levels is by using ground-based measurements. However, due to the high cost of measuring equipment and its maintenance, a variety of methods to estimate solar radiation from different variables have been proposed and analyzed (e.g., Sonmete et al. 2011; Elani 2007; Barron et al. 2009). Since clouds are the most important meteorological factors that attenuate TGSI (e.g., Wacker et al. 2015), sunshine duration is the most direct variable to estimate TGSI. Ångström was the first to show a linear relation between the duration of bright sunshine and TGSI (Ångström 1924). Some years later this relation was modified by Prescott, who proposed the so-called Ångström-Prescott equation (Prescott 1940). Unlike Ångström equation, it introduces the use of the extraterrestrial irradiance on a horizontal surface to estimate TGSI. Different authors have introduced modifications to the original Ångström-Prescott model in order to apply it to different sites and climatic conditions. A review of the works using bright sunshine measurements for estimating TGSI can be found in Bakirci (2009a).

Although sunshine data are easier to obtain than TGSI measurements, they are not available at most of the meteorological stations. For this reason, models based only on geographical locations and easily available meteorological variables are attractive options. As in the case of sunshine hours, the difference between the daily maximum and minimum temperatures can be a good indicator of cloudiness, since cloud cover decreases the maximum air temperature (due to the reduction in transmissivity) and increases the minimum air temperature (due to cloud emissivity). Hargreaves and Samani (1982) were the first to introduce this variable to estimate TGSI through its relation with the extraterrestrial irradiance. Since then, many modifications to the Hargreaves and Samani model have been proposed. These modifications include the use of a different mathematical relation between the extraterrestrial irradiance and the maximum and minimum temperatures and the use of more meteorological variables such as pressure, relative humidity, and atmospheric precipitable water vapor. Besides the models based on sunshine measurements and meteorological data, other methods estimate TGSI using data related to the presence of clouds such as cloud cover and cloud types. These models involve the routine observation of the sky conditions. A review of the empirical models proposed to estimate TGSI can be found in Besharat et al. (2013).

Soft computing techniques have been widely used in environmental studies since they allow both handling a large amount of data and making estimates, although the underlying physical processes involved in these estimations are not completely understood (He et al. 2014). For example, they have been used to predict the amount of precipitation (Choubin et al. 2018), to assess the risk of groundwater contamination (Sajedi-Hosseini et al. 2018) and to model meteorological drought (Rafiei-Sardooi et al. 2018). Artificial neural network (ANN) (Basheer and Hajmeer 2000) is one of the soft computing techniques commonly used to estimate global and spectral irradiance (Moreno-Sáez and Mora-López 2014; Khosravi et al. 2018; Voyant et al. 2017). One of the advantages of this tool compared with the empirical models is that the user does not need to define how each input parameter is related to the obtained value. However, once the ANN model has been trained with TGSI measurements and generalized for a particular site, its application to different sites is not straightforward. Although empirical models are simple to apply, they determine coefficients that are accurate only for the location where they were obtained. To retrieve and assess these coefficients, TGSI measurements are needed. Therefore, these measurements should be performed by covering the entire range of local meteorological conditions to be representative of the site of study.

In the literature, there are several studies focused on estimating solar radiation for a given site by using new empirical models. These models require TGSI measurements to calibrate the empirical coefficients and to validate the obtained estimations. Similarly, an ANN model also requires these measurements. For example, Quej et al. (2016) proposed an empirical model to estimate TGSI in six sites in Mexico by using, as average, TGSI measurements recorded over a 10-year period. In a later work, Quej et al. (2017) used the same set of measurements to develop an ANN model for TGSI estimations. De Souza et al. (2016) used measurements recorded over a 3-year period in a state of Brazil to calibrate a new model and measurements recorded over a 1-year period to validate it. Besharat et al. (2013) proposed a new model to estimate TGSI in a city in Iran by using TGSI measurements recorded from 1988 to 2008. In conclusion, all these works use a set of TGSI measurements recorded on a daily basis for many years.

The choice of a particular model is not an easy task when this dataset has a limited number of data. Therefore, the present work aimed to determine which of the models can be used to estimate daily TGSI when these measurements are scarce. Besides the ANN models, the empirical models based on temperature and sunshine measurements selected from the literature were analyzed and evaluated. A sensitivity analysis of the number of TGSI measurements needed to obtain accurate estimations was performed for both the empirical and the ANN models. Statistical estimators were used to evaluate the accuracy of the model regarding the measurements performed in Córdoba, a city located in the center of Argentina.

Materials and methods

Data

Site of study

This work was performed in Córdoba City (31.4° S; 64.18° W; 470 m.a.s.l.), which has an estimated population of 1.3 million inhabitants (2010 National Demographic Census). It is a Mediterranean city in Argentina located in the center of the country in a semi-arid region. The climate is sub-humid and the prevailing wind direction is from the NE. The annual average precipitation is about 700 mm. However, the area is affected by severe and persistent dry periods that occur cyclically in wintertime. Snowfalls are infrequent. Thus, precipitations consist mostly of rain, which is concentrated mainly in summer. The study was carried out between 2000 and 2013. So, a wide range of meteorological conditions were covered during that period.

Meteorological data

Minimum, maximum, and mean ambient temperatures, relative humidity, wind speed, atmospheric pressure, and sunshine hours were provided by a National Weather Service station located near the irradiance measurement site. Sunshine hour data, also provided by the National Weather Service, were measured using a Campbell-Stokes heliograph. All of these meteorological data were obtained on a daily timescale for all the days between 2000 and 2013. For these days, solar radiation measurements are available (see the “Solar radiation measurements” section).

Solar radiation measurements

Total global solar irradiance (TGSI) surface measurements (300–3000 nm) were obtained with a pyranometer YES (Yankee Environmental System, Inc.) model TSP-700 operated by our research team. The pyranometer was placed in a wide-open area on the University Campus in Córdoba. Broadband observations were recorded as half a minute average values throughout the day. These measurements were integrated throughout the day following the trapezoidal rule method in order to obtain the daily TGSI values. It should be mentioned that the TGSI station that provided the values is not an automatic station. So, the pyranometer was mounted every day at the mentioned site early in the morning and removed at night. The measurements and data from the heliograph were manually checked for errors and inconsistencies and quality control was performed to eliminate inaccurate measurements. Finally, a total of 696 days were used. TGSI measurements recorded during these days represent a scenario with high data availability if a model is to be calibrated and the accuracy of its TGSI estimates is to be evaluated. Therefore, to simulate scenarios with limited data availability, only a small number of these TGSI measurements were selected to calibrate the models. This procedure will be carefully detailed in the “Measured vs. estimated irradiance: large number of available TGSI measurements” and “Measured vs. estimated irradiance: scarce TGSI measurements” sections.

Figure 1 shows the location of Córdoba City in the center of Argentina, as well as a map of the city showing the location of the meteorological station and the irradiance measurement site. As can be seen, both sites are approximately 2.5 km away and are located at the same elevation.

Fig. 1
figure 1

Location of Córdoba City on a shaded relief map. On the right, the location of the meteorological station and the irradiance measurement site in the city

Empirical models

The most commonly used empirical models based on sunshine duration and meteorological data were selected. Models with fixed coefficients were discarded. Following this criteria, fifteen models were selected to obtain estimated TGSI values. All of the selected models use extraterrestrial irradiance (Ho), which was calculated from the following widely used equation (e.g., Duffie and Beckman 1991):

$$ {H}_{\mathrm{o}}=\frac{24{I}_{\mathrm{SC}}}{\pi}\left[1+0.033\cos \left(\frac{360{n}_{\mathrm{o}}}{365}\right)\right]\left[\cos \varphi \cos \delta \sin {w}_{\mathrm{s}}+\left(\frac{2\pi {w}_{\mathrm{s}}}{360}\right)\sin \varphi \sin \delta \right] $$
(1)

where ISC is the solar constant (1367 W m–2); no is the Julian day; φ is the latitude of the site (in degrees); δ is the solar declination angle (in degrees); and ws is the sunset hour angle (in degrees). The two latter variables were calculated as follow:

$$ \delta =23.45\sin \left[360\left({n}_{\mathrm{o}}+284\right)/365\right] $$
(2)
$$ {w}_{\mathrm{s}}={\cos}^{-1}\left(-\tan \varphi \tan \delta \right) $$
(3)

Figure 2 shows the annual evolution of the measured TGSI and the extraterrestrial irradiance. The effect of clouds on TGSI measurements was evident on days with lower than expected TGSI values.

Fig. 2
figure 2

Annual variation of daily TGSI and daily extraterrestrial irradiance

The models used in this work are listed in Table 1. In this table, He is the estimated total global surface solar irradiance; n is the day length; N is the maximum possible sunshine duration; a, b, c, and d are the empirical coefficients; Tmax and Tmin are the maximum and minimum daily temperature, respectively (°C); Kr is an empirical coefficient; P is the mean atmospheric pressure of the site (hPa); i and (i + 1) represent the day of interest and the next day, respectively; Tavg is the average temperature (in °C); f(Tavg) and f(Tmin) are both functions of Tavg and Tmin, respectively. For all the models used in this work, n was obtained from the heliograph, while N was calculated as:

$$ N=2{w}_{\mathrm{s}}/15 $$
(4)
Table 1 Equations of the models used to estimate TGSI. He is the estimated total global surface solar irradiance; n is the day length; N is the maximum possible sunshine duration; a, b, c, d, and Kr are the empirical coefficients; Tmax and Tmin are the maximum and minimum daily temperature, respectively (°C); P is the mean atmospheric pressure of the site (hPa); i and (i + 1) are the day of interest and the next day, respectively; Tavg is the average temperature (°C)

Artificial neural network

ANNs have been designed to mimic the human brain and are structurally formed by processing units called neurons. In a typical ANN, each input variable is jointly connected to neurons placed in one or several intermediate layers, which are then connected to the outputs. The weight of each individual connection is a parameter adjusted during the ANN training. The training process is iterative. In each loop, the error between the known and the calculated output is evaluated, and consequently, the connection weights are modified until the network can correctly predict the already known outputs.

For the present work, and given that the ANNs are trained in a semi-automatic way, all the possible combinations among the following meteorological factors were evaluated: Tmax, Tmin, Tmin(i + 1), Tmax (i + 1), Pmax, Pmin, HRmax, HRmin, Pmonth, and Tmonth. By analogy with the nomenclature previously used, Tmax (i + 1) symbolizes the maximum temperature of the next day; Pmax and Pmin represent, respectively, the maximum and minimum atmospheric pressure of the i day; and HRmax and HRmin are, respectively, the maximum and minimum relative humidity of the i day. Finally, Pmonth and Tmonth are the monthly mean atmospheric pressure and temperature, respectively. Besides Ho, a certain number, from one to all, of these meteorological parameters were used as input parameters to estimate TGSI, resulting in 1024 possible combinations. Each of these combinations was separately trained with a feed-forward network using the Levenberg-Marquardt back-propagation training algorithm (Marquardt 1963; Hagan and Menhaj 1994) in order to find which of the combinations yielded the best results. In this network, the first layer has a connection from the network input, then a single hidden layer has a connection from the previous layer and the final layer produces the network’s output. Also, for each combination of variables, several ANNs were trained using a different number of neurons in the hidden layer in order to find the best network architecture for each data set. The number of neurons of the intermediate layer was set to 3, 5, or 10, while the transfer function (i.e., how the input and output of the neurons are correlated) was set as tan-sigmoid or linear, resulting in 6144 different network configurations.

The results presented here were obtained by using the best network selected among all the possible variable combinations and architectures. This network was chosen in terms of the average root mean square error between the irradiance values obtained using the ANN model and the measured values considering all the subsets described in the “Results and discussion” section. All the ANN calculations were done using the Matlab® Neural Network Toolbox (version 2011Rb). Among all the 6144 configurations employed (different inputs and different numbers of neurons in the hidden layer), the selected ANN configuration used the following input parameters: Tmin, Pmin, HRmin, Tmin(i + 1), Pmonth, and Tmonth, 10 neurons in the hidden layer and tan-sigmoid activation function. This configuration, hereinafter referred to just as ANN configuration, was the only one used in this work.

To better understand the results obtained, it is worth mentioning some aspects of the working process to estimate TGSI. The common approach to estimate TGSI using empirical models is to split the measurements into two sub-datasets. The first one is used to determine the empirical coefficients of the models, while the second one is used to evaluate the accuracy of the models by comparing the estimated with measured TGSI. Both groups are frequently referred to as calibration and validation group, respectively (Besharat et al. 2013). However, the use of an ANN also requires a validation stage, but as a part of the training process. Hereinafter, the term “evaluation” will be used to refer to the process that analyzes the performance of the models by considering the group of data not included in the determination of the coefficients of the models or in the ANN training and by comparing the estimates with the measurements.

Comparison techniques

To compare the estimated with measured TGSI, the recommendations by Yorukoglu and Celik (2006) were followed. To evaluate the accuracy and performance of the derived correlations (in predicting TGSI), the statistical estimators shown in Table 2 were used. In this table, Hi,e and Hi,m are, respectively, the measured and estimated TGSI for the i day; nd is the number of data; Ha,m and Ha,e are the average of the measured and estimated TGSI, respectively, and (nd–1) are the degrees of freedom.

Table 2 Equations of the statistical estimators used to compare the estimated with measured TGSI. Hi,e and Hi,m are the measured and estimated TGSI for the i day, respectively; nd is the number of data; Ha,m and Ha,e are the average of the measured and estimated TGSI, respectively; (nd–1) are the degrees of freedom

The statistics t test was performed to determine the existence of significant differences between the estimated TGSI following each of the models and the TGSI measurements. The t value obtained from the expression in Table 2 (e.g., Muzathik et al. 2011) was compared with a critical t value (tcrit) obtained from standard statistical tables for a given level of significance (α) and (nd–1) degrees of freedom (e.g., Walpole et al. 2012). In this work, the t table was used for a two-sided test using a value of 0.05 for α. Thus, the null hypothesis is accepted if:

$$ -{t}_{{\mathrm{crit}}_{\frac{\alpha }{2^{\prime }}\left({\mathrm{n}}_{\mathrm{d}}-1\right)}}\kern0.5em <t<{t}_{{\mathrm{crit}}_{\frac{\alpha }{2^{\prime }}\left({\mathrm{n}}_{\mathrm{d}}-1\right)}} $$
(5)

The results showed that there were not significant differences between the TGSI values obtained using the model and the measured TGSI.

Results and discussion

The “Measured vs. estimated irradiance: large number of available TGSI measurements” section discusses the comparison between the measured and estimated TGSI when 85% of the TGSI measurements were used to calibrate the empirical models and train ANN. The results obtained showed that the models were adequate to estimate TGSI when a large number of TGSI measurements were available, which is not the most common scenario.

The accuracy of the models was analyzed using the statistical estimators described in the “Comparison techniques” section. The empirical coefficients obtained for each model for Córdoba City were analyzed and compared with those previously reported for other sites.

The “Measured vs. estimated irradiance: scarce TGSI measurements” section describes a sensibility analysis performed to determine the number of daily TGSI measurements needed to calibrate the empirical models and ANNs. Conversely to the “Measured vs. estimated irradiance: large number of available TGSI measurements” section, the results described in this section are useful in scenarios where TGSI measurements are scarce.

Measured vs. estimated irradiance: large number of available TGSI measurements

To calibrate the empirical models, train ANN, and then evaluate all of them, the measured TGSI was divided into two groups. For the calibration and training stage, the set of daily TGSI measurements was randomly selected. A number of measurements corresponding to 85% of the total data were included in this calibration/training group. The remaining 15% was formed by the evaluation group.

Table 3 shows the empirical coefficients obtained for each model in the calibration stage.

Table 3 Adjusted coefficients for all the empirical models

Figure 3 shows the relation between the estimated and measured TGSI for all the analyzed methods applied to the evaluation data set. Red lines indicate the 1:1 relation.

Fig. 3
figure 3

Relation between the measured and estimated TGSI for the evaluation data set

In order to select the methods in good agreement with the measurements, several statistical estimators were analyzed. Table 4 shows the statistical parameters obtained for each method. For a higher modeling accuracy, MBE, MPE, MABE, MAPE, and RMSE values should be close to zero, while r and NSE coefficients should be close to one.

Table 4 Statistical estimators for each of the models (* = [kWh m−2 d−1])

The values of these parameters show the agreement between the estimated and measured TGSI (Table 4). These results can also be qualitatively inferred from a visual inspection of Fig. 3. As can be seen, for all the models, except for the Bristow and Campbell model, the calculated t values were less than tcrit (1.983 at α = 5%). Thus, excluding the Bristow and Campbell model and considering that some models were more precise than others, it can be concluded that there were no significant differences between the TGSI values obtained using all the evaluated and measured models for the site of study.

The empirical coefficients previously shown in Table 3 were compared with those obtained for different sites and consequently for different meteorological conditions. The obtained coefficients were in the range of values previously reported (e.g., Muzathik et al. 2011, reported a = 0.2207 and b = 0.5249 for Ångström-Prescott model in Malaysia, Almorox et al. 2013, reported a = 0.7685, b = − 0.0714, and c = 1.0919 for Bristow and Campbell model in Argentina and Besharat et al. 2013, reported Kr = 0.1746 in Iran).

From all these models, the Almorox et al. meteorological model was chosen because it was developed and evaluated in Cañada de Luque City, which is approximately 100 km from Córdoba City. Because of its similar topography and proximity to Córdoba City, the empirical coefficients for the Almorox et al. (2013) meteorological model were expected to be approximately the same for both cities. Considering this hypothesis, the empirical coefficients obtained by Almorox et al. (2013) for Cañada de Luque were used to estimate the TGSI values for Córdoba City. Table 5 shows the empirical coefficients reported for Cañada de Luque (Almorox et al. 2013) and the t and p values calculated from the comparison between the estimated and measured TGSI for Córdoba City.

Table 5 Empirical coefficients of Almorox et al. meteorological model reported for Cañada de Luque (Almorox et al. 2013) and Córdoba (this work) and the t and p values for Córdoba City

In conclusion, despite the closeness of Cañada de Luque to Córdoba city, the statistical comparison resulted in a t value = 4.501. This value was higher than the critical t value, indicating the high spatial variability of the empirical coefficients. Therefore, the coefficients obtained for a given site should not be blindly applied to a different site, no matter how close the site is. This spatial dependence on the empirical coefficients has been previously reported in the literature and led to studies of spatial sensitivity of the models. For example, Liu et al. (2014) focused on the spatial modeling of the parameters of the Ångström-Prescott model in order to extend its application to different sites in China. Urraca et al. (2017) also analyzed the spatial performance of the TGSI estimates of a model in Spain.

Measured vs. estimated irradiance: scarce TGSI measurements

Given that empirical coefficients should be obtained from local TGSI measurements, an analysis was performed to determine the minimum number of TGSI measurements needed to calibrate the empirical models or to train the ANN models. This analysis was carried out by following the same procedure for the two stages of calibration and training (for the empirical and ANN models, respectively) described in the “Measured vs. estimated irradiance: large number of available TGSI measurements” section and by subsequent evaluation. However, different subsets with different numbers of TGSI measurements were tested. The subsets of TGSI measurements were randomly selected. However, in order to consider all the seasons and meteorological conditions, all months had the same number of days. Subsets containing 2, 3, 5, 7, and 10 days per month were formed. That is to say, the numbers of days used were 24, 36, 60, 84, and 120. These numbers were randomly selected and divided into 85% for calibration and 15% for validation processes. In this analysis, the days in a given month came from different years (these days were also randomly selected). For these different groups, the calibration/validation processes were as in the “Measured vs. estimated irradiance: large number of available TGSI measurements” section (the weights of the ANN model when 2, 3, 5, 7, and 10 days per month were used for the calibration/validation processes are shown in Appendix). These values were used to estimate TGSI and to compare them with the TGSI measurements included in the evaluation data set (all the available measurements). All the statistical estimators were calculated. However, only the analysis of the t values is shown. Considering the amount of data in the evaluation group, tcrit was 1.963 (α = 5%).

Figure 4 shows the t values obtained for each model when different numbers of days per month were used in the subset. Red horizontal lines indicate the ± tcrit values.

Fig. 4
figure 4

t values calculated by using a different number of daily TGSI measurements per month to calibrate the empirical models and to train the ANN models

As can be seen, all the empirical models presented some t values higher than tcrit. However, all the t values for the ANN models were lower than tcrit. This behavior shows that 2, 3, 5, 7, and 10 days per month are enough to estimate TGSI.

In the case of the empirical models, the Samuel, Allen, Samani, Chen et al. and Almorox et al. meteorological models were adequate to estimate daily TGSI when almost 600 measurements of TGSI were available for calibration (the “Measured vs. estimated irradiance: large number of available TGSI measurements” section). However, when the calibration was carried out using a subset of a limited number of TGSI measurements per month, they presented t values higher than tcrit (see Fig. 4). This fact shows the strong relationship between the number of calibration days and the accuracy of the models, since more days were needed to obtain the convergence of the empirical coefficients.

The increase in the number of days used for calibration and training was expected to lead to t values closer to zero. This behavior was effectively verified for the ANN models, but not for any of the empirical models. This shows the improvement in the performance of the ANN models as the amount of data used for training was increased (Table 4 shows that the t value was 0.184 for about 600 days in the training group).

The innovative character of this research does not lie in the fact that the performance of the ANN model is better than that of the empirical models. In fact, this better performance has been widely demonstrated in previous works (e.g., Sharifi et al. 2016; Citakoglu 2015; Antonopoulos et al. 2019). The innovative nature of this study lies in the fact that the daily TGSI measurements performed for only 10 days per month (or even less) for 1 year were enough to train the ANN models and to obtain estimates statistically equal to the measurements for the entire period. This is an important result for sites like Argentina where these measurements are scarce. However, the main disadvantage of this method is that its application is limited to sites where TGSI measurements are recorded.

Lastly, the results obtained in this work can also be used to complement studies analyzing the spatial sensitivity of the models. For example, Urraca et al. (2017) evaluated different models in relation to the spatial performance at nearly forty ground stations in central Spain from 2001 to 2013. They focused on studying the spatial variability of irradiance measurements and on obtaining irradiance estimates by means of interpolation techniques and found that the accuracy of interpolation at each location depends on the distance of the closest pyranometer. If the determination of the distance between the pyranometers needed to cover a certain area is complemented by the determination of the minimum number of TGSI measurements needed to obtain accurate estimates, it would constitute an important advance in optimizing the resources and, at the same time, in generating more accurate solar radiation maps.

Summary and concluding remarks

This work analyzed the performance of the empirical and ANN models in order to estimate TGSI in Córdoba City between the years 2000 and 2013. From the statistical analysis, it was found that all the analyzed empirical and ANN models can be used to estimate TGSI in Córdoba City with a good accuracy provided that a large number of measurements are available to calibrate these models. However, sites without TGSI estimates do not always meet this requirement. In addition, although the calibration/training and evaluation stages for a given site are performed using the appropriate number of measurements, the results cannot be applied to other nearby sites. This expected result was statistically shown when the empirical coefficients obtained in the present work for the Almorox et al. meteorological model were compared to those previously reported in the literature for a site located almost 100 km away.

If the spatial closeness between two sites does not guarantee accurate TGSI estimates by using the same empirical coefficients, then TGSI measurements are required to calibrate the models. This fact led us to investigate the minimum number of daily TGSI measurements required to calibrate/train each model. The number of the daily TGSI measurements per month used to calibrate the empirical models and to train the ANN configurations was varied. From the analysis of the results, it was concluded that the ANN model is the only model that with 10 measurements available per month for 1 year (or even less) can be used to obtain reliable estimates of daily TGSI. This represents an important advantage over the empirical models because they can be easily applied. However, the main weakness of this method is that its application is still limited to sites where TGSI measurements are recorded.