Keywords

1 Introduction

Due to the increasing use of photovoltaic systems in the generation of alternative energy, it is difficult to obtain mathematical or physical models that result in the efficient use of such systems. Likewise, the tools and algorithms provided by Machine Learning in the use and treatment of data, result in useful tools to model photovoltaic generation systems. Within the field of data-based forecasts we have multiparametric linear regression, which allows forecasts taking into account a set of independent variables that affect the target or dependent variable. These variables affect the prediction to some degree, but some of them do not have so much preponderance in the final forecast, so it is convenient to eliminate them so that the processing cost and time are reduced. Among some of the techniques to exclude irrelevant variables or predictors we have: Subset Selection, Shinkrage Regularization, and Dimension Reduction. Within the first group that identifies and selects among all the available predictors that are most related to the target variable, we have: Bestsubset selection and Stepwiese selection. Within this last group we have: forward, backward, and hybrid. In the backward method we have the elimination of recursive functions (RFE), which is the algorithm used in this paper to model the multiparameter photovoltaic system. RFE is used in various studies such as the selection of attributes in classifiers based on artificial neural networks in the detection of cyberbullying [1]. In conjunction with other techniques such as SVR for feature selection based on twin support vector regression [2]; with SVM and Bayes for categorical classifications [3]. For the modeling of emotions and affective states from EEG, combining RFE with Random Forest (RF), Support Vector Regression (SVR), Tree-based bagging [4]. In identifying features for football game earnings forecast, combining it with were Gradient Boosting and Random Forest [5]. In the prediction of boiler system failures, using the RFE algorithm in combination with the elimination of recursive functions by vector machine (SVM-RFE) [6]. In the phenotyping of high-yield plants [7], to eliminate spectral characteristics, the elimination of vector-machine recursive characteristics (SVM-RFE), LASSO logistic regression and random forest are used. To perform the short-term electricity price and charge forecast using KNN, [8] uses RFE to eliminate redundancy of functions. To perform heart transplant tests, [9] in pig tests use a combination of RFE-SVM to select the parameters for the estimation of V0. In the present work, we perform the combination of RFE with Shinkrage regularization algorithms: Ridge, Lasso, and Bayesian Ridge, establishing a hybrid algorithm for modeling the multiparameter photovoltaic system.

2 Methodology

In regression models, a compromise must be made between the bias and the variance provided by the data to be predicted and the model performed. For this, the theory provides us with the following variable selection methods (feature selection): Subset selection, Shrinkage, and Dimension reduction. The first identifies and selects among all the available predictors those that are most related to the response variable. Shrinkage or Shrinkage fits the model, including all predictors, but including a method that forces the regression coefficient estimates to zero. While Dimension Reduction creates a small number of new variables from combinations of the original variables. Each of them has a subset of techniques such as for subset selection: best subselection and stepwiese selection (forward, backward and hybrid). For Shrinkage: Ridge, Lasso and ElasticNet. For Dimension Reduction we have Principal components, Partial Last Square and tSNE. Subset selection is the task of finding a small subset of the most informative elements in a basic set. In addition to helping reduce computational time and algorithm memory, due to working on a much smaller representative set, he has found numerous applications, including image and video summary, voice and document summary, grouping, feature selection and models, sensor location, social media marketing and product recommendation [10]. The recursive feature removal method (RFE) used works by recursively removing attributes and building a model on the remaining attributes. Use precision metrics to rank the feature based on importance. The RFE method takes the model to be used and the number of characteristics required as input. Then it gives the classification of all the variables, 1 being the most important. It also provides support, True if it is a relevant feature and False if it is an irrelevant feature.

The data was pre-processed by eliminating the null values. Next, the non-multicollinearity between the predictors was determined using a heat diagram. Three hybrid methods of variable selection were performed: RFE-Lasso, RFE-Ridge, RFE-Bayesian Ridge, comparing them with RFE-OLS, it was used as a baseline for our work. Finally, the results were validated under conditions of linearity, normality, no autocorrelation of error terms, and homoscedasticity.

3 Methods

3.1 Recursive Feature Elimination

For RFE we will use the following algorithm:

  • 1 Refine/Train the model in the training group using all predictors

  • 2 Calculate model performance

  • 3 Calculate the importance of variables or classifications

  • 4 For (for) each subset size \( S_i \), i = 1...S do (do

    • 4.1 Keep the most important variables of \( S_i \)

    • 4.2 Optional: Pre-process the data

    • 4.3 Refine/Train the model in the training group using \( S_i \) predictors

    • 4.4 Calculate model performance

    • 4.5 Optional: Recalculate rankings for each predictor

    • 4.6 End (end)

  • 5 Calculate the performance profile on \( S_i \)

  • 6 Determine the appropriate number of predictors

  • 7 Use the model corresponding to the optimal \( S_i \)

The algorithm fits the model to all predictors, each predictor is classified using its importance for the model. Let S be a sequence of ordered numbers that are candidate values for the number of predictors to retain (\( S_1 \),\( S_2 \), ...). At each iteration of the feature selection, the highest ranked Si predictors are retained, the model is readjusted, and performance is evaluated. The best performing Si value is determined and the main Si predictors are used to fit the final model. The algorithm has an optional step just at the end of its sequence (8) where the predictor ratings recalculate into the reduced feature set model. For the random forest models, there was a decrease in performance when the rankings were recalculated at each step. However, in other cases when the initial classifications are not good (for example, linear models with highly collinear predictors), the recalculation may slightly improve performance [11].

3.2 Ridge

For Ridge the sum of squared errors for linear regression is defined by Eq. 1:

$$\begin{aligned} E= \sum _{i=1}^N (y_i-\hat{y_i})^2 \end{aligned}$$
(1)

Just as the data set we want to use to make machine learning models must follow the Gaussian distribution defined by its mean, \( \mu \) and variance \( \sigma ^ 2 \) and is represented by \(N(\mu , \sigma ^2),i.e.,X\mathtt {\sim }N(\mu , \sigma ^2)\) where X is the input matrix.

For any point \( x_i \), the probability of \( x_i \) is given by Eq. 2.

$$\begin{aligned} P(x_i)=\frac{1}{2\pi \sigma ^2}e ^{-\frac{1}{2}\frac{(x_i-\mu )^2}{\sigma ^2}} \end{aligned}$$
(2)

The occurrence of each \( x_i \) is independent of the occurrence of another, the joint probability of each of them is given by Eq. 3:

$$\begin{aligned} p(x_{1},x_{2}, ...x_{N})=\prod _{i=1}^N{\frac{1}{2\pi \sigma ^2}}e^{-\frac{1}{2}\frac{(x_i-\mu )^2}{\sigma ^2}} \end{aligned}$$
(3)

Furthermore, linear regression is the solution that gives the maximum likelihood to the line of best fit by Eq. 4:

$$\begin{aligned} P(X\mid \mu )=p(x_{1},x_{2}, ...x_{N})=\prod _{i=1}^N{\frac{1}{2\pi \sigma ^2}}e^{-\frac{1}{2}\frac{(x_i-\mu )^2}{\sigma ^2}} \end{aligned}$$
(4)

Linear regression maximizes this function for the sake of finding the line of best fit. For this, we take the natural logarithm of the probability function (likelihood) (L), then differentiate and equal zero by Eq. 5.

$$\begin{aligned} ln(P(X\mid \mu ))=ln(p(x_{1},x_{2}, ...x_{N}))= \end{aligned}$$
(5)
$$\begin{aligned} ln\prod _{i=1}^N{\frac{1}{2\pi \sigma ^2}}e^{-\frac{1}{2}\frac{(x_i-\mu )^2}{\sigma ^2}}=\sum _{i=1}^N\ln {(\frac{1}{2\pi \sigma ^2}e^{-\frac{1}{2}\frac{(x_i-\mu )^2}{\sigma ^2}}})= \end{aligned}$$
(6)
$$\begin{aligned} \sum _{i=1}^N\ln {(\frac{1}{2\pi \sigma ^2}}) - \sum _{i=1}^N\mid \frac{1}{2}\frac{(x_i-\mu )^2}{\sigma ^2} \end{aligned}$$
(7)
$$\begin{aligned} \frac{\partial \ln (P(X\mid \mu )}{\partial \mu }=\frac{\partial \sum _{i=1}^N\ln {(\frac{1}{2\pi \sigma ^2})}}{\partial \mu } - \frac{\partial \sum _{i=1}^N\frac{1}{2}\frac{(x_i-\mu )^2}{\sigma ^2}}{\partial \mu } \end{aligned}$$
(8)
$$\begin{aligned} = 0 + \sum _{i=1}^N \frac{(x_i-\mu )}{\sigma ^2}=\sum _{i=1}^N \frac{(x_i-\mu )}{\sigma ^2} \end{aligned}$$
(9)
$$\begin{aligned} \frac{\partial \ln (P(X\mid \mu )}{\partial \mu }=\sum _{i=1}^N\frac{(x_i-\mu )}{\sigma ^2}=0\Longrightarrow \mu =\frac{\sum _{i=1}^N x_i}{N} \end{aligned}$$
(10)

We take into account here is that maximizing the probability function (likelihood) L is equivalent to minimizing the error function E. Furthermore, and it is Gaussian distributed with mean transposition (w) * X and variance \( \sigma ^ 2 \) is show in Eq. 11.

$$\begin{aligned} y\mathtt {\sim }N(\omega ^T X, \sigma ^2) o y=\omega ^T X + \varepsilon \end{aligned}$$
(11)

Where \(\varepsilon \mathtt {\sim }N(0, \sigma ^2)\) \( \varepsilon \) is Gaussian distributed noise with zero mean and variance \( \sigma ^ 2 \). This is equivalent to saying that in linear regression, the errors are Gaussian and the trend is linear. For new or outliers, the prediction would be less accurate for least squares, so we would use the L2 regularization method or Ridge regression. To do this, we modify the cost function and penalize large weights as follows by Eq. 12:

$$\begin{aligned} J_{RIDGE}= \sum _{i=1}^N (y_i-\hat{y_i})^2 + \lambda |w|^2 \end{aligned}$$
(12)

Where: \(|w|^2 = w^T w = w_1^2 + w_2^2 +\cdots +w_D^2\)

We, now have two probabilities:

Posterior:

$$\begin{aligned} P(Y|X,w)=\prod _{i=1}^N{\frac{1}{2\pi \sigma ^2}}exp(-\frac{1}{2\sigma ^2}(y_n - w^T x_n)^2) \end{aligned}$$
(13)

A priori:

$$\begin{aligned} P(w)=\frac{\lambda }{\sqrt{2\pi }}exp(-\frac{\lambda }{2} w^T w) \end{aligned}$$
(14)

3.3 Ridge-Bayesian

So, applying Bayes

$$\begin{aligned} exp(J) = \prod _{n=1}^N exp(-(y_n - w^T x_n)^2)exp(\lambda w^T w) \end{aligned}$$
(15)

Applying Bayes:\( J=(Y-Xw)(Y-Xw)^T +\lambda w^T w \)

$$\begin{aligned} = Y^T T- 2Y^T Xw + w^T X^T Xw + \lambda w^T w \end{aligned}$$
(16)

To minimize J, we use \( \frac{\partial J}{\partial w} \) and set its value to 0. Therefore, \( -2X ^ T + 2X ^ T Xw + 2 \lambda w = 0 \)

So \( (X ^ T X + \lambda I) w = X ^ T Y \) or \( w = (X ^ T Y) \)

This method encourages weights to be small since P (w) is a Gaussian centered around 0. The anterior value of w is called the MAP (maximize posterior) estimate of w.

3.4 Lasso

In the same way for Lasso

$$\begin{aligned} J_ {LASSO} = \sum _{n = 1} ^ N (y_i- \hat{y_i}) ^ 2 + \lambda || w || \end{aligned}$$
(17)

Maximizing the likelihood

$$\begin{aligned} P (Y | X, w) = \prod _{n = 1} ^ N {\frac{1}{2\pi \sigma ^ 2}} exp (-\frac{1}{2 \sigma ^ 2 } (y_n - w ^ T x_n) ^ 2) \end{aligned}$$
(18)

and prior (previous) is given by:

$$\begin{aligned} P (w) = \frac{\lambda }{2} exp (- \lambda | w |) \end{aligned}$$
(19)

So that \( J = (Y - X_w) ^ T (Y - X_w) + \lambda | w | \)

y \( \frac{\partial J}{\partial w} = -2X ^ T Y + 2X ^ T Y + 2X ^ T Xw + \lambda sign (w) = 0 \)

Where \( sign (w) = 1 \) If \( x> 0 \) and \( -1 \) if \( x <0 \) and 0 if \( x = 0 \)

4 Data Set

4.1 Data Acquisition

The data was collected in the department of Puno whose coordinates are: \( 15 ^ {\circ } \) \( 29 ^ \prime \) \( 27 ^ {\prime \prime } \) S and \( 70 ^ {\circ } \) \( 07 ^ \prime \) \( 37 ^ {\prime \prime } \) O. The time period was April and August 2019.

The data to be analyzed were: DC Voltage, AC Voltage, AC Current, Active Power, Apparent Power, Reactive Power, Frequency, Power Factor, Total Energy, Daily Energy, DC Voltage, DC Current, and DC Power. Those that were obtained through the StecaGrid 3010 Inverter. The temperature of the environment and the photovoltaic panel were obtained by the PT1000 sensors that are suitable for temperature-sensitive elements given their special sensitivity, precision and reliability. Irradiance was obtained through a calibrated Atersa brand cell, whose output signal depends exclusively on solar irradiance and not on temperature. The amount of data is reduced from 331157 to 123120 because many of the values obtained are null, for example, the values obtained at night time. Characteristics such as mean, standard deviation, minimum value, maximum value and percentages of the pre-processed data are presented in Table 1 and Table 2. The statistics of the data obtained are shown as median, standard deviation, values: maximums, minimums, and interquartile ranges.

Table 1. Statistic table
Table 2. Statistic table

5 Results

5.1 Non-multicollinearity Between Predictors - Correlation

The independent variables (predictors) should not be correlated with each other, as they would cause problems in the interpretation of the coefficients, as well as the error provided by each one. To determine this, a correlation heat map was used. Correlation is the basis to eliminate or minimize some variables, this is done by a variable selection algorithm or by the researcher’s criteria, of course, advanced methods use an algorithm as will be done later, however, Fig. 1 displays the matrix to validate subsequent results.

Fig. 1.
figure 1

Correlation matrix.

5.2 Prediction

First the RFE method was applied for the selection of variables, to the obtained results we applied the following Shrinkage regularization methods: Lasso, Ridge and Bayesian Ridge The data set is divided into training data 98496 (80%) and test data set 24624 (20%), for better performance seeds are also used. The best seed is also 8849. The RFE algorithm is applied, the following result is obtained:

[‘Tension AC’, ‘Corriente AC’, ‘Potencia aparente’, ‘Potencia reactiva’, ‘Frecuencia’, ‘Factor de potencia’, ‘Energia total’, ‘Energia diaria’, ‘Tension DC’, ‘Corriente DC’, ‘Potencia DC’, ‘Irradiancia’, ‘Temp modulo’, ‘Temp ambiente’]

[True, True, True, True, True, True, False, False, True, True, True, False, True, True]

[‘Tension AC’, ‘Corriente AC’, ‘Potencia aparente’, ‘Potencia reactiva’, ‘Frecuencia’, ‘Factor de potencia’, ‘Tension DC’, ‘Corriente DC’, ‘Potencia DC’, ‘Temp modulo’, ‘Temp ambiente’]

Of the 14 variables evaluated, for RFE the optimal number of characteristic variables was 11 with a score of 0.999768. It is important to mention that RFE discards: “Energia total”, “Energia diaria” e “Irradiancia”. The hyperparameters are then determined for Ridge an alpha value = 1,538 and for Lasso an alpha value = 0.01. For the models found, we determined \(R^2\) and adjusted \(R^2\), the mean absolute error of R (MAE), the mean square error of R (RMSE) and Score.

Table 3 and Table 4 shows the values obtained for the proposed groups, where the RFE method with OLS is not part of the research proposal, this result is also used to compare the research results. The following RFE methods with Lasso, RFE with Ridge and RFE with Bayesian Ridge; form the proposal of this research.

6 Validation of the Results

To check the results provided by the model, we must check certain assumptions about linear regression. If they are not fulfilled, the interpretation of results will not be valid.

Fig. 2.
figure 2

RFE-Linearity, (a) Correspond to model OLS, (b) Correspond to model Ridge (c) Correspond to model Lasso, (d) Correspond to model Bayessian Ridge

6.1 Linearity

There must be a linear relationship between the actual data and the prediction so that the model does not provide inaccurate predictions. It is checked using a scatter diagram in which the values or points must be on or around the diagonal line of the diagram Fig. 2 shows the linear relationship.

6.2 Normality of Error Terms

The error terms should be distributed normally. The histogram and the probability graph are shown in Fig. 3.

Fig. 3.
figure 3

RFE-Residual Distribution, (a) Correspond to model OLS, (b) Correspond to model Ridge (c) Correspond to model Lasso, (d) Correspond to model Bayessian Ridge

6.3 No Autocorrelation of the Error Terms

Autocorrelation indicates that some information is missing that the model should capture. It would be represented by a systematic bias below or above the prediction. For this we will use the Durbin-Watson test. Value from 0 to 2 is positive autocorrelation and value from 2 to 4 is negative autocorrelation. For RFE - OLS there is no autocorrelation. Durbin-Watson Test is 2.0037021333412754, little to no autocorrelation. For RFE - Bayesian Ridge there is no autocorrelation. Durbin-Watson Test is 2.0037008807358965, little to no autocorrelation. For RFE - Lasso there is no autocorrelation. Durbin-Watson is 2.0037472224605053, little to no autocorrelation. Have a For RFE - Ridge there is no autocorrelation. Durbin-Watson is 2.0037017639830537, little to no autocorrelation.

6.4 Homocedasticity

It must be fulfilled that the error made by the model always has the same variance. It is presented when the model gives too much weight to a subset of data, particularly where the variance of the error was the greatest: to detect it, residuals are plotted to see if the variance is uniform (Fig. 4).

Fig. 4.
figure 4

RFE-Homocedasticity, (a) Correspond to model OLS, (b) Correspond to model Ridge (c) Correspond to model Lasso, (d) Correspond to model Bayessian Ridge

7 Description and Analysis of the Results

In this article we present three hybrid methods for the selection of variables in the multiparameter regression of photovoltaic systems to predict the levels of the active power of the photovoltaic system with 14 independent variables, these methods are RFE - Lasso, RFE - Ridge and RFE - Bayesian Ridge.

Table 3. RFE with OLS and Shrinkage
Table 4. RFE witch OLS and Shrinkage

Table 3 and Table 4 shows the method comparison, RFE-OLS, which is not part of our proposal, was compared with OLS to have a benchmark for the following comparisons that are part of the proposal. RFE-Lasso: it has an absolute error of approximately 0.035% greater than Lasso, which is taken as a disadvantage of the proposal, it has a mean squared error of approximately 0.057% less than Lasso, which is a significant result considered as a advantage, it has a coefficient of determination of approximately 0.0000309% higher than Lasso, this is considered greater but almost the same, so it is not considered very advantageous but in no way a disadvantage of the proposal, it has an adjusted coefficient of determination of approximately 0.0000315% greater than Lasso, this is considered an advantage as the previous case, the training time is approximately 30,904% less, which is considered a great contribution of this hybrid method, the test time is approximately 4,161% greater than Lasso, which is considered a disadvantage of the proposed model. For the following two hybrid RFE-Ridge methods compared to Ridge and RFE-Bayesian Ridge compared to Bayesian Ridge, and shown in Table 3 and Table 4. The description and analysis is similar to RFE-Lasso compared to Lasso.

8 Conclusions

The selection of independent variables of the multi-parameter photovoltaic system allowed us to develop four prediction models with an accuracy greater than 99.97% in all cases. Three RFE proposals are presented, RFE-Ridge, RFE-Lasso, and RFE-Bayesian Ridge; training time was reduced by 71% for RFE-Ridge over Ridge and and 36% RFE-OLS over OLS. The variables eliminated with RFE-Ridge and RFE-Bayesian Ridge were: “Energia total”, “Energia diaria” e “Irradiancia”, and additionaly the variable eliminated by RFE-Lasso was: “Frequencia”. In all cases we see that the root mean square errors were reduced for RFE-Lasso by 0.15% over Lasso while for RFE-Bayesian Ridge by 0.06% over Bayesian Ridge. From all that has been done, we note that the proposed hybrid method, by eliminating variables that are not significant for the system, achieves a decrease in training times, without losing accuracy in predictions. The results can be improved by implementing algorithms in pre-processing stages such as imputation of values; or perform techniques as linear regression such, neural networks or XGBoost.