1 Introduction

Historically, load forecasting is very important for both transmission and distribution electricity companies. With the liberalization of electricity markets, load forecasting is also extremely important for trading companies that have to purchase electrical energy in bulk at variable prices and sell it to consumers at fixed rates. To reduce this risk, the trader must forecast as precisely as possible the demand of its customers to provide them with good services at a low cost. However, the load forecasting is becoming increasingly difficult due to the variability of load curves resulting from dynamic bidding strategies, time-varying electricity price, price-dependent loads, economic cycles, weather conditions, among other factors. Therefore, it is imperative to investigate advanced prediction models.

Several approaches for short-term load forecasting (STLF), that is, the prediction of the system load over an interval ranging from 1 h to 1 week, have been reported in the last decades and can be divided into statistical methods, artificial-intelligence-based methods and hybrid approaches [1, 2]. The first one includes linear regression, exponential smoothing, stochastic process, state space and time series methods. A review of statistical methods for electric load forecasting has been given in [3] for instance. Approaches based on artificial intelligence, such as pattern recognition, neural networks, fuzzy neural networks and expert systems, have been widely explored for load forecasting [4]. The extensive work in this domain has shown a trend that is influenced both by increasing complexity of factors that affects consumption and by a trend to apply an increasing number of different technologies that have been proposed and tested. This has led to the development of more accurate load forecast methods and a new era of hybrid load forecasting methods have appeared [5]. Using hybrid model or combining several models has become a common practice that often leads to improved forecasting performance. Given that combining several methods outperforms single methods, we focus on load forecasting through functional clustering and ensemble learning.

In this work, an approach based on the divide-and-conquer paradigm is proposed. First, the original load diagrams database is segmented into distinct groups, accordingly to phase and amplitude of load curves using a functional clustering algorithm. Next, the consumption points of each one of the previously obtained groups are subdivided in accordance with seven climatic regions that we have set for our country. It is our purpose with this more detailed division of the groups, to study the effect of temperature on load forecasting. Then, extreme learning machines (ELMs) of varying complexity are individually trained on each one of the subdivisions made and specific ELMs models are generated. In order to obtain a final prediction, these individual models are combined in a single model, through ensemble learning, which is an effective strategy to improve upon the accuracy of a single learner.

Functional clustering is a recent research area [6, 7] and is little explored in load forecasting. Just a few works have recently appeared in the literature such as the work developed in [8], which uses functional clustering and linear regression to short-term peak load forecasting applied to past heating demand data in a district heating system, and in [9], which focuses on predicting electricity consumption by means of a functional linear regression model. ELM is a simple learning algorithm for feedforward neural network, which randomly selects the hidden nodes parameters (including input weights and bias) and analytically determines the output weights of single hidden layer feedforward neural network. In doing this, the training time is substantially reduced while reaching minimum training error. ELM has been adopted to establish prediction models in various real-world problems, showing fast learning speed and good generalization performance. For short-term load forecasting where high volumes must be processed, ELM is evidently suitable to undertake the training and forecasting task. For instance, in [10] an online power load forecasting method based on regularized fixed-memory ELM is proposed to improve the accuracy and speed of load forecasting. Ensemble methods have also been applied and tested for load forecasting. For instance, in [11] the authors proposed a meta-learning system for multivariate time series forecasting as a general framework for using selective ensemble techniques. And in [12] the re-forecast ensembles consist of various time series models combined using least-squares optimization.

The work that will be here described distinguishes from all the previous solutions reported in the load forecasting literature, because it combines functional clustering with ensemble learning of extreme learning machines which is an innovative solution with good results, as we will show.

The remainder of this paper is organized as follows: In Sect. 2, the proposed methodology is presented, as well as a brief explanation of the techniques used to implement it, such as functional clustering, ELM and the theory of ensemble models. In Sect. 3, the methodology is demonstrated with the case study, by starting with a presentation of the clients load diagram database, followed by the segmentation obtained with the functional clustering algorithm and an analysis of the groups reached. Next, the ELMs model generation is explained, followed by a description of the ensemble learning made. In Sect. 4, the results from the experiments made are presented and discussed. In the last section, conclusions and future work are disclosed.

2 Methodology

The load forecast to be supplied by a trader, with clients from different economic activities, must be predicted from the bottom up, i.e., by aggregating the clients load profiles. This aggregation must result into clients groups with load profiles as similar as possible. If this is achieved, better predictions will be also achieved with each one of the groups models. As we are interested in a single forecast, by combining together all single models, trained with different samples, they can compensate for each other, and the final model can reduce the aggregated variance and bias, thus tending to increase the accuracy over the individual models (see Sect. 2.3). The block diagram of the proposed methodology is shown in Fig. 1.

Fig. 1
figure 1

Block diagram of the proposed methodology

Generically, the methodology is divided into three main levels and is implemented in a modular way, so as to be easy to experiment with different configurations and study several effects on the final load forecasting, such as the influence of temperature conditions on the clusters electricity consumption, the number of different models on ensemble learning and also different combinations schemes.

In order to minimize the temperature forecast error, the country was segmented into seven distinct regions, each of which with homogeneous climate conditions. Portugal is Europe’s southwestern extremity. The country is bathed by the Atlantic Ocean along its extensive eastern coast, from north to south, and it is in the coast where most of its population is concentrated. The climate is quite different inland and on the coast. On the coast, the temperatures are milder and the rainfall is higher. In inland, there is a greater difference between the minimum and maximum temperatures, but less rain. There is also a gradual increase in temperature from north into south. Because of these differences and also because of the population distribution, the country was segmented in two inland regions (6, 7) and five regions along the coast (see Portugal country map in Fig. 1).

The methodology is oriented toward handling short-term forecasting in a unified framework composed by three levels. In the first level, the daily load diagrams for all clients since 2011 are selected from the database, followed by data preprocessing operations such as data cleaning, data reduction and data normalization. In the second level, the load diagrams are segmented concerning its phase and amplitude. Next, the consumption points of each one of the achieved clusters are distributed according to the seven weather regions defined to the country. In the third level, specific ELMs models are developed based on the clustered weather region subdivisions and its ambient temperature. In the last level, the final model prediction is enhanced by an ensemble of the individual models predictions, in order to achieve a higher forecasting accuracy.

2.1 Functional clustering

Cluster analysis groups data objects based only on information in data that describes the objects and their relationships. The goal is that the objects within a group are similar to one another and different from the objects in the other groups. The grouping of objects is based on a distance or similarity function, so that clusters can be formed from objects with a high similarity to each other.

The majority of clustering algorithms have been developed on static data, that is, on data whose feature values do not change with time, or change very little. Due to this, the traditional clustering methods, such as partition, hierarchical or model-based methods are not adequate for group data that arise as curves, designated as functional data or longitudinal data. When applied to functional data, traditional clustering methods are able to detect several extremes and local amplitude variation, but do not take phase variation into account and hence assume that its presence is limited. In [13], it is illustrated that ignoring phase variation of functional data may result in a possible loss of information. To handle this kind of data, some functional clustering algorithms have been developed. Some algorithms are modifications of traditional clustering algorithms in order to handle time series data; others convert time series data so that the traditional clustering algorithms can be directly used. The former approach usually works directly with raw time series data, thus called raw-data-based approach, and the major modification lies in replacing the distance/similarity measure for static data with an appropriate one for time series. The latter approach first converts a raw time series data either into a feature vector of lower dimension or a number of model parameters and then applies a conventional clustering algorithm to the extracted feature vectors or model parameters, thus called feature-based approach [6].

In the system here described the KmL algorithm [14], a new implementation of k-means specifically designed to analyze longitudinal data is used. The algorithm provides several different techniques for dealing with missing values in trajectories, and it runs with distances specifically designed for longitudinal data, like Frechet distance or dynamic time warping, or any user-defined distance. As K-means is a hill-climbing algorithm, in order to avoid the convergence to a local solution, KmL runs several times, varying the starting conditions and/or the number of clusters looked for, and uses the Calinsky and Harabasz quality criterion [15] to select the adequate number of clusters. In practice, it is better to have several criteria so that their concordance will strengthen the reliability of the result. In addition to Calinsky and Harabasz, the KmL also provides two other criteria, Ray and Turi [16] and Davies and Bouldin [17]. One of the advantages of KmL over the existing algorithms is exactly its graphical interface that helps the user to choose the appropriate number of clusters.

2.2 Extreme learning machines

Initially, a NN was applied with satisfactory results, but low performance. Given that performance was critical (the prediction for the next day has to be carried out overnight), more efficient techniques were investigated. In [18], the authors argue that the learning speed of feedforward neural networks is in general far slower than required for two key reasons: (1) the slow gradient-based learning algorithms are extensively used to train neural networks and (2) all the parameters of the networks are tuned iteratively by using such learning algorithms. So they propose a new learning algorithm called extreme learning machine (ELM) for single hidden layer feedforward neural networks. This algorithm tends to provide good generalization performance at extremely fast learning speed, hence its name.

The ELM algorithm makes use of single-layer feedforward neural networks (SLFN) having only an input layer, a hidden layer and an output layer. The main concept behind ELM lies in the random initialization of the SLFN weights and biases. Under the condition that the transfer functions in the hidden layer are infinitely differentiable, the optimal output weights for a given training set can be determined analytically. The obtained output weights minimize the square training error. The trained network is thus obtained in very few steps and is very fast to train, which is the main reason we use them to make the load forecasting of each cluster. Moreover, ELM algorithm, unlike SLFN, can be used as an adaptive algorithm. Given a training dataset with M arbitrary distinct samples (\(x_{i}\), \(y_{i}\)), with \(x_{i}\) \(\in R^d\) and \(y_{i}\) \(\in R\), the output function of the SLFN with N hidden nodes is modeled as the following sum

$$\begin{aligned} \sum _{i=1}^{N} \beta _{i} g( \omega _{i} \cdot x_{j} + b_{i}) = y_{j} j = 1,2, \cdots , M \end{aligned}$$
(1)

with g(x) being the activation function, \(w_{i}\) the input weights to the ith neuron in the hidden layer, \(b_{i}\) the biases, and \(\beta _{i}\) the output weights. ELM is completely different from traditional iterative learning algorithms as it randomly selects the input weights and biases for hidden nodes, \(w_{i}\) and \(b_{i}\), and analytically calculates the output weights \(\beta _{i}\) by finding the least-square solution. In doing so, it is proven that the training error can still be minimized with even better generalization performance.

In the case where the SLFN would perfectly approximate the data, meaning the error between the output \(\hat{y}_{i}\) and the actual value \(y_{i}\) is zero, according to ELM theory (1) can be expressed in the following matrix form

$$\begin{aligned} H_{n} \beta _{n} = Y_{n}, \end{aligned}$$
(2)

where H is the hidden layer output matrix defined as:

$$\begin{aligned} H = \begin{pmatrix} g(w_{1}x_{1} + b_{1}) &{} \cdots &{} g(w_{N}x_{1} + b_{N}) \\ \cdots &{} \cdots &{} \cdots \\ g(w_{1}x_{M} + b_{1}) &{} \cdots &{} g(w_{N}x_{M} + b_{N}) \\ \end{pmatrix} \end{aligned}$$

and \(\beta = (\beta _{1} \cdots \beta _{N})^\mathrm{T}\) and \(Y = (y_{1} \cdots y_{M})^\mathrm{T}\).

Given the randomly initialized first layer of the ELM and the training inputs \(x_{i} \in R^d\), the hidden layer output matrix H can be computed. Given H and the target outputs \(y_{i} \in R\) (i.e., Y), the output weights \(\beta \) can be solved from the linear system defined by (2). This solution is given by \(\beta = H \dag Y\), where \(H \dag \) is the Moore–Penrose generalized inverse of the matrix H [19]. This solution for beta is the unique least-square solution to Eq. 2. The ELM algorithm then is:

Given a training set \((x_{i}, y_{i})\), \(x_{i} \in R^{d}\), an activation function \(g : R \mapsto R\) and N the number of hidden nodes,

  1. 1.

    Randomly assign input weights \(w_{i}\) and biases \(b_{i}\), \(i \in [1,N]\);

  2. 2.

    Calculate the hidden layer output matrix H;

  3. 3.

    Calculate output weights matrix \(\beta = H \dag Y\).

A more detailed presentation of the algorithm with theoretical proofs is presented in the original paper [18].

2.3 Ensemble models

The ensemble approach is based on multiple uncorrelated models with low error rates. The individual models are combined in some way (typically by voting) into a single ensemble model as follows:

$$\begin{aligned} \hat{p}_\textit{ens}(t) = \frac{1}{m}\sum _{i=1}^{m} \hat{p}_{i}(t), \end{aligned}$$
(3)

where \(\hat{p}_\textit{ens}(t)\) is the output of the ensemble models, \(\hat{p}_{i}(t)\) are the outputs of the individual models, and m is the number of models. In [20], it was shown that the variance of the ensemble model is lower than the average variance of all the individual models. Let p(t) denote the true output that we are trying to predict and \(\hat{p}_{i}(t)\) the estimation for this value of model i. Then, we can write the output \(\hat{p}_{i}(t)\) of model i as the true value p(t) plus some error term \(e_{i}(t)\):

$$\begin{aligned} \hat{p}_{i}(t) = p(t) + e_{i}(t). \end{aligned}$$
(4)

Then, the expected square error of a model becomes:

$$\begin{aligned} E[(\hat{p}_{i}(t) - p(t))^2] = E [e_{i}(t)^2]. \end{aligned}$$
(5)

The average error for m models is given by:

$$\begin{aligned} E_\textit{avg} = \frac{1}{m}\sum _{i=1}^{m} E[e_{i}(t)^2]. \end{aligned}$$
(6)

Similarly, the expected error of the ensemble as defined in 5 is given by:

$$\begin{aligned} E_\textit{ens} = E\begin{bmatrix} \begin{pmatrix} \frac{1}{m}\sum \nolimits _{i=1}^{m} \hat{p}_{i}(t) - p(t) \end{pmatrix}^2 \end{bmatrix} = E\begin{bmatrix} \begin{pmatrix} \frac{1}{m}\sum \nolimits _{i=1}^{m} e_{i}(t) \end{pmatrix}^2 \end{bmatrix} \end{aligned}$$
(7)

Assuming the errors \(e_{i}(t)\) are uncorrelated, i.e., \(([e_{i}(t) e_{j}(t)] = 0)\) and have zero mean \((E[e_{i}(t)] = 0)\), we get

$$\begin{aligned} E_\textit{ens} = \frac{1}{m}E_\textit{avg}. \end{aligned}$$
(8)

In practice, errors tend to be highly correlated, so they may not be reduced as much as suggested by these equations. The use of ensemble models can, however, lead to a further reduction in the errors. Indeed, the test error of the ensemble is smaller than the average test error of the individual models \((E_\textit{ens} < E_\textit{avg})\). The effectiveness of the ensemble model depends on the accuracy and the diversity of the base models. By segmenting the data into several clusters and developing a distinct forecasting model for each cluster, such effectiveness can be achieved.

3 Short-term load forecasting

3.1 Client’s database description

Elergone is an energy trading company that buys energy in bulk at variable prices and sells it at fixed rates to its clients. If Elergone does not buy the energy necessary to supply their clients, Elergone incurs in losses, so a correct prediction of their needs is fundamental to their business. Their database contains load diagrams that belong to clients with different economic activities such as offices, factories with continuous or weekly laboring, hotels, restaurants, health clubs, schools, shopping malls and supermarkets, among others. For each client, the respective daily load diagram (or load curve) is represented by a vector \(l=\{l_{1}, \ldots , l_{h}\}\), where \(l_{h}\) is the energy consumed during the period h, \(h = 1\), 2,...,96. Such measurements were made every 15 min. These load diagrams have been collected since 2011. Nowadays is composed by a universe of 370 clients. This dataset is publicly available at UCI machine learning repository [21].

An initial analysis of the load curves for all clients shows that the difficulty in defining a satisfactory forecasting model is mainly due to the high variability of the load curves of the different clients. In Fig. 2, a random selection of load curves is plotted. Clearly, the curves are not aligned and show variation both in phase (horizontal) and in amplitude (vertical). This turns the figure very difficult or impossible to interpret, but in the same way it shows the necessity to separate the load curves into homogeneous groups, otherwise it would be very challenging to create a model with an acceptable accuracy.

Fig. 2
figure 2

Weekly load curves

3.2 Clients database segmentation

The first goal of this project was to stratify the set of load curves into a few homogeneous groups exhibiting similar demand patterns, that is, curves with phase and amplitude similarity. The aim of grouping load diagrams with similar characteristic patterns is to detect few groups which may determine the changes in the load demand.

Several functional clustering algorithms were applied. The feature-based functional algorithms did not lead to significant groups, and from all raw-data-based algorithms tested, the KmL clustering algorithm, available in the KmL R package on CRAN [22], has provided the best results. To correctly separate the load diagrams according to their shape and not by their magnitude, it is important to consider the seasonality of the load diagrams, and for that it is necessary to use at least information gathered over 1 year. As raw data are collected in a 15 min frequency and because the market previsions are made every hour, data were converted from 15 min frequency samples into hourly samples. Even so, due to the high dimensionality of the dataset, the KmL algorithm presented convergence difficulties. So, concerning the number of load diagrams to consider by each client, there were two possibilities: choose some representative weeks of each season or produce an average week. The latter option, called average weekly load diagram (AWLD), has been chosen. The data used to produce the AWLD ranged from July 2013 until June 2014 for all days of the week and for each consumption point, which gives a \(370\times 168\) matrix.

As already stated, the main goal of this segmentation is to group data by shape instead of magnitude. Thus, the AWLD for each client is normalized using the min-max normalization [23], resulting for all consumption points a normalized AWLD with values between [0, 1]. To segment the load diagrams into several clusters, the KmL algorithm was applied to the normalized AWLD. This algorithm starts to transform the load diagrams into a ClusterizLongData object where all partitions found are stored. Once an object of class ClusterizLongData has been created, the KmL runs k-means several times, varying the starting conditions and the number of clusters. The range of variation of the number of clusters was preset between 2 and 15, and the default distance function used was the Euclidean distance with Gower adjustment.

The optimal number of clusters is the one that maximizes the Calinski and Harabasz criterion C(k). However, a given criterion may work better on some datasets than others, because of that two distinct criteria were computed. Figure 3 displays two criteria estimated by the algorithm (Calinsky and Harabasz and Davies and Bouldin). Small values of Davies and Bouldin correspond to clusters that are compact, whose centers are far away from each other. Therefore, the cluster configuration that minimizes Davies and Bouldin is the optimal number of clusters.

Fig. 3
figure 3

Calinski and Harabasz and Davies and Bouldin criteria

As can be see in Fig. 3, both criteria are concordant. There is a distinct peak in the Calinski and Harabasz and the minimum value of the Davies and Bouldin is achieved when the number of clusters is also seven. So, for both criteria the best value of k is seven, which indicates that for this database seven clusters is the best partition.

Table 1 describes the clusters obtained in quantitative terms, that is, the number of consumption points included in each one of the clusters and the corresponding aggregate consumption.

Table 1 Quantitative cluster characterization

Figure 4 displays for each cluster the shape of its average week load diagrams. The results of the segmentation are quite satisfactory, since the load curves of each one of the clusters have distinct curve shapes.

Fig. 4
figure 4

Average week load curves

By inspecting the clients activity of the clusters, cluster 1 was found to be mainly composed by retailers, cluster 2 essentially includes public medium/lower voltage (MV/LV) substations, cluster 3 basically contains shopping malls, cluster 4 is a mix of factories with weekly and continuous laboring and logistics, cluster 5 includes colleges, schools and offices, cluster 6 includes logistic installations, and finally cluster 7 contains four cogeneration plants.

3.3 ELMs configuration models

Many factors affect load demand including economic cycles, client activity, among other variables. In addition, load profile of a client can change due to the introduction of efficiency energy measures. Thus, the current year data are important to provide economic cycles and recent behavior of the forecasting point/cluster.

Another factor on load demand on which there are disparate positions in scientific community is temperature. Several previous research works [10, 24,25,26] confirm that outside temperature has a strong influence on electricity consumption because of the use of air conditioning/refrigeration in summer and heating in winter, among other reasons. On the other hand, there are also works [27, 28] that argue that temperature hasn’t almost any value at all in load forecasting. So, in this work we will study the effect of both aggregate consumption and temperature in the electricity load forecasting with four distinct configurations.

In the first configuration, one ELM is trained for each one of the consumption points. In the second experiment, one ELM is trained for each one of the clusters obtained with the functional segmentation. In these first two configurations, no temperature will be considered. In the following two configurations, each one of the clusters will be divided spreading their consumption points through the seven climate country regions, accordingly to the geographical location of its consumption points. For example, cluster 7 has four cogeneration plants that are spread over three regions, because two are located in region 2, one in region 4 and another one in region 6. Making the same procedure to all consumption points of all clusters resulted in 36 distinct partitions (Table 2). In this way, in the third configuration more accurate temperature is included as input variable in each one of the ELMs subdivision models. The last experiment is equal to the previous, but load forecasting is made without considering temperature.The four experiments are summarized in Table 3.

Table 2 Clusters points distribution over the country regions
Table 3 Summary of configurations

In the first, second and fourth configurations, because no temperature is used, and in order to best capture the influence of all previous mentioned factors on load forecasting, several ELMs were trained using data selected from the current year and from the 2 years before the date to forecast. Therefore, each training dataset for each one of the points/clusters/subdivisions is composed of two past time periods (n_days) from the day before the day to forecast, and one past period time (n_days) before and after the same day to forecast, in the two previous years. Table 4 shows two training sets of size 30 and 60 days to load forecasting two different days. The number of records of each dataset depends on the configuration, for example, in the first configuration the dataset with 30 days for all 370 clients contains 66,600 records \((60\times 3\times 370)\). When the day to forecast is changed, the datasets are updated accordingly. The ELMs were trained with several datasets with different period time sizes (n_days: 20, 30, 45, 60, 90 and 120).

In the third configuration, in order to consider the seasonal variations in temperature by region described in Sect. 2, both electrical data and local temperature data were used and the ELMs models were trained using data from the previous year and from 2 years before (2012, 2013) and evaluated using data from the year 2014.

Table 4 Two training sets of different sizes for forecast two different days

Like in SLFN, also in ELM the data should be normalized aiming to improve the forecasting results. Therefore, the load diagrams were normalized to the [0, 1] range with the min-max normalization. This kind of normalization allows maintaining the shape of the load curve and thus permits to make a better comparison of the consumption patterns [23].

In the Iberian Electricity Market (MIBEL), trading companies should provide to the market operator their daily load forecasting needs. Buy orders for the following day must be uploaded until 11:00 am (Portuguese local time). At the deadline, the latest information known is the previous day consumption. Thus, the closest period to the day to forecast starts always 2 days before it, which is why this is a 2-day-ahead load forecasting. As a result, the most recent real consumption known is 2 days before the day to forecast, which is used as input to the ELMs. Additionally as input to ELMs it is also used the real consumption 72 h (3 days) and 168 h (7 days) before the day to forecast and a Boolean indicating whether it is a holiday or not. These give a total of four input neurons to the ELM. The output layer has one neuron: the consumption forecast by hour. Besides the training dataset and the input variables, it is also important to optimize the number of neurons in the ELM hidden layer. Smaller number of neurons in the hidden layer speed up the training step, but higher number usually lead to best forecast accuracy. Several experiments with 5, 10, 20 and 75 neurons in the hidden layer were performed.

It should be noted that the methodology can be used to create models to predict longer periods, because the methods that read from database to create the datasets to train the algorithms are parameterized, which guarantees this flexibility.

3.4 Ensemble of ELMs configuration models

Load forecasting is a non-stationary process where data are continuously generated. Therefore, since the information that has been gathered from past samples can become inaccurate, it is needed to keep learning once new samples become available. One possible way of doing this is using a combination of different models, each of which is specialized on part of the state space. In this work, diverse models are developed each of which is specialized on each one of the consumption points, or on each one of the clusters, or yet on each one of the cluster/region and all contribute to the ensemble.

Many ensemble schemes exist from the most simple ones such as the product rule, the sum rule, the min, max or median rule, the simple weighted, the majority voting, to more elaborate ensemble schemes that use regression, evolutionary programming, neural networks, just to name a few. The key of these ensemble methods is to determine the weight coefficients of each model. Due to the nature of the load forecast to be supplied by an energy trader, we will evaluate two ensemble schemes: the simple sum of the ELMs predictions and the linear regression to accurately deduce the weights of each individual ELM model in the final prediction. The ensemble model consists of a number of randomly initialized ELMs, which each have their own parameters. The model \(\textit{ELM}_{i}\) has an associated weight \(h_{i}\) which determines its contribution to the prediction of the ensemble. Each ELM is individually trained on the training data, and the outputs of the ELMs contribute to the output of the ensemble \(\hat{y}_\textit{ens}\) as follows: \(\hat{y}_\textit{ens}(t + 1) = \sum _{i=1}h_{i}\hat{y}_{i}(t + 1)\).

4 Forecasting results and discussion

Several ELM typologies for all the configurations were tested using training data from the current year and from the two previous years, as explained before. In the second configuration, the differences among the experiments rely on the number of neurons in the ELM hidden layer and on the size of the training dataset (n_days). The best results were obtained with an ELM topology with 5 internal neurons, and the best size of the training dataset was: 30 days for cluster 1, 120 days for cluster 5 and 60 days for the other clusters.

In the first, third and fourth configurations, the changes among the experiences rely on the size of the training dataset, 1 or 2 years, and the best results were obtained with a 2-year training data set.

To evaluate and compare the configurations predictive capabilities, the R-squared (\(R^{2}\)) measure will be used. The reason for choosing this metric is because it is an easily interpretable statistic that is more used in industry. As this work was developed for a trade company, where managers are more familiarized with \(R^{2}\), we used this metric to communicate the results with them. Also, in this article we are interested in evaluating the fit of the forecasting models to past data, rather than acting as a measure of forecast accuracy, so the \(R^{2}\) measure is more adequate. In the context of predictive models where y is the true outcome, \(\bar{y}\) is the average of the true outcomes and f is the model’s prediction, \(R^{2}\) is defined by (9).

$$\begin{aligned} R^{2} =1-\frac{ \sum _{i=1}^{M} (y_{i}-f_{i})^{2}}{\sum _{i=1}^{M}(y_{i}-\bar{y})^{2}}. \end{aligned}$$
(9)

In words, \(R^{2}\) is a measure of how much of the variance in y is explained by the model f and the best possible \(R^{2}\) is 1.0.

To better understand the influence of temperature and evaluate the effect of aggregation on load forecasting, the predictions obtained with the divisions by regions, in the third and fourth configurations, and also the individual point consumption predictions, first configuration, were added in order to obtain the respective cluster load forecasting and be possible to compare them with the cluster prediction of the second configuration. Table 5 presents the \(R^{2}\) of the ELMs models developed for all the configurations. The numbers set in bold give the best \(R^2\) value achieved for each cluster/configuration.

Table 5 \(R^{2}\) 1st, 2nd, 3rd and 4th configurations

As can be seen from the values presented in Table 5, the best results were achieved with the first three configurations, which lead to the conclusion that smaller subdivisions of the clusters without considering temperature do not lead to better predictions. Analyzing the results of the first three configurations in detail, cluster 2 has obtained the best results without ambient temperature, because cluster 2 points are mainly public MV/LV substations with a mix of consumption profiles. For clusters 4 and 7, the best prediction occurs with individual point consumption prediction, because cluster 4 is a mix of factories and cluster 7 contains four cogeneration plants, whose consumption doesn’t depend on temperature. It should be noted that the predictions of these three clusters (2, 4, 7) in the fourth configuration are also better than in the third configuration, which confirms that temperature has no influence on their load forecasting. Cluster 5 has identical values with all the first three configurations which is inconclusive. For the remaining clusters (1, 3, 6), the best prediction occurs considering temperature, given that in some way their activities are influenced by temperature and this reflects in their consumption electricity. Also, concerning the aggregation effect on load forecasting is somewhat beneficial, because the majority of the best predictions were achieved with the second and third configurations, and both grouped the consumption points.

The final step of this methodology consists in making the ensemble of all the individual ELMs models predictions achieved with each one of the configurations. The fourth configuration was not considered to ensemble because it was only developed to test the effect of aggregation and temperature on load forecasting. The \(R^{2}\) of the two ensemble schemes using the individual consumption points (370 models), the KmL clusters (7 models) and the cluster subdivisions (36 models) with temperature is presented in Table 6.

Table 6 \(R^{2}\) for the two ensemble schemes

The linear regression scheme is slightly better than the simple sum scheme for all configurations. This is due to the fact that the regression scheme assigns different weights to the models, depending on their forecast ability, which makes the final forecast more accurate.

Concerning the three configurations, the better results were achieved with the third configuration that considers 36 models with temperature, because 69.5% of the aggregate consumption of clients’ database (all clusters except 2, 4 and 7, see Table 5) is in some way influenced by the temperature conditions. Another reason is related with the large number of models that this configuration involves which is beneficial to the ensemble scheme.

Next, simple visualization of the model, such as plotting the observed and predicted values, to discover areas of the data where the model does particularly good or bad, will be presented. This type of qualitative information is critical and is lost when the model is gauged only on summary statistics, like \(R^{2}\). For that, the week average \(R^{2}\) for all weeks of the 2014 year (test set) was calculated. Figure 5 compares for 7 days, the curve of real load consumption registered, with the curve of our model prediction and the curve whose prediction is equal to the consumption of the 2 days ago, that is, a prediction that is equal to the latest known information, in this case 48 h before the forecast point. The latter curve corresponds to the prediction made by the company prior to the adoption of our methodology. As the figure shows, the curve obtained with the ELMs linear regression ensemble is very similar with the real load consumption.

Fig. 5
figure 5

Best week load forecasting versus real load curve consumption

5 Conclusions and future work

In this paper, a methodology based on functional clustering and on ensemble learning was presented. The experiments showed that the KmL functional clustering algorithm performs well, having correctly separated the load diagrams accordingly to their shape, phase and amplitude. It was also showed that the linear regression ensemble scheme outperforms the sum scheme and can effectively improve the final prediction.

The several configurations tested suggest that the influence of ambient temperature in the electricity consumption is related with the economic activity and the aggregation of load consumption curves with similar characteristics is beneficial to load forecasting.

The final prediction achieved with the methodology here described presented a \(R^{2}=0.967\) for a 2-day-ahead prediction, which is good for practical use. An added advantage of the methodology is its low computational cost due to the segmentation of search space and the very fast training speed of the ELMs, which allows the daily load forecasting to be done quickly. As future work, we intend to investigate techniques for real-time load forecasting, such as deep learning.