1 Introduction

System identification is a challenging and interesting engineering problem that has been extensively studied for decades. It consists in identifying a mathematical model that describes the behaviour of a system based on recorded input–output data [1]. In general, most of the real-life systems of interest are nonlinear [2]. Extensive research has been developed in the nonlinear realm for system identification since 1980s [1, 3, 4]. In particular, one of the most popular approaches is the Nonlinear AutoRegressive with eXogenous inputs (NARX) methodology, which has proved to be a well-suited scheme for nonlinear system identification problems [1, 5]. Such approach ranks a set of candidate terms based on their contribution to the output data and identifies parsimonious models that generalise well on new data. The commonly used criterion to measure the dependency between candidate model terms and the desired output is linear correlation; however, it can only identify linear dependency. Therefore, new metrics have been implemented recently to identify nonlinear dependencies. Some of these new metrics are entropy [6] and mutual information [710]. In particular, mutual information has been extensively used because it captures both linear and nonlinear correlations and has no assumption on the distribution of the data [11]. Although most of the research is promising, the mutual information is hard to interpret because its maximum value is not fixed and depends on the entropy of the variables involved.

Another important issue is the need to extend the deterministic notion of the NARX model to accommodate uncertainties in the parameter estimates, as well as the identified model and the computed predictions. Some authors have worked towards the incorporation of the Bayesian approach within the NARX methodology. An interesting example is the work by Baldacchino et al [12] which developed a computational Bayesian framework for Nonlinear AutoRegressive Moving Average with eXogenous inputs (NARMAX) models using the Reversible Jump Markov Chain Monte Carlo (RJMCMC) procedure, an iterative sampling technique for performing inference in the context of model selection [13]. In [12], Bayesian inference is a key element to estimate not only the parameters but also the model. The results obtained are interesting; however, the main drawback is that there are many assumptions in the probability distributions of the parameters involved, and the likelihood and prior distributions are selected carefully to be conjugate priors, an assumption that may not always be accurate.

In this work, we address both the use of a novel metric to detect nonlinearities within the data set, and the extension of the deterministic notion of the NARX model. For the first case, the distance correlation metric is implemented, which is a measure that belongs to a new class of functions of distances between statistical observations and is able to detect all types of nonlinear or non-monotone dependencies between random vectors with finite first moment, but not necessarily with equal dimension [14, 15]. This is the first time that the distance correlation is introduced and implemented to the well-known orthogonal forward regression [16]. For the second case, the bagging method is used. Bagging consists of running an algorithm several times on different bootstrap realisations, and the results obtained are combined to predict a numerical value via averaging (for regression problems) or via voting (for classification problems). The combination of these two implementations enhances the performance of a NARX model and provides interpretability of nonlinear dependencies together with an insightful uncertainty analysis. For simplicity, the discussion is restricted to polynomial models that can be expressed in a linear-in-the-parameters form.

This work is organised as follows. In Sect. 2, a brief summary of nonlinear system identification, that includes the Orthogonal Forward Regression algorithm, is discussed. Section 3 reviews the bootstrap and bagging method. In Sect. 4, the distance correlation metric is described. Our new Bagging Forward Orthogonal Regression using distance Correlation (BFOR-dCor) algorithm is proposed in Sect. 5. Three case studies that show the effectiveness of the new algorithm are presented in Sect. 6. The work is concluded in Sect. 7.

2 Nonlinear system identification

System identification is an experimental approach that aims to identify and fit a mathematical model of a system based on experimental data that record the system inputs and outputs behaviour [1, 17]. Linear system identification has been extensively used in past years; however, its applicability is limited since the linearity assumption is strict, and in real life most of the systems of interest are nonlinear [2]. One of the most popular approaches used to deal with nonlinear systems is the Nonlinear AutoRegressive with eXogenous inputs (NARX) methodology, which has been extensively used in different case studies and interesting results have been obtained [1, 5, 1821].

In general, system identification consists of three steps [17, 22]:

  1. 1.

    Model structure detection

  2. 2.

    Parameter estimation

  3. 3.

    Model validation

Model structure detection has been extensively studied, and there is considerable amount of information in the literature [3]. It consists of determining the model order and selecting model terms that contribute to explaining the variation of the system output [1]. In general, most of the candidate model terms in an initially predetermined model are redundant or spurious; therefore, their contribution to the system output is negligible [23]. Furthermore, a model that includes a large number of terms tends to generalise poorly on unseen data [24]. Because of this, different methods have been developed to search and select the significant model terms that play a major role in the identification process. Some of these methods include clustering [24, 25], the Least Absolute Shrinkage and Selection Operator (LASSO) [26, 27], elastic nets [28, 29], genetic programming [30, 31], the Orthogonal Forward Regression (OFR) using the Error Reduction Ratio (ERR) approach [19], and a recently developed multiobjective extension known as the Multiobjective ERR (MERR) [32]. Once the structure has been identified, the parameter of each model term needs to be estimated for testing the term’s significance [33, 34]. Finally, a fundamental part of system identification is model validation. It consists in testing the identified model to check whether the parameter estimates are biased and if the final model is an adequate representation of the recorded data set [1, 22]. For the latter, Billings and Voon [35] developed a set of statistical correlation tests that can be used for nonlinear input–output model testing and validation. In summary, system identification has to consider a trade-off between model parsimony, accuracy, and validity [36].

2.1 Orthogonal forward regression algorithm

The NARX model is a nonlinear recursive difference equation with the following general form:

$$\begin{aligned} y\left( k\right)= & {} f\Big (y\left( k-1\right) ,\ldots ,y\left( k-n_{y}\right) ,\nonumber \\&\quad u\left( k-1\right) ,\ldots ,u\left( k-n_{u}\right) \Big )+\xi \left( k\right) \end{aligned}$$
(1)

where \(f\left( \cdot \right) \) represents an unknown nonlinear mapping, \(y\left( k\right) \), \(u\left( k\right) \), and \(\xi \left( k\right) \) are the output, input, and prediction error sequences with \(k=1,2,\ldots ,N\), where \(N\) is the total number of observations, and the maximum lags for the output and input sequences are \(n_{y}\) and \(n_{u}\) [9]. For simplicity, in this work we assume that the function \(f\left( \cdot \right) \) is a polynomial model of nonlinear degree \(\ell \).

One of the most popular algorithms to work with the NARX identification approach is the Orthogonal Forward Regression (OFR) algorithm, which is also known as the Forward Orthogonal Regression (FOR) algorithm [1, 37]. This was developed in the late 1980s by Billings et al. [1]. It is a greedy algorithm [38] that belongs to the class of recursive-partitioning procedures [39]. It identifies and fits a deterministic parsimonious NARX model that can be expressed in a generalised linear regression form [4, 9]. The original OFR algorithm used the Error Reduction Ratio (ERR) index as dependency metric [1]. The ERR of a term represents the percentage reduction in the total mean square error that is obtained if such term is included in the final model [6], and it is defined as the non-centralised squared correlation coefficient \(C\left( \mathbf {x},\mathbf {y}\right) \) between two associated vectors \(\mathbf {x}\), and \(\mathbf {y}\) [8]

$$\begin{aligned} C\left( \mathbf {x},\mathbf {y}\right) =\frac{\left( \mathbf {x}^{T}\mathbf {y}\right) ^{2}}{\left( \mathbf {x}^{T}\mathbf {x}\right) \left( \mathbf {y}^{T}\mathbf {y}\right) } \end{aligned}$$
(2)

The non-centralised squared correlation only detects linear dependencies; therefore, new metrics have been implemented recently to identify nonlinear dependencies [6, 8, 9]. Some of these new metrics are entropy [6] and mutual information [710]. In particular, mutual information \(I\left( \mathbf {x},\mathbf {y}\right) \) provides a measure of the amount of information that two variables share with each other [8]. It is defined as

$$\begin{aligned} I\left( \mathbf {x},\mathbf {y}\right) =\sum _{x\epsilon {\mathcal {X}}} \sum _{y\epsilon {\mathcal {Y}}}p\left( x,y\right) \ln \left( \frac{p\left( x,y\right) }{p\left( x\right) p\left( y\right) }\right) \end{aligned}$$
(3)

Although most of the research is promising, the mutual information is hard to interpret. Furthermore, the conventional OFR method may incorrectly select some spurious model terms due to the effect of noise, and there is still a need to extend the deterministic notion of the NARX methodology to deal with uncertainties in the parameter estimates, the identified model and the computed predictions.

In the remaining of this work, we refer to the original OFR algorithm as OFR-ERR (Orthogonal Forward Regression using Error Reduction Ratio) [1], and if the mutual information is used as dependency metric, then it is referred as FOR-MI (Forward Orthogonal Regression using Mutual Information) [8]. These will be used later in Sect. 6 for comparison with our new developed algorithm.

3 The bootstrap and bagging methods

The bootstrap method was developed by Bradley Efron [40]. It is a computer-based method that computes measures of accuracy to statistical estimates. Bootstrapping consists of randomly sampling \(R\) times, with replacement, from a given data set where it is assumed that the observations are independent of each other. Each of the resamples is called a bootstrap realisation and has the same length as the original data set. The bootstrap realisations can be treated as unique data sets that produce their own results when used in a specific algorithm, method, or technique. Such results contain information that can be used to make inferences from the original data set [41, 42].

The bootstrap method has been previously used for system identification of NARX models. In [43, 44], bootstrapping was used for structure detection where a backward elimination scheme was implemented to find the significant model terms. Such methodology is computationally expensive, as the bootstrap method must be applied every time a model term is eliminated. Furthermore, the methodology may not work when the lag order of the system is large. In [45], the bootstrap was used for parameter estimation of a fixed model. Although the parameter estimation is improved, by fixing the model there is no guarantee that the bootstrapped data come from the true model. The main drawback of these previous works is that the model structure needs to be correct for bootstrap to work [45].

In this work, the bootstrap method is applied in a different way based on [41]. Considering that observations at a given time may depend on previously measured observations, the data set is split into overlapping blocks of fixed length \(B\). The first and last observations appear in fewer blocks than the rest; therefore, the data set is wrapped around a circle to make all data points participate equally [42]. Then the blocks are sampled with replacement until a new data set is created with the same length as the original one. This methodology is known as moving blocks bootstrap for time series [41], and it is illustrated in Fig. 1. By sampling the blocks, the correlation present in observations less than \(B\) units apart is preserved. This methodology is less “model dependent” than the bootstrapping of the residuals approach [41]. It is important to notice that the choice of \(B\) is quite important. If it is too small, the correlation within the observations may be lost. If it is too big, there would be no distinction between the original data set and the bootstrap realisations. Effective methods for choosing \(B\) are still been investigated. In the remaining of this work, we assume that \(B\) is known beforehand.

Fig. 1
figure 1

Schematic of the moving blocks bootstrap for time series methodology. The upper line corresponds to the original time series. The lower line corresponds to a bootstrap realisation generated by choosing a block length \(B=3\)

The bootstrap technique can be extended to a very popular approach nowadays. Assume that a total of \(R\) bootstrap realisations have been carried out and each of them has been used in a specific algorithm to duplicate a result of its own. Therefore, \(R\) outputs are generated and all of them can be used to predict a numerical value via averaging (for regression problems) or via voting (for classification problems). This procedure is known as bagging (that stands for bootstrap aggregating) and was proposed by Leo Breiman [46].

4 The distance correlation

The distance correlation was recently developed by Székely, et al. [14]. It is a measure that belongs to a new class of functions of distances between statistical observations [15]. Distance correlation, denoted as \(dCor\left( \mathbf {x},\mathbf {y}\right) \), provides a new approach to measure all types of nonlinear or non-monotone dependencies between two random vectors with finite first moment, but not necessarily with equal dimension.

The distance correlation has the following properties [14, 15]:

  1. (i)

    \(0\le dCor\left( \mathbf {x},\mathbf {y}\right) \le 1\)

  2. (ii)

    If \(dCor\left( \mathbf {x},\mathbf {y}\right) =1\), then the dimensions of the linear subspaces spanned by \(\mathbf {x}\) and \(\mathbf {y}\) are almost surely equal. Therefore, there exists a vector \(\mathbf {a}\), a nonzero real number \(b\) and an orthogonal matrix \(\mathbf {C}\) such that \(\mathbf {y}=\mathbf {a}+b\mathbf {C}\mathbf {x}\).

The distance correlation is analogous to Pearson product-moment correlation coefficient \(\rho \). However, Pearson’s coefficient only characterises linear dependency between two variables while distance correlation is a more general measure that characterises independence of random variables [15]. The procedure to compute this metric is shown in [14, 15].

As a simple comparison, Fig. 2 displays three distinct noisy data sets. These have been created using a linear (\(y=x\)), sinusoidal (\(y=\sin \left( x+\frac{\pi }{2}\right) \)), and circular (\(x^{2}+y^{2}=1\)) relationship with additive white noise. Each of the figures shows the respective values for the Pearson product-moment correlation coefficient, mutual information, and distance correlation. The Pearson coefficient is able to detect a linear dependency in the first data set, but finds no such dependency in the other cases, as expected. The mutual information provides a better insight in each of the data sets, but its value is difficult to interpret because the maximum value of the mutual information is not fixed and depends on the entropy of each of the variables involved. Finally, the distance correlation is able to detect dependencies in all cases. Also, the distance correlation is not as strict as the Pearson coefficient, and the fixed range between 0 and 1 for possible values of the distance correlation is an important characteristic that plays a key role in our new algorithm when determining significant terms. It is important to mention that one drawback of the distance correlation metric is its computation time, since it can take three times longer to compute it compared with the Pearson coefficient or the mutual information.

Fig. 2
figure 2

Three distinct noisy data sets displaying a a linear, b sinusoidal, and c circular dependency. In each case, the Pearson product-moment correlation coefficient (\(\rho \)), mutual information (MI), and distance correlation (dCor) are computed

5 The new BFOR-dCor algorithm

The bagging method and distance correlation are combined with the OFR algorithm to produce the Bagging Forward Orthogonal Regression using distance Correlation (BFOR-dCor) algorithm. This is the first time that the distance correlation metric is introduced and incorporated to the well-known orthogonal forward regression [16]. This algorithm is divided into two parts. In Algorithm 1, the Forward Orthogonal Regression algorithm using the distance correlation dependency metric is described. It is important to mention that in contrast to the original algorithm developed by Billings et al. [1], that requires a threshold in the Error-to-Signal Ratio (ESR), the user needs to specify the maximum number of terms \(n_{max}\) that the algorithm will look for [23]. In this algorithm, lines 1–4 search for the candidate term that has the most significant influence on the system output based on the distance correlation metric. Once found, lines 5–8 create an orthogonal projection of \(\mathbf {y}\) with respect to \(\mathbf {q}_{1}\) using the modified Gram–Schmidt process. This orthogonalisation sequence is repeated in lines 11–25 until the maximum number of models \(n_{max}\) specified by the user is achieved. To avoid redundant candidate terms, we introduced lines 14–16, which check the squared norm-2 of a candidate term, and if it is less than \(10^{-10}\), it is simply removed. Following [23], we introduced the concept of Leave-One-Out Cross Validation (LOOCV) in order to prevent under- and over-fitting. Every time a new model term is added, the LOOCV statistic is computed with its standard error (SE) using the following equations:

$$\begin{aligned}&{\text{ LOOCV }}=\frac{1}{N}\sum _{i=1}^{N}\left( \frac{e_{i}}{1-h_{i}}\right) ^{2} \end{aligned}$$
(4)
$$\begin{aligned}&{\text{ SE }}=\sqrt{\frac{1}{N}\underset{i\epsilon \left\{ 1,...,N\right\} }{\mathrm {Var}}\left[ \left( \frac{e_{i}}{1-h_{i}}\right) ^{2}\right] } \end{aligned}$$
(5)

where \(e_{i}\) is the residual obtained from fitting the model to all \(N\) observations using the selected candidate terms at each iteration \(s\), and \(h_{i}\) are the diagonal values of the influence matrix for the fitted model [47]. Once the maximum number of terms \(n_{max}\) is achieved, the most parsimonious model with \(n\le n_{max}\) terms is selected in line 26 using the 1 SE rule [48], i.e. select the simplest model for which the LOOCV is within 1 SE from the minimum LOOCV. Finally, the parameters \(\varvec{\theta }\) are computed in line 27, and the algorithm returns them together with the significant terms selected. The parameter \(n_{max}\) can be selected heuristically, by running Algorithm 1 a couple of times and inspecting the resulting LOOCV curve.

Algorithm 2 describes the new BFOR-dCor algorithm. Here, Algorithm 1 is repeated \(R\) times, each with a different bootstrap realisation taken from the original input and output signals. Every time a bootstrap realisation is used, the identified model is recorded in a table. After all the \(R\) bootstrap realisations are taken, the table is summarised to identify the different models that were found, and each of them is assigned a value that is equal to the number of times it was selected within the \(R\) bootstrap realisations.

The BFOR-dCor algorithm is a new method that has been applied for the first time to nonlinear model selection. The proposed algorithm outperforms the conventional OFR algorithm in that the new method aims to find correct model terms within noisy data by introducing a voting mechanism in the algorithm.

figure a
figure b

6 Case studies

In this section, several examples are provided to illustrate the effectiveness of the BFOR-dCor algorithm. First, a comparison of the new method with both the traditional OFR-ERR and the recent FOR-MI algorithms is performed. Second, the BFOR-dCor technique is applied to a testing model in [12] where the RJMCMC algorithm is applied. Finally, the BFOR-dCor algorithm is applied to the sunspot data provided by the Solar Influences Data Center (SIDC), RWC Belgium, World Data Center for the Sunspot Index, Royal Observatory of Belgium [49]. The data consist of annual number of sunspots from 1700 to 2013.

6.1 Comparison with OFR-ERR and FOR-MI

The following model was taken from [9]:

$$\begin{aligned} y\left( t\right) =&-0.5y\left( t-2\right) +0.7y\left( t-1\right) u\left( t-1\right) \nonumber \\&+0.6u^{2}\left( t-2\right) +0.2y^{3}\left( t-1\right) \nonumber \\&-0.7y\left( t-2\right) u^{2}\left( t-2\right) +e\left( t\right) \end{aligned}$$
(6)

where the input \(u\left( t\right) \sim \mathcal {U}\left( -1,1\right) \), that is \(u\left( t\right) \) is evenly distributed over \(\left[ -1,1\right] \), and the error \(e\left( t\right) \sim \mathcal {N}\left( 0,0.02^{2}\right) \). Following [9], the maximum lags for the input and output are chosen to be \(n_{u}=n_{y}=4\) and the nonlinear degree is \(\ell =3\). The stop criterion for the OFR-ERR and FOR-MI algorithms is when the ESR is less than 0.05. A total of 500 input–output data points were generated, and the same random seed is used to ensure a fair comparison. The results for the OFR-ERR algorithm are shown in Table 1 and Fig. 3. It can be seen that all the model terms selected are correct except for the first one. Likewise, the results for the FOR-MI algorithm are displayed in Table 2 and Fig. 4. The four model terms selected are correct, still the algorithm failed to find one of the five terms required. From Tables 1 and 2, both models failed to select all the true model terms in (6). It is interesting to notice that, except by the spurious term found by the OFR-ERR algorithm, the union set of the model terms found by the OFR-ERR and FOR-MI algorithms is equivalent to the true model terms set. As explained in [9], both the OFR-ERR and FOR-MI algorithms can be used at the same time to select the model terms based on the t-tests; however, this example shows that the selection is still hard to perform as all the terms selected by both methods are statistically significant.

Fig. 3
figure 3

Model terms selected for (6) by the OFR-ERR algorithm with their corresponding ERR and the updated sum of ERR (SERR)

Fig. 4
figure 4

Model terms selected for (6) by the FOR-MI algorithm with the updated ESR

Table 1 Identified model for (6) using the OFR-ERR algorithm
Table 2 Identified model for (6) using the FOR-MI algorithm

The BFOR-dCor algorithm is applied to model (6) using a total of \(R=1000\) bootstrap realisations and a block length \(B=5\). The maximum number of terms to look for is \(n_{max}=10\). On Table 3, the 3 top model structures obtained by the BFOR-dCor algorithm are shown. These 3 model structures correspond to 96.5 % of the bootstrap realisations. The most-voted model structure has a structure that coincides with the true model (6), something that was not obtained with the OFR-ERR and FOR-MI algorithms.

Table 3 Three top model structures identified for (6) using the BFOR-dCor algorithm

For the 924 realisations that have the most-voted model structure, Fig. 5 shows the beanplots [50] for each of the parameter estimates, which clearly suggest that each parameter bootstrap distribution is not Gaussian. Furthermore, Table 4 shows a statistical summary of the parameter estimates. It is interesting to notice that all but one of the true values are within 2 standard deviation (SD) from the mean. The exception is the \(y^{3}\left( t-1\right) \) term. A frequency analysis may reveal an insightful understanding of the contribution of this term.

Fig. 5
figure 5

Beanplots for the parameter estimates of the model terms identified in the most-voted model structure using the BFOR-dCor algorithm for identification of (6), where the red dotted line represents the parameter true value while the black solid line represents the parameter mean estimated value. (Colour figure online)

Table 4 Statistical summary for the parameter estimates of the model terms identified in the most-voted model structure using the BFOR-dCor algorithm for identification of (6)

The results presented here show that the BFOR-dCor algorithm is able to identify 924 realisations with the true model structure together with a bootstrap distribution of the parameter estimates. Furthermore, having different equal-structure models is beneficial for the forecasting task since all the models or a sample from them can be used to compute an average prediction with the corresponding SD.

6.2 Comparison with RJMCMC algorithm

The following model was taken from [12]:

$$\begin{aligned} y\left( t\right) =&-0.5y\left( t-2\right) +0.7y\left( t-1\right) u\left( t-1\right) \nonumber \\&+0.6u^{2}\left( t-2\right) -0.7y\left( t-2\right) u^{2}\left( t-2\right) \nonumber \\&+e\left( t\right) \end{aligned}$$
(7)

In [12], the authors developed a computational Bayesian identification framework for NARMAX models that uses the RJMCMC algorithm to perform structure detection and parameter estimation together with a characterisation of the probability distribution over models. The algorithm is stochastic in nature, which encourages a global search over the model term space while at the same time ensuring that the identified model is parsimonious [12, 13]. In their work, the algorithm is executed 10 times on the same input–output data. From the 10 runs, the algorithm is able to get the true model structure 7 times. The main drawbacks of this method are that it is computationally expensive, and it needs to define different probability distributions for the parameters involved. Most of these distributions are chosen to be conjugate prior to ease the computations, but of course this does not mean that such distributions are faithful to the real unknown distributions.

The BFOR-dCor algorithm requires no assumptions about probability distributions, and it can work extremely well once the basic parameters are defined. Here again the maximum lags for the input and output are \(n_{u}=n_{y}=4\) and the nonlinear degree is \(\ell =3\), exactly the same values as in [12]. A total of 500 input–output data points were generated. The BFOR-dCor algorithm is applied to (7) using a total of \(R=1000\) bootstrap realisations, a block length \(B=5\), and the maximum number of terms is \(n_{max}=10\). On Table 5, the 3 top model structures obtained by the BFOR-dCor algorithm are shown. These 3 model structures correspond to 88.1 % of the bootstrap realisations. The most-voted model structure has a structure that coincides with the true model (7).

Table 5 Three top model structures identified for (7) using the BFOR-dCor algorithm

Figure 6 shows the beanplots for each of the parameter estimates, which suggest that each parameter may be treated as a Gaussian random variable. Likewise, Table 6 shows a statistical summary of the parameter estimates. It is interesting to notice that all the true values are within 2 SD from the mean.

Fig. 6
figure 6

Beanplots for the parameter estimates of the model terms identified in the most-voted model structure using the BFOR-dCor algorithm for identification of (7), where the red dotted line represents the parameter true value while the black solid line represents the parameter mean estimated value. (Colour figure online)

Table 6 Statistical summary for the parameter estimates of the model terms identified in the most-voted model structure using the BFOR-dCor algorithm for identification of (7)

These results show that the BFOR-dCor algorithm is extremely efficient and works well without the need of assumptions of probability distributions.

6.3 Forecasting the annual sunspot number

The sunspot time series provided by the Solar Influences Data Center (SIDC), RWC Belgium, World Data Center for the Sunspot Index, Royal Observatory of Belgium [49] consists of 314 observations of the annual number of sunspots from 1700 to 2013. The data from 1700 to 1950 are used for structure detection and parameter estimation while the data from 1951 to 2013 is used for model performance testing and validation. It is assumed that the annual number of sunspots depends only on previous annual observations, i.e. \(n_{u}=0\). Furthermore, it is well known that the sun’s north and south poles reverse around every 11 years which corresponds to a period of great solar activity known as the solar max [51]. Therefore, we choose \(n_{y}=12\), and employ a Nonlinear AutoRegressive (NAR) model with nonlinear degree \(\ell =3\) to test the performance of the proposed BFOR-dCor algorithm.

The BFOR-dCor algorithm is applied using a total of \(R=1000\) bootstrap realisations, a block length \(B=15\), and the maximum number of terms is \(n_{max}=15\). The five top model structures obtained by the BFOR-dCor algorithm are shown in Table 7, which correspond to 7.2 % of the bootstrap realisations.

Table 7 Five top model structures identified for the sunspot time series using the BFOR-dCor algorithm

For the 30 realisations that have the most-voted model structure, Fig. 7 shows the beanplots for each of the parameter estimates, which clearly suggest that most of the bootstrap parameter distributions are not Gaussian. Furthermore, Table 8 shows a statistical summary of the parameter estimates. Figures 8 and 9 show the one-step ahead and model predicted outputs together with the 2 SD region, respectively. In both cases, from these two graphs we can see that a simple NAR model has successfully captured the general trend of the sunspots behaviour. The root-mean-square error (RMSE) for the one-step ahead predicted output is 19.39716, while the RMSE for the model predicted output is 28.77858.

Fig. 7
figure 7

Beanplots for the parameter estimates of the model terms identified in the most-voted model structure using the BFOR-dCor algorithm for forecasting the annual sunspot number, where the black solid line represents the parameter mean estimated value. (Colour figure online)

Fig. 8
figure 8

One-step ahead predicted output for the sunspot time series using the most-voted model structure identified by the BFOR-dCor algorithm, where the black solid line with circles indicates the true measurements, the empty blue circles represent the one-step ahead predicted output, and the blue shadow represents the 2 SD region. (Colour figure online)

Fig. 9
figure 9

Model predicted output for the sunspot time series using the most-voted model structure identified by the BFOR-dCor algorithm, where the black solid line with circles indicates the true measurements, the green diamonds represent the model predicted output, and the green shadow represents the 2 SD region. (Colour figure online)

Table 8 Statistical summary for the parameter estimates of the model terms identified in the most-voted model structure using the BFOR-dCor algorithm for forecasting the annual sunspot number

In [52], Billings and Tao developed a set of tests that are effective for time series model validation:

$$\begin{aligned} {\left\{ \begin{array}{ll} \phi _{\xi '\xi '}\left( \tau \right) =\delta \left( \tau \right) &{} \forall \tau \\ \phi _{\xi '\left( \xi ^{2}\right) '}\left( \tau \right) =0 &{} \forall \tau \\ \phi _{\left( \xi ^{2}\right) '\left( \xi ^{2}\right) '}\left( \tau \right) =\delta \left( \tau \right) &{} \forall \tau \end{array}\right. } \end{aligned}$$
(8)

where \(\xi \left( k\right) =\xi _{k}\) is the prediction error sequence with \(k=1,2,\ldots ,N\), \(\xi '_{k}=\xi _{k}-\overline{\xi }\) and \(\left( \xi ^{2}\right) '_{k}=\xi _{k}^{2}-\overline{\xi ^{2}}\). Fig. 10 shows the statistical correlation tests for the one-step ahead predicted output of the most-voted NAR model identified by the BFOR-dCor algorithm. It can be seen that the second and third tests, i.e. \(\phi _{\xi '\left( \xi ^{2}\right) '}\left( \tau \right) =0\) and \(\phi _{\left( \xi ^{2}\right) '\left( \xi ^{2}\right) '} \left( \tau \right) =\delta \left( \tau \right) \;\forall \,\tau \), are not ideally satisfied, suggesting that autoregressive models may not be sufficient to fully characterise the entire dynamics of the process. Nevertheless, the results obtained by the BFOR-dCor algorithm are still remarkable given the complexity of the system.

Fig. 10
figure 10

Statistical correlation tests (8), with 95 % confidence limits, for the one-step ahead predicted output of the most-voted NAR model identified for the sunspot time series using the BFOR-dCor algorithm

7 Conclusion

A new algorithm for model structure detection and parameter estimation has been developed. This new algorithm combines two different concepts that enhance the performance of the original OFR algorithm. First, the distance correlation metric is used, which measures all types of nonlinear or non-monotone dependencies between random vectors. Second, the bagging method is implemented, which produces different models for each resample from the original data set. Identified models, or a subset of them, can be used together to generate improved predictions via averaging (for regression problems) or via voting (for classification problems). A main advantage of these concepts in the new BFOR-dCor algorithm is that it provides the interpretability of nonlinear dependencies and an insightful uncertainty analysis. The algorithm can be slow since the distance correlation is a complex computation compared with other metrics; nevertheless, it produces results that outperform its counterparts and requires no assumptions of probability distributions like the RJMCMC algorithm. All these have been demonstrated through numerical case studies.