Keywords

1 Introduction

One of the most effective and well-established practical uses of time series analysis is related to the prediction of future values using its past information [1]. However, much of the related statistical analysis is done under linear assumptions which—outside trivial cases and ad hoc lab—controlled experiments—hardly ever do possess features compatible with real-world DGPs. The proposed forecast procedure has been designed to account for complex, possibly non-linear dynamics one may encounter in practice and combines the wavelet multi-resolution decomposition approach, with a non-standard, highly computer intensive statistical method: artificial neural networks (ANN). The acronym chosen for i—i.e. MUNI, short for Multi-resolution Neural Intelligence—reflects both these aspects. In more details, MUNI procedure is based on the reconstruction of the original time series after its decomposition, performed through an algorithm based on the inverse of a wavelet transform, called Multi-resolution Approximation (MRA) [24]. Practically, the original time series is decomposed into more elementary components (first step), each of them representing an input variable for the predictive model, and as such individually predicted and finally combined through the inverse MRA procedure (second step). In charge of generating the predictions is the time domain—Artificial Intelligence (AI)—part of the method which exploits an algorithm belonging to the class of parallel distributed processing, i.e., ANN [5, 6], with an input structure of the type autoregressive.

1.1 Signal Decomposing and Prediction Procedures

In what follows the time series (signal) of interest is assumed to be real-valued, uniformly sampled of finite length T, i.e.: \(x_{t}:= \left \{(x_{t})_{t\in \mathbb{Z}^{+}}^{T}\right \}\). MUNI has been implemented with a wavelet [79] signal-coefficient transformation procedure of the type Maximum Overlapping Discrete Wavelet Transform (MODWT) [10], which is a filtering approach aimed at modifying the observed series \(\left \{x\right \}_{t\in Z^{+}}\), by artificially introducing an extension of it, so that the unobserved samples \(\left \{x\right \}_{t\in Z^{-}}\) are assigned the observed values X T−1, X T−2, , X 0. This method considers the series as it were periodic and is known as using circular boundary conditions, where wavelet and scale coefficients are respectively given by:

$$\displaystyle{ d_{j,t} = \frac{1} {2^{j/2}}\sum _{l=0}^{L_{j}-1}\tilde{h}_{ j,l},X_{t-l\ \mathrm{mod}N},\qquad S_{J,t} = \frac{1} {2^{j/2}}\sum _{l=0}^{L_{j}-1}\tilde{g}_{ j,l},X_{t-l\ \mathrm{mod}N}, }$$

with \(\left \{\tilde{h}_{j,l}\right \}\) and \(\left \{\tilde{g}_{j,l}\right \}\) denoting the length L, level j, wavelet, and scaling filters, obtained by rescaling their Discrete Wavelet Transform counterparts, i.e., \(\left \{h_{j,l}\right \}\) and \(\left \{g_{j,l}\right \}\), as follows: \(\tilde{h}_{j,l} = \frac{h_{j,l}} {2^{\,j/2}}\) and \(\tilde{g}_{j,l} = \frac{g_{j,l}} {2^{\,j/2}}\). Here, the sequences of coefficients \(\left \{h_{j,l}\right \}\) and \(\left \{g_{j,l}\right \}\) are approximate filters: the former of the type band-pass, with nominal pass-band \(f \in [ \frac{1} {4\ \mu _{j}}, \frac{1} {2}\mu _{j}]\), and the latter of the type low-pass, with a nominal pass-band \(f \in [0, \frac{1} {4}\mu _{j}]\), with μ j denoting the scale. Considering all the J = J max sustainable scales, MRA wavelet representation of x t , in the L 2(R) space, can be expressed as follows:

$$\displaystyle\begin{array}{rcl} x(t)& =& \sum _{k}s_{J,k}\phi _{J,k}(t) +\sum _{k}d_{J,k}\psi _{J,k}(t) +\sum _{k}d_{J-1,k}\psi _{J-1,k}(t) + \cdots \\ & & +\sum _{k}d_{j,k}\psi _{j,k}(t)\ldots +\sum _{k}d_{1,k}\psi _{1,k}(t), {}\end{array}$$
(1)

with k taking integer values from 1 to the length of the vector of wavelet coefficients related to the component j and ψ and ϕ, respectively, the father and mother wavelets (see, for example, [11, 12]). Assuming that a number \(J_{0} \leq J^{\mathrm{max}}\) of scales is selected, MRA is expressed as \(x_{t} =\sum _{ j=1}^{J_{0}}D_{j} + S_{J_{ 0}}\), with \(D_{j} =\sum _{k}d_{J,k}\psi _{J,k}(t)\) and \(S_{j} =\sum _{k}s_{J,k}\phi _{J,k}(t)\); j = 1, 2, , J. Each sequence of coefficients d j , (in signal processing called crystal), represents the original signal at a given resolution level, so that the MRA conducted at a given level j ( j = 1, 2, , J), delivers the coefficients set D j , which reflects signal local variations at the detailing level j, and the set \(S_{J_{0}}\), accounting for the long run variations. By adding more levels \(\left \{d_{j};\ \ \ j = 1,2,\ldots,J^{\mathrm{max}}\right \}\), finer levels js are involved in the reconstruction of the original signal and the approximation becomes closer and closer, until the loss of information becomes negligible. The forecasted values are generated by aggregation of the predictions singularly obtained by each of the wavelet components, once they are transformed via the Inverse MODWT algorithm, i.e.: \(\hat{x}_{t}(h) =\sum _{ j=1}^{J_{0}}\hat{D}_{j}^{\mathrm{inv}}(h) +\hat{ S}_{J_{ 0}}^{\mathrm{inv}}(h)\), where D and S are as above defined and the superscript inv indicates the inverse MODWT transform. In total, four are the choices required for a proper implementation of MODWT, they are boundary conditions, type of wavelet filter, its width parameter L, and number of decomposition levels. Regarding the first choice, MUNI has been implemented with periodic boundary conditions. However, alternatives can be evaluated on the basis of the characteristics of the time series and/or as a part of a preliminary investigation. The choices related to the type of wavelet function and its length L are generally hard to automatize, therefore their inclusion in MUNI has not been pursued. More simply, it has been implemented with the fourth order Daubechies least asymmetric wavelet filter (known also as symmlets) [8] of length L = 8, usually denoted LA(8). Regarding the forecasting method, MUNI uses a neural topology belonging to the family of multilayer perceptron [13, 14], of the type feed-forward (FFWD-ANN) [15]. This is a popular choice in computational intelligence for its ability to perform in virtually any functional mapping problem including autoregressive structures. This network represents the non-linear function mapping from past observations \(\left \{x_{t-\tau };\ (\tau = 1,2,\ldots,T - 1)\right \}\) to future values \(\left \{x_{h};\ (h = T,T + 1,\ldots )\right \}\), i.e.: \(x_{t} =\sigma _{nn}\big(x_{t-1},x_{t-2},\ldots,x_{t-p},\mathbf{w}\big) + u_{t}\), with p the maximum autoregressive lag, \(\sigma (.)\) the activation function defined over the inputs and w the network parameters and u t the error term. In practice, the input–output relationship is learnt by linking, via acyclic connections, the output x t to its lagged values, constituting the network input, through a set of layers. While the latter has usually a very low cardinality (often 1 or 2), the input set is critical—being the inclusion of not-significant lags and∕or the exclusion of significant ones able to affect the quality of the outcomes.

1.1.1 The Learning Algorithm and the Regularization Parameter η

MUNI envisions the time series at hand split in three different, non overlapping parts, serving respectively as training, test, and validation sets. The training set is the sequence \(\left \{(\mathbf{x}_{1},\mathbf{q}_{1}),\ldots,(\mathbf{x}_{p},\mathbf{q}_{p})\right \}\), in the form of p ordered pairs of n- and m-dimensional vectors, were q i denotes the target value and x i the matrix of the delayed time series values. The network, usually initialized with random weights, is presented an input pattern and an output, say o i , is generated as a result. Being in general o i q i , the learning algorithm tries to find the optimal weights vector minimizing the error function in the w-space, that is: \(\mathbf{o}_{p} = f_{nn}(\mathbf{w},\mathbf{x}_{p})\), where the weight vector w refers, respectively, to the p i output and the p i input and f nn the activation function. Denoting here the training set with T r and with P tr the number of pairs, the average error E committed by the network can be expressed as: \(\tilde{E}(\mathbf{w}) = E(\mathbf{w}) + \frac{1} {2}\eta \sum _{i}\mathbf{w}_{i}^{2}\), where η is a constraint term aimed at penalizing model weights and thus limiting the probability of over-fitting.

1.1.2 Intelligent Network Parameters Optimization

In this section, MUNI’s AI-driven part is illustrated and some notation introduced. In essence, it is a multi-grid searching system for the estimation of an optimal network vector of parameters under a suitable loss function, i.e., the root mean square error (RMSE), expressed as

$$\displaystyle{ \mathfrak{B}(x_{i},\hat{x}_{i}) = \left [T^{-1}\sum _{ i}^{T}\vert e_{ i}\vert ^{2}\right ]^{\frac{1} {2} }, }$$
(2)

with x i denoting the observed value, e the difference between it and its prediction \(\hat{x}_{i}\), and T the sample size. The parameters subjected to neural-driven search, listed in Table 1 along with the symbols used to denote each of them

Table 1 MUNI’s parameters and related notation

are stored in the vector \(\boldsymbol{\omega }\), i.e.: \(\boldsymbol{\omega }\equiv (\beta,\rho,\alpha,\eta,\lambda,\nu )\). Each of them is associated a grid, whose arbitrarily chosen values are all in \(\mathbb{Z}^{+}\). Consistently with the list of parameters of Table 1, the set of these grids, is formalized as follows: \(\boldsymbol{\varGamma }= (\{\varGamma _{\beta }\}\), {Γ ρ }, {Γ α }, {Γ η }, \(\{\varGamma _{\lambda }\}\), {Γ ν }), where each subset {Γ (⋅ )} has cardinality respectively equals to \(\tilde{\beta }\), \(\tilde{\rho }\), \(\tilde{\alpha }\), \(\tilde{\eta }\), \(\tilde{\lambda }\tilde{\nu }\).Footnote 1 The wavelet-based MRA procedure is applied \(\tilde{\beta }\) times, so that the time series of interest is broken down into \(\tilde{\beta }\) different sets, each containing different numbers of crystals, in turn contained by a set denominated A, i.e.: \(\{A_{\beta _{w}};w = 1,2,\ldots,\tilde{\beta }\}\subset A\). Here, each of the A’s encompasses a number of decomposition levels ranging from a minimum and a maximum, respectively, denoted by J min and J max, therefore, for the generic set \(A_{\beta _{w}} \subset A\), it will be:

$$\displaystyle{ A_{\beta _{w}} = \left \{J^{\mathrm{min}} \leq k,k + 1,k + 2,\ldots,K \leq J^{\mathrm{max}};J^{\mathrm{min}}> 1\right \}. }$$
(3)

Assuming a resolution set \(A_{\beta _{0}}\) and a resolution level \(k_{0} \subset A_{\beta _{0}}\), the related crystal, denoted by \(\mathcal{X}_{k_{0},\beta _{0}}\), is processed by the network \(\mathfrak{N}_{1}\), which is parametrized by the vector of parameters \(\omega _{1}^{(k_{0},A_{\beta _{0}})} \equiv (\rho ^{(k_{0},A_{\beta _{0}})},\alpha ^{(k_{0},A_{\beta _{0}})},\eta ^{(k_{0},A_{\beta _{0}})},\lambda ^{(k_{0},A_{\beta _{0}})},\nu ^{(k_{0},A_{\beta _{0}})})\). Once trained, the network \(\mathfrak{N}_{1}\), denoted by \(\tilde{\mathfrak{C}_{1}}^{k_{0},A_{\beta _{0}}}\), is employed to generate H-step ahead predictions. MUNI chooses the best parameter vector, i.e., \(^{{\ast}}\boldsymbol{\omega }^{(k_{0},A_{\beta _{0}})}\) for \(\mathcal{X}_{k_{0},\beta _{0}}\), according to the minimization of a set of cost functions of the form \(\mathfrak{B}(\mathcal{X}_{k_{0},\beta _{0}},\hat{\mathcal{X}}_{k_{0},\beta _{0}})\) iteratively computed on the predicted values \(\hat{\mathcal{X}}_{k_{0},\beta _{0}}\) in the validation set. These predictions are generated by a set of networks parametrized and trained according to the set of k-tuples (with k the length of ω) induced by the set of the Cartesian relations on Γ. Denoting the former by \(\tilde{\mathfrak{C}} ^{k_{0},A_{\beta _{0}}}\) and by P the latter, it will be:

$$\displaystyle{ \boldsymbol{^{{\ast}}\omega }^{(k_{0},A_{\beta _{0}})} =\mathop{\arg \min }\limits_{ ^{}\boldsymbol{P}}\ \ \mathfrak{B}(\hat{\mathcal{X}}_{ k_{0},A_{\beta _{0}}}(\boldsymbol{P}),\mathcal{X}_{k_{0},A_{\beta _{0}}}). }$$
(4)

The set of all the trained networks attempted at the resolution level \(A_{\beta _{0}}\) (i.e., encompassing all the crystals in \(A_{\beta _{0}}\)), is denoted by \(\tilde{\mathfrak{C}} ^{A_{\beta _{0}}}\), whereas the set of networks trained in the whole exercise (i.e., for all the A’s), by \(\tilde{\mathfrak{C}}^{A}\). The networks \(\tilde{\mathfrak{C}}^{A_{\beta _{0}}}\), parametrized with the optimal vector \((^{{\ast}}\boldsymbol{\omega }_{J^{\mathrm{min}}},\ldots,^{{\ast}}\boldsymbol{\omega }_{J^{\mathrm{max}}}) \equiv ^{{\ast}}\boldsymbol{\varOmega }_{\beta _{0}}\), which is obtained by applying (4) to each crystal, are used to generate predicted values at each resolution level independently. These predictions are combined via Inverse-MODWT and evaluated in terms of the loss function \(\mathfrak{B}\), computed on the validation set of the original time series. By repeating the above steps for the remaining sets, i.e., \(\{A_{\beta _{w}};w = 1,2,\ldots,\tilde{\beta }-1\} \subset A\), \(\tilde{\beta }\) optimal sets of networks \(^{{\ast}}\tilde{\mathfrak{C}} ^{A}\), each parametrized by optimal vectors of parameters \(\{^{{\ast}}\boldsymbol{\varOmega }_{w};w = 1,2,\ldots,\tilde{\beta }\}\) are obtained. Each set of networks in \(^{{\ast}}\tilde{\mathfrak{C}} ^{A}\) is used to generate one vector of predictions for x t in the validation set (by combination of the multi-resolutions predictions via MODWT), so that, by iteratively applying (2) to each of them, a vector containing \(\tilde{\beta }\) values of the loss function, say \(\mathfrak{L}_{w}\), is generated. Finally, the set of networks in a resolution set say A, whose parametrizations minimize \(\mathfrak{L}_{w}\), are the winners, i.e.: \(\boldsymbol{^{{\ast}}\varOmega }^{A} =\mathop{\arg \min }\limits_{ ^{(^{{\ast}}\boldsymbol{\varOmega }_{ w})}}\ \ (\mathfrak{L})\).

1.1.3 Human-Driven Decisions

Being MUNI a partially self-regulating method, while it embodies automatic estimation procedures for a number of parameters, it requires a set of preliminary, human-driven choices, involving both the decomposition and the AI parts, as summarized in Table 2. Here, the first two entries are in common in that refer to activities which are not unit specific.

Table 2 MUNI procedure: a priori activities and choices

1.1.4 The Algorithm

MUNI procedure is now detailed in a step-by-step fashion.

  1. 1.

    Let x t be the time series of interest [as defined in of interest (Sect. 1.1)], split in three disjoint segments: training set \(\left \{x^{\mathrm{Tr}}\right \}_{t};t = 1,2,\ldots,T - (S + V + 1)\), validation set, \(\left \{x^{U}\right \}_{t}t = T - (S + V + 1),\ldots,(T - V )\) and test set, \(\left \{x^{S}\right \}_{t}t = T - (V + 1),\ldots,T)\), where with V and S, respectively, the length of validation and test set are denoted;

  2. 2.

    \(MODWT\) is applied \(\tilde{\beta }\) times to \(x_{t}^{\mathrm{Tr}}\), and the related sets of crystals are stored in \(\tilde{\beta }\) different sets, which in turn are stored in the matrix \(\mathfrak{D}_{j,w}^{}\), of dimension (\(J^{\mathrm{MAX}}\times \tilde{\beta }\)) of the formFootnote 2:

    Here, the generic column β w , represents the set of resolution levels generated by a given MRA procedure, so that its generic element \(\mathcal{X}_{k,w}\) is the crystal obtained at a given decomposition level k belonging to the set of crystals \(A_{\beta _{w}}\). For each column vector β w , a minimum and a maximum decomposition level (3), J min and J max, is arbitrarily chosen;

  3. 3.

    the set P of the parametrizations of interest is built. It is the set of all the Cartesian relations \(\mathbf{P} \equiv \left \{\varGamma _{\rho }\times \varGamma _{\alpha }\times \varGamma _{\eta }\times \varGamma _{\lambda }\times \varGamma _{\nu }\right \}\) whose cardinality, expressed through the symbol | ⋅ | , is denoted by | P | ;

  4. 4.

    an arbitrary set of decomposition levels, say \(A_{0} \subset A\), is selected (the symbol β is suppressed for an easier readability);

  5. 5.

    an arbitrary crystal, say \(\mathcal{X}_{k_{0},A_{0}} \subset A_{0}\), is extracted;

  6. 6.

    the parameter vector \(\boldsymbol{\omega }\) is set to an (arbitrary) initial status \(\boldsymbol{\omega }_{1} \equiv \mathbf{P}_{1} \equiv \big [\varGamma _{\alpha }^{(1)},\varGamma _{\rho }^{(1)},\varGamma _{\eta }^{(1)},\varGamma _{\lambda }^{(1)},\varGamma _{\nu }^{(1)}\big];\)

    1. (a)

      \(\mathcal{X}_{k_{0},A_{0}}\) is submitted to and processed by a single hidden layer ANN of the form

      $$\displaystyle{ \left \{\begin{array}{@{}l@{\quad }l@{}} \mathfrak{N}_{(1)} =\sigma ^{}\Big(\mathcal{X}_{k_{0},A_{0}},\mathbf{w}_{(1)}\Big)\quad \\ \mathbf{w}_{(1)} =\sigma ^{}(\boldsymbol{\omega }_{1}^{k_{0},A_{0}}), \quad \end{array} \right. }$$

      with \(\sigma\) being the sigmoid activation function and w the network weights evaluated for a given configuration of the parameter vector, i.e., \(\boldsymbol{\omega }_{1}\).

    2. (b)

      network \(\mathfrak{N}_{1}\) is trained and the network \(\tilde{\mathfrak{C}_{1}} ^{k_{0},A_{0}}\), obtained as a result, is employed to generate H-step ahead predictions for the validation set x U. These predictions are stored in the vector \(\mathfrak{P}_{1}^{k_{0},A_{0}}\);

    3. (c)

      steps 6a–6d are repeated for each of the remaining ( | P | − 1) elements of P. The matrix \(\mathfrak{P}_{m}^{k_{0},A_{0}}\) of dimension \(\big(U \times \vert \mathsf{\mathbf{P}\vert -1}_{}\big)\), containing the related predictions (for the crystal \(\mathcal{X}_{k_{0},A_{0}}\)) is generated by the trained networks \(\tilde{\mathfrak{C}} _{2,\ldots,\vert \mathbf{P}\vert }^{k_{0},A_{_{0}}}\);

    4. (d)

      the matrix \(_{\mathit{\mathrm{full}}}\mathfrak{P}^{k_{0},A_{0}}\) of dimensions \(\big(U \times \vert \mathsf{\mathbf{P}}_{}\vert \big)\) containing all the predictions for the crystal \(\mathcal{X}_{k_{0},A_{0}}\) is generated, i.e.,

      $$\displaystyle{ _{{\it \text{full}}}\mathfrak{P}^{k_{0},A_{0} } = \mathfrak{P}_{1}^{k_{0},A_{0} }\ \ \cup \quad \mathfrak{P}_{m}^{k_{0},A_{0} }; }$$
    5. (e)

      steps 6a–6d are repeated for each of the remaining crystals in A 0, i.e.,

      $$\displaystyle{ \left \{\mathcal{X}_{k_{i},A_{0}};\quad i = J^{\mathrm{min}},J^{\mathrm{min}+1},\ldots,(J^{\mathrm{max}} - 1)\right \} \subset A_{ 0} \subset A, }$$

      so that \(i = 1,2,\ldots,(J^{\mathrm{MAX}} - 1)\) prediction matrices \(\mathfrak{P}^{k_{i},A_{0}}\) are generated;

    6. (f)

      the \(\big((J^{\mathrm{MAX}} - J^{\mathrm{min}} + 1) \times \vert \mathsf{\mathbf{P}}_{}\vert \big)\) dimension matrix \(\hat{\mathfrak{J}}_{A_{0}}^{} \equiv _{\mathit{\mathrm{full}}}\mathfrak{P}^{k_{0},A_{0}} \cup \mathfrak{P}^{k_{i},A_{0}};\quad i = J^{\mathrm{min}},J^{\mathrm{min+1}},\ldots,(J^{\mathrm{max}} - 1)\), containing all the predictions for the validation set \(\left \{\mathcal{X}^{U}\right \}_{t}\), of all the crystals in A 0, is generated,Footnote 3 i.e.:

      $$\displaystyle{\begin{array}{c} \begin{array}{*{10}c} &&\end{array} \\ \hat{\mathfrak{J}}_{A_{0}}^{} = \left [\ \begin{array}{*{10}c} & \hat{\mathcal{X}}_{J^{\mathrm{MIN}},\omega _{1}} & \hat{\mathcal{X}}_{J^{\mathrm{MIN}},\omega _{2}} &\cdots & \hat{\mathcal{X}}_{J^{\mathrm{MIN}},\omega _{k}} &\cdots & \hat{\mathcal{X}}_{J^{\mathrm{MIN}},\omega _{\vert \mathbf{P}\vert }} & \\ &\hat{\mathcal{X}}_{J^{\mathrm{MIN+1}},\omega _{1}}&\hat{\mathcal{X}}_{J^{\mathrm{MIN+1}},\omega _{2}}&\cdots &\hat{\mathcal{X}}_{J^{\mathrm{MIN+1}},\omega _{k}}&\cdots &\hat{\mathcal{X}}_{J^{\mathrm{MIN+1}},\omega _{\vert \mathbf{P}\vert }}& \\ & \vdots & \vdots & \vdots & \vdots &\cdots & \vdots &&\\ & \hat{\mathcal{X}}_{ k,\omega _{1}} & \hat{\mathcal{X}}_{k,\omega _{2}} &\cdots & \hat{\mathcal{X}}_{k,\omega _{k}} &\cdots & \hat{\mathcal{X}}_{k,\omega _{\vert \mathbf{P}\vert }} & \\ & \vdots & \vdots & \vdots & \vdots &\cdots & \vdots &&\\ & \hat{\mathcal{X}}_{ J^{\mathrm{MAX}},\omega _{1}} & \hat{\mathcal{X}}_{J^{\mathrm{MAX}},\omega _{2}} &\cdots & \hat{\mathcal{X}}_{J^{\mathrm{MAX}},\omega _{k}} &\cdots & \hat{\mathcal{X}}_{J^{\mathrm{MAX}},\omega _{\vert \mathbf{P}\vert }} &\end{array} \ \right ]; \end{array} }$$
  7. 7.

    loss function minimization in the validation set is used to build the set of winner ANNs for each of the crystals in A 0, i.e., \(\mathfrak{C}_{{\ast}}^{A_{0}} \equiv \left \{^{J^{\mathrm{min}} }\mathfrak{C}_{{\ast}}^{A_{0}},\ldots,\ ^{J^{\mathrm{max}} }\mathfrak{C}_{{\ast}}^{A_{0}}\right \}\). For example, for the generic crystal k, the related optimal network is selected according to: \(^{k}\mathfrak{C}_{{\ast}}^{A_{0}} =\mathop{\arg \min }\limits_{ \mathbf{P}}\ \ \mathfrak{B}^{}(\ \mathcal{X}_{k}^{U},\ \hat{\mathcal{X}}_{k}^{U}(\mathbf{P}))\);

  8. 8.

    \(\mathfrak{C}_{{\ast}}^{A_{0}}\) is employed to generate the matrix \(\hat{\mathcal{X}}_{{\ast}}^{A_{0}}\) of the optimal predictions for the validation set of each resolution level in A 0, i.e.,  

    \(\hat{\mathcal{X}}_{{\ast}}^{A_{0}} \equiv \Big [\ \ ^{J^{\mathrm{min}} }\hat{\mathcal{X}}^{U},\ldots \ldots,\ ^{J^{\mathrm{max}} }\hat{\mathcal{X}}^{U}\ \ \ \Big]'\);

  9. 9.

    by applying inverse MODWT to \(\hat{\mathcal{X}}_{{\ast}}^{A_{0}}\), the series \(\left \{x^{U}\right \}_{t}\) is reconstructed, i.e., \(\mathfrak{I}\mathfrak{n}\mathfrak{v}(\hat{\mathcal{X}}_{{\ast}}^{A_{0}}) = \left \{\hat{x}_{A_{ 0}}^{U}\right \}_{ t}\), so that the related loss function \(\mathfrak{B}(x_{t}^{U},\hat{x}_{t}^{(U)})\) is computed and its value stored in the vector \(\boldsymbol{\varDelta }\) whose length is \((J^{\mathrm{MAX}} - J^{\mathrm{min}} + 1)\);

  10. 10.

    steps 4–9 are repeated for the remaining sets of resolutions \(A_{1},\ldots A_{w},\ldots,A_{\tilde{w}-1}\), so that all the \(\tilde{w}\) error function minima are stored in the vector \(\boldsymbol{\varDelta }\);

  11. 11.

    the network set \(\mathfrak{C}^{{\ast}}\) generating the estimation of the crystals minimizer of \(\boldsymbol{\varDelta }\) over all the network configurations \(\mathfrak{C}\), is the final winner, i.e., \(\mathfrak{C}^{{\ast}} =\mathop{\arg \min }\limits_{ ^{}\mathfrak{C}_{{\ast}}^{A}}\ \ \varDelta (\mathfrak{C})\);

  12. 12.

    final performances assessments are obtained by using \(\mathfrak{C}^{{\ast}}\) on the test set \(x_{t}^{S}\).

Table 3 Specification of the time series employed in the empirical analysis
Fig. 1
figure 1

Graphs of the time series employed in the empirical study

Fig. 2
figure 2

EACFs for the time series employed in the empirical study

Fig. 3
figure 3

TS1: test set. True (continuous line) and 1,2,3,4-step ahead predicted values (dashed line)

Fig. 4
figure 4

TS4: test set. True (continuous line) and 1,2,3,4-step ahead predicted values (dashed line)

2 Empirical Analysis

In this section, the outcomes of an empirical study conducted on four macroeconomic time series—i.e., Japan/USA Exchange rate, USA Civilian Unemployment rate (un-transformed and differenced data), Italian industrial production Index, respectively denoted TS1, TS2, TS3, TS4—are presented. These series (detailed in Tables 3 and 5) along with their empirical autocorrelation functions (EACF) [16], depicted respectively in Figs. 1 and 2, have been considered as they differ substantially for the type of phenomenon measured other than for their own inherent characteristics as time span, probabilistic structure, seasonality, and frequency components. In particular, TS2 and TS3, refer to the same variable (US civilian unemployment rate), and are included in the empirical analysis to emphasize MUNI’s capabilities to yield comparable results when applied to both the original and transformed data (and thus to simulate the case of a not pre-processed input series). As expected, the two series exhibit a different pattern: the un-transformed one (TS2), in fact, shows an ill behavior, in terms of both seasonal components and non-stationarity, in comparison with its differenced (the difference order is 1 and 12) counterpart TS3. On the other hand, TS1–2 shows roughly an overall similar pattern, with spikes, irregular seasonality, and non-stationarity both in mean and variance. Such a similarity is roughly confirmed by the patterns of their EACFs (Fig. 2). Regarding the time series TS3–4, they exhibit more regular overall behaviors but deep differences in terms of their structures. In fact, by examining Figs. 1 and 2, it becomes apparent that unlike TS4, TS3 is roughly trend stationary with a 12-month seasonality with a persistence of the type moving average—according to the (unreported) Partial EACF—different from the one characterizing TS4, appearing to follow an autoregressive process. Regarding TS4, this time series has been included for being affected by two major problems: an irregular trend pattern with a significant structural break located in 2009 and seasonal variations with size approximately proportional with the local level of the mean. A potential source of nonlinearity, this form of seasonality is often dealt with by making it additive through ad hoc data transformation. However, this is not a risk-free procedure, for being usually associated with the critical task of back-transforming the data, as shown in [17, 18]. Quantitative assessment of the quality of the predictions generated by MUNI are made by means of the following three metrics—computed on the test sets of each of the four time series—i.e.: \(\mathrm{RMSE}^{(h)} = \sqrt{\frac{1} {s}\sum \vert x^{S} -\hat{ x}^{S}\vert ^{2}}\), \(\mathrm{MPE}^{(h)} = 100\frac{1} {s}\sum [\frac{x^{S}-\hat{x}^{S}} {x^{S}} ]\), \(\mathrm{MAPE}^{(h)} = 100\frac{1} {s}\sum \vert \frac{x^{S}-\hat{x}^{S}} {x^{S}} \vert\), with s the length of x S and h the number of steps ahead the predictions are evaluated, here h = 1, 2, 3, 4 (Figs. 3 and 4).

Table 4 Grid values employed in the empirical analysis

2.1 Results

As illustrated in Sect. 1.1.4, each of the employed ANNs has been implemented according to a variable-specific set \(\boldsymbol{\varGamma }\), containing all the grids whose values are reported in Table 4. It is worth emphasizing that, in practical applications, not necessarily the set \(\boldsymbol{\varGamma }\) encompasses the optimal (in the sense of the target function \(\mathfrak{B}\)) parameter values of a given network. More realistically, due to the computational constraints, one can only design a grid set able to steer the searching procedure towards good approximating solutions (Table 6). The outcomes of the empirical analysis outlined in the previous Sect. 2 are reported in Table 7. From its inspection, it is possible to see the good performances, with a few exceptions, achieved by the procedure. The series denominated TS4, in particular, shows a level of fitting that can be considered particularly interesting, especially in the light of its moderate sample size and the irregularities exhibited by the lower frequency components, i.e., d4 and s4 (Fig. 5). The procedure chooses in this case relatively simple architectures: in fact (see Table 6), excluding s4 (with six lags and the parameter ν = 5), for all the remaining components we have a more limited number of delays and of hidden neurons (ν ≤ 3). Regarding the performances, it seems remarkable the level of fitting obtained at horizon 1 and 2, for a MAPE respectively equal to 0.85 and 1.43, and a RMSE of 0.71 and 1.5. Visual inspection of Fig. 4 confirms this impression as well as the less degree of accuracy recorded at farther horizons, even though it appears how, especially horizon 3 predictions (MAPE = 2. 63), can provide some insights about the future behavior of this variable. Other than from Table 7, less impressive performances can be noticed for TS1 by examining Fig. 3. However, it

Table 5 Length of the subsets of the original time series
Fig. 5
figure 5

TS4 and its MODWT coefficient sequence \(d_{j,\,t};j = 1,\ldots,4\)

Table 6 Parameters chosen by MUNI for each time series at each frequency component
Table 7 Goodness of fit statistics computed on the test set for the four time series considered

has to be said that, among those included in the empirical experiment, this time series proves to be the most problematic one both in terms of sample size and for exhibiting a multiple regime pattern made more complicated by the presence of heteroscedasticity. Such a framework induces MUNI at selecting architectures which are too complex for the available sample size. As reported in Table 6, in fact, the number of hidden neurons is large and reaches, for almost all the frequency components chosen (β = 5), its maximum grid value (ν = 8). Also, the number of input lags selected is always high, whereas the regularization parameter reaches, for all the decomposition levels, its maximum value (η = 0. 1). Such a situation is probably an indication of how the procedure tries to limit model complexity by using the greatest value admitted for the regularization term, nevertheless the selected networks still seem to over-fit. This impression is also supported by the high number of iterations (the selected value for α is 200 for each of the final networks) which might have induced the selected networks to learn irrelevant patterns. As a result, MUNI is not able to properly screen out undesired, noisy components, which therefore affect the quality of the predictions. However, notwithstanding this framework, the performances can be still regarded as acceptable considering the predictions at lag 1 and perhaps at lag 2 (RMSE = 1. 73 and 2. 70, respectively), whereas they significantly deteriorate at horizon 3, where the RMSE reaches the value of 4. 71. Horizon 4 is where MUNI breaks down, probably for the increasing degree of uncertainty present at higher horizons associated with poor network generalization capabilities. With an RMSE of 7.18, that is, more than 4 times higher than horizon 1 and an MPE of 1.59 ( > 5 times), additional actions would be in order, e.g., increasing the information set by including ad hoc regressors in the model. As already mentioned, TS2 shows an overall behavior fairly similar to TS1, in terms of probabilistic structure, non-stationarity components and multiple regime pattern. However, the more satisfactory performances recorded in this case are most likely to be connected to the much bigger available sample size. In particular, it is worth emphasizing the good values of the MAPE for the short term predictions (h = 1, 2), respectively, equal to 1. 42 and 2. 83 as well as the RMSE obtained at horizon 4, which amounts to 0. 37. In this case, more parsimonious architecture are chosen (see Table 6) for the β = 6 selected number of components the original time series has been broken into, with a number of neurons ranging from 2 (for the crystal d6) to 5 (for the crystals d2), which are associated with input sets of moderate size, (\(\lambda = 5\) is the max value selected). As a pre-processed version of TS2, TS3 shows a more regular behavior (even though a certain amount of heteroscedasticity is still present) with weaker correlation structures at the first lags and a single peak at the seasonal lag 12 (\(EACF = -0.474\)). As expected, MUNI for this case selects simpler architectures for each of the β = 5 sub-series, with a limited number of input lags and a smaller number of neurons (ν ≤ 3). Although generated by more parsimonious networks, the overall performances seem to be comparable to those obtained in the case of TS2. In fact, while they are slightly worse for the first two lags (MAPE = 1. 84, 3. 22 for h = 1, 2 respectively versus 1. 42 and 2. 83), the error committed seems to decrease more smoothly as the forecast horizon increases. In particular at horizon 4, MUNI delivers better predictions than in the case of the un-transformed series: in fact, the recorded values of RMSE and MAPE are respectively of 0.29 and 4.07 for TS3 and 0.37 and 5.83 for TS2.