1 Introduction

Artificial Neural Networks (ANNs) have been investigated and utilized extensively by researchers in numerous fields to conduct predictions and classifications based on the knowledge learning from training data [1, 2]. Significant accomplishments have been achieved by applying ANNs in computer vision, speech recognition and natural language processing [3, 4]. ANNs were mathematical models of biological neural networks which constitute animal brains [5] with neurons, connections (axons) and transfer functions (synapse). After decades of researches and developments, ANNs have evolved from Perceptron [6] to Hopfield network [7], to Backpropagation Neural Network [8] and more recently to deep learning [2] which promotes the third wave of Artificial Intelligence (AI). Nonlinear mapping capability was obtained by applying sufficiently large number of neurons, connections, weights, bias, transfer functions and learning algorithms. ANNs are capable of approximate any function with any given precision from a mathematical perspective [9, 10]. However, critical issues should be addressed for applying ANNs more effectively.

The first issue is the dependence of gradient during training ANNs. Although the number of neurons, connections, weights, bias, transfer functions are essential aspects for ANNs, a training procedure which adjusts the weights and biases is necessary to ensure the behaviour of ANNs as expected. Backpropagation has played an important role since 1980s which is efficient for training ANNs with a teacher-based supervised learning algorithm. The errors are backpropagated through the networks based on gradient decent algorithm (GD). However, the algorithm might be trapped in local minima because of the dependence of local gradient information. Although some improved methods (e.g. Batch Gradient Descent, Stochastic Gradient Descent and Mini-batch Gradient Descent) have been proposed, the convergence of ANNs during training stages is another problem which would further influence the performance of training and predicting. Therefore, some researchers propose to train ANNs with Heuristic Algorithms (HAs). For example, Zhao Hong proposed General Vector Machine (GVM) which trains ANNs with Monte Carlo algorithm and Least Square Errors [11]. The accuracy and generalizability performed well in relatively small data sets, but this method could hardly obtain satisfying results in large data sets with the exponential increase of computational cost. Simulated Annealing was integrated with GD and backpropagation to avoid local optima during training ANNs [12]. Researches and progresses have been obtained by applying Genetic Algorithm to adjust the weights and bias during training procedure. However, solid theoretical basis was missing due to the origination of HAs.

The second issue is the uncertainty analysis of ANNs. Uncertainty has always been intrinsic property in all kinds of prediction models (including black-box and deterministic models) [13]. Uncertainty analysis is to quantitatively identify and reduce uncertainties in models which tries to determine how likely the outputs are if some aspects of the system are not exactly known [14]. Ignoring the sources and influences of uncertainty would undermine the robustness and reliability of the results and analysis from a model [15, 16]. There are four important sources of uncertainties for a model [13]: (1) uncertainties in input data; (2) uncertainties in data used for calibration; (3) uncertainties in model parameters; (4) uncertainties due to imperfect model structure. Generally, there are broadly two groups of uncertainty assessment methods [17], i.e. the Bayesian methods [18] and the Generalized Likelihood Uncertainty Estimation (GLUE) method [19]. The Bayesian methods analyse the uncertainty of a model by assuming a prior probability over hypotheses to determine the probability of a particular hypothesis based on observed evidence. The GLUE method conducts a large number of model runs with many random parameter values selected from a priori probability distribution. The parameter values are accepted or discarded based on a certain subjective threshold which leads to a major drawback of GLUE method depending on subjective decisions rather than statistically consistent error models. Furthermore, the large number of model runs is not practical due to the computational cost of complex models.

Data Assimilation (DA) is originated from and has a long tradition in meteorology and oceanography [20, 21]. The essence of DA is to deal with uncertainty by assimilate different kinds of observations. It is well known that a free-running model will accumulate errors until its prediction is no long useful [22]. The only way to avoid this procedure is to allow the model to be influenced by observations [23]. DA provide a solution to evolve the models (update the states) by involving available observations. This procedure has different names in different fields, e.g. states estimation [24]; optimization [25]; history matching [26]; retrieval production [27]; inverse modelling [28]. The objective of DA is to produce information about the posterior Probability Density Function (PDF) by different approaches. There are three categories of Bayesian-based strategies of DA methods: (1) Variational DA with implementations of 3D-Var or 4D-Var; (2) Ensemble DA which implements based on Ensemble Kalman Filter (EnKF); (3) Monte Carlo methods which allow the assimilation of information with non-Gaussian errors. The EnKF [29] is derived from the merge of Kalman Filter [30] and Monte Carlo estimation methods [31]. The algorithm has been examined and applied in various fields such as metrology, oceanography, petroleum engineering and hydrogeology [32,33,34], since it was first proposed by Evensen [29]. The simple conceptual formulation and relative ease of implementation (no derivation of a tangent linear operator or adjoint equations are required) with affordable computational requirements result in the popularity of EnKF. The system states can be forecasted and updated simultaneously with minimized error covariance in real time. Bocquet et. al showed the possibilities of combining DA and machine learning from a Bayesian perspective [35]. In [36], the Kalman filter was used to sequentially update the output weights of a single-layer feedforward network based on Online Sequential Extreme Learning Machine to conduct online learning and handle the effects of multicollinearity. The Extended Kalman Filter (EKF) has been used to optimize parameters of Support Vector Machine [37], Feedforward Neural Network (FNN) [38,39,40], Radial Basis Function Neural network [41], Recurrent Neural Network (RNN) [42, 43]. Chen et. al proposed a training method based on the ensemble randomized maximum likelihood algorithm which avoided the gradient while training and performed uncertainty analysis at the same time [44]. Ensemble Smoother with Multiple Data Assimilation (ESMDA) [45] was introduced based on the Ensemble Smoother (ES) proposed by van Leeuwen and Evensen [46] in order to avoid stopping and restarting the model when observations happen. Furthermore, a range of methods based on Monte Carlo techniques are formed to conduct DA. For example, the Particle Filter (PF) represents a PDF by ensembles (particles) without the limitation of Gaussianity of the distribution. Bao et. al proposed to combine ESMDA with GAN to deal with the non-Gaussianity [47]. DA algorithms offer an opportunity for optimizing the parameters (i.e. states), quantifying the uncertainty and gradient-free training of ANNs at the same time.

In this paper, a novel training framework for ANNs was proposed by adopting data assimilation. This training framework avoids the dependence of gradient and hence some disadvantages of GD-based methods. ESMDA trains ANNs with pre-defined iterations by updating the parameters using all the available observations which can be regarded as offline learning. EnKF optimizes ANNs when new observation available by updating parameters which can be regarded as real-time learning. To illustrate the idea, a fully connected FNN integrated with EnKF and ESMDA is implemented. Two synthetic cases with the regression problems of Sine function and Mexican Hat function are conducted to test and validate the proposed framework. Furthermore, the uncertainty of FNN parameters is analysed and quantified by ESMDA. The paper is organized as follows. Section 2 provides the theory of FNN, data assimilation and the proposed framework. Section 3 presents the data and settings of the synthetic cases to validate the proposed framework. The results are demonstrated in Sect. 4. Finally, summary and conclusions are given in Sect. 5.

2 Methodology

2.1 Notations

Extensive use of mathematical notations is made in this section. The DA-related notations in this paper are consistent with the symbols used in our previous work [48] as much as possible. To better understand the equations, the notations are summarized and described in advance (Table 1).

Table 1 List of symbols

2.2 Feedforward neural network

A Feedforward Neural Network (FNN) is an ANN wherein the information flows from the input layer through the transfer functions to the output layer. There are no feedback connections, and hence, the neurons do not form a cycle. Neurons were proposed by Frank Rosenblatt [6] inspired by Warren McCulloch and Walter Pitts [5]. In a neuron, the output is calculated by a nonlinear function (activation function or transfer function) of the sum of its inputs as: \(y = f\left( {\mathop \sum \nolimits_{i = 1}^{p} w_{i} \psi_{i} } \right)\). An FNN is formed by the combination of such neurons as in the biological neural networks. Without losing generality, the three-layer FNN (input layer, hidden layer and output layer) is used as an example of FNNs in this study (Fig. 1). The feedforward process is the same as common fully connected neural networks as follows:

$$h_{j} = f_{1} \left( {\mathop \sum \limits_{i = 1}^{n} w_{ji} \times \psi_{i} } \right)\;\; i = 1,2, \ldots ,n;\;j = 1, 2, \ldots , N_{h}$$
(1)
$$y_{k} = f_{2} \left( {\mathop \sum \limits_{j = 1}^{{N_{h} }} w_{kj} \times h_{j} + b_{j} } \right)\;\; k = 1, 2, \ldots , m$$
(2)

where ψi, hj and yk represent the nodal values in the input layer, hidden layer and output layer, respectively; n, Nh and m are the number of neurons in the input layer, hidden layer and output layer; wji is the weight connecting the input ψi and the jth neuron in the hidden layer; bj represents the bias in the output layer; wkj is the weight connecting the jth neuron in the hidden layer (hj) and the output yk; f1 and f2 are the activation functions in the hidden layer and the output layer.

Fig. 1
figure 1

The structure of Feedforward Neural Networks and neurons

2.3 Data assimilation

Generally, data assimilation combines information from a variety of sources to improve the accuracy of predictions and takes the uncertainty from measurements, inputs, parameters and model structures into account at the same time. In a nonlinear dynamic system, the state space X is defined as:

$$X_{t} = \left( {\begin{array}{*{20}c} A \\ B \\ \end{array} } \right){ }t = 1,2, \ldots ,{ }N_{t}$$
(3)

where Xt is the state vectors at time t with the dimension of Nx; Nt denotes the number of time steps; A represents the parameters vector with dimension of Na; B represents other state vectors with dimension of Nb; Nx denotes the number of all state vectors in Xt which equals Na + Nb.

The system is treated as derivations of state equation (Eq. (4)) and observation equation (Eq. (5)) through time t.

$$\begin{array}{*{20}c} {X_{t}^{f} = M_{t - 1} \left( {X_{t - 1}^{a} } \right) + \xi }\;\; & {\xi \sim N\left( {0, R_{\xi } } \right)} \\ \end{array}$$
(4)
$$\begin{array}{*{20}c} {Y_{t} = H_{t - 1} \left( {X_{t}^{f} } \right) + \nu }\;\; & {\nu \sim N\left( {0,R_{\upsilon } } \right)} \\ \end{array}$$
(5)

where f denotes the forecast (prior estimation) of the states; a denotes the analysis (posterior estimation) of the states; Xtf represents the forecast of the states at time t; Mt-1 is the nonlinear model operator; Xt-1a is the analysis of the states at time t-1; \(\xi \sim N\left( {0, R_{\xi } } \right)\) and \(\nu \sim N\left( {0,R_{\nu } } \right)\) indicate the Gaussian distribution errors with zero mean and covariance matrix Rξ and Rν; Yt is the observation vector at time t; H represents the observation operator which connects the model states and the observations.

2.3.1 Ensemble kalman filter

The EnKF is a sequential data assimilation algorithm based on KF. There are typically two steps in EnKF: the forecast step and the analysis (update) step. In the forecast step, the forecast states is updated according to Eq. (4). In the analysis step, the observation data Yo are first perturbed by random errors:

$$\begin{array}{*{20}c} {Y_{t}^{o} = Y^{o} + \varepsilon }\; & {\varepsilon \sim N\left( {0, R_{\varepsilon } } \right)} \\ \end{array}$$
(6)

where Yto represents the perturbed observation data at time t; ε ~ N(0, Rε) indicates Gaussian random observation errors with zero mean and covariance matrix Rε.

The analysis states are obtained by updating the forecast as follows:

$$X_{t}^{a} = X_{t}^{f} + \frac{{P_{t}^{f} H^{T} }}{{HP_{t}^{f} H^{T} + R_{\varepsilon } }}\left( {Y_{t}^{o} - Y_{t} } \right)$$
(7)

The analysis covariance matrix at time t is

$$P_{t}^{a} = \left( {I - \frac{{P_{t}^{f} H^{T} }}{{HP_{t}^{f} H^{T} + R_{\varepsilon } }}H} \right)P_{t}^{f}$$
(8)

Here

$$P_{t}^{f} = \frac{1}{{N_{e} - 1}}\mathop \sum \limits_{i = 1}^{{N_{e} }} \left( {x_{i,t}^{f} - \overline{{x_{t}^{f} }} } \right)\left( {x_{i,t}^{f} - \overline{{x_{t}^{f} }} } \right)^{T}$$
(9)
$$\overline{{x_{t}^{f} }} = \frac{1}{{N_{e} }}\mathop \sum \limits_{i = 1}^{{N_{e} }} x_{i,t}^{f}$$
(10)

Define

$$K_{t} = \frac{{P_{t}^{f} H^{T} }}{{HP_{t}^{f} H^{T} + R_{\varepsilon } }}$$
(11)

where Kt is the Kalman gain matrix at time t; Ne represents the ensemble size; Ptf is the forecast covariance matrix at time t; \(\overline{{x_{t}^{f} }}\) is the mean of ensemble members for forecast states.

Equations (4) ~ (11) illustrate the process of recursive optimal estimation in EnKF which is able to dynamically update the system estimates when new observations become available.

2.3.2 Ensemble smoother with multiple data assimilation

Equations (4) ~ (11) show that the ensemble-based sequential data assimilation (e.g. EnKF, PF) updates the states at the time when observations happen which results in the necessity of restarting the simulations. The recurrent simulation may be inconvenient when the purpose is to incorporate different kinds of data for history matching. Therefore, Ensemble Smoother with Multiple Data Assimilation (ESMDA) is proposed to update the states by simultaneously assimilating all the available data. Unlike sequential data assimilation, it is not necessary to restart the simulations in ESMDA. This procedure enables ESMDA to obtain better data matches with lower computation cost.

ESMDA is an iterative Ensemble Smoother with a predefined number of iterations for data assimilation. An inflation coefficient αi is introduced to the measurement error in each iteration. The requirement of inflation coefficient is described in Eq. (12) to maintain correct posterior mean and covariance.

$$\mathop \sum \limits_{i = 1}^{{N_{i} }} \frac{1}{{\alpha_{i} }} = 1$$
(12)

where Ni is the predefined number of iterations for data assimilation. Apparently, there are many alternatives for inflation coefficient which satisfies the requirement. The determination of αi refer to [49].

The inflation coefficient is used to inflate the perturbation of all observation data and its covariance matrix in Eq. (13) and Eq. (14) which leads to:

$$\begin{array}{*{20}c} {Y = Y^{o} + \sqrt {\alpha_{i} } \varepsilon } & {\varepsilon \sim N\left( {0, R_{e} } \right)} \\ \end{array}$$
(13)
$$K = \frac{{P^{f} H^{T} }}{{HP^{f} H^{T} + \sqrt {\alpha_{i} } R_{\varepsilon } }}$$
(14)

2.4 Training FNN with DA

Assume the structure (the number of layers, the nodes in each layer and the connection between nodes) of FNN for a specified problem is determined and represented by M*. The weights (w in Eq. (1) and (2)) and biases (b in Eq. (2)) are regarded as states (X*) of M* which leads to \(X^{*} = \left( {\begin{array}{*{20}c} w \\ b \\ \end{array} } \right)\).

In the perspective of DA, substitute M in Eq. (4) with M*, we can obtain:

$$\begin{array}{*{20}c} {X_{t}^{*f} = M_{t - 1}^{*} \left( {X_{t - 1}^{*a} } \right) + \xi^{*} } & {\xi \sim N\left( {0, R_{\xi } } \right)} \\ \end{array}$$
(15)
$$Y_{t}^{*} = H^{*} \left( {X_{t}^{*f} } \right)$$
(16)

where Yt* represents the outputs of M* with the element of yk in Eq. (2); X* is the parameters which can be updated by Eq. (7)–(11).

In the perspective of FNN, the optimization of parameters (w and b) in the back-propagation process is replaced by data assimilation. The ESMDA can be used to train the FNN with the historical data. The sequential data assimilation can be used to adjust the model trained by ESMDA with the real-time observations. The procedure of FNN trained by sequential data assimilation and ESMDA is shown in Fig. 2.

Fig. 2
figure 2

The procedure of FNN trained by a sequential data assimilation; b ESMDA. The combination of FNN and DA can be summarized as Algorithm 1 (for EnKF) and Algorithm 2 (for ESMDA). There are several hyper-parameters for Algorithm 1 and Algorithm 2 which should be determined based on the prior information of the actual situation

figure a
figure b

3 Synthetic cases

The performance of the proposed integration of FNN and DA is validated through two synthetic cases. The main purpose of the synthetic cases is to analyse the capability of the proposed method in generating accurate estimations without gradient information by comparing the performance of the proposed method with the traditional GD method. In the synthetic cases, two regression datasets are generated from Sine function and Mexican Hat function (hereafter, refer to Sine function case and Mexican Hat function case). Different optimization methods (GD, EnKF, ESMDA) are conducted to train the FNN model. The methods used in the synthetic cases are summarized in Table 2.

Table 2 Methods used in synthetic cases

3.1 Performance criteria

As recommended by [50], the root mean square error (RMSE) and coefficient of determination (R2) are used to assess the performance of different training algorithms in the two synthetic cases (as shown in Eq. (20) and (21)). The RMSE measures the average magnitude of the error between model simulations (M) and observations (O). As shown in Eq. (20), the errors are squared before averaged, large errors take a relatively high weight. Therefore, RMSE is useful when large errors are undesirable. R2 measures the predictive ability of models.

$$RMSE = \sqrt {\frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {M_{i} - O_{i} } \right)^{2} }$$
(20)
$$R^{2} = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {O_{i} - M_{i} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{N} \left( {O_{i} - \overline{O}} \right)^{2} }}$$
(21)

where N represents the total number of observations; \(\overline{O}\) is the average of observations.

Besides, the computation time was also recorded as a criterion to assess the computation costs of different models.

3.2 Data

In the Sine function case, two datasets (training data and validation data) are generated. The data in training stage are generated in (0, 2π) with interval of 0.01π which results in 201 samples. The data in validation stage are generated in (0, 2π) with interval of 0.1 which results in 63 samples. Detail information of the data is summarized in Table 3.

Table 3 Data description for the Sine function case

In the Mexican Hat function case, two datasets (training data and validation data) are generated using \({\uppsi }\left( t \right) = \frac{2}{{\sqrt {3\sigma } \pi^{\frac{1}{4}} }}\left( {1 - \left( {\frac{t}{\sigma }} \right)^{2} } \right)e^{{ - \frac{{t^{2} }}{{2\sigma^{2} }}}}\) with σ = 1. In particular, 200 samples are generated in [− 5, 5] for the training dataset; 30 samples are generated in [− 5, 5] for the validation dataset. Detail information of the data is summarized in Table 4.

Table 4 Data description for the Mexican Hat function case

The data used in the two synthetic cases are shown in Fig. 3.

Fig. 3
figure 3

Data used in the training stage and validation stage for a the Sine function case and b the Mexican Hat function case

3.3 Experimental Settings

The architecture of FNN is predefined to be fully connected network with one input layer, one hidden layer and one output layer. Based on the features of the dataset, the number of neurons in each layer is one, ten and one. respectively. The parameters (weights and bias) are randomly initialized from a normal distribution. The loss function in gradient decent method is mean square error (MSE). 10,000 epochs with learning rate of 0.12 are used for gradient decent method to train the FNN. Without loss of generality, the biases between the hidden layer and output layer are selected as states to be updated in EnKF. The hyperparameters in EnKF, the ensemble size Ne, the prior parameter covariance matrix Rξ and the observation error covariance matrix Rε are 50, 0.1 and 0.005, respectively. The observation error covariance Rε is set to be a small value because the observations used in EnKF are generated from the Sine wave which was accurate and much more trustworthy than the FNN model. EnKF is used in the ESMDA to conduct the procedure of data assimilation. The predefined number of iterations for data assimilation Ni, the ensemble size Ne, the prior parameter covariance matrix Rξ and the observation error covariance matrix Rε are 3, 50, 0.1 and 0.1, respectively.

4 Results and discussions

4.1 Performance of FNN model optimized by EnKF

In the Sine function case, the results calculated from FNN which optimized by different methods are shown in Fig. 4. After 10,000 epochs of training, the FNN model with gradient decent method approaches the Sine Function with some biases. The RMSE and R2 values for GD-optimized FNN model are 0.0948 and 0.9819. Although the values of performance criteria are relatively acceptable, there is still bias in the peak and trough of the curve which may be caused by the difference of gradient changes and static learning rate of the algorithm. In the experiment of EnKF-optimized FNN model, there are ensembles for the parameters which generated from a random normal distribution. To calculate the performance criteria, the ensemble mean is used as the final model outputs. The EnKF-optimized FNN model indicates a better match to the Sine Function (the red curve in Fig. 4) with RMSE of 0.0317 and R2 of 0.9980. Each realizations of parameters can be regarded as a possible realization of FNN. On the contrary to the gradient decent algorithm, EnKF is capable of capturing the variance of gradient changes because of the updating procedure in the algorithm. The evolution of parameters (shown in Fig. 5) also reflects the correction processes of the parameters to adapt the larger gradient changes. After randomly generating the FNN parameters (biases from the hidden layer to the output layer), the uncertainty of parameter remains relatively large because of the large difference between the FNN model and the Sine wave according to Eq. (7). The same situation can be found at “x = 1.5π”. On the contrary, the parameter uncertainty is reduced when “x ∈ (0.75π, 1.25π)” because of the relatively small difference between the FNN model and the Sine wave (Fig. 5). These results indicate that the EnKF optimized FNN model with higher accuracy than gradient decent algorithm. Furthermore, the EnKF is able to optimize the parameters of FNN in real time by incorporating real-time observations which is intrinsic quality of the methods.

Fig. 4
figure 4

Comparison of the results from FNN optimized by Gradient Descent (black curve) and EnKF (Red curve for ensemble mean and grey area for uncertain zone) in the Sine function case (color figure online)

Fig. 5
figure 5

Parameters trained by EnKF and Gradient Decent in the Sine function case

In the Mexican Hat function case, the hyper-parameters of FNN and EnKF are identical with those in the Sine function case. The results calculated from FNN which optimized by different methods are shown in Fig. 6. The RMSE and R2 value for GD-optimized FNN model are 0.0329 and 0.9891. In the experiment of EnKF-optimized FNN, the ensemble means of the model outputs are used to calculate the performance criteria with RMSE of 0.018 and R2 of 0.9967. The better performance is attributed to the update scheme of the model states which is shown in Eq. (7) ~ (11). The evolution of parameters shown in Fig. 7 illustrates the update process. From Fig. 6 and Fig. 7, one can tell that the variance of parameters is larger when the difference between observations (\(Y_{t}^{o}\)) and simulations (Yt) are large which can be explained by Eq. (7). These results indicate that EnKF is able to optimize the parameters of FNN by implementing the update process with higher accuracy.

Fig. 6
figure 6

Comparison of the results from FNN optimized by Gradient Descent (black curve) and EnKF (Red curve for Ensemble mean and grey area for uncertain zone) in the Mexican Hat function case (color figure online)

Fig. 7
figure 7

Parameters trained by EnKF and Gradient Decent in the Mexican Hat function case

4.2 Performance of FNN model optimized by ESMDA

In the Sine function case, the outputs and the corresponding parameters of four iterations are shown in Figs. 8 and 9. Figure 8 displays the outputs of the models with grey area indicating uncertain zone, blue curve indicating Sine function, black curve indicating optimized outputs from GD, red curve indicating ensemble means of the outputs. In the first iteration, 50 samples of parameters are randomly generated using normal distribution with covariance matrix Rξ, the FNN model is executed with the generated samples to yield outputs. The uncertain zone of the outputs for the first iteration is the largest because of the random generation of parameters (shown in Fig. 9) which results in the largest uncertainty of the parameters. In the second iteration, the distributions of the parameters are updated by the EnKF which significantly narrows down the uncertain zone of the outputs. In the third and fourth iteration, the distributions of the parameters are slightly updated without significant effects on the outputs. The mean of the 50 ensembles is considered as the best estimation for the outputs in each iteration. The RMSE and R2 are calculated to conduct quantitative comparisons between the observations and simulations (Table 5). Table 5 indicates that better results are obtained by ESMDA than those obtained by GD which proves the effectiveness of the ESMDA for updating the parameters. The evolution of parameters (biases from the hidden layer to the output layer) are shown in Fig. 9. In each figure, the histogram of the ensembles is used to indicate the distribution of the parameters. The red line indicates the ensemble mean of the updated parameters. The convergence of the parameters with the increase of iterations indicates the effectiveness of the ESMDA. The variances of the parameters decrease with the iterations which could be obtained from Fig. 9 by the narrowing of the uncertain zone. The mean value of the parameter in Fig. 9 which corresponds to the mean value of the trained results in Fig. 8 can be regarded as the optimal parameters for the FNN model.

Fig. 8
figure 8

Comparison of the results from FNN optimized by Gradient Descent (black curve) and. ESMDA (Red curve for Ensemble mean and grey area for each ensemble) in the Sine function case (color figure online)

Fig. 9
figure 9

Parameters trained by ESMDA in the Sine function case

Table 5 RMSE and R2 values for the Gradient Decent and ESMDA methods in the Sine function case

In the Mexican Hat function case, the outputs and the corresponding parameters of four iterations are shown in Figs. 10 and 11, respectively. Similar to Figs. 8,  10 shows the outputs of the models with grey area indicating uncertain zone, blue curve indicating Sine function, black curve indicating optimized outputs from GD, red curve indicating ensemble means of the outputs. The uncertain zone of the outputs for the first iteration is the largest because of the random generation of parameters (shown in Fig. 11). In iteration 2 and iteration 3, the uncertain zone of the outputs keeps narrowing down due to the parameters updating process of ESMDA. It should be noted that the uncertainties of parameters were larger when the gradient of Mexican Hat function closing zero (i.e. around x =  ± 3, x =  ± \(\sqrt 3\) and x = 0). The differences between the outputs of FNN and the Mexican Hat function are also relatively larger at these points. The reason may also lie in Eq. (7) as we described in Sect. 4.1. This phenomenon indicates the adjustment of parameters (shown in Fig. 11) according to the observations which also demonstrates the effectiveness of updating processes in ESMDA. It should also be noted that the variance of parameters is not enough to cover some points in the model outputs (i.e. ± \(\sqrt 3\) in Fig. 10). This may be caused by the situation that only biases in the hidden layer are perturbed and updated. Involving more parameters (for instance, weights in Eq. (1) ~ (2)) for perturbation and optimization may solve this problem. Quantitative comparisons between the observations and simulations are conducted by calculating RMSE and R2 (Table 6). Table 6 shows that better results are obtained by ESMDA than those obtained by Gradient Decent. The evolution of parameters is shown in Fig. 11. In each figure, the histogram of the ensembles is used to indicate the distribution of the parameters. The red line indicates the ensemble mean of the updated parameters. The variances of parameters are lowered with the iterations which results in the narrowing of uncertain zones in Fig. 10.

Fig. 10
figure 10

Comparison of the results from FNN optimized by Gradient Descent (black curve) and ESMDA (Red curve for Ensemble mean and grey area for uncertain zone) in the Mexican Hat function case (color figure online)

Fig. 11
figure 11

Parameters trained by ESMDA and Gradient Decent in the Mexican Hat function case

Table 6 RMSE and R2 values for the Gradient Decent and ESMDA methods in the Mexican Hat function case

4.3 Validation

The FNN trained by GD and ESMDA are then validated using the validation data generated in Sect. 3.3. The ensemble means of ESMDA parameters are used as optimal parameters. The EnKF trained FNN is not validated due to two reasons. The first reason is that EnKF is used for real-time training (online learning) in the proposed training framework which would conduct continuous learning when the new observations available. The second reason is that EnKF optimizes the parameters based on the observations for a particular time step. The validation results for the Sine function case and the Mexican Hat function case are illustrated in Fig. 12 which shows considerable match in both cases. Table 7 shows the RMSE and R2 values for the Gradient Decent and ESMDA methods in validation stage which indicates a reasonable training by comparing with Tables 5 and 6. The generalization ability of GD and ESMDA is both acceptable. The performance of ESMDA is slightly better than GD.

Fig. 12
figure 12

The validation of the Gradient Decent and ESMDA trained FNN for: a the Sine function case; b The Mexican Hat function case

Table 7 RMSE and R.2 values for the Gradient Decent and ESMDA methods in validation stage

4.4 Uncertainty analysis

Uncertainty is inevitable in all kinds of prediction models including neural networks. Due to the nature of EnKF, the uncertainty of FNN parameters (i.e. biases) is quantified through the generation of ensembles for the parameters. There are many ways to describe parameters uncertainty via ensemble analysis, including histograms, probability, ensemble means, standard deviations (STD) and correlations. Figures 9 and 11 are examples which show the changes of uncertainty of the parameters over different iterations by histograms. “Iteration 1” denotes the initial step of the ESMDA which randomly generates the parameters from a normal distribution. The ensemble spread of the parameters decreases after the first iteration and continues to be narrowed as the ESMDA executes. The narrowing of the range of uncertainty is observed for many, but not all, of the parameters.

A quantitative summary of the uncertainty is provided which summarizes the ensemble maximums, ensemble minimums, ensemble means and STDs for all the parameters in the Sine function case (Table 8) and Mexican Hat function case (Table 9). It should be noted that statistical values are used because of the existence of ensembles in ESMDA. Different parameters are listed in columns. The values of ensemble maximums, ensemble minimums, ensemble means and STDs in different iterations are demonstrated in different rows. In both cases, the ensemble STDs for all the parameters narrow along with the iterations which indicates a reduction in the uncertainty. However, different degrees of narrowing are observed for different parameters. The narrowing for “Bias 5” in the Sine function case and “Bias 3 ~ Bias 8” in the Mexican Hat function case indicates a significant reduction in uncertainty and reveals that these particular parameters identifiable from the data (also shown in Figs. 9 and  11). For the other parameters in the Sine function case and the Mexican Hat function case, the ensemble spreads narrow a little over the three iterations which indicates that the data contains relatively little information about the uncertain parameters and the parameters are less identifiable. Tables 8 and 9 indicate that for most parameters the largest uncertainty decrease happens with the first iteration (from “Iteration 1” to “Iteration 2”). Corresponding to this phenomenon, the uncertain zone of outputs from FNN model narrows the most in the first iteration (shown in Figs. 8 and 10). On the one hand, this suggests that the capability of ESMDA to update the parameters uncertainty. On the other hand, this also suggests that even though the nonlinearity exists, the overall relationship between the parameters and data is relatively linear. The ensemble means over different iterations in Tables 8 and 9 indicate the capability of ESMDA to update the value of parameters. Comparing the ensemble means in the Sine function case (Table 8) with the ensemble means in the Mexican Hat function case, we can find that the degree of updating ensemble means in the Sine function case is bigger in the Mexican Hat function case. The reason may be the relatively optimal parameters are obtained in the prior estimates in the Mexican Hat function case. This phenomenon is also revealed in the smaller degree of updating of STDs in the Sine function case than in the Mexican Hat function case.

Table 8 Statistical values of FNN parameters from ESMDA in the Sine function case
Table 9 Statistical values of FNN parameters from ESMDA in the Mexican Hat function case

Furthermore, a comparison of computation cost is conducted to approximately se the complexity of different training methods (GD, EnKF and ESMDA). Computation times (Table 10) under the same software and hardware runtime environment are recorded to indicate the computation cost. The computation cost of ESMDA is similar to GD. The EnKF is much more computational expensive than GD and ESMDA which indicates the fact that the major drawback of EnKF is computation cost. This drawback is reasonable given the fact that EnKF derived from the merge of Kalman Filter [30] and Monte Carlo estimation methods [31]. Therefore, EnKF restarts and executes FNN Ne (ensemble size) times at each time step when observations become available. In our synthetic cases, the number of time steps for observations (Nt) are 201 for the Sine function case and 200 for the Mexican Hat function case. The FNN is executed Ne × Nt times (50 × 201 or 50 × 200 in the synthetic cases) in EnKF which may consume a lot computation time. However, in the proposed training framework, EnKF is used for real-time training (online learning) which means only Ne times executions of FNN are needed when the observations become available. This process involves new information from observations and avoids the retraining of FNN.

Table 10 The computation time of different training methods (GD, EnKF, ESMDA) in the Sine function case and the Mexican Hat function case

5 Conclusion

In this paper, a new training framework for neural networks based on data assimilation is proposed to avoid the calculation of gradient in the neural network training. The Feedforward Neural Networks (FNNs), Ensemble Kalman Filter (EnKF) and Ensemble Smoother with Multiple Data Assimilation (ESMDA) are used to validate the proposed framework. Synthetic cases with data generated from the Sine function and the Mexican Hat function are implemented to test the methods. EnKF updates the parameters when the observations available which can be regarded as real-time training (online learning). ESMDA updates the parameters using all the available observations with a predefined number of iterations for data assimilation which can be regarded as normal training (offline learning) compared to the conventional methods. The results from EnKF-optimized and ESMDA-optimized FNN model show higher accuracy than those from gradient-decent-optimized FNN model. This indicates the effectiveness of the EnKF and ESMDA trained FNN. Furthermore, the uncertainty of the FNN model parameters is quantified at the same time. The major advantages of the proposed training methods based on the data assimilation were (1) the avoidance of calculating gradient, (2) the ability of real-time training when the observations available, (3) the uncertainty analysis for the parameters of neural networks. Although only FNN, EnKF and ESMDA were implemented as examples in this study, the potential of data assimilation algorithms on training neural networks is unlimited. Future works may include exploring new data assimilation algorithms (e.g. Particle Filter), exploring other kinds of neural networks (e.g. Recurrent Neural Network, Graph Neural Networks), involving more parameters of neural networks and validating the methods with real observation data.