1 Introduction

Data-driven is an emerging and important study field of engineering mathematics and mathematical physics in identifying governing equations of dynamical systems from noisy state measurements. Thanks to the fact that the amount of digitally stored information has explored over the past two decades, facilitating a variety of innovations and growing intersection of data-driven methods and applied optimization. The purpose of this research is expected to discover the structure that explains spatiotemporal activities of the system, thereby further providing more comprehensive information that can hopefully deepen understanding and analysis the structure and operating mechanism of the system. Hence, it is significant and meaningful to investigate how to learn more accurate dynamical systems from abundant observational or simulated time-series data and grasp their physical behaviors and interpretability.

In recent years, many efforts have been made to construct new models or explore computing methods for identifying governing equations of the systems from measured data [1, 2]. The early method of building possible nonlinear chaotic systems from data was discussed [3]. Advanced regression method from statistics, such as sparse regression, is driving new algorithms that identify parsimonious nonlinear dynamical systems from measurements of complex systems. While the expression of the dynamical system is unknown, a special method of approximating dynamical systems based on the overcomplete functional space of state variables is studied, which eliminates the expansion terms that have no contribution to the dynamics and attracts more and more attention [4, 5]. The symbolic regression was used in the past work to find equations of dynamical systems directly from observational data [6,7,8,9,10] and also was employed to predict and identify the dynamical systems [11,12,13]. However, it may easily cause over-fitting.

With the influx of a large number of experimental data, regression and machine learning are becoming important tools to discover dynamical systems from data. The Least Absolute Shrinkage and Selection Operator (LASSO) algorithm was proposed to alleviate the influence of noise when identifying nonlinear dynamical systems from observational experimental data [14]. Based on minimum coefficient error of the polynomial basis expansion, Wang et al. [15] proposed the compressed sensing to discover the nonlinear dynamical system. Brunton et al. [16] proposed a sparse identification of nonlinear dynamics (SINDy) method that employs the Sequentially Thresholded Least Squares (STLS) algorithm to learn governing equations from the overcomplete candidate basis function. The SINDy was extended to identify the system including input and control [17] and was also extended to identify the dynamical system with rational function nonlinearity [18], integral terms [19], and highly damaged and incomplete data [20]. The SINDy method was also used to include partial derivatives, making it possible to identify partial differential equation [21, 22]. Most methods of system identification involve in the form of regression from data to dynamics, and the difference between different technologies is the limitation of this regression.

In all the above methods, a common assumption is that the data are collected from a measurement source or a simulation experiment, and all the data here are discrete points obtained by discretizing the system in time or space. However, in practical dynamical system studies, obtaining data from some complex systems needs to consume a lot of manpower and financial resources, and at the same time, higher accuracy is required for system identification problems. Therefore, it is very necessary to make full use of all the information collected in as many ways as possible, and the information usually has prominent similarities that are generally ignored. Separate estimates will lack power that uses important shared patterns due to the limited size of each individual data source, but simply merging datasets together may suffer from large errors. Thus, we use the similarity information to improve the performance of dynamical system identification through joint analysis of multiple datasets. The fused lasso penalty is chosen because it has been successfully applied to jointly estimate multiple graphical models to discover differential dependency networks [23,24,25]. To the best of the authors’ knowledge, it is an unexplored topic to apply it to identification task of nonlinear dynamical systems from multiple measurement datasets for achieving sparsity and more accurate estimates simultaneously. Hence, there has been a growing interest in fully mining the information of multiple datasets to fill this gap.

Motivated by this, we propose a novel joint sparse least-square model via generalized fused lasso penalty (JSLSGF) to jointly identify multiple sparse least squares models corresponding to different data sources. As illustrated in Fig. 1 with data from \(L\) measurement sources, the main idea is to explore the similarity of multiple sources, thereby improving the identification accuracy of the system. In this way, the generalized fused lasso penalty is imposed on the coefficient values of the basis function, in which the \(l_{1} {\text{ - norm}}\) penalty of the system coefficients for each set of data encourages the sparsity for each individual coefficient vector that takes advantage of the sparsity prior property of system identification, and the \(l_{1} {\text{ - norm}}\) penalty of the difference between the system coefficients presented by every two datasets stimulates them to share the similar physical mechanism, which utilizes the similarity among different measurements. Inspired by the optimization framework [26], we here design an effective two-step algorithm to solve this optimization problem and develop a selected criterion of the threshold parameter in the first step of the algorithm according to the L-curve criterion. Using multiple simulated test datasets and taking the comprehensive comparison, we demonstrate the effectiveness of the proposed method in identifying nonlinear dynamical systems from multiple measurements.

Fig. 1
figure 1

Comparison of the separate sparse least squares and joint sparse least squares

The rest of the paper is arranged as follows: In Section 2, as preliminary, the background of identifying dynamical systems from state measurements by sparse regression techniques is introduced. In Sect. 3, the JSLSGF model and threshold joint sparse least-square algorithm are proposed, and the selection of the threshold parameter and the calculation of time derivatives of state variables are discussed. In Sect. 4, three numerical experiments are presented to illustrate that our method often results in improved accuracy compared with separate analysis on each data source for system identification. Section 5 concludes this paper.

2 Problem and background

Dynamical systems provide physical insight and interpretability into a system's behavior through trajectories and solutions obtained from the governing equations of motion. In most regression problems, only a few terms are important, and thus, the feature selection mechanism is required. Considering the dynamical system [27]

$$ \dot{x}(t) = \frac{{{\text{d}}x(t)}}{{{\text{d}}t}} = f(x(t)),\;x(0) = x_{0} . $$
(1)

The (known or measurable) state variable \(x(t)\) and derivative \(\dot{x}(t)\) denote the evolution of the system on discrete time points \(t_{1} ,t_{1} , \ldots ,t_{m}\) , and \(m\) is the number of samples, the function \(f(x(t))\) constrains dynamics that describe the motion of the systems. Our goal is to seek a parsimonious system that \(f(x(t))\) includes only a few active terms, meaning that it is sparsely expressed in candidate basis function.

We review the SINDy that is used as a sparse regression problem. The expanded library of candidate \(p\) functions \(\Phi (x) = [\phi_{1} (x) \cdots \phi_{p} (x)] \in {\mathbb{R}}^{p}\) is built, wherein each \(\phi_{i} (x), \, i = 1, \ldots ,p\) is a candidate function term. The select and use of the basis function usually reflect some knowledge of the system of interest, gathering such information often requires a certain domain knowledge. The common choice of the basis function is constant, polynomial, and trigonometric; we select polynomials as basis function because the Taylor expansion of complex nonlinear systems is usually polynomial. The basis function is used to establish an overdetermined linear system that can be written as:

$$ \dot{x}(t) = \Phi (x)\xi , $$
(2)

where the unknown vector \(\xi\) is estimated sparse coefficient values of the basis function that decide which terms work in \(f(x(t))\).

Equation (2) does not actually hold because there is always noise in practical applications. Instead, calculating the coefficient vector \(\xi\) that satisfies the least-square constraint

$$ \left\| {\Phi (x)\xi - \dot{x}} \right\|_{2} \le \delta , $$
(3)

where \(\delta \ge 0\) is a tolerance parameter to avoid over-fitting. To solve Eq. (3) and obtain a sparse solution, one usually relaxes the \(l_{0} {\text{ - norm}}\) via convex \(l_{1} {\text{ - norm}}\) penalty

$$ \mathop {\min }\limits_{\xi } \left\| \xi \right\|_{1} \quad {\text{s}}.{\text{t}}.\;\left\| {\Phi (x)\xi - \dot{x}} \right\|_{2} \le \delta . $$
(4)

The unconstrained expression is given by:

$$ \mathop {\arg \min }\limits_{\xi } \frac{1}{2}\left\| {\Phi (x)\xi - \dot{x}} \right\|_{2}^{2} + \lambda \left\| \xi \right\|_{1} , $$
(5)

where \(\lambda\) is a nonnegative parameter that balances the trade-off between fitting error and sparsity.

Enabled by unprecedented availability of data and computational resources, datasets from multiple measurement sources can be collected using different devices or collection ways, and those datasets have the prominent similarity that is usually neglected, even though it is helpful to enhance the system identification performance.

3 Joint sparse least squares via generalized fused lasso penalty approach

3.1 Generalized model

To make full use of the similarity information among multiple noisy state measurements and their derivatives to boost the accuracy and robustness of SINDy for dynamical system identification, we add an additional generalized fused lasso penalty to the sparse least-square model. The generalized fused lasso penalty encourages two sparse vectors to share a similar system mechanism, which is crucial to achieve the fusion of solutions. As described in the introduction and shown in Fig. 1, we assume that \(L\) state measurements are collected and propose the joint sparse least squares via generalized fused lasso penalty (JSLSGF) model as

$$ \mathop {\arg \min }\limits_{{[\xi_{1} , \ldots ,\xi_{L} ]}} \sum\limits_{l = 1}^{L} {\frac{1}{2}\left\| {\Phi (x)\xi_{l} - \dot{x}_{l} } \right\|_{2}^{2} } + \sum\limits_{l = 1}^{L} {\lambda \left\| {\xi_{l} } \right\|_{1} } + \sum\limits_{{l < l^{\prime } }} {\beta \left\| {\xi_{l} - \xi_{{l^{\prime } }} } \right\|_{1} } , $$
(6)

where \([\xi_{1} , \ldots ,\xi_{L} ] \in {\mathbb{R}}^{p \times L}\) are \(L\) sparse coefficient vectors of the system, and each sparse column vector corresponds to the basis function coefficient values of each data resource and determines which terms are contributing in the candidate basis function, \(\lambda\) and \(\beta\) are nonnegative parameters. Note that the penalty terms (the second and third terms) of the object function (6) are convex, so the object function (6) is convex with respect to \(\xi\). Our approach is inspired by the work [28], wherein the generalized fused lasso has been shown to outperform the standard \(l_{1} {\text{ - norm}}\) in promoting sparsity and accuracy from varied data sources. The \(l_{1} {\text{ - norm}}\) on every \(\xi_{l}\), controlled by \(\lambda\), encourages the sparsity for individual identified coefficient vector. The \(l_{1} {\text{ - norm}}\) penalty on the differences between every two coefficient vectors from different data resources encourage them to share a similar physical mechanism, which applies the similarity among different measurements. The parameter \(\beta\) plays a vital role to adjust the degree of fusion. It is not fused across multiple sparse vectors when \(\beta = 0\). The object function (6) is achieved only when all the sparse vectors are identical to each other with \(\beta = \infty\). Especially for \(L = 1\), the proposed model reduces to traditional sparse least squares.

3.2 The proposed optimization algorithm

To solve the JSLSGF model that is a special fused lasso signal approximation [29], the two-step (sparsification and fusion) approach is an efficient algorithm and available as in [23, 26].

In the sparsification step, the preliminary sparse vector is computed by setting \(\beta = 0\) and writing

$$ \mathop {\arg \min }\limits_{{[\xi_{1} , \ldots ,\xi_{L} ]}} \sum\limits_{l = 1}^{L} {\frac{1}{2}\left\| {\Phi_{l} (x)\xi_{l} - \dot{x}_{l} } \right\|_{2}^{2} } + \sum\limits_{l = 1}^{L} {\lambda \left\| {\xi_{l} } \right\|_{1} } . $$
(7)

Considering that the convergence of the sequential thresholded least-square (STLS) algorithm used in SINDy had been guaranteed [26], we utilize the STLS algorithm to improve the numerical robustness of the system identification in calculation process. Computing the derivative of \(\sum\nolimits_{l = 1}^{L} {\frac{1}{2}\left\| {\Phi_{l} (x)\xi_{l} - \dot{x}_{l} } \right\|_{2}^{2} }\) with respect to \(\xi_{l}\), we enjoy

$$ \Phi_{l}^{T} (x)(\Phi_{l} (x)\xi_{l} - \dot{x}_{l} ) = 0, \, l = 1, \ldots ,L, $$
(8)

and have the preliminary solution

$$ \xi_{l} = (\Phi_{l}^{T} (x)\Phi_{l} (x))^{ - 1} \Phi_{l}^{T} (x)\dot{x}_{l} ,\quad l = 1, \ldots ,L. $$
(9)

Introducing the discriminant criteria

$$ M = \left\{ {j:\left| {\xi_{lj} } \right| \ge \alpha \times \mathop {\max }\limits_{1 \le i \le p} \left| {\xi_{li} } \right|} \right\},\quad j = 1, \ldots ,p, $$
(10)

where the threshold parameter \(\alpha \in (0,1)\). Its selection is very challenging and difficult because the content of \(\xi_{l}\) is unknown and mainly depends on the candidate basis function. To alleviate this problem, we adopt L-curve criterion to guide the selection of the tuning threshold parameter \(\alpha\), yielding a much less sensitive calculation process. Then, the preliminary sparse solution is achieved.

In the fusion step, setting \(\lambda = 0\), the object function (6) can be rewritten as:

$$ \mathop {\arg \min }\limits_{{[\xi_{1} , \ldots ,\xi_{L} ]}} \sum\limits_{l = 1}^{L} {\frac{1}{2}\left\| {\Phi_{l}^{M} (x)\xi_{l}^{M} - \dot{x}_{l} } \right\|_{2}^{2} } + \sum\limits_{{l < l^{\prime } }} {\beta \left\| {\xi_{l}^{M} - \xi_{{l^{\prime } }}^{M} } \right\|_{1} } , $$
(11)

where \(\Phi_{l}^{M} (x)\) is the updated matrix composed of the columns of the matrix \(\Phi_{l} (x)\), the position of the columns depends on the row of the nonzero terms in the derived sparse vector \(\xi_{l}\) from the sparsification step. \(\xi_{l}^{M}\) is the vector of all the nonzero terms in \(\xi_{l}\), the items that are set to zeros not considered in the subsequent iteration process, which greatly reduces the computational complexity. This step fuses the terms in the sparse vectors that do not have significant difference (dependent on \(\beta\)). We say that the \(j{\text{ - th}}\) polynomial term in \(\Phi_{l} (x)\) and \(\Phi_{{l^{\prime } }} (x)\) is fused between the l-th and l′-th data sources if \(\xi_{lj} = \xi_{{l^{\prime}j}}\).

After the above two-step iterations, until the absolute value of the difference between the sparse vectors for the next iteration and the previous iteration reaches a minimum.

To assess the quality of the solution obtained by STLS and JSLSGF, the arithmetic mean vector \(\overline{\xi }\) of \(L\) sparse solution vectors is introduced as the final solution. The relative error of the solution is reported as:

$$ \varepsilon_{\xi } = \frac{{\left\| {\overline{\xi } - \xi } \right\|_{2} }}{{\left\| \xi \right\|_{2} }}, $$
(12)

where \(\overline{\xi }\) and \(\xi\) are the approximate and exact solution vectors of each state variable, respectively. The iteration will terminate when the error of the solution vector satisfies \(\left\| {\overline{\xi }^{w + 1} - \overline{\xi }^{w} } \right\|_{2}^{2} \le \varepsilon_{0} \, (\varepsilon_{0} = 10^{ - 10} )\). We summarize the pseudo-code of the proposed threshold joint sparse least-square algorithm as below.

figure a

The threshold joint sparse least-square algorithm

3.3 Parameter selection

There are three tuning parameters \(\lambda\), \(\alpha\) and \(\beta\) in joint sparse least-square model. The parameter \(\lambda\) plays an unimportant role and will not work on the continuous operation of the algorithm, since we adopt the threshold algorithm in the sparsification step and the \(l_{1} {\text{ - norm}}\) penalty sparse regression problem does not rely on it theoretically. Therefore, the remaining two parameters \(\alpha\) and \(\beta\) need to be adjusted. The former controls the number of selected polynomial terms from candidate basis function. The latter determines how similar the derived coefficient values of the selected polynomial terms are. In this work, we utilize a two-step parameter selection method.

In the sparsification step, we note that the model complexity (defined by the number of nonzero coefficients in sparse solution vectors) reduces and the model accuracy (residual norm) increases as the increase of the threshold parameter \(\alpha\), and thus, the results may lose fidelity. In contrast, for a small \(\alpha\), the model accuracy will be very small, leading to an over-fitting solution in practice. Therefore, a proper \(\alpha\) should be based on the trade-off between data fidelity and solution sparsity, making both norms small simultaneously. The Pareto method has been shown to outperform cross-validation for sparse approximation [30]. Motivated by the previous study [31], we employ the linear scale instead of log–log to show the solution. The curves of the residual norm and solution norm with change of the threshold parameter \(\alpha\) are used to determine the optimal \(\alpha\) according to the L-curve criterion, wherein we utilize one state measurement (\(L = 1\)) and its derivative and set \(\beta = 0\). This process will be represented in the following numerical examples.

In the fusion step, the joint sparse least-square algorithm is fitted with fixed \(\alpha\) from the sparsification step and a candidate set \(\beta\). It is extremely time-consuming by blind grid search. We employ some tricks to accelerate the tuning procedure. The major difference between the traditional sparse least squares and joint sparse least squares is the last penalty term. On the one hand, JSLSGF will reduce to STLS when \(\beta\) is too small, i.e., \(\beta < \min_{j} \left| {(\Phi_{l}^{M} (x))^{T} \dot{x}_{l} } \right|\). On the other hand, JSLSGF will over-fuse the result if \(\beta\) is very large, i.e., \(\beta > \max_{j} \left| {(\Phi_{l}^{M} (x))^{T} \dot{x}_{l} } \right|\), which implies all the sparse vectors are identical to each other. Thus, it is reasonable that the parameter range that is neither too large nor too small is used in the simulation experiment. We will conduct a series of experiments to display the identification performance of the system coefficients with proper \(\beta\).

3.4 Numerical computation of state time derivatives

In many practical applications, the state trajectory \(x\) can be available, and its derivative \(\dot{x}\) can be usually obtained by the finite difference method for ordinary differential equations (ODEs). To mitigate the error caused by differentiation, we employ the total variation regularization method [32] to compute the derivative of the ODEs, which can de-noise the derivative, and retain more edge information as well as estimate more robust derivative [33].

4 Numerical examples

In the simulation study, to evaluate the performance of the proposed model, we mainly tend to investigate the potential power whether joint sparse least squares by generalized fused lasso penalty can boost the performance of STLS from our motivation or not. Therefore, it is reasonable to use STLS as benchmark for comparison. For STLS, we estimate the coefficient values of the system individually on each state measurement of every case and then compute their arithmetic mean as the final coefficient values, which is named as STLS (Separate) that is the separate sparse least squares in Fig. 1. The other refers to STLS (Combined), which identifies coefficient values of the system when all state measurements with different noise levels are combined into one data set.

We exhibit three specific numerical examples to explore the validity and serviceability of JSLSGF for identifying dynamical system. To study the performance of the proposed method in different scenarios, several cases are conducted to estimate the effect of the fusion for every example. We simulate two datasets (\(L = 2\)) with different noise levels as every experimental case in this work to see the effect of the proposed model on identification performance. Supposing that we have no prior knowledge of the governing equation that generates the data, excepting that they can be sparsely expressed by a multivariate polynomial basis of the state variable with the known dimension. The exact state variables (numerical solution) of ODEs are computed by the fourth-order Runge–Kutta method under the tolerance \(10^{ - 12}\), and each \(\Delta t\) time unit at discrete time \(t_{k} = k\Delta t \, (\Delta t = 0.001)\) is sampled. The number of simulated samples is considered to catch the fundamental behavior of every dynamical system. The independent zero mean Gaussian white noise with different variance \(\sigma^{2}\) is contaminated to the exact state variables to obtain the state measurements of different noise levels, which is used to test the robustness of the proposed model. The noisy formula

$$ x_{v}^{ * } (t_{k} ) = x_{v} (t_{k} ) + \eta_{v} (t_{k} ), $$
(13)

where \(x_{v}^{ * } (t_{k} ),\;v = 1, \ldots ,n\) and \(x_{v} (t_{k} ),\;v = 1, \ldots ,n\) are the corrupted and exact state variables at times \(t_{k} , \, k = 1, \ldots ,m\), respectively, and \(v\) is the state variable as well as \(\eta_{v} (t_{k} ) \sim N(0,\sigma )\). We approximate two ODEs by using the multivariate monomial basis function library \(\Phi (x)\) to form candidate nonlinear function [16], where the total degree of the basis is one more than the highest degree in the governing equation.

To alleviate the error of the numerical differentiation, mentioned in Sect. 3.4, may yield wrong results close to the function boundary, and that the time derivatives of the state are unknown at the end of the training time, we therefore compute the time derivatives of the state within a time interval (from both sides) \(5\%\) larger than the required time range for experiment, but merely utilize the state measurement and estimated time derivative within the original required time to calculate solution vector. To evaluate the quality of the derivative, we define the relative error

$$ \varepsilon_{{\dot{x}}} = \frac{{\left\| {\dot{x}^{ * } - \dot{x}} \right\|_{2} }}{{\left\| {\dot{x}} \right\|_{2} }}, $$
(14)

where \(\dot{x}^{ * }\) and \(\dot{x}\) denote the computed and true state derivative values, respectively. To ensure the fairness of the comparison of numerical experimental results, we separately calculate the derivative of each noisy state measurement with different noise levels, so that the used data are same when STLS (Separate and Combined) and our proposed method are executed.

The nonlinear dynamical systems considered are as follows: the Lorenz 63 system as a model to identify chaotic dynamics, the Duffing oscillator as nonlinear stiffness and damping model and Burgers’ equation as a partial differential equation. In the next section, we will explore how the proposed model can further improve the dynamical system identification performance by using the similarity information of two state measurement sources.

4.1 Lorenz 63 system

We consider a well-known nonlinear chaotic Lorenz system [34] that was developed when Lorenz studied atmospheric motion by simplifying the convection model. Its state trajectories are chaotic, aperiodic and confined to a butterfly shaped attractor. The governing equation of Lorenz system is:

$$ \left\{ {\begin{array}{*{20}l} {\dot{x} = \beta (y - x),} \hfill \\ {\dot{y} = x(\rho - z) - y,} \hfill \\ {\dot{z} = xy - \mu z,} \hfill \\ \end{array} } \right. $$
(15)

where the coefficient values are \(\beta = 10\), \(\rho = 28\), \(\mu = 8/3\), and the initial value is \((x_{0} ,y_{0} ,z_{0} ) = ( - 8,7,27)\). The highest order of the state variables is square on the right-hand side of Eq. (15), and the basis function \(\Phi (x) = \left[ {1\;x\;y\;z\;x^{2} \;xy\;xz\;y^{2} \;yz\;z^{2} \;x^{3} \;x^{2} y\;x^{2} z\;xy^{2} \;xyz\;xz^{2} \;y^{3} \;y^{2} z\;yz^{2} \;z^{3} } \right]\) is constructed.

The state trajectories of the Lorenz system for each state measurement were obtained from \(t = 0.001\) to \(t = 2.2\) with \(\Delta t = 0.001\), yielding \(m = 2200\) samples, and we disturb them with different noise levels \(\sigma\). Truncating the discrete state trajectory and calculating state measurement derivative by the total variation regularized numerical differentiation, and then generating \(2000\) samples from \(t = 0.1\) to \(t = 2.1\). We show in Table 1 the relative errors of state derivatives for each state measurement with different noise levels. Observe that the error of derivative is fast increasing around \(\sigma = 10^{ - 2}\), and the error of derivative is quite large when \(\sigma = 10^{0}\). After conducting a lot of simulation experiments, we found that the Lorenz systems identified by all methods failed seriously, while the noise level is \(\sigma = 10^{0}\). Therefore, it is thrown, we give the combination (\(L = 2\)) of the state measurement with different noise levels in Table 2 for every case, which means that there are \(4000\) samples for each simulation case. The first two cases with different noise levels, and the difference between them is relatively large. The difference of the noisy levels in the latter two cases is small. The purpose of this setup is to explore the effect and impact of fusion on data with different gap noisy levels.

Table 1 The relative errors of the Lorenz system state derivatives at different noise levels
Table 2 Cases consisting of the Lorenz system state with different noise levels

We take the state measurement and its derivative with noise level \(\sigma = 10^{ - 1}\) in Case 1 and implement L-curve criterion as an example to investigate how the threshold parameter \(\alpha\) impacts the accuracy of the result as well as how to select the proper \(\alpha\). The object function (7) is computed for three state variables with an increment \(\Delta \alpha = 0.0001\) ranging \(0.0001 - 1\), and the residual norm and solution norm are calculated, respectively. We explain in Fig. 2 the resulting curves versus different \(\alpha\). As mentioned previously, the residual norm increases and solution norm decreases as \(\alpha\) increases. According to the L-curve criterion and the results obtained with different \(\alpha\) in the sparsification step, for the variables \(x\), \(y\) and \(z\), we can observe in Fig. 2a–c that \(0.0016 \le \alpha \le 0.0376\), \(0.002 \le \alpha \le 0.0183\) and \(0.0375 \le \alpha \le 0.3492\) are the approximate parameter value ranges, respectively, meaning that there exist a wide range of \(\alpha\) values that produce similar and even same solutions.

Fig. 2
figure 2

The L-curve for each state of the Lorenz system. a State \(x\), b state \(y\) and c state \(z\) at noise level \(\sigma = 10^{ - 1}\)

Here we empirically analyze the convergence behavior of the algorithm 1 by showing the changes of the absolute sum of the obtained solution vector versus iteration numbers in Fig. 3 for Case 1, clearly indicating that the absolute sum of the solution vector decreases during iterations and finally converges to a constant value.

Fig. 3
figure 3

Convergence curves of our method on a the state \(x\), b the state \(y\) and c the state \(z\) for Case 1

To probe and compare whether the fusion item in the proposed method has effect on the identification of system coefficients, we conduct JSLSGF in the presence of the Case 1. The trajectories of exact and learned system up to time-stamp \(t = 10\) with three state variables in Fig. 4 are drawn for comparisons in the prediction performance of the identified Lorenz system. Note that the predicted trajectories agree with the exact ones from \(t = 0\) to more than \(t = 5\). Moreover, in Fig. 5, the short time prediction from \(t = 0\) to \(t = 3\) (dashed red line) and long-time exact trajectory from \(t = 0\) to \(t = 16\) (solid blue line) show that the predicted trajectory is consistent with the exact trajectory in a long-time period even the Lorenz system is chaotic.

Fig. 4
figure 4

Comparison of predicted state trajectories of a the \(x(t)\), b the \(y(t)\) and c the \(z(t)\) of Lorenz system for Case 1

Fig. 5
figure 5

Exact and identified state trajectories of the Lorenz system for Case 1

The ability to obtain attractor dynamics of the Lorenz system is more essential than to predict the state trajectory because chaos can cause the system to diverge exponentially while small fluctuations occur in the initial condition or system coefficient values. Since the identified Lorenz system coefficient values of most terms in the basis function are zeros, we only list in Table 3 the nonzero coefficient values of the results obtained from STLS (Separate and Combined) and JSLSGF at different cases, and the optimal values are shown in boldface.

Table 3 The sparse solutions obtained from STLS and JSLSGF for the Lorenz system at different cases that are composed of the state measurement polluted by different noise levels

For the first three cases, we detect that JSLSGF not only recognizes the correct basis function terms, but also the identified coefficient values are closer to the real values. However, under the condition of ensuring that all the terms involved in Lorenz system completely appear in the sparse solution vectors, meaning that the corresponding entries in the sparse vectors are nonzero and the error of the solutions is minimized simultaneously for Case 4 in Table 2. After conducting many experiments, we derive the identified systems and compute the root-mean-square error (RMSE) of the coefficient values for the identified and true systems as

$$ {\text{STLS}}\;\left( {{\text{Separate}}} \right):\left\{ {\begin{array}{*{20}l} {\dot{x} = - 9.9766x + 9.976y,} \hfill \\ {\dot{y} = 27.6966x - 0.9117y - 0.993xz,} \hfill \\ {\dot{z} = - 2.6649z + 0.999xy,} \hfill \\ \end{array} } \right. $$
$$ {\text{with}}\;{\text{RMSE}}_{{{\text{STLS}}({\text{Separate}})}} = 0.1201; $$
$$ {\text{STLS}}\left( {{\text{Combined}}} \right):\;\left\{ {\begin{array}{*{20}l} {\dot{x} = - 9.9766x + 9.976y,} \hfill \\ {\dot{y} = 27.648x - 0.8939y - 0.9922xz + 0.1205,} \hfill \\ {\dot{z} = - 2.6649z + 0.999xy,} \hfill \\ \end{array} } \right. $$
$$ {\text{with}}\;{\text{RMSE}}_{{{\text{STLS}}({\text{Combined}})}} = 0.1468; $$
$$ {\text{and}}\;{\text{JSLSGF}}:\;\left\{ {\begin{array}{*{20}l} {\dot{x} = - 9.9766x + 9.976y,} \hfill \\ {\dot{y} = 27.7046x - 0.914y - 0.9932xz,} \hfill \\ {\dot{z} = - 2.6649z + 0.999xy,} \hfill \\ \end{array} } \right. $$
$$ {\text{with}}\;{\text{RMSE}}_{{{\text{JSLSGF}}}} = 0.117. $$

Based on the above discussion and representation, in most cases, JSLSGF identifies items inherent to the system from the two datasets, that is, the similarity information of the datasets is mined. For different state measurements, with a small noisy difference between the latter two kinds of cases, the fusion plays a relatively small role. The underlying reason for this condition could be that the difference of measurement data is smaller when the difference of noise level is small. However, for all cases, we obtain that JSLSGF has an essential advantage than STLS, and the STLS (Combined) performs even worse than STLS (Separate), especially for Case 4, the identified system via STLS (Combined) introduces an extra constant term \(0.1205\). These results demonstrate that JSLSGF with the proper fusion could hold the best combination of multiple state measurements to increase the identification accuracy.

4.2 Duffing oscillator

As the second example, the classical Duffing oscillator describes forced vibration and presents cubic nonlinearity and chaotic behavior. The motion equation of the unforced Duffing oscillator is:

$$ \ddot{x} + \beta \dot{x} + \rho x + \mu x^{3} = 0, $$
(16)

which can be rewritten into the first-order system of two variables by setting \(\dot{x} = y\), we have

$$ \left\{ \begin{gathered} \dot{x} = y,\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \hfill \\ \dot{y} = - \beta y - \rho x - \mu x^{3} , \hfill \\ \end{gathered} \right. $$
(17)

where \(\beta = 0.1\), \(\rho = 1\) and \(\mu = 5\), and the initial condition is \((x_{0} ,y_{0} ) = (1,0)\). The Duffing oscillator does not show chaotic behaviors for these values. The highest order on the state variables of (17) is cube; we establish the basis function \(\Phi (x) = \left[ {1\;x\;y\;x^{2} \;xy\;y^{2} \;x^{3} \;x^{2} y\;xy^{2} \;y^{3} \;x^{4} \;x^{3} y\;x^{2} y^{2} \;xy^{3} \;y^{4} } \right]\). The system, in fact, only is expressed by four terms of them. Like the Lorenz system, the number of samples is \(2200\) for every state measurement with different noise levels \(\sigma\) in each case, ranging from \(t = 0.001\) to \(t = 2.2\) with \(\Delta t = 0.001\). The derivatives of the noisy state measurements are calculated by the total variation regularized numerical differentiation. We truncate the noisy state measurements and compute derivatives, gaining \(2000\) samples ranging from \(t = 0.1\) to \(t = 2.1\). The relative errors of state derivatives for each state measurement with different noise levels are given in Table 4. We found that the error of derivative is quickly increasing around \(\sigma = 10^{ - 3}\), and the error is large at \(\sigma { = }10^{ - 2}\). After conducting numerous simulation experiments, we discover that all methods fail to recover the equations at the noise level \(\sigma = 10^{ - 2}\). Therefore, we give the combination (\(L = 2\)) of the state measurement with different noise levels in Table 5 for each case without including \(\sigma = 10^{ - 2}\), signifying that 4000 samples are contained in each case. The noise level in the first two cases is quite different. The difference of noise level in the latter two cases is small.

Table 4 The relative errors of the Duffing oscillator state derivatives at different noise levels
Table 5 Cases consisting of the Duffing oscillator state with different noise levels

We adopt the state measurement and its derivative with noise level \(\sigma { = }10^{ - 3}\) as input. For two state variables in the range \(0.0001 - 1\) with \(\Delta \alpha = 0.0001\), we illustrate in Fig. 6 the resulting curves versus different \(\alpha\). Based on the L-curve criterion, for the variables \(x\) and \(y\), we acquire in Fig. 6a, b that \(0.0024 \le \alpha \le 0.1613\) and \(0.0106 \le \alpha \le 0.02\) are the optimal parameter value ranges, respectively, implying that there is a large range of \(\alpha\) will generate similar and even identical solutions.

Fig. 6
figure 6

The L-curves for each state of the Duffing oscillator. a State \(x\) and b state \(y\) at noise level \(\sigma = 10^{ - 3}\)

We now investigate the convergence behavior of the algorithm 1. As we can observe from Fig. 7, our proposed optimization algorithm has a good convergence because it basically converges after five iterations.

Fig. 7
figure 7

Convergence curves of our method on a the state \(x\) and b the state \(y\) for Case 1

To probe and compare whether the additional fusion item in the proposed method has effect on the system identification, we conduct JSLSGF in the presence of the Case 1. The trajectories of exact and learned systems up to time-stamp \(t = 16\) with two state variables are drawn in Fig. 8a, b. Note that the predicted trajectories agree with the exact ones for a long time. Moreover, in Fig. 8c, the prediction from \(t = 0\) to \(t = 20\) (dashed red line) and exact trajectory from \(t = 0\) to \(t = 50\) (solid blue line) describe that the exact and learned trajectories agree in a relatively long-time period.

Fig. 8
figure 8

Comparison of predicted state trajectories of a the \(x(t)\) and b the \(y(t)\) of Duffing oscillator for Case 1. c Exact and identified trajectories of Duffing oscillator for Case 1

We show the results of STLS and JSLSGF on simulated data for every case in Table 6 and just give in Table 6 the approximate coefficient values of the nonzero terms that we derived by using methods STLS and JSLSGF, wherein the best values are highlighted by the boldface.

Table 6 The sparse solutions obtained from STLS and JSLSGF for the Duffing oscillator at different cases composed of the state measurement polluted by different noise levels

As expected, in all cases, three methods pick out the correct terms that describe the Duffing oscillator. The estimated values of the proposed JSLSGF are basically consistent with that of the true coefficient values, meaning that JSLSGF performs better than competing methods STLS (Separate) and STLS (Combined) in identifying coefficient values of the system. All results indicate that JSLSGF could effectively use more information from multiple state measurements and achieve superior performance in identification accuracy of the dynamical system.

4.3 Burgers’ equation

Many systems of interest are governed by partial differential equations (PDEs) [35]. Here we generalize our proposed method to the PDEs and only consider using combinations of monomials to express state variables and partial derivatives of the PDEs in this work, since these are the common terms in physics. The general form of the PDEs is:

$$ u_{t} = N(1,u,u_{x} ,u_{xx} , \ldots ,\xi ), \, t \in [0,T], $$
(18)

where the subscripts denote the partial differentiation with respect to time or space, and \(N( \bullet )\) describes the evolution of the system that is generally a nonlinear function of \(u(x,t)\) and coefficients in \(\xi\). As a classic example, the Burgers’ equation is derived from the Navier–Stokes equations for the velocity field. The Burgers’ equation is given as:

$$ u_{t} = \beta uu_{x} + \rho u_{xx} , $$
(19)

where coefficients \(\beta = - 1\), \(\rho = 0.5\). Set the initial condition \(u(x,0) = e^{{ - (x - 3)^{2} }}\) and the Dirichlet boundary condition \(u(0,t) = u(7.5,t) = 0\). The numerical solution is calculated using finite difference method at the grid points on \(x \in [0,7.5]\) and \(t \in [0,3.15]\) with \(\Delta x = 0.15\) and \(\Delta t = 0.015\), see Fig. 9. As for the partial derivatives of time, we use the points from two adjacent times and calculate partial derivatives with respect to space using central-difference formula. Assume that 5 data points at both ends of the time and space intervals are removed. The basis function \(\Phi (u) = \left[ {1\;u\;u_{x} \;u^{2} \;uu_{x} \;u_{xx} \;u^{3} \;uu_{xx} \;u^{2} u_{x} \;u_{xxx} } \right]\) is constructed, there are 10 terms, only two of which are contributing. We provide different combinations (\(L = 2\)) of the numerical solution for the Burgers’ equation with different noise levels in Table 7.

Fig. 9
figure 9

The numerical solution of the Burgers’ equation with \(\beta = - 1\) and \(\rho = 0.5\)

Table 7 Cases for numerical solution of the Burgers’ equation with different noise levels

In Fig. 10, we show the convergence curve during the experiment to intuitively illustrate the convergence property of the designed algorithm. One can see that the absolute sum of the obtained solution vector decreases rapidly within three iterations, illustrating the fast convergence of the algorithm 1.

Fig. 10
figure 10

Convergence curve of our method for Case 1

Next, the identified results are presented in Table 8 to investigate the performance of JSLSGF for identifying PDEs. We observe that our proposed method significantly outperforms STLS (Separate) and STLS (Combined), and learn that STLS (Combined) is even worse than STLS (Separate). Note that though all the terms of the Burgers’ equation are identified by STLS (Separate), the coefficient errors for Case 1 and Case 2 are relatively high. However, for Case 3, STLS (Separate) and STLS (Combined) introduce more non-contributing terms than JSLSGF, and the coefficient errors of the selected contributing terms are larger.

Table 8 The identified PDEs from STLS and JSLSGF at different cases composed of the numerical solution polluted with different noise levels

Now, we solve the PDEs on \(x \in [0,7.5]\) and \(t \in [0,8]\) using the coefficients identified by all methods for Case 2 to study predictive performance of our method, and the results are shown in Fig. 11. As a comparison, the Burgers’ equation is computed by Eq. (19) with \(\beta = - 1\) and \(\rho = 0.5\) as illustrated in Fig. 11a. We know that Fig. 11c deviates greatly from the true equation. Figure 11b, d shows obvious differences on \(x \in [5,6]\) and around \(t = 4\), while Fig. 11d is closer to the true Burgers’ equation. These highlight the broad applicability of our proposed method and the better performance in identifying PDEs.

Fig. 11
figure 11

The solution of the Burgers’ equation plotted in space–time: a the true numerical solution, b predicted by STLS (Separate), c predicted by STLS (Combined); and d predicted by JSLSGF

5 Conclusion

In this work, we have proposed a joint sparse least-square model with generalized fused lasso penalty, which can make full use of more similarity information among different state measurements, thereby boosting the performance of system identification. We have also presented an efficient threshold joint sparse least-square algorithm, wherein the selection of the threshold parameter in the sparsification step is achieved by using the L-curve criterion. In addition, we have studied the performance of the proposed method on two ODEs and generalized the proposed method to a PDE with a series of cases. The numerical results demonstrate that the appropriate fusion of multiple state measurements can improve the identification accuracy of the systems than traditional sparse least-square model. Thus, the proposed method identifies more accurate governing equations and exploits the richness of the data when explicitly accounting for the similarity information of multiple datasets, which means that it has a greater potential applicability for dynamical system identification problems that require higher accuracy.

As part of future work, we will investigate how to find a more extensive basis function search space to further promote the power for identifying complicated dynamical systems. Furthermore, it is an exciting avenue of future research to extend this work to dynamical system identification problems with multiple data sources.